[tpm] PDF to OCR

Wed Jan 28 19:08:46 PST 2009

On Wed, 2009-01-28 at 16:49 -0500, arocker at vex.net wrote:
> Has anyone any experience using the PDF::OCR module?

As far as I have been able to tell, the only OCR programs
that are worth anything at all are commercial.  I've used
Abby finereader for http://www.fromoldbooks.org/ and
can average under a minute per page including some hand
fix-ups, starting with, say, a good clean 400dpi grayscale
scan of a page.

Google books OCR is much crappier, and the gnu OCR is about
20 years behind the commercial stuff in quality.

But a lot of it depends on your source content -- some of
the packages are trained and developed with computer
printouts, for example, for OCR of business documents, and
may work well for that and really badly for other things;
I was using 19th century 9and older) books.

Liam

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org www.advogato.org