[Omaha.pm] PDF to Text parsing

Jay Hannah jay at jays.net
Thu Oct 4 02:29:53 PDT 2012


On Oct 4, 2012, at 2:31 AM, Rob Townley <rob.townley at gmail.com> wrote:
> Interesting.  i suppose your PDFs only contained images of text, but
> not actual text, hence the need for OCR?  If so, i may use this for
> something else.

PDF files often contain both: (1) Text with layout, placement, and font information. And (2) images. Those images may happen to have pixels in them which humans interpret as text. Those pixels can sometimes be OCRd to produce text.

PDF::OCR2 does both of these things for you. It can be used to "extract all text and all image ocr from pdf". 

Again, it all depends on the PDF file.   :)

I'm guessing Chris was dealing with a directory full of images. 

HTH,

j






More information about the Omaha-pm mailing list