[Omaha.pm] PDF to Text parsing

Thu Oct 4 18:24:52 PDT 2012

All the ones I was dealing with were pdf labeled multi page tiff files
from a fax software.

Chris Brandstetter

-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/IT d+(-) s++:++ a C++++$ UBLISXC*++++$ P++++$ L+++$ E-- W+++ N+ o K-
w-- O M++$ V PS- PE Y+ PGP++ t++ 5+++ X+ R- tv-- b+>+++ DI D+ G+ e+ h++
r
y?
------END GEEK CODE BLOCK------ 

On 10/04/2012 02:00 PM, omaha-pm-request at pm.org wrote:
>
> PDF files often contain both: (1) Text with layout, placement, and font information. And (2) images. Those images may happen to have pixels in them which humans interpret as text. Those pixels can sometimes be OCRd to produce text.
>
> PDF::OCR2 does both of these things for you. It can be used to "extract all text and all image ocr from pdf". 
>
> Again, it all depends on the PDF file.   :)
>
> I'm guessing Chris was dealing with a directory full of images. 
>
> HTH,
>
> j
>