[Omaha.pm] [omaha] PDF to Text parsing

Jay Hannah jay at jays.net
Tue Oct 2 12:18:35 PDT 2012


On Oct 2, 2012, at 11:03 AM, "Rob Townley" <rob.townley at gmail.com> wrote:
> The ReEnergizeProgram.org auditor said that a big slowdown is getting
> all the data from PDF based bills from MUD and OPPD into a spreadsheet
> / database.  Sounds like they email stuff, copy-n-paste alot, and then
> email on.
> 
> What perl/python/php modules would you recommend for parsing the text from PDF?


On Oct 2, 2012, at 11:10 AM, Burch Kealey <bkealey at unomaha.edu> wrote:
> Send us one as an example this is really a trivial task

Ya, send us an example PDF. There are 475 PDF libraries on CPAN, but your mileage will vary and the only way to know for sure is to actually try... Here's all the hits, and the one I'd probably try first for this job:

   https://metacpan.org/search?q=PDF
   https://metacpan.org/module/PDF::OCR2

Good luck!  :)

j
Omaha Perl Mongers: http://omaha.pm.org



P.S.   PDF scraping is usually really gross. Government orgs often publish PDF archives as if those are data APIs, and they're really not. Poke MUD and OPPD to publish JSON or XML APIs / archives.






More information about the Omaha-pm mailing list