[Chicago-talk] Locating text in a PDF

Elias Lutfallah dontyoumonkeywiththemonkey at gmail.com
Tue Jul 3 22:18:28 PDT 2007


All the modules mentioned are for creating, manipulating, and adding to
PDFs. What you'll probably want to do is use pdf2ps to convert the PDF into
Postscript. Then you should be able to parse for the text. However, if the
text is actually an image in the PDF this won't work either. Then you'd need
some kind of OCR software.

I've had pretty good luck with this method when I needed to modify a PDF
that I didn't create myself. Convert to ps, modify it with any text editor,
convert it back to PDF. Almost always ends up exactly like the original
except for my change.

There's also pdftotext (part of Xpdf), but I haven't used it. I've needed to
maintain the original PDF, this looks like it just extracts any text in a
PDF.

Good luck, but since this is from March 30, I hope that you figured out a
way to do what you needed already.

On 7/3/07, tiger peng <tigerpeng2001 at yahoo.com> wrote:
>
> Have you figured out how to look for a specific string? I have just
> skimmed through PDF::Parser, PDF::Extract and PDF::Xtract without finding
> any good clues.
>
> Ge
>
> ----- Original Message ----
> From: Jay Strauss < me at heyjay.com>
> To: Chicago.pm chatter < chicago-talk at pm.org>
> Sent: Friday, March 30, 2007 5:29:01 PM
> Subject: [Chicago-talk] Locating text in a PDF
>
> Once I have opened a PDF using PDF::API2
>
> How would I examine the text on a page looking for a specific string?
>
> Thanks
> Jay
> _______________________________________________
> Chicago-talk mailing list
> Chicago-talk at pm.org
> http://mail.pm.org/mailman/listinfo/chicago-talk
>
>
> _______________________________________________
> Chicago-talk mailing list
> Chicago-talk at pm.org
> http://mail.pm.org/mailman/listinfo/chicago-talk
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.pm.org/pipermail/chicago-talk/attachments/20070704/cd71a4a5/attachment.html 


More information about the Chicago-talk mailing list