[Boulder.pm] PDF info extraction
Walter Pienciak
wpiencia at thunderdome.ieee.org
Wed Aug 18 15:24:45 CDT 2004
There are a number of PDF tools that can extract the document
information structure from a PDF. Here is a simple example:
use PDF::API2;
use strict;
my $doc = shift || die "usage: $0 filename\n";
my $pdf = PDF::API2->open($doc);
my %info = $pdf->info;
for (keys %info) {
print "[$_]\t= [", $info{$_}, "]\n";
}
This is in theory a nice way to determine the presence of keywords,
abstract, etc., in the PDF file.
But for a fair number of the documents I examine, the values of
%info come back as strange non-ASCII values. Piping through
od -bc, here is an example:
0001540 133 115 157 144 104 141 164 145 135 011 075 040 133 043 253 030
[ M o d D a t e ] \t = [ # 253 030
0001560 302 051 363 235 360 363 154 330 351 126 254 251 026 345 347 062
302 ) 363 235 360 363 l 330 351 V 254 251 026 345 347 2
0001600 201 374 063 035 135 012 133 120 162 157 144 165 143 145 162 135
201 374 3 035 ] \n [ P r o d u c e r ]
0001620 012 133 120 162 157 144 165 143 145 162 135 011 075 040 133 046
\n [ P r o d u c e r ] \t = [ &
0001640 362 130 235 173 243 330 341 205 062 232 255 017 362 360 103 272
362 X 235 { 243 330 341 205 2 232 255 017 362 360 C 272
0001660 367 063 210 374 066 032 340 134 021 040 344 166 323 000 002 206
367 3 210 374 6 032 340 \ 021 344 v 323 \0 002 206
0001700 123 135 012
S ] \n
Does anyone have any insight into this problem?
I would genuinely appreciate a pointer in the right direction.
Walter
More information about the Boulder-pm
mailing list