[Boulder.pm] PDF info extraction

Walter Pienciak wpiencia at thunderdome.ieee.org
Wed Aug 18 15:24:45 CDT 2004


There are a number of PDF tools that can extract the document
information structure from a PDF.  Here is a simple example: 

use PDF::API2; 
use strict; 

my $doc = shift || die "usage:  $0 filename\n"; 
my $pdf = PDF::API2->open($doc);

my %info = $pdf->info;

for (keys %info) {
    print "[$_]\t= [", $info{$_}, "]\n";
}

This is in theory a nice way to determine the presence of keywords,
abstract, etc., in the PDF file.

But for a fair number of the documents I examine, the values of
%info come back as strange non-ASCII values.  Piping through
od -bc, here is an example:

0001540 133 115 157 144 104 141 164 145 135 011 075 040 133 043 253 030
           [   M   o   d   D   a   t   e   ]  \t   =       [   # 253 030 
0001560 302 051 363 235 360 363 154 330 351 126 254 251 026 345 347 062
         302   ) 363 235 360 363   l 330 351   V 254 251 026 345 347   2
0001600 201 374 063 035 135 012 133 120 162 157 144 165 143 145 162 135
         201 374   3 035   ]  \n   [   P   r   o   d   u   c   e   r   ]
0001620 012 133 120 162 157 144 165 143 145 162 135 011 075 040 133 046
          \n   [   P   r   o   d   u   c   e   r   ]  \t   =       [   &
0001640 362 130 235 173 243 330 341 205 062 232 255 017 362 360 103 272
         362   X 235   { 243 330 341 205   2 232 255 017 362 360   C 272
0001660 367 063 210 374 066 032 340 134 021 040 344 166 323 000 002 206
         367   3 210 374   6 032 340   \ 021     344   v 323  \0 002 206
0001700 123 135 012
           S   ]  \n

Does anyone have any insight into this problem?
I would genuinely appreciate a pointer in the right direction.

Walter




More information about the Boulder-pm mailing list