From bogus@does.not.exist.com Mon Aug 2 21:25:48 2004 From: bogus@does.not.exist.com () Date: Mon Aug 2 21:25:47 2004 Subject: No subject Message-ID: Also, we've been switched over to Mailman from majordomo, which I suspect you've all figured out from your subscribe message. I have a copy of the SAMS book 'Teach Yourself Perl In 21 Days' book if anyone wants it. Walter From wpiencia at thunderdome.ieee.org Wed Aug 18 15:24:45 2004 From: wpiencia at thunderdome.ieee.org (Walter Pienciak) Date: Wed Aug 18 15:24:48 2004 Subject: [Boulder.pm] PDF info extraction Message-ID: <20040818202445.GA11984@thunderdome.ieee.org> There are a number of PDF tools that can extract the document information structure from a PDF. Here is a simple example: use PDF::API2; use strict; my $doc = shift || die "usage: $0 filename\n"; my $pdf = PDF::API2->open($doc); my %info = $pdf->info; for (keys %info) { print "[$_]\t= [", $info{$_}, "]\n"; } This is in theory a nice way to determine the presence of keywords, abstract, etc., in the PDF file. But for a fair number of the documents I examine, the values of %info come back as strange non-ASCII values. Piping through od -bc, here is an example: 0001540 133 115 157 144 104 141 164 145 135 011 075 040 133 043 253 030 [ M o d D a t e ] \t = [ # 253 030 0001560 302 051 363 235 360 363 154 330 351 126 254 251 026 345 347 062 302 ) 363 235 360 363 l 330 351 V 254 251 026 345 347 2 0001600 201 374 063 035 135 012 133 120 162 157 144 165 143 145 162 135 201 374 3 035 ] \n [ P r o d u c e r ] 0001620 012 133 120 162 157 144 165 143 145 162 135 011 075 040 133 046 \n [ P r o d u c e r ] \t = [ & 0001640 362 130 235 173 243 330 341 205 062 232 255 017 362 360 103 272 362 X 235 { 243 330 341 205 2 232 255 017 362 360 C 272 0001660 367 063 210 374 066 032 340 134 021 040 344 166 323 000 002 206 367 3 210 374 6 032 340 \ 021 344 v 323 \0 002 206 0001700 123 135 012 S ] \n Does anyone have any insight into this problem? I would genuinely appreciate a pointer in the right direction. Walter