[Nottingham-pm] This morning's work
Michael Erskine
msemtd at yahoo.co.uk
Mon Aug 16 08:22:15 CDT 2004
This morning I'm writing an interface to a Rohde & Schwarz FSH3 Spectrum
Analyser - I quoted two weeks' work but got the guts of it together in about
an hour with POE this morning! As a result I've moved onto the important
stuff, like a script to get a complete month of dilbert cartoons...
Life is good again!
#! /usr/bin/perl -w
use strict;
=for docs
Get a month of dilbert cartoons into current directory.
I noticed that there was an archive for the previous month or so of
dilbert cartoons. The cartoon images have mildly obfuscated names
but that's no real challenge for a regex!
First I get the archive page from
http://www.comics.com/comics/dilbert/archive/
and pull out all the links to each dated page e.g. <A
HREF="/comics/dilbert/archive/dilbert-20040716.html">.
Luckily, these urls are quite easy to spot and any duplicates are taken out
with a hash hack.
This, of course, will all break when the format changes - ah well!
=cut
use tmstub;
use LWP::Simple;
use File::Slurp;
# Hot file handle magic...
select( ( select(STDERR), $| = 1 )[0] );
select( ( select(STDOUT), $| = 1 )[0] );
# Get archive index...
my($pageurl, $pagedata) = ('http://www.comics.com/comics/dilbert/archive/');
if(not defined ($pagedata = get($pageurl))){
t "could not get page '$pageurl'";
exit(1);
}
# Split into lines...
my @lines = split /\n/m, $pagedata;
# pull out urls to daily pages...
my @urls = grep {s[.*<A
HREF="(/comics/dilbert/archive/dilbert-\d{8}\.html)".*>][$1]} @lines;
# remove dupes...
my %h; @h{@urls} = @urls x '1';
@urls = sort keys %h;
t "There are ".scalar(@urls)." urls to get...";
foreach(@urls){
# get machine readable date...
if(not /dilbert-(\d{8})\.html/){
t "could not establish date from url '$_'";
next;
}
my $mrd = $1;
# don't bother if it looks like we have it already...
if(-f "dilbert".$mrd.".gif" or -f "dilbert".$mrd.".jpg"){
t "already got '$mrd' so won't get '$_'";
next;
}
my($pageurl, $pagedata) = ('http://www.comics.com'.$_);
if(not defined ($pagedata = get($pageurl))){
t "could not get page '$pageurl'";
next;
}
if (not $pagedata =~ m{<IMG
SRC="/comics/dilbert/archive/images/(dilbert(\d+)\.(gif|jpg))"}mgi){
t "couldn't find expected image url in page data for '$pageurl' will
save it to a file...";
if(not $pageurl =~ /(dilbert-\d{8}\.html)/){
t "ouch! wanted to save html page but couldn't...";
next;
}
write_file($1, $pagedata);
next;
}
my($image, $obf, $ext) = ($1, $2, $3);
t "image is '$image', obf is '$obf', ext is '$ext'";
#~ exit(0);
my $content;
unless (defined ($content =
get('http://www.comics.com/comics/dilbert/archive/images/'.$image))) {
t "could not get '$image'\n";
return;
}
t "got data!";
write_file( "dilbert".$mrd.".".$ext, {binmode => ':raw'}, $content );
}
More information about the Nottingham-pm
mailing list