[Nottingham-pm] This morning's work

Michael Erskine msemtd at yahoo.co.uk
Mon Aug 16 08:22:15 CDT 2004


This morning I'm writing an interface to a Rohde & Schwarz FSH3 Spectrum 
Analyser - I quoted two weeks' work but got the guts of it together in about 
an hour with POE this morning! As a result I've moved onto the important 
stuff, like a script to get a complete month of dilbert cartoons...

Life is good again!

#! /usr/bin/perl -w
use strict;
=for docs

Get a month of dilbert cartoons into current directory.
I noticed that there was an archive for the previous month or so of
dilbert cartoons. The cartoon images have mildly obfuscated names
but that's no real challenge for a regex!

First I get the archive page from 
http://www.comics.com/comics/dilbert/archive/
and pull out all the links to each dated page e.g. <A 
HREF="/comics/dilbert/archive/dilbert-20040716.html">.
Luckily, these urls are quite easy to spot and any duplicates are taken out 
with a hash hack.

This, of course, will all break when the format changes - ah well!

=cut
use tmstub;
use LWP::Simple;
use File::Slurp;
# Hot file handle magic...
select( ( select(STDERR), $| = 1 )[0] );
select( ( select(STDOUT), $| = 1 )[0] );
# Get archive index...
my($pageurl, $pagedata) = ('http://www.comics.com/comics/dilbert/archive/');
if(not defined ($pagedata = get($pageurl))){
    t "could not get page '$pageurl'";
    exit(1);
}
# Split into lines...
my @lines = split /\n/m, $pagedata;
# pull out urls to daily pages...
my @urls = grep {s[.*<A 
HREF="(/comics/dilbert/archive/dilbert-\d{8}\.html)".*>][$1]} @lines;
# remove dupes...
my %h; @h{@urls} = @urls x '1';
@urls = sort keys %h;
t "There are ".scalar(@urls)." urls to get...";
foreach(@urls){
    # get machine readable date...
    if(not /dilbert-(\d{8})\.html/){
        t "could not establish date from url '$_'";
        next;
    }
    my $mrd = $1;
    # don't bother if it looks like we have it already...
    if(-f "dilbert".$mrd.".gif" or -f "dilbert".$mrd.".jpg"){
        t "already got '$mrd' so won't get '$_'";
        next;
    }
    my($pageurl, $pagedata) = ('http://www.comics.com'.$_);
    if(not defined ($pagedata = get($pageurl))){
        t "could not get page '$pageurl'";
        next;
    }
    if (not $pagedata =~ m{<IMG 
SRC="/comics/dilbert/archive/images/(dilbert(\d+)\.(gif|jpg))"}mgi){
        t "couldn't find expected image url in page data for '$pageurl' will 
save it to a file...";
        if(not $pageurl =~ /(dilbert-\d{8}\.html)/){
            t "ouch! wanted to save html page but couldn't...";
            next;
        }
        write_file($1, $pagedata);
        next;
    }
    my($image, $obf, $ext) = ($1, $2, $3);
    t "image is '$image', obf is '$obf', ext is '$ext'";
    #~ exit(0);
    my $content;
    unless (defined ($content = 
get('http://www.comics.com/comics/dilbert/archive/images/'.$image))) {
        t "could not get '$image'\n";
        return;
    }
    t "got data!";
    write_file( "dilbert".$mrd.".".$ext, {binmode => ':raw'}, $content );
}


More information about the Nottingham-pm mailing list