[Kc] HTTP Links

Eric Wilhelm scratchcomputing at gmail.com
Mon Aug 7 09:01:01 PDT 2006


# from Frank Wiles
# on Monday 07 August 2006 07:43 am:

>   Here is a slightly more robust version:

But how do you know it's more robust if you can't test it?  Here's a 
minor refactoring into a modular form.  Start with:

  perl -e 'my $package = require("./extract_links");
  my $main = eval("\\&${package}::main");
  $main->("-h");'

Then break the stuff in main() out into individual subs.

getoptions() would be a good first candidate if you parse the options 
into a hash.  Then your tests could do:

  my ($opts, @args) = bin::extract_links::getoptions(qw(
    --url http://example.com --type html
  ));
  ok(ref($opts), 'is a hash');
  ok(@args == 0);
  is($opts->{url}, 'http://example.com');
  # etc (or even is_deeply)

So, when you add the feature "figure out whether it is a url", you can 
test that the DWIM is working without having a network connection.

You could, of course, use IPC::Run and do tests that way, but you can't 
unit test if there aren't units.

--- extract_links
#!/usr/bin/perl

use warnings;
use strict;

=head1 NAME

extract_links - extract links from HTML documents

=cut

package bin::extract_links;

use HTML::SimpleLinkExtor; 
use Getopt::Helpful; 

sub main {
  my (@args) = @_;

  my $extor = HTML::SimpleLinkExtor->new(); 

  # Some defaults
  my $local_file;
  my $remote_url;  
  my $file_type  = '*'; 

  # Parse commandline options 
  my $opts = Getopt::Helpful->new(
    usage =>  "CALLER --file /path/to/file [options]\n" .
      "         or\n" .
      "               --url http://example.com [options]",
    ['file=s', \$local_file, '/path/file.html','read a file'],
    ['url=s' , \$remote_url, 'http://example.com/','fetch a url'],
    ['type=s', \$file_type , 'html|pdf','type of file to get'],
    '+help',
  );
  $opts->Get_from(\@args);

  # Handle a local file or a url
  my @links; 
  if( $local_file ) {
      $extor->parse_file( $local_file ); 
      @links = $extor->links; 
  }
  elsif ( $remote_url ) {
      my $ua = LWP::UserAgent->new; 
      my $response = $ua->get($remote_url) or die
             "Cannot retrieve URL '$remote_url': $!";
      if( $response->is_success ) { 
          $extor->parse( $response->content ); 
          @links = $extor->links; 
      }
      else { 
           die "Unable to retrieve the URL '$remote_url': $!";
      }
  }
  else {
      $opts->usage("You must define either a -file or a -url");
  }

  if( $file_type ne '*' ) { 
      print join("\n", grep( /\.$file_type/i, @links ) ); 
  }
  else { 
      print join("\n", @links); 
  }
}

package main;

if($0 eq __FILE__) {
  bin::extract_links::main(@ARGV);
}

# vi:ts=2:sw=2:et:sta
my $package = 'bin::extract_links';
# EOF

--Eric
-- 
"Because understanding simplicity is complicated."
--Eric Raymond
---------------------------------------------------
    http://scratchcomputing.com
---------------------------------------------------


More information about the kc mailing list