[Kc] HTTP Links

Frank Wiles frank at wiles.org
Mon Aug 7 07:43:25 PDT 2006


On Sun, 6 Aug 2006 13:14:44 -0500
djgoku <djgoku at gmail.com> wrote:

> On 8/4/06, Eric Wilhelm <scratchcomputing at gmail.com> wrote:
> > # from djgoku
> > # on Friday 04 August 2006 09:13 am:
> >
> > >see how the fair with multiline comments.
> >
> > me thinks they fare well
> >
> >   $ cat link_extract
> >   #!/usr/bin/perl
> >   use warnings; use strict; use HTML::SimpleLinkExtor;
> >   my $extor = HTML::SimpleLinkExtor->new();
> >   {local $/; $extor->parse(<STDIN>);}
> >   print join("\n", $extor->links, '');
> >
> >   $ curl -s http://www.... | ./link_extract | grep '\.pdf'
> >   http://.../3145_Intro.pdf
> >   http://.../3145_Chap01.pdf
> >   ...
> 
> #!/usr/bin/perl
> #
> # My second try, for get_http.pl
> # Syntax: get_http.pl filename (pdf|html|tar|etc)
> # Todo: Add use for web links (http://blah.com/blah
> 
> use strict;
> use warnings;
> 
> use HTML::SimpleLinkExtor;
> 
> my $extor = HTML::SimpleLinkExtor->new();
> 
> # Filename Stuff
> my $source = shift @ARGV;
> $extor->parse_file($source);
> my @links = $extor->links;
> 
> # Filetype Stuff
> my $filetype = '*';
> $filetype = shift @ARGV if (@ARGV);
> 
> # Print only found $filetype
> foreach (@links) {
> 	print "$_\n" if (m{\s*\.$filetype}i);
> }
> _______________________________________________
> kc mailing list
> kc at pm.org
> http://mail.pm.org/mailman/listinfo/kc

   Here is a slightly more robust version: 

   #!/usr/bin/perl 
   use strict; 
   use warnings; 
   use HTML::SimpleLinkExtor; 
   use Getopt::Long; 

   my $extor = HTML::SimpleLinkExtor->new(); 
   
   # Some defaults
   my $local_file;
   my $remote_url;  
   my $file_type  = '*'; 
  
   # Parse commandline options 
   GetOptions( 
       'file=s'        => \$local_file, 
       'url=s'         => \$remote_url, 
       'type=s'       => \$file_type
   ); 

   # Handle a local file or a url
   my @links; 
   if( $local_file ) { 
       $extor->parse_file( $local_file ); 
       @links = $extor->links; 
   }
   elsif ( $remote_url ) { 
       my $ua = LWP::UserAgent->new; 
       my $response = $ua->get($remote_url) or die
              "Cannot retrieve URL '$remote_url': $!";
       if( $response->is_success ) { 
           $extor->parse( $reponse->content ); 
           @links = $extor->links; 
       }
       else { 
            die "Unable to retrieve the URL '$remote_url': $!";
       }
   }
   else { 
       die "You must define either a -file or a -url";
   }

   if( $file_type ne '*' ) { 
       print join("\n", grep( /\.$file_type/i, @links ) ); 
   }
   else { 
       print join("\n", @links); 
   }

   # EOF 

   You can now call it like so: 

   get_http.pl -file /path/to/file -type pdf 
   get_http.pl -file=/path/to/file -type=pdf
   get_http.pl -url=http://www.blah.com -type=gif
   get_http.pl -u=http://www.blah.com -t=pdf

 ---------------------------------
   Frank Wiles <frank at wiles.org>
   http://www.wiles.org
 ---------------------------------



More information about the kc mailing list