[Kc] HTTP Links
Frank Wiles
frank at wiles.org
Mon Aug 7 07:43:25 PDT 2006
On Sun, 6 Aug 2006 13:14:44 -0500
djgoku <djgoku at gmail.com> wrote:
> On 8/4/06, Eric Wilhelm <scratchcomputing at gmail.com> wrote:
> > # from djgoku
> > # on Friday 04 August 2006 09:13 am:
> >
> > >see how the fair with multiline comments.
> >
> > me thinks they fare well
> >
> > $ cat link_extract
> > #!/usr/bin/perl
> > use warnings; use strict; use HTML::SimpleLinkExtor;
> > my $extor = HTML::SimpleLinkExtor->new();
> > {local $/; $extor->parse(<STDIN>);}
> > print join("\n", $extor->links, '');
> >
> > $ curl -s http://www.... | ./link_extract | grep '\.pdf'
> > http://.../3145_Intro.pdf
> > http://.../3145_Chap01.pdf
> > ...
>
> #!/usr/bin/perl
> #
> # My second try, for get_http.pl
> # Syntax: get_http.pl filename (pdf|html|tar|etc)
> # Todo: Add use for web links (http://blah.com/blah
>
> use strict;
> use warnings;
>
> use HTML::SimpleLinkExtor;
>
> my $extor = HTML::SimpleLinkExtor->new();
>
> # Filename Stuff
> my $source = shift @ARGV;
> $extor->parse_file($source);
> my @links = $extor->links;
>
> # Filetype Stuff
> my $filetype = '*';
> $filetype = shift @ARGV if (@ARGV);
>
> # Print only found $filetype
> foreach (@links) {
> print "$_\n" if (m{\s*\.$filetype}i);
> }
> _______________________________________________
> kc mailing list
> kc at pm.org
> http://mail.pm.org/mailman/listinfo/kc
Here is a slightly more robust version:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::SimpleLinkExtor;
use Getopt::Long;
my $extor = HTML::SimpleLinkExtor->new();
# Some defaults
my $local_file;
my $remote_url;
my $file_type = '*';
# Parse commandline options
GetOptions(
'file=s' => \$local_file,
'url=s' => \$remote_url,
'type=s' => \$file_type
);
# Handle a local file or a url
my @links;
if( $local_file ) {
$extor->parse_file( $local_file );
@links = $extor->links;
}
elsif ( $remote_url ) {
my $ua = LWP::UserAgent->new;
my $response = $ua->get($remote_url) or die
"Cannot retrieve URL '$remote_url': $!";
if( $response->is_success ) {
$extor->parse( $reponse->content );
@links = $extor->links;
}
else {
die "Unable to retrieve the URL '$remote_url': $!";
}
}
else {
die "You must define either a -file or a -url";
}
if( $file_type ne '*' ) {
print join("\n", grep( /\.$file_type/i, @links ) );
}
else {
print join("\n", @links);
}
# EOF
You can now call it like so:
get_http.pl -file /path/to/file -type pdf
get_http.pl -file=/path/to/file -type=pdf
get_http.pl -url=http://www.blah.com -type=gif
get_http.pl -u=http://www.blah.com -t=pdf
---------------------------------
Frank Wiles <frank at wiles.org>
http://www.wiles.org
---------------------------------
More information about the kc
mailing list