[pm-h] Checking for Broken Links
Mike Flannigan
mikeflan at att.net
Thu Oct 24 03:03:40 PDT 2013
I almost never use this script below, but I think
I may start using modified copies of it. The
broken link referred to in the comment section below
is linked to the text "*Spatial Evacuation Analysis Project
<http://www.ncgia.ucsb.edu/%7Ecova/seap.html>"
on *the webpage
http://www.ncgia.ucsb.edu/about/sitemap.php*
*The program apparently skips the broken link and
probably a lot of other links. Maybe because they are
relative links?? I could probably figure this out,
but just haven't worked on it much yet.*
*
I found this script. I did not create it.
#!/usr/local/bin/perl
#
# This program crawls sites listed in URLS and checks
# all links. But it does not crawl outside the base
# site listed in FOLLOW_REGEX. It lists all the links
# followed, including the broken links. All output goes
# to the terminal window.
#
# I say this does not work, because the link
http://www.ncgia.ucsb.edu/~cova/seap.html
# is broken on this page: http://www.ncgia.ucsb.edu/about/sitemap.php
# but this script does not point that out.
#
#
use strict;
use warnings;
use WWW::SimpleRobot;
my $robot = WWW::SimpleRobot->new(
URLS => [ 'http://www.ncgia.ucsb.edu/about/sitemap.php' ],
FOLLOW_REGEX => "^http://www.ncgia.ucsb.edu/",
DEPTH => 1,
TRAVERSAL => 'depth',
VISIT_CALLBACK =>
sub {
my ( $url, $depth, $html, $links ) = @_;
print STDERR "\nVisiting $url\n\n";
foreach my $link (@$links){
print STDERR "@{$link}\n"; # This derefereces the links
}
}
,
BROKEN_LINK_CALLBACK =>
sub {
my ( $url, $linked_from, $depth ) = @_;
print STDERR "$url looks like a broken link on
$linked_from\n";
print STDERR "Depth = $depth\n";
}
);
$robot->traverse;
my @urls = @{$robot->urls};
my @pages = @{$robot->pages};
for my $page ( @pages )
{
my $url = $page->{url};
my $depth = $page->{depth};
my $modification_time = $page->{modification_time};
}
print "\nAll done.\n";
__END__
On 10/23/2013 1:48 PM, G. Wade Johnson wrote:
> Hi Mike, Thanks for the input. I'm glad you have been able to get
> input from other resources. I hope this list and the hangout will
> become more useful to you as well.
> We have had some of these topics covered in the past, so the talks
> pages may have some information that will help. My goal with this is
> really to help people get unstuck and see how to proceed, rather than
> teaching.
>
> For example, if you had a particular task you wanted to perform with
> LWP (even if it's an example problem), we could walk through where you
> are stuck and get you moving again. Also, we could answer questions on
> the modules that we know.
>
> It sounds like what I have in mind could be useful to you as well.
>
> G. Wade
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/mailman/private/houston/attachments/20131024/fb0657dc/attachment-0001.html>
More information about the Houston
mailing list