[pm-h] Checking for Broken Links
G. Wade Johnson
gwadej at anomaly.org
Thu Oct 24 08:01:35 PDT 2013
That was fun, and it gave me a good excuse to play with Devel::hdb.
There's a bug in the way that WWW::SimpleRobot handles broken links.
If the link is in the original array that you pass, it recognizes the
broken link and calls the callback routine.
But, when it's traversing a page and building a list of links, it
discards any link that fails a "head" request. So, all broken links
would be discarded.
That's probably worth a bug report to the author.
More Detail
-----------
To troubleshoot this, I first ran it the way you did. Then, I looked
at the docs for WWW::SimpleRobot and didn't see anything useful there.
Next, I looked at the source (nicely formatted by metacpan:
https://metacpan.org/source/AWRIGLEY/WWW-SimpleRobot-0.07/SimpleRobot.pm).
On line 35, I noticed there was an ability to do a VERBOSE mode.
Looking down the code a little ways (lines 119-124), you can see that
verbose is used to print a "get $url" line before the
BROKEN_LINK_CALLBACK is called.
Running that way showed that the code never prints
"get http://www.ncgia.ucsb.edu/%7Ecova/seap.html".
Looking a little further shows lines 140-142, which discards the link
if head() fails.
The hdb debugging interface was really nice for this. (Unfortunately, I
spent a fair amount of time playing with the debugger.<shrug/>)
I can see a couple of ways of fixing this:
1. Easiest: report the bug through RT and hope the author takes care of
it soon.
2. Patch your copy of WWW::SimpleRobot code to call the callback at the
head() failure or not to discard on the head() request.
3. Copy the WWW::SimpleRobot traversal code into your script and fix it
there.
The first approach is probably the best.
G. Wade
On Thu, 24 Oct 2013 05:03:40 -0500
Mike Flannigan <mikeflan at att.net> wrote:
>
> I almost never use this script below, but I think
> I may start using modified copies of it. The
> broken link referred to in the comment section below
> is linked to the text "*Spatial Evacuation Analysis Project
> <http://www.ncgia.ucsb.edu/%7Ecova/seap.html>"
> on *the webpage
> http://www.ncgia.ucsb.edu/about/sitemap.php*
>
> *The program apparently skips the broken link and
> probably a lot of other links. Maybe because they are
> relative links?? I could probably figure this out,
> but just haven't worked on it much yet.*
> *
>
> I found this script. I did not create it.
>
> #!/usr/local/bin/perl
> #
> # This program crawls sites listed in URLS and checks
> # all links. But it does not crawl outside the base
> # site listed in FOLLOW_REGEX. It lists all the links
> # followed, including the broken links. All output goes
> # to the terminal window.
> #
> # I say this does not work, because the link
> http://www.ncgia.ucsb.edu/~cova/seap.html
> # is broken on this page: http://www.ncgia.ucsb.edu/about/sitemap.php
> # but this script does not point that out.
> #
> #
> use strict;
> use warnings;
> use WWW::SimpleRobot;
> my $robot = WWW::SimpleRobot->new(
> URLS =>
> [ 'http://www.ncgia.ucsb.edu/about/sitemap.php' ], FOLLOW_REGEX =>
> "^http://www.ncgia.ucsb.edu/", DEPTH => 1,
> TRAVERSAL => 'depth',
> VISIT_CALLBACK =>
> sub {
> my ( $url, $depth, $html, $links ) = @_;
> print STDERR "\nVisiting $url\n\n";
> foreach my $link (@$links){
> print STDERR "@{$link}\n"; # This derefereces the
> links }
> }
>
> ,
> BROKEN_LINK_CALLBACK =>
> sub {
> my ( $url, $linked_from, $depth ) = @_;
> print STDERR "$url looks like a broken link on
> $linked_from\n";
> print STDERR "Depth = $depth\n";
> }
> );
> $robot->traverse;
> my @urls = @{$robot->urls};
> my @pages = @{$robot->pages};
> for my $page ( @pages )
> {
> my $url = $page->{url};
> my $depth = $page->{depth};
> my $modification_time = $page->{modification_time};
> }
>
> print "\nAll done.\n";
>
>
> __END__
>
>
>
>
> On 10/23/2013 1:48 PM, G. Wade Johnson wrote:
> > Hi Mike, Thanks for the input. I'm glad you have been able to get
> > input from other resources. I hope this list and the hangout will
> > become more useful to you as well.
> > We have had some of these topics covered in the past, so the talks
> > pages may have some information that will help. My goal with this is
> > really to help people get unstuck and see how to proceed, rather
> > than teaching.
> >
> > For example, if you had a particular task you wanted to perform with
> > LWP (even if it's an example problem), we could walk through where
> > you are stuck and get you moving again. Also, we could answer
> > questions on the modules that we know.
> >
> > It sounds like what I have in mind could be useful to you as well.
> >
> > G. Wade
> >
>
--
We've all heard that a million monkeys banging on a million typewriters
will eventually reproduce the works of Shakespeare. Now, thanks to the
Internet, we know this is not true. -- Robert Wilensky, UCB
More information about the Houston
mailing list