[pm-h] Checking for Broken Links

Thu Oct 24 08:01:35 PDT 2013

That was fun, and it gave me a good excuse to play with Devel::hdb.

There's a bug in the way that WWW::SimpleRobot handles broken links.

If the link is in the original array that you pass, it recognizes the
broken link and calls the callback routine.

But, when it's traversing a page and building a list of links, it
discards any link that fails a "head" request. So, all broken links
would be discarded.

That's probably worth a bug report to the author.

More Detail
-----------
To troubleshoot this, I first ran it the way you did. Then, I looked
at the docs for WWW::SimpleRobot and didn't see anything useful there.

Next, I looked at the source (nicely formatted by metacpan:
https://metacpan.org/source/AWRIGLEY/WWW-SimpleRobot-0.07/SimpleRobot.pm).

On line 35, I noticed there was an ability to do a VERBOSE mode.
Looking down the code a little ways (lines 119-124), you can see that
verbose is used to print a "get $url" line before the
BROKEN_LINK_CALLBACK is called.

Running that way showed that the code never prints
"get http://www.ncgia.ucsb.edu/%7Ecova/seap.html".

Looking a little further shows lines 140-142, which discards the link
if head() fails.

The hdb debugging interface was really nice for this. (Unfortunately, I
spent a fair amount of time playing with the debugger.<shrug/>)

I can see a couple of ways of fixing this:

1. Easiest: report the bug through RT and hope the author takes care of
it soon.

2. Patch your copy of WWW::SimpleRobot code to call the callback at the
head() failure or not to discard on the head() request.

3. Copy the WWW::SimpleRobot traversal code into your script and fix it
there.

The first approach is probably the best.

G. Wade

On Thu, 24 Oct 2013 05:03:40 -0500
Mike Flannigan <mikeflan at att.net> wrote:

> 
> I almost never use this script below, but I think
> I may start using modified copies of it.  The
> broken link referred to in the comment section below
> is linked to the text "*Spatial Evacuation Analysis Project 
> <http://www.ncgia.ucsb.edu/%7Ecova/seap.html>"
> on *the webpage
> http://www.ncgia.ucsb.edu/about/sitemap.php*
> 
> *The program apparently skips the broken link and
> probably a lot of other links.  Maybe because they are
> relative links??  I could probably figure this out,
> but just haven't worked on it much yet.*
> *
> 
> I found this script.  I did not create it.
> 
> #!/usr/local/bin/perl
> #
> # This program crawls sites listed in URLS and checks
> # all links.  But it does not crawl outside the base
> # site listed in FOLLOW_REGEX.  It lists all the links
> # followed, including the broken links.  All output goes
> # to the terminal window.
> #
> # I say this does not work, because the link 
> http://www.ncgia.ucsb.edu/~cova/seap.html
> # is broken on this page: http://www.ncgia.ucsb.edu/about/sitemap.php
> # but this script does not point that out.
> #
> #
> use strict;
> use warnings;
> use WWW::SimpleRobot;
> my $robot = WWW::SimpleRobot->new(
>      URLS            =>
> [ 'http://www.ncgia.ucsb.edu/about/sitemap.php' ], FOLLOW_REGEX    =>
> "^http://www.ncgia.ucsb.edu/", DEPTH           => 1,
>      TRAVERSAL       => 'depth',
>      VISIT_CALLBACK  =>
>          sub {
>              my ( $url, $depth, $html, $links ) = @_;
>              print STDERR "\nVisiting $url\n\n";
>              foreach my $link (@$links){
>                  print STDERR "@{$link}\n"; # This derefereces the
> links }
>          }
> 
>      ,
>      BROKEN_LINK_CALLBACK  =>
>          sub {
>              my ( $url, $linked_from, $depth ) = @_;
>              print STDERR "$url looks like a broken link on 
> $linked_from\n";
>              print STDERR "Depth = $depth\n";
>          }
> );
> $robot->traverse;
> my @urls = @{$robot->urls};
> my @pages = @{$robot->pages};
> for my $page ( @pages )
> {
>      my $url = $page->{url};
>      my $depth = $page->{depth};
>      my $modification_time = $page->{modification_time};
> }
> 
> print "\nAll done.\n";
> 
> 
> __END__
> 
> 
> 
> 
> On 10/23/2013 1:48 PM, G. Wade Johnson wrote:
> > Hi Mike, Thanks for the input. I'm glad you have been able to get 
> > input from other resources. I hope this list and the hangout will 
> > become more useful to you as well.
> > We have had some of these topics covered in the past, so the talks
> > pages may have some information that will help. My goal with this is
> > really to help people get unstuck and see how to proceed, rather
> > than teaching.
> >
> > For example, if you had a particular task you wanted to perform with
> > LWP (even if it's an example problem), we could walk through where
> > you are stuck and get you moving again. Also, we could answer
> > questions on the modules that we know.
> >
> > It sounds like what I have in mind could be useful to you as well.
> >
> > G. Wade
> >
> 

-- 
We've all heard that a million monkeys banging on a million typewriters
will eventually reproduce the works of Shakespeare. Now, thanks to the
Internet, we know this is not true.         -- Robert Wilensky, UCB