[pm-h] Checking for Broken Links

Mike Flannigan mikeflan at att.net
Thu Oct 24 03:03:40 PDT 2013


I almost never use this script below, but I think
I may start using modified copies of it.  The
broken link referred to in the comment section below
is linked to the text "*Spatial Evacuation Analysis Project 
<http://www.ncgia.ucsb.edu/%7Ecova/seap.html>"
on *the webpage
http://www.ncgia.ucsb.edu/about/sitemap.php*

*The program apparently skips the broken link and
probably a lot of other links.  Maybe because they are
relative links??  I could probably figure this out,
but just haven't worked on it much yet.*
*

I found this script.  I did not create it.

#!/usr/local/bin/perl
#
# This program crawls sites listed in URLS and checks
# all links.  But it does not crawl outside the base
# site listed in FOLLOW_REGEX.  It lists all the links
# followed, including the broken links.  All output goes
# to the terminal window.
#
# I say this does not work, because the link 
http://www.ncgia.ucsb.edu/~cova/seap.html
# is broken on this page: http://www.ncgia.ucsb.edu/about/sitemap.php
# but this script does not point that out.
#
#
use strict;
use warnings;
use WWW::SimpleRobot;
my $robot = WWW::SimpleRobot->new(
     URLS            => [ 'http://www.ncgia.ucsb.edu/about/sitemap.php' ],
     FOLLOW_REGEX    => "^http://www.ncgia.ucsb.edu/",
     DEPTH           => 1,
     TRAVERSAL       => 'depth',
     VISIT_CALLBACK  =>
         sub {
             my ( $url, $depth, $html, $links ) = @_;
             print STDERR "\nVisiting $url\n\n";
             foreach my $link (@$links){
                 print STDERR "@{$link}\n"; # This derefereces the links
             }
         }

     ,
     BROKEN_LINK_CALLBACK  =>
         sub {
             my ( $url, $linked_from, $depth ) = @_;
             print STDERR "$url looks like a broken link on 
$linked_from\n";
             print STDERR "Depth = $depth\n";
         }
);
$robot->traverse;
my @urls = @{$robot->urls};
my @pages = @{$robot->pages};
for my $page ( @pages )
{
     my $url = $page->{url};
     my $depth = $page->{depth};
     my $modification_time = $page->{modification_time};
}

print "\nAll done.\n";


__END__




On 10/23/2013 1:48 PM, G. Wade Johnson wrote:
> Hi Mike, Thanks for the input. I'm glad you have been able to get 
> input from other resources. I hope this list and the hangout will 
> become more useful to you as well.
> We have had some of these topics covered in the past, so the talks
> pages may have some information that will help. My goal with this is
> really to help people get unstuck and see how to proceed, rather than
> teaching.
>
> For example, if you had a particular task you wanted to perform with
> LWP (even if it's an example problem), we could walk through where you
> are stuck and get you moving again. Also, we could answer questions on
> the modules that we know.
>
> It sounds like what I have in mind could be useful to you as well.
>
> G. Wade
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/mailman/private/houston/attachments/20131024/fb0657dc/attachment-0001.html>


More information about the Houston mailing list