SPUG: Crawling the web

Fri Jun 15 10:50:59 CDT 2001

Good question!  Haven't used WWW::Robot, but after reading the docs:

Both modules are extensible via 'hooks'.  (I call it a 'plugin'.)

IMPORTANT: My module allows multiple plugins, WWW::Robot does not.  If my
module becomes popular, it may encourage authors of plugins to load them to
CPAN so we could have a set of modules that would do link checking, HTML 4.0
validation, XHTML 1.0 validation,  WML validation, XML validation, spell
checking, grammar checking, etc.  Trying to bundle all of these
applications into a single WWW::Robot plugin would be a mess, in my opinion.

My module can be used 'out of the box' for web page validation using some
built-in basic tests (response time, min/max size of page, text string/regex
matching against the page).  WWW::Robot requires the user to write a plugin
to use it.  My module is bundled with a script that allows it to be driven
from the command line using an input parameter file.  (It can also be used
like a module from within a user-supplied program.)

My module can be used for unit testing of static or dynamic web pages: If
passed a file pathname instead of a URL, it will start a local instance of
Apache on a private/dynamic port, copy the file to a temporary htdocs
directory, fetch the page and process it.  This is easy to integrate with
the 'make test' phase of a makefile.

My module handles creating and passing cookies to and from the web server,
even during multiple redirects; looks like WWW::Robot can accept a cookie
but not create one from scratch.

My module handles passing parameters (as from a form) to the web server,
looks like WWW::Robot does not.

My module doesn't honor the Robot Exclusion Protocol (yet), WWW:Robot does.

WWW::Robot extracts all the links on each page it visits, I leave that up to
the author of the plugin.  (Probably should add that ..)

My module exposes methods for input parameter validation and generating a
detailed report (both a summary report and a report on each visited URL);
looks like WWW::Robot does not have this.

WWW:Robot had a single release in 1997, since then no maintenance has been
done, which suggests that it is dead code.  I'm actively working on my
module; it's buggy but if it gets enough use and maintenance it could get to
production-level quality someday.

(I'm probably being a bit harsh on WWW::Robot; I haven't used it so it
probably has some nifty features that I should add to my module.)

Cheers,
Richard

----- Original Message -----
From: "Leon Brocard" <acme at astray.com>
To: "Richard Anderson" <Richard.Anderson at raycosoft.com>
Sent: Friday, June 15, 2001 6:54 AM
Subject: Re: SPUG: Crawling the web

> Richard Anderson sent the following bits through the ether:
>
> > I've written a generalized, extensible Perl module for crawling the web
and
> > doing arbitrary processing of web pages.
>
> Sure. How does it compare to WWW::Robot?
>
> Leon
> --
> Leon Brocard.............................http://www.astray.com/
> Iterative Software...........http://www.iterative-software.com/
>
> ... Holy Smoke Batman, it's the Joker!
>

 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     POST TO: spug-list at pm.org       PROBLEMS: owner-spug-list at pm.org
      Subscriptions; Email to majordomo at pm.org:  ACTION  LIST  EMAIL
  Replace ACTION by subscribe or unsubscribe, EMAIL by your Email-address
 For daily traffic, use spug-list for LIST ;  for weekly, spug-list-digest
  Seattle Perl Users Group (SPUG) Home Page: http://www.halcyon.com/spug/