SPUG: robot (spider)
CJ Collier
cjcollier at colliertech.org
Tue Nov 9 17:57:42 CST 2004
Be sure to implement reading of robots.txt
Use LWP::UserAgent to read the pages
Use HTML::Parser to find the links
Create a breadth-first recursive function to follow links.
Create a global hash to keep track of the links that have already been
followed.
You could accept a boolean argument to the worker object constructor
that forces the spider to stay on the same domain.
I've got more ideas if you'd like to hear them.
C.J.
On Fri, 2004-11-05 at 13:00 -0800, Luis Medrano wrote:
> List,
>
> I'm trying to build a spider. Can somebody explain how it will be the easy way to do it.
>
> Thanks,
> Luis
>
> _____________________________________________________________
> Seattle Perl Users Group Mailing List
> POST TO: spug-list at mail.pm.org http://spugwiki.perlocity.org/
> ACCOUNT CONFIG: http://mail.pm.org/mailman/listinfo/spug-list
> MEETINGS: 3rd Tuesdays, Location: Amazon.com Pac-Med
> WEB PAGE: http://seattleperl.org/
>
More information about the spug-list
mailing list