SPUG: robot (spider)

CJ Collier cjcollier at colliertech.org
Tue Nov 9 17:57:42 CST 2004


Be sure to implement reading of robots.txt

Use LWP::UserAgent to read the pages

Use HTML::Parser to find the links

Create a breadth-first recursive function to follow links.

Create a global hash to keep track of the links that have already been
followed.

You could accept a boolean argument to the worker object constructor
that forces the spider to stay on the same domain.

I've got more ideas if you'd like to hear them.

C.J.


On Fri, 2004-11-05 at 13:00 -0800, Luis Medrano wrote:
> List,
> 
> I'm trying to build a spider. Can somebody explain how it will be the easy way to do it.
> 
> Thanks,
> Luis
> 
> _____________________________________________________________
> Seattle Perl Users Group Mailing List  
> POST TO: spug-list at mail.pm.org  http://spugwiki.perlocity.org/
> ACCOUNT CONFIG: http://mail.pm.org/mailman/listinfo/spug-list
> MEETINGS: 3rd Tuesdays, Location: Amazon.com Pac-Med
> WEB PAGE: http://seattleperl.org/
> 



More information about the spug-list mailing list