SPUG: robot (spider)
cjcollier at colliertech.org
Tue Nov 9 17:57:42 CST 2004
Be sure to implement reading of robots.txt
Use LWP::UserAgent to read the pages
Use HTML::Parser to find the links
Create a breadth-first recursive function to follow links.
Create a global hash to keep track of the links that have already been
You could accept a boolean argument to the worker object constructor
that forces the spider to stay on the same domain.
I've got more ideas if you'd like to hear them.
On Fri, 2004-11-05 at 13:00 -0800, Luis Medrano wrote:
> I'm trying to build a spider. Can somebody explain how it will be the easy way to do it.
> Seattle Perl Users Group Mailing List
> POST TO: spug-list at mail.pm.org http://spugwiki.perlocity.org/
> ACCOUNT CONFIG: http://mail.pm.org/mailman/listinfo/spug-list
> MEETINGS: 3rd Tuesdays, Location: Amazon.com Pac-Med
> WEB PAGE: http://seattleperl.org/
More information about the spug-list