Phoenix.pm: keep-alives with RobotUA

Jacob Powers jpowers at ccbill.com
Thu Dec 4 18:14:58 CST 2003


As Scott said you will need to break into RobotUA to add that. RobotUA
in its default form is very weak and ridden with "friendly" rules. Here
are a few things I have done to it.

	1.	Remove all robot rules.	Just comment out the use of
Robot Rules.
	2.	Remove all wait commands. In order to be more website
friendly 		it has a sleep(1), sometimes up to a minute,
inbetween each 		request. Very annoying if you are trying to pull
down many 			pages/sites quickly. Just set use_sleep
to 0 and delay to 0.
	3.	Set it up to use MySQL for the seen/used URLs instead of
using a 		hash (this gets really big really fast.) What I
do is in the 		addUrl function, in RobotUA, I comment out the
part where it 		adds/makes a hash and instead I MD5 the URL and
put it in a 		table that has nothing but a char(32) field.
This speeds the 		Robot up a lot. You also have to add the
where it checks the 		Url, once again in the addUrl function
to read from the DB 			instead of the hash.


Those are just my personal tweaks to the RobotUA, or as I have renamed
it RobotDB. Make sure you do ample testing in various scenarios with
your hooks in place. 

Jacob Powers

-----Original Message-----
From: Scott Walters [mailto:scott at illogics.org] 
Sent: Thursday, December 04, 2003 4:09 PM
To: Matt Alexander
Subject: Re: Phoenix.pm: keep-alives with RobotUA

Very generally speaking:

Go under the hood and hack the feature on.

I haven't used any of those modules more than a wee widdle bit
so I don't know how it all fits together, but one object is
likely creating instances of others and this argument is something
that could be perpetuated.

Sorry I don't have a better answer, short on time today =(

Re: Spidering Hacks, post a mini-review on http://phoenix.pm.org under
books =)

-scott


On  0, Matt Alexander <m at pdxlug.org> wrote:
> 
> I'm using LWP::RobotUA, but I apparently can't pass 'keep_alive => 1'
to
> the constructor like I can with LWP::UserAgent or WWW::Mechanize.
Does
> anyone have a suggestion for how to enable keep-alives with RobotUA?
> Thanks,
> ~M
> P.S.  The new O'Reilly book "Spidering Hacks" is incredible.
Definitely
> check it out.



More information about the Phoenix-pm mailing list