Phoenix.pm: keep-alives with RobotUA

Jacob Powers jpowers at ccbill.com
Fri Dec 5 11:16:48 CST 2003


Mike,
	
	I agree with you whole heartly about not D.O.S.ing a site or
having a malicious spyder hitting any site you don't own or have
permission to. However, leaving the delays and waits in does create a
*big* bottleneck in the software (upwards of 100 times greater). You can
do additional tweaks to your spyder so it does not DOS a site. Once for
example is your type of traversal, breadth or depth. Depending on if you
are hitting many sites or just one you can also spread out the load (hit
one URL with a request then hit a different URL on the next request, and
so on). 
	The default parameters are too friendly in my opinion. If I
recall correctly the RobotUA only allows 10 requests, with a second
delay in between each, and then requires you to sleep for a minute.
That's excessive if you ask me.
	I use the RobotDB to spyder my own sites daily to check for
broken links or changed pages. It can handle a lot more then one hit per
second (as most professional or well done sites can) so speed means more
to me then being nice to my server. But I completely agree with Mike
about being aware of what you are doing and who you may be affecting. As
always, be courteous if the site is not yours.
	Oh ya and one last thing, if you just want a local copy of a
site use wget instead of RobotUA, it will work much better for that.

Jacob Powers

-----Original Message-----
From: Michael Friedman [mailto:friedman at highwire.stanford.edu] 
Sent: Thursday, December 04, 2003 11:43 PM
To: phoenix-pm-list at happyfunball.pm.org
Subject: Re: Phoenix.pm: keep-alives with RobotUA

Jacob,

It's great that you have been able to modify RobotUA to meet your 
needs. I'd like to speak up, though, in favor of robot rules and 
request delays.

I work for HighWire Press. We run almost 400 high-traffic websites for 
medical and scientific journals, such as the New England Journal of 
Medicine, Science Magazine, and the Oxford English Dictionary. We have 
serious hardware, software, and bandwidth for these sites. And yet, one 
robot that ignores our robot rules and requests pages faster than 1 per 
second could bring down an entire site. Even very popular sites such as 
these generally only handle a dozen requests per second at their 
highest traffic peaks. More than that can saturate our net connections, 
overwhelm the application server, and in the worst cases, crash the 
machine.

Now, we've taken steps to avoid such behavior. If you hit one of our 
sites that fast, we'll block you from getting any pages within a 
second. But smaller sites don't always have the ability to do that. By 
requesting tons of pages as fast as possible, you can end up costing 
the provider money (for "extra" bandwidth charges) or even bring down 
the site as effectively as any Denial of Service attack.

So I urge caution when removing the safeguards that are built into 
robot spidering software. Please don't overwhelm a site, just because 
you want a local copy of it. You can usually set the delay to 1 second 
and it won't take *that* much longer and you don't keep others from 
using the site at the same time.

My $.02,
-- Mike

On Dec 4, 2003, at 5:14 PM, Jacob Powers wrote:

> As Scott said you will need to break into RobotUA to add that. RobotUA
> in its default form is very weak and ridden with "friendly" rules.
Here
> are a few things I have done to it.
>
> 	1.	Remove all robot rules.	Just comment out the use of
> Robot Rules.
> 	2.	Remove all wait commands. In order to be more website
> friendly 		it has a sleep(1), sometimes up to a minute,
> inbetween each 		request. Very annoying if you are trying
to pull
> down many 			pages/sites quickly. Just set use_sleep
> to 0 and delay to 0.
> 	3.	Set it up to use MySQL for the seen/used URLs instead of
> using a 		hash (this gets really big really fast.) What I
> do is in the 		addUrl function, in RobotUA, I comment out the
> part where it 		adds/makes a hash and instead I MD5 the
URL and
> put it in a 		table that has nothing but a char(32) field.
> This speeds the 		Robot up a lot. You also have to add the
> where it checks the 		Url, once again in the addUrl function
> to read from the DB 			instead of the hash.
>
>
> Those are just my personal tweaks to the RobotUA, or as I have renamed
> it RobotDB. Make sure you do ample testing in various scenarios with
> your hooks in place.
>
> Jacob Powers
>
> -----Original Message-----
> From: Scott Walters [mailto:scott at illogics.org]
> Sent: Thursday, December 04, 2003 4:09 PM
> To: Matt Alexander
> Subject: Re: Phoenix.pm: keep-alives with RobotUA
>
> Very generally speaking:
>
> Go under the hood and hack the feature on.
>
> I haven't used any of those modules more than a wee widdle bit
> so I don't know how it all fits together, but one object is
> likely creating instances of others and this argument is something
> that could be perpetuated.
>
> Sorry I don't have a better answer, short on time today =(
>
> Re: Spidering Hacks, post a mini-review on http://phoenix.pm.org under
> books =)
>
> -scott
>
>
> On  0, Matt Alexander <m at pdxlug.org> wrote:
>>
>> I'm using LWP::RobotUA, but I apparently can't pass 'keep_alive => 1'
> to
>> the constructor like I can with LWP::UserAgent or WWW::Mechanize.
> Does
>> anyone have a suggestion for how to enable keep-alives with RobotUA?
>> Thanks,
>> ~M
>> P.S.  The new O'Reilly book "Spidering Hacks" is incredible.
> Definitely
>> check it out.
>>
---------------------------------------------------------------------
Michael Friedman                  HighWire Press, Stanford Southwest
Phone: 480-456-0880                                   Tempe, Arizona
FAX:   270-721-8034                  <friedman at highwire.stanford.edu>
---------------------------------------------------------------------




More information about the Phoenix-pm mailing list