Phoenix.pm: keep-alives with RobotUA

Michael Friedman friedman at highwire.stanford.edu
Fri Dec 5 00:43:16 CST 2003


Jacob,

It's great that you have been able to modify RobotUA to meet your 
needs. I'd like to speak up, though, in favor of robot rules and 
request delays.

I work for HighWire Press. We run almost 400 high-traffic websites for 
medical and scientific journals, such as the New England Journal of 
Medicine, Science Magazine, and the Oxford English Dictionary. We have 
serious hardware, software, and bandwidth for these sites. And yet, one 
robot that ignores our robot rules and requests pages faster than 1 per 
second could bring down an entire site. Even very popular sites such as 
these generally only handle a dozen requests per second at their 
highest traffic peaks. More than that can saturate our net connections, 
overwhelm the application server, and in the worst cases, crash the 
machine.

Now, we've taken steps to avoid such behavior. If you hit one of our 
sites that fast, we'll block you from getting any pages within a 
second. But smaller sites don't always have the ability to do that. By 
requesting tons of pages as fast as possible, you can end up costing 
the provider money (for "extra" bandwidth charges) or even bring down 
the site as effectively as any Denial of Service attack.

So I urge caution when removing the safeguards that are built into 
robot spidering software. Please don't overwhelm a site, just because 
you want a local copy of it. You can usually set the delay to 1 second 
and it won't take *that* much longer and you don't keep others from 
using the site at the same time.

My $.02,
-- Mike

On Dec 4, 2003, at 5:14 PM, Jacob Powers wrote:

> As Scott said you will need to break into RobotUA to add that. RobotUA
> in its default form is very weak and ridden with "friendly" rules. Here
> are a few things I have done to it.
>
> 	1.	Remove all robot rules.	Just comment out the use of
> Robot Rules.
> 	2.	Remove all wait commands. In order to be more website
> friendly 		it has a sleep(1), sometimes up to a minute,
> inbetween each 		request. Very annoying if you are trying to pull
> down many 			pages/sites quickly. Just set use_sleep
> to 0 and delay to 0.
> 	3.	Set it up to use MySQL for the seen/used URLs instead of
> using a 		hash (this gets really big really fast.) What I
> do is in the 		addUrl function, in RobotUA, I comment out the
> part where it 		adds/makes a hash and instead I MD5 the URL and
> put it in a 		table that has nothing but a char(32) field.
> This speeds the 		Robot up a lot. You also have to add the
> where it checks the 		Url, once again in the addUrl function
> to read from the DB 			instead of the hash.
>
>
> Those are just my personal tweaks to the RobotUA, or as I have renamed
> it RobotDB. Make sure you do ample testing in various scenarios with
> your hooks in place.
>
> Jacob Powers
>
> -----Original Message-----
> From: Scott Walters [mailto:scott at illogics.org]
> Sent: Thursday, December 04, 2003 4:09 PM
> To: Matt Alexander
> Subject: Re: Phoenix.pm: keep-alives with RobotUA
>
> Very generally speaking:
>
> Go under the hood and hack the feature on.
>
> I haven't used any of those modules more than a wee widdle bit
> so I don't know how it all fits together, but one object is
> likely creating instances of others and this argument is something
> that could be perpetuated.
>
> Sorry I don't have a better answer, short on time today =(
>
> Re: Spidering Hacks, post a mini-review on http://phoenix.pm.org under
> books =)
>
> -scott
>
>
> On  0, Matt Alexander <m at pdxlug.org> wrote:
>>
>> I'm using LWP::RobotUA, but I apparently can't pass 'keep_alive => 1'
> to
>> the constructor like I can with LWP::UserAgent or WWW::Mechanize.
> Does
>> anyone have a suggestion for how to enable keep-alives with RobotUA?
>> Thanks,
>> ~M
>> P.S.  The new O'Reilly book "Spidering Hacks" is incredible.
> Definitely
>> check it out.
>>
---------------------------------------------------------------------
Michael Friedman                  HighWire Press, Stanford Southwest
Phone: 480-456-0880                                   Tempe, Arizona
FAX:   270-721-8034                  <friedman at highwire.stanford.edu>
---------------------------------------------------------------------




More information about the Phoenix-pm mailing list