Phoenix.pm: keep-alives with RobotUA
Michael Friedman
friedman at highwire.stanford.edu
Fri Dec 5 00:43:16 CST 2003
Jacob,
It's great that you have been able to modify RobotUA to meet your
needs. I'd like to speak up, though, in favor of robot rules and
request delays.
I work for HighWire Press. We run almost 400 high-traffic websites for
medical and scientific journals, such as the New England Journal of
Medicine, Science Magazine, and the Oxford English Dictionary. We have
serious hardware, software, and bandwidth for these sites. And yet, one
robot that ignores our robot rules and requests pages faster than 1 per
second could bring down an entire site. Even very popular sites such as
these generally only handle a dozen requests per second at their
highest traffic peaks. More than that can saturate our net connections,
overwhelm the application server, and in the worst cases, crash the
machine.
Now, we've taken steps to avoid such behavior. If you hit one of our
sites that fast, we'll block you from getting any pages within a
second. But smaller sites don't always have the ability to do that. By
requesting tons of pages as fast as possible, you can end up costing
the provider money (for "extra" bandwidth charges) or even bring down
the site as effectively as any Denial of Service attack.
So I urge caution when removing the safeguards that are built into
robot spidering software. Please don't overwhelm a site, just because
you want a local copy of it. You can usually set the delay to 1 second
and it won't take *that* much longer and you don't keep others from
using the site at the same time.
My $.02,
-- Mike
On Dec 4, 2003, at 5:14 PM, Jacob Powers wrote:
> As Scott said you will need to break into RobotUA to add that. RobotUA
> in its default form is very weak and ridden with "friendly" rules. Here
> are a few things I have done to it.
>
> 1. Remove all robot rules. Just comment out the use of
> Robot Rules.
> 2. Remove all wait commands. In order to be more website
> friendly it has a sleep(1), sometimes up to a minute,
> inbetween each request. Very annoying if you are trying to pull
> down many pages/sites quickly. Just set use_sleep
> to 0 and delay to 0.
> 3. Set it up to use MySQL for the seen/used URLs instead of
> using a hash (this gets really big really fast.) What I
> do is in the addUrl function, in RobotUA, I comment out the
> part where it adds/makes a hash and instead I MD5 the URL and
> put it in a table that has nothing but a char(32) field.
> This speeds the Robot up a lot. You also have to add the
> where it checks the Url, once again in the addUrl function
> to read from the DB instead of the hash.
>
>
> Those are just my personal tweaks to the RobotUA, or as I have renamed
> it RobotDB. Make sure you do ample testing in various scenarios with
> your hooks in place.
>
> Jacob Powers
>
> -----Original Message-----
> From: Scott Walters [mailto:scott at illogics.org]
> Sent: Thursday, December 04, 2003 4:09 PM
> To: Matt Alexander
> Subject: Re: Phoenix.pm: keep-alives with RobotUA
>
> Very generally speaking:
>
> Go under the hood and hack the feature on.
>
> I haven't used any of those modules more than a wee widdle bit
> so I don't know how it all fits together, but one object is
> likely creating instances of others and this argument is something
> that could be perpetuated.
>
> Sorry I don't have a better answer, short on time today =(
>
> Re: Spidering Hacks, post a mini-review on http://phoenix.pm.org under
> books =)
>
> -scott
>
>
> On 0, Matt Alexander <m at pdxlug.org> wrote:
>>
>> I'm using LWP::RobotUA, but I apparently can't pass 'keep_alive => 1'
> to
>> the constructor like I can with LWP::UserAgent or WWW::Mechanize.
> Does
>> anyone have a suggestion for how to enable keep-alives with RobotUA?
>> Thanks,
>> ~M
>> P.S. The new O'Reilly book "Spidering Hacks" is incredible.
> Definitely
>> check it out.
>>
---------------------------------------------------------------------
Michael Friedman HighWire Press, Stanford Southwest
Phone: 480-456-0880 Tempe, Arizona
FAX: 270-721-8034 <friedman at highwire.stanford.edu>
---------------------------------------------------------------------
More information about the Phoenix-pm
mailing list