Phoenix.pm: keep-alives with RobotUA
Jacob Powers
jpowers at ccbill.com
Fri Dec 5 11:16:48 CST 2003
Mike,
I agree with you whole heartly about not D.O.S.ing a site or
having a malicious spyder hitting any site you don't own or have
permission to. However, leaving the delays and waits in does create a
*big* bottleneck in the software (upwards of 100 times greater). You can
do additional tweaks to your spyder so it does not DOS a site. Once for
example is your type of traversal, breadth or depth. Depending on if you
are hitting many sites or just one you can also spread out the load (hit
one URL with a request then hit a different URL on the next request, and
so on).
The default parameters are too friendly in my opinion. If I
recall correctly the RobotUA only allows 10 requests, with a second
delay in between each, and then requires you to sleep for a minute.
That's excessive if you ask me.
I use the RobotDB to spyder my own sites daily to check for
broken links or changed pages. It can handle a lot more then one hit per
second (as most professional or well done sites can) so speed means more
to me then being nice to my server. But I completely agree with Mike
about being aware of what you are doing and who you may be affecting. As
always, be courteous if the site is not yours.
Oh ya and one last thing, if you just want a local copy of a
site use wget instead of RobotUA, it will work much better for that.
Jacob Powers
-----Original Message-----
From: Michael Friedman [mailto:friedman at highwire.stanford.edu]
Sent: Thursday, December 04, 2003 11:43 PM
To: phoenix-pm-list at happyfunball.pm.org
Subject: Re: Phoenix.pm: keep-alives with RobotUA
Jacob,
It's great that you have been able to modify RobotUA to meet your
needs. I'd like to speak up, though, in favor of robot rules and
request delays.
I work for HighWire Press. We run almost 400 high-traffic websites for
medical and scientific journals, such as the New England Journal of
Medicine, Science Magazine, and the Oxford English Dictionary. We have
serious hardware, software, and bandwidth for these sites. And yet, one
robot that ignores our robot rules and requests pages faster than 1 per
second could bring down an entire site. Even very popular sites such as
these generally only handle a dozen requests per second at their
highest traffic peaks. More than that can saturate our net connections,
overwhelm the application server, and in the worst cases, crash the
machine.
Now, we've taken steps to avoid such behavior. If you hit one of our
sites that fast, we'll block you from getting any pages within a
second. But smaller sites don't always have the ability to do that. By
requesting tons of pages as fast as possible, you can end up costing
the provider money (for "extra" bandwidth charges) or even bring down
the site as effectively as any Denial of Service attack.
So I urge caution when removing the safeguards that are built into
robot spidering software. Please don't overwhelm a site, just because
you want a local copy of it. You can usually set the delay to 1 second
and it won't take *that* much longer and you don't keep others from
using the site at the same time.
My $.02,
-- Mike
On Dec 4, 2003, at 5:14 PM, Jacob Powers wrote:
> As Scott said you will need to break into RobotUA to add that. RobotUA
> in its default form is very weak and ridden with "friendly" rules.
Here
> are a few things I have done to it.
>
> 1. Remove all robot rules. Just comment out the use of
> Robot Rules.
> 2. Remove all wait commands. In order to be more website
> friendly it has a sleep(1), sometimes up to a minute,
> inbetween each request. Very annoying if you are trying
to pull
> down many pages/sites quickly. Just set use_sleep
> to 0 and delay to 0.
> 3. Set it up to use MySQL for the seen/used URLs instead of
> using a hash (this gets really big really fast.) What I
> do is in the addUrl function, in RobotUA, I comment out the
> part where it adds/makes a hash and instead I MD5 the
URL and
> put it in a table that has nothing but a char(32) field.
> This speeds the Robot up a lot. You also have to add the
> where it checks the Url, once again in the addUrl function
> to read from the DB instead of the hash.
>
>
> Those are just my personal tweaks to the RobotUA, or as I have renamed
> it RobotDB. Make sure you do ample testing in various scenarios with
> your hooks in place.
>
> Jacob Powers
>
> -----Original Message-----
> From: Scott Walters [mailto:scott at illogics.org]
> Sent: Thursday, December 04, 2003 4:09 PM
> To: Matt Alexander
> Subject: Re: Phoenix.pm: keep-alives with RobotUA
>
> Very generally speaking:
>
> Go under the hood and hack the feature on.
>
> I haven't used any of those modules more than a wee widdle bit
> so I don't know how it all fits together, but one object is
> likely creating instances of others and this argument is something
> that could be perpetuated.
>
> Sorry I don't have a better answer, short on time today =(
>
> Re: Spidering Hacks, post a mini-review on http://phoenix.pm.org under
> books =)
>
> -scott
>
>
> On 0, Matt Alexander <m at pdxlug.org> wrote:
>>
>> I'm using LWP::RobotUA, but I apparently can't pass 'keep_alive => 1'
> to
>> the constructor like I can with LWP::UserAgent or WWW::Mechanize.
> Does
>> anyone have a suggestion for how to enable keep-alives with RobotUA?
>> Thanks,
>> ~M
>> P.S. The new O'Reilly book "Spidering Hacks" is incredible.
> Definitely
>> check it out.
>>
---------------------------------------------------------------------
Michael Friedman HighWire Press, Stanford Southwest
Phone: 480-456-0880 Tempe, Arizona
FAX: 270-721-8034 <friedman at highwire.stanford.edu>
---------------------------------------------------------------------
More information about the Phoenix-pm
mailing list