Phoenix.pm: keep-alives with RobotUA
Jacob Powers
jpowers at ccbill.com
Thu Dec 4 18:24:55 CST 2003
Opps that didn't come out as pretty as it went in.... oh well.
Jacob Powers
-----Original Message-----
From: Jacob Powers
Sent: Thursday, December 04, 2003 5:15 PM
To: phoenix-pm-list at happyfunball.pm.org
Subject: RE: Phoenix.pm: keep-alives with RobotUA
As Scott said you will need to break into RobotUA to add that. RobotUA
in its default form is very weak and ridden with "friendly" rules. Here
are a few things I have done to it.
1. Remove all robot rules. Just comment out the use of
Robot Rules.
2. Remove all wait commands. In order to be more website
friendly it has a sleep(1), sometimes up to a minute,
inbetween each request. Very annoying if you are trying to pull
down many pages/sites quickly. Just set use_sleep
to 0 and delay to 0.
3. Set it up to use MySQL for the seen/used URLs instead of
using a hash (this gets really big really fast.) What I
do is in the addUrl function, in RobotUA, I comment out the
part where it adds/makes a hash and instead I MD5 the URL and
put it in a table that has nothing but a char(32) field.
This speeds the Robot up a lot. You also have to add the
where it checks the Url, once again in the addUrl function
to read from the DB instead of the hash.
Those are just my personal tweaks to the RobotUA, or as I have renamed
it RobotDB. Make sure you do ample testing in various scenarios with
your hooks in place.
Jacob Powers
-----Original Message-----
From: Scott Walters [mailto:scott at illogics.org]
Sent: Thursday, December 04, 2003 4:09 PM
To: Matt Alexander
Subject: Re: Phoenix.pm: keep-alives with RobotUA
Very generally speaking:
Go under the hood and hack the feature on.
I haven't used any of those modules more than a wee widdle bit
so I don't know how it all fits together, but one object is
likely creating instances of others and this argument is something
that could be perpetuated.
Sorry I don't have a better answer, short on time today =(
Re: Spidering Hacks, post a mini-review on http://phoenix.pm.org under
books =)
-scott
On 0, Matt Alexander <m at pdxlug.org> wrote:
>
> I'm using LWP::RobotUA, but I apparently can't pass 'keep_alive => 1'
to
> the constructor like I can with LWP::UserAgent or WWW::Mechanize.
Does
> anyone have a suggestion for how to enable keep-alives with RobotUA?
> Thanks,
> ~M
> P.S. The new O'Reilly book "Spidering Hacks" is incredible.
Definitely
> check it out.
More information about the Phoenix-pm
mailing list