Phoenix.pm: keep-alives with RobotUA

Bill Nash billn at billn.net
Thu Dec 4 19:18:31 CST 2003


Think I spotted your problem:
X-MimeOLE: Produced By Microsoft Exchange V6.0.6375.0

}=)

- billn

On Thu, 4 Dec 2003, Jacob Powers wrote:

> Opps that didn't come out as pretty as it went in.... oh well.
>
> Jacob Powers
> -----Original Message-----
> From: Jacob Powers
> Sent: Thursday, December 04, 2003 5:15 PM
> To: phoenix-pm-list at happyfunball.pm.org
> Subject: RE: Phoenix.pm: keep-alives with RobotUA
>
> As Scott said you will need to break into RobotUA to add that. RobotUA
> in its default form is very weak and ridden with "friendly" rules. Here
> are a few things I have done to it.
>
> 	1.	Remove all robot rules.	Just comment out the use of
> Robot Rules.
> 	2.	Remove all wait commands. In order to be more website
> friendly 		it has a sleep(1), sometimes up to a minute,
> inbetween each 		request. Very annoying if you are trying to pull
> down many 			pages/sites quickly. Just set use_sleep
> to 0 and delay to 0.
> 	3.	Set it up to use MySQL for the seen/used URLs instead of
> using a 		hash (this gets really big really fast.) What I
> do is in the 		addUrl function, in RobotUA, I comment out the
> part where it 		adds/makes a hash and instead I MD5 the URL and
> put it in a 		table that has nothing but a char(32) field.
> This speeds the 		Robot up a lot. You also have to add the
> where it checks the 		Url, once again in the addUrl function
> to read from the DB 			instead of the hash.
>
>
> Those are just my personal tweaks to the RobotUA, or as I have renamed
> it RobotDB. Make sure you do ample testing in various scenarios with
> your hooks in place.
>
> Jacob Powers
>
> -----Original Message-----
> From: Scott Walters [mailto:scott at illogics.org]
> Sent: Thursday, December 04, 2003 4:09 PM
> To: Matt Alexander
> Subject: Re: Phoenix.pm: keep-alives with RobotUA
>
> Very generally speaking:
>
> Go under the hood and hack the feature on.
>
> I haven't used any of those modules more than a wee widdle bit
> so I don't know how it all fits together, but one object is
> likely creating instances of others and this argument is something
> that could be perpetuated.
>
> Sorry I don't have a better answer, short on time today =(
>
> Re: Spidering Hacks, post a mini-review on http://phoenix.pm.org under
> books =)
>
> -scott
>
>
> On  0, Matt Alexander <m at pdxlug.org> wrote:
> >
> > I'm using LWP::RobotUA, but I apparently can't pass 'keep_alive => 1'
> to
> > the constructor like I can with LWP::UserAgent or WWW::Mechanize.
> Does
> > anyone have a suggestion for how to enable keep-alives with RobotUA?
> > Thanks,
> > ~M
> > P.S.  The new O'Reilly book "Spidering Hacks" is incredible.
> Definitely
> > check it out.
>




More information about the Phoenix-pm mailing list