<meta http-equiv="content-type" content="text/html; charset=utf-8"><div><br class="Apple-interchange-newline">There are a few out there now a days.  I&#39;ve had some success in the past with Nutch (<a href="http://lucene.apache.org/nutch/">http://lucene.apache.org/nutch/</a>).  Plays pretty well with other Apache projects, too.  Heritrix (<a href="http://crawler.archive.org/">http://crawler.archive.org/</a>) is another semi-popular one.  Snags more than text, which may be of use.  I know there are plenty others, but these two had shined for their particular niche last time I researched it (which was a few years ago now).</div>

<div><br></div><div> - Trevor</div><br><div class="gmail_quote">On Thu, Apr 22, 2010 at 2:08 PM, Michael R. Wolf <span dir="ltr">&lt;<a href="mailto:MichaelRWolf@att.net">MichaelRWolf@att.net</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

I let myself get sucker punched!  I wrote my own web crawler based on<br>

WWW::Mechanize because my preliminary research indicated that crawlers<br>

were simple.<br>

<br>

[Aside: If you&#39;re tempted to do that, let me save you some time.<br>

Don&#39;t.  They are *conceptually* simple (GET page, push links on queue,<br>

iterate), but there are many levels of devils lurking in the details.]<br>

<br>

Having finished phase 1 (way behind schedule and way over budget), I&#39;m<br>

looking for a better web crawler solution for phase 2.<br>

<br>

Suggestions?<br>

<br>

Thanks, Michael<br>

<br>

P.S.  Even if you you told me *personally* at the previous SPUG<br>

meeting, please post your suggestion here so that others can learn via<br>

the email list and via search engines.  Thanks.<br>

<br>

P.P.S.  [Stop reading here unless you&#39;re interested in the<br>

nitty-gritty details of how I let myself get sucker punched.]<br>

<br>

OK... since you&#39;re interested, here&#39;s a short description of my long<br>

&quot;journey&quot;, in the hope that it will help someone else.<br>

<br>

I&#39;m now at the end of phase 1, looking to start a new phase.  I can<br>

see that my proof-of-concept crawler worked, but I can also see that<br>

it&#39;s a bad business and technical decision to continue investing in<br>

what got me here.<br>

<br>

I *significantly* underestimated the complexity of a production web<br>

crawler, and the development time it would take to<br>

create/test/debut/maintain it.<br>

<br>

In my defense, I did a bit of research, and all the papers I could<br>

find said that crawling was simple:<br>

 * initialize a queue with some seed URL&#39;s<br>

 * GET the next page from the queue<br>

 * extract the links<br>

 * add links to a queue<br>

 * [process the page for other information]<br>

 * loop until the queue is empty<br>

<br>

So, being a bit familiar with WWW::Mechanize, I build a simple crawler<br>

around it.  Then added more features as I bumped against them, then<br>

added more features, then added new features...<br>

<br>

Mech is a great module (thanks, Andy)!  It&#39;s well documented, well<br>

designed, and does what it says it does.<br>

<br>

But... I eventually realized that Mech is an automated browser, and an<br>

automated browser is *not* the same thing as a crawler.  For *simple*<br>

cases, that distinction is not critical, but for the kind of<br>

application I eventually need, the distinction is a deal-breaker.<br>

<br>

I even asked Andy if anyone had taken Mech to the next level to create<br>

a crawler.  He said that he gets that question a lot, but doesn&#39;t know<br>

of a project that&#39;s created a crawler framework.<br>

<br>

Here&#39;s a brief requirement list for what a crawler needs to do<br>

(i.e. code that I wrote, or need to write) beyond what Mech does:<br>

 * be polite (respect robots.txt, throttle page retrieveal rate to<br>

 prevent overloading my machine and server machine)<br>

  * have a revisit policy to refresh links (based on cost, expected<br>

  benefit, anticipated expiration, actual expiration...)<br>

  * cache results to prevent expensive re-access for unexpired content<br>

  * prevent circular (i.e. infinate) crawls<br>

  * avoid useless content (for many definitions of useless)<br>

  * recognize uncanonical duplicates of canonical URL<br>

  * keep crawler on same site (or virtualized duplicate servers)<br>

  * monitor/admminister long (multi-day) processes (stop, start,<br>

  pause, continue, recover, monitor, search logs, debug...)<br>

<br>

It&#39;s a big job to create a whole crawling environment and all the<br>

production support eco-system around it.  It&#39;s not as big as Google,<br>

since I&#39;ll only be looking at a few hundred sites, but it&#39;s *much*<br>

bigger problem than I want to create with WWW::Mechanize.<br>

<br>

-- <br>

Michael R. Wolf<br>

    All mammals learn by playing!<br>

        <a href="mailto:MichaelRWolf@att.net" target="_blank">MichaelRWolf@att.net</a><br>

<br>

<br>

<br>

<br>

-- <br>

Michael R. Wolf<br>

    All mammals learn by playing!<br>

        <a href="mailto:MichaelRWolf@att.net" target="_blank">MichaelRWolf@att.net</a><br>

<br>

<br>

<br>

<br>

_____________________________________________________________<br>

Seattle Perl Users Group Mailing List<br>

    POST TO: <a href="mailto:spug-list@pm.org" target="_blank">spug-list@pm.org</a><br>

SUBSCRIPTION: <a href="http://mail.pm.org/mailman/listinfo/spug-list" target="_blank">http://mail.pm.org/mailman/listinfo/spug-list</a><br>

   MEETINGS: 3rd Tuesdays<br>

   WEB PAGE: <a href="http://seattleperl.org/" target="_blank">http://seattleperl.org/</a><br>

</blockquote></div><br>