SPUG: Suggestions for web crawler framework, toolkit, or reusable pieces?

Thu Apr 22 14:43:36 PDT 2010

There are a few out there now a days.  I've had some success in the past
with Nutch (http://lucene.apache.org/nutch/).  Plays pretty well with other
Apache projects, too.  Heritrix (http://crawler.archive.org/) is another
semi-popular one.  Snags more than text, which may be of use.  I know there
are plenty others, but these two had shined for their particular niche last
time I researched it (which was a few years ago now).

 - Trevor

On Thu, Apr 22, 2010 at 2:08 PM, Michael R. Wolf <MichaelRWolf at att.net>wrote:

> I let myself get sucker punched!  I wrote my own web crawler based on
> WWW::Mechanize because my preliminary research indicated that crawlers
> were simple.
>
> [Aside: If you're tempted to do that, let me save you some time.
> Don't.  They are *conceptually* simple (GET page, push links on queue,
> iterate), but there are many levels of devils lurking in the details.]
>
> Having finished phase 1 (way behind schedule and way over budget), I'm
> looking for a better web crawler solution for phase 2.
>
> Suggestions?
>
> Thanks, Michael
>
> P.S.  Even if you you told me *personally* at the previous SPUG
> meeting, please post your suggestion here so that others can learn via
> the email list and via search engines.  Thanks.
>
> P.P.S.  [Stop reading here unless you're interested in the
> nitty-gritty details of how I let myself get sucker punched.]
>
> OK... since you're interested, here's a short description of my long
> "journey", in the hope that it will help someone else.
>
> I'm now at the end of phase 1, looking to start a new phase.  I can
> see that my proof-of-concept crawler worked, but I can also see that
> it's a bad business and technical decision to continue investing in
> what got me here.
>
> I *significantly* underestimated the complexity of a production web
> crawler, and the development time it would take to
> create/test/debut/maintain it.
>
> In my defense, I did a bit of research, and all the papers I could
> find said that crawling was simple:
>  * initialize a queue with some seed URL's
>  * GET the next page from the queue
>  * extract the links
>  * add links to a queue
>  * [process the page for other information]
>  * loop until the queue is empty
>
> So, being a bit familiar with WWW::Mechanize, I build a simple crawler
> around it.  Then added more features as I bumped against them, then
> added more features, then added new features...
>
> Mech is a great module (thanks, Andy)!  It's well documented, well
> designed, and does what it says it does.
>
> But... I eventually realized that Mech is an automated browser, and an
> automated browser is *not* the same thing as a crawler.  For *simple*
> cases, that distinction is not critical, but for the kind of
> application I eventually need, the distinction is a deal-breaker.
>
> I even asked Andy if anyone had taken Mech to the next level to create
> a crawler.  He said that he gets that question a lot, but doesn't know
> of a project that's created a crawler framework.
>
> Here's a brief requirement list for what a crawler needs to do
> (i.e. code that I wrote, or need to write) beyond what Mech does:
>  * be polite (respect robots.txt, throttle page retrieveal rate to
>  prevent overloading my machine and server machine)
>  * have a revisit policy to refresh links (based on cost, expected
>  benefit, anticipated expiration, actual expiration...)
>  * cache results to prevent expensive re-access for unexpired content
>  * prevent circular (i.e. infinate) crawls
>  * avoid useless content (for many definitions of useless)
>  * recognize uncanonical duplicates of canonical URL
>  * keep crawler on same site (or virtualized duplicate servers)
>  * monitor/admminister long (multi-day) processes (stop, start,
>  pause, continue, recover, monitor, search logs, debug...)
>
> It's a big job to create a whole crawling environment and all the
> production support eco-system around it.  It's not as big as Google,
> since I'll only be looking at a few hundred sites, but it's *much*
> bigger problem than I want to create with WWW::Mechanize.
>
> --
> Michael R. Wolf
>    All mammals learn by playing!
>        MichaelRWolf at att.net
>
>
>
>
> --
> Michael R. Wolf
>    All mammals learn by playing!
>        MichaelRWolf at att.net
>
>
>
>
> _____________________________________________________________
> Seattle Perl Users Group Mailing List
>    POST TO: spug-list at pm.org
> SUBSCRIPTION: http://mail.pm.org/mailman/listinfo/spug-list
>   MEETINGS: 3rd Tuesdays
>   WEB PAGE: http://seattleperl.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/spug-list/attachments/20100422/08fe835e/attachment-0001.html>