SPUG: Suggestions for web crawler framework, toolkit, or reusable pieces?

Thu Apr 22 14:08:29 PDT 2010

I let myself get sucker punched!  I wrote my own web crawler based on
WWW::Mechanize because my preliminary research indicated that crawlers
were simple.

[Aside: If you're tempted to do that, let me save you some time.
Don't.  They are *conceptually* simple (GET page, push links on queue,
iterate), but there are many levels of devils lurking in the details.]

Having finished phase 1 (way behind schedule and way over budget), I'm
looking for a better web crawler solution for phase 2.

Suggestions?

Thanks, Michael

P.S.  Even if you you told me *personally* at the previous SPUG
meeting, please post your suggestion here so that others can learn via
the email list and via search engines.  Thanks.

P.P.S.  [Stop reading here unless you're interested in the
nitty-gritty details of how I let myself get sucker punched.]

OK... since you're interested, here's a short description of my long
"journey", in the hope that it will help someone else.

I'm now at the end of phase 1, looking to start a new phase.  I can
see that my proof-of-concept crawler worked, but I can also see that
it's a bad business and technical decision to continue investing in
what got me here.

I *significantly* underestimated the complexity of a production web
crawler, and the development time it would take to
create/test/debut/maintain it.

In my defense, I did a bit of research, and all the papers I could
find said that crawling was simple:
  * initialize a queue with some seed URL's
  * GET the next page from the queue
  * extract the links
  * add links to a queue
  * [process the page for other information]
  * loop until the queue is empty

So, being a bit familiar with WWW::Mechanize, I build a simple crawler
around it.  Then added more features as I bumped against them, then
added more features, then added new features...

Mech is a great module (thanks, Andy)!  It's well documented, well
designed, and does what it says it does.

But... I eventually realized that Mech is an automated browser, and an
automated browser is *not* the same thing as a crawler.  For *simple*
cases, that distinction is not critical, but for the kind of
application I eventually need, the distinction is a deal-breaker.

I even asked Andy if anyone had taken Mech to the next level to create
a crawler.  He said that he gets that question a lot, but doesn't know
of a project that's created a crawler framework.

Here's a brief requirement list for what a crawler needs to do
(i.e. code that I wrote, or need to write) beyond what Mech does:
  * be polite (respect robots.txt, throttle page retrieveal rate to
  prevent overloading my machine and server machine)
   * have a revisit policy to refresh links (based on cost, expected
   benefit, anticipated expiration, actual expiration...)
   * cache results to prevent expensive re-access for unexpired content
   * prevent circular (i.e. infinate) crawls
   * avoid useless content (for many definitions of useless)
   * recognize uncanonical duplicates of canonical URL
   * keep crawler on same site (or virtualized duplicate servers)
   * monitor/admminister long (multi-day) processes (stop, start,
   pause, continue, recover, monitor, search logs, debug...)

It's a big job to create a whole crawling environment and all the
production support eco-system around it.  It's not as big as Google,
since I'll only be looking at a few hundred sites, but it's *much*
bigger problem than I want to create with WWW::Mechanize.

-- 
Michael R. Wolf
     All mammals learn by playing!
         MichaelRWolf at att.net

-- 
Michael R. Wolf
     All mammals learn by playing!
         MichaelRWolf at att.net