<meta http-equiv="content-type" content="text/html; charset=utf-8"><div><br class="Apple-interchange-newline">There are a few out there now a days. I've had some success in the past with Nutch (<a href="http://lucene.apache.org/nutch/">http://lucene.apache.org/nutch/</a>). Plays pretty well with other Apache projects, too. Heritrix (<a href="http://crawler.archive.org/">http://crawler.archive.org/</a>) is another semi-popular one. Snags more than text, which may be of use. I know there are plenty others, but these two had shined for their particular niche last time I researched it (which was a few years ago now).</div>
<div><br></div><div> - Trevor</div><br><div class="gmail_quote">On Thu, Apr 22, 2010 at 2:08 PM, Michael R. Wolf <span dir="ltr"><<a href="mailto:MichaelRWolf@att.net">MichaelRWolf@att.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
I let myself get sucker punched! I wrote my own web crawler based on<br>
WWW::Mechanize because my preliminary research indicated that crawlers<br>
were simple.<br>
<br>
[Aside: If you're tempted to do that, let me save you some time.<br>
Don't. They are *conceptually* simple (GET page, push links on queue,<br>
iterate), but there are many levels of devils lurking in the details.]<br>
<br>
Having finished phase 1 (way behind schedule and way over budget), I'm<br>
looking for a better web crawler solution for phase 2.<br>
<br>
Suggestions?<br>
<br>
Thanks, Michael<br>
<br>
P.S. Even if you you told me *personally* at the previous SPUG<br>
meeting, please post your suggestion here so that others can learn via<br>
the email list and via search engines. Thanks.<br>
<br>
P.P.S. [Stop reading here unless you're interested in the<br>
nitty-gritty details of how I let myself get sucker punched.]<br>
<br>
OK... since you're interested, here's a short description of my long<br>
"journey", in the hope that it will help someone else.<br>
<br>
I'm now at the end of phase 1, looking to start a new phase. I can<br>
see that my proof-of-concept crawler worked, but I can also see that<br>
it's a bad business and technical decision to continue investing in<br>
what got me here.<br>
<br>
I *significantly* underestimated the complexity of a production web<br>
crawler, and the development time it would take to<br>
create/test/debut/maintain it.<br>
<br>
In my defense, I did a bit of research, and all the papers I could<br>
find said that crawling was simple:<br>
* initialize a queue with some seed URL's<br>
* GET the next page from the queue<br>
* extract the links<br>
* add links to a queue<br>
* [process the page for other information]<br>
* loop until the queue is empty<br>
<br>
So, being a bit familiar with WWW::Mechanize, I build a simple crawler<br>
around it. Then added more features as I bumped against them, then<br>
added more features, then added new features...<br>
<br>
Mech is a great module (thanks, Andy)! It's well documented, well<br>
designed, and does what it says it does.<br>
<br>
But... I eventually realized that Mech is an automated browser, and an<br>
automated browser is *not* the same thing as a crawler. For *simple*<br>
cases, that distinction is not critical, but for the kind of<br>
application I eventually need, the distinction is a deal-breaker.<br>
<br>
I even asked Andy if anyone had taken Mech to the next level to create<br>
a crawler. He said that he gets that question a lot, but doesn't know<br>
of a project that's created a crawler framework.<br>
<br>
Here's a brief requirement list for what a crawler needs to do<br>
(i.e. code that I wrote, or need to write) beyond what Mech does:<br>
* be polite (respect robots.txt, throttle page retrieveal rate to<br>
prevent overloading my machine and server machine)<br>
* have a revisit policy to refresh links (based on cost, expected<br>
benefit, anticipated expiration, actual expiration...)<br>
* cache results to prevent expensive re-access for unexpired content<br>
* prevent circular (i.e. infinate) crawls<br>
* avoid useless content (for many definitions of useless)<br>
* recognize uncanonical duplicates of canonical URL<br>
* keep crawler on same site (or virtualized duplicate servers)<br>
* monitor/admminister long (multi-day) processes (stop, start,<br>
pause, continue, recover, monitor, search logs, debug...)<br>
<br>
It's a big job to create a whole crawling environment and all the<br>
production support eco-system around it. It's not as big as Google,<br>
since I'll only be looking at a few hundred sites, but it's *much*<br>
bigger problem than I want to create with WWW::Mechanize.<br>
<br>
-- <br>
Michael R. Wolf<br>
All mammals learn by playing!<br>
<a href="mailto:MichaelRWolf@att.net" target="_blank">MichaelRWolf@att.net</a><br>
<br>
<br>
<br>
<br>
-- <br>
Michael R. Wolf<br>
All mammals learn by playing!<br>
<a href="mailto:MichaelRWolf@att.net" target="_blank">MichaelRWolf@att.net</a><br>
<br>
<br>
<br>
<br>
_____________________________________________________________<br>
Seattle Perl Users Group Mailing List<br>
POST TO: <a href="mailto:spug-list@pm.org" target="_blank">spug-list@pm.org</a><br>
SUBSCRIPTION: <a href="http://mail.pm.org/mailman/listinfo/spug-list" target="_blank">http://mail.pm.org/mailman/listinfo/spug-list</a><br>
MEETINGS: 3rd Tuesdays<br>
WEB PAGE: <a href="http://seattleperl.org/" target="_blank">http://seattleperl.org/</a><br>
</blockquote></div><br>