SPUG: Suggestions for web crawler framework, toolkit, or reusable pieces?
Fred Morris
m3047 at m3047.net
Fri Apr 23 00:36:12 PDT 2010
On Thursday 22 April 2010 15:30, Colin Meyer wrote:
> [...]
> . endless graph detection (you've seen Fred's roach motel california
> for crawlers, or whatever he calls it)
'Bot Motel. Thanks. People I work with now have taken it to further
extremes... not on company time, but this seems to appeal to the rank and
file.
Do I need to give a talk on it?
And that's just the legitimate (?) defenses. You get into things like
wildcarded domains or paths and the complexity rises. A lot of the
wildcarding is legit, too.. or sort of. It gets infinitely more complex when
your caching DNS serves defaults. Verisign wanted to do this and was spanked,
but I can tell you that any number of small ISPs will do this.
Then again, there is bone strange. Have you tried this?
dig www.facebook.com ns
How about this?
dig www.facebook.com soa
Why? On all of the gods' and goddesses' green or blasted earth... why?
We seem to spend a lot of time on DNS, these days.
Without some notion of "framework for what, exactly?" I don't know what to
say. I've got a perfectly good CD creator crawler (in Perl) for a particular
Zope (python) photo archiving scaffold which I created.
Michael: are you crawling the wild, or the tame?
--
Fred
More information about the spug-list
mailing list