SPUG: Suggestions for web crawler framework, toolkit, or reusable pieces?

Fred Morris m3047 at m3047.net
Fri Apr 23 00:36:12 PDT 2010


On Thursday 22 April 2010 15:30, Colin Meyer wrote:
> [...]
>  . endless graph detection (you've seen Fred's roach motel california
>    for crawlers, or whatever he calls it)

'Bot Motel. Thanks. People I work with now have taken it to further 
extremes... not on company time, but this seems to appeal to the rank and 
file.

Do I need to give a talk on it?

And that's just the legitimate (?) defenses. You get into things like 
wildcarded domains or paths and the complexity rises. A lot of the 
wildcarding is legit, too.. or sort of. It gets infinitely more complex when 
your caching DNS serves defaults. Verisign wanted to do this and was spanked, 
but I can tell you that any number of small ISPs will do this.

Then again, there is bone strange. Have you tried this?

dig www.facebook.com ns

How about this?

dig www.facebook.com soa

Why? On all of the gods' and goddesses' green or blasted earth... why?

We seem to spend a lot of time on DNS, these days.


Without some notion of "framework for what, exactly?" I don't know what to 
say. I've got a perfectly good CD creator crawler (in Perl) for a particular 
Zope (python) photo archiving scaffold which I created.

Michael: are you crawling the wild, or the tame?

--

Fred



More information about the spug-list mailing list