SPUG: Suggestions for web crawler framework, toolkit, or reusable pieces?

Fred Morris m3047 at m3047.net
Fri Apr 23 00:49:29 PDT 2010


On Thursday 22 April 2010 15:30, Colin Meyer wrote:
> My trouble with heritrix is that it is geared for deep, archival
> crawling (it's what crawls for the wayback machine), where I wanted
> shallow, iterative crawling, to find new content. [...]

There is a whole mathematics to heuristics. I have Judea Pearl's _Heuristics_ 
on my bookshelf. Honestly though, nobody cares.

It's really not that complicated once you start it on a mathematical footing. 
I doubt that a lot of what people are presently trying to do was envisioned 
when this book was written; but to get it on a mathematical footing, here's 
how. (Spoiler: no code.)

That's the (meta) problem:  nobody cares about making it less complicated, 
they want to solve their little frog pond problem NOW!

I don't know what Michael is trying to do.

--

Fred



More information about the spug-list mailing list