SPUG: Suggestions for web crawler framework, toolkit, or reusable pieces?
Fred Morris
m3047 at m3047.net
Fri Apr 23 00:49:29 PDT 2010
On Thursday 22 April 2010 15:30, Colin Meyer wrote:
> My trouble with heritrix is that it is geared for deep, archival
> crawling (it's what crawls for the wayback machine), where I wanted
> shallow, iterative crawling, to find new content. [...]
There is a whole mathematics to heuristics. I have Judea Pearl's _Heuristics_
on my bookshelf. Honestly though, nobody cares.
It's really not that complicated once you start it on a mathematical footing.
I doubt that a lot of what people are presently trying to do was envisioned
when this book was written; but to get it on a mathematical footing, here's
how. (Spoiler: no code.)
That's the (meta) problem: nobody cares about making it less complicated,
they want to solve their little frog pond problem NOW!
I don't know what Michael is trying to do.
--
Fred
More information about the spug-list
mailing list