[sf-perl] Testing a web crawler
Michael Friedman
friedman at highwire.stanford.edu
Sun Dec 30 14:47:33 PST 2007
I don't know about the optimization tests, but negative tests pretty
much require "outside" knowledge. You need to know things that the
software doesn't so that you can predict non-matching data.
For example, you would use some other search engine to gather results
and then pick some of the relevant results from there that aren't in
your own search engine. Or if you are using a limited dataset (which
you should be, for repeatable unit tests) you can intentionally place
files in there that are "close but don't meet the current algorithm".
Then the test checks to make sure those files don't appear in search
results. If the algorithm changes later and the files that you knew
shouldn't appear suddenly do, then you know you've become too
inclusive and need to ratchet back.
I haven't done work on search engines, but I do work with a journal
reference <-> journal citation matching algorithm that has to perform
similar discrimination between "good" and "not quite good enough"
matches. I've had to create data for each of the 8 factors involved in
my algorithm, both positive and negative, plus some random "don't
match anything" records. The dataset quickly becomes large, but if you
document it correctly it shouldn't be too hard to maintain.
Good luck!
-- Mike
On Dec 30, 2007, at 12:22 PM, Neil Heller wrote:
> I have been asked to consider ways to test a web crawler (aka search
> engine).
>
> There are lots of "positive"-type tests I can think of that mostly
> deal with
> "are the returned pages really relevant to the request".
>
> How does one test for pages that are (or might be) relevant but were
> missed
> by the web crawler?
>
> What might be the best (is that a trick word?) or most optimized
> request
> given someone's desire to find information?
>
>
>
> _______________________________________________
> SanFrancisco-pm mailing list
> SanFrancisco-pm at pm.org
> http://mail.pm.org/mailman/listinfo/sanfrancisco-pm
---------------------------------------------------------------------
Michael Friedman HighWire Press
Phone: 650-725-1974 Stanford University
FAX: 270-721-8034 <friedman at highwire.stanford.edu>
---------------------------------------------------------------------
More information about the SanFrancisco-pm
mailing list