[sf-perl] Testing a web crawler

Michael Friedman friedman at highwire.stanford.edu
Sun Dec 30 14:47:33 PST 2007


I don't know about the optimization tests, but negative tests pretty  
much require "outside" knowledge. You need to know things that the  
software doesn't so that you can predict non-matching data.

For example, you would use some other search engine to gather results  
and then pick some of the relevant results from there that aren't in  
your own search engine. Or if you are using a limited dataset (which  
you should be, for repeatable unit tests) you can intentionally place  
files in there that are "close but don't meet the current algorithm".  
Then the test checks to make sure those files don't appear in search  
results. If the algorithm changes later and the files that you knew  
shouldn't appear suddenly do, then you know you've become too  
inclusive and need to ratchet back.

I haven't done work on search engines, but I do work with a journal  
reference <-> journal citation matching algorithm that has to perform  
similar discrimination between "good" and "not quite good enough"  
matches. I've had to create data for each of the 8 factors involved in  
my algorithm, both positive and negative, plus some random "don't  
match anything" records. The dataset quickly becomes large, but if you  
document it correctly it shouldn't be too hard to maintain.

Good luck!
-- Mike

On Dec 30, 2007, at 12:22 PM, Neil Heller wrote:

> I have been asked to consider ways to test a web crawler (aka search
> engine).
>
> There are lots of "positive"-type tests I can think of that mostly  
> deal with
> "are the returned pages really relevant to the request".
>
> How does one test for pages that are (or might be) relevant but were  
> missed
> by the web crawler?
>
> What might be the best (is that a trick word?) or most optimized  
> request
> given someone's desire to find information?
>
>
>
> _______________________________________________
> SanFrancisco-pm mailing list
> SanFrancisco-pm at pm.org
> http://mail.pm.org/mailman/listinfo/sanfrancisco-pm

---------------------------------------------------------------------
Michael Friedman                     HighWire Press
Phone: 650-725-1974                  Stanford University
FAX:   270-721-8034                  <friedman at highwire.stanford.edu>
---------------------------------------------------------------------




More information about the SanFrancisco-pm mailing list