SPUG: search algorithm

Mon Dec 17 19:18:56 CST 2001

-- Thomas Whitney <whitneyt at agcs.com> spake thusly:

> Martin Korb wrote:
> 
>> Here is the problem: I have 7000 database entries which may or may
>> be found in the content of over 50000 files.At the moment I am using
>> a linear search,the search patterns are the keys of a hash,
>> find(\&wanted, $startdir) is called on each key of the hash, if the
>> key is found, stop recursing the dir, delete this key and go the
>> next one, start all over again. This could potentially be over 3 x
>> 10 exp 7 searches.Obviously, this will take much to long with a
>> linear search algorithm. Which is the best algorithm to use for such
>> a search and where can I find out more about it? Any help is much
>> appreciated. Thanks Martin
>
> If I am understanding correctly, it sounds like the directory search
> is the more expensive process. How about running through the keys on
> each file entry, rather then the other way around.

Another option might be:

If the keys you're searching for have a standard format, you might be
best off find all _possible_ keys in each file (using a regexp), and
then looking those up in a hash of all the keys you're searching for. I
suspect this would be the fastest method, since you end up reading and
scanning each file once rather than 7000 times. You also get to take
advantage of Perl's hashing algorithms rather than scanning through all
7000 keys for each hit.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 232 bytes
Desc: not available
Url : http://mail.pm.org/archives/spug-list/attachments/20011217/6c735ede/attachment.bin