SPUG: Searching for multiple strings

Andy Jacobs andyj at microsoft.com
Thu Jul 6 11:31:31 CDT 2000


I've written a program that searches a lot of text sources for multiple
search strings (among other things).  The search strings are read as input
from a file at run time (supplied by the user).  The program builds a table
of which search strings appear in which sources.  Each source string may
contain many search strings (perhaps overlapping), and each search string
may appear in multiple source strings.

My first implementation read the search strings into an array, and looped
through the array, searching for each one (in each source string).  This was
not very efficient, and I suspected it was due to having to re-compile each
search string each time it was used.  My next implementation created one
large string with "|"s to separate the strings, and then a grouping, so I
could tell which string was found.  Even though this second implementation
allowed me to use the "/o" flag on the match (to compile only once) it was
not much more efficient.  My latest implementation is much more efficient -
I create an array of closures where each one will search for a single
string.  Then I loop through the array, calling each closure in turn.  (I'm
not sure why this would be more efficient, but it seems to be - I suppose
it's because the big search does much more backtracking, and the overhead of
keeping track of the grouping as it backtracks).

I suspect I could gain even more efficiency if I could go through the source
string only once.  I'm aware of a table-based technique that builds a state
machine and lookup table, such that at any state you lookup the next state
based on the next character in the source string, and the resulting state
gives you any strings that were just found, and the new look-up table to
find the next state, etc.  My main question is: Has this been done already
and available as a module/script in CPAN, etc.?  My next question is: If
not, what are your feelings on whether this would be more efficient (and
thus a good use of my programming time)?  And finally, any other
suggestions?

There are typically 50-100 search strings of approximately 10-30 characters
each.  And hundreds of sources that will be searched (typically 1K-10K of
text).  According to -d:DProf this is currently the bottleneck in my
program.  Thanks for any pointers, advice or discussion.

 - Andy Jacobs

 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     POST TO: spug-list at pm.org       PROBLEMS: owner-spug-list at pm.org
 Seattle Perl Users Group (SPUG) Home Page: http://www.halcyon.com/spug/
     Subscriptions; Email to majordomo at pm.org:  ACTION  LIST  EMAIL
  Replace ACTION by subscribe or unsubscribe, EMAIL by your Email-address
 For full traffic, use spug-list for LIST ; otherwise use spug-list-digest





More information about the spug-list mailing list