SPUG: Searching for multiple strings

Andy Jacobs andyj at microsoft.com
Thu Jul 6 16:24:05 CDT 2000


In reply to various helpful responses:

The perlfaq6 gives a solution similar to what I have, although using the new
qr// may be more efficient than a closure call.  I'll have to learn more
about qr// and give this a shot.

I did try study() at one point, but it didn't help.  However, I don't
remember when I tried it, and it may be more helpful with my new
implementation.

My program needs the full list of all search strings that were in each
source string, and which were not.  So stopping early is not an option
(unless all the search strings have been found).

The method given at: http://www.effectiveperl.com/recipes/searching.html
uses a closure technique similar to what I am already doing.

But I think a different approach would improve things even more.
Specifically, the algorithm I mentioned about creating a state machine and
table based next state lookup (which I suspect is the ESM algorithm Greg
mentioned, and could be implemented similar to the Knuth-Morris-Pratt
algorithm given in "Mastering Algorithms with Perl" - but would need more
work to have it accept multiple strings).  In fact, I've previously written
this in C, but was wondering if it had already been done in Perl (and in
general if it would be worth the effort).  If I do it in Perl, I'll probably
add features like wildcards and repeating whitespace characters, etc., which
I didn't do in my C version.  Unless I hear otherwise, based on the
responses so far, I'll assume this has not already been written in Perl.

 - Andy

> -----Original Message-----
> From: owner-spug-list at pm.org 
> [mailto:owner-spug-list at pm.org]On Behalf Of
> greg heil
> Sent: Thursday, July 06, 2000 11:43 AM
> To: Andy Jacobs
> Cc: 'spug-list at pm.org'
> Subject: Re: SPUG: Searching for multiple strings
> 
> 
> Andy
> 
> The problem you outlined is known as the "exact
> set matching" problem. Actually it is slightly simpler
> - as you do not need to know the location of all
> matches, just the existence. However, the ESM is well
> enough solved, Eg p52 of:
> "Algorithms on Strings, Trees, and Sequences:
>  Computer Science and Computational Biology" '97 by
>  Dan Gusfield, http://wwwcsif.cs.ucdavis.edu/~gusfield/
> 
> It is solved there to O(n+m+k) with n the size of the
> text, m the sum of the sizes of the patterns and k the
> number of hits of P in T.
> 
> It sounds like your Perl solutions are having a hard
> time achieving a reduction from O(nm+cm) to O(nm) - ie
> effectively factoring out the compilation time for the
> search automata.
> 
> If your problem is important enough you might want
> to implement a better algorithm directly. It is not
> clear to me how the designers of Perl could be given
> syntactic clues signaling you want to solve an ESM;-)
> Thus you may need to resort to a language that exposes
> more of the hardwares capabilities.
> 
> -- 
> greg heil
> mailto:gheil at acm.org
> http://www.scn.org/tl/anvil
> 
>  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
> - - - - - - -
>      POST TO: spug-list at pm.org       PROBLEMS: owner-spug-list at pm.org
>  Seattle Perl Users Group (SPUG) Home Page: 
> http://www.halcyon.com/spug/
>      Subscriptions; Email to majordomo at pm.org:  ACTION  LIST  EMAIL
>   Replace ACTION by subscribe or unsubscribe, EMAIL by your 
> Email-address
>  For full traffic, use spug-list for LIST ; otherwise use 
> spug-list-digest
> 
> 
> 

 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     POST TO: spug-list at pm.org       PROBLEMS: owner-spug-list at pm.org
 Seattle Perl Users Group (SPUG) Home Page: http://www.halcyon.com/spug/
     Subscriptions; Email to majordomo at pm.org:  ACTION  LIST  EMAIL
  Replace ACTION by subscribe or unsubscribe, EMAIL by your Email-address
 For full traffic, use spug-list for LIST ; otherwise use spug-list-digest





More information about the spug-list mailing list