[Mpls-pm] Interesting RegEx Problem

Ian Malpass ian at indecorous.com
Mon Sep 19 16:32:00 PDT 2005


On Mon, 19 Sep 2005, Andy Lester wrote:

> Date: Mon, 19 Sep 2005 16:01:44 -0500
> From: Andy Lester <andy at petdance.com>
> To: Ian Malpass <ian at indecorous.com>
> Cc: mpls-pm at pm.org
> Subject: Re: [Mpls-pm] Interesting RegEx Problem

> As I understand it, study says "I'm going to find the first instance of each of
> the 256 possible characters in the string, so that strings anchored with
> a literal can start right there."

It actually goes a little further, and does some symbol frequency checking 
too. From the perldoc:

>>>>>
The way "study" works is this: a linked list of every character in the 
string to be searched is made, so we know, for example, where all the 'k' 
characters are.  From each search string, the rarest character is 
selected, based on some static frequency tables constructed from some C 
programs and English text.  Only those places that contain this "rarest" 
character are examined.

For example, here is a loop that inserts index producing entries before 
any line containing a certain pattern:

     while (<>) {
         study;
         print ".IX foo\n"       if /\bfoo\b/;
         print ".IX bar\n"       if /\bbar\b/;
         print ".IX blurfl\n"    if /\bblurfl\b/;
         # ...
         print;
     }

In searching for "/\bfoo\b/", only those locations in $_ that contain "f" 
will be looked at, because "f" is rarer than "o".  In general, this is a 
big win except in pathological cases.  The only question is whether it 
saves you more time than it took to build the linked list in the first 
place.
<<<<<

So, if all the patterns are equally likely to match, then the amount of 
time it takes to run find the pattern that matches will be directly 
proportional to the number of patters. If all the input strings are of 
similar lengths, then I would guess that study() will take a similar 
amount of time on each, and that it would take longer on longer strings. 
That would suggest there's a tipping point beyond which using study is of 
benefit, but I think that can only be determined empirically :)

The nice thing is that because it's so localised in the code, it's easy to 
turn it on or off with a flag or environment variable. Even if it's not of 
benefit now, it might be in two years' time when the number of patterns 
has inflated wildly....

Certainly it should be documented in the code that study() might be 
applied at a given point even if it isn't used.

Still, 
<http://billharlan.com/pub/papers/A_Tirade_Against_the_Cult_of_Performance.html> 
"Optimise later, if at all", and all that jazz.

Ian

-
---------------------------------------------------------------------------

The soul would have no rainbows if the eyes held no tears.

Ian Malpass
<ian at indecorous.com>


More information about the Mpls-pm mailing list