[Mpls-pm] Interesting RegEx Problem

Mon Sep 19 17:35:18 PDT 2005

On Sep 19, 2005, at 12:39 PM, Robert Fischer wrote:

> List::Util::first is a great shortcut for implementing the
> sequential-search algorithm, but I'm looking for something better than
> sequential-search.

Check out Regexp::Assemble:

    http://search.cpan.org/~dland/Regexp-Assemble-0.17/Assemble.pm

It delves into the guts of the regexes and finds a good (i.e. short) 
regex that will match the alternation of the set of regexes.  For 
example:

my $ra = Regexp::Assemble->new;
$ra->add( 'ab+c' );
$ra->add( 'ab+\\d*\\s+c' );
$ra->add( 'a\\w+\\d+' );
$ra->add( 'a\\d+' );
print $ra->re; # prints (?:a(?:b+(?:\d*\s+)?c|(?:\w+)?\d+))

It can also track which of the original patterns matched.

> As a tangential note: is there a concept of a "distance" between 
> regular
> expressions which can be reasonably implemented?  If so, has anyone
> implemented it yet?  String distance certainly doesn't work, because 
> \d{3}
> and [1-90][1-90][1-90] are implementation-identical, but have a drastic
> edit distance.

Your best bet might be to convert each regex to a canonical DFA (if 
possible - it won't be possible if you're using capturing groups or 
other regex extensions) and then use a graph-theoretical metric on the 
underlying DFA.  Definitely a research problem, I'm thinking. =)

  -Ken