[Mpls-pm] Interesting RegEx Problem
Ken Williams
ken at mathforum.org
Mon Sep 19 17:35:18 PDT 2005
On Sep 19, 2005, at 12:39 PM, Robert Fischer wrote:
> List::Util::first is a great shortcut for implementing the
> sequential-search algorithm, but I'm looking for something better than
> sequential-search.
Check out Regexp::Assemble:
http://search.cpan.org/~dland/Regexp-Assemble-0.17/Assemble.pm
It delves into the guts of the regexes and finds a good (i.e. short)
regex that will match the alternation of the set of regexes. For
example:
my $ra = Regexp::Assemble->new;
$ra->add( 'ab+c' );
$ra->add( 'ab+\\d*\\s+c' );
$ra->add( 'a\\w+\\d+' );
$ra->add( 'a\\d+' );
print $ra->re; # prints (?:a(?:b+(?:\d*\s+)?c|(?:\w+)?\d+))
It can also track which of the original patterns matched.
> As a tangential note: is there a concept of a "distance" between
> regular
> expressions which can be reasonably implemented? If so, has anyone
> implemented it yet? String distance certainly doesn't work, because
> \d{3}
> and [1-90][1-90][1-90] are implementation-identical, but have a drastic
> edit distance.
Your best bet might be to convert each regex to a canonical DFA (if
possible - it won't be possible if you're using capturing groups or
other regex extensions) and then use a graph-theoretical metric on the
underlying DFA. Definitely a research problem, I'm thinking. =)
-Ken
More information about the Mpls-pm
mailing list