[Mpls-pm] Interesting RegEx Problem

Mon Sep 19 12:46:02 PDT 2005

If I understand the problem correctly, it reminds me of something
Friedl mentioned in a talk that he was doing about his work at Yahoo
-- making a regex engine so that company names (and various versions
of them) could be efficiently substituted in news articles with the
stock symbol and lookup link.  Unfortunately, I couldn't find a module
on CPAN that facilitates this or includes his code, but this might
give a different direction to go on this -- perhaps even contacting
Friedl directly.

And if you were going to take the collection and perform a search like
what was mentioned before, essentially iterating through the
collection and then through the regexes for each, perhaps you could do
an up-front sorting of the collection and the regexes.

For example, if the collection is of various lengths:

# try to match the easier ones (roughly speaking) first
@collection = sort { length($a) <=> length($b) } @collection;

# sort out the regexes by complexity (this could get ugly) or
# at least get the faster matches up front
# Perhaps a hash could be used instead to build a smarter version of this.
# Example below is to pull out the regexes that do a begining match first
# eg: @regexes = ('/abc/', '/^abc/', '/abc$/', '/^abc$/');
@regexes = sort { $b =~ m#^/\^# <=> $a =~ m#^/\^# } @regexes;

Then use these to do the iterative approach.  And if the regex is to
only be used once in the assumed 1-to-1 relationship, you can splice
it out of @regexes as you go onto the next string.

I hope this wasn't too confusing or off the track you were going.

- Glenn

On 9/19/05, Ian Malpass <ian at indecorous.com> wrote:
> On Mon, 19 Sep 2005, Robert Fischer wrote:
> 
> > I need to rephrase the problem, though, because I realize I left a major
> > aspect out of the phrasing:  Given a *collection of* strings and a
> > collection of regular expression strings, where it is known that each
> > string is matched by precisely one regular expression, how do you most
> > efficiently develop the mapping?
> >
> > Given that arrangement, I'm currently looking at lexically sorting the
> > strings and getting an MRU list for the pattern matches.  That
> > implementation assumes that strings which are lexically close are liable
> > to match the same (or similar) regular expressions.
> 
> Are they, though? "Alphabet" and "ZZ 9 Plural Z Alpha" both match /Alpha/.
> And /^.*Alpha.*$/ for that matter, if you need to match the entire string.
> A bit of a pathalogical case, I know, but it illustrates the point.
> 
> I'm not saying your approach won't bring some improvement, but the extra
> list maintenance and sorting will add overhead too. I'd profile it against
> a more straightforward approach (e.g. nested loops) and see what benefits
> it brings. Probably depends on the set sizes. Certainly depends on the set
> contents.
> 
> More complexity brings more opportunity for things to go wrong, and more
> difficulty maintaining things later :(
> 
> Ian
> 
> -
> ---------------------------------------------------------------------------
> 
> The soul would have no rainbows if the eyes held no tears.
> 
> Ian Malpass
> <ian at indecorous.com>
> _______________________________________________
> Mpls-pm mailing list
> Mpls-pm at pm.org
> http://mail.pm.org/mailman/listinfo/mpls-pm
>