[Mpls-pm] Interesting RegEx Problem

Mon Sep 19 13:26:37 PDT 2005

Glenn and Ian:

What you're talking about is exactly on the line of what I'm looking at. 
The relationship is not 1-to-1 (it's many-to-one), but the thoughts on
sorting and the like is exactly what I'm looking at.

~~ Robert.

> If I understand the problem correctly, it reminds me of something
> Friedl mentioned in a talk that he was doing about his work at Yahoo
> -- making a regex engine so that company names (and various versions
> of them) could be efficiently substituted in news articles with the
> stock symbol and lookup link.  Unfortunately, I couldn't find a module
> on CPAN that facilitates this or includes his code, but this might
> give a different direction to go on this -- perhaps even contacting
> Friedl directly.
>
> And if you were going to take the collection and perform a search like
> what was mentioned before, essentially iterating through the
> collection and then through the regexes for each, perhaps you could do
> an up-front sorting of the collection and the regexes.
>
> For example, if the collection is of various lengths:
>
> # try to match the easier ones (roughly speaking) first
> @collection = sort { length($a) <=> length($b) } @collection;
>
> # sort out the regexes by complexity (this could get ugly) or
> # at least get the faster matches up front
> # Perhaps a hash could be used instead to build a smarter version of this.
> # Example below is to pull out the regexes that do a begining match first
> # eg: @regexes = ('/abc/', '/^abc/', '/abc$/', '/^abc$/');
> @regexes = sort { $b =~ m#^/\^# <=> $a =~ m#^/\^# } @regexes;
>
> Then use these to do the iterative approach.  And if the regex is to
> only be used once in the assumed 1-to-1 relationship, you can splice
> it out of @regexes as you go onto the next string.
>
> I hope this wasn't too confusing or off the track you were going.
>
> - Glenn
>
>
>
>
> On 9/19/05, Ian Malpass <ian at indecorous.com> wrote:
>> On Mon, 19 Sep 2005, Robert Fischer wrote:
>>
>> > I need to rephrase the problem, though, because I realize I left a
>> major
>> > aspect out of the phrasing:  Given a *collection of* strings and a
>> > collection of regular expression strings, where it is known that each
>> > string is matched by precisely one regular expression, how do you most
>> > efficiently develop the mapping?
>> >
>> > Given that arrangement, I'm currently looking at lexically sorting the
>> > strings and getting an MRU list for the pattern matches.  That
>> > implementation assumes that strings which are lexically close are
>> liable
>> > to match the same (or similar) regular expressions.
>>
>> Are they, though? "Alphabet" and "ZZ 9 Plural Z Alpha" both match
>> /Alpha/.
>> And /^.*Alpha.*$/ for that matter, if you need to match the entire
>> string.
>> A bit of a pathalogical case, I know, but it illustrates the point.
>>
>> I'm not saying your approach won't bring some improvement, but the extra
>> list maintenance and sorting will add overhead too. I'd profile it
>> against
>> a more straightforward approach (e.g. nested loops) and see what
>> benefits
>> it brings. Probably depends on the set sizes. Certainly depends on the
>> set
>> contents.
>>
>> More complexity brings more opportunity for things to go wrong, and more
>> difficulty maintaining things later :(
>>
>> Ian
>>
>> -
>> ---------------------------------------------------------------------------
>>
>> The soul would have no rainbows if the eyes held no tears.
>>
>> Ian Malpass
>> <ian at indecorous.com>
>> _______________________________________________
>> Mpls-pm mailing list
>> Mpls-pm at pm.org
>> http://mail.pm.org/mailman/listinfo/mpls-pm
>>
> _______________________________________________
> Mpls-pm mailing list
> Mpls-pm at pm.org
> http://mail.pm.org/mailman/listinfo/mpls-pm
>

~~ Robert Fischer.
rfischer at corradiation.net
651-398-8010