[Pdx-pm] common elements in two lists

Sat Nov 5 06:07:11 PDT 2011

Michael G Schwern <schwern at pobox.com> wrote:
> On 2011.11.4 3:34 PM, Tom Keller wrote:
>> my @out = map { $_ =~  m/\w+[-_]*(\w*[-_]*\d+[a-z]*).*/ } @names2;
>> my @index = grep {
>> my $c = $_;
>> if ( $c > $#names1  or # always false
>> ( grep { $names1[$c] =~ m/$_/ } @out ) > 0) {
>
> That grep inside a grep makes it an O(N*M) algorithm (where N is the length of
> @names1 and M is @names2), so 46227 * 5726 == about 264 million iterations.
>
> You'd be better off putting the normalized names into a hash and checking
> against that.

It sounds like normalising the names is probably the right thing in
this situation.  But if it weren't, there's another way to speed this
up.  The original code builds a list of fixed strings, and then
selects the elements of @names1 that contain at least one of those
fixed strings.  It does that using grep inside grep; but another
option would be to build a single regex to use in the outer grep:

my @out = map { $_ =~  m/\w+[-_]*(\w*[-_]*\d+[a-z]*).*/ } @names2;
my $regex = join '|', map { quotemeta } @out;
$regex = qr/$regex/;
my @common = grep { /$regex/ } @names1;

Naively, that could be assumed to be O(N*M) in the same way as the
double-grep solution — and in Perl 5.8 and earlier, it is.  (Though
with a substantial reduction in the constant factors; it's probably
considerably faster anyway.)  However, starting from Perl 5.10, the
regex engine uses an Aho-Corasick matcher for patterns (or
subpatterns) consisting of fixed-string alternations:

http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm

The running time of the Aho-Corasick algorithm is linear in the length
of the string being examined, independent of the number of
alternatives, so it's an excellent choice for a situation like this if
the number of fixed strings is arbitrarily large.

For those using Perl 5.8, I have a CPAN module which implements a
related algorithm:

https://metacpan.org/module/Text::Match::FastAlternatives

However, even under Perl 5.10 (and perhaps later; I should do some
more timing tests), T::M::FA is faster than the internal regex engine.
 I assume that's partly because it needs to do less work, but in
addition, the regex engine's running time seems to depend at least to
some extent on the number of alternatives:

http://aaroncrane.co.uk/2008/05/text_match_fastalternatives/

That said, the difference probably only matters if you have a
gargantuan set of fixed strings, or if you're doing this sort of
operation on millions of candidate strings (which was the situation I
had when I wrote it).

Also, that blog post is out of date; it says there's no Unicode
support in TMFA, but recent versions do handle Unicode for the most
part.  And for that matter, I have an improved version of TMFA which
is even faster still, and typically uses a quarter of the memory it
used to.  I really must get round to releasing that soon…

-- 
Aaron Crane ** http://aaroncrane.co.uk/