[Pdx-pm] common elements in two lists

Mon Nov 7 16:40:53 PST 2011

Thanks All, 

Using Benchmark against Aaron Crane's suggestion 
###
my $t0 = Benchmark->new;
## capture the relevant signifier from each name in the longer list of unique names
my @out = map { $_ =~  m/\w+[-_]*(\w*[-_]*\d+[a-z]*).*/ } @class1;
say "out contains ", scalar @out, " strings with \@class1 as input";
my $regex = join '|', map { quotemeta } @out;
$regex = qr/$regex/;
my @common = grep { /$regex/ } @class2;
my $t1 = Benchmark->new;
my $td = timediff( $t1,$t0 );
say "the code took: ",timestr($td);
say "common contains ", scalar @common, " names.";
###

test.pl Output to ./data/affy_oa_common_classnames1_aaroncrane.txt
names1 contains 46227 names.
names2 contains 5726 names.
class1 contains 7815 names.
class2 contains 748 names.
out contains 7806 strings with @class1 as input
the code took:  0 wallclock secs ( 0.03 usr +  0.00 sys =  0.03 CPU)
common contains 748 names.

VS
My original index method
grep_names.pl Output to ./data/affy_oa_common1.txt
names1 contains 46227 names.
names2 contains 5726 names.
class1 contains 7815 names.
class2 contains 748 names.
out contains 7806 strings with @class1 as input
the code took: 19 wallclock secs (18.42 usr +  0.22 sys = 18.64 CPU)
index contains 748 matches.

The results are identical, but a fantastic speed-up.

thanks again,
Tom
MMI DNA Services Core Facility
503-494-2442
kellert at ohsu.edu
Office: 6588 RJH (CROET/BasicScience)

OHSU Shared Resources

On Nov 5, 2011, at 6:07 AM, Aaron Crane wrote:

> Michael G Schwern <schwern at pobox.com> wrote:
>> On 2011.11.4 3:34 PM, Tom Keller wrote:
>>> my @out = map { $_ =~  m/\w+[-_]*(\w*[-_]*\d+[a-z]*).*/ } @names2;
>>> my @index = grep {
>>> my $c = $_;
>>> if ( $c > $#names1  or # always false
>>> ( grep { $names1[$c] =~ m/$_/ } @out ) > 0) {
>> 
>> That grep inside a grep makes it an O(N*M) algorithm (where N is the length of
>> @names1 and M is @names2), so 46227 * 5726 == about 264 million iterations.
>> 
>> You'd be better off putting the normalized names into a hash and checking
>> against that.
> 
> It sounds like normalising the names is probably the right thing in
> this situation.  But if it weren't, there's another way to speed this
> up.  The original code builds a list of fixed strings, and then
> selects the elements of @names1 that contain at least one of those
> fixed strings.  It does that using grep inside grep; but another
> option would be to build a single regex to use in the outer grep:
> 
> my @out = map { $_ =~  m/\w+[-_]*(\w*[-_]*\d+[a-z]*).*/ } @names2;
> my $regex = join '|', map { quotemeta } @out;
> $regex = qr/$regex/;
> my @common = grep { /$regex/ } @names1;
> 
> Naively, that could be assumed to be O(N*M) in the same way as the
> double-grep solution — and in Perl 5.8 and earlier, it is.  (Though
> with a substantial reduction in the constant factors; it's probably
> considerably faster anyway.)  However, starting from Perl 5.10, the
> regex engine uses an Aho-Corasick matcher for patterns (or
> subpatterns) consisting of fixed-string alternations:
> 
> http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
> 
> The running time of the Aho-Corasick algorithm is linear in the length
> of the string being examined, independent of the number of
> alternatives, so it's an excellent choice for a situation like this if
> the number of fixed strings is arbitrarily large.
> 
> For those using Perl 5.8, I have a CPAN module which implements a
> related algorithm:
> 
> https://metacpan.org/module/Text::Match::FastAlternatives
> 
> However, even under Perl 5.10 (and perhaps later; I should do some
> more timing tests), T::M::FA is faster than the internal regex engine.
> I assume that's partly because it needs to do less work, but in
> addition, the regex engine's running time seems to depend at least to
> some extent on the number of alternatives:
> 
> http://aaroncrane.co.uk/2008/05/text_match_fastalternatives/
> 
> That said, the difference probably only matters if you have a
> gargantuan set of fixed strings, or if you're doing this sort of
> operation on millions of candidate strings (which was the situation I
> had when I wrote it).
> 
> Also, that blog post is out of date; it says there's no Unicode
> support in TMFA, but recent versions do handle Unicode for the most
> part.  And for that matter, I have an improved version of TMFA which
> is even faster still, and typically uses a quarter of the memory it
> used to.  I really must get round to releasing that soon…
> 
> -- 
> Aaron Crane ** http://aaroncrane.co.uk/
> _______________________________________________
> Pdx-pm-list mailing list
> Pdx-pm-list at pm.org
> http://mail.pm.org/mailman/listinfo/pdx-pm-list