[Pdx-pm] common elements in two lists
Tom Keller
kellert at ohsu.edu
Mon Nov 7 16:40:53 PST 2011
Thanks All,
Using Benchmark against Aaron Crane's suggestion
###
my $t0 = Benchmark->new;
## capture the relevant signifier from each name in the longer list of unique names
my @out = map { $_ =~ m/\w+[-_]*(\w*[-_]*\d+[a-z]*).*/ } @class1;
say "out contains ", scalar @out, " strings with \@class1 as input";
my $regex = join '|', map { quotemeta } @out;
$regex = qr/$regex/;
my @common = grep { /$regex/ } @class2;
my $t1 = Benchmark->new;
my $td = timediff( $t1,$t0 );
say "the code took: ",timestr($td);
say "common contains ", scalar @common, " names.";
###
test.pl Output to ./data/affy_oa_common_classnames1_aaroncrane.txt
names1 contains 46227 names.
names2 contains 5726 names.
class1 contains 7815 names.
class2 contains 748 names.
out contains 7806 strings with @class1 as input
the code took: 0 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU)
common contains 748 names.
VS
My original index method
grep_names.pl Output to ./data/affy_oa_common1.txt
names1 contains 46227 names.
names2 contains 5726 names.
class1 contains 7815 names.
class2 contains 748 names.
out contains 7806 strings with @class1 as input
the code took: 19 wallclock secs (18.42 usr + 0.22 sys = 18.64 CPU)
index contains 748 matches.
The results are identical, but a fantastic speed-up.
thanks again,
Tom
MMI DNA Services Core Facility
503-494-2442
kellert at ohsu.edu
Office: 6588 RJH (CROET/BasicScience)
OHSU Shared Resources
On Nov 5, 2011, at 6:07 AM, Aaron Crane wrote:
> Michael G Schwern <schwern at pobox.com> wrote:
>> On 2011.11.4 3:34 PM, Tom Keller wrote:
>>> my @out = map { $_ =~ m/\w+[-_]*(\w*[-_]*\d+[a-z]*).*/ } @names2;
>>> my @index = grep {
>>> my $c = $_;
>>> if ( $c > $#names1 or # always false
>>> ( grep { $names1[$c] =~ m/$_/ } @out ) > 0) {
>>
>> That grep inside a grep makes it an O(N*M) algorithm (where N is the length of
>> @names1 and M is @names2), so 46227 * 5726 == about 264 million iterations.
>>
>> You'd be better off putting the normalized names into a hash and checking
>> against that.
>
> It sounds like normalising the names is probably the right thing in
> this situation. But if it weren't, there's another way to speed this
> up. The original code builds a list of fixed strings, and then
> selects the elements of @names1 that contain at least one of those
> fixed strings. It does that using grep inside grep; but another
> option would be to build a single regex to use in the outer grep:
>
> my @out = map { $_ =~ m/\w+[-_]*(\w*[-_]*\d+[a-z]*).*/ } @names2;
> my $regex = join '|', map { quotemeta } @out;
> $regex = qr/$regex/;
> my @common = grep { /$regex/ } @names1;
>
> Naively, that could be assumed to be O(N*M) in the same way as the
> double-grep solution — and in Perl 5.8 and earlier, it is. (Though
> with a substantial reduction in the constant factors; it's probably
> considerably faster anyway.) However, starting from Perl 5.10, the
> regex engine uses an Aho-Corasick matcher for patterns (or
> subpatterns) consisting of fixed-string alternations:
>
> http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
>
> The running time of the Aho-Corasick algorithm is linear in the length
> of the string being examined, independent of the number of
> alternatives, so it's an excellent choice for a situation like this if
> the number of fixed strings is arbitrarily large.
>
> For those using Perl 5.8, I have a CPAN module which implements a
> related algorithm:
>
> https://metacpan.org/module/Text::Match::FastAlternatives
>
> However, even under Perl 5.10 (and perhaps later; I should do some
> more timing tests), T::M::FA is faster than the internal regex engine.
> I assume that's partly because it needs to do less work, but in
> addition, the regex engine's running time seems to depend at least to
> some extent on the number of alternatives:
>
> http://aaroncrane.co.uk/2008/05/text_match_fastalternatives/
>
> That said, the difference probably only matters if you have a
> gargantuan set of fixed strings, or if you're doing this sort of
> operation on millions of candidate strings (which was the situation I
> had when I wrote it).
>
> Also, that blog post is out of date; it says there's no Unicode
> support in TMFA, but recent versions do handle Unicode for the most
> part. And for that matter, I have an improved version of TMFA which
> is even faster still, and typically uses a quarter of the memory it
> used to. I really must get round to releasing that soon…
>
> --
> Aaron Crane ** http://aaroncrane.co.uk/
> _______________________________________________
> Pdx-pm-list mailing list
> Pdx-pm-list at pm.org
> http://mail.pm.org/mailman/listinfo/pdx-pm-list
More information about the Pdx-pm-list
mailing list