[Pdx-pm] common elements in two lists

Michael G Schwern schwern at pobox.com
Fri Nov 4 18:17:58 PDT 2011


On 2011.11.4 3:34 PM, Tom Keller wrote:
> I have two very long lists of names. They have slightly different conventions
> for naming the same thing, so I devised a regex to compare the two lists. I
> need to extract the names common to both. (Acknowledgement: "Effective Perl
> Programming, 1st ed.")
> But it is taking an ungodly amount of time, since 
> names1 contains 46227 names.
> names2 contains 5726 names.
> 
> Here's the code:
> ########
> my @names1 = get_names($file1);
> my @names2 = get_names($file2);
> #say join(", ", @names1);
> 
> my @out = map { $_ =~  m/\w+[-_]*(\w*[-_]*\d+[a-z]*).*/ } @names2;
> my @index = grep {
> my $c = $_;
> if ( $c > $#names1  or # always false
> ( grep { $names1[$c] =~ m/$_/ } @out ) > 0) {
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

That grep inside a grep makes it an O(N*M) algorithm (where N is the length of
@names1 and M is @names2), so 46227 * 5726 == about 264 million iterations.

First optimization would be to use first() from List::Util instead of grep.
That will stop checking at its first match.  However, it will on average only
reduce the number of comparisons by half (so 130 million) and the cost of
using first() instead of grep() might offset that.  But it's cheap to find out.


> 1;  ## save
> } else {
> 0;  ## skip
> }
> } 0 .. $#names1;
> 
> my @common = map { $names1[$_] } @index;

You'd be better off putting the normalized names into a hash and checking
against that.

my %names1 = map { _normalize($_) => 1 } get_names($file1);
my %intersection;
for my $name (@names2) {
    $intersection{$name}++ if $names1{ _normalize($name) };
}

That requires N+M time, about 50 thousand, not N*M.  OTOH it does use double
the memory.


-- 
24. Must not tell any officer that I am smarter than they are, especially
    if it's true.
    -- The 213 Things Skippy Is No Longer Allowed To Do In The U.S. Army
           http://skippyslist.com/list/


More information about the Pdx-pm-list mailing list