[Pdx-pm] common elements in two lists

Ben Prew ben.prew at gmail.com
Fri Nov 4 16:33:09 PDT 2011


Also, you're using 5.10+, right?

If so, you can also use the ~~ operator:

for my $name (@names2) {
      push @common, $name if ($name ~~ @out);
}

And, if you have that much overlap in names, use a hash:

my %names1 = map { $_ => 1 } get_names($file1);
my @names1 = keys %names1;
my %names2 = map { $_ => 1 } get_names($file2);
my @names2 = keys %names2;

Finally, you'll probably want to quote any regular expression
characters in the @out strings (the original code didn't, but it
probably should)

for my $name (@names2) {
      push @common, $name if ($name ~~ map { qr $_ } @out);
}

Note: all code should be assumed untested.



2011/11/4 Tom Keller <kellert at ohsu.edu>:
> Hi,
> Kris B had a really good idea for this problem, since the names identify
> items that do get repeated many times:
> names1 contains 46227 names.
> names2 contains 5726 names.
> class1 contains 7815 names.
> class2 contains 748 names.
> Much faster.
> Thanks!
> Tom
> MMI DNA Services Core Facility
> 503-494-2442
> kellert at ohsu.edu
> Office: 6588 RJH (CROET/BasicScience)
>
> OHSU Shared Resources
>
>
>
>
>
>
> On Nov 4, 2011, at 3:39 PM, Kris Bosland wrote:
>
> I would suggest smashing the strings into classes and then you only have to
> make comparisons within the classes.
> For example, you could find the Soundex(tm) code of each name.
> I don't know the details of your naming conventions so you would need to
> figure out the algorithm to get all the possible matches into one class -
> false positives are fine (a larger class but still small compared to your
> bigger set) but you don't want false negatives (putting things in different
> classes from each other).
> My $0.002
> <lurk/>
> -Kris
>
> 2011/11/4 Tom Keller <kellert at ohsu.edu>
>>
>> Greetings,
>> I have two very long lists of names. They have slightly different
>> conventions for naming the same thing, so I devised a regex to compare the
>> two lists. I need to extract the names common to both. (Acknowledgement:
>> "Effective Perl Programming, 1st ed.")
>> But it is taking an ungodly amount of time, since
>> names1 contains 46227 names.
>> names2 contains 5726 names.
>> Here's the code:
>> ########
>> my @names1 = get_names($file1);
>> my @names2 = get_names($file2);
>> #say join(", ", @names1);
>> my @out = map { $_ =~  m/\w+[-_]*(\w*[-_]*\d+[a-z]*).*/ } @names2;
>> my @index = grep {
>> my $c = $_;
>> if ( $c > $#names1  or # always false
>> ( grep { $names1[$c] =~ m/$_/ } @out ) > 0) {
>> 1;  ## save
>> } else {
>> 0;  ## skip
>> }
>> } 0 .. $#names1;
>> my @common = map { $names1[$_] } @index;
>> ########
>> Is there a faster/better way to do this?
>> thanks,
>> Tom
>> MMI DNA Services Core Facility
>> 503-494-2442
>> kellert at ohsu.edu
>> Office: 6588 RJH (CROET/BasicScience)
>>
>> OHSU Shared Resources
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Pdx-pm-list mailing list
>> Pdx-pm-list at pm.org
>> http://mail.pm.org/mailman/listinfo/pdx-pm-list
>
>
>
> _______________________________________________
> Pdx-pm-list mailing list
> Pdx-pm-list at pm.org
> http://mail.pm.org/mailman/listinfo/pdx-pm-list
>



-- 
--Ben


More information about the Pdx-pm-list mailing list