[Pdx-pm] common elements in two lists

Tom Keller kellert at ohsu.edu
Fri Nov 4 16:12:02 PDT 2011


Hi,
Kris B had a really good idea for this problem, since the names identify items that do get repeated many times:
names1 contains 46227 names.
names2 contains 5726 names.
class1 contains 7815 names.
class2 contains 748 names.

Much faster.

Thanks!
Tom
MMI DNA Services Core Facility<http://www.ohsu.edu/xd/research/research-cores/dna-analysis/>
503-494-2442
kellert at ohsu.edu<http://ohsu.edu>
Office: 6588 RJH (CROET/BasicScience)

OHSU Shared Resources<http://www.ohsu.edu/xd/research/research-cores/index.cfm>






On Nov 4, 2011, at 3:39 PM, Kris Bosland wrote:

I would suggest smashing the strings into classes and then you only have to make comparisons within the classes.

For example, you could find the Soundex(tm) code of each name.

I don't know the details of your naming conventions so you would need to figure out the algorithm to get all the possible matches into one class - false positives are fine (a larger class but still small compared to your bigger set) but you don't want false negatives (putting things in different classes from each other).

My $0.002

<lurk/>

-Kris

2011/11/4 Tom Keller <kellert at ohsu.edu<mailto:kellert at ohsu.edu>>
Greetings,
I have two very long lists of names. They have slightly different conventions for naming the same thing, so I devised a regex to compare the two lists. I need to extract the names common to both. (Acknowledgement: "Effective Perl Programming, 1st ed.")
But it is taking an ungodly amount of time, since
names1 contains 46227 names.
names2 contains 5726 names.

Here's the code:
########
my @names1 = get_names($file1);
my @names2 = get_names($file2);
#say join(", ", @names1);

my @out = map { $_ =~  m/\w+[-_]*(\w*[-_]*\d+[a-z]*).*/ } @names2;
my @index = grep {
my $c = $_;
if ( $c > $#names1  or # always false
( grep { $names1[$c] =~ m/$_/ } @out ) > 0) {
1;  ## save
} else {
0;  ## skip
}
} 0 .. $#names1;

my @common = map { $names1[$_] } @index;
########

Is there a faster/better way to do this?

thanks,
Tom
MMI DNA Services Core Facility<http://www.ohsu.edu/xd/research/research-cores/dna-analysis/>
503-494-2442<tel:503-494-2442>
kellert at ohsu.edu<http://ohsu.edu/>
Office: 6588 RJH (CROET/BasicScience)

OHSU Shared Resources<http://www.ohsu.edu/xd/research/research-cores/index.cfm>







_______________________________________________
Pdx-pm-list mailing list
Pdx-pm-list at pm.org<mailto:Pdx-pm-list at pm.org>
http://mail.pm.org/mailman/listinfo/pdx-pm-list


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/pdx-pm-list/attachments/20111104/96beffe6/attachment.html>


More information about the Pdx-pm-list mailing list