[Pdx-pm] common elements in two lists

Kris Bosland kris at bosland.com
Fri Nov 4 15:39:28 PDT 2011


I would suggest smashing the strings into classes and then you only have to
make comparisons within the classes.

For example, you could find the Soundex(tm) code of each name.

I don't know the details of your naming conventions so you would need to
figure out the algorithm to get all the possible matches into one class -
false positives are fine (a larger class but still small compared to your
bigger set) but you don't want false negatives (putting things in different
classes from each other).

My $0.002

<lurk/>

-Kris

2011/11/4 Tom Keller <kellert at ohsu.edu>

> Greetings,
> I have two very long lists of names. They have slightly different
> conventions for naming the same thing, so I devised a regex to compare the
> two lists. I need to extract the names common to both. (Acknowledgement:
> "Effective Perl Programming, 1st ed.")
> But it is taking an ungodly amount of time, since
> names1 contains 46227 names.
> names2 contains 5726 names.
>
> Here's the code:
> ########
> my @names1 = get_names($file1);
> my @names2 = get_names($file2);
> #say join(", ", @names1);
>
> my @out = map { $_ =~  m/\w+[-_]*(\w*[-_]*\d+[a-z]*).*/ } @names2;
> my @index = grep {
> my $c = $_;
> if ( $c > $#names1  or # always false
> ( grep { $names1[$c] =~ m/$_/ } @out ) > 0) {
> 1;  ## save
> } else {
> 0;  ## skip
> }
> } 0 .. $#names1;
>
> my @common = map { $names1[$_] } @index;
> ########
>
> Is there a faster/better way to do this?
>
> thanks,
> Tom
> MMI DNA Services Core Facility<http://www.ohsu.edu/xd/research/research-cores/dna-analysis/>
> 503-494-2442
> kellert at ohsu.edu
> Office: 6588 RJH (CROET/BasicScience)
>
> OHSU Shared Resources<http://www.ohsu.edu/xd/research/research-cores/index.cfm>
>
>
>
>
>
>
>
> _______________________________________________
> Pdx-pm-list mailing list
> Pdx-pm-list at pm.org
> http://mail.pm.org/mailman/listinfo/pdx-pm-list
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/pdx-pm-list/attachments/20111104/253aacad/attachment.html>


More information about the Pdx-pm-list mailing list