[Pdx-pm] common elements in two lists

Anthony Foiani tkil at scrye.com
Fri Nov 4 20:22:45 PDT 2011


Tom Keller <kellert at ohsu.edu> writes:

> They have slightly different conventions for naming the same thing

If you could give examples (or even just explain what you mean by the
above sentence), we can probably provide better help.

* If the values are textually identical, then a hash is probably the
  best choice:

    my %in_names_1;
    @in_names_1{ @names1 } = ( 1 ) x @names1;
    my @common = grep { $in_names_1{$_} } @names2;

* If each name can be filed into an unambiguous class, then consider
  using a hash-of-lists:

    my %names_in_class;
    foreach my $name ( @names1 ) {
       my $class = classify( $name );
       push @{ $names_in_class{ $class } }, $name;
    }
    foreach my $name ( @names2 ) {
       my $class = classify( $name );
       do_fancy_match( $name, $names_in_class{ $class } );
    }
   
  (Where "do_fancy_match" does more work; since we've reduced the
  search space by filing names into classes, it might be ok to even do
  quadratic or worse within that particular function.)

  (Note that this is the algorithm needed for a SoundEx match, where
  the SoundEx is the category.  Likewise, the "scrabble match" --
  where the class is simply the sorted multiset of letters in the
  originial name -- is of this style as well.)

Finally, if there's really no way to classify things, and every $name1
must be compared against every $name2, then just remember that 250M
comparisons is really not that scary a number anymore.  Not something
you'd want to do to render a web page, but modern hardware ought to be
able to hammer that out in a matter of minutes if not seconds.

All you can do is make the comparison routine as efficient as
possible, which might include preprocessing the two arrays.  (Which is
really just the pathologic case where every name has its own class;
the role of the classes here is to allow for faster comparisons
between words.)

(A simple example here would be trying to match [sub-]domain names:
"www.google.com" and "google.com" require at least a substring search,
possibly two; but by preprocessing the data to reverse each name, then
sorting the reversed names, you suddenly find that "moc.elgoog" and
"moc.elgoog.www" are right next to each other.)

HTH,
t.


More information about the Pdx-pm-list mailing list