[Pdx-pm] common elements in two lists

benh ben.hengst at gmail.com
Fri Nov 4 20:41:59 PDT 2011


I'd have to do some benchmarking to get the numbers but it seems like
you could just take the entries and key them with a count as a value
then just take them based on the value being > 1.

my %stash;
foreach( get_names($file1)
             , map { $_ =~  m/\w+[-_]*(\w*[-_]*\d+[a-z]*).*/ } get_names($file2)
             ) {
  $stash{$_}++;
}
my @intersect = grep{$stash{$_} > 1} keys %stash;


I'm not sure how much faster this would be rather then building two
hashes and doing the comparison like this:

my %file1 = map{ $_ => 1 } get_names($file1);
my @intersect = grep{$file1{$_}} map { $_ =~
m/\w+[-_]*(\w*[-_]*\d+[a-z]*).*/ } get_names($file2);

On Fri, Nov 4, 2011 at 20:22, Anthony Foiani <tkil at scrye.com> wrote:
> Tom Keller <kellert at ohsu.edu> writes:
>
>> They have slightly different conventions for naming the same thing
>
> If you could give examples (or even just explain what you mean by the
> above sentence), we can probably provide better help.
>
> * If the values are textually identical, then a hash is probably the
>  best choice:
>
>    my %in_names_1;
>    @in_names_1{ @names1 } = ( 1 ) x @names1;
>    my @common = grep { $in_names_1{$_} } @names2;
>
> * If each name can be filed into an unambiguous class, then consider
>  using a hash-of-lists:
>
>    my %names_in_class;
>    foreach my $name ( @names1 ) {
>       my $class = classify( $name );
>       push @{ $names_in_class{ $class } }, $name;
>    }
>    foreach my $name ( @names2 ) {
>       my $class = classify( $name );
>       do_fancy_match( $name, $names_in_class{ $class } );
>    }
>
>  (Where "do_fancy_match" does more work; since we've reduced the
>  search space by filing names into classes, it might be ok to even do
>  quadratic or worse within that particular function.)
>
>  (Note that this is the algorithm needed for a SoundEx match, where
>  the SoundEx is the category.  Likewise, the "scrabble match" --
>  where the class is simply the sorted multiset of letters in the
>  originial name -- is of this style as well.)
>
> Finally, if there's really no way to classify things, and every $name1
> must be compared against every $name2, then just remember that 250M
> comparisons is really not that scary a number anymore.  Not something
> you'd want to do to render a web page, but modern hardware ought to be
> able to hammer that out in a matter of minutes if not seconds.
>
> All you can do is make the comparison routine as efficient as
> possible, which might include preprocessing the two arrays.  (Which is
> really just the pathologic case where every name has its own class;
> the role of the classes here is to allow for faster comparisons
> between words.)
>
> (A simple example here would be trying to match [sub-]domain names:
> "www.google.com" and "google.com" require at least a substring search,
> possibly two; but by preprocessing the data to reverse each name, then
> sorting the reversed names, you suddenly find that "moc.elgoog" and
> "moc.elgoog.www" are right next to each other.)
>
> HTH,
> t.
> _______________________________________________
> Pdx-pm-list mailing list
> Pdx-pm-list at pm.org
> http://mail.pm.org/mailman/listinfo/pdx-pm-list
>



-- 
benh~

http://about.notbenh.info


More information about the Pdx-pm-list mailing list