[Pdx-pm] common elements in two lists
benh
ben.hengst at gmail.com
Fri Nov 4 20:41:59 PDT 2011
I'd have to do some benchmarking to get the numbers but it seems like
you could just take the entries and key them with a count as a value
then just take them based on the value being > 1.
my %stash;
foreach( get_names($file1)
, map { $_ =~ m/\w+[-_]*(\w*[-_]*\d+[a-z]*).*/ } get_names($file2)
) {
$stash{$_}++;
}
my @intersect = grep{$stash{$_} > 1} keys %stash;
I'm not sure how much faster this would be rather then building two
hashes and doing the comparison like this:
my %file1 = map{ $_ => 1 } get_names($file1);
my @intersect = grep{$file1{$_}} map { $_ =~
m/\w+[-_]*(\w*[-_]*\d+[a-z]*).*/ } get_names($file2);
On Fri, Nov 4, 2011 at 20:22, Anthony Foiani <tkil at scrye.com> wrote:
> Tom Keller <kellert at ohsu.edu> writes:
>
>> They have slightly different conventions for naming the same thing
>
> If you could give examples (or even just explain what you mean by the
> above sentence), we can probably provide better help.
>
> * If the values are textually identical, then a hash is probably the
> best choice:
>
> my %in_names_1;
> @in_names_1{ @names1 } = ( 1 ) x @names1;
> my @common = grep { $in_names_1{$_} } @names2;
>
> * If each name can be filed into an unambiguous class, then consider
> using a hash-of-lists:
>
> my %names_in_class;
> foreach my $name ( @names1 ) {
> my $class = classify( $name );
> push @{ $names_in_class{ $class } }, $name;
> }
> foreach my $name ( @names2 ) {
> my $class = classify( $name );
> do_fancy_match( $name, $names_in_class{ $class } );
> }
>
> (Where "do_fancy_match" does more work; since we've reduced the
> search space by filing names into classes, it might be ok to even do
> quadratic or worse within that particular function.)
>
> (Note that this is the algorithm needed for a SoundEx match, where
> the SoundEx is the category. Likewise, the "scrabble match" --
> where the class is simply the sorted multiset of letters in the
> originial name -- is of this style as well.)
>
> Finally, if there's really no way to classify things, and every $name1
> must be compared against every $name2, then just remember that 250M
> comparisons is really not that scary a number anymore. Not something
> you'd want to do to render a web page, but modern hardware ought to be
> able to hammer that out in a matter of minutes if not seconds.
>
> All you can do is make the comparison routine as efficient as
> possible, which might include preprocessing the two arrays. (Which is
> really just the pathologic case where every name has its own class;
> the role of the classes here is to allow for faster comparisons
> between words.)
>
> (A simple example here would be trying to match [sub-]domain names:
> "www.google.com" and "google.com" require at least a substring search,
> possibly two; but by preprocessing the data to reverse each name, then
> sorting the reversed names, you suddenly find that "moc.elgoog" and
> "moc.elgoog.www" are right next to each other.)
>
> HTH,
> t.
> _______________________________________________
> Pdx-pm-list mailing list
> Pdx-pm-list at pm.org
> http://mail.pm.org/mailman/listinfo/pdx-pm-list
>
--
benh~
http://about.notbenh.info
More information about the Pdx-pm-list
mailing list