[sf-perl] Sorting vs. hash

Fri Mar 28 13:22:45 PDT 2008

Sure.  See below.  I think the sorting trick is fastest.

use strict;
use warnings;
use Benchmark;

# starting point: one big fat array (Neil says 250K).
# but let's assume there are some dupes.
my @input;
my @ltr        = ( 'a' .. 'z' );
my $dupe_count = 0;
while ( @input < 250000 ) {
    my $t = join( '', map { $ltr[ rand(@ltr) ] } ( 1 .. rand(40) + 10 ) );
    push( @input, $t );

    # Eliminating dupes is the point.
    if ( @input / 20 == int( @input / 20 ) ) {
        unshift( @input, $t );
        $dupe_count++;
    }
}
printf "%d strings (%d dupes)\n", scalar(@input), $dupe_count;

timethese(
    100,
    {

        SortStyle  => \&array_strip_without_hash,
        SeenStyle  => \&array_strip_with_hash,
        SimpleHash => \&hash_strip,
        SortedHash => \&hash_strip_sort
    }
);

sub array_strip_without_hash {

    # This strikes me as really dumb, but...
    my @new_array;
    my $last_item = '';
    foreach my $item ( sort @input ) {
        next if $item eq $last_item;
        $last_item = $item;
        push( @new_array, $item );

    }
    die "oops" unless @new_array == @input - $dupe_count;

}

sub array_strip_with_hash {

    # My thinking is this is also dumb.
    my @new_array;
    my %seen;
    foreach my $item (@input) {
        next if $seen{$item};
        push( @new_array, $item );
        $seen{$item} = 1;
    }
    die "oops" unless @new_array == @input - $dupe_count;

}

sub hash_strip {

    my %new_hash = map { $_ => 1 } @input;
    my @new_array = keys %new_hash;
    die "oops" unless @new_array == @input - $dupe_count;

}

sub hash_strip_sort {

    # because it's more useful.
    my %new_hash = map { $_ => 1 } @input;
    my @new_array = sort keys %new_hash;
    die "oops" unless @new_array == @input - $dupe_count;
}

On Friday, March 28, 2008, at 01:16PM, "Neil Heller" <nheller at silcon.com> wrote:
>It's easy to fill an array with 250,0000 or more strings?
>
>Neil Heller
>510-862-4387
>
>
>-----Original Message-----
>From: sanfrancisco-pm-bounces+nheller=silcon.com at pm.org
>[mailto:sanfrancisco-pm-bounces+nheller=silcon.com at pm.org] On Behalf Of Andy
>Lester
>Sent: Friday, March 28, 2008 12:52 PM
>To: San Francisco Perl Mongers User Group
>Subject: Re: [sf-perl] Sorting vs. hash
>
>
>On Mar 28, 2008, at 2:44 PM, Neil Heller wrote:
>> Is it faster to sort the array and then iterate once through the list
>> (writing selected ones to a new array) or write them all into a hash  
>> (at
>> which time duplicates would be naturally eliminated)?
>
>
>It's easy enough to test it and not have to speculate idly.
>
>--
>Andy Lester => andy at petdance.com => www.petdance.com => AIM:petdance
>
>
>
>
>_______________________________________________
>SanFrancisco-pm mailing list
>SanFrancisco-pm at pm.org
>http://mail.pm.org/mailman/listinfo/sanfrancisco-pm
>
>_______________________________________________
>SanFrancisco-pm mailing list
>SanFrancisco-pm at pm.org
>http://mail.pm.org/mailman/listinfo/sanfrancisco-pm
>
>