[sf-perl] Sorting vs. hash

Fri Mar 28 13:28:38 PDT 2008

D'oh, no!  I take it back!  That'll teach me to not wait for my benchmarks to run.

It looks like the good old %seen trick is fastest on my MacBook Pro (sysperl, 5.8.8).

The hash via map stuff is a lot slower.  If I had more time to play with it I'd like to know which parts make it slower but I have to get back to work.

My hunch is that if you really cared, you could make the sort thingy the fastest by using a different sort algo.

-- f.

On Friday, March 28, 2008, at 01:23PM, "frosty" <biztos at mac.com> wrote:
>Sure.  See below.  I think the sorting trick is fastest.
>
>use strict;
>use warnings;
>use Benchmark;
>
># starting point: one big fat array (Neil says 250K).
># but let's assume there are some dupes.
>my @input;
>my @ltr        = ( 'a' .. 'z' );
>my $dupe_count = 0;
>while ( @input < 250000 ) {
>    my $t = join( '', map { $ltr[ rand(@ltr) ] } ( 1 .. rand(40) + 10 ) );
>    push( @input, $t );
>
>    # Eliminating dupes is the point.
>    if ( @input / 20 == int( @input / 20 ) ) {
>        unshift( @input, $t );
>        $dupe_count++;
>    }
>}
>printf "%d strings (%d dupes)\n", scalar(@input), $dupe_count;
>
>timethese(
>    100,
>    {
>
>        SortStyle  => \&array_strip_without_hash,
>        SeenStyle  => \&array_strip_with_hash,
>        SimpleHash => \&hash_strip,
>        SortedHash => \&hash_strip_sort
>    }
>);
>
>sub array_strip_without_hash {
>
>    # This strikes me as really dumb, but...
>    my @new_array;
>    my $last_item = '';
>    foreach my $item ( sort @input ) {
>        next if $item eq $last_item;
>        $last_item = $item;
>        push( @new_array, $item );
>
>    }
>    die "oops" unless @new_array == @input - $dupe_count;
>
>}
>
>sub array_strip_with_hash {
>
>    # My thinking is this is also dumb.
>    my @new_array;
>    my %seen;
>    foreach my $item (@input) {
>        next if $seen{$item};
>        push( @new_array, $item );
>        $seen{$item} = 1;
>    }
>    die "oops" unless @new_array == @input - $dupe_count;
>
>}
>
>sub hash_strip {
>
>    my %new_hash = map { $_ => 1 } @input;
>    my @new_array = keys %new_hash;
>    die "oops" unless @new_array == @input - $dupe_count;
>
>}
>
>sub hash_strip_sort {
>
>    # because it's more useful.
>    my %new_hash = map { $_ => 1 } @input;
>    my @new_array = sort keys %new_hash;
>    die "oops" unless @new_array == @input - $dupe_count;
>}
>
> 
>On Friday, March 28, 2008, at 01:16PM, "Neil Heller" <nheller at silcon.com> wrote:
>>It's easy to fill an array with 250,0000 or more strings?
>>
>>Neil Heller
>>510-862-4387
>>
>>
>>-----Original Message-----
>>From: sanfrancisco-pm-bounces+nheller=silcon.com at pm.org
>>[mailto:sanfrancisco-pm-bounces+nheller=silcon.com at pm.org] On Behalf Of Andy
>>Lester
>>Sent: Friday, March 28, 2008 12:52 PM
>>To: San Francisco Perl Mongers User Group
>>Subject: Re: [sf-perl] Sorting vs. hash
>>
>>
>>On Mar 28, 2008, at 2:44 PM, Neil Heller wrote:
>>> Is it faster to sort the array and then iterate once through the list
>>> (writing selected ones to a new array) or write them all into a hash  
>>> (at
>>> which time duplicates would be naturally eliminated)?
>>
>>
>>It's easy enough to test it and not have to speculate idly.
>>
>>--
>>Andy Lester => andy at petdance.com => www.petdance.com => AIM:petdance
>>
>>
>>
>>
>>_______________________________________________
>>SanFrancisco-pm mailing list
>>SanFrancisco-pm at pm.org
>>http://mail.pm.org/mailman/listinfo/sanfrancisco-pm
>>
>>_______________________________________________
>>SanFrancisco-pm mailing list
>>SanFrancisco-pm at pm.org
>>http://mail.pm.org/mailman/listinfo/sanfrancisco-pm
>>
>>
>_______________________________________________
>SanFrancisco-pm mailing list
>SanFrancisco-pm at pm.org
>http://mail.pm.org/mailman/listinfo/sanfrancisco-pm
>
>