[mplspm]: In a sorted list [better]

Dave Bianchi djb at tc.umn.edu
Fri Mar 22 11:14:27 CST 2002


In all this discussion, the assumption is that you are starting with a
sorted list of strings.  Perhaps that list should not have been created
in the first place?  Maybe a hash would have been a more appropriate
data structure?  With two million strings, maybe a hash mapped to a DBM
file would be appropriate, generated by one program and used by another?

Without more information about the whole problem, a binary search seems
appropriate.  But I would take a step back and look at the bigger
picture and perhaps implement the solution differently.

- Dave Bianchi

"John J. Trammell" wrote:
> 
> On Fri, Mar 22, 2002 at 09:40:54AM -0600, Jim Anderson wrote:
> > And with 2 million strings, building a hash is a non-trivial amount of
> > time.
> 
> You said it:
> 
> #!/usr/bin/perl -w
> use strict;
> use Benchmark;
> $| = 1;
> 
> my $binary = sub
> {
>     use POSIX qw[ floor ];
>     my ($x,$a) = @_;
>     my ($p,$q) = (0,scalar(@$a)-1);
>     LOOP:
>     {
>         return 1 if $a->[$p] eq $x;
>         return 1 if $a->[$q] eq $x;
>         my $mid = floor( ($p+$q)/2 );
>         return 0 if ($mid == $p) || ($mid == $q);
>         for ($a->[$mid])
>         {
>             $x eq $_ && do { return 1; };
>             $x lt $_ && do { ($p,$q) = ($p,$mid); last; };
>             $x gt $_ && do { ($p,$q) = ($mid,$q); last; };
>         }
>         redo LOOP;
>     }
> };
> 
> my $hash = sub
> {
>     my ($x,$a) = @_;
>     my %h = map { $_, 0 } @$a;
>     return exists $h{$x};
> };
> 
> {
>     print "benchmark for size 26\n";
>     my @a = ('a' .. 'z');
>     my $x = $a[rand(@a)];
>     my $y = "__foo__";
> 
>     timethese(50000,
>     {
>         binary => sub { $binary->($x,\@a); $binary->($y,\@a); },
>         hash =>   sub { $hash->($x,\@a); $hash->($y,\@a); },
>     });
> }
> 
> {
>     print "benchmark for size @{[ 26*26 ]}\n";
>     my @a = ('aa' .. 'zz');
>     my $x = $a[rand(@a)];
>     my $y = "__foo__";
> 
>     timethese(50000,
>     {
>         binary => sub { $binary->($x,\@a); $binary->($y,\@a); },
>         hash =>   sub { $hash->($x,\@a); $hash->($y,\@a); },
>     });
> }
> 
> __END__
> 
> [ ~ ] perl bar.pl
> benchmark for size 26
> Benchmark: timing 50000 iterations of binary, hash...
>     binary:  4 wallclock secs ( 3.68 usr +  0.00 sys =  3.68 CPU) @ 13586.96/s (n=50000)
>     hash:  7 wallclock secs ( 7.21 usr +  0.01 sys =  7.22 CPU) @ 6925.21/s (n=50000)
> benchmark for size 676
> Benchmark: timing 50000 iterations of binary, hash...
>     binary:  7 wallclock secs ( 6.86 usr +  0.00 sys =  6.86 CPU) @ 7288.63/s (n=50000)
>     hash: 187 wallclock secs (184.91 usr +  0.06 sys = 184.97 CPU) @ 270.31/s (n=50000)
> [ ~ ] ls -l
> 
> Admittedly this benchmark isn't perfect, but I think it points
> in the direction of the binary search solution scaling better.
> 
> --------------------------------------------------
> Minneapolis Perl Mongers mailing list
> 
> To unsubscribe, send mail to majordomo at pm.org
> with "unsubscribe mpls" in the body of the message.


--------------------------------------------------
Minneapolis Perl Mongers mailing list

To unsubscribe, send mail to majordomo at pm.org
with "unsubscribe mpls" in the body of the message.



More information about the Mpls-pm mailing list