[mplspm]: In a sorted list [better]
Dave Bianchi
djb at tc.umn.edu
Fri Mar 22 11:14:27 CST 2002
In all this discussion, the assumption is that you are starting with a
sorted list of strings. Perhaps that list should not have been created
in the first place? Maybe a hash would have been a more appropriate
data structure? With two million strings, maybe a hash mapped to a DBM
file would be appropriate, generated by one program and used by another?
Without more information about the whole problem, a binary search seems
appropriate. But I would take a step back and look at the bigger
picture and perhaps implement the solution differently.
- Dave Bianchi
"John J. Trammell" wrote:
>
> On Fri, Mar 22, 2002 at 09:40:54AM -0600, Jim Anderson wrote:
> > And with 2 million strings, building a hash is a non-trivial amount of
> > time.
>
> You said it:
>
> #!/usr/bin/perl -w
> use strict;
> use Benchmark;
> $| = 1;
>
> my $binary = sub
> {
> use POSIX qw[ floor ];
> my ($x,$a) = @_;
> my ($p,$q) = (0,scalar(@$a)-1);
> LOOP:
> {
> return 1 if $a->[$p] eq $x;
> return 1 if $a->[$q] eq $x;
> my $mid = floor( ($p+$q)/2 );
> return 0 if ($mid == $p) || ($mid == $q);
> for ($a->[$mid])
> {
> $x eq $_ && do { return 1; };
> $x lt $_ && do { ($p,$q) = ($p,$mid); last; };
> $x gt $_ && do { ($p,$q) = ($mid,$q); last; };
> }
> redo LOOP;
> }
> };
>
> my $hash = sub
> {
> my ($x,$a) = @_;
> my %h = map { $_, 0 } @$a;
> return exists $h{$x};
> };
>
> {
> print "benchmark for size 26\n";
> my @a = ('a' .. 'z');
> my $x = $a[rand(@a)];
> my $y = "__foo__";
>
> timethese(50000,
> {
> binary => sub { $binary->($x,\@a); $binary->($y,\@a); },
> hash => sub { $hash->($x,\@a); $hash->($y,\@a); },
> });
> }
>
> {
> print "benchmark for size @{[ 26*26 ]}\n";
> my @a = ('aa' .. 'zz');
> my $x = $a[rand(@a)];
> my $y = "__foo__";
>
> timethese(50000,
> {
> binary => sub { $binary->($x,\@a); $binary->($y,\@a); },
> hash => sub { $hash->($x,\@a); $hash->($y,\@a); },
> });
> }
>
> __END__
>
> [ ~ ] perl bar.pl
> benchmark for size 26
> Benchmark: timing 50000 iterations of binary, hash...
> binary: 4 wallclock secs ( 3.68 usr + 0.00 sys = 3.68 CPU) @ 13586.96/s (n=50000)
> hash: 7 wallclock secs ( 7.21 usr + 0.01 sys = 7.22 CPU) @ 6925.21/s (n=50000)
> benchmark for size 676
> Benchmark: timing 50000 iterations of binary, hash...
> binary: 7 wallclock secs ( 6.86 usr + 0.00 sys = 6.86 CPU) @ 7288.63/s (n=50000)
> hash: 187 wallclock secs (184.91 usr + 0.06 sys = 184.97 CPU) @ 270.31/s (n=50000)
> [ ~ ] ls -l
>
> Admittedly this benchmark isn't perfect, but I think it points
> in the direction of the binary search solution scaling better.
>
> --------------------------------------------------
> Minneapolis Perl Mongers mailing list
>
> To unsubscribe, send mail to majordomo at pm.org
> with "unsubscribe mpls" in the body of the message.
--------------------------------------------------
Minneapolis Perl Mongers mailing list
To unsubscribe, send mail to majordomo at pm.org
with "unsubscribe mpls" in the body of the message.
More information about the Mpls-pm
mailing list