[Pdx-pm] string comparison vs hash
Thomas J Keller
kellert at ohsu.edu
Tue May 29 13:52:25 PDT 2007
Hi,
I was curious about the fastest way to get a subset of lines from a
file where a line of interest will have a match of the first field
with one of the names in a list of names, and got some really helpful
solutions.
Background: the input file has a fixed structure and is in this case
only some 4,000 lines. It could be much larger, hence my interest in
speed.
I've tested the two most straightforward approaches: a string
comparison and a hash method
# @names was declared in main::
my $fh1 = new IO::File;
my $fh2 = new IO::File;
cmpthese( $count,
{
'with_string_cmp' => sub {
my $names_join = join '|', @names;
my @goi;
if ($fh1->open("< $annot_file")) {
my @lines = <$fh1>;
foreach (@lines) {
chomp;
push @goi, $_ if $_ =~ m/($names_join)/;
}
} else { die "Could not get the filehandle $fh1: $!." }
$fh1->close;
},
'with_hash' => sub {
my %have_name = map({$_ => 1} @names);
if ($fh2->open("< $annot_file")) {
my $header = <$fh2>;
while(my $line = <$fh2>) {
my ($name,$else) = split(/\t/, $line, 2);
$have_name{$name} = [split(/\t/, $else)]
or next;
}
} else { die "Could not get the filehandle $fh2: $!." }
$fh2->close;
}
});
I think this is a fair comparison. The data that gets saved is the same.
(Though the hash is easier to get at down the road.)
.. drumrole ...
(warning: too few iterations for a reliable count)
Rate with_string_cmp with_hash
with_string_cmp 1.45/s -- -87%
with_hash 11.4/s 687% --
Hashes rule!
Thanks Eric, Ben, Rafael and Andy for your helpful suggestions.
Tom
kellert at ohsu.edu
503-494-2442
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.pm.org/pipermail/pdx-pm-list/attachments/20070529/014b6a2f/attachment.html
More information about the Pdx-pm-list
mailing list