[Pdx-pm] string comparison vs hash

Thomas J Keller kellert at ohsu.edu
Tue May 29 13:52:25 PDT 2007

I was curious about the fastest way to get a subset of lines from a  
file where a line of interest will have a match of the first field  
with one of the names in a list of names, and got some really helpful  
Background: the input file has a fixed structure and is in this case  
only some 4,000 lines. It could be much larger, hence my interest in  
  I've tested the two most straightforward approaches: a string  
comparison and a hash method
# @names was declared in main::

my $fh1 = new IO::File;
my $fh2 = new IO::File;
cmpthese( $count,
     'with_string_cmp' => sub {
         my $names_join = join '|', @names;
         my @goi;
         if ($fh1->open("< $annot_file")) {
             my @lines = <$fh1>;
             foreach (@lines) {
                push @goi, $_ if $_ =~ m/($names_join)/;
         } else { die "Could not get the filehandle $fh1: $!." }

     'with_hash' => sub {
         my %have_name = map({$_ => 1} @names);
         if ($fh2->open("< $annot_file")) {
             my $header = <$fh2>;
             while(my $line = <$fh2>) {
                 my ($name,$else) = split(/\t/, $line, 2);
                 $have_name{$name} = [split(/\t/, $else)]
                     or next;
         } else { die "Could not get the filehandle $fh2: $!." }

I think this is a fair comparison. The data that gets saved is the same.
(Though the hash is easier to get at down the road.)

.. drumrole ...
             (warning: too few iterations for a reliable count)
                   Rate with_string_cmp       with_hash
with_string_cmp 1.45/s              --            -87%
with_hash       11.4/s            687%              --

Hashes rule!

Thanks Eric, Ben, Rafael and Andy for your helpful suggestions.

kellert at ohsu.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.pm.org/pipermail/pdx-pm-list/attachments/20070529/014b6a2f/attachment.html 

More information about the Pdx-pm-list mailing list