[Pdx-pm] string comparison vs hash

Tue May 29 13:52:25 PDT 2007

Hi,
I was curious about the fastest way to get a subset of lines from a  
file where a line of interest will have a match of the first field  
with one of the names in a list of names, and got some really helpful  
solutions.
Background: the input file has a fixed structure and is in this case  
only some 4,000 lines. It could be much larger, hence my interest in  
speed.
  I've tested the two most straightforward approaches: a string  
comparison and a hash method
# @names was declared in main::

my $fh1 = new IO::File;
my $fh2 = new IO::File;
cmpthese( $count,
{
     'with_string_cmp' => sub {
         my $names_join = join '|', @names;
         my @goi;
         if ($fh1->open("< $annot_file")) {
             my @lines = <$fh1>;
             foreach (@lines) {
                 chomp;
                push @goi, $_ if $_ =~ m/($names_join)/;
             }
         } else { die "Could not get the filehandle $fh1: $!." }
         $fh1->close;
     },

     'with_hash' => sub {
         my %have_name = map({$_ => 1} @names);
         if ($fh2->open("< $annot_file")) {
             my $header = <$fh2>;
             while(my $line = <$fh2>) {
                 my ($name,$else) = split(/\t/, $line, 2);
                 $have_name{$name} = [split(/\t/, $else)]
                     or next;
             }
         } else { die "Could not get the filehandle $fh2: $!." }
         $fh2->close;
     }
});

I think this is a fair comparison. The data that gets saved is the same.
(Though the hash is easier to get at down the road.)

.. drumrole ...
             (warning: too few iterations for a reliable count)
                   Rate with_string_cmp       with_hash
with_string_cmp 1.45/s              --            -87%
with_hash       11.4/s            687%              --

Hashes rule!

Thanks Eric, Ben, Rafael and Andy for your helpful suggestions.

Tom
kellert at ohsu.edu
503-494-2442

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.pm.org/pipermail/pdx-pm-list/attachments/20070529/014b6a2f/attachment.html