[Pdx-pm] hash from array question

Sat May 26 01:38:27 PDT 2007

# from Thomas Keller
# on Friday 25 May 2007 10:55 pm:

>                        ( grep { $lines[$c] =~ m/$_/ } @names ) > 0 )
> { 1;      #yes, select it
>...
>I can make a hash, but I was thinking it would be faster to (somehow)  
>grep the lines  of the file that I needed, and then make a hash of  
>just the stuff I need to work with.

I agree with Andy.  In the case of exact string match, "thinking hashes" 
is definitely faster.

In this case, you've already read the entire file into @lines, just not 
done the split?  The question is then whether looping over @names * 
@lines with m/$_/ as in your grep is quicker than split /\t/ on every 
line (including some which may be then skipped.)  (BTW, you may need to 
anchor that m/^$_\t/ unless you're certain that the data is never going 
to have $name in a latter field.)

  %have_name = map({$_ => 1} @names);

  while(my $line = <$gene_fh>) {
    my ($name, $else) = split(/\t/, $line, 2);
    $have_name{$name} or next;
    chomp($else);
    push(@genes_of_interest, [split(/\t/, $else)]);
  }

Whether the two-part split gives any speed vs ($name, @genes) = 
split(/\t/, $line) with the chomp first is a good question for the 
benchmark.

But, I'm pretty certain that $have_name{$name} is going to go faster 
than grep({$_ eq $name} @names).  The truth-hash idiom almost always 
wins.

I'm not sure you want both data sets in hashes though.  Particularly if 
order is important and if $name could occur on more than one line of 
the input.

Then again, thousands of lines isn't that many.  Maybe tens of thousands 
is cause for worry.  Hundreds of thousands is *maybe* cause for sqlite.

  my %have_names = map({$_ => 1} @names);
  my @genes_of_interest = grep({$have_names{$_->[0]}}
    map({chomp; [split(/\t/, $_)]} <$gene_fh>));

--Eric
-- 
"Insert random misquote here"
---------------------------------------------------
    http://scratchcomputing.com
---------------------------------------------------