[Pdx-pm] hash from array question
Eric Wilhelm
scratchcomputing at gmail.com
Sat May 26 01:38:27 PDT 2007
# from Thomas Keller
# on Friday 25 May 2007 10:55 pm:
> ( grep { $lines[$c] =~ m/$_/ } @names ) > 0 )
> { 1; #yes, select it
>...
>I can make a hash, but I was thinking it would be faster to (somehow)
>grep the lines of the file that I needed, and then make a hash of
>just the stuff I need to work with.
I agree with Andy. In the case of exact string match, "thinking hashes"
is definitely faster.
In this case, you've already read the entire file into @lines, just not
done the split? The question is then whether looping over @names *
@lines with m/$_/ as in your grep is quicker than split /\t/ on every
line (including some which may be then skipped.) (BTW, you may need to
anchor that m/^$_\t/ unless you're certain that the data is never going
to have $name in a latter field.)
%have_name = map({$_ => 1} @names);
while(my $line = <$gene_fh>) {
my ($name, $else) = split(/\t/, $line, 2);
$have_name{$name} or next;
chomp($else);
push(@genes_of_interest, [split(/\t/, $else)]);
}
Whether the two-part split gives any speed vs ($name, @genes) =
split(/\t/, $line) with the chomp first is a good question for the
benchmark.
But, I'm pretty certain that $have_name{$name} is going to go faster
than grep({$_ eq $name} @names). The truth-hash idiom almost always
wins.
I'm not sure you want both data sets in hashes though. Particularly if
order is important and if $name could occur on more than one line of
the input.
Then again, thousands of lines isn't that many. Maybe tens of thousands
is cause for worry. Hundreds of thousands is *maybe* cause for sqlite.
my %have_names = map({$_ => 1} @names);
my @genes_of_interest = grep({$have_names{$_->[0]}}
map({chomp; [split(/\t/, $_)]} <$gene_fh>));
--Eric
--
"Insert random misquote here"
---------------------------------------------------
http://scratchcomputing.com
---------------------------------------------------
More information about the Pdx-pm-list
mailing list