[Chicago-talk] Data structures in Perl

Steven Lembark lembark at wrkhors.com
Mon Feb 16 14:09:05 CST 2004


People have asked about data structures in Perl before.
Here is an example of one case where I saved a huge amount
of time by switching from arrays to hashes for storing
sorted, numerically indexed values.

The basic problem is to run whole-gene comparisons of
organisms by sucking their gens and dna from GenBank
files. Genes with length mismatches more than 15% are
not worth making detailed comparisions of and can be
pruned up front.

One way to handle this is:

	my $min_mult = 1 - $length_factor;
	my $max_mult = 1 + $length_factor;

	# @listX = ( [ $name, $dna ], [$name, $dna] ... )

	for my $g1 ( @list1 )
	{
		for my $g2 ( @list2 )
		{
			my $i = length $g1;
			my $j = $i * $min_mult;
			my $k = $i * $max_mult;
			my $l = length $g2;

			next unless $j <= $l && $l <= $k;

			compare_genes $g1, $g2
		}
	}

Problem with this method is that I am running a cartesian
product and have to recompute $i, $j, $k, & $l millions
of times.

The first improvement I made was using arrays to store the
genes by length (saves re-computing that each time). At that
the comparision is replaced by an array slice:

	# push @{ $listX[length $dna] }, [ $name, $dna ]

	for my $g1 ( @list1 )
	{
		my $i = length $g1;
		my $j = $i * $min_mult;
		my $k = $i * $max_mult;

		for my $g2 ( @list2[$j..$k] )
		{
			# compare_genes now does its own
			# product of @$g1 : @$g2

			compare_genes $g1, @$g2
		}
	}

This still leves me computing $i, $j, & $k each time.
The array slices are not all that effecient, partly
because the arrays are long partly because I have to re-
scan up to $min each time. I could shift things off the
front to save work, but would then have to re-compute
the effective offset for the lengths each pass which
adds work.

The usual rule is that hashes are better for storing
sparse data. So...

Instead of storing the genes on lists I put them into a
hash keyed by the gene length:

	$base->{$dna_length}{$gene_name} = $gene_dna_string;

For the base species (the one being compared to all of the
others) I can compute the lengths once and queue them for
more effecient run times in the cluster in one pass:

	my @length_que =
		map { [ $_, int $_ * $min_mult, int $_ * $max_mult ] }
		sort { $b <=> $a }
		keys %$base
	;

[Thankfully, Perl's base data storage is an encapsulated,
polymorphic object so I don't have to deal with hash
keys being strings, numeric sorts, numeric math, and
re-accessing the keys as strings. Anyone with a stronger
stomach than mine is welcome to try this in C++, Java,
or Python and see how much code it adds :-]


After that, iterating the input files is just a matter of
putting them into the same hash format as the base gene and
grabbing all of the matching length keys:

	for( @ARGV )
	{
		my( $name, $alt_genz ) = read_genbank $_;

		for( @length_que )
		{
			# syntatic sugar

			my( $l, $min, $max ) = @$_;

			# the hash slice delivers its values in
			# sorted order, the map expands the sub-
			# hashes to simplify the structure.

			next unless my %alt =
				map { (defined) ? %$_ : () }
				@{$alt_genz}{ $min..$max }
			;

			compare_genes $base{$l}, \%alt;
		}
	}

This allows running all of the math only once: the queue
built for effecient average runtime in the cluster, the
length values are computed only once, and the hash quickly
grabs the necessary values without reading over [a usually
long] array to get them multiple times.

This arrangement saved me around 20 minutes of runtime.
Along with a few other tweaks, it takes my test harness
for comparing the full genome of m.genitalium with m.pneumoniae
from 45 to 8 minutes. For comparison, BLAST2 or Fasta
take just over 40 hours to run the same set of comparisions.


Things to notice are that people normally use arrays for
numeric indexing and storing sorted values. For sparse or
very long arrays being sliced, however, it may be far more
effecient to deal with a hash using numbers for keys.

--
Steven Lembark                               2930 W. Palmer
Workhorse Computing                       Chicago, IL 60647
                                            +1 888 359 3508



More information about the Chicago-talk mailing list