[Chicago-talk] Updated notes for Bioinfo talk

Tue Sep 23 08:11:28 CDT 2003

The other version had too many PNG's in it. I've converted the whole
thing to flat ascii -- enjoy the artwork :-)

-- 
Steven Lembark                                            2930 W. Palmer
Workhorse Computing                                    Chicago, IL 60647
                                                         +1 888 910 1206
-------------- next part --------------
------------------------------------------------------------------------
Throwing Bioinformatics a [W-]curve:
Perl for high-speed, high-volume DNA sequence comparison.
------------------------------------------------------------------------

A funny thing happend on the way to a petri dish: watching things
grow isn't enough any more. You have to look inside them, further
than a microscope can go. 

Recent problems in biology can no longer be handled by cell cultures
alone. In particular, using the information contained in DNA involves
managing huge amounts of raw data in increasingly complex ways.

Academics are persuing questions on the causes of disease and origin
of species; lots of money is being thrown at improving the cycle
of drug discovery by pharmaceutical companies. 

All of this centers around finding faster, more flexable ways of 
comparing sequences of DNA that make up our genes.

By now, everyone has seen the representation of DNA as "base pairs"
as the letters "C", "G", "T", and "A". Their sequences make up the
genetic code. A common question, then, is why are the comparisons 
so difficult, when all the DNA is just a bunch of strings?

Answer: they aren't.

DNA is not read as a single sequence but in chunks of 3 units (see
next page). Given the [CATG] units, there are 64 combinations of 
DNA -- but only 20 amino acids they encode. This leaves the third
base pair in each sequence partially redundent. The number of mappings
ranges from 2 (ARG, SER) to one (TRP) with most having 4 encodings.

Good news: biologically, we can survive copying errors.
Bad news: variations make it really difficult to make the comparisons.

The obvious fix would be to convert the DNA to regexen like

	/CA[TG]CT[CG]/

Catch: it doesn't work. The simple reason is that genes are thousands
of bases long, which leaves the regexen unmanagable. The real reason 
is that eucaryote genes (pretty much evertyhing from yeast on up)
have non-copying pieces of DNA named "interons" that don't have
any affect on the resulting protein. Of the 450_000 bases in hemogloben,
about 85% of the DNA never gets "used": it's snipped out of the RNA
copy before being used to make proteins.

The obvious fix here would be to ignore the interons -- which would
work if only we could find them reliably.

There are also repeats: short sequences of DNA repeated multiple
times. These affect the protein without changing it too much. One
example of this is a TATA repeat in hemogloben that gives us A, B,
and O blood.

As protiens get larger -- and their DNA encoding longer -- the 
range of changes that lead to largely equivalent but still different
proteins also grows.

Now try comparing genes between species...

... which is exactly what you need to do next. 

Microarrays are good for showing what is active in a cell -- for
example active genes in human liver cancer cells. 

The problem is that you probably don't want to grow humans in 
order to test genetic variations and cures. You want to call 
the company in Maine that makes the funny hairless mice. 

The problem now is finding the gene in a mouse that matches 
most closely the gene in a human that when it goes haywire 
gives humans liver cancer. This requires scanning the mouse
chromosome for similar-enough matches that they may perform
the same function in a mouse as a person. 

Existing techniques for making these comparisons perform 
alignments between the various genes. This involves paring up
bases in the seqence, introducing "gaps" where the two genes
don't seem to fit:

	...CATCTT---CATTGCGCC... 
	...CATCTTGCCCATTGCGCC... 

The problem here is that DNA repeats the same short sequences
many times and two genes may not align at three-unit boundries:

	...CATCTGTA--CTGATA...
	...CATCTGTATACTGATA...

Gaps are also recursive: adding one offsets all the bases
downstream, which may add new gaps or may require a larger
gap to push everythign back into its proper place.

The most effective methods today -- BLAST, Fasta, Clustal --
cannot handle long genes or repeats well, inerons at all.

Face it: for comparing genes, strings just don't cut it.

Curves, however, may.

Approximation techiques work by trading degrees of freedom
in the problem set for a more inclusive outcome. With only one 
dimension to start with, strings simply don't have anything
to give up. 

One way of converting a string of DNA into a usable curve is 
by using it as input to a finite state machine -- similar to
Dr. Wolfram's cellular automatia. At each step the "bug" does
whatever the next base in the DNA sequence tells it to. This
can convert a linear string into whatever convouted shape the
rules tell it to. Given a useful set of rules, the shape
produced can be approached with approximation techniques.

One example is to take a square with corners at [+/-1,+/-1]
Label one set of opposing corners "A" and "T" the others "C"
and "G". Label the bases along the gene [1..N], start the
curve at [0,0,0] and let the bug walk halfway to the corner
labeled with the next base. 

Comparing the genes then involves comparing the curves 
generated by two bugs walking their respective genes.

Which is where I finally saw a camel peeking in under my tent.

The tent was parked over a C++ based visualazation system
called the W-curve (described above). The technique is pretty 
good for viewing gened within strings of DNA, but comparing 
any long set of DNA using 3D curves on a monitor can just as
easily induce blidness as instanity.

My mission was to find a way for the machine to make comparisons
instead of increasingly feeble-eyed humans.

Our goal is to boldly going were no nerd had gone before: comparing
large sequences of DNA quickly. Making it effecient gives a few
good examples of how to look at problems and adapt them for use
with Perl.

The first steps were looking at the algorithem. One of the 
more common things that done to a W-curve is rotating it 
about the z-axis. Cartesian co-ordinates make this a three
step operation, with three 4x4 matrices involved. 

Ouch.

First fix: convert the coord's to cylindrical: [r, a, Z], 
leaving rotation as a simple addition.

Now the coordinates work, but the maximum radius is sqrt 2.

Second fix: start the corners at [0,1], which gives a unit
radius: 

>From This:

 A+------+------+ C (1,1)
  |      |      |
  |      |      |
  |      |      |
  |      |      |
--+------+------+--
  |      |      |
  |      |      |
  |      |      |
  |      |      |
 G+------+------+ T

To this:

         +A
        /|\
       / | \
      /  |  \
     /   |   \
    /    |    \
 G /     |     \ C (1,0)
--+------+------+--
   \     |     /
    \    |    /
     \   |   /
      \  |  /
       \ | /
        \|/
         +T

At this point the only thing left is to compute the half
interval. Halving the distance is one way, but this 
requires a square root, which is expensive. Taking the 
arctan of the half-intervals is simple enough and involves
mostly table lookups. The unit radius helps since addition
and by one is simple and division by one even easier.

The last modification was to notice that rotations are
simple addition, and the math works equally well for any
of the corners. This allows computing the half-interval
from any point to (1,0) and rotating it to the appropriate
corner. The sequence is: find the angle for a corner (e.g.,
C == 0, A == PI/2) and subtract it from the current vector's
angle. After that the math for a point with respect to 
(1,0) is the same, with a single addition required to move the
point around to where it belongs. The point is that two
rather minor looking geometric shifts simplified my life
on this enormously.

Result: generating W-curves in Perl is a snap. Using map,
of course, to convert split //, $dnastring into an array
of [$radius, $angle].

Comparing the curves is where the real fun begins, and 
introduces one way to deal with Perl's lack of pointers
for effeciently handling multiple sequences of data.

People used to C bemoan the lack of pointers in Perl: 
"arrays just don't cut it for comparisons", they say,
"becuase you have to scan the entire array to access the
offsets."

This is partly due to programmers misusing structures
in Perl -- or not using them at all.

The usual approach to comparing two lists in C is a pair
of pointers:

	foo_t *a = list1;
	foo_t *b = list2;

	while( *a != (foo_t *)NULL && *b != (foo_t *)NULL )
	{
		foo_compare( a, b );
	}

Lacking pointers in Perl, one fix is to use arrays with
offsets to walk the lists:

	for( 0..$#a )
	{
		foo_compare $a[$_], $b[$_];
	}

the problem here is that you end up walking a list for 
each offset, with a serious performance hit. 

Another approach uses a for-loop to walk one of the
lists, indexing the other. This still has to walk
the second list for its data:

	my $i = 0;

	for( @a )
	{
		foo_compare $_, $b[$i++];
	}

Perl can iterate one list quite effectively, it's the 
second list that hurts.

Walking the second list can be avoided if it can be 
consumed for processing (or duplicating it is cheap):

	my @c = ();

	for( @a )
	{
		my $b = shift @b;

		foo_compare $_, $b;

		push @c, $b;
	}

	@b = @c;

The catch here is that @b has to be regenerated for each
use or cached by pushing it onto a second list in the 
loop. Using a referent for @b helps a bit, but can run
into issues with the copy back from \@c if $b is blessed
or its value used directly (e.g., as the key to a dbi-style
metadata hash).

My way out of this was to store the data in a single list.
Instead of generating separate arrays, the W-curve uses a
single list with offsets zero and one for the resulting 
data:

	my @dna = @_[0,1];

	my $i = length $dna[0];
	my $j = length $dna[1];

	my @curvz = ();

	$#curvz = $i > $j ? $i : $j;

	for( 0, 1 )
	{
		my @a = split //, $dna[$_];

		for my $bucket ( @curvz )
		{
			last unless @a;
			$bucket->[$_] = next_w_curve_value shift @a;
		}
	}

This leaves the two results merged into a single array @curvz
with the values being compared at offsets 0 and 1 in $curves[$i].

Comparing the curves now involves only a single pass over @curvz
to complete:

	compare_two_curve_points $_ for( @curvz );

	...

	compare_two_curvepoints
	{
		my( $a, $b ) = @{$_[0]};

		# or just use the referent with offsets:
		#
		# my $a = shift;
		#
		# $a->[0] and $a->[1]
		#
		# $a and $b are the values being compared.
	}

This works well for our use becuase one of the curves will be
re-used for each pass. Generating a new curve then requires only
clipping the combined array to the two curves' maximum length
and generating a curve with offset 1:

	my $string1 = shift;

	my $length1 = length $string1;

	$#curvz = $length0 > length $string1 ? $length0 : $length1;

	my $a = split //, $string1;

	$curves[0] = [ 0, 0 ];

	for my $bucket ( @curvz )
	{
		$bucket->[1] = next_w_curve_value shift @a;
	}

	...

	compare_two_curve_points @$_ for( @curvz );

Looks familiar, eh? Re-merging the second curve into a retained
array saves re-generating the first curve and leaves comparing
them simple.  

My first-pass comparision function sums a value computed 
separately at each point and averages it over the length 
of the longer array. This simple average has the advantage
that an array can be partitioned and the sections compared
separately.

For two points [r0,a0], [r1,a0] the mesasure is computed with:

	( $r0, $r1 ) = ( $r1, $r0 ) if $r1 > $r0;

	my $score = $r0 - $r1 * cos( $a0 - $a1 );

Nice thing about cosines: cos(-a) == cos(a), so the angles
don't have to be reversed.

The result is a measure that includes distance along the longer
radius in the difference but not deviation from it. If the two
radii are at less than 90 degrees to one another then their 
differences will subtract, otherwise they will add; if both 
radii are small then their angle does not matter much. This allows
the two curves to wander along in roughly the same direction 
with minor separations and still have a fairly low score. The
idea here is to handle for variatioins caused by repeats without
having to explicitly align the curves. 

Note that this makes no attempt to handle interons. The current
method is only useful for procaryotes or cDNA samples -- which
is fine since cDNA is what gets used in microarray expiraments.

Using the techniques here I was able to compare the whole
genomes of m.genatalaum and m.pneumonia in around 7 minutes
in a single process, Perl-5.8 job on linux.

This is an improvement over alignment techniques, but does
not go very far towards handling larger procaryotes or anything
using a eucaryotic-ish quantity of genes to run itself.

The next step required a bit if cut and run, so to speak: by
slicing up the DNA strings, generation and comparison tasks 
could hopefully be done in reasonable time. This is workable
with W-curves since -- unlike alignment -- they do not modify
their input data by introducing gaps: the method can be used
for additive piecewise comparison.

So, my next step was to partition the generation and comparison
so that they could be handled in parallel. The eventual goal
of this was to slurp up some time on a local 8-node Beowulf
Cluster ("BC") to get the really big jobs done.

BC's in their various incarnations currently make up the most 
powerful number-crunching systems in the known universe. The
trick to using them well involves going from "embarassingly 
parallel" task scheduling to truly-parallel execution using
message passing and distributed data.

Contrary to most notions, this is an area where Perl shines:
most of the nasty work in a BC involves bookkeeping to maintain
list of available nodes, jobs pending and dispatched, and 
associate the return results from individual nodes with their
total results. These tasks are nicely suited to Perl's arrays
and hashes; Perl's exception handling works well with the kind
of error recovery required for network errors and adaptive 
scheduling. The PVM module makes multi-node dispatch simple 
enough, with the underlying PVM library handling many of the
message passing details automatically.

The first step in designing message-driven code is to figure
out what the messages will be. In this case reading in a 
specie's data, selecting a gene for the first and seconcd
w-curve lists, initiating computation and comparison,
partitioning, and passing back a result seemed like a decent
collection of tasks.

The code then had to be split into chunks to handle the various
messages. This left it looking like event-driven code used
in GUI design: handlers for the various messages that call
processing routines to really get the work done. Fortunately
for me, the messages and their handlers are fairly straightforward:
pick a GenBank file for the specie's genome, pick a gene for a
species and merge it into the curve array, make the comparison,
return the results. Aside from the GenBank file path (which can
be made short with default paths), the remaining messages are 
short strings or pairs of integers.

One issue at this point is how to deal with partitioned comparisons.
What we came up with is having the nodes generate their own 
segments of the W-curves using sections of the source data. Since
the curve only uses one curve point and the next base, curve
generation can be daisy-chained: each node returns a message
of the w-curve at the end of its section, then the dispatcher
kicks off another job from that point onward. The alternative
is to have each node generate the entire w-curve and then compare
only sections of them.  The choice between these is largely a 
matter of memory management and time saved generating part of
the curved.

At this point we are still converting the code for execution 
in the BC and testing how well my first -- admittedly simplistic --
comparison algorithem works. Once we have worked out the kinks 
for m.genatalaum and m.pneumonae we can start on procaryotes with
longer genomes (e.g., e.coli) and cDNA from eucaryotes.

The main point of all this is that Perl can be used for serious
number crunching without resoring to XS. It can also be used
for programs in messy, complicated situations like genome reseasrch.
The main issues are making sure that the problem space is properly
defined for handling by Perl and that optimizations are designed 
into the process carefully instead of being ad hoc hacks.

Bioinformatics is a growing, changing field. Perl's fits nicely
into this darwinian environment because it adapts. Gracefully handling
nested and dynamic data structures is one of its strengths, effecient
integration with networking libraries is another. The trick to
finding effective solutions is basing them on effective Perl rather
than recycled C.

The most interesting thing in projects like these is what we find
in them: the origins of genes, animal models for human cancer,
starting points for drug research. The fun part as a Perl hacker
is seeing just how easily Perl handles it all.

-------------------------------------------------------
Other places to look for information on Bioinformatics:
-------------------------------------------------------

http://www.bioinfo.org/

   General information, links to everything else.

http://www.bioperl.org/

   The bio* projects were intended to build a consistent
   set of bioinformatics tools in various high-level languages,
   Perl, Python, Java among them. The BioPerl project has gone
   the furthest and is the most used of these. The site describes
   using BioPerl and has good links to other sites.

Developing Bioinformatics Computer Skills
Gibas & Jambeck, O'Reilly Press.

   The content covers where biology and informatics meet, including
   both the chemestry of DNA and basics of Perl. This is a really
   good place for anyone to start since it covers both biology,
   Perl, and how they work together.

http://www.ora.com/

   O'Reilly's bioinformatics conference is detailed on
   their site, with good references to presenter information
   and papers. Their site also has a number of good books
   in the bioinformatics catagory.

Genes, VII

	This is the standard text on genetics, and covers everything
	from basepairs to ribosome chemestry. Even a cursory reading
	of the first few chapters can be a big help in understanding
	gene-speak.

------------------------------------------------------------------------