background for my problem

Fri Sep 20 01:59:14 CDT 2002

Actually I wrote the following after our meeting to be the weekly progress 
report that I write for my employer.  I apologize if it sounds 
over-complicated or pretentious...that would reflect my confused and somewhat 
dismayed state of mind.  I'll elaborate the situation further as you fire 
questions my way.

Terms:

cluster: think of this as a superstring of DNA bases (A,C,G,T)
repeat: sub-string of some minimum length (in this case 21) that occurs more 
than once in the cluster-set
ORF: open reading frame; always delimited on at least one end by a stop codon 
unless the entire superstring is contained in an even larger open reading 
frame, in which case we can't say where the stop codon might be

##############################################################################
#

This week was spent working on a tool for screening repeats located in 
putative coding region.  To do this, several data are needed for each cluster:
	1) where the stop codons are located in each frame
	2)where the repeats are located
	3)a comparison of repeat locations to candidate ORFs
The comparison is based in turn on three statements:
	1)clusters with no stop codons in at least one frame are entire ORFs
	2)sequences with stop codons evenly distributed in a frame may be
	   considered noncoding noncoding for that frame
	3)sequences with stop codons nonrandomly distributed in a frame may
	   be considered partially coding for that frame
I have code for finding stop codons in each frame.  It works by  adjusting the 
return value of the Perl built-in pos function on a per-frame basis.  I have 
code for accurately matching repeats to clusters and getting the locations of 
every match.  However it is horribly inefficient, so my supervisor suggested 
an alternative method.  In this approach, an index of all k-size DNA fragments 
(k-strings) found in the clusters is created; each k-string points to all 
clusters in which the k-string occurs.  Then, for each repeat, the first k 
nucleotides are tested against the index.  The full-length, global repeat 
matching is attempted only against the clusters pointed to by the k-string.  
The only difficulty I am having with this approach is deciding on data 
structures for capturing results.  Initially, I had planned the following, 
which uses several levels of nested, anonymous structures to fully annotate 
each cluster:

HASH: { clust_id => ARRAY: [ STRING: clust_seq,
				     HASH: { frame => CSV: stop positions, 
 },
				     HASH: { rep_seq => CSV: positions, 
 }
   ] , 

  }

It’s nicely organized in theory but a total nightmare to build, access 
piecemeal, and pass around to subroutines for iterating etc.; the Data::Dumper 
module may be a viable tool for accessing the whole structure.  In discussing 
data structures with other member of the Victoria Perl Mongers, the above 
schema could also be extraordinarily RAM intensive unless I am frequently 
dumping to a database or file.  Though the annotation is interesting and 
deserves to be placed in a database for its own sake, the immediate goal is a 
series of logical comparisons, so relevant information needs to stay in memory 
unless comparisons and resulting decisions are made on a per-cluster basis and 
then flushed from memory, which is an attractive thought.  The Victoria PM 
also had the idea of compressing di-nucleotide combinations (of which there 
are only 16) by symbol.  This would speed up pattern matching, but would add 
overhead in other ways.  An interesting idea nonetheless and one which I am 
eager to try.  In summary, the biggest problems are data structure and input 
size, keeping in mind the need to use the data and not just gather it, and to 
integrate with the existing data pipeline.  I’ve decided to backburner 
everything but the repeats-to-cluster code until satisfactory.  I can’t help 
thinking that a more creative, brilliant, yet simpler solution to the overall 
task exists, so there is a lot to ponder.