background for my problem
nkuipers
nkuipers at uvic.ca
Fri Sep 20 01:59:14 CDT 2002
Actually I wrote the following after our meeting to be the weekly progress
report that I write for my employer. I apologize if it sounds
over-complicated or pretentious...that would reflect my confused and somewhat
dismayed state of mind. I'll elaborate the situation further as you fire
questions my way.
Terms:
cluster: think of this as a superstring of DNA bases (A,C,G,T)
repeat: sub-string of some minimum length (in this case 21) that occurs more
than once in the cluster-set
ORF: open reading frame; always delimited on at least one end by a stop codon
unless the entire superstring is contained in an even larger open reading
frame, in which case we can't say where the stop codon might be
##############################################################################
#
This week was spent working on a tool for screening repeats located in
putative coding region. To do this, several data are needed for each cluster:
1) where the stop codons are located in each frame
2)where the repeats are located
3)a comparison of repeat locations to candidate ORFs
The comparison is based in turn on three statements:
1)clusters with no stop codons in at least one frame are entire ORFs
2)sequences with stop codons evenly distributed in a frame may be
considered noncoding noncoding for that frame
3)sequences with stop codons nonrandomly distributed in a frame may
be considered partially coding for that frame
I have code for finding stop codons in each frame. It works by adjusting the
return value of the Perl built-in pos function on a per-frame basis. I have
code for accurately matching repeats to clusters and getting the locations of
every match. However it is horribly inefficient, so my supervisor suggested
an alternative method. In this approach, an index of all k-size DNA fragments
(k-strings) found in the clusters is created; each k-string points to all
clusters in which the k-string occurs. Then, for each repeat, the first k
nucleotides are tested against the index. The full-length, global repeat
matching is attempted only against the clusters pointed to by the k-string.
The only difficulty I am having with this approach is deciding on data
structures for capturing results. Initially, I had planned the following,
which uses several levels of nested, anonymous structures to fully annotate
each cluster:
HASH: { clust_id => ARRAY: [ STRING: clust_seq,
HASH: { frame => CSV: stop positions,
},
HASH: { rep_seq => CSV: positions,
}
] ,
}
Its nicely organized in theory but a total nightmare to build, access
piecemeal, and pass around to subroutines for iterating etc.; the Data::Dumper
module may be a viable tool for accessing the whole structure. In discussing
data structures with other member of the Victoria Perl Mongers, the above
schema could also be extraordinarily RAM intensive unless I am frequently
dumping to a database or file. Though the annotation is interesting and
deserves to be placed in a database for its own sake, the immediate goal is a
series of logical comparisons, so relevant information needs to stay in memory
unless comparisons and resulting decisions are made on a per-cluster basis and
then flushed from memory, which is an attractive thought. The Victoria PM
also had the idea of compressing di-nucleotide combinations (of which there
are only 16) by symbol. This would speed up pattern matching, but would add
overhead in other ways. An interesting idea nonetheless and one which I am
eager to try. In summary, the biggest problems are data structure and input
size, keeping in mind the need to use the data and not just gather it, and to
integrate with the existing data pipeline. Ive decided to backburner
everything but the repeats-to-cluster code until satisfactory. I cant help
thinking that a more creative, brilliant, yet simpler solution to the overall
task exists, so there is a lot to ponder.
More information about the Victoria-pm
mailing list