Ideas?

Wed Sep 18 13:49:44 CDT 2002

Hello all,

I have a bit of a problem.  To present it, I need to first give a bit of a 
biology primer.

A DNA sequence can be represented as a string of A,G,C,T, which are 1-letter 
representations of different nucleotides.  Think GATTACA :).  Often, a 
sequence is considered in blocks of 3 nucleotides; this block is called a 
codon.  An array of codons occupies a "reading frame", and for a given 
sequence there are 6 reading frames.  For example, for 
ACG|GTC|TTT|CGA|TAA|AAA... the frames are:

1)as written
2)remove the first nucleotide from 1), giving CGG|TCT|TTC|GAT|AAA|A...
3)remove the first nucleotide from 2), giving GGT|CTT|TCG|ATA|AAA...

The other three frames are derived with similar mechanics, but the original 
sequence is first reversed, then "complemented" (essentially, tr/ACGT/TGCA/).

I am interested in finding all instances of 3 specific codons, and have 
created 2 regex objects (forward and reverse complement, for a total of 6 
codons) that do this perfectly.  I am also interested in knowing the locations 
of each matched codon in the string.  Currently I am using the pos function, 
and this is fine for the first frame in either orientation.  But...my current 
implementation of creating the next frame involves removing the current first 
nucleotide from the sequence with s/^\w// which comprimises the "absolute" 
position of a match with pos.  I need ideas please.  Arrays?  Tmp vars?  
Adding/subtracting appropriate integer to the pos return (easy,viable, but 
sort of messy as I imagine it). A better logical foundation is needed? I am 
quite sure I could come up with an answer to this with more thought but wanted 
to hear other opinions which are likely more elegant than mine.  How would you 
best do a frame-specific search while still being able to annotate the match 
location based on the original, untouched sequence?