diddling with regexes

Fri Jan 17 15:16:37 CST 2003

Hi everyone,

A colleague of mine is annotating, by hand, DNA sequences tens of thousands of 
residues in length.  I can't speed up the annotation per se since she needs to 
decide for each feature whether or not to remark on it and what to note.  
However I can flag the features of interest for her so that she can zoom right 
to them sequentially, like a pre-processing step.  She needs to find all 
instances of the following:

T-rich nonamer->12bp->heptamer with terminal GTG->15-40bp->TTYGGNNNNGGN

As far as I know it is impossible to program a regex to find "T-rich" so I
have to stick to simple the nonamer aspect (9 chars).

A regex for this might be as follows:

/\G[acgt]{25}gtg[acgt]{15,40}tt[ct]gg[acgt]{4}gg[acgt]/gi

There are various optimizations I could try and different approaches like 
lookaround assertions but for now my concern is with \G.  As far as I 
understand it, the \G assertion means that the /g modifier will not bump the 
engine to the position after the last match, thus operating like pos() in the 
regex and not missing overlapping instances of the pattern?  This is very 
important, I can't miss overlaps.  Do I understand correctly?

Any suggestions in general for this task?

Thanks muchly,

Nathanael