diddling with regexes
nkuipers
nkuipers at uvic.ca
Fri Jan 17 15:16:37 CST 2003
Hi everyone,
A colleague of mine is annotating, by hand, DNA sequences tens of thousands of
residues in length. I can't speed up the annotation per se since she needs to
decide for each feature whether or not to remark on it and what to note.
However I can flag the features of interest for her so that she can zoom right
to them sequentially, like a pre-processing step. She needs to find all
instances of the following:
T-rich nonamer->12bp->heptamer with terminal GTG->15-40bp->TTYGGNNNNGGN
As far as I know it is impossible to program a regex to find "T-rich" so I
have to stick to simple the nonamer aspect (9 chars).
A regex for this might be as follows:
/\G[acgt]{25}gtg[acgt]{15,40}tt[ct]gg[acgt]{4}gg[acgt]/gi
There are various optimizations I could try and different approaches like
lookaround assertions but for now my concern is with \G. As far as I
understand it, the \G assertion means that the /g modifier will not bump the
engine to the position after the last match, thus operating like pos() in the
regex and not missing overlapping instances of the pattern? This is very
important, I can't miss overlaps. Do I understand correctly?
Any suggestions in general for this task?
Thanks muchly,
Nathanael
More information about the Victoria-pm
mailing list