diddling with regexes
Peter Scott
Peter at PSDT.com
Fri Jan 17 16:02:47 CST 2003
At 01:16 PM 1/17/03 -0800, nkuipers wrote:
>Hi everyone,
>
>A colleague of mine is annotating, by hand, DNA sequences tens of
>thousands of
>residues in length. I can't speed up the annotation per se since she
>needs to
>decide for each feature whether or not to remark on it and what to note.
>However I can flag the features of interest for her so that she can
>zoom right
>to them sequentially, like a pre-processing step. She needs to find all
>instances of the following:
>
>T-rich nonamer->12bp->heptamer with terminal GTG->15-40bp->TTYGGNNNNGGN
>
>As far as I know it is impossible to program a regex to find "T-rich"
I don't know what you mean by that.
>so I
>have to stick to simple the nonamer aspect (9 chars).
>
>A regex for this might be as follows:
>
>/\G[acgt]{25}gtg[acgt]{15,40}tt[ct]gg[acgt]{4}gg[acgt]/gi
>
>There are various optimizations I could try and different approaches like
>lookaround assertions but for now my concern is with \G. As far as I
>understand it, the \G assertion means that the /g modifier will not bump the
>engine to the position after the last match, thus operating like pos() in the
>regex and not missing overlapping instances of the pattern? This is very
>important, I can't miss overlaps. Do I understand correctly?
Not quite. \G is an anchor, which means "must match the point where
the previous /g match left off." Because without it, you'd have no way
of saying that you want successive global matches to succeed only if
there's no intervening junk.
If you want to make sure that you don't miss overlapping instances, you
need to ensure that pos isn't advanced past the point where one might
be. Using a zero-width positive lookahead assertion should do that. I
don't have time to play around with this some more so I don't know
whether there's a way to avoid the kludge but check this out:
[peter at tweety ~]$ perl -le '$_="abacad"; print $1 while /(a.+)/g'
abacad
[peter at tweety ~]$ perl -le '$_="abacad"; print $1 while /\G(a.+)/g'
abacad
[peter at tweety ~]$ perl -le '$_="abacad"; print "$1$2" while /(a(?=(.+)))/g'
abacad
acad
ad
--
Peter Scott
Pacific Systems Design Technologies
http://www.perldebugged.com/
More information about the Victoria-pm
mailing list