diddling with regexes

Fri Jan 17 16:02:47 CST 2003

At 01:16 PM 1/17/03 -0800, nkuipers wrote:
>Hi everyone,
>
>A colleague of mine is annotating, by hand, DNA sequences tens of 
>thousands of
>residues in length.  I can't speed up the annotation per se since she 
>needs to
>decide for each feature whether or not to remark on it and what to note.
>However I can flag the features of interest for her so that she can 
>zoom right
>to them sequentially, like a pre-processing step.  She needs to find all
>instances of the following:
>
>T-rich nonamer->12bp->heptamer with terminal GTG->15-40bp->TTYGGNNNNGGN
>
>As far as I know it is impossible to program a regex to find "T-rich"

I don't know what you mean by that.

>so I
>have to stick to simple the nonamer aspect (9 chars).
>
>A regex for this might be as follows:
>
>/\G[acgt]{25}gtg[acgt]{15,40}tt[ct]gg[acgt]{4}gg[acgt]/gi
>
>There are various optimizations I could try and different approaches like
>lookaround assertions but for now my concern is with \G.  As far as I
>understand it, the \G assertion means that the /g modifier will not bump the
>engine to the position after the last match, thus operating like pos() in the
>regex and not missing overlapping instances of the pattern?  This is very
>important, I can't miss overlaps.  Do I understand correctly?

Not quite.  \G is an anchor, which means "must match the point where 
the previous /g match left off."  Because without it, you'd have no way 
of saying that you want successive global matches to succeed only if 
there's no intervening junk.

If you want to make sure that you don't miss overlapping instances, you 
need to ensure that pos isn't advanced past the point where one might 
be.  Using a zero-width positive lookahead assertion should do that.  I 
don't have time to play around with this some more so I don't know 
whether there's a way to avoid the kludge but check this out:

[peter at tweety ~]$ perl -le '$_="abacad"; print $1 while /(a.+)/g'
abacad
[peter at tweety ~]$ perl -le '$_="abacad"; print $1 while /\G(a.+)/g'
abacad
[peter at tweety ~]$ perl -le '$_="abacad"; print "$1$2" while /(a(?=(.+)))/g'
abacad
acad
ad

--
Peter Scott
Pacific Systems Design Technologies
http://www.perldebugged.com/