diddling with regexes

Mon Jan 20 13:07:55 CST 2003

On Fri, 17 Jan 2003, nkuipers wrote:

> Hi everyone,
> 
> A colleague of mine is annotating, by hand, DNA sequences tens of thousands of 
> residues in length.  I can't speed up the annotation per se since she needs to 
> decide for each feature whether or not to remark on it and what to note.  
> However I can flag the features of interest for her so that she can zoom right 
> to them sequentially, like a pre-processing step.  She needs to find all 
> instances of the following:
> 
> T-rich nonamer->12bp->heptamer with terminal GTG->15-40bp->TTYGGNNNNGGN
> 
> As far as I know it is impossible to program a regex to find "T-rich" so I
> have to stick to simple the nonamer aspect (9 chars).
> 

I do not understand this.

> A regex for this might be as follows:
> 
> /\G[acgt]{25}gtg[acgt]{15,40}tt[ct]gg[acgt]{4}gg[acgt]/gi
> 
> There are various optimizations I could try and different approaches like 
> lookaround assertions but for now my concern is with \G.  As far as I 
> understand it, the \G assertion means that the /g modifier will not bump the 
> engine to the position after the last match, thus operating like pos() in the 
> regex and not missing overlapping instances of the pattern?  This is very 

I think the opposite is true (and \G makes no difference) , plus \G
prevents the regex from continuing past sections that don't match, because
the regex is anchored to where it left off.

The simplist way to ensure that overlaps are found is to restart the regex
at exactly one character past the start of the last match

The following example shows what I mean

	#!perl
	# find all occurences of two digits, include overlaps

	$_ = '11111 22222 33333 44444 55555';

	while (m/(\d\d)/g)
	{
	   print "pos=",pos()," \$1=$1, \$` \$& \$' =[$`][$&][$']\n";
	   #
	   # backup almost to the start of the thing that just  matched
	   #
	   pos = pos() - length($&) + 1;
	}

You can easily see the effect of \G by putting it into the m// above
	while (m/\G(\d\d)/g)

(the regex will stop at the first blank).

If you comment out the pos= assignment then you'll see that the regex
match misses the overlaps.

I don't know more about what you're matching to give a more specific
example.