[tpm] I wish I was better at regex's

Stuart Watt stuart at morungos.com
Wed Mar 9 13:53:45 PST 2011

Sorry to chip in late, but this actually feels like a tokenizing 
problem, which is part-way to Richard's point. I do a lot of these, and 
there is a pattern in the perldocs, specifically under "What good is \G 
in a regular expression?" in perlfaq6. This would go something like this 
(*** warning untested code ***)

while(1) {
   m{\G(\s*;[^\n]*))}gcx && do { };   # Don't print when matched a comment
   m{\G(=)}gcx && do { print $1; };
   m{\G(\s+)}gcx && do { print $1; };
   m{\G(\w+)}gcx && do { print $1; };
   m{\G(\"(?:\\.|[^\\\"])*\")}gcx && do { print $1; };
   m{\G(\'(?:\\.|[^\\\'])*\')}gcx && do { print $1; };
   m{\G$}gcx && last;
   croak("Unprocessed input");

print, of course, could be replaced to just drop the identified section 
of text somewhere, e.g., in an output array to be joined.

This has the benefit that it isn't all one huge regex, but it is slower. 
Essentially, the idea is simple: \G represents the current position, and 
each line handles a different type of token at each position. This 
allows strings to be handled in separate regexes from words, comments, 
etc. This means comment handling can be separated from quote handling, 
which does improve maintainability.

As has been said, it is possible to do this in a single regex (even 
nesting in Perl 5.10+) but the result can be an unreadable mess. Believe 
me, I've written some like that. There is also a significant risk of 
hitting serious performance issues. A complex regex, can quickly degrade 
if backtracking/lazy quantifiers aren't handled right, and you can end 
up with truly bad performance. The approach above will impose a small 
hit, but usually prevents pathologically bad matching.

All the best

On 09/03/2011 4:19 PM, J. Bobby Lopez wrote:
> I would expect that you can just count the number of ';' instances in 
> the string, and get the index of the last instance which resides after 
> the last instance of the last single or double quote.  If there are no 
> quotes, then it's the first instance of ';'.
> On Wed, Mar 9, 2011 at 4:14 PM, Uri Guttman <uri at stemsystems.com 
> <mailto:uri at stemsystems.com>> wrote:
>     >>>>> "RJ" == Rob Janes <janes.rob at gmail.com
>     <mailto:janes.rob at gmail.com>> writes:
>      RJ> i recall some compsci proof that regex cannot do nested pattern
>      RJ> matching, like (xxx) or (xxx (yyy) zzz).  for that you need a
>     lalr
>      RJ> parser, something like recdescent or whatever.
>     that is true for pure regexes. perl's latest can match nested
>     pairs. it
>     isn't trivial but the feature is in there and documented. regardless,
>     this problem is very easy to solve with text::balanced and some basic
>     code. just a single regex is the wrong solution.
>     uri
>     --
>     Uri Guttman  ------ uri at stemsystems.com
>     <mailto:uri at stemsystems.com>  -------- http://www.sysarch.com --
>     -----  Perl Code Review , Architecture, Development, Training,
>     Support ------
>     ---------  Gourmet Hot Cocoa Mix  ---- http://bestfriendscocoa.com
>     ---------
>     _______________________________________________
>     toronto-pm mailing list
>     toronto-pm at pm.org <mailto:toronto-pm at pm.org>
>     http://mail.pm.org/mailman/listinfo/toronto-pm
> _______________________________________________
> toronto-pm mailing list
> toronto-pm at pm.org
> http://mail.pm.org/mailman/listinfo/toronto-pm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/toronto-pm/attachments/20110309/9417de78/attachment.html>

More information about the toronto-pm mailing list