<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#ffffff" text="#000000">

    Sorry to chip in late, but this actually feels like a tokenizing

    problem, which is part-way to Richard's point. I do a lot of these,

    and there is a pattern in the perldocs, specifically under "What

    good is \G in a regular expression?" in perlfaq6. This would go

    something like this (*** warning untested code ***)<br>

    <br>

    while(1) {<br>

      m{\G(\s*;[^\n]*))}gcx && do { };   # Don't print when

    matched a comment<br>

      m{\G(=)}gcx && do { print $1; };<br>

      m{\G(\s+)}gcx && do { print $1; };<br>

      m{\G(\w+)}gcx && do { print $1; };<br>

      m{\G(\"(?:\\.|[^\\\"])*\")}gcx && do { print $1; };<br>

      m{\G(\'(?:\\.|[^\\\'])*\')}gcx && do { print $1; };<br>

      m{\G$}gcx && last;<br>

      croak("Unprocessed input");<br>

    }<br>

      <br>

    print, of course, could be replaced to just drop the identified

    section of text somewhere, e.g., in an output array to be joined.<br>

    <br>

    This has the benefit that it isn't all one huge regex, but it is

    slower. Essentially, the idea is simple: \G represents the current

    position, and each line handles a different type of token at each

    position. This allows strings to be handled in separate regexes from

    words, comments, etc. This means comment handling can be separated

    from quote handling, which does improve maintainability. <br>

    <br>

    As has been said, it is possible to do this in a single regex (even

    nesting in Perl 5.10+) but the result can be an unreadable mess.

    Believe me, I've written some like that. There is also a significant

    risk of hitting serious performance issues. A complex regex, can

    quickly degrade if backtracking/lazy quantifiers aren't handled

    right, and you can end up with truly bad performance. The approach

    above will impose a small hit, but usually prevents pathologically

    bad matching. <br>

    <br>

    All the best<br>

    Stuart<br>

    <br>

    <br>

    <br>

    On 09/03/2011 4:19 PM, J. Bobby Lopez wrote:

    <blockquote

      cite="mid:AANLkTi=es+cFw40e9hhi7Ujn1=2-xRJh+=DG9FeUhAQp@mail.gmail.com"

      type="cite">I would expect that you can just count the number of

      ';' instances in the string, and get the index of the last

      instance which resides after the last instance of the last single

      or double quote.  If there are no quotes, then it's the first

      instance of ';'.<br>

      <br>

      <div class="gmail_quote">On Wed, Mar 9, 2011 at 4:14 PM, Uri

        Guttman <span dir="ltr"><<a moz-do-not-send="true"

            href="mailto:uri@stemsystems.com">uri@stemsystems.com</a>></span>

        wrote:<br>

        <blockquote class="gmail_quote" style="border-left: 1px solid

          rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left:

          1ex;">

          >>>>> "RJ" == Rob Janes <<a

            moz-do-not-send="true" href="mailto:janes.rob@gmail.com">janes.rob@gmail.com</a>>

          writes:<br>

          <br>

           RJ> i recall some compsci proof that regex cannot do

          nested pattern<br>

           RJ> matching, like (xxx) or (xxx (yyy) zzz).  for that you

          need a lalr<br>

           RJ> parser, something like recdescent or whatever.<br>

          <br>

          that is true for pure regexes. perl's latest can match nested

          pairs. it<br>

          isn't trivial but the feature is in there and documented.

          regardless,<br>

          this problem is very easy to solve with text::balanced and

          some basic<br>

          code. just a single regex is the wrong solution.<br>

          <div class="im"><br>

            uri<br>

            <br>

            --<br>

            Uri Guttman  ------  <a moz-do-not-send="true"

              href="mailto:uri@stemsystems.com">uri@stemsystems.com</a>

             --------  <a moz-do-not-send="true"

              href="http://www.sysarch.com" target="_blank">http://www.sysarch.com</a>

            --<br>

            -----  Perl Code Review , Architecture, Development,

            Training, Support ------<br>

            ---------  Gourmet Hot Cocoa Mix  ----  <a

              moz-do-not-send="true" href="http://bestfriendscocoa.com"

              target="_blank">http://bestfriendscocoa.com</a> ---------<br>

            _______________________________________________<br>

          </div>

          <div>

            <div class="h5">toronto-pm mailing list<br>

              <a moz-do-not-send="true" href="mailto:toronto-pm@pm.org">toronto-pm@pm.org</a><br>

              <a moz-do-not-send="true"

                href="http://mail.pm.org/mailman/listinfo/toronto-pm"

                target="_blank">http://mail.pm.org/mailman/listinfo/toronto-pm</a><br>

            </div>

          </div>

        </blockquote>

      </div>

      <br>

      <pre wrap="">

<fieldset class="mimeAttachmentHeader"></fieldset>

_______________________________________________

toronto-pm mailing list

<a class="moz-txt-link-abbreviated" href="mailto:toronto-pm@pm.org">toronto-pm@pm.org</a>

<a class="moz-txt-link-freetext" href="http://mail.pm.org/mailman/listinfo/toronto-pm">http://mail.pm.org/mailman/listinfo/toronto-pm</a>

</pre>

    </blockquote>

  </body>

</html>