Phoenix.pm: Mystery perl module failure

Scott Walters scott at illogics.org
Mon Jan 19 01:06:34 CST 2004


Text::Balanced.

Okey, it took me three hours to type that. Sorry if this reply is short.
Post a status later...

That will get things quoted by an arbitrary character of set of matching
characters. If you know when you're expecting something quoted and when
you're expecting something, you should be able to mix those.

Parse::RecDescent is much more powerful, and the power carries a price.
There is some learning involved.

I usually just hack up a quick parsing using the \G trick from perldoc
perlre. \G in a regex matches where the last match left off, so you
can match...

  while(1) {
    if($str =~ m/\G(['"])(.*)\1\s+/gcs) {
      # $1 will contain the qouting character
      # $2 contains what was between them
      # \s+ eats up whitespace
      # the /gc are needed for \G to work
    } 
    if($str =~ m/\G(.*?)\s+/gcs) {
      # $1 is the word 
      # the match is non-greedy so that it will stop at the first white-space
    }
    pos($str) == length($str) and last;
  }

This should match a stream of things like:

not-quoted "quoted stuff" thingie stuff
'another quoted thingie'

and find:

non-quoted
"quoted stuff"
thingie
stuff
'another quoted thingie'

Hope this helps!
-scott

On  0, Robert Lindley <bob at brogmoid.com> wrote:
> 
> This is a multi-part message in MIME format.
> --------------020401030200020407060505
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> Content-Transfer-Encoding: 7bit
> 
> Here is a puzzle.
> 
> I am constructing an assembler for a circa 1979 computer that is on the 
> Apache.
> 
> Tried to use Text::ParseWords module. It almost worked. I expect it to 
> parse out a
> quoted token only if the quote immediately follows a word delimiter. I 
> need it to work
> that way (and the regex looks like it should) but it grabs the whole 
> word at the front
> and back of the quoted token.
> 
> What is really bad is that if there is an unmatched single or double 
> quote anywhere
> on the line it throws the entire line away by returning an empty array 
> of words.
> 
> I have extracted the part of Text::ParseWords that I am using and put it 
> in a error
> demo program that is as small as is needed to show the error. 
> 
> Question:
> 
>  Does anybody know how to modify the main regex to:
>   1. only tokenize a quoted string when that string starts with a single 
> or double quote
>   2. return all the tokens (including the quote in place) when any 
> unmatched quotes are present.
> 
> To run, copy both enclosed files somewhere and run with this command:
> 
> ./parse-error-demo.pl  test.src
> 
> I made one change to parse_line -- deleted reference to 
> $PERL_SINGLE_QUOTE --
> that should not effect this problem.
> 
> Does anyone know of another perl module to parse input lines into tokens 
> treating
> quoted strings as single units by ignoring enclosed delimiters?
> 
> Thanks for any help.
> 
> Bob Lindley
> 
> --------------020401030200020407060505
> Content-Type: text/plain;
>  name="parse-error-demo.pl"
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline;
>  filename="parse-error-demo.pl"
> 
> #!/usr/bin/perl
> #
> use strict 'vars';
> use warnings;
> # use Text::ParseWords;
> my($file, $input, $inline, @words1)
> ;
>   $file = shift;
>   open IN, $file or die "Can't open $file:\n   $!\n";
>   # read all lines in current input file.
>   while($inline = <IN>) {
>     $inline =~ s/\s+$//; # trim trailing white space
>     $inline =~ s/^\s+//; # trim leading white space
>     print "|$inline|\n";
>     if($inline eq "") { next; }  # Skip blank lines
>     @words1 = &parse_line('\s+' , 'delimiters', $inline);
>     print join "|", @words1, "\n--------\n";
>     # Each item in @words holds:
>     #    empty string '' (e.g. word starts in col 1.)
>     #    word with only delimiters present
>     #    delimited word
>     #
>   }
>   close IN;
>   exit;
> 
> sub parse_line {
>   # We will be testing undef strings
>   no warnings;
>   use re 'taint'; # if it's tainted, leave it as such
> 
>   my($delimiter, $keep, $line) = @_;
>   my($quote, $quoted, $unquoted, $delim, $word, @pieces);
>   while (length($line)) {
>     ($quote, $quoted, undef, $unquoted, $delim, undef) =
>       $line =~ m/^(["'])                 # a $quote
>       ((?:\\.|(?!\1)[^\\])*)    # and $quoted text
>       \1                     # followed by the same quote
>       ([\000-\377]*)         # and the rest
>       |                       # --OR--
>       ^((?:\\.|[^\\"'])*?)    # an $unquoted text
>       (\Z(?!\n)|(?-x:$delimiter)|(?!^)(?=["']))
>                                                # plus EOL, delimiter, or quote
>       ([\000-\377]*)           # the rest
>       /x;                      # extended layout
>     return() unless( $quote || length($unquoted) || length($delim));
>     $line = $+;
>     if ($keep) {
>       $quoted = "$quote$quoted$quote";
>     } else {
>       $unquoted =~ s/\\(.)/$1/g;
>       if (defined $quote) {
>         $quoted =~ s/\\(.)/$1/g if ($quote eq '"');
>         $quoted =~ s/\\([\\'])/$1/g if ($quote eq "'");
>       }
>     }
>     $word .= defined $quote ? $quoted : $unquoted;
>     if (length($delim)) {
>       push(@pieces, $word);
>       push(@pieces, $delim) if ($keep eq 'delimiters');
>       undef $word;
>     }
>     if (!length($line)) {
>       push(@pieces, $word);
>     }
>   }
>   return(@pieces);
> }
> 
> 
> 
> __END__
> 
> --------------020401030200020407060505
> Content-Type: text/plain;
>  name="test.src"
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline;
>  filename="test.src"
> 
> An ordinary line parses just fine 'this has a space in it.'
> Mismatched quotes throw away the whole line "mismatched quotes.'
> Dave O'Neil worked with George O'Malley on this project.
> My name is David O'Neil 
>                         ^ENABLES THE "SSS" MSG'S TO THE DTC,    {57-000
> ^ NOTE THE '' ABOVE MEANS TO USE ONE ' CHARACTER                {57-002
> 
> --------------020401030200020407060505--
> 



More information about the Phoenix-pm mailing list