Phoenix.pm: Mystery perl module failure

Robert Lindley bob at brogmoid.com
Thu Jan 22 21:43:10 CST 2004


Scott

Thanks!

That method allowed me to solve my problem. Had to modify the code a 
bit. It had two problems.

1. If a line (like the last line of a file) did not have a \n or space at
the end it went into an infinite loop. Easy to fix. Just made sure the all
lines had \n at the end.

2. Quoted string of the form ' ',,'x' etc. got in troulble. Just made the
quoted capture non-greedy.

Ran 50,000+ lines of assembly code through it an no problems found.

Thanks again - - really saved the day!

If anybody is interested, I will post what I ended up with.

Bob Lindley

Scott Walters wrote:

>Text::Balanced.
>
>Okey, it took me three hours to type that. Sorry if this reply is short.
>Post a status later...
>
>That will get things quoted by an arbitrary character of set of matching
>characters. If you know when you're expecting something quoted and when
>you're expecting something, you should be able to mix those.
>
>Parse::RecDescent is much more powerful, and the power carries a price.
>There is some learning involved.
>
>I usually just hack up a quick parsing using the \G trick from perldoc
>perlre. \G in a regex matches where the last match left off, so you
>can match...
>
>  while(1) {
>    if($str =~ m/\G(['"])(.*)\1\s+/gcs) {
>      # $1 will contain the qouting character
>      # $2 contains what was between them
>      # \s+ eats up whitespace
>      # the /gc are needed for \G to work
>    } 
>    if($str =~ m/\G(.*?)\s+/gcs) {
>      # $1 is the word 
>      # the match is non-greedy so that it will stop at the first white-space
>    }
>    pos($str) == length($str) and last;
>  }
>
>This should match a stream of things like:
>
>not-quoted "quoted stuff" thingie stuff
>'another quoted thingie'
>
>and find:
>
>non-quoted
>"quoted stuff"
>thingie
>stuff
>'another quoted thingie'
>
>Hope this helps!
>-scott
>
>On  0, Robert Lindley <bob at brogmoid.com> wrote:
>  
>
>>This is a multi-part message in MIME format.
>>--------------020401030200020407060505
>>Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>Content-Transfer-Encoding: 7bit
>>
>>Here is a puzzle.
>>
>>I am constructing an assembler for a circa 1979 computer that is on the 
>>Apache.
>>
>>Tried to use Text::ParseWords module. It almost worked. I expect it to 
>>parse out a
>>quoted token only if the quote immediately follows a word delimiter. I 
>>need it to work
>>that way (and the regex looks like it should) but it grabs the whole 
>>word at the front
>>and back of the quoted token.
>>
>>What is really bad is that if there is an unmatched single or double 
>>quote anywhere
>>on the line it throws the entire line away by returning an empty array 
>>of words.
>>
>>I have extracted the part of Text::ParseWords that I am using and put it 
>>in a error
>>demo program that is as small as is needed to show the error. 
>>
>>Question:
>>
>> Does anybody know how to modify the main regex to:
>>  1. only tokenize a quoted string when that string starts with a single 
>>or double quote
>>  2. return all the tokens (including the quote in place) when any 
>>unmatched quotes are present.
>>
>>To run, copy both enclosed files somewhere and run with this command:
>>
>>./parse-error-demo.pl  test.src
>>
>>I made one change to parse_line -- deleted reference to 
>>$PERL_SINGLE_QUOTE --
>>that should not effect this problem.
>>
>>Does anyone know of another perl module to parse input lines into tokens 
>>treating
>>quoted strings as single units by ignoring enclosed delimiters?
>>
>>Thanks for any help.
>>
>>Bob Lindley
>>
>>--------------020401030200020407060505
>>Content-Type: text/plain;
>> name="parse-error-demo.pl"
>>Content-Transfer-Encoding: 7bit
>>Content-Disposition: inline;
>> filename="parse-error-demo.pl"
>>
>>#!/usr/bin/perl
>>#
>>use strict 'vars';
>>use warnings;
>># use Text::ParseWords;
>>my($file, $input, $inline, @words1)
>>;
>>  $file = shift;
>>  open IN, $file or die "Can't open $file:\n   $!\n";
>>  # read all lines in current input file.
>>  while($inline = <IN>) {
>>    $inline =~ s/\s+$//; # trim trailing white space
>>    $inline =~ s/^\s+//; # trim leading white space
>>    print "|$inline|\n";
>>    if($inline eq "") { next; }  # Skip blank lines
>>    @words1 = &parse_line('\s+' , 'delimiters', $inline);
>>    print join "|", @words1, "\n--------\n";
>>    # Each item in @words holds:
>>    #    empty string '' (e.g. word starts in col 1.)
>>    #    word with only delimiters present
>>    #    delimited word
>>    #
>>  }
>>  close IN;
>>  exit;
>>
>>sub parse_line {
>>  # We will be testing undef strings
>>  no warnings;
>>  use re 'taint'; # if it's tainted, leave it as such
>>
>>  my($delimiter, $keep, $line) = @_;
>>  my($quote, $quoted, $unquoted, $delim, $word, @pieces);
>>  while (length($line)) {
>>    ($quote, $quoted, undef, $unquoted, $delim, undef) =
>>      $line =~ m/^(["'])                 # a $quote
>>      ((?:\\.|(?!\1)[^\\])*)    # and $quoted text
>>      \1                     # followed by the same quote
>>      ([\000-\377]*)         # and the rest
>>      |                       # --OR--
>>      ^((?:\\.|[^\\"'])*?)    # an $unquoted text
>>      (\Z(?!\n)|(?-x:$delimiter)|(?!^)(?=["']))
>>                                               # plus EOL, delimiter, or quote
>>      ([\000-\377]*)           # the rest
>>      /x;                      # extended layout
>>    return() unless( $quote || length($unquoted) || length($delim));
>>    $line = $+;
>>    if ($keep) {
>>      $quoted = "$quote$quoted$quote";
>>    } else {
>>      $unquoted =~ s/\\(.)/$1/g;
>>      if (defined $quote) {
>>        $quoted =~ s/\\(.)/$1/g if ($quote eq '"');
>>        $quoted =~ s/\\([\\'])/$1/g if ($quote eq "'");
>>      }
>>    }
>>    $word .= defined $quote ? $quoted : $unquoted;
>>    if (length($delim)) {
>>      push(@pieces, $word);
>>      push(@pieces, $delim) if ($keep eq 'delimiters');
>>      undef $word;
>>    }
>>    if (!length($line)) {
>>      push(@pieces, $word);
>>    }
>>  }
>>  return(@pieces);
>>}
>>
>>
>>
>>__END__
>>
>>--------------020401030200020407060505
>>Content-Type: text/plain;
>> name="test.src"
>>Content-Transfer-Encoding: 7bit
>>Content-Disposition: inline;
>> filename="test.src"
>>
>>An ordinary line parses just fine 'this has a space in it.'
>>Mismatched quotes throw away the whole line "mismatched quotes.'
>>Dave O'Neil worked with George O'Malley on this project.
>>My name is David O'Neil 
>>                        ^ENABLES THE "SSS" MSG'S TO THE DTC,    {57-000
>>^ NOTE THE '' ABOVE MEANS TO USE ONE ' CHARACTER                {57-002
>>
>>--------------020401030200020407060505--
>>
>>    
>>
>
>
>  
>





More information about the Phoenix-pm mailing list