Phoenix.pm: Mystery perl module failure
Robert Lindley
bob at brogmoid.com
Thu Jan 22 21:43:10 CST 2004
Scott
Thanks!
That method allowed me to solve my problem. Had to modify the code a
bit. It had two problems.
1. If a line (like the last line of a file) did not have a \n or space at
the end it went into an infinite loop. Easy to fix. Just made sure the all
lines had \n at the end.
2. Quoted string of the form ' ',,'x' etc. got in troulble. Just made the
quoted capture non-greedy.
Ran 50,000+ lines of assembly code through it an no problems found.
Thanks again - - really saved the day!
If anybody is interested, I will post what I ended up with.
Bob Lindley
Scott Walters wrote:
>Text::Balanced.
>
>Okey, it took me three hours to type that. Sorry if this reply is short.
>Post a status later...
>
>That will get things quoted by an arbitrary character of set of matching
>characters. If you know when you're expecting something quoted and when
>you're expecting something, you should be able to mix those.
>
>Parse::RecDescent is much more powerful, and the power carries a price.
>There is some learning involved.
>
>I usually just hack up a quick parsing using the \G trick from perldoc
>perlre. \G in a regex matches where the last match left off, so you
>can match...
>
> while(1) {
> if($str =~ m/\G(['"])(.*)\1\s+/gcs) {
> # $1 will contain the qouting character
> # $2 contains what was between them
> # \s+ eats up whitespace
> # the /gc are needed for \G to work
> }
> if($str =~ m/\G(.*?)\s+/gcs) {
> # $1 is the word
> # the match is non-greedy so that it will stop at the first white-space
> }
> pos($str) == length($str) and last;
> }
>
>This should match a stream of things like:
>
>not-quoted "quoted stuff" thingie stuff
>'another quoted thingie'
>
>and find:
>
>non-quoted
>"quoted stuff"
>thingie
>stuff
>'another quoted thingie'
>
>Hope this helps!
>-scott
>
>On 0, Robert Lindley <bob at brogmoid.com> wrote:
>
>
>>This is a multi-part message in MIME format.
>>--------------020401030200020407060505
>>Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>Content-Transfer-Encoding: 7bit
>>
>>Here is a puzzle.
>>
>>I am constructing an assembler for a circa 1979 computer that is on the
>>Apache.
>>
>>Tried to use Text::ParseWords module. It almost worked. I expect it to
>>parse out a
>>quoted token only if the quote immediately follows a word delimiter. I
>>need it to work
>>that way (and the regex looks like it should) but it grabs the whole
>>word at the front
>>and back of the quoted token.
>>
>>What is really bad is that if there is an unmatched single or double
>>quote anywhere
>>on the line it throws the entire line away by returning an empty array
>>of words.
>>
>>I have extracted the part of Text::ParseWords that I am using and put it
>>in a error
>>demo program that is as small as is needed to show the error.
>>
>>Question:
>>
>> Does anybody know how to modify the main regex to:
>> 1. only tokenize a quoted string when that string starts with a single
>>or double quote
>> 2. return all the tokens (including the quote in place) when any
>>unmatched quotes are present.
>>
>>To run, copy both enclosed files somewhere and run with this command:
>>
>>./parse-error-demo.pl test.src
>>
>>I made one change to parse_line -- deleted reference to
>>$PERL_SINGLE_QUOTE --
>>that should not effect this problem.
>>
>>Does anyone know of another perl module to parse input lines into tokens
>>treating
>>quoted strings as single units by ignoring enclosed delimiters?
>>
>>Thanks for any help.
>>
>>Bob Lindley
>>
>>--------------020401030200020407060505
>>Content-Type: text/plain;
>> name="parse-error-demo.pl"
>>Content-Transfer-Encoding: 7bit
>>Content-Disposition: inline;
>> filename="parse-error-demo.pl"
>>
>>#!/usr/bin/perl
>>#
>>use strict 'vars';
>>use warnings;
>># use Text::ParseWords;
>>my($file, $input, $inline, @words1)
>>;
>> $file = shift;
>> open IN, $file or die "Can't open $file:\n $!\n";
>> # read all lines in current input file.
>> while($inline = <IN>) {
>> $inline =~ s/\s+$//; # trim trailing white space
>> $inline =~ s/^\s+//; # trim leading white space
>> print "|$inline|\n";
>> if($inline eq "") { next; } # Skip blank lines
>> @words1 = &parse_line('\s+' , 'delimiters', $inline);
>> print join "|", @words1, "\n--------\n";
>> # Each item in @words holds:
>> # empty string '' (e.g. word starts in col 1.)
>> # word with only delimiters present
>> # delimited word
>> #
>> }
>> close IN;
>> exit;
>>
>>sub parse_line {
>> # We will be testing undef strings
>> no warnings;
>> use re 'taint'; # if it's tainted, leave it as such
>>
>> my($delimiter, $keep, $line) = @_;
>> my($quote, $quoted, $unquoted, $delim, $word, @pieces);
>> while (length($line)) {
>> ($quote, $quoted, undef, $unquoted, $delim, undef) =
>> $line =~ m/^(["']) # a $quote
>> ((?:\\.|(?!\1)[^\\])*) # and $quoted text
>> \1 # followed by the same quote
>> ([\000-\377]*) # and the rest
>> | # --OR--
>> ^((?:\\.|[^\\"'])*?) # an $unquoted text
>> (\Z(?!\n)|(?-x:$delimiter)|(?!^)(?=["']))
>> # plus EOL, delimiter, or quote
>> ([\000-\377]*) # the rest
>> /x; # extended layout
>> return() unless( $quote || length($unquoted) || length($delim));
>> $line = $+;
>> if ($keep) {
>> $quoted = "$quote$quoted$quote";
>> } else {
>> $unquoted =~ s/\\(.)/$1/g;
>> if (defined $quote) {
>> $quoted =~ s/\\(.)/$1/g if ($quote eq '"');
>> $quoted =~ s/\\([\\'])/$1/g if ($quote eq "'");
>> }
>> }
>> $word .= defined $quote ? $quoted : $unquoted;
>> if (length($delim)) {
>> push(@pieces, $word);
>> push(@pieces, $delim) if ($keep eq 'delimiters');
>> undef $word;
>> }
>> if (!length($line)) {
>> push(@pieces, $word);
>> }
>> }
>> return(@pieces);
>>}
>>
>>
>>
>>__END__
>>
>>--------------020401030200020407060505
>>Content-Type: text/plain;
>> name="test.src"
>>Content-Transfer-Encoding: 7bit
>>Content-Disposition: inline;
>> filename="test.src"
>>
>>An ordinary line parses just fine 'this has a space in it.'
>>Mismatched quotes throw away the whole line "mismatched quotes.'
>>Dave O'Neil worked with George O'Malley on this project.
>>My name is David O'Neil
>> ^ENABLES THE "SSS" MSG'S TO THE DTC, {57-000
>>^ NOTE THE '' ABOVE MEANS TO USE ONE ' CHARACTER {57-002
>>
>>--------------020401030200020407060505--
>>
>>
>>
>
>
>
>
More information about the Phoenix-pm
mailing list