SPUG: RE / Split Question

Michael R. Wolf MichaelRWolf at att.net
Sat Nov 15 00:07:27 CST 2003


An old thread.  I'm catching up....


sthoenna at efn.org (Yitzchak Scott-Thoennes) writes:

> On Thu, 31 Jul 2003 00:02:11 -0700, krahnj at acm.org wrote:
>>$ perl -le'
>>$glob = "425 501 sttlwa01t 425 712 sttlwa01t tacwa02t 425 337 tacwa02t ";
>>
>>@array = $glob =~ /( \b\d+ \s+ \d+ (?:\s+ \D\w*)+ )/xg;
>>
>>print for @array;
>>'
>>425 501 sttlwa01t 
>>425 712 sttlwa01t tacwa02t 
>>425 337 tacwa02t 
>
> The problem with this kind of approach is that it silently ignores bad
> data (or good data if you make a mistake in your regex).  I like to do
> this kind of spliting with something like:
>
> @array = $glob =~ /\G ( \b\d+ \s+ \d+ (?:\s+ \D\w*)+ ) \s+ /xgc;
> print "error!" if (pos($glob)||0) != length($glob)
>
> This always starts each match where the preceeding one left off and
> then verifies that the entire string was consumed.

Watch out for the required whitespace (\s+) at the end of the RE. It's
used for interstitial whitespace, so it can't be completely optional
(\s*). I've refined the RE a bit in this test harness to also match at
end of string.

Thanks for the "gc" modifier -- I had to look it up. It works nicely
with the \G.

Michael Wolf

================================================================
#! /usr/bin/perl -w

while ($glob = <DATA>) {
    chomp $glob;
    @array = $glob =~ /\G ( \b\d+ \s+ \d+ (?:\s+ \D\w*)+ ) (?:\s+|\z) /xgc;
    unless (@array && (pos($glob)||0) == length($glob)) {
	my $fmt = qq(Error.  Levtover string '%s', line %d, position %d\n);
	warn sprintf($fmt => $'||$glob, $., pos($glob)||0)
    }
    print "$...\n", (join "\n" => @array), "\n\n";
}
__DATA__
Total trash
Beginning trash 425 501 sttlwa01t End Trash
425 501 sttlwa01t 425 712 sttlwa01t tacwa02t 425 337 tacwa02t
425 501 sttlwa01t 425 712 sttlwa01t tacwa02t 425 337 tacwa02t  
425 501 sttlwa01t 425 712 sttlwa01t tacwa02t 425 337 tacwa02t ...
425 501 sttlwa01t 425 712 sttlwa01t tacwa02t 425 337 tacwa02t 425 ...
425 501 sttlwa01t 425 712 sttlwa01t tacwa02t 425 337 tacwa02t 425 703 junk





More information about the spug-list mailing list