SPUG: Word boundry regex treated differently by 5.6 and 5.005033
Colin Meyer
cmeyer at helvella.org
Thu Apr 26 16:33:14 CDT 2001
On Thu, Apr 26, 2001 at 12:09:26AM -0700, Ben Burnett wrote:
> At 05:51 PM 4/25/01 -0700, Colin Meyer wrote:
> >More detail can be seen from the regex debugger:
> >perl -M're debug' -le '$t = "abcdefg"; print pos $t while $t =~ m/\B\w/g'
>
> I have to admin I haven't spent much time with the perl debugger I'll take
> a closer look at this.
The main Perl debugger lets you single step through code and examine
variables and etc at each step. Very useful. It is invoked with:
perl -d script_name
The non-interactive regex debugger, which really isn't part of the main
debugger spits out a ton of extra information when compiling and
matching regexs. It is invoked with:
perl -M're debug' script_name
or from within the program:
use re debug;
>
> >It is hard for me to decide if this is a new bug or a bug fix for an
> >old problem. The camel says that /g causes the regex to "start the
> >next match on the same variable at a position *just past* where the
> >last match stopped." The older versions of Perl seem to be looking at
> >the character that the last match ended on in order to determine the
> >border or non-border properties of the character at pos($t). Well,
> >it's either a bug with Perl, or a bug with its documentation. In
> >either case, a report should be submitted with perlbug.
>
> I think it's probably a bug with Perl itself. I can't imagine this change
> in behavior was intentional. I'll have to submit it in the morning.
I am tending to think of the new behavior as a bug fix. The boundary (\B
or \b) depends on a condition between two characters (or a character and
the beginning or ending of the string). When the second match is
attempted, pos($test) is 2. The pointer must move on to character 3
before it has examined two characters and can decide if \b or \B
matches. Michael's post draws this out. Only when pos($test) == 0 or
pos($test) == length($test)-1 can the status of \b or \B be determined
with one character.
Hmm. I'm still undecided. I can see arguments for either behavior
being considered correct.
Another way to think of it is:
perl -le '$t="abcdefg"; $t=~s/\b.//g; print $t'
Would you expect that to delete all of the characters in the string?
For the record, here's gawk:
gawk '{gsub(/\B./, ""); print}'
Given 'abcdefg' as input, gawk returns 'ag'. Different than any version
of Perl. Oh well.
> Here is an excerpt of code showing the regex hard at work in a motorcycle
> rental application CGI script.
> ...
> # we need to give this request a registration number while
> we are here. this number
> # will be built out of the initials of each word in the
> applicants name, a unique session_key,
> # the applicants state, and the first two letters of the
> city that the applicant is in
> my $key = time();
> $key .= "-" . getppid() or $LogH->append("couldn't
> getppid to add to session key");
> my $request_id = $PASSED_VARS{'name'};
> $request_id =~ s/\B\w//g;
> $request_id =~ s/\W//g;
While you have uncovered an interesting bug, there are several other ways
to do what you are after:
$request_id =~ s/\b(\w)\w+\W*/$1/g;
# or
$request_id = join '', map substr($_,0,1), split / /, $request_id;
#...
Have fun,
-C.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
POST TO: spug-list at pm.org PROBLEMS: owner-spug-list at pm.org
Subscriptions; Email to majordomo at pm.org: ACTION LIST EMAIL
Replace ACTION by subscribe or unsubscribe, EMAIL by your Email-address
For daily traffic, use spug-list for LIST ; for weekly, spug-list-digest
Seattle Perl Users Group (SPUG) Home Page: http://www.halcyon.com/spug/
More information about the spug-list
mailing list