SPUG: Word boundry regex treated differently by 5.6 and 5.005033

Colin Meyer cmeyer at helvella.org
Thu Apr 26 16:33:14 CDT 2001

On Thu, Apr 26, 2001 at 12:09:26AM -0700, Ben Burnett wrote:
> At 05:51 PM 4/25/01 -0700, Colin Meyer wrote:
> >More detail can be seen from the regex debugger:
> >perl -M're debug' -le '$t = "abcdefg"; print pos $t while $t =~ m/\B\w/g'
> I have to admin I haven't spent much time with the perl debugger I'll take 
> a closer look at this.

The main Perl debugger lets you single step through code and examine
variables and etc at each step. Very useful. It is invoked with: 
perl -d script_name

The non-interactive regex debugger, which really isn't part of the main
debugger spits out a ton of extra information when compiling and
matching regexs. It is invoked with:
perl -M're debug' script_name

or from within the program:
use re debug;

> >It is hard for me to decide if this is a new bug or a bug fix for an
> >old problem. The camel says that /g causes the regex to "start the
> >next match on the same variable at a position *just past* where the
> >last match stopped." The older versions of Perl seem to be looking at
> >the character that the last match ended on in order to determine the
> >border or non-border properties of the character at pos($t). Well,
> >it's either a bug with Perl, or a bug with its documentation. In
> >either case, a report should be submitted with perlbug.
> I think it's probably a bug with Perl itself.  I can't imagine this change 
> in behavior was intentional.  I'll have to submit it in the morning.

I am tending to think of the new behavior as a bug fix. The boundary (\B
or \b) depends on a condition between two characters (or a character and
the beginning or ending of the string). When the second match is
attempted, pos($test) is 2. The pointer must move on to character 3
before it has examined two characters and can decide if \b or \B
matches. Michael's post draws this out. Only when pos($test) == 0 or
pos($test) == length($test)-1 can the status of \b or \B be determined
with one character.

Hmm.  I'm still undecided.  I can see arguments for either behavior 
being considered correct.

Another way to think of it is: 
perl -le '$t="abcdefg"; $t=~s/\b.//g; print $t'
Would you expect that to delete all of the characters in the string?

For the record, here's gawk:
gawk '{gsub(/\B./, ""); print}'

Given 'abcdefg' as input, gawk returns 'ag'.  Different than any version
of Perl.  Oh well.

> Here is an excerpt of code showing the regex hard at work in a motorcycle 
> rental application CGI script.
> ...
>                  # we need to give this request a registration number while 
> we are here.  this number
>                  # will be built out of the initials of each word in the 
> applicants name, a unique session_key,
>                  # the applicants state, and the first two letters of the 
> city that the applicant is in
>                          my $key = time();
>                          $key .= "-" . getppid() or $LogH->append("couldn't 
> getppid to add to session key");

>                          my $request_id = $PASSED_VARS{'name'};
>                          $request_id =~ s/\B\w//g;
>                          $request_id =~ s/\W//g;

While you have uncovered an interesting bug, there are several other ways
to do what you are after:

$request_id =~ s/\b(\w)\w+\W*/$1/g;
# or
$request_id = join '', map substr($_,0,1), split / /, $request_id;


Have fun,

 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     POST TO: spug-list at pm.org       PROBLEMS: owner-spug-list at pm.org
      Subscriptions; Email to majordomo at pm.org:  ACTION  LIST  EMAIL
  Replace ACTION by subscribe or unsubscribe, EMAIL by your Email-address
 For daily traffic, use spug-list for LIST ;  for weekly, spug-list-digest
  Seattle Perl Users Group (SPUG) Home Page: http://www.halcyon.com/spug/

More information about the spug-list mailing list