[Melbourne-pm] Regexps - how does the lexical scope of capture buffers work? (Was: Regexp: What's the right way to do this?)

Michael G Schwern schwern at pobox.com
Wed Oct 17 22:23:46 PDT 2012


On 2012.10.17 4:33 PM, Nathan Bailey wrote:
> My first question is really a language design one. Regexp evaluations
> short circuit on failure; why don't if statement assignments do the same?
> I would think the above use case is far more common/likely than the current
> one, which would theoretically allow someone to collect a bunch of undefs
> through each loop iteration for the ifs that fail (and as you note, there
> are other ways to get the right-hand side to fail into undef).

I'm not sure what you mean by "regexp evaluations short circuit on failure".
I'm going to assume you're asking why when you run this code...

    sub bar { 0 }

    $foo = 42;
    if( $foo = bar() ) {
        ...
    }
    else {
        print $foo;  # what do you expect here?
    }

...why doesn't it print 42?

>From a pragmatic POV, its impossible to evaluate arbitrary code without
actually running it.  See also "The Halting Problem".  Once you've run it,
you'd have to roll back any changes it made which isn't possible in most
languages/interpreters.  It's theoretically possible using something called
Software Transactional Memory but that's way beyond Perl.
http://en.wikipedia.org/wiki/Software_transactional_memory

And then there's side effects, printing to the screen, setting global
variables, network, disk and database access... how do you control them?  I
don't even think STE can account for that.

>From a language design perspective, there's lots and lots of cases where you
want to use changes and side effects from a failed conditional.  Here's a
couple examples off the top of my head.  The first illustrates where you want
to use a side effect from a failed condition.

    if( open my $fh, $file ) {
        print <$fh>;
    }
    else {
        # $! is a global set as a side effect of the failed open
        print "There was an error: $!\n";
    }

This one uses a change, in this case a variable assignment.

    # This is longhand for open || die
    if( !open $fh, $file ) {
        die "Can't open $file: $!";
    }

    # The condition failed, but we still want to use a variable assigned
    # in it.
    print <$fh>;

Regexes, OTOH, are their own little machines within a machine with clear
boundries where they communicate with Perl.  They don't so much short-circuit
on failure as they simply do not clear out their associated global variables
until they have to.  I'm willing to bet this was originally an implementation
quirk, possibly an overly aggressive optimization, which became a feature
and/or compatibility issue.

If you were making it today, you'd want each regex to clear its associated
globals to avoid exactly the sort of problem you're having.  Better yet, you
wouldn't use globals and avoid the problem of regexes clobbering each other.
The regex would return a match object you could get information out of.

    # something like this
    if( my $match = $string =~ /foo (.*?) bar/ ) {
        print $match->capture(1);
    }


> My second question is what's a better way to do this. I can think of two ways:
>   1. Assign the capture buffer (ie. $start_time = $1), which is what a.pl does
>   2. Use a multi-line string regexp that pulls out both start and finish time at once
> 
> I was wondering if there was a deep fu way that I hadn't considered.

Use a p--... oh nevermind. :P


-- 
Anyway, last I saw him, the TPF goons were pouring concrete around him,
leaving only one hole each for air, tea, and power.  No ethernet,
because he's using git.
    -- Eric Wilhelm on one of my disappearances


More information about the Melbourne-pm mailing list