[Melbourne-pm] Regexps - how does the lexical scope of capture buffers work? (Was: Regexp: What's the right way to do this?)

Michael G Schwern schwern at pobox.com
Wed Oct 17 12:57:30 PDT 2012


On 2012.10.17 3:36 AM, Nathan Bailey wrote:
> I really wish I had obfuscated the contents of the regexps :-(
> 
> My question, which I thought I had clearly stated, related to the lexical scope of capture buffers, and why one approach to capture buffers worked and another didn't.
> 
> Let's try again:
>       #if (($start_time) = m#^\s*(\d+:\d+) -#) {
>       if (m#^\s*(\d+:\d+) -#) {
>          $start_time = $2;
>       #} elsif (($finish_time) = m#^\s*(\d+:\d+)#) {
>       } elsif (m#^\s*(\d+:\d+)#) {
>          $finish_time = $1;
>          $event++;
>      }
> 
> Why is $start_time undefined when we get to $finish_time in the
> first version (commented out) and not in the second?

To clarify...

    if (($start_time) = m#^\s*(\d+:\d+) -#) {
        print "$start_time\n";
    }
    elsif (($finish_time) = m#^\s*(\d+:\d+)#) {
        print "$finish_time\n";
    }

vs

    if (m#^\s*(\d+:\d+) -#) {
        $start_time = $2;       # that should be $1
        print "$start_time\n";
    }
    elsif (m#^\s*(\d+:\d+)#) {
        $finish_time = $1;
        $event++;
        print "$finish_time\n";
    }

The regexes are a red herring.  This has to do with how lexical variables and
conditions work.

Presumably you run this code more than once, maybe in a loop.  And maybe
$start_time and $finish_time are globals, or they're lexicals but declared
outside the loop like this...

    my($start_time, $finish_time);
    while(<HTML>) {
        if( m#^\s*(\d+:\d+) -# ) {
            $start_time = $1;
        }
        elsif( m#^\s*(\d+:\d+)# ) {
            $finish_time = $1;
        }

        print "$start_time - $finish_time\n";
    }

In the above version $start_time and $finish_time are only changed if their
regexes match.  And because it's an if/elsif condition only one of them is
going to change per loop.  But their values persist from one loop to the next,
so you're A) only ever going to get one of them set and B) you're always going
to get one of them from the last loop.  This is bad.

    my($start_time, $finish_time);
    while(<HTML>) {
        if( ($start_time) = m#^\s*(\d+:\d+) -# ) {
            ...
        }
        elsif( ($finish_time) = m#^\s*(\d+:\d+)# ) {
            ...
        }

        print "$start_time - $finish_time\n";
    }

You're in the same boat here, only now because the first condition always runs
$start_time will always be set to something.  Maybe a value, maybe undef.
Either way, there's still data persisting from one iteration to the next which
I presume you don't want?  Even if you do, you're better off having variables
for "what I saw this iteration" and "what I'm remembering".

Simplest way to fix this is to move the lexical variables inside the loop so
they're cleared on every iteration.

    while(<HTML>) {
        my($start_time, $finish_time);

        if( m#^\s*(\d+:\d+) -# ) {
            $start_time = $1;
        }
        elsif( m#^\s*(\d+:\d+)# ) {
            $finish_time = $1;
        }

        print "$start_time - $finish_time\n";
    }

Eliminating the regexes might make it clearer.

    my($odd, $even);
    for my $num (1..10) {
        if( $num % 2 ) {
            $odd = $num
        }
        elsif( !($num % 2) ) {
            $even = $num
        }

        print "$even - $odd\n";
    }

vs

    for my $num (1..10) {
        my($odd, $even);

        if( $num % 2 ) {
            $odd = $num
        }
        elsif( !($num % 2) ) {
            $even = $num
        }

        print "$even - $odd\n";
    }

Still not a regex question. ;)


-- 
54. "Napalm sticks to kids" is *not* a motivational phrase.
    -- The 213 Things Skippy Is No Longer Allowed To Do In The U.S. Army
           http://skippyslist.com/list/


More information about the Melbourne-pm mailing list