[Melbourne-pm] Regexps - how does the lexical scope of capture buffers work? (Was: Regexp: What's the right way to do this?)

Wed Oct 17 23:42:20 PDT 2012

On 2012.10.17 10:59 PM, Nathan Bailey wrote:
> On 18/10/2012, at 4:23 PM, Michael G Schwern wrote:
>> On 2012.10.17 4:33 PM, Nathan Bailey wrote:
>> I'm not sure what you mean by "regexp evaluations short circuit on failure".
> 
> As I understand it, the 'c' in the below regular expression never gets evaluated:
> 	if ("aa" =~ /bc/) { ...

Sooorta. It gets evaluated in the process of compiling the regex, but when run
it never bothers to check if there's a 'c' because there's never a 'b'...
maybe.  It depends on how the regex is implemented.  It's possible instead of
looking first for 'b' and then 'c' looks for 'bc'... but go a step down and
the string comparison probably never tries to compare "b" to "c".  This is
basically how strcmp works.

    # Pretend this is a low level language...
    sub strcmp {
        my($left, $right) = @_;

        # Different lengths, don't bother comparing.
        # (This isn't efficient in C, but it is in Perl)
        return 0 if length($left) != length($right);

        for my $idx (0..length($left)-1) {
            # If you encounter a different character, stop.
            return 0 if substr($left, $idx, 1) ne substr($right, $idx, 1);
        }

        # Made it this far, must be the same
        return 1;
    }

The key thing that separates that from trying to roll back a condition is
strcmp() doesn't change anything outside its scope in the process of doing its
work.  There's nothing to roll back, you just stop, and there's no side effects.

>> I'm going to assume you're asking why when you run this code...
>>    if( $foo = bar() ) {
> ...
>> And then there's side effects, printing to the screen, setting global
>> variables, network, disk and database access... how do you control them?  I
>> don't even think STE can account for that.
> 
> Thank-you, that's actually a really good answer - if the if statement includes
> some major side effect, it's not reasonable to expect that it could be undone
> on failure, and it is reasonable to expect that someone might want to record
> that failure, separate from the execution of the subsequent block of code.

Yes.  Any side effect.  Even simple assignment is a side effect.

>> The regex would return a match object you could get information out of.
>>    # something like this
>>    if( my $match = $string =~ /foo (.*?) bar/ ) {
>>        print $match->capture(1);
>>    }
> 
> Interesting. That has a certain elegance to it. Maybe we should hassle Damian :-)
> 
>>> I was wondering if there was a deep fu way that I hadn't considered.
>> Use a p--... oh nevermind. :P
> 
> parser? I suspect Peter's Parse::RecDescent suggestion is actually the generic answer
> to my question (of which HTML::{TreeBuilder,TokeParser} and its cousins are
a specific
> case for HTML). Beyond a certain point of regexp fu, you have to look at the
document
> rather than the line.

Pretty much.  Though to be precise, it's not so much about looking at the
document as it is understanding the grammar.  You can still work on a
complicated document element by element, but those elements are not delimited
by newlines.  They're delimited by... something else.  One you understand the
grammar you can iterate through the elements like you iterate through lines.
Except they can be nested.  Not a perfect analogy.

For example, most HTML/XML parsers parse the whole document into a DOM
(Document Object Model... basically a bunch of objects representing all the
things in the document.  This is very convenient to work with, it allows
things like XPath I showed earlier, but consumes a lot of memory and you can't
do anything until its done parsing.  This is sort of like slurping a whole
file into an array.

OTOH a SAX parser reads a document element by element and lets you do
something to each element.  This is more like reading line by line in a file.
https://secure.wikimedia.org/wikipedia/en/wiki/SAX_parser

>> -- 
>> Anyway, last I saw him, the TPF goons were pouring concrete around him,
>> leaving only one hole each for air, tea, and power.  No ethernet,
>> because he's using git.
>>    -- Eric Wilhelm on one of my disappearances
> 
> There seems to be a certain lack of output capacity in this model?
> (and I'm not referring to the code :-)

So THAT'S why I haven't been getting anything done!

-- 
THIS I COMMAND!