[Melbourne-pm] Regexps - how does the lexical scope of capture buffers work? (Was: Regexp: What's the right way to do this?)

Thu Oct 18 11:08:46 PDT 2012

This is tangential to the OP's question, except that it's about the same
idiom. 

I started to think about what was going on in order to explain it, and
realised that there's a subtle difference an array of undefs and a list
of undefs.  I've gone for years without noticing this.

So, there's this general idiom for assigning variables from a regex into
a list of variables.

   ($var1, $var2, ...) = m/(.)(.).../;

You're evaluating m// in list context.  If it doesn't match, it returns
an empty list, and $var1, $var2, etc are set to undef.

So consider that in a conditional:

if( ($var1,$var2,...) = m/(.)(.).../ )  { ... }

Perl does what you expect, but when you look closely it's pretty clever,
and I don't know that I've seen this documented:

if ( (undef,undef) ) { }   # a list of undefs is false

@arr = (undef,undef);
if ( @arr ) {}                        # an array of undefs is true

At one level it's quirky behaviour that a perl programmer of many years
may not have considered.  At another level, it enables a very useful
idiom, and we can mostly just get on and use it without worrying about
the subtleties.  Very perlish.

Regards,
Andrew McNaughton

On 17/10/12 21:36, Nathan Bailey wrote:
> I really wish I had obfuscated the contents of the regexps :-(
>
> My question, which I thought I had clearly stated, related to the lexical scope of capture buffers, and why one approach to capture buffers worked and another didn't.
>
> Let's try again:
>       #if (($start_time) = m#^\s*(\d+:\d+) -#) {
>       if (m#^\s*(\d+:\d+) -#) {
>          $start_time = $2;
>       #} elsif (($finish_time) = m#^\s*(\d+:\d+)#) {
>       } elsif (m#^\s*(\d+:\d+)#) {
>          $finish_time = $1;
>          $event++;
>      }
>
> Why is $start_time undefined when we get to $finish_time in the first version (commented out) and not in the second?
>
> And is there a good/better way to collect multiple values over multiple lines than this?
>
> thanks,
> Nathan
>
> On 17/10/2012, at 7:14 PM, Michael G Schwern wrote:
>
>> "Should I use regexes to parse HTML?"
>>
>> No, do not use regexes to parse HTML.  While it may seem easy to put together
>> a quick and dirty HTML scanner with regexes, it will very quickly get very
>> ugly.  HTML parsing requires matching balanced characters and tags such as <
>> and > and quotes which regexes do very poorly in addition to all the little
>> special cases like comments.
>>
>> In addition, you're going to forget many small things, like casing and spaces,
>> which you'll be hunting down forever.  For example...
>>
>>  <div  class="event-time calendar-1">2:3 - </div>
>>  <DIV class="event-time calendar-1">  2:3 - </div>
>>  <P class="summary">blah</p>
>>  <p  class = "description">blah</p>
>>  <!-- <p class="summary">blah</p> -->
>>
>> If you patch up your regexes to cover those, maybe an activity for the next
>> meeting might be to come up with more to break your regexes. :)
>>
>> There, your regex question is answered. :P
>>
>> It's quicker even in the short run to use a pre existing, well documented,
>> parser like HTML::TreeBuilder as evidenced by the fact that you're posting on
>> a mailing list for help with your regex based HTML parser.  You even get
>> search facilities like XPath (see HTML::TreeBuilder::XPath and
>> http://www.w3schools.com/xpath/).
>>
>>    use HTML::TreeBuilder::XPath;
>>    use v5.14;
>>
>>    my $tree= HTML::TreeBuilder::XPath->new;
>>    $tree->parse_file(shift);
>>
>>    my @event_times  = $tree->findnodes(
>>        '//div[starts-with(@class, "event-time-calendar-")]'
>>    );
>>
>>    for my $event_time (@event_times) {
>>        my($hour, $min) = $event_time->as_text =~ /(\d+):(\d+)/;
>>        say "Event at $hour:$min";
>>    }
>>
>> Once you learn how to use an HTML parser and XPath you'll never have to write
>> a hacky HTML regex parser again.  O(1) learning efficiency!
>>
>> If you're doing this as an exercise in learning regexes, well, don't ignore
>> the lesson just because its not what you expected to learn.  If you want to
>> learn "from scratch" look into writing a grammar parser.
>>
>>
>> On 2012.10.17 12:11 AM, Nathan Bailey wrote:> I knew someone would say that :P
>>> It's a regexp question, not an HTML parsing question!
>>> N
>>>
>>> On 17/10/2012, at 6:10 PM, Toby Wintermute wrote:
>>>
>>>> On 17 October 2012 18:09, Nathan Bailey <nathan.bailey at monash.edu> wrote:
>>>>> The code below works, but the commented out bits don't. I presume that
>> when $shh and $smm are defined on the first loop through, they get undefined
>> on the next time through?
>>>>> What's the "right" way to do this, TIMTOWTDI notwithstanding :-)
>>>>
>>>> use HTML::TreeBuilder;
>>>>
>>>>
>>>> -Toby
>>> _______________________________________________
>>> Melbourne-pm mailing list
>>> Melbourne-pm at pm.org
>>> http://mail.pm.org/mailman/listinfo/melbourne-pm
>>>
>>
>> -- 
>> s7ank: i want to be one of those guys that types "s/j&jd//.^$ueu*///djsls/sm."
>>       and it's a perl script that turns dog crap into gold.
> _______________________________________________
> Melbourne-pm mailing list
> Melbourne-pm at pm.org
> http://mail.pm.org/mailman/listinfo/melbourne-pm