[Melbourne-pm] Regexps - how does the lexical scope of capture buffers work? (Was: Regexp: What's the right way to do this?)

Nathan Bailey nathan.bailey at monash.edu
Wed Oct 17 03:36:56 PDT 2012


I really wish I had obfuscated the contents of the regexps :-(

My question, which I thought I had clearly stated, related to the lexical scope of capture buffers, and why one approach to capture buffers worked and another didn't.

Let's try again:
      #if (($start_time) = m#^\s*(\d+:\d+) -#) {
      if (m#^\s*(\d+:\d+) -#) {
         $start_time = $2;
      #} elsif (($finish_time) = m#^\s*(\d+:\d+)#) {
      } elsif (m#^\s*(\d+:\d+)#) {
         $finish_time = $1;
         $event++;
     }

Why is $start_time undefined when we get to $finish_time in the first version (commented out) and not in the second?

And is there a good/better way to collect multiple values over multiple lines than this?

thanks,
Nathan

On 17/10/2012, at 7:14 PM, Michael G Schwern wrote:

> "Should I use regexes to parse HTML?"
> 
> No, do not use regexes to parse HTML.  While it may seem easy to put together
> a quick and dirty HTML scanner with regexes, it will very quickly get very
> ugly.  HTML parsing requires matching balanced characters and tags such as <
> and > and quotes which regexes do very poorly in addition to all the little
> special cases like comments.
> 
> In addition, you're going to forget many small things, like casing and spaces,
> which you'll be hunting down forever.  For example...
> 
>  <div  class="event-time calendar-1">2:3 - </div>
>  <DIV class="event-time calendar-1">  2:3 - </div>
>  <P class="summary">blah</p>
>  <p  class = "description">blah</p>
>  <!-- <p class="summary">blah</p> -->
> 
> If you patch up your regexes to cover those, maybe an activity for the next
> meeting might be to come up with more to break your regexes. :)
> 
> There, your regex question is answered. :P
> 
> It's quicker even in the short run to use a pre existing, well documented,
> parser like HTML::TreeBuilder as evidenced by the fact that you're posting on
> a mailing list for help with your regex based HTML parser.  You even get
> search facilities like XPath (see HTML::TreeBuilder::XPath and
> http://www.w3schools.com/xpath/).
> 
>    use HTML::TreeBuilder::XPath;
>    use v5.14;
> 
>    my $tree= HTML::TreeBuilder::XPath->new;
>    $tree->parse_file(shift);
> 
>    my @event_times  = $tree->findnodes(
>        '//div[starts-with(@class, "event-time-calendar-")]'
>    );
> 
>    for my $event_time (@event_times) {
>        my($hour, $min) = $event_time->as_text =~ /(\d+):(\d+)/;
>        say "Event at $hour:$min";
>    }
> 
> Once you learn how to use an HTML parser and XPath you'll never have to write
> a hacky HTML regex parser again.  O(1) learning efficiency!
> 
> If you're doing this as an exercise in learning regexes, well, don't ignore
> the lesson just because its not what you expected to learn.  If you want to
> learn "from scratch" look into writing a grammar parser.
> 
> 
> On 2012.10.17 12:11 AM, Nathan Bailey wrote:> I knew someone would say that :P
>> 
>> It's a regexp question, not an HTML parsing question!
>> N
>> 
>> On 17/10/2012, at 6:10 PM, Toby Wintermute wrote:
>> 
>>> On 17 October 2012 18:09, Nathan Bailey <nathan.bailey at monash.edu> wrote:
>>>> The code below works, but the commented out bits don't. I presume that
> when $shh and $smm are defined on the first loop through, they get undefined
> on the next time through?
>>>> 
>>>> What's the "right" way to do this, TIMTOWTDI notwithstanding :-)
>>> 
>>> 
>>> use HTML::TreeBuilder;
>>> 
>>> 
>>> -Toby
>> 
>> _______________________________________________
>> Melbourne-pm mailing list
>> Melbourne-pm at pm.org
>> http://mail.pm.org/mailman/listinfo/melbourne-pm
>> 
> 
> 
> -- 
> s7ank: i want to be one of those guys that types "s/j&jd//.^$ueu*///djsls/sm."
>       and it's a perl script that turns dog crap into gold.



More information about the Melbourne-pm mailing list