[Melbourne-pm] Regexps - how does the lexical scope of capture buffers work? (Was: Regexp: What's the right way to do this?)
Nathan Bailey
nathan.bailey at monash.edu
Wed Oct 17 03:36:56 PDT 2012
I really wish I had obfuscated the contents of the regexps :-(
My question, which I thought I had clearly stated, related to the lexical scope of capture buffers, and why one approach to capture buffers worked and another didn't.
Let's try again:
#if (($start_time) = m#^\s*(\d+:\d+) -#) {
if (m#^\s*(\d+:\d+) -#) {
$start_time = $2;
#} elsif (($finish_time) = m#^\s*(\d+:\d+)#) {
} elsif (m#^\s*(\d+:\d+)#) {
$finish_time = $1;
$event++;
}
Why is $start_time undefined when we get to $finish_time in the first version (commented out) and not in the second?
And is there a good/better way to collect multiple values over multiple lines than this?
thanks,
Nathan
On 17/10/2012, at 7:14 PM, Michael G Schwern wrote:
> "Should I use regexes to parse HTML?"
>
> No, do not use regexes to parse HTML. While it may seem easy to put together
> a quick and dirty HTML scanner with regexes, it will very quickly get very
> ugly. HTML parsing requires matching balanced characters and tags such as <
> and > and quotes which regexes do very poorly in addition to all the little
> special cases like comments.
>
> In addition, you're going to forget many small things, like casing and spaces,
> which you'll be hunting down forever. For example...
>
> <div class="event-time calendar-1">2:3 - </div>
> <DIV class="event-time calendar-1"> 2:3 - </div>
> <P class="summary">blah</p>
> <p class = "description">blah</p>
> <!-- <p class="summary">blah</p> -->
>
> If you patch up your regexes to cover those, maybe an activity for the next
> meeting might be to come up with more to break your regexes. :)
>
> There, your regex question is answered. :P
>
> It's quicker even in the short run to use a pre existing, well documented,
> parser like HTML::TreeBuilder as evidenced by the fact that you're posting on
> a mailing list for help with your regex based HTML parser. You even get
> search facilities like XPath (see HTML::TreeBuilder::XPath and
> http://www.w3schools.com/xpath/).
>
> use HTML::TreeBuilder::XPath;
> use v5.14;
>
> my $tree= HTML::TreeBuilder::XPath->new;
> $tree->parse_file(shift);
>
> my @event_times = $tree->findnodes(
> '//div[starts-with(@class, "event-time-calendar-")]'
> );
>
> for my $event_time (@event_times) {
> my($hour, $min) = $event_time->as_text =~ /(\d+):(\d+)/;
> say "Event at $hour:$min";
> }
>
> Once you learn how to use an HTML parser and XPath you'll never have to write
> a hacky HTML regex parser again. O(1) learning efficiency!
>
> If you're doing this as an exercise in learning regexes, well, don't ignore
> the lesson just because its not what you expected to learn. If you want to
> learn "from scratch" look into writing a grammar parser.
>
>
> On 2012.10.17 12:11 AM, Nathan Bailey wrote:> I knew someone would say that :P
>>
>> It's a regexp question, not an HTML parsing question!
>> N
>>
>> On 17/10/2012, at 6:10 PM, Toby Wintermute wrote:
>>
>>> On 17 October 2012 18:09, Nathan Bailey <nathan.bailey at monash.edu> wrote:
>>>> The code below works, but the commented out bits don't. I presume that
> when $shh and $smm are defined on the first loop through, they get undefined
> on the next time through?
>>>>
>>>> What's the "right" way to do this, TIMTOWTDI notwithstanding :-)
>>>
>>>
>>> use HTML::TreeBuilder;
>>>
>>>
>>> -Toby
>>
>> _______________________________________________
>> Melbourne-pm mailing list
>> Melbourne-pm at pm.org
>> http://mail.pm.org/mailman/listinfo/melbourne-pm
>>
>
>
> --
> s7ank: i want to be one of those guys that types "s/j&jd//.^$ueu*///djsls/sm."
> and it's a perl script that turns dog crap into gold.
More information about the Melbourne-pm
mailing list