[Melbourne-pm] Regexps - how does the lexical scope of capture buffers work? (Was: Regexp: What's the right way to do this?)

Peter Vereshagin peter at vereshagin.org
Wed Oct 17 14:33:44 PDT 2012


Hi guys.

The what is wrong with this:

    #!/usr/bin/env perl
    use strict;
    use warnings;
    use autodie;

    my $str = '';
    while ( my $buf .= <DATA> ) {
        $str .= $buf;
        if (my ( $hh_mm_start => $hh_mm_end )
            = $str =~ m/
                <div[^>]*>\s*
                (\d\d?:\d\d?)
                \s*-\s*
                (\d\d?:\d\d?)
                /sx
            )
        {
            use Data::Dump;
            ddx $hh_mm_start => $hh_mm_end;
            $str = '';
        }
    }

    __DATA__
    <div class="event-time calendar-1">12:45 -                                                                                               
            14:00</div>
    <div class="event-time calendar-1">12:45 -
    </div>
    <div class="something else entirely">The time is now
    05:46</div>

?

There is also an old bold P::RecD :

    http://search.cpan.org/dist/Parse-RecDescent/ 

I use to parse MySQL dumps with it here:

    http://gitweb.vereshagin.org/endvance/blob_plain/HEAD:/endvance/README

But surely HTML::* can make you happy, too.

2012/10/17 13:45:17 -0700 Michael G Schwern <schwern at pobox.com> => To melbourne-pm at pm.org :
MGS> On 2012.10.17 5:24 AM, Nathan Bailey wrote:
MGS> > I'm just wondering if there's a better way to grab text out of
MGS> > multiple lines that are related to each other. A simple solution
MGS> > would be to go for multi-line strings but I'm actually curious
MGS> > to know (a) if that's the way the evaluation of the regexp and
MGS> > assignment works and (b) if there are better ways of doing
MGS> > multi-line parsing, without simply treating it as one big complex
MGS> > line.
MGS> 
MGS> You don't want to the answer to be "use an HTML parser", so here's a sort of a
MGS> Look Into Your Future as you try to parse HTML with regexes...
MGS> 
MGS> Reading your original code, it seems like you're trying to parse this:
MGS> 
MGS>     <div class="event-time calendar-1">12:45 -
MGS>         14:00</div>
MGS> 
MGS> but doing it line by line with individual regexes.  HTML doesn't give two
MGS> hoots about newlines, so trying to understand it line by line has lots of
MGS> problems.  This means you have to carry state over from one line to another,
MGS> which gets complicated.  Worse, you have to check that nothing else came
MGS> between them else you get fooled by this:
MGS> 
MGS>     <div class="event-time calendar-1">12:45 -
MGS>         </div>
MGS>     <div class="something else entirely">The time is now
MGS>         05:46</div>
MGS> 
MGS> If you try to parse as one big string...
MGS> 
MGS>     /\G<div class="time">(\d+):(\d+) - (\d+):(\d+)</div>/msg
MGS> 
MGS> That works for this:
MGS> 
MGS>     <div class="time">12:45 - 14:00</div><div class="time">15:00 - 16:00</div>
MGS> 
MGS> But to account for whitespace and casing the regex really needs to be...
MGS> 
MGS>     /\G<div\s+class\s+=\s+"time"\s+>\s+(\d+):(\d+)\s+-\s+(\d+):(\d+)</div>/imsg
MGS> 
MGS> Yuck.
MGS> 
MGS> You'll run into trouble with this:
MGS> 
MGS>     /\G<p\s+class\s+=\s+"summary"\s+>.*</p>/msgi
MGS> 
MGS>     <p class="summary">foo</p>
MGS>     <p class="somethingelse">bar</p>
MGS> 
MGS> Slurped up too much.  Have to make it non-greedy.
MGS> 
MGS>     /\G<p\s+class\s+=\s+"summary"\s+>(.*?)</p>/msgi
MGS> 
MGS> And then you're told its not just <p> tags that might contain the summary, but
MGS> <div> tags as well.  This gets into the joy of variable balancing tags in regexes.
MGS> 
MGS>     /\G<(p|div)\s+class\s+=\s+"summary"\s+>(.*?)</\1>/msgi
MGS> 
MGS> And it all seems to be working fine until...
MGS> 
MGS>     <p class="summary"> foo <p>bar</p> baz </p>
MGS> 
MGS> Now you're hosed.  Regexes are *terrible* at trying to match nested balanced
MGS> delimiters.  HTML is all about nested balanced delimiters.  Solving this
MGS> requires wall-banging complexity.
MGS> http://perldoc.perl.org/perlfaq4.html#How-do-I-find-matching%2fnesting-anything%3f
MGS> http://perldoc.perl.org/perlre.html#%28%3fPARNO%29-%28%3f-PARNO%29-%28%3f%2bPARNO%29-%28%3fR%29-%28%3f0%29
MGS> https://metacpan.org/module/Regexp::Common::balanced
MGS> 
MGS> There are a class of problems which look easy to solve with regexes, but are
MGS> actually nigh impossible to get even mostly correct.  This is one of them.
MGS> 
MGS> I expect you'll try anyway. :)  "Parse HTML with a regex" is right up there
MGS> with "write a template language" and "write an ORM" for Perl rites of passage.
MGS> 
MGS> 
MGS> -- 
MGS> 185. My name is not a killing word.
MGS>     -- The 213 Things Skippy Is No Longer Allowed To Do In The U.S. Army
MGS>            http://skippyslist.com/list/
MGS> _______________________________________________
MGS> Melbourne-pm mailing list
MGS> Melbourne-pm at pm.org
MGS> http://mail.pm.org/mailman/listinfo/melbourne-pm

--
Peter Vereshagin <peter at vereshagin.org> (http://vereshagin.org) pgp: A0E26627 


More information about the Melbourne-pm mailing list