[Melbourne-pm] Regexps - how does the lexical scope of capture buffers work? (Was: Regexp: What's the right way to do this?)
Peter Vereshagin
peter at vereshagin.org
Wed Oct 17 14:33:44 PDT 2012
Hi guys.
The what is wrong with this:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
my $str = '';
while ( my $buf .= <DATA> ) {
$str .= $buf;
if (my ( $hh_mm_start => $hh_mm_end )
= $str =~ m/
<div[^>]*>\s*
(\d\d?:\d\d?)
\s*-\s*
(\d\d?:\d\d?)
/sx
)
{
use Data::Dump;
ddx $hh_mm_start => $hh_mm_end;
$str = '';
}
}
__DATA__
<div class="event-time calendar-1">12:45 -
14:00</div>
<div class="event-time calendar-1">12:45 -
</div>
<div class="something else entirely">The time is now
05:46</div>
?
There is also an old bold P::RecD :
http://search.cpan.org/dist/Parse-RecDescent/
I use to parse MySQL dumps with it here:
http://gitweb.vereshagin.org/endvance/blob_plain/HEAD:/endvance/README
But surely HTML::* can make you happy, too.
2012/10/17 13:45:17 -0700 Michael G Schwern <schwern at pobox.com> => To melbourne-pm at pm.org :
MGS> On 2012.10.17 5:24 AM, Nathan Bailey wrote:
MGS> > I'm just wondering if there's a better way to grab text out of
MGS> > multiple lines that are related to each other. A simple solution
MGS> > would be to go for multi-line strings but I'm actually curious
MGS> > to know (a) if that's the way the evaluation of the regexp and
MGS> > assignment works and (b) if there are better ways of doing
MGS> > multi-line parsing, without simply treating it as one big complex
MGS> > line.
MGS>
MGS> You don't want to the answer to be "use an HTML parser", so here's a sort of a
MGS> Look Into Your Future as you try to parse HTML with regexes...
MGS>
MGS> Reading your original code, it seems like you're trying to parse this:
MGS>
MGS> <div class="event-time calendar-1">12:45 -
MGS> 14:00</div>
MGS>
MGS> but doing it line by line with individual regexes. HTML doesn't give two
MGS> hoots about newlines, so trying to understand it line by line has lots of
MGS> problems. This means you have to carry state over from one line to another,
MGS> which gets complicated. Worse, you have to check that nothing else came
MGS> between them else you get fooled by this:
MGS>
MGS> <div class="event-time calendar-1">12:45 -
MGS> </div>
MGS> <div class="something else entirely">The time is now
MGS> 05:46</div>
MGS>
MGS> If you try to parse as one big string...
MGS>
MGS> /\G<div class="time">(\d+):(\d+) - (\d+):(\d+)</div>/msg
MGS>
MGS> That works for this:
MGS>
MGS> <div class="time">12:45 - 14:00</div><div class="time">15:00 - 16:00</div>
MGS>
MGS> But to account for whitespace and casing the regex really needs to be...
MGS>
MGS> /\G<div\s+class\s+=\s+"time"\s+>\s+(\d+):(\d+)\s+-\s+(\d+):(\d+)</div>/imsg
MGS>
MGS> Yuck.
MGS>
MGS> You'll run into trouble with this:
MGS>
MGS> /\G<p\s+class\s+=\s+"summary"\s+>.*</p>/msgi
MGS>
MGS> <p class="summary">foo</p>
MGS> <p class="somethingelse">bar</p>
MGS>
MGS> Slurped up too much. Have to make it non-greedy.
MGS>
MGS> /\G<p\s+class\s+=\s+"summary"\s+>(.*?)</p>/msgi
MGS>
MGS> And then you're told its not just <p> tags that might contain the summary, but
MGS> <div> tags as well. This gets into the joy of variable balancing tags in regexes.
MGS>
MGS> /\G<(p|div)\s+class\s+=\s+"summary"\s+>(.*?)</\1>/msgi
MGS>
MGS> And it all seems to be working fine until...
MGS>
MGS> <p class="summary"> foo <p>bar</p> baz </p>
MGS>
MGS> Now you're hosed. Regexes are *terrible* at trying to match nested balanced
MGS> delimiters. HTML is all about nested balanced delimiters. Solving this
MGS> requires wall-banging complexity.
MGS> http://perldoc.perl.org/perlfaq4.html#How-do-I-find-matching%2fnesting-anything%3f
MGS> http://perldoc.perl.org/perlre.html#%28%3fPARNO%29-%28%3f-PARNO%29-%28%3f%2bPARNO%29-%28%3fR%29-%28%3f0%29
MGS> https://metacpan.org/module/Regexp::Common::balanced
MGS>
MGS> There are a class of problems which look easy to solve with regexes, but are
MGS> actually nigh impossible to get even mostly correct. This is one of them.
MGS>
MGS> I expect you'll try anyway. :) "Parse HTML with a regex" is right up there
MGS> with "write a template language" and "write an ORM" for Perl rites of passage.
MGS>
MGS>
MGS> --
MGS> 185. My name is not a killing word.
MGS> -- The 213 Things Skippy Is No Longer Allowed To Do In The U.S. Army
MGS> http://skippyslist.com/list/
MGS> _______________________________________________
MGS> Melbourne-pm mailing list
MGS> Melbourne-pm at pm.org
MGS> http://mail.pm.org/mailman/listinfo/melbourne-pm
--
Peter Vereshagin <peter at vereshagin.org> (http://vereshagin.org) pgp: A0E26627
More information about the Melbourne-pm
mailing list