[Melbourne-pm] Regexps - how does the lexical scope of capture buffers work? (Was: Regexp: What's the right way to do this?)

Michael G Schwern schwern at pobox.com
Wed Oct 17 13:45:17 PDT 2012


On 2012.10.17 5:24 AM, Nathan Bailey wrote:
> I'm just wondering if there's a better way to grab text out of
> multiple lines that are related to each other. A simple solution
> would be to go for multi-line strings but I'm actually curious
> to know (a) if that's the way the evaluation of the regexp and
> assignment works and (b) if there are better ways of doing
> multi-line parsing, without simply treating it as one big complex
> line.

You don't want to the answer to be "use an HTML parser", so here's a sort of a
Look Into Your Future as you try to parse HTML with regexes...

Reading your original code, it seems like you're trying to parse this:

    <div class="event-time calendar-1">12:45 -
        14:00</div>

but doing it line by line with individual regexes.  HTML doesn't give two
hoots about newlines, so trying to understand it line by line has lots of
problems.  This means you have to carry state over from one line to another,
which gets complicated.  Worse, you have to check that nothing else came
between them else you get fooled by this:

    <div class="event-time calendar-1">12:45 -
        </div>
    <div class="something else entirely">The time is now
        05:46</div>

If you try to parse as one big string...

    /\G<div class="time">(\d+):(\d+) - (\d+):(\d+)</div>/msg

That works for this:

    <div class="time">12:45 - 14:00</div><div class="time">15:00 - 16:00</div>

But to account for whitespace and casing the regex really needs to be...

    /\G<div\s+class\s+=\s+"time"\s+>\s+(\d+):(\d+)\s+-\s+(\d+):(\d+)</div>/imsg

Yuck.

You'll run into trouble with this:

    /\G<p\s+class\s+=\s+"summary"\s+>.*</p>/msgi

    <p class="summary">foo</p>
    <p class="somethingelse">bar</p>

Slurped up too much.  Have to make it non-greedy.

    /\G<p\s+class\s+=\s+"summary"\s+>(.*?)</p>/msgi

And then you're told its not just <p> tags that might contain the summary, but
<div> tags as well.  This gets into the joy of variable balancing tags in regexes.

    /\G<(p|div)\s+class\s+=\s+"summary"\s+>(.*?)</\1>/msgi

And it all seems to be working fine until...

    <p class="summary"> foo <p>bar</p> baz </p>

Now you're hosed.  Regexes are *terrible* at trying to match nested balanced
delimiters.  HTML is all about nested balanced delimiters.  Solving this
requires wall-banging complexity.
http://perldoc.perl.org/perlfaq4.html#How-do-I-find-matching%2fnesting-anything%3f
http://perldoc.perl.org/perlre.html#%28%3fPARNO%29-%28%3f-PARNO%29-%28%3f%2bPARNO%29-%28%3fR%29-%28%3f0%29
https://metacpan.org/module/Regexp::Common::balanced

There are a class of problems which look easy to solve with regexes, but are
actually nigh impossible to get even mostly correct.  This is one of them.

I expect you'll try anyway. :)  "Parse HTML with a regex" is right up there
with "write a template language" and "write an ORM" for Perl rites of passage.


-- 
185. My name is not a killing word.
    -- The 213 Things Skippy Is No Longer Allowed To Do In The U.S. Army
           http://skippyslist.com/list/


More information about the Melbourne-pm mailing list