[Melbourne-pm] Regexp: What's the right way to do this?

Michael G Schwern schwern at pobox.com
Wed Oct 17 01:14:29 PDT 2012


"Should I use regexes to parse HTML?"

No, do not use regexes to parse HTML.  While it may seem easy to put together
a quick and dirty HTML scanner with regexes, it will very quickly get very
ugly.  HTML parsing requires matching balanced characters and tags such as <
and > and quotes which regexes do very poorly in addition to all the little
special cases like comments.

In addition, you're going to forget many small things, like casing and spaces,
which you'll be hunting down forever.  For example...

  <div  class="event-time calendar-1">2:3 - </div>
  <DIV class="event-time calendar-1">  2:3 - </div>
  <P class="summary">blah</p>
  <p  class = "description">blah</p>
  <!-- <p class="summary">blah</p> -->

If you patch up your regexes to cover those, maybe an activity for the next
meeting might be to come up with more to break your regexes. :)

There, your regex question is answered. :P

It's quicker even in the short run to use a pre existing, well documented,
parser like HTML::TreeBuilder as evidenced by the fact that you're posting on
a mailing list for help with your regex based HTML parser.  You even get
search facilities like XPath (see HTML::TreeBuilder::XPath and
http://www.w3schools.com/xpath/).

    use HTML::TreeBuilder::XPath;
    use v5.14;

    my $tree= HTML::TreeBuilder::XPath->new;
    $tree->parse_file(shift);

    my @event_times  = $tree->findnodes(
        '//div[starts-with(@class, "event-time-calendar-")]'
    );

    for my $event_time (@event_times) {
        my($hour, $min) = $event_time->as_text =~ /(\d+):(\d+)/;
        say "Event at $hour:$min";
    }

Once you learn how to use an HTML parser and XPath you'll never have to write
a hacky HTML regex parser again.  O(1) learning efficiency!

If you're doing this as an exercise in learning regexes, well, don't ignore
the lesson just because its not what you expected to learn.  If you want to
learn "from scratch" look into writing a grammar parser.


On 2012.10.17 12:11 AM, Nathan Bailey wrote:> I knew someone would say that :P
>
> It's a regexp question, not an HTML parsing question!
> N
>
> On 17/10/2012, at 6:10 PM, Toby Wintermute wrote:
>
>> On 17 October 2012 18:09, Nathan Bailey <nathan.bailey at monash.edu> wrote:
>>> The code below works, but the commented out bits don't. I presume that
when $shh and $smm are defined on the first loop through, they get undefined
on the next time through?
>>>
>>> What's the "right" way to do this, TIMTOWTDI notwithstanding :-)
>>
>>
>> use HTML::TreeBuilder;
>>
>>
>> -Toby
>
> _______________________________________________
> Melbourne-pm mailing list
> Melbourne-pm at pm.org
> http://mail.pm.org/mailman/listinfo/melbourne-pm
>


-- 
s7ank: i want to be one of those guys that types "s/j&jd//.^$ueu*///djsls/sm."
       and it's a perl script that turns dog crap into gold.


More information about the Melbourne-pm mailing list