[Edinburgh-pm] Marpa::HTML

Fri May 20 04:36:26 PDT 2011

(disclaimer: not played with Marpa, tho it looks interesting)

At least when I've gone down this rabbit hole, the problem isn't the parsing
per se, but rather given this:

    <html>
      <h1>hi</
      <tr><td>foo</tr></td
    </body>
    </html>

What the bleedin' feck does it mean?

(no opening body tag, borked closing h1, no table tags, missing > on the
closing td & the closing td/tr being out of order).

The "problem" is that browsers will display that just fine (for some value of
"fine") and thus you do get real world insanity like this when your home brew
web crawler starts chewing on public web pages.

You're trying to extract meaning from the markup, but the markup is ambiguous,
even after a successful parse.

What is the heading? Is the content tabular?  What is the "text" for that page?

Marpa seems to have some smarts for missing/broken HTML, but I'd wager the
combined might of the internets can produce obscure html markup insanity faster
than any mortal can keep up.  Tho it is a shame the browser's rendering engines
behaviour isn't more exposed for re-purposing in this regard.

This isn't to say that Marpa might not be a massive big win for dealing with
this sorta thing, just that I'm guessing you're still going to be amazed at how
insane some markup can be and end up dealing with piecemeal exceptions in the
real world.  And for me, it's this that has historically dominated the pain.

All that said, Marpa sure looks worth further investigation.

(as would some kinda "but i'm on both those lists, so kindly only forward me
one copy ta" filtering service ... that isn't gmail ;)