SPUG: XPath on (less-than-perfect) HTML

Tue Dec 8 19:44:17 PST 2009

On Tue, Dec 8, 2009 at 5:43 PM, Michael R. Wolf <MichaelRWolf at att.net> wrote:
>
> Although XML brings to mind 'well-formed' whereas HTML (not XHTML) does not, I guess I'm fortunate to be able to use XPath in the XML-ish packages by using the qw(suppress_errors recover) options to the parser to handle my HTML.  (Aside.  This was the answer to my earlier posting on how to get over the non-well-formed issue.)  I guess I started with XML::LIbXML because I didn't think that XPath would be applicable to non-XML (i.e. HTML).  It appears that findnodes($xpath) works for a $treee (or $doc or $dom) parsed from either package.
>
> Could the only difference be that I've got to be explicit with the XML::LibXML parser about recovering on non-well-formed input while the HTML one already (tacitly) expects non-well-formed.

No personal experience, but it's not just about recovering, but
recovering the way a browser would have interpreted the HTML.
>From the TreeBuilder POD:

> HTML is rather harder to parse than people who write it generally suspect.
>
> Here's the problem: HTML is a kind of SGML that permits "minimization"
> and "implication". In short, this means that you don't have to close
> every tag you open (because the opening of a subsequent tag may
> implicitly close it), and if you use a tag that can't occur in the
> context you seem to using it in, under certain conditions the parser
> will be able to realize you mean to leave the current context and enter
> the new one, that being the only one that your code could correctly be
> interpreted in.
>
> Now, this would all work flawlessly and unproblematically if: 1) all
> the rules that both prescribe and describe HTML were (and had been)
> clearly set out, and 2) everyone was aware of these rules and wrote
> their code in compliance to them.
>
> However, it didn't happen that way, and so most HTML pages are
> difficult if not impossible to correctly parse with nearly any set of
> straightforward SGML rules. That's why the internals of HTML::TreeBuilder
> consist of lots and lots of special cases -- instead of being just a
> generic SGML parser with HTML DTD rules plugged in.
> ...
>
> The HTML::TreeBuilder source may seem long and complex, but it is rather
> well commented, and symbol names are generally self-explanatory. (You are
> encouraged to read the Mozilla HTML parser source for comparison.) Some
> of the complexity comes from little-used features, and some of it comes
> from having the HTML tokenizer (HTML::Parser) being a separate module,
> requiring somewhat of a different interface than you'd find in a combined
> tokenizer and tree-builder. But most of the length of the source comes
> from the fact that it's essentially a long list of special cases,
> with lots and lots of sanity-checking, and sanity-recovery -- because,
> as Roseanne Rosannadanna once said, "it's always something".