SPUG: XPath on (less-than-perfect) HTML

Thu Dec 31 13:15:50 PST 2009

On Tue, Nov 17, 2009 at 1:33 PM, Michael R. Wolf <MichaelRWolf at att.net> wrote:
> Yes, I know that XPath can only be applied to well-formed XML.
>
> That's the theoretical, pure, absolute truth.
>
> I'm working in the real world where I can't find a well-formed page.  (For
> instance, http://validator.w3c.org does not validate such biggies as
> amazon.com, ask.com, google.com, or msn.com).  For (my) practical purposes,
> there are no valid pages.
>
> What am I to (practically, not theoretically) do?
>
> What tricks do practical XPath users know that I might not?
>
> I'm trying to scrape pages across sites to aggregate data.
>
> I'm loathe to use regular expressions for all the pure reasons, but if pure
> isn't workable outside the ivory towers, that purity is useless in the real
> world.
>
> I've already tried:
>    tidy -asxhtml
>    tidy -asxml
>    HTML::TokeParser
>    XML::XPath
>    XML::LibXML

I've happily used XML::LibXML per Randal Schwartz in Linux Magazine
(Jun 2003) at http://www.stonehenge.com/merlyn/LinuxMag/col49.html

Josh