SPUG: XPath on (less-than-perfect) HTML
Joshua ben Jore
twists at gmail.com
Thu Dec 31 13:15:50 PST 2009
On Tue, Nov 17, 2009 at 1:33 PM, Michael R. Wolf <MichaelRWolf at att.net> wrote:
> Yes, I know that XPath can only be applied to well-formed XML.
> That's the theoretical, pure, absolute truth.
> I'm working in the real world where I can't find a well-formed page. (For
> instance, http://validator.w3c.org does not validate such biggies as
> amazon.com, ask.com, google.com, or msn.com). For (my) practical purposes,
> there are no valid pages.
> What am I to (practically, not theoretically) do?
> What tricks do practical XPath users know that I might not?
> I'm trying to scrape pages across sites to aggregate data.
> I'm loathe to use regular expressions for all the pure reasons, but if pure
> isn't workable outside the ivory towers, that purity is useless in the real
> I've already tried:
> tidy -asxhtml
> tidy -asxml
I've happily used XML::LibXML per Randal Schwartz in Linux Magazine
(Jun 2003) at http://www.stonehenge.com/merlyn/LinuxMag/col49.html
More information about the spug-list