SPUG: XPath on (less-than-perfect) HTML
Michael R. Wolf
MichaelRWolf at att.net
Tue Nov 17 13:33:17 PST 2009
Yes, I know that XPath can only be applied to well-formed XML.
That's the theoretical, pure, absolute truth.
I'm working in the real world where I can't find a well-formed page.
(For instance, http://validator.w3c.org does not validate such biggies
as amazon.com, ask.com, google.com, or msn.com). For (my) practical
purposes, there are no valid pages.
What am I to (practically, not theoretically) do?
What tricks do practical XPath users know that I might not?
I'm trying to scrape pages across sites to aggregate data.
I'm loathe to use regular expressions for all the pure reasons, but if
pure isn't workable outside the ivory towers, that purity is useless
in the real world.
I've already tried:
tidy -asxhtml
tidy -asxml
HTML::TokeParser
XML::XPath
XML::LibXML
I can't take step #2 because step #1 (parsing the data) fails.
Thanks for *practical* ideas, tricks, tips, and pointers....
Michael
--
Michael R. Wolf
All mammals learn by playing!
MichaelRWolf at att.net
More information about the spug-list
mailing list