SPUG: XPath on (less-than-perfect) HTML

Tue Nov 17 13:33:17 PST 2009

Yes, I know that XPath can only be applied to well-formed XML.

That's the theoretical, pure, absolute truth.

I'm working in the real world where I can't find a well-formed page.   
(For instance, http://validator.w3c.org does not validate such biggies  
as amazon.com, ask.com, google.com, or msn.com).  For (my) practical  
purposes, there are no valid pages.

What am I to (practically, not theoretically) do?

What tricks do practical XPath users know that I might not?

I'm trying to scrape pages across sites to aggregate data.

I'm loathe to use regular expressions for all the pure reasons, but if  
pure isn't workable outside the ivory towers, that purity is useless  
in the real world.

I've already tried:
     tidy -asxhtml
     tidy -asxml
     HTML::TokeParser
     XML::XPath
     XML::LibXML

I can't take step #2 because step #1 (parsing the data) fails.

Thanks for *practical* ideas, tricks, tips, and pointers....

Michael

-- 
Michael R. Wolf
     All mammals learn by playing!
         MichaelRWolf at att.net