SPUG: XPath on (less-than-perfect) HTML

C.J. Adams-Collier cjac at colliertech.org
Tue Nov 17 19:28:41 PST 2009


On Tue, 2009-11-17 at 13:33 -0800, Michael R. Wolf wrote:

> Yes, I know that XPath can only be applied to well-formed XML.
> That's the theoretical, pure, absolute truth.
> I'm working in the real world where I can't find a well-formed page.   
> (For instance, http://validator.w3c.org does not validate such biggies  
> as amazon.com, ask.com, google.com, or msn.com).  For (my) practical  
> purposes, there are no valid pages.
> What am I to (practically, not theoretically) do?
> What tricks do practical XPath users know that I might not?
> I'm trying to scrape pages across sites to aggregate data.
> I'm loathe to use regular expressions for all the pure reasons, but if  
> pure isn't workable outside the ivory towers, that purity is useless  
> in the real world.
> I've already tried:
>      tidy -asxhtml
>      tidy -asxml
>      HTML::TokeParser
>      XML::XPath
>      XML::LibXML
> I can't take step #2 because step #1 (parsing the data) fails.
> Thanks for *practical* ideas, tricks, tips, and pointers....
> Michael

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/spug-list/attachments/20091117/4804d0e0/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://mail.pm.org/pipermail/spug-list/attachments/20091117/4804d0e0/attachment.bin>

More information about the spug-list mailing list