SPUG: XPath on (less-than-perfect) HTML
C.J. Adams-Collier
cjac at colliertech.org
Tue Nov 17 19:28:41 PST 2009
HTML::TreeBuilder::XPath
On Tue, 2009-11-17 at 13:33 -0800, Michael R. Wolf wrote:
> Yes, I know that XPath can only be applied to well-formed XML.
>
> That's the theoretical, pure, absolute truth.
>
> I'm working in the real world where I can't find a well-formed page.
> (For instance, http://validator.w3c.org does not validate such biggies
> as amazon.com, ask.com, google.com, or msn.com). For (my) practical
> purposes, there are no valid pages.
>
> What am I to (practically, not theoretically) do?
>
> What tricks do practical XPath users know that I might not?
>
> I'm trying to scrape pages across sites to aggregate data.
>
> I'm loathe to use regular expressions for all the pure reasons, but if
> pure isn't workable outside the ivory towers, that purity is useless
> in the real world.
>
> I've already tried:
> tidy -asxhtml
> tidy -asxml
> HTML::TokeParser
> XML::XPath
> XML::LibXML
>
> I can't take step #2 because step #1 (parsing the data) fails.
>
> Thanks for *practical* ideas, tricks, tips, and pointers....
>
> Michael
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/spug-list/attachments/20091117/4804d0e0/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://mail.pm.org/pipermail/spug-list/attachments/20091117/4804d0e0/attachment.bin>
More information about the spug-list
mailing list