<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 TRANSITIONAL//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; CHARSET=UTF-8">
<META NAME="GENERATOR" CONTENT="GtkHTML/3.28.1">
</HEAD>
<BODY>
<A HREF="http://search.cpan.org/~mirod/HTML-TreeBuilder-XPath-0.11/lib/HTML/TreeBuilder/XPath.pm">HTML::TreeBuilder::XPath</A><BR>
<BR>
On Tue, 2009-11-17 at 13:33 -0800, Michael R. Wolf wrote:
<BLOCKQUOTE TYPE=CITE>
<PRE>
Yes, I know that XPath can only be applied to well-formed XML.
That's the theoretical, pure, absolute truth.
I'm working in the real world where I can't find a well-formed page.
(For instance, <A HREF="http://validator.w3c.org">http://validator.w3c.org</A> does not validate such biggies
as amazon.com, ask.com, google.com, or msn.com). For (my) practical
purposes, there are no valid pages.
What am I to (practically, not theoretically) do?
What tricks do practical XPath users know that I might not?
I'm trying to scrape pages across sites to aggregate data.
I'm loathe to use regular expressions for all the pure reasons, but if
pure isn't workable outside the ivory towers, that purity is useless
in the real world.
I've already tried:
tidy -asxhtml
tidy -asxml
HTML::TokeParser
XML::XPath
XML::LibXML
I can't take step #2 because step #1 (parsing the data) fails.
Thanks for *practical* ideas, tricks, tips, and pointers....
Michael
</PRE>
</BLOCKQUOTE>
<BR>
</BODY>
</HTML>