<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 TRANSITIONAL//EN">

<HTML>

<HEAD>

  <META HTTP-EQUIV="Content-Type" CONTENT="text/html; CHARSET=UTF-8">

  <META NAME="GENERATOR" CONTENT="GtkHTML/3.28.1">

</HEAD>

<BODY>

<A HREF="http://search.cpan.org/~mirod/HTML-TreeBuilder-XPath-0.11/lib/HTML/TreeBuilder/XPath.pm">HTML::TreeBuilder::XPath</A><BR>

<BR>

On Tue, 2009-11-17 at 13:33 -0800, Michael R. Wolf wrote:

<BLOCKQUOTE TYPE=CITE>

<PRE>

Yes, I know that XPath can only be applied to well-formed XML.

That's the theoretical, pure, absolute truth.

I'm working in the real world where I can't find a well-formed page.   

(For instance, <A HREF="http://validator.w3c.org">http://validator.w3c.org</A> does not validate such biggies  

as amazon.com, ask.com, google.com, or msn.com).  For (my) practical  

purposes, there are no valid pages.

What am I to (practically, not theoretically) do?

What tricks do practical XPath users know that I might not?

I'm trying to scrape pages across sites to aggregate data.

I'm loathe to use regular expressions for all the pure reasons, but if  

pure isn't workable outside the ivory towers, that purity is useless  

in the real world.

I've already tried:

     tidy -asxhtml

     tidy -asxml

     HTML::TokeParser

     XML::XPath

     XML::LibXML

I can't take step #2 because step #1 (parsing the data) fails.

Thanks for *practical* ideas, tricks, tips, and pointers....

Michael

</PRE>

</BLOCKQUOTE>

<BR>

</BODY>

</HTML>