SPUG: XPath on (less-than-perfect) HTML

Tue Dec 8 17:43:36 PST 2009

On Dec 8, 2009, at 10:15 AM, Colin Meyer wrote:

> Just came across this blog post on xpath webscraping (via perlbuzz):
>
>  http://ssscripting.blogspot.com/2009/12/using-perl-to-scrape-web.html
>
> It aggrees with C.J.'s suggestion of using HTML::TreeBuilder::XPath

Colin,

    Thanks.

C.J.,

     Thanks again

All,

In essence, this article gets a nodeset via

     use HTML::TreeBuilder::XPath;

     $agent = $content = WWW::Mechanize->new();
     $agent->get($url);
     $content = $agent->content();

     @nodes = HTML::TreeBuilder::XPath->new()->parse($content)- 
 >findnodes($xpath);

     $text = join '', map { $_->content()->[0] } @nodes

I'm getting similar results via

     use XML::LibXML;

     $content = DITTO;

     my %parse_options = (suppress_errors =>1, recover => 1);
     @nodes = XML::LibXML->new(\%parse_options)- 
 >parse_html_string($content)->findnodes($xpath);

     $text = join '', map { $_->textContent() } @nodes;

So, I asked myself, "Self, what's the difference between starting with  
XML::LibXML and starting with HTML::TreeBuilder if I get to pass an  
XPath off to a findnodes() method in either case?".  In chasing the  
provenance, I found that they're both maintained by Michael Rodriguez,  
and have almost identical MANIFEST files.  (They're identical on the  
names (but not contents of) the lib/(XML|Tree)/*.pm files and differ  
in the names and number of the t/*.t files.)

A high-level code review looked like the lib/*.pm files were mostly  
copy/paste-identical files.

Here's the best (high-level) contrast I could find in the documentation.

     From the XML::XPathEngine POD:
     SEE ALSO
         Tree::XPathEngine for a similar module for non-XML trees.

Although XML brings to mind 'well-formed' whereas HTML (not XHTML)  
does not, I guess I'm fortunate to be able to use XPath in the XML-ish  
packages by using the qw(suppress_errors recover) options to the  
parser to handle my HTML.  (Aside.  This was the answer to my earlier  
posting on how to get over the non-well-formed issue.)  I guess I  
started with XML::LIbXML because I didn't think that XPath would be  
applicable to non-XML (i.e. HTML).  It appears that findnodes($xpath)  
works for a $treee (or $doc or $dom) parsed from either package.

Could the only difference be that I've got to be explicit with the  
XML::LibXML parser about recovering on non-well-formed input while the  
HTML one already (tacitly) expects non-well-formed.

Since my code's got to run on Mac, Windows and CentOS it would be  
great to hear if anyone's got a strong preference for, or history  
with, one versus the other.

Thanks,
Michael

-- 
Michael R. Wolf
     All mammals learn by playing!
         MichaelRWolf at att.net