SPUG: XPath on (less-than-perfect) HTML

Tue Dec 8 18:30:53 PST 2009

yay me.

On Tue, 2009-12-08 at 17:43 -0800, Michael R. Wolf wrote:
> On Dec 8, 2009, at 10:15 AM, Colin Meyer wrote:
> 
> > Just came across this blog post on xpath webscraping (via perlbuzz):
> >
> >  http://ssscripting.blogspot.com/2009/12/using-perl-to-scrape-web.html
> >
> > It aggrees with C.J.'s suggestion of using HTML::TreeBuilder::XPath
> 
> Colin,
> 
>     Thanks.
> 
> 
> C.J.,
> 
>      Thanks again
> 
> 
> All,
> 
> In essence, this article gets a nodeset via
> 
>      use HTML::TreeBuilder::XPath;
> 
>      $agent = $content = WWW::Mechanize->new();
>      $agent->get($url);
>      $content = $agent->content();
> 
>      @nodes = HTML::TreeBuilder::XPath->new()->parse($content)- 
>  >findnodes($xpath);
> 
>      $text = join '', map { $_->content()->[0] } @nodes
> 
> I'm getting similar results via
> 
>      use XML::LibXML;
> 
>      $content = DITTO;
> 
>      my %parse_options = (suppress_errors =>1, recover => 1);
>      @nodes = XML::LibXML->new(\%parse_options)- 
>  >parse_html_string($content)->findnodes($xpath);
> 
>      $text = join '', map { $_->textContent() } @nodes;
> 
> 
> So, I asked myself, "Self, what's the difference between starting with  
> XML::LibXML and starting with HTML::TreeBuilder if I get to pass an  
> XPath off to a findnodes() method in either case?".  In chasing the  
> provenance, I found that they're both maintained by Michael Rodriguez,  
> and have almost identical MANIFEST files.  (They're identical on the  
> names (but not contents of) the lib/(XML|Tree)/*.pm files and differ  
> in the names and number of the t/*.t files.)
> 
> A high-level code review looked like the lib/*.pm files were mostly  
> copy/paste-identical files.
> 
> Here's the best (high-level) contrast I could find in the documentation.
> 
>      From the XML::XPathEngine POD:
>      SEE ALSO
>          Tree::XPathEngine for a similar module for non-XML trees.
> 
> 
> Although XML brings to mind 'well-formed' whereas HTML (not XHTML)  
> does not, I guess I'm fortunate to be able to use XPath in the XML-ish  
> packages by using the qw(suppress_errors recover) options to the  
> parser to handle my HTML.  (Aside.  This was the answer to my earlier  
> posting on how to get over the non-well-formed issue.)  I guess I  
> started with XML::LIbXML because I didn't think that XPath would be  
> applicable to non-XML (i.e. HTML).  It appears that findnodes($xpath)  
> works for a $treee (or $doc or $dom) parsed from either package.
> 
> Could the only difference be that I've got to be explicit with the  
> XML::LibXML parser about recovering on non-well-formed input while the  
> HTML one already (tacitly) expects non-well-formed.
> 
> Since my code's got to run on Mac, Windows and CentOS it would be  
> great to hear if anyone's got a strong preference for, or history  
> with, one versus the other.
> 
> Thanks,
> Michael
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://mail.pm.org/pipermail/spug-list/attachments/20091208/ec157d39/attachment.bin>