SPUG: XPath on (less-than-perfect) HTML
C.J. Adams-Collier
cjac at colliertech.org
Tue Dec 8 18:30:53 PST 2009
yay me.
On Tue, 2009-12-08 at 17:43 -0800, Michael R. Wolf wrote:
> On Dec 8, 2009, at 10:15 AM, Colin Meyer wrote:
>
> > Just came across this blog post on xpath webscraping (via perlbuzz):
> >
> > http://ssscripting.blogspot.com/2009/12/using-perl-to-scrape-web.html
> >
> > It aggrees with C.J.'s suggestion of using HTML::TreeBuilder::XPath
>
> Colin,
>
> Thanks.
>
>
> C.J.,
>
> Thanks again
>
>
> All,
>
> In essence, this article gets a nodeset via
>
> use HTML::TreeBuilder::XPath;
>
> $agent = $content = WWW::Mechanize->new();
> $agent->get($url);
> $content = $agent->content();
>
> @nodes = HTML::TreeBuilder::XPath->new()->parse($content)-
> >findnodes($xpath);
>
> $text = join '', map { $_->content()->[0] } @nodes
>
> I'm getting similar results via
>
> use XML::LibXML;
>
> $content = DITTO;
>
> my %parse_options = (suppress_errors =>1, recover => 1);
> @nodes = XML::LibXML->new(\%parse_options)-
> >parse_html_string($content)->findnodes($xpath);
>
> $text = join '', map { $_->textContent() } @nodes;
>
>
> So, I asked myself, "Self, what's the difference between starting with
> XML::LibXML and starting with HTML::TreeBuilder if I get to pass an
> XPath off to a findnodes() method in either case?". In chasing the
> provenance, I found that they're both maintained by Michael Rodriguez,
> and have almost identical MANIFEST files. (They're identical on the
> names (but not contents of) the lib/(XML|Tree)/*.pm files and differ
> in the names and number of the t/*.t files.)
>
> A high-level code review looked like the lib/*.pm files were mostly
> copy/paste-identical files.
>
> Here's the best (high-level) contrast I could find in the documentation.
>
> From the XML::XPathEngine POD:
> SEE ALSO
> Tree::XPathEngine for a similar module for non-XML trees.
>
>
> Although XML brings to mind 'well-formed' whereas HTML (not XHTML)
> does not, I guess I'm fortunate to be able to use XPath in the XML-ish
> packages by using the qw(suppress_errors recover) options to the
> parser to handle my HTML. (Aside. This was the answer to my earlier
> posting on how to get over the non-well-formed issue.) I guess I
> started with XML::LIbXML because I didn't think that XPath would be
> applicable to non-XML (i.e. HTML). It appears that findnodes($xpath)
> works for a $treee (or $doc or $dom) parsed from either package.
>
> Could the only difference be that I've got to be explicit with the
> XML::LibXML parser about recovering on non-well-formed input while the
> HTML one already (tacitly) expects non-well-formed.
>
> Since my code's got to run on Mac, Windows and CentOS it would be
> great to hear if anyone's got a strong preference for, or history
> with, one versus the other.
>
> Thanks,
> Michael
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://mail.pm.org/pipermail/spug-list/attachments/20091208/ec157d39/attachment.bin>
More information about the spug-list
mailing list