SPUG: XPath on (less-than-perfect) HTML
Michael R. Wolf
MichaelRWolf at att.net
Tue Dec 8 17:43:36 PST 2009
On Dec 8, 2009, at 10:15 AM, Colin Meyer wrote:
> Just came across this blog post on xpath webscraping (via perlbuzz):
>
> http://ssscripting.blogspot.com/2009/12/using-perl-to-scrape-web.html
>
> It aggrees with C.J.'s suggestion of using HTML::TreeBuilder::XPath
Colin,
Thanks.
C.J.,
Thanks again
All,
In essence, this article gets a nodeset via
use HTML::TreeBuilder::XPath;
$agent = $content = WWW::Mechanize->new();
$agent->get($url);
$content = $agent->content();
@nodes = HTML::TreeBuilder::XPath->new()->parse($content)-
>findnodes($xpath);
$text = join '', map { $_->content()->[0] } @nodes
I'm getting similar results via
use XML::LibXML;
$content = DITTO;
my %parse_options = (suppress_errors =>1, recover => 1);
@nodes = XML::LibXML->new(\%parse_options)-
>parse_html_string($content)->findnodes($xpath);
$text = join '', map { $_->textContent() } @nodes;
So, I asked myself, "Self, what's the difference between starting with
XML::LibXML and starting with HTML::TreeBuilder if I get to pass an
XPath off to a findnodes() method in either case?". In chasing the
provenance, I found that they're both maintained by Michael Rodriguez,
and have almost identical MANIFEST files. (They're identical on the
names (but not contents of) the lib/(XML|Tree)/*.pm files and differ
in the names and number of the t/*.t files.)
A high-level code review looked like the lib/*.pm files were mostly
copy/paste-identical files.
Here's the best (high-level) contrast I could find in the documentation.
From the XML::XPathEngine POD:
SEE ALSO
Tree::XPathEngine for a similar module for non-XML trees.
Although XML brings to mind 'well-formed' whereas HTML (not XHTML)
does not, I guess I'm fortunate to be able to use XPath in the XML-ish
packages by using the qw(suppress_errors recover) options to the
parser to handle my HTML. (Aside. This was the answer to my earlier
posting on how to get over the non-well-formed issue.) I guess I
started with XML::LIbXML because I didn't think that XPath would be
applicable to non-XML (i.e. HTML). It appears that findnodes($xpath)
works for a $treee (or $doc or $dom) parsed from either package.
Could the only difference be that I've got to be explicit with the
XML::LibXML parser about recovering on non-well-formed input while the
HTML one already (tacitly) expects non-well-formed.
Since my code's got to run on Mac, Windows and CentOS it would be
great to hear if anyone's got a strong preference for, or history
with, one versus the other.
Thanks,
Michael
--
Michael R. Wolf
All mammals learn by playing!
MichaelRWolf at att.net
More information about the spug-list
mailing list