[Melbourne-pm] XML::LibXML HTML parsing regression
toby.corkindale at strategicdata.com.au
Tue May 12 19:14:34 PDT 2009
Just something to watch out for if you use XML::LibXML for parsing XHTML..
There's a regresssion that occured in all versions after 1.66, whereby
the (x)html parsing mode is broken..
(I reported this some time back when I noticed it.. but it doesn't seem
to have had any response for quite a while :(
I just mention it here now because Ubuntu 9.04 ships with a recent
version of xml::libxml, and thus if you're upgrading from 8.10 to 9.04
you'll find your web scraper apps all mysteriously break.
(eg. Those of you with RSS feeds from http://rea.dryft.net will have
noticed nothing turned up for a couple of days since the weekend. Guess
what I upgraded then..)
(Debian 5.0 is still on the rather old 1.66 version; so it'll still work
with xhtml data, but on the other hand you'll have to put up with
various other bugs..)
If you're hit by this problem, I recommend swapping out XML::LibXML
entirely for HTML::TreeBuilder::XPath - it's the closest thing to a
drop-in replacement I could find.
The main caveat is that it doesn't seem to handle character encodings,
so you'll need to manually faff about with that. (Which is why some, but
not all, items were showing up on the RSS feeds from last night)
More information about the Melbourne-pm