[Melbourne-pm] XML::LibXML HTML parsing regression

Toby Corkindale toby.corkindale at strategicdata.com.au
Tue May 12 19:14:34 PDT 2009


Hey guys,
Just something to watch out for if you use XML::LibXML for parsing XHTML..
There's a regresssion that occured in all versions after 1.66, whereby 
the (x)html parsing mode is broken..
(I reported this some time back when I noticed it.. but it doesn't seem 
to have had any response for quite a while :(
http://rt.cpan.org/Public/Bug/Display.html?id=44715
)

I just mention it here now because Ubuntu 9.04 ships with a recent 
version of xml::libxml, and thus if you're upgrading from 8.10 to 9.04 
you'll find your web scraper apps all mysteriously break.
(eg. Those of you with RSS feeds from http://rea.dryft.net will have 
noticed nothing turned up for a couple of days since the weekend. Guess 
what I upgraded then..)

(Debian 5.0 is still on the rather old 1.66 version; so it'll still work 
with xhtml data, but on the other hand you'll have to put up with 
various other bugs..)



If you're hit by this problem, I recommend swapping out XML::LibXML 
entirely for HTML::TreeBuilder::XPath - it's the closest thing to a 
drop-in replacement I could find.
The main caveat is that it doesn't seem to handle character encodings, 
so you'll need to manually faff about with that. (Which is why some, but 
not all, items were showing up on the RSS feeds from last night)


-Toby


More information about the Melbourne-pm mailing list