SPUG: HTML Entity UNICODE Blues

Thu Apr 8 16:29:49 CDT 2004

Executing the following on Windows using ActiveState 5.8.3 (build 809):

	use HTML::Entities;
	$txt  = "&mdash;";      print "1) $txt\n";
	decode_entities($txt);  print "2) $txt\n";
	encode_entities($txt);  print "3) $txt\n";

The final text isn't equal to the starting text.  In fact, it looks something 
like this:

	1) &mdash;
	2) â
	3) &acirc;&#128;&#148;

And whaddyaknow, it looks like that on my Linux box too, running 5.8.0.

It appears to be some sort of UNICODE translation issue, and I'm allergic to 
UNICODE (heh, heh) so I have no clue what to do to fix it.

This came up using HTML::TreeBuilder to parse some HTML containing an &mdash; 
sequence.  There seems to be no simple way to tell TreeBuilder I don't _want_ 
it to decode my entities.  Which I _don't_ since I'm just moving stuff around 
structurally.  I just want the text to pass through unmunged.

So my current solution is:

	$txt =~ s|&|{[AMP]}|gs;
	# do the TreeBuilder stuff and store result in $fixed
	$fixed =~ s|\{\[AMP\]\}|&|gs;

Which bypasses the whole de/encoding issue.  Ugly.  Wouldn't mind wise words 
suggesting a better solution.

During my attempted debugging I noticed that versions of libraries delivered 
with ActiveState are not the most recent on CPAN.  By several versions in 
some cases.  Sigh.

mma