SPUG: HTML Entity UNICODE Blues
Marc M. Adkins
Perl at Doorways.org
Thu Apr 8 16:29:49 CDT 2004
Executing the following on Windows using ActiveState 5.8.3 (build 809):
use HTML::Entities;
$txt = "—"; print "1) $txt\n";
decode_entities($txt); print "2) $txt\n";
encode_entities($txt); print "3) $txt\n";
The final text isn't equal to the starting text. In fact, it looks something
like this:
1) —
2) รข
3) —
And whaddyaknow, it looks like that on my Linux box too, running 5.8.0.
It appears to be some sort of UNICODE translation issue, and I'm allergic to
UNICODE (heh, heh) so I have no clue what to do to fix it.
This came up using HTML::TreeBuilder to parse some HTML containing an —
sequence. There seems to be no simple way to tell TreeBuilder I don't _want_
it to decode my entities. Which I _don't_ since I'm just moving stuff around
structurally. I just want the text to pass through unmunged.
So my current solution is:
$txt =~ s|&|{[AMP]}|gs;
# do the TreeBuilder stuff and store result in $fixed
$fixed =~ s|\{\[AMP\]\}|&|gs;
Which bypasses the whole de/encoding issue. Ugly. Wouldn't mind wise words
suggesting a better solution.
During my attempted debugging I noticed that versions of libraries delivered
with ActiveState are not the most recent on CPAN. By several versions in
some cases. Sigh.
mma
More information about the spug-list
mailing list