[Chicago-talk] Malformed UTF-8 character in file with Byte Order Mark on SunOS
tiger peng
tigerpeng2001 at yahoo.com
Wed Feb 28 06:22:15 PST 2007
I checked it again between Linux and SunOS. The malfoming is not related with BOM.
If set locales to en_US.UTF-8 on SunOS and en_US.utf8(oren_US.UTF-8, which is not shown up in locale -a) on Linux. The non-ascii7 characters are malformed; they are splited to two characters. If set both to en_US.ISO8859-1, the the are not malformed and display creactly on my xterm (PuTTy).
On Linux:
Here is the OS infor and locale setting
-> uname -a
Linux etdwag2 2.4.21-27.ELsmp #1 SMP Wed Dec 1 21:59:02 EST 2004 i686
-> locale
LANG=
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=en_US.utf8
Generate file with perl:
perl -e 'for $i(253..255){print " " . chr($i)."\t".$i."\n"}
for $i(253..255){print chr($i). chr($i)."\t".$i."\n"}
for $i(253..255){print chr($i-200).chr($i)."\t".$i."\n"}
' > latin1.txt
Check the files with perl:
perl -ne 'print ord(substr($_, 0, 1)).
"\t".ord(substr($_, 1, 1)).
"\t".ord(substr($_, 2, 1)).
"\t".$_' latin1.txt
32 253 9 ý 253
32 254 9 þ 254
32 255 9 ÿ 255
253 253 9 ýý 253
254 254 9 þþ 254
255 255 9 ÿÿ 255
53 253 9 5ý 253
54 254 9 6þ 254
55 255 9 7ÿ 255
On
-> uname -a
SunOS etdwdev2 5.10 Generic_118833-24 sun4u sparc SUNW,Sun-Fire-880
-> locale
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
-> export LC_ALL=en_US.UTF-8
-> locale
LANG=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=en_US.UTF-8
# Now scp the file from Linux
# and check the file
perl -ne 'print ord(substr($_, 0, 1)).
"\t".ord(substr($_, 1, 1)).
"\t".ord(substr($_, 2, 1)).
"\t".$_' latin1.txt
32 195 189 ý 253
32 195 190 þ 254
32 195 191 ÿ 255
195 189 195 ýý 253
195 190 195 þþ 254
195 191 195 ÿÿ 255
53 195 189 5ý 253
54 195 190 6þ 254
55 195 191 7ÿ 255
----- Original Message ----
From: Jonathan Rockway <jon at jrock.us>
To: Chicago.pm chatter <chicago-talk at pm.org>
Sent: Tuesday, February 27, 2007 11:15:38 AM
Subject: Re: [Chicago-talk] Malformed UTF-8 character in file with Byte Order Mark on SunOS
tiger peng wrote:
> Having trouble to ftp or scp utf8 files to SunOS and after googling for a while, I found that when I searched 'utf-8 byte order mark malformed SunOS', most of the top searching results relate with perl 5.8. So I hope I can get a quick help from my dear perlists.
UTF-8 is just binary. If your link can't handle 8-bit characters, then
uuencode the UTF-8 and uudecode it on the other end. My guess is that's
not the problem -- your terminal (or shell) are probably confused by
UTF-8 and you need to play around with locales.
A bit more detail might help us find the issue, although I kind of doubt
it has *anything* to do with Perl :)
Regards,
Jonathan Rockway
--
package JAPH;use Catalyst qw/-Debug/;($;=JAPH)->config(name => do {
$,.=reverse qw[Jonathan tsu rehton lre rekca Rockway][$_].[split //,
";$;"]->[$_].q; ;for 1..4;$,=~s;^.;;;$,});$;->setup;
_______________________________________________
Chicago-talk mailing list
Chicago-talk at pm.org
http://mail.pm.org/mailman/listinfo/chicago-talk
More information about the Chicago-talk
mailing list