[Chicago-talk] Malformed UTF-8 character in file with Byte Order Mark on SunOS
tiger peng
tigerpeng2001 at yahoo.com
Thu Mar 1 07:50:50 PST 2007
It looks like a perl build issue (the perl on the SunOS box cannot correctly handle utf-8)
I set the locale on Linux to en_US.utf8 and the SunOS to en_US.UTF-8. Then generate the file again on Lunix box and scp to SunOS. The file looks good in both box with vi(m).
Then use perl to check the file on SunOS. The out put indicate that the unicode is splited.
-> perl -ne 'use utf8; print ord(substr($_, 0, 1)).^J "\t".ord(substr($_, 1, 1)).^J "\t".ord(substr($_, 2, 1)).^J "\t".$_' l>
32 195 189 ý 253
32 195 190 þ 254
32 195 191 ÿ 255
195 189 195 ýý 253
195 190 195 þþ 254
195 191 195 ÿÿ 255
53 195 189 5ý 253
54 195 190 6þ 254
55 195 191 7ÿ 255
Then I used same command to create a file on SunOS, When I VI it, the charaters show in octals with correct values.
Checking the file with perl, the characters are not splited, but cannot display cottectly.
32 253 9 ▒ 253
32 254 9 ▒ 254
32 255 9 ▒ 255
253 253 9 ▒▒ 253
254 254 9 ▒▒ 254
255 255 9 ▒▒ 255
53 253 9 5▒ 253
54 254 9 6▒ 254
55 255 9 7▒ 255
scp this file to Linux and run the perl comand to check the file, Then get the following message:
Malformed UTF-8 character (unexpected non-continuation byte 0x09, immediately after start byte 0xfd) in ord at -e line 1, <> line 1.
32 0 0 ▒ 253
Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line 2.
Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xfe) in ord at -e line 1, <> line 2.
32 0 0 ▒ 254
Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line 3.
Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xff) in ord at -e line 1, <> line 3.
32 0 0 ▒ 255
Malformed UTF-8 character (unexpected non-continuation byte 0xfd, immediately after start byte 0xfd) in ord at -e line 1, <> line 4.
0 10 0 ▒▒ 253
Malformed UTF-8 character (unexpected non-continuation byte 0xfe, immediately after start byte 0xfe) in ord at -e line 1, <> line 5.
0 0 0 ▒▒ 254
Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line 6.
Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line 6.
Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line 6.
0 0 0 ▒▒ 255
Malformed UTF-8 character (unexpected non-continuation byte 0x09, immediately after start byte 0xfd) in ord at -e line 1, <> line 7.
53 0 0 5▒ 253
Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line 8.
Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xfe) in ord at -e line 1, <> line 8.
54 0 0 6▒ 254
Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line 9.
Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xff) in ord at -e line 1, <> line 9.
55 0 0 7▒ 255
Check the file size:
-rw-r--r-- 1 gpeng dba 75 Mar 1 09:38 fromLinux.txt
-rw-r--r-- 1 gpeng dba 63 Mar 1 09:37 fromSunOS.txt
Here are what the perl -v said:
-> perl -v
This is perl, v5.8.0 built for i386-linux-thread-multi
(with 1 registered patch, see perl -V for more detail)
-> perl -v
This is perl, v5.8.4 built for sun4-solaris-64int
(with 28 registered patches, see perl -V for more detail)
----- Original Message ----
From: tiger peng <tigerpeng2001 at yahoo.com>
To: Chicago.pm chatter <chicago-talk at pm.org>
Sent: Wednesday, February 28, 2007 8:22:15 AM
Subject: Re: [Chicago-talk] Malformed UTF-8 character in file with Byte Order Mark on SunOS
I checked it again between Linux and SunOS. The malfoming is not related with BOM.
If set locales to en_US.UTF-8 on SunOS and en_US.utf8(oren_US.UTF-8, which is not shown up in locale -a) on Linux. The non-ascii7 characters are malformed; they are splited to two characters. If set both to en_US.ISO8859-1, the the are not malformed and display creactly on my xterm (PuTTy).
On Linux:
Here is the OS infor and locale setting
-> uname -a
Linux etdwag2 2.4.21-27.ELsmp #1 SMP Wed Dec 1 21:59:02 EST 2004 i686
-> locale
LANG=
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=en_US.utf8
Generate file with perl:
perl -e 'for $i(253..255){print " " . chr($i)."\t".$i."\n"}
for $i(253..255){print chr($i). chr($i)."\t".$i."\n"}
for $i(253..255){print chr($i-200).chr($i)."\t".$i."\n"}
' > latin1.txt
Check the files with perl:
perl -ne 'print ord(substr($_, 0, 1)).
"\t".ord(substr($_, 1, 1)).
"\t".ord(substr($_, 2, 1)).
"\t".$_' latin1.txt
32 253 9 ý 253
32 254 9 þ 254
32 255 9 ÿ 255
253 253 9 ýý 253
254 254 9 þþ 254
255 255 9 ÿÿ 255
53 253 9 5ý 253
54 254 9 6þ 254
55 255 9 7ÿ 255
On
-> uname -a
SunOS etdwdev2 5.10 Generic_118833-24 sun4u sparc SUNW,Sun-Fire-880
-> locale
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
-> export LC_ALL=en_US.UTF-8
-> locale
LANG=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=en_US.UTF-8
# Now scp the file from Linux
# and check the file
perl -ne 'print ord(substr($_, 0, 1)).
"\t".ord(substr($_, 1, 1)).
"\t".ord(substr($_, 2, 1)).
"\t".$_' latin1.txt
32 195 189 ý 253
32 195 190 þ 254
32 195 191 ÿ 255
195 189 195 ýý 253
195 190 195 þþ 254
195 191 195 ÿÿ 255
53 195 189 5ý 253
54 195 190 6þ 254
55 195 191 7ÿ 255
----- Original Message ----
From: Jonathan Rockway <jon at jrock.us>
To: Chicago.pm chatter <chicago-talk at pm.org>
Sent: Tuesday, February 27, 2007 11:15:38 AM
Subject: Re: [Chicago-talk] Malformed UTF-8 character in file with Byte Order Mark on SunOS
tiger peng wrote:
> Having trouble to ftp or scp utf8 files to SunOS and after googling for a while, I found that when I searched 'utf-8 byte order mark malformed SunOS', most of the top searching results relate with perl 5.8. So I hope I can get a quick help from my dear perlists.
UTF-8 is just binary. If your link can't handle 8-bit characters, then
uuencode the UTF-8 and uudecode it on the other end. My guess is that's
not the problem -- your terminal (or shell) are probably confused by
UTF-8 and you need to play around with locales.
A bit more detail might help us find the issue, although I kind of doubt
it has *anything* to do with Perl :)
Regards,
Jonathan Rockway
--
package JAPH;use Catalyst qw/-Debug/;($;=JAPH)->config(name => do {
$,.=reverse qw[Jonathan tsu rehton lre rekca Rockway][$_].[split //,
";$;"]->[$_].q; ;for 1..4;$,=~s;^.;;;$,});$;->setup;
_______________________________________________
Chicago-talk mailing list
Chicago-talk at pm.org
http://mail.pm.org/mailman/listinfo/chicago-talk
_______________________________________________
Chicago-talk mailing list
Chicago-talk at pm.org
http://mail.pm.org/mailman/listinfo/chicago-talk
More information about the Chicago-talk
mailing list