[Chicago-talk] Malformed UTF-8 character in file with Byte Order Mark on SunOS

Thu Mar 1 07:50:50 PST 2007

It looks like a perl build issue (the perl on the SunOS box cannot correctly handle utf-8)

I set the locale on Linux to en_US.utf8 and the SunOS to en_US.UTF-8. Then generate the file again on Lunix box and scp to SunOS. The file looks good in both box with vi(m).

Then use perl to check the file on SunOS. The out put indicate that the unicode is splited.

-> perl -ne 'use utf8; print ord(substr($_, 0, 1)).^J           "\t".ord(substr($_, 1, 1)).^J           "\t".ord(substr($_, 2, 1)).^J           "\t".$_' l>
32      195     189      ý     253
32      195     190      þ     254
32      195     191      ÿ     255
195     189     195     ýý    253
195     190     195     þþ    254
195     191     195     ÿÿ    255
53      195     189     5ý     253
54      195     190     6þ     254
55      195     191     7ÿ     255

Then I used same command to create a file on  SunOS, When I VI it, the charaters show in octals with correct values.
Checking the file with perl, the characters are not splited, but cannot display cottectly.
32      253     9        ▒      253
32      254     9        ▒      254
32      255     9        ▒      255
253     253     9       ▒▒      253
254     254     9       ▒▒      254
255     255     9       ▒▒      255
53      253     9       5▒      253
54      254     9       6▒      254
55      255     9       7▒      255

scp this file to Linux and run the perl comand to check the file, Then get the following message:
Malformed UTF-8 character (unexpected non-continuation byte 0x09, immediately after start byte 0xfd) in ord at -e line 1, <> line 1.
32      0       0        ▒      253
Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line 2.
Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xfe) in ord at -e line 1, <> line 2.
32      0       0        ▒      254
Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line 3.
Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xff) in ord at -e line 1, <> line 3.
32      0       0        ▒      255
Malformed UTF-8 character (unexpected non-continuation byte 0xfd, immediately after start byte 0xfd) in ord at -e line 1, <> line 4.
0       10      0       ▒▒      253
Malformed UTF-8 character (unexpected non-continuation byte 0xfe, immediately after start byte 0xfe) in ord at -e line 1, <> line 5.
0       0       0       ▒▒      254
Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line 6.
Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line 6.
Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line 6.
0       0       0       ▒▒      255
Malformed UTF-8 character (unexpected non-continuation byte 0x09, immediately after start byte 0xfd) in ord at -e line 1, <> line 7.
53      0       0       5▒      253
Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line 8.
Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xfe) in ord at -e line 1, <> line 8.
54      0       0       6▒      254
Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line 9.
Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xff) in ord at -e line 1, <> line 9.
55      0       0       7▒      255

Check the file size:
-rw-r--r--    1 gpeng    dba            75 Mar  1 09:38 fromLinux.txt
-rw-r--r--    1 gpeng    dba            63 Mar  1 09:37 fromSunOS.txt

Here are what the perl -v said:
-> perl -v

This is perl, v5.8.0 built for i386-linux-thread-multi
(with 1 registered patch, see perl -V for more detail)

-> perl -v

This is perl, v5.8.4 built for sun4-solaris-64int
(with 28 registered patches, see perl -V for more detail)

----- Original Message ----
From: tiger peng <tigerpeng2001 at yahoo.com>
To: Chicago.pm chatter <chicago-talk at pm.org>
Sent: Wednesday, February 28, 2007 8:22:15 AM
Subject: Re: [Chicago-talk] Malformed UTF-8 character in file with Byte Order Mark on SunOS

I checked it again between Linux and SunOS. The malfoming is not related with BOM.

If set locales to en_US.UTF-8 on SunOS and en_US.utf8(oren_US.UTF-8, which is not shown up in locale -a) on Linux. The non-ascii7 characters are malformed; they are splited to two characters. If set both to en_US.ISO8859-1, the the are not malformed and display creactly on my xterm (PuTTy).

On Linux:
Here is the OS infor and locale setting
-> uname -a
Linux etdwag2 2.4.21-27.ELsmp #1 SMP Wed Dec 1 21:59:02 EST 2004 i686
-> locale
LANG=
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=en_US.utf8

Generate file with perl:

perl -e 'for $i(253..255){print " " .       chr($i)."\t".$i."\n"}
         for $i(253..255){print chr($i).    chr($i)."\t".$i."\n"}
         for $i(253..255){print chr($i-200).chr($i)."\t".$i."\n"}
' > latin1.txt
Check the files with perl:

perl -ne 'print ord(substr($_, 0, 1)).
           "\t".ord(substr($_, 1, 1)).
           "\t".ord(substr($_, 2, 1)).
           "\t".$_' latin1.txt
32      253     9        Ã½     253
32      254     9        Ã¾     254
32      255     9        Ã¿     255
253     253     9       Ã½Ã½    253
254     254     9       Ã¾Ã¾    254
255     255     9       Ã¿Ã¿    255
53      253     9       5Ã½     253
54      254     9       6Ã¾     254
55      255     9       7Ã¿     255

On 
-> uname -a
SunOS etdwdev2 5.10 Generic_118833-24 sun4u sparc SUNW,Sun-Fire-880
-> locale
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
-> export LC_ALL=en_US.UTF-8
-> locale
LANG=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=en_US.UTF-8
# Now scp the file from Linux
# and check the file
perl -ne 'print ord(substr($_, 0, 1)).
           "\t".ord(substr($_, 1, 1)).
           "\t".ord(substr($_, 2, 1)).
           "\t".$_' latin1.txt
32      195     189      Ã½     253
32      195     190      Ã¾     254
32      195     191      Ã¿     255
195     189     195     Ã½Ã½    253
195     190     195     Ã¾Ã¾    254
195     191     195     Ã¿Ã¿    255
53      195     189     5Ã½     253
54      195     190     6Ã¾     254
55      195     191     7Ã¿     255

----- Original Message ----
From: Jonathan Rockway <jon at jrock.us>
To: Chicago.pm chatter <chicago-talk at pm.org>
Sent: Tuesday, February 27, 2007 11:15:38 AM
Subject: Re: [Chicago-talk] Malformed UTF-8 character in file with Byte Order Mark on SunOS

tiger peng wrote:
> Having trouble to ftp or scp utf8 files to SunOS and after googling for a while, I found that when I searched 'utf-8 byte order mark malformed SunOS', most of the top searching results relate with perl 5.8. So I hope I can get a quick help from my dear perlists.

UTF-8 is just binary.  If your link can't handle 8-bit characters, then
uuencode the UTF-8 and uudecode it on the other end.  My guess is that's
not the problem -- your terminal (or shell) are probably confused by
UTF-8 and you need to play around with locales.

A bit more detail might help us find the issue, although I kind of doubt
it has *anything* to do with Perl :)

Regards,
Jonathan Rockway

-- 
package JAPH;use Catalyst qw/-Debug/;($;=JAPH)->config(name => do {
$,.=reverse qw[Jonathan tsu rehton lre rekca Rockway][$_].[split //,
";$;"]->[$_].q; ;for 1..4;$,=~s;^.;;;$,});$;->setup;
_______________________________________________
Chicago-talk mailing list
Chicago-talk at pm.org
http://mail.pm.org/mailman/listinfo/chicago-talk

_______________________________________________
Chicago-talk mailing list
Chicago-talk at pm.org
http://mail.pm.org/mailman/listinfo/chicago-talk