[Chicago-talk] Malformed UTF-8 character in file with Byte Order Mark on SunOS

tiger peng tigerpeng2001 at yahoo.com
Thu Mar 1 13:08:43 PST 2007


Strange! The perl can get the The BOM characters (Insert by MS Notepad correctly)!

-> perl  -nwe 'print if /([^\p{IsASCII}])/; print if s/([^\p{IsASCII}])/"\&#".ord($1).";"/ge'  unicodeNotepad.txt
êëìñ
êëìñ

Could anyone please provide some Java/C/C++ codes to function as the perl command for generate the file and check the file?
I did googled one, but can find now.



----- Original Message ----
From: Elias Lutfallah <eli at mortgagefolder.com>
To: Chicago.pm chatter <chicago-talk at pm.org>
Sent: Thursday, March 1, 2007 3:03:42 PM
Subject: Re: [Chicago-talk] Malformed UTF-8 character in file with Byte Order Mark on SunOS

I just remembered that I have encountered a similar problem a couple of
years ago.

I can't remember what versions of perl I was using at the time, but I was
using IO::Socket to send data between two Linux machines. I wasn't even
using utf-8. After transfer, the file would be corrupt on the destination
machine.

One had a perl I had compiled myself, and one had the default perl that
came with the distribution. I didn't investigate it too thoroughly, I just
tar-ed up my compiled perl and put it on the other machine and everything
worked.

I probably should have done more research and submitted a bug report.

> It looks like a perl build issue (the perl on the SunOS box cannot
> correctly handle utf-8)
>
> I set the locale on Linux to en_US.utf8 and the SunOS to en_US.UTF-8. Then
> generate the file again on Lunix box and scp to SunOS. The file looks good
> in both box with vi(m).
>
> Then use perl to check the file on SunOS. The out put indicate that the
> unicode is splited.
>
> -> perl -ne 'use utf8; print ord(substr($_, 0, 1)).^J
> "\t".ord(substr($_, 1, 1)).^J           "\t".ord(substr($_, 2, 1)).^J
>      "\t".$_' l>
> 32      195     189      ý     253
> 32      195     190      þ     254
> 32      195     191      ÿ     255
> 195     189     195     ýý    253
> 195     190     195     þþ    254
> 195     191     195     ÿÿ    255
> 53      195     189     5ý     253
> 54      195     190     6þ     254
> 55      195     191     7ÿ     255
>
> Then I used same command to create a file on  SunOS, When I VI it, the
> charaters show in octals with correct values.
> Checking the file with perl, the characters are not splited, but cannot
> display cottectly.
> 32      253     9        â–’      253
> 32      254     9        â–’      254
> 32      255     9        â–’      255
> 253     253     9       â–’â–’      253
> 254     254     9       â–’â–’      254
> 255     255     9       â–’â–’      255
> 53      253     9       5â–’      253
> 54      254     9       6â–’      254
> 55      255     9       7â–’      255
>
> scp this file to Linux and run the perl comand to check the file, Then get
> the following message:
> Malformed UTF-8 character (unexpected non-continuation byte 0x09,
> immediately after start byte 0xfd) in ord at -e line 1, <> line 1.
> 32      0       0        â–’      253
> Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line
> 2.
> Malformed UTF-8 character (unexpected non-continuation byte 0x00,
> immediately after start byte 0xfe) in ord at -e line 1, <> line 2.
> 32      0       0        â–’      254
> Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line
> 3.
> Malformed UTF-8 character (unexpected non-continuation byte 0x00,
> immediately after start byte 0xff) in ord at -e line 1, <> line 3.
> 32      0       0        â–’      255
> Malformed UTF-8 character (unexpected non-continuation byte 0xfd,
> immediately after start byte 0xfd) in ord at -e line 1, <> line 4.
> 0       10      0       â–’â–’      253
> Malformed UTF-8 character (unexpected non-continuation byte 0xfe,
> immediately after start byte 0xfe) in ord at -e line 1, <> line 5.
> 0       0       0       â–’â–’      254
> Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line
> 6.
> Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line
> 6.
> Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line
> 6.
> 0       0       0       â–’â–’      255
> Malformed UTF-8 character (unexpected non-continuation byte 0x09,
> immediately after start byte 0xfd) in ord at -e line 1, <> line 7.
> 53      0       0       5â–’      253
> Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line
> 8.
> Malformed UTF-8 character (unexpected non-continuation byte 0x00,
> immediately after start byte 0xfe) in ord at -e line 1, <> line 8.
> 54      0       0       6â–’      254
> Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line
> 9.
> Malformed UTF-8 character (unexpected non-continuation byte 0x00,
> immediately after start byte 0xff) in ord at -e line 1, <> line 9.
> 55      0       0       7â–’      255
>
> Check the file size:
> -rw-r--r--    1 gpeng    dba            75 Mar  1 09:38 fromLinux.txt
> -rw-r--r--    1 gpeng    dba            63 Mar  1 09:37 fromSunOS.txt
>
>
> Here are what the perl -v said:
> -> perl -v
>
> This is perl, v5.8.0 built for i386-linux-thread-multi
> (with 1 registered patch, see perl -V for more detail)
>
> -> perl -v
>
> This is perl, v5.8.4 built for sun4-solaris-64int
> (with 28 registered patches, see perl -V for more detail)
>
>
>
> ----- Original Message ----
> From: tiger peng <tigerpeng2001 at yahoo.com>
> To: Chicago.pm chatter <chicago-talk at pm.org>
> Sent: Wednesday, February 28, 2007 8:22:15 AM
> Subject: Re: [Chicago-talk] Malformed UTF-8 character in file with Byte
> Order Mark on SunOS
>
> I checked it again between Linux and SunOS. The malfoming is not related
> with BOM.
>
> If set locales to en_US.UTF-8 on SunOS and en_US.utf8(oren_US.UTF-8, which
> is not shown up in locale -a) on Linux. The non-ascii7 characters are
> malformed; they are splited to two characters. If set both to
> en_US.ISO8859-1, the the are not malformed and display creactly on my
> xterm (PuTTy).
>
> On Linux:
> Here is the OS infor and locale setting
> -> uname -a
> Linux etdwag2 2.4.21-27.ELsmp #1 SMP Wed Dec 1 21:59:02 EST 2004 i686
> -> locale
> LANG=
> LC_CTYPE="en_US.utf8"
> LC_NUMERIC="en_US.utf8"
> LC_TIME="en_US.utf8"
> LC_COLLATE="en_US.utf8"
> LC_MONETARY="en_US.utf8"
> LC_MESSAGES="en_US.utf8"
> LC_PAPER="en_US.utf8"
> LC_NAME="en_US.utf8"
> LC_ADDRESS="en_US.utf8"
> LC_TELEPHONE="en_US.utf8"
> LC_MEASUREMENT="en_US.utf8"
> LC_IDENTIFICATION="en_US.utf8"
> LC_ALL=en_US.utf8
>
>
> Generate file with perl:
>
> perl -e 'for $i(253..255){print " " .       chr($i)."\t".$i."\n"}
>          for $i(253..255){print chr($i).    chr($i)."\t".$i."\n"}
>          for $i(253..255){print chr($i-200).chr($i)."\t".$i."\n"}
> ' > latin1.txt
> Check the files with perl:
>
> perl -ne 'print ord(substr($_, 0, 1)).
>            "\t".ord(substr($_, 1, 1)).
>            "\t".ord(substr($_, 2, 1)).
>            "\t".$_' latin1.txt
> 32      253     9        ý     253
> 32      254     9        þ     254
> 32      255     9        ÿ     255
> 253     253     9       ýý    253
> 254     254     9       þþ    254
> 255     255     9       ÿÿ    255
> 53      253     9       5ý     253
> 54      254     9       6þ     254
> 55      255     9       7ÿ     255
>
> On
> -> uname -a
> SunOS etdwdev2 5.10 Generic_118833-24 sun4u sparc SUNW,Sun-Fire-880
> -> locale
> LANG=
> LC_CTYPE="C"
> LC_NUMERIC="C"
> LC_TIME="C"
> LC_COLLATE="C"
> LC_MONETARY="C"
> LC_MESSAGES="C"
> LC_ALL=
> -> export LC_ALL=en_US.UTF-8
> -> locale
> LANG=
> LC_CTYPE="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_ALL=en_US.UTF-8
> # Now scp the file from Linux
> # and check the file
> perl -ne 'print ord(substr($_, 0, 1)).
>            "\t".ord(substr($_, 1, 1)).
>            "\t".ord(substr($_, 2, 1)).
>            "\t".$_' latin1.txt
> 32      195     189      ý     253
> 32      195     190      þ     254
> 32      195     191      ÿ     255
> 195     189     195     ýý    253
> 195     190     195     þþ    254
> 195     191     195     ÿÿ    255
> 53      195     189     5ý     253
> 54      195     190     6þ     254
> 55      195     191     7ÿ     255
>
>
>
>
>
>
>
> ----- Original Message ----
> From: Jonathan Rockway <jon at jrock.us>
> To: Chicago.pm chatter <chicago-talk at pm.org>
> Sent: Tuesday, February 27, 2007 11:15:38 AM
> Subject: Re: [Chicago-talk] Malformed UTF-8 character in file with Byte
> Order Mark on SunOS
>
>
> tiger peng wrote:
>> Having trouble to ftp or scp utf8 files to SunOS and after googling for
>> a while, I found that when I searched 'utf-8 byte order mark malformed
>> SunOS', most of the top searching results relate with perl 5.8. So I
>> hope I can get a quick help from my dear perlists.
>
> UTF-8 is just binary.  If your link can't handle 8-bit characters, then
> uuencode the UTF-8 and uudecode it on the other end.  My guess is that's
> not the problem -- your terminal (or shell) are probably confused by
> UTF-8 and you need to play around with locales.
>
> A bit more detail might help us find the issue, although I kind of doubt
> it has *anything* to do with Perl :)
>
> Regards,
> Jonathan Rockway
>
> --
> package JAPH;use Catalyst qw/-Debug/;($;=JAPH)->config(name => do {
> $,.=reverse qw[Jonathan tsu rehton lre rekca Rockway][$_].[split //,
> ";$;"]->[$_].q; ;for 1..4;$,=~s;^.;;;$,});$;->setup;
> _______________________________________________
> Chicago-talk mailing list
> Chicago-talk at pm.org
> http://mail.pm.org/mailman/listinfo/chicago-talk
>
>
>
>
>
> _______________________________________________
> Chicago-talk mailing list
> Chicago-talk at pm.org
> http://mail.pm.org/mailman/listinfo/chicago-talk
>
>
>
> _______________________________________________
> Chicago-talk mailing list
> Chicago-talk at pm.org
> http://mail.pm.org/mailman/listinfo/chicago-talk


-- 
Elias Lutfallah
Chief Technology Officer
Mortgage Desk, Inc.

_______________________________________________
Chicago-talk mailing list
Chicago-talk at pm.org
http://mail.pm.org/mailman/listinfo/chicago-talk





More information about the Chicago-talk mailing list