[Chicago-talk] Malformed UTF-8 character in file with Byte Order Mark on SunOS
tiger peng
tigerpeng2001 at yahoo.com
Thu Mar 1 20:12:07 PST 2007
Finally, recalled what I did before, binmode!!!
-> perl -e 'binmode(STDOUT, ":encoding(UTF-8)");print chr(234).chr(235).chr(236).chr(241)."\n"'> perled.txt
-> perl -C -nwe 'binmode(STDIN, ":encoding(UTF-8)");print if /([^\p{IsASCII}])/; print if s/([^\p{IsASCII}])/"\&#".ord
($1).";"/ge' perled.txt
êëìñ
êëìñ
----- Original Message ----
From: tiger peng <tigerpeng2001 at yahoo.com>
To: Chicago.pm chatter <chicago-talk at pm.org>
Sent: Thursday, March 1, 2007 3:08:43 PM
Subject: Re: [Chicago-talk] Malformed UTF-8 character in file with Byte Order Mark on SunOS
Strange! The perl can get the The BOM characters (Insert by MS Notepad correctly)!
-> perl -nwe 'print if /([^\p{IsASCII}])/; print if s/([^\p{IsASCII}])/"\&#".ord($1).";"/ge' unicodeNotepad.txt
êëìñ
êëìñ
Could anyone please provide some Java/C/C++ codes to function as the perl command for generate the file and check the file?
I did googled one, but can find now.
----- Original Message ----
From: Elias Lutfallah <eli at mortgagefolder.com>
To: Chicago.pm chatter <chicago-talk at pm.org>
Sent: Thursday, March 1, 2007 3:03:42 PM
Subject: Re: [Chicago-talk] Malformed UTF-8 character in file with Byte Order Mark on SunOS
I just remembered that I have encountered a similar problem a couple of
years ago.
I can't remember what versions of perl I was using at the time, but I was
using IO::Socket to send data between two Linux machines. I wasn't even
using utf-8. After transfer, the file would be corrupt on the destination
machine.
One had a perl I had compiled myself, and one had the default perl that
came with the distribution. I didn't investigate it too thoroughly, I just
tar-ed up my compiled perl and put it on the other machine and everything
worked.
I probably should have done more research and submitted a bug report.
> It looks like a perl build issue (the perl on the SunOS box cannot
> correctly handle utf-8)
>
> I set the locale on Linux to en_US.utf8 and the SunOS to en_US.UTF-8. Then
> generate the file again on Lunix box and scp to SunOS. The file looks good
> in both box with vi(m).
>
> Then use perl to check the file on SunOS. The out put indicate that the
> unicode is splited.
>
> -> perl -ne 'use utf8; print ord(substr($_, 0, 1)).^J
> "\t".ord(substr($_, 1, 1)).^J "\t".ord(substr($_, 2, 1)).^J
> "\t".$_' l>
> 32 195 189 ý 253
> 32 195 190 þ 254
> 32 195 191 ÿ 255
> 195 189 195 ýý 253
> 195 190 195 þþ 254
> 195 191 195 ÿÿ 255
> 53 195 189 5ý 253
> 54 195 190 6þ 254
> 55 195 191 7ÿ 255
>
> Then I used same command to create a file on SunOS, When I VI it, the
> charaters show in octals with correct values.
> Checking the file with perl, the characters are not splited, but cannot
> display cottectly.
> 32 253 9 â–’ 253
> 32 254 9 â–’ 254
> 32 255 9 â–’ 255
> 253 253 9 â–’â–’ 253
> 254 254 9 â–’â–’ 254
> 255 255 9 â–’â–’ 255
> 53 253 9 5â–’ 253
> 54 254 9 6â–’ 254
> 55 255 9 7â–’ 255
>
> scp this file to Linux and run the perl comand to check the file, Then get
> the following message:
> Malformed UTF-8 character (unexpected non-continuation byte 0x09,
> immediately after start byte 0xfd) in ord at -e line 1, <> line 1.
> 32 0 0 â–’ 253
> Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line
> 2.
> Malformed UTF-8 character (unexpected non-continuation byte 0x00,
> immediately after start byte 0xfe) in ord at -e line 1, <> line 2.
> 32 0 0 â–’ 254
> Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line
> 3.
> Malformed UTF-8 character (unexpected non-continuation byte 0x00,
> immediately after start byte 0xff) in ord at -e line 1, <> line 3.
> 32 0 0 â–’ 255
> Malformed UTF-8 character (unexpected non-continuation byte 0xfd,
> immediately after start byte 0xfd) in ord at -e line 1, <> line 4.
> 0 10 0 â–’â–’ 253
> Malformed UTF-8 character (unexpected non-continuation byte 0xfe,
> immediately after start byte 0xfe) in ord at -e line 1, <> line 5.
> 0 0 0 â–’â–’ 254
> Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line
> 6.
> Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line
> 6.
> Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line
> 6.
> 0 0 0 â–’â–’ 255
> Malformed UTF-8 character (unexpected non-continuation byte 0x09,
> immediately after start byte 0xfd) in ord at -e line 1, <> line 7.
> 53 0 0 5â–’ 253
> Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line
> 8.
> Malformed UTF-8 character (unexpected non-continuation byte 0x00,
> immediately after start byte 0xfe) in ord at -e line 1, <> line 8.
> 54 0 0 6â–’ 254
> Malformed UTF-8 character (unexpected end of string) at -e line 1, <> line
> 9.
> Malformed UTF-8 character (unexpected non-continuation byte 0x00,
> immediately after start byte 0xff) in ord at -e line 1, <> line 9.
> 55 0 0 7â–’ 255
>
> Check the file size:
> -rw-r--r-- 1 gpeng dba 75 Mar 1 09:38 fromLinux.txt
> -rw-r--r-- 1 gpeng dba 63 Mar 1 09:37 fromSunOS.txt
>
>
> Here are what the perl -v said:
> -> perl -v
>
> This is perl, v5.8.0 built for i386-linux-thread-multi
> (with 1 registered patch, see perl -V for more detail)
>
> -> perl -v
>
> This is perl, v5.8.4 built for sun4-solaris-64int
> (with 28 registered patches, see perl -V for more detail)
>
>
>
> ----- Original Message ----
> From: tiger peng <tigerpeng2001 at yahoo.com>
> To: Chicago.pm chatter <chicago-talk at pm.org>
> Sent: Wednesday, February 28, 2007 8:22:15 AM
> Subject: Re: [Chicago-talk] Malformed UTF-8 character in file with Byte
> Order Mark on SunOS
>
> I checked it again between Linux and SunOS. The malfoming is not related
> with BOM.
>
> If set locales to en_US.UTF-8 on SunOS and en_US.utf8(oren_US.UTF-8, which
> is not shown up in locale -a) on Linux. The non-ascii7 characters are
> malformed; they are splited to two characters. If set both to
> en_US.ISO8859-1, the the are not malformed and display creactly on my
> xterm (PuTTy).
>
> On Linux:
> Here is the OS infor and locale setting
> -> uname -a
> Linux etdwag2 2.4.21-27.ELsmp #1 SMP Wed Dec 1 21:59:02 EST 2004 i686
> -> locale
> LANG=
> LC_CTYPE="en_US.utf8"
> LC_NUMERIC="en_US.utf8"
> LC_TIME="en_US.utf8"
> LC_COLLATE="en_US.utf8"
> LC_MONETARY="en_US.utf8"
> LC_MESSAGES="en_US.utf8"
> LC_PAPER="en_US.utf8"
> LC_NAME="en_US.utf8"
> LC_ADDRESS="en_US.utf8"
> LC_TELEPHONE="en_US.utf8"
> LC_MEASUREMENT="en_US.utf8"
> LC_IDENTIFICATION="en_US.utf8"
> LC_ALL=en_US.utf8
>
>
> Generate file with perl:
>
> perl -e 'for $i(253..255){print " " . chr($i)."\t".$i."\n"}
> for $i(253..255){print chr($i). chr($i)."\t".$i."\n"}
> for $i(253..255){print chr($i-200).chr($i)."\t".$i."\n"}
> ' > latin1.txt
> Check the files with perl:
>
> perl -ne 'print ord(substr($_, 0, 1)).
> "\t".ord(substr($_, 1, 1)).
> "\t".ord(substr($_, 2, 1)).
> "\t".$_' latin1.txt
> 32 253 9 ý 253
> 32 254 9 þ 254
> 32 255 9 ÿ 255
> 253 253 9 ýý 253
> 254 254 9 þþ 254
> 255 255 9 ÿÿ 255
> 53 253 9 5ý 253
> 54 254 9 6þ 254
> 55 255 9 7ÿ 255
>
> On
> -> uname -a
> SunOS etdwdev2 5.10 Generic_118833-24 sun4u sparc SUNW,Sun-Fire-880
> -> locale
> LANG=
> LC_CTYPE="C"
> LC_NUMERIC="C"
> LC_TIME="C"
> LC_COLLATE="C"
> LC_MONETARY="C"
> LC_MESSAGES="C"
> LC_ALL=
> -> export LC_ALL=en_US.UTF-8
> -> locale
> LANG=
> LC_CTYPE="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_ALL=en_US.UTF-8
> # Now scp the file from Linux
> # and check the file
> perl -ne 'print ord(substr($_, 0, 1)).
> "\t".ord(substr($_, 1, 1)).
> "\t".ord(substr($_, 2, 1)).
> "\t".$_' latin1.txt
> 32 195 189 ý 253
> 32 195 190 þ 254
> 32 195 191 ÿ 255
> 195 189 195 ýý 253
> 195 190 195 þþ 254
> 195 191 195 ÿÿ 255
> 53 195 189 5ý 253
> 54 195 190 6þ 254
> 55 195 191 7ÿ 255
>
>
>
>
>
>
>
> ----- Original Message ----
> From: Jonathan Rockway <jon at jrock.us>
> To: Chicago.pm chatter <chicago-talk at pm.org>
> Sent: Tuesday, February 27, 2007 11:15:38 AM
> Subject: Re: [Chicago-talk] Malformed UTF-8 character in file with Byte
> Order Mark on SunOS
>
>
> tiger peng wrote:
>> Having trouble to ftp or scp utf8 files to SunOS and after googling for
>> a while, I found that when I searched 'utf-8 byte order mark malformed
>> SunOS', most of the top searching results relate with perl 5.8. So I
>> hope I can get a quick help from my dear perlists.
>
> UTF-8 is just binary. If your link can't handle 8-bit characters, then
> uuencode the UTF-8 and uudecode it on the other end. My guess is that's
> not the problem -- your terminal (or shell) are probably confused by
> UTF-8 and you need to play around with locales.
>
> A bit more detail might help us find the issue, although I kind of doubt
> it has *anything* to do with Perl :)
>
> Regards,
> Jonathan Rockway
>
> --
> package JAPH;use Catalyst qw/-Debug/;($;=JAPH)->config(name => do {
> $,.=reverse qw[Jonathan tsu rehton lre rekca Rockway][$_].[split //,
> ";$;"]->[$_].q; ;for 1..4;$,=~s;^.;;;$,});$;->setup;
> _______________________________________________
> Chicago-talk mailing list
> Chicago-talk at pm.org
> http://mail.pm.org/mailman/listinfo/chicago-talk
>
>
>
>
>
> _______________________________________________
> Chicago-talk mailing list
> Chicago-talk at pm.org
> http://mail.pm.org/mailman/listinfo/chicago-talk
>
>
>
> _______________________________________________
> Chicago-talk mailing list
> Chicago-talk at pm.org
> http://mail.pm.org/mailman/listinfo/chicago-talk
--
Elias Lutfallah
Chief Technology Officer
Mortgage Desk, Inc.
_______________________________________________
Chicago-talk mailing list
Chicago-talk at pm.org
http://mail.pm.org/mailman/listinfo/chicago-talk
_______________________________________________
Chicago-talk mailing list
Chicago-talk at pm.org
http://mail.pm.org/mailman/listinfo/chicago-talk
More information about the Chicago-talk
mailing list