SPUG: Malformed UTF-8 character (unexpected end of string)

Johnston, Mark mark.johnston at pnl.gov
Thu Jul 15 16:17:04 CDT 2004


Luis,

Do you expect the web log to be encoded as UTF-8?  If not, you may need
to specify the correct encoding when you open it.  If your LANG
environment variable specifies that the default system text encoding is
utf8, then Perl expects strings to be UTF-8 encoded.

The gotcha here is that unlike single-byte and fixed-width multibyte
character encodings, UTF-8 uses a variable width scheme.  This makes
UTF-8 compatible with ASCII, because all 1-byte ASCII characters are
valid UTF-8 characters.  Not so with high-order bytes.  In order for
UTF-8 to be able to encode all 95,221 characters which are included in
the Unicode 3.2 repertoire, the other 95,094 characters in addition to
the ASCII character set are represented by multiple-byte sequences.
This means that an arbitrary stream of bytes which contains high-order
bytes is more likely to be invalid UTF-8 as valid.

If your source file is single-byte encoded and not UTF-8 encoded, then
you can use binmode() to specify byte-oriented input, or a non-default
encoding scheme (encode.pm module required for the latter).  You can
also set the LANG environment variable to specify a system language
which is not a UTF-8 locale prior to running your script.

It seems odd that a web log file would not be vanilla ASCII, though.

	--Mark

-----Original Message-----
From: spug-list-bounces at mail.pm.org
[mailto:spug-list-bounces at mail.pm.org] On Behalf Of MADRANO ZALVIDAR, L
Sent: Thursday, July 15, 2004 1:40 PM
To: Andrew Sweger
Cc: spug-list at mail.pm.org
Subject: RE: SPUG: Malformed UTF-8 character (unexpected end of string)

Basically I'm just parsing a web log. But this is the line where is
showing the error:

my @temp=split(' ',$line);

And is very weird. Why should break just splitting a string. 

Any thoughts.

Luis


-----Original Message-----
From: Andrew Sweger [mailto:andrew at sweger.net]
Sent: Thursday, July 15, 2004 1:23 PM
To: MADRANO ZALVIDAR, L
Cc: spug-list at mail.pm.org
Subject: Re: SPUG: Malformed UTF-8 character (unexpected end of string)

According to perldoc perldiag:

Malformed UTF-8 character (%s)

    Perl detected something that didn't comply with UTF-8 encoding
rules.

    One possible cause is that you read in data that you thought to be
in
    UTF-8 but it wasn't (it was for example legacy 8-bit data).  Another
    possibility is careless use of utf8::upgrade().

Can you provide any other information about the application you're
having trouble with?

On Thu, 15 Jul 2004, MADRANO ZALVIDAR, L wrote:

> working with some logs. but for some reason is showing this error 
> "Malformed UTF-8 character (unexpected end of string) ", when I run
the
> script. Anybody knows how can this be fix? 

--
Andrew B. Sweger -- The great thing about multitasking is that several
                                things can go wrong at once.





_____________________________________________________________
Seattle Perl Users Group Mailing List  
POST TO: spug-list at mail.pm.org  http://spugwiki.perlocity.org
ACCOUNT CONFIG: http://mail.pm.org/mailman/listinfo/spug-list
MEETINGS: 3rd Tuesdays, Location Unknown
WEB PAGE: http://www.seattleperl.org





More information about the spug-list mailing list