SPUG: Malformed UTF-8 character (unexpected end of string)
mark.johnston at pnl.gov
Thu Jul 15 16:17:04 CDT 2004
Do you expect the web log to be encoded as UTF-8? If not, you may need
to specify the correct encoding when you open it. If your LANG
environment variable specifies that the default system text encoding is
utf8, then Perl expects strings to be UTF-8 encoded.
The gotcha here is that unlike single-byte and fixed-width multibyte
character encodings, UTF-8 uses a variable width scheme. This makes
UTF-8 compatible with ASCII, because all 1-byte ASCII characters are
valid UTF-8 characters. Not so with high-order bytes. In order for
UTF-8 to be able to encode all 95,221 characters which are included in
the Unicode 3.2 repertoire, the other 95,094 characters in addition to
the ASCII character set are represented by multiple-byte sequences.
This means that an arbitrary stream of bytes which contains high-order
bytes is more likely to be invalid UTF-8 as valid.
If your source file is single-byte encoded and not UTF-8 encoded, then
you can use binmode() to specify byte-oriented input, or a non-default
encoding scheme (encode.pm module required for the latter). You can
also set the LANG environment variable to specify a system language
which is not a UTF-8 locale prior to running your script.
It seems odd that a web log file would not be vanilla ASCII, though.
From: spug-list-bounces at mail.pm.org
[mailto:spug-list-bounces at mail.pm.org] On Behalf Of MADRANO ZALVIDAR, L
Sent: Thursday, July 15, 2004 1:40 PM
To: Andrew Sweger
Cc: spug-list at mail.pm.org
Subject: RE: SPUG: Malformed UTF-8 character (unexpected end of string)
Basically I'm just parsing a web log. But this is the line where is
showing the error:
my @temp=split(' ',$line);
And is very weird. Why should break just splitting a string.
From: Andrew Sweger [mailto:andrew at sweger.net]
Sent: Thursday, July 15, 2004 1:23 PM
To: MADRANO ZALVIDAR, L
Cc: spug-list at mail.pm.org
Subject: Re: SPUG: Malformed UTF-8 character (unexpected end of string)
According to perldoc perldiag:
Malformed UTF-8 character (%s)
Perl detected something that didn't comply with UTF-8 encoding
One possible cause is that you read in data that you thought to be
UTF-8 but it wasn't (it was for example legacy 8-bit data). Another
possibility is careless use of utf8::upgrade().
Can you provide any other information about the application you're
having trouble with?
On Thu, 15 Jul 2004, MADRANO ZALVIDAR, L wrote:
> working with some logs. but for some reason is showing this error
> "Malformed UTF-8 character (unexpected end of string) ", when I run
> script. Anybody knows how can this be fix?
Andrew B. Sweger -- The great thing about multitasking is that several
things can go wrong at once.
Seattle Perl Users Group Mailing List
POST TO: spug-list at mail.pm.org http://spugwiki.perlocity.org
ACCOUNT CONFIG: http://mail.pm.org/mailman/listinfo/spug-list
MEETINGS: 3rd Tuesdays, Location Unknown
WEB PAGE: http://www.seattleperl.org
More information about the spug-list