[Chicago-talk] Malformed UTF-8 character
Don Drake
don at drakeconsult.com
Fri Dec 15 09:55:13 PST 2006
I used to get a similar error when loading spam into a db for MailLaunder.
I now use something similar to `iconv -c -t UTF-8 < inputfile` to clean out
any UTF-8 badness. The -c option does that for you.
Hope that helps.
-Don
-----Original Message-----
From: chicago-talk-bounces+don=drakeconsult.com at pm.org
[mailto:chicago-talk-bounces+don=drakeconsult.com at pm.org] On Behalf Of
Andy_Bach at wiwb.uscourts.gov
Sent: Friday, December 15, 2006 11:43 AM
To: Chicago.pm chatter
Subject: Re: [Chicago-talk] Malformed UTF-8 character
> > I've got data w/ a x91 and x92 chars in it (which must be Excel
curling
> quotes) and trying to parse it I get a lot of:
> Malformed UTF-8 character (unexpected continuation byte 0x91, with no
> preceding start byte) in pattern match (m//) at
> /opt/util/check_doc_table.pl line 155, <> line 1.
> Looks like you might be coming up against windows-1252 and perl is
thinking its Unicode. Try seeing if the utf8 flag is set using the
Encode module. If it is, you might consider turning off the utf8 flag
and [d]encoding to the proper format for your work.
Its a cgi app (on linux) and the data is unknowingly winx/dos text, linux
text or even cutnpaste from html display. The issue here is similar to
what I'm running into - we have an excel spreadsheet w/ an 'export' macro
that is supposed to produce a ".txt" file that folks can then upload to
their linux box/db. The spreadsheet is supposed to help non-linux folks
have an easy way of editing - they edit, export and upload rather than
work on the linux box. What happens is these chars get left in, unnoticed,
and the uploaded info fails in unexpected ways.
I wasn't doing any decoding, but:
my $utf_line = eval " decode(\'ISO-8859-1\', \$line, Encode::FB_WARN ) ";
#my $utf_line = decode('ISO-8859-1', $line , Encode::FB_WARN);
in various guises doesn't seem to help. W/o the eval
Wide character in subroutine entry at
/usr/lib/perl5/5.8.0/i386-linux-thread-multi/Encode.pm line 154, <> line
160.
(I do have an older 'Encode' but can't upgrade). W/ the eval, and, if I'm
reading the docs right, w/ decode - I still can't get rid of the issue.
a
Andy Bach
Systems Mangler
Internet: andy_bach at wiwb.uscourts.gov
VOICE: (608) 261-5738 FAX 264-5932
Seville Dar Daigo
Tousin Busses Inaro
Nojo Demistrux
Summit Cows In
Summit Dux
_______________________________________________
Chicago-talk mailing list
Chicago-talk at pm.org
http://mail.pm.org/mailman/listinfo/chicago-talk
More information about the Chicago-talk
mailing list