[Chicago-talk] Malformed UTF-8 character

Don Drake don at drakeconsult.com
Fri Dec 15 09:55:13 PST 2006


I used to get a similar error when loading spam into a db for MailLaunder.

I now use something similar to `iconv -c -t UTF-8 < inputfile` to clean out
any UTF-8 badness.  The -c option does that for you.

Hope that helps.

-Don

-----Original Message-----
From: chicago-talk-bounces+don=drakeconsult.com at pm.org
[mailto:chicago-talk-bounces+don=drakeconsult.com at pm.org] On Behalf Of
Andy_Bach at wiwb.uscourts.gov
Sent: Friday, December 15, 2006 11:43 AM
To: Chicago.pm chatter
Subject: Re: [Chicago-talk] Malformed UTF-8 character

> > I've got data w/ a  x91 and x92 chars in it (which must be Excel 
curling
> quotes) and trying to parse it I get a lot of:
> Malformed UTF-8 character (unexpected continuation byte 0x91, with no
> preceding start byte) in pattern match (m//) at
> /opt/util/check_doc_table.pl line 155, <> line 1.

> Looks like you might be coming up against windows-1252 and perl is
thinking its Unicode.  Try seeing if the utf8 flag is set using the
Encode module.  If it is, you might consider turning off the utf8 flag
and [d]encoding to the proper format for your work.

Its a cgi app (on linux) and the data is unknowingly winx/dos text, linux 
text or even cutnpaste from html display.  The issue here is similar to 
what I'm running into - we have an excel spreadsheet w/ an 'export' macro 
that is supposed to produce a ".txt" file that folks can then upload to 
their linux box/db. The spreadsheet is supposed to help non-linux folks 
have an easy way of editing - they edit, export and upload rather than 
work on the linux box. What happens is these chars get left in, unnoticed, 
and the uploaded info fails in unexpected ways. 

I wasn't doing any decoding, but:
 my $utf_line = eval " decode(\'ISO-8859-1\', \$line, Encode::FB_WARN ) ";
  #my $utf_line = decode('ISO-8859-1', $line , Encode::FB_WARN);

in various guises doesn't seem to help.  W/o the eval
Wide character in subroutine entry at 
/usr/lib/perl5/5.8.0/i386-linux-thread-multi/Encode.pm line 154, <> line 
160.

(I do have an older 'Encode' but can't upgrade). W/ the eval, and, if I'm 
reading the docs right, w/ decode - I still can't get rid of the issue. 

a

Andy Bach
Systems Mangler
Internet: andy_bach at wiwb.uscourts.gov
VOICE: (608) 261-5738  FAX 264-5932

Seville Dar Daigo
Tousin Busses Inaro
Nojo Demistrux
Summit Cows In
Summit Dux 
_______________________________________________
Chicago-talk mailing list
Chicago-talk at pm.org
http://mail.pm.org/mailman/listinfo/chicago-talk




More information about the Chicago-talk mailing list