[Melbourne-pm] UTF-8 headaches

Thu Nov 15 13:49:25 PST 2007

I'm having a very similar (I suspect) problem at the moment.

I have a document that I can display correctly in a web browser, but  
when I process exactly the same document through a shell script (to  
convert it to PDF) the characters come out wrong. I suspect the web  
browser is able to switch encodings mid stream and thus displays the  
document correctly, whereas the shell script assumes it is all one  
encoding and thus it comes out wrong.

I'm combining unicode fragments from mysql 5 but am performing some  
substitutions on some of the fragments - I suspect this is part of  
the problem. i have the added problem that some of those fragments  
were probably initially entered using a different encoding scheme, so  
there may be some stuff written using big5 that again the browser  
might be quietly handling.

I could be wrong about all of the above, but that's what it looks like.

I think is_utf8() might be a good diagnostic tool for this - eg i  
might just test every fragment before I allow it into the final  
output and if it fails leave it out.

Anyway, I'll be interested to hear your experiences, Kat.

Guy

On 15/11/2007, at 6:08 PM, Toby Corkindale wrote:

> Kat Grant wrote:
>> Hi All
>>
>> We have a web front ended application, and (not unusually) some jobs
>> that run server side in the background.
>> My problem is that the same bit of code called from within the web
>> server handles utf8 characters in just fine, but when called from a
>> standalone script, turns them into rubbish.
>
>>
>> The code is literally identical, and it's doing my head in.
>>
>> I've tried running with all the various possible values of -C but
>> nothing has helped.
>>
>> The code is pulling some UTF-8 data from a mysql database,
>> constructing a MIME::Lite message and sending it. The messages come
>> through fine when sent from within the webserver, but the characters
>> are trashed when sent from a stand alone script.
>>
>> We use perl 5.8.8, MySQL 5, Apache 1.3, mod_perl on debian.
>
> I remember a Perl talk a couple of years ago or so about using Unicode
> and Perl and databases, and the conclusion could roughly be summed up
> as: DBD::Pg (PostgreSQL) Just Works(tm) and MySQL required a bunch of
> hoops to be jumped through.
>
> I notice from Google that:
> http://www.simplicidade.org/notes/archives/2005/12/ 
> utf8_and_dbdmys.html
>
> So I wonder if that's what you're seeing.. You're getting raw  
> characters
> that add up to Unicode, but are not /marked/ as Unicode internally.
> When printed via the web, your browser might be smart enough to  
> pick up
> that it's unicode and display it as such.. but on a terminal, the Perl
> i/o layer or the terminal may be (mistakenly) escaping the bytes, or
> trying to display them as iso-8859 instead, hence the garbage.
>
> ie. the webserver may be the one at fault here, and your terminal is
> correctly displaying garbage.
>
> Can you try this out?
>
> use utf8;
> print (or log) "Data " . (utf8::is_utf8($string) ? 'is' : 'is not')
>     . " valid UTF8.\n";
>
>
> If it comes back as NOT being utf8, try this:
>
> use Encode qw/decode/;
> my $upgraded = decode("utf8", $string);
> print $upgraded;
>
>
> -toby
> _______________________________________________
> Melbourne-pm mailing list
> Melbourne-pm at pm.org
> http://mail.pm.org/mailman/listinfo/melbourne-pm