[tpm] PostgreSQL INSERT/UTF-8 problem

Wed Jul 9 04:59:22 PDT 2008

Cees Hek wrote:
> On Wed, Jul 9, 2008 at 5:29 AM, Madison Kelly <linux at alteeve.com> wrote:
>> Hi all, second question of the day!
>>
>>  I've got a problem INSERTing a value into my DB. It's a French character
>> 'é', and my DB is set to UTF8, but the error is:
>>
>> INSERT INTO customer_data (cd_cust_id, cd_variable, cd_value, added_user,
>> added_date, modified_user, modified_date) VALUES (1,
>> 'CustServiceTypeDisplay_F', 'Résidence', 1, now(), 1, now());
>>
>> DBD::Pg::db do failed: ERROR:  invalid byte sequence for encoding "UTF8":
>> 0xe97369
>> HINT:  This error can also happen if the byte sequence does not match the
>> encoding expected by the server, which is controlled by "client_encoding".
>>
>>  When I manually run the INSERT, it works, so I know the problem is in perl
>> somewhere. Now then, I setup my script with this:
>>
>> # Setup for UTF-8 mode.
>> binmode STDOUT, ":utf8:";
>> $ENV{'PERL_UNICODE'}=1;
>>
>>  When I create my PgSQL connection, I use:
>>
>> $dbh=DBI->connect($db_connect_string, $$conf{db}{user}, $$conf{db}{pass},
>> {
>>        RaiseError => 1,
>>        AutoCommit => 1,
>>        pg_enable_utf8 => 1
>> }
>> ) or die ...;
>>
>>  I push a pile of queries into an array (referenced) and run them like this:
>>
>> # Sanity checks stripped for the email
>> $dbh->begin_work;
>> foreach my $query (@{$sql})
>> {
>>        print "Query: [$query]\n";
>>        $dbh->do($query) or $error.=$DBI::errstr.", ";
>> }
>> $dbh->commit;
>>
>>  Lastly, my database itself is set to UTF8:
>>
>> SET client_encoding = 'UTF8';
>>
>>  I've tried knocking out the 'pg_enable_utf8 => 1' line in case I was
>> dealing with double-encoding, but that didn't help.
>>
>>  Any tips/ideas?
> 
> Hi Madison,
> 
> You have told everything in the app that you are dealing only with
> UTF8 data, but somewhere you are getting data that is not in UTF8.
> What you didn't tell us is where that accented character came from.
> Was it from a file?  Was it from a POST in a web request?  Was it from
> another database?  Was it entered on the commandline?
> 
> Regardless of where it came from, this data source is not giving you
> UTF8 data, so you will have to convert it, or you need to get postgres
> to convert it for you.  Luckily, postgresql is very good at handling
> this sort of thing for you.  You have already mentioned that you
> called:  SET client_encoding = 'UTF8'.  That tells postgres that
> everything you send it will be in UTF8.  If you change that to SET
> client_encoding = 'LATIN1', then postgres will assume all data you
> send it is in LATIN1, and it will magically convert it to UTF8 for you
> before inserting it into the database (since your database is set to
> store UTF8 characters).
> 
> Also remember that when you retrieve data from postgres, it also obeys
> the client_encoding setting, and it will convert all those UTF8
> characters back into LATIN1 for you.
> 
> So what you need to find out is what encoding your source data is in,
> and then tell postgres that you are sending it in that encoding.  it
> will do the dirty work for you.
> 
> Alternatively you can use Rob's method of converting the data
> yourself.  That should also work, and that would be the route you want
> to go if you want to use UTF8 throughout your entire application
> (which isn't that bad of an idea if you deal with lots of special
> characters).
> 
> Cheers,
> 
> Cees
> 

Oh well, now don't I deserve a whack about the head? >_<

I use 'WWW::Mechanize' to grab data off of an SSL encrypted website. 
I've not yet figured out a way to determine what encoding the website 
uses or, more relevantly, what encoding the WWW::Mechanize object uses. 
I had started looking into this when I sent the original email, but had 
a senior's moment and forgot to include that.

Thus far, I've tried reloading the database without the 
'client_encoding' pragma, without the 'pg_enable_utf8', without the 
'binmode STDOUT, ":utf8:";' or '$ENV{'PERL_UNICODE'}=1;' and various 
combinations thereof in hopes that "magic" would happen, as much as I 
hate that, but it never worked. I do believe that 'LATIN1' is the 
default, though I would need to confirm that.

As I mentioned to Rob, I will play with the Encode module when I get 
into the office this morning. I *MUCH* prefer to handle conversions 
manually given the control freak I am. :) Joking aside, I do really want 
to control conversion myself so that I can honestly say that my program 
is fully UTF-8 through and through.

Thanks kindly for your reply! I will report my success or failure "for 
the record".

Madi