[tpm] PostgreSQL INSERT/UTF-8 problem

Cees Hek ceeshek at gmail.com
Tue Jul 8 18:41:52 PDT 2008


On Wed, Jul 9, 2008 at 5:29 AM, Madison Kelly <linux at alteeve.com> wrote:
> Hi all, second question of the day!
>
>  I've got a problem INSERTing a value into my DB. It's a French character
> 'é', and my DB is set to UTF8, but the error is:
>
> INSERT INTO customer_data (cd_cust_id, cd_variable, cd_value, added_user,
> added_date, modified_user, modified_date) VALUES (1,
> 'CustServiceTypeDisplay_F', 'Résidence', 1, now(), 1, now());
>
> DBD::Pg::db do failed: ERROR:  invalid byte sequence for encoding "UTF8":
> 0xe97369
> HINT:  This error can also happen if the byte sequence does not match the
> encoding expected by the server, which is controlled by "client_encoding".
>
>  When I manually run the INSERT, it works, so I know the problem is in perl
> somewhere. Now then, I setup my script with this:
>
> # Setup for UTF-8 mode.
> binmode STDOUT, ":utf8:";
> $ENV{'PERL_UNICODE'}=1;
>
>  When I create my PgSQL connection, I use:
>
> $dbh=DBI->connect($db_connect_string, $$conf{db}{user}, $$conf{db}{pass},
> {
>        RaiseError => 1,
>        AutoCommit => 1,
>        pg_enable_utf8 => 1
> }
> ) or die ...;
>
>  I push a pile of queries into an array (referenced) and run them like this:
>
> # Sanity checks stripped for the email
> $dbh->begin_work;
> foreach my $query (@{$sql})
> {
>        print "Query: [$query]\n";
>        $dbh->do($query) or $error.=$DBI::errstr.", ";
> }
> $dbh->commit;
>
>  Lastly, my database itself is set to UTF8:
>
> SET client_encoding = 'UTF8';
>
>  I've tried knocking out the 'pg_enable_utf8 => 1' line in case I was
> dealing with double-encoding, but that didn't help.
>
>  Any tips/ideas?

Hi Madison,

You have told everything in the app that you are dealing only with
UTF8 data, but somewhere you are getting data that is not in UTF8.
What you didn't tell us is where that accented character came from.
Was it from a file?  Was it from a POST in a web request?  Was it from
another database?  Was it entered on the commandline?

Regardless of where it came from, this data source is not giving you
UTF8 data, so you will have to convert it, or you need to get postgres
to convert it for you.  Luckily, postgresql is very good at handling
this sort of thing for you.  You have already mentioned that you
called:  SET client_encoding = 'UTF8'.  That tells postgres that
everything you send it will be in UTF8.  If you change that to SET
client_encoding = 'LATIN1', then postgres will assume all data you
send it is in LATIN1, and it will magically convert it to UTF8 for you
before inserting it into the database (since your database is set to
store UTF8 characters).

Also remember that when you retrieve data from postgres, it also obeys
the client_encoding setting, and it will convert all those UTF8
characters back into LATIN1 for you.

So what you need to find out is what encoding your source data is in,
and then tell postgres that you are sending it in that encoding.  it
will do the dirty work for you.

Alternatively you can use Rob's method of converting the data
yourself.  That should also work, and that would be the route you want
to go if you want to use UTF8 throughout your entire application
(which isn't that bad of an idea if you deal with lots of special
characters).

Cheers,

Cees


More information about the toronto-pm mailing list