[tpm] PostgreSQL INSERT/UTF-8 problem
Madison Kelly
linux at alteeve.com
Wed Jul 9 04:59:22 PDT 2008
Cees Hek wrote:
> On Wed, Jul 9, 2008 at 5:29 AM, Madison Kelly <linux at alteeve.com> wrote:
>> Hi all, second question of the day!
>>
>> I've got a problem INSERTing a value into my DB. It's a French character
>> 'é', and my DB is set to UTF8, but the error is:
>>
>> INSERT INTO customer_data (cd_cust_id, cd_variable, cd_value, added_user,
>> added_date, modified_user, modified_date) VALUES (1,
>> 'CustServiceTypeDisplay_F', 'Résidence', 1, now(), 1, now());
>>
>> DBD::Pg::db do failed: ERROR: invalid byte sequence for encoding "UTF8":
>> 0xe97369
>> HINT: This error can also happen if the byte sequence does not match the
>> encoding expected by the server, which is controlled by "client_encoding".
>>
>> When I manually run the INSERT, it works, so I know the problem is in perl
>> somewhere. Now then, I setup my script with this:
>>
>> # Setup for UTF-8 mode.
>> binmode STDOUT, ":utf8:";
>> $ENV{'PERL_UNICODE'}=1;
>>
>> When I create my PgSQL connection, I use:
>>
>> $dbh=DBI->connect($db_connect_string, $$conf{db}{user}, $$conf{db}{pass},
>> {
>> RaiseError => 1,
>> AutoCommit => 1,
>> pg_enable_utf8 => 1
>> }
>> ) or die ...;
>>
>> I push a pile of queries into an array (referenced) and run them like this:
>>
>> # Sanity checks stripped for the email
>> $dbh->begin_work;
>> foreach my $query (@{$sql})
>> {
>> print "Query: [$query]\n";
>> $dbh->do($query) or $error.=$DBI::errstr.", ";
>> }
>> $dbh->commit;
>>
>> Lastly, my database itself is set to UTF8:
>>
>> SET client_encoding = 'UTF8';
>>
>> I've tried knocking out the 'pg_enable_utf8 => 1' line in case I was
>> dealing with double-encoding, but that didn't help.
>>
>> Any tips/ideas?
>
> Hi Madison,
>
> You have told everything in the app that you are dealing only with
> UTF8 data, but somewhere you are getting data that is not in UTF8.
> What you didn't tell us is where that accented character came from.
> Was it from a file? Was it from a POST in a web request? Was it from
> another database? Was it entered on the commandline?
>
> Regardless of where it came from, this data source is not giving you
> UTF8 data, so you will have to convert it, or you need to get postgres
> to convert it for you. Luckily, postgresql is very good at handling
> this sort of thing for you. You have already mentioned that you
> called: SET client_encoding = 'UTF8'. That tells postgres that
> everything you send it will be in UTF8. If you change that to SET
> client_encoding = 'LATIN1', then postgres will assume all data you
> send it is in LATIN1, and it will magically convert it to UTF8 for you
> before inserting it into the database (since your database is set to
> store UTF8 characters).
>
> Also remember that when you retrieve data from postgres, it also obeys
> the client_encoding setting, and it will convert all those UTF8
> characters back into LATIN1 for you.
>
> So what you need to find out is what encoding your source data is in,
> and then tell postgres that you are sending it in that encoding. it
> will do the dirty work for you.
>
> Alternatively you can use Rob's method of converting the data
> yourself. That should also work, and that would be the route you want
> to go if you want to use UTF8 throughout your entire application
> (which isn't that bad of an idea if you deal with lots of special
> characters).
>
> Cheers,
>
> Cees
>
Oh well, now don't I deserve a whack about the head? >_<
I use 'WWW::Mechanize' to grab data off of an SSL encrypted website.
I've not yet figured out a way to determine what encoding the website
uses or, more relevantly, what encoding the WWW::Mechanize object uses.
I had started looking into this when I sent the original email, but had
a senior's moment and forgot to include that.
Thus far, I've tried reloading the database without the
'client_encoding' pragma, without the 'pg_enable_utf8', without the
'binmode STDOUT, ":utf8:";' or '$ENV{'PERL_UNICODE'}=1;' and various
combinations thereof in hopes that "magic" would happen, as much as I
hate that, but it never worked. I do believe that 'LATIN1' is the
default, though I would need to confirm that.
As I mentioned to Rob, I will play with the Encode module when I get
into the office this morning. I *MUCH* prefer to handle conversions
manually given the control freak I am. :) Joking aside, I do really want
to control conversion myself so that I can honestly say that my program
is fully UTF-8 through and through.
Thanks kindly for your reply! I will report my success or failure "for
the record".
Madi
More information about the toronto-pm
mailing list