[tpm] PostgreSQL INSERT/UTF-8 problem

Rob Janes janes.rob at gmail.com
Wed Jul 9 06:06:40 PDT 2008


The character encoding is named in the http header.  I'll bet it isn't
exactly named the way that any of the converters would like, so you'll have
to map it to what the converter expects.

Here's a snippet from the man page for WWW::Mech on getting the http
headers.

my $mech = WWW::Mechanize->new( autocheck => 1 );
    $mech->get( 'http://my.site.com' );
    my $res = $mech->response();
    for my $key ( $response->header_field_names() ) {
        print $key, " : ", $response->header( $key ), "\n";
    }


Here's the http header of interest:

Content-Type: text/html; charset=utf-8
see http://www.w3.org/International/O-charset and
http://www.iana.org/assignments/character-sets.

iconv is a utility for character conversions that seems to be available on
most unixes.  there may be more platform coverage for iconv.  especially if
your perl isn't always 5.8.1 and up.

latin1 is the same as iso-8859-1.

decide on the encoding in your program, and convert all inputs to that
encoding.  if latin1 and not utf8, you'll need to ensure some conversion
happens when writing to postgres.  you may find it easier to have postgres
do the latin1->utf8 conversion, especially since the editor you use seems to
like latin1.

so you want to be a utf8 purist.  maybe a "use utf8;" at the top is best.
use utf8::decode to convert inputs to utf8 with the utf8 flag set.  make
your ide use utf8 for editing and display.

-rob

On Wed, Jul 9, 2008 at 7:59 AM, Madison Kelly <linux at alteeve.com> wrote:

> Cees Hek wrote:
>
>> On Wed, Jul 9, 2008 at 5:29 AM, Madison Kelly <linux at alteeve.com> wrote:
>>
>>> Hi all, second question of the day!
>>>
>>>  I've got a problem INSERTing a value into my DB. It's a French character
>>> 'é', and my DB is set to UTF8, but the error is:
>>>
>>> INSERT INTO customer_data (cd_cust_id, cd_variable, cd_value, added_user,
>>> added_date, modified_user, modified_date) VALUES (1,
>>> 'CustServiceTypeDisplay_F', 'Résidence', 1, now(), 1, now());
>>>
>>> DBD::Pg::db do failed: ERROR:  invalid byte sequence for encoding "UTF8":
>>> 0xe97369
>>> HINT:  This error can also happen if the byte sequence does not match the
>>> encoding expected by the server, which is controlled by
>>> "client_encoding".
>>>
>>>  When I manually run the INSERT, it works, so I know the problem is in
>>> perl
>>> somewhere. Now then, I setup my script with this:
>>>
>>> # Setup for UTF-8 mode.
>>> binmode STDOUT, ":utf8:";
>>> $ENV{'PERL_UNICODE'}=1;
>>>
>>>  When I create my PgSQL connection, I use:
>>>
>>> $dbh=DBI->connect($db_connect_string, $$conf{db}{user}, $$conf{db}{pass},
>>> {
>>>       RaiseError => 1,
>>>       AutoCommit => 1,
>>>       pg_enable_utf8 => 1
>>> }
>>> ) or die ...;
>>>
>>>  I push a pile of queries into an array (referenced) and run them like
>>> this:
>>>
>>> # Sanity checks stripped for the email
>>> $dbh->begin_work;
>>> foreach my $query (@{$sql})
>>> {
>>>       print "Query: [$query]\n";
>>>       $dbh->do($query) or $error.=$DBI::errstr.", ";
>>> }
>>> $dbh->commit;
>>>
>>>  Lastly, my database itself is set to UTF8:
>>>
>>> SET client_encoding = 'UTF8';
>>>
>>>  I've tried knocking out the 'pg_enable_utf8 => 1' line in case I was
>>> dealing with double-encoding, but that didn't help.
>>>
>>>  Any tips/ideas?
>>>
>>
>> Hi Madison,
>>
>> You have told everything in the app that you are dealing only with
>> UTF8 data, but somewhere you are getting data that is not in UTF8.
>> What you didn't tell us is where that accented character came from.
>> Was it from a file?  Was it from a POST in a web request?  Was it from
>> another database?  Was it entered on the commandline?
>>
>> Regardless of where it came from, this data source is not giving you
>> UTF8 data, so you will have to convert it, or you need to get postgres
>> to convert it for you.  Luckily, postgresql is very good at handling
>> this sort of thing for you.  You have already mentioned that you
>> called:  SET client_encoding = 'UTF8'.  That tells postgres that
>> everything you send it will be in UTF8.  If you change that to SET
>> client_encoding = 'LATIN1', then postgres will assume all data you
>> send it is in LATIN1, and it will magically convert it to UTF8 for you
>> before inserting it into the database (since your database is set to
>> store UTF8 characters).
>>
>> Also remember that when you retrieve data from postgres, it also obeys
>> the client_encoding setting, and it will convert all those UTF8
>> characters back into LATIN1 for you.
>>
>> So what you need to find out is what encoding your source data is in,
>> and then tell postgres that you are sending it in that encoding.  it
>> will do the dirty work for you.
>>
>> Alternatively you can use Rob's method of converting the data
>> yourself.  That should also work, and that would be the route you want
>> to go if you want to use UTF8 throughout your entire application
>> (which isn't that bad of an idea if you deal with lots of special
>> characters).
>>
>> Cheers,
>>
>> Cees
>>
>>
> Oh well, now don't I deserve a whack about the head? >_<
>
> I use 'WWW::Mechanize' to grab data off of an SSL encrypted website. I've
> not yet figured out a way to determine what encoding the website uses or,
> more relevantly, what encoding the WWW::Mechanize object uses. I had started
> looking into this when I sent the original email, but had a senior's moment
> and forgot to include that.
>
> Thus far, I've tried reloading the database without the 'client_encoding'
> pragma, without the 'pg_enable_utf8', without the 'binmode STDOUT,
> ":utf8:";' or '$ENV{'PERL_UNICODE'}=1;' and various combinations thereof in
> hopes that "magic" would happen, as much as I hate that, but it never
> worked. I do believe that 'LATIN1' is the default, though I would need to
> confirm that.
>
> As I mentioned to Rob, I will play with the Encode module when I get into
> the office this morning. I *MUCH* prefer to handle conversions manually
> given the control freak I am. :) Joking aside, I do really want to control
> conversion myself so that I can honestly say that my program is fully UTF-8
> through and through.
>
> Thanks kindly for your reply! I will report my success or failure "for the
> record".
>
> Madi
>
>
> _______________________________________________
> toronto-pm mailing list
> toronto-pm at pm.org
> http://mail.pm.org/mailman/listinfo/toronto-pm
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/toronto-pm/attachments/20080709/e0edf299/attachment-0001.html>


More information about the toronto-pm mailing list