[Wellington-pm] More on Unicode and DB

Mon Sep 17 20:49:38 PDT 2007

Hi Mongers

In my talk about Unicode the other month, one subject I didn't get time
to explore properly was working with databases.  I've now done a little
more of work in this area with Postgres.

The initial question was how to 'round-trip' a Unicode string.

For example say I insert a row with a text field containing the string
'Māori' (or "M\x{101}ori" in ASCII Perl).  This string is 5 characters
long, but 6 bytes long because the second character is a multibyte
character.

I found that if I subsequently queried that value what I got back was
effectively a binary string.  It's the same 6 bytes I inserted but
without the UTF8 flag set on the Perl scalar, so it appears to be a 6
character string.

I could use Encode::decode to translate that into a scalar with the UTF8
flag set but that's a bit tedious.

It turns out that the Postgres driver (DBD::Pg) has a flag which can be
set to do the decoding automatically, e.g.:

  my $attr = {
      RaiseError     => 1,
      AutoCommit     => 0,
      pg_enable_utf8 => 1,
  };

  my $dbh = DBI->connect($dsn, $user, $passwd, $attr);

With the pg_enable_utf8 flag set, Unicode strings are round-tripped
correctly automatically.

It's mildly concerning that this flag is labelled "experimental and may
be subject to change" but I'm sure if it was changed it would only be
for the better :-)

Cheers
Grant