[tpm] PostgreSQL INSERT/UTF-8 problem

Rob Janes janes.rob at gmail.com
Wed Jul 9 10:51:59 PDT 2008


the code won't fail, but you would do an UPDATE more often that you need to.

the key-value from the database is utf8, "foo=bar"
the key-value from the http form is, say, ebcdic, "foo=bar"

utf8 "bar" does not equal ebcdic "bar"

update database with decode("ebcdic", $input_form{foo}) , which will update
the database with exactly the same string as what's already there.

to save on database updates, one would

my $charset_from_http_header;
... set $charset_from_http_header ...

while (my ($key, $value) = each %new_input_from_http_form)
{
       my $decoded_key = decode($charset_from_http_header, $key);
       my $decoded_value = decode($charset_from_http_header, $value); ## or
$new_input_from_http_form{$key}

       if ( $decoded_value ne $old_input_saved_in_db{$decoded_key} )  ##
important that $decoded_key is used here
       {
               # Update the DB using the '$decoded_value' value.
       }
}

well ... the http form keys are probably ascii, since you would have
designed the form, and therefore $decoded_key is the same as $key.  the
form's key and value pairs will be encoded as per the http header charset.
if the browser is set to an ebcdic code page, you'll be in trouble if you
don't decode the key.  that would be really wierd though, having a browser
doing ebcdic.  however, i wouldn't be surprised if there weren't other code
pages that are not ascii friendly, like maybe mandarin, thai, tibet,
whatever.

the binary representation of an ascii string is the same in latin1 as it is
in any of the iso-8859 dialects, and in utf8.  but if you render that same
string in ebcdic, kansas goes bye bye.

-rob

On Wed, Jul 9, 2008 at 12:50 PM, Madison Kelly <linux at alteeve.com> wrote:

> Rob Janes wrote:
>
>> oops, first reply went to Madison alone ...
>>
>> I think this is better ...
>>
>> use Encode qw(from_to decode);
>>
>> my $data = "Résidence";
>> from_to($data, "iso-8859-1", "utf8"); ## assuming Résidence is encoded in
>> 8859-1
>> or
>> my $data = decode("iso-8859-1", "Résidence");
>>
>> both of these will create a utf8 string from Résidence.  However,
>> depending on the original encoding of Résidence, what's stored in the
>> database may or may not be what you want.
>>
>> In other words, the lack of an error message is not indicative of it
>> working.
>>
>> -rob
>>
>
> I am not sure if this is the most ... appropriate way to solve the problem,
> so I would still love to have some feedback if you (or anyone) has any.
>
> I got it working this way:
>
> - Read the data from the website and push it into a hash (hidden input
> values stored as "$input{name}=value;".
> - Loop through the '%input' hash keys and populate a new hash '%enc_input'
> with the same format.
> - Read the old values from the database into a matching hash called
> '%old_input'.
> - Pseduo-code:
> foreach $key (keys %input)
> {
>        if ( $input{$key} ne $old_input{$key} )
>        {
>                # Update the DB using the '$enc_input' hash value.
>        }
> }
>
> It's ugly as sin, but it seems like the only time I need to use 'Encode'
> functions are in the actual PgSQL INPUT or UPDATE calls; not in the
> comparison of the value either from the HTML page or from the database.
>
> Odd.
>
> Thanks for your help, Rob and Cees!
>
> Madi
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/toronto-pm/attachments/20080709/065a759b/attachment.html>


More information about the toronto-pm mailing list