[San-Diego-pm] Cold shower in UTF-8

Thierry de Villeneuve thierryv at abac.com
Mon Oct 28 13:56:00 PDT 2013


Hello Joël,

There are several things to remember.

 - First, the MySQL Instance may be setup up with a specific default charset.  It's defined with the "default-character-set" setup parameter. You can discover the default encoding used for the client connections and unspecified table creations

[mysqld]
character-set-server=utf8
default-collation=utf8_unicode_ci

[client]
default-character-set=utf8

  That you can query with :

mysql> SHOW VARIABLES LIKE 'character%';
+--------------------------+--------------------------------------------------------+
| Variable_name            | Value                                                  |
+--------------------------+--------------------------------------------------------+
| character_set_client     | utf8                                                   |
| character_set_connection | utf8                                                   |
| character_set_database   | utf8                                                   |
| character_set_filesystem | binary                                                 |
| character_set_results    | utf8                                                   |
| character_set_server     | utf8                                                   |
| character_set_system     | utf8                                                   |
+--------------------------+--------------------------------------------------------+

   And figure out what charsets are implemented with this MySQL build :

mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |      2 |
| dec8     | DEC West European           | dec8_swedish_ci     |      1 |
| cp850    | DOS West European           | cp850_general_ci    |      1 |
| hp8      | HP West European            | hp8_english_ci      |      1 |
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |      1 |
| latin1   | cp1252 West European        | latin1_swedish_ci   |      1 |
| latin2   | ISO 8859-2 Central European | latin2_general_ci   |      1 |
| swe7     | 7bit Swedish                | swe7_swedish_ci     |      1 |
| ascii    | US ASCII                    | ascii_general_ci    |      1 |
| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |      3 |
| sjis     | Shift-JIS Japanese          | sjis_japanese_ci    |      2 |
| hebrew   | ISO 8859-8 Hebrew           | hebrew_general_ci   |      1 |
| tis620   | TIS620 Thai                 | tis620_thai_ci      |      1 |
| euckr    | EUC-KR Korean               | euckr_korean_ci     |      2 |
| koi8u    | KOI8-U Ukrainian            | koi8u_general_ci    |      1 |
| gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |      2 |
| greek    | ISO 8859-7 Greek            | greek_general_ci    |      1 |
| cp1250   | Windows Central European    | cp1250_general_ci   |      1 |
| gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |      2 |
| latin5   | ISO 8859-9 Turkish          | latin5_turkish_ci   |      1 |
| armscii8 | ARMSCII-8 Armenian          | armscii8_general_ci |      1 |
| utf8     | UTF-8 Unicode               | utf8_general_ci     |      3 |
| ucs2     | UCS-2 Unicode               | ucs2_general_ci     |      2 |
| cp866    | DOS Russian                 | cp866_general_ci    |      1 |
| keybcs2  | DOS Kamenicky Czech-Slovak  | keybcs2_general_ci  |      1 |
| macce    | Mac Central European        | macce_general_ci    |      1 |
| macroman | Mac West European           | macroman_general_ci |      1 |
| cp852    | DOS Central European        | cp852_general_ci    |      1 |
| latin7   | ISO 8859-13 Baltic          | latin7_general_ci   |      1 |
| utf8mb4  | UTF-8 Unicode               | utf8mb4_general_ci  |      4 |
| cp1251   | Windows Cyrillic            | cp1251_general_ci   |      1 |
| utf16    | UTF-16 Unicode              | utf16_general_ci    |      4 |
| cp1256   | Windows Arabic              | cp1256_general_ci   |      1 |
| cp1257   | Windows Baltic              | cp1257_general_ci   |      1 |
| utf32    | UTF-32 Unicode              | utf32_general_ci    |      4 |
| binary   | Binary pseudo charset       | binary              |      1 |
| geostd8  | GEOSTD8 Georgian            | geostd8_general_ci  |      1 |
| cp932    | SJIS for Windows Japanese   | cp932_japanese_ci   |      2 |
| eucjpms  | UJIS for Windows Japanese   | eucjpms_japanese_ci |      3 |
+----------+-----------------------------+---------------------+--------+




 - Then, When the tables are created, the DBA must specify under which charset the table is created and the collation pattern to override any instance level setup.

SET NAMES utf8;

CREATE TABLE `someTable` (
    `qId`          INT unsigned NOT NULL AUTO_INCREMENT,
    `qFileName`    VARCHAR(128) NOT NULL DEFAULT '',
  PRIMARY KEY (`qId`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;


 - Finally, the client connection must specify the character encoding to be used for the data transfers.

  Java ...

    <connection-url>jdbc:mysql://somehost.domain.com:3306/mydb?useUnicode=true&characterEncoding=UTF-8</connection-url>


  Perl ...

    eval {
      my $hsig = set_sig_handler ( 'ALRM', sub { my $canceled = 1; die; }, { mask=>[ qw( INT ALRM ) ] ,safe => 0 } );
      eval {
        alarm ($timeout);
        $l_dbh = DBI->connect($dsn, "$args->{DBuser}", "$args->{DBpswd}", {
          PrintError => 0,             ### Don't report errors via warn( )
          RaiseError => 1,             ### Do report errors via die( )
          AutoCommit => 0,             ### Do not commit automatically Inserts and Updates
          ShowErrorStatement => 1,     ### Do show the statement in error
          mysql_auto_reconnect => 0,   ### Complex issue when using table locking
          mysql_multi_statements => 1, ### To help on calling Stored Procedures. Do not enable server-side prepared statements
          mysql_server_prepare => 0,   ### Do NEVER USE 1 ! It's broken: Signed integers are returned as Unsigned integer !!!
          mysql_enable_utf8 => 1       ### Override instance defaults
        } );
      };
      alarm (0);
      die "$@" if $@;
    };


Now.



On Oct 26, 2013, at 1:22 AM, Joel Fentin wrote:

> I did some web sites long ago. Their owner moved them to Network Solutions. Network Solutions suddenly and without prior notice changed the MySQL character encoding to UTF-8. There are fields in the database which are displayed on webpages. I have some cleanup to do.
> 
> Is there an industry standard for putting CR &/or LF into such a database text field? Or does everyone roll his own?
> 
> Are there an industry standards for áéíñóúÁÉÍÑÓÚ¡¿
> 
> -- 
> Joel Fentin       tel: 760-749-8863
> Biz Website:      http://fentin.com
> Personal Website: http://fentin.com/me
> _______________________________________________
> San-Diego-pm mailing list
> San-Diego-pm at pm.org
> http://mail.pm.org/mailman/listinfo/san-diego-pm
> 



More information about the San-Diego-pm mailing list