[San-Diego-pm] Cold shower in UTF-8

Thierry de Villeneuve thierryv at abac.com
Mon Oct 28 14:38:18 PDT 2013


(clicked "send" by mistake. No complete)

Hello Joël,

There are several things to remember.

- First, the MySQL Instance may be setup up with a specific default charset.  It's defined with the "default-character-set" setup parameter. You can discover the default encoding used for the client connections and unspecified table creations. This default setup addresses no-charset specified client connections and no-charset specified table creation scripts.

[mysqld]
character-set-server=utf8
default-collation=utf8_unicode_ci

[client]
default-character-set=utf8

 That you can query with :

mysql> SHOW VARIABLES LIKE 'character%';
+--------------------------+--------------------------------------------------------+
| Variable_name            | Value                                                  |
+--------------------------+--------------------------------------------------------+
| character_set_client     | utf8                                                   |
| character_set_connection | utf8                                                   |
| character_set_database   | utf8                                                   |
| character_set_filesystem | binary                                                 |
| character_set_results    | utf8                                                   |
| character_set_server     | utf8                                                   |
| character_set_system     | utf8                                                   |
+--------------------------+--------------------------------------------------------+

  And figure out what charsets are implemented with this MySQL build :

mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |      2 |
| dec8     | DEC West European           | dec8_swedish_ci     |      1 |
| cp850    | DOS West European           | cp850_general_ci    |      1 |
| hp8      | HP West European            | hp8_english_ci      |      1 |
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |      1 |
| latin1   | cp1252 West European        | latin1_swedish_ci   |      1 |
| latin2   | ISO 8859-2 Central European | latin2_general_ci   |      1 |
| swe7     | 7bit Swedish                | swe7_swedish_ci     |      1 |
| ascii    | US ASCII                    | ascii_general_ci    |      1 |
| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |      3 |
| sjis     | Shift-JIS Japanese          | sjis_japanese_ci    |      2 |
| hebrew   | ISO 8859-8 Hebrew           | hebrew_general_ci   |      1 |
| tis620   | TIS620 Thai                 | tis620_thai_ci      |      1 |
| euckr    | EUC-KR Korean               | euckr_korean_ci     |      2 |
| koi8u    | KOI8-U Ukrainian            | koi8u_general_ci    |      1 |
| gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |      2 |
| greek    | ISO 8859-7 Greek            | greek_general_ci    |      1 |
| cp1250   | Windows Central European    | cp1250_general_ci   |      1 |
| gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |      2 |
| latin5   | ISO 8859-9 Turkish          | latin5_turkish_ci   |      1 |
| armscii8 | ARMSCII-8 Armenian          | armscii8_general_ci |      1 |
| utf8     | UTF-8 Unicode               | utf8_general_ci     |      3 |
| ucs2     | UCS-2 Unicode               | ucs2_general_ci     |      2 |
| cp866    | DOS Russian                 | cp866_general_ci    |      1 |
| keybcs2  | DOS Kamenicky Czech-Slovak  | keybcs2_general_ci  |      1 |
| macce    | Mac Central European        | macce_general_ci    |      1 |
| macroman | Mac West European           | macroman_general_ci |      1 |
| cp852    | DOS Central European        | cp852_general_ci    |      1 |
| latin7   | ISO 8859-13 Baltic          | latin7_general_ci   |      1 |
| utf8mb4  | UTF-8 Unicode               | utf8mb4_general_ci  |      4 |
| cp1251   | Windows Cyrillic            | cp1251_general_ci   |      1 |
| utf16    | UTF-16 Unicode              | utf16_general_ci    |      4 |
| cp1256   | Windows Arabic              | cp1256_general_ci   |      1 |
| cp1257   | Windows Baltic              | cp1257_general_ci   |      1 |
| utf32    | UTF-32 Unicode              | utf32_general_ci    |      4 |
| binary   | Binary pseudo charset       | binary              |      1 |
| geostd8  | GEOSTD8 Georgian            | geostd8_general_ci  |      1 |
| cp932    | SJIS for Windows Japanese   | cp932_japanese_ci   |      2 |
| eucjpms  | UJIS for Windows Japanese   | eucjpms_japanese_ci |      3 |
+----------+-----------------------------+---------------------+--------+




- Then, When the tables are created, the DBA must specify under which charset the table is created and the collation pattern to override any instance level setup.

SET NAMES utf8;

CREATE TABLE `someTable` (
   `qId`          INT unsigned NOT NULL AUTO_INCREMENT,
   `qFileName`    VARCHAR(128) NOT NULL DEFAULT '',
 PRIMARY KEY (`qId`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;




- Finally, the client connection must specify the character encoding to be used for the data transfers.

 Java ...

   <connection-url>jdbc:mysql://somehost.domain.com:3306/mydb?useUnicode=true&characterEncoding=UTF-8</connection-url>


 Perl ...

   eval {
     my $hsig = set_sig_handler ( 'ALRM', sub { my $canceled = 1; die; }, { mask=>[ qw( INT ALRM ) ] ,safe => 0 } );
     eval {
       alarm ($timeout);
       $l_dbh = DBI->connect($dsn, "$args->{DBuser}", "$args->{DBpswd}", {
         PrintError => 0,             ### Don't report errors via warn( )
         RaiseError => 1,             ### Do report errors via die( )
         AutoCommit => 0,             ### Do not commit automatically Inserts and Updates
         ShowErrorStatement => 1,     ### Do show the statement in error
         mysql_auto_reconnect => 0,   ### Complex issue when using table locking
         mysql_multi_statements => 1, ### To help on calling Stored Procedures. Do not enable server-side prepared statements
         mysql_server_prepare => 0,   ### Do NEVER USE 1 ! It's broken: Signed integers are returned as Unsigned integer !!!
         mysql_enable_utf8 => 1       ### Override instance defaults
       } );
     };
     alarm (0);
     die "$@" if $@;
   };


Now, if you happen to be no longer capable of reading former table data, it's more likely that one of the default setting of the instance has been changed and your client connection has no charset defined for translations and your tables were created without specifying a charset.

There is not really such thing as "NS has changed the MySQL character encoding". Of what ?

I would recommend you identify what is the instance default setup, as in "first".

Then that to identify how the tables are created, with a    SHOW CREATE TABLE `someTable`;

Then set client connection charset accordingly.


No clue is given on how the "move to NS" happened. Most probably using a poorly crafted mysqldump script.

You may have to extract all data using a mysqldump --no-data script to extract and fix the schema, then a mysqldump --skip-set-charset --no-create-db --no-create-info script to extract the raw data, using text editors to fix the charset.  Finally, insert the data back into a new DB with a repaired schema.



CREATE TABLE `someTableA` (
    `qId`          INT unsigned NOT NULL AUTO_INCREMENT,
    `qSomeName`    VARCHAR(128) NOT NULL DEFAULT '',
  PRIMARY KEY (`qId`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci;


CREATE TABLE `someTableB` (
    `qId`          INT unsigned NOT NULL AUTO_INCREMENT,
    `qSomeName`    VARCHAR(128) NOT NULL DEFAULT '',
  PRIMARY KEY (`qId`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;



SHOW CREATE TABLE `someTableA`;

SHOW CREATE TABLE `someTableB`;



I hope this has helped you.

Thierry



On Oct 26, 2013, at 1:22 AM, Joel Fentin wrote:

> I did some web sites long ago. Their owner moved them to Network Solutions. Network Solutions suddenly and without prior notice changed the MySQL character encoding to UTF-8. There are fields in the database which are displayed on webpages. I have some cleanup to do.
> 
> Is there an industry standard for putting CR &/or LF into such a database text field? Or does everyone roll his own?
> 
> Are there an industry standards for áéíñóúÁÉÍÑÓÚ¡¿
> 
> -- 
> Joel Fentin       tel: 760-749-8863
> Biz Website:      http://fentin.com
> Personal Website: http://fentin.com/me
> _______________________________________________
> San-Diego-pm mailing list
> San-Diego-pm at pm.org
> http://mail.pm.org/mailman/listinfo/san-diego-pm
> 



More information about the San-Diego-pm mailing list