[Wellington-pm] More on Unicode and DB

Michael Robinson michael at diaspora.gen.nz
Tue Sep 18 02:52:23 PDT 2007


> For example say I insert a row with a text field containing the string
> 'Māori' (or "M\x{101}ori" in ASCII Perl).  This string is 5 characters
> long, but 6 bytes long because the second character is a multibyte
> character.

Going off on a tangent here, one of the fun facts about Unicode,
particularly for those dealing with Maaori, is that there are "combining
characters".

This means that the following two strings are equivalent:

    M\x{304}aori 
    M\x{101}ori

Because 0x304 is the combining character "COMBINING MACRON" (see
unicore/EastAsianWidth.txt in your Perl 5.8 distribution).

So, if you're trying to canonicalize text input, say in a search engine
dealing with Maaori source documents, and you need to deal with the fact
that some people input macrons, and some don't, then you probably also
need to consult Unicode::Normalize.

(Not that I'm bringing back bad memories here, or anything :).

    -- michael.



More information about the Wellington-pm mailing list