[Chicago-talk] Validating utf-8.

Jonathan Rockway jon-chicagotalk at jrock.us
Sat Oct 4 07:16:12 PDT 2008


* On Fri, Oct 03 2008, Clyde Forrester wrote:
> Jonathan Rockway wrote:
>> Oh, and one other thing.  "utf8" and "UTF-8" are not the same.  UTF-8
>> can only encode characters in Unicode, but utf8 will use UTF-8's
>> algorithm to encode all "characters" up to 0xffffffffffffffff.
>>
>> So while something may be valid utf8, it might not be valid UTF-8.
>>
>> Confused yet?  Perl ignores case and treats "-" and "_" as equal, so you
>> could also refer to UTF-8 as "utf-8" (as I usually do) or "utf_8" (or
>> "UtF-8" and so on).  You can call UTF-8 "utf-8-strict" also.
>>
>> Perl and Unicode are fun.
>>
> In what context is the uft8 vs. UTF-8 distinction being made?
> Is this a Perl distinction?
> Or is there a big standards organization battle going on here?

This is a Perl thing.  In Unicode, there only 2**21 codepoints.  So
while the UTF-8 encoding algorithm *works* on numbers bigger than that,
it doesn't make sense.  So in Perl, when you say "UTF-8", it will die
when you do something that doesn't make sense.  If you say "utf8", it
will just use the utf-8 encoding algorithm to dump whatever data it has
in memory.

Regards,
Jonathan Rockway

--
print just => another => perl => hacker => if $,=$"


More information about the Chicago-talk mailing list