[Chicago-talk] Validating utf-8.
Jonathan Rockway
jon-chicagotalk at jrock.us
Sat Oct 4 07:16:12 PDT 2008
* On Fri, Oct 03 2008, Clyde Forrester wrote:
> Jonathan Rockway wrote:
>> Oh, and one other thing. "utf8" and "UTF-8" are not the same. UTF-8
>> can only encode characters in Unicode, but utf8 will use UTF-8's
>> algorithm to encode all "characters" up to 0xffffffffffffffff.
>>
>> So while something may be valid utf8, it might not be valid UTF-8.
>>
>> Confused yet? Perl ignores case and treats "-" and "_" as equal, so you
>> could also refer to UTF-8 as "utf-8" (as I usually do) or "utf_8" (or
>> "UtF-8" and so on). You can call UTF-8 "utf-8-strict" also.
>>
>> Perl and Unicode are fun.
>>
> In what context is the uft8 vs. UTF-8 distinction being made?
> Is this a Perl distinction?
> Or is there a big standards organization battle going on here?
This is a Perl thing. In Unicode, there only 2**21 codepoints. So
while the UTF-8 encoding algorithm *works* on numbers bigger than that,
it doesn't make sense. So in Perl, when you say "UTF-8", it will die
when you do something that doesn't make sense. If you say "utf8", it
will just use the utf-8 encoding algorithm to dump whatever data it has
in memory.
Regards,
Jonathan Rockway
--
print just => another => perl => hacker => if $,=$"
More information about the Chicago-talk
mailing list