[Chicago-talk] Validating utf-8.

Jonathan Rockway jon-chicagotalk at jrock.us
Fri Oct 3 08:49:58 PDT 2008


* On Fri, Oct 03 2008, Jonathan Rockway wrote:
> * On Fri, Oct 03 2008, Elliot Shank wrote:
>> Elliot Shank wrote:
>>> Using the built-in IO layers seems to hide problems, i.e.
>>>
>>>    open my $handle, '<:utf8', $file
>>>
>>> doesn't work.  If I feed that a binary file which is plainly not utf-8, perl blithely reads the file without complaint.
>>
>> Well, not without warnings, but I don't really want to hook $SIG{__WARN__} looking for specific strings, which is pretty fragile.
>
> If you use Encode::decode directly, you can specify exactly how to
> handle errors:
>
>   http://search.cpan.org/~dankogai/Encode-2.26/Encode.pm#Handling_Malformed_Data
>
> I think:
>
>   my $string = Encode::decode('utf-8', $octets, Encode::FB_CROAK)
>

Oh, and one other thing.  "utf8" and "UTF-8" are not the same.  UTF-8
can only encode characters in Unicode, but utf8 will use UTF-8's
algorithm to encode all "characters" up to 0xffffffffffffffff.

So while something may be valid utf8, it might not be valid UTF-8.

Confused yet?  Perl ignores case and treats "-" and "_" as equal, so you
could also refer to UTF-8 as "utf-8" (as I usually do) or "utf_8" (or
"UtF-8" and so on).  You can call UTF-8 "utf-8-strict" also.

Perl and Unicode are fun.

Regards,
Jonathan Rockway

--
print just => another => perl => hacker => if $,=$"


More information about the Chicago-talk mailing list