[Chicago-talk] Validating utf-8.
Jonathan Rockway
jon-chicagotalk at jrock.us
Fri Oct 3 08:49:58 PDT 2008
* On Fri, Oct 03 2008, Jonathan Rockway wrote:
> * On Fri, Oct 03 2008, Elliot Shank wrote:
>> Elliot Shank wrote:
>>> Using the built-in IO layers seems to hide problems, i.e.
>>>
>>> open my $handle, '<:utf8', $file
>>>
>>> doesn't work. If I feed that a binary file which is plainly not utf-8, perl blithely reads the file without complaint.
>>
>> Well, not without warnings, but I don't really want to hook $SIG{__WARN__} looking for specific strings, which is pretty fragile.
>
> If you use Encode::decode directly, you can specify exactly how to
> handle errors:
>
> http://search.cpan.org/~dankogai/Encode-2.26/Encode.pm#Handling_Malformed_Data
>
> I think:
>
> my $string = Encode::decode('utf-8', $octets, Encode::FB_CROAK)
>
Oh, and one other thing. "utf8" and "UTF-8" are not the same. UTF-8
can only encode characters in Unicode, but utf8 will use UTF-8's
algorithm to encode all "characters" up to 0xffffffffffffffff.
So while something may be valid utf8, it might not be valid UTF-8.
Confused yet? Perl ignores case and treats "-" and "_" as equal, so you
could also refer to UTF-8 as "utf-8" (as I usually do) or "utf_8" (or
"UtF-8" and so on). You can call UTF-8 "utf-8-strict" also.
Perl and Unicode are fun.
Regards,
Jonathan Rockway
--
print just => another => perl => hacker => if $,=$"
More information about the Chicago-talk
mailing list