[PerlChina] 如何用正则确定变量的内容是utf8还是gb2312的?

Ken Lam bi.ken.lam at gmail.com
Wed Nov 26 23:42:01 PST 2008


try this?

/*use Encode qw(encode decode);

sub is_utf8{
    my $a = shift;
    my $b = encode('utf8', decode('utf8', $a));
   
    return ($a eq $b) ? 1 : 0;
}

sub is_gbk{
    my $a = shift;
    my $b = encode('gbk', decode('gbk', $a));
   
    return ($a eq $b) ? 1 : 0;
}*/

silent wrote:
> 了解, 可是我的那两个函数为什么不对呢?
>
> agentzh 写道:
>   
>> 我们一般使用 CPAN 上的 Encode::Guess 模块。对于长文本非常有效的,但是对于非常短的,比如两三个字的文本就不怎么准了,呵呵。
>>
>>        use Encode::Guess;
>>        my @enc = qw( UTF-8 GB2312 Big5 GBK Latin1 );
>>        for my $enc (@enc) {
>>            my $decoder = guess_encoding($data, $enc);
>>            if (ref $decoder) {
>>                $charset = $decoder->name;
>>                last;
>>            }
>>        }
>>        if (!$charset) {
>>            die "Can't determine the charset of the input.\n";
>>        }
>>
>> 这里 @enc 中是尝试的 charset 数量。其实感觉用 Encode 的 decode
>> 函数也可以直接整,只不过设一个参数让它遇到错误字节时直接抛异常即可 ;)
>>
>> -agentzh
>>   
>>     
>
> _______________________________________________
> China-pm mailing list
> China-pm at pm.org
> http://mail.pm.org/mailman/listinfo/china-pm

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/china-pm/attachments/20081127/d5683870/attachment.html>


More information about the China-pm mailing list