[SP-pm] trabalhando com UTF-8 e ISO-8859-1 simultaneamente

Stanislaw Pusep creaktive at gmail.com
Wed Oct 20 05:44:04 PDT 2010


Mais ou menos:

sub detect_utf8 {
        use bytes;

        my $str = $_[0];

        my $c = 0;
        my $b = 0;

        my $bits = 0;
        my $len = length ${$str};

        for(my $i = 0; $i < $len; $i++) {

                $c = ord(substr(${$str}, $i, 1));

                if ($c > 128) {

                        if (($c >= 254)) {

                                return 0;
                        } elsif ($c >= 252) {

                                $bits = 6;
                        } elsif ($c >= 248) {

                                $bits = 5;
                        } elsif ($c >= 240) {

                                $bits = 4;
                        } elsif ($c >= 224) {

                                $bits = 3;
                        } elsif ($c >= 192) {

                                $bits = 2;
                        } else {

                                return 0;
                        }

                        if (($i + $bits) > $len) {

                                return 0;
                        }

                        while ($bits > 1) {

                                $i++;
                                $b = ord(substr(${$str}, $i, 1));

                                if ($b < 128 || $b > 191) {

                                        return 0;
                                }

                                $bits--;
                        }
                }

        }

        return 1;
}


Isto verifica se uma *string de octets* está de acordo com o "protocolo" do
UTF-8. Como dá para ver, não é nada eficiente. E, antes de usar, precisa
certificar-se de que todos os mecanismos internos do Perl de lidar com
codificações estejam desativados, senão, no melhor caso, gerará uma pancada
de warnings de "Wide character..."
Enfim, deixa pra lá a eficiência, essa foi a melhor solução que encontrei
depois de *anos* procurando :)

ABS()



2010/10/20 Solli Honorio <shonorio em gmail.com>

> Nunca precisei fazer estas coisas, mas não seria apenas comparar os bit
> mais significantes através de um bitwise ?
>
> Solli M. Honório
>
> 2010/10/20 Andre Carneiro <andregarciacarneiro em gmail.com>
>
>> Eu já tentei usar esse módulo. Não é sempre que ele detecta corretamente a
>> codificação. Mas como já faz muito tempo desde a última vez que tentei usar
>> esse módulo( a uns dois anos atrás ), talvez valha a pena dar uma olhada
>> novamente, considerando que a última atualização foi esse ano.
>>
>> E tem uma observação na documentação desse módulo:
>>
>> Because of the algorithm used, ISO-8859 series and other single-byte
>> encodings do not work well unless either one of ISO-8859 is the only one
>> suspect (besides ascii and utf8).
>>
>>
>>
>> Cheers!
>>
>>
>> 2010/10/19 Solli Honorio <shonorio em gmail.com>
>>
>> Stanislaw,
>>>
>>> O http://search.cpan.org/~dankogai/Encode-2.40/lib/Encode/Guess.pm<http://search.cpan.org/%7Edankogai/Encode-2.40/lib/Encode/Guess.pm>faz o que vc precisa ?
>>>
>>> Solli
>>>
>>> 2010/10/19 Stanislaw Pusep <creaktive em gmail.com>
>>>
>>> Valeu Daniel!
>>>> De fato, sai muito mais eficiente salvar os dados codificados num
>>>> arquivo e depois abrir e ler pelo "conversor embutido" do Perl, do que fazer
>>>> as conversões malucas com buffers inline.
>>>> Só me resta uma dúvida: e para detectar a codificação de uma string? O
>>>> PHP tem mb_detect_encoding() (
>>>> http://php.net/manual/en/function.mb-detect-encoding.php, foi de lá que
>>>> roubei o meu detect_utf8()); já no Perl, nem utf8::is_utf8() e nem
>>>> utf8::valid() fazem isso.
>>>>
>>>> ABS()
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Oct 19, 2010 at 01:12, Daniel de Oliveira Mantovani <
>>>> mantovani em perl.org.br> wrote:
>>>>
>>>>> perl -e '{binmode STDOUT,":utf8";use open IO => ":utf8";print uc($_)
>>>>> while <>}' teste.txt
>>>>>
>>>>> "Setting the default encoding
>>>>> You can set the encoding for all streams with the open pragma. If you
>>>>> want
>>>>> to use the same default encoding for all input and output filehandles,
>>>>> you
>>>>> can set them at the same time with the IO setting:
>>>>> use open IO => ':utf8';
>>>>> You can set the default encoding for just output handles with the
>>>>> setting:
>>>>> OUT
>>>>> use open OUT => ':utf8';
>>>>> Similarly, you can set all of the input filehandles to have the
>>>>> encoding that
>>>>> you need:
>>>>> use open IN => ':utf8';
>>>>> You can event set the default encoding for the input and output streams
>>>>> separately, but in the same call to open:
>>>>> use open IN => ":cp1251", OUT => ":shiftjis";
>>>>> The -C switch tells Perl to switch on various Unicode features. You can
>>>>> selec-
>>>>> tively turn on features by specifying the ones that you want without
>>>>> having
>>>>> to change the source code. If you use that switch with no specifiers,
>>>>> Perl uses
>>>>> UTF-8 for all of the standard filehandles and any that you open
>>>>> yourself:
>>>>> "
>>>>>
>>>>>
>>>>>
>>>>> 2010/10/19 Daniel de Oliveira Mantovani <mantovani em perl.org.br>:
>>>>> > Argh, desculpa estou muitas, muitas, muitas horas sem dormir.
>>>>> >
>>>>> > perl -Mutf8 -pe 'binmode STDIN, ":utf8";$_=uc' texte.txt
>>>>> >
>>>>> > É disso que você precisa.
>>>>> >
>>>>> > Me desculpe de novo.
>>>>> >
>>>>> >
>>>>> > 2010/10/19 Daniel de Oliveira Mantovani <mantovani em perl.org.br>:
>>>>> >> perl -Mutf8 -pe '$_=uc' teste.txt
>>>>> >>
>>>>> >> 2010/10/18 Stanislaw Pusep <creaktive em gmail.com>:
>>>>> >>> Li sim :)
>>>>> >>>
>>>>> >>> "The following functions are defined in the utf8:: package by the
>>>>> Perl core.
>>>>> >>> You do not need to say use utf8 to use these and in fact you should
>>>>> not say
>>>>> >>> that unless you really want to have UTF-8 source code."
>>>>> >>>
>>>>> >>> Anyway, tentei fazer isso:
>>>>> >>> perl -pe 'utf8::encode($_);$_=uc' teste.txt
>>>>> >>>
>>>>> >>> Conforme o esperado, imprime na tela os caracteres corretos. Porém
>>>>> sem
>>>>> >>> converter acentos para maiúsculas. Vai entender :(
>>>>> >>>
>>>>> >>> ABS()
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> 2010/10/18 Daniel de Oliveira Mantovani <mantovani em perl.org.br>
>>>>> >>>>
>>>>> >>>> Você leu o manual todo ?
>>>>> >>>>
>>>>> >>>> "Converts in-place the internal octet sequence in the native
>>>>> encoding
>>>>> >>>> (Latin-1 or EBCDIC) to the equivalent character sequence in UTF-X.
>>>>> >>>> $string already encoded as characters does no harm.Returns the
>>>>> number
>>>>> >>>> of octets necessary to represent the string as UTF-X.Can be used
>>>>> to
>>>>> >>>> make sure that the UTF-8 flag is on, so that "\w" or "lc()" work
>>>>> as
>>>>> >>>> Unicode on strings containing characters in the range 0x80-0xFF
>>>>> (on
>>>>> >>>> ASCII
>>>>> >>>> and derivatives)."
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> 2010/10/18 Stanislaw Pusep <creaktive em gmail.com>:
>>>>> >>>> > Infelizmente...
>>>>> >>>> >
>>>>> >>>> > http://perldoc.perl.org/utf8.html
>>>>> >>>> > Do not use this pragma for anything else than telling Perl that
>>>>> your
>>>>> >>>> > script
>>>>> >>>> > is written in UTF-8.
>>>>> >>>> >
>>>>> >>>> > A minha referência atual sobre Perl e UTF-8 é esta (original em
>>>>> russo,
>>>>> >>>> > não a
>>>>> >>>> > tradução):
>>>>> >>>> >
>>>>> >>>> >
>>>>> http://translate.google.com/translate?hl=en-US&sl=ru&tl=en&u=http%3A%2F%2Fxpoint.ru%2Fknow-how%2FPerl%2FPodderzhkaUnicode
>>>>> >>>> >
>>>>> >>>> > ABS()
>>>>> >>>> >
>>>>> >>>> >
>>>>> >>>> >
>>>>> >>>> > 2010/10/18 Daniel de Oliveira Mantovani <mantovani em perl.org.br>
>>>>> >>>> >>
>>>>> >>>> >> 2010/10/18 Daniel de Oliveira Mantovani <mantovani em perl.org.br
>>>>> >:
>>>>> >>>> >> <code>
>>>>> >>>> >>  my $text;{$/=$\;$text=<>};
>>>>> >>>> >>  sub do_what_I_want {return uc(@_)};
>>>>> >>>> >>  when (detect_utf8($buf)) {
>>>>> >>>> >>     {
>>>>> >>>> >>        require utf8;
>>>>> >>>> >>        do_what_I_want(...)
>>>>> >>>> >>     }
>>>>> >>>> >>  }
>>>>> >>>> >>
>>>>> >>>> >>  { do_what_I_want(...) }
>>>>> >>>> >> </code>
>>>>> >>>> >>
>>>>> >>>> >> Agora sim.
>>>>> >>>> >>
>>>>> >>>> >> >
>>>>> >>>> >> > /me ;)
>>>>> >>>> >> >
>>>>> >>>> >> >
>>>>> >>>> >> > Procura no StackOverflow por Perl e codificação, o briand d
>>>>> foy deu
>>>>> >>>> >> > uma explicação bem útil.
>>>>> >>>> >> >
>>>>> >>>> >> > 2010/10/18 Stanislaw Pusep <creaktive em gmail.com>:
>>>>> >>>> >> >> Tenho certeza de que o assunto foi levantado várias vezes na
>>>>> lista,
>>>>> >>>> >> >> então,
>>>>> >>>> >> >> ATENÇÃO: o Perl tem excelentes mecanismos para tratar I/O em
>>>>> >>>> >> >> diversas
>>>>> >>>> >> >> codificações da maneira mais prática possível. Por exemplo,
>>>>> dá para
>>>>> >>>> >> >> pegar
>>>>> >>>> >> >> arquivo em ISO-8859-1 do STDIN e jogar para STDOUT em UTF-8,
>>>>> isso é
>>>>> >>>> >> >> canja de
>>>>> >>>> >> >> galinha. Sempre que abre um handle, é só especificar o que
>>>>> tem
>>>>> >>>> >> >> dentro
>>>>> >>>> >> >> que...
>>>>> >>>> >> >> Aí que está o MEU problema: nunca sei de antemão o que tem
>>>>> dentro :P
>>>>> >>>> >> >> A solução mais viável que encontrei até agora foi:
>>>>> >>>> >> >>
>>>>> >>>> >> >>         my $buf;
>>>>> >>>> >> >>
>>>>> >>>> >> >>
>>>>> >>>> >> >>         eval {
>>>>> >>>> >> >>                 open(TXT, '<', $file) or die "impossivel
>>>>> abrir
>>>>> >>>> >> >> $file:
>>>>> >>>> >> >> $!";
>>>>> >>>> >> >>
>>>>> >>>> >> >>
>>>>> >>>> >> >>                 binmode TXT, ':bytes';
>>>>> >>>> >> >>                 local $/ = undef;
>>>>> >>>> >> >>
>>>>> >>>> >> >>
>>>>> >>>> >> >>                 $buf = <TXT>;
>>>>> >>>> >> >>                 close TXT;
>>>>> >>>> >> >>
>>>>> >>>> >> >>
>>>>> >>>> >> >>         };
>>>>> >>>> >> >>
>>>>> >>>> >> >>         my $iconv = new Text::Iconv(detect_utf8($buf) ?
>>>>> 'utf-8' :
>>>>> >>>> >> >> 'iso-8859-1', 'utf-8');
>>>>> >>>> >> >>
>>>>> >>>> >> >>
>>>>> >>>> >> >>         $buf = $iconv->convert($buf);
>>>>> >>>> >> >>
>>>>> >>>> >> >>
>>>>> >>>> >> >>         Encode::_utf8_on($buf);
>>>>> >>>> >> >>
>>>>> >>>> >> >> Explicando: abro o arquivo do jeito "cru", sem nenhuma
>>>>> codificação.
>>>>> >>>> >> >> Carrego
>>>>> >>>> >> >> o conteúdo no buffer. Aí uso Text::Iconv para converter a
>>>>> >>>> >> >> codificação.
>>>>> >>>> >> >> Detalhe importantíssimo: mesmo que os dados já estejam em
>>>>> UTF-8,
>>>>> >>>> >> >> ainda
>>>>> >>>> >> >> assim
>>>>> >>>> >> >> precisa aplicar o Text::Iconv. E ainda não acabou: Perl não
>>>>> >>>> >> >> reconhece o
>>>>> >>>> >> >> buffer como algo que tenha codificação UTF-8 até que eu
>>>>> force o flag
>>>>> >>>> >> >> UTF-8.
>>>>> >>>> >> >> Pronto! Depois disso tudo, $buf é um autêntico UTF-8. Posso
>>>>> dar uc()
>>>>> >>>> >> >> que "ã"
>>>>> >>>> >> >> vira "Ã", e /\w/ pega os acentos também.
>>>>> >>>> >> >> Aqui está o código completo: http://tinypaste.com/c3680
>>>>> >>>> >> >> A pergunta é: existe alguma maneira menos ineficiente de se
>>>>> fazer
>>>>> >>>> >> >> isto?
>>>>> >>>> >> >>
>>>>> >>>> >> >> ABS()
>>>>> >>>> >> >>
>>>>> >>>> >> >>
>>>>> >>>> >> >> _______________________________________________
>>>>> >>>> >> >> SaoPaulo-pm mailing list
>>>>> >>>> >> >> SaoPaulo-pm em pm.org
>>>>> >>>> >> >> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>>>> >>>> >> >>
>>>>> >>>> >> >
>>>>> >>>> >> >
>>>>> >>>> >> >
>>>>> >>>> >> > --
>>>>> >>>> >> > "If you’ve never written anything thoughtful, then you’ve
>>>>> never had
>>>>> >>>> >> > any difficult, important, or interesting thoughts. That’s the
>>>>> secret:
>>>>> >>>> >> > people who don’t write, are people who don’t think."
>>>>> >>>> >> >
>>>>> >>>> >>
>>>>> >>>> >>
>>>>> >>>> >>
>>>>> >>>> >> --
>>>>> >>>> >> "If you’ve never written anything thoughtful, then you’ve never
>>>>> had
>>>>> >>>> >> any difficult, important, or interesting thoughts. That’s the
>>>>> secret:
>>>>> >>>> >> people who don’t write, are people who don’t think."
>>>>> >>>> >> _______________________________________________
>>>>> >>>> >> SaoPaulo-pm mailing list
>>>>> >>>> >> SaoPaulo-pm em pm.org
>>>>> >>>> >> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>>>> >>>> >
>>>>> >>>> >
>>>>> >>>> > _______________________________________________
>>>>> >>>> > SaoPaulo-pm mailing list
>>>>> >>>> > SaoPaulo-pm em pm.org
>>>>> >>>> > http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>>>> >>>> >
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> --
>>>>> >>>> "If you’ve never written anything thoughtful, then you’ve never
>>>>> had
>>>>> >>>> any difficult, important, or interesting thoughts. That’s the
>>>>> secret:
>>>>> >>>> people who don’t write, are people who don’t think."
>>>>> >>>> _______________________________________________
>>>>> >>>> SaoPaulo-pm mailing list
>>>>> >>>> SaoPaulo-pm em pm.org
>>>>> >>>> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>>>> >>>
>>>>> >>>
>>>>> >>> _______________________________________________
>>>>> >>> SaoPaulo-pm mailing list
>>>>> >>> SaoPaulo-pm em pm.org
>>>>> >>> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>>>> >>>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> "If you’ve never written anything thoughtful, then you’ve never had
>>>>> >> any difficult, important, or interesting thoughts. That’s the
>>>>> secret:
>>>>> >> people who don’t write, are people who don’t think."
>>>>> >>
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > "If you’ve never written anything thoughtful, then you’ve never had
>>>>> > any difficult, important, or interesting thoughts. That’s the secret:
>>>>> > people who don’t write, are people who don’t think."
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> "If you’ve never written anything thoughtful, then you’ve never had
>>>>> any difficult, important, or interesting thoughts. That’s the secret:
>>>>> people who don’t write, are people who don’t think."
>>>>> _______________________________________________
>>>>> SaoPaulo-pm mailing list
>>>>> SaoPaulo-pm em pm.org
>>>>> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> SaoPaulo-pm mailing list
>>>> SaoPaulo-pm em pm.org
>>>> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>>>
>>>
>>>
>>>
>>> --
>>> "o animal satisfeito dorme". - Guimarães Rosa
>>>
>>> _______________________________________________
>>> SaoPaulo-pm mailing list
>>> SaoPaulo-pm em pm.org
>>> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>>
>>
>>
>>
>> --
>> André Garcia Carneiro
>> Analista/Desenvolvedor Perl
>> (11)82907780
>>
>> _______________________________________________
>> SaoPaulo-pm mailing list
>> SaoPaulo-pm em pm.org
>> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>
>
>
>
> --
> "o animal satisfeito dorme". - Guimarães Rosa
>
> _______________________________________________
> SaoPaulo-pm mailing list
> SaoPaulo-pm em pm.org
> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>
-------------- Pr�xima Parte ----------
Um anexo em HTML foi limpo...
URL: <http://mail.pm.org/pipermail/saopaulo-pm/attachments/20101020/50270ab4/attachment-0001.html>


More information about the SaoPaulo-pm mailing list