[SP-pm] trabalhando com UTF-8 e ISO-8859-1 simultaneamente

Andre Carneiro andregarciacarneiro at gmail.com
Wed Oct 20 03:20:06 PDT 2010


Eu já tentei usar esse módulo. Não é sempre que ele detecta corretamente a
codificação. Mas como já faz muito tempo desde a última vez que tentei usar
esse módulo( a uns dois anos atrás ), talvez valha a pena dar uma olhada
novamente, considerando que a última atualização foi esse ano.

E tem uma observação na documentação desse módulo:

Because of the algorithm used, ISO-8859 series and other single-byte
encodings do not work well unless either one of ISO-8859 is the only one
suspect (besides ascii and utf8).



Cheers!


2010/10/19 Solli Honorio <shonorio at gmail.com>

> Stanislaw,
>
> O http://search.cpan.org/~dankogai/Encode-2.40/lib/Encode/Guess.pm faz o
> que vc precisa ?
>
> Solli
>
> 2010/10/19 Stanislaw Pusep <creaktive at gmail.com>
>
> Valeu Daniel!
>> De fato, sai muito mais eficiente salvar os dados codificados num arquivo
>> e depois abrir e ler pelo "conversor embutido" do Perl, do que fazer as
>> conversões malucas com buffers inline.
>> Só me resta uma dúvida: e para detectar a codificação de uma string? O PHP
>> tem mb_detect_encoding() (
>> http://php.net/manual/en/function.mb-detect-encoding.php, foi de lá que
>> roubei o meu detect_utf8()); já no Perl, nem utf8::is_utf8() e nem
>> utf8::valid() fazem isso.
>>
>> ABS()
>>
>>
>>
>>
>> On Tue, Oct 19, 2010 at 01:12, Daniel de Oliveira Mantovani <
>> mantovani at perl.org.br> wrote:
>>
>>> perl -e '{binmode STDOUT,":utf8";use open IO => ":utf8";print uc($_)
>>> while <>}' teste.txt
>>>
>>> "Setting the default encoding
>>> You can set the encoding for all streams with the open pragma. If you
>>> want
>>> to use the same default encoding for all input and output filehandles,
>>> you
>>> can set them at the same time with the IO setting:
>>> use open IO => ':utf8';
>>> You can set the default encoding for just output handles with the
>>> setting:
>>> OUT
>>> use open OUT => ':utf8';
>>> Similarly, you can set all of the input filehandles to have the encoding
>>> that
>>> you need:
>>> use open IN => ':utf8';
>>> You can event set the default encoding for the input and output streams
>>> separately, but in the same call to open:
>>> use open IN => ":cp1251", OUT => ":shiftjis";
>>> The -C switch tells Perl to switch on various Unicode features. You can
>>> selec-
>>> tively turn on features by specifying the ones that you want without
>>> having
>>> to change the source code. If you use that switch with no specifiers,
>>> Perl uses
>>> UTF-8 for all of the standard filehandles and any that you open yourself:
>>> "
>>>
>>>
>>>
>>> 2010/10/19 Daniel de Oliveira Mantovani <mantovani at perl.org.br>:
>>> > Argh, desculpa estou muitas, muitas, muitas horas sem dormir.
>>> >
>>> > perl -Mutf8 -pe 'binmode STDIN, ":utf8";$_=uc' texte.txt
>>> >
>>> > É disso que você precisa.
>>> >
>>> > Me desculpe de novo.
>>> >
>>> >
>>> > 2010/10/19 Daniel de Oliveira Mantovani <mantovani at perl.org.br>:
>>> >> perl -Mutf8 -pe '$_=uc' teste.txt
>>> >>
>>> >> 2010/10/18 Stanislaw Pusep <creaktive at gmail.com>:
>>> >>> Li sim :)
>>> >>>
>>> >>> "The following functions are defined in the utf8:: package by the
>>> Perl core.
>>> >>> You do not need to say use utf8 to use these and in fact you should
>>> not say
>>> >>> that unless you really want to have UTF-8 source code."
>>> >>>
>>> >>> Anyway, tentei fazer isso:
>>> >>> perl -pe 'utf8::encode($_);$_=uc' teste.txt
>>> >>>
>>> >>> Conforme o esperado, imprime na tela os caracteres corretos. Porém
>>> sem
>>> >>> converter acentos para maiúsculas. Vai entender :(
>>> >>>
>>> >>> ABS()
>>> >>>
>>> >>>
>>> >>>
>>> >>> 2010/10/18 Daniel de Oliveira Mantovani <mantovani at perl.org.br>
>>> >>>>
>>> >>>> Você leu o manual todo ?
>>> >>>>
>>> >>>> "Converts in-place the internal octet sequence in the native
>>> encoding
>>> >>>> (Latin-1 or EBCDIC) to the equivalent character sequence in UTF-X.
>>> >>>> $string already encoded as characters does no harm.Returns the
>>> number
>>> >>>> of octets necessary to represent the string as UTF-X.Can be used to
>>> >>>> make sure that the UTF-8 flag is on, so that "\w" or "lc()" work as
>>> >>>> Unicode on strings containing characters in the range 0x80-0xFF (on
>>> >>>> ASCII
>>> >>>> and derivatives)."
>>> >>>>
>>> >>>>
>>> >>>> 2010/10/18 Stanislaw Pusep <creaktive at gmail.com>:
>>> >>>> > Infelizmente...
>>> >>>> >
>>> >>>> > http://perldoc.perl.org/utf8.html
>>> >>>> > Do not use this pragma for anything else than telling Perl that
>>> your
>>> >>>> > script
>>> >>>> > is written in UTF-8.
>>> >>>> >
>>> >>>> > A minha referência atual sobre Perl e UTF-8 é esta (original em
>>> russo,
>>> >>>> > não a
>>> >>>> > tradução):
>>> >>>> >
>>> >>>> >
>>> http://translate.google.com/translate?hl=en-US&sl=ru&tl=en&u=http%3A%2F%2Fxpoint.ru%2Fknow-how%2FPerl%2FPodderzhkaUnicode
>>> >>>> >
>>> >>>> > ABS()
>>> >>>> >
>>> >>>> >
>>> >>>> >
>>> >>>> > 2010/10/18 Daniel de Oliveira Mantovani <mantovani at perl.org.br>
>>> >>>> >>
>>> >>>> >> 2010/10/18 Daniel de Oliveira Mantovani <mantovani at perl.org.br>:
>>> >>>> >> <code>
>>> >>>> >>  my $text;{$/=$\;$text=<>};
>>> >>>> >>  sub do_what_I_want {return uc(@_)};
>>> >>>> >>  when (detect_utf8($buf)) {
>>> >>>> >>     {
>>> >>>> >>        require utf8;
>>> >>>> >>        do_what_I_want(...)
>>> >>>> >>     }
>>> >>>> >>  }
>>> >>>> >>
>>> >>>> >>  { do_what_I_want(...) }
>>> >>>> >> </code>
>>> >>>> >>
>>> >>>> >> Agora sim.
>>> >>>> >>
>>> >>>> >> >
>>> >>>> >> > /me ;)
>>> >>>> >> >
>>> >>>> >> >
>>> >>>> >> > Procura no StackOverflow por Perl e codificação, o briand d foy
>>> deu
>>> >>>> >> > uma explicação bem útil.
>>> >>>> >> >
>>> >>>> >> > 2010/10/18 Stanislaw Pusep <creaktive at gmail.com>:
>>> >>>> >> >> Tenho certeza de que o assunto foi levantado várias vezes na
>>> lista,
>>> >>>> >> >> então,
>>> >>>> >> >> ATENÇÃO: o Perl tem excelentes mecanismos para tratar I/O em
>>> >>>> >> >> diversas
>>> >>>> >> >> codificações da maneira mais prática possível. Por exemplo, dá
>>> para
>>> >>>> >> >> pegar
>>> >>>> >> >> arquivo em ISO-8859-1 do STDIN e jogar para STDOUT em UTF-8,
>>> isso é
>>> >>>> >> >> canja de
>>> >>>> >> >> galinha. Sempre que abre um handle, é só especificar o que tem
>>> >>>> >> >> dentro
>>> >>>> >> >> que...
>>> >>>> >> >> Aí que está o MEU problema: nunca sei de antemão o que tem
>>> dentro :P
>>> >>>> >> >> A solução mais viável que encontrei até agora foi:
>>> >>>> >> >>
>>> >>>> >> >>         my $buf;
>>> >>>> >> >>
>>> >>>> >> >>
>>> >>>> >> >>         eval {
>>> >>>> >> >>                 open(TXT, '<', $file) or die "impossivel abrir
>>> >>>> >> >> $file:
>>> >>>> >> >> $!";
>>> >>>> >> >>
>>> >>>> >> >>
>>> >>>> >> >>                 binmode TXT, ':bytes';
>>> >>>> >> >>                 local $/ = undef;
>>> >>>> >> >>
>>> >>>> >> >>
>>> >>>> >> >>                 $buf = <TXT>;
>>> >>>> >> >>                 close TXT;
>>> >>>> >> >>
>>> >>>> >> >>
>>> >>>> >> >>         };
>>> >>>> >> >>
>>> >>>> >> >>         my $iconv = new Text::Iconv(detect_utf8($buf) ?
>>> 'utf-8' :
>>> >>>> >> >> 'iso-8859-1', 'utf-8');
>>> >>>> >> >>
>>> >>>> >> >>
>>> >>>> >> >>         $buf = $iconv->convert($buf);
>>> >>>> >> >>
>>> >>>> >> >>
>>> >>>> >> >>         Encode::_utf8_on($buf);
>>> >>>> >> >>
>>> >>>> >> >> Explicando: abro o arquivo do jeito "cru", sem nenhuma
>>> codificação.
>>> >>>> >> >> Carrego
>>> >>>> >> >> o conteúdo no buffer. Aí uso Text::Iconv para converter a
>>> >>>> >> >> codificação.
>>> >>>> >> >> Detalhe importantíssimo: mesmo que os dados já estejam em
>>> UTF-8,
>>> >>>> >> >> ainda
>>> >>>> >> >> assim
>>> >>>> >> >> precisa aplicar o Text::Iconv. E ainda não acabou: Perl não
>>> >>>> >> >> reconhece o
>>> >>>> >> >> buffer como algo que tenha codificação UTF-8 até que eu force
>>> o flag
>>> >>>> >> >> UTF-8.
>>> >>>> >> >> Pronto! Depois disso tudo, $buf é um autêntico UTF-8. Posso
>>> dar uc()
>>> >>>> >> >> que "ã"
>>> >>>> >> >> vira "Ã", e /\w/ pega os acentos também.
>>> >>>> >> >> Aqui está o código completo: http://tinypaste.com/c3680
>>> >>>> >> >> A pergunta é: existe alguma maneira menos ineficiente de se
>>> fazer
>>> >>>> >> >> isto?
>>> >>>> >> >>
>>> >>>> >> >> ABS()
>>> >>>> >> >>
>>> >>>> >> >>
>>> >>>> >> >> _______________________________________________
>>> >>>> >> >> SaoPaulo-pm mailing list
>>> >>>> >> >> SaoPaulo-pm at pm.org
>>> >>>> >> >> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>> >>>> >> >>
>>> >>>> >> >
>>> >>>> >> >
>>> >>>> >> >
>>> >>>> >> > --
>>> >>>> >> > "If you’ve never written anything thoughtful, then you’ve never
>>> had
>>> >>>> >> > any difficult, important, or interesting thoughts. That’s the
>>> secret:
>>> >>>> >> > people who don’t write, are people who don’t think."
>>> >>>> >> >
>>> >>>> >>
>>> >>>> >>
>>> >>>> >>
>>> >>>> >> --
>>> >>>> >> "If you’ve never written anything thoughtful, then you’ve never
>>> had
>>> >>>> >> any difficult, important, or interesting thoughts. That’s the
>>> secret:
>>> >>>> >> people who don’t write, are people who don’t think."
>>> >>>> >> _______________________________________________
>>> >>>> >> SaoPaulo-pm mailing list
>>> >>>> >> SaoPaulo-pm at pm.org
>>> >>>> >> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>> >>>> >
>>> >>>> >
>>> >>>> > _______________________________________________
>>> >>>> > SaoPaulo-pm mailing list
>>> >>>> > SaoPaulo-pm at pm.org
>>> >>>> > http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>> >>>> >
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> "If you’ve never written anything thoughtful, then you’ve never had
>>> >>>> any difficult, important, or interesting thoughts. That’s the
>>> secret:
>>> >>>> people who don’t write, are people who don’t think."
>>> >>>> _______________________________________________
>>> >>>> SaoPaulo-pm mailing list
>>> >>>> SaoPaulo-pm at pm.org
>>> >>>> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>> >>>
>>> >>>
>>> >>> _______________________________________________
>>> >>> SaoPaulo-pm mailing list
>>> >>> SaoPaulo-pm at pm.org
>>> >>> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>> >>>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> "If you’ve never written anything thoughtful, then you’ve never had
>>> >> any difficult, important, or interesting thoughts. That’s the secret:
>>> >> people who don’t write, are people who don’t think."
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > "If you’ve never written anything thoughtful, then you’ve never had
>>> > any difficult, important, or interesting thoughts. That’s the secret:
>>> > people who don’t write, are people who don’t think."
>>> >
>>>
>>>
>>>
>>> --
>>> "If you’ve never written anything thoughtful, then you’ve never had
>>> any difficult, important, or interesting thoughts. That’s the secret:
>>> people who don’t write, are people who don’t think."
>>> _______________________________________________
>>> SaoPaulo-pm mailing list
>>> SaoPaulo-pm at pm.org
>>> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>>
>>
>>
>> _______________________________________________
>> SaoPaulo-pm mailing list
>> SaoPaulo-pm at pm.org
>> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>
>
>
>
> --
> "o animal satisfeito dorme". - Guimarães Rosa
>
> _______________________________________________
> SaoPaulo-pm mailing list
> SaoPaulo-pm at pm.org
> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>



-- 
André Garcia Carneiro
Analista/Desenvolvedor Perl
(11)82907780
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/saopaulo-pm/attachments/20101020/11f2db3b/attachment-0001.html>


More information about the SaoPaulo-pm mailing list