[SP-pm] trabalhando com UTF-8 e ISO-8859-1 simultaneamente

Solli Honorio shonorio at gmail.com
Wed Oct 20 05:31:54 PDT 2010


Nunca precisei fazer estas coisas, mas não seria apenas comparar os bit mais
significantes através de um bitwise ?

Solli M. Honório

2010/10/20 Andre Carneiro <andregarciacarneiro em gmail.com>

> Eu já tentei usar esse módulo. Não é sempre que ele detecta corretamente a
> codificação. Mas como já faz muito tempo desde a última vez que tentei usar
> esse módulo( a uns dois anos atrás ), talvez valha a pena dar uma olhada
> novamente, considerando que a última atualização foi esse ano.
>
> E tem uma observação na documentação desse módulo:
>
> Because of the algorithm used, ISO-8859 series and other single-byte
> encodings do not work well unless either one of ISO-8859 is the only one
> suspect (besides ascii and utf8).
>
>
>
> Cheers!
>
>
> 2010/10/19 Solli Honorio <shonorio em gmail.com>
>
> Stanislaw,
>>
>> O http://search.cpan.org/~dankogai/Encode-2.40/lib/Encode/Guess.pm<http://search.cpan.org/%7Edankogai/Encode-2.40/lib/Encode/Guess.pm>faz o que vc precisa ?
>>
>> Solli
>>
>> 2010/10/19 Stanislaw Pusep <creaktive em gmail.com>
>>
>> Valeu Daniel!
>>> De fato, sai muito mais eficiente salvar os dados codificados num arquivo
>>> e depois abrir e ler pelo "conversor embutido" do Perl, do que fazer as
>>> conversões malucas com buffers inline.
>>> Só me resta uma dúvida: e para detectar a codificação de uma string? O
>>> PHP tem mb_detect_encoding() (
>>> http://php.net/manual/en/function.mb-detect-encoding.php, foi de lá que
>>> roubei o meu detect_utf8()); já no Perl, nem utf8::is_utf8() e nem
>>> utf8::valid() fazem isso.
>>>
>>> ABS()
>>>
>>>
>>>
>>>
>>> On Tue, Oct 19, 2010 at 01:12, Daniel de Oliveira Mantovani <
>>> mantovani em perl.org.br> wrote:
>>>
>>>> perl -e '{binmode STDOUT,":utf8";use open IO => ":utf8";print uc($_)
>>>> while <>}' teste.txt
>>>>
>>>> "Setting the default encoding
>>>> You can set the encoding for all streams with the open pragma. If you
>>>> want
>>>> to use the same default encoding for all input and output filehandles,
>>>> you
>>>> can set them at the same time with the IO setting:
>>>> use open IO => ':utf8';
>>>> You can set the default encoding for just output handles with the
>>>> setting:
>>>> OUT
>>>> use open OUT => ':utf8';
>>>> Similarly, you can set all of the input filehandles to have the encoding
>>>> that
>>>> you need:
>>>> use open IN => ':utf8';
>>>> You can event set the default encoding for the input and output streams
>>>> separately, but in the same call to open:
>>>> use open IN => ":cp1251", OUT => ":shiftjis";
>>>> The -C switch tells Perl to switch on various Unicode features. You can
>>>> selec-
>>>> tively turn on features by specifying the ones that you want without
>>>> having
>>>> to change the source code. If you use that switch with no specifiers,
>>>> Perl uses
>>>> UTF-8 for all of the standard filehandles and any that you open
>>>> yourself:
>>>> "
>>>>
>>>>
>>>>
>>>> 2010/10/19 Daniel de Oliveira Mantovani <mantovani em perl.org.br>:
>>>> > Argh, desculpa estou muitas, muitas, muitas horas sem dormir.
>>>> >
>>>> > perl -Mutf8 -pe 'binmode STDIN, ":utf8";$_=uc' texte.txt
>>>> >
>>>> > É disso que você precisa.
>>>> >
>>>> > Me desculpe de novo.
>>>> >
>>>> >
>>>> > 2010/10/19 Daniel de Oliveira Mantovani <mantovani em perl.org.br>:
>>>> >> perl -Mutf8 -pe '$_=uc' teste.txt
>>>> >>
>>>> >> 2010/10/18 Stanislaw Pusep <creaktive em gmail.com>:
>>>> >>> Li sim :)
>>>> >>>
>>>> >>> "The following functions are defined in the utf8:: package by the
>>>> Perl core.
>>>> >>> You do not need to say use utf8 to use these and in fact you should
>>>> not say
>>>> >>> that unless you really want to have UTF-8 source code."
>>>> >>>
>>>> >>> Anyway, tentei fazer isso:
>>>> >>> perl -pe 'utf8::encode($_);$_=uc' teste.txt
>>>> >>>
>>>> >>> Conforme o esperado, imprime na tela os caracteres corretos. Porém
>>>> sem
>>>> >>> converter acentos para maiúsculas. Vai entender :(
>>>> >>>
>>>> >>> ABS()
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> 2010/10/18 Daniel de Oliveira Mantovani <mantovani em perl.org.br>
>>>> >>>>
>>>> >>>> Você leu o manual todo ?
>>>> >>>>
>>>> >>>> "Converts in-place the internal octet sequence in the native
>>>> encoding
>>>> >>>> (Latin-1 or EBCDIC) to the equivalent character sequence in UTF-X.
>>>> >>>> $string already encoded as characters does no harm.Returns the
>>>> number
>>>> >>>> of octets necessary to represent the string as UTF-X.Can be used to
>>>> >>>> make sure that the UTF-8 flag is on, so that "\w" or "lc()" work as
>>>> >>>> Unicode on strings containing characters in the range 0x80-0xFF (on
>>>> >>>> ASCII
>>>> >>>> and derivatives)."
>>>> >>>>
>>>> >>>>
>>>> >>>> 2010/10/18 Stanislaw Pusep <creaktive em gmail.com>:
>>>> >>>> > Infelizmente...
>>>> >>>> >
>>>> >>>> > http://perldoc.perl.org/utf8.html
>>>> >>>> > Do not use this pragma for anything else than telling Perl that
>>>> your
>>>> >>>> > script
>>>> >>>> > is written in UTF-8.
>>>> >>>> >
>>>> >>>> > A minha referência atual sobre Perl e UTF-8 é esta (original em
>>>> russo,
>>>> >>>> > não a
>>>> >>>> > tradução):
>>>> >>>> >
>>>> >>>> >
>>>> http://translate.google.com/translate?hl=en-US&sl=ru&tl=en&u=http%3A%2F%2Fxpoint.ru%2Fknow-how%2FPerl%2FPodderzhkaUnicode
>>>> >>>> >
>>>> >>>> > ABS()
>>>> >>>> >
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > 2010/10/18 Daniel de Oliveira Mantovani <mantovani em perl.org.br>
>>>> >>>> >>
>>>> >>>> >> 2010/10/18 Daniel de Oliveira Mantovani <mantovani em perl.org.br
>>>> >:
>>>> >>>> >> <code>
>>>> >>>> >>  my $text;{$/=$\;$text=<>};
>>>> >>>> >>  sub do_what_I_want {return uc(@_)};
>>>> >>>> >>  when (detect_utf8($buf)) {
>>>> >>>> >>     {
>>>> >>>> >>        require utf8;
>>>> >>>> >>        do_what_I_want(...)
>>>> >>>> >>     }
>>>> >>>> >>  }
>>>> >>>> >>
>>>> >>>> >>  { do_what_I_want(...) }
>>>> >>>> >> </code>
>>>> >>>> >>
>>>> >>>> >> Agora sim.
>>>> >>>> >>
>>>> >>>> >> >
>>>> >>>> >> > /me ;)
>>>> >>>> >> >
>>>> >>>> >> >
>>>> >>>> >> > Procura no StackOverflow por Perl e codificação, o briand d
>>>> foy deu
>>>> >>>> >> > uma explicação bem útil.
>>>> >>>> >> >
>>>> >>>> >> > 2010/10/18 Stanislaw Pusep <creaktive em gmail.com>:
>>>> >>>> >> >> Tenho certeza de que o assunto foi levantado várias vezes na
>>>> lista,
>>>> >>>> >> >> então,
>>>> >>>> >> >> ATENÇÃO: o Perl tem excelentes mecanismos para tratar I/O em
>>>> >>>> >> >> diversas
>>>> >>>> >> >> codificações da maneira mais prática possível. Por exemplo,
>>>> dá para
>>>> >>>> >> >> pegar
>>>> >>>> >> >> arquivo em ISO-8859-1 do STDIN e jogar para STDOUT em UTF-8,
>>>> isso é
>>>> >>>> >> >> canja de
>>>> >>>> >> >> galinha. Sempre que abre um handle, é só especificar o que
>>>> tem
>>>> >>>> >> >> dentro
>>>> >>>> >> >> que...
>>>> >>>> >> >> Aí que está o MEU problema: nunca sei de antemão o que tem
>>>> dentro :P
>>>> >>>> >> >> A solução mais viável que encontrei até agora foi:
>>>> >>>> >> >>
>>>> >>>> >> >>         my $buf;
>>>> >>>> >> >>
>>>> >>>> >> >>
>>>> >>>> >> >>         eval {
>>>> >>>> >> >>                 open(TXT, '<', $file) or die "impossivel
>>>> abrir
>>>> >>>> >> >> $file:
>>>> >>>> >> >> $!";
>>>> >>>> >> >>
>>>> >>>> >> >>
>>>> >>>> >> >>                 binmode TXT, ':bytes';
>>>> >>>> >> >>                 local $/ = undef;
>>>> >>>> >> >>
>>>> >>>> >> >>
>>>> >>>> >> >>                 $buf = <TXT>;
>>>> >>>> >> >>                 close TXT;
>>>> >>>> >> >>
>>>> >>>> >> >>
>>>> >>>> >> >>         };
>>>> >>>> >> >>
>>>> >>>> >> >>         my $iconv = new Text::Iconv(detect_utf8($buf) ?
>>>> 'utf-8' :
>>>> >>>> >> >> 'iso-8859-1', 'utf-8');
>>>> >>>> >> >>
>>>> >>>> >> >>
>>>> >>>> >> >>         $buf = $iconv->convert($buf);
>>>> >>>> >> >>
>>>> >>>> >> >>
>>>> >>>> >> >>         Encode::_utf8_on($buf);
>>>> >>>> >> >>
>>>> >>>> >> >> Explicando: abro o arquivo do jeito "cru", sem nenhuma
>>>> codificação.
>>>> >>>> >> >> Carrego
>>>> >>>> >> >> o conteúdo no buffer. Aí uso Text::Iconv para converter a
>>>> >>>> >> >> codificação.
>>>> >>>> >> >> Detalhe importantíssimo: mesmo que os dados já estejam em
>>>> UTF-8,
>>>> >>>> >> >> ainda
>>>> >>>> >> >> assim
>>>> >>>> >> >> precisa aplicar o Text::Iconv. E ainda não acabou: Perl não
>>>> >>>> >> >> reconhece o
>>>> >>>> >> >> buffer como algo que tenha codificação UTF-8 até que eu force
>>>> o flag
>>>> >>>> >> >> UTF-8.
>>>> >>>> >> >> Pronto! Depois disso tudo, $buf é um autêntico UTF-8. Posso
>>>> dar uc()
>>>> >>>> >> >> que "ã"
>>>> >>>> >> >> vira "Ã", e /\w/ pega os acentos também.
>>>> >>>> >> >> Aqui está o código completo: http://tinypaste.com/c3680
>>>> >>>> >> >> A pergunta é: existe alguma maneira menos ineficiente de se
>>>> fazer
>>>> >>>> >> >> isto?
>>>> >>>> >> >>
>>>> >>>> >> >> ABS()
>>>> >>>> >> >>
>>>> >>>> >> >>
>>>> >>>> >> >> _______________________________________________
>>>> >>>> >> >> SaoPaulo-pm mailing list
>>>> >>>> >> >> SaoPaulo-pm em pm.org
>>>> >>>> >> >> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>>> >>>> >> >>
>>>> >>>> >> >
>>>> >>>> >> >
>>>> >>>> >> >
>>>> >>>> >> > --
>>>> >>>> >> > "If you’ve never written anything thoughtful, then you’ve
>>>> never had
>>>> >>>> >> > any difficult, important, or interesting thoughts. That’s the
>>>> secret:
>>>> >>>> >> > people who don’t write, are people who don’t think."
>>>> >>>> >> >
>>>> >>>> >>
>>>> >>>> >>
>>>> >>>> >>
>>>> >>>> >> --
>>>> >>>> >> "If you’ve never written anything thoughtful, then you’ve never
>>>> had
>>>> >>>> >> any difficult, important, or interesting thoughts. That’s the
>>>> secret:
>>>> >>>> >> people who don’t write, are people who don’t think."
>>>> >>>> >> _______________________________________________
>>>> >>>> >> SaoPaulo-pm mailing list
>>>> >>>> >> SaoPaulo-pm em pm.org
>>>> >>>> >> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > _______________________________________________
>>>> >>>> > SaoPaulo-pm mailing list
>>>> >>>> > SaoPaulo-pm em pm.org
>>>> >>>> > http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>>> >>>> >
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> --
>>>> >>>> "If you’ve never written anything thoughtful, then you’ve never had
>>>> >>>> any difficult, important, or interesting thoughts. That’s the
>>>> secret:
>>>> >>>> people who don’t write, are people who don’t think."
>>>> >>>> _______________________________________________
>>>> >>>> SaoPaulo-pm mailing list
>>>> >>>> SaoPaulo-pm em pm.org
>>>> >>>> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>>> >>>
>>>> >>>
>>>> >>> _______________________________________________
>>>> >>> SaoPaulo-pm mailing list
>>>> >>> SaoPaulo-pm em pm.org
>>>> >>> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>>> >>>
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> "If you’ve never written anything thoughtful, then you’ve never had
>>>> >> any difficult, important, or interesting thoughts. That’s the secret:
>>>> >> people who don’t write, are people who don’t think."
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > "If you’ve never written anything thoughtful, then you’ve never had
>>>> > any difficult, important, or interesting thoughts. That’s the secret:
>>>> > people who don’t write, are people who don’t think."
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> "If you’ve never written anything thoughtful, then you’ve never had
>>>> any difficult, important, or interesting thoughts. That’s the secret:
>>>> people who don’t write, are people who don’t think."
>>>> _______________________________________________
>>>> SaoPaulo-pm mailing list
>>>> SaoPaulo-pm em pm.org
>>>> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>>>
>>>
>>>
>>> _______________________________________________
>>> SaoPaulo-pm mailing list
>>> SaoPaulo-pm em pm.org
>>> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>>
>>
>>
>>
>> --
>> "o animal satisfeito dorme". - Guimarães Rosa
>>
>> _______________________________________________
>> SaoPaulo-pm mailing list
>> SaoPaulo-pm em pm.org
>> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>
>
>
>
> --
> André Garcia Carneiro
> Analista/Desenvolvedor Perl
> (11)82907780
>
> _______________________________________________
> SaoPaulo-pm mailing list
> SaoPaulo-pm em pm.org
> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>



-- 
"o animal satisfeito dorme". - Guimarães Rosa
-------------- Pr?xima Parte ----------
Um anexo em HTML foi limpo...
URL: <http://mail.pm.org/pipermail/saopaulo-pm/attachments/20101020/d03a7d2a/attachment-0001.html>


More information about the SaoPaulo-pm mailing list