[SP-pm] trabalhando com UTF-8 e ISO-8859-1 simultaneamente

Solli Honorio shonorio at gmail.com
Tue Oct 19 17:14:23 PDT 2010


Stanislaw,

O http://search.cpan.org/~dankogai/Encode-2.40/lib/Encode/Guess.pm faz o que
vc precisa ?

Solli

2010/10/19 Stanislaw Pusep <creaktive em gmail.com>

> Valeu Daniel!
> De fato, sai muito mais eficiente salvar os dados codificados num arquivo e
> depois abrir e ler pelo "conversor embutido" do Perl, do que fazer as
> conversões malucas com buffers inline.
> Só me resta uma dúvida: e para detectar a codificação de uma string? O PHP
> tem mb_detect_encoding() (
> http://php.net/manual/en/function.mb-detect-encoding.php, foi de lá que
> roubei o meu detect_utf8()); já no Perl, nem utf8::is_utf8() e nem
> utf8::valid() fazem isso.
>
> ABS()
>
>
>
>
> On Tue, Oct 19, 2010 at 01:12, Daniel de Oliveira Mantovani <
> mantovani em perl.org.br> wrote:
>
>> perl -e '{binmode STDOUT,":utf8";use open IO => ":utf8";print uc($_)
>> while <>}' teste.txt
>>
>> "Setting the default encoding
>> You can set the encoding for all streams with the open pragma. If you want
>> to use the same default encoding for all input and output filehandles, you
>> can set them at the same time with the IO setting:
>> use open IO => ':utf8';
>> You can set the default encoding for just output handles with the
>> setting:
>> OUT
>> use open OUT => ':utf8';
>> Similarly, you can set all of the input filehandles to have the encoding
>> that
>> you need:
>> use open IN => ':utf8';
>> You can event set the default encoding for the input and output streams
>> separately, but in the same call to open:
>> use open IN => ":cp1251", OUT => ":shiftjis";
>> The -C switch tells Perl to switch on various Unicode features. You can
>> selec-
>> tively turn on features by specifying the ones that you want without
>> having
>> to change the source code. If you use that switch with no specifiers, Perl
>> uses
>> UTF-8 for all of the standard filehandles and any that you open yourself:
>> "
>>
>>
>>
>> 2010/10/19 Daniel de Oliveira Mantovani <mantovani em perl.org.br>:
>> > Argh, desculpa estou muitas, muitas, muitas horas sem dormir.
>> >
>> > perl -Mutf8 -pe 'binmode STDIN, ":utf8";$_=uc' texte.txt
>> >
>> > É disso que você precisa.
>> >
>> > Me desculpe de novo.
>> >
>> >
>> > 2010/10/19 Daniel de Oliveira Mantovani <mantovani em perl.org.br>:
>> >> perl -Mutf8 -pe '$_=uc' teste.txt
>> >>
>> >> 2010/10/18 Stanislaw Pusep <creaktive em gmail.com>:
>> >>> Li sim :)
>> >>>
>> >>> "The following functions are defined in the utf8:: package by the Perl
>> core.
>> >>> You do not need to say use utf8 to use these and in fact you should
>> not say
>> >>> that unless you really want to have UTF-8 source code."
>> >>>
>> >>> Anyway, tentei fazer isso:
>> >>> perl -pe 'utf8::encode($_);$_=uc' teste.txt
>> >>>
>> >>> Conforme o esperado, imprime na tela os caracteres corretos. Porém sem
>> >>> converter acentos para maiúsculas. Vai entender :(
>> >>>
>> >>> ABS()
>> >>>
>> >>>
>> >>>
>> >>> 2010/10/18 Daniel de Oliveira Mantovani <mantovani em perl.org.br>
>> >>>>
>> >>>> Você leu o manual todo ?
>> >>>>
>> >>>> "Converts in-place the internal octet sequence in the native encoding
>> >>>> (Latin-1 or EBCDIC) to the equivalent character sequence in UTF-X.
>> >>>> $string already encoded as characters does no harm.Returns the number
>> >>>> of octets necessary to represent the string as UTF-X.Can be used to
>> >>>> make sure that the UTF-8 flag is on, so that "\w" or "lc()" work as
>> >>>> Unicode on strings containing characters in the range 0x80-0xFF (on
>> >>>> ASCII
>> >>>> and derivatives)."
>> >>>>
>> >>>>
>> >>>> 2010/10/18 Stanislaw Pusep <creaktive em gmail.com>:
>> >>>> > Infelizmente...
>> >>>> >
>> >>>> > http://perldoc.perl.org/utf8.html
>> >>>> > Do not use this pragma for anything else than telling Perl that
>> your
>> >>>> > script
>> >>>> > is written in UTF-8.
>> >>>> >
>> >>>> > A minha referência atual sobre Perl e UTF-8 é esta (original em
>> russo,
>> >>>> > não a
>> >>>> > tradução):
>> >>>> >
>> >>>> >
>> http://translate.google.com/translate?hl=en-US&sl=ru&tl=en&u=http%3A%2F%2Fxpoint.ru%2Fknow-how%2FPerl%2FPodderzhkaUnicode
>> >>>> >
>> >>>> > ABS()
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > 2010/10/18 Daniel de Oliveira Mantovani <mantovani em perl.org.br>
>> >>>> >>
>> >>>> >> 2010/10/18 Daniel de Oliveira Mantovani <mantovani em perl.org.br>:
>> >>>> >> <code>
>> >>>> >>  my $text;{$/=$\;$text=<>};
>> >>>> >>  sub do_what_I_want {return uc(@_)};
>> >>>> >>  when (detect_utf8($buf)) {
>> >>>> >>     {
>> >>>> >>        require utf8;
>> >>>> >>        do_what_I_want(...)
>> >>>> >>     }
>> >>>> >>  }
>> >>>> >>
>> >>>> >>  { do_what_I_want(...) }
>> >>>> >> </code>
>> >>>> >>
>> >>>> >> Agora sim.
>> >>>> >>
>> >>>> >> >
>> >>>> >> > /me ;)
>> >>>> >> >
>> >>>> >> >
>> >>>> >> > Procura no StackOverflow por Perl e codificação, o briand d foy
>> deu
>> >>>> >> > uma explicação bem útil.
>> >>>> >> >
>> >>>> >> > 2010/10/18 Stanislaw Pusep <creaktive em gmail.com>:
>> >>>> >> >> Tenho certeza de que o assunto foi levantado várias vezes na
>> lista,
>> >>>> >> >> então,
>> >>>> >> >> ATENÇÃO: o Perl tem excelentes mecanismos para tratar I/O em
>> >>>> >> >> diversas
>> >>>> >> >> codificações da maneira mais prática possível. Por exemplo, dá
>> para
>> >>>> >> >> pegar
>> >>>> >> >> arquivo em ISO-8859-1 do STDIN e jogar para STDOUT em UTF-8,
>> isso é
>> >>>> >> >> canja de
>> >>>> >> >> galinha. Sempre que abre um handle, é só especificar o que tem
>> >>>> >> >> dentro
>> >>>> >> >> que...
>> >>>> >> >> Aí que está o MEU problema: nunca sei de antemão o que tem
>> dentro :P
>> >>>> >> >> A solução mais viável que encontrei até agora foi:
>> >>>> >> >>
>> >>>> >> >>         my $buf;
>> >>>> >> >>
>> >>>> >> >>
>> >>>> >> >>         eval {
>> >>>> >> >>                 open(TXT, '<', $file) or die "impossivel abrir
>> >>>> >> >> $file:
>> >>>> >> >> $!";
>> >>>> >> >>
>> >>>> >> >>
>> >>>> >> >>                 binmode TXT, ':bytes';
>> >>>> >> >>                 local $/ = undef;
>> >>>> >> >>
>> >>>> >> >>
>> >>>> >> >>                 $buf = <TXT>;
>> >>>> >> >>                 close TXT;
>> >>>> >> >>
>> >>>> >> >>
>> >>>> >> >>         };
>> >>>> >> >>
>> >>>> >> >>         my $iconv = new Text::Iconv(detect_utf8($buf) ? 'utf-8'
>> :
>> >>>> >> >> 'iso-8859-1', 'utf-8');
>> >>>> >> >>
>> >>>> >> >>
>> >>>> >> >>         $buf = $iconv->convert($buf);
>> >>>> >> >>
>> >>>> >> >>
>> >>>> >> >>         Encode::_utf8_on($buf);
>> >>>> >> >>
>> >>>> >> >> Explicando: abro o arquivo do jeito "cru", sem nenhuma
>> codificação.
>> >>>> >> >> Carrego
>> >>>> >> >> o conteúdo no buffer. Aí uso Text::Iconv para converter a
>> >>>> >> >> codificação.
>> >>>> >> >> Detalhe importantíssimo: mesmo que os dados já estejam em
>> UTF-8,
>> >>>> >> >> ainda
>> >>>> >> >> assim
>> >>>> >> >> precisa aplicar o Text::Iconv. E ainda não acabou: Perl não
>> >>>> >> >> reconhece o
>> >>>> >> >> buffer como algo que tenha codificação UTF-8 até que eu force o
>> flag
>> >>>> >> >> UTF-8.
>> >>>> >> >> Pronto! Depois disso tudo, $buf é um autêntico UTF-8. Posso dar
>> uc()
>> >>>> >> >> que "ã"
>> >>>> >> >> vira "Ã", e /\w/ pega os acentos também.
>> >>>> >> >> Aqui está o código completo: http://tinypaste.com/c3680
>> >>>> >> >> A pergunta é: existe alguma maneira menos ineficiente de se
>> fazer
>> >>>> >> >> isto?
>> >>>> >> >>
>> >>>> >> >> ABS()
>> >>>> >> >>
>> >>>> >> >>
>> >>>> >> >> _______________________________________________
>> >>>> >> >> SaoPaulo-pm mailing list
>> >>>> >> >> SaoPaulo-pm em pm.org
>> >>>> >> >> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>> >>>> >> >>
>> >>>> >> >
>> >>>> >> >
>> >>>> >> >
>> >>>> >> > --
>> >>>> >> > "If you’ve never written anything thoughtful, then you’ve never
>> had
>> >>>> >> > any difficult, important, or interesting thoughts. That’s the
>> secret:
>> >>>> >> > people who don’t write, are people who don’t think."
>> >>>> >> >
>> >>>> >>
>> >>>> >>
>> >>>> >>
>> >>>> >> --
>> >>>> >> "If you’ve never written anything thoughtful, then you’ve never
>> had
>> >>>> >> any difficult, important, or interesting thoughts. That’s the
>> secret:
>> >>>> >> people who don’t write, are people who don’t think."
>> >>>> >> _______________________________________________
>> >>>> >> SaoPaulo-pm mailing list
>> >>>> >> SaoPaulo-pm em pm.org
>> >>>> >> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>> >>>> >
>> >>>> >
>> >>>> > _______________________________________________
>> >>>> > SaoPaulo-pm mailing list
>> >>>> > SaoPaulo-pm em pm.org
>> >>>> > http://mail.pm.org/mailman/listinfo/saopaulo-pm
>> >>>> >
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> "If you’ve never written anything thoughtful, then you’ve never had
>> >>>> any difficult, important, or interesting thoughts. That’s the secret:
>> >>>> people who don’t write, are people who don’t think."
>> >>>> _______________________________________________
>> >>>> SaoPaulo-pm mailing list
>> >>>> SaoPaulo-pm em pm.org
>> >>>> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> SaoPaulo-pm mailing list
>> >>> SaoPaulo-pm em pm.org
>> >>> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> "If you’ve never written anything thoughtful, then you’ve never had
>> >> any difficult, important, or interesting thoughts. That’s the secret:
>> >> people who don’t write, are people who don’t think."
>> >>
>> >
>> >
>> >
>> > --
>> > "If you’ve never written anything thoughtful, then you’ve never had
>> > any difficult, important, or interesting thoughts. That’s the secret:
>> > people who don’t write, are people who don’t think."
>> >
>>
>>
>>
>> --
>> "If you’ve never written anything thoughtful, then you’ve never had
>> any difficult, important, or interesting thoughts. That’s the secret:
>> people who don’t write, are people who don’t think."
>> _______________________________________________
>> SaoPaulo-pm mailing list
>> SaoPaulo-pm em pm.org
>> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>>
>
>
> _______________________________________________
> SaoPaulo-pm mailing list
> SaoPaulo-pm em pm.org
> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>



-- 
"o animal satisfeito dorme". - Guimarães Rosa
-------------- Pr?xima Parte ----------
Um anexo em HTML foi limpo...
URL: <http://mail.pm.org/pipermail/saopaulo-pm/attachments/20101019/afcc86f5/attachment-0001.html>


More information about the SaoPaulo-pm mailing list