[SP-pm] trabalhando com UTF-8 e ISO-8859-1 simultaneamente

Daniel de Oliveira Mantovani mantovani at perl.org.br
Mon Oct 18 19:35:08 PDT 2010


perl -Mutf8 -pe '$_=uc' teste.txt

2010/10/18 Stanislaw Pusep <creaktive at gmail.com>:
> Li sim :)
>
> "The following functions are defined in the utf8:: package by the Perl core.
> You do not need to say use utf8 to use these and in fact you should not say
> that unless you really want to have UTF-8 source code."
>
> Anyway, tentei fazer isso:
> perl -pe 'utf8::encode($_);$_=uc' teste.txt
>
> Conforme o esperado, imprime na tela os caracteres corretos. Porém sem
> converter acentos para maiúsculas. Vai entender :(
>
> ABS()
>
>
>
> 2010/10/18 Daniel de Oliveira Mantovani <mantovani at perl.org.br>
>>
>> Você leu o manual todo ?
>>
>> "Converts in-place the internal octet sequence in the native encoding
>> (Latin-1 or EBCDIC) to the equivalent character sequence in UTF-X.
>> $string already encoded as characters does no harm.Returns the number
>> of octets necessary to represent the string as UTF-X.Can be used to
>> make sure that the UTF-8 flag is on, so that "\w" or "lc()" work as
>> Unicode on strings containing characters in the range 0x80-0xFF (on
>> ASCII
>> and derivatives)."
>>
>>
>> 2010/10/18 Stanislaw Pusep <creaktive at gmail.com>:
>> > Infelizmente...
>> >
>> > http://perldoc.perl.org/utf8.html
>> > Do not use this pragma for anything else than telling Perl that your
>> > script
>> > is written in UTF-8.
>> >
>> > A minha referência atual sobre Perl e UTF-8 é esta (original em russo,
>> > não a
>> > tradução):
>> >
>> > http://translate.google.com/translate?hl=en-US&sl=ru&tl=en&u=http%3A%2F%2Fxpoint.ru%2Fknow-how%2FPerl%2FPodderzhkaUnicode
>> >
>> > ABS()
>> >
>> >
>> >
>> > 2010/10/18 Daniel de Oliveira Mantovani <mantovani at perl.org.br>
>> >>
>> >> 2010/10/18 Daniel de Oliveira Mantovani <mantovani at perl.org.br>:
>> >> <code>
>> >>  my $text;{$/=$\;$text=<>};
>> >>  sub do_what_I_want {return uc(@_)};
>> >>  when (detect_utf8($buf)) {
>> >>     {
>> >>        require utf8;
>> >>        do_what_I_want(...)
>> >>     }
>> >>  }
>> >>
>> >>  { do_what_I_want(...) }
>> >> </code>
>> >>
>> >> Agora sim.
>> >>
>> >> >
>> >> > /me ;)
>> >> >
>> >> >
>> >> > Procura no StackOverflow por Perl e codificação, o briand d foy deu
>> >> > uma explicação bem útil.
>> >> >
>> >> > 2010/10/18 Stanislaw Pusep <creaktive at gmail.com>:
>> >> >> Tenho certeza de que o assunto foi levantado várias vezes na lista,
>> >> >> então,
>> >> >> ATENÇÃO: o Perl tem excelentes mecanismos para tratar I/O em
>> >> >> diversas
>> >> >> codificações da maneira mais prática possível. Por exemplo, dá para
>> >> >> pegar
>> >> >> arquivo em ISO-8859-1 do STDIN e jogar para STDOUT em UTF-8, isso é
>> >> >> canja de
>> >> >> galinha. Sempre que abre um handle, é só especificar o que tem
>> >> >> dentro
>> >> >> que...
>> >> >> Aí que está o MEU problema: nunca sei de antemão o que tem dentro :P
>> >> >> A solução mais viável que encontrei até agora foi:
>> >> >>
>> >> >>         my $buf;
>> >> >>
>> >> >>
>> >> >>         eval {
>> >> >>                 open(TXT, '<', $file) or die "impossivel abrir
>> >> >> $file:
>> >> >> $!";
>> >> >>
>> >> >>
>> >> >>                 binmode TXT, ':bytes';
>> >> >>                 local $/ = undef;
>> >> >>
>> >> >>
>> >> >>                 $buf = <TXT>;
>> >> >>                 close TXT;
>> >> >>
>> >> >>
>> >> >>         };
>> >> >>
>> >> >>         my $iconv = new Text::Iconv(detect_utf8($buf) ? 'utf-8' :
>> >> >> 'iso-8859-1', 'utf-8');
>> >> >>
>> >> >>
>> >> >>         $buf = $iconv->convert($buf);
>> >> >>
>> >> >>
>> >> >>         Encode::_utf8_on($buf);
>> >> >>
>> >> >> Explicando: abro o arquivo do jeito "cru", sem nenhuma codificação.
>> >> >> Carrego
>> >> >> o conteúdo no buffer. Aí uso Text::Iconv para converter a
>> >> >> codificação.
>> >> >> Detalhe importantíssimo: mesmo que os dados já estejam em UTF-8,
>> >> >> ainda
>> >> >> assim
>> >> >> precisa aplicar o Text::Iconv. E ainda não acabou: Perl não
>> >> >> reconhece o
>> >> >> buffer como algo que tenha codificação UTF-8 até que eu force o flag
>> >> >> UTF-8.
>> >> >> Pronto! Depois disso tudo, $buf é um autêntico UTF-8. Posso dar uc()
>> >> >> que "ã"
>> >> >> vira "Ã", e /\w/ pega os acentos também.
>> >> >> Aqui está o código completo: http://tinypaste.com/c3680
>> >> >> A pergunta é: existe alguma maneira menos ineficiente de se fazer
>> >> >> isto?
>> >> >>
>> >> >> ABS()
>> >> >>
>> >> >>
>> >> >> _______________________________________________
>> >> >> SaoPaulo-pm mailing list
>> >> >> SaoPaulo-pm at pm.org
>> >> >> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > "If you’ve never written anything thoughtful, then you’ve never had
>> >> > any difficult, important, or interesting thoughts. That’s the secret:
>> >> > people who don’t write, are people who don’t think."
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> "If you’ve never written anything thoughtful, then you’ve never had
>> >> any difficult, important, or interesting thoughts. That’s the secret:
>> >> people who don’t write, are people who don’t think."
>> >> _______________________________________________
>> >> SaoPaulo-pm mailing list
>> >> SaoPaulo-pm at pm.org
>> >> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>> >
>> >
>> > _______________________________________________
>> > SaoPaulo-pm mailing list
>> > SaoPaulo-pm at pm.org
>> > http://mail.pm.org/mailman/listinfo/saopaulo-pm
>> >
>>
>>
>>
>> --
>> "If you’ve never written anything thoughtful, then you’ve never had
>> any difficult, important, or interesting thoughts. That’s the secret:
>> people who don’t write, are people who don’t think."
>> _______________________________________________
>> SaoPaulo-pm mailing list
>> SaoPaulo-pm at pm.org
>> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>
>
> _______________________________________________
> SaoPaulo-pm mailing list
> SaoPaulo-pm at pm.org
> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>



-- 
"If you’ve never written anything thoughtful, then you’ve never had
any difficult, important, or interesting thoughts. That’s the secret:
people who don’t write, are people who don’t think."


More information about the SaoPaulo-pm mailing list