[SP-pm] trabalhando com UTF-8 e ISO-8859-1 simultaneamente

Stanislaw Pusep creaktive at gmail.com
Tue Oct 19 04:31:15 PDT 2010


Valeu Daniel!
De fato, sai muito mais eficiente salvar os dados codificados num arquivo e
depois abrir e ler pelo "conversor embutido" do Perl, do que fazer as
conversões malucas com buffers inline.
Só me resta uma dúvida: e para detectar a codificação de uma string? O PHP
tem mb_detect_encoding() (
http://php.net/manual/en/function.mb-detect-encoding.php, foi de lá que
roubei o meu detect_utf8()); já no Perl, nem utf8::is_utf8() e nem
utf8::valid() fazem isso.

ABS()



On Tue, Oct 19, 2010 at 01:12, Daniel de Oliveira Mantovani <
mantovani em perl.org.br> wrote:

> perl -e '{binmode STDOUT,":utf8";use open IO => ":utf8";print uc($_)
> while <>}' teste.txt
>
> "Setting the default encoding
> You can set the encoding for all streams with the open pragma. If you want
> to use the same default encoding for all input and output filehandles, you
> can set them at the same time with the IO setting:
> use open IO => ':utf8';
> You can set the default encoding for just output handles with the
> setting:
> OUT
> use open OUT => ':utf8';
> Similarly, you can set all of the input filehandles to have the encoding
> that
> you need:
> use open IN => ':utf8';
> You can event set the default encoding for the input and output streams
> separately, but in the same call to open:
> use open IN => ":cp1251", OUT => ":shiftjis";
> The -C switch tells Perl to switch on various Unicode features. You can
> selec-
> tively turn on features by specifying the ones that you want without having
> to change the source code. If you use that switch with no specifiers, Perl
> uses
> UTF-8 for all of the standard filehandles and any that you open yourself:
> "
>
>
>
> 2010/10/19 Daniel de Oliveira Mantovani <mantovani em perl.org.br>:
> > Argh, desculpa estou muitas, muitas, muitas horas sem dormir.
> >
> > perl -Mutf8 -pe 'binmode STDIN, ":utf8";$_=uc' texte.txt
> >
> > É disso que você precisa.
> >
> > Me desculpe de novo.
> >
> >
> > 2010/10/19 Daniel de Oliveira Mantovani <mantovani em perl.org.br>:
> >> perl -Mutf8 -pe '$_=uc' teste.txt
> >>
> >> 2010/10/18 Stanislaw Pusep <creaktive em gmail.com>:
> >>> Li sim :)
> >>>
> >>> "The following functions are defined in the utf8:: package by the Perl
> core.
> >>> You do not need to say use utf8 to use these and in fact you should not
> say
> >>> that unless you really want to have UTF-8 source code."
> >>>
> >>> Anyway, tentei fazer isso:
> >>> perl -pe 'utf8::encode($_);$_=uc' teste.txt
> >>>
> >>> Conforme o esperado, imprime na tela os caracteres corretos. Porém sem
> >>> converter acentos para maiúsculas. Vai entender :(
> >>>
> >>> ABS()
> >>>
> >>>
> >>>
> >>> 2010/10/18 Daniel de Oliveira Mantovani <mantovani em perl.org.br>
> >>>>
> >>>> Você leu o manual todo ?
> >>>>
> >>>> "Converts in-place the internal octet sequence in the native encoding
> >>>> (Latin-1 or EBCDIC) to the equivalent character sequence in UTF-X.
> >>>> $string already encoded as characters does no harm.Returns the number
> >>>> of octets necessary to represent the string as UTF-X.Can be used to
> >>>> make sure that the UTF-8 flag is on, so that "\w" or "lc()" work as
> >>>> Unicode on strings containing characters in the range 0x80-0xFF (on
> >>>> ASCII
> >>>> and derivatives)."
> >>>>
> >>>>
> >>>> 2010/10/18 Stanislaw Pusep <creaktive em gmail.com>:
> >>>> > Infelizmente...
> >>>> >
> >>>> > http://perldoc.perl.org/utf8.html
> >>>> > Do not use this pragma for anything else than telling Perl that your
> >>>> > script
> >>>> > is written in UTF-8.
> >>>> >
> >>>> > A minha referência atual sobre Perl e UTF-8 é esta (original em
> russo,
> >>>> > não a
> >>>> > tradução):
> >>>> >
> >>>> >
> http://translate.google.com/translate?hl=en-US&sl=ru&tl=en&u=http%3A%2F%2Fxpoint.ru%2Fknow-how%2FPerl%2FPodderzhkaUnicode
> >>>> >
> >>>> > ABS()
> >>>> >
> >>>> >
> >>>> >
> >>>> > 2010/10/18 Daniel de Oliveira Mantovani <mantovani em perl.org.br>
> >>>> >>
> >>>> >> 2010/10/18 Daniel de Oliveira Mantovani <mantovani em perl.org.br>:
> >>>> >> <code>
> >>>> >>  my $text;{$/=$\;$text=<>};
> >>>> >>  sub do_what_I_want {return uc(@_)};
> >>>> >>  when (detect_utf8($buf)) {
> >>>> >>     {
> >>>> >>        require utf8;
> >>>> >>        do_what_I_want(...)
> >>>> >>     }
> >>>> >>  }
> >>>> >>
> >>>> >>  { do_what_I_want(...) }
> >>>> >> </code>
> >>>> >>
> >>>> >> Agora sim.
> >>>> >>
> >>>> >> >
> >>>> >> > /me ;)
> >>>> >> >
> >>>> >> >
> >>>> >> > Procura no StackOverflow por Perl e codificação, o briand d foy
> deu
> >>>> >> > uma explicação bem útil.
> >>>> >> >
> >>>> >> > 2010/10/18 Stanislaw Pusep <creaktive em gmail.com>:
> >>>> >> >> Tenho certeza de que o assunto foi levantado várias vezes na
> lista,
> >>>> >> >> então,
> >>>> >> >> ATENÇÃO: o Perl tem excelentes mecanismos para tratar I/O em
> >>>> >> >> diversas
> >>>> >> >> codificações da maneira mais prática possível. Por exemplo, dá
> para
> >>>> >> >> pegar
> >>>> >> >> arquivo em ISO-8859-1 do STDIN e jogar para STDOUT em UTF-8,
> isso é
> >>>> >> >> canja de
> >>>> >> >> galinha. Sempre que abre um handle, é só especificar o que tem
> >>>> >> >> dentro
> >>>> >> >> que...
> >>>> >> >> Aí que está o MEU problema: nunca sei de antemão o que tem
> dentro :P
> >>>> >> >> A solução mais viável que encontrei até agora foi:
> >>>> >> >>
> >>>> >> >>         my $buf;
> >>>> >> >>
> >>>> >> >>
> >>>> >> >>         eval {
> >>>> >> >>                 open(TXT, '<', $file) or die "impossivel abrir
> >>>> >> >> $file:
> >>>> >> >> $!";
> >>>> >> >>
> >>>> >> >>
> >>>> >> >>                 binmode TXT, ':bytes';
> >>>> >> >>                 local $/ = undef;
> >>>> >> >>
> >>>> >> >>
> >>>> >> >>                 $buf = <TXT>;
> >>>> >> >>                 close TXT;
> >>>> >> >>
> >>>> >> >>
> >>>> >> >>         };
> >>>> >> >>
> >>>> >> >>         my $iconv = new Text::Iconv(detect_utf8($buf) ? 'utf-8'
> :
> >>>> >> >> 'iso-8859-1', 'utf-8');
> >>>> >> >>
> >>>> >> >>
> >>>> >> >>         $buf = $iconv->convert($buf);
> >>>> >> >>
> >>>> >> >>
> >>>> >> >>         Encode::_utf8_on($buf);
> >>>> >> >>
> >>>> >> >> Explicando: abro o arquivo do jeito "cru", sem nenhuma
> codificação.
> >>>> >> >> Carrego
> >>>> >> >> o conteúdo no buffer. Aí uso Text::Iconv para converter a
> >>>> >> >> codificação.
> >>>> >> >> Detalhe importantíssimo: mesmo que os dados já estejam em UTF-8,
> >>>> >> >> ainda
> >>>> >> >> assim
> >>>> >> >> precisa aplicar o Text::Iconv. E ainda não acabou: Perl não
> >>>> >> >> reconhece o
> >>>> >> >> buffer como algo que tenha codificação UTF-8 até que eu force o
> flag
> >>>> >> >> UTF-8.
> >>>> >> >> Pronto! Depois disso tudo, $buf é um autêntico UTF-8. Posso dar
> uc()
> >>>> >> >> que "ã"
> >>>> >> >> vira "Ã", e /\w/ pega os acentos também.
> >>>> >> >> Aqui está o código completo: http://tinypaste.com/c3680
> >>>> >> >> A pergunta é: existe alguma maneira menos ineficiente de se
> fazer
> >>>> >> >> isto?
> >>>> >> >>
> >>>> >> >> ABS()
> >>>> >> >>
> >>>> >> >>
> >>>> >> >> _______________________________________________
> >>>> >> >> SaoPaulo-pm mailing list
> >>>> >> >> SaoPaulo-pm em pm.org
> >>>> >> >> http://mail.pm.org/mailman/listinfo/saopaulo-pm
> >>>> >> >>
> >>>> >> >
> >>>> >> >
> >>>> >> >
> >>>> >> > --
> >>>> >> > "If you’ve never written anything thoughtful, then you’ve never
> had
> >>>> >> > any difficult, important, or interesting thoughts. That’s the
> secret:
> >>>> >> > people who don’t write, are people who don’t think."
> >>>> >> >
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >> --
> >>>> >> "If you’ve never written anything thoughtful, then you’ve never had
> >>>> >> any difficult, important, or interesting thoughts. That’s the
> secret:
> >>>> >> people who don’t write, are people who don’t think."
> >>>> >> _______________________________________________
> >>>> >> SaoPaulo-pm mailing list
> >>>> >> SaoPaulo-pm em pm.org
> >>>> >> http://mail.pm.org/mailman/listinfo/saopaulo-pm
> >>>> >
> >>>> >
> >>>> > _______________________________________________
> >>>> > SaoPaulo-pm mailing list
> >>>> > SaoPaulo-pm em pm.org
> >>>> > http://mail.pm.org/mailman/listinfo/saopaulo-pm
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> "If you’ve never written anything thoughtful, then you’ve never had
> >>>> any difficult, important, or interesting thoughts. That’s the secret:
> >>>> people who don’t write, are people who don’t think."
> >>>> _______________________________________________
> >>>> SaoPaulo-pm mailing list
> >>>> SaoPaulo-pm em pm.org
> >>>> http://mail.pm.org/mailman/listinfo/saopaulo-pm
> >>>
> >>>
> >>> _______________________________________________
> >>> SaoPaulo-pm mailing list
> >>> SaoPaulo-pm em pm.org
> >>> http://mail.pm.org/mailman/listinfo/saopaulo-pm
> >>>
> >>
> >>
> >>
> >> --
> >> "If you’ve never written anything thoughtful, then you’ve never had
> >> any difficult, important, or interesting thoughts. That’s the secret:
> >> people who don’t write, are people who don’t think."
> >>
> >
> >
> >
> > --
> > "If you’ve never written anything thoughtful, then you’ve never had
> > any difficult, important, or interesting thoughts. That’s the secret:
> > people who don’t write, are people who don’t think."
> >
>
>
>
> --
> "If you’ve never written anything thoughtful, then you’ve never had
> any difficult, important, or interesting thoughts. That’s the secret:
> people who don’t write, are people who don’t think."
> _______________________________________________
> SaoPaulo-pm mailing list
> SaoPaulo-pm em pm.org
> http://mail.pm.org/mailman/listinfo/saopaulo-pm
>
-------------- Pr�xima Parte ----------
Um anexo em HTML foi limpo...
URL: <http://mail.pm.org/pipermail/saopaulo-pm/attachments/20101019/4c0f1af7/attachment-0001.html>


More information about the SaoPaulo-pm mailing list