perl中处理中文的原则:<br><br>让中文字符串在perl中以utf8的形式存在。一个字符串进来的时候如果是其他编码,先转成utf8,出去的时候再转成相应的编码。这样可以保证任何操作都不出错。<br><br><div><span class="gmail_quote">在07-4-15,<b class="gmail_sendername">Dongxu Ma</b> <<a href="mailto:dongxu.ma@gmail.com">
dongxu.ma@gmail.com</a>> 写道:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">yes, your input stream was encoded as GB2312.<br><br><div>
<span class="gmail_quote">在07-4-15,<b class="gmail_sendername">zongzi</b> <<a href="mailto:honghunter@gmail.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">honghunter@gmail.com</a>> 写道:</span>
<div><span class="e" id="q_111f3f768076c302_1"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
就是说需要强制作一次转码才行?<br><br>在 07-4-13,Dongxu Ma<<a href="mailto:dongxu.ma@gmail.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">dongxu.ma@gmail.com</a>> 写道:<br>> `iconv -f GB2312 -t UTF8 p0.html
' showed me Chinese inside that html,<br>> which means while reading from html in your script, you need to decode
<br>> from GB2312. By something like:<br>><br>> 1. Encode::decode("GB2312", <INPUT>)<br>> 2. bindmode INPUT, ":encoding('GB2312')"<br>><br>> 在07-4-13,zongzi <<a href="mailto:honghunter@gmail.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">
honghunter@gmail.com</a>> 写道:<br>> > 编辑器我用的UltraEdit32。<br>> ><br>> > 网页编码都是<meta http-equiv="Content-type" content="text/html;<br>> > charset=gb2312"/>,还需要什么转换吗?<br>> >
<br>> ><br>> > 在 07-4-13,Beckheng Lam<<a href="mailto:beckheng@perlchina.org" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">beckheng@perlchina.org</a>> 写道:<br>> > > 是不是跟gbk或者utf8有关?
<br>> > ><br>> > > 缘起和合 wrote:<br>> > > 什么编辑器干的?确实很乱,用VIM吧
<br>> > ><br>> > > On 4/12/07, zongzi <<a href="mailto:honghunter@gmail.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">honghunter@gmail.com</a>> wrote:<br>> > > >
<br>> > ><br>> 为了方便放在PDA上面看小说,我用wget把网页(是新浪读书频道)下载到本机,然后用perl去把其中的正文提取出来。
<br>> > > ><br>> > > > 发现弄出来的txt文档中,有好多乱码(用记事本打开看的时候)。<br>> > > ><br>> > > > 请教大家怎么才能解决?<br>> > > ><br>> > > ><br>> > > > 附件是我的代码,写的非常乱。真是不好意思了。
<br>> > > ><br>> > > > --<br>> > > > 这是一个有钱人的世界,与我的世界截然不同!<br>> > > ><br>> > > > _______________________________________________<br>> > > > China-pm mailing list
<br>> > > > <a href="mailto:China-pm@pm.org" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">China-pm@pm.org</a><br>> > > > <a href="http://mail.pm.org/mailman/listinfo/china-pm" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">
http://mail.pm.org/mailman/listinfo/china-pm</a><br>> > > >
<br>> > > ><br>> > ><br>> > ><br>> > ><br>> > > --<br>> > > ------======Nerazzurri======------<br>> > > ________________________________<br>> > >
<br>
> > _______________________________________________<br>> > China-pm<br>> > > mailing<br>> > > list<br>> > <a href="mailto:China-pm@pm.org" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">
China-pm@pm.org</a><br>> > <a href="http://mail.pm.org/mailman/listinfo/china-pm" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">
http://mail.pm.org/mailman/listinfo/china-pm</a><br>> > ><br>> > > _______________________________________________<br>> > > China-pm mailing list<br>> > > <a href="mailto:China-pm@pm.org" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">
China-pm@pm.org</a><br>> > > <a href="http://mail.pm.org/mailman/listinfo/china-pm" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">http://mail.pm.org/mailman/listinfo/china-pm</a><br>> > >
<br>> ><br>> ><br>> > --<br>> > 这是一个有钱人的世界,与我的世界截然不同!
<br>> > _______________________________________________<br>> > China-pm mailing list<br>> > <a href="mailto:China-pm@pm.org" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">China-pm@pm.org
</a><br>> > <a href="http://mail.pm.org/mailman/listinfo/china-pm" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">
http://mail.pm.org/mailman/listinfo/china-pm</a><br>><br>><br>><br>> --<br>> cheers,<br>> -dongxu<br>> __END__<br>> <a href="http://search.cpan.org/%7Edongxu" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">
http://search.cpan.org/~dongxu</a><br>> _______________________________________________
<br>> China-pm mailing list<br>> <a href="mailto:China-pm@pm.org" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">China-pm@pm.org</a><br>> <a href="http://mail.pm.org/mailman/listinfo/china-pm" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">
http://mail.pm.org/mailman/listinfo/china-pm</a><br>><br><br><br>
--<br>这是一个有钱人的世界,与我的世界截然不同!<br>_______________________________________________<br>China-pm mailing list<br><a href="mailto:China-pm@pm.org" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">China-pm@pm.org
</a><br><a href="http://mail.pm.org/mailman/listinfo/china-pm" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">http://mail.pm.org/mailman/listinfo/china-pm
</a></blockquote></span></div></div><br><br clear="all"><br>-- <div><span class="e" id="q_111f3f768076c302_3"><br>cheers,<br>-dongxu<br>__END__<br><a href="http://search.cpan.org/%7Edongxu" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">
http://search.cpan.org/~dongxu</a>
</span></div><br>_______________________________________________<br>China-pm mailing list<br><a onclick="return top.js.OpenExtLink(window,event,this)" href="mailto:China-pm@pm.org">China-pm@pm.org</a><br><a onclick="return top.js.OpenExtLink(window,event,this)" href="http://mail.pm.org/mailman/listinfo/china-pm" target="_blank">
http://mail.pm.org/mailman/listinfo/china-pm</a><br></blockquote></div><br><br clear="all"><br>-- <br>---------------------------<br>Achilles Xu<br><a href="http://www.lazycode.org/achilles/">http://www.lazycode.org/achilles/
</a>