[Melbourne-pm] foreign characters and perlio
Hamish Carpenter
hamish at hamishcarpenter.com
Tue Sep 30 04:24:14 PDT 2008
As a sort of follow up from Stephen's Perl Mongers talk in August,
I've used knowledge gained there to advantage at work recently and
thought I would share.
Mostly I used the discussions at the meeting and Stephens handy
links [0] to search out enough information to solve the problem. It
turns that to import files, they need to be in UTF-8 but they are
encoded differently, either ISO-8859-1 or cp437 [1]. We had an ö
(latin small letter with diaresis [2]) to import. The problem didn't
occur immediately as only some people had characters outside the
US-ASCII space (Unicode range U+0000 to U+007F). Using hexdump was
useful in determining the encodings as they were all single byte.
Once converted to utf8, the file grew by a few bytes.
The solution was to use perl to convert between the encodings. It
turns out that this was pretty easy and we could do it immediately
as perl is installed. There wasn't an existing conversion tool
installed in our environment.
The code, Version 1:
use Encode;
open my $infile, '<', 'datafile.dat';
my $bytes = do { local $/; <$infile> };
my $chars = decode( 'cp437', $bytes );
my $output = encode( 'utf8', $chars );
open my $out_file, '>', 'datafile.utf8';
print $out_file $output;
The code, version 2:
use Encode;
open my $infile, "<:encoding(cp437)", 'datafile.dat';
my $data = do { local $/; <$infile> };
# see Encode::PerlIO and [3]
open my $out_file, '>:encoding(utf8)', 'datafile.utf8';
print $out_file $data;
The code basically slurps in the file and then spits it back out
again. Version 1 uses conversion routines as discussed by Stephen
but version 2 uses the new perlio layers to do it automagically [3].
When using files, version 2 avoids the confusing encode/decode
pairing. Or is that decode/encode? It is also slightly shorter but
its mostly formatting, I'm sure a golfer could do this in a single line.
I hope this rings a bell for someone in the future and stops them
from tearing their hair out!
Hamish Carpenter
[0]
http://perl.net.au/wiki/Melbourne_Perl_Mongers/Meeting_History_2008#Wednesday.2C_August_13th_2008
[1] http://en.wikipedia.org/wiki/CP437
[2] http://www.fileformat.info/info/unicode/char/00f6/index.htm
[3] http://www.perladvent.org/2004/11th/
More information about the Melbourne-pm
mailing list