[Melbourne-pm] foreign characters and perlio

Hamish Carpenter hamish at hamishcarpenter.com
Tue Sep 30 04:24:14 PDT 2008

As a sort of follow up from Stephen's Perl Mongers talk in August, 
I've used knowledge gained there to advantage at work recently and 
thought I would share.

Mostly I used the discussions at the meeting and Stephens handy 
links [0] to search out enough information to solve the problem. It 
turns that to import files, they need to be in UTF-8 but they are 
encoded differently, either ISO-8859-1 or cp437 [1]. We had an ö 
(latin small letter with diaresis [2]) to import. The problem didn't 
occur immediately as only some people had characters outside the 
US-ASCII space (Unicode range U+0000 to U+007F). Using hexdump was 
useful in determining the encodings as they were all single byte. 
Once converted to utf8, the file grew by a few bytes.

The solution was to use perl to convert between the encodings. It 
turns out that this was pretty easy and we could do it immediately 
as perl is installed. There wasn't an existing conversion tool 
installed in our environment.

The code, Version 1:

     use Encode;
     open my $infile, '<', 'datafile.dat';
     my $bytes  = do { local $/; <$infile> };
     my $chars  = decode( 'cp437', $bytes );
     my $output = encode( 'utf8',  $chars );
     open my $out_file, '>', 'datafile.utf8';
     print $out_file $output;

The code, version 2:

     use Encode;
     open my $infile, "<:encoding(cp437)", 'datafile.dat';
     my $data = do { local $/; <$infile> };
     # see Encode::PerlIO and [3]
     open my $out_file, '>:encoding(utf8)', 'datafile.utf8';
     print $out_file $data;

The code basically slurps in the file and then spits it back out 
again. Version 1 uses conversion routines as discussed by Stephen 
but version 2 uses the new perlio layers to do it automagically [3]. 
When using files, version 2 avoids the confusing encode/decode 
pairing. Or is that decode/encode? It is also slightly shorter but 
its mostly formatting, I'm sure a golfer could do this in a single line.

I hope this rings a bell for someone in the future and stops them 
from tearing their hair out!

Hamish Carpenter

[1] http://en.wikipedia.org/wiki/CP437
[2] http://www.fileformat.info/info/unicode/char/00f6/index.htm
[3] http://www.perladvent.org/2004/11th/

More information about the Melbourne-pm mailing list