[Za-pm] ascii high order character conversion

Jonathan McKeown jonathan at hst.org.za
Fri May 9 01:49:42 PDT 2008


On Thursday 08 May 2008 18:20, Anne Wainwright wrote:
> Hi.
>
> I have data prepared on a dos programme that involves high order
> characters, like european letters with umlauts, cedillas, acute and grave
> accents etc.

This brings you into the messy realms of character sets and encodings. I would 
guess if the original data was prepared in DOS the encoding would be either 
cp850 (European) or cp437 (extended American). I would further guess cp437 - 
computers sold here tend to be configured as US rather than European.

> Not that lines of s/// wouldn't do the job, but I wondered if there was a
> more concise way of programming this to convert either to the plain
> unaccented character or to the correct windows character.

The Windows character set would be cp1252. This is similar, but not identical, 
to latin-1 (iso-8859-1). Depending on the destination for your data I'd 
suggest going for either latin-1 - likely to be supported - or utf-8 - 
support is patchy in some tools, although Perl handles it fine.

> [maybe I must study the "perlebcdic Considerations for running Perl on
> EBCDIC platforms" found on CPAN which looks like it might be a guide.
> suggests tr///   , will absorb this evening]

For changing fixed characters into other fixed characters tr/// is much faster 
than regexes. Bear in mind you've got a huge amount of documentation 
(including perlebcdic) installed with your perl - try perldoc perltoc for a 
table of contents.

> Had hoped for a ready module from CPAN, but see nothing.

Have a look at Encoding and family, which have been in the core since perl 
5.7.3 (dev version of 5.8); also perlIO layers which will do the conversion 
on the fly (I think that may be a 5.8 thing too).

Jonathan


More information about the Za-pm mailing list