[Za-pm] ascii high order character conversion
Jonathan McKeown
jonathan at hst.org.za
Fri May 9 01:49:42 PDT 2008
On Thursday 08 May 2008 18:20, Anne Wainwright wrote:
> Hi.
>
> I have data prepared on a dos programme that involves high order
> characters, like european letters with umlauts, cedillas, acute and grave
> accents etc.
This brings you into the messy realms of character sets and encodings. I would
guess if the original data was prepared in DOS the encoding would be either
cp850 (European) or cp437 (extended American). I would further guess cp437 -
computers sold here tend to be configured as US rather than European.
> Not that lines of s/// wouldn't do the job, but I wondered if there was a
> more concise way of programming this to convert either to the plain
> unaccented character or to the correct windows character.
The Windows character set would be cp1252. This is similar, but not identical,
to latin-1 (iso-8859-1). Depending on the destination for your data I'd
suggest going for either latin-1 - likely to be supported - or utf-8 -
support is patchy in some tools, although Perl handles it fine.
> [maybe I must study the "perlebcdic Considerations for running Perl on
> EBCDIC platforms" found on CPAN which looks like it might be a guide.
> suggests tr/// , will absorb this evening]
For changing fixed characters into other fixed characters tr/// is much faster
than regexes. Bear in mind you've got a huge amount of documentation
(including perlebcdic) installed with your perl - try perldoc perltoc for a
table of contents.
> Had hoped for a ready module from CPAN, but see nothing.
Have a look at Encoding and family, which have been in the core since perl
5.7.3 (dev version of 5.8); also perlIO layers which will do the conversion
on the fly (I think that may be a 5.8 thing too).
Jonathan
More information about the Za-pm
mailing list