[Za-pm] ascii high order character conversion

Anne Wainwright aesop at fables.co.za
Tue May 13 03:20:41 PDT 2008


Jonathan & Tielman,

Thanks a lot, I think that I can make something out of that. The input is a 
 dos file (untainted by Windows) and I am sure that you are right about 
cp437.

I didn't have perldoc on until very recently, somehow it never got on. So I 
have only just realised what a great resource it is.

I do note that when using the editor in linux, either gedit when working on 
my pearls, or the inbuilt one in mc when having a squizz in the directory, 
that many of the characters don't show up. Like my Alt-127 (DEL or little 
house) shows up as a block (an outline square character). And things like C-
cedilla also show as a block. I suppose I should look at the editor 
settings. I know that gedit does not detect it as cp437 so a bit of 
investigation needed there.

This of course fouls up tr/// because it only trs discrete characters and I 
don't think it very smart to paste in blocks which you can't read even if 
they would would work (which I think they do, at least in regexes which I 
tried out before I found \x) and don't think you can put in \x7F for 
instance into tr///

Written in dos, modified on linux perl (5.8), sent over to Windoze! 
Hopefully one day to some or other sql database but that is too big a step 
today.

Thanks
Anne

> It depends how your input data is formatted.
> 
> If it is UTF8 encoded, in double bytes, and if you have perl 5.6.1 and
> above, use Unicode::Normalize:
> 
> use Unicode::Normalize qw/:all/;
> ...
> $string =~ s/([\x80-\xFF])/substr(decompose($1),0,1)/eg;
> 
> If it's formatted by one or other Windows app, with special characters
> such as the Euro symbol single byte encoded, then your input is probably
> CP1252. I'm not sure about the available conversion modules.
> 
> Alternatively, just run your whole input file through the libiconv C
> library:
> 
> iconv --from-code=ISO-8859-1 --to-code=UTF-8
> 
> 
> --tielman
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: za-pm-bounces+tvilliers=lastminute.com at pm.org
> [mailto:za-pm-bounces+tvilliers=lastminute.com at pm.org] On Behalf Of Anne
> Wainwright
> Sent: 08 May 2008 17:20
> To: za-pm at pm.org
> Subject: [Za-pm] ascii high order character conversion
> 
> Hi.
> 
> I have data prepared on a dos programme that involves high order 
> characters, like european letters with umlauts, cedillas, acute and
> grave 
> accents etc.
> 
> I have a dos utility that I wrote that converts all of these to plain 
> unaccented characters, a simple replacement operation. The reason being 
> that in moving the data to Windows it does not show them correctly and
> this 
> was the easiest way to go at the time. Now I am away from that route and
> 
> want to build this into my perl database conversion routine (convert
> from 
> proprietary to delimited).
> 
> Now I am wondering if there is an easier way in perl than doing a s///
> for 
> each of the characters used. I looked in the Perl Cookbook, and had a 
> wander through the CPAN modules, but nothing struck me as specific for
> the 
> task in hand.
> 
> Not that lines of s/// wouldn't do the job, but I wondered if there was
> a 
> more concise way of programming this to convert either to the plain 
> unaccented character or to the correct windows character.
> 
> [maybe I must study the "perlebcdic Considerations for running Perl on 
> EBCDIC platforms" found on CPAN which looks like it might be a guide. 
> suggests tr///   , will absorb this evening]
> 
> Had hoped for a ready module from CPAN, but see nothing.
> 
> Any ideas gratefully received on what must have been a common problem
> some 
> years back?
> 
> 
> Regards
> Anne
> ----
> Anne Wainwright
> 
> _______________________________________________
> Za-pm mailing list
> Za-pm at pm.org
> http://mail.pm.org/mailman/listinfo/za-pm

----
Anne Wainwright



More information about the Za-pm mailing list