[Za-pm] ascii high order character conversion
Anne Wainwright
aesop at fables.co.za
Tue May 13 03:20:41 PDT 2008
Jonathan & Tielman,
Thanks a lot, I think that I can make something out of that. The input is a
dos file (untainted by Windows) and I am sure that you are right about
cp437.
I didn't have perldoc on until very recently, somehow it never got on. So I
have only just realised what a great resource it is.
I do note that when using the editor in linux, either gedit when working on
my pearls, or the inbuilt one in mc when having a squizz in the directory,
that many of the characters don't show up. Like my Alt-127 (DEL or little
house) shows up as a block (an outline square character). And things like C-
cedilla also show as a block. I suppose I should look at the editor
settings. I know that gedit does not detect it as cp437 so a bit of
investigation needed there.
This of course fouls up tr/// because it only trs discrete characters and I
don't think it very smart to paste in blocks which you can't read even if
they would would work (which I think they do, at least in regexes which I
tried out before I found \x) and don't think you can put in \x7F for
instance into tr///
Written in dos, modified on linux perl (5.8), sent over to Windoze!
Hopefully one day to some or other sql database but that is too big a step
today.
Thanks
Anne
> It depends how your input data is formatted.
>
> If it is UTF8 encoded, in double bytes, and if you have perl 5.6.1 and
> above, use Unicode::Normalize:
>
> use Unicode::Normalize qw/:all/;
> ...
> $string =~ s/([\x80-\xFF])/substr(decompose($1),0,1)/eg;
>
> If it's formatted by one or other Windows app, with special characters
> such as the Euro symbol single byte encoded, then your input is probably
> CP1252. I'm not sure about the available conversion modules.
>
> Alternatively, just run your whole input file through the libiconv C
> library:
>
> iconv --from-code=ISO-8859-1 --to-code=UTF-8
>
>
> --tielman
>
>
>
>
>
>
>
>
>
>
>
> -----Original Message-----
> From: za-pm-bounces+tvilliers=lastminute.com at pm.org
> [mailto:za-pm-bounces+tvilliers=lastminute.com at pm.org] On Behalf Of Anne
> Wainwright
> Sent: 08 May 2008 17:20
> To: za-pm at pm.org
> Subject: [Za-pm] ascii high order character conversion
>
> Hi.
>
> I have data prepared on a dos programme that involves high order
> characters, like european letters with umlauts, cedillas, acute and
> grave
> accents etc.
>
> I have a dos utility that I wrote that converts all of these to plain
> unaccented characters, a simple replacement operation. The reason being
> that in moving the data to Windows it does not show them correctly and
> this
> was the easiest way to go at the time. Now I am away from that route and
>
> want to build this into my perl database conversion routine (convert
> from
> proprietary to delimited).
>
> Now I am wondering if there is an easier way in perl than doing a s///
> for
> each of the characters used. I looked in the Perl Cookbook, and had a
> wander through the CPAN modules, but nothing struck me as specific for
> the
> task in hand.
>
> Not that lines of s/// wouldn't do the job, but I wondered if there was
> a
> more concise way of programming this to convert either to the plain
> unaccented character or to the correct windows character.
>
> [maybe I must study the "perlebcdic Considerations for running Perl on
> EBCDIC platforms" found on CPAN which looks like it might be a guide.
> suggests tr/// , will absorb this evening]
>
> Had hoped for a ready module from CPAN, but see nothing.
>
> Any ideas gratefully received on what must have been a common problem
> some
> years back?
>
>
> Regards
> Anne
> ----
> Anne Wainwright
>
> _______________________________________________
> Za-pm mailing list
> Za-pm at pm.org
> http://mail.pm.org/mailman/listinfo/za-pm
----
Anne Wainwright
More information about the Za-pm
mailing list