[sf-perl] gsm hentai convert
David Graff
graff at ldc.upenn.edu
Tue Mar 14 15:35:42 PST 2006
bart at solozone.com said:
> Is there some easy was ( & I am about to write a shell script using
> 'tr' and its OCTAL values to do this) to clear (LaTeX=? \"e and so
> forth ) to be their equivalents without diacriticals (stripped
> letters if they do not make it)?
Here's a pretty simple approach for normalizing all the accented Latin
characters to their unaccented ASCII equivalents -- it's not a
one-liner anymore, but it's not that hard...
sub map_accents
{
my @charnames =
grep /\tLATIN \S+ LETTER/, split( /^/, do 'unicore/Name.pl' );
my %deaccent;
for my $c ( split //, qq/AEIOUCNYaeioucny/ ) {
my $case = ( $c eq lc $c ) ? 'SMALL' : 'CAPITAL';
$deaccent{$c} =
join( '', map { chr hex( substr $_, 0, 4 ) }
grep /\tLATIN $case LETTER \U$c WITH/, @charnames );
}
return \%accents;
}
# Sample usage in a main script:
my $accmap = map_accents();
while (<>) {
for my $c ( keys %$accmap ) {
s/[$$accmap{$c}]/$c/g;
}
# $_ now contains no accented latin letters...
}
> Note the 'panic' statement which I
> have never seen before, below when I errorneously tried this using
> 8859 on a diffeent input when the command line said utf-8:.
I'm not sure about the cause of the panic, but the message indicates
you were passing a string with a character in the unicode CJK range
(U9837 is a Chinese ideograph). If you really have Chinese text data
in unicode, you'll need something more elaborate to handle that
(conversion to pinyin, maybe?).
Dave Graff
More information about the SanFrancisco-pm
mailing list