[sf-perl] gsm hentai convert

David Graff graff at ldc.upenn.edu
Tue Mar 14 15:35:42 PST 2006

bart at solozone.com said:
> Is there some  easy was ( & I am about to write a shell script using
> 'tr' and its OCTAL  values to do this) to clear (LaTeX=? \"e and so
> forth ) to be their  equivalents without diacriticals (stripped
> letters if they do not make it)?

Here's a pretty simple approach for normalizing all the accented Latin
characters to their unaccented ASCII equivalents -- it's not a
one-liner anymore, but it's not that hard...

sub map_accents
    my @charnames =
         grep /\tLATIN \S+ LETTER/, split( /^/, do 'unicore/Name.pl' );

    my %deaccent;
    for my $c ( split //, qq/AEIOUCNYaeioucny/ ) {
        my $case = ( $c eq lc $c ) ?  'SMALL' : 'CAPITAL';
        $deaccent{$c} =
              join( '', map { chr hex( substr $_, 0, 4 ) }
                    grep /\tLATIN $case LETTER \U$c WITH/, @charnames );
    return \%accents;

# Sample usage in a main script:

my $accmap = map_accents();
while (<>) {
    for my $c ( keys %$accmap ) {
    # $_ now contains no accented latin letters...

> Note the 'panic' statement which I
> have never seen before, below when I errorneously tried this using
> 8859 on a diffeent input when the command line said utf-8:.

I'm not sure about the cause of the panic, but the message indicates
you were passing a string with a character in the unicode CJK range
(U9837 is a Chinese ideograph).  If you really have Chinese text data
in unicode, you'll need something more elaborate to handle that
(conversion to pinyin, maybe?).

	Dave Graff

More information about the SanFrancisco-pm mailing list