[San-Diego-pm] accents

Tkil tkil-sdpm at scrye.com
Wed Oct 27 01:20:05 CDT 2004


>>>>> "Joel" == Joel Fentin <joel at fentin.com> writes:

Joel> I need to see if what the Spanish language operator enters is
Joel> contained in a long hunk of text. Something like this:
Joel> if($X =~ /$Y/){[Do something]}

Joel> The operator might enter josé, JOSÉ, or jose. He might enter niño,
Joel> NIÑO, or nino.

Joel> An i modifier to m// will take care of case. Is there any fell
Joel> swoop way of taking care of accents?

If it's a character you know might have an accent, you can use \X
(which matches any base character plus possible combining characters).

For more generic cases, what you want to find is something that will
"canonicalize" the unicode into one of two base forms (but preferably
"C" form, which uses combining marks whenever possible).  Fortunately,
there is a standard Unicode::Normalize module to do this for you.
First, I have to justify it:

So, you have "manana".  You might store it that way in your match
variable, but the actual entry data might be any one of:

Pure ASCII, no tilde:

   6D 61 6E 61 6E 61

ISO-8859-1.  Note that IE under windows often lies; it might claim
it's sending ISO-8859-1, but it's really sending CP1252.  In this
case, note that the mapping of "LATIN SMALL LETTER N WITH TILDE" is to
code point 0xF1:

   6D 61 F1 61 6E 61

Since Unicode adopted U+0080 through U+00FF from ISO-8859-1, it is
entirely reasonable to represent that 0xF1 by the UTF-8 expansion of
C3B1:

   6D 61 6E C3B1 6E 61

Finally, this string can also be represented with seven unicode code
points: 'm', 'a', 'n', COMBINING TILDE, 'a', 'n', 'a':

   6D 61 6E 6E CC83 6E 61

So these are a sampling of the ways that you might get incoming data.
The question is, what do you want to match it against?

You can use "\X" somewhat like this:

   if ( $input =~ /ma\Xana/ ) { ... }

More info in "perldoc perlre".

If you want to be a bit cleverer, take a look at Unicode::Normalize.
Something like this should give you the "base characters":

   use Encode qw( decode );
   use Unicode::Normalize qw( NFD reorder );

   # take a raw byte stream and interpret it as though it were in
   # ISO-8859-1.
   my $raw = decode "iso-8859-1", "man\xf1na";

   # normalize it in fully decomposed form ("Normalized Form D")
   # see: http://www.unicode.org/reports/tr15/
   my $norm = reorder NFD $raw;

   # remove any characters that aren't ascii:
   $norm =~ tr/\x00-\x7f//cd;

Given the above examples, you can look at the result like so:

| $ perl -MEncode=decode \
| >      -MUnicode::Normalize=NFD,reorder -lwe '
| >    $m = reorder NFD decode "iso-8859-1", "man\xf1na";
| >    $m =~ tr/\x00-\x7f//cd;
| >    print uc unpack "H*", $m'
| 6D616E6E6E61

I have no idea how well stuff like this works when you start talking
about non-roman alphabets (kanji, katakana, arabic, etc), though.

t.

p.s. Please be warned that this is an area of perl that I'm still just
     poking around the fringes of -- the above might explode in your
     face...



More information about the San-Diego-pm mailing list