[Kc] cleaning up unicode?

Garrett Goebel garrett at scriptpro.com
Mon Mar 3 09:15:14 CST 2003


John Reinke wrote:
> 
> I have a text file which contains a character I am guessing as being
> considered a unicode character. It is the letter 'u' with the accent
> mark over it that looks like an apostrophe, which appears in some text
> editors as <FA>.

This need not be unicode, and likely isn't. Many ASCII extended character
sets contain the accented vowels. It is most likely that you're using one of
those.

 
> I'd like to convert any of those characters to be regular ASCII
> characters, most likely with a tr command, but I haven't been able to
> find a way to match that character. Any suggestions?

Are you saying that:

  s/üûùú/u/g;
  s/Ü/U/g;

Doesn't work?

 
> Once I can match those characters, is there an easy way to convert all
> accented characters to their non-accented counterparts, such that the
> accent will disappear but the same letter will remain?

I'm not sure I'd recommend this. The accents after all do represent
different vowel sounds. And the simple conversion you're suggesting doesn't
correctly convert them to their non-accented counterparts. For instance,
Germans typing on keyboards without umlauts (the 2 dots above ö) would type
schön as schoen... There's another word schon which has a different meaning.


Then again my sum knowledge of locale and internationalization issues is
pretty meager. I don't know if it'd even be possible to do what you suggest
unless you could identify the language and locale for each section of text
you wished to transform. Even then I'm not aware of any modules to perform
such conversions. Though that isn't to say they don't exist.

It'd be nice to know what you're actually trying to accomplish. If for
instance you're wanting to munge this text file so it can be displayed in
HTML... you might be interested in HTML::Entities:

  http://search.cpan.org/author/GAAS/HTML-Parser-3.27/lib/HTML/Entities.pm

Which for example allows you to do this:

  $input = "vis-à-vis Beyoncé's naïve papier-mâché résumé";
  print encode_entities($in), "\n"

which would result in:

  vis-&agrave;-vis Beyonc&eacute;'s na&iuml;ve papier-m&acirc;ch&eacute;
  r&eacute;sum&eacute;

--
Garrett Goebel
IS Development Specialist

ScriptPro                   Direct: 913.403.5261
5828 Reeds Road               Main: 913.384.1008
Mission, KS 66202              Fax: 913.384.2180
www.scriptpro.com          garrett at scriptpro.com 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.pm.org/pipermail/kc/attachments/20030303/5c08559d/attachment.htm


More information about the kc mailing list