[Kc] cleaning up unicode?

John Reinke jmreinke at sunflower.com
Mon Mar 3 21:58:24 CST 2003


On Mon, 2003-03-03 at 09:15, Garrett Goebel wrote:
> > I have a text file which contains a character I am guessing as being
> > considered a unicode character. It is the letter 'u' with the accent
> > mark over it that looks like an apostrophe, which appears in some text
> > editors as <FA>.
> 
> This need not be unicode, and likely isn't. Many ASCII extended character
> sets contain the accented vowels. It is most likely that you're using one of
> those.

You are correct. I had to revert to my C skills to learn that u with the
accent mark has the value 250 - a single byte.

> > I'd like to convert any of those characters to be regular ASCII
> > characters, most likely with a tr command, but I haven't been able to
> > find a way to match that character. Any suggestions?
> 
> Are you saying that:
> 
>   s/üûùú/u/g;
>   s/Ü/U/g;
> 
> Doesn't work?

I'm sure it does, but what I'm really looking for (but didn't state
clearly) is a more generic solution. Besides, I'm not sure how to type
all those characters without pasting them in from another document.
 
> It'd be nice to know what you're actually trying to accomplish. If for

What I'm trying to accomplish is remove the accent marks from
characters, essentially reducing everything down to 7-bit ASCII. Since
you're asking, the strings will become file names. I want to create a
subroutine that will convert a string to something valid for my file
system. While I could just eliminate the accented characters, it would
make sense to retain the letter part, and eliminate the additional
punctuation - no offense intended toward your "beautiful" (auf Deutsch)
example, Garrett.

I thought that this might have been common enough that someone had a
quick formula that could handle this. Perhaps not. I'll have to look for
an existing package or code something up from scratch...

Thanks,
John





More information about the kc mailing list