[Kc] cleaning up unicode?

Mon Mar 3 21:58:24 CST 2003

On Mon, 2003-03-03 at 09:15, Garrett Goebel wrote:
> > I have a text file which contains a character I am guessing as being
> > considered a unicode character. It is the letter 'u' with the accent
> > mark over it that looks like an apostrophe, which appears in some text
> > editors as <FA>.
> 
> This need not be unicode, and likely isn't. Many ASCII extended character
> sets contain the accented vowels. It is most likely that you're using one of
> those.

You are correct. I had to revert to my C skills to learn that u with the
accent mark has the value 250 - a single byte.

> > I'd like to convert any of those characters to be regular ASCII
> > characters, most likely with a tr command, but I haven't been able to
> > find a way to match that character. Any suggestions?
> 
> Are you saying that:
> 
>   s/üûùú/u/g;
>   s/Ü/U/g;
> 
> Doesn't work?

I'm sure it does, but what I'm really looking for (but didn't state
clearly) is a more generic solution. Besides, I'm not sure how to type
all those characters without pasting them in from another document.

> It'd be nice to know what you're actually trying to accomplish. If for

What I'm trying to accomplish is remove the accent marks from
characters, essentially reducing everything down to 7-bit ASCII. Since
you're asking, the strings will become file names. I want to create a
subroutine that will convert a string to something valid for my file
system. While I could just eliminate the accented characters, it would
make sense to retain the letter part, and eliminate the additional
punctuation - no offense intended toward your "beautiful" (auf Deutsch)
example, Garrett.

I thought that this might have been common enough that someone had a
quick formula that could handle this. Perhaps not. I'll have to look for
an existing package or code something up from scratch...

Thanks,
John