[Kc] cleaning up unicode?

Tue Mar 4 10:20:15 CST 2003

Garrett Goebel <garrett at scriptpro.com> wrote:
> > What I'm trying to accomplish is remove the accent marks
> > from characters, essentially reducing everything down to
> > 7-bit ASCII. Since you're asking, the strings will become
> > file names. I want to create a subroutine that will
> > convert a string to something valid for my file system.
> 
> What file system(s) does the filename need to be valid for?

While I'm running on Linux, I'm a firm believer in storing data (file format and likewise the filenames) in a manner that I can process and share across platforms. Considering that I have Linux/MacOS(old & new)/Win at home and Solaris/Win at work, I figured that reducing everything to a-z,A-Z,0-9, and _ would be safe, as long as the file names aren't too long.

I had hoped for a formula similar to 32 being the difference between ASCII 'A' and 'a' for the accented to non-accented characters, but there doesn't seem to be a similar pattern. It looks like a series of substitutions will have to suffice, assuming I can think of all the possible input.

Or, I can just let it run as is and manually intervene when needed, but my three principle virtues don't want me to take that approach...

Thanks,
John