[Kc] cleaning up unicode?

John Reinke jmreinke at sunflower.com
Tue Mar 4 10:20:15 CST 2003


Garrett Goebel <garrett at scriptpro.com> wrote:
> > What I'm trying to accomplish is remove the accent marks
> > from characters, essentially reducing everything down to
> > 7-bit ASCII. Since you're asking, the strings will become
> > file names. I want to create a subroutine that will
> > convert a string to something valid for my file system.
> 
> What file system(s) does the filename need to be valid for?

While I'm running on Linux, I'm a firm believer in storing data (file format and likewise the filenames) in a manner that I can process and share across platforms. Considering that I have Linux/MacOS(old & new)/Win at home and Solaris/Win at work, I figured that reducing everything to a-z,A-Z,0-9, and _ would be safe, as long as the file names aren't too long.

I had hoped for a formula similar to 32 being the difference between ASCII 'A' and 'a' for the accented to non-accented characters, but there doesn't seem to be a similar pattern. It looks like a series of substitutions will have to suffice, assuming I can think of all the possible input.

Or, I can just let it run as is and manually intervene when needed, but my three principle virtues don't want me to take that approach...

Thanks,
John



More information about the kc mailing list