[Kc] cleaning up unicode? [x-adr][x-bayes]

Garrett Goebel garrett at scriptpro.com
Tue Mar 4 12:18:34 CST 2003


John Reinke wrote:
> Garrett Goebel <garrett at scriptpro.com> wrote:
> > > What I'm trying to accomplish is remove the accent marks
> > > from characters, essentially reducing everything down to
> > > 7-bit ASCII. Since you're asking, the strings will become
> > > file names. I want to create a subroutine that will
> > > convert a string to something valid for my file system.
> > 
> > What file system(s) does the filename need to be valid for?
> 
> While I'm running on Linux, I'm a firm believer in storing 
> data (file format and likewise the filenames) in a manner 
> that I can process and share across platforms. Considering 
> that I have Linux/MacOS(old & new)/Win at home and 
> Solaris/Win at work, I figured that reducing everything to 
> a-z,A-Z,0-9, and _ would be safe, as long as the file names 
> aren't too long.

I believe '-' and '.' are safe too.

For Mac HFS compatibility too long is >31

With NTFS when translating LFN to 8.3, all extended ASCII chars are
translated to _.

And there's always the issue of many-to-one transformations:

  fön
  fün

So you might want to append a counter when you go to rename the file if it
would stomp on a pre-existing file with the same target name.

It'd probably be best to use an appended counter for case insensitive a-z
clashes too... as HFS is case-preserving but not case-sensitive right? Or
alternatively force everything to upper or lower case.


> I had hoped for a formula similar to 32 being the difference 
> between ASCII 'A' and 'a' for the accented to non-accented 
> characters, but there doesn't seem to be a similar pattern. 
> It looks like a series of substitutions will have to suffice, 
> assuming I can think of all the possible input.

Substitutions might not be cross-platform... A script is written in one
extended ASCII charset and executed on a machine with another might make
some unintended substitutions... It might be best to punt and convert all
unacceptable chars to _.

sub transform ($) { # return true if transform took place
  my $nok;
  if (length($_[0]) > 31) {
    $nok += $_[0] =~ s/(?<=^.{31}).*//;
  }
  if ($_[0] =~ m/[^a-zA-Z0-9_\-\.]/) {
    $nok += $_[0] =~ s/[^a-zA-Z0-9_\-\.]/_/g;
  }
  $nok;
}

$a = 'ab?d*fg i!k&m,,,q at stuvwxyzyxwvutsrqponmlkjihgfedcba';
print "$a\n";
print transform $a;
print "\n$a\n";


Note: this doesn't worry about filename clashes. But that after all depends
on where you intend to save the file...

--
Garrett Goebel
IS Development Specialist

ScriptPro                  Direct: 913.403.5261
5828 Reeds Road            Main: 913.384.1008
Mission, KS 66202          Fax: 913.384.2180
www.scriptpro.com          garrett at scriptpro dot com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.pm.org/pipermail/kc/attachments/20030304/c407ea70/attachment.htm


More information about the kc mailing list