[Kc] cleaning up unicode? [x-adr][x-bayes]
Garrett Goebel
garrett at scriptpro.com
Tue Mar 4 12:18:34 CST 2003
John Reinke wrote:
> Garrett Goebel <garrett at scriptpro.com> wrote:
> > > What I'm trying to accomplish is remove the accent marks
> > > from characters, essentially reducing everything down to
> > > 7-bit ASCII. Since you're asking, the strings will become
> > > file names. I want to create a subroutine that will
> > > convert a string to something valid for my file system.
> >
> > What file system(s) does the filename need to be valid for?
>
> While I'm running on Linux, I'm a firm believer in storing
> data (file format and likewise the filenames) in a manner
> that I can process and share across platforms. Considering
> that I have Linux/MacOS(old & new)/Win at home and
> Solaris/Win at work, I figured that reducing everything to
> a-z,A-Z,0-9, and _ would be safe, as long as the file names
> aren't too long.
I believe '-' and '.' are safe too.
For Mac HFS compatibility too long is >31
With NTFS when translating LFN to 8.3, all extended ASCII chars are
translated to _.
And there's always the issue of many-to-one transformations:
fön
fün
So you might want to append a counter when you go to rename the file if it
would stomp on a pre-existing file with the same target name.
It'd probably be best to use an appended counter for case insensitive a-z
clashes too... as HFS is case-preserving but not case-sensitive right? Or
alternatively force everything to upper or lower case.
> I had hoped for a formula similar to 32 being the difference
> between ASCII 'A' and 'a' for the accented to non-accented
> characters, but there doesn't seem to be a similar pattern.
> It looks like a series of substitutions will have to suffice,
> assuming I can think of all the possible input.
Substitutions might not be cross-platform... A script is written in one
extended ASCII charset and executed on a machine with another might make
some unintended substitutions... It might be best to punt and convert all
unacceptable chars to _.
sub transform ($) { # return true if transform took place
my $nok;
if (length($_[0]) > 31) {
$nok += $_[0] =~ s/(?<=^.{31}).*//;
}
if ($_[0] =~ m/[^a-zA-Z0-9_\-\.]/) {
$nok += $_[0] =~ s/[^a-zA-Z0-9_\-\.]/_/g;
}
$nok;
}
$a = 'ab?d*fg i!k&m,,,q at stuvwxyzyxwvutsrqponmlkjihgfedcba';
print "$a\n";
print transform $a;
print "\n$a\n";
Note: this doesn't worry about filename clashes. But that after all depends
on where you intend to save the file...
--
Garrett Goebel
IS Development Specialist
ScriptPro Direct: 913.403.5261
5828 Reeds Road Main: 913.384.1008
Mission, KS 66202 Fax: 913.384.2180
www.scriptpro.com garrett at scriptpro dot com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.pm.org/pipermail/kc/attachments/20030304/c407ea70/attachment.htm
More information about the kc
mailing list