<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<META NAME="Generator" CONTENT="MS Exchange Server version 5.5.2654.45">
<TITLE>RE: RE: [Kc] cleaning up unicode? [x-adr][x-bayes]</TITLE>
</HEAD>
<BODY>
<P><FONT SIZE=2>John Reinke wrote:</FONT>
<BR><FONT SIZE=2>> Garrett Goebel <garrett@scriptpro.com> wrote:</FONT>
<BR><FONT SIZE=2>> > > What I'm trying to accomplish is remove the accent marks</FONT>
<BR><FONT SIZE=2>> > > from characters, essentially reducing everything down to</FONT>
<BR><FONT SIZE=2>> > > 7-bit ASCII. Since you're asking, the strings will become</FONT>
<BR><FONT SIZE=2>> > > file names. I want to create a subroutine that will</FONT>
<BR><FONT SIZE=2>> > > convert a string to something valid for my file system.</FONT>
<BR><FONT SIZE=2>> > </FONT>
<BR><FONT SIZE=2>> > What file system(s) does the filename need to be valid for?</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> While I'm running on Linux, I'm a firm believer in storing </FONT>
<BR><FONT SIZE=2>> data (file format and likewise the filenames) in a manner </FONT>
<BR><FONT SIZE=2>> that I can process and share across platforms. Considering </FONT>
<BR><FONT SIZE=2>> that I have Linux/MacOS(old & new)/Win at home and </FONT>
<BR><FONT SIZE=2>> Solaris/Win at work, I figured that reducing everything to </FONT>
<BR><FONT SIZE=2>> a-z,A-Z,0-9, and _ would be safe, as long as the file names </FONT>
<BR><FONT SIZE=2>> aren't too long.</FONT>
</P>
<P><FONT SIZE=2>I believe '-' and '.' are safe too.</FONT>
</P>
<P><FONT SIZE=2>For Mac HFS compatibility too long is >31</FONT>
</P>
<P><FONT SIZE=2>With NTFS when translating LFN to 8.3, all extended ASCII chars are translated to _.</FONT>
</P>
<P><FONT SIZE=2>And there's always the issue of many-to-one transformations:</FONT>
</P>
<P><FONT SIZE=2> fön</FONT>
<BR><FONT SIZE=2> fün</FONT>
</P>
<P><FONT SIZE=2>So you might want to append a counter when you go to rename the file if it would stomp on a pre-existing file with the same target name.</FONT></P>
<P><FONT SIZE=2>It'd probably be best to use an appended counter for case insensitive a-z clashes too... as HFS is case-preserving but not case-sensitive right? Or alternatively force everything to upper or lower case.</FONT></P>
<BR>
<P><FONT SIZE=2>> I had hoped for a formula similar to 32 being the difference </FONT>
<BR><FONT SIZE=2>> between ASCII 'A' and 'a' for the accented to non-accented </FONT>
<BR><FONT SIZE=2>> characters, but there doesn't seem to be a similar pattern. </FONT>
<BR><FONT SIZE=2>> It looks like a series of substitutions will have to suffice, </FONT>
<BR><FONT SIZE=2>> assuming I can think of all the possible input.</FONT>
</P>
<P><FONT SIZE=2>Substitutions might not be cross-platform... A script is written in one extended ASCII charset and executed on a machine with another might make some unintended substitutions... It might be best to punt and convert all unacceptable chars to _.</FONT></P>
<P><FONT SIZE=2>sub transform ($) { # return true if transform took place</FONT>
<BR><FONT SIZE=2> my $nok;</FONT>
<BR><FONT SIZE=2> if (length($_[0]) > 31) {</FONT>
<BR><FONT SIZE=2> $nok += $_[0] =~ s/(?<=^.{31}).*//;</FONT>
<BR><FONT SIZE=2> }</FONT>
<BR><FONT SIZE=2> if ($_[0] =~ m/[^a-zA-Z0-9_\-\.]/) {</FONT>
<BR><FONT SIZE=2> $nok += $_[0] =~ s/[^a-zA-Z0-9_\-\.]/_/g;</FONT>
<BR><FONT SIZE=2> }</FONT>
<BR><FONT SIZE=2> $nok;</FONT>
<BR><FONT SIZE=2>}</FONT>
</P>
<P><FONT SIZE=2>$a = 'ab?d*fg i!k&m,,,q@stuvwxyzyxwvutsrqponmlkjihgfedcba';</FONT>
<BR><FONT SIZE=2>print "$a\n";</FONT>
<BR><FONT SIZE=2>print transform $a;</FONT>
<BR><FONT SIZE=2>print "\n$a\n";</FONT>
</P>
<BR>
<P><FONT SIZE=2>Note: this doesn't worry about filename clashes. But that after all depends on where you intend to save the file...</FONT>
</P>
<P><FONT SIZE=2>--</FONT>
<BR><FONT SIZE=2>Garrett Goebel</FONT>
<BR><FONT SIZE=2>IS Development Specialist</FONT>
</P>
<P><FONT SIZE=2>ScriptPro Direct: 913.403.5261</FONT>
<BR><FONT SIZE=2>5828 Reeds Road Main: 913.384.1008</FONT>
<BR><FONT SIZE=2>Mission, KS 66202 Fax: 913.384.2180</FONT>
<BR><FONT SIZE=2>www.scriptpro.com garrett at scriptpro dot com</FONT>
</P>
</BODY>
</HTML>