[tpm] Irritation problem - regex French character set

Chris Jones cj at enersave.ca
Tue Apr 10 17:10:51 PDT 2012

Having successfully untainted one file while 
reading it in, I am now faced with untainting a 
file containing two languages, English and French.

File - tagnames2.dat
key     English value   French value
p1a_help1       Getting Help    Obtenir de l'aide
p2a_type        Building Type:  Type de bâtiment:
p3a_error_less   must be no less than    ne peut pas être inférieur à

As well, this file contains some math like symbols: >, =, <, ~

My initial regex is:
if( $tagLine =~ 
/([\w]+)\t([-\w\/.]+)\t([-\w\/.]+)$/) # key and two values the same format
         my $tag = $1;
         my $phraseE =  $2;
         my $phraseF =  $3;
         my $tmpref = {
                 english => "$phraseE",
                 francais => "$phraseF" };
         $tags{ $tag } = $tmpref;

Works for the English phrase, $2 but not the French phrase $3.
I use a test file to print the "bad" lines.  It 
is the French phrases that cause the bad line error.

I could set locale the loop then restore - and 
write the regex without the \w shortcut.  Is that a good idea?

Christopher Jones, P.Eng.
Suite 1801, 1 Yonge Street
Toronto, ON M5E1W7
Tel. 416-203-7465
Fax. 416-946-1005
email cj at enersave.ca

More information about the toronto-pm mailing list