[tpm] Irritation problem - regex French character set
Chris Jones
cj at enersave.ca
Tue Apr 10 17:10:51 PDT 2012
Having successfully untainted one file while
reading it in, I am now faced with untainting a
file containing two languages, English and French.
File - tagnames2.dat
key English value French value
p1a_help1 Getting Help Obtenir de l'aide
p2a_type Building Type: Type de bâtiment:
p3a_error_less must be no less than ne peut pas être inférieur à
As well, this file contains some math like symbols: >, =, <, ~
My initial regex is:
if( $tagLine =~
/([\w]+)\t([-\w\/.]+)\t([-\w\/.]+)$/) # key and two values the same format
{
my $tag = $1;
my $phraseE = $2;
my $phraseF = $3;
my $tmpref = {
english => "$phraseE",
francais => "$phraseF" };
$tags{ $tag } = $tmpref;
$count++;
}
Works for the English phrase, $2 but not the French phrase $3.
I use a test file to print the "bad" lines. It
is the French phrases that cause the bad line error.
I could set locale the loop then restore - and
write the regex without the \w shortcut. Is that a good idea?
>>
Christopher Jones, P.Eng.
Suite 1801, 1 Yonge Street
Toronto, ON M5E1W7
Tel. 416-203-7465
Fax. 416-946-1005
email cj at enersave.ca
More information about the toronto-pm
mailing list