[tpm] Irritation problem - regex French character set

Liam R E Quin liam at holoweb.net
Tue Apr 10 18:52:41 PDT 2012


On Tue, 2012-04-10 at 20:10 -0400, Chris Jones wrote:
> Having successfully untainted one file while 
> reading it in, I am now faced with untainting a 
> file containing two languages, English and French.
> 
> File - tagnames2.dat
> key     English value   French value
> p1a_help1       Getting Help    Obtenir de l'aide
> p2a_type        Building Type:  Type de bâtiment:
> p3a_error_less   must be no less than    ne peut pas être inférieur à
> 
> As well, this file contains some math like symbols: >, =, <, ~
> 
> My initial regex is:
> if( $tagLine =~ 
> /([\w]+)\t([-\w\/.]+)\t([-\w\/.]+)$/) # key and two values the same format
> {
>          my $tag = $1;
>          my $phraseE =  $2;
>          my $phraseF =  $3;
>          my $tmpref = {
>                  english => "$phraseE",
>                  francais => "$phraseF" };
>          $tags{ $tag } = $tmpref;
>          $count++;
> }

It sounds like you might want this instead:

if ($tagLine =~ m{^([^\t]+)\t([^\t]+)\t(.+)$}) {
    $tags{$1} = {
        english => $2, française => $3
    };
    ++$count;
} else {
    # maybe log an error here? be careful not to
    # show the untrusted data in an error message that
    # goes to the user, though!
}

since you want to match based on tabs, not on what's between them.

Liam

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/



More information about the toronto-pm mailing list