[Edinburgh-pm] Dumb regex question

Miles Gould miles at assyrian.org.uk
Fri Jun 15 09:43:28 PDT 2012


On 15/06/12 17:18, Chris Yocum wrote:
> my @words = ("díne", "láechreraig", "caínConchobor");
>
> foreach my $word (@words) {
>          if($word =~ m/\p{IsLower}(?=\p{IsUpper})/) {
>                   print "$word\n";
>          }
> }

First off, the lookahead assertion isn't doing anything, so let's 
simplify the regex.


#!/usr/bin/perl

use strict;
use warnings;

my @words = ("díne", "láechreraig", "caínConchobor");

foreach my $word (@words) {
         if($word =~ m/\p{IsLower}\p{IsUpper}/) {
                  print "$word\n";
         }
}

díne
láechreraig
caínConchobor


OK, bug verified (this is all on perl 5.10.1, by the way). Now let's see 
what's actually matching:


#!/usr/bin/perl

use strict;
use warnings;

my @words = ("díne", "láechreraig", "caínConchobor");

foreach my $word (@words) {
         if($word =~ m/\p{IsLower}\p{IsUpper}/) {
                  print "$word: $&\n";
         }
}

díne: d
láechreraig: l
caínConchobor: a


In each case, it's the letter before the accented character, and *not* 
the accented character itself. Maybe it doesn't realise that the 
multibyte sequence should be treated as one character? Let's turn on the 
utf8 pragma so Perl knows our source code is in UTF-8:


#!/usr/bin/perl

use strict;
use warnings;
use utf8;

my @words = ("díne", "láechreraig", "caínConchobor");

foreach my $word (@words) {
         if($word =~ m/\p{IsLower}\p{IsUpper}/) {
                  print "$word: $&\n";
         }
}

ca�nConchobor: nC


Success! Almost. Now we're outputting the result with the wrong 
encoding, for reasons that someone else will hopefully be able to explain.


#!/usr/bin/perl

use strict;
use warnings;
use utf8;
use Encode;

my @words = ("díne", "láechreraig", "caínConchobor");

foreach my $word (@words) {
         if($word =~ m/\p{IsLower}\p{IsUpper}/) {
                  print encode('utf8', "$word\n");
         }
}

caínConchobor


Wincore!

HTH,
Miles


More information about the Edinburgh-pm mailing list