[Edinburgh-pm] Dumb regex question

Fri Jun 15 09:43:28 PDT 2012

On 15/06/12 17:18, Chris Yocum wrote:
> my @words = ("díne", "láechreraig", "caínConchobor");
>
> foreach my $word (@words) {
>          if($word =~ m/\p{IsLower}(?=\p{IsUpper})/) {
>                   print "$word\n";
>          }
> }

First off, the lookahead assertion isn't doing anything, so let's 
simplify the regex.

#!/usr/bin/perl

use strict;
use warnings;

my @words = ("díne", "láechreraig", "caínConchobor");

foreach my $word (@words) {
         if($word =~ m/\p{IsLower}\p{IsUpper}/) {
                  print "$word\n";
         }
}

díne
láechreraig
caínConchobor

OK, bug verified (this is all on perl 5.10.1, by the way). Now let's see 
what's actually matching:

#!/usr/bin/perl

use strict;
use warnings;

my @words = ("díne", "láechreraig", "caínConchobor");

foreach my $word (@words) {
         if($word =~ m/\p{IsLower}\p{IsUpper}/) {
                  print "$word: $&\n";
         }
}

díne: d
láechreraig: l
caínConchobor: a

In each case, it's the letter before the accented character, and *not* 
the accented character itself. Maybe it doesn't realise that the 
multibyte sequence should be treated as one character? Let's turn on the 
utf8 pragma so Perl knows our source code is in UTF-8:

#!/usr/bin/perl

use strict;
use warnings;
use utf8;

my @words = ("díne", "láechreraig", "caínConchobor");

foreach my $word (@words) {
         if($word =~ m/\p{IsLower}\p{IsUpper}/) {
                  print "$word: $&\n";
         }
}

ca�nConchobor: nC

Success! Almost. Now we're outputting the result with the wrong 
encoding, for reasons that someone else will hopefully be able to explain.

#!/usr/bin/perl

use strict;
use warnings;
use utf8;
use Encode;

my @words = ("díne", "láechreraig", "caínConchobor");

foreach my $word (@words) {
         if($word =~ m/\p{IsLower}\p{IsUpper}/) {
                  print encode('utf8', "$word\n");
         }
}

caínConchobor

Wincore!

HTH,
Miles