[Edinburgh-pm] Dumb regex question
Miles Gould
miles at assyrian.org.uk
Fri Jun 15 09:43:28 PDT 2012
On 15/06/12 17:18, Chris Yocum wrote:
> my @words = ("díne", "láechreraig", "caínConchobor");
>
> foreach my $word (@words) {
> if($word =~ m/\p{IsLower}(?=\p{IsUpper})/) {
> print "$word\n";
> }
> }
First off, the lookahead assertion isn't doing anything, so let's
simplify the regex.
#!/usr/bin/perl
use strict;
use warnings;
my @words = ("díne", "láechreraig", "caínConchobor");
foreach my $word (@words) {
if($word =~ m/\p{IsLower}\p{IsUpper}/) {
print "$word\n";
}
}
díne
láechreraig
caínConchobor
OK, bug verified (this is all on perl 5.10.1, by the way). Now let's see
what's actually matching:
#!/usr/bin/perl
use strict;
use warnings;
my @words = ("díne", "láechreraig", "caínConchobor");
foreach my $word (@words) {
if($word =~ m/\p{IsLower}\p{IsUpper}/) {
print "$word: $&\n";
}
}
díne: d
láechreraig: l
caínConchobor: a
In each case, it's the letter before the accented character, and *not*
the accented character itself. Maybe it doesn't realise that the
multibyte sequence should be treated as one character? Let's turn on the
utf8 pragma so Perl knows our source code is in UTF-8:
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
my @words = ("díne", "láechreraig", "caínConchobor");
foreach my $word (@words) {
if($word =~ m/\p{IsLower}\p{IsUpper}/) {
print "$word: $&\n";
}
}
ca�nConchobor: nC
Success! Almost. Now we're outputting the result with the wrong
encoding, for reasons that someone else will hopefully be able to explain.
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Encode;
my @words = ("díne", "láechreraig", "caínConchobor");
foreach my $word (@words) {
if($word =~ m/\p{IsLower}\p{IsUpper}/) {
print encode('utf8', "$word\n");
}
}
caínConchobor
Wincore!
HTH,
Miles
More information about the Edinburgh-pm
mailing list