[Edinburgh-pm] Dumb regex question

Fri Jun 15 09:50:23 PDT 2012

On Fri, Jun 15, 2012 at 05:43:28PM +0100, Miles Gould wrote:
> On 15/06/12 17:18, Chris Yocum wrote:
> >my @words = ("díne", "láechreraig", "caínConchobor");
> >
> >foreach my $word (@words) {
> >         if($word =~ m/\p{IsLower}(?=\p{IsUpper})/) {
> >                  print "$word\n";
> >         }
> >}
> 
> First off, the lookahead assertion isn't doing anything, so let's
> simplify the regex.
> 
> 
> #!/usr/bin/perl
> 
> use strict;
> use warnings;
> 
> my @words = ("díne", "láechreraig", "caínConchobor");
> 
> foreach my $word (@words) {
>         if($word =~ m/\p{IsLower}\p{IsUpper}/) {
>                  print "$word\n";
>         }
> }
> 
> díne
> láechreraig
> caínConchobor
> 

That's the regex I originally started with.  After some Google'ing, I
thought that the one I sent was the one I needed.

> 
> OK, bug verified (this is all on perl 5.10.1, by the way). Now let's
> see what's actually matching:
> 
> 
> #!/usr/bin/perl
> 
> use strict;
> use warnings;
> 
> my @words = ("díne", "láechreraig", "caínConchobor");
> 
> foreach my $word (@words) {
>         if($word =~ m/\p{IsLower}\p{IsUpper}/) {
>                  print "$word: $&\n";
>         }
> }
> 
> díne: d
> láechreraig: l
> caínConchobor: a
> 
> 
> In each case, it's the letter before the accented character, and
> *not* the accented character itself. Maybe it doesn't realise that
> the multibyte sequence should be treated as one character? Let's
> turn on the utf8 pragma so Perl knows our source code is in UTF-8:
> 
> 
> #!/usr/bin/perl
> 
> use strict;
> use warnings;
> use utf8;
> 
> my @words = ("díne", "láechreraig", "caínConchobor");
> 
> foreach my $word (@words) {
>         if($word =~ m/\p{IsLower}\p{IsUpper}/) {
>                  print "$word: $&\n";
>         }
> }
> 
> ca�nConchobor: nC
> 
> 
> Success! Almost. Now we're outputting the result with the wrong
> encoding, for reasons that someone else will hopefully be able to
> explain.
> 
> 
> #!/usr/bin/perl
> 
> use strict;
> use warnings;
> use utf8;
> use Encode;
> 
> my @words = ("díne", "láechreraig", "caínConchobor");
> 
> foreach my $word (@words) {
>         if($word =~ m/\p{IsLower}\p{IsUpper}/) {
>                  print encode('utf8', "$word\n");
>         }
> }
> 
> caínConchobor
> 
> 
> Wincore!

w00t!  Thanks!!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 230 bytes
Desc: Digital signature
URL: <http://mail.pm.org/pipermail/edinburgh-pm/attachments/20120615/1b61da5e/attachment-0001.bin>