[Edinburgh-pm] Dumb regex question
Chris Yocum
cyocum at gmail.com
Fri Jun 15 09:50:23 PDT 2012
On Fri, Jun 15, 2012 at 05:43:28PM +0100, Miles Gould wrote:
> On 15/06/12 17:18, Chris Yocum wrote:
> >my @words = ("díne", "láechreraig", "caínConchobor");
> >
> >foreach my $word (@words) {
> > if($word =~ m/\p{IsLower}(?=\p{IsUpper})/) {
> > print "$word\n";
> > }
> >}
>
> First off, the lookahead assertion isn't doing anything, so let's
> simplify the regex.
>
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> my @words = ("díne", "láechreraig", "caínConchobor");
>
> foreach my $word (@words) {
> if($word =~ m/\p{IsLower}\p{IsUpper}/) {
> print "$word\n";
> }
> }
>
> díne
> láechreraig
> caínConchobor
>
That's the regex I originally started with. After some Google'ing, I
thought that the one I sent was the one I needed.
>
> OK, bug verified (this is all on perl 5.10.1, by the way). Now let's
> see what's actually matching:
>
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> my @words = ("díne", "láechreraig", "caínConchobor");
>
> foreach my $word (@words) {
> if($word =~ m/\p{IsLower}\p{IsUpper}/) {
> print "$word: $&\n";
> }
> }
>
> díne: d
> láechreraig: l
> caínConchobor: a
>
>
> In each case, it's the letter before the accented character, and
> *not* the accented character itself. Maybe it doesn't realise that
> the multibyte sequence should be treated as one character? Let's
> turn on the utf8 pragma so Perl knows our source code is in UTF-8:
>
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
> use utf8;
>
> my @words = ("díne", "láechreraig", "caínConchobor");
>
> foreach my $word (@words) {
> if($word =~ m/\p{IsLower}\p{IsUpper}/) {
> print "$word: $&\n";
> }
> }
>
> ca�nConchobor: nC
>
>
> Success! Almost. Now we're outputting the result with the wrong
> encoding, for reasons that someone else will hopefully be able to
> explain.
>
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
> use utf8;
> use Encode;
>
> my @words = ("díne", "láechreraig", "caínConchobor");
>
> foreach my $word (@words) {
> if($word =~ m/\p{IsLower}\p{IsUpper}/) {
> print encode('utf8', "$word\n");
> }
> }
>
> caínConchobor
>
>
> Wincore!
w00t! Thanks!!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 230 bytes
Desc: Digital signature
URL: <http://mail.pm.org/pipermail/edinburgh-pm/attachments/20120615/1b61da5e/attachment-0001.bin>
More information about the Edinburgh-pm
mailing list