[Edinburgh-pm] Dumb regex question

Fri Jun 15 09:45:33 PDT 2012

Chris Yocum <cyocum at gmail.com> wrote:
> my @words = ("díne", "láechreraig", "caínConchobor");

Are you sure that Perl is seeing the strings you think it's seeing?
Try adding `use re "debug"` to get more insight into what's going on;
my guess is that you need `use utf8` (or the binmode equivalent if
these strings come from a file) to get the behaviour you want.  Since
your program doesn't say what encoding it's in, Perl assumes Latin-1
(for backwards compatibility).

In particular, "á" is 0xC3 0xA1 in UTF-8, but those bytes also form
the Latin-1 representation of U+00C3 U+00A1 "Ã¡", so Perl is probably
seeing "lÃ¡chreraig" for $words[1] — which does indeed contain a
lower-case letter immediately followed by an upper-case.  A similar
argument applies to "í".

> Note that this must be in Unicode because I have data with
> accent marks in it.

Well, it's certainly possible to represent contemporary Irish
orthography in Latin-1 — no Unicode needed there.  But I agree that
Unicode is desirable nonetheless.

-- 
Aaron Crane ** http://aaroncrane.co.uk/