[Edinburgh-pm] Dumb regex question
Aaron Crane
perl at aaroncrane.co.uk
Fri Jun 15 09:45:33 PDT 2012
Chris Yocum <cyocum at gmail.com> wrote:
> my @words = ("díne", "láechreraig", "caínConchobor");
Are you sure that Perl is seeing the strings you think it's seeing?
Try adding `use re "debug"` to get more insight into what's going on;
my guess is that you need `use utf8` (or the binmode equivalent if
these strings come from a file) to get the behaviour you want. Since
your program doesn't say what encoding it's in, Perl assumes Latin-1
(for backwards compatibility).
In particular, "á" is 0xC3 0xA1 in UTF-8, but those bytes also form
the Latin-1 representation of U+00C3 U+00A1 "á", so Perl is probably
seeing "láchreraig" for $words[1] — which does indeed contain a
lower-case letter immediately followed by an upper-case. A similar
argument applies to "í".
> Note that this must be in Unicode because I have data with
> accent marks in it.
Well, it's certainly possible to represent contemporary Irish
orthography in Latin-1 — no Unicode needed there. But I agree that
Unicode is desirable nonetheless.
--
Aaron Crane ** http://aaroncrane.co.uk/
More information about the Edinburgh-pm
mailing list