utf8 and regular expressions.
Scott Penrose
scottp at dd.com.au
Wed Jan 30 23:12:41 CST 2002
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Since Perl Monks is down (well partially) I though I might post this
here (melbourne-pm at pm.org)
Here is a cute bit of code that has caused us hours of problems...
#!/usr/bin/perl -w
$test = $ARGV[0] || "abc%abc123";
print "Testing on $test\n";
use utf8;
if ($test =~ /%([\dA-Fa-f]{2})/) {
print "Found - $1\n";
}
no utf8;
if ($test =~ /%([\dA-Fa-f]{2})/) {
print "Found - $1\n";
}
The output of the above is....
Testing on abc%abc123
Found - abc123
Found - ab
Using perl 5.6.0 or perl 5.6.1 (I tried both).
The problem, if you have not spotted it is that we have asked for {2}
characters but get more if in utf8 mode.
"{n} Match exactly n times" - man perlre
We also tried {2,2}
"{n,m} Match at least n but not more than m times" - man perlre
Using (use) utf8 matches all things in the character class, no mater how
long the string.
This makes decoding a URL - HELL !
I don't think I am doing anything wrong, but maybe someone can point out
a problem with the above?
Otherwise, it is a UTF8 Bug in perl re engine.
Does anyone have perl 5.7 installed they could test it on?
Scott
- ---
Scott Penrose
Open source and Linux Developer
http://linux.dd.com.au/
scottp at dd.com.au
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (Darwin)
Comment: For info see http://www.gnupg.org
iD8DBQE8WNJODCFCcmAm26YRAm5qAJ44PXprwN6jID3GtKixlENp//VqqQCeILnM
su4MTqPzhL56scRdNBHCBuA=
=DsBt
-----END PGP SIGNATURE-----
More information about the Melbourne-pm
mailing list