utf8 and regular expressions.

Scott Penrose scottp at dd.com.au
Wed Jan 30 23:12:41 CST 2002


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Since Perl Monks is down (well partially) I though I might post this 
here (melbourne-pm at pm.org)

Here is a cute bit of code that has caused us hours of problems...

#!/usr/bin/perl -w

$test = $ARGV[0] || "abc%abc123";

print "Testing on $test\n";

use utf8;
if ($test =~ /%([\dA-Fa-f]{2})/) {
         print "Found - $1\n";
}

no utf8;
if ($test =~ /%([\dA-Fa-f]{2})/) {
         print "Found - $1\n";
}

The output of the above is....

	Testing on abc%abc123
	Found - abc123
	Found - ab

Using perl 5.6.0 or perl 5.6.1 (I tried both).

The problem, if you have not spotted it is that we have asked for {2} 
characters but get more if in utf8 mode.

	"{n}    Match exactly n times" - man perlre
We also tried {2,2}
	"{n,m}  Match at least n but not more than m times" - man perlre

Using (use) utf8 matches all things in the character class, no mater how 
long the string.
This makes decoding a URL - HELL !

I don't think I am doing anything wrong, but maybe someone can point out 
a problem with the above?

Otherwise, it is a UTF8 Bug in perl re engine.
Does anyone have perl 5.7 installed they could test it on?

Scott
- ---
Scott Penrose
Open source and Linux Developer
http://linux.dd.com.au/
scottp at dd.com.au
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (Darwin)
Comment: For info see http://www.gnupg.org

iD8DBQE8WNJODCFCcmAm26YRAm5qAJ44PXprwN6jID3GtKixlENp//VqqQCeILnM
su4MTqPzhL56scRdNBHCBuA=
=DsBt
-----END PGP SIGNATURE-----




More information about the Melbourne-pm mailing list