[Melbourne-pm] Regular expression musings

Jacinta Richardson jarich at perltraining.com.au
Wed Jun 4 21:34:28 PDT 2008


I'm writing a Perl tip ( http://perltraining.com.au/tips/ ) about regular
expression optimisations - the usual ones, and decided to benchmark some with
respect to .*, .*? and alternatives.

I considered the case of matching a string inside double quotes.  I used one
very long string copied three times with:

	* the whole string in quotes    (1)
	* only the first word in quotes (2)
	* only the last word in quotes  (3)
	* only one double quote         (4)

I figured that this covered all of my options since the regular engine would
halt as soon as the second double quote was found.  I also used the following
expressions:

	/".*"/          (dot_star)
	/".*?"/         (dot_quest)
	/"[^"]"/        (brackets)

What I expected to find was:

	* dot_star clearly fastest on 1
        * dot_quest slower in general than brackets
	* brackets fastest on 2 and 4 (followed closely by dot_quest)
	* no significant time difference on 3
	* very significant time difference between dot_quest and the others on 4

What I found instead was:

	* dot_star fastest on 1, 4
	* dot_quest fastest on 2, 3
	* brackets never fastest
	* no significant time difference between any of them on 3,  4.

The benchmarking results are:

Comparing over a string of length: 2159
Whole string quoted
               Rate  brackets1 dot_quest1  dot_star1
brackets1   17958/s         --       -75%       -83%
dot_quest1  71522/s       298%         --       -33%
dot_star1  107222/s       497%        50%         --

First word only quoted
               Rate  dot_star2  brackets2 dot_quest2
dot_star2  562469/s         --       -31%       -40%
brackets2  813620/s        45%         --       -13%
dot_quest2 936418/s        66%        15%         --

Last word only quoted
              Rate  brackets3  dot_star3 dot_quest3
brackets3  68064/s         --        -1%        -2%
dot_star3  68713/s         1%         --        -1%
dot_quest3 69176/s         2%         1%         --

Single starting quote
               Rate dot_quest4  brackets4  dot_star4
dot_quest4 203988/s         --        -0%        -1%
brackets4  204852/s         0%         --        -1%
dot_star4  206853/s         1%         1%         --


This surprises me.  I expected it to take more time for .*? to take something,
try to match the ", fail and repeat; than to just compare a character to a bit
map, consume it and repeat.  I certainly didn't expect .*? to be 300% faster
than [^"]* over a long string.  I'm particularly surprised to see .*? be 15%
faster than [^"]* over a string of 5 characters.  This seems even more unusual
because it's not that much slower over the 9 characters at the end of the string.

My benchmarking code is attached.  Can anyone spot any issues which might be
influencing these results?

All the best,

	Jacinta

-- 
   ("`-''-/").___..--''"`-._          |  Jacinta Richardson         |
    `6_ 6  )   `-.  (     ).`-.__.`)  |  Perl Training Australia    |
    (_Y_.)'  ._   )  `._ `. ``-..-'   |      +61 3 9354 6001        |
  _..`--'_..-_/  /--'_.' ,'           | contact at perltraining.com.au |
 (il),-''  (li),'  ((!.-'             |   www.perltraining.com.au   |
-------------- next part --------------
A non-text attachment was scrubbed...
Name: benchmarking-res.pl
Type: application/x-perl
Size: 9839 bytes
Desc: not available
Url : http://mail.pm.org/pipermail/melbourne-pm/attachments/20080605/ea152ab2/attachment.bin 


More information about the Melbourne-pm mailing list