[Melbourne-pm] Regular expression musings
Jacinta Richardson
jarich at perltraining.com.au
Wed Jun 4 21:34:28 PDT 2008
I'm writing a Perl tip ( http://perltraining.com.au/tips/ ) about regular
expression optimisations - the usual ones, and decided to benchmark some with
respect to .*, .*? and alternatives.
I considered the case of matching a string inside double quotes. I used one
very long string copied three times with:
* the whole string in quotes (1)
* only the first word in quotes (2)
* only the last word in quotes (3)
* only one double quote (4)
I figured that this covered all of my options since the regular engine would
halt as soon as the second double quote was found. I also used the following
expressions:
/".*"/ (dot_star)
/".*?"/ (dot_quest)
/"[^"]"/ (brackets)
What I expected to find was:
* dot_star clearly fastest on 1
* dot_quest slower in general than brackets
* brackets fastest on 2 and 4 (followed closely by dot_quest)
* no significant time difference on 3
* very significant time difference between dot_quest and the others on 4
What I found instead was:
* dot_star fastest on 1, 4
* dot_quest fastest on 2, 3
* brackets never fastest
* no significant time difference between any of them on 3, 4.
The benchmarking results are:
Comparing over a string of length: 2159
Whole string quoted
Rate brackets1 dot_quest1 dot_star1
brackets1 17958/s -- -75% -83%
dot_quest1 71522/s 298% -- -33%
dot_star1 107222/s 497% 50% --
First word only quoted
Rate dot_star2 brackets2 dot_quest2
dot_star2 562469/s -- -31% -40%
brackets2 813620/s 45% -- -13%
dot_quest2 936418/s 66% 15% --
Last word only quoted
Rate brackets3 dot_star3 dot_quest3
brackets3 68064/s -- -1% -2%
dot_star3 68713/s 1% -- -1%
dot_quest3 69176/s 2% 1% --
Single starting quote
Rate dot_quest4 brackets4 dot_star4
dot_quest4 203988/s -- -0% -1%
brackets4 204852/s 0% -- -1%
dot_star4 206853/s 1% 1% --
This surprises me. I expected it to take more time for .*? to take something,
try to match the ", fail and repeat; than to just compare a character to a bit
map, consume it and repeat. I certainly didn't expect .*? to be 300% faster
than [^"]* over a long string. I'm particularly surprised to see .*? be 15%
faster than [^"]* over a string of 5 characters. This seems even more unusual
because it's not that much slower over the 9 characters at the end of the string.
My benchmarking code is attached. Can anyone spot any issues which might be
influencing these results?
All the best,
Jacinta
--
("`-''-/").___..--''"`-._ | Jacinta Richardson |
`6_ 6 ) `-. ( ).`-.__.`) | Perl Training Australia |
(_Y_.)' ._ ) `._ `. ``-..-' | +61 3 9354 6001 |
_..`--'_..-_/ /--'_.' ,' | contact at perltraining.com.au |
(il),-'' (li),' ((!.-' | www.perltraining.com.au |
-------------- next part --------------
A non-text attachment was scrubbed...
Name: benchmarking-res.pl
Type: application/x-perl
Size: 9839 bytes
Desc: not available
Url : http://mail.pm.org/pipermail/melbourne-pm/attachments/20080605/ea152ab2/attachment.bin
More information about the Melbourne-pm
mailing list