[Melbourne-pm] Regular expression musings

Jacinta Richardson jarich at perltraining.com.au
Wed Jun 4 22:38:26 PDT 2008


Heh.  I've just shown that .* and .*? can fail very quickly.  I feel so foolish.
 (Original data didn't need /s, but I should have put it in anyway).  Corrected
file attached and better benchmark results are:

Comparing over a string of length: 2159

Whole string quoted
               Rate  brackets1 dot_quest1  dot_star1
brackets1   16580/s         --       -84%       -93%
dot_quest1 102920/s       521%         --       -56%
dot_star1  232662/s      1303%       126%         --

First word only quoted
               Rate  dot_star2  brackets2 dot_quest2
dot_star2   46541/s         --       -94%       -94%
brackets2  744650/s      1500%         --        -7%
dot_quest2 804305/s      1628%         8%         --

Last word only quoted
              Rate  brackets3 dot_quest3  dot_star3
brackets3  65944/s         --        -2%        -2%
dot_quest3 66987/s         2%         --        -1%
dot_star3  67626/s         3%         1%         --

Single starting quote
               Rate  dot_star4 dot_quest4  brackets4
dot_star4  192377/s         --        -1%        -2%
dot_quest4 194869/s         1%         --        -0%
brackets4  195491/s         2%         0%         --


Paul pointed out that Perl's way smarter than this and makes anchors of static
points before running the regular expression.  Which we can see when we turn on
re 'debug':

	use re 'debug';
	$string1 =~ /"(.*)"/s;

	exit;

Compiling REx `"(.*)"'
size 11 Got 92 bytes for offset annotations.
first at 1
   1: EXACT <">(3)
   3: OPEN1(5)
   5:   STAR(7)
   6:     SANY(0)
   7: CLOSE1(9)
   9: EXACT <">(11)
  11: END(0)
anchored """ at 0 floating """ at 1..2147483647 (checking floating) minlen 2
Offsets: [11]
        1[1] 0[0] 2[1] 0[0] 4[1] 3[1] 5[1] 0[0] 6[1] 0[0] 7[0]
Guessing start of match, REx ""(.*)"" against ""Lorem ipsum dolor sit amet,
consectetur adipisicing elit, s..."...
Found floating substr """ at offset 2158...
Found anchored substr """ at offset 0...
Guessed: match at offset 0
Matching REx ""(.*)"" against ""Lorem ipsum dolor sit amet, consectetur
adipisicing elit, s..."
  Setting an EVAL scope, savestack=14
   0 <> <"Lorem ipsum>    |  1:  EXACT <">
   1 <"> <Lorem ipsum>    |  3:  OPEN1
   1 <"> <Lorem ipsum>    |  5:  STAR
                           SANY can match 2158 times out of 2147483647...
  Setting an EVAL scope, savestack=14
2158 <s repellat.> <">    |  7:    CLOSE1
2158 <s repellat.> <">    |  9:    EXACT <">
2159 <s repellat."> <>    | 11:    END
Match successful!
Freeing REx: `"\"(.*)\""'


So Perl jumps to the anchors, and thus is so fast.  I'm not sure how this ties
in to the speed differences we see.

All the best,

	J

-- 
   ("`-''-/").___..--''"`-._          |  Jacinta Richardson         |
    `6_ 6  )   `-.  (     ).`-.__.`)  |  Perl Training Australia    |
    (_Y_.)'  ._   )  `._ `. ``-..-'   |      +61 3 9354 6001        |
  _..`--'_..-_/  /--'_.' ,'           | contact at perltraining.com.au |
 (il),-''  (li),'  ((!.-'             |   www.perltraining.com.au   |
-------------- next part --------------
A non-text attachment was scrubbed...
Name: benchmarking-res.pl
Type: application/x-perl
Size: 10142 bytes
Desc: not available
Url : http://mail.pm.org/pipermail/melbourne-pm/attachments/20080605/7826dc9a/attachment-0001.bin 


More information about the Melbourne-pm mailing list