[Omaha.pm] regex preference

Thu Aug 5 14:59:50 PDT 2010

2010/7/21 Jay Hannah <jhannah at omnihotels.com>
>    if ( $in_string =~ /<HotelCode>$code/mgi || $in_string =~ /<mfResort>$code/mgi ) {
>
> I prefer
>
>    if ( $in_string =~ /<(HotelCode|mfResort)>$code/mgi ) {

>From a readability standpoint, I 100% agree with you.  I think the
combined regular expression is a lot more readable.  The one note I
would add, which bit me recently, regards performance.  If this is a
performance sensitive bit of code, and that regex will be going
through a lot of data, it's probably worth benchmarking it.

Background: A little while ago, I was writing a bit of perl to do some
basic log processing and when putting together some regular
expressions, I assumed that it would be faster to do a single regex
with alternation, as opposed to two separate checks.  Very similar to
the code above.  It turned out my assumption was very wrong.

A little assistance from the Benchmark module (thank you "Effective
Perl Programming, 2nd Ed." for kicking me in the ass to use that), and
I found out that Combining it was significantly slower.  Here's the
code and results of my tests:

#!/usr/bin/perl
# vim: ts=3 sw=3 et sm ai smd sc bg=dark
#######################################################################
# Small script to benchmark regular expressions.  Expects test text as
# standard input.  Runs the test 100 times.  The xN_ prefix is to
force test result
# ordering (sorts alphabetically by default).
#######################################################################
use strict;
use warnings;
use Benchmark qw(timethese);

my @data = <>;
my $host = "lab13";

print "Testing against " . scalar @data . " lines.\n";

timethese(
   $ARGV[0] || 100,
   {
      x1_control => sub {
         foreach (@data) {
            if (1) {
               next;
            };
         }
      },

      x2_mgi_separate => sub {
         foreach (@data) {
            my $foo = ( m/$host.*sudo/mgi || m/$host.*ssh/mgi );
         }
      },

      x3_separate => sub {
         foreach (@data) {
            if ( my ($foo) = ( m/$host.*sudo/g || m/$host.*ssh/g ) ) {
               next;
            };
         }
      },

      x4_mgi_combined => sub {
         foreach (@data) {
            if ( m/$host.*(?:sudo|ssh)/mgi ) {
               next;
            };
         }
      },

      x5_combined => sub {
         foreach (@data) {
            if ( m/$host.*(?:sudo|ssh)/g ) {
               next;
            };
         }
      },

      x6_mgi_combined_capture => sub {
         foreach (@data) {
            if ( m/$host.*(sudo|ssh)/mgi ) {
               next;
            };
         }
      },

      x7_combined_capture => sub {
         foreach (@data) {
            if ( m/$host.*(sudo|ssh)/g ) {
               next;
            };
         }
      },

   }
);

topher at nexus:~/perl/foo$ ./regex-benchmark.pl /tmp/regex-benchmark.data
Testing against 16789 lines.
Benchmark: timing 100 iterations of x1_control, x2_mgi_separate,
x3_separate, x4_mgi_combined, x5_combined, x6_mgi_combined_capture,
x7_combined_capture...

x1_control:  0 wallclock secs ( 0.32 usr +  0.00 sys =  0.32 CPU) @
312.50/s (n=100)
            (warning: too few iterations for a reliable count)
x2_mgi_separate:  7 wallclock secs ( 6.41 usr +  0.00 sys =  6.41 CPU)
@ 15.60/s (n=100)
x3_separate:  2 wallclock secs ( 2.20 usr +  0.01 sys =  2.21 CPU) @
45.25/s (n=100)
x4_mgi_combined: 15 wallclock secs (14.87 usr +  0.04 sys = 14.91 CPU)
@  6.71/s (n=100)
x5_combined:  3 wallclock secs ( 3.25 usr +  0.00 sys =  3.25 CPU) @
30.77/s (n=100)
x6_mgi_combined_capture: 18 wallclock secs (17.66 usr +  0.00 sys =
17.66 CPU) @  5.66/s (n=100)
x7_combined_capture:  3 wallclock secs ( 3.48 usr +  0.00 sys =  3.48
CPU) @ 28.74/s (n=100)

As you can see, for this case and with this data, using separate regex
checks is over twice as fast as doing a combined regex with
alternation.  The opposite of what I had expected.  Case insensitive
searches are also significantly slower.

After this, I've discovered that I'm not as smart as I thought I was
with my assumptions about optimizing regular expressions.  Now all
regular expressions that are going to be chewing on large data sets
get tested with a few alternatives to make sure I'm not screwing up
performance by being clever.  Even little things can have a big
impact, especially with big data files.

> j

--
Christopher