[Edinburgh-pm] Regex performance article

Sun Jan 28 10:31:53 PST 2007

Rory Macdonald writes:
> http://swtch.com/~rsc/regexp/regexp1.html

That was very interesting; thanks for posting the link, Rory.

I don't think the approach Russ Cox describes is necessarily easy and/or
useful to fit into Perl, though.

There are two circumstances where the try-and-backtrack NFA execution
algorithm for regexes (that is, the algorithm used by Perl and PCRE and many
other tools) causes O(2^n) (exponential-time) performance.  One is when the
regex uses backreferences.  There are no known backreference algorithms that
avoid O(2^n) execution in all cases, so that's not surprising.  The other is
when the regex uses nested quantifiers -- things like /(a*)+/.

The thing that these two situations have in common is that they're fairly
rare in practice.  Backreferences are almost never useful, and when they
are, it's usually for something whose execution can't be O(2^n), like /\b
(\w+) \s+ \1 \b/x for finding duplicate words.  And nested quantifiers seem
to be used mainly in pathological test cases for regex engines.  The few
real-world situations they're needed for are things like this:

  / " (?: [^"\\]+ | \\ .)* " /x

which also can't take O(2^n) time.  It is admittedly possible to write such
things in ways that can exhibit O(2^n) performance, but I don't find that
significantly more interesting than the fact that you can write buggy code
that contains an infinite loop.  (For the record, you can also write regexes
that match the same strings, but are even more efficient; I have production
code containing

  / " (?> [^"\\]* ) (?> (?> \\ . [^"\\]* )* ) " /x

which can be executed a lot more quickly by Perl's engine.)

Russ mentions this point, but sees the issue as "a choice between an
implementation with a predictable, consistent, fast running time on all
inputs or one that usually runs quickly but can take years of CPU time (or
more) on some inputs".  That's a fair comment -- it would be nice not to
have to spend mental effort on constructing regexes in ways that avoid
misuse of nested quantifiers -- but it's also a simplification of the
situation.

First, as I'm sure Russ knows, asymptotic time complexity isn't the only
factor in how well a piece of code performs.  For small problem sizes (and
often, real-world problems _are_ small), the lower-order terms can hurt a
lot.  We never describe an algorithm as taking O(k_1n² + k_2n + k_3) time,
because that's asymptotically equivalent to O(n²) time complexity.  But for
small values of N, the k_x constants can dominate the actual runtime.

I recently encountered an excellent example of this, related precisely to
this issue of regex performance.  Another part of the production code
mentioned above needs to determine whether any of a set of 339 literal
strings can be found in another string, and we need to answer that question
millions of times per day.  We were building a regex from the literal
strings, separating each one with "|".  Profiling revealed that executing
this regex was a performance bottleneck.  That made sense: at each position
in the string being matched, Perl's regex engine has to try each of the
alternatives in turn.  On the other hand, a DFA execution algorithm, as
described by Russ, can keep track of all of the alternatives in parallel (at
each match position).

With that in mind, I looked round for a suitable off-the-shelf DFA engine to
st-- I mean, to borrow.  I found TRE http://laurikari.net/tre/ (one of the
"efficient implementations" Russ mentions on http://swtch.com/~rsc/regexp/),
and it's already in Debian, so it would be easy for us to use.  Some simple
proof-of-concept XS code later, I benchmarked the same regex on the Perl
engine and on TRE.

Perl's asymptotically-slow engine was about twice as fast as the
asymptotically-fast TRE engine.  So, I gave up on DFAs for this task, and
wrote something even faster instead:

  http://search.cpan.org/~arc/Text-Match-FastAlternatives-0.04/

It's entirely possible that there are DFA implementations that perform at
least as well as Perl's regex engine, but TRE is at least an existence proof
that picking a better algorithm isn't necessarily sufficient for good
performance.

The second problem with Russ's position is that backtracking NFA algorithms
typically have more features than DFA or DFA-conversion implementations.

  - Capturing parentheses have been considered hard to implement in DFA
    engines, though Russ says that "Thompson-style algorithms can be adapted
    to track submatch boundaries without giving up efficient performance",
    and certainly the Tcl regex implementation (a hybrid DFA/NFA engine, as
    far as I can tell) offers them.

  - There are no known algorithms for backreferences that always avoid
    O(2^n) matching time.

  - NFA implementations are beginning to offer recursive regex invocation.
    I'm not a mathematician, but I'm pretty sure that you can't construct a
    DFA from a regex that recursively invokes parts of itself.

  - Perl lets you embed code in regexes, and that code may have arbitrary
    side-effects.  Further, the side-effects are unpicked when Perl
    backtracks past the code.  Again, I'm no mathematician, but I think this
    makes it very hard to use a DFA implementation for embedded code blocks.

There seems to be a trade-off between expressiveness in regex syntax, versus
fast worst-case execution time.  Backtracking NFA implementations come down
firmly in favour of expressiveness.  Traditional DFA implementations prefer
to minimise asymptotic time complexity.  Hybrid implementations get to pick
and choose.

My own preference is for expressiveness, on balance.  That's in line with my
preference for Perl over more minimal languages; good code in a more
expressive language is typically easier to understand than good code in a
less expressive language.

That said, Russ's article is definitely interesting, and perhaps it will
move people to try out the algorithms he suggests in widely-used systems
like Perl or PCRE.

And this does seem to be a good time to go about doing that.  Perl 5.10 will
have many significant enhancements to its regex engine, including facilities
that make it much easier to build and use pluggable regex engines.  It's
quite easy to imagine Perl continuing to offer its current expressive but
sometimes slow regex engine as the default, but letting you switch to a DFA
engine for individual regexes where that would make sense.

-- 
Aaron Crane