[Wellington-pm] Today's tricky perl teaser

Tue Nov 28 00:30:20 PST 2006

On Tue, 2006-11-28 at 19:23 +1300, Cliff Pratt wrote:
> I now know why, but can anyone spot why the regex will not compile on 
> the date/time line? (Please ignore the inappropriate line breaks).

I'll confess that it wasn't obvious to me until I pasted the snippet
into my favourite editor (gvim) and the syntax highlighting showed the
regex ended at the slash between Date and time.

Instead of /.../, I almost always use m{...} (as Jacinta suggested).
This is especially useful when the regex spans multiple lines:

  @fields = m{
      ...
  }x;

As Jacinta probably intended to say, using \S+ to match strings of
non-whitespace characters would simplify things.

Also, this bit:

  \s+(\[.*\])             # Date / time

Would be better written as:

  \s+(\[.*?\])            # Date / time

Otherwise the .* will try to match to the end of the string and then
will backtrack to the last ']'.  This will slow things down but will be
harmless as long as there is never another ']' on the line.  I notice
from some Apache logs I have handy that some Opera browsers seem to put
a country code (?) in square brackets at the end of the user agent so
you might encounter more square brackets than you're expecting.  This
still shouldn't cause a problem since you're requiring specific things
after the ']' but it will require the regex engine to work harder than
necessary.

Also, when capturing the date/time, you probably want to move the square
brackets so they're not part of what you capture:

  \s+\[(.*?)\]            # Date / time

Another approach to building up a complex regex like this one is to use
variables to name the bits:

  my $spc  = '\s+';
  my $ip   = '(\S+)';
  my $user = '(\S+)';
  my $date = '\[(.*?)\]';
  my $log_match = qr{ ^ $ip $spc $user $spc $date ... }x;

  while(<LOGS>) {
      @fields = $_ =~ $log_match;

Cheers
Grant