[oak perl] regexp for discussion

Thu Mar 11 12:20:41 CST 2004

Two regular expressions for this free for all.

This one comes from a spam filter (applied to Subject: and body) I
used actively from about 1996 to 1999. It was one of my first very
long regexps, presented here with original whitespace and commenting.
(Yes, I really did use dollar-sign pound-sign as a global variable,
that relies on a iso-8859-1 encoding for the code, this message is
utf-8 encoded, so I'm not sure if it will run for you.)

# Used in the scan_* routines for currencies (perl considers this
# a reserved global, since it is a one char non-alphabetic variable,
# but it is unlikely to be used).
$£='$#£¥¢';

# [...]

# Be very generous about accepting "only US$15,- p min" text.
 /(\b(?:to|just|only)|^)# first they try to de-emphasize it
  \s+(?:[a-z]{2}\s*)?   # Sometimes with something for the currency
  (?:[$£]\s*)?         # Sometimes a currency notation ($£='$#£')
  (?:\d+[o\d]*          # A number
     (?:[.,](?:[o\d]{2} #    with an optional fractional portion
      |-)?)?            #    or a hyphen for 00
   |\.[o\d]+)           # or exclusively fractional
  (?:\s*cents)?         # an alternative currency location
  \s*(?:\/|p(?:er|\.)?) # "per" and variations
  \s*(?:m(?:in(?:ute)?)?# "minute"   and variations
     |h(?:ou)?r?        # or "hour"   ''     ''
     |da?y?             # or "day"    ''     ''
     |w(?:ee)k?         # or "week"   ''     ''
     |mo?n(?:th)?       # or "month"  ''     ''
     |y(?:ea)?r?)       # or "year"   ''     ''
  /ix                   # Ignore case and use free-format regexp

Now one from a log parser I use (Hi George, did iPro parse like this?).
The particular parser understands (== has regexps for) four different
formats. As George likes to attest, this is not expected to work on
all log lines of the target format, just better than 99% of them.

  combined => [ # Standard apache 'combined' log format
    # Column names, for (captures)
    [ 'ip', 'identd', 'username', 'date', 'time', 'tz', 'method', 'file',
      'protocol', 'status', 'bytes', 'referer', 'client', 'other', ],
    # Regexp
    qr%^                                #                       anchor
          ([\w.]+)                      # IP
      \s+                               #                       whitespace
          (\S+)                         # ident check
      \s+                               #                       whitespace
          (\S+)                         # auth user
      \s+                               #                       whitespace
          \[(\d\d/\w\w\w/\d\d\d\d)      # date
          :(\d\d:\d\d:\d\d)             # time
          \s+([\d-]+)\]                 # timezone
      \s+                               #                       whitespace
          "(\w+)                        # GET/POST/HEAD, etc
      \s+                               #                       whitespace
          (\S+)                         # URI/URL
      (?:                               # grouping for optional version
      \s+                               #                       whitespace
          (HTTP/[\d.]+)                 # protocol version
      )?                                # end grouping
          "                             # end of request line
      \s+                               #                       whitespace
          (\d\d\d)                      # response code, 200 success, etc
      \s+                               #                       whitespace
          (\d+|-)                       # bytes written
      \s+                               #                       whitespace
          "(\S+)"                       # referrer
      \s+                               #                       whitespace
          "([^"]+)"                     # user agent
      \s*                               #                       whitespace
          (.*)                          # other
      $                                 #                       anchor
      %xi,
   ],

I don't use the Apache 'combined' format for most of my logs anymore
because of parsing ambiguities, but I've got 20-odd servers and some
of the lessor ones haven't been migrated to something better. I expect
the 'combined' format is rather popular with others, though.

Problems I've seen the auth-user value have spaces in it (always
with "401 Not Authorized" errors for me), some non-compliant tools
send URLs with un-encoded spaces in the GET line, some user-agents have
quotes in them. The script that runs this has a debug mode which prints
all non-matching lines to STDERR, and I use them as fodder for fixing
regexp bugs. Sometimes the lines simply aren't well formed, and those
ones I don't try to accomodate for sanity. Last week I noticed a couple
of lines that were missing the " before the GET in the request line.
Some sort of log writing bug I guess, and better ignored.

Elijah