[oak perl] regexp for discussion

Thu Mar 11 18:58:22 CST 2004

On Thursday 11 March 2004 10:20 am, B.E.G wrote:
> ...
> Now one from a log parser I use (Hi George, did iPro parse like this?).
Hi, Elijah.
Yes, some of the filters used a regex something like 
the one you show below.
Many of the filters, however, began with a split.
 
> The particular parser understands (== has regexps for) four different
> formats. As George likes to attest, this is not expected to work on
> all log lines of the target format, just better than 99% of them.
Right, I don't recall anyone being concerned with losing 1% of the data.
For some accounts even 2% might be OK.

>   combined => [ # Standard apache 'combined' log format
>     # Column names, for (captures)
>     [ 'ip', 'identd', 'username', 'date', 'time', 'tz', 'method', 'file',
>       'protocol', 'status', 'bytes', 'referer', 'client', 'other', ],
>     # Regexp
>     qr%^                                #                       anchor
>           ([\w.]+)                      # IP
>       \s+                               #                       whitespace
>           (\S+)                         # ident check
>       \s+                               #                       whitespace
>           (\S+)                         # auth user
>       \s+                               #                       whitespace
>           \[(\d\d/\w\w\w/\d\d\d\d)      # date
>
>           :(\d\d:\d\d:\d\d)             # time
>
>           \s+([\d-]+)\]                 # timezone
>       \s+                               #                       whitespace
>           "(\w+)                        # GET/POST/HEAD, etc
>       \s+                               #                       whitespace
>           (\S+)                         # URI/URL
>       (?:                               # grouping for optional version
>       \s+                               #                       whitespace
>           (HTTP/[\d.]+)                 # protocol version
>       )?                                # end grouping
>           "                             # end of request line
>       \s+                               #                       whitespace
>           (\d\d\d)                      # response code, 200 success, etc
>       \s+                               #                       whitespace
>           (\d+|-)                       # bytes written
>       \s+                               #                       whitespace
>           "(\S+)"                       # referrer
>       \s+                               #                       whitespace
>           "([^"]+)"                     # user agent
>       \s*                               #                       whitespace
>           (.*)                          # other
>       $                                 #                       anchor
>       %xi,
>    ],
> ...
> Elijah