[oak perl] regexp for discussion
B.E.G
oaklandpm at eli.users.panix.com
Thu Mar 11 12:20:41 CST 2004
Two regular expressions for this free for all.
This one comes from a spam filter (applied to Subject: and body) I
used actively from about 1996 to 1999. It was one of my first very
long regexps, presented here with original whitespace and commenting.
(Yes, I really did use dollar-sign pound-sign as a global variable,
that relies on a iso-8859-1 encoding for the code, this message is
utf-8 encoded, so I'm not sure if it will run for you.)
# Used in the scan_* routines for currencies (perl considers this
# a reserved global, since it is a one char non-alphabetic variable,
# but it is unlikely to be used).
$£='$#£¥¢';
# [...]
# Be very generous about accepting "only US$15,- p min" text.
/(\b(?:to|just|only)|^)# first they try to de-emphasize it
\s+(?:[a-z]{2}\s*)? # Sometimes with something for the currency
(?:[$£]\s*)? # Sometimes a currency notation ($£='$#£')
(?:\d+[o\d]* # A number
(?:[.,](?:[o\d]{2} # with an optional fractional portion
|-)?)? # or a hyphen for 00
|\.[o\d]+) # or exclusively fractional
(?:\s*cents)? # an alternative currency location
\s*(?:\/|p(?:er|\.)?) # "per" and variations
\s*(?:m(?:in(?:ute)?)?# "minute" and variations
|h(?:ou)?r? # or "hour" '' ''
|da?y? # or "day" '' ''
|w(?:ee)k? # or "week" '' ''
|mo?n(?:th)? # or "month" '' ''
|y(?:ea)?r?) # or "year" '' ''
/ix # Ignore case and use free-format regexp
Now one from a log parser I use (Hi George, did iPro parse like this?).
The particular parser understands (== has regexps for) four different
formats. As George likes to attest, this is not expected to work on
all log lines of the target format, just better than 99% of them.
combined => [ # Standard apache 'combined' log format
# Column names, for (captures)
[ 'ip', 'identd', 'username', 'date', 'time', 'tz', 'method', 'file',
'protocol', 'status', 'bytes', 'referer', 'client', 'other', ],
# Regexp
qr%^ # anchor
([\w.]+) # IP
\s+ # whitespace
(\S+) # ident check
\s+ # whitespace
(\S+) # auth user
\s+ # whitespace
\[(\d\d/\w\w\w/\d\d\d\d) # date
:(\d\d:\d\d:\d\d) # time
\s+([\d-]+)\] # timezone
\s+ # whitespace
"(\w+) # GET/POST/HEAD, etc
\s+ # whitespace
(\S+) # URI/URL
(?: # grouping for optional version
\s+ # whitespace
(HTTP/[\d.]+) # protocol version
)? # end grouping
" # end of request line
\s+ # whitespace
(\d\d\d) # response code, 200 success, etc
\s+ # whitespace
(\d+|-) # bytes written
\s+ # whitespace
"(\S+)" # referrer
\s+ # whitespace
"([^"]+)" # user agent
\s* # whitespace
(.*) # other
$ # anchor
%xi,
],
I don't use the Apache 'combined' format for most of my logs anymore
because of parsing ambiguities, but I've got 20-odd servers and some
of the lessor ones haven't been migrated to something better. I expect
the 'combined' format is rather popular with others, though.
Problems I've seen the auth-user value have spaces in it (always
with "401 Not Authorized" errors for me), some non-compliant tools
send URLs with un-encoded spaces in the GET line, some user-agents have
quotes in them. The script that runs this has a debug mode which prints
all non-matching lines to STDERR, and I use them as fodder for fixing
regexp bugs. Sometimes the lines simply aren't well formed, and those
ones I don't try to accomodate for sanity. Last week I noticed a couple
of lines that were missing the " before the GET in the request line.
Some sort of log writing bug I guess, and better ignored.
Elijah
More information about the Oakland
mailing list