[Wellington-pm] Today's tricky perl teaser
Grant McLean
grant at mclean.net.nz
Tue Nov 28 00:30:20 PST 2006
On Tue, 2006-11-28 at 19:23 +1300, Cliff Pratt wrote:
> I now know why, but can anyone spot why the regex will not compile on
> the date/time line? (Please ignore the inappropriate line breaks).
I'll confess that it wasn't obvious to me until I pasted the snippet
into my favourite editor (gvim) and the syntax highlighting showed the
regex ended at the slash between Date and time.
Instead of /.../, I almost always use m{...} (as Jacinta suggested).
This is especially useful when the regex spans multiple lines:
@fields = m{
...
}x;
As Jacinta probably intended to say, using \S+ to match strings of
non-whitespace characters would simplify things.
Also, this bit:
\s+(\[.*\]) # Date / time
Would be better written as:
\s+(\[.*?\]) # Date / time
Otherwise the .* will try to match to the end of the string and then
will backtrack to the last ']'. This will slow things down but will be
harmless as long as there is never another ']' on the line. I notice
from some Apache logs I have handy that some Opera browsers seem to put
a country code (?) in square brackets at the end of the user agent so
you might encounter more square brackets than you're expecting. This
still shouldn't cause a problem since you're requiring specific things
after the ']' but it will require the regex engine to work harder than
necessary.
Also, when capturing the date/time, you probably want to move the square
brackets so they're not part of what you capture:
\s+\[(.*?)\] # Date / time
Another approach to building up a complex regex like this one is to use
variables to name the bits:
my $spc = '\s+';
my $ip = '(\S+)';
my $user = '(\S+)';
my $date = '\[(.*?)\]';
my $log_match = qr{ ^ $ip $spc $user $spc $date ... }x;
while(<LOGS>) {
@fields = $_ =~ $log_match;
Cheers
Grant
More information about the Wellington-pm
mailing list