[Wellington-pm] Today's tricky perl teaser

Tue Nov 28 02:01:23 PST 2006

Grant McLean wrote:
> On Tue, 2006-11-28 at 19:23 +1300, Cliff Pratt wrote:
>> I now know why, but can anyone spot why the regex will not compile
>> on the date/time line? (Please ignore the inappropriate line
>> breaks).
> 
> I'll confess that it wasn't obvious to me until I pasted the snippet 
> into my favourite editor (gvim) and the syntax highlighting showed
> the regex ended at the slash between Date and time.
> 
Well, I actually found it pretty quickly because I had the book open to
the right page at the time, because I was looking for the multiline 
thing! It is pretty tricky though, isn't it?
> 
> Instead of /.../, I almost always use m{...} (as Jacinta suggested). 
> This is especially useful when the regex spans multiple lines:
> 
> @fields = m{ ... }x;
> 
> As Jacinta probably intended to say, using \S+ to match strings of 
> non-whitespace characters would simplify things.
> 
> Also, this bit:
> 
> \s+(\[.*\])             # Date / time
> 
> Would be better written as:
> 
> \s+(\[.*?\])            # Date / time
> 
> Otherwise the .* will try to match to the end of the string and then 
> will backtrack to the last ']'.  This will slow things down but will
> be harmless as long as there is never another ']' on the line.  I
> notice from some Apache logs I have handy that some Opera browsers
> seem to put a country code (?) in square brackets at the end of the
> user agent so you might encounter more square brackets than you're
> expecting.  This still shouldn't cause a problem since you're
> requiring specific things after the ']' but it will require the regex
> engine to work harder than necessary.
> 
> Also, when capturing the date/time, you probably want to move the
> square brackets so they're not part of what you capture:
> 
> 
> \s+\[(.*?)\]            # Date / time
> 
Yes, I'd noticed that already, though it wasn't in the pasted snippet. I 
also included the ^ in the first (). Someone pointed that out to me. 
Would that have any unintended effect?
> 
> 
> Another approach to building up a complex regex like this one is to
> use variables to name the bits:
> 
> my $spc  = '\s+'; my $ip   = '(\S+)'; my $user = '(\S+)'; my $date =
> '\[(.*?)\]'; my $log_match = qr{ ^ $ip $spc $user $spc $date ... }x;
> 
> while(<LOGS>) { @fields = $_ =~ $log_match;
> 

Cheers,

Cliff