[Wellington-pm] Today's tricky perl teaser
Cliff Pratt
enkidu at cliffp.com
Tue Nov 28 02:01:23 PST 2006
Grant McLean wrote:
> On Tue, 2006-11-28 at 19:23 +1300, Cliff Pratt wrote:
>> I now know why, but can anyone spot why the regex will not compile
>> on the date/time line? (Please ignore the inappropriate line
>> breaks).
>
> I'll confess that it wasn't obvious to me until I pasted the snippet
> into my favourite editor (gvim) and the syntax highlighting showed
> the regex ended at the slash between Date and time.
>
Well, I actually found it pretty quickly because I had the book open to
the right page at the time, because I was looking for the multiline
thing! It is pretty tricky though, isn't it?
>
> Instead of /.../, I almost always use m{...} (as Jacinta suggested).
> This is especially useful when the regex spans multiple lines:
>
> @fields = m{ ... }x;
>
> As Jacinta probably intended to say, using \S+ to match strings of
> non-whitespace characters would simplify things.
>
> Also, this bit:
>
> \s+(\[.*\]) # Date / time
>
> Would be better written as:
>
> \s+(\[.*?\]) # Date / time
>
> Otherwise the .* will try to match to the end of the string and then
> will backtrack to the last ']'. This will slow things down but will
> be harmless as long as there is never another ']' on the line. I
> notice from some Apache logs I have handy that some Opera browsers
> seem to put a country code (?) in square brackets at the end of the
> user agent so you might encounter more square brackets than you're
> expecting. This still shouldn't cause a problem since you're
> requiring specific things after the ']' but it will require the regex
> engine to work harder than necessary.
>
> Also, when capturing the date/time, you probably want to move the
> square brackets so they're not part of what you capture:
>
>
> \s+\[(.*?)\] # Date / time
>
Yes, I'd noticed that already, though it wasn't in the pasted snippet. I
also included the ^ in the first (). Someone pointed that out to me.
Would that have any unintended effect?
>
>
> Another approach to building up a complex regex like this one is to
> use variables to name the bits:
>
> my $spc = '\s+'; my $ip = '(\S+)'; my $user = '(\S+)'; my $date =
> '\[(.*?)\]'; my $log_match = qr{ ^ $ip $spc $user $spc $date ... }x;
>
> while(<LOGS>) { @fields = $_ =~ $log_match;
>
Cheers,
Cliff
More information about the Wellington-pm
mailing list