[Neworleans-pm] split vs. match
B. Estrade
estrabd at mailcan.com
Mon Oct 27 14:27:45 PDT 2008
David,
Maybe the following will help:
http://oreilly.com/catalog/perlwsmng/chapter/ch08.html
I don't know a whole bunch about parsing tons of text with regexes. I do have one immediate suggestion - that "/g" might not be necessary. Anyway, take a look at that link; it might help.
Cheers,
Brett
On Mon, Oct 27, 2008 at 03:59:32PM -0500, David B. John wrote:
> (perl v5.8.8 on Ubuntu Hardy Heron/2.6.24-21-generic.)
>
> I have an http logfile I'm trying to parse which looks like:
>
> 2008:10:24-00:00:06 x.x.x.x httpproxy[4997]: id="0001" severity="info"
> sys="SecureWeb" sub="http" name="http access" action="pass" method="GET"
> srcip="x.x.x.x" user="" statuscode="200" cached="0" profile="profile_0"
> filteraction="action_REF_DefaultHTTPCFFAction" size="78632" time="105
> ms" request="0x90075b60"
> url="http://sb.google.com/safebrowsing/update?client=navclient-auto-ffox&appver=2.0.0.17&version=goog-white-domain:1:481,goog-white-url:1:371,goog-black-url:1:25401,goog-black-enchash:1:62374" error="" category="175,178" categoryname="Software/Hardware,Internet Services" content-type="text/html"
>
>
> (Nothing Spectacular.)
>
> If I loop through the log file and do:
>
> my ($date,$fwip,$proxy,$id,$severity,$sys,$sub,$name,$action,$method,
> $srcip,$user,$statuscode,$cached,$profile,$filteraction,$size,$time,
> $request,$url,$error,$category,$category_name,$content_type) =
> $_ =~ /\w+=".*?"|\S+/g;
>
> life is good but could be better (~ 75 seconds on a Hp dc7700S for a
> compressed 500 MB logfile).
>
> However, I'd really like to use split b/c it's so much faster (~ 15
> seconds). The problem is if I split, sometimes $name, $time or
> $category_name will include a space within the quotes which I don't want
> to split on (see above).
>
> I used Text::ParseWords but gave up after waiting 5 minutes.
>
> I know I can use a regex with split but I'm stumped as to how I would go
> about writing it. E.g. split on a space except when enclosed in quotes.
> Also, would it theoretically be any faster than the example above since
> it's using regex or should I just live with it?
>
> Thanks.
>
> David
>
>
> _______________________________________________
> NewOrleans-pm mailing list
> NewOrleans-pm at pm.org
> http://mail.pm.org/mailman/listinfo/neworleans-pm
--
B. Estrade
Louisiana Optical Network Initiative
+1.225.578.1920 aim: bz743
:wq
More information about the NewOrleans-pm
mailing list