[Neworleans-pm] split vs. match
David B. John
djohn at archdiocese-no.org
Mon Oct 27 13:59:32 PDT 2008
(perl v5.8.8 on Ubuntu Hardy Heron/2.6.24-21-generic.)
I have an http logfile I'm trying to parse which looks like:
2008:10:24-00:00:06 x.x.x.x httpproxy[4997]: id="0001" severity="info"
sys="SecureWeb" sub="http" name="http access" action="pass" method="GET"
srcip="x.x.x.x" user="" statuscode="200" cached="0" profile="profile_0"
filteraction="action_REF_DefaultHTTPCFFAction" size="78632" time="105
ms" request="0x90075b60"
url="http://sb.google.com/safebrowsing/update?client=navclient-auto-ffox&appver=2.0.0.17&version=goog-white-domain:1:481,goog-white-url:1:371,goog-black-url:1:25401,goog-black-enchash:1:62374" error="" category="175,178" categoryname="Software/Hardware,Internet Services" content-type="text/html"
(Nothing Spectacular.)
If I loop through the log file and do:
my ($date,$fwip,$proxy,$id,$severity,$sys,$sub,$name,$action,$method,
$srcip,$user,$statuscode,$cached,$profile,$filteraction,$size,$time,
$request,$url,$error,$category,$category_name,$content_type) =
$_ =~ /\w+=".*?"|\S+/g;
life is good but could be better (~ 75 seconds on a Hp dc7700S for a
compressed 500 MB logfile).
However, I'd really like to use split b/c it's so much faster (~ 15
seconds). The problem is if I split, sometimes $name, $time or
$category_name will include a space within the quotes which I don't want
to split on (see above).
I used Text::ParseWords but gave up after waiting 5 minutes.
I know I can use a regex with split but I'm stumped as to how I would go
about writing it. E.g. split on a space except when enclosed in quotes.
Also, would it theoretically be any faster than the example above since
it's using regex or should I just live with it?
Thanks.
David
More information about the NewOrleans-pm
mailing list