[Neworleans-pm] split vs. match

David B. John djohn at archdiocese-no.org
Mon Oct 27 13:59:32 PDT 2008


(perl v5.8.8 on Ubuntu Hardy Heron/2.6.24-21-generic.)

I have an http logfile I'm trying to parse which looks like:

2008:10:24-00:00:06 x.x.x.x httpproxy[4997]: id="0001" severity="info"
sys="SecureWeb" sub="http" name="http access" action="pass" method="GET"
srcip="x.x.x.x" user="" statuscode="200" cached="0" profile="profile_0"
filteraction="action_REF_DefaultHTTPCFFAction" size="78632" time="105
ms" request="0x90075b60"
url="http://sb.google.com/safebrowsing/update?client=navclient-auto-ffox&appver=2.0.0.17&version=goog-white-domain:1:481,goog-white-url:1:371,goog-black-url:1:25401,goog-black-enchash:1:62374" error="" category="175,178" categoryname="Software/Hardware,Internet Services" content-type="text/html"


(Nothing Spectacular.)

If I loop through the log file and do:

	my ($date,$fwip,$proxy,$id,$severity,$sys,$sub,$name,$action,$method,
$srcip,$user,$statuscode,$cached,$profile,$filteraction,$size,$time,
$request,$url,$error,$category,$category_name,$content_type) = 
$_ =~ /\w+=".*?"|\S+/g;

life is good but could be better (~ 75 seconds on a Hp dc7700S for a
compressed 500 MB logfile).

However, I'd really like to use split b/c it's so much faster (~ 15
seconds).  The problem is if I split, sometimes $name, $time or
$category_name will include a space within the quotes which I don't want
to split on (see above).

I used Text::ParseWords but gave up after waiting 5 minutes.

I know I can use a regex with split but I'm stumped as to how I would go
about writing it.  E.g. split on a space except when enclosed in quotes.
Also, would it theoretically be any faster than the example above since
it's using regex or should I just live with it?

Thanks.

David




More information about the NewOrleans-pm mailing list