[Neworleans-pm] split vs. match

Mon Oct 27 14:27:45 PDT 2008

David, 

Maybe the following will help:

http://oreilly.com/catalog/perlwsmng/chapter/ch08.html

I don't know a whole bunch about parsing tons of text with regexes. I do have one immediate suggestion - that "/g" might not be necessary. Anyway, take a look at that link; it might help.

Cheers,
Brett

On Mon, Oct 27, 2008 at 03:59:32PM -0500, David B. John wrote:
> (perl v5.8.8 on Ubuntu Hardy Heron/2.6.24-21-generic.)
> 
> I have an http logfile I'm trying to parse which looks like:
> 
> 2008:10:24-00:00:06 x.x.x.x httpproxy[4997]: id="0001" severity="info"
> sys="SecureWeb" sub="http" name="http access" action="pass" method="GET"
> srcip="x.x.x.x" user="" statuscode="200" cached="0" profile="profile_0"
> filteraction="action_REF_DefaultHTTPCFFAction" size="78632" time="105
> ms" request="0x90075b60"
> url="http://sb.google.com/safebrowsing/update?client=navclient-auto-ffox&appver=2.0.0.17&version=goog-white-domain:1:481,goog-white-url:1:371,goog-black-url:1:25401,goog-black-enchash:1:62374" error="" category="175,178" categoryname="Software/Hardware,Internet Services" content-type="text/html"
> 
> 
> (Nothing Spectacular.)
> 
> If I loop through the log file and do:
> 
> 	my ($date,$fwip,$proxy,$id,$severity,$sys,$sub,$name,$action,$method,
> $srcip,$user,$statuscode,$cached,$profile,$filteraction,$size,$time,
> $request,$url,$error,$category,$category_name,$content_type) = 
> $_ =~ /\w+=".*?"|\S+/g;
> 
> life is good but could be better (~ 75 seconds on a Hp dc7700S for a
> compressed 500 MB logfile).
> 
> However, I'd really like to use split b/c it's so much faster (~ 15
> seconds).  The problem is if I split, sometimes $name, $time or
> $category_name will include a space within the quotes which I don't want
> to split on (see above).
> 
> I used Text::ParseWords but gave up after waiting 5 minutes.
> 
> I know I can use a regex with split but I'm stumped as to how I would go
> about writing it.  E.g. split on a space except when enclosed in quotes.
> Also, would it theoretically be any faster than the example above since
> it's using regex or should I just live with it?
> 
> Thanks.
> 
> David
> 
> 
> _______________________________________________
> NewOrleans-pm mailing list
> NewOrleans-pm at pm.org
> http://mail.pm.org/mailman/listinfo/neworleans-pm

-- 
B. Estrade
Louisiana Optical Network Initiative
+1.225.578.1920 aim: bz743
:wq