[Neworleans-pm] split vs. match
estrabd at mailcan.com
Mon Oct 27 14:27:45 PDT 2008
Maybe the following will help:
I don't know a whole bunch about parsing tons of text with regexes. I do have one immediate suggestion - that "/g" might not be necessary. Anyway, take a look at that link; it might help.
On Mon, Oct 27, 2008 at 03:59:32PM -0500, David B. John wrote:
> (perl v5.8.8 on Ubuntu Hardy Heron/2.6.24-21-generic.)
> I have an http logfile I'm trying to parse which looks like:
> 2008:10:24-00:00:06 x.x.x.x httpproxy: id="0001" severity="info"
> sys="SecureWeb" sub="http" name="http access" action="pass" method="GET"
> srcip="x.x.x.x" user="" statuscode="200" cached="0" profile="profile_0"
> filteraction="action_REF_DefaultHTTPCFFAction" size="78632" time="105
> ms" request="0x90075b60"
> url="http://sb.google.com/safebrowsing/update?client=navclient-auto-ffox&appver=184.108.40.206&version=goog-white-domain:1:481,goog-white-url:1:371,goog-black-url:1:25401,goog-black-enchash:1:62374" error="" category="175,178" categoryname="Software/Hardware,Internet Services" content-type="text/html"
> (Nothing Spectacular.)
> If I loop through the log file and do:
> my ($date,$fwip,$proxy,$id,$severity,$sys,$sub,$name,$action,$method,
> $request,$url,$error,$category,$category_name,$content_type) =
> $_ =~ /\w+=".*?"|\S+/g;
> life is good but could be better (~ 75 seconds on a Hp dc7700S for a
> compressed 500 MB logfile).
> However, I'd really like to use split b/c it's so much faster (~ 15
> seconds). The problem is if I split, sometimes $name, $time or
> $category_name will include a space within the quotes which I don't want
> to split on (see above).
> I used Text::ParseWords but gave up after waiting 5 minutes.
> I know I can use a regex with split but I'm stumped as to how I would go
> about writing it. E.g. split on a space except when enclosed in quotes.
> Also, would it theoretically be any faster than the example above since
> it's using regex or should I just live with it?
> NewOrleans-pm mailing list
> NewOrleans-pm at pm.org
Louisiana Optical Network Initiative
+1.225.578.1920 aim: bz743
More information about the NewOrleans-pm