[Neworleans-pm] split vs. match

Mon Oct 27 22:54:53 PDT 2008

David,

The split function is not going to make things any faster. In fact, without
resorting to the use of another language, I can't think of a faster way of
doing it than you have suggested. Even if you were to split on something
like a quote followed by a space (/" /) and then reattach the quote to the
end of each resulting element (work that is vastly simpler than regex
matching), the process would end up being slower than regex matching because
the regex maching happens in machine language and the more efficient work
happens in Perl. I'm convinced also that even if you were to use the index
function, your Perl code would still be slower than the regex-based solution
you described.

In the past, I have tried a number of tricks to try to beat simple regex
matching for this type of work and I've seldom been able to beat the regex
matching. (When I write "this type of work", I am of course excluding
regular Apache-like log files and other files that are designed to be easy
and fast to parse. I'm talking about more thoughtless file designs, such as
the one you described.)

You could roll out your own C extension, but that's just ridiculous because
the hardware to process the slower and more general Perl regex would be less
expensive than your time.

I don't know how you timed the split function, but I suspect that it was
much faster because its regex was probably much simpler. If you try the
split function with a more complicated regex, I'm sure you'll find that
split isn't so fast any more.

You do need the /g at the end, of course.

--Donnie

On Mon, Oct 27, 2008 at 5:27 PM, B. Estrade <estrabd at mailcan.com> wrote:

> David,
>
> Maybe the following will help:
>
> http://oreilly.com/catalog/perlwsmng/chapter/ch08.html
>
> I don't know a whole bunch about parsing tons of text with regexes. I do
> have one immediate suggestion - that "/g" might not be necessary. Anyway,
> take a look at that link; it might help.
>
> Cheers,
> Brett
>
> On Mon, Oct 27, 2008 at 03:59:32PM -0500, David B. John wrote:
> > (perl v5.8.8 on Ubuntu Hardy Heron/2.6.24-21-generic.)
> >
> > I have an http logfile I'm trying to parse which looks like:
> >
> > 2008:10:24-00:00:06 x.x.x.x httpproxy[4997]: id="0001" severity="info"
> > sys="SecureWeb" sub="http" name="http access" action="pass" method="GET"
> > srcip="x.x.x.x" user="" statuscode="200" cached="0" profile="profile_0"
> > filteraction="action_REF_DefaultHTTPCFFAction" size="78632" time="105
> > ms" request="0x90075b60"
> > url="
> http://sb.google.com/safebrowsing/update?client=navclient-auto-ffox&appver=2.0.0.17&version=goog-white-domain:1:481,goog-white-url:1:371,goog-black-url:1:25401,goog-black-enchash:1:62374"
> error="" category="175,178" categoryname="Software/Hardware,Internet
> Services" content-type="text/html"
> >
> >
> > (Nothing Spectacular.)
> >
> > If I loop through the log file and do:
> >
> >       my
> ($date,$fwip,$proxy,$id,$severity,$sys,$sub,$name,$action,$method,
> > $srcip,$user,$statuscode,$cached,$profile,$filteraction,$size,$time,
> > $request,$url,$error,$category,$category_name,$content_type) =
> > $_ =~ /\w+=".*?"|\S+/g;
> >
> > life is good but could be better (~ 75 seconds on a Hp dc7700S for a
> > compressed 500 MB logfile).
> >
> > However, I'd really like to use split b/c it's so much faster (~ 15
> > seconds).  The problem is if I split, sometimes $name, $time or
> > $category_name will include a space within the quotes which I don't want
> > to split on (see above).
> >
> > I used Text::ParseWords but gave up after waiting 5 minutes.
> >
> > I know I can use a regex with split but I'm stumped as to how I would go
> > about writing it.  E.g. split on a space except when enclosed in quotes.
> > Also, would it theoretically be any faster than the example above since
> > it's using regex or should I just live with it?
> >
> > Thanks.
> >
> > David
> >
> >
> > _______________________________________________
> > NewOrleans-pm mailing list
> > NewOrleans-pm at pm.org
> > http://mail.pm.org/mailman/listinfo/neworleans-pm
>
> --
> B. Estrade
> Louisiana Optical Network Initiative
> +1.225.578.1920 aim: bz743
> :wq
> _______________________________________________
> NewOrleans-pm mailing list
> NewOrleans-pm at pm.org
> http://mail.pm.org/mailman/listinfo/neworleans-pm
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/neworleans-pm/attachments/20081028/ab7e01a1/attachment.html>