[Neworleans-pm] [neworleans-pm-owner at pm.org: Re: split vs. match]

Tue Oct 28 06:17:59 PDT 2008

Sorry for the dupes - it's been so long, I can't keep straight which email address is subscribed here :)...my reply is below.

On Tue, Oct 28, 2008 at 07:51:11AM -0500, David B. John wrote:
> On Tue, 2008-10-28 at 01:54 -0400, Donnie Cameron wrote:
> 
> > David,
> > 
> > The split function is not going to make things any faster. In fact,
> > without resorting to the use of another language, I can't think of a
> > faster way of doing it than you have suggested. Even if you were to
> > split on something like a quote followed by a space (/" /) and then
> > reattach the quote to the end of each resulting element (work that is
> > vastly simpler than regex matching), the process would end up being
> > slower than regex matching because the regex maching happens in
> > machine language and the more efficient work happens in Perl. I'm
> > convinced also that even if you were to use the index function, your
> > Perl code would still be slower than the regex-based solution you
> > described.
> > 
> > In the past, I have tried a number of tricks to try to beat simple
> > regex matching for this type of work and I've seldom been able to beat
> > the regex matching. (When I write "this type of work", I am of course
> > excluding regular Apache-like log files and other files that are
> > designed to be easy and fast to parse. I'm talking about more
> > thoughtless file designs, such as the one you described.) 
> > 
> > You could roll out your own C extension, but that's just ridiculous
> > because the hardware to process the slower and more general Perl regex
> > would be less expensive than your time. 
> > 
> > I don't know how you timed the split function, but I suspect that it
> > was much faster because its regex was probably much simpler. If you
> > try the split function with a more complicated regex, I'm sure you'll
> > find that split isn't so fast any more. 
> > 
> > You do need the /g at the end, of course.

Do you or don't you? I am not familiar with using the "g" switch in a pure match - I usually just use it when doing global search and replaces.

> > 
> > --Donnie
> > 
> 
> Thanks Donnie.  I can live with that.  :)

David:

Is this for Apache?  If so, you are treading on well-worn ground - 

http://www.google.com/search?q=perl+parse+apache+log+file&ie=utf-8&oe=utf-8&aq=t&rls=com.ubuntu:en-US:unofficial&client=firefox-a

Also, if you want to analyze your log files, you may want to check out AWStats - http://awstats.sourceforge.net/.

Lastly, you can try an approach that essentially parses parts of the file in parallel. I am not familiar with writing multi-threaded Perl scripts, but that would allow you to get further speed-up once you've found the magic regex to use. Of course, you might have to deal with bringing back the results in some ordered way, so it is a rather advanced approach to take.

Cheers,
Brett

> David
> 
> 

-- 
B. Estrade
Louisiana Optical Network Initiative
+1.225.578.1920 aim: bz743
:wq