Log file parsing

Tue Jul 1 12:45:53 CDT 2003

on 7/1/03 10:15 AM, william.l.lewis at usa.net purportedly said:

> Long time listener, first time poster.  I am scoured around looking for a
> fast way to pull lines out of very large log files.  These files are on the
> order of 300+ mb.
> 
> I am finding a list of Proc/pid combinations that we log, and creating an
> array of those then foreach'ing over that list looking through the log for
> matching lines.
> 
> Each new proc/pid means a new search on the entire file which is obviously
> time consuming.  Right now, I am using the OS grep to pipe into perl to get
> the lines for each procpid and then processing those lines.
> 
> Unfortunately, there is not enouhg memory to slurp the whole thing into ram
> and process that way.  Though, if I could how would I go about that?
> 
> Anyone done anything like this before?

I have done various manipulations on Apache log files in the order of 1+gb,
and I find Perl can chew through a large text file in no time flat (less
than a minute for 1gb files). Of course, depends on what you need to do to
each line.

If you use a hash instead of an array for PIDs, you will likely eke out more
performance:

1) extract pid from line
2) check if exists in hash
3) do whatever with line

With new PIDs, if you can maintain state, you can simply search the file for
the new PIDs (i.e. as opposed to the whole list all over again). If this
isn't acceptable performance-wise, you could also try maintaining a bitmap
of the file (one bit per line, on if PID is known, off if otherwise) and
then only deal with candidate lines.

Of course, this is only if using a database isn't an acceptable option, as I
would agree with Tim that for performance and flexibility this is the best
option.

Keary Suska
Esoteritech, Inc.
"Leveraging Open Source for a better Internet"