[Chicago-talk] views on a quandary...

Sat Sep 13 17:39:43 CDT 2003

--On Thursday, September 11, 2003 18:41:55 -0500 Walter Torres 
<walter at torres.ws> wrote:

> I have a (potential) client that has asked a question, which I thought I
> knew the answer, but am second guessing myself. So I thought I'd ask this
> august body for their view.
>
> I am creating a log post-processor for him he needs to munge some fields
> so a report generator can understand the log file. It seems it would be
> easier to create a munger than fix the reporter.
>
> Anyway, these log files are upwards of 300MB per day, and I will be
> processing as many as 20 a night.

I've happily processed 10GB of data through perl at once -- on a system
with 200GB of core mind you but it worked. Main issue will be how much
load the system can handle and how you read it. For this much data you
probably don't want to read in slurp mode, but many logfiles work nicely
in paragraph or line mode. You can also read in fixed chunks via

    $/ = \4096

(pick your size) and manage things that way (code'll look a whole
lot like C w/ regexes instead of string.h calls).

>
> It seems some brainiac has put the bug in his ear that...
>
>   "The suggestion was made that really large log files could be
>    parsed faster and with less of a hit to the processor and ram
>    usage if they were split into smaller, more manageable chunks
>    beforehand and then reassembled as they were being cleaned."
>
> My question to you lot, does this hold any water?

If the only way to read the data is in slurp mode then maybe,
otherwise you still have to read the stuff to split it. Breadking
up files that large for backup/recovery purposes may help (e.g.,
hourly logs might be easier to grep for items). Aside from that
you have to read that much junk off the disk either way.

Funny thing is that gzip --fast may actually help since the
read cycle can use:

    open my $log, "gzip -dc $logfile.gz |"

and save some disk I/O -- assuming the logs can be zipped during
the day (say hourly rotation). The gzip works becuase unzippping
in core takes less time than reading the data from disk for
repetative things like logs.

>
> My prototype on this does streaming processing.

Adjust the buffer to a reasonable size for the system and you have
no reason to worry about the total file size -- just don't slurp it
via:

    for( <$fh> )

or

    local $/;

and you should be able to read any amount of data easily.

-- 
Steven Lembark                                            2930 W. Palmer
Workhorse Computing                                    Chicago, IL 60647
                                                         +1 888 910 1206