[Chicago-talk] views on a quandary...

Thu Sep 11 18:41:55 CDT 2003

I have a (potential) client that has asked a question, which I thought I
knew the answer, but am second guessing myself. So I thought I'd ask this
august body for their view.

I am creating a log post-processor for him he needs to munge some fields so
a report generator can understand the log file. It seems it would be easier
to create a munger than fix the reporter.

Anyway, these log files are upwards of 300MB per day, and I will be
processing as many as 20 a night.

It seems some brainiac has put the bug in his ear that...

  "The suggestion was made that really large log files could be
   parsed faster and with less of a hit to the processor and ram
   usage if they were split into smaller, more manageable chunks
   beforehand and then reassembled as they were being cleaned."

My question to you lot, does this hold any water?

My prototype on this does streaming processing.

Open the log file, open a temp file, loop down the log file, munge each line
in turn and drop it to the temp file, close both at end.

I figured this would be the best method for large files.

I take it he is talking to other coders and they are thinking of loading the
files into RAM and munging them there.

I'm going to be running some prelim tests this evening on 300MB files (I'll
dummy up) and suck up and spit out on my 600mHz machine and see how it goes.
I'll also tackle 10 30mb files and compare the times on that.

I really don't see how processing smaller files would save any IO/processor
time/cycles.

Besides, something has to split the files into "smaller, more manageable
chunks"

Thanks for sharing your thoughts.

Walter