[Chicago-talk] views on a quandary...

Thu Sep 11 19:54:52 CDT 2003

On Thursday, September 11, 2003, at 06:41  PM, Walter Torres wrote:

> I take it he is talking to other coders and they are thinking of 
> loading the
> files into RAM and munging them there.

Unless the munging you are doing on each line is some how affected by 
the preceding lines, there will be no speed up. I can't think of any 
reason you would be doing this, and you would know if you were. If you 
were, your code would be backtracking and re-reading lines to get some 
other values to include in the current output. I assume you aren't 
doing anything like this? If you were, it would obviously be quicker to 
store all the data in RAM and then do searches against RAM, then to be 
constantly searching against the data on disk. I highly doubt you are 
doing something like this.

So, if your processing cares only about each line individually, then 
their approach of reading all the data into RAM first before doing the 
processing will most likely slow things down. Not only does my own 
experience coincide with this, but just look at other unix utilities 
out there. Take gzip for instance. It goes very fast and does not read 
all of a file into RAM before it starts to compress it. If reading 
everything into RAM would speed it up, don't you think somebody out 
there would have written a version that did that by now?

Basically what they are trying to do duplicate the buffering that Perl 
and your operating system's filesystem cache already do. But instead of 
using smaller more efficient buffers, they want to create a huge slow 
one. There will be a cost associated with allocating all of that RAM. 
And you will essentially be creating a buffer on top of another buffer 
that already exists. You will doing the same thing twice and as a 
result could end up taking twice as long...

When your code does a read from the disk, the filesystem and perl will 
already read more of the data then you originally requested into memory 
(the input buffer). But it will return to your program only the data 
you asked for initially, so your processing step can do it's thing 
while the OS is still reading more data into memory. And once you go to 
output the result, it will initially be put into the output buffer and 
later written to disk when there is the best opportunity to do so. The 
next time your program asks for more data from the disk, it will be 
returned to you directly from the input buffer without having to go to 
disk. And when you get close to the end of the buffer, Perl will in the 
background go and fetch more data from the filesystem, while your code 
is doing it's processing, taking advantage of the multi-tasking 
properties of your operating system.

By reading all the data into memory in one shot you are essentially 
negating all the good things that come with the already existing I/O 
buffers. You get none of the advantage that the OS and Perl developers 
years of effort have put into making these things fast as possible. To 
think that, "oh let's just read it all into RAM. that will make it 
faster." is almost insulting.

So I guess what you can say to them is that, yes they are right in that 
reading small portions of the data into RAM at once is a good thing to 
do. But we don't need to do that ourselves because Perl is already 
doing that for us in the most optimal way possible.

I hope that this made sense.

-- 
Paul Baker

"Reality is that which, when you stop believing in it, doesn't go away."
          -- Philip K. Dick

GPG Key: http://homepage.mac.com/pauljbaker/public.asc