[Chicago-talk] views on a quandary...
pbaker at where2getit.com
Thu Sep 11 19:54:52 CDT 2003
On Thursday, September 11, 2003, at 06:41 PM, Walter Torres wrote:
> I take it he is talking to other coders and they are thinking of
> loading the
> files into RAM and munging them there.
Unless the munging you are doing on each line is some how affected by
the preceding lines, there will be no speed up. I can't think of any
reason you would be doing this, and you would know if you were. If you
were, your code would be backtracking and re-reading lines to get some
other values to include in the current output. I assume you aren't
doing anything like this? If you were, it would obviously be quicker to
store all the data in RAM and then do searches against RAM, then to be
constantly searching against the data on disk. I highly doubt you are
doing something like this.
So, if your processing cares only about each line individually, then
their approach of reading all the data into RAM first before doing the
processing will most likely slow things down. Not only does my own
experience coincide with this, but just look at other unix utilities
out there. Take gzip for instance. It goes very fast and does not read
all of a file into RAM before it starts to compress it. If reading
everything into RAM would speed it up, don't you think somebody out
there would have written a version that did that by now?
Basically what they are trying to do duplicate the buffering that Perl
and your operating system's filesystem cache already do. But instead of
using smaller more efficient buffers, they want to create a huge slow
one. There will be a cost associated with allocating all of that RAM.
And you will essentially be creating a buffer on top of another buffer
that already exists. You will doing the same thing twice and as a
result could end up taking twice as long...
When your code does a read from the disk, the filesystem and perl will
already read more of the data then you originally requested into memory
(the input buffer). But it will return to your program only the data
you asked for initially, so your processing step can do it's thing
while the OS is still reading more data into memory. And once you go to
output the result, it will initially be put into the output buffer and
later written to disk when there is the best opportunity to do so. The
next time your program asks for more data from the disk, it will be
returned to you directly from the input buffer without having to go to
disk. And when you get close to the end of the buffer, Perl will in the
background go and fetch more data from the filesystem, while your code
is doing it's processing, taking advantage of the multi-tasking
properties of your operating system.
By reading all the data into memory in one shot you are essentially
negating all the good things that come with the already existing I/O
buffers. You get none of the advantage that the OS and Perl developers
years of effort have put into making these things fast as possible. To
think that, "oh let's just read it all into RAM. that will make it
faster." is almost insulting.
So I guess what you can say to them is that, yes they are right in that
reading small portions of the data into RAM at once is a good thing to
do. But we don't need to do that ourselves because Perl is already
doing that for us in the most optimal way possible.
I hope that this made sense.
"Reality is that which, when you stop believing in it, doesn't go away."
-- Philip K. Dick
GPG Key: http://homepage.mac.com/pauljbaker/public.asc
More information about the Chicago-talk