[Chicago-talk] File reads and micromanagement

Thu Jan 25 17:49:09 PST 2007

I have a program which does massive amounts of file comparisons.
Until recently its comparison section looked something like this:

(initialize flags and accumulators)
(open files 3 and 4 for comparison)
while (not end of file && not unequal) {
  $length3=read(MY3,$b3,512,$accum3);
  $accum3+=$length3;
  $length4=read(MY4,$b4,512,$accum4);
  $accum4+=$lenth4;
  if (either buffer length was <512) {flag an end of file}
  if (the buffers don't match) {flag an unequal condition so I can bail 
out of the while before the end of file}
}
(close the files)
(notice whether the files matched and act accordingly)

What I was noticing was that as the file size increased, the processing 
time increased as something like the square of the file size. So, on a 
hunch, I ripped out the $accum parts. This hunch was partly based on 
looking at lots of examples of file reading code in various books.

  $length3 = read(MY3, $b3, 512);
  $length4 = read(MY4, $b4, 512);

Then it went a bit faster for small files, and much much faster for 
large files. The execution time was now linearly proportional to the 
file size. It seems like instead of just noticing that I'm pointing to 
where it left off, it had been wallowing through the entire file on each 
read to re-establish the pointer.

So there must be a lesson here.
I figure that it must be "Don't micro manage a high level language."
It could also be "There's more than one way to do it, but some ways are 
really wretched."
or "That method that really bogs the system down? Yeah, well, don't do 
that."

So, have I recreated any well known rules here (of which I've been 
amazingly oblivious)?
Have I exposed a bug, or a best practice, or nothing much really?
Where should I have already read about this?

Clyde