[Melbourne-pm] An intermittent problem with open for append

Wed May 28 22:38:31 PDT 2008

> 
> Paul Fenwick wrote:
> > Oops, I meant to qualify that with "provided your records are less than the 
> > atomic buffer size".  You're quite right that if we hit records bigger than 
> > our atomic buffer, we have to move to locking.
> >   
> Very interesting thread.  I had no idea that the kernel can mangle the 
> output based on block size.  However, at least in my tests, there will 
> be no data lost, but it may be mangled?

Early Unix systems had 2 atomic file system calls, one was "open
exclusive" and the other was "this data always gets appended".  The
second needs to be guaranteed for system logs where you will have
several things writing to a common log file.  In that case the order
isn't critical but getting the data in the file is.

My solaris internal book describes a write with the O_APPEND bit
set as simply setting the "where to write the next block pointer"
to the save value as the "file length" before the write.  While
this is like seeking to the end before a write, it is in the atomic
operation section of write so it was a trivial way to guarantee
that it works properly every time.

What I'm doing about the original problem is:
1) fixing the || vs or vs () bug.
2) logging a failed open >> to someplace else.
3) checking for unexpected signals that might be showing up.
4) rebuilding perl 5.8.8 without the perl's IO abstraction layer.

We have two CGIs, one uses /usr/local/bin/perl (5.8.8) and the other
uses /usr/bin/perl (5.003) and the only other difference in the
perl CGI scrips is one includes the base64 code (which is built
into 5.8.8) and uses that function.

The one has been running for at least 9 years and millions of times
with out every showing this problem, yet the other one get hit about
20,000 times and has had this problem 4 times in the last month.  They
are both called from the very same apache and the working one still gets
used far more often than the broken one.

The only major difference is version of perl and other things that
are not in scope.

I think there is a problem in perlapio.

The comments so far seem to fit into one of the following groups:
1) the open || report error
The failed open won't report an error... at that location but
will later down where the code with || doesn't have any commas in it.

2) >> is overwriting
This is to a local Solaris 5 ufs file system.  I don't think thats a problem
or lots of other people would have all sorts of odd problems.  If it was NFS,
then I could see it being a problem or if the records were out of order.  But
since this is just and audit log, all it means is someone has to unscramble the
data. (on the working one, I've never seen the data scrambled and it includes
about 2k worth of data)

3) Apache signals
4) program not flushing
These two might be the case but I do set the $| to flush and it doesn't
happen with perl 5.003.  It still could be a race condition.  And it doesn't
rule out those issues working differently in the new perl io abstration layer.

Thanks for everyones help.

-tim