[Pdx-pm] corruption puzzlement

Kyle Hayes kyle at silverbeach.net
Thu Aug 7 19:08:40 CDT 2003


On Thursday 07 August 2003 16:03, Michael Rasmussen wrote:
> Focused question here.  (sigger about the code behind my
> back ok?)
>
> We had text files arriving via scp with Unix style EOL
> characters that would eventually be used by Windows people.
> Had to convert the line endings in the files.  So I created
> a pair of scripts to handle the task ( unix2dos not available
> on the system)
>
> check2convert runs continuously, sleepign for two minutes and
> then checking if there are new files in the directory to muss
> with, if so 2dos is called for each file.
>
> There is a group of files that arrives about 2:00am.  This
> last Monday one of them showed up with 0 size.  Normally this
> file (a large one) takes about 40 seconds to transfer between
> sites.

I've seen this when something temporarily hangs SCP just at the wrong time.  
The action to create the file goes fine, then something burps on the network 
and no data is actually put into the file for a few seconds.  If your program 
runs at just that time, it'll see a zero byte file.  

File creation is a different action from putting data into the file.  Just 
because the file is there does not mean that the data is there yet.  If 
you've got a Linux system and active disks, it is possible to get the data 
without having the directory stitched up yet too (depends on the filesystem).

> I did some munging (eliminating the sleep and the file time stamp
> comparison)  to try and duplicate the truncation.  This
> raised two questions:
>
>   1) since the transfer takes 40 seconds and I loop every 120 seconds
> I'd expect to see 2dos trash the file every once in awhile. This hasn't
> happened.  Huh??

Possibly luck?  Heisenbugs generally work that way.

>   2) No matter what I did I couldn't replicate the trucate the file to 0
> bytes behavior.

If it is a timing issue as I mentioned above, it might be pretty hard to 
duplicate.  I've only seen it a few times and I've got stuff that copies 
thousands of files daily that's been running for years.  We worked around it 
by using sentinels at the end of the data and checking for file size.

> Huh??  Is this pair of quickies potentially responsible for the 0 byte
> file we received earlier this week?  Any ideas on why 2dos doesn't trash
> about 1 in 3 of the incoming files where the transfer time would overlap
> with the loop invocation?
>
> ################# Start of check2convert ############
> #!/usr/bin/perl
>
> while (1) {
>         $mtime_ref = (stat (".timestamp"))[9];
>         $now = time;
>         utime $now, $now, ".timestamp";
>
>         @dir = `ls *.txt *.csv`;
>
>         foreach $f (@dir) {
>                 chomp $f;
>                 $mtime_cmp = (stat ($f))[9];
>                 if ( ($mtime_cmp > $mtime_ref) && -f $f )  {
>                         $cmd = "./2dos $f";
>                         system $cmd ;
>                 }
>         }
> sleep 120;
> } # while(1)

Change the the program so that you wait until the file is at least 60 seconds 
old (if the longest file takes 40 seconds, give yourself some fudge factor).  
Your current "window" is now to 120 seconds ago roughly.  You want to move 
the window back in time:

(cheeseball code warning!):

while (1) {
         $mtime_ref = (stat (".timestamp"))[9];
         $now = time - 60;  # shift our window back 60 seconds
         utime $now, $now, ".timestamp";  # time stamp in the past.

         @dir = `ls *.txt *.csv`;

         foreach $f (@dir) {
                 chomp $f;
                 $mtime_cmp = (stat ($f))[9];

		 # file must have shown up in a roughly two minute window
		 # starting one minute ago and extending two minutes before that.
		 # this gives the file time to "settle" (for all the data to be written).
                 if ( ($mtime_cmp > $mtime_ref) && 
		      ($mtime_cmp <= $now) && -f $f )  {
                         $cmd = "./2dos $f";
                         system $cmd ;
                 }
         }
sleep 120;


Also note that if you can stat the file, it is probably there, so the -f may 
be redundant.  Your guarding if statement can still in  _missing_ a file 
altogether.  You have a race condition.  On a fast machine/network, it could 
happen.

Here's the scenario:

1) at time 42, your program comes out of the sleep and starts running.  It 
tags the timestamp file.

2) you get the directory listing into @dir, but it's still time 42.  Fast 
disk, directory in cache, whatever.  If your program runs a lot, you will 
have stuff in the d-cache on Linux (probably in some similar cache on most 
OSes except maybe Win 9x).

3) a remote SCP drops another file into the directory quickly.  The mtime for 
the file is still time 42.

But, remember that you got the directory listing in step 2.  If all the steps 
1-3 take less than a second, then you could miss the file dropped in step 3.  
The next time around the loop, you'll skip the new file because it has the 
same mtime as the timestamp file.

I generally process files into different directories.  The raw files land in 
one directory, and I move them to another directory after processing.  This 
means that only files that need processing are in the input directory.  

The problem is actually a bit worse than it seems.  Depending on the 
filesystem used, you may see that the file is created and the data inserted 
into it _before_ the directory entry is created along with the mtime.  Thus, 
it is possible to have the file start being created before time 42, but 
finish and show up in step 3 above.  I've seen up to five second delays on 
heavily loaded Linux systems running ext2 filesystems.  Ext3 and Reiser 
running in journalling mode could actually have this problem worse than ext2.  
The WinNT filesystem can get really weird this way.  On a heavily loaded 
system, I timed a file taking more than 30 seconds to show up in a directory 
after a copy operation said it was complete.

Is there some sort of sentinel that you can look for at the end of the file?  
If the file is pretty big, just having a fudge factor delay isn't really a 
solution.  It might alleviate the problem, but it won't solve it.

> ################## end of check2convert #############
>
> ################## start of 2dos ####################
> #!/usr/bin/perl -i
>
> # slurp in a file and make it have dos line endings
> # be nice if I could do the test non destructively
> # open close open???
>
> $eol = "\r\n";
>
> $line = <>;
>
> if ($line =~ /\r\n/) {   $/ = $eol; }
>
> chomp $line;
> print "$line$eol";
>
> while(<>) {
>         chomp;
>         print "$_$eol";
> }
> ############# end of 2dos #############################

Erm, where does the output go?  Are these programs sanitized to protect the 
innocent?

Best,
Kyle




More information about the Pdx-pm-list mailing list