[Pdx-pm] corruption puzzlement
Kyle Hayes
kyle at silverbeach.net
Thu Aug 7 19:08:40 CDT 2003
On Thursday 07 August 2003 16:03, Michael Rasmussen wrote:
> Focused question here. (sigger about the code behind my
> back ok?)
>
> We had text files arriving via scp with Unix style EOL
> characters that would eventually be used by Windows people.
> Had to convert the line endings in the files. So I created
> a pair of scripts to handle the task ( unix2dos not available
> on the system)
>
> check2convert runs continuously, sleepign for two minutes and
> then checking if there are new files in the directory to muss
> with, if so 2dos is called for each file.
>
> There is a group of files that arrives about 2:00am. This
> last Monday one of them showed up with 0 size. Normally this
> file (a large one) takes about 40 seconds to transfer between
> sites.
I've seen this when something temporarily hangs SCP just at the wrong time.
The action to create the file goes fine, then something burps on the network
and no data is actually put into the file for a few seconds. If your program
runs at just that time, it'll see a zero byte file.
File creation is a different action from putting data into the file. Just
because the file is there does not mean that the data is there yet. If
you've got a Linux system and active disks, it is possible to get the data
without having the directory stitched up yet too (depends on the filesystem).
> I did some munging (eliminating the sleep and the file time stamp
> comparison) to try and duplicate the truncation. This
> raised two questions:
>
> 1) since the transfer takes 40 seconds and I loop every 120 seconds
> I'd expect to see 2dos trash the file every once in awhile. This hasn't
> happened. Huh??
Possibly luck? Heisenbugs generally work that way.
> 2) No matter what I did I couldn't replicate the trucate the file to 0
> bytes behavior.
If it is a timing issue as I mentioned above, it might be pretty hard to
duplicate. I've only seen it a few times and I've got stuff that copies
thousands of files daily that's been running for years. We worked around it
by using sentinels at the end of the data and checking for file size.
> Huh?? Is this pair of quickies potentially responsible for the 0 byte
> file we received earlier this week? Any ideas on why 2dos doesn't trash
> about 1 in 3 of the incoming files where the transfer time would overlap
> with the loop invocation?
>
> ################# Start of check2convert ############
> #!/usr/bin/perl
>
> while (1) {
> $mtime_ref = (stat (".timestamp"))[9];
> $now = time;
> utime $now, $now, ".timestamp";
>
> @dir = `ls *.txt *.csv`;
>
> foreach $f (@dir) {
> chomp $f;
> $mtime_cmp = (stat ($f))[9];
> if ( ($mtime_cmp > $mtime_ref) && -f $f ) {
> $cmd = "./2dos $f";
> system $cmd ;
> }
> }
> sleep 120;
> } # while(1)
Change the the program so that you wait until the file is at least 60 seconds
old (if the longest file takes 40 seconds, give yourself some fudge factor).
Your current "window" is now to 120 seconds ago roughly. You want to move
the window back in time:
(cheeseball code warning!):
while (1) {
$mtime_ref = (stat (".timestamp"))[9];
$now = time - 60; # shift our window back 60 seconds
utime $now, $now, ".timestamp"; # time stamp in the past.
@dir = `ls *.txt *.csv`;
foreach $f (@dir) {
chomp $f;
$mtime_cmp = (stat ($f))[9];
# file must have shown up in a roughly two minute window
# starting one minute ago and extending two minutes before that.
# this gives the file time to "settle" (for all the data to be written).
if ( ($mtime_cmp > $mtime_ref) &&
($mtime_cmp <= $now) && -f $f ) {
$cmd = "./2dos $f";
system $cmd ;
}
}
sleep 120;
Also note that if you can stat the file, it is probably there, so the -f may
be redundant. Your guarding if statement can still in _missing_ a file
altogether. You have a race condition. On a fast machine/network, it could
happen.
Here's the scenario:
1) at time 42, your program comes out of the sleep and starts running. It
tags the timestamp file.
2) you get the directory listing into @dir, but it's still time 42. Fast
disk, directory in cache, whatever. If your program runs a lot, you will
have stuff in the d-cache on Linux (probably in some similar cache on most
OSes except maybe Win 9x).
3) a remote SCP drops another file into the directory quickly. The mtime for
the file is still time 42.
But, remember that you got the directory listing in step 2. If all the steps
1-3 take less than a second, then you could miss the file dropped in step 3.
The next time around the loop, you'll skip the new file because it has the
same mtime as the timestamp file.
I generally process files into different directories. The raw files land in
one directory, and I move them to another directory after processing. This
means that only files that need processing are in the input directory.
The problem is actually a bit worse than it seems. Depending on the
filesystem used, you may see that the file is created and the data inserted
into it _before_ the directory entry is created along with the mtime. Thus,
it is possible to have the file start being created before time 42, but
finish and show up in step 3 above. I've seen up to five second delays on
heavily loaded Linux systems running ext2 filesystems. Ext3 and Reiser
running in journalling mode could actually have this problem worse than ext2.
The WinNT filesystem can get really weird this way. On a heavily loaded
system, I timed a file taking more than 30 seconds to show up in a directory
after a copy operation said it was complete.
Is there some sort of sentinel that you can look for at the end of the file?
If the file is pretty big, just having a fudge factor delay isn't really a
solution. It might alleviate the problem, but it won't solve it.
> ################## end of check2convert #############
>
> ################## start of 2dos ####################
> #!/usr/bin/perl -i
>
> # slurp in a file and make it have dos line endings
> # be nice if I could do the test non destructively
> # open close open???
>
> $eol = "\r\n";
>
> $line = <>;
>
> if ($line =~ /\r\n/) { $/ = $eol; }
>
> chomp $line;
> print "$line$eol";
>
> while(<>) {
> chomp;
> print "$_$eol";
> }
> ############# end of 2dos #############################
Erm, where does the output go? Are these programs sanitized to protect the
innocent?
Best,
Kyle
More information about the Pdx-pm-list
mailing list