[Boulder.pm] Streaming File Copies

Justin Crawford Justin.Crawford at cusys.edu
Wed Jun 4 16:04:59 CDT 2003


Howdy-

We've got a few enormous databases to copy from one server to another every night of the year.  I've written a Perl script to do this, whose logic goes like so:

1. Get a list of all the datafiles contained by a specified directory, and also get their sizes.
2. Randomize the list of filenames.  Go through the list one file at a time and fork a child to copy that file UNLESS we already have 3 children copying.
3. Collect dead children (ugh) and verify their return code.  Also verify that the source and destination files are the same size.
4. If return code or size are not right, put the file back in the "to be copied" pile.  If after 3 tries they're still not right, put the file in the "this file failed" pile, copy as many others as possible, then exit nonzero.

What all of this does is takes advantage of extra network bandwidth and handles network blips, reducing job time and reducing failed copies (which otherwise result in frequent, expensive retries of the job).  The "randomize" step was kludge, to keep the script from just copying all of /filesystem1 and then all of /filesystem2; it wasn't very important since both /filesystem1 and /filesystem2 were on a RAID and might have been sharing the same physical disk for all we knew.

Until today, that is.  Now we have 4 separate physical disks on both source and destination machine, and 4 networks between.  I have to rewrite most of the above logic to make 4 streams that are filesystem- and network-specific.  But before I dissect this script--the hairiest script I've ever written--I've got two questions:

A. Anyone done this?  Anyone seen a mod or a script that's already dealt with robust network copy streaming?  Anyone have a better idea than Perl?
B. File listing, file sizing: I'm using rsh when list/size operations must happen on remote machines.  I don't like to do that, but I don't know what else to do.  Any better ideas for listing/sizing remote files?

Thanks!

Justin Crawford
Database Administration Group
University of Colorado Management Systems
justin.crawford at cusys.edu




More information about the Boulder-pm mailing list