[Boulder.pm] Streaming File Copies
keanansmith at techie.com
Wed Jun 4 16:36:54 CDT 2003
I've done a couple of similar things (Nothing *exactly* like it, but enough similarities that I can make some suggestions :)
1. Best method for file listing/sizing depends on exactly how the machine in question gets to open the file to begin with. I find that the -s operator works on just about any file I can open (I actually usually use 'stat' directly, but often I want more than just the size), but that doesn't mean that's the case for you. (Oh and the opendir and readdir usually work pretty darn effectively for file listing as well)
Both of those assume that you can somehow remotely mount the filesystem in question (Or at least fake it through some network interface or another)
If it happens that isn't supported you might try FTP rather than RSH (It tends to be a little less overhead intensive on the target machine (Provided FTP is already supported), although it's still messy, depending on the FTP host on the target machine you may be able to get a directory listing and file sizes all in one clean go though)
2. If you have to go through the trouble of opening a shell on a remote machine for file listing/sizing, it's likely that machine has faster access to those files to begin with, and it might be more effective to spawn script on the machine in question to do the reading/writing rather than jumping across far networks more than once with the same byte stream (Ie. for reading and writing)
3. I don't know if you're doing block-based direct copying (In perl) or using underlying system libraries to do the copy (With File::Copy or just directly), but if you're doing it block by block, you can verify your write at that level instead of at the file-by-file level, meaning you catch errors sooner and have to do less work to fix them, which might be a time saver in the long run. (I know that the block by block copy routine I wrote in perl was faster that the system libraries on the system I was using to copy the same files (I did alot of perfomance tests), but it was a Windows system, so that's not especially surprising)
----- Original Message -----
From: "Justin Crawford" <Justin.Crawford at cusys.edu>
Date: Wed, 4 Jun 2003 15:04:59 -0600
To: <boulder-pm at mail.pm.org>
Subject: [Boulder.pm] Streaming File Copies
> We've got a few enormous databases to copy from one server to another every night of the year. I've written a Perl script to do this, whose logic goes like so:
> 1. Get a list of all the datafiles contained by a specified directory, and also get their sizes.
> 2. Randomize the list of filenames. Go through the list one file at a time and fork a child to copy that file UNLESS we already have 3 children copying.
> 3. Collect dead children (ugh) and verify their return code. Also verify that the source and destination files are the same size.
> 4. If return code or size are not right, put the file back in the "to be copied" pile. If after 3 tries they're still not right, put the file in the "this file failed" pile, copy as many others as possible, then exit nonzero.
> What all of this does is takes advantage of extra network bandwidth and handles network blips, reducing job time and reducing failed copies (which otherwise result in frequent, expensive retries of the job). The "randomize" step was kludge, to keep the script from just copying all of /filesystem1 and then all of /filesystem2; it wasn't very important since both /filesystem1 and /filesystem2 were on a RAID and might have been sharing the same physical disk for all we knew.
> Until today, that is. Now we have 4 separate physical disks on both source and destination machine, and 4 networks between. I have to rewrite most of the above logic to make 4 streams that are filesystem- and network-specific. But before I dissect this script--the hairiest script I've ever written--I've got two questions:
> A. Anyone done this? Anyone seen a mod or a script that's already dealt with robust network copy streaming? Anyone have a better idea than Perl?
> B. File listing, file sizing: I'm using rsh when list/size operations must happen on remote machines. I don't like to do that, but I don't know what else to do. Any better ideas for listing/sizing remote files?
> Justin Crawford
> Database Administration Group
> University of Colorado Management Systems
> justin.crawford at cusys.edu
> Boulder-pm mailing list
> Boulder-pm at mail.pm.org
Sign-up for your own FREE Personalized E-mail at Mail.com
More information about the Boulder-pm