APM: processing lots of files?

Mon Apr 19 15:19:30 CDT 2004

On Mon, Apr 19, 2004 at 01:52:57PM -0500, Sam Foster wrote:
> erik at debill.org wrote:
> >filesystems can be surprisingly expensive, so doing everything in a
> >single pass may be a win.
> 
> I'm using File::Find, which I think does this by default.

Ah.  I'd assumed you were running that once for each step.  As long as
you only run it once you're good.

> >Any chance of getting the files locally instead of via the network?
> >I'm assuming SMB, if it was NFS I might be able to suggest some mount
> >parameters to speed it up, but nothing beats a local disk.
> 
> There's about 3-4 GB of data, that is being worked on collaboratively by 
> a distributed team, so moving it isn't an option unfortunately. However, 
> it is NFS... so what you got?

I'm not sure what the exact options would be for NT, but you want to
use tcp (instead of udp, which is a default lots of places), and crank
the block size up.

I use tcp,rsize=16000,wsize=16000 at home.

Even larger block sizes are perfectly legit (I believe some companies
default to 64000) and large sizes can save on the number of requests
needed to transfer your data (as well as cutting down on actual read
requests that get to the physical disks).

Also, if you aren't defaulting to an async mount you might try that.
I'm not sure how it interacts with NFS (for all I know they're always
async) but it's usually a big throughput win to not wait for your
writes to complete.

> >>Simply crawling the tree and parsing each properties file is taking a 
> >>while (an hour or more). Next up I need to fix some broken references 
> >
> >30000/ 3600 = 8.3 files/sec.  Not exactly blazing, but not incredibly
> >slow either.
> 
> I just tried benchmarking one of my scripts again (I called my &find 
> from Benchmark::timeit) with a limited dataset, and got:
> 
> 72 wallclock secs ( 0.21 usr +  1.24 sys =  1.45 CPU) @  3.
> 45/s (n=5)
> 
> which was after parsing just 160 files. 2.22 files/sec. Not so stellar 
> after all. I'll dig in to the module that's doing the parsing and see if 
> there's an obvious culprit there. (starting with the bits I wrote :)

72 wall clock and only 1.45 CPU?  Sounds like it's all IO wait.  The
good news is there's bound to be a way to make that go a lot faster :)

Does it slow down as it handles more and more files?  Is memory use
growing?  If your workstation goes into swap that would definitely
cause a slowdown.

> >parallel each on its own subset of the directories might be a win
> >(even with only a single processor to work with).
> 
> I didn't think of dividing up the directory list and simply running the 
> same script again in parallel. I'll try that. Would forking achieve the 
> same thing, or am I introducing unnecessary complexity?

You could have the script fork a set number of times right at the
beginning.  You just need a way for each process to figure out what
directories are its responsibility (even if it's "I only do odd
numbered directories").  Easy to do if your directory names are
relatively stable and predictable.  I wouldn't modify the function
that File::Find calls to fork(), since that's liable to make a fork
bomb.

> thanks, this was a help,

Glad to help.  Just let us know how things turn out.

Erik
-- 
Humor soothes the savage cubicle monkey.
   -- J Jacques