APM: processing lots of files?

Mon Apr 19 13:52:57 CDT 2004

erik at debill.org wrote:
> I'd start by processing each directory completely before moving on to
> the next one, if at all possible.  Directory lookups on network
> filesystems can be surprisingly expensive, so doing everything in a
> single pass may be a win.

I'm using File::Find, which I think does this by default.

> Any chance of getting the files locally instead of via the network?
> I'm assuming SMB, if it was NFS I might be able to suggest some mount
> parameters to speed it up, but nothing beats a local disk.

There's about 3-4 GB of data, that is being worked on collaboratively by 
a distributed team, so moving it isn't an option unfortunately. However, 
it is NFS... so what you got?

>>There's about 30,000 individual properties files (and a cross-references 
>>file in the same kind of format) - one for each directory.
> 
> How deeply does this structure go?  Some filesystems get bogged down
> when there are 1000s of files in a single directory.  If all of these
> 30k directories are within a single parent directory just getting a
> list of them could be a serious slowdown.  On Linux I try to avoid
> having more than a few hundred files in a directory if at all possible.

I have only 5-10 files in each directory. I'm using the pre-processing 
that File::Find offers to only visit the positive matches (FWIW)

>>Simply crawling the tree and parsing each properties file is taking a 
>>while (an hour or more). Next up I need to fix some broken references 
> 
> 30000/ 3600 = 8.3 files/sec.  Not exactly blazing, but not incredibly
> slow either.

I just tried benchmarking one of my scripts again (I called my &find 
from Benchmark::timeit) with a limited dataset, and got:

72 wallclock secs ( 0.21 usr +  1.24 sys =  1.45 CPU) @  3.
45/s (n=5)

which was after parsing just 160 files. 2.22 files/sec. Not so stellar 
after all. I'll dig in to the module that's doing the parsing and see if 
there's an obvious culprit there. (starting with the bits I wrote :)

 >  I suspect your script is
> spending a fair amount of time waiting for data.  Running 2 copies in
> parallel each on its own subset of the directories might be a win
> (even with only a single processor to work with).

I didn't think of dividing up the directory list and simply running the 
same script again in parallel. I'll try that. Would forking achieve the 
same thing, or am I introducing unnecessary complexity?

thanks, this was a help,

Sam