APM: processing lots of files?
Sam Foster
austin.pm at sam-i-am.com
Mon Apr 19 13:52:57 CDT 2004
erik at debill.org wrote:
> I'd start by processing each directory completely before moving on to
> the next one, if at all possible. Directory lookups on network
> filesystems can be surprisingly expensive, so doing everything in a
> single pass may be a win.
I'm using File::Find, which I think does this by default.
> Any chance of getting the files locally instead of via the network?
> I'm assuming SMB, if it was NFS I might be able to suggest some mount
> parameters to speed it up, but nothing beats a local disk.
There's about 3-4 GB of data, that is being worked on collaboratively by
a distributed team, so moving it isn't an option unfortunately. However,
it is NFS... so what you got?
>>There's about 30,000 individual properties files (and a cross-references
>>file in the same kind of format) - one for each directory.
>
> How deeply does this structure go? Some filesystems get bogged down
> when there are 1000s of files in a single directory. If all of these
> 30k directories are within a single parent directory just getting a
> list of them could be a serious slowdown. On Linux I try to avoid
> having more than a few hundred files in a directory if at all possible.
I have only 5-10 files in each directory. I'm using the pre-processing
that File::Find offers to only visit the positive matches (FWIW)
>>Simply crawling the tree and parsing each properties file is taking a
>>while (an hour or more). Next up I need to fix some broken references
>
> 30000/ 3600 = 8.3 files/sec. Not exactly blazing, but not incredibly
> slow either.
I just tried benchmarking one of my scripts again (I called my &find
from Benchmark::timeit) with a limited dataset, and got:
72 wallclock secs ( 0.21 usr + 1.24 sys = 1.45 CPU) @ 3.
45/s (n=5)
which was after parsing just 160 files. 2.22 files/sec. Not so stellar
after all. I'll dig in to the module that's doing the parsing and see if
there's an obvious culprit there. (starting with the bits I wrote :)
> I suspect your script is
> spending a fair amount of time waiting for data. Running 2 copies in
> parallel each on its own subset of the directories might be a win
> (even with only a single processor to work with).
I didn't think of dividing up the directory list and simply running the
same script again in parallel. I'll try that. Would forking achieve the
same thing, or am I introducing unnecessary complexity?
thanks, this was a help,
Sam
More information about the Austin
mailing list