APM: processing lots of files?

Mon Apr 19 12:12:28 CDT 2004

On Mon, Apr 19, 2004 at 09:45:13AM -0500, Sam Foster wrote:
> I'm currently working with a fairly large set of data that consists of a 
> deep filesystem directory structure, each directory having a 
> (java-style) properties text file, along with miscellaneous directory 
> contents. In addition there's an xml file for each that is our final 
> output for delivery to the client.
> 
> I've got some data clean-up to do, verification, reporting, and 
> validation of the output against a schema. Lots of tree-crawling and 
> text file parsing in other words. I'm in need of some performance tips.

I'd start by processing each directory completely before moving on to
the next one, if at all possible.  Directory lookups on network
filesystems can be surprisingly expensive, so doing everything in a
single pass may be a win.

Any chance of getting the files locally instead of via the network?
I'm assuming SMB, if it was NFS I might be able to suggest some mount
parameters to speed it up, but nothing beats a local disk.

> There's about 30,000 individual properties files (and a cross-references 
> file in the same kind of format) - one for each directory.

How deeply does this structure go?  Some filesystems get bogged down
when there are 1000s of files in a single directory.  If all of these
30k directories are within a single parent directory just getting a
list of them could be a serious slowdown.  On Linux I try to avoid
having more than a few hundred files in a directory if at all possible.

> Simply crawling the tree and parsing each properties file is taking a 
> while (an hour or more). Next up I need to fix some broken references 

30000/ 3600 = 8.3 files/sec.  Not exactly blazing, but not incredibly
slow either.

> (the xrefs file contains references like so: relatedLinks = 
> [@/some/path/, @/someother/path] .)
> After that I'll need to verify and validate some xml output. Again, one 
> file per directory.

Does this mean you can't parallelize this?  I suspect your script is
spending a fair amount of time waiting for data.  Running 2 copies in
parallel each on its own subset of the directories might be a win
(even with only a single processor to work with).

Erik

-- 
Humor soothes the savage cubicle monkey.
   -- J Jacques