APM: processing lots of files?

Mon Apr 19 09:45:13 CDT 2004

(preface: my perl is fairly poor, perhaps fair on a good day. These are 
the kind of tasks I originally learnt perl for, but the sheer volume of 
the data is challenging me)

I'm currently working with a fairly large set of data that consists of a 
deep filesystem directory structure, each directory having a 
(java-style) properties text file, along with miscellaneous directory 
contents. In addition there's an xml file for each that is our final 
output for delivery to the client.

I've got some data clean-up to do, verification, reporting, and 
validation of the output against a schema. Lots of tree-crawling and 
text file parsing in other words. I'm in need of some performance tips.

There's about 30,000 individual properties files (and a cross-references 
file in the same kind of format) - one for each directory.
Simply crawling the tree and parsing each properties file is taking a 
while (an hour or more). Next up I need to fix some broken references 
(the xrefs file contains references like so: relatedLinks = 
[@/some/path/, @/someother/path] .)
After that I'll need to verify and validate some xml output. Again, one 
file per directory.

This data is on the local network, I'm working on a win2k box, having 
mapped a network drive. My machine is running activestate perl 5.8, with 
1GB RAM, and a (single) 1600 mhz pentium processor.

I've done a little benchmarking on parts of individual scripts, but I 
need a order of magnitude speed increase, not shaving micro-seconds off 
here and there. Any thoughts?

I can attach a sample script if list protocol allows.

thanks,
Sam