APM: processing lots of files?
Sam Foster
austin.pm at sam-i-am.com
Mon Apr 19 09:45:13 CDT 2004
(preface: my perl is fairly poor, perhaps fair on a good day. These are
the kind of tasks I originally learnt perl for, but the sheer volume of
the data is challenging me)
I'm currently working with a fairly large set of data that consists of a
deep filesystem directory structure, each directory having a
(java-style) properties text file, along with miscellaneous directory
contents. In addition there's an xml file for each that is our final
output for delivery to the client.
I've got some data clean-up to do, verification, reporting, and
validation of the output against a schema. Lots of tree-crawling and
text file parsing in other words. I'm in need of some performance tips.
There's about 30,000 individual properties files (and a cross-references
file in the same kind of format) - one for each directory.
Simply crawling the tree and parsing each properties file is taking a
while (an hour or more). Next up I need to fix some broken references
(the xrefs file contains references like so: relatedLinks =
[@/some/path/, @/someother/path] .)
After that I'll need to verify and validate some xml output. Again, one
file per directory.
This data is on the local network, I'm working on a win2k box, having
mapped a network drive. My machine is running activestate perl 5.8, with
1GB RAM, and a (single) 1600 mhz pentium processor.
I've done a little benchmarking on parts of individual scripts, but I
need a order of magnitude speed increase, not shaving micro-seconds off
here and there. Any thoughts?
I can attach a sample script if list protocol allows.
thanks,
Sam
More information about the Austin
mailing list