APM: processing lots of files?
Wayne Walker
wwalker at bybent.com
Fri Apr 23 20:38:33 CDT 2004
First, if you have the local disk space, then you should mirror the
data, then parse it.walking directories on a net file system is slow.
Rsync will allow you to mirror it once (SLOW) then mirror it again (much
faster) as often as needed..
What is the maximum # of files/directories in any one directory? This
has a large impact on performance, especially on networked disks.
What is the size of the whole directory tree (in MBytes).
On Mon, Apr 19, 2004 at 09:45:13AM -0500, Sam Foster wrote:
> (preface: my perl is fairly poor, perhaps fair on a good day. These are
> the kind of tasks I originally learnt perl for, but the sheer volume of
> the data is challenging me)
>
> I'm currently working with a fairly large set of data that consists of a
> deep filesystem directory structure, each directory having a
> (java-style) properties text file, along with miscellaneous directory
> contents. In addition there's an xml file for each that is our final
> output for delivery to the client.
>
> I've got some data clean-up to do, verification, reporting, and
> validation of the output against a schema. Lots of tree-crawling and
> text file parsing in other words. I'm in need of some performance tips.
>
> There's about 30,000 individual properties files (and a cross-references
> file in the same kind of format) - one for each directory.
> Simply crawling the tree and parsing each properties file is taking a
> while (an hour or more). Next up I need to fix some broken references
> (the xrefs file contains references like so: relatedLinks =
> [@/some/path/, @/someother/path] .)
> After that I'll need to verify and validate some xml output. Again, one
> file per directory.
>
> This data is on the local network, I'm working on a win2k box, having
> mapped a network drive. My machine is running activestate perl 5.8, with
> 1GB RAM, and a (single) 1600 mhz pentium processor.
>
> I've done a little benchmarking on parts of individual scripts, but I
> need a order of magnitude speed increase, not shaving micro-seconds off
> here and there. Any thoughts?
>
> I can attach a sample script if list protocol allows.
>
> thanks,
> Sam
> _______________________________________________
> Austin mailing list
> Austin at mail.pm.org
> http://mail.pm.org/mailman/listinfo/austin
--
Wayne Walker
wwalker at bybent.com Do you use Linux?!
http://www.bybent.com Get Counted! http://counter.li.org/
Perl - http://www.perl.org/ Perl User Groups - http://www.pm.org/
Jabber IM: wwalker at jabber.phototropia.org AIM: lwwalkerbybent
More information about the Austin
mailing list