APM: processing lots of files?

Fri Apr 23 20:38:33 CDT 2004

First, if you have the local disk space, then you should mirror the
data, then parse it.walking directories on a net file system is slow.

Rsync will allow you to mirror it once (SLOW) then mirror it again (much
faster) as often as needed..

What is the maximum # of files/directories in any one directory?  This
has a large impact on performance, especially on networked disks.

What is the size of the whole directory tree (in MBytes).

On Mon, Apr 19, 2004 at 09:45:13AM -0500, Sam Foster wrote:
> (preface: my perl is fairly poor, perhaps fair on a good day. These are 
> the kind of tasks I originally learnt perl for, but the sheer volume of 
> the data is challenging me)
> 
> I'm currently working with a fairly large set of data that consists of a 
> deep filesystem directory structure, each directory having a 
> (java-style) properties text file, along with miscellaneous directory 
> contents. In addition there's an xml file for each that is our final 
> output for delivery to the client.
> 
> I've got some data clean-up to do, verification, reporting, and 
> validation of the output against a schema. Lots of tree-crawling and 
> text file parsing in other words. I'm in need of some performance tips.
> 
> There's about 30,000 individual properties files (and a cross-references 
> file in the same kind of format) - one for each directory.
> Simply crawling the tree and parsing each properties file is taking a 
> while (an hour or more). Next up I need to fix some broken references 
> (the xrefs file contains references like so: relatedLinks = 
> [@/some/path/, @/someother/path] .)
> After that I'll need to verify and validate some xml output. Again, one 
> file per directory.
> 
> This data is on the local network, I'm working on a win2k box, having 
> mapped a network drive. My machine is running activestate perl 5.8, with 
> 1GB RAM, and a (single) 1600 mhz pentium processor.
> 
> I've done a little benchmarking on parts of individual scripts, but I 
> need a order of magnitude speed increase, not shaving micro-seconds off 
> here and there. Any thoughts?
> 
> I can attach a sample script if list protocol allows.
> 
> thanks,
> Sam
> _______________________________________________
> Austin mailing list
> Austin at mail.pm.org
> http://mail.pm.org/mailman/listinfo/austin

-- 

Wayne Walker
wwalker at bybent.com                 Do you use Linux?!
http://www.bybent.com              Get Counted!  http://counter.li.org/
Perl - http://www.perl.org/        Perl User Groups - http://www.pm.org/
Jabber IM:  wwalker at jabber.phototropia.org       AIM:     lwwalkerbybent