SPUG: Directory Fun

Thu Mar 17 11:30:45 PST 2005

I'm working backroom stuff so I don't often have to worry that much 
about performance.  We use a lot of network-mounted directories, 
however, fairly large ones at that, so I decided to write a script to 
compare some common (?) ways to go through a directory looking for 
filenames that matched a specific pattern.

I was kind of shocked at what I found (see end of script below).  Now 
keep in mind that I'm running on Windows 2000 and all of the directories 
are network mounts (as assigned drive letters).  Still, I would have 
thought that File::Find would have worked a touch better than it did. 
And the way the file glob works is odd, to say the least.

So I enter this into the group mind to see what response I get.  I don't 
mind finding out that some of my tests are badly written.  Learning is 
good.  I'm curious if other people find similar results or different.

I'm on my home Linux box right now, trying to duplicate the test.  But 
it's all local directories and they aren't of the same size so it's 
difficult to seen any results at all.  Plus the glob isn't matching 
anything.  But I'm really more interested in Windows since that's where 
I'm earning my money right now.

###################################################################
###################################################################

use     strict;
use     warnings;

use     File::Find;
use     IO::Dir;

# You'll need to make up your own list of directories here.
#   Mine contain a total of 20,000+ files and are all network mounts.
#   You can avoid File::Spec by entering them all by hand.

use     File::Spec;

my  @dirs = map {
     File::Spec->catfile('S:\DA\Feds', $_, 'done')
} qw(Fed1 Fed2 Fed3 Fed4 Fed5 Fed6 Fed7 Fed8 Fed9 Fed10 Fed11);

# These are the patterns used to sieve through the files.
#   Your patterns will be dependent on your source directories.
#   The first pattern is a regular expression to be applied
#   to each filename.  The second is for the 'glob' code.
#   It's worth playing with these to see if different patterns
#   make a difference in the times for the various directories.

my  $ptn = qr(^02)i;
my  $glb = '02*';

#y  $ptn = qr(\.doc$)i;
#y  $glb = '*.doc';

#y  $ptn = qr(\.pdf$)i;
#y  $glb = '*.pdf';

# This set of patterns counts all files:
#y  $ptn = qr(.)i;
#y  $glb = '*';

###################################################################

sub duration
{
     my  $durn = shift;

     return '{unknown}'
         unless defined $durn;

     my  $secs = $durn % 60;  $durn = int($durn / 60);
     my  $mins = $durn % 60;  $durn = int($durn / 60);
     my  $hour = $durn % 24;  $durn = int($durn / 24);

     $durn ? sprintf('%d %02d:%02d:%02d', $durn, $hour, $mins, $secs) :
     $hour ? sprintf(     '%d:%02d:%02d',        $hour, $mins, $secs)
           : sprintf(          '%d:%02d',               $mins, $secs)
}

###################################################################

sub test ($&@)
{
     my  $name = shift;

     print $name;

     my  $func = shift;
     my  $cnt  = 0;
     my  $secs = undef;

     eval {
         my  $strt = time;

         $cnt += &$func($_)
             for @dirs;

         $secs = time - $strt;
     };

     printf " found %7d file%s in %s\n",
            $cnt, ($cnt == 1 ? '' : 's'), duration($secs);
}

###################################################################
#
# Main program consists of four tests:
#

# This is the winner for me despite coming first.
#   I figure any OS caching of directory data would
#   apply to later tests.

test 'scan', sub {
     my  $dir = shift;

     die "Unable to open directory for reading:\n  $!\n"
         unless opendir DIR, $dir;

     my  $cnt = 0;

     while (my $item = readdir(DIR)) {
         $cnt++
             if $item =~ $ptn;
     }

     closedir DIR;

     $cnt
};

# This is a close second.  Same order of magnitude,
#   not enough difference to care about.

test 'IOdr', sub {
     my  $dir = shift;
     my  $hdl = new IO::Dir($dir);

     die "Unable to open directory for reading:\n  $!\n"
         unless $hdl;

     my  $cnt = 0;

     while (my $item = $hdl->read) {
         $cnt++
             if $item =~ $ptn;
     }

     undef $hdl;

     $cnt
};

# This is the worst performer.
#   Two orders of magnitude above the best.
#   I would have thought this would have been somewhat ok.
#   Note that I'm not really using it for it's intended purpose,
#   since I'm only scanning single directories with it.

test 'find', sub {
     my  $dir = shift;
     my  $cnt = 0;

     find sub {
         $cnt++
             if $_ =~ $ptn;
     }, $dir;

     $cnt
};

# The performance of this one depends on how many items match!
#   It varies from comparable to 'scan' to comparable to 'find'.
#   For some reason this trashes the diretory list when it runs,
#   so it must be last.  Probably a stupid programming error.

test 'glob', sub {
     my  $dir = shift;
     my  $cnt = 0;

     while (<$dir/$glb>) {
         $cnt++;
     }

     $cnt
};

###################################################################

__END__

Some results for ActiveState 5.8 on Windows 2000
     (with network-mounted drives):

   Looking for all files:
     scan found   20046 files in 0:01
     IOdr found   20046 files in 0:01
     find found   20035 files in 2:19
     glob found   20024 files in 2:17

   Looking for .pdf files:
     scan found   17430 files in 0:00
     IOdr found   17430 files in 0:01
     find found   17430 files in 2:04
     glob found   17430 files in 1:46

   Looking for files beginning with '02':
     scan found    4305 files in 0:01
     IOdr found    4305 files in 0:01
     find found    4305 files in 2:06
     glob found    4305 files in 0:27

   Looking for .doc files:
     scan found      24 files in 0:01
     IOdr found      24 files in 0:01
     find found      24 files in 2:01
     glob found      24 files in 0:01