[Chicago-talk] array to hash and counting files

Tue Dec 30 17:16:41 CST 2003

-- Jeremy Hubble <jhubble at core.com>

> I have a series of directories that each have a number of
> subdirectories with the same directory structure.  I need to count the
> number of files and directories in each subdirectory, and get a list of
> all unique files.
>
> Here is the code fragment I have (not tested yet):
>
> Is there a more effecient way to:
> 1) Extract the unique list of files?
> 2) Use perl to replace the find commant?

Any time you feel the word "unique" percolating through your
brain think "hash". If the hash key is the first-level subdir
then ++$hash{$subdir} will count the items by subdir, to
break the count out by file type use ++$hash->{$subdir}{$type}.

	use File::Find qw( &finddepth );
	use File::Basename qw( &basename );

	# name is immaterial, could also be an anonymous
	# sub defined on the finddepth call line.
	#
	# File::Find is kind enough to define $File::Find::dir
	# as the directory and chdir to it. the current file's
	# basename is in $_ (i.e. -e, -d, etc, work as expected).
	#
	# the handler could also use stat to get the file type
	# and store that but outputting the results in human-
	# usable form gets a bit hairly. for now assume that
	# anything is either a file or directory.
	#
	# making countz a referent saves us from $countz{$subdir}->{type}
	# notation (I find it easier to read the -> toward the front).

	my $subdir = '';
	my %unique = ();
	my $countz = {};

	sub handler
	{
		if( -d )
		{
			++$countz{$subdir}{dirz};
		}
		else
		{
			++$countz{$subdir}{filz};
		}

		++$unique{$_};
	}

	# iterate over all the directory items in the base directory
	# processing each item through the handler.

	for( grep { -d } glob "$basedir/*" )
	{
		$subdir = basename $_;

		finddepth \&handler, $_;
	}

	# at this point keys %$counts are the subdirz of $basedir,
	# and the values are counts of files and directories by
	# subdir. %unique is keyed by basename of whatever was found
	# in the subdirs.

The one variation you might want is to find the relative paths
within the subdirectories (i.e., keys of %unique are full paths
relative to $basedir/$subdir). In this case use the fact that
$File::Find::dir is a relative path when the input directory is
a relative path iteslf:

	sub handler
	{
		if( -d )
		{
			++$countz{$subdir}{dirz};
		}
		else
		{
			++$countz{$subdir}{filz};
		}

		++$unique{$File::Find::name};
	}

	for( grep { -d } glob "$basedir/*" )
	{
		chdir $_;

		subdir = basename $_;

		finddepth \&handler, '.';
	}

The combination of chdir and finddepth w/ '.' will leave all
the $File::Find::name entries as relative paths:

  DB<1> finddepth sub { print $File::Find::name, "\n" }, '.'
  ./output
  ./CVS/Root
  ./CVS/Repository
  ./CVS/Entries
  ./CVS/Entries.Log
  ./CVS
  ./CVSROOT/checkoutlist
  ./CVSROOT/commitinfo
  ./CVSROOT/config
  ./CVSROOT/cvswrappers
  ./CVSROOT/editinfo
  <snip>

At this point the keys of %unique will be paths relative
to the various subdir's, which will give you a unique list
of the files within the general file tree.

--
Steven Lembark                               2930 W. Palmer
Workhorse Computing                       Chicago, IL 60647
                                            +1 888 359 3508