[LA.pm] perl CGI querying of directory filenames most efficient method?

Fri Sep 23 12:34:54 PDT 2005

"Peter Benjamin" <pete at peterbenjamin.com> wrote:
> 
> I hope I've written an email that can be understood.
> My advise is to read it all the way through before
> replying, as it is a complex "overall" efficiency
> question, involving just not the perl CGI code,
> but also the web server needing the same directory.
[...]

I read through, and here is one "big picture" kind of
optimization.  Squeezing the maximum out of a single
machine is not as good an optimization strategy as it
is to make it possible to have multiple machines
going at once behind a load balancer.  Your entire
described design smells of a problem which is forcing
you to rely on a single machine, and if that machine
fails, then what?

A common strategy for this kind of problem is to store
information about what files you have in a database,
and then use something like
http://search.cpan.org/~jesus/Spread-3.17.3-1.07/Spread.pm
to mirror those files on all of your webservers.

> Would it be faster to get the entire directory listing of
> 12,000 images (and growing) with this type of statement:

Red flag alert!

What kind of filesystem do you have?

Many filesystems, eg ext3, store directory contents as a
linked list.  That means that when the directory gets
large, the process of finding specific files slows down.

If this is becoming a performance problem (running some
benchmarks is the best way to say whether it is) then I
would suggest either switching to a filesystem that
handles lots of files in one directory (eg reiserfs) or
else using a nested directory structure.  A common first
strategy is to store foo.jpg as f/foo.jpg.  (You can
easily get more sophisticated, of course.)

>From past experience I wouldn't worry about this too
much with a few thousand files, but I'd worry about it
big time if I expected this to grow to a few hundred
thousand files.

Of course note that if you use the multiple webserver
strategy that I indicate above, you can just buy another
webserver...

[...]
> Maybe a readdir would be even faster?  Remember that
> File Caching may make these questions immaterial, as
> the web server needs to access the same images/ folder
> as well, and that needs to be taking into consideration
> for the overall efficiency, which I did not address
> in the questions below.

If a readdir processed in Perl is faster than several
exists checks, then you definitely have room to
optimize your filesystem!  (Note: It is not always
worth doing all possible optimizations.)

> Questions:
> 
> Does the Unix File Cache also cache directory
> information?  I imagine it does, so this CGI script would
> not have to access the hard drive to get the list of files,
> but just get it from RAM, for each and every -e test,
> or for the foreach, or readdir methods.

Of course if you really have a CGI script, then there are
more obvious optimizations that you still have left.
Namely, mod_perl. :-)

BTW the answer to your question is that it does, but the
filesystem has to do real work to process the cache to
find specific files.

> Would the -e test have to go to the hard drive each time?
> Or would the entire directory 'file' be in the File Cache
> as well in that case?  Which would mean both methods are
> efficient, and either would do fine.
> 
> I might do a comparison testing in a loop, but then that
> would not simulate a CGI script, unless the script was
> invoked by an outer loop coded in a shell script to
> create new PIDs each time (avoid using the directory
> contents that would be "buffered" in File IO cache,
> or even in the perl buffers).
> 
> What comparison testing method would you use?

I would use a loop, because you want to isolate a single
performance factor.  I would also suggest accessing the
whole directory once before benchmarking to avoid I/O
differences between runs.

> Right now it is "cheaper" programmer time wise to do -e
> testing than add a database field to store the values,
> which one might imagine would be faster , but even then
> the directory contents needs to be accessed in order
> to web serve the image, and having the directory
> contents in File Cache due to the web server using
> it means the database field might not be any faster
> than either the if-elsif-else or foreach methods.

Databases are not magic.  They store data just like a
filesystem has to, and they add extra complexity on
top.  A filesystem should generally be faster at very
straightforward tasks.  This should be a straightforward
task.

(OK, filesystems are optimized for specific kinds of
access.  If your filesystem is not optimized for what
you're doing, while a database is, then the database can
be faster.  For instance if your filesystem does not
handle well having many files in one directory, and your
database has the right index, your database should be
faster.)

Cheers,
Ben