[Chicago-talk] neat little optimization

Steven Lembark lembark at wrkhors.com
Thu Oct 14 11:01:19 CDT 2004



-- JT Smith <jt at plainblack.com>

> Ok, everybody is talking this up, but there must be a down side. Is
> Memoize storing all of these cached objects in memory, or out to disk?
> Let's say I wanted to use it in a mod_perl app, on an class that has 2
> million object instances that may be called. If this is persisted in
> memory, I can see that eating through memory very quickly.
>
> Can anybody speak to the do's and don'ts of this thing?

memoize bascially looks like:

	$cache{@argz} = &$stored_sub( @argz )
		unless exists $cache{@argz};

	$cache{@argz}

The process works by storing a ref to the existing sub then
replacing it in the symbol table with a closure that checks
a cache (hash) and calls the sub if it can't find the args
in its cache.

This is useful when the time for stringify+hash lookup is
noticably less than the subroutine call (think join-from-hell
in a database, networked lookup of static data via URI).

It has all of the downsides of caching -- mainly that you
may be taking time to store values that are never re-used.

There is also the classic memory vs. speed issue: by
taking up memory to store the results you may get better
speed. Then again, you may not.

Main downside is memory (often less of an issue these days).
You also have issues with how to memoize data strucutres that
are heavily nested or that must return different values even
for the same input values (e.g., the same query might not
return identical values every time it is called).

As more people use OO with nested data structures, memoizing
gets harder and harder. Main catch there is that a generic
caching function has no good way of knowing which part of the
nestted data structure is relevant to the call being wrapped.
At that point you can waste huge amounts of memory separately
caching identical values whose arguments involved modification
of a data structure that didn't affect the return value.

As an example, say you have a hash object with "foo" and "bar"
keys. One of your calls depends only on the value of $obj->{foo}
but the value of $obj->{bar} changes between memoize calls.
Net result is that you end up with identical return values
cached againsed varous values of {bar}. If bar were, say, the
unix time of the request start then you could end up caching
a large number of unusable values.

Memoizing also gets hairy when the number of arguments gets
large or is difficulty to stringify into a hash key. At that
point the overhead of memoizing all of the arguments all of
the time may outweigh the performance savings of not calling
the "expeisive" function.

So, you'd want to use it where the called function uses a
few, simply-structured arguments, is called frequently with
the same arguments, must return the same value for its
arguments, and cannot [easily] be modified to cache its own
results. Querys of static data are a good example of were
this works well: database calls are expensive and you will
call $sth->( @blah ) with simple enough arguments that they
can be easily strigified into a hash key for less overhead
than the datbase call.

-- 
Steven Lembark                                       85-09 90th Street
Workhorse Computing                                Woodhaven, NY 11421
lembark at wrkhors.com                                     1 888 359 3508


More information about the Chicago-talk mailing list