[Pdx-pm] Measuring memory consumption of scalars
Marvin Humphrey
marvin at rectangular.com
Thu Jul 21 07:32:20 PDT 2005
Greets,
When I presented Sort::External to the group a couple months ago, a
number of people expressed an interest in seeing the buffer-flush
triggered by memory consumption rather than the number of items in
the cache. I'd like to explore what it would take to add this feature.
In order to calculate the memory consumed by an array of scalars, I
need to know the memory consumed by each. Right now,
Sort::External's feed() method just adds stuff to a buffer and
flushes if the size of the buffer array (as evaluated in a scalar
context ;) ) meets or exceeds the -cache_size.
sub feed {
my $self = shift;
push @{ $self->{input_cache} }, @_;
return unless @{ $self->{input_cache} } >= $self->{-cache_size};
$self->_write_input_cache_to_tempfile;
}
Conceptually, the most straightforward way to figure memory
consumption is to take the length() of each item, tack on a worst-
case number for the overhead of a scalar, and add that to a tally
which triggers a buffer-flush when it grows past a threshold. There
are at least five problems with this approach, but I believe they can
all be addressed.
1) Scalars which don't as yet have a string component (PV, for
Pointer Value, in the underlying representation) will be needlessly
forced to acquire one. Note that when these items are regenerated by
reading back from temp-files, they are going to have a PV component
anyway.
2) I'm not sure what the worst-case number for the administrative
overhead of a scalar is.
3) Perl seems to assign memory in unpredictably sized chunks. It may
be that the last 50-byte scalar causes Perl to grab another meg of
memory.
4) length(), which works on utf8 by default, is fairly expensive.
5) length() doesn't tell you about hidden portions of the PV. (For
the sake of avoiding extra malloc calls, Perl will sometimes move the
pointer which indicates where a string begins. I'm not sure exactly
when this happens, but I'd guess it's on stuff like substr($foo, 0,
4, ''); )
Any performance hits we take are mitigated by the fact that
Sort::External is probably going to be doing a bunch of disk i/o at
some point.
The first step is to improve the efficiency and reliability of
figuring out the memory consumed by the PV component of the scalar.
If we use the bytes pragma, I believe that the efficiency of length()
improves dramatically, though I haven't tracked down where it is in
the Perl source code so I can't say that authoritatively. But we're
still stuck with the problem of hidden pieces of strings.
Fortunately, at the C level, SvLEN [1] solves all the problems
associated with length(). It won't add a PV component to the scalar
if none exists. It returns the size of the PV, including an extra
null byte that Perl tacks onto the end, so there's no problem with
hidden portions of strings. I believe that it is actually returning
a number that is stored as part of the C struct for a scalar, so it
should be nice and fast.
Looks like I'll have to turn feed() into an Inline::C or XS sub with
considerably more than 4 lines of code, unless someone knows a way to
expose SvLEN in Perl space.
That leaves me with two remaining problems.
What is the worst-case administrative overhead for a scalar? Anybody
know?
Then there's the approximate nature of memory management in Perl.
I'll solve this by documenting Sort::External's -mem_threshold as an
imprecise tool. ;)
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
[1] From perlapi: "SvLEN -- Returns the size of the string buffer in
the SV, not including any part attributable to SvOOK. See SvCUR."
More information about the Pdx-pm-list
mailing list