[Pdx-pm] Measuring memory consumption of scalars

Marvin Humphrey marvin at rectangular.com
Thu Jul 21 07:32:20 PDT 2005


Greets,

When I presented Sort::External to the group a couple months ago, a  
number of people expressed an interest in seeing the buffer-flush  
triggered by memory consumption rather than the number of items in  
the cache.  I'd like to explore what it would take to add this feature.

In order to calculate the memory consumed by an array of scalars, I  
need to know the memory consumed by each.  Right now,  
Sort::External's feed() method just adds stuff to a buffer and  
flushes if the size of the buffer array (as evaluated in a scalar  
context ;) ) meets or exceeds the -cache_size.

sub feed {
     my $self = shift;
     push @{ $self->{input_cache} }, @_;
     return unless @{ $self->{input_cache} } >= $self->{-cache_size};
     $self->_write_input_cache_to_tempfile;
}

Conceptually, the most straightforward way to figure memory  
consumption is to take the length() of each item, tack on a worst- 
case number for the overhead of a scalar, and add that to a tally  
which triggers a buffer-flush when it grows past a threshold.  There  
are at least five problems with this approach, but I believe they can  
all be addressed.

1) Scalars which don't as yet have a string component (PV, for  
Pointer Value, in the underlying representation) will be needlessly  
forced to acquire one. Note that when these items are regenerated by  
reading back from temp-files, they are going to have a PV component  
anyway.
2) I'm not sure what the worst-case number for the administrative  
overhead of a scalar is.
3) Perl seems to assign memory in unpredictably sized chunks.  It may  
be that the last 50-byte scalar causes Perl to grab another meg of  
memory.
4) length(), which works on utf8 by default, is fairly expensive.
5) length() doesn't tell you about hidden portions of the PV.  (For  
the sake of avoiding extra malloc calls, Perl will sometimes move the  
pointer which indicates where a string begins.  I'm not sure exactly  
when this happens, but I'd guess it's on stuff like substr($foo, 0,  
4, ''); )

Any performance hits we take are mitigated by the fact that  
Sort::External is probably going to be doing a bunch of disk i/o at  
some point.

The first step is to improve the efficiency and reliability of  
figuring out the memory consumed by the PV component of the scalar.   
If we use the bytes pragma, I believe that the efficiency of length()  
improves dramatically, though I haven't tracked down where it is in  
the Perl source code so I can't say that authoritatively.  But we're  
still stuck with the problem of hidden pieces of strings.

Fortunately, at the C level, SvLEN [1] solves all the problems  
associated with length().  It won't add a PV component to the scalar  
if none exists.  It returns the size of the PV, including an extra  
null byte that Perl tacks onto the end, so there's no problem with  
hidden portions of strings.  I believe that it is actually returning  
a number that is stored as part of the C struct for a scalar, so it  
should be nice and fast.

Looks like I'll have to turn feed() into an Inline::C or XS sub with  
considerably more than 4 lines of code, unless someone knows a way to  
expose SvLEN in Perl space.

That leaves me with two remaining problems.

What is the worst-case administrative overhead for a scalar?  Anybody  
know?

Then there's the approximate nature of memory management in Perl.   
I'll solve this by documenting Sort::External's -mem_threshold as an  
imprecise tool.  ;)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

[1] From perlapi: "SvLEN -- Returns the size of the string buffer in  
the SV, not including any part attributable to SvOOK. See SvCUR."


More information about the Pdx-pm-list mailing list