[kw-pm] Pseudo Hashes (Was Re: slides)

Fri Jun 17 07:36:11 PDT 2005

OK, so I'll repeat the pseudohashes thingy here:

Problem statement: You have an object as a hashref, but there's a 
performance penalty for looking up values in a hash (compute the hash, 
lookup the entry in the hash buckets, including cascading into overflow 
buckets if required).

Now you can have your objects as blessed arrays no problem:

package Person;
use constant NAME => 0;
use constant AGE => 1;
use constant SEX => 2;

sub new {
     bless [], shift;
}

And the access is fast, keyed off constants, and thus key-safe. But 
this is truly horrible for subclassing - you have to remember to add 
entries starting at 3, and if the base class gets extended it all 
breaks really badly. Note this is how I wrote XML::XPath. XML::Twig 
uses a similar scheme, but the source code uses hashrefs, and mirod 
magically filters it all into arrayrefs when he runs Makefile.PL. Kinda 
sick, but works.

So to solve this in a more elegant (*cough*) way, the perl5-porters 
came up with the idea of pseudohashes. The basic premise is the 
construction of the pseudohash hashref is really an arrayref that looks 
like this:

   [ { name => 1, age => 2, sex => 3 }, "John Doe", 22, "Male" ]

And when you treat it like a hash perl automatically does the right 
thing in order to find your elements:

   $self->{name} becomes $self->[$self->[0]{name}]

This isn't much of an optimisation (it's slower than an ordinary hash 
lookup), so in order to make this really fast perl has to be able to 
precompute the hash entries. You can do this with the fields pragma and 
by specifying the class of the $self var.

package Person;
use fields qw(name age sex);

sub new {
     my Person $self = shift;
     $self = fields::new($self) unless ref($self);

	return $self;
}

Now the important thing to remember is that everywhere you use $self 
(i.e. in all your methods) you MUST declare it as "my Person $self" 
otherwise you don't get the optimisation - you get the slow version.

This all sounds great, but what are the downsides?

  - It's a mess. It's a really ugly hack. You can accidentally use $self 
as an arrayref instead of a hashref and perl won't complain.

  - You can't delete() keys. In fact doing so results in some really 
weird bugs under older versions of perl. You have to undef() keys.

  - exists() doesn't work right either.

  - local $r->{key} doesn't work right either.

  - Imagine you have $r->[ { 1 => \$y }, { 2 => \$z } ] (or something 
like that). Now you accidentally type $r->{1} in your program instead 
of $r->[1] - what happens? Perl thinks you want a pseudo hash. So it 
looks up the key 1 in your 0'th element, finds \$y. Since it's a 
reference, it's something like SCALAR(0x809f0c). But if you treat it as 
a number it's 8429324. So perl happily goes and tries to look up an 
array element nearly 8 and a half million entries away. And expands the 
array to accomodate. Yow!

If you're careful they can actually give you a small performance boost 
(I use them in the high_perf branch of qpsmtpd which is what we run on 
our spamtrap), but if you're not careful they get really horrid really 
fast :-)