Processing Data Partitions in Perl

Wed Jul 2 13:32:52 PDT 2008

Hi all,

Recently I've found myself working with various SQL result sets of the
following form:
  id  name     location
  --  ----     --------
   1  Cuba     NB
   2  New Inn  NB
   3  Spoons   CMK
   4  Vaults   Stony
   5  Crown    Stony

I want to work with these result sets as groups of records with the same
location.

Chapter 19 of SQL For Smarties (a wonderful book for anyone doing
anything non-trivial with SQL databases) tells me that I should refer to
such groups of records as partitions.

Nosing through CPAN led me to the partition implementation in
List::MoreUtils.  I quickly put something together along the following
lines:
  my @pub = $sth->fetchall_arrayref({});
  my $i = 0;
  my $prev;
  my @pub_by_location = part {
      ! defined $prev || $_->{location} ne $prev->{location} && $i++;
      $prev = $_;
      $i;
  } @pub;

And this gave me something like:
  [
    { id => 1, name => 'Cuba', location => 'NB' },
    { id => 2, name => 'New Inn', location => 'NB' },
  ],
  [
    { id => 3, name => 'Spoons', location => 'CMK' },
  ],
  [
    { id => 4, name => 'Vaults', location => 'Stony' },
    { id => 5, name => 'Crown', location => 'Stony' },
  ]

So I was happy.  But I was less happy about the idea of slurping
everything from the database into one array.  It doesn't matter much for
the five records in my contrived example, but it would for larger
datasets.

For large sets, I would like to retrieve all the records in
each partition, perform some operation on them, then forget about that
partition and move on to the next one.

So I wrote myself a little subroutine to do just this:
  sub make_partition_sub {
      my $record_sub = shift;
      my $field = shift;

      my (@group, $sub);

      $sub = {
          my $cur = $record_sub->();
	  return if (! defined $cur && ! scalar @group);

	  if (! defined $cur || $cur->{$field} ne $group[0]{$field}) {
	      my @old_group = @group;
	      @group = defined $cur ? $cur : ();
	      return [ @old_group ];
	  }
	  else {
	      push @group, $cur;
	      $sub->();
	  }
      };

      return $sub;
  }

I'm not too happy that it calls itself and returns from various places
within itself, but my attempts to make it more readable failed.

Anyway, when I call it as follows I get the results I want:
  my $return_record = sub { $sth->fetchrow_hashref() };
  my $get_partition = make_partition_sub($return_record, 'location');
  while (my $partition_ref = $get_partition->()) {
      my @partition = @$partition_ref;
      # Do something with @partition
  }

So now I'm happy that I have a generic partition builder that happens to
work with DBI statement handles, but also works with anything that
returns hash references of fields and values.  I've gone and written
something generic that seems useful that I can't already find on CPAN.

I guess I could also pull out the "if (!defined $cur..." line and make
the second argument a code reference instead of a field name.  This
would give more flexibility, but I worry that would make the code
gratuitously more generic:  I can't think of a common use case.

I thought I should run it past you lot to get feedback, check I haven't
reinvented rounder wheels and to see whether you reckon I've written
something generally useful that warrants packaging up and releasing,
despite its brevity.

Finally, what should I call such a thing?

Tom