[Melbourne-pm] Designing modules to handle large data files

Wed Aug 18 23:52:41 PDT 2010

Dear List,

As part of my work I have built several modules to handle data files.
The idea is to hide the structure and messiness of the data file in a
nice reusable module.  This also allows the script to focus on the
processing rather than the data format.

Unfortunately while the method I have evolved towards meets these
objectives reasonably well I'm running into significant memory and speed
problems with large data files.  I have some ideas of ways to
restructure it to improve this but all involve some uncomfortable
compromises.

I was hoping some of the more experienced eyes on the list could look
over my approach and make a few suggestions.

Following is the basic module structure followed by usage examples.

David

package DataType;
use Moose;
use 5.010;
use MyTypes;

around BUILDARGS => sub {
	my ($orig, $class, $file) = @_;
	return $class->$orig(_file => $file);
};

has '_file' => (
	is       => 'ro',
	isa      => 'MyTypes::File', # File handle, IO handle or
filename
	coerce   => 1,
	required => 1,
	trigger  => \&_process_file,
);

sub _process_file {
	my ($this, $file) = @_;

	# Break file into entries

	$this->_set_rows([map {DataType::Entry->new($_)}
@entry_strings]);
}

# An easy optimisation is to store a hash of array refs where the
# key of the hash is the most commonly searched for string.  If
# there is no strong key candidate I just leave it as an array.

has '_rows' => (
	is      => 'ro',
	isa     => 'ArrayRef[DataType::Entry]',
	writer  => '_set_rows',
	default => sub {[]},
);

sub find {
	my ($this, %fields) = @_;

	my @possibles = @{$this->_rows};

	foreach my $k (keys %fields) {
		@possibles = grep {$_->$k ~~ $fields{$k}} @possibles;
	}

	return @possibles;
}

no Moose;
__PACKAGE__->meta->make_immutable;

package DataType::Entry;
use Moose;
use 5.010;

around BUILDARGS => sub {
	my ($orig, $class, $string) = @_;

	# Process string into structure

	return $class->$orig(%structure);
}

has [qw(field list)] => (
	is => 'ro',
);

no Moose;
__PACKAGE__->meta->make_immutable;

Examples of typical usage:

my $data = DataType->new($filename);

# Convert to a different data format
say join "\n", map {} sort {} map {} $data->find;

# Loop through all data
foreach ($data->find) {}

# loop through a subset
foreach ($data->find(destination => "YSSY")) {}