[Melbourne-pm] Designing modules to handle large data files

Sun Aug 22 18:14:27 PDT 2010

hi David,

When you say 'large' datasets, how large do you mean?  I did experiment with
using Perl for a toy full-text search system, it's quite capable to handle
medium sized datasets (maybe 500MB) and query and process them very quickly.

I think if you have datasets that are smaller than your RAM, and you don't
create too many unnecessary perl strings and objects, you should be able to
process everything in perl if you prefer to do it like that.  It may even
outperform a general relational database.

Say for example you have 6,000,000 objects each with 10 fields.  I would store
the objects on disk in the manner of Debian packages files:

	name: Sam
	email: sam at ai.ki

	name: Fred
	email: fred at yahoo.com

Text files, key-value pairs, records terminated with a blank line.

I'm not sure as I haven't tried this, but you might find that loading each
object into a single string, and parsing out the fields 'on demand' will save
you a lot of memory and the program will run faster.  IO and specifically
swapping is what will kill your performance.

You will also need to create indexes of course (perl hash tables).  If you are
really running out of RAM, you could compress objects using Compress::Zlib or
similar - or buy some more RAM!

I do like to use streaming systems where possible, but sometimes you want
Random access.  You could also look at creating your indexes in RAM, but
reading the object data from files, or perhaps using Berkerley DB for indexes
if your indexes become too big for RAM.  I'm not a big fan of SQL, but I do
like the mathemtical concept of relational databases.

Sam