[Melbourne-pm] Designing modules to handle large data files

Sam Watkins sam at nipl.net
Mon Aug 23 21:48:33 PDT 2010


On Mon, Aug 23, 2010 at 12:41:02PM +1000, Adrian Masters wrote:
> David,
> 
> [snip]
> > Say for example you have 6,000,000 objects each with 10 fields.  I would store
> > the objects on disk in the manner of Debian packages files:
> >
> > 	name: Sam
> > 	email: sam at ai.ki
> >
> > 	name: Fred
> > 	email: fred at yahoo.com
> >
> >
> > Text files, key-value pairs, records terminated with a blank line.
> [snip]
> 
> If you went down this road and were considering exchanging data with others, I'd suggest using either JSON or YAML

The format I'm suggesting is like YAML-lite, without the kitchen sink, as used
in email and http headers.  The only addition over those is the blank-line as
record separator.  It's the same as debian package files.  I think it's more
than sufficient for practically any task, and it's an extremely Simple and
Readable format.  I don't know of a dataset that can't be expressed nicely like
this.  If you want more compactness, I would suggest going with TSV.

Other formats like XML and even YAML and JSON are unnecessarily
over-complicated in my opinion.  Simplicity, Clarity, Generality!!

  http://www.informit.com/ShowCover.aspx?isbn=020161586X

> If you want to go full geek, you could look at writing a BTree index for your file, and record characters position (1 index per use case) ;).

I like that method :)  The file is text, the BTree index can be regenerated
from the file.  I'd recommend using libdb4 for the index rather than coding
your own BTrees unless you'd like to do that.  The illustrious postfix does
something like this for its map files, well actually I think it creates binary
.db files from the text files, not indexes.

Although I do prefer to avoid them, It very likely would be much easier to use
an SQL database.

Sam


More information about the Melbourne-pm mailing list