[Melbourne-pm] Designing modules to handle large data files

Adrian Masters adrian at ash-blue.org
Sun Aug 22 19:41:02 PDT 2010


> Say for example you have 6,000,000 objects each with 10 fields.  I would store
> the objects on disk in the manner of Debian packages files:
> 	name: Sam
> 	email: sam at ai.ki
> 	name: Fred
> 	email: fred at yahoo.com
> Text files, key-value pairs, records terminated with a blank line.

If you went down this road and were considering exchanging data with others, I'd suggest using either JSON or YAML, as they model rich data
structures without the (full) overhead of XML. Doctrine & Propel frameworks for PHP use YAML for ORM schema & data representation.

If you want something fast, which parses the data file once, use a stream based approach. You could handle your complex field requirements using a
design pattern like SAX (see http://search.cpan.org/~grantm/XML-SAX-0.96/SAX/Intro.pod).

If you are going to query the parsed data more often than parsing it, a database is the way to go (as per the worthy suggestions previously).

If you want to go full geek, you could look at writing a BTree index for your file, and record characters position (1 index per use case) ;).


More information about the Melbourne-pm mailing list