[Melbourne-pm] Designing modules to handle large data files

Sun Aug 22 18:49:17 PDT 2010

On 23/08/10 11:14, Sam Watkins wrote:
> I think if you have datasets that are smaller than your RAM, and you don't
> create too many unnecessary perl strings and objects, you should be able to
> process everything in perl if you prefer to do it like that.  It may even
> outperform a general relational database.

Outperform, yes, but it won't scale well at all.

[snip example]
> I'm not sure as I haven't tried this, but you might find that loading each
> object into a single string, and parsing out the fields 'on demand' will save
> you a lot of memory and the program will run faster.

To both of you - I suggest you benchmark this suggestion before 
implementing your program around it. My intuition suggests you won't 
save that much memory with this approach. Perl scalars aren't as 
inefficient as you imagine.

> You will also need to create indexes of course (perl hash tables).  If you are
> really running out of RAM, you could compress objects using Compress::Zlib or
> similar - or buy some more RAM!

Or you could use a lightweight db or NoSQL system, which has already 
implemented those features for you.
Perhaps MongoDB or CouchDB would suit you?

You can keep buying ram in the short-term, but what happens when your 
dataset gets 10x bigger? You stop being able to economically install 
more ram quite quickly.. whereas using a scalable approach will enable 
you to process more data at no cost and a more linear increase in time.

> I do like to use streaming systems where possible, but sometimes you want
> Random access.  You could also look at creating your indexes in RAM, but
> reading the object data from files, or perhaps using Berkerley DB for indexes
> if your indexes become too big for RAM.  I'm not a big fan of SQL, but I do
> like the mathemtical concept of relational databases.