[Melbourne-pm] Designing modules to handle large data files

Mon Aug 23 01:48:31 PDT 2010

Toby Corkindale <toby.corkindale at strategicdata.com.au> writes:
> On 23/08/10 11:14, Sam Watkins wrote:
>
>> I think if you have datasets that are smaller than your RAM, and you don't
>> create too many unnecessary perl strings and objects, you should be able to
>> process everything in perl if you prefer to do it like that.  It may even
>> outperform a general relational database.
>
> Outperform, yes, but it won't scale well at all.

*nod*  Everything is easy, and every algorithm is sufficient, for data smaller
than core memory.  Given that 24 to 96 GB of memory is possible for a
dedicated home user today, that makes a lot of the old scaling problems go
away.

(Don't forget persistence, and hardware contention, though :)

[...]

>> You will also need to create indexes of course (perl hash tables).  If you are
>> really running out of RAM, you could compress objects using Compress::Zlib or
>> similar - or buy some more RAM!
>
> Or you could use a lightweight db or NoSQL system, which has already
> implemented those features for you.  Perhaps MongoDB or CouchDB would suit
> you?

For something like this I would also seriously consider Riak; the main
differences between Riak and the MongoDB/CouchDB models are in how they scale
across systems.  (Internal, invisible sharding vs replication, basically.)

They all use JavaScript based map/reduce as their inherent data mining tools,
and can generally deliver reasonably on exploiting data locally and the like.

        Daniel
-- 
✣ Daniel Pittman            ✉ daniel at rimspace.net            ☎ +61 401 155 707
               ♽ made with 100 percent post-consumer electrons