[Melbourne-pm] Designing modules to handle large data files

Thu Aug 19 00:15:22 PDT 2010

On 19/08/10 16:52, Tulloh, David wrote:
> Dear List,
>
> As part of my work I have built several modules to handle data files.
> The idea is to hide the structure and messiness of the data file in a
> nice reusable module.  This also allows the script to focus on the
> processing rather than the data format.
>
> Unfortunately while the method I have evolved towards meets these
> objectives reasonably well I'm running into significant memory and speed
> problems with large data files.  I have some ideas of ways to
> restructure it to improve this but all involve some uncomfortable
> compromises.
>
> I was hoping some of the more experienced eyes on the list could look
> over my approach and make a few suggestions.

Suggestion 1:
Perhaps you should import the data file into a database, then let the 
database do all the hard work for you? By all means put a layer over the 
DB interface so as to make it nice for people to use.
You are running the risk of reinventing the wheel otherwise.

Suggestion 2:
If you want to stick with processing the file in situ, then you'll need 
to approach it with a streaming processor, rather than loading the whole 
thing into memory at once.
Are you familiar with that concept?

Cheers,
Toby