[Dub-pm] big data structures relative to memory size

Sean O'Riordain sean.oriordain at swiftcall.com
Fri Apr 16 10:54:02 CDT 2004


Thanks Fergal.

here is what is how i currently load from mysql...

     my $rec_count = 0;
     while (my $aref = $sth->fetchrow_hashref) {
         push @as, $aref->{timestamp_ue};
         push @ae, $aref->{finish_ue};
         push @adw, $aref->{dw};
         push @ahh, $aref->{hh};
         push @abz, $aref->{bzone};
         $rec_count++;
     }
     print " $rec_count cdr records loaded\n";

this takes maybe 5 minutes - so i'm not overly worried about that...

if there isn't a simple way of passing the info to inline::C, then i was 
thinking of just re-writing all the info to disk in an easily parseable 
format, ie fixed width columns.  Then i was just going to do all the 
integer work in C and writing the results to an output file... 
(currently it takes more than 8 hours at 100%cpu on a 1700mhz athlon ...)

i could speed up the string stuff by using a lookup table since there 
are only about 350 different values...

in inline::C is it possible to persistently keep a C data-structure 
between calls ? ie malloc space for my large int arrays, and then from 
perl append each new line of info ?

cheers,
Sean


Fergal Daly wrote:
> Not knowing exactly what you have makes it a bit tricky if you've got 5
> million things looking like
> 
> 	[$int1, $int2, $int3, $int4, $int5, $string]  x 1.5 million
> 
> then you will save quite bit by having
> 
> @int1s = (int x 1.5 million)
> @int2s = (int x 1.5 million)
> ..
> @int5s = (int x 1.5 million)
> @strings = (string x 1.5 million)
>
> then just pass around the index. A package like
> 
> package MyObj;
> 
> sub new
> {
> 	my $pkg = shift;
> 	my $index = shift;
> 	return bless \$index, $pkg;
> }
> 
> sub getInt1
> {
> 	my $self = shift;
> 	return $int1s[$$self];
> }
> 
> etc...
> 
> are you could get more mem efficient and rather than using arrays for the
> ints, have a string for each set of ints and have methods like
> 
> sub getInt1
> {
> 	my $self = shift;
> 
> 	# assume a 4 byte integer
> 	my $enc = substr($int1s, $$self*4, 4);
> 
> 	return unpack("L", $enc);
> }
> 
> you could also do this for the strings. It'll be slower because you'll be
> invoking methods, you could use subroutines if you're sure you'll never want
> inheritance etc.
> 
> If you use Inline::C, how you load the data depends entirely on how you store
> it, you'll just have to write C routines for loading the data and call them
> from Perl,
> 
> F
> On Fri, Apr 16, 2004 at 03:08:23PM +0100, Sean O'Riordain wrote:
> 
>>Hi folks,
>>
>>I've an analysis program with a couple of million records that i really 
>>need to keep in memory as i need to scan back and forth etc... With 5 
>>million odd records (written as a couple of independent 'arrays' or 
>>should i say 'lists') the program requires quite a bit more than the 
>>1.5Gb of ram and becomes very slow due to swapping - gentoo-linux... 
>>Each record has 5 integers and a string of max.len 30 chars... but perl 
>>takes up extra ram for each SV...  I would like to be able to handle 
>>larger datasets much faster than currently...
>>
>>Has anybody used INLINE::C for handling large data structures - if so 
>>how do you load the info?
>>
>>Anybody used PDL?
>>
>>Any thoughts which way I should jump?
>>
>>cheers,
>>Sean
>>_______________________________________________
>>Dublin-pm mailing list - Dublin-pm at mail.pm.org
>>http://dublin.pm.org/ - IRC irc.linux.ie #dublin-pm
>>
>>
> 
> 
> 



More information about the Dublin-pm mailing list