[Dub-pm] big data structures relative to memory size

Sean O'Riordain seanpor at acm.org
Fri Apr 16 11:15:26 CDT 2004


Hi Fergal!

have to think about that one!

if i malloc memory in C then go back to perl to pass info into C 
again... how do i reference that memory again - i presume i pass the 
info back and forth by reference, - or could it be still available as a 
static type pointer... i'm missing a step here...

thanks again!
Sean


Fergal Daly wrote:
> If you malloc data in the C it will stay malloced until you free it, whether
> you go back to Perl or not.
> 
> Getting data in and out of C within Perl is pretty much the same as getting
> it into C on it's own, except that you can use bits of Perl too. You could
> try something a bit fancy and figure out exactly how much memory you'll be
> using and dump all the data into that file, exactly like a big chunk of
> memory, making a note of the offsets of where each array begins and how long
> it is etc. Then you can just mmap the file so that it appears as a chunk of
> memory and bingo, your data structures are loaded. the mmap call will give
> you the address that the file has been mapped to so you can get the address
> of the arrays inside it by using the offsets,
> 
> F
> 
> On Fri, Apr 16, 2004 at 04:54:02PM +0100, Sean O'Riordain wrote:
> 
>>Thanks Fergal.
>>
>>here is what is how i currently load from mysql...
>>
>>    my $rec_count = 0;
>>    while (my $aref = $sth->fetchrow_hashref) {
>>        push @as, $aref->{timestamp_ue};
>>        push @ae, $aref->{finish_ue};
>>        push @adw, $aref->{dw};
>>        push @ahh, $aref->{hh};
>>        push @abz, $aref->{bzone};
>>        $rec_count++;
>>    }
>>    print " $rec_count cdr records loaded\n";
>>
>>this takes maybe 5 minutes - so i'm not overly worried about that...
>>
>>if there isn't a simple way of passing the info to inline::C, then i was 
>>thinking of just re-writing all the info to disk in an easily parseable 
>>format, ie fixed width columns.  Then i was just going to do all the 
>>integer work in C and writing the results to an output file... 
>>(currently it takes more than 8 hours at 100%cpu on a 1700mhz athlon ...)
>>
>>i could speed up the string stuff by using a lookup table since there 
>>are only about 350 different values...
>>
>>in inline::C is it possible to persistently keep a C data-structure 
>>between calls ? ie malloc space for my large int arrays, and then from 
>>perl append each new line of info ?
>>
>>cheers,
>>Sean
>>
>>
>>Fergal Daly wrote:
>>
>>>Not knowing exactly what you have makes it a bit tricky if you've got 5
>>>million things looking like
>>>
>>>	[$int1, $int2, $int3, $int4, $int5, $string]  x 1.5 million
>>>
>>>then you will save quite bit by having
>>>
>>>@int1s = (int x 1.5 million)
>>>@int2s = (int x 1.5 million)
>>>..
>>>@int5s = (int x 1.5 million)
>>>@strings = (string x 1.5 million)
>>>
>>>then just pass around the index. A package like
>>>
>>>package MyObj;
>>>
>>>sub new
>>>{
>>>	my $pkg = shift;
>>>	my $index = shift;
>>>	return bless \$index, $pkg;
>>>}
>>>
>>>sub getInt1
>>>{
>>>	my $self = shift;
>>>	return $int1s[$$self];
>>>}
>>>
>>>etc...
>>>
>>>are you could get more mem efficient and rather than using arrays for the
>>>ints, have a string for each set of ints and have methods like
>>>
>>>sub getInt1
>>>{
>>>	my $self = shift;
>>>
>>>	# assume a 4 byte integer
>>>	my $enc = substr($int1s, $$self*4, 4);
>>>
>>>	return unpack("L", $enc);
>>>}
>>>
>>>you could also do this for the strings. It'll be slower because you'll be
>>>invoking methods, you could use subroutines if you're sure you'll never 
>>>want
>>>inheritance etc.
>>>
>>>If you use Inline::C, how you load the data depends entirely on how you 
>>>store
>>>it, you'll just have to write C routines for loading the data and call them
>>
>>>from Perl,
>>
>>>F
>>>On Fri, Apr 16, 2004 at 03:08:23PM +0100, Sean O'Riordain wrote:
>>>
>>>
>>>>Hi folks,
>>>>
>>>>I've an analysis program with a couple of million records that i really 
>>>>need to keep in memory as i need to scan back and forth etc... With 5 
>>>>million odd records (written as a couple of independent 'arrays' or 
>>>>should i say 'lists') the program requires quite a bit more than the 
>>>>1.5Gb of ram and becomes very slow due to swapping - gentoo-linux... 
>>>>Each record has 5 integers and a string of max.len 30 chars... but perl 
>>>>takes up extra ram for each SV...  I would like to be able to handle 
>>>>larger datasets much faster than currently...
>>>>
>>>>Has anybody used INLINE::C for handling large data structures - if so 
>>>>how do you load the info?
>>>>
>>>>Anybody used PDL?
>>>>
>>>>Any thoughts which way I should jump?
>>>>
>>>>cheers,
>>>>Sean
>>>>_______________________________________________
>>>>Dublin-pm mailing list - Dublin-pm at mail.pm.org
>>>>http://dublin.pm.org/ - IRC irc.linux.ie #dublin-pm
>>>>
>>>>
>>>
>>>
>>>
>>_______________________________________________
>>Dublin-pm mailing list - Dublin-pm at mail.pm.org
>>http://dublin.pm.org/ - IRC irc.linux.ie #dublin-pm
>>
>>



More information about the Dublin-pm mailing list