[Dub-pm] big data structures relative to memory size

Fergal Daly fergal at esatclear.ie
Fri Apr 16 11:03:32 CDT 2004


If you malloc data in the C it will stay malloced until you free it, whether
you go back to Perl or not.

Getting data in and out of C within Perl is pretty much the same as getting
it into C on it's own, except that you can use bits of Perl too. You could
try something a bit fancy and figure out exactly how much memory you'll be
using and dump all the data into that file, exactly like a big chunk of
memory, making a note of the offsets of where each array begins and how long
it is etc. Then you can just mmap the file so that it appears as a chunk of
memory and bingo, your data structures are loaded. the mmap call will give
you the address that the file has been mapped to so you can get the address
of the arrays inside it by using the offsets,

F

On Fri, Apr 16, 2004 at 04:54:02PM +0100, Sean O'Riordain wrote:
> Thanks Fergal.
> 
> here is what is how i currently load from mysql...
> 
>     my $rec_count = 0;
>     while (my $aref = $sth->fetchrow_hashref) {
>         push @as, $aref->{timestamp_ue};
>         push @ae, $aref->{finish_ue};
>         push @adw, $aref->{dw};
>         push @ahh, $aref->{hh};
>         push @abz, $aref->{bzone};
>         $rec_count++;
>     }
>     print " $rec_count cdr records loaded\n";
> 
> this takes maybe 5 minutes - so i'm not overly worried about that...
> 
> if there isn't a simple way of passing the info to inline::C, then i was 
> thinking of just re-writing all the info to disk in an easily parseable 
> format, ie fixed width columns.  Then i was just going to do all the 
> integer work in C and writing the results to an output file... 
> (currently it takes more than 8 hours at 100%cpu on a 1700mhz athlon ...)
> 
> i could speed up the string stuff by using a lookup table since there 
> are only about 350 different values...
> 
> in inline::C is it possible to persistently keep a C data-structure 
> between calls ? ie malloc space for my large int arrays, and then from 
> perl append each new line of info ?
> 
> cheers,
> Sean
> 
> 
> Fergal Daly wrote:
> >Not knowing exactly what you have makes it a bit tricky if you've got 5
> >million things looking like
> >
> >	[$int1, $int2, $int3, $int4, $int5, $string]  x 1.5 million
> >
> >then you will save quite bit by having
> >
> >@int1s = (int x 1.5 million)
> >@int2s = (int x 1.5 million)
> >..
> >@int5s = (int x 1.5 million)
> >@strings = (string x 1.5 million)
> >
> >then just pass around the index. A package like
> >
> >package MyObj;
> >
> >sub new
> >{
> >	my $pkg = shift;
> >	my $index = shift;
> >	return bless \$index, $pkg;
> >}
> >
> >sub getInt1
> >{
> >	my $self = shift;
> >	return $int1s[$$self];
> >}
> >
> >etc...
> >
> >are you could get more mem efficient and rather than using arrays for the
> >ints, have a string for each set of ints and have methods like
> >
> >sub getInt1
> >{
> >	my $self = shift;
> >
> >	# assume a 4 byte integer
> >	my $enc = substr($int1s, $$self*4, 4);
> >
> >	return unpack("L", $enc);
> >}
> >
> >you could also do this for the strings. It'll be slower because you'll be
> >invoking methods, you could use subroutines if you're sure you'll never 
> >want
> >inheritance etc.
> >
> >If you use Inline::C, how you load the data depends entirely on how you 
> >store
> >it, you'll just have to write C routines for loading the data and call them
> >from Perl,
> >
> >F
> >On Fri, Apr 16, 2004 at 03:08:23PM +0100, Sean O'Riordain wrote:
> >
> >>Hi folks,
> >>
> >>I've an analysis program with a couple of million records that i really 
> >>need to keep in memory as i need to scan back and forth etc... With 5 
> >>million odd records (written as a couple of independent 'arrays' or 
> >>should i say 'lists') the program requires quite a bit more than the 
> >>1.5Gb of ram and becomes very slow due to swapping - gentoo-linux... 
> >>Each record has 5 integers and a string of max.len 30 chars... but perl 
> >>takes up extra ram for each SV...  I would like to be able to handle 
> >>larger datasets much faster than currently...
> >>
> >>Has anybody used INLINE::C for handling large data structures - if so 
> >>how do you load the info?
> >>
> >>Anybody used PDL?
> >>
> >>Any thoughts which way I should jump?
> >>
> >>cheers,
> >>Sean
> >>_______________________________________________
> >>Dublin-pm mailing list - Dublin-pm at mail.pm.org
> >>http://dublin.pm.org/ - IRC irc.linux.ie #dublin-pm
> >>
> >>
> >
> >
> >
> _______________________________________________
> Dublin-pm mailing list - Dublin-pm at mail.pm.org
> http://dublin.pm.org/ - IRC irc.linux.ie #dublin-pm
> 
> 



More information about the Dublin-pm mailing list