[Phoenix-pm] Hash performance

Metz, Bobby W, WWCS bwmetz at att.com
Mon Jun 12 09:19:10 PDT 2006


Scott, 
	yes, single level hash only.

Michael,
	correct...#2 uses require to open a file generated by another
script which basically looks like:

$hash{'key'} = 'value';

I had considered disk I/O but disregarded since the #1 method reads the
source file from disk to populate the hash, that is the same source file
the secondary script reads to generate the files read via require in
method #2.  I guess there is the difference of a few characters per line
in the hash notation, but I wouldn't have thought that would nearly
triple the memory usage to store the hash.

Thanks,

Bobby



-----Original Message-----
From: Michael Friedman [mailto:friedman at highwire.stanford.edu]
Sent: Friday, June 09, 2006 8:11 PM
To: Metz, Bobby W, WWCS
Cc: phoenix-pm at pm.org
Subject: Re: [Phoenix-pm] Hash performance


Wait -- for #2 it sounds like you build the hashes and then write  
them out to a file (in some other script, I assume), followed by this  
script using 'require' to load the previously written files. Is that  
right?

I would bet, then, that the extra memory and slowness comes from  
accessing the filesystem. Once you require a file, perl basically eval 
()s it into the current context -- doing just about exactly what you  
do when you build the hash in the first place. :-) Scott could  
probably explain the details, I'm just going on a guess.

Personally, I think you did the right thing by benchmarking it. Now  
you know for sure which way is better and you can just rebuild it  
when you use it.

If you want to save the hashes in a file, you might want to check out  
the GDBM, MLDBM, or TDB (a new really fast one) database modules.  
(GDBM_File, MLDBM + Tie::MLDBM, TDB_File) They all tie to a hash and  
let you manage the persistent storage without any effort whatsoever.

-- Mike

On Jun 9, 2006, at 5:38 PM, Metz, Bobby W, WWCS wrote:

> 	This is kind of a follow-up question to my multi-level hash
> post.  Everything I've been reading on-line about how hashes work  
> leads
> me to conclusions that don't seem to pan out in reality, e.g.
> pre-defining the # of hash buckets to increase performance on large  
> data
> sets.  At least, I thought +40K records would be considered large...no
> jokes please.
> 	So, here's what I've observed using two methods to load +40K
> records into a single level hash.  I have always used method #1 as I
> learned it that way years ago but would love some thoughts around
> whether method #2 might be superior somehow as I know a lot of folks
> that do it that way instead.
>
> Method 1
> + Dynamically build hash from data file at run time.
> + Program load is consistently 3 seconds faster than Method 2.
> + Used 13M of memory to hold the records.
>
> Method 2
> + Used pre-built hashes loaded via "require".
> + Program load is consistently 3 seconds slower than Method 1.
> + Used 36M of memory to hold the records.
>
> 	Any of you know the inner workings of hashes enough to explain
> the difference?  I think the memory increase might have something  
> to do
> with "require" mucking with the usual shared hash table used by perl,
> possibly forcing two copies.  But, that's just an uneducated guess.
> There was no discernable difference in output performance using a  
> small
> test set against the +40K records, only the initial program load and
> total memory consumption.
>
> Thoughts?
>
> Thanks,
>
> Bobby
> _______________________________________________
> Phoenix-pm mailing list
> Phoenix-pm at pm.org
> http://mail.pm.org/mailman/listinfo/phoenix-pm

---------------------------------------------------------------------
Michael Friedman                     HighWire Press
Phone: 650-725-1974                  Stanford University
FAX:   270-721-8034                  <friedman at highwire.stanford.edu>
---------------------------------------------------------------------




More information about the Phoenix-pm mailing list