Just for fun

Brent Fulgham brent.fulgham at xpsystems.com
Tue Jul 25 18:48:27 CDT 2000


At the risk of revealing my lame level of Perl skills, but in the
interest of sparking conversation, I submit the following Perl
success story:

I agreed to help my (relatively) computer-illiterate father convert
some genealogical data he had received as a DBase III file to a
format his software could use.

Naturally, I selected Perl as my tool of choice.

The GEDCOM format consist of records for individuals (i.e., humans) as 
well as families.  The families (obviously) have all kinds of
possible links that might occur between them.  (And contrary to
popular opinion, there actually ARE quite a few branches in my
family tree ;-)

My first cut was to model this as a set of nested arrays (or arrays
holding references to arrays).  This was fairly slow (took 30 minutes
or so to run on an 8000 entry file), and turned out to be incorrect.
I had not modelled all the interrelationships properly, and so I
had the frightening result of kids who were shown as being their own
parents, etc.

Next I searched CPAN a bit, and located the Tree-DAG_Node class.
This allowed me to move away from trying to handle all the possible
parent-child relationships, making my job easier.  

The problem was this was extremely slow.  After 30 minutes of running
to reach the 30% point, I stopped and tried to speed things up.

Being a C/C++ programmer by day, my first thought was to implement a 
binary search algorithm to locate nodes (since the DAG_Node class 
doesn't provide searching routines).  I should point out that nodes are
not ordered by their value -- you might have parent 5 with children 
2000, 129, 2, and 3.

But the binary-search slowed things even further than the simple
iterative search I had used before.

So, I finally realized something that is probably obvious to the
more expert Perl users -- the Hash data type solves this problem.

Rather than search through the entire tree for a particular parent
node, I stored a reference to it in a hash (keyed by the node's
name).  If I added more nodes, I simply stored them in the hash
(after defining their links to the other nodes).

Imagine my shock to see that the program whipped through all 8000
elements in about 30 seconds.  Quite an improvement!

So I guess the moral is that good selection of the data type can
make all the difference in the world.  :-)

Do any of you have other tips on how to create speedy Perl programs?

Thanks,

-Brent



More information about the Thousand-oaks-pm mailing list