[Pdx-pm] RE: [Pdx-pm] Software development and The Rules

Thu Sep 26 19:13:52 CDT 2002

>>>>> "JK" == Joshua Keroes <jkeroes at eli.net> writes:

JK> Q: What's the fastest, most space-efficient way to sort all
JK>    of Portland's 503 phone numbers? 

Odd, I would have thought Portland to have at least 600 phone
numbers.  :)

It's probably obvious, but I haven't seen anyone point out that these
two criteria ("fastest", probably meaning time-efficient, and "most
space-efficient") are often at odds with each other, even in the
abstract (let alone what happens when you consider the tiered memory
model that our modern machines actually have [1]).  

Anyone who remembers the big, ah, discussions that would occur on
#perl whenever someone asked "what's the best way to do X?"  The
problem, of course, is that you can't answer that question without its
context.  In this case, we'd need to know much time costs, how much
memory costs, and try to find the minimum in the tradeoff.

We also aren't using those terms very precisely: what space are we
trying to save (e.g., disk v. memory typically)?  What time metric are
we trying to minimize (execution, authoring, maintaining)?

In this case, there are all sorts of optimizations you can apply.  But
it's also a bit of a trick question.  Management (or whoever is paying
the bills) doesn't care for the absolute fastest or smallest solution;
rather, they care about feasibility ("can it be done") and economy
("can we do it cheaply / quickly enough to make money off it").

Stupid optimizations:

1. Less RAM: decimal digits easily fit in four bits, so you can fit a
   7-digit value into a 32-bit integer.  Even a 10-digit number fits
   into a pretty short string with this encoding (although it won't
   fit into a 32-bit unsigned value in the raw, sadly.)

2. Less CPU Time: no need to convert to numeric: if you compare raw
   strings (or compressed as above), the fact that USA phone numbers
   are constant-length means that string comparison will do the right
   thing.

3. Less Programmer Time: use 'sort' on the command line.

4. Less CPU time: use radix sort.

5. Less Programmer Time: use a database.  This is notable because it's
   even odds that these numbers are already in a database, which most
   likely already has an in-order index on these numbers.

6. Less RAM: sort small segments of the list, then merge them.

7. Less CPU time (but on more CPUs): bin by first <x> digits, dispatch
   to parallel CPUs, merge sorted result lists.

8. Less CPU time, Less Programmer time, longer Wall-Clock time: wait 6
   months, buy a new CPU for same money, CPU-bound tasks will finish
   20% faster with no further effort on the programmer's part.

And so on.  What's most important to you?

t.

[1] Tiered memory: registers, L1 cache, L2 cache, maybe L3, then main
    RAM, maybe remote memory [think NUMA], then disk, then tape.  That
    last isn't as perverse as it sounds; there is a long history of
    using merge sort with tapes, and programming tasks where you want
    to minimize the number of tape swaps, number of times a bit of
    tape is scanned, all sorts of crazy stuff.