[sf-perl] Thinking about psuedo-randomness
Josh Berkus
josh at agliodbs.com
Tue Mar 31 15:37:36 PDT 2009
On 3/29/09 3:46 PM, Joe Brenner wrote:
> Josh Berkus<josh at agliodbs.com> wrote:
>
>> I've been thinking a little about the pseudo-randomness issue some, and
>> I think there *is* a generalizable case. However, can you send me your
>> slides/code so that I can build on what you have rather than starting
>> from scratch?
Ok, so I'm thinking about the more complex case than capitalization.
That is, the case which would cover:
song tracks, albums & artists
pictures & tags
people & departments
etc.
All of these can be broken down into:
1) entity
2) group memberships
3) serial placement in the first group (i.e. order)
4) outliers (thanks Joe for this concept)
(4) is a difficult concept, because while 1-3 are permanent data, 4 only
applies to the specific randomizer run (that is, this particular
playlist, picture page or committee). So since it's actually a property
of the run, we'll ignore it for now. For the rest of the data, all of
the above cases can be represented as:
Songs:
("Darkness") ("Up","Peter Gabriel","Rock") (1)
("Growing Up") ("Up","Peter Gabriel","Rock") (2)
("Sky Blue") ("Up","Peter Gabriel","Rock") (3)
("Have a Cigar")("Wish You Were Here","Pink Floyd","Rock") (3)
Pictures:
("GG Bridge Below") ("GG Bridge","Bridges","Architecture","SF") (1)
("GG Bridge East Side") ("GG Bridge","Bridges","Architecture","SF") (2)
People:
("Sam Joe") ("HR","Management","Atlanta","10-20yrs") (N/A)
("Mike Mike") ("HR","Staff","Atlanta","1-3yrs") (N/A)
("Kelly Susan") ("IT","Staff","Livermore","3-5yrs") (N/A)
So, there's a couple interesting things with this reduction. One is
that in some cases group membership is heirarchical, but I don't think
that matters for picking "prandom" entries. Second is that there are
some data sets for which serial order doesn't matter or doesn't exist.
I think Joe's try/pick method is sound. I like the idea of regarding
the entities as a stream, and testing each new entity against the prior
entities.
get # of tries items randomly
foreach:
if outlier
if first item drop from list else
if reached max # outliers drop from list else
score is (max # of groups / 2)
else
compare against previous # selections
each "group" in common = +1
if 1st group in common
+2 for being adjacent serially
add item with lowest score to the list
next item
I think that algorithm works for the stream concept.
The other possibility is the "whole list" concept, where you have a
static small list (like 6 or 10 items) and want them to be prandomly
selected. That would go like this:
determine max number of groups possible (maxgroups)
decide mingroups required and max_outliers
sort items in random order
if the first item is an outlier, shift it off
if more than max_outliers in first list_size, shift outliers off
if two items are serially in sequence, shift the second one
sum number of groups represented by first list_size items
while numgroups < mingroups
calculate the item with the most groups
in common with other items
shift it off
shift off serially ordered items and extra outliers
recalculate numgroups
The problem with this second method is that it's possible for it to
never complete; presumably you'd lower mingroups and start over.
Just my thoughts so far.
--Josh Berkus
More information about the SanFrancisco-pm
mailing list