[sf-perl] Thinking about psuedo-randomness

Tue Mar 31 15:37:36 PDT 2009

On 3/29/09 3:46 PM, Joe Brenner wrote:
> Josh Berkus<josh at agliodbs.com>  wrote:
>
>> I've been thinking a little about the pseudo-randomness issue some, and
>> I think there *is* a generalizable case.  However, can you send me your
>> slides/code so that I can build on what you have rather than starting
>> from scratch?

Ok, so I'm thinking about the more complex case than capitalization. 
That is, the case which would cover:

song tracks, albums & artists
pictures & tags
people & departments

etc.

All of these can be broken down into:
1) entity
2) group memberships
3) serial placement in the first group (i.e. order)
4) outliers (thanks Joe for this concept)

(4) is a difficult concept, because while 1-3 are permanent data, 4 only 
applies to the specific randomizer run (that is, this particular 
playlist, picture page or committee).  So since it's actually a property 
of the run, we'll ignore it for now.  For the rest of the data, all of 
the above cases can be represented as:

Songs:
("Darkness") ("Up","Peter Gabriel","Rock") (1)
("Growing Up") ("Up","Peter Gabriel","Rock") (2)
("Sky Blue") ("Up","Peter Gabriel","Rock") (3)
("Have a Cigar")("Wish You Were Here","Pink Floyd","Rock") (3)

Pictures:
("GG Bridge Below") ("GG Bridge","Bridges","Architecture","SF") (1)
("GG Bridge East Side") ("GG Bridge","Bridges","Architecture","SF") (2)

People:
("Sam Joe") ("HR","Management","Atlanta","10-20yrs") (N/A)
("Mike Mike") ("HR","Staff","Atlanta","1-3yrs") (N/A)
("Kelly Susan") ("IT","Staff","Livermore","3-5yrs") (N/A)

So, there's a couple interesting things with this reduction.  One is 
that in some cases group membership is heirarchical, but I don't think 
that matters for picking "prandom" entries.  Second is that there are 
some data sets for which serial order doesn't matter or doesn't exist.

I think Joe's try/pick method is sound.  I like the idea of regarding 
the entities as a stream, and testing each new entity against the prior 
entities.

get # of tries items randomly
	foreach:
	if outlier
		if first item drop from list else
		if reached max # outliers drop from list else
		score is (max # of groups / 2)
	else
		compare against previous # selections
		each "group" in common = +1
		if 1st group in common
				+2 for being adjacent serially
	add item with lowest score to the list
	next item

I think that algorithm works for the stream concept.

The other possibility is the "whole list" concept, where you have a 
static small list (like 6 or 10 items) and want them to be prandomly 
selected.  That would go like this:

determine max number of groups possible (maxgroups)
decide mingroups required and max_outliers

sort items in random order
	if the first item is an outlier, shift it off
	if more than max_outliers in first list_size, shift outliers off
	if two items are serially in sequence, shift the second one
	sum number of groups represented by first list_size items
	while numgroups < mingroups
		calculate the item with the most groups
			in common with other items
			shift it off
		shift off serially ordered items and extra outliers
		recalculate numgroups

The problem with this second method is that it's possible for it to 
never complete; presumably you'd lower mingroups and start over.

Just my thoughts so far.

--Josh Berkus