[sf-perl] DB design question

Fri Jul 28 20:31:01 PDT 2006

Rich,

Upon reading the problem definition here, and reading your straw man  
solution, I think you are thinking of this problem differently from  
the way you wrote it.

> Problem
>
> For each new entity that I encounter, I need to determine and
> record its "unusual" attributes and save this in a way that
> will allow me to (later) find other entities which have similar
> sets of unusual attributes.

To me, that says that you are tracking each entity for the *sole  
purpose* of clustering them based on similar attributes. (You say  
attributes instead of attribute values. I'm assuming you meant  
attribute values, but you could just make attributes have binary  
values if not.)

However, the design proposed seems to me to be a storage medium for  
entities with varying attributes. That has a large number of  
problems, the least of which is that given a completely free-form  
attribute table, your SQL queries to find clusters of similar  
entities will be ugly and slow. (As was previously pointed out by  
others.)

What I would suggest is a matrix instead of relational tables. You  
could implement it in tables, but some other way might be more  
efficient.

For example, what you are looking for is clusters with similar  
attributes. Imagine an Excel spreadsheet with each entity as a single  
row and each attribute as a column. The value for each attribute goes  
into the proper cell for each entity. To find "every entity with  
attribute X between 10 and 20", you sort the matrix on the column for  
attribute X and read off everything with a value between 10 and 20.  
(Entities without that attribute at all would get undef.)

Sure, it looks like a relational table, but you're also talking about  
billions of entities and billions of attributes, so that'll never  
work. :-) However, if each entity only has a manageable number of  
attributes (100? 10? "reasonable" depends on a lot of context here)  
then you could create a sparse hash for the matrix and store that in  
some fashion. (MLDBM, for a simple start; objects using the Flyweight  
pattern for both entities and attributes in some sort of repository  
in the longer term probably.)

Using a hash (tied to whatever) you can then cluster by sorting the  
hash by $entity{$attribute} and then read out the range of items with  
values where you want. Optimizing this sort (or search) would be the  
most important part of making it run with billions of items, but also  
pretty straightforward.

For example, you could walk the hash in sections and cluster each  
section separately. If you really are dealing with billions of  
entities (Genomic data?) then you could split it across multiple  
machines and write a little broker/server package that merges results  
together. (Uh... I know there's something that handles the hard work  
of that on CPAN. POE maybe it's called?)

A further optimization would be classification. If several attributes  
tend to go together, you could wrap them up into a bigger attribute,  
thus reducing the actual number of attributes you need to code for.  
If attributes are degrees of one another, you could merge them into a  
single non-binary-valued attribute, to reduce the number. If a bunch  
of entities have the exact same attributes, ignore all but the first  
one and keep track of "entity groups" in a separate storage  
mechanism. Anything to reduce the number of cells in the matrix would  
help with execution speed. But again, that's all really only  
necessary once you hit the limits of your first try. Do The Simplest  
Thing That Might Possibly Work, then optimize later.

This has gotten rather long, but without more context, I can only  
make vague guesses at an approach. Best of luck with this!

-- Mike Friedman

PS - This is basically "compute a vector in attribute space for each  
entity, but only store the non-zero values." If that description  
makes more sense, code to it. :-)

PPS - Fulltext search engines do something like this in their Query  
By Example feature. If you can find an open source search engine with  
QBE, you might be able to snag some code or at least design that  
would help with this kind of problem.

On Jul 28, 2006, at 8:31 AM, Rich Morin wrote:

> This is one of those questions where there is no shortage of
> answers, but some are likely to be a lot better (for various
> reasons) than the others.  So, I'm hoping for suggestions,
> rationales, etc.
>
> Background
>
>   *  There are billions of entities (5-10, at present).
>
>   *  Each entity has a unique name, which could be 100+
>      characters in length.
>
>   *  Each entity has a collection of attributes, drawn from
>      a much larger set (could also be in the billions).
>
>   *  The "signature" of an attribute might be 50 characters.
>
>   *  I'd like to keep the total storage constrained to (say)
>      one KB per entity.
>
>   *  It's OK (but not necessary) to presume that entity and
>      attribute ids can only be used once.
>
> Problem
>
> For each new entity that I encounter, I need to determine and
> record its "unusual" attributes and save this in a way that
> will allow me to (later) find other entities which have similar
> sets of unusual attributes.
>
>
> Discussion
>
> Without presuming any particular database for this project,
> I'll use MySQL syntax to sketch out a "straw man" design:
>
>   CREATE TABLE Entities (
>     id         INT NOT NULL,
>     name       CHAR(200)
>   )
>
>   CREATE TABLE Attrs (
>     id         INT NOT NULL,
>     sig        CHAR(50),
>     count      INT NOT NULL
>   )
>
>   CREATE TABLE Links (
>     id_entity  INT NOT NULL,
>     id_attr    INT NOT NULL
>   )
>
> Using this setup, I could start with an entity, pick one of
> its attributes that has a low count, and look for other
> entities that have the same attribute.  Repeating this for
> several attributes, I could build up a "cluster" of entities
> that have similar, unusual attributes.
>
> Another setup would involve the use of a blob in the Entities
> table, containing (say) 250 id_attr values.  This limits my
> ability to use SQL, however, and it also keeps me from doing
> cute lookups in the Links table.
>
> Anyway, I'm open to suggestions...
>
> -r
> -- 
> http://www.cfcl.com/rdm            Rich Morin
> http://www.cfcl.com/rdm/resume     rdm at cfcl.com
> http://www.cfcl.com/rdm/weblog     +1 650-873-7841
>
> Technical editing and writing, programming, and web development
> _______________________________________________
> SanFrancisco-pm mailing list
> SanFrancisco-pm at pm.org
> http://mail.pm.org/mailman/listinfo/sanfrancisco-pm

---------------------------------------------------------------------
Michael Friedman                     HighWire Press
Phone: 650-725-1974                  Stanford University
FAX:   270-721-8034                  <friedman at highwire.stanford.edu>
---------------------------------------------------------------------