[sf-perl] DB design question

Fri Jul 28 08:31:59 PDT 2006

This is one of those questions where there is no shortage of
answers, but some are likely to be a lot better (for various
reasons) than the others.  So, I'm hoping for suggestions,
rationales, etc.

Background

  *  There are billions of entities (5-10, at present).

  *  Each entity has a unique name, which could be 100+
     characters in length.

  *  Each entity has a collection of attributes, drawn from
     a much larger set (could also be in the billions).

  *  The "signature" of an attribute might be 50 characters.

  *  I'd like to keep the total storage constrained to (say)
     one KB per entity.

  *  It's OK (but not necessary) to presume that entity and
     attribute ids can only be used once.

Problem

For each new entity that I encounter, I need to determine and
record its "unusual" attributes and save this in a way that
will allow me to (later) find other entities which have similar
sets of unusual attributes.

Discussion

Without presuming any particular database for this project,
I'll use MySQL syntax to sketch out a "straw man" design:

  CREATE TABLE Entities (
    id         INT NOT NULL,
    name       CHAR(200)
  )

  CREATE TABLE Attrs (
    id         INT NOT NULL,
    sig        CHAR(50),
    count      INT NOT NULL
  )

  CREATE TABLE Links (
    id_entity  INT NOT NULL,
    id_attr    INT NOT NULL
  )

Using this setup, I could start with an entity, pick one of
its attributes that has a low count, and look for other
entities that have the same attribute.  Repeating this for
several attributes, I could build up a "cluster" of entities
that have similar, unusual attributes.

Another setup would involve the use of a blob in the Entities
table, containing (say) 250 id_attr values.  This limits my
ability to use SQL, however, and it also keeps me from doing
cute lookups in the Links table.

Anyway, I'm open to suggestions...

-r
-- 
http://www.cfcl.com/rdm            Rich Morin
http://www.cfcl.com/rdm/resume     rdm at cfcl.com
http://www.cfcl.com/rdm/weblog     +1 650-873-7841

Technical editing and writing, programming, and web development