[sf-perl] DB design question
Rich Morin
rdm at cfcl.com
Fri Jul 28 08:31:59 PDT 2006
This is one of those questions where there is no shortage of
answers, but some are likely to be a lot better (for various
reasons) than the others. So, I'm hoping for suggestions,
rationales, etc.
Background
* There are billions of entities (5-10, at present).
* Each entity has a unique name, which could be 100+
characters in length.
* Each entity has a collection of attributes, drawn from
a much larger set (could also be in the billions).
* The "signature" of an attribute might be 50 characters.
* I'd like to keep the total storage constrained to (say)
one KB per entity.
* It's OK (but not necessary) to presume that entity and
attribute ids can only be used once.
Problem
For each new entity that I encounter, I need to determine and
record its "unusual" attributes and save this in a way that
will allow me to (later) find other entities which have similar
sets of unusual attributes.
Discussion
Without presuming any particular database for this project,
I'll use MySQL syntax to sketch out a "straw man" design:
CREATE TABLE Entities (
id INT NOT NULL,
name CHAR(200)
)
CREATE TABLE Attrs (
id INT NOT NULL,
sig CHAR(50),
count INT NOT NULL
)
CREATE TABLE Links (
id_entity INT NOT NULL,
id_attr INT NOT NULL
)
Using this setup, I could start with an entity, pick one of
its attributes that has a low count, and look for other
entities that have the same attribute. Repeating this for
several attributes, I could build up a "cluster" of entities
that have similar, unusual attributes.
Another setup would involve the use of a blob in the Entities
table, containing (say) 250 id_attr values. This limits my
ability to use SQL, however, and it also keeps me from doing
cute lookups in the Links table.
Anyway, I'm open to suggestions...
-r
--
http://www.cfcl.com/rdm Rich Morin
http://www.cfcl.com/rdm/resume rdm at cfcl.com
http://www.cfcl.com/rdm/weblog +1 650-873-7841
Technical editing and writing, programming, and web development
More information about the SanFrancisco-pm
mailing list