[sf-perl] DB design question
Michael Friedman
friedman at highwire.stanford.edu
Fri Jul 28 20:31:01 PDT 2006
Rich,
Upon reading the problem definition here, and reading your straw man
solution, I think you are thinking of this problem differently from
the way you wrote it.
> Problem
>
> For each new entity that I encounter, I need to determine and
> record its "unusual" attributes and save this in a way that
> will allow me to (later) find other entities which have similar
> sets of unusual attributes.
To me, that says that you are tracking each entity for the *sole
purpose* of clustering them based on similar attributes. (You say
attributes instead of attribute values. I'm assuming you meant
attribute values, but you could just make attributes have binary
values if not.)
However, the design proposed seems to me to be a storage medium for
entities with varying attributes. That has a large number of
problems, the least of which is that given a completely free-form
attribute table, your SQL queries to find clusters of similar
entities will be ugly and slow. (As was previously pointed out by
others.)
What I would suggest is a matrix instead of relational tables. You
could implement it in tables, but some other way might be more
efficient.
For example, what you are looking for is clusters with similar
attributes. Imagine an Excel spreadsheet with each entity as a single
row and each attribute as a column. The value for each attribute goes
into the proper cell for each entity. To find "every entity with
attribute X between 10 and 20", you sort the matrix on the column for
attribute X and read off everything with a value between 10 and 20.
(Entities without that attribute at all would get undef.)
Sure, it looks like a relational table, but you're also talking about
billions of entities and billions of attributes, so that'll never
work. :-) However, if each entity only has a manageable number of
attributes (100? 10? "reasonable" depends on a lot of context here)
then you could create a sparse hash for the matrix and store that in
some fashion. (MLDBM, for a simple start; objects using the Flyweight
pattern for both entities and attributes in some sort of repository
in the longer term probably.)
Using a hash (tied to whatever) you can then cluster by sorting the
hash by $entity{$attribute} and then read out the range of items with
values where you want. Optimizing this sort (or search) would be the
most important part of making it run with billions of items, but also
pretty straightforward.
For example, you could walk the hash in sections and cluster each
section separately. If you really are dealing with billions of
entities (Genomic data?) then you could split it across multiple
machines and write a little broker/server package that merges results
together. (Uh... I know there's something that handles the hard work
of that on CPAN. POE maybe it's called?)
A further optimization would be classification. If several attributes
tend to go together, you could wrap them up into a bigger attribute,
thus reducing the actual number of attributes you need to code for.
If attributes are degrees of one another, you could merge them into a
single non-binary-valued attribute, to reduce the number. If a bunch
of entities have the exact same attributes, ignore all but the first
one and keep track of "entity groups" in a separate storage
mechanism. Anything to reduce the number of cells in the matrix would
help with execution speed. But again, that's all really only
necessary once you hit the limits of your first try. Do The Simplest
Thing That Might Possibly Work, then optimize later.
This has gotten rather long, but without more context, I can only
make vague guesses at an approach. Best of luck with this!
-- Mike Friedman
PS - This is basically "compute a vector in attribute space for each
entity, but only store the non-zero values." If that description
makes more sense, code to it. :-)
PPS - Fulltext search engines do something like this in their Query
By Example feature. If you can find an open source search engine with
QBE, you might be able to snag some code or at least design that
would help with this kind of problem.
On Jul 28, 2006, at 8:31 AM, Rich Morin wrote:
> This is one of those questions where there is no shortage of
> answers, but some are likely to be a lot better (for various
> reasons) than the others. So, I'm hoping for suggestions,
> rationales, etc.
>
> Background
>
> * There are billions of entities (5-10, at present).
>
> * Each entity has a unique name, which could be 100+
> characters in length.
>
> * Each entity has a collection of attributes, drawn from
> a much larger set (could also be in the billions).
>
> * The "signature" of an attribute might be 50 characters.
>
> * I'd like to keep the total storage constrained to (say)
> one KB per entity.
>
> * It's OK (but not necessary) to presume that entity and
> attribute ids can only be used once.
>
> Problem
>
> For each new entity that I encounter, I need to determine and
> record its "unusual" attributes and save this in a way that
> will allow me to (later) find other entities which have similar
> sets of unusual attributes.
>
>
> Discussion
>
> Without presuming any particular database for this project,
> I'll use MySQL syntax to sketch out a "straw man" design:
>
> CREATE TABLE Entities (
> id INT NOT NULL,
> name CHAR(200)
> )
>
> CREATE TABLE Attrs (
> id INT NOT NULL,
> sig CHAR(50),
> count INT NOT NULL
> )
>
> CREATE TABLE Links (
> id_entity INT NOT NULL,
> id_attr INT NOT NULL
> )
>
> Using this setup, I could start with an entity, pick one of
> its attributes that has a low count, and look for other
> entities that have the same attribute. Repeating this for
> several attributes, I could build up a "cluster" of entities
> that have similar, unusual attributes.
>
> Another setup would involve the use of a blob in the Entities
> table, containing (say) 250 id_attr values. This limits my
> ability to use SQL, however, and it also keeps me from doing
> cute lookups in the Links table.
>
> Anyway, I'm open to suggestions...
>
> -r
> --
> http://www.cfcl.com/rdm Rich Morin
> http://www.cfcl.com/rdm/resume rdm at cfcl.com
> http://www.cfcl.com/rdm/weblog +1 650-873-7841
>
> Technical editing and writing, programming, and web development
> _______________________________________________
> SanFrancisco-pm mailing list
> SanFrancisco-pm at pm.org
> http://mail.pm.org/mailman/listinfo/sanfrancisco-pm
---------------------------------------------------------------------
Michael Friedman HighWire Press
Phone: 650-725-1974 Stanford University
FAX: 270-721-8034 <friedman at highwire.stanford.edu>
---------------------------------------------------------------------
More information about the SanFrancisco-pm
mailing list