[LA.pm] Anthony Curtis presents Perl Stored Procedures for MySQL

Wed Aug 19 12:36:24 PDT 2009

2009/8/19 David Fetter <david at fetter.org>:
> On Wed, Aug 19, 2009 at 11:22:21AM -0700, Aran Deltac wrote:
[...]
>> > Jim Gray
>> > <http://en.wikipedia.org/wiki/Jim_Gray_%28computer_scientist%29>
>> > measured this back in 2003, and those metrics have moved even further
>> > toward his conclusion, which was essentially, "do all the processing
>> > you can as close to where the data lives as you can arrange it."
>> >
>> > http://research.microsoft.com/apps/pubs/default.aspx?id=70001
>>
>> Good info, thanks.
>
> Clearly you didn't actually read it, or if you did, you didn't
> understand it.

Are you sure?  Because I read it, and I came to a very different
conclusion than you did.  The conclusion that I came to is that there
is a world of difference between distributing information across the
LAN and across the Internet or a WAN.  In particular his figures show
that the cost of sending data from local hard drive is the same as
sending it across the LAN.  So in a standard client talks to webserver
talks to database scenario, the economics say that pre-processing in
the database is not a big win over sending over raw data and
processing in the webserver.

This assumes, of course, an apples to apples comparison in what you're
doing.  There are many, many factors that can tip your action in
different ways.

>> But, I have to agree, the less you do *in* the database, and the
>> more you can shrug off processing to other parts of the system, the
>> better.
>
> The more processing you do *as close as possible* to where they data
> is actually stored, the better off you are.  Read the paper.

I did.  His figures indicate that $1 will buy you 10 TB of local disk
access, and 10 TB of local LAN access.  They cost the same amount.  So
moving data to process it then moving it back is an economic loss
(though you may want to do that for performance reasons).  And if data
is moving from database to webserver to client, moving where
processing happens is (assuming all else is equal - sometimes a big
assumption) economically neutral.

>> I like to treat my database as a very fast flat file storage engine
>> that does very little processing for me.
>
> Yes, that's a common mistake, but that it's common doesn't make it not
> be a mistake.  OO coders are especially prone to this mistake, but
> it's far from unknown among other kinds of coders who don't understand
> what an RDBMS is or what it does.

To clarify, the common mistake is that an OO coder will treat the
database as very fast flat file storage, and then do the equivalent of
a join in code.  If you push work to the database often the database
will come up with a query plan that is a much better algorithm than
the OO coder had come up with.  So by moving processing to the
database you're better off in this case, but that's because databases
are designed to automatically come up with good algorithms, and not
because of the intrinsic economics of the situation.

That's one trade-off saying that we should push work to the database.

An argument for the simple fast flat file storage is that this fits
well with putting a memcached layer in, which moves data to where the
processing happens.

Another argument for moving processing to the database is so that your
business rules are enforced across multiple applications.

Proponents of moving data away can point out that CPU availability on
the database is one of the scalability bottlenecks.  Moving processing
to webservers allows you to scale better.

And so the argument goes back and forth with a variety of subtle
trade-offs.  The real point that we should draw from this is not that
processing should always be pushed to the webserver or the database,
but that there are a lot of trade-offs we need to understand to make a
good choice in any particular case.

[...]
> Everybody's entitled to an *informed* opinion.  The thing is, Jim Gray
> went out and measured in a technology-agnostic way, and he came to the
> opposite conclusion.
>
> Perhaps you'll go out and measure something different.  It'll be worth
> your very own Turing award if you manage to overturn the result in
> that paper.

Please step back, re-read the paper, then read what I have said and
see how they compare.  Because from where I sit it appears that you've
oversimplified the message of the paper to an inappropriate degree.

For example read the Caveats section that starts off:

  Beowulf clusters have completely different networking
  economics.  Render farms, materials simulation, and CFD
  fit beautifully on Beowulf clusters because there the
  cost of networking is very inexpensive: a GBps Ethernet
  fabric costs about 200$/port and delivers 50MBps, so
  Beowulf networking costs are comparable to disk
  bandwidth costs – 10,000 times less than the price of
  Internet transports.

Note that a typical website setups uses the same networking technology
as a Beowulf cluster, and therefore its economics match those of a
Beowulf cluster.  Therefore the conclusions of that paper no more
apply to a typical website than they do a Beowulf cluster.  *UNLESS*,
of course, the website has made the mistake of interactively moving
data over a WAN.

Cheers,
Ben