[Wellington-pm] Perl database stuff....

Sun Jan 7 21:49:18 PST 2007

Jacinta Richardson wrote:
>> Surely just wrapping the entire operation in a transaction is enough?
>> Or is that one of those cases that needs the TRANSACTION ISOLATION LEVEL
>> set to max?
>>     
>
> Transactions don't solve the problem, regardless of their isolation level 
> (unless your levels mean something other than what I think they do).  The race 
> condition exists because the underlying operating system handles multiple 
> processes all at the same time.  As far as the operating system is concerned, 
> the database, scripts accessing the database, and anything else on the machine 
> are just processes.  Each process deserves time slices to do their things in a 
> timely fashion.  These are handled due to various algorithms but essentially 
> each process goes into a queue.  The OS takes one out from the front of the 
> queue, sets up it's memory, runs it's instructions for a while and then swaps 
> its used memory out to disk (if necessary) and moves that to the back of the 
> queue.  It then takes the next one out of the queue and processes that.  
>   

In summary, this is exactly the problem that ACID was designed to solve.
(see, eg, http://en.wikipedia.org/wiki/ACID)

> This 
> means that sometimes (in rare but possible cases) two or more processes can 
> collide and do the unexpected.  For example consider the following:
>
> 	Process A selects stuff for business key X - it's not there
> 	Operating system swaps Process A to memory/disk
> 	Process B selects stuff for business key X - it's not there
> 	Process B selects the sequence next val
> 	Process B inserts stuff into the database
> 	Operating system swaps Process B to memory/disk
> 	Process A selects the sequence next val
> 	Process A inserts it's stuff into the database for key X
>
> Now you have either an insertion failure (if you have unique indices) or a 
> duplicate entry.  If Process A is using transactions and you do have unique 
> indices (separate from the id) then the transaction fails because the insert 
> fails.  If you don't have a separate key then the transaction succeeds because 
> failure isn't detected.
>   

To answer my own (earlier) question, the particular problem you refer to
is known as "phantom read" (see
http://www.postgresql.org/docs/8.2/interactive/transaction-iso.html). If
you use the SERIALIZABLE transaction isolation level, then the
transaction should protect you.

This marginal extra overhead is only needed for this usage pattern; if
you did it this way:

Process A starts a transaction
Process A selects the sequence next val
Operating system swaps Process A to memory/disk
Process B starts a transaction
Process B selects the sequence next val
Process B inserts stuff into the database
Operating system swaps Process B to memory/disk
Process A inserts it's stuff into the database
Process B commits
Process A commits

Then you would not suffer from a potential phantom read. Note that B
commits a higher sequence than A, before A commits. This is intentional,
and why you use sequences rather than auto_increment; it's more
scalable. It also means that in the event of a rollback there will be
missing numbers. Sequences are atomic, and don't roll back with
transactions.

Let's look at the pattern you describe, and how the SERIALIZABLE
isolation level solves the problem:

Process A starts a transaction
Process A selects the highest value from the table and computes the next key
Operating system swaps Process A to memory/disk
Process B starts a transaction
Process B selects the highest value from the table and computes the next key
Process B inserts stuff into the database
Operating system swaps Process B to memory/disk
Process A inserts it's stuff into the database
Process B commits
Process A tries to commit

In this case, when Process A tries to commit it will be forced to roll
back. Process B locked the phantom rows off the end of the index as a
part of its transaction, and so process A cannot then commit; it has to
replay.

In the case where the information to be inserted was derived from rows
that were actually present - and selected during the course of the
transaction - then when process B tries to select those values it will
block. This is the disadvantage of the REPEATABLE READ level - it's a
trade-off that means you don't need to write transaction replay
functionality, but with the side-effect that two concurrent transactions
interfere with each other, at least in time. That being said, you don't
save yourself completely - if two transactions lock rows in different
orders you might end up with a deadlock and then they both lose.

At least, that's the assumptions I've always worked with. I sure hope I
got it right ;).

> If I am using an id field, I set it to "auto-increment" if the 
> database allows.  Most of the time these days though, I just use Class::DBI 
> (probably soon to change to DBIx::Class).  I don't know what its policy is in 
> this regard, but I'm happy enough to trust it.
>   

None of these toolkits can simply save you. You should be concious of at
least the "Dirty Read" problem, and ideally, make sure that your
application is prepared to handle the "Phantom Read" problem, or
engineered to avoid it.

Here's a few simple words of advice for writing safe transactions:

1. always start a transaction at the beginning of your "request" cycle
(to get ACID support)

2. never use information from a previous transaction as input for an
update (to avoid "Dirty Read"), or information from another database handle.

3. build your application so that you can replay transactions on a
rollback indication from the database (to enable avoiding of "Phantom
Read"), and turn on serializable isolation - or make sure you never
allow the *absence* of rows to influence the committed result of a
transaction, *and* select/insert things in a common order to avoid
deadlocks aborting transactions for you.

4. don't hold transactions open too long.

Now, points 4 and 2 have some interesting implications - for instance,
what about information that you send to the user to review and edit,
then when they hit "save", commit to the database? From an overall
perspective, the user is locking the rows, making changes, and
committing them to the database. Therefore, a simple and correct (but
flawed) way to do this is to make sure that the user always uses the
same database handle, and simply leave the transaction running while
they edit. However, this violates principle 4 above, and when you get
sufficient users you can run into capacity problems (ie, run out of
rollback space). So, once good solution is to have a (hidden to the
user) "object version" field, which is sent to the user with the rest of
their information, and confirmed to match the row during the update
cycle. An alternative that does not require an extra column is for the
user to send back the complete original row that they were editing,
which is compared before commit, or even a hash thereof.

> in most of them it's probably less than 3.  This kind of a thing is a nightmare 
> when creating items for an online store (or CMS) -- where a number of 
> administrators might try to add the exact same item -- but rarely a problem when 
> handling a shopping cart, because most people *expect* duplicate items to appear 
> if they buy the same item in the same session from two different pages.  

Dirty read gives you behaviour like the Smoke CDs site, where if you add
a CD to your basket in two windows at near the same time, you might get
only one in your basket at the end. In their case it meant that a
potential customer, irritated at the poor performance of their site,
went to $local_cd_store instead.

> Thus in my situations the chances of hitting this kind of race condition is sufficiently 
> low that I was told not to spend the time worrying about it.
>
> It's a trade off: you can write some code correctly, or quickly.  ;)
>   

With the widespread availability of ACID compliant databases like
Postgres (or even SQLite) You shouldn't need to compromise on
transaction consistency for deadlines - just take on board the above
guidelines and it should become automatic.

Sam.