Database Isolation Levels and Data Anomalies
Database isolation and data anomalies are important to understand. This Blog Post explains details regarding both and how they are linked
Database isolation touches on the ability of a database platform to allow a transaction to execute as if there are no other transactions are running concurrently, in reality there can be a large number of concurrently running transactions in a database platform. Database isolation helps you deal with this with guarantees around data accuracy.
The ultimate goal is to prevent reads and writes of temporary, cancelled, or otherwise incorrect data written by concurrent transactions.
I often get asked what databases are at what isolation level and in truth most database's allow you to choose the isolation level you run at. This is because the stricter you make your isolation level the more cost is normally incurred in terms of performance. Below is a diagram of what the default isolation levels are for some common database platforms along with the data anomalies that can occur with each isolation level.
Data Anomalies
Now that we know what isolation level is and what the default isolation levels are for some common database platforms lets take a look at the data anomalies that can occur with each of the isolation mentioned above.
P1 - Dirty Read
A Dirty Read occurs when one transaction reads a value that has been written by another still in-flight transaction. It is not enough to prevent only reads of values written by transactions that eventually rollback, you need to prevent reads of values from transactions that ultimately commit too
P2 - Fuzzy Read or Non-Repeatable Read
A Fuzzy or Non-Repeatable Read occurs when a value that has been read by a still in-flight transaction is overwritten by another transaction. Even without a second read of the value actually occurring this can still cause database invariants to be violated.
P3 - Phantom
A Phantom occurs when a transaction does a predicate-based read (e.g. SELECT… WHERE P) and another transaction writes a data item matched by that predicate while the first transaction is still in flight. In the original ANSI SQL language only matching inserts were forbidden (the new entries being the phantoms), but in fact to be safe we need to prohibit any write: updates, deletes, and inserts.
P4: Lost Update
It turns out that there is an anomaly that is not prevented by systems that defend against Dirty Reads and Dirty Writes, but that is prevented by systems that also protect against Fuzzy reads. It’s known as a Lost Update. Consider 2 transaction T1 and T2,
T1 wants to update the data of x. It first reads the value of x, then adds 1 to the value and writes it back to the database. But after reading x, T2 writes a new x and submits it successfully, while T1 adds 1 to the old x value. In this way, T2's update appears to be discarded to T1.
P4C: Cursor Lost Update
P4C is a variation of the Lost Update phenomenon that involves a SQL cursor
A5A: Read Skew
Read skew can occur when there are integrity constraints between two or more data items. Read skew causes inconsistent reads due to inconsistent state of the data being read.
A5B: Write Skew
Write skew is very similar. Suppose our constraint is x+y ≤ 100. And this time we interpose a transaction before T1 issues a write
Database Isolation Levels
Isolation can be ordered as below (From weakest isolation to strongest isolation on)
Degree 0 (everything goes) Read Uncommitted << Read Committed << Cursor Stability << Repeatable Read and Snapshot << Serializable
Lets look at the Isolation Levels in a bit more detail
Read committed is an isolation level that guarantees that any data read was committed at the moment is read. It simply restricts the reader from seeing any intermediate, uncommitted, 'dirty' read. It makes no promise whatsoever that if the transaction re-issues the read, will find the Same data, data is free to change after it was read. PostgresSQL, Oracle and SQL server are some of the databases with this isolation level by default.
Repeatable read is a higher isolation level, that in addition to the guarantees of the read committed level, it also guarantees that any data read cannot change, if the transaction reads the same data again, it will find the previously read data in place, unchanged, and available to read. MySQL and Yugabyte have this isolation level by default
The next isolation level, serializable, makes an even stronger guarantee: in addition to everything repeatable read guarantees, it also guarantees that no new data can be seen by a subsequent read. CockroachDB ships with this as the default isolation Level
Example
Say you have a table T with a column C with one row in it, say it has the value '1'. And consider you have a simple task like the following:
BEGIN TRANSACTION;
SELECT * FROM T;
WAITFOR DELAY '00:01:00'
SELECT * FROM T;
COMMIT;
That is a simple task that issue two reads from table T, with a delay of 1 minute between them.
- under READ COMMITTED, the second SELECT may return any data. A concurrent transaction may update the record, delete it, insert new records. The second select will always see the new data.
- under REPEATABLE READ the second SELECT is guaranteed to display at least the rows that were returned from the first SELECT unchanged. New rows may be added by a concurrent transaction in that one minute, but the existing rows cannot be deleted nor changed.
- under SERIALIZABLE reads the second select is guaranteed to see exactly the same rows as the first. No row can change, nor deleted, nor new rows could be inserted by a concurrent transaction.
If you follow the logic above you can quickly realise that SERIALIZABLE transactions, while they may make life easy for you, are always completely blocking every possible concurrent operation, since they require that nobody can modify, delete nor insert any row. Cockroach is able to do with in a "Lockless" manner. You can read all about it here
Finally, there is also the SNAPSHOT isolation level. SNAPSHOT isolation level makes the same guarantees as serializable, but not by requiring that no concurrent transaction can modify the data. Instead, it forces every reader to see its own version of the world (it's own 'snapshot'). However, that benefit comes with a price: extra server resource consumption and still has the small possibility of Phantom Reads and Write Skew.
In Summary
CockroachDB is one of a few databases that only runs in serializable isolation level preventing all forms of data anomalies without needing strict locking mechanisms. CockroachDB believes that the accuracy of data should not be compromised and is part of the reason we only allow serializable isolation level within the DB platform.