Skip to content
This repository was archived by the owner on Jun 12, 2020. It is now read-only.

Transactions and Concurrency

RIch Prohaska edited this page Aug 10, 2014 · 20 revisions

Transactions and Concurrency

TokuDB uses two-phase locking in its transactional concurrency control algorihm. A transaction acquires locks while it executes. This is the lock expansion phase. A transaction releases its locks when it commits or aborts. This is the retiring phase. Within a single transaction, the phases never overlap. This algorithm and its variants have been proven to provide serializability, which is a nice property to have.

Transaction isolation levels

MySQL supports several transaction isolation levels. Each of these transaction isolation levels use different locking strategies.

TokuDB implements locking strategies similar to InnoDB's locking strategies.

Lock types

TokuDB use two types of locks.

  • Read locks are exclusive locks (yes, exclusive locks). This may limit concurrency in some cases. Read locks were shared locks in previous versions of TokuDB, but we simplified and sped up the TokuDB lock tree by not implementing shared locks. The cost is less concurrency. This can be reexamined.
  • Write locks are exclusive locks.

Lock tree

TokuDB stores locks in an in memory data structure called the lock tree. The lock tree contains the set of locks currently owned by each live transaction. In addition, the lock tree contains the set of transactions that are waiting for a lock owned by another transaction.

Lock ranges

Each range lock includes a transaction id, a lock type, and a copy of the left and right edges of the range being locked. A point lock is a range lock where the left and right edges are the same.

A lock consists of some overhead bytes, some length fields, and a copy of the key for a point lock, or a copy of the left and right ends of the key range for a range lock. An estimate on the lock size is about two times the size of the key.

Lock tree escalation

Since the lock tree is an in memory data structure with a limit on its size, what happens when the limit is reached? When this happens, lock escalation runs. The goal of lock escalation is to shrink the memory footprint of the locks. It does this by merging adjacent range locks that are owned by the same transaction into a single larger range lock. The merge will include the gap space between the original range locks in the merged lock.

The lock tree runs lock escalation when the size of the locks in the lock tree exceeds one of two limits. The first limit applies to a big transaction and is 1/2 of the lock memory. When this limit is hit and a big transaction is trying to acquire a lock, lock escalation is run on this client thread. Other client threads that also own a big transaction wait for this lock escalation to complete. Other client threads that own a small transaction continue to run. This algorithm assigns the cost of lock escalation to the clients that own a lot of locks, and allows the small transactions to not be stalled by the escalation activity caused by big transactions.

A big transaction is a transaction with a rollback log that got too big and spilled to the rollback file. The rollback log for a small transaction still fits in memory.

Locks taken for various database operations

Write locks are taken on all keys that were written while executing the transaction.

Read locks are taken on all keys that were read while executing the transaction. Sometimes, TokuDB will prelock key ranges that match the range that is expected to be traversed since prelocking the range has less overhead than locking each key that is iterated over.

Insert

  • All isolation levels
    • Write lock points written

Insert select

  • Serializable
    • Read lock ranges read
    • Write lock points written
  • Repeatable read
    • Read lock ranges read
    • Write lock points written
  • Read committed
    • Write lock points written
  • Read uncommitted
    • Write lock points written

Create select

  • Serializable
    • Read lock ranges read
    • Write lock target table
  • Repeatable read
    • Read lock ranges read
    • Write lock target table
  • Read committed
    • Write lock target table
  • Read uncommitted
    • Write lock target table

Select

  • Serializable
    • Read lock ranges read
  • Repeatable read
    • Snapshot read, no locks taken
  • Read committed
    • Snapshot read, no locks taken
  • Read uncommitted
    • No locks taken

Select for update

  • All
    • Write lock ranges read

Update

  • All isolation levels
    • Read lock ranges read
    • Write lock points written

Delete

  • All isolation levels
    • Read lock ranges read
    • Write lock points written

Examples

The keys are just hex dumps of the key memory. Sorry that we don't nicely decode the memory.

Locks taken for inserts into a table with a primary key

When two rows are inserted into a table with a primary key, TokuDB takes point locks of the primary keys in the primary key's fractal tree index.

create table t (x bigint not null, y bigint not null, primary key(x)) engine=tokudb
insert into t values (1,2),(3,4)
select * from information_schema.tokudb_locks

| locks_trx_id | locks_mysql_thread_id | locks_dname   | locks_key_left     | locks_key_right    |
|       146928 |                     2 | ./test/t-main | 000100000000000000 | 000100000000000000 |
|       146928 |                     2 | ./test/t-main | 000300000000000000 | 000300000000000000 |

Locks taken for inserts into a table with a primary and clustering key

When two rows are inserted into a table with a primary key and a clustering secondary key, TokuDB takes point locks of the primary keys in the primary key's fractal tree index, and TokuDB takes point locks of the secondary keys in the secondary clustering key's fractal tree index.

create table t (x bigint not null, y bigint not null, primary key(x), clustering key(y)) engine=tokudb
insert into t values (1,2),(3,4)
| locks_trx_id | locks_mysql_thread_id | locks_dname    | locks_key_left                     | locks_key_right                    |
|       146934 |                     2 | ./test/t-main  | 000100000000000000                 | 000100000000000000                 |
|       146934 |                     2 | ./test/t-main  | 000300000000000000                 | 000300000000000000                 |
|       146934 |                     2 | ./test/t-key-y | 0002000000000000000100000000000000 | 0002000000000000000100000000000000 |
|       146934 |                     2 | ./test/t-key-y | 0004000000000000000300000000000000 | 0004000000000000000300000000000000 |
Clone this wiki locally