Ataraxia through Epoché: [Book][Database Internals] reading note part I

Reference:
Transactional storage for geo-replicated systems[walter-sosp11]

Transaction manager

This manager schedules transactions and ensures they cannot leave

the database in a logically inconsistent state.

Lock manager

This manager locks on the database objects for the running

transactions, ensuring that concurrent operations do not violate

physical data integrity.

Access methods (storage structures)

These manage access and organizing data on disk. Access methods

include heap files and storage structures such as B-Trees (“Ubiquitous B-Trees”)

or LSM Trees.

Buffer manager

This manager caches data pages in memory (see “Buffer

Management”).

Recovery manager

This manager maintains the operation log and restoring the system

state in case of a failure.

checkpointing

Backup <-> Log till backup snapshot.

Storage Structure

Disk-based storage structures often have a form of wide and short trees

, while memory-based implementations can choose from a larger pool of data structures and

perform optimizations that would otherwise be impossible or difficult to

implement on disk [MOLINA92].

Similarly, handling variable-size data on disk requires special attention, while in memory it’s often a matter of referencing the value with a pointer.

On modern CPUs, vectorized instructions can be used to process multiple data points

with a single CPU instruction [DREPPER07].

Wide Column Stores

Column-oriented databases should not be mixed up with wide column

stores, such as BigTable or HBase, where data is represented as a

multidimensional map, columns are grouped into column families (usually

storing data of the same type), and inside each column family, data is

stored row-wise. This layout is best for storing data retrieved by a key or a

sequence of keys.

https://jepsen.io/consistency

Serializability and external consistency (SSSSSS)

Even though there is some overlap in time in which Txn1 and Txn2 are both executing, their commit timestamps c1 and c2 respect a linear transaction order, which means that all effects of the reads and writes of Txn1 appear to have occurred at a single point of time (c1), and all effects of the reads and writes of Txn2 appear to have occurred at a single point of time (c2).

Furthermore, c1 < c2 (which is guaranteed because both Txn1 and Txn2 committed writes; this is true even if the writes happened on different machines), which respects the order of Txn1 happening before Txn2. (However, if Txn2 only did reads in the transaction, then c1 <= c2).

If Run() or Commit() succeeds

When a call to Run() (if using the Transaction Runner API) or Commit() (if using the Transaction API) returns with an OK status, it means the transaction committed at time T, and:

The writes buffered on that transaction are guaranteed to be applied at T.

All rows read by the transaction are guaranteed to be valid at T.
In other words, all the data you read is consistent because it came from a same single snapshot of the database.

If Run() or Commit() fails

If you're using the Transaction Runner API and a call to Run() fails, the read and write guarantees you have depend on what error the underlying Commit() call failed with:

Deadline Exceeded (if the client set a deadline on the read options) or Canceled (if the client canceled) - the transaction may or may not have committed (and thus buffered writes may or may not have happened).
A constraint failure (e.g. RowNotFound, AlreadyExists, BadUsage, etc.) - Writing the buffered mutations encountered some error, e.g. a row that the client is trying to update doesn't exist. In that case, the reads are guaranteed consistent, the writes are guaranteed to not be applied, and the non-existence of the row is guaranteed to be consistent with the reads as well.

Nested Transactions Are Unsafe

Potential dead lock could happen.

Use timeout to controll distributed transaction liveness

Long-running Operations SSSSSS servers automatically time out transactions that have been idle for 10s. The SSSSSS client automatically keeps transactions alive while a transactional read or query is running.

Lock Modes

SSSSSS uses a combination of shared locks and exclusive locks to control access to the data.

By default when you perform a read as part of a transaction, SSSSSS acquires shared read locks, which allows other reads to still access the data until your transaction is ready to commit. When your transaction is committing and writes are being applied, the transaction attempts to upgrade to an exclusive lock for any data you are writing. It blocks new shared read locks on the data, waits for existing shared read locks to clear, then places an exclusive lock for exclusive access to the data.

Locks are taken at the granularity of row-and-column. If transaction T1 has locked column "A" of row "foo", and transaction T2 wants to write column "B" of row "foo" then there is no conflict.

Writes to a data item that don't also read the data being written (aka "blind writes") don't conflict with other blind writers of the same item (the commit timestamp of each write determines the order in which it is applied to the database). A consequence of this is that SSSSSS only needs to upgrade to an exclusive lock if you have read the data you are writing. Otherwise SSSSSS uses a shared lock called a writer shared lock. But at distributed transaction participants even blind writes will take exclusive locks because transactions at different coordinators cannot be guaranteed, in general, to be assigned different commit timestamps.

SSSSSS uses the standard "wound-wait" algorithm to handle deadlock detection: under the hood SSSSSS keeps track of the age of each transaction that requests conflicting locks, and allows older transactions to abort younger transactions (where "older" means the transaction started earlier). By giving priority to older transactions, SSSSSS ensures that eventually every transaction has a chance to acquire locks by virtue of it becoming old enough to have higher priority than other transactions. e.g. a transaction holding a reader shared lock can be aborted by an older transaction wanting a writer shared lock.

Ataraxia through Epoché

Nov 28, 2023

[Book][Database Internals] reading note part I