More on this book
Community
Kindle Notes & Highlights
Read between
October 21 - November 26, 2024
You can make your read from a replica that is synchronously updated on writes, and is thus sure to be up to date.
In general, if you think hard enough about linearizable sequence number generators, you inevitably end up with a consensus algorithm.
In a database with single-leader replication, all nodes need to agree on which node is the leader.
If there were two leaders, they would both accept writes and their data would diverge, leading to inconsistency and data loss.
In a database that supports transactions spanning several nodes or partitions, we have the problem that a transaction may fail on some nodes but succeed on others.
In a distributed system, we must assume that nodes may crash, so reliable consensus is impossible.
Atomicity prevents failed transactions from littering the database with half-finished results and half-updated state.
Atomicity ensures that the secondary index stays consistent with the primary data (if the index became inconsistent with the primary data, it would not be very useful).
For transactions that execute at a single database node, atomicity is commonly implemented by the storage engine.
A transaction commit must be irrevocable—you are not allowed to change your mind and retroactively abort a transaction after it has been committed.
If a transaction was allowed to abort after committing, any transactions that read the committed data would be based on data that was retroactively declared not to have existed—so they would have to be reverted as well.
Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes—i.e., to ensure that either all nodes commit or all nodes abort.
2PC provides atomic commit in a distributed database, whereas 2PL provides serializable isolation.
When 2PC is used, a distributed transaction begins with the application reading and writing data on multiple database nodes, as normal.
When the application wants to begin a distributed transaction, it requests a transaction ID from the coordinator. This transaction ID is globally unique.
The application begins a single-node transaction on each of the participants, and attaches the globally unique transaction ID to the single-node transaction.
When the application is ready to commit, the coordinator sends a prepare request to all participants, tagged with the global transaction ID.
When a participant receives the prepare request, it makes sure that it can definitely commit the transaction under all circumstances.
If the coordinator fails before sending the prepare requests, a participant can safely abort the transaction.
Without hearing from the coordinator, the participant has no way of knowing whether to commit or abort.
The only way 2PC can complete is by waiting for the coordinator to recover.
Two-phase commit is called a blocking atomic commit protocol due to the fact that 2PC can become stuck waiting for the coordinator to recover.
In a network with unbounded delay a timeout is not a reliable failure detector, because a request may time out due to a network problem even if no node has crashed.
Database-internal transactions do not have to be compatible with any other system, so they can use any protocol and apply optimizations specific to that particular technology.
Heterogeneous distributed transactions allow diverse systems to be integrated in powerful ways.
X/Open XA (short for eXtended Architecture) is a standard for implementing two-phase commit across heterogeneous technologies
XA is not a network protocol—it is merely a C API for interfacing with a transaction coordinator.
XA assumes that your application uses a network driver or client library to communicate with the participant databases or messaging services.
The transaction coordinator implements the XA API. The standard does not specify how it should be implemented, but in practice the coordinator is often simply a library that is loaded into the same process as the application issuing the transaction (not a separate service).
If the application process crashes, or the machine on which the application is running dies, the coordinator goes with it.
Since the coordinator’s log is on the application server’s local disk, that server must be restarted, and the coordinator library must read the log to recover the commit/abort outcome of each transaction.
The database server cannot contact the coordinator directly, since all communication must go via its client library.
In theory, if the coordinator crashes and is restarted, it should cleanly recover its state from the log and resolve any in-doubt transactions.
The administrator must examine the participants of each in-doubt transaction, determine whether any participant has committed or aborted already, and then apply the same outcome to the other participants.
Many server-side applications are developed in a stateless model (as favored by HTTP), with all persistent state stored in a database, which has the advantage that application servers can be added and removed at will.
Since XA needs to be compatible with a wide range of data systems, it is necessarily a lowest common denominator.
Informally, consensus means getting several nodes to agree on something.
The consensus problem is normally formalized as follows: one or more nodes may propose values, and the consensus algorithm decides on one of those values.
The termination property formalizes the idea of fault tolerance.
In order to solve consensus, we must first solve consensus.
Every time the current leader is thought to be dead, a vote is started among the nodes to elect a new leader. This election is given an incremented epoch number, and thus epoch numbers are totally ordered and monotonically increasing.
If there is a conflict between two different leaders in two different epochs (perhaps because the previous leader actually wasn’t dead after all), then the leader with the higher epoch number prevails.
Before a leader is allowed to decide anything, it must first check that there isn’t some other leader with a higher epoch number whi...
This highlight has been truncated due to consecutive passage length restrictions.
For every decision that a leader wants to make, it must send the proposed value to the other nodes and wait for a quorum of nodes to respond in favor of the proposal.
A node votes in favor of a proposal only if it is not aware of any other leader with a higher epoch.
Consensus systems always require a strict majority to operate. This means you need a minimum of three nodes in order to tolerate one failure (the remaining two out of three form a majority), or a minimum of five nodes to tolerate two failures (the remaining three out of five form a majority).
Most consensus algorithms assume a fixed set of nodes that participate in voting, which means that you can’t just add or remove nodes in the cluster.
Consensus systems generally rely on timeouts to detect failed nodes.
In environments with highly variable network delays, especially geographically distributed systems, it often happens that a node falsely believes the leader to have failed due to a transient network issue.
The fencing token is some number that monotonically increases every time the lock is acquired.