More on this book
Community
Kindle Notes & Highlights
Read between
October 21 - November 26, 2024
The whole point of a consistent snapshot is that it does not include writes that are more recent than the snapshot, and thus reads from the snapshot are not linearizable.
Coordination services like Apache ZooKeeper [15] and etcd [16] are often used to implement distributed locks and leader election.
The web server doesn’t place the entire photo on the queue, since most message brokers are designed for small messages, and a photo may be several megabytes in size.
The most common approach to making a system fault-tolerant is to use replication.
Using the leader for reads relies on the assumption that you know for sure who the leader is.
If the network between datacenters is interrupted in a single-leader setup, clients connected to follower datacenters cannot contact the leader, so they cannot make any writes to the database, nor any linearizable reads.
If the application requires linearizable reads and writes, the network interruption causes the application to become unavailable in the datacenters that cannot contact the leader.
CAP was originally proposed as a rule of thumb, without precise definitions, with the goal of starting a discussion about trade-offs in databases.
At times when the network is working correctly, a system can provide both consistency (linearizability) and total availability.
When a network fault occurs, you have to choose between either linearizability or total availability.
All in all, there is a lot of misunderstanding and confusion around CAP, and it does not help us understand systems better, so CAP is best avoided.
Although linearizability is a useful guarantee, surprisingly few systems are actually linearizable in practice.
Memory access first goes to the cache by default, and any changes are asynchronously written out to main memory.
The reason for dropping linearizability is performance, not fault tolerance.
Linearizability is slow—and this is true all the time, not only during a network fault.
If a system obeys the ordering imposed by causality, we say that it is causally consistent.
A total order allows any two elements to be compared, so if you have two elements, you can always say which one is greater and which one is smaller.
There might be several requests waiting to be handled, but the datastore ensures that every request is handled atomically at a single point in time, acting on a single copy of the data, along a single timeline, without any concurrency.
If you are familiar with distributed version control systems such as Git, their version histories are very much like the graph of causal dependencies.
Often one commit happens after another, in a straight line, but sometimes you get branches (when several people concurrently work on a project), and merges are created when those concurrently created commits are combined.
The fact that linearizability ensures causality is what makes linearizable systems simple to understand and appealing.
In many cases, systems that appear to require linearizability in fact only really require causal consistency, which can be implemented more efficiently.
In order to maintain causality, you need to know which operation happened before which other operation. This is a partial order: concurrent operations may be processed in any order, but if one operation happened before another, then they must be processed in that order on every replica.
Thus, when a replica processes an operation, it must ensure that all causally preceding operations (all operations that happened before) have already been processed; if some preceding operation is missing, the later operation must wait until the preceding operation has been processed.
If a node had already seen the value X when it issued the write Y, then X and Y may be causally related.
In order to determine the causal ordering, the database needs to know which version of the data was read by the application.
Although causality is an important theoretical concept, actually keeping track of all causal dependencies can become impracticable.
In many applications, clients read lots of data before writing something, and then it is not clear whether the write is causally dependent on all or only some of those prior reads.
Explicitly tracking all the data that has been read would mea...
This highlight has been truncated due to consecutive passage length restrictions.
The leader can simply increment a counter for each operation, and thus assign a monotonically increasing sequence number to each operation in the replication log.
If a follower applies the writes in the order they appear in the replication log, the state of the follower is always causally consistent (even if it is lagging behind the leader).
If there is not a single leader (perhaps because you are using a multi-leader or leaderless database, or because the database is partitioned), it is less clear how to generate sequence numbers for operations.
Each node may process a different number of operations per second.
Thus, if one node generates even numbers and the other generates odd numbers, the counter for even numbers may lag behind the counter for odd numbers, or vice versa.
Timestamps from physical clocks are subject to clock skew, which can make them inconsistent with causality.
Each node has a unique identifier, and each node keeps a counter of the number of operations it has processed. The Lamport timestamp is then simply a pair of (counter, node ID).
Two nodes may sometimes have the same counter value, but by including the node ID in the timestamp, each timestamp is made unique.
Lamport timestamps provide a total ordering consistent with causality.
The key idea about Lamport timestamps, which makes them consistent with causality, is the following: every node and every client keeps track of the maximum counter value it has seen so far, and includes that maximum on every request.
When a node receives a request or response with a maximum counter value greater than its own counter value, it immediately increases its own counter to that maximum.
As long as the maximum counter value is carried along with every operation, this scheme ensures that the ordering from the Lamport timestamps is consistent with causality, because every causal dependency results in an increased timestamp.
The advantage of Lamport timestamps over version vectors is that they are more compact.
If another node has generated some operations, but you don’t yet know what they are, you cannot construct the final ordering of operations: the unknown operations from the other node may need to be inserted at various positions in the total order.
If you have an operation to create a username, and you are sure that no other node can insert a claim for the same username ahead of your operation in the total order, then you can safely declare the operation successful.
Total order broadcast is usually described as a protocol for exchanging messages between nodes.
Consensus services such as ZooKeeper and etcd actually implement total order broadcast.
Since all nodes must deliver the same messages in the same order, all nodes can read the log and see the same sequence of messages.
Every request to acquire the lock is appended as a message to the log, and all messages are sequentially numbered in the order they appear in the log. The sequence number can then serve as a fencing token, because it is monotonically increasing.
Because log entries are delivered to all nodes in the same order, if there are several concurrent writes, all nodes will agree on which one came first.
Choosing the first of the conflicting writes as the winner and aborting later ones ensures that all nodes agree on whether a write was committed or aborted.