More on this book
Community
Kindle Notes & Highlights
Read between
August 2 - December 28, 2020
Later, client 1 comes back to life and sends its write to the storage service, including its token value 33. However, the storage server remembers that it has already processed a write with a higher token number (34), and so it rejects the request with token 33.
Note that this mechanism requires the resource itself to take an active role in checking tokens by rejecting any writes with an older token than one that has already been processed — it is not sufficient to rely on clients checking their lock status themselves.
resources that do not explicitly support fencing tokens, you might still be able work around the limitation (for example, in the case of a file storage service you could include the fencing token in the filename).
Distributed systems problems become much harder if there is a risk that nodes may “lie” (send arbitrary faulty or corrupted responses) — for example, if a node may claim to have received a particular message when in fact it didn’t. Such behavior is known as a Byzantine fault,
Byzantine Generals Problem
their endeavor is hampered by the fact that there are some traitors in their midst.
A system is Byzantine fault-tolerant if it continues to operate correctly even if some of the nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering with the network.
In aerospace environments, the data in a computer’s memory or CPU register could become corrupted by radiation,
For example, peer-to-peer networks like Bitcoin and other blockchains can be considered to be a way of getting mutually untrusting parties to agree whether a transaction happened or not, without relying on a central authority
To use this approach against bugs, you would have to have four independent implementations of the same software and hope that a bug only appears in one of the four implementations.
Usually, corrupted packets are caught by the checksums built into TCP and UDP, but sometimes they evade detection
A publicly accessible application must carefully sanitize any inputs from users,
When synchronizing, the client contacts all of them, estimates their errors, and checks that a majority of servers agree on some time range.
Synchronous model The synchronous model assumes bounded network delay, bounded process pauses, and bounded clock error.
Partially synchronous model Partial synchrony means that a system behaves like a synchronous system most of the time, but it sometimes exceeds the bounds
Asynchronous model In this model, an algorithm is not allowed to make any timing assumptions — in fact, it does not even have a clock (so it cannot use timeouts).
Crash-stop faults In the crash-stop model, an algorithm may assume that a node can fail in only one way, namely by crashing.
Crash-recovery faults We assume that nodes may crash at any moment, and perhaps start responding again after some unknown time.
Byzantine (arbitrary) faults Nodes may do absolutely anything, including trying to trick and deceive other nodes, as described in the last section.
Similarly, we can write down the properties we want of a distributed algorithm to define what it means to be correct.
clarify the situation, it is worth distinguishing between two different kinds of properties: safety and liveness properties.
uniqueness and monotonic sequence are safety properties, but availability is a liveness property.
A giveaway is that liveness properties often include the word “eventually” in their definition.
Safety is often informally defined as nothing bad happens, and liveness as something good eventually happens.
For distributed algorithms, it is common to require that safety properties always hold,
is, even if all nodes crash, or the entire network fails, the algorithm must nevertheless ensure that it does not return a wrong result
Quorum algorithms (see “Quorums for reading and writing”) rely on a node remembering the data that it claims to have stored.
that breaks the quorum condition, and thus breaks the correctness of the algorithm.
The fact that such partial failures can occur is the defining characteristic of distributed systems.
most distributed algorithms rely on timeouts to determine whether a remote node is still available.
However, as discussed in the introduction to Part II, scalability is not the only reason for wanting to use a distributed system. Fault tolerance and low latency (by placing data geographically close to users) are equally important goals, and those things cannot be achieved with a single node.
The simplest way of handling such faults is to simply let the entire service fail, and show the user an error message. If that solution is unacceptable, we need to find ways of tolerating faults — that is, of keeping the service functioning correctly, even if some internal component is faulty.
We will assume that all the problems from Chapter 8 can occur: packets can be lost, reordered, duplicated, or arbitrarily delayed in the network; clocks are approximate at best; and nodes can pause (e.g., due to garbage collection) or crash at any time.
This is the same approach as we used with transactions in Chapter 7: by using a transaction, the application can pretend that there are no crashes
systems. For example, one of the most important abstractions for distributed systems is consensus: that is, getting all of the nodes to agree on something.
A better name for eventual consistency may be convergence, as we expect all replicas to eventually converge to the same value
A database looks superficially like a variable that you can read and write, but in fact it has much more complicated semantics
linearizability
atomic consistency [7], strong consistency, immediate consistency, or external consistency
But the basic idea is to make a system appear as if there were only one copy of the data, and all operations on it are atomic.
words, linearizability is a recency guarantee.
What Makes a System Linearizable?
2. If a read request is concurrent with a write request, it may return either the old or the new value.
To make the system linearizable, we need to add another constraint, illustrated in
After any one read has returned the new value, all following reads (on the same or other clients) must also return the new value.
The final read by client B (in a shaded bar) is not linearizable. The operation is concurrent with C’s cas write, which updates x from 2 to 4. In the absence of other requests, it would be okay for B’s read to return 2. However, client A has already read the new value 4 before B’s read started, so B is not allowed to read an older value than A. Again, it’s the same situation as with Alice and Bob in Figure 9-1
Serializability is an isolation property of transactions, where every transaction may read and write multiple objects (rows, documents, records)
guarantees that transactions behave the same as if they had executed in some serial order
Linearizability is a recency guarantee on reads and writes of a register
so it does not prevent problems such as write skew