Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
42%
Flag icon
Later, client 1 comes back to life and sends its write to the storage service, including its token value 33. However, the storage server remembers that it has already processed a write with a higher token number (34), and so it rejects the request with token 33.
42%
Flag icon
Note that this mechanism requires the resource itself to take an active role in checking tokens by rejecting any writes with an older token than one that has already been processed — it is not sufficient to rely on clients checking their lock status themselves.
42%
Flag icon
resources that do not explicitly support fencing tokens, you might still be able work around the limitation (for example, in the case of a file storage service you could include the fencing token in the filename).
42%
Flag icon
Distributed systems problems become much harder if there is a risk that nodes may “lie” (send arbitrary faulty or corrupted responses) — for example, if a node may claim to have received a particular message when in fact it didn’t. Such behavior is known as a Byzantine fault,
42%
Flag icon
Byzantine Generals Problem
42%
Flag icon
their endeavor is hampered by the fact that there are some traitors in their midst.
42%
Flag icon
A system is Byzantine fault-tolerant if it continues to operate correctly even if some of the nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering with the network.
42%
Flag icon
In aerospace environments, the data in a computer’s memory or CPU register could become corrupted by radiation,
42%
Flag icon
For example, peer-to-peer networks like Bitcoin and other blockchains can be considered to be a way of getting mutually untrusting parties to agree whether a transaction happened or not, without relying on a central authority
42%
Flag icon
To use this approach against bugs, you would have to have four independent implementations of the same software and hope that a bug only appears in one of the four implementations.
42%
Flag icon
Usually, corrupted packets are caught by the checksums built into TCP and UDP, but sometimes they evade detection
42%
Flag icon
A publicly accessible application must carefully sanitize any inputs from users,
42%
Flag icon
When synchronizing, the client contacts all of them, estimates their errors, and checks that a majority of servers agree on some time range.
43%
Flag icon
Synchronous model The synchronous model assumes bounded network delay, bounded process pauses, and bounded clock error.
43%
Flag icon
Partially synchronous model Partial synchrony means that a system behaves like a synchronous system most of the time, but it sometimes exceeds the bounds
43%
Flag icon
Asynchronous model In this model, an algorithm is not allowed to make any timing assumptions — in fact, it does not even have a clock (so it cannot use timeouts).
43%
Flag icon
Crash-stop faults In the crash-stop model, an algorithm may assume that a node can fail in only one way, namely by crashing.
43%
Flag icon
Crash-recovery faults We assume that nodes may crash at any moment, and perhaps start responding again after some unknown time.
43%
Flag icon
Byzantine (arbitrary) faults Nodes may do absolutely anything, including trying to trick and deceive other nodes, as described in the last section.
43%
Flag icon
Similarly, we can write down the properties we want of a distributed algorithm to define what it means to be correct.
43%
Flag icon
clarify the situation, it is worth distinguishing between two different kinds of properties: safety and liveness properties.
43%
Flag icon
uniqueness and monotonic sequence are safety properties, but availability is a liveness property.
43%
Flag icon
A giveaway is that liveness properties often include the word “eventually” in their definition.
43%
Flag icon
Safety is often informally defined as nothing bad happens, and liveness as something good eventually happens.
43%
Flag icon
For distributed algorithms, it is common to require that safety properties always hold,
43%
Flag icon
is, even if all nodes crash, or the entire network fails, the algorithm must nevertheless ensure that it does not return a wrong result
43%
Flag icon
Quorum algorithms (see “Quorums for reading and writing”) rely on a node remembering the data that it claims to have stored.
43%
Flag icon
that breaks the quorum condition, and thus breaks the correctness of the algorithm.
43%
Flag icon
The fact that such partial failures can occur is the defining characteristic of distributed systems.
43%
Flag icon
most distributed algorithms rely on timeouts to determine whether a remote node is still available.
43%
Flag icon
However, as discussed in the introduction to Part II, scalability is not the only reason for wanting to use a distributed system. Fault tolerance and low latency (by placing data geographically close to users) are equally important goals, and those things cannot be achieved with a single node.
45%
Flag icon
The simplest way of handling such faults is to simply let the entire service fail, and show the user an error message. If that solution is unacceptable, we need to find ways of tolerating faults — that is, of keeping the service functioning correctly, even if some internal component is faulty.
45%
Flag icon
We will assume that all the problems from Chapter 8 can occur: packets can be lost, reordered, duplicated, or arbitrarily delayed in the network; clocks are approximate at best; and nodes can pause (e.g., due to garbage collection) or crash at any time.
45%
Flag icon
This is the same approach as we used with transactions in Chapter 7: by using a transaction, the application can pretend that there are no crashes
45%
Flag icon
systems. For example, one of the most important abstractions for distributed systems is consensus: that is, getting all of the nodes to agree on something.
45%
Flag icon
A better name for eventual consistency may be convergence, as we expect all replicas to eventually converge to the same value
45%
Flag icon
A database looks superficially like a variable that you can read and write, but in fact it has much more complicated semantics
45%
Flag icon
linearizability
45%
Flag icon
atomic consistency [7], strong consistency, immediate consistency, or external consistency
45%
Flag icon
But the basic idea is to make a system appear as if there were only one copy of the data, and all operations on it are atomic.
45%
Flag icon
words, linearizability is a recency guarantee.
45%
Flag icon
What Makes a System Linearizable?
45%
Flag icon
2. If a read request is concurrent with a write request, it may return either the old or the new value.
45%
Flag icon
To make the system linearizable, we need to add another constraint, illustrated in
45%
Flag icon
After any one read has returned the new value, all following reads (on the same or other clients) must also return the new value.
45%
Flag icon
The final read by client B (in a shaded bar) is not linearizable. The operation is concurrent with C’s cas write, which updates x from 2 to 4. In the absence of other requests, it would be okay for B’s read to return 2. However, client A has already read the new value 4 before B’s read started, so B is not allowed to read an older value than A. Again, it’s the same situation as with Alice and Bob in Figure 9-1
46%
Flag icon
Serializability is an isolation property of transactions, where every transaction may read and write multiple objects (rows, documents, records)
46%
Flag icon
guarantees that transactions behave the same as if they had executed in some serial order
46%
Flag icon
Linearizability is a recency guarantee on reads and writes of a register
46%
Flag icon
so it does not prevent problems such as write skew
1 15 28