Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
51%
Flag icon
Such application servers are no longer stateless.
51%
Flag icon
Distributed transactions thus have a tendency of amplifying failures,
51%
Flag icon
consensus algorithm must satisfy the following properties
51%
Flag icon
Uniform agreement
51%
Flag icon
Integrity
51%
Flag icon
Validity
51%
Flag icon
Termination
51%
Flag icon
The termination property formalizes the idea of fault tolerance. It essentially says that a consensus algorithm cannot simply sit around and do nothing forever — in other words, it must make progress.
51%
Flag icon
decision. (Termination is a liveness property, whereas the other three are safety properties
51%
Flag icon
2PC does not meet the requirements for termination.
51%
Flag icon
There is a limit to the number of failures that an algorithm can tolerate: in fact, it can be proved that any consensus algorithm requires at least a majority of nodes to be functioning correctly in order to assure termination
51%
Flag icon
That majority can safely form a quorum (see
51%
Flag icon
Thus, the termination property is subject to the assumption that fewer than half of the nodes are crashed or unreachable.
51%
Flag icon
Remember that total order broadcast requires messages to be delivered exactly once, in the same order, to all nodes. If you think about it, this is equivalent to performing several rounds of consensus:
51%
Flag icon
implement total order broadcast directly, because that is more efficient than doing repeated rounds of one-value-at-a-time consensus.
51%
Flag icon
Thus, we need consensus in order to elect a leader. But if the consensus algorithms described here are actually total order broadcast algorithms, and total order broadcast is like single-leader replication, and single-leader replication requires a leader, then…
51%
Flag icon
It seems that in order to elect a leader, we first need a leader. In order to solve consensus, we must first solve consensus. How do we break out of this conundrum?
51%
Flag icon
All of the consensus protocols discussed so far internally use a leader in some form or another, but they don’t guarantee that the leader is unique. Instead, they can make a weaker guarantee: the protocols define an epoch number
51%
Flag icon
and guarantee that within each epoch, the leader is unique.
51%
Flag icon
For every decision that a leader wants to make, it must send the proposed value to the other nodes and wait for a quorum of nodes to respond in favor of the proposal. The quorum typically, but not always, consists of a majority of nodes
51%
Flag icon
Thus, we have two rounds of voting: once to choose a leader, and a second time to vote on a leader’s proposal. The key insight is that the quorums for those two votes must overlap:
51%
Flag icon
Moreover, consensus algorithms define a recovery process by which nodes can get into a consistent state after a new leader is elected, ensuring that the safety properties are always
52%
Flag icon
people choose to accept this risk for the sake of better performance.
52%
Flag icon
Consensus systems always require a strict majority to operate. This means you need a minimum of three nodes in order to tolerate one failure (the remaining two out of three form a majority),
52%
Flag icon
Most consensus algorithms assume a fixed set of nodes that participate in voting, which means that you can’t just add or remove nodes in the cluster. Dynamic membership extensions to consensus algorithms allow the set of nodes in the
52%
Flag icon
Consensus systems generally rely on timeouts to detect failed nodes. In environments with highly variable network delays, especially geographically distributed systems, it often happens that a node falsely believes the leader to have failed due to a transient network issue.
52%
Flag icon
Sometimes, consensus algorithms are particularly sensitive to network problems.
52%
Flag icon
if the entire network is working correctly except for one particular network link that is consistently unreliable, Raft can get into situations where leadership continually bounces between two nodes,
52%
Flag icon
for example, HBase, Hadoop YARN, OpenStack Nova, and Kafka all rely on ZooKeeper running in the background. What is it that these projects get from it?
52%
Flag icon
ZooKeeper and etcd are designed to hold small amounts of data that can fit entirely in memory (although they still write to disk for durability) —
52%
Flag icon
That small amount of data is replicated across all the nodes using a fault-tolerant total order broadcast
52%
Flag icon
algor...
This highlight has been truncated due to consecutive passage length restrictions.
52%
Flag icon
implementing not only total order broadcast (and hence consensus), but also an interesting set of other features that turn out to be particularly useful when building distributed systems:
52%
Flag icon
Linearizable atomic operations
52%
Flag icon
Total ordering of operations
52%
Flag icon
Failure detection
52%
Flag icon
Change notifications
52%
Flag icon
Thus, a client can find out when another client joins the cluster (based on the value it writes to ZooKeeper), or if another client fails (because its session times out and its ephemeral nodes disappear). By subscribing to notifications, a client avoids having to frequently poll to find out about changes.
52%
Flag icon
of them needs to be chosen as leader or primary.
52%
Flag icon
and need to decide which partition to assign to which node.
52%
Flag icon
Trying to perform majority votes over so many nodes would be terribly inefficient. Instead, ZooKeeper runs on a fixed number of nodes (usually three or five) and performs its majority votes among those nodes while supporting a potentially large number of clients.
52%
Flag icon
provides a way of “outsourcing” some of the work of coordinating nodes
52%
Flag icon
Normally, the kind of data managed by ZooKeeper is quite slow-changing: it represents information like “the node running on IP address 10.1.1.23 is the leader for partition 7,” and such assignments usually change on a timescale of minutes or hours. ZooKeeper is not intended for storing the runtime state of the application,
52%
Flag icon
It is more important that DNS is reliably available and robust to network interruptions.
52%
Flag icon
Thus, if your consensus system already knows who the leader is, then it can make sense to also use that information to help other services discover who the leader is. For this purpose, some consensus systems support read-only caching replicas.
52%
Flag icon
A membership service determines which nodes are currently active and live members of a cluster.
52%
Flag icon
For example, choosing a leader could mean simply choosing the lowest-numbered among the current members, but this approach would not work if different nodes have divergent opinions on who the current members are.
52%
Flag icon
We looked in depth at linearizability, a popular consistency model: its goal is to make replicated data appear as though there were only a single copy, and to make all operations act on it atomically.
52%
Flag icon
Unlike linearizability, which puts all operations in a single, totally ordered timeline, causality provides us with a weaker consistency model:
52%
Flag icon
causal ordering (for example using Lamport timestamps), we saw that some things cannot be implemented this way: in
1 19 28