More on this book
Community
Kindle Notes & Highlights
Read between
August 2 - December 28, 2020
Such application servers are no longer stateless.
Distributed transactions thus have a tendency of amplifying failures,
consensus algorithm must satisfy the following properties
Uniform agreement
Integrity
Validity
Termination
The termination property formalizes the idea of fault tolerance. It essentially says that a consensus algorithm cannot simply sit around and do nothing forever — in other words, it must make progress.
decision. (Termination is a liveness property, whereas the other three are safety properties
2PC does not meet the requirements for termination.
There is a limit to the number of failures that an algorithm can tolerate: in fact, it can be proved that any consensus algorithm requires at least a majority of nodes to be functioning correctly in order to assure termination
That majority can safely form a quorum (see
Thus, the termination property is subject to the assumption that fewer than half of the nodes are crashed or unreachable.
Remember that total order broadcast requires messages to be delivered exactly once, in the same order, to all nodes. If you think about it, this is equivalent to performing several rounds of consensus:
implement total order broadcast directly, because that is more efficient than doing repeated rounds of one-value-at-a-time consensus.
Thus, we need consensus in order to elect a leader. But if the consensus algorithms described here are actually total order broadcast algorithms, and total order broadcast is like single-leader replication, and single-leader replication requires a leader, then…
It seems that in order to elect a leader, we first need a leader. In order to solve consensus, we must first solve consensus. How do we break out of this conundrum?
All of the consensus protocols discussed so far internally use a leader in some form or another, but they don’t guarantee that the leader is unique. Instead, they can make a weaker guarantee: the protocols define an epoch number
and guarantee that within each epoch, the leader is unique.
For every decision that a leader wants to make, it must send the proposed value to the other nodes and wait for a quorum of nodes to respond in favor of the proposal. The quorum typically, but not always, consists of a majority of nodes
Thus, we have two rounds of voting: once to choose a leader, and a second time to vote on a leader’s proposal. The key insight is that the quorums for those two votes must overlap:
Moreover, consensus algorithms define a recovery process by which nodes can get into a consistent state after a new leader is elected, ensuring that the safety properties are always
people choose to accept this risk for the sake of better performance.
Consensus systems always require a strict majority to operate. This means you need a minimum of three nodes in order to tolerate one failure (the remaining two out of three form a majority),
Most consensus algorithms assume a fixed set of nodes that participate in voting, which means that you can’t just add or remove nodes in the cluster. Dynamic membership extensions to consensus algorithms allow the set of nodes in the
Consensus systems generally rely on timeouts to detect failed nodes. In environments with highly variable network delays, especially geographically distributed systems, it often happens that a node falsely believes the leader to have failed due to a transient network issue.
Sometimes, consensus algorithms are particularly sensitive to network problems.
if the entire network is working correctly except for one particular network link that is consistently unreliable, Raft can get into situations where leadership continually bounces between two nodes,
for example, HBase, Hadoop YARN, OpenStack Nova, and Kafka all rely on ZooKeeper running in the background. What is it that these projects get from it?
ZooKeeper and etcd are designed to hold small amounts of data that can fit entirely in memory (although they still write to disk for durability) —
That small amount of data is replicated across all the nodes using a fault-tolerant total order broadcast
algor...
This highlight has been truncated due to consecutive passage length restrictions.
implementing not only total order broadcast (and hence consensus), but also an interesting set of other features that turn out to be particularly useful when building distributed systems:
Linearizable atomic operations
Total ordering of operations
Failure detection
Change notifications
Thus, a client can find out when another client joins the cluster (based on the value it writes to ZooKeeper), or if another client fails (because its session times out and its ephemeral nodes disappear). By subscribing to notifications, a client avoids having to frequently poll to find out about changes.
of them needs to be chosen as leader or primary.
and need to decide which partition to assign to which node.
Trying to perform majority votes over so many nodes would be terribly inefficient. Instead, ZooKeeper runs on a fixed number of nodes (usually three or five) and performs its majority votes among those nodes while supporting a potentially large number of clients.
provides a way of “outsourcing” some of the work of coordinating nodes
Normally, the kind of data managed by ZooKeeper is quite slow-changing: it represents information like “the node running on IP address 10.1.1.23 is the leader for partition 7,” and such assignments usually change on a timescale of minutes or hours. ZooKeeper is not intended for storing the runtime state of the application,
It is more important that DNS is reliably available and robust to network interruptions.
Thus, if your consensus system already knows who the leader is, then it can make sense to also use that information to help other services discover who the leader is. For this purpose, some consensus systems support read-only caching replicas.
A membership service determines which nodes are currently active and live members of a cluster.
For example, choosing a leader could mean simply choosing the lowest-numbered among the current members, but this approach would not work if different nodes have divergent opinions on who the current members are.
We looked in depth at linearizability, a popular consistency model: its goal is to make replicated data appear as though there were only a single copy, and to make all operations act on it atomically.
Unlike linearizability, which puts all operations in a single, totally ordered timeline, causality provides us with a weaker consistency model:
causal ordering (for example using Lamport timestamps), we saw that some things cannot be implemented this way: in