Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
23%
Flag icon
you could track the time of the last update and, for one minute after the last update, make all reads from the leader. You could also monitor the replication lag on followers and prevent queries on any follower that is more than one minute behind the leader.
23%
Flag icon
The client can remember the timestamp of its most recent write — then the system can ensure that the replica serving any reads for that user reflects updates at least until that timestamp.
23%
Flag icon
If your replicas are distributed across multiple datacenters (for geographical proximity to users or for availability), there is additional complexity.
23%
Flag icon
you may want to provide cross-device read-after-write consistency:
23%
Flag icon
reading from asynchronous followers is that it’s possible for a user to see things moving backward in time.
23%
Flag icon
monotonic reads only means that if one user makes several reads in sequence, they will not see time go backward — i.e., they will not read older data after having previously read newer data.
23%
Flag icon
each user always makes their reads from the same replica
23%
Flag icon
For example, the replica can be chosen based on a hash of the user ID,...
This highlight has been truncated due to consecutive passage length restrictions.
23%
Flag icon
consistent prefix reads [23]. This guarantee says that if a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order.
23%
Flag icon
It would be better if application developers didn’t have to worry about subtle replication issues and could just trust their databases to “do the right thing.” This is why transactions exist: they are a way for a database to provide stronger guarantees so that the application can be simpler.
23%
Flag icon
It rarely makes sense to use a multi-leader setup within a single datacenter, because the benefits rarely outweigh the added complexity.
23%
Flag icon
the inter-datacenter network delay is hidden from users, which means the perceived performance may be better.
23%
Flag icon
Tolerance of datacenter outages
23%
Flag icon
every device has a local database that acts as a leader (it accepts write requests), and there is an asynchronous multi-leader replication process (sync) between the replicas of your calendar on all of your devices.
23%
Flag icon
From an architectural point of view, this setup is essentially the same as multi-leader replication between datacenters, taken to the extreme: each device is a “datacenter,” and the network connection between them is extremely unreliable.
23%
Flag icon
However, for faster collaboration, you may want to make the unit of change very small (e.g., a single keystroke) and avoid locking.
24%
Flag icon
The simplest strategy for dealing with conflicts is to avoid them: if the application can ensure that all writes for a particular record go through the same leader, then conflicts cannot occur.
24%
Flag icon
Give each write a unique ID (e.g., a timestamp, a long random number, a UUID, or a hash of the key and value), pick the write with the highest ID as the winner,
24%
Flag icon
As soon as the database system detects a conflict in the log of replicated changes, it calls the conflict handler.
24%
Flag icon
When a conflict is detected, all the conflicting writes are stored. The next time the data is read, these multiple versions of the data are returned to the application.
24%
Flag icon
quickly become complicated, and custom code can be error-prone. Amazon is a frequently cited example of surprising effects due to a conflict resolution handler: for some time, the conflict resolution logic on the shopping cart would preserve items added to the cart, but not items removed from the cart. Thus, customers would sometimes see items reappearing in their carts even though they had previously been removed
24%
Flag icon
Conflict-free replicated datatypes (CRDTs)
24%
Flag icon
Mergeable persistent data structures
24%
Flag icon
similarly to the Git version control system, and use a three-way merge function
24%
Flag icon
Operational transformation
24%
Flag icon
Even if the application checks availability before allowing a user to make a booking, there can be a conflict if the two bookings are made on two different leaders.
24%
Flag icon
A replication topology describes the communication paths
24%
Flag icon
A problem with circular and star topologies is that if just one node fails, it can interrupt the flow of replication messages between other nodes, causing them to be unable to communicate until the node is fixed.
24%
Flag icon
On the other hand, all-to-all topologies can have issues too. In particular, some network links may be faster than others (e.g., due to network congestion), with the result that some replication messages may “overtake” others,
24%
Flag icon
because clocks cannot be trusted to be sufficiently in sync to correctly order these events
24%
Flag icon
To order these events correctly, a technique called version vectors can be used,
24%
Flag icon
Amazon used it for its in-house Dynamo system [37].vi Riak, Cassandra, and Voldemort are open source datastores with leaderless replication models inspired by Dynamo, so this kind of database is also known as Dynamo-style.
25%
Flag icon
read requests are also sent to several nodes in parallel.
25%
Flag icon
Read repair When a client makes a read from several nodes in parallel, it can detect any stale responses.
25%
Flag icon
Anti-entropy process In addition, some datastores have a background process that constantly looks for differences in the data between replicas and copies any missing data from one replica to another. Unlike the replication log in leader-based replication, this anti-entropy process does not copy writes in any particular order,
25%
Flag icon
values that are rarely read may be missing from some replicas and thus have reduced durability, because read repair is only performed when a value is read by the application.
25%
Flag icon
More generally, if there are n replicas, every write must be confirmed by w nodes to be considered successful, and we must query at least r nodes for each read. (In our example, n = 3, w = 2, r = 2.) As long as w + r > n, we expect to get an up-to-date value when reading,
25%
Flag icon
Reads and writes that obey these r and w values are called quorum reads and writes
25%
Flag icon
The quorum condition, w + r > n, allows the system to tolerate unavailable nodes as follows: If w < n, we can still process writes if a node is unavailable. If r < n, we can still process reads if a node is unavailable.
25%
Flag icon
With a smaller w and r you are more likely to read stale values, because it’s more likely that your read didn’t include the node with the latest value. On the upside, this configuration allows lower latency and higher availability:
25%
Flag icon
If a write happens concurrently with a read, the write may be reflected on only some of the replicas. In this case, it’s undetermined whether the read returns the old or the new value.
25%
Flag icon
If a write succeeded on some replicas but failed on others (for example because the disks on some nodes are full), and overall succeeded on fewer than w replicas, it is not rolled back on the replicas where it succeeded.
25%
Flag icon
If a node carrying a new value fails, and its data is restored from a replica carrying an old value, the number of replicas storing the new value may fall below w, breaking the quorum condition.
25%
Flag icon
These characteristics make databases with leaderless replication appealing for use cases that require high availability and low latency, and that can tolerate occasional stale reads.
25%
Flag icon
Is it better to return errors to all requests for which we cannot reach a quorum of w or r nodes? Or should we accept writes anyway, and write them to some nodes that are reachable but aren’t among the n nodes on which the value usually lives?
25%
Flag icon
Once the network interruption is fixed, any writes that one node temporarily accepted on behalf of another node are sent to the appropriate “home” nodes. This is called hinted handoff.
25%
Flag icon
Sloppy quorums are particularly useful for increasing write availability: as long as any w nodes are available, the database can accept writes. However, this means that even when w + r > n, you cannot be sure to read the latest value for a key, because the latest value may have been temporarily written to some nodes outside of n
25%
Flag icon
If each node simply overwrote the value for a key whenever it received a write request from a client, the nodes would become permanently inconsistent,
25%
Flag icon
might hope that replicated databases would handle this automatically, but unfortunately most implementations are quite poor: if you want to avoid losing data, you — the application developer — need to know a lot about the internals of your database’s conflict handling.
25%
Flag icon
One approach for achieving eventual convergence is to declare that each replica need only store the most “recent” value and allow “older” values to be overwritten and discarded.
1 5 28