More on this book
Community
Kindle Notes & Highlights
Read between
October 21 - November 26, 2024
Pretending that replication is synchronous when in fact it is asynchronous is a recipe for problems down the line.
A natural extension of the leader-based replication model is to allow more than one node to accept writes.
It rarely makes sense to use a multi-leader setup within a single datacenter, because the benefits rarely outweigh the added complexity.
With a normal leader-based replication setup, the leader has to be in one of the datacenters, and all writes must go through that datacenter.
In a multi-leader configuration, you can have a leader in each datacenter.
In a single-leader configuration, every write must go over the internet to the datacenter with the leader.
In a multi-leader configuration, every write can be processed in the local datacenter and is replicated asynchronously to the other datacenters.
If you make any changes while you are offline, they need to be synced with a server and your other devices when the device is next online.
The biggest problem with multi-leader replication is that write conflicts can occur, which means that conflict resolution is required.
If you want synchronous conflict detection, you might as well just use single-leader replication.
In a multi-leader configuration, there is no defined ordering of writes, so it’s not clear what the final value should be.
As the most appropriate way of resolving a conflict may depend on the application, most multi-leader replication tools let you write conflict resolution logic using application code.
Even if the application checks availability before allowing a user to make a booking, there can be a conflict if the two bookings are made on two different leaders.
A replication topology describes the communication paths along which writes are propagated from one node to another.
With more than two leaders, various different topologies are possible.
In circular and star topologies, a write may need to pass through several nodes before it reaches all replicas.
When a node receives a data change that is tagged with its own identifier, that data change is ignored, because the node knows that it has already been processed.
A leader determines the order in which writes should be processed, and followers apply the leader’s writes in the same order.
In some leaderless implementations, the client directly sends its writes to several replicas, while in others, a coordinator node does this on behalf of the client.
The client simply ignores the fact that one of the replicas missed the write.
The replication system should ensure that eventually all the data is copied to every replica.
After an unavailable node comes back online, how does it catch up on the writes that it missed?
Note that without an anti-entropy process, values that are rarely read may be missing from some replicas and thus have reduced durability, because read repair is only performed when a value is read by the application.
As long as w + r > n, we expect to get an up-to-date value when reading, because at least one of the r nodes we’re reading from must be up to date.
If w + r > n, at least one of the r replicas you read from must have seen the most recent successful write.
With a smaller w and r you are more likely to read stale values, because it’s more likely that your read didn’t include the node with the latest value.
However, even with w + r > n, there are likely to be edge cases where stale values are returned.
Stronger guarantees generally require transactions or consensus.
The only safe way of using a database with LWW is to ensure that a key is only written once and thereafter treated as immutable, thus avoiding any concurrent updates to the same key.
An operation A happens before another operation B if B knows about A, or depends on A, or builds upon A in some way.
If one operation happened before another, the later operation should overwrite the earlier operation, but if the operations are concurrent, we have a conflict that needs to be resolved.
For defining concurrency, exact time doesn’t matter: we simply call two operations concurrent if they are both unaware of each other, regardless of the physical time at which they occurred.
When a write includes the version number from a prior read, that tells us which previous state the write is based on.
With the example of a shopping cart, a reasonable approach to merging siblings is to just take the union.
Each replica increments its own version number when processing a write, and also keeps track of the version numbers it has seen from each of the other replicas.
The version vector allows the database to distinguish between overwrites and concurrent writes.
A version vector is sometimes also called a vector clock, even though they are not quite the same.
If a leader fails and you promote an asynchronously updated follower to be the new leader, recently committed data may be lost.
In effect, each partition is a small database of its own, although the database may support operations that touch multiple partitions at the same time.
The main reason for wanting to partition data is scalability.
Partitioning is usually combined with replication so that copies of each partition are stored on multiple nodes.
A node may store more than one partition.
Each partition’s leader is assigned to one node, and its followers are assigned to other nodes.
Our goal with partitioning is to spread the data and the query load evenly across nodes.
A partition with disproportionately high load is called a hot spot.
A good hash function takes skewed data and makes it uniformly distributed.
Keys that were once adjacent are now scattered across all the partitions, so their sort order is lost.
The problem with secondary indexes is that they don’t map neatly to partitions.