More on this book
Community
Kindle Notes & Highlights
Read between
August 2 - December 28, 2020
For example, we can attach a timestamp to each write, pick the biggest timestamp as the most “recent,” and discard any writes with an earlier timestamp. This conflict resolution algorithm, called last write wins (LWW),
We also say that B is causally dependent on A.
For defining concurrency, exact time doesn’t matter: we simply call two operations concurrent if they are both unaware of each other, regardless of the physical time at which they occurred.
This algorithm ensures that no data is silently dropped, but it unfortunately requires that the clients do some extra work: if several operations happen concurrently, clients have to clean up afterward
Riak calls these concurrent values siblings.
With the example of a shopping cart, a reasonable approach to merging siblings is to just take the union.
However, if you want to allow people to also remove things from their carts, and not just add things, then taking the union of siblings may not yield the right result:
To prevent this problem, an item cannot simply be deleted from the database when it is removed; instead, the system must leave a marker with an appropriate version number to indicate that the item has been removed when merging siblings. Such a deletion marker is known as a tombstone.
Instead, we need to use a version number per replica as well as per key.
Each replica increments its own version number when processing a write, and also keeps track of the version numbers it has seen from each of the other replicas.
The collection of version numbers from all the replicas is called a version vector
dotted version vector
The main reason for wanting to partition data is scalability. Different partitions can be placed on different nodes in a shared-nothing cluster
Thus, a large dataset can be distributed across many disks, and the query load can be distributed across many processors.
Large, complex queries can potentially be parallelized across many nodes, although this gets significantly harder.
Our goal with partitioning is to spread the data and the query load evenly across nodes.
in theory — 10 nodes should be able to handle 10 times as much data and 10 times the read and write throughput of a single node (ignoring replication for now).
randomly. That would distribute the data quite evenly across the nodes, but it has a big disadvantage: when you’re trying to read a particular item, you have no way of knowing which node it is on,
The ranges of keys are not necessarily evenly spaced, because your
data may not be evenly distributed. For example, in Figure 6-2, volume 1 contains words starting with A and B, but volume 12 contains words starting with T, U, V, W, X,
However, the downside of key range partitioning is that certain access patterns can lead to hot spots.
Because of this risk of skew and hot spots, many distributed datastores use a hash function to determine the partition for a given key.
Unfortunately however, by using the hash of the key for partitioning we lose a nice property of key-range partitioning: the ability to do efficient range queries.
A query therefore cannot search for a range of values within the first column of a compound key, but if it specifies a fixed value for the first column, it can perform an efficient range scan over the other columns of the key.
for example, on a social media site, a celebrity user with millions of followers may cause a storm of activity when they do something
However, having split the writes across different keys, any reads now have to do additional work, as they have to read the data from all 100 keys and combine it.
This technique also requires additional bookkeeping: it only makes
If you have declared the index, the database can perform the indexing automatically.ii For example, whenever a red car is added to the database, the database partition automatically adds it to the list of document IDs for the index entry color:red.
This approach to querying a partitioned database is sometimes known as scatter/gather, and it can make read queries on secondary indexes quite expensive.
Nevertheless, it is widely used: MongoDB, Riak [15], Cassandra [16], Elasticsearch [17], SolrCloud [18], and VoltDB [19] all use document-partitioned secondary indexes.
Rather than each partition having its own secondary index (a local index), we can construct a global index that covers data
A global index must also be partitioned, but it can be partitioned differently from the primary key index.
We call this kind of index term-partitioned, because the term we’re looking for determines the partition of the index.
Partitioning by the term itself can be useful for range scans (e.g., on a numeric property, such as the asking price of the car), whereas partitioning on a hash of the term gives a more even distribution of load.
The advantage of a global (term-partitioned) index over a document-partitioned index is that it can make reads more efficient: rather than doing scatter/gather over all partitions, a client only needs to make a request to the partition containing the term that it wants. However, the downside of a global index is that writes are slower and more complicated,
In an ideal world, the index would always be up to date, and every document written to the database would immediately be reflected in the index. However, in a term-partitioned index, that would require a distributed transaction across all partitions affected by a write, which is not supported in all databases
In practice, updates to global secondary indexes are often asynchronous
Amazon DynamoDB states that its global secondary indexes are updated within a fraction of a second in normal circumstances, but may experience longer propagation delays in cases of faults in the infrastructure
The query throughput increases, so you want to add more CPUs to handle the load.
The dataset size increases, so you want to add more disks and RAM to store it.
A machine fails, and other machines need to take over the failed machi...
This highlight has been truncated due to consecutive passage length restrictions.
The process of moving load from one node in the cluster to another is called rebalancing.
No more data than necessary should be moved between nodes, to make rebalancing fast and to minimize the network and disk I/O load.
How not to do it: hash mod N
When partitioning by the hash of a key, we said earlier (Figure 6-3) that it’s best to divide the possible hashes into ranges
The problem with the mod N approach is that if the number of nodes N changes, most of the keys will need to be moved from one node to another.
Such frequent moves make rebalancing excessively expensive.
Fortunately, there is a fairly simple solution: create many more partitions than there are nodes, and assign several partitions to each node.
Only entire partitions are moved between nodes. The number of partitions does not change, nor does the assignment of keys to partitions.
In principle, you can even account for mismatched hardware in your cluster: by assigning more partitions to nodes that