Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
26%
Flag icon
For example, we can attach a timestamp to each write, pick the biggest timestamp as the most “recent,” and discard any writes with an earlier timestamp. This conflict resolution algorithm, called last write wins (LWW),
26%
Flag icon
We also say that B is causally dependent on A.
26%
Flag icon
For defining concurrency, exact time doesn’t matter: we simply call two operations concurrent if they are both unaware of each other, regardless of the physical time at which they occurred.
26%
Flag icon
This algorithm ensures that no data is silently dropped, but it unfortunately requires that the clients do some extra work: if several operations happen concurrently, clients have to clean up afterward
26%
Flag icon
Riak calls these concurrent values siblings.
26%
Flag icon
With the example of a shopping cart, a reasonable approach to merging siblings is to just take the union.
26%
Flag icon
However, if you want to allow people to also remove things from their carts, and not just add things, then taking the union of siblings may not yield the right result:
26%
Flag icon
To prevent this problem, an item cannot simply be deleted from the database when it is removed; instead, the system must leave a marker with an appropriate version number to indicate that the item has been removed when merging siblings. Such a deletion marker is known as a tombstone.
26%
Flag icon
Instead, we need to use a version number per replica as well as per key.
26%
Flag icon
Each replica increments its own version number when processing a write, and also keeps track of the version numbers it has seen from each of the other replicas.
26%
Flag icon
The collection of version numbers from all the replicas is called a version vector
26%
Flag icon
dotted version vector
27%
Flag icon
The main reason for wanting to partition data is scalability. Different partitions can be placed on different nodes in a shared-nothing cluster
27%
Flag icon
Thus, a large dataset can be distributed across many disks, and the query load can be distributed across many processors.
27%
Flag icon
Large, complex queries can potentially be parallelized across many nodes, although this gets significantly harder.
28%
Flag icon
Our goal with partitioning is to spread the data and the query load evenly across nodes.
28%
Flag icon
in theory — 10 nodes should be able to handle 10 times as much data and 10 times the read and write throughput of a single node (ignoring replication for now).
28%
Flag icon
randomly. That would distribute the data quite evenly across the nodes, but it has a big disadvantage: when you’re trying to read a particular item, you have no way of knowing which node it is on,
28%
Flag icon
The ranges of keys are not necessarily evenly spaced, because your
28%
Flag icon
data may not be evenly distributed. For example, in Figure 6-2, volume 1 contains words starting with A and B, but volume 12 contains words starting with T, U, V, W, X,
28%
Flag icon
However, the downside of key range partitioning is that certain access patterns can lead to hot spots.
28%
Flag icon
Because of this risk of skew and hot spots, many distributed datastores use a hash function to determine the partition for a given key.
28%
Flag icon
Unfortunately however, by using the hash of the key for partitioning we lose a nice property of key-range partitioning: the ability to do efficient range queries.
28%
Flag icon
A query therefore cannot search for a range of values within the first column of a compound key, but if it specifies a fixed value for the first column, it can perform an efficient range scan over the other columns of the key.
28%
Flag icon
for example, on a social media site, a celebrity user with millions of followers may cause a storm of activity when they do something
28%
Flag icon
However, having split the writes across different keys, any reads now have to do additional work, as they have to read the data from all 100 keys and combine it.
28%
Flag icon
This technique also requires additional bookkeeping: it only makes
28%
Flag icon
If you have declared the index, the database can perform the indexing automatically.ii For example, whenever a red car is added to the database, the database partition automatically adds it to the list of document IDs for the index entry color:red.
28%
Flag icon
This approach to querying a partitioned database is sometimes known as scatter/gather, and it can make read queries on secondary indexes quite expensive.
28%
Flag icon
Nevertheless, it is widely used: MongoDB, Riak [15], Cassandra [16], Elasticsearch [17], SolrCloud [18], and VoltDB [19] all use document-partitioned secondary indexes.
29%
Flag icon
Rather than each partition having its own secondary index (a local index), we can construct a global index that covers data
29%
Flag icon
A global index must also be partitioned, but it can be partitioned differently from the primary key index.
29%
Flag icon
We call this kind of index term-partitioned, because the term we’re looking for determines the partition of the index.
29%
Flag icon
Partitioning by the term itself can be useful for range scans (e.g., on a numeric property, such as the asking price of the car), whereas partitioning on a hash of the term gives a more even distribution of load.
29%
Flag icon
The advantage of a global (term-partitioned) index over a document-partitioned index is that it can make reads more efficient: rather than doing scatter/gather over all partitions, a client only needs to make a request to the partition containing the term that it wants. However, the downside of a global index is that writes are slower and more complicated,
29%
Flag icon
In an ideal world, the index would always be up to date, and every document written to the database would immediately be reflected in the index. However, in a term-partitioned index, that would require a distributed transaction across all partitions affected by a write, which is not supported in all databases
29%
Flag icon
In practice, updates to global secondary indexes are often asynchronous
29%
Flag icon
Amazon DynamoDB states that its global secondary indexes are updated within a fraction of a second in normal circumstances, but may experience longer propagation delays in cases of faults in the infrastructure
29%
Flag icon
The query throughput increases, so you want to add more CPUs to handle the load.
29%
Flag icon
The dataset size increases, so you want to add more disks and RAM to store it.
29%
Flag icon
A machine fails, and other machines need to take over the failed machi...
This highlight has been truncated due to consecutive passage length restrictions.
29%
Flag icon
The process of moving load from one node in the cluster to another is called rebalancing.
29%
Flag icon
No more data than necessary should be moved between nodes, to make rebalancing fast and to minimize the network and disk I/O load.
29%
Flag icon
How not to do it: hash mod N
29%
Flag icon
When partitioning by the hash of a key, we said earlier (Figure 6-3) that it’s best to divide the possible hashes into ranges
29%
Flag icon
The problem with the mod N approach is that if the number of nodes N changes, most of the keys will need to be moved from one node to another.
29%
Flag icon
Such frequent moves make rebalancing excessively expensive.
29%
Flag icon
Fortunately, there is a fairly simple solution: create many more partitions than there are nodes, and assign several partitions to each node.
29%
Flag icon
Only entire partitions are moved between nodes. The number of partitions does not change, nor does the assignment of keys to partitions.
29%
Flag icon
In principle, you can even account for mismatched hardware in your cluster: by assigning more partitions to nodes that
1 6 28