Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
Kindle Notes & Highlights
25%
Flag icon
Dynamo-style databases allow several clients to concurrently write to the same key, which means that conflicts will occur even if strict quorums are used.
26%
Flag icon
(LWW), is the only supported conflict resolution method in Cassandra
26%
Flag icon
LWW achieves the goal of eventual convergence, but at the cost of durability: if there are several concurrent writes to the same key, even if they were all reported as successful to the client (because they were written to w replicas), only one of the writes will survive and the others will be silently discarded.
26%
Flag icon
If losing data is not acceptable, LWW is a poor choice for conflict resolution.
26%
Flag icon
two operations are concurrent if neither happens before the other (i.e., neither knows about the other) [54
26%
Flag icon
When a client writes a key, it must include the version number from the prior read, and it must merge together all values that it received in the prior read.
26%
Flag icon
Riak calls these concurrent values siblings.
26%
Flag icon
an item cannot simply be deleted from the database when it is removed; instead, the system must leave a marker with an appropriate version number to indicate that the item has been removed when merging siblings. Such a deletion marker is known as a tombstone.
26%
Flag icon
we need to use a version number per replica as well as per key. Each replica increments its own version number when processing a write, and also keeps track of the version numbers it has seen from each of the other replicas.
26%
Flag icon
The collection of version numbers from all the replicas is called a version vector
26%
Flag icon
version vectors are sent from the database replicas to clients when values are read, and need to be sent back to the database when a value is subsequently written. (Riak encodes the version vector as a string that it calls causal context.)
26%
Flag icon
The version vector structure ensures that it is safe to read from one replica and subsequently write back to another replica.
27%
Flag icon
What we call a partition here is called a shard in MongoDB, Elasticsearch, and SolrCloud; it’s known as a region in HBase, a tablet in Bigtable, a vnode in Cassandra and Riak, and a vBucket in Couchbase.
27%
Flag icon
each partition is a small database of its own, although the database may support operations that touch multiple partitions at the same time.
28%
Flag icon
A partition with disproportionately high load is called a hot spot.
28%
Flag icon
Once you have a suitable hash function for keys, you can assign each partition a range of hashes (rather than a range of keys), and every key whose hash falls within a partition’s range will be stored in that partition.
28%
Flag icon
by using the hash of the key for partitioning we lose a nice property of key-range partitioning: the ability to do efficient range queries.
28%
Flag icon
A table in Cassandra can be declared with a compound primary key consisting of several columns. Only the first part of that key is hashed to determine the partition, but the other columns are used as a concatenated index for sorting the data in Cassandra’s SSTables.
28%
Flag icon
scatter/gather, and it can make read queries on secondary indexes quite expensive. Even if you query the partitions in parallel, scatter/gather is prone to tail latency amplification
28%
Flag icon
MongoDB, Riak [15], Cassandra [16], Elasticsearch [17], SolrCloud [18], and VoltDB [19] all use document-partitioned secondary indexes.
29%
Flag icon
we can construct a global index that covers data in all partitions. However, we can’t just store that index on one node, since it would likely become a bottleneck and defeat the purpose of partitioning.
29%
Flag icon
We call this kind of index term-partitioned, because the term we’re looking for determines the partition of the index.
29%
Flag icon
the downside of a global index is that writes are slower and more complicated, because a write to a single document may now affect multiple partitions of the index (every term in the document might be on a different partition, on a different node).
29%
Flag icon
updates to global secondary indexes are often asynchronous
29%
Flag icon
The problem with the mod N approach is that if the number of nodes N changes, most of the keys will need to be moved from one node to
29%
Flag icon
create many more partitions than there are nodes, and assign several partitions to each node.
29%
Flag icon
by assigning more partitions to nodes that are more powerful, you can force those nodes to take a greater share of the load.
29%
Flag icon
If partitions are very large, rebalancing and recovery from node failures become expensive. But if partitions are too small, they incur too much overhead.
29%
Flag icon
key range–partitioned databases such as HBase and RethinkDB create partitions dynamically.
29%
Flag icon
HBase and MongoDB allow an initial set of partitions to be configured on an empty database (this is called pre-splitting).
29%
Flag icon
it can be a good thing to have a human in the loop for rebalancing. It’s slower than a fully automatic process, but it can help prevent operational surprises.
29%
Flag icon
Each node registers itself in ZooKeeper, and ZooKeeper maintains the authoritative mapping of partitions to nodes. Other actors, such as the routing tier or the partitioning-aware client, can subscribe to this information in ZooKeeper.
30%
Flag icon
Cassandra and Riak take a different approach: they use a gossip protocol among the nodes to disseminate any changes in cluster state. Requests can be sent to any node, and that node forwards them to the appropriate node for the requested partition
30%
Flag icon
When using a routing tier or when sending requests to a random node, clients still need to find the IP addresses to connect to. These are not as fast-changing as the assignment of partitions to nodes, so it is often sufficient to use DNS for this purpose.
30%
Flag icon
When partitioning by hash, it is common to create a fixed number of partitions in advance, to assign several partitions to each node, and to move entire partitions from one node to another when nodes are added or removed.
30%
Flag icon
transaction is a way for an application to group several reads and writes together into a logical unit. Conceptually, all the reads and writes in a transaction are executed as one operation: either the entire transaction succeeds (commit) or it fails (abort, rollback).
31%
Flag icon
ACID, which stands for Atomicity, Consistency, Isolation, and Durability.
31%
Flag icon
(Systems that do not meet the ACID criteria are sometimes called BASE, which stands for Basically Available, Soft state, and Eventual consistency
31%
Flag icon
ACID atomicity describes what happens if a client wants to make several writes, but a fault occurs after some of the writes have been processed
31%
Flag icon
If the writes are grouped together into an atomic transaction, and the transaction cannot be completed (committed) due to a fault, then the transaction is aborted
31%
Flag icon
The ability to abort a transaction on error and have all writes from that transaction discarded is the defining feature of ACID atomicity.
31%
Flag icon
this idea of consistency depends on the application’s notion of invariants, and it’s the application’s responsibility to define its transactions correctly
31%
Flag icon
serializable isolation is rarely used, because it carries a performance penalty.
31%
Flag icon
Durability is the promise that once a transaction has committed successfully, any data it has written will not be forgotten,
31%
Flag icon
there is no one technique that can provide absolute guarantees. There are only various risk-reduction techniques, including writing to disk, replicating to remote machines, and backups
31%
Flag icon
multi-object transactions are often needed if several pieces of data need to be kept in sync.
31%
Flag icon
on any particular connection, everything between a BEGIN TRANSACTION and a COMMIT statement is considered to be part of the same transaction.
31%
Flag icon
Atomicity can be implemented using a log for crash recovery (see
31%
Flag icon
isolation can be implemented using a lock on each object (allowing only one thread to access an object at any one time).
32%
Flag icon
A transaction is usually understood as a mechanism for grouping multiple operations on multiple objects into one unit of execution.