More on this book
Community
Kindle Notes & Highlights
Read between
April 5 - December 1, 2020
Dynamo-style databases allow several clients to concurrently write to the same key, which means that conflicts will occur even if strict quorums are used.
(LWW), is the only supported conflict resolution method in Cassandra
LWW achieves the goal of eventual convergence, but at the cost of durability: if there are several concurrent writes to the same key, even if they were all reported as successful to the client (because they were written to w replicas), only one of the writes will survive and the others will be silently discarded.
If losing data is not acceptable, LWW is a poor choice for conflict resolution.
two operations are concurrent if neither happens before the other (i.e., neither knows about the other) [54
When a client writes a key, it must include the version number from the prior read, and it must merge together all values that it received in the prior read.
Riak calls these concurrent values siblings.
an item cannot simply be deleted from the database when it is removed; instead, the system must leave a marker with an appropriate version number to indicate that the item has been removed when merging siblings. Such a deletion marker is known as a tombstone.
we need to use a version number per replica as well as per key. Each replica increments its own version number when processing a write, and also keeps track of the version numbers it has seen from each of the other replicas.
The collection of version numbers from all the replicas is called a version vector
version vectors are sent from the database replicas to clients when values are read, and need to be sent back to the database when a value is subsequently written. (Riak encodes the version vector as a string that it calls causal context.)
The version vector structure ensures that it is safe to read from one replica and subsequently write back to another replica.
What we call a partition here is called a shard in MongoDB, Elasticsearch, and SolrCloud; it’s known as a region in HBase, a tablet in Bigtable, a vnode in Cassandra and Riak, and a vBucket in Couchbase.
each partition is a small database of its own, although the database may support operations that touch multiple partitions at the same time.
A partition with disproportionately high load is called a hot spot.
Once you have a suitable hash function for keys, you can assign each partition a range of hashes (rather than a range of keys), and every key whose hash falls within a partition’s range will be stored in that partition.
by using the hash of the key for partitioning we lose a nice property of key-range partitioning: the ability to do efficient range queries.
A table in Cassandra can be declared with a compound primary key consisting of several columns. Only the first part of that key is hashed to determine the partition, but the other columns are used as a concatenated index for sorting the data in Cassandra’s SSTables.
scatter/gather, and it can make read queries on secondary indexes quite expensive. Even if you query the partitions in parallel, scatter/gather is prone to tail latency amplification
MongoDB, Riak [15], Cassandra [16], Elasticsearch [17], SolrCloud [18], and VoltDB [19] all use document-partitioned secondary indexes.
we can construct a global index that covers data in all partitions. However, we can’t just store that index on one node, since it would likely become a bottleneck and defeat the purpose of partitioning.
We call this kind of index term-partitioned, because the term we’re looking for determines the partition of the index.
the downside of a global index is that writes are slower and more complicated, because a write to a single document may now affect multiple partitions of the index (every term in the document might be on a different partition, on a different node).
updates to global secondary indexes are often asynchronous
The problem with the mod N approach is that if the number of nodes N changes, most of the keys will need to be moved from one node to
create many more partitions than there are nodes, and assign several partitions to each node.
by assigning more partitions to nodes that are more powerful, you can force those nodes to take a greater share of the load.
If partitions are very large, rebalancing and recovery from node failures become expensive. But if partitions are too small, they incur too much overhead.
key range–partitioned databases such as HBase and RethinkDB create partitions dynamically.
HBase and MongoDB allow an initial set of partitions to be configured on an empty database (this is called pre-splitting).
it can be a good thing to have a human in the loop for rebalancing. It’s slower than a fully automatic process, but it can help prevent operational surprises.
Each node registers itself in ZooKeeper, and ZooKeeper maintains the authoritative mapping of partitions to nodes. Other actors, such as the routing tier or the partitioning-aware client, can subscribe to this information in ZooKeeper.
Cassandra and Riak take a different approach: they use a gossip protocol among the nodes to disseminate any changes in cluster state. Requests can be sent to any node, and that node forwards them to the appropriate node for the requested partition
When using a routing tier or when sending requests to a random node, clients still need to find the IP addresses to connect to. These are not as fast-changing as the assignment of partitions to nodes, so it is often sufficient to use DNS for this purpose.
When partitioning by hash, it is common to create a fixed number of partitions in advance, to assign several partitions to each node, and to move entire partitions from one node to another when nodes are added or removed.
transaction is a way for an application to group several reads and writes together into a logical unit. Conceptually, all the reads and writes in a transaction are executed as one operation: either the entire transaction succeeds (commit) or it fails (abort, rollback).
ACID, which stands for Atomicity, Consistency, Isolation, and Durability.
(Systems that do not meet the ACID criteria are sometimes called BASE, which stands for Basically Available, Soft state, and Eventual consistency
ACID atomicity describes what happens if a client wants to make several writes, but a fault occurs after some of the writes have been processed
If the writes are grouped together into an atomic transaction, and the transaction cannot be completed (committed) due to a fault, then the transaction is aborted
The ability to abort a transaction on error and have all writes from that transaction discarded is the defining feature of ACID atomicity.
this idea of consistency depends on the application’s notion of invariants, and it’s the application’s responsibility to define its transactions correctly
serializable isolation is rarely used, because it carries a performance penalty.
Durability is the promise that once a transaction has committed successfully, any data it has written will not be forgotten,
there is no one technique that can provide absolute guarantees. There are only various risk-reduction techniques, including writing to disk, replicating to remote machines, and backups
multi-object transactions are often needed if several pieces of data need to be kept in sync.
on any particular connection, everything between a BEGIN TRANSACTION and a COMMIT statement is considered to be part of the same transaction.
Atomicity can be implemented using a log for crash recovery (see
isolation can be implemented using a lock on each object (allowing only one thread to access an object at any one time).
A transaction is usually understood as a mechanism for grouping multiple operations on multiple objects into one unit of execution.