Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
Kindle Notes & Highlights
33%
Flag icon
Unfortunately, object-relational mapping frameworks make it easy to accidentally write code that performs unsafe read-modify-write cycles instead of using atomic operations provided by the database [38]. That’s not a problem if you know what you are doing, but it is potentially a source of subtle bugs that are difficult to find by testing.
34%
Flag icon
Atomic operations can work well in a replicated context, especially if they are commutative (i.e., you can apply them in a different order on different replicas, and still get the same result).
35%
Flag icon
In order to make the most of that single thread, transactions need to be structured differently from their traditional form.
35%
Flag icon
However, for any transaction that needs to access multiple partitions, the database must coordinate the transaction across all the partitions that it touches.
35%
Flag icon
Serial execution of transactions has become a viable way of achieving serializable isolation within certain constraints:
35%
Flag icon
In 2PL, writers don’t just block other writers; they also block readers and vice versa.
38%
Flag icon
Working with distributed systems is fundamentally different from writing software on a single computer — and the main difference is that there are lots of new and exciting ways for things to go wrong
38%
Flag icon
This chapter is a thoroughly pessimistic and depressing overview of things that may go wrong in a distributed system.
38%
Flag icon
Although the system can be more reliable than its underlying parts, there is always a limit to how much more reliable it can be.
39%
Flag icon
In particular, it could happen that the node actually wasn’t dead but only slow to respond due to overload; transferring its load to other nodes can cause a cascading failure (in the extreme case, all nodes declare each other dead, and everything stops working).
41%
Flag icon
Spanner implements snapshot isolation across datacenters in this way [59, 60]. It uses the clock’s confidence interval as reported by the TrueTime API, and is based on the following observation:
41%
Flag icon
Using clock synchronization for distributed transaction semantics is an area of active research [57, 61, 62]. These ideas are interesting, but they have not yet been implemented in mainstream databases outside of Google.
42%
Flag icon
This trick hides GC pauses from clients and reduces the high percentiles of response time [70, 71]. Some latency-sensitive financial trading systems [72] use this approach.
42%
Flag icon
Fortunately, we don’t need to go as far as figuring out the meaning of life. In a distributed system, we can state the assumptions we are making about the behavior (the system model) and design the actual system in such a way that it meets those assumptions. Algorithms can be proved to function correctly within a certain system model. This means that reliable behavior is achievable, even if the underlying system model provides very few guarantees.
42%
Flag icon
The moral of these stories is that a node cannot necessarily trust its own judgment of a situation. A distributed system cannot exclusively rely on a single node, because a node may fail at any time, potentially leaving the system stuck and unable to recover. Instead, many distributed algorithms rely on a quorum, that is, voting among the nodes (see “Quorums for reading and writing”): decisions require some minimum number of votes from several nodes in order to reduce the dependence on any one particular node.
42%
Flag icon
Distributed systems problems become much harder if there is a risk that nodes may “lie” (send arbitrary faulty or corrupted responses)
43%
Flag icon
However, a real implementation may still have to include code to handle the case where something happens that was assumed to be impossible, even if that handling boils down to printf("Sucks to be you") and exit(666) — i.e., letting a human operator clean up the mess [93]. (This is arguably the difference between computer science and software engineering.)
43%
Flag icon
Theoretical analysis and empirical testing are equally important.
45%
Flag icon
We need to understand the scope of what can and cannot be done: in some situations, it’s possible for the system to tolerate faults and continue working; in other situations, that is not possible.
45%
Flag icon
A better name for eventual consistency may be convergence, as we expect all replicas to eventually converge to the same value
1 2 3 5 Next »