Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
Kindle Notes & Highlights
23%
Flag icon
Single-node transactions have existed for a long time. However, in the move to distributed (replicated and partitioned) databases, many systems have abandoned them, claiming that transactions are too expensive in terms of performance and availability, and asserting that eventual consistency is inevitable in a scalable system. There is some truth in that statement, but it is overly simplistic,
23%
Flag icon
As multi-leader replication is a somewhat retrofitted feature in many databases, there are often subtle configuration pitfalls and surprising interactions with other database features. For example, autoincrementing keys, triggers, and integrity constraints can be problematic. For this reason, multi-leader replication is often considered dangerous territory that should be avoided if possible
Corey
In other words: RTFM before you use something.
38%
Flag icon
This chapter is a thoroughly pessimistic and depressing overview of things that may go wrong in a distributed system.
Corey
Correction, this whole book is a pessimistic and depressing overview of things that may go wrong. If you want to go full cynic, read the results of the Jepsen tests.
40%
Flag icon
time-of-day clocks also have various oddities, as described in the next section. In particular, if the local clock is too far ahead of the NTP server, it may be forcibly reset and appear to jump back to a previous point in time. These jumps, as well as similar jumps caused by leap seconds, make time-of-day clocks unsuitable for measuring elapsed time
Corey
I tried explaining this concept to a 9 year old. It seems that she could grasp the idea that there is a lot more to the social media applications she uses than the simplistic interface she lives in, but it was hard to comprehend the idea of time as being relative.
40%
Flag icon
Part of the problem is that incorrect clocks easily go unnoticed. If a machine’s CPU is defective or its network is misconfigured, it most likely won’t work at all, so it will quickly be noticed and fixed. On the other hand, if its quartz clock is defective or its NTP client is misconfigured, most things will seem to work fine, even though its clock gradually drifts further and further away from reality. If some piece of software is relying on an accurately synchronized clock, the result is more likely to be silent and subtle data loss than a dramatic crash
Corey
I wonder if there is a way of using a consensus algorithm to help here? I'd suppose it could generate a lot of extra network chattiness but it may help find nodes that are way out of line.
41%
Flag icon
An emerging idea is to treat GC pauses like brief planned outages of a node, and to let other nodes handle requests from clients while one node is collecting its garbage. If the runtime can warn the application that a node soon requires a GC pause, the application can stop sending new requests to that node, wait for it to finish processing outstanding requests, and then perform the GC while no requests are in progress. This trick hides GC pauses from clients and reduces the high percentiles of response time [70, 71]. Some latency-sensitive financial trading systems [72] use this approach.
Corey
I've tried this. In principle its a cool idea but really hard to put into practice. GC collections that even last hundreds of milliseconds are hard to detect because there isn't a clear "hook" you can get for a GC event. You can ping the node to death with health checks on a really tight interval (sub 10ms) but that introduces new problems.
42%
Flag icon
Similarly, it would be appealing if a protocol could protect us from vulnerabilities, security compromises, and malicious attacks. Unfortunately, this is not realistic either: in most systems, if an attacker can compromise one node, they can probably compromise all of them, because they are probably running the same software. Thus, traditional mechanisms (authentication, access control, encryption, firewalls, and so on) continue to be the main protection against attackers.
Corey
This hints at the biggest misunderstanding of security that the average developer has. They tend to assume that once you are inside the network you are safe, but that is wrong. If you start treating all your services as if they are exposed to the external internet it forces you to take a route more in line with zero trust.
66%
Flag icon
If you are mathematically inclined, you might say that the application state is what you get when you integrate an event stream over time, and a change stream is what you get when you differentiate the state by time,
73%
Flag icon
As people in the functional programming community like to joke, “We believe in the separation of Church and state”
77%
Flag icon
I think it is not sufficient for software engineers to focus exclusively on the technology and ignore its consequences: the ethical responsibility is ours to bear also. Reasoning about ethics is difficult, but it is too important to ignore.