More on this book
Community
Kindle Notes & Highlights
Read between
December 13, 2022 - January 4, 2023
configuration errors by operators were the leading cause of outages, whereas hardware faults (servers or network) played a role in only 10–25% of outages
Thiago Ghisi liked this
We therefore need to think of response time not as a single number, but as a distribution of values that you can measure.
Thiago Ghisi liked this
a 1-second slowdown reduces a customer satisfaction metric by 16%
Reducing response times at very high percentiles is difficult because they are easily affected by random events outside of your control, and the benefits are diminishing.
Queueing delays often account for a large part of the response time at high percentiles.
Even if you make the calls in parallel, the end-user request still needs to wait for the slowest of the parallel calls to complete.
Thiago Ghisi liked this
An architecture that is appropriate for one level of load is unlikely to cope with 10 times that load.
using several fairly powerful machines can still be simpler and cheaper than a large number of small virtual machines.
there is no such thing as a generic, one-size-fits-all scalable architecture
Thiago Ghisi liked this
An architecture that scales well for a particular application is built around assumptions of which operations will be common and which will be rare—the load parameters. If those assumptions turn out to be wrong, the engineering effort for scaling is at best wasted, and at worst counterproductive.
In an early-stage startup or an unproven product it’s usually more important to be able to iterate quickly on product features than it is to scale to some hypothetical future load.
It is well known that the majority of the cost of software is not in its initial development, but in its ongoing maintenance—fixing
This is an important trade-off in storage systems: well-chosen indexes speed up read queries, but every index slows down writes.
Concurrency and crash recovery are much simpler if segment files are append-only or immutable. For example, you don’t have to worry about the case where a crash happened while a value was being overwritten, leaving you with a file containing part of the old and part of the new value spliced together.
Thiago Ghisi liked this
Counterintuitively, the performance advantage of in-memory databases is not due to the fact that they don’t need to read from disk. Even a disk-based storage engine may never need to read from disk if you have enough memory, because the operating system caches recently used disk blocks in memory anyway. Rather, they can be faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk
Thiago Ghisi liked this
Data warehouses now exist in almost all large enterprises, but in small companies they are almost unheard of.
as long as people agree on what the format is, it often doesn’t matter how pretty or efficient the format is. The difficulty of getting different organizations to agree on anything outweighs most other concerns.
A key design goal of a service-oriented/microservices architecture is to make the application easier to change and maintain by making services independently deployable and evolvable. For example, each service should be owned by one team, and that team should be able to release new versions of the service frequently, without having to coordinate with other teams.
Although RPC seems convenient at first, the approach is fundamentally flawed
Part of the appeal of REST is that it doesn’t try to hide the fact that it’s a network protocol (although this doesn’t seem to stop people from building RPC libraries on top of REST).
Thiago Ghisi liked this
For a successful technology, reality must take precedence over public relations, for nature cannot be fooled. Richard Feynman, Rogers Commission Report (1986)
The problem with a shared-memory approach is that the cost grows faster than linearly: a machine with twice as many CPUs, twice as much RAM, and twice as much disk capacity as another typically costs significantly more than twice as much. And due to bottlenecks, a machine twice the size cannot necessarily handle twice the load.
In some cases, a simple single-threaded program can perform significantly better than a cluster with over 100 CPU cores
The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair. Douglas Adams, Mostly Harmless (1992)
Thiago Ghisi liked this
Clearly, we must break away from the sequential and not limit the computers. We must state definitions and provide for priorities and descriptions of data. We must state relationships, not procedures. Grace Murray Hopper, Management and the Computer of the Future (1962)
A system designed for single-threaded execution can sometimes perform better than a system that supports concurrency, because it can avoid the coordination overhead of locking.
Thiago Ghisi liked this
Working with distributed systems is fundamentally different from writing software on a single computer—and the main difference is that there are lots of new and exciting ways for things to go wrong
Thiago Ghisi liked this
In the end, our task as engineers is to build systems that do their job (i.e., meet the guarantees that users are expecting), in spite of everything going wrong.
a supercomputer is more like a single-node computer than a distributed system: it deals with partial failure by letting it escalate into total failure—if any part of the system fails, just let everything crash (like a kernel panic on a single machine).
Thiago Ghisi liked this
In distributed systems, suspicion, pessimism, and paranoia pay off.
Shared-nothing is not the only way of building systems, but it has become the dominant approach for building internet services, for several reasons: it’s comparatively cheap because it requires no special hardware, it can make use of commoditized cloud computing services, and it can achieve high reliability through redundancy across multiple geographically distributed datacenters.
If you can avoid opening Pandora’s box and simply keep things on a single machine, it is generally worth doing so.
All in all, there is a lot of misunderstanding and confusion around CAP, and it does not help us understand systems better, so CAP is best avoided.
although CAP has been historically influential, it has little practical value for designing systems
In many cases, systems that appear to require linearizability in fact only really require causal consistency, which can be implemented more efficiently.
Distributed transactions thus have a tendency of amplifying failures, which runs counter to our goal of building fault-tolerant systems.
Thiago Ghisi liked this
frequent leader elections result in terrible performance because the system can end up spending more time choosing a leader than doing any useful work.
In reality, integrating disparate systems is one of the most important things that needs to be done in a nontrivial application.
A system cannot be successful if it is too strongly influenced by a single person. Once the initial design is complete and fairly robust, the real test begins as people with many different viewpoints undertake their own experiments. Donald Knuth
lack of integration leads to Balkanization of data.
The MapReduce approach is more appropriate for larger jobs: jobs that process so much data and run for such a long time that they are likely to experience at least one task failure along the way.
Thiago Ghisi liked this
In an environment where tasks are not so often terminated, the design decisions of MapReduce make less sense.
A complex system that works is invariably found to have evolved from a simple system that works. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work. John Gall, Systemantics (1975)
This principle is known as exactly-once semantics, although effectively-once would be a more descriptive term
Surprisingly often I see software engineers make statements like, “In my experience, 99% of people only need X” or “…don’t need X” (for various values of X). I think that such statements say more about the experience of the speaker than about the actual usefulness of a technology.
violations of timeliness are “eventual consistency,” whereas violations of integrity are “perpetual inconsistency.”
in most applications, integrity is much more important than timeliness.
If we cannot fully trust that every individual component of the system will be free from corruption—that every piece of hardware is fault-free and that every piece of software is bug-free—then we must at least periodically check the integrity of our data.
Thiago Ghisi liked this
Predictive analytics systems merely extrapolate from the past; if the past is discriminatory, they codify that discrimination.