More on this book
Community
Kindle Notes & Highlights
Read between
February 17 - February 27, 2023
Despite promises of reliability through redundancy, distributed systems exhibit availability more like “two eights” rather than the coveted “five nines.”
During the hectic rush of a development project, you can easily make decisions that optimize development cost at the expense of operational cost. This makes sense only in the context of the team aiming for a fixed budget and delivery date. In the context of the organization paying for the software, it’s a bad choice.
Imagine that your system requires five minutes of downtime on every release. You expect your system to have a five-year life span with monthly releases. (Most companies would like to do more releases per year, but I’m being very conservative.) You can compute the expected cost of downtime, discounted by the time-value of money. It’s probably on the order of $1,000,000 (300 minutes of downtime at a very modest cost of $3,000 per minute). Now suppose you could invest $50,000 to create a build pipeline and deployment process that avoids downtime during releases. That will, at a minimum, avoid the
...more
In any incident, my first priority is always to restore service. Restoring service takes precedence over investigation. If I can collect some data for postmortem analysis, that’s great—unless it makes the outage longer.
The trick to restoring service is figuring out what to target. You can always “reboot the world” by restarting every single server, layer by layer. That’s almost always effective, but it takes a long time. Most of the time, you can find one culprit that is really locking things up.
A postmortem is like a murder mystery. You have a set of clues. Some are reliable, such as server logs copied from the time of the outage. Some are unreliable, such as statements from people about what they saw. As with real witnesses, people will mix observations with speculation. They will present hypotheses as facts. The postmortem can actually be harder to solve than a murder, because the body goes away. There is no corpse to autopsy, because the servers are back up and running. Whatever state they were in that caused the failure no longer exists.
Enterprise software must be cynical. Cynical software expects bad things to happen and is never surprised when they do. Cynical software doesn’t even trust itself, so it puts up internal barriers to protect itself from failures. It refuses to get too intimate with other systems, because it could get hurt.
A system with longevity keeps processing transactions for a long time. What is a long time? It depends. A useful working definition of “a long time” is the time between code deployments.
The major dangers to your system’s longevity are memory leaks and data growth. Both kinds of sludge will kill your system in production. Both are rarely caught during testing.
Fault A condition that creates an incorrect internal state in your software.
Error Visibly incorrect behavior.
Failure An unresponsive system. When a system doesn’t respond, we say it has failed.
Triggering a fault opens the crack. Faults become errors, and errors provoke failures.
Five nines of reliability for the overall application is nowhere near enough. It would result in thousands of disappointed users every day.
High interactive complexity arises when systems have enough moving parts and hidden, internal dependencies that most operators’ mental models are either incomplete or just plain wrong. In a system exhibiting high interactive complexity, the operator’s instinctive actions will have results ranging from ineffective to actively harmful.
connections are integration points, and every single one of them is out to destroy your system.
Integration points are the number-one killer of systems. Every single one of those feeds presents a stability risk. Every socket, process, pipe, or remote procedure call can and will hang. Even database calls can hang, in ways obvious and subtle. Every feed into the system can hang it, crash it, or generate other impulses at the worst possible time.
Slow failures, such as a dropped ACK, let threads block for minutes before throwing exceptions. The blocked thread can’t process other transactions, so overall capacity is reduced. If all threads end up getting blocked, then for all practical purposes, the server is down. Clearly, a slow response is a lot worse than no response.
not every problem can be solved at the level of abstraction where it manifests. Sometimes the causes reverberate up and down the layers.
Use a client library that allows fine-grained control over timeouts—including both the connection timeout and read timeout—and response handling. I recommend you avoid client libraries that try to map responses directly into domain objects. Instead, treat a response as data until you’ve confirmed it meets your expectations. It’s just text in maps (also known as dictionaries) and lists until you decide what to extract.
A chain reaction happens because the death of one server makes the others pick up the slack. The increased load makes them more likely to fail.
A chain reaction will quickly bring an entire layer down. Other layers that depend on it must protect themselves, or they will go down in a cascading failure.
A cascading failure occurs when cracks jump from one system or layer to another, usually because of insufficiently paranoid integration points. A cascading failure can also happen after a chain reaction in a lower layer.
Users are a terrible thing. Systems would be much better off with no users.
When your system is teetering on the brink of disaster like a car on a cliff in a movie, some user will be the seagull that lands on the hood. Down she goes! Human users have a gift for doing exactly the worst possible thing at the worst possible time.
“Capacity” is the maximum throughput your system can sustain under a given workload while maintaining acceptable performance. When a transaction takes too long to execute, it means that the demand on your system exceeds its capacity.
If things are really bad, the logging system might not even be able to log the error. If no memory is available to create the log event, then nothing gets logged.
For every bit of data you put in the session, consider that it might never be used again. It could spend the next thirty minutes uselessly taking up memory and putting your system at risk.
Weak references are a useful way to respond to changing memory conditions, but they do add complexity. When you can, it’s best to just keep things out of the session.
Redis is another popular tool for moving memory out of your process.[7] It’s a fast “data structure server” that lives in a space between cache and database. Many systems use Redis to hold session data instead of keeping it in memory or in a relational database.
networks have gotten fast enough that “someone else’s memory” can be faster to access than local disk. Your application is better off making a remote call to get a value than reading it from storage.
Identify whatever your most expensive transactions are and double or triple the proportion of those transactions. If your retail system expects a 2 percent conversion rate (which is about standard for retailers), then your load tests should test for a 4, 6, or 10 percent conversion rate.
I advocate supplementing internal monitors (such as log file scraping, process monitoring, and port monitoring) with external monitoring. A mock client somewhere (not in the same data center) can run synthetic transactions on a regular basis. That client experiences the same view of the system that real users experience. If that client cannot process the synthetic transactions, then there is a problem, whether or not the server process is running.
One elegant way to avoid synchronization on domain objects is to make your domain objects immutable. Use them for querying and rendering. When the time comes to alter their state, do it by constructing and issuing a “command object.” This style is called “Command Query Responsibility Separation,” and it nicely avoids a large number of concurrency issues.
The maximum memory usage of all application-level caches should be configurable. Caches that do not limit maximum memory consumption will eventually eat away at the memory available for the system.
you need to monitor hit rates for the cached items to see whether most items are being used from cache. If hit rates are very low, then the cache is not buying any performance gains and might actually be slower than not using the cache. Keeping something in cache is a bet that the cost of generating it once, plus the cost of hashing and lookups, is less than the cost of generating it every time it’s needed. If a particular cached object is used only once during the lifetime of a server, then caching it is of no help.
It’s also wise to avoid caching things that are cheap to generate.
Caches should be built using weak references to hold the cached item itself. If memory gets low, the garbage collector is permitted to reap any object that is reachable only via weak references.
don’t just test your system with your usual workloads. See what happens if you take the number of calls the front end could possibly make, double it, and direct it all against your most expensive transaction. If your system is resilient, it might slow down—even start to fail fast if it can’t process transactions within the allowed time (see Fail Fast)—but it should recover once the load goes down. Crashing, hung threads, empty responses, or nonsense replies indicate your system won’t survive and might just start a cascading failure.
the steady-state load on a system might be significantly different than the startup or periodic load. Imagine a farm of app servers booting up. Every single one needs to connect to a database and load some amount of reference or seed data. Every one starts with a cold cache and only gradually gets to a useful working set. Until then, most HTTP requests translate into one or more database queries. That means the transient load on the database is much higher when applications start up than after they’ve been running for a while.
A fixed retry interval will concentrate demand from callers on that period. Instead, use a backoff algorithm so different callers will be at different points in their backoff periods.
If observations report that more than 80 percent of the system is unavailable, it’s more likely to be a problem with the observer than the system.
When the gap between expected state and observed state is large, signal for confirmation.
Actions initiated by automation take time. That time is usually longer than a monitoring interval, so make sure to account for some delay in the system’s response to the action.
You should give your system the ability to monitor its own performance, so it can also tell when it isn’t meeting its service-level agreement. Suppose your system is a service provider that’s required to respond within one hundred milliseconds. When a moving average over the last twenty transactions exceeds one hundred milliseconds, your system could start refusing requests.
The only sensible numbers are “zero,” “one,” and “lots,” so unless your query selects exactly one row, it has the potential to return too many. Don’t rely on the data producers to create a limited amount of data.
Hope is not a design method.
Well-placed timeouts provide fault isolation—a
Many APIs offer both a call with a timeout and a simpler, easier call that blocks forever. It would be better if, instead of overloading a single function, the no-timeout version were labeled “CheckoutAndMaybeKillMySystem.”
The Timeouts pattern prevents calls to Integration Points from becoming Blocked Threads. Thus, timeouts avert Cascading Failures.

