More on this book
Community
Kindle Notes & Highlights
Read between
July 30 - September 9, 2017
Don’t avoid one-time development expenses at the cost of recurring operational expenses.
Manage perceptions after a major incident. It’s as important as managing the incident itself.
Bugs will happen. They cannot be eliminated, so they must be survived instead.
Enterprise software must be cynical. Cynical software expects bad things to happen and is never surprised when they do. Cynical software doesn’t even trust itself, so it puts up internal barriers to protect itself from failures. It refuses to get too intimate with other systems, because it could get hurt.
A highly stable design usually costs the same to implement as an unstable one.
The amazing thing is that the highly stable design usually costs the same to implement as the unstable one.
A resilient system keeps processing transactions, even when there are transient impulses, persistent stresses, or component failures disrupting normal processing.
So, how do you find these kinds of bugs? The only way you can catch them before they bite you in production is to run your own longevity tests. If you can, set aside a developer machine. Have it run JMeter, Marathon, or some other load-testing tool. Don’t hit the system hard; just keep driving requests all the time. (Also, be sure to have the scripts slack for a few hours a day to simulate the slow period during the middle of the night. That will catch connection pool and firewall timeouts.)
Antipatterns create, accelerate, or multiply cracks in the system.
Integration points are the number-one killer of systems. Every single one of those feeds presents a stability risk.
Networks failures can hit you in two ways: fast or slow.
The most effective patterns to combat integration point failures are Circuit Breaker and Decoupling Middleware. Combat integration point failures with the Circuit Breaker and Decoupling Middleware patterns.
Beware this necessary evil Every integration point will eventually fail in some way, and you need to be prepared for that failure. Prepare for the many forms of failure Integration point failures take several forms, ranging from various network errors to semantic errors. You will not get nice error responses delivered through the defined protocol; instead, you’ll see some kind of protocol violation, slow response, or outright hang. Know when to open up abstractions Debugging integration point failures usually requires peeling back a layer of abstraction. Failures are often difficult to debug
...more
One server down jeopardizes the rest A chain reaction happens because the death of one server makes the others pick up the slack. The increased load makes them more likely to fail. A chain reaction will quickly bring an entire layer down. Other layers that depend on it must protect themselves, or they will go down in a cascading failure.
Hunt for resource leaks Most of the time, a chain reaction happens when your application has a memory leak. As one server runs out of memory and goes down, the other servers pick up the dead one’s burden. The increased traffic means they leak memory faster.
Hunt for obscure timing bugs Obscure race conditions can also be triggered by traffic. Again, if one server goes down to a deadlock, the increased load on the others m...
This highlight has been truncated due to consecutive passage length restrictions.
Defend with Bulkheads Partitioning servers, with Bulkheads, can prevent Chain Reactions from taking out the entire service—though they won’t help the callers of whichever partition does go dow...
This highlight has been truncated due to consecutive passage length restrictions.
A cascading failure occurs when problems in one layer cause problems in callers.
Stop cracks from jumping the gap A cascading failure occurs when cracks jump from one system or layer to another, usually because of insufficiently paranoid integration points. A cascading failure can also happen after a chain reaction in a lower layer. Your system surely calls out to other enterprise systems; make sure you can stay up when they go down.
Scrutinize resource pools A cascading failure often results from a resource pool, such as a connection pool, that gets exhausted when none of its calls return. The threads that get the connections block forever; all other threads get blocked waiting for connections. Safe resource pools always limit the time a thread can wait to check out a resource.
Defend with Timeouts and Circuit Breaker A cascading failure happens after something else has already gone wrong. Circuit Breaker protects your system by avoiding calls out to the troubled integration point. Using Timeouts ensure...
This highlight has been truncated due to consecutive passage length restrictions.
Users consume memory Each user’s session requires some memory. Minimize that memory to improve your capacity. Use a session only for caching so you can purge the session’s contents if memory gets tight.
Users do weird, random things Users in the real world do things that you won’t predict (or sometimes understand). If there’s a weak spot in your application, they’ll find it through sheer numbers. Test scripts are useful for functional testing but too predictable for stability testing. Hire a bunch of chimpanzees to hammer on keyboards for more realistic testing.
Malicious users are out there Become intimate with your network design; it should help avert attacks. Make sure your systems are easy to patch—you’ll be doing a lot of it. Keep your frameworks up-to-date, and keep yourself educated. T...
This highlight has been truncated due to consecutive passage length restrictions.
Users will gang up on you Sometimes they come in really, really big mobs. Picture the Slashdot editors giggling as they point toward your site, saying, “Release the legions!” Large mobs can trigger hangs, deadlocks, and obscure race condit...
This highlight has been truncated due to consecutive passage length restrictions.
Distrust synchronized methods on domain objects.
In Java, it is possible for a subclass to declare a method synchronized that is unsynchronized in its superclass or interface definition. Object-oriented purists will tell you that this violates the Liskov Substitution principle. They are correct.
The problem with this design had nothing to do with the functional behavior. Functionally, RemoteAvailabilityCache was a nice piece of work. In times of stress, however, it had a nasty failure mode. The inventory system was undersized (see Antipattern 4.8, Unbalanced Capacities), so when the front end got busy, the back end would be flooded with requests. Eventually, it crashed. At that point, any thread calling RemoteAvailabilityCache.get would block, because one single thread was inside the create call, waiting for a response that would never come.
The Blocked Threads antipattern is the proximate cause of most failures Application failures nearly always relate to Blocked Threads in one way or another, including the ever-popular “gradual slowdown” and “hung server.” The Blocked Threads antipattern leads to Chain Reactions and Cascading Failures.
Scrutinize resource pools Like Cascading Failures, the Blocked Threads antipattern usually happens around resource pools, particularly database connection pools. A deadlock in the database can cause connections to be lost forever, and so can incorrect exception handling.
Use proven primitives Learn and apply safe primitives. It might seem easy to roll your own producer/consumer queue; it isn’t. Any library of concurrency utiliti...
This highlight has been truncated due to consecutive passage length restrictions.
Defend with Timeouts You cannot prove that your code has no deadlocks in it, but you can make sure that no deadlock lasts forever. Avoid Java’s infinite wait method; use the version that takes a timeout parameter. Always use Time...
This highlight has been truncated due to consecutive passage length restrictions.
Beware the code you cannot see All manner of problems can lurk in the shadows of third-party code. Be very wary. Test it yourself. Whenever possible, acquire and investigat...
This highlight has been truncated due to consecutive passage length restrictions.
Keep the lines of communication open Attacks of Self-Denial originate inside your own organization, when clever marketers cause self-inflicted wounds by creating their own flash mobs and traffic spikes. You can aid and abet these marketing efforts and protect your system at the same time, but only if you know what’s coming. Make sure nobody sends mass emails with deep links. Create static “landing zone” pages for the first click from these offers. Watch out for embedded session IDs in URLs.
Protect shared resources Programming errors, unexpected scaling effects, and shared resources all create risks when traffic surges. Watch out for Fight Club bugs, where increased front-end load causes exponentially increasing back-end processing.
Expect rapid redistribution of any cool or valuable offer Anybody who thinks they’ll release a special deal for limited distribution is asking for trouble. There’s no such thing as limited distribution. Even if you limit the number of times a fantastic deal can be redeemed, you’ll still get crushed ...
This highlight has been truncated due to consecutive passage length restrictions.
Examine production versus QA environments to spot Scaling Effects You get bitten by Scaling Effects when you move from small one-to-one development and test environments to full-sized production environments. Patterns that work fine in small environments or one-to-one environments might slow down or fail completely when you move to production sizes.
Watch out for point-to-point communication Point-to-point communication scales badly, since the number of connections increases as the square of the number of participants. Consider how large your system can grow while still using point-to-point connections—it might be sufficient. Once you’re dealing with tens of servers, you will probably need to replace it with some kind of one-to-many communication.
Watch out for shared resources Shared resources can be a bottleneck, a capacity constraint, and a threat to stability. If your system must use some sort of shared resource, stress test it heavily. Also, be sure its clients wi...
This highlight has been truncated due to consecutive passage length restrictions.
Examine server and thread counts In development and QA, your system probably looks like one or two servers, and so do all the QA versions of the other systems you call. In production, the ratio might be more like ten to one instead of one to one. Check the ratio of front-end to back-end servers, along with the number of threads each side can handle, in production compared to QA.
Observe near scaling effects and users Unbalanced Capacities is a special case of Scaling Effects: one side of a relationship scales up much more than the other side. A change in traffic patterns—seasonal, market-driven, or publicity-driven—can cause a usually benign front-end system to suddenly flood a back-end system, in much the same way as a Slashdot or Digg post causes traffic to suddenly flood websites.
Stress both sides of the interface If you provide the back-end system, see what happens if it suddenly gets ten times the highest ever demand, hitting the most expensive transaction. Does it fail completely? Does it slow down and recover? If you provide the front-end system, s...
This highlight has been truncated due to consecutive passage length restrictions.
Slow Responses triggers Cascading Failures Upstream systems experiencing Slow Responses will themselves slow down and might be vulnerable to stability problems when the response times exceed their own timeouts.
For websites, Slow Responses causes more traffic Users waiting for pages frequently hit the Reload button, generating even more traffic to your already overloaded system.
Consider Fail Fast If your system tracks its own responsiveness,[41] then it can tell when it is getting slow. Consider sending an immediate error response when the average response time exceeds the system’s allowed time (or at the very leas...
This highlight has been truncated due to consecutive passage length restrictions.
Hunt for memory leaks or resource contention Contention for an inadequate supply of database connections produces Slow Responses. Slow Responses also aggravates that contention, leading to a self-reinforcing cycle. Memory leaks cause excessive effort in the garbage collector, resulting in slow response. Ineffici...
This highlight has been truncated due to consecutive passage length restrictions.
Don’t make empty promises An SLA inversion means you are operating on wishful thinking: you’ve committed to a service level that you can achieve only through luck.
Examine every dependency SLA Inversion lurks in unexpected places, particularly in the network infrastructure. For example, what is the SLA on your corporate DNS cluster? (I hope it’s a cluster, anyway.) How about on the SMTP service? Message queues and brokers? Enterprise SAN? SLA dependencies are everywhere.
Decouple your SLAs Be sure you can maintain service even when your dependencies go down. If you fail whenever they do, then it’s a mathematical certainty that your ...
This highlight has been truncated due to consecutive passage length restrictions.
Use realistic data volumes Typical development and test data sets are too small to exhibit this problem. You need production-sized data sets to see what happens when your query returns a million rows that you turn into objects. As a side benefit, you’ll also get better information from your performance testing when you use production-sized test data.

