Release It!: Design and Deploy Production-Ready Software (Pragmatic Programmers)
Rate it:
Kindle Notes & Highlights
4%
Flag icon
Don’t avoid one-time development expenses at the cost of recurring operational expenses.
7%
Flag icon
Manage perceptions after a major incident. It’s as important as managing the incident itself.
10%
Flag icon
Bugs will happen. They cannot be eliminated, so they must be survived instead.
10%
Flag icon
Enterprise software must be cynical. Cynical software expects bad things to happen and is never surprised when they do. Cynical software doesn’t even trust itself, so it puts up internal barriers to protect itself from failures. It refuses to get too intimate with other systems, because it could get hurt.
10%
Flag icon
A highly stable design usually costs the same to implement as an unstable one.
10%
Flag icon
The amazing thing is that the highly stable design usually costs the same to implement as the unstable one.
11%
Flag icon
A resilient system keeps processing transactions, even when there are transient impulses, persistent stresses, or component failures disrupting normal processing.
11%
Flag icon
So, how do you find these kinds of bugs? The only way you can catch them before they bite you in production is to run your own longevity tests. If you can, set aside a developer machine. Have it run JMeter, Marathon, or some other load-testing tool. Don’t hit the system hard; just keep driving requests all the time. (Also, be sure to have the scripts slack for a few hours a day to simulate the slow period during the middle of the night. That will catch connection pool and firewall timeouts.)
13%
Flag icon
Antipatterns create, accelerate, or multiply cracks in the system.
14%
Flag icon
Integration points are the number-one killer of systems. Every single one of those feeds presents a stability risk.
14%
Flag icon
Networks failures can hit you in two ways: fast or slow.
18%
Flag icon
The most effective patterns to combat integration point failures are Circuit Breaker and Decoupling Middleware. Combat integration point failures with the Circuit Breaker and Decoupling Middleware patterns.
18%
Flag icon
Beware this necessary evil Every integration point will eventually fail in some way, and you need to be prepared for that failure. Prepare for the many forms of failure Integration point failures take several forms, ranging from various network errors to semantic errors. You will not get nice error responses delivered through the defined protocol; instead, you’ll see some kind of protocol violation, slow response, or outright hang. Know when to open up abstractions Debugging integration point failures usually requires peeling back a layer of abstraction. Failures are often difficult to debug ...more
19%
Flag icon
One server down jeopardizes the rest A chain reaction happens because the death of one server makes the others pick up the slack. The increased load makes them more likely to fail. A chain reaction will quickly bring an entire layer down. Other layers that depend on it must protect themselves, or they will go down in a cascading failure.
19%
Flag icon
Hunt for resource leaks Most of the time, a chain reaction happens when your application has a memory leak. As one server runs out of memory and goes down, the other servers pick up the dead one’s burden. The increased traffic means they leak memory faster.
19%
Flag icon
Hunt for obscure timing bugs Obscure race conditions can also be triggered by traffic. Again, if one server goes down to a deadlock, the increased load on the others m...
This highlight has been truncated due to consecutive passage length restrictions.
19%
Flag icon
Defend with Bulkheads Partitioning servers, with Bulkheads, can prevent Chain Reactions from taking out the entire service—though they won’t help the callers of whichever partition does go dow...
This highlight has been truncated due to consecutive passage length restrictions.
19%
Flag icon
A cascading failure occurs when problems in one layer cause problems in callers.
19%
Flag icon
Stop cracks from jumping the gap A cascading failure occurs when cracks jump from one system or layer to another, usually because of insufficiently paranoid integration points. A cascading failure can also happen after a chain reaction in a lower layer. Your system surely calls out to other enterprise systems; make sure you can stay up when they go down.
19%
Flag icon
Scrutinize resource pools A cascading failure often results from a resource pool, such as a connection pool, that gets exhausted when none of its calls return. The threads that get the connections block forever; all other threads get blocked waiting for connections. Safe resource pools always limit the time a thread can wait to check out a resource.
19%
Flag icon
Defend with Timeouts and Circuit Breaker A cascading failure happens after something else has already gone wrong. Circuit Breaker protects your system by avoiding calls out to the troubled integration point. Using Timeouts ensure...
This highlight has been truncated due to consecutive passage length restrictions.
23%
Flag icon
Users consume memory Each user’s session requires some memory. Minimize that memory to improve your capacity. Use a session only for caching so you can purge the session’s contents if memory gets tight.
23%
Flag icon
Users do weird, random things Users in the real world do things that you won’t predict (or sometimes understand). If there’s a weak spot in your application, they’ll find it through sheer numbers. Test scripts are useful for functional testing but too predictable for stability testing. Hire a bunch of chimpanzees to hammer on keyboards for more realistic testing.
24%
Flag icon
Malicious users are out there Become intimate with your network design; it should help avert attacks. Make sure your systems are easy to patch—you’ll be doing a lot of it. Keep your frameworks up-to-date, and keep yourself educated. T...
This highlight has been truncated due to consecutive passage length restrictions.
24%
Flag icon
Users will gang up on you Sometimes they come in really, really big mobs. Picture the Slashdot editors giggling as they point toward your site, saying, “Release the legions!” Large mobs can trigger hangs, deadlocks, and obscure race condit...
This highlight has been truncated due to consecutive passage length restrictions.
24%
Flag icon
Distrust synchronized methods on domain objects.
24%
Flag icon
In Java, it is possible for a subclass to declare a method synchronized that is unsynchronized in its superclass or interface definition. Object-oriented purists will tell you that this violates the Liskov Substitution principle. They are correct.
25%
Flag icon
The problem with this design had nothing to do with the functional behavior. Functionally, RemoteAvailabilityCache was a nice piece of work. In times of stress, however, it had a nasty failure mode. The inventory system was undersized (see Antipattern 4.8, ​Unbalanced Capacities​), so when the front end got busy, the back end would be flooded with requests. Eventually, it crashed. At that point, any thread calling RemoteAvailabilityCache.get would block, because one single thread was inside the create call, waiting for a response that would never come.
26%
Flag icon
The Blocked Threads antipattern is the proximate cause of most failures Application failures nearly always relate to Blocked Threads in one way or another, including the ever-popular “gradual slowdown” and “hung server.” The Blocked Threads antipattern leads to Chain Reactions and Cascading Failures.
26%
Flag icon
Scrutinize resource pools Like Cascading Failures, the Blocked Threads antipattern usually happens around resource pools, particularly database connection pools. A deadlock in the database can cause connections to be lost forever, and so can incorrect exception handling.
26%
Flag icon
Use proven primitives Learn and apply safe primitives. It might seem easy to roll your own producer/consumer queue; it isn’t. Any library of concurrency utiliti...
This highlight has been truncated due to consecutive passage length restrictions.
26%
Flag icon
Defend with Timeouts You cannot prove that your code has no deadlocks in it, but you can make sure that no deadlock lasts forever. Avoid Java’s infinite wait method; use the version that takes a timeout parameter. Always use Time...
This highlight has been truncated due to consecutive passage length restrictions.
26%
Flag icon
Beware the code you cannot see All manner of problems can lurk in the shadows of third-party code. Be very wary. Test it yourself. Whenever possible, acquire and investigat...
This highlight has been truncated due to consecutive passage length restrictions.
27%
Flag icon
Keep the lines of communication open Attacks of Self-Denial originate inside your own organization, when clever marketers cause self-inflicted wounds by creating their own flash mobs and traffic spikes. You can aid and abet these marketing efforts and protect your system at the same time, but only if you know what’s coming. Make sure nobody sends mass emails with deep links. Create static “landing zone” pages for the first click from these offers. Watch out for embedded session IDs in URLs.
27%
Flag icon
Protect shared resources Programming errors, unexpected scaling effects, and shared resources all create risks when traffic surges. Watch out for Fight Club bugs, where increased front-end load causes exponentially increasing back-end processing.
27%
Flag icon
Expect rapid redistribution of any cool or valuable offer Anybody who thinks they’ll release a special deal for limited distribution is asking for trouble. There’s no such thing as limited distribution. Even if you limit the number of times a fantastic deal can be redeemed, you’ll still get crushed ...
This highlight has been truncated due to consecutive passage length restrictions.
28%
Flag icon
Examine production versus QA environments to spot Scaling Effects You get bitten by Scaling Effects when you move from small one-to-one development and test environments to full-sized production environments. Patterns that work fine in small environments or one-to-one environments might slow down or fail completely when you move to production sizes.
28%
Flag icon
Watch out for point-to-point communication Point-to-point communication scales badly, since the number of connections increases as the square of the number of participants. Consider how large your system can grow while still using point-to-point connections—it might be sufficient. Once you’re dealing with tens of servers, you will probably need to replace it with some kind of one-to-many communication.
28%
Flag icon
Watch out for shared resources Shared resources can be a bottleneck, a capacity constraint, and a threat to stability. If your system must use some sort of shared resource, stress test it heavily. Also, be sure its clients wi...
This highlight has been truncated due to consecutive passage length restrictions.
29%
Flag icon
Examine server and thread counts In development and QA, your system probably looks like one or two servers, and so do all the QA versions of the other systems you call. In production, the ratio might be more like ten to one instead of one to one. Check the ratio of front-end to back-end servers, along with the number of threads each side can handle, in production compared to QA.
29%
Flag icon
Observe near scaling effects and users Unbalanced Capacities is a special case of Scaling Effects: one side of a relationship scales up much more than the other side. A change in traffic patterns—seasonal, market-driven, or publicity-driven—can cause a usually benign front-end system to suddenly flood a back-end system, in much the same way as a Slashdot or Digg post causes traffic to suddenly flood websites.
29%
Flag icon
Stress both sides of the interface If you provide the back-end system, see what happens if it suddenly gets ten times the highest ever demand, hitting the most expensive transaction. Does it fail completely? Does it slow down and recover? If you provide the front-end system, s...
This highlight has been truncated due to consecutive passage length restrictions.
30%
Flag icon
Slow Responses triggers Cascading Failures Upstream systems experiencing Slow Responses will themselves slow down and might be vulnerable to stability problems when the response times exceed their own timeouts.
30%
Flag icon
For websites, Slow Responses causes more traffic Users waiting for pages frequently hit the Reload button, generating even more traffic to your already overloaded system.
30%
Flag icon
Consider Fail Fast If your system tracks its own responsiveness,[41] then it can tell when it is getting slow. Consider sending an immediate error response when the average response time exceeds the system’s allowed time (or at the very leas...
This highlight has been truncated due to consecutive passage length restrictions.
30%
Flag icon
Hunt for memory leaks or resource contention Contention for an inadequate supply of database connections produces Slow Responses. Slow Responses also aggravates that contention, leading to a self-reinforcing cycle. Memory leaks cause excessive effort in the garbage collector, resulting in slow response. Ineffici...
This highlight has been truncated due to consecutive passage length restrictions.
31%
Flag icon
Don’t make empty promises An SLA inversion means you are operating on wishful thinking: you’ve committed to a service level that you can achieve only through luck.
31%
Flag icon
Examine every dependency SLA Inversion lurks in unexpected places, particularly in the network infrastructure. For example, what is the SLA on your corporate DNS cluster? (I hope it’s a cluster, anyway.) How about on the SMTP service? Message queues and brokers? Enterprise SAN? SLA dependencies are everywhere.
31%
Flag icon
Decouple your SLAs Be sure you can maintain service even when your dependencies go down. If you fail whenever they do, then it’s a mathematical certainty that your ...
This highlight has been truncated due to consecutive passage length restrictions.
32%
Flag icon
Use realistic data volumes Typical development and test data sets are too small to exhibit this problem. You need production-sized data sets to see what happens when your query returns a million rows that you turn into objects. As a side benefit, you’ll also get better information from your performance testing when you use production-sized test data.
« Prev 1 3