Release It!: Design and Deploy Production-Ready Software
Rate it:
Read between February 17 - February 27, 2023
31%
Flag icon
When an operation is taking too long, sometimes we don’t care why…we just need to give up and keep moving. The Timeouts pattern lets us do that.
32%
Flag icon
Immediate retries are liable to hit the same problem and result in another timeout. That just makes the user wait even longer for her error message. Most of the time, you should queue the operation and retry it later.
32%
Flag icon
When the circuit is “open,” calls to the circuit breaker fail immediately, without any attempt to execute the real operation. After a suitable amount of time, the circuit breaker decides that the operation has a chance of succeeding, so it goes into the “half-open” state. In this state, the next call to the circuit breaker is allowed to execute the dangerous operation. Should the call succeed, the circuit breaker resets and returns to the “closed” state, ready for more routine operation. If this trial call fails, however, the circuit breaker returns to the open state until another timeout ...more
32%
Flag icon
Circuit breakers are a way to automatically degrade functionality when the system is under stress.
32%
Flag icon
Changes in a circuit breaker’s state should always be logged, and the current state should be exposed for querying and monitoring.
32%
Flag icon
Circuit Breaker is the fundamental pattern for protecting your system from all manner of Integration Points problems. When there’s a difficulty with Integration Points, stop calling it!
32%
Flag icon
Circuit Breaker is good at avoiding calls when Integration Points has a problem. The Timeouts pattern indicates that there’s a problem in Integration Points.
33%
Flag icon
You can partition the threads inside a single process, with separate thread groups dedicated to different functions. For example, it’s often helpful to reserve a pool of request-handling threads for administrative use. That way, even if all request-handling threads on the application server are hung, it can still respond to admin requests—perhaps to collect data for postmortem analysis or a request to shut down.
33%
Flag icon
Bulkheads are effective at maintaining service, or partial service, even in the face of failures.
33%
Flag icon
The Bulkheads pattern partitions capacity to preserve partial functionality when bad things happen.
33%
Flag icon
Every single time a human touches a server is an opportunity for unforced errors.
34%
Flag icon
The Steady State pattern says that for every mechanism that accumulates a resource, some other mechanism must recycle that resource.
34%
Flag icon
Log files on production systems have a terrible signal-to-noise ratio. It’s best to get them off the individual hosts as quickly as possible. Ship the log files to a centralized logging server, such as Logstash, where they can be indexed, searched, and monitored.
35%
Flag icon
Human intervention leads to problems. Eliminate the need for recurring human intervention. Your system should run for at least a typical deployment cycle without manual disk cleanups or nightly restarts.
35%
Flag icon
Purge data with application logic. DBAs can create scripts to purge data, but they don’t always know how the application behaves when data is removed. Maintaining logical integrity, especially if you use an ORM tool, requires the application to purge its own data.
35%
Flag icon
Limit caching. In-memory caching speeds up applications, until it slows them down. Limit the amount o...
This highlight has been truncated due to consecutive passage length restrictions.
35%
Flag icon
Roll the logs. Don’t keep an unlimited amount of log files. Configure log file...
This highlight has been truncated due to consecutive passage length restrictions.
35%
Flag icon
The service can quickly check out the connections it will need and verify the state of the circuit breakers around the integration points. This is sort of the software equivalent of the chef’s mise en place—gathering all the ingredients needed to perform the request before it begins. If any of the resources are not available, the service can fail immediately, rather than getting partway through the work.
35%
Flag icon
If your system cannot meet its SLA, inform callers quickly. Don’t make them wait for an error message, and don’t make them wait until they time out. That just makes your problem into their problem.
36%
Flag icon
Do basic user input validation even before you reserve resources. Don’t bother checking out a database connection, fetching domain objects, populating them, and calling validate just to find out that a required parameter wasn’t entered.
36%
Flag icon
The cleanest state your program can ever have is right after startup. The “let it crash” approach says that error recovery is difficult and unreliable, so our goal should be to get back to that clean startup as rapidly as possible.
36%
Flag icon
Crash components to save systems. It may seem counterintuitive to create system-level stability through component-level instability. Even so, it may be the best way to get back to a known good state.
36%
Flag icon
Restart fast and reintegrate. The key to crashing well is getting back up quickly. Otherwise you risk loss of service when too many components are bouncing. Once a component is back up, it should be reintegrated automatically.
36%
Flag icon
Isolate components to crash independently. Use Circuit Breakers to isolate callers from components that crash. Use supervisors to determine what the span of restarts should be. Design your supervision tree so that cra...
This highlight has been truncated due to consecutive passage length restrictions.
36%
Flag icon
Don’t crash monoliths. Large processes with heavy runtimes or long startups are not the right place to apply this pattern. Applications that couple many features ...
This highlight has been truncated due to consecutive passage length restrictions.
37%
Flag icon
Create cooperative demand control. Handshaking between a client and a server permits demand throttling to serviceable levels. Both the client and the server must be built to perform handshaking. Most common application-level protocols do not perform handshaking.
37%
Flag icon
Use health checks in clustered or load-balanced services as a way for instances to handshake with the load balancer.
38%
Flag icon
A test harness differs from mock objects in that a mock object can only be trained to produce behavior that conforms to the defined interface. A test harness runs as a separate server, so it’s not obliged to conform to any interface. It can provoke network errors, protocol errors, or application-level errors. If all low-level errors were guaranteed to be recognized, caught, and thrown as the right type of exception, we would not need test harnesses.
38%
Flag icon
Emulate out-of-spec failures. Calling real applications lets you test only those errors that the real application can deliberately produce. A good test harness lets you simulate all sorts of messy, real-world failure modes.
38%
Flag icon
Stress the caller. The test harness can produce slow responses, no responses, or garbage responses. Then you can see how your application reacts.
38%
Flag icon
The Test Harness pattern augments other testing methods. It does not replace unit tests, acceptance tests, penetration tests, and so on. Each of those techniques help verify functional behavior. A test harness helps verify “nonfunctional” behavior while maintaining isolation from the remote systems.
38%
Flag icon
Done well, middleware simultaneously integrates and decouples systems. It integrates them by passing data and events back and forth between the systems. It decouples them by letting the participating systems remove specific knowledge of and calls to the other systems.
39%
Flag icon
If your service is exposed to uncontrolled demand, then you need to be able to shed load when the world goes crazy on you.
40%
Flag icon
When Back Pressure kicks in, monitoring needs to know about it. That way you can tell whether it’s a random fluctuation or a trend.
40%
Flag icon
Back Pressure creates safety by slowing down consumers. Consumers will experience slowdowns. The only alternative is to let them crash the provider.
40%
Flag icon
Apply Back Pressure within a system boundary Across boundaries, look at load shedding instead. This is especially true when the ...
This highlight has been truncated due to consecutive passage length restrictions.
40%
Flag icon
Queues must be finite for response times to be finite. You only have a few options when a queue is full. All of them are unpleasant: drop data, refuse work, or block. C...
This highlight has been truncated due to consecutive passage length restrictions.
40%
Flag icon
A governor is stateful and time-aware. It knows what actions have been taken over a period of time. It should also be asymmetric. Most actions have a “safe” direction and an “unsafe” one. Shutting down instances is unsafe. Deleting data is unsafe. Blocking client IP addresses is unsafe.
40%
Flag icon
You will often find a tension between definitions of “safe.” Shutting down instances is unsafe for availability, while spinning up instances is unsafe for cost. These forces don’t cancel each other out. Instead, they define a U-shaped curve where going too far in either direction is bad. That means actions may also be safe within a defined range but unsafe outside the range.
40%
Flag icon
You can think about this U-shaped curve as defining the response curve for the governor. Inside the safe zone, the actions are fast. Outside the range, the governor applies increasing resistance.
40%
Flag icon
The whole point of a governor is to slow things down enough for humans to get involved. Naturally that means connecting to monitoring both to alert humans that there’s a situation and to give them enough visibility to understand what’s happening.
41%
Flag icon
If you ever catch yourself saying, “The odds of that happening are astronomical,” or some similar utterance, consider this: a single small service might do ten million requests per day over three years, for a total of 10,950,000,000 chances for something to go wrong. That’s more than ten billion opportunities for bad things to happen. Astronomical observations indicate there are four hundred billion stars in the Milky Way galaxy. Astronomers consider a number “close enough” if it’s within a factor of 10. Astronomically unlikely coincidences happen all the time.
45%
Flag icon
“DNS name to IP address” is a many-to-many relationship. But the machine still acts as if it has exactly one hostname. Many utilities and programs assume that the machine’s self-assigned FQDN is a legitimate DNS name that resolves back to itself. This is largely true for development machines and largely untrue for production services.
45%
Flag icon
many data centers have a specific network for administrative access. This is an important security protection, because services such as SSH can be bound only to the administrative interface and are therefore not accessible from the production network. This can help if a firewall gets breached by an attacker or if the server handles an internal application and doesn’t sit behind a firewall.
47%
Flag icon
Containers are meant to start and stop rapidly. Avoid long startup or initialization sequences. Some production servers take many minutes to load reference data or to warm up caches. These are not suited for containers. Aim for a total startup time of one second.
49%
Flag icon
There’s simply no excuse not to use version control today. Only the code goes into version control, though. Version control doesn’t handle third-party libraries or dependencies very well.
49%
Flag icon
Developers should not do production builds from their own machines.
49%
Flag icon
make production builds on a CI server, and have it put the binary into a safe repository that nobody else can write into.
49%
Flag icon
When it’s time to deploy new code, we don’t patch up the container; we just build a new one instead. We launch it and throw away the old one.
50%
Flag icon
Because the same software runs on several instances, some configuration properties should probably vary per machine. Keep these properties in separate places so nobody ever has to ask, “Are those supposed to be different?”