Attila Bertók ’s Kindle Notes & Highlights for Release It!: Design and Deploy Production-Ready Software

When an operation is taking too long, sometimes we don’t care why…we just need to give up and keep moving. The Timeouts pattern lets us do that.

32%

Immediate retries are liable to hit the same problem and result in another timeout. That just makes the user wait even longer for her error message. Most of the time, you should queue the operation and retry it later.

32%

When the circuit is “open,” calls to the circuit breaker fail immediately, without any attempt to execute the real operation. After a suitable amount of time, the circuit breaker decides that the operation has a chance of succeeding, so it goes into the “half-open” state. In this state, the next call to the circuit breaker is allowed to execute the dangerous operation. Should the call succeed, the circuit breaker resets and returns to the “closed” state, ready for more routine operation. If this trial call fails, however, the circuit breaker returns to the open state until another timeout ...more

32%

Circuit breakers are a way to automatically degrade functionality when the system is under stress.

32%

Changes in a circuit breaker’s state should always be logged, and the current state should be exposed for querying and monitoring.

32%

Circuit Breaker is the fundamental pattern for protecting your system from all manner of Integration Points problems. When there’s a difficulty with Integration Points, stop calling it!

32%

Circuit Breaker is good at avoiding calls when Integration Points has a problem. The Timeouts pattern indicates that there’s a problem in Integration Points.

33%

You can partition the threads inside a single process, with separate thread groups dedicated to different functions. For example, it’s often helpful to reserve a pool of request-handling threads for administrative use. That way, even if all request-handling threads on the application server are hung, it can still respond to admin requests—perhaps to collect data for postmortem analysis or a request to shut down.

33%

Bulkheads are effective at maintaining service, or partial service, even in the face of failures.

33%

The Bulkheads pattern partitions capacity to preserve partial functionality when bad things happen.

33%

Every single time a human touches a server is an opportunity for unforced errors.

34%

The Steady State pattern says that for every mechanism that accumulates a resource, some other mechanism must recycle that resource.

34%

Log files on production systems have a terrible signal-to-noise ratio. It’s best to get them off the individual hosts as quickly as possible. Ship the log files to a centralized logging server, such as Logstash, where they can be indexed, searched, and monitored.

35%

Human intervention leads to problems. Eliminate the need for recurring human intervention. Your system should run for at least a typical deployment cycle without manual disk cleanups or nightly restarts.

35%

Purge data with application logic. DBAs can create scripts to purge data, but they don’t always know how the application behaves when data is removed. Maintaining logical integrity, especially if you use an ORM tool, requires the application to purge its own data.

35%

Limit caching. In-memory caching speeds up applications, until it slows them down. Limit the amount o...

This highlight has been truncated due to consecutive passage length restrictions.

35%

Roll the logs. Don’t keep an unlimited amount of log files. Configure log file...

This highlight has been truncated due to consecutive passage length restrictions.

35%

The service can quickly check out the connections it will need and verify the state of the circuit breakers around the integration points. This is sort of the software equivalent of the chef’s mise en place—gathering all the ingredients needed to perform the request before it begins. If any of the resources are not available, the service can fail immediately, rather than getting partway through the work.

35%

If your system cannot meet its SLA, inform callers quickly. Don’t make them wait for an error message, and don’t make them wait until they time out. That just makes your problem into their problem.

36%

Do basic user input validation even before you reserve resources. Don’t bother checking out a database connection, fetching domain objects, populating them, and calling validate just to find out that a required parameter wasn’t entered.

36%

The cleanest state your program can ever have is right after startup. The “let it crash” approach says that error recovery is difficult and unreliable, so our goal should be to get back to that clean startup as rapidly as possible.

36%

Crash components to save systems. It may seem counterintuitive to create system-level stability through component-level instability. Even so, it may be the best way to get back to a known good state.

36%

Restart fast and reintegrate. The key to crashing well is getting back up quickly. Otherwise you risk loss of service when too many components are bouncing. Once a component is back up, it should be reintegrated automatically.

36%

Isolate components to crash independently. Use Circuit Breakers to isolate callers from components that crash. Use supervisors to determine what the span of restarts should be. Design your supervision tree so that cra...

This highlight has been truncated due to consecutive passage length restrictions.

36%

Don’t crash monoliths. Large processes with heavy runtimes or long startups are not the right place to apply this pattern. Applications that couple many features ...

This highlight has been truncated due to consecutive passage length restrictions.

37%

Create cooperative demand control. Handshaking between a client and a server permits demand throttling to serviceable levels. Both the client and the server must be built to perform handshaking. Most common application-level protocols do not perform handshaking.

37%

Use health checks in clustered or load-balanced services as a way for instances to handshake with the load balancer.

38%

A test harness differs from mock objects in that a mock object can only be trained to produce behavior that conforms to the defined interface. A test harness runs as a separate server, so it’s not obliged to conform to any interface. It can provoke network errors, protocol errors, or application-level errors. If all low-level errors were guaranteed to be recognized, caught, and thrown as the right type of exception, we would not need test harnesses.

38%

Emulate out-of-spec failures. Calling real applications lets you test only those errors that the real application can deliberately produce. A good test harness lets you simulate all sorts of messy, real-world failure modes.

38%

Stress the caller. The test harness can produce slow responses, no responses, or garbage responses. Then you can see how your application reacts.

38%

The Test Harness pattern augments other testing methods. It does not replace unit tests, acceptance tests, penetration tests, and so on. Each of those techniques help verify functional behavior. A test harness helps verify “nonfunctional” behavior while maintaining isolation from the remote systems.

38%

Done well, middleware simultaneously integrates and decouples systems. It integrates them by passing data and events back and forth between the systems. It decouples them by letting the participating systems remove specific knowledge of and calls to the other systems.

39%

If your service is exposed to uncontrolled demand, then you need to be able to shed load when the world goes crazy on you.

40%

When Back Pressure kicks in, monitoring needs to know about it. That way you can tell whether it’s a random fluctuation or a trend.

40%

Back Pressure creates safety by slowing down consumers. Consumers will experience slowdowns. The only alternative is to let them crash the provider.

40%

Apply Back Pressure within a system boundary Across boundaries, look at load shedding instead. This is especially true when the ...

This highlight has been truncated due to consecutive passage length restrictions.

40%

Queues must be finite for response times to be finite. You only have a few options when a queue is full. All of them are unpleasant: drop data, refuse work, or block. C...

This highlight has been truncated due to consecutive passage length restrictions.

40%

A governor is stateful and time-aware. It knows what actions have been taken over a period of time. It should also be asymmetric. Most actions have a “safe” direction and an “unsafe” one. Shutting down instances is unsafe. Deleting data is unsafe. Blocking client IP addresses is unsafe.

40%

You will often find a tension between definitions of “safe.” Shutting down instances is unsafe for availability, while spinning up instances is unsafe for cost. These forces don’t cancel each other out. Instead, they define a U-shaped curve where going too far in either direction is bad. That means actions may also be safe within a defined range but unsafe outside the range.

40%

You can think about this U-shaped curve as defining the response curve for the governor. Inside the safe zone, the actions are fast. Outside the range, the governor applies increasing resistance.

40%

The whole point of a governor is to slow things down enough for humans to get involved. Naturally that means connecting to monitoring both to alert humans that there’s a situation and to give them enough visibility to understand what’s happening.

41%

If you ever catch yourself saying, “The odds of that happening are astronomical,” or some similar utterance, consider this: a single small service might do ten million requests per day over three years, for a total of 10,950,000,000 chances for something to go wrong. That’s more than ten billion opportunities for bad things to happen. Astronomical observations indicate there are four hundred billion stars in the Milky Way galaxy. Astronomers consider a number “close enough” if it’s within a factor of 10. Astronomically unlikely coincidences happen all the time.

45%

“DNS name to IP address” is a many-to-many relationship. But the machine still acts as if it has exactly one hostname. Many utilities and programs assume that the machine’s self-assigned FQDN is a legitimate DNS name that resolves back to itself. This is largely true for development machines and largely untrue for production services.

45%

many data centers have a specific network for administrative access. This is an important security protection, because services such as SSH can be bound only to the administrative interface and are therefore not accessible from the production network. This can help if a firewall gets breached by an attacker or if the server handles an internal application and doesn’t sit behind a firewall.

47%

Containers are meant to start and stop rapidly. Avoid long startup or initialization sequences. Some production servers take many minutes to load reference data or to warm up caches. These are not suited for containers. Aim for a total startup time of one second.

49%

There’s simply no excuse not to use version control today. Only the code goes into version control, though. Version control doesn’t handle third-party libraries or dependencies very well.

49%

Developers should not do production builds from their own machines.

49%

make production builds on a CI server, and have it put the binary into a safe repository that nobody else can write into.

49%

When it’s time to deploy new code, we don’t patch up the container; we just build a new one instead. We launch it and throw away the old one.

50%

Because the same software runs on several instances, some configuration properties should probably vary per machine. Keep these properties in separate places so nobody ever has to ask, “Are those supposed to be different?”