Release It!: Design and Deploy Production-Ready Software
Rate it:
Read between February 17 - February 27, 2023
50%
Flag icon
We don’t want our instance binaries to change per environment, but we do want their properties to change. That means the code should look outside the deployment directory to find per-environment configurations.
50%
Flag icon
That’s not to say you should keep configurations out of version control altogether. Just keep them in a different repository than the source code. Lock it down to only the people who should have access,
51%
Flag icon
system-level view will provide historical analysis, present state, instantaneous behavior, and future projections. The job of an individual instance is to reveal enough data to enable those perspectives.
51%
Flag icon
Transparency arises from deliberate design and architecture. “Adding transparency” late in development is about as effective as “adding quality.”
51%
Flag icon
The monitoring and reporting systems should be like an exoskeleton built around your system, not woven into it. In particular, decisions about what metrics should trigger alerts, where to set the thresholds, and how to “roll up” state variables into an overall system health status should all be left outside of the instance itself. These are policy decisions that will change at a very different rate than the application code will.
51%
Flag icon
Despite what all those application templates create for us, a logs directory under the application’s install directory is the wrong way to go. Log files can be large. They grow rapidly and consume lots of I/O. For physical machines, it’s a good idea to keep them on a separate drive. That lets the machine use more I/O bandwidth in parallel and reduces contention for the busy drives.
51%
Flag icon
Most developers implement logging as though they are the primary consumer of the log files. In fact, administrators and engineers in operations will spend far more time with these log files than developers will. Logging should be aimed at production operations rather than development or testing.
51%
Flag icon
anything logged at level “ERROR” or “SEVERE” should be something that requires action on the part of operations. Not every exception needs to be logged as an error. Just because a user entered a bad credit card number and the validation component threw an exception doesn’t mean anything has to be done about it. Log errors in business logic or user input as warnings (if at all).
52%
Flag icon
It’s easy to leave debug messages turned on in production. All it takes is one wrong commit with debug levels enabled. I recommend adding a step to your build process that automatically removes any configs that enable debug or trace log levels.
52%
Flag icon
The instance itself won’t be able to tell much about overall system health, but it should emit metrics that can be collected, analyzed, and visualized centrally. This may be as simple as periodically spitting a line of stats into a log file.
52%
Flag icon
Some developers from Netflix have quipped that Netflix is a monitoring system that streams movies as a side effect.
53%
Flag icon
Health checks should be more than just “yup, it’s running.” It should report at least the following: The host IP address or addresses The version number of the runtime or interpreter (Ruby, Python, JVM, .Net, Go, and so on) The application version or commit ID Whether the instance is accepting work The status of connection pools, caches, and circuit breakers
53%
Flag icon
When using DNS, it’s important to have a logical service name to call, rather than a physical hostname. Even if that logical name is just an alias to the underlying host, it’s still preferable. An alias only needs to be changed in one place (the name server’s database) rather than in every consuming application.
54%
Flag icon
The main emphasis for DNS servers should be diversity. Don’t host them on the same infrastructure as your production systems. Make sure you have more than one DNS provider with servers in different locations. Use a different DNS provider still for your public status page. Make sure there are no failure scenarios that leave you without at least one functioning DNS server.
55%
Flag icon
A normal proxy multiplexes many outgoing calls into a single source IP address. A reverse proxy server does the opposite: it demultiplexes calls coming into a single IP address and fans them out to multiple addresses.
55%
Flag icon
Load balancers can also attempt to direct repeated requests to the same instance. This helps when you have stateful services, like user session state, in an application server. Directing the same requests to the same instances will provide better response time for the caller because necessary resources will already be in that instance’s memory. A downside of sticky sessions is that they can prevent load from being distributed evenly across machines.
55%
Flag icon
Health checks are a vital part of load balancer configuration. Good health checks ensure that requests can succeed, not just that the service is listening to a socket.
56%
Flag icon
Services can measure their own response time to help with this. They can also check their own operational state to see if requests will be answered in a timely fashion. For instance, monitoring the degree of contention for a connection pool allows a service to estimate wait times. Likewise, a service can check response times on its own dependencies. If those dependencies are too slow and are required, then the health check should show that this service is unavailable. This provides back pressure through service tiers.
57%
Flag icon
Reject work as close to the edge as possible. The further it penetrates into your system, the more resources it ties up.
57%
Flag icon
Provide health checks that allow load balancers to protect your application code.
57%
Flag icon
Start rejecting work when your response time is going t...
This highlight has been truncated due to consecutive passage length restrictions.
57%
Flag icon
“Service discovery” really has two parts. First, it’s a way that instances of a service can announce themselves to begin receiving a load. This replaces statically configured load balancer pools with dynamic pools.
58%
Flag icon
The second part is lookup. A caller needs to know at least one IP address to contact for a particular service. The lookup process can appear to be a simple DNS resolution for the caller, even if some super-dynamic service-aware server is supplying the DNS service.
59%
Flag icon
The words “human error” don’t appear anywhere. It’s hard to overstate the importance of that. This is not a case of humans failing the system. It’s a case of the system failing humans. The administrative tools and playbooks allowed this error to happen. They amplified a minor error into enormous consequences. We must regard this as a system failure.
60%
Flag icon
Have postmortems for successful changes. See what variations or anomalies happened. Find out what the “near misses” were.
61%
Flag icon
If your image of a dev server is a fresh virtual machine with a known configuration, that’s great! Maybe your image of QA is a whole environment stamped out by the same automation tools that deploy to production, with an anonymized sample of production data from within the last week. If so, you’re doing quite well.
61%
Flag icon
There’s no “right number” of QA environments. Virtualize them so every team can create its own on-demand QA environment.)
61%
Flag icon
At scale, “partially broken” is the normal state of operation.
62%
Flag icon
Traffic indicators Page requests, page requests total, transaction counts, concurrent sessions
62%
Flag icon
Business transaction, for each type Number processed, number aborted, dollar value, transaction aging, conversion rate, completion rate
62%
Flag icon
Users Demographics or classification, technographics, percentage of users who are registered, number of users, usage patterns, errors encountered...
This highlight has been truncated due to consecutive passage length restrictions.
62%
Flag icon
Resource pool health Enabled state, total resources (as applied to connection pools, worker thread pools, and any other resource pools), resources checked out, high-water mark, number of resources created, number of resources destroyed, number of times checked out, number of threads b...
This highlight has been truncated due to consecutive passage length restrictions.
62%
Flag icon
Database connection health Number of SQLExceptions thrown, number of queries, average...
This highlight has been truncated due to consecutive passage length restrictions.
62%
Flag icon
Data consumption Number of entities or rows present, footprint i...
This highlight has been truncated due to consecutive passage length restrictions.
63%
Flag icon
Integration point health State of circuit breaker, number of timeouts, number of requests, average response time, number of good responses, number of network errors, number of protocol errors, number of application errors, actual IP address of the remote endpoint, cu...
This highlight has been truncated due to consecutive passage length restrictions.
63%
Flag icon
Cache health Items in cache, memory used by cache, cache hit rate, items flushed by garbage collector, configured uppe...
This highlight has been truncated due to consecutive passage length restrictions.
63%
Flag icon
In many organizations deployment is ridiculously painful, so it’s a good place to start making life better.
63%
Flag icon
The build pipeline should tag the build as it passes various stages, especially verification steps like unit or integration tests.
64%
Flag icon
Live control is only necessary if it takes your instances a long time to be ready to run.
64%
Flag icon
In those cases, you need to look at ways to send control signals to running instances. Here is a brief checklist of controls to plan for: Reset circuit breakers. Adjust connection pool sizes and timeouts. Disable specific outbound integrations. Reload configuration. Start or stop accepting load. Feature toggles.
64%
Flag icon
it’s time to build a “command queue.” This is a shared message queue or pub/sub bus that all the instances can listen to. The admin tool sends out a command that the instances then perform.
64%
Flag icon
With a command queue, it’s even easier to create a dogpile. It’s often a good idea to have each instance add a random bit of delay to spread them out a bit. It can also help to identify “waves” or “gangs” of instances. So a command may target “wave 1,” followed by “wave 2” and “wave 3” a few minutes later.
64%
Flag icon
GUIs make terrible administrative interfaces for long-term production operation. The best interface for long-term operation is the command line. Given a command line, operators can easily build a scaffolding of scripts, logging, and automated actions to keep your software happy.
64%
Flag icon
Start with visibility. Use logging, tracing, and metrics to create transparency. Collect and index logs to look for general patterns. That also gets logs off of the machines for postmortem analysis when a machine or instance fails. Use configuration, provisioning, and deployment services to gain leverage over larger or more dynamic systems.
67%
Flag icon
In first-party authentication, the authority (us) keeps a database of credentials. The principal (the caller who claims to have an identity) provides credentials that the authority checks against its database. If the credentials match, the authority accepts that identity for the principal.
69%
Flag icon
Suppose your service responds with a “404 Not Found” when a caller requests a resource that doesn’t exist, but responds with a “403 Authentication Required” for a resource that exists but isn’t authorized. That means your service leaks information about what resources exist or not.
74%
Flag icon
version 1.0 is the beginning of the system’s life. That means we shouldn’t plan for one or a few deployments to production, but many upon many.
74%
Flag icon
treat deployment as a feature.
75%
Flag icon
Between the time a developer commits code to the repository and the time it runs in production, code is a pure liability. Undeployed code is unfinished inventory. It has unknown bugs. It may break scaling or cause production downtime. It might be a great implementation of a feature nobody wants. Until you push it to production, you can’t be sure. The idea of continuous deployment is to reduce that delay as much as possible to minimize the liability of undeployed code.
75%
Flag icon
A bigger deployment with more change is definitely riskier. When those risks materialize, the most natural reaction is to add review steps as a way to mitigate future risks. But that will lengthen the commit-production delay, which increases risk even further!