More on this book
Community
Kindle Notes & Highlights
Read between
February 17 - February 27, 2023
We don’t want our instance binaries to change per environment, but we do want their properties to change. That means the code should look outside the deployment directory to find per-environment configurations.
That’s not to say you should keep configurations out of version control altogether. Just keep them in a different repository than the source code. Lock it down to only the people who should have access,
system-level view will provide historical analysis, present state, instantaneous behavior, and future projections. The job of an individual instance is to reveal enough data to enable those perspectives.
Transparency arises from deliberate design and architecture. “Adding transparency” late in development is about as effective as “adding quality.”
The monitoring and reporting systems should be like an exoskeleton built around your system, not woven into it. In particular, decisions about what metrics should trigger alerts, where to set the thresholds, and how to “roll up” state variables into an overall system health status should all be left outside of the instance itself. These are policy decisions that will change at a very different rate than the application code will.
Despite what all those application templates create for us, a logs directory under the application’s install directory is the wrong way to go. Log files can be large. They grow rapidly and consume lots of I/O. For physical machines, it’s a good idea to keep them on a separate drive. That lets the machine use more I/O bandwidth in parallel and reduces contention for the busy drives.
Most developers implement logging as though they are the primary consumer of the log files. In fact, administrators and engineers in operations will spend far more time with these log files than developers will. Logging should be aimed at production operations rather than development or testing.
anything logged at level “ERROR” or “SEVERE” should be something that requires action on the part of operations. Not every exception needs to be logged as an error. Just because a user entered a bad credit card number and the validation component threw an exception doesn’t mean anything has to be done about it. Log errors in business logic or user input as warnings (if at all).
It’s easy to leave debug messages turned on in production. All it takes is one wrong commit with debug levels enabled. I recommend adding a step to your build process that automatically removes any configs that enable debug or trace log levels.
The instance itself won’t be able to tell much about overall system health, but it should emit metrics that can be collected, analyzed, and visualized centrally. This may be as simple as periodically spitting a line of stats into a log file.
Some developers from Netflix have quipped that Netflix is a monitoring system that streams movies as a side effect.
Health checks should be more than just “yup, it’s running.” It should report at least the following: The host IP address or addresses The version number of the runtime or interpreter (Ruby, Python, JVM, .Net, Go, and so on) The application version or commit ID Whether the instance is accepting work The status of connection pools, caches, and circuit breakers
When using DNS, it’s important to have a logical service name to call, rather than a physical hostname. Even if that logical name is just an alias to the underlying host, it’s still preferable. An alias only needs to be changed in one place (the name server’s database) rather than in every consuming application.
The main emphasis for DNS servers should be diversity. Don’t host them on the same infrastructure as your production systems. Make sure you have more than one DNS provider with servers in different locations. Use a different DNS provider still for your public status page. Make sure there are no failure scenarios that leave you without at least one functioning DNS server.
A normal proxy multiplexes many outgoing calls into a single source IP address. A reverse proxy server does the opposite: it demultiplexes calls coming into a single IP address and fans them out to multiple addresses.
Load balancers can also attempt to direct repeated requests to the same instance. This helps when you have stateful services, like user session state, in an application server. Directing the same requests to the same instances will provide better response time for the caller because necessary resources will already be in that instance’s memory. A downside of sticky sessions is that they can prevent load from being distributed evenly across machines.
Health checks are a vital part of load balancer configuration. Good health checks ensure that requests can succeed, not just that the service is listening to a socket.
Services can measure their own response time to help with this. They can also check their own operational state to see if requests will be answered in a timely fashion. For instance, monitoring the degree of contention for a connection pool allows a service to estimate wait times. Likewise, a service can check response times on its own dependencies. If those dependencies are too slow and are required, then the health check should show that this service is unavailable. This provides back pressure through service tiers.
Reject work as close to the edge as possible. The further it penetrates into your system, the more resources it ties up.
Provide health checks that allow load balancers to protect your application code.
Start rejecting work when your response time is going t...
This highlight has been truncated due to consecutive passage length restrictions.
“Service discovery” really has two parts. First, it’s a way that instances of a service can announce themselves to begin receiving a load. This replaces statically configured load balancer pools with dynamic pools.
The second part is lookup. A caller needs to know at least one IP address to contact for a particular service. The lookup process can appear to be a simple DNS resolution for the caller, even if some super-dynamic service-aware server is supplying the DNS service.
The words “human error” don’t appear anywhere. It’s hard to overstate the importance of that. This is not a case of humans failing the system. It’s a case of the system failing humans. The administrative tools and playbooks allowed this error to happen. They amplified a minor error into enormous consequences. We must regard this as a system failure.
Have postmortems for successful changes. See what variations or anomalies happened. Find out what the “near misses” were.
If your image of a dev server is a fresh virtual machine with a known configuration, that’s great! Maybe your image of QA is a whole environment stamped out by the same automation tools that deploy to production, with an anonymized sample of production data from within the last week. If so, you’re doing quite well.
There’s no “right number” of QA environments. Virtualize them so every team can create its own on-demand QA environment.)
At scale, “partially broken” is the normal state of operation.
Traffic indicators Page requests, page requests total, transaction counts, concurrent sessions
Business transaction, for each type Number processed, number aborted, dollar value, transaction aging, conversion rate, completion rate
Users Demographics or classification, technographics, percentage of users who are registered, number of users, usage patterns, errors encountered...
This highlight has been truncated due to consecutive passage length restrictions.
Resource pool health Enabled state, total resources (as applied to connection pools, worker thread pools, and any other resource pools), resources checked out, high-water mark, number of resources created, number of resources destroyed, number of times checked out, number of threads b...
This highlight has been truncated due to consecutive passage length restrictions.
Database connection health Number of SQLExceptions thrown, number of queries, average...
This highlight has been truncated due to consecutive passage length restrictions.
Data consumption Number of entities or rows present, footprint i...
This highlight has been truncated due to consecutive passage length restrictions.
Integration point health State of circuit breaker, number of timeouts, number of requests, average response time, number of good responses, number of network errors, number of protocol errors, number of application errors, actual IP address of the remote endpoint, cu...
This highlight has been truncated due to consecutive passage length restrictions.
Cache health Items in cache, memory used by cache, cache hit rate, items flushed by garbage collector, configured uppe...
This highlight has been truncated due to consecutive passage length restrictions.
In many organizations deployment is ridiculously painful, so it’s a good place to start making life better.
The build pipeline should tag the build as it passes various stages, especially verification steps like unit or integration tests.
Live control is only necessary if it takes your instances a long time to be ready to run.
In those cases, you need to look at ways to send control signals to running instances. Here is a brief checklist of controls to plan for: Reset circuit breakers. Adjust connection pool sizes and timeouts. Disable specific outbound integrations. Reload configuration. Start or stop accepting load. Feature toggles.
it’s time to build a “command queue.” This is a shared message queue or pub/sub bus that all the instances can listen to. The admin tool sends out a command that the instances then perform.
With a command queue, it’s even easier to create a dogpile. It’s often a good idea to have each instance add a random bit of delay to spread them out a bit. It can also help to identify “waves” or “gangs” of instances. So a command may target “wave 1,” followed by “wave 2” and “wave 3” a few minutes later.
GUIs make terrible administrative interfaces for long-term production operation. The best interface for long-term operation is the command line. Given a command line, operators can easily build a scaffolding of scripts, logging, and automated actions to keep your software happy.
Start with visibility. Use logging, tracing, and metrics to create transparency. Collect and index logs to look for general patterns. That also gets logs off of the machines for postmortem analysis when a machine or instance fails. Use configuration, provisioning, and deployment services to gain leverage over larger or more dynamic systems.
In first-party authentication, the authority (us) keeps a database of credentials. The principal (the caller who claims to have an identity) provides credentials that the authority checks against its database. If the credentials match, the authority accepts that identity for the principal.
Suppose your service responds with a “404 Not Found” when a caller requests a resource that doesn’t exist, but responds with a “403 Authentication Required” for a resource that exists but isn’t authorized. That means your service leaks information about what resources exist or not.
version 1.0 is the beginning of the system’s life. That means we shouldn’t plan for one or a few deployments to production, but many upon many.
treat deployment as a feature.
Between the time a developer commits code to the repository and the time it runs in production, code is a pure liability. Undeployed code is unfinished inventory. It has unknown bugs. It may break scaling or cause production downtime. It might be a great implementation of a feature nobody wants. Until you push it to production, you can’t be sure. The idea of continuous deployment is to reduce that delay as much as possible to minimize the liability of undeployed code.
A bigger deployment with more change is definitely riskier. When those risks materialize, the most natural reaction is to add review steps as a way to mitigate future risks. But that will lengthen the commit-production delay, which increases risk even further!

