More on this book
Community
Kindle Notes & Highlights
Start by thinking about (or finding out!) what your users care about, not what you can measure. Often, what your users care about is difficult or impossible to measure, so you’ll end up approximating users’ needs in some way.
SLOs should specify how they’re measured and the conditions under which they’re valid.
Don’t pick a target based on current performance
Keep it simple Complicated aggregations in SLIs can obscure changes to system performance,
Have as few SLOs as possible
Users build on the reality of what you offer, rather than what you say you’ll supply, particularly for infrastructure services. If your service’s actual performance is much better than its stated SLO, users will come to rely on its current performance. You can avoid over-dependence by deliberately taking the system offline occasionally
If a human operator needs to touch your system during normal operations, you have a bug.
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.
Toil is interrupt-driven and reactive, rather than strategy-driven and proactive. Handling pager alerts is toil.
If your service remains in the same state after you have finished a task, the task was probably toil.
Engineering work is novel and intrinsically requires human judgment. It produces a permanent improvement in your service, and is guided by a strategy. It is frequently creative and innovative, taking a design-driven approach to solving a problem — the more generalized, the better.
Toil doesn’t make everyone unhappy all the time, especially in small amounts. Predictable and repetitive tasks can be quite calming. They produce a sense of accomplishment and quick wins. They can be low-risk and low-stress activities. Some people gravitate toward tasks involving toil and may even enjoy that type of work.
If you’re too willing to take on toil, your Dev counterparts will have incentives to load you down with even more toil, sometimes shifting operational tasks that should rightfully be performed by Devs to SRE.
White-box monitoring Monitoring based on metrics exposed by the internals of the system, including logs, interfaces like the Java Virtual Machine Profiling Interface, or an HTTP handler that emits internal statistics.
Node and machine Used interchangeably to indicate a single instance of a running kernel in either a physical server, virtual machine, or container.
Monitoring a complex application is a significant engineering endeavor in and of itself.
We avoid “magic” systems that try to learn thresholds or automatically detect causality.
Few teams at Google maintain complex dependency hierarchies because our infrastructure has a steady rate of continuous refactoring.
When collecting telemetry for debugging, white-box monitoring is essential. If web servers seem slow on database-heavy requests, you need to know both how fast the web server perceives the database to be, and how fast the database believes itself to be. Otherwise, you can’t distinguish an actually slow database server from a network problem between your web server and your database.
The four golden signals of monitoring are latency, traffic, errors, and saturation.
The simplest way to differentiate between a slow average and a very slow “tail” of requests is to collect request counts bucketed by latencies (suitable for rendering a histogram), rather than actual latencies: how many requests did I serve that took between 0 ms and 10 ms, between 10 ms and 30 ms, between 30 ms and 100 ms, between 100 ms and 300 ms, and so on? Distributing the histogram boundaries approximately exponentially (in this case by factors of roughly 3) is often an easy way to visualize the distribution of your requests.
Data collection, aggregation, and alerting configuration that is rarely exercised (e.g., less than once a quarter for some SRE teams) should be up for removal. Signals that are collected, but not exposed in any prebaked dashboard nor used by any alert, are candidates for removal.
Often, sheer force of effort can help a rickety system achieve high availability, but this path is usually short-lived and fraught with burnout and dependence on a small number of heroic team members. Taking a controlled, short-term decrease in availability is often a painful, but strategic trade for the long-run stability of the system.
It focuses primarily on symptoms for paging, reserving cause-oriented heuristics to serve as aids to debugging problems.
For SRE, automation is a force multiplier, not a panacea. Of course, just multiplying force does not naturally change the accuracy of where that force is applied: doing automation thoughtlessly can create as many problems as it solves.
A platform also centralizes mistakes. In other words, a bug fixed in the code will be fixed there once and forever,
It’s easy to overlook the fact that once you have encapsulated some task in automation, anyone can execute the task.
For truly large services, the factors of consistency, quickness, and reliability dominate most conversations about the trade-offs of performing automation.
automation is “meta-software” — software to act on software.
Automation code, like unit test code, dies when the maintaining team isn’t obsessive about keeping the code in sync with the codebase it covers.
The most functional tools are usually written by those who use them.
our evolution of turnup automation followed a path: Operator-triggered manual action (no automation) Operator-written, system-specific automation Externally maintained generic automation Internally maintained, system-specific automation Autonomous systems that need no human intervention
Inevitably, then, a situation arises in which the automation fails, and the humans are now unable to successfully operate the system. The fluidity of their reactions has been lost due to lack of practice, and their mental models of what the system should be doing no longer reflect the reality of what it is doing.
We have embraced the philosophy that frequent releases result in fewer changes between versions. This approach makes testing and troubleshooting easier. Some teams perform hourly builds and then select the version to actually deploy to production from the resulting pool of builds. Selection is based upon the test results and the features contained in a given build. Other teams have adopted a “Push on Green” release model and deploy every build that passes all tests
Our builds are hermetic, meaning that they are insensitive to the libraries and other software installed on the build machine. Instead, builds depend on known versions of build tools, such as compilers, and dependencies, such as libraries. The build process is self-contained and must not rely on services that are external to the build environment.
Our goal is to fit the deployment process to the risk profile of a given service. In development or pre-production environments, we may build hourly and push releases automatically when all tests pass. For large user-facing services, we may push by starting in one cluster and expand exponentially until all clusters are updated. For sensitive pieces of infrastructure, we may extend the rollout over several days, interleaving them across instances in different geographic regions.
Teams should budget for release engineering resources at the beginning of the product development cycle. It’s cheaper to put good practices and process in place early, rather than have to retrofit your system later.
Software systems are inherently dynamic and unstable.1 A software system can only be perfectly stable if it exists in a vacuum.
good summary of the SRE approach to managing systems is: “At the end of the day, our job is to keep agility and stability in balance in the system.”
In fact, SRE’s experience has found that reliable processes tend to actually increase developer agility: rapid, reliable production rollouts make changes in production easier to see. As a result, once a bug surfaces, it takes less time to find and fix that bug.
In software, less is more! A small, simple API is usually also a hallmark of a well-understood problem.
The example also uses a Google convention that helps readability. Each computed variable name contains a colon-separated triplet indicating the aggregation level, the variable name, and the operation that created that name.
Borgmon configuration separates the definition of the rules from the targets being monitored.
In extreme cases, SRE teams may have the option to “give back the pager” — SRE can ask the developer team to be exclusively on-call for the system until it meets the standards of the SRE team in question. Giving back the pager doesn’t happen very frequently, because it’s almost always possible to work with the developer team to reduce the operational load and make a given system more reliable.
Ways in which things go right are special cases of the ways in which things go wrong. John Allspaw
Your first response in a major outage may be to start troubleshooting and try to find a root cause as quickly as possible. Ignore that instinct! Instead, your course of action should be to make the system work as well as it can under the circumstances.
Text logs are very helpful for reactive debugging in real time, while storing logs in a structured binary format can make it possible to build tools to conduct retrospective analysis with much more information. It’s really useful to have multiple verbosity levels available, along with a way to increase these levels on the fly.
Injecting known test data in order to check that the resulting output is expected (a form of black-box testing) at each step can be especially effective,
Take clear notes of what ideas you had, which tests you ran, and the results you saw.