More on this book
Community
Kindle Notes & Highlights
Be skeptical of any design document, performance review, or essay that doesn’t mention failure. Such a document is potentially either too heavily filtered, or the author was not rigorous in his or her methods.
Building observability — with both white-box metrics and structured logs — into each component from the ground up. Designing systems with well-understood and observable interfaces between components.
Ensuring that information is available in a consistent way throughout a system — for instance, using a unique request identifier throughout the span of RPCs generated by various components — reduces the need to figure out which log entry on an upstream component matches a log entry on a downstream component, speeding the time to diagnosis and recovery.
What to Do When Systems Break First of all, don’t panic! You aren’t alone, and the sky isn’t falling.
History is about learning from everyone’s mistakes.
Until your system has actually failed, you don’t truly know how that system, its dependent systems, or your users will react.
She wasn’t in a position to think about the bigger picture of how to mitigate the problem because the technical task at hand was overwhelming.
make sure that everybody involved in the incident knows their role and doesn’t stray onto someone else’s turf.
The incident commander holds the high-level state about the incident. They structure the incident response task force, assigning responsibilities according to need and priority. De facto, the commander holds all positions that they have not delegated.
The Ops lead works with the incident commander to respond to the incident by applying operational tools to the task at hand. The operations team should be the only group modifying the system during an incident.
planning role supports Ops by dealing with longer-term issues, such as filing bugs, ordering dinner, arranging handoffs, and tracking how the system has diverged from the norm so it can be reverted once the incident is resolved.
if any of the following is true, the event is an incident: Do you need to involve a second team in fixing the problem? Is the outage visible to customers? Is the issue unsolved even after an hour’s concentrated analysis?
A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.
Writing a postmortem is not punishment — it is a learning opportunity for the entire company.
Blameless culture originated in the healthcare and avionics industries where mistakes can be fatal. These industries nurture an environment where every “mistake” is seen as an opportunity to strengthen the system.
the ability to select zero or more outalations and include their subjects, tags, and “important” annotations in an email to the next on-call engineer (and an arbitrary cc list) in order to pass on recent state between shifts.
For each configuration file, a separate configuration test examines production to see how a particular binary is actually configured and reports discrepancies against that file.
Configuration tests are built and tested for a specific version of the checked-in configuration file. Comparing which version of the test is passing in relation to the goal version for automation implicitly indicates how far actual production currently lags behind ongoing engineering work.
To conduct a canary test, a subset of servers is upgraded to a new version or configuration and then left in an incubation period. Should no unexpected variances occur, the release continues and the rest of the servers are upgraded in a progressive fashion.
We commonly refer to the incubation period for the upgraded servers as “baking the binary.”
Conducting unit tests for every key function and class is a completely overwhelming prospect if the current test coverage is low or nonexistent. Instead, start with testing that delivers the most impact with the least effort.
It takes little effort to create a series of smoke tests to run for every release. This type of low-effort, high-impact first step can lead to highly tested, reliable software.
If every bug is converted into a test, each test is supposed to initially fail because the bug hasn’t yet been fixed. As engineers fix the bugs, the software passes testing and you’re on the road to developing a comprehensive regression test suite.
Once source control is in place, you can add a continuous build system that builds the software and runs tests every time code is submitted. We’ve found it optimal if the build system notifies engineers the moment a change breaks a software project.
When the build is predictably solid and reliable, developers can iterate faster!
Bazel creates dependency graphs for software projects. When a change is made to a file, Bazel only rebuilds the part of the software that depends on that file.
complexity (in case the patch creates a superlinear workload elsewhere).
If you let users try more versions of the software during the year, the MTBF suffers because there are more opportunities for user-visible breakage. However, you can also discover areas that would benefit from additional test coverage. If these tests are implemented, each improvement protects against some future failure. Careful reliability management combines the limits on uncertainty due to test coverage with the limits on user-visible faults in order to adjust the release cadence. This combination maximizes the knowledge that you gain from operations and end users. These gains drive test
...more
Configuration files generally exist because changing the configuration is faster than rebuilding a tool.
A configuration file that changes more than once per user-facing application release (for example, because it holds release state) can be a major risk if these changes are not treated the same as application releases.
Testing is one of the most profitable investments engineers can make to improve the reliability of their product.
You can’t fix a problem until you understand it, and in engineering, you can only understand a problem by measuring it.
A standard rule of thumb is to start by having the release impact 0.1% of user traffic, and then scaling by orders of magnitude every 24 hours while varying the geographic location of servers being upgraded (then on day 2: 1%, day 3: 10%, day 4: 100%).
Spreadsheets suffer severely from scalability problems and have limited error-checking abilities. Data becomes stale, and tracking changes becomes difficult. Teams often are forced to make simplifying assumptions and reduce the complexity of their requirements, simply to render maintaining adequate capacity a tractable problem.
Intent-based Capacity Planning. The basic premise of this approach is to programmatically encode the dependencies and parameters (intent) of a service’s needs, and use that encoding to autogenerate an allocation plan that details which resources go to which service, in which cluster.
Don’t focus on perfection and purity of solution, especially if the bounds of the problem aren’t well known. Launch and iterate.
the old motto of “launch and iterate” is particularly relevant in SRE software development projects. Don’t wait for the perfect design; rather, keep the overall vision in mind while moving ahead with design and development. When you encounter areas of uncertainty, design the software to be flexible enough so that if process or strategy changes at a higher level, you don’t incur a huge rework cost. But at the same time, stay grounded by making sure that general solutions have a real-world–specific implementation that demonstrates the utility of the design.
agnosticism — writing the software to be generalized to allow myriad data sources as input
The right time to engage specialists will, of course, vary from project to project. As a rough guideline, the project should be successfully off the ground and demonstrably successful, such that the skills of the current team would be significantly bolstered by the additional expertise.
the same red flags you might instinctively identify in any software project, such as software that touches many moving parts at once, or software design that requires an all-or-nothing approach that prevents iterative development.
there are few pictures bigger than the intricate inner workings of modern technical infrastructure).
Dedicated, noninterrupted, project work time is essential to any software development effort. Dedicated project time is necessary to enable progress on a project, because it’s nearly impossible to write code — much less to concentrate on larger, more impactful projects — when you’re thrashing between several tasks in the course of an hour. Therefore, the ability to work on a software project without interrupts is often an attractive reason for engineers to begin working on a development project. Such time must be aggressively defended.
However, as your organization grows, this ad hoc approach won’t scale, instead resulting in largely functional, yet narrow or single-purpose, software solutions that can’t be shared, which inevitably lead to duplicated efforts and wasted time.
SREs are a skeptical lot (in fact, skepticism is a trait for which we specifically hire);
Reducing the number of ways to perform the same task allows the entire department to benefit from the skills any single team has developed,
it’s relatively common for an SRE to lack experience as part of a team that built and shipped a product to a set of users.
The unique hands-on production experience that SREs bring to developing tools can lead to innovative approaches to age-old problems,
The video upload stream is routed via a different path — perhaps to a link that is currently underutilized — to maximize the throughput at the expense of latency.
But on the local level, inside a given datacenter, we often assume that all machines within the building are equally distant to the user and connected to the same network. Therefore, optimal distribution of load focuses on optimal resource utilization and protecting a single server from overloading.
The first (and perhaps most intuitive) approach is to always prefer the least loaded backend. In theory, this approach should result in the best end-user experience because requests are always routed to the least busy machine. Unfortunately, this logic breaks down quickly in the case of stateful protocols, which must use the same backend for the duration of a request.