More on this book
Community
Kindle Notes & Highlights
by
Betsy Beyer
Read between
September 12 - December 27, 2017
Hope is not a strategy. Traditional SRE saying
Because most outages are caused by some kind of change — a new configuration, a new feature launch, or a new type of user traffic — the two teams’ goals are fundamentally in tension.
Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload.
Google’s rule of thumb is that an SRE team must spend the remaining 50% of its time actually doing development.
And once an SRE team is in place, their potentially unorthodox approaches to service management require strong management support. For example, the decision to stop releases for the remainder of the quarter once an error budget is depleted might not be embraced by a product development team unless mandated by their management.
In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).
In practice, this is accomplished by monitoring the amount of operational work being done by SREs, and redirecting excess operational work to the product development teams: reassigning bugs and tickets to development managers, [re]integrating developers into on-call pager rotations, and so on.
SREs and product developers aim to spend the error budget getting maximum feature velocity.
Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.
When humans are necessary, we have found that thinking through and recording the best practices ahead of time in a “playbook” produces roughly a 3x improvement in MTTR as compared to the strategy of “winging it.”
SRE has found that roughly 70% of outages are due to changes in a live system. Best practices in this domain use automation to accomplish the following: Implementing progressive rollouts Quickly and accurately detecting problems Rolling back changes safely when problems arise
We define toil as mundane, repetitive operational work providing no enduring value, which scales linearly with service growth.
Managing service reliability is largely about managing risk, and managing risk can be costly. 100% is probably never the right reliability target: not only is it impossible to achieve, it’s typically more reliability than a service’s users want or notice. Match the profile of the service to the risk the business is willing to take. An error budget aligns incentives and emphasizes joint ownership between SRE and product development. Error budgets make it easier to decide the rate of releases and to effectively defuse discussions about outages with stakeholders, and allows multiple teams to
...more
We use intuition, experience, and an understanding of what users want to define service level indicators (SLIs), objectives (SLOs), and agreements (SLAs).
Most services consider request latency — how long it takes to return a response to a request — as a key SLI. Other common SLIs include the error rate, often expressed as a fraction of all requests received, and system throughput,
if there is no explicit consequence, then you are almost certainly looking at an SLO.1
User-facing serving systems, such as the Shakespeare search frontends, generally care about availability, latency, and throughput. In other words: Could we respond to the request? How long did it take to respond? How many requests could be handled?
User studies have shown that people typically prefer a slightly slower system to one with high variance in response time, so some SRE teams focus only on high percentile values, on the
Choose just enough SLOs to provide good coverage of your system’s attributes. Defend the SLOs you pick: if you can’t ever win a conversation about priorities by quoting a particular SLO, it’s probably not worth having that SLO.2 However, not all product attributes are amenable to SLOs: it’s hard to specify “user delight” with an SLO.
So what is toil? Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.
Our SRE organization has an advertised goal of keeping operational work (i.e., toil) below 50% of each SRE’s time. At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service features.
The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.
This kind of tension is common within a team, and often reflects an underlying mistrust of the team’s self-discipline: while some team members want to implement a “hack” to allow time for a proper fix, others worry that a hack will be forgotten or that the proper fix will be deprioritized indefinitely. This concern is credible, as it’s easy to build layers of unmaintainable technical debt by patching over problems instead of making real fixes.
Our builds are hermetic, meaning that they are insensitive to the libraries and other software installed on the build machine.
It sometimes makes sense to sacrifice stability for the sake of agility. I’ve often approached an unfamiliar problem domain by conducting what I call exploratory coding — setting an explicit shelf life for whatever code I write with the understanding that I’ll need to try and fail once in order to really understand the task I need to accomplish.
As Fred Brooks suggests in his “No Silver Bullet” essay [Bro95], it is very important to consider the difference between essential complexity and accidental complexity. Essential complexity is the complexity inherent in a given situation that cannot be removed from a problem definition, whereas accidental complexity is more fluid and can be resolved with engineering effort.
A smaller project is easier to understand, easier to test, and frequently has fewer defects.
May the queries flow, and the pager stay silent. Traditional SRE blessing
We strongly believe that the “E” in “SRE” is a defining characteristic of our organization, so we strive to invest at least 50% of SRE time into engineering: of the remainder, no more than 25% can be spent on-call, leaving up to another 25% on other types of operational, nonproject work.
It’s important that on-call SREs understand that they can rely on several resources that make the experience of being on-call less daunting than it may seem. The most important on-call resources are: Clear escalation paths Well-defined incident-management procedures A blameless postmortem culture ([Loo10], [All12]
In this case, it is appropriate to work together with the application developers to set common goals to improve the system.
In extreme cases, SRE teams may have the option to “give back the pager” — SRE can ask the developer team to be exclusively on-call for the system until it meets the standards of the SRE team in question. Giving back the pager doesn’t happen very frequently, because it’s almost always possible to work with the developer team to reduce the operational load and make a given system more reliable.
Ways in which things go right are special cases of the ways in which things go wrong. John Allspaw
as doctors are taught, “when you hear hoofbeats, think of horses not zebras.”4 Also remember that, all things being equal, we should prefer simpler explanations.5
Novice pilots are taught that their first responsibility in an emergency is to fly the airplane [Gaw09]; troubleshooting is secondary to getting the plane and everyone on it safely onto the ground. This approach is also applicable to computer systems: for example, if a bug is leading to possibly unrecoverable data corruption, freezing the system to prevent further failure may be better than letting this behavior continue.
Publishing negative results improves our industry’s data-driven culture. Accounting for negative results and statistical insignificance reduces the bias in our metrics and provides an example to others of how to maturely accept uncertainty. By publishing everything, you encourage others to do the same, and everyone in the industry collectively learns much more quickly.
There are many ways to simplify and speed troubleshooting. Perhaps the most fundamental are: Building observability — with both white-box metrics and structured logs — into each component from the ground up. Designing systems with well-understood and observable interfaces between components.
My team follows these broad guidelines — if any of the following is true, the event is an incident: Do you need to involve a second team in fixing the problem? Is the outage visible to customers? Is the issue unsolved even after an hour’s concentrated analysis?
Best Practices for Incident Management Prioritize. Stop the bleeding, restore service, and preserve the evidence for root-causing. Prepare. Develop and document your incident management procedures in advance, in consultation with incident participants. Trust. Give full autonomy within the assigned role to all incident participants. Introspect. Pay attention to your emotional state while responding to an incident. If you start to feel panicky or overwhelmed, solicit more support. Consider alternatives. Periodically consider your options and re-evaluate whether it still makes sense to continue
...more
A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.
Teams have some internal flexibility, but common postmortem triggers include: User-visible downtime or degradation beyond a certain threshold Data loss of any kind On-call engineer intervention (release rollback, rerouting of traffic, etc.) A resolution time above some threshold A monitoring failure (which usually implies manual incident discovery)
You can’t “fix” people, but you can fix systems and processes to better support people
Some example activities include: Postmortem of the month In a monthly newsletter, an interesting and well-written postmortem is shared with the entire organization.
Passing a test or a series of tests doesn’t necessarily prove reliability. However, tests that are failing generally prove the absence of reliability.
Regression tests can be analogized to a gallery of rogue bugs that historically caused the system to fail or produce incorrect results.
Engineers use stress tests to find the limits on a web service.
Dedicated, noninterrupted, project work time is essential to any software development effort.
Client-side throttling addresses this problem.1 When a client detects that a significant portion of its recent requests have been rejected due to “out of quota” errors, it starts self-regulating and caps the amount of outgoing traffic it generates.
Criticality is another notion that we’ve found very useful in the context of global quotas and throttling. A request made to a backend is associated with one of four possible criticality values, depending on how critical we consider that request: