Site Reliability Engineering: How Google Runs Production Systems
Rate it:
Open Preview
Read between September 12 - December 27, 2017
3%
Flag icon
Hope is not a strategy. Traditional SRE saying
3%
Flag icon
Because most outages are caused by some kind of change — a new configuration, a new feature launch, or a new type of user traffic — the two teams’ goals are fundamentally in tension.
3%
Flag icon
Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload.
3%
Flag icon
Google’s rule of thumb is that an SRE team must spend the remaining 50% of its time actually doing development.
3%
Flag icon
And once an SRE team is in place, their potentially unorthodox approaches to service management require strong management support. For example, the decision to stop releases for the remainder of the quarter once an error budget is depleted might not be embraced by a product development team unless mandated by their management.
3%
Flag icon
In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).
4%
Flag icon
In practice, this is accomplished by monitoring the amount of operational work being done by SREs, and redirecting excess operational work to the product development teams: reassigning bugs and tickets to development managers, [re]integrating developers into on-call pager rotations, and so on.
4%
Flag icon
SREs and product developers aim to spend the error budget getting maximum feature velocity.
4%
Flag icon
Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.
4%
Flag icon
When humans are necessary, we have found that thinking through and recording the best practices ahead of time in a “playbook” produces roughly a 3x improvement in MTTR as compared to the strategy of “winging it.”
4%
Flag icon
SRE has found that roughly 70% of outages are due to changes in a live system. Best practices in this domain use automation to accomplish the following: Implementing progressive rollouts Quickly and accurately detecting problems Rolling back changes safely when problems arise
7%
Flag icon
We define toil as mundane, repetitive operational work providing no enduring value, which scales linearly with service growth.
9%
Flag icon
Managing service reliability is largely about managing risk, and managing risk can be costly. 100% is probably never the right reliability target: not only is it impossible to achieve, it’s typically more reliability than a service’s users want or notice. Match the profile of the service to the risk the business is willing to take. An error budget aligns incentives and emphasizes joint ownership between SRE and product development. Error budgets make it easier to decide the rate of releases and to effectively defuse discussions about outages with stakeholders, and allows multiple teams to ...more
9%
Flag icon
We use intuition, experience, and an understanding of what users want to define service level indicators (SLIs), objectives (SLOs), and agreements (SLAs).
9%
Flag icon
Most services consider request latency — how long it takes to return a response to a request — as a key SLI. Other common SLIs include the error rate, often expressed as a fraction of all requests received, and system throughput,
9%
Flag icon
if there is no explicit consequence, then you are almost certainly looking at an SLO.1
10%
Flag icon
User-facing serving systems, such as the Shakespeare search frontends, generally care about availability, latency, and throughput. In other words: Could we respond to the request? How long did it take to respond? How many requests could be handled?
10%
Flag icon
User studies have shown that people typically prefer a slightly slower system to one with high variance in response time, so some SRE teams focus only on high percentile values, on the
10%
Flag icon
Choose just enough SLOs to provide good coverage of your system’s attributes. Defend the SLOs you pick: if you can’t ever win a conversation about priorities by quoting a particular SLO, it’s probably not worth having that SLO.2 However, not all product attributes are amenable to SLOs: it’s hard to specify “user delight” with an SLO.
11%
Flag icon
So what is toil? Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.
11%
Flag icon
Our SRE organization has an advertised goal of keeping operational work (i.e., toil) below 50% of each SRE’s time. At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service features.
13%
Flag icon
The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.
14%
Flag icon
This kind of tension is common within a team, and often reflects an underlying mistrust of the team’s self-discipline: while some team members want to implement a “hack” to allow time for a proper fix, others worry that a hack will be forgotten or that the proper fix will be deprioritized indefinitely. This concern is credible, as it’s easy to build layers of unmaintainable technical debt by patching over problems instead of making real fixes.
18%
Flag icon
Our builds are hermetic, meaning that they are insensitive to the libraries and other software installed on the build machine.
19%
Flag icon
It sometimes makes sense to sacrifice stability for the sake of agility. I’ve often approached an unfamiliar problem domain by conducting what I call exploratory coding — setting an explicit shelf life for whatever code I write with the understanding that I’ll need to try and fail once in order to really understand the task I need to accomplish.
19%
Flag icon
As Fred Brooks suggests in his “No Silver Bullet” essay [Bro95], it is very important to consider the difference between essential complexity and accidental complexity. Essential complexity is the complexity inherent in a given situation that cannot be removed from a problem definition, whereas accidental complexity is more fluid and can be resolved with engineering effort.
19%
Flag icon
A smaller project is easier to understand, easier to test, and frequently has fewer defects.
20%
Flag icon
21%
Flag icon
May the queries flow, and the pager stay silent. Traditional SRE blessing
24%
Flag icon
We strongly believe that the “E” in “SRE” is a defining characteristic of our organization, so we strive to invest at least 50% of SRE time into engineering: of the remainder, no more than 25% can be spent on-call, leaving up to another 25% on other types of operational, nonproject work.
24%
Flag icon
It’s important that on-call SREs understand that they can rely on several resources that make the experience of being on-call less daunting than it may seem. The most important on-call resources are: Clear escalation paths Well-defined incident-management procedures A blameless postmortem culture ([Loo10], [All12]
25%
Flag icon
In this case, it is appropriate to work together with the application developers to set common goals to improve the system.
25%
Flag icon
In extreme cases, SRE teams may have the option to “give back the pager” — SRE can ask the developer team to be exclusively on-call for the system until it meets the standards of the SRE team in question. Giving back the pager doesn’t happen very frequently, because it’s almost always possible to work with the developer team to reduce the operational load and make a given system more reliable.
25%
Flag icon
Ways in which things go right are special cases of the ways in which things go wrong. John Allspaw
25%
Flag icon
as doctors are taught, “when you hear hoofbeats, think of horses not zebras.”4 Also remember that, all things being equal, we should prefer simpler explanations.5
26%
Flag icon
Novice pilots are taught that their first responsibility in an emergency is to fly the airplane [Gaw09]; troubleshooting is secondary to getting the plane and everyone on it safely onto the ground. This approach is also applicable to computer systems: for example, if a bug is leading to possibly unrecoverable data corruption, freezing the system to prevent further failure may be better than letting this behavior continue.
27%
Flag icon
Publishing negative results improves our industry’s data-driven culture. Accounting for negative results and statistical insignificance reduces the bias in our metrics and provides an example to others of how to maturely accept uncertainty. By publishing everything, you encourage others to do the same, and everyone in the industry collectively learns much more quickly.
28%
Flag icon
There are many ways to simplify and speed troubleshooting. Perhaps the most fundamental are: Building observability — with both white-box metrics and structured logs — into each component from the ground up. Designing systems with well-understood and observable interfaces between components.
31%
Flag icon
My team follows these broad guidelines — if any of the following is true, the event is an incident: Do you need to involve a second team in fixing the problem? Is the outage visible to customers? Is the issue unsolved even after an hour’s concentrated analysis?
31%
Flag icon
Best Practices for Incident Management Prioritize. Stop the bleeding, restore service, and preserve the evidence for root-causing. Prepare. Develop and document your incident management procedures in advance, in consultation with incident participants. Trust. Give full autonomy within the assigned role to all incident participants. Introspect. Pay attention to your emotional state while responding to an incident. If you start to feel panicky or overwhelmed, solicit more support. Consider alternatives. Periodically consider your options and re-evaluate whether it still makes sense to continue ...more
31%
Flag icon
A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.
31%
Flag icon
Teams have some internal flexibility, but common postmortem triggers include: User-visible downtime or degradation beyond a certain threshold Data loss of any kind On-call engineer intervention (release rollback, rerouting of traffic, etc.) A resolution time above some threshold A monitoring failure (which usually implies manual incident discovery)
31%
Flag icon
You can’t “fix” people, but you can fix systems and processes to better support people
32%
Flag icon
Some example activities include: Postmortem of the month In a monthly newsletter, an interesting and well-written postmortem is shared with the entire organization.
33%
Flag icon
Passing a test or a series of tests doesn’t necessarily prove reliability. However, tests that are failing generally prove the absence of reliability.
34%
Flag icon
Regression tests can be analogized to a gallery of rogue bugs that historically caused the system to fail or produce incorrect results.
34%
Flag icon
Engineers use stress tests to find the limits on a web service.
40%
Flag icon
Dedicated, noninterrupted, project work time is essential to any software development effort.
45%
Flag icon
Client-side throttling addresses this problem.1 When a client detects that a significant portion of its recent requests have been rejected due to “out of quota” errors, it starts self-regulating and caps the amount of outgoing traffic it generates.
45%
Flag icon
Criticality is another notion that we’ve found very useful in the context of global quotas and throttling. A request made to a backend is associated with one of four possible criticality values, depending on how critical we consider that request:
« Prev 1