Site Reliability Engineering: How Google Runs Production Systems
Rate it:
Open Preview
1%
Flag icon
Yet software engineering as a discipline spends much more time talking about the first period as opposed to the second, despite estimates that 40–90% of the total costs of a system are incurred after birth.
2%
Flag icon
For our purposes, reliability is “The probability that [a system] will perform a required function without failure under stated conditions for a stated period of time,” following the definition in [Oco12].
3%
Flag icon
Hope is not a strategy.
3%
Flag icon
Running a service with a team that relies on manual intervention for both change management and event handling becomes expensive as the service and/or traffic to the service grows, because the size of the team necessarily scales with the load generated by the system.
3%
Flag icon
SRE is what happens when you ask a software engineer to design an operations team.
3%
Flag icon
So I designed and managed the group the way I would want it to work if I worked as an SRE myself. That group has since matured to become Google’s present-day SRE team, which remains true to its origins as envisioned by a lifelong software engineer.
3%
Flag icon
By design, it is crucial that SRE teams are focused on engineering. Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload.
3%
Flag icon
And once an SRE team is in place, their potentially unorthodox approaches to service management require strong management support. For example, the decision to stop releases for the remainder of the quarter once an error budget is depleted might not be embraced by a product development team unless mandated by their management.
3%
Flag icon
In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).
4%
Flag icon
When they are focused on operations work, on average, SREs should receive a maximum of two events per 8–12-hour on-call shift.
4%
Flag icon
Pursuing Maximum Change Velocity Without Violating a Service’s SLO
4%
Flag icon
Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.
4%
Flag icon
When humans are necessary, we have found that thinking through and recording the best practices ahead of time in a “playbook” produces roughly a 3x improvement in MTTR as compared to the strategy of “winging it.”
7%
Flag icon
A key principle of any effective software engineering, not only reliability-oriented engineering, simplicity is a quality that, once lost, can be extraordinarily difficult to recapture.
7%
Flag icon
For example, a system that serves 2.5M requests in a day with a daily availability target of 99.99% can serve up to 250 errors and still hit its target for that given day.
8%
Flag icon
Product development performance is largely evaluated on product velocity, which creates an incentive to push new code as quickly as possible. Meanwhile, SRE performance is (unsurprisingly) evaluated based upon reliability of a service, which implies an incentive to push back against a high rate of change.
9%
Flag icon
An error budget aligns incentives and emphasizes joint ownership between SRE and product development. Error budgets make it easier to decide the rate of releases and to effectively defuse discussions about outages with stakeholders, and allows multiple teams to reach the same conclusion about production risk without rancor.
9%
Flag icon
We use intuition, experience, and an understanding of what users want to define service level indicators (SLIs), objectives (SLOs), and agreements (SLAs). These measurements describe basic properties of metrics that matter, what values we want those metrics to have, and how we’ll react if we can’t provide the expected service.
9%
Flag icon
SRE makes sure that global Chubby meets, but does not significantly exceed, its service level objective. In any given quarter, if a true failure has not dropped availability below the target, a controlled outage will be synthesized by intentionally taking down the system. In this way, we are able to flush out unreasonable dependencies on Chubby shortly after they are added.
10%
Flag icon
User-facing serving systems, such as the Shakespeare search frontends, generally care about availability, latency, and throughput. In other words: Could we respond to the request? How long did it take to respond? How many requests could be handled?
10%
Flag icon
Storage systems often emphasize latency, availability, and durability. In other words: How long does it take to read or write data? Can we access the data on demand? Is the data still there when we need it?
10%
Flag icon
While it’s tempting to ask for a system that can scale its load “infinitely” without any latency increase and that is “always” available, this requirement is unrealistic.
11%
Flag icon
If you can’t ever win a conversation about SLOs, it’s probably not worth having an SRE team for the product.
11%
Flag icon
At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service features. Feature development typically focuses on improving reliability, performance, or utilization, which often reduces toil as a second-order effect.
11%
Flag icon
Engineering work is novel and intrinsically requires human judgment. It produces a permanent improvement in your service, and is guided by a strategy. It is frequently creative and innovative, taking a design-driven approach to solving a problem — the more generalized, the better.
11%
Flag icon
Your career progress will slow down or grind to a halt if you spend too little time on projects. Google rewards grungy work when it’s inevitable and has a big positive impact, but you can’t make a career out of grunge.
13%
Flag icon
The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.
14%
Flag icon
Often, sheer force of effort can help a rickety system achieve high availability, but this path is usually short-lived and fraught with burnout and dependence on a small number of heroic team members. Taking a controlled, short-term decrease in availability is often a painful, but strategic trade for the long-run stability of the system.
17%
Flag icon
On that iteration, when trying to send the set of machines in the rack to Diskerase, the automation determined that the set of machines that still needed to be Diskerased was (correctly) empty. Unfortunately, the empty set was used as a special value, interpreted to mean “everything.” This means the automation sent almost all the machines we have in all colos to Diskerase.
Przemek
Damn!
19%
Flag icon
Building reliability into development allows developers to focus their attention on what we really do care about — the functionality and performance of their software and systems.
19%
Flag icon
The term “software bloat” was coined to describe the tendency of software to become slower and bigger over time as a result of a constant stream of additional features.
20%
Flag icon
French poet Antoine de Saint Exupery wrote, “perfection is finally attained not when there is no longer more to add, but when there is no longer anything to take away” [Sai39]. This principle is also applicable to the design and construction of software.
20%
Flag icon
Simple releases are generally better than complicated releases. It is much easier to measure and understand the impact of a single change rather than a batch of changes released simultaneously.
20%
Flag icon
Every time we say “no” to a feature, we are not restricting innovation; we are keeping the environment uncluttered of distractions so that focus remains squarely on innovation, and real engineering can proceed.
20%
Flag icon
We can characterize the health of a service — in much the same way that Abraham Maslow categorized human needs [Mas43] — from the most basic requirements needed for a system to function as a service at all to the higher levels of function — permitting self-actualization and taking active control of the direction of the service rather than reactively fighting fires.
24%
Flag icon
Stress hormones like cortisol and corticotropin-releasing hormone (CRH) are known to cause behavioral consequences — including fear — that can impair cognitive functions and cause suboptimal decision making [Chr09].
24%
Flag icon
The most important on-call resources are: Clear escalation paths Well-defined incident-management procedures A blameless postmortem culture ([Loo10], [All12])
25%
Flag icon
Being on-call for a quiet system is blissful, but what happens if the system is too quiet or when SREs are not on-call often enough? An operational underload is undesirable for an SRE team. Being out of touch with production for long periods of time can lead to confidence issues, both in terms of overconfidence and underconfidence, while knowledge gaps are discovered only when an incident occurs.
25%
Flag icon
Be warned that being an expert is more than understanding how a system is supposed to work. Expertise is gained by investigating why a system doesn’t work.
26%
Flag icon
Your first response in a major outage may be to start troubleshooting and try to find a root cause as quickly as possible. Ignore that instinct! Instead, your course of action should be to make the system work as well as it can under the circumstances.
28%
Flag icon
Things break; that’s life.
Przemek
yes
28%
Flag icon
Few of us naturally respond well during an emergency. A proper response takes preparation and periodic, pertinent, hands-on training. Establishing and maintaining thorough training and testing processes requires the support of the board and management, in addition to the careful attention of staff.
Przemek
would be nice
31%
Flag icon
Mary returns to work the following morning to find that her transatlantic colleagues have assumed responsibility for the bug, mitigated the problem, closed the incident, and started work on the postmortem.
Przemek
very nice job Mary :)
31%
Flag icon
Blameless postmortems are a tenet of SRE culture. For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior.
33%
Flag icon
One key responsibility of Site Reliability Engineers is to quantify confidence in the systems they maintain. SREs perform this task by adapting classical software testing techniques to systems at scale.
33%
Flag icon
Passing a test or a series of tests doesn’t necessarily prove reliability. However, tests that are failing generally prove the absence of reliability.
34%
Flag icon
To borrow a technique from feature development and project management, if every task is high priority, none of the tasks are high priority.
37%
Flag icon
You can’t fix a problem until you understand it, and in engineering, you can only understand a problem by measuring it.
37%
Flag icon
Fully fledged software development projects within SRE provide career development opportunities for SREs, as well as an outlet for engineers who don’t want their coding skills to get rusty. Long-term project work provides much-needed balance to interrupts and on-call work, and can provide job satisfaction for engineers who want their careers to maintain a balance between software engineering and systems engineering.
39%
Flag icon
Don’t focus on perfection and purity of solution, especially if the bounds of the problem aren’t well known. Launch and iterate. Any sufficiently complex software engineering effort is bound to encounter uncertainty as to how a component should be designed or how a problem should be tackled.
« Prev 1 3