Site Reliability Engineering: How Google Runs Production Systems
Rate it:
Open Preview
Read between October 13, 2018 - January 21, 2019
3%
Flag icon
The result of our approach to hiring for SRE is that we end up with a team of people who (a) will quickly become bored by performing tasks by hand, and (b) have the skill set necessary to write software to replace their previously manual work, even when the solution is complicated.
7%
Flag icon
It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a cost:
7%
Flag icon
rather than simply maximizing uptime, Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness
7%
Flag icon
when we set an availability target of 99.99%,we want to exceed it, but not by much: that would waste opportunities to add features to the system, clean up technical debt, or reduce its operational costs.
8%
Flag icon
We set a lower availability target for YouTube than for our enterprise products because rapid feature development was correspondingly more important.
9%
Flag icon
An SLI is a service level indicator — a carefully defined quantitative measure of some aspect of the level of service that is provided. Most services consider request latency — how long it takes to return a response to a request — as a key SLI.
9%
Flag icon
In any given quarter, if a true failure has not dropped availability below the target, a controlled outage will be synthesized by intentionally taking down the system. In this way, we are able to flush out unreasonable dependencies on Chubby shortly after they are added.
10%
Flag icon
It’s both unrealistic and undesirable to insist that SLOs will be met 100% of the time: doing so can reduce the rate of innovation and deployment, require expensive, overly conservative solutions, or both.
13%
Flag icon
The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.
13%
Flag icon
as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.
13%
Flag icon
Different aspects of a system should be measured with different levels of granularity.
13%
Flag icon
Every time the pager goes off, I should be able to react with a sense of urgency. I can only react with a sense of urgency a few times a day before I become fatigued. Every page should be actionable. Every page response should require intelligence. If a page merely merits a robotic response, it shouldn’t be a page. Pages should be about a novel problem or an event that hasn’t been seen before.
14%
Flag icon
while some team members want to implement a “hack” to allow time for a proper fix, others worry that a hack will be forgotten or that the proper fix will be deprioritized indefinitely. This concern is credible, as it’s easy to build layers of unmaintainable technical debt by patching over problems instead of making real fixes.
24%
Flag icon
we strive to invest at least 50% of SRE time into engineering: of the remainder, no more than 25% can be spent on-call, leaving up to another 25% on other types of operational, nonproject work.
27%
Flag icon
By publishing everything, you encourage others to do the same, and everyone in the industry collectively learns much more quickly. SRE has already learned this lesson with high-quality postmortems, which have had a large positive effect on production stability. Publish your results.
31%
Flag icon
For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior. A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the “wrong” thing prevails, people will not bring issues to light for fear of punishment.
33%
Flag icon
As MTBF increases in response to better testing, developers are encouraged to release features faster. Some of these features will, of course, have bugs. New bugs result in an opposite adjustment to release velocity as these bugs are found and fixed.
34%
Flag icon
The term canary comes from the phrase “canary in a coal mine,” and refers to the practice of using a live bird to detect toxic gases before humans were poisoned.
42%
Flag icon
Lame duck The backend task is listening on its port and can serve, but is explicitly asking clients to stop sending requests.
44%
Flag icon
Weighted Round Robin is fairly simple in principle: each client task keeps a “capability” score for each backend in its subset. Requests are distributed in Round-Robin fashion, but clients weigh the distributions of requests to backends proportionally.
47%
Flag icon
As a server becomes overloaded, its responses to RPCs from its clients arrive later, which may exceed any deadlines those clients set. The work the server did to respond is then wasted, and clients may retry the RPCs, leading to even more overload.
48%
Flag icon
Load shedding drops some proportion of load by dropping traffic as the server approaches overload conditions. The goal is to keep the server from running out of RAM, failing health checks, serving with extremely high latency, or any of the other symptoms associated with overload, while still doing as much useful work as it can.
48%
Flag icon
If a user’s web search is slow because an RPC has been queued for 10 seconds, there’s a good chance the user has given up and refreshed their browser, issuing another request: there’s no point in responding to the first one, since it will be ignored!
48%
Flag icon
Graceful degradation shouldn’t trigger very often — usually in cases of a capacity planning failure or unexpected load shift.
49%
Flag icon
It’s important to note the distinction between a latency cache versus a capacity cache: when a latency cache is employed, the service can sustain its expected load with an empty cache, but a service using a capacity cache cannot sustain its expected load under an empty cache.
50%
Flag icon
At this point, the component should ideally start serving errors or degraded results in response to additional load, but not significantly reduce the rate at which it successfully handles requests.
51%
Flag icon
a growing number of distributed datastore technologies provide a different set of semantics known as BASE (Basically Available, Soft state, and Eventual consistency).
51%
Flag icon
Most of these systems that support BASE semantics rely on multimaster replication, where writes can be committed to different processes concurrently, and there is some mechanism to resolve conflicts (often as simple as “latest timestamp wins”).
52%
Flag icon
As proven by the Dijkstra Prize–winning FLP impossibility result [Fis85], no asynchronous distributed consensus algorithm can guarantee progress in the presence of an unreliable network.
52%
Flag icon
In the first phase of the protocol, the proposer sends a sequence number to the acceptors. Each acceptor will agree to accept the proposal only if it has not yet seen a proposal with a higher sequence number.
52%
Flag icon
Paxos on its own isn’t that useful: all it lets you do is to agree on a value and proposal number once. Because only a quorum of nodes need to agree on a value, any given node may not have a complete view of the set of values that have been agreed to. This limitation is true for most distributed consensus algorithms.
60%
Flag icon
BASE allows for higher availability than ACID, in exchange for a softer distributed consistency guarantee.
61%
Flag icon
From the user’s point of view, data integrity without expected and regular data availability is effectively the same as having no data at all.
63%
Flag icon
Similar to the effects enjoyed when units test are introduced early in the project lifecycle, a data validation pipeline results in an overall acceleration of software development projects.
66%
Flag icon
Google approached the challenges inherent to launches by creating a dedicated consulting team within SRE tasked with the technical side of launching a new product or feature.
66%
Flag icon
Experience has demonstrated that engineers are likely to sidestep processes that they consider too burdensome or as adding insufficient value — especially when a team is already in crunch mode, and the launch process is seen as just another item blocking their launch. For this reason, LCE must optimize the launch experience continuously to strike the right balance between cost and benefit.
67%
Flag icon
we perform most development on the mainline branch, but releases are built on separate branches per release. This setup makes it easy to fix bugs in a release without pulling in unrelated changes from the mainline.
67%
Flag icon
A new server might be installed on a few machines in one datacenter and observed for a defined period of time. If all looks well, the server is installed on all machines in one datacenter, observed again, and then installed on all machines globally.
70%
Flag icon
Early opportunities for ownership are standard across Google in general: all engineers are given a starter project that’s meant to provide a tour through the infrastructure sufficient to enable them to make a small but useful contribution early.
71%
Flag icon
Collect your best postmortems and make them prominently available for your newbies — in addition to interested parties from related and/or integrating teams — to read. Ask related teams to publish their best postmortems where you can access them.
73%
Flag icon
Polarizing time means that when a person comes into work each day, they should know if they’re doing just project work or just interrupts.
73%
Flag icon
A person should never be expected to be on-call and also make progress on projects (or anything else with a high context switching cost).
73%
Flag icon
Think about the value of the time you spend doing interrupts for this system, and if you’re spending this time wisely. At some point, if you can’t get the attention you need to fix the root cause of the problems causing interrupts, perhaps the component you’re supporting isn’t that important.
74%
Flag icon
Sort the team fires into toil and not-toil. When you’re finished, present the list to the team and clearly explain why each fire is either work that should be automated or acceptable overhead for running the service.
74%
Flag icon
“I’m not pushing back on the latest release because the tests are bad. I’m pushing back because the error budget we set for releases is exhausted.”
76%
Flag icon
Specialization is good, because it leads to higher chances of improved technical mastery, but it’s also bad, because it leads to siloization and ignorance of the broader picture.
77%
Flag icon
this, the product development team and the SRE
78%
Flag icon
The objectives of the Production Readiness Review are as follows: Verify that a service meets accepted standards of production setup and operational readiness, and that service owners are prepared to work with SRE and take advantage of SRE expertise. Improve the reliability of the service in production, and minimize the number and severity of incidents that might be expected. A PRR targets all aspects of production that SRE cares about.
81%
Flag icon
and preventative action (CAPA)4 is a well-known concept for improving reliability that focuses on the systematic investigation of root causes of identified issues or risks in order to prevent recurrence. This principle is embodied by SRE’s strong culture of blameless postmortems.
82%
Flag icon
The ability to move or change quickly must be weighed against the differing implications of a failure.
« Prev 1