Atthavit Wannasakwong’s Kindle Notes & Highlights for Site Reliability Engineering: How Google Runs Production Systems

Rate it:

More on this book

Community

Wellington Cabral

2 notes & 43 highlights

Niharika

1 note & 9 highlights

Ricardo

1 note & 6 highlights

Przemek

6 notes & 109 highlights

Paul

1 note & 14 highlights

Guilherme Costa

Kenneth LeFebvre

Zhi Han

Ethan Petuchowski

Sugan

Tien Nguyen Van

Ovidiu Giorgi

David Moreno

Bouke

Oleksiy Kovyrin

Miguel David

Elvin

Ran

Mindaugas Mozūras

José

Kindle Notes & Highlights

by Atthavit Wannasakwong

See all Atthavit’s Notes & Highlights

Site Reliability Engineering: How Google Runs Production Systems

by Betsy Beyer

Read between October 13, 2018 - January 21, 2019

The result of our approach to hiring for SRE is that we end up with a team of people who (a) will quickly become bored by performing tasks by hand, and (b) have the skill set necessary to write software to replace their previously manual work, even when the solution is complicated.

It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a cost:

rather than simply maximizing uptime, Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness

when we set an availability target of 99.99%,we want to exceed it, but not by much: that would waste opportunities to add features to the system, clean up technical debt, or reduce its operational costs.

We set a lower availability target for YouTube than for our enterprise products because rapid feature development was correspondingly more important.

An SLI is a service level indicator — a carefully defined quantitative measure of some aspect of the level of service that is provided. Most services consider request latency — how long it takes to return a response to a request — as a key SLI.

In any given quarter, if a true failure has not dropped availability below the target, a controlled outage will be synthesized by intentionally taking down the system. In this way, we are able to flush out unreasonable dependencies on Chubby shortly after they are added.

10%

It’s both unrealistic and undesirable to insist that SLOs will be met 100% of the time: doing so can reduce the rate of innovation and deployment, require expensive, overly conservative solutions, or both.

13%

The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.

13%

as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.

13%

Different aspects of a system should be measured with different levels of granularity.

13%

Every time the pager goes off, I should be able to react with a sense of urgency. I can only react with a sense of urgency a few times a day before I become fatigued. Every page should be actionable. Every page response should require intelligence. If a page merely merits a robotic response, it shouldn’t be a page. Pages should be about a novel problem or an event that hasn’t been seen before.

14%

while some team members want to implement a “hack” to allow time for a proper fix, others worry that a hack will be forgotten or that the proper fix will be deprioritized indefinitely. This concern is credible, as it’s easy to build layers of unmaintainable technical debt by patching over problems instead of making real fixes.

24%

we strive to invest at least 50% of SRE time into engineering: of the remainder, no more than 25% can be spent on-call, leaving up to another 25% on other types of operational, nonproject work.

27%

By publishing everything, you encourage others to do the same, and everyone in the industry collectively learns much more quickly. SRE has already learned this lesson with high-quality postmortems, which have had a large positive effect on production stability. Publish your results.

31%

For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior. A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the “wrong” thing prevails, people will not bring issues to light for fear of punishment.

33%

As MTBF increases in response to better testing, developers are encouraged to release features faster. Some of these features will, of course, have bugs. New bugs result in an opposite adjustment to release velocity as these bugs are found and fixed.

34%

The term canary comes from the phrase “canary in a coal mine,” and refers to the practice of using a live bird to detect toxic gases before humans were poisoned.

42%

Lame duck The backend task is listening on its port and can serve, but is explicitly asking clients to stop sending requests.

44%

Weighted Round Robin is fairly simple in principle: each client task keeps a “capability” score for each backend in its subset. Requests are distributed in Round-Robin fashion, but clients weigh the distributions of requests to backends proportionally.

47%

As a server becomes overloaded, its responses to RPCs from its clients arrive later, which may exceed any deadlines those clients set. The work the server did to respond is then wasted, and clients may retry the RPCs, leading to even more overload.

48%

Load shedding drops some proportion of load by dropping traffic as the server approaches overload conditions. The goal is to keep the server from running out of RAM, failing health checks, serving with extremely high latency, or any of the other symptoms associated with overload, while still doing as much useful work as it can.

48%

If a user’s web search is slow because an RPC has been queued for 10 seconds, there’s a good chance the user has given up and refreshed their browser, issuing another request: there’s no point in responding to the first one, since it will be ignored!

48%

Graceful degradation shouldn’t trigger very often — usually in cases of a capacity planning failure or unexpected load shift.

49%

It’s important to note the distinction between a latency cache versus a capacity cache: when a latency cache is employed, the service can sustain its expected load with an empty cache, but a service using a capacity cache cannot sustain its expected load under an empty cache.

50%

At this point, the component should ideally start serving errors or degraded results in response to additional load, but not significantly reduce the rate at which it successfully handles requests.

51%

a growing number of distributed datastore technologies provide a different set of semantics known as BASE (Basically Available, Soft state, and Eventual consistency).

51%

Most of these systems that support BASE semantics rely on multimaster replication, where writes can be committed to different processes concurrently, and there is some mechanism to resolve conflicts (often as simple as “latest timestamp wins”).

52%

As proven by the Dijkstra Prize–winning FLP impossibility result [Fis85], no asynchronous distributed consensus algorithm can guarantee progress in the presence of an unreliable network.

52%

In the first phase of the protocol, the proposer sends a sequence number to the acceptors. Each acceptor will agree to accept the proposal only if it has not yet seen a proposal with a higher sequence number.

52%

Paxos on its own isn’t that useful: all it lets you do is to agree on a value and proposal number once. Because only a quorum of nodes need to agree on a value, any given node may not have a complete view of the set of values that have been agreed to. This limitation is true for most distributed consensus algorithms.

60%

BASE allows for higher availability than ACID, in exchange for a softer distributed consistency guarantee.

61%

From the user’s point of view, data integrity without expected and regular data availability is effectively the same as having no data at all.

63%

Similar to the effects enjoyed when units test are introduced early in the project lifecycle, a data validation pipeline results in an overall acceleration of software development projects.

66%

Google approached the challenges inherent to launches by creating a dedicated consulting team within SRE tasked with the technical side of launching a new product or feature.

66%

Experience has demonstrated that engineers are likely to sidestep processes that they consider too burdensome or as adding insufficient value — especially when a team is already in crunch mode, and the launch process is seen as just another item blocking their launch. For this reason, LCE must optimize the launch experience continuously to strike the right balance between cost and benefit.

67%

we perform most development on the mainline branch, but releases are built on separate branches per release. This setup makes it easy to fix bugs in a release without pulling in unrelated changes from the mainline.

67%

A new server might be installed on a few machines in one datacenter and observed for a defined period of time. If all looks well, the server is installed on all machines in one datacenter, observed again, and then installed on all machines globally.

70%

Early opportunities for ownership are standard across Google in general: all engineers are given a starter project that’s meant to provide a tour through the infrastructure sufficient to enable them to make a small but useful contribution early.

71%

Collect your best postmortems and make them prominently available for your newbies — in addition to interested parties from related and/or integrating teams — to read. Ask related teams to publish their best postmortems where you can access them.

73%

Polarizing time means that when a person comes into work each day, they should know if they’re doing just project work or just interrupts.

73%

A person should never be expected to be on-call and also make progress on projects (or anything else with a high context switching cost).

73%

Think about the value of the time you spend doing interrupts for this system, and if you’re spending this time wisely. At some point, if you can’t get the attention you need to fix the root cause of the problems causing interrupts, perhaps the component you’re supporting isn’t that important.

74%

Sort the team fires into toil and not-toil. When you’re finished, present the list to the team and clearly explain why each fire is either work that should be automated or acceptable overhead for running the service.

74%

“I’m not pushing back on the latest release because the tests are bad. I’m pushing back because the error budget we set for releases is exhausted.”

76%

Specialization is good, because it leads to higher chances of improved technical mastery, but it’s also bad, because it leads to siloization and ignorance of the broader picture.

77%

this, the product development team and the SRE

78%

The objectives of the Production Readiness Review are as follows: Verify that a service meets accepted standards of production setup and operational readiness, and that service owners are prepared to work with SRE and take advantage of SRE expertise. Improve the reliability of the service in production, and minimize the number and severity of incidents that might be expected. A PRR targets all aspects of production that SRE cares about.

81%

and preventative action (CAPA)4 is a well-known concept for improving reliability that focuses on the systematic investigation of root causes of identified issues or risks in order to prevent recurrence. This principle is embodied by SRE’s strong culture of blameless postmortems.

82%

The ability to move or change quickly must be weighed against the differing implications of a failure.

« Prev 1 2 Next »