Wellington Cabral’s Kindle Notes & Highlights for Site Reliability Engineering: How Google Runs Production Systems

Rate it:

Open Preview

More on this book

Community

Niharika

1 note & 9 highlights

Ricardo

1 note & 6 highlights

Przemek

6 notes & 109 highlights

Paul

1 note & 14 highlights

Guilherme Costa

Kenneth LeFebvre

Zhi Han

Ethan Petuchowski

Atthavit Wannasakwong

Sugan

Tien Nguyen Van

Michael Burch

Ovidiu Giorgi

David Moreno

Bouke

Oleksiy Kovyrin

Miguel David

Elvin

Ran

Mindaugas Mozūras

José

Kindle Notes & Highlights

by Wellington Cabral

See all Wellington’s Notes & Highlights

Site Reliability Engineering: How Google Runs Production Systems

by Betsy Beyer

Read between February 14 - December 19, 2019

Hope is not a strategy.

Running a service with a team that relies on manual intervention for both change management and event handling becomes expensive as the service and/or traffic to the service grows, because the size of the team necessarily scales with the load generated by the system.

At their core, the development teams want to launch new features and see them adopted by users. At their core, the ops teams want to make sure the service doesn’t break while they are holding the pager.

SRE is what happens when you ask a software engineer to design an operations team.

Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload.

The error budget stems from the observation that 100% is the wrong reliability target for basically everything

Parece óbvio, mas não é...

SRE’s goal is no longer “zero outages”; rather, SREs and product developers aim to spend the error budget getting maximum feature velocity.

An outage is no longer a “bad” thing — it is an expected part of the process of innovation, and an occurrence that both development and SRE teams manage rather than fear.

Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.

SRE has found that roughly 70% of outages are due to changes in a live system.

100% is probably never the right reliability target: not only is it impossible to achieve, it’s typically more reliability than a service’s users want or notice. Match the profile of the service to the risk the business is willing to take.

An error budget aligns incentives and emphasizes joint ownership between SRE and product development. Error budgets make it easier to decide the rate of releases and to effectively defuse discussions about outages with stakeholders, and allows multiple teams to reach the same conclusion about production risk without rancor.

SLI is a service level indicator — a carefully defined quantitative measure of some aspect of the level of service that is provided. Most services consider request latency

Although 100% availability is impossible, near-100% availability is often readily achievable,

SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound. For example, we might decide that we will return Shakespeare search results “quickly,” adopting an SLO that our average search request latency should be less than 100 milliseconds.

The Global Chubby Planned Outage Written by Marc Alvidrez Chubby [Bur06] is Google’s lock service for loosely coupled distributed systems. In the global case, we distribute Chubby instances such that each replica is in a different geographical region. Over time, we found that the failures of the global instance of Chubby consistently generated service outages, many of which were visible to end users. As it turns out, true global Chubby outages are so infrequent that service owners began to add dependencies to Chubby assuming that it would never go down. Its high reliability provided a false ...more

Meta profissional: Introduzir falhas num sistema para atender as metricas acordadas de disponíbilidade

An easy way to tell the difference between an SLO and an SLA is to ask “what happens if the SLOs aren’t met?”: if there is no explicit consequence, then you are almost certainly looking at an SLO.1

Google Search is an example of an important service that doesn’t have an SLA for the public: we want everyone to use Search as fluidly and efficiently as possible, but we haven’t signed a contract with the whole world. Even so, there are still consequences if Search isn’t available — unavailability results in a hit to our reputation, as well as a drop in advertising revenue.

11%

If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow.

11%

Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.

12%

We have to be careful about saying a task is “not toil because it needs human judgment.” We need to think carefully about whether the nature of the task intrinsically requires human judgment and cannot be addressed by better design. For example, one could build (and some have built) a service that alerts its SREs several times a day, where each alert requires a complex response involving plenty of human judgment. Such a service is poorly designed, with unnecessary complexity. The system needs to be simplified and rebuilt to either eliminate the underlying failure conditions or deal with these ...more

13%

The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.

13%

Latency increases are often a leading indicator of saturation. Measuring your 99th percentile response time over some small window (e.g., one minute) can give a very early signal of saturation.

14%

Email alerts are of very limited value and tend to easily become overrun with noise; instead, you should favor a dashboard that monitors all ongoing subcritical problems for the sort of information that typically ends up in email alerts. A dashboard might also be paired with a log, in order to analyze historical correlations.

14%

Warning Joseph Bironas, an SRE who led Google’s datacenter turnup efforts for a time, forcefully argued: “If we are engineering processes and solutions that are not automatable, we continue having to staff humans to maintain the system. If we have to staff humans to do the work, we are feeding the machines with the blood, sweat, and tears of human beings. Think The Matrix with less special effects and more pissed off System Administrators.”

15%

automation is “meta-software” — software to act on software.

15%

For example, we often assume that pushing a new binary to a cluster is atomic; the cluster will either end up with the old version, or the new version. However, real-world behavior is more complicated: that cluster’s network can fail halfway through; machines can fail; communication to the cluster management layer can fail, leaving the system in an inconsistent state; depending on the situation, new binaries could be staged but not pushed, or pushed but not restarted, or restarted but not verifiable. Very few abstractions model these kinds of outcomes successfully, and most generally end up ...more

15%

SRE hates manual operations, so we obviously try to create systems that don’t require them. However, sometimes manual operations are unavoidable.

15%

we managed to achieve the self-professed nirvana of SRE: to automate ourselves out of a job.

19%

The price of reliability is the pursuit of the utmost simplicity.

19%

“At the end of the day, our job is to keep agility and stability in balance in the system.”2

19%

In fact, SRE’s experience has found that reliable processes tend to actually increase developer agility: rapid, reliable production rollouts make changes in production easier to see.

20%

“perfection is finally attained not when there is no longer more to add, but when there is no longer anything to take away”

24%

Under the influence of these stress hormones, the more deliberate cognitive approach is typically subsumed by unreflective and unconsidered (but immediate) action, leading to potential abuse of heuristics. Heuristics are very tempting behaviors when one is on-call. For example, when the same alert pages for the fourth time in the week, and the previous three pages were initiated by an external infrastructure system, it is extremely tempting to exercise confirmation bias by automatically associating this fourth occurrence of the problem with the previous cause. While intuition and quick ...more

25%

Be warned that being an expert is more than understanding how a system is supposed to work. Expertise is gained by investigating why a system doesn’t work. Brian Redman

27%

Negative results should not be ignored or discounted. Realizing you’re wrong has much value: a clear negative result can resolve some of the hardest design questions.

27%

Many more experiments are simply unreported because people mistakenly believe that negative results are not progress.

28%

“I don’t know where the fire is yet, but I’m blinded by smoke coming from this whitelist cache.”

28%

Things break; that’s life.

28%

What to Do When Systems Break First of all, don’t panic! You aren’t alone, and the sky isn’t falling.

31%

Writing a postmortem is not punishment — it is a learning opportunity for the entire company.

33%

If you haven’t tried it, assume it’s broken. Unknown

46%

If at first you don’t succeed, back off exponentially.

See a Problem?

Preview — Site Reliability Engineering by Betsy Beyer