Ran’s Kindle Notes & Highlights for Site Reliability Engineering: How Google Runs Production Systems

Rate it:

More on this book

Community

Wellington Cabral

2 notes & 43 highlights

Niharika

1 note & 9 highlights

Ricardo

1 note & 6 highlights

Przemek

6 notes & 109 highlights

Paul

1 note & 14 highlights

Guilherme Costa

Kenneth LeFebvre

Zhi Han

Ethan Petuchowski

Atthavit Wannasakwong

Sugan

Tien Nguyen Van

Ovidiu Giorgi

David Moreno

Bouke

Oleksiy Kovyrin

Miguel David

Elvin

Mindaugas Mozūras

José

Kindle Notes & Highlights

by Ran

See all Ran’s Notes & Highlights

Site Reliability Engineering: How Google Runs Production Systems

by Betsy Beyer

Started reading March 22, 2018

Google’s story is a story of scaling up.

Software engineering has this in common with having children: the labor before the birth is painful and difficult, but the labor after the birth is where you actually spend most of your effort. Yet software engineering as a discipline spends much more time talking about the first period as opposed to the second, despite estimates that 40–90% of the total costs of a system are incurred after birth.

Unpacking the term a little, first and foremost, SREs are engineers. We apply the principles of computer science and engineering to the design and development of computing systems: generally, large distributed ones.

Next, we focus on system reliability. Ben Treynor Sloss, Google’s VP for 24/7 Operations, originator of the term SRE, claims that reliability is the most fundamental feature of any product: a system isn’t very useful if nobody can use it! Because reliability2 is so critical, SREs are focused on finding ways to improve the design and operation of systems to make them more scalable, more reliable, and more efficient.

The “site” in our name originally referred to SRE’s role in keeping the google.com website running, though we now run many more services, many of which aren’t themselves websites — from internal infrastructure such as Bigtable to products for external developers such as the Google Cloud Platform.

It is equally no surprise that of all the post-deployment characteristics of software that we could choose to devote special attention to, reliability is the one we regard as primary.

much like security, the earlier you care about reliability, the better.

thorough understanding of how to operate the systems was not enough to prevent human errors,”

SRE Way in mind: thoroughness and dedication, belief in the value of preparation and documentation, and an awareness of what could go wrong, coupled with a strong desire to prevent it. Welcome to our emerging profession!

Hope is not a strategy.

It is a truth universally acknowledged that systems do not run themselves.

Running a service with a team that relies on manual intervention for both change management and event handling becomes expensive as the service and/or traffic to the service grows, because the size of the team necessarily scales with the load generated by the system.

At their core, the development teams want to launch new features and see them adopted by users. At their core, the ops teams want to make sure the service doesn’t break while they are holding the pager. Because most outages are caused by some kind of change — a new configuration, a new feature launch, or a new type of user traffic — the two teams’ goals are fundamentally in tension.

SRE is what happens when you ask a software engineer to design an operations team.

will quickly become bored by performing tasks by hand, and (b) have the skill set necessary to write software to replace their previously manual work, even when the solution is complicated.

SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design and implement automation with software to replace human labor.

Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload.

Google places a 50% cap on the aggregate “ops” work for all SREs — tickets, on-call, manual tasks, etc.

we want systems that are automatic, not just automated. In practice, scale and new features keep SREs on their toes.

Consciously maintaining this balance between ops and development work allows us to ensure that SREs have the bandwidth to engage in creative, autonomous engineering, while still retaining the wisdom gleaned from the operations side of running a service.

Its core principles — involvement of the IT function in each phase of a system’s design and development, heavy reliance on automation versus human effort, the application of engineering practices and tools to operations tasks

general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).

redirecting excess operational work to the product development teams:

This approach works well when the entire organization — SRE and development alike — understands why the safety valve mechanism exists, and supports the goal of having no overflow events because the product doesn’t generate

When they are focused on operations work, on average, SREs should receive a maximum of two events per 8–12-hour on-call shift.

If more than two events occur regularly per on-call shift, problems can’t be investigated thoroughly and engineers are sufficiently overwhelmed to prevent them from learning from these events.

Product development and SRE teams can enjoy a productive working relationship by eliminating the structural conflict in their respective goals. The structural conflict is between pace of innovation and product stability, and as described earlier, this conflict often is expressed indirectly. In SRE we bring this conflict to the fore, and then resolve it with the introduction of an error budget.

SRE’s goal is no longer “zero outages”; rather, SREs and product developers aim to spend the error budget getting maximum feature velocity.

a system that requires a human to read an email and decide whether or not some type of action needs to be taken in response is fundamentally flawed. Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.

When humans are necessary, we have found that thinking through and recording the best practices ahead of time in a “playbook” produces roughly a 3x improvement in MTTR as compared to the strategy of “winging it.”

Implementing progressive rollouts Quickly and accurately detecting problems Rolling back changes safely when problems arise

ensuring that there is sufficient capacity and redundancy to serve projected future demand with the required availability.

turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better!

experience shows that as we build systems, cost does not increase linearly as reliability increments — an incremental improvement in reliability may cost 100x more than the previous increment.