Kenneth LeFebvre’s Kindle Notes & Highlights for Site Reliability Engineering: How Google Runs Production Systems

Rate it:

More on this book

Community

Wellington Cabral

2 notes & 43 highlights

Niharika

1 note & 9 highlights

Ricardo

1 note & 6 highlights

Przemek

6 notes & 109 highlights

Paul

1 note & 14 highlights

Guilherme Costa

5 highlights

Zhi Han

1 highlight

Ethan Petuchowski

1 highlight

Atthavit Wannasakwong

Sugan

Tien Nguyen Van

Ovidiu Giorgi

David Moreno

Bouke

Oleksiy Kovyrin

Miguel David

Elvin

Ran

Mindaugas Mozūras

José

Kindle Notes & Highlights

by Kenneth LeFebvre

See all Kenneth’s Notes & Highlights

Site Reliability Engineering: How Google Runs Production Systems

by Betsy Beyer

Google’s story is a story of scaling up.

changed its mind

Our many questions are the real legacy of the volume:

Tools were only components in processes, working alongside chains of software, people, and data.

scaling is far more than just a photographic enlargement of a textbook computer architecture.

Software engineering has this in common with having children: the labor before the birth is painful and difficult, but the labor after the birth is where you actually spend most of your effort.

Yet software engineering as a discipline spends much more time talking about the first period as opposed to the second, despite estimates that 40–90% of the total costs of a system are incurred after birth.1

reliability is the most fundamental feature of any product:

managing change itself is so tightly coupled with failures of all kinds,

And because their vocabulary and risk assumptions differ, both groups often resort to a familiar form of trench warfare to advance their interests.

SRE is what happens when you ask a software engineer to design an operations team.

quickly become bored by performing tasks by hand,

over time, left to their own devices, the SRE team should end up with very little operational load and almost entirely engage in development tasks, because the service basically runs and repairs itself:

redirecting excess operational work to the product development teams:

This also provides an effective feedback mechanism, guiding developers to build systems that don’t need manual intervention.

Postmortems should be written for all significant incidents, regardless of whether or not they paged;

This investigation should establish what happened in detail, find all root causes of the event, and assign actions to correct the problem or improve how it is addressed next time.

100% is the wrong reliability target for basically everything

Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.

No one needs to look at this information, but it is recorded for diagnostic or forensic purposes.

the SRE team must be in charge of capacity planning,

they also must be in charge of provisioning.