David Moreno’s Kindle Notes & Highlights for Site Reliability Engineering: How Google Runs Production Systems

Rate it:

More on this book

Community

Wellington Cabral

2 notes & 43 highlights

Niharika

1 note & 9 highlights

Ricardo

1 note & 6 highlights

Przemek

6 notes & 109 highlights

Paul

1 note & 14 highlights

Guilherme Costa

Kenneth LeFebvre

Zhi Han

Ethan Petuchowski

Atthavit Wannasakwong

Sugan

Tien Nguyen Van

Ovidiu Giorgi

Bouke

Oleksiy Kovyrin

Miguel David

Elvin

Ran

Mindaugas Mozūras

José

Kindle Notes & Highlights

by David Moreno

See all David’s Notes & Highlights

Site Reliability Engineering: How Google Runs Production Systems

by Betsy Beyer

The tribal nature of IT culture often entrenches practitioners in dogmatic positions that hold the industry back.

SRE is what happens when you ask a software engineer to design an operations team.

In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).

In general, for any software service or system, 100% is not the right reliability target because no user can tell the difference between a system being 100% available and 99.999% available.

Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.