Paul’s Kindle Notes & Highlights for Site Reliability Engineering: How Google Runs Production Systems

Rate it:

More on this book

Community

Wellington Cabral

2 notes & 43 highlights

Niharika

1 note & 9 highlights

Ricardo

1 note & 6 highlights

Przemek

6 notes & 109 highlights

Guilherme Costa

Kenneth LeFebvre

Zhi Han

Ethan Petuchowski

Atthavit Wannasakwong

Sugan

Tien Nguyen Van

Ovidiu Giorgi

David Moreno

Bouke

Oleksiy Kovyrin

Miguel David

Elvin

Ran

Mindaugas Mozūras

José

Kindle Notes & Highlights

by Paul

See all Paul’s Notes & Highlights

Site Reliability Engineering: How Google Runs Production Systems

by Betsy Beyer

Running a service with a team that relies on manual intervention for both change management and event handling becomes expensive as the service and/or traffic to the service grows, because the size of the team necessarily scales with the load generated by the system.

Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload.

SRE teams are characterized by both rapid innovation and a large acceptance of change.

the number of SREs needed to run, maintain, and improve a system scales sublinearly with the size of the system. Finally,

In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).

Goal

this is accomplished by monitoring the amount of operational work being done by SREs, and redirecting excess operational work to the product development teams: reassigning bugs and tickets to development managers, [re]integrating developers into on-call pager rotations, and so on.

Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.

When humans are necessary, we have found that thinking through and recording the best practices ahead of time in a “playbook” produces roughly a 3x improvement in MTTR as compared to the strategy of “winging it.”

SRE has found that roughly 70% of outages are due to changes in a live system.

Implementing progressive rollouts Quickly and accurately detecting problems Rolling back changes safely when problems arise

Provisioning combines both change management and capacity planning. In our experience, provisioning must be conducted quickly and only when necessary, as capacity is expensive.

The opportunity cost The cost borne by an organization when it allocates engineering resources to build systems or features that diminish risk instead of features that are directly visible to or usable by end users. These engineers no longer work on new features and products for end users.

Service failures can have many potential effects, including user dissatisfaction, harm, or loss of trust; direct or indirect revenue loss; brand or reputational impact; and undesirable press coverage.

Equation 3-2. Aggregate availability For example, a system that serves 2.5M requests in a day with a daily availability target of 99.99% can serve up to 250 errors and still hit its target for that given day.