Site Reliability Engineering: How Google Runs Production Systems
Rate it:
Open Preview
0%
Flag icon
Google’s story is a story of scaling up.
1%
Flag icon
changed its mind
1%
Flag icon
Our many questions are the real legacy of the volume:
1%
Flag icon
Tools were only components in processes, working alongside chains of software, people, and data.
1%
Flag icon
scaling is far more than just a photographic enlargement of a textbook computer architecture.
1%
Flag icon
Software engineering has this in common with having children: the labor before the birth is painful and difficult, but the labor after the birth is where you actually spend most of your effort.
1%
Flag icon
Yet software engineering as a discipline spends much more time talking about the first period as opposed to the second, despite estimates that 40–90% of the total costs of a system are incurred after birth.1
1%
Flag icon
reliability is the most fundamental feature of any product:
1%
Flag icon
managing change itself is so tightly coupled with failures of all kinds,
3%
Flag icon
And because their vocabulary and risk assumptions differ, both groups often resort to a familiar form of trench warfare to advance their interests.
3%
Flag icon
SRE is what happens when you ask a software engineer to design an operations team.
3%
Flag icon
quickly become bored by performing tasks by hand,
3%
Flag icon
over time, left to their own devices, the SRE team should end up with very little operational load and almost entirely engage in development tasks, because the service basically runs and repairs itself:
4%
Flag icon
redirecting excess operational work to the product development teams:
4%
Flag icon
This also provides an effective feedback mechanism, guiding developers to build systems that don’t need manual intervention.
4%
Flag icon
Postmortems should be written for all significant incidents, regardless of whether or not they paged;
4%
Flag icon
This investigation should establish what happened in detail, find all root causes of the event, and assign actions to correct the problem or improve how it is addressed next time.
4%
Flag icon
100% is the wrong reliability target for basically everything
4%
Flag icon
Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.
4%
Flag icon
No one needs to look at this information, but it is recorded for diagnostic or forensic purposes.
4%
Flag icon
the SRE team must be in charge of capacity planning,
4%
Flag icon
they also must be in charge of provisioning.