Site Reliability Engineering: How Google Runs Production Systems
Rate it:
Open Preview
Kindle Notes & Highlights
3%
Flag icon
Running a service with a team that relies on manual intervention for both change management and event handling becomes expensive as the service and/or traffic to the service grows, because the size of the team necessarily scales with the load generated by the system.
3%
Flag icon
Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload.
3%
Flag icon
SRE teams are characterized by both rapid innovation and a large acceptance of change.
3%
Flag icon
the number of SREs needed to run, maintain, and improve a system scales sublinearly with the size of the system. Finally,
3%
Flag icon
In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).
Paul
Goal
4%
Flag icon
this is accomplished by monitoring the amount of operational work being done by SREs, and redirecting excess operational work to the product development teams: reassigning bugs and tickets to development managers, [re]integrating developers into on-call pager rotations, and so on.
4%
Flag icon
Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.
4%
Flag icon
When humans are necessary, we have found that thinking through and recording the best practices ahead of time in a “playbook” produces roughly a 3x improvement in MTTR as compared to the strategy of “winging it.”
4%
Flag icon
SRE has found that roughly 70% of outages are due to changes in a live system.
4%
Flag icon
Implementing progressive rollouts Quickly and accurately detecting problems Rolling back changes safely when problems arise
4%
Flag icon
Provisioning combines both change management and capacity planning. In our experience, provisioning must be conducted quickly and only when necessary, as capacity is expensive.
7%
Flag icon
The opportunity cost The cost borne by an organization when it allocates engineering resources to build systems or features that diminish risk instead of features that are directly visible to or usable by end users. These engineers no longer work on new features and products for end users.
7%
Flag icon
Service failures can have many potential effects, including user dissatisfaction, harm, or loss of trust; direct or indirect revenue loss; brand or reputational impact; and undesirable press coverage.
7%
Flag icon
Equation 3-2. Aggregate availability For example, a system that serves 2.5M requests in a day with a daily availability target of 99.99% can serve up to 250 errors and still hit its target for that given day.