Site Reliability Engineering: How Google Runs Production Systems
Rate it:
Open Preview
Kindle Notes & Highlights
0%
Flag icon
Google’s story is a story of scaling up.
1%
Flag icon
Software engineering has this in common with having children: the labor before the birth is painful and difficult, but the labor after the birth is where you actually spend most of your effort. Yet software engineering as a discipline spends much more time talking about the first period as opposed to the second, despite estimates that 40–90% of the total costs of a system are incurred after birth.
1%
Flag icon
Unpacking the term a little, first and foremost, SREs are engineers. We apply the principles of computer science and engineering to the design and development of computing systems: generally, large distributed ones.
1%
Flag icon
Next, we focus on system reliability. Ben Treynor Sloss, Google’s VP for 24/7 Operations, originator of the term SRE, claims that reliability is the most fundamental feature of any product: a system isn’t very useful if nobody can use it! Because reliability2 is so critical, SREs are focused on finding ways to improve the design and operation of systems to make them more scalable, more reliable, and more efficient.
1%
Flag icon
The “site” in our name originally referred to SRE’s role in keeping the google.com website running, though we now run many more services, many of which aren’t themselves websites — from internal infrastructure such as Bigtable to products for external developers such as the Google Cloud Platform.
1%
Flag icon
It is equally no surprise that of all the post-deployment characteristics of software that we could choose to devote special attention to, reliability is the one we regard as primary.
1%
Flag icon
much like security, the earlier you care about reliability, the better.
1%
Flag icon
thorough understanding of how to operate the systems was not enough to prevent human errors,”
1%
Flag icon
SRE Way in mind: thoroughness and dedication, belief in the value of preparation and documentation, and an awareness of what could go wrong, coupled with a strong desire to prevent it. Welcome to our emerging profession!
3%
Flag icon
Hope is not a strategy.
3%
Flag icon
It is a truth universally acknowledged that systems do not run themselves.
3%
Flag icon
Running a service with a team that relies on manual intervention for both change management and event handling becomes expensive as the service and/or traffic to the service grows, because the size of the team necessarily scales with the load generated by the system.
3%
Flag icon
At their core, the development teams want to launch new features and see them adopted by users. At their core, the ops teams want to make sure the service doesn’t break while they are holding the pager. Because most outages are caused by some kind of change — a new configuration, a new feature launch, or a new type of user traffic — the two teams’ goals are fundamentally in tension.
3%
Flag icon
SRE is what happens when you ask a software engineer to design an operations team.
3%
Flag icon
will quickly become bored by performing tasks by hand, and (b) have the skill set necessary to write software to replace their previously manual work, even when the solution is complicated.
3%
Flag icon
SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design and implement automation with software to replace human labor.
3%
Flag icon
Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload.
3%
Flag icon
Google places a 50% cap on the aggregate “ops” work for all SREs — tickets, on-call, manual tasks, etc.
3%
Flag icon
we want systems that are automatic, not just automated. In practice, scale and new features keep SREs on their toes.
3%
Flag icon
Consciously maintaining this balance between ops and development work allows us to ensure that SREs have the bandwidth to engage in creative, autonomous engineering, while still retaining the wisdom gleaned from the operations side of running a service.
3%
Flag icon
Its core principles — involvement of the IT function in each phase of a system’s design and development, heavy reliance on automation versus human effort, the application of engineering practices and tools to operations tasks
3%
Flag icon
general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).
4%
Flag icon
redirecting excess operational work to the product development teams:
4%
Flag icon
This approach works well when the entire organization — SRE and development alike — understands why the safety valve mechanism exists, and supports the goal of having no overflow events because the product doesn’t generate
4%
Flag icon
When they are focused on operations work, on average, SREs should receive a maximum of two events per 8–12-hour on-call shift.
4%
Flag icon
If more than two events occur regularly per on-call shift, problems can’t be investigated thoroughly and engineers are sufficiently overwhelmed to prevent them from learning from these events.
4%
Flag icon
Product development and SRE teams can enjoy a productive working relationship by eliminating the structural conflict in their respective goals. The structural conflict is between pace of innovation and product stability, and as described earlier, this conflict often is expressed indirectly. In SRE we bring this conflict to the fore, and then resolve it with the introduction of an error budget.
4%
Flag icon
SRE’s goal is no longer “zero outages”; rather, SREs and product developers aim to spend the error budget getting maximum feature velocity.
4%
Flag icon
a system that requires a human to read an email and decide whether or not some type of action needs to be taken in response is fundamentally flawed. Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.
4%
Flag icon
When humans are necessary, we have found that thinking through and recording the best practices ahead of time in a “playbook” produces roughly a 3x improvement in MTTR as compared to the strategy of “winging it.”
4%
Flag icon
Implementing progressive rollouts Quickly and accurately detecting problems Rolling back changes safely when problems arise
4%
Flag icon
ensuring that there is sufficient capacity and redundancy to serve projected future demand with the required availability.
7%
Flag icon
turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better!
7%
Flag icon
experience shows that as we build systems, cost does not increase linearly as reliability increments — an incremental improvement in reliability may cost 100x more than the previous increment.