Site Reliability Engineering: How Google Runs Production Systems
Rate it:
Open Preview
0%
Flag icon
Google drew a line in the silicon, forcing that fate into being. The revised role was called SRE, or Site Reliability Engineer.
1%
Flag icon
Software engineering has this in common with having children: the labor before the birth is painful and difficult, but the labor after the birth is where you actually spend most of your effort.
1%
Flag icon
They are people who stand on the cusp between one way of looking at the world and another one: like Newton, who is sometimes called not the world’s first physicist, but the world’s last alchemist.
1%
Flag icon
And taking the historical view, who, then, looking back, might be the first SRE?
1%
Flag icon
We like to think that Margaret Hamilton, working on the Apollo program on loan from MIT, had all of the significant traits of the first SRE.
1%
Flag icon
Accordingly, for the systems you look after, for the groups you work in, or for the organizations you’re building, please bear the SRE Way in mind: thoroughness and dedication, belief in the value of preparation and documentation, and an awareness of what could go wrong, coupled with a strong desire to prevent it. Welcome to our emerging profession!
3%
Flag icon
Running a service with a team that relies on manual intervention for both change management and event handling becomes expensive as the service and/or traffic to the service grows, because the size of the team necessarily scales with the load generated by the system.
3%
Flag icon
SRE is what happens when you ask a software engineer to design an operations team.
3%
Flag icon
By far, UNIX system internals and networking (Layer 1 to Layer 3) expertise are the two most common types of alternate technical skills we seek.
3%
Flag icon
Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload.
4%
Flag icon
In general, for any software service or system, 100% is not the right reliability target because no user can tell the difference between a system being 100% available and 99.999% available. There are many other systems in the path between user and service (their laptop, their home WiFi, their ISP, the power grid…) and those systems collectively are far less than 99.999% available.
4%
Flag icon
Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.
4%
Flag icon
Reliability is a function of mean time to failure (MTTF) and mean time to repair (MTTR) [Sch15]. The most relevant metric in evaluating the effectiveness of emergency response is how quickly the response team can bring the system back to health — that is, the MTTR.
4%
Flag icon
Implementing progressive rollouts Quickly and accurately detecting problems Rolling back changes safely when problems arise
4%
Flag icon
Motivated originally by familiarity — “as a software engineer, this is how I would want to invest my time to accomplish a set of repetitive tasks” — it has become much more: a set of principles, a set of practices, a set of incentives, and a field of endeavor within the larger software engineering discipline.
6%
Flag icon
After all, people regularly use www.google.com to check if their Internet connection is set up correctly.
7%
Flag icon
7%
Flag icon
7%
Flag icon
In 2006, YouTube was focused on consumers and was in a very different phase of its business lifecycle than Google was at the time.
8%
Flag icon
If we were to build and operate these systems at one more nine of availability, what would our incremental increase in revenue be?
8%
Flag icon
In this case, if the cost of improving availability by one nine is less than $900, it is worth the investment. If the cost is greater than $900, the costs will exceed the projected increase in revenue.
8%
Flag icon
Service latency for our Ads systems provides an illustrative example. When Google first launched Web Search, one of the service’s key distinguishing features was speed. When we introduced AdWords, which displays advertisements next to search results, a key requirement of the system was that the ads should not slow down the search experience.
8%
Flag icon
It’s a best practice to test a new release on some small subset of a typical workload, a practice often called canarying. How long do we wait, and how big is the canary?
8%
Flag icon
(Indeed, Google SRE’s unofficial motto is “Hope is not a strategy.”)
9%
Flag icon
The difference between these two numbers is the “budget” of how much “unreliability” is remaining for the quarter.
9%
Flag icon
100% is probably never the right reliability target: not only is it impossible to achieve, it’s typically more reliability than a service’s users want or notice. Match the profile of the service to the risk the business is willing to take.
9%
Flag icon
An SLI is a service level indicator — a carefully defined quantitative measure of some aspect of the level of service that is provided.
9%
Flag icon
An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI.
9%
Flag icon
On the other hand, you can say that you want the average latency per request to be under 100 milliseconds, and setting such a goal could in turn motivate you to write your frontend with low-latency behaviors of various kinds or to buy certain kinds of low-latency equipment.
10%
Flag icon
Many indicator metrics are most naturally gathered on the server side, using a monitoring system such as Borgmon (see Chapter 10) or Prometheus, or with periodic log analysis
10%
Flag icon
You can always refine SLO definitions and targets over time as you learn about a system’s behavior. It’s better to start with a loose target that you tighten than to choose an overly strict target that has to be relaxed when you discover it’s unattainable.
18%
Flag icon
Build tools must allow us to ensure consistency and repeatability. If two people attempt to build the same product at the same revision number in the source code repository on different machines, we expect identical results.