More on this book
Community
Kindle Notes & Highlights
by
Betsy Beyer
Read between
March 28 - October 15, 2020
Google drew a line in the silicon, forcing that fate into being. The revised role was called SRE, or Site Reliability Engineer.
Software engineering has this in common with having children: the labor before the birth is painful and difficult, but the labor after the birth is where you actually spend most of your effort.
They are people who stand on the cusp between one way of looking at the world and another one: like Newton, who is sometimes called not the world’s first physicist, but the world’s last alchemist.
And taking the historical view, who, then, looking back, might be the first SRE?
We like to think that Margaret Hamilton, working on the Apollo program on loan from MIT, had all of the significant traits of the first SRE.
Accordingly, for the systems you look after, for the groups you work in, or for the organizations you’re building, please bear the SRE Way in mind: thoroughness and dedication, belief in the value of preparation and documentation, and an awareness of what could go wrong, coupled with a strong desire to prevent it. Welcome to our emerging profession!
Running a service with a team that relies on manual intervention for both change management and event handling becomes expensive as the service and/or traffic to the service grows, because the size of the team necessarily scales with the load generated by the system.
SRE is what happens when you ask a software engineer to design an operations team.
By far, UNIX system internals and networking (Layer 1 to Layer 3) expertise are the two most common types of alternate technical skills we seek.
Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload.
In general, for any software service or system, 100% is not the right reliability target because no user can tell the difference between a system being 100% available and 99.999% available. There are many other systems in the path between user and service (their laptop, their home WiFi, their ISP, the power grid…) and those systems collectively are far less than 99.999% available.
Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.
Reliability is a function of mean time to failure (MTTF) and mean time to repair (MTTR) [Sch15]. The most relevant metric in evaluating the effectiveness of emergency response is how quickly the response team can bring the system back to health — that is, the MTTR.
Implementing progressive rollouts Quickly and accurately detecting problems Rolling back changes safely when problems arise
Motivated originally by familiarity — “as a software engineer, this is how I would want to invest my time to accomplish a set of repetitive tasks” — it has become much more: a set of principles, a set of practices, a set of incentives, and a field of endeavor within the larger software engineering discipline.
After all, people regularly use www.google.com to check if their Internet connection is set up correctly.
In 2006, YouTube was focused on consumers and was in a very different phase of its business lifecycle than Google was at the time.
If we were to build and operate these systems at one more nine of availability, what would our incremental increase in revenue be?
In this case, if the cost of improving availability by one nine is less than $900, it is worth the investment. If the cost is greater than $900, the costs will exceed the projected increase in revenue.
Service latency for our Ads systems provides an illustrative example. When Google first launched Web Search, one of the service’s key distinguishing features was speed. When we introduced AdWords, which displays advertisements next to search results, a key requirement of the system was that the ads should not slow down the search experience.
It’s a best practice to test a new release on some small subset of a typical workload, a practice often called canarying. How long do we wait, and how big is the canary?
(Indeed, Google SRE’s unofficial motto is “Hope is not a strategy.”)
The difference between these two numbers is the “budget” of how much “unreliability” is remaining for the quarter.
100% is probably never the right reliability target: not only is it impossible to achieve, it’s typically more reliability than a service’s users want or notice. Match the profile of the service to the risk the business is willing to take.
An SLI is a service level indicator — a carefully defined quantitative measure of some aspect of the level of service that is provided.
An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI.
On the other hand, you can say that you want the average latency per request to be under 100 milliseconds, and setting such a goal could in turn motivate you to write your frontend with low-latency behaviors of various kinds or to buy certain kinds of low-latency equipment.
Many indicator metrics are most naturally gathered on the server side, using a monitoring system such as Borgmon (see Chapter 10) or Prometheus, or with periodic log analysis
You can always refine SLO definitions and targets over time as you learn about a system’s behavior. It’s better to start with a loose target that you tighten than to choose an overly strict target that has to be relaxed when you discover it’s unattainable.
Build tools must allow us to ensure consistency and repeatability. If two people attempt to build the same product at the same revision number in the source code repository on different machines, we expect identical results.