More on this book
Community
Kindle Notes & Highlights
by
Betsy Beyer
Read between
March 30 - May 11, 2018
Hiring experienced, qualified SREs is difficult and costly. Despite enormous effort from the recruiting organization, there are never enough SREs to support all the services that need their expertise. Once SREs are hired, their training is also a lengthier process than is typical for development engineers.
“Hope is not a strategy.” This rallying cry of the SRE team at Google sums up what we mean by preparedness and disaster testing. The SRE culture is forever vigilant and constantly questioning: What could go wrong?
The lifeguards may have been well prepared for what happened in practice, but might feel like they haven’t done an adequate job. Similar to Google, lifeguarding embraces a culture of blameless incident analysis. Incidents are chaotic, and many factors contribute to any given incident. In this field, it’s not helpful to place blame on a single individual.
Data-driven decisions win over decisions based on feelings, hunches, or the opinion of the most senior employee in the room
Decisions should be informed rather than prescriptive, and are made without deference to personal opinions — even that of the most-senior person in the room, who Eric Schmidt and Jonathan Rosenberg dub the “HiPPO,” for “Highest-Paid Person’s Opinion” [Sch14].
SRE teams are constructed so that our engineers divide their time between two equally important types of work. SREs staff on-call shifts, which entail putting our hands around the systems, observing where and how these systems break, and understanding challenges such as how to best scale them. But we also have time to then reflect and decide what to build in order to make those systems easier to manage. In essence, we have the pleasure of playing both the roles of the pilot and the engineer/designer.
Use load testing rather than tradition to establish the resource-to-capacity ratio: a cluster of X machines could handle Y queries per second three months ago, but can it still do so given changes to the system?
Every client that makes an RPC must implement exponential backoff (with jitter) for retries, to dampen error amplification. Mobile clients are especially troublesome because there may be millions of them and updating their code to fix behavior takes a significant amount of time — possibly weeks — and requires that users install updates.
We’ve found that at least eight people need to be part of the on-call team, in order to avoid fatigue and allow sustainable staffing and low turnover. Preferably, those on-call should be in two well-separated geographic locations (e.g., California and Ireland) to provide a better quality of life by avoiding nighttime pages; in this case, six people at each site is the minimum team size.