More on this book
Community
Kindle Notes & Highlights
by
Betsy Beyer
Read between
September 12 - December 27, 2017
It’s a common mistake to assume that an overloaded backend should turn down and stop accepting all traffic.
A cascading failure is a failure that grows over time as a result of positive feedback.1 It can occur when a portion of an overall system fails, increasing the probability that other portions of the system fail.
Byzantine failure occurs when a process passes incorrect messages due to a bug or malicious activity, and are comparatively costly to handle, and less often encountered.
When considering data integrity, what matters is that services in the cloud remain accessible to users. User access to data is especially important.
To revise our earlier definition of data integrity, we might say that data integrity means that services in the cloud remain accessible to users. User access to data is especially important, so this access should remain in perfect shape.
No one really wants to make backups; what people really want are restores.
SRE students will have questions like the following: What am I working on? How much progress have I made? When will these activities accumulate enough experience for me to go on-call?
Pages always have an expected response time (SLO), which is sometimes measured in minutes.
Tickets may also have an SLO, but response time is more likely measured in hours, days, or weeks.
Tickets are managed in a few different ways, depending on the SRE team: a primary on-call engineer might work on tickets while on-call, a secondary engineer might work on tickets while on-call, or a team can have a dedicated ticket person who is not on-call. Tickets might be randomly autodistributed among team members, or team members might be expected to service tickets ad hoc.
On the other hand, when a person is concentrating full-time on interrupts, interrupts stop being interrupts.
However, viewing an engineer as an interruptible unit of work, whose context switches are free, is suboptimal if you want people to be happy and productive.
Polarizing time means that when a person comes into work each day, they should know if they’re doing just project work or just interrupts.
There should be a handoff for tickets, as well as for on-call work. A handoff process maintains shared state between ticket handlers as responsibility switches over.
Remind the team that more tickets should not require more SREs: the goal of the SRE model is to only introduce more humans as more complexity is added to the system. Instead, try to draw attention to how healthy work habits reduce the time spent on tickets.
“Mistakes are inevitable in any system with multiple subtle interactions. You were on-call, and I trust you to make the right decisions with the right information. I’d like you to write down what you were thinking at each point in time, so that we can find out where the system misled you, and where the cognitive demands were too high.”
In the first case, just as in software engineering — where the earlier the bug is found, the cheaper it is to fix — the earlier an SRE team consultation happens, the better the service will be and the quicker it will feel the benefit.
Not all Google services receive close SRE engagement. A couple of factors are at play here: Many services don’t need high reliability and availability, so support can be provided by other means. By design, the number of development teams that request SRE support exceeds the available bandwidth of SRE teams (see Chapter 1).
“Hope is not a strategy.” This rallying cry of the SRE team at Google sums up what we mean by preparedness and disaster testing. The SRE culture is forever vigilant and constantly questioning: What could go wrong? What action can we take to address those issues before they lead to an outage or data loss?
At Google, we constantly walk a tightrope between user expectations for high reliability versus a laser-sharp focus on rapid change and innovation. While Google is incredibly serious about reliability, we must adapt our approaches to our high rate of change.
Ultimately, SRE’s goal is to follow a similar course. An SRE team should be as compact as possible and operate at a high level of abstraction, relying upon lots of backup systems as failsafes and thoughtful APIs to communicate with the systems. At the same time, the SRE team should also have comprehensive knowledge of the systems — how they operate, how they fail, and how to respond to failures — that comes from operating them day-to-day.
at least eight people need to be part of the on-call team, in order to avoid fatigue and allow sustainable staffing and low turnover.