More on this book
Community
Kindle Notes & Highlights
Ben Treynor Sloss, the senior VP overseeing technical operations at Google — and the originator of the term “Site Reliability Engineering”
Running a service with a team that relies on manual intervention for both change management and event handling becomes expensive as the service and/or traffic to the service grows, because the size of the team necessarily scales with the load generated by the system.
Miguel Alho liked this
These costs arise from the fact that the two teams are quite different in background, skill set, and incentives. They use different vocabulary to describe situations; they carry different assumptions about both risk and possibilities for technical solutions; they have different assumptions about the target level of product stability.
Because most outages are caused by some kind of change — a new configuration, a new feature launch, or a new type of user traffic — the two teams’ goals are fundamentally in tension.
We want to launch anything, any time, without hindrance” versus “We won’t want to ever change anything in the system once it works”).
SRE is what happens when you ask a software engineer to design an operations team.
Miguel Alho liked this
Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload.
Therefore, Google places a 50% cap on the aggregate “ops” work for all SREs — tickets, on-call, manual tasks, etc. This cap ensures that the SRE team has enough time in their schedule to make the service stable and operable.
an SRE team must spend the remaining 50% of its time actually doing development.
Miguel Alho liked this
One continual challenge Google faces is hiring SREs: not only does SRE compete for the same candidates as the product development hiring pipeline, but the fact that we set the hiring bar so high in terms of both coding and system engineering skills means that our hiring pool is necessarily small.
involvement of the IT function in each phase of a system’s design and development, heavy reliance on automation versus human effort, the application of engineering practices and tools to operations tasks — are consistent with many of SRE’s principles and practices. One could view DevOps as a generalization of several core SRE principles
this is accomplished by monitoring the amount of operational work being done by SREs, and redirecting excess operational work to the product development teams: reassigning bugs and tickets to development managers, [re]integrating developers into on-call pager
on average, SREs should receive a maximum of two events per 8–12-hour on-call shift.
There are many other systems in the path between user and service (their laptop, their home WiFi, their ISP, the power grid…) and those systems collectively are far less than 99.999% available. Thus, the marginal difference between 99.999% and 100% gets lost in the noise of other unavailability, and the user receives no benefit from the enormous effort required to add that last 0.001% of availability.
SRE’s goal is no longer “zero outages”; rather, SREs and product developers aim to spend the error budget getting maximum feature velocity.
When humans are necessary, we have found that thinking through and recording the best practices ahead of time in a “playbook” produces roughly a 3x improvement in MTTR as compared to the strategy of “winging it.”
While no playbook, no matter how comprehensive it may be, is a substitute for smart engineers able to think on the fly, clear and thorough troubleshooting steps and tips are valuable when responding to a high-stakes or time-sensitive page.
SRE has found that roughly 70% of outages are due to changes in a live system.
Because capacity is critical to availability, it naturally follows that the SRE team must be in charge of capacity planning, which means they also must be in charge of provisioning.
Resource use is a function of demand (load), capacity, and software efficiency.
SREs provision to meet a capacity target at a specific response speed,
very fast virtual switch with tens of thousands of ports. We accomplished this by connecting hundreds of Google-built switches in a Clos network fabric [Clos53] named Jupiter
backbone network B4 [Jai13]. B4 is a software-defined networking architecture
For example, the BNS path might be a string such as /bns/<cluster>/<user>/<job name>/<task number>, which would resolve to <IP address>:<port>.
A layer on top of D called Colossus creates a cluster-wide filesystem that offers usual filesystem semantics, as well as replication and encryption.
Bigtable [Cha06] is a NoSQL database
Spanner [Cor12] offers an SQL-like interface
Instead of using “smart” routing hardware, we rely on less expensive “dumb” switching components in combination with a central (duplicated) controller that precomputes best paths across the network.
Global Software Load Balancer (GSLB)
Chubby handles these locks across datacenter locations. It uses the Paxos protocol for asynchronous Consensus
Data that must be consistent is well suited to storage in Chubby.
Our code is heavily multithreaded, so one task can easily use many cores. To facilitate dashboards, monitoring, and debugging, every server has an HTTP server that provides diagnostics and statistics for a given task.
Data is transferred to and from an RPC using protocol buffers,
they can fix the problem, send the proposed changes (“changelist,” or CL) to the owner for review,
Each time a CL is submitted, tests run on all software that may depend on that CL, either directly or indirectly.
Google Frontend, or GFE) is a reverse proxy that terminates the TCP connection
the following considerations mean that we need at least 37 tasks in the job, or : During updates, one task at a time will be unavailable, leaving 36 tasks. A machine failure might occur during a task update, leaving only 35 tasks,
Protocol buffers are a language-neutral, platform-neutral extensible mechanism for serializing structured data.
Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed and how quickly products can be delivered to users, and dramatically increases their cost, which in turn reduces the numbers of features a team can afford to offer.
We strive to make a service reliable enough, but no more reliable than it needs to be.
we set quarterly availability targets for a service and track our performance against those targets on a weekly, or even daily, basis.
we’ve measured the typical background error rate for ISPs as falling between 0.01% and 1%.
Exposing cost in this way motivates the clients to choose the level of service with the lowest cost that still meets their needs.
The main benefit of an error budget is that it provides a common incentive that allows both product development and SRE to focus on finding the right balance between innovation and reliability.
service level indicators (SLIs), objectives (SLOs), and agreements (SLAs).
request latency — how long it takes to return a response to a request — as a key SLI. Other common SLIs include the error rate, often expressed as a fraction of all requests received, and system throughput, typically measured in requests per second.
availability, or the fraction of the time that a service is usable.
Choosing too many indicators makes it hard to pay the right level of attention to the indicators that matter, while choosing too few may leave significant behaviors of your system unexamined.
Most metrics are better thought of as distributions rather than averages.
Using percentiles for indicators allows you to consider the shape of the distribution and its differing attributes:
Miguel Alho liked this