Site Reliability Engineering: How Google Runs Production Systems
Rate it:
Open Preview
3%
Flag icon
Ben Treynor Sloss, the senior VP overseeing technical operations at Google — and the originator of the term “Site Reliability Engineering”
3%
Flag icon
Running a service with a team that relies on manual intervention for both change management and event handling becomes expensive as the service and/or traffic to the service grows, because the size of the team necessarily scales with the load generated by the system.
Miguel Alho liked this
3%
Flag icon
These costs arise from the fact that the two teams are quite different in background, skill set, and incentives. They use different vocabulary to describe situations; they carry different assumptions about both risk and possibilities for technical solutions; they have different assumptions about the target level of product stability.
3%
Flag icon
Because most outages are caused by some kind of change — a new configuration, a new feature launch, or a new type of user traffic — the two teams’ goals are fundamentally in tension.
3%
Flag icon
We want to launch anything, any time, without hindrance” versus “We won’t want to ever change anything in the system once it works”).
3%
Flag icon
SRE is what happens when you ask a software engineer to design an operations team.
Miguel Alho liked this
3%
Flag icon
Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload.
3%
Flag icon
Therefore, Google places a 50% cap on the aggregate “ops” work for all SREs — tickets, on-call, manual tasks, etc. This cap ensures that the SRE team has enough time in their schedule to make the service stable and operable.
3%
Flag icon
an SRE team must spend the remaining 50% of its time actually doing development.
Miguel Alho liked this
3%
Flag icon
One continual challenge Google faces is hiring SREs: not only does SRE compete for the same candidates as the product development hiring pipeline, but the fact that we set the hiring bar so high in terms of both coding and system engineering skills means that our hiring pool is necessarily small.
3%
Flag icon
involvement of the IT function in each phase of a system’s design and development, heavy reliance on automation versus human effort, the application of engineering practices and tools to operations tasks — are consistent with many of SRE’s principles and practices. One could view DevOps as a generalization of several core SRE principles
4%
Flag icon
this is accomplished by monitoring the amount of operational work being done by SREs, and redirecting excess operational work to the product development teams: reassigning bugs and tickets to development managers, [re]integrating developers into on-call pager
4%
Flag icon
on average, SREs should receive a maximum of two events per 8–12-hour on-call shift.
4%
Flag icon
There are many other systems in the path between user and service (their laptop, their home WiFi, their ISP, the power grid…) and those systems collectively are far less than 99.999% available. Thus, the marginal difference between 99.999% and 100% gets lost in the noise of other unavailability, and the user receives no benefit from the enormous effort required to add that last 0.001% of availability.
4%
Flag icon
SRE’s goal is no longer “zero outages”; rather, SREs and product developers aim to spend the error budget getting maximum feature velocity.
4%
Flag icon
When humans are necessary, we have found that thinking through and recording the best practices ahead of time in a “playbook” produces roughly a 3x improvement in MTTR as compared to the strategy of “winging it.”
4%
Flag icon
While no playbook, no matter how comprehensive it may be, is a substitute for smart engineers able to think on the fly, clear and thorough troubleshooting steps and tips are valuable when responding to a high-stakes or time-sensitive page.
4%
Flag icon
SRE has found that roughly 70% of outages are due to changes in a live system.
4%
Flag icon
Because capacity is critical to availability, it naturally follows that the SRE team must be in charge of capacity planning, which means they also must be in charge of provisioning.
4%
Flag icon
Resource use is a function of demand (load), capacity, and software efficiency.
4%
Flag icon
SREs provision to meet a capacity target at a specific response speed,
5%
Flag icon
very fast virtual switch with tens of thousands of ports. We accomplished this by connecting hundreds of Google-built switches in a Clos network fabric [Clos53] named Jupiter
5%
Flag icon
backbone network B4 [Jai13]. B4 is a software-defined networking architecture
5%
Flag icon
For example, the BNS path might be a string such as /bns/<cluster>/<user>/<job name>/<task number>, which would resolve to <IP address>:<port>.
5%
Flag icon
A layer on top of D called Colossus creates a cluster-wide filesystem that offers usual filesystem semantics, as well as replication and encryption.
5%
Flag icon
Bigtable [Cha06] is a NoSQL database
5%
Flag icon
Spanner [Cor12] offers an SQL-like interface
5%
Flag icon
Instead of using “smart” routing hardware, we rely on less expensive “dumb” switching components in combination with a central (duplicated) controller that precomputes best paths across the network.
5%
Flag icon
Global Software Load Balancer (GSLB)
5%
Flag icon
Chubby handles these locks across datacenter locations. It uses the Paxos protocol for asynchronous Consensus
6%
Flag icon
Data that must be consistent is well suited to storage in Chubby.
6%
Flag icon
Our code is heavily multithreaded, so one task can easily use many cores. To facilitate dashboards, monitoring, and debugging, every server has an HTTP server that provides diagnostics and statistics for a given task.
6%
Flag icon
Data is transferred to and from an RPC using protocol buffers,
6%
Flag icon
they can fix the problem, send the proposed changes (“changelist,” or CL) to the owner for review,
6%
Flag icon
Each time a CL is submitted, tests run on all software that may depend on that CL, either directly or indirectly.
6%
Flag icon
Google Frontend, or GFE) is a reverse proxy that terminates the TCP connection
6%
Flag icon
the following considerations mean that we need at least 37 tasks in the job, or : During updates, one task at a time will be unavailable, leaving 36 tasks. A machine failure might occur during a task update, leaving only 35 tasks,
6%
Flag icon
Protocol buffers are a language-neutral, platform-neutral extensible mechanism for serializing structured data.
7%
Flag icon
Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed and how quickly products can be delivered to users, and dramatically increases their cost, which in turn reduces the numbers of features a team can afford to offer.
7%
Flag icon
We strive to make a service reliable enough, but no more reliable than it needs to be.
7%
Flag icon
we set quarterly availability targets for a service and track our performance against those targets on a weekly, or even daily, basis.
8%
Flag icon
we’ve measured the typical background error rate for ISPs as falling between 0.01% and 1%.
8%
Flag icon
Exposing cost in this way motivates the clients to choose the level of service with the lowest cost that still meets their needs.
9%
Flag icon
The main benefit of an error budget is that it provides a common incentive that allows both product development and SRE to focus on finding the right balance between innovation and reliability.
9%
Flag icon
service level indicators (SLIs), objectives (SLOs), and agreements (SLAs).
9%
Flag icon
request latency — how long it takes to return a response to a request — as a key SLI. Other common SLIs include the error rate, often expressed as a fraction of all requests received, and system throughput, typically measured in requests per second.
9%
Flag icon
availability, or the fraction of the time that a service is usable.
10%
Flag icon
Choosing too many indicators makes it hard to pay the right level of attention to the indicators that matter, while choosing too few may leave significant behaviors of your system unexamined.
10%
Flag icon
Most metrics are better thought of as distributions rather than averages.
10%
Flag icon
Using percentiles for indicators allows you to consider the shape of the distribution and its differing attributes:
Miguel Alho liked this
« Prev 1 3 4 5