Site Reliability Engineering Quotes

Rate this book
Clear rating
Site Reliability Engineering: How Google Runs Production Systems Site Reliability Engineering: How Google Runs Production Systems by Betsy Beyer
2,875 ratings, 4.21 average rating, 271 reviews
Open Preview
Site Reliability Engineering Quotes Showing 1-30 of 79
“Hope is not a strategy.”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“When a team must allocate a disproportionate amount of time to resolving tickets at the cost of spending time improving the service, scalability and reliability suffer.”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“team size should not scale directly with service growth.”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“Remember that the code path you never use is the code path that (often) doesn’t work.”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“A key principle of any effective software engineering, not only reliability-oriented engineering, simplicity is a quality that, once lost, can be extraordinarily difficult to recapture.”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“Blameless culture originated in the healthcare and avionics industries where mistakes can be fatal. These industries nurture an environment where every “mistake” is seen as an opportunity to strengthen the system. When postmortems shift from allocating blame to investigating the systematic reasons why an individual or team had incomplete or incorrect information, effective prevention plans can be put in place. You can’t “fix” people, but you can fix systems and processes to better support people making the right choices when designing and maintaining complex systems.”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“By design, it is crucial that SRE teams are focused on engineering. Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload.”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“Hope is not a strategy. Traditional SRE saying”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow. Carla Geisser, Google SRE”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“monitoring is an absolutely essential component of doing the right thing in production. If you can’t monitor a service, you don’t know what’s happening, and if you’re blind to what’s happening, you can’t be reliable.”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“Best practices in this domain use automation to accomplish the following: Implementing progressive rollouts Quickly and accurately detecting problems Rolling back changes safely when problems arise”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“The hero jack-of-all-trades on-call engineer does work, but the practiced on-call engineer armed with a playbook works much better”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s). We”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“engage in development tasks, because the service basically runs and repairs itself: we want systems that are automatic, not just automated. In practice, scale and new features keep SREs on their toes.”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“And taking the historical view, who, then, looking back, might be the first SRE? We like to think that Margaret Hamilton, working on the Apollo program on loan from MIT, had all of the significant traits of the first SRE.5 In her own words, “part of the culture was to learn from everyone and everything, including from that which one would least expect.”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“Software engineering has this in common with having children: the labor before the birth is painful and difficult, but the labor after the birth is where you actually spend most of your effort.”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed and how quickly products can be delivered to users, and dramatically increases their cost, which in turn reduces the numbers of features a team can afford to offer.”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“If your service’s actual performance is much better than its stated SLO, users will come to rely on its current performance. You can avoid over-dependence by deliberately taking the system offline occasionally (Google’s Chubby service introduced planned outages in response to being overly available),3 throttling some requests, or designing the system so that it isn’t faster under light loads.”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“You can always refine SLO definitions and targets over time as you learn about a system’s behavior. It’s better to start with a loose target that you tighten than to choose an overly strict target that has to be relaxed when you discover it’s unattainable.”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“Big data systems, such as data processing pipelines, tend to care about throughput and end-to-end latency. In other words: How much data is being processed? How long does it take the data to progress from ingestion to completion? (Some pipelines may also have targets for latency on individual processing stages.)”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“Google Search is an example of an important service that doesn’t have an SLA for the public: we want everyone to use Search as fluidly and efficiently as possible, but we haven’t signed a contract with the whole world.”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“SLAs are service level agreements: an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain. The consequences are most easily recognized when they are financial — a rebate or a penalty — but they can take other forms.”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound.”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“An SLI is a service level indicator — a carefully defined quantitative measure of some aspect of the level of service that is provided.”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“For example, if product development wants to skimp on testing or increase push velocity and SRE is resistant, the error budget guides the decision. When the budget is large, the product developers can take more risks. When the budget is nearly drained, the product developers themselves will push for more testing or slower push velocity, as they don’t want to risk using up the budget and stall their launch.”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“In order to base these decisions on objective data, the two teams jointly define a quarterly error budget based on the service’s service level objective, or SLO (see Chapter 4). The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“Note that we can run multiple classes of services using identical hardware and software. We can provide vastly different service guarantees by adjusting a variety of service characteristics, such as the quantities of resources, the degree of redundancy, the geographical provisioning constraints, and, critically, the infrastructure software configuration.”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“With explicitly delineated levels of service, the infrastructure providers can effectively externalize the difference in the cost it takes to provide service at a given level to clients. Exposing cost in this way motivates the clients to choose the level of service with the lowest cost that still meets their needs. For example, Google + can decide to put data critical to enforcing user privacy in a high-availability, globally consistent datastore (e.g., a globally replicated SQL-like system like Spanner [Cor12]), while putting optional data (data that isn’t critical, but that enhances the user experience) in a cheaper, less reliable, less fresh, and eventually consistent datastore (e.g., a NoSQL store with best-effort replication like Bigtable).”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems
“If we were to build and operate these systems at one more nine of availability, what would our incremental increase in revenue be? Does this additional revenue offset the cost of reaching that level of reliability?”
Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

« previous 1 3