Site Reliability Engineering Quotes by Betsy Beyer(page 2 of 3)

“a system that serves 2.5M requests in a day with a daily availability target of 99.99% can serve up to 250 errors and still hit its target for that given day.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“You might expect Google to try to build 100% reliable services—ones that never fail. It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed and how quickly products can be delivered to users, and dramatically increases their cost, which in turn reduces the numbers of features a team can afford to offer. Further, users typically don’t notice the difference between high reliability and extreme reliability in a service, because the user experience is dominated by less reliable components like the cellular network or the device they are working with. Put simply, a user on a 99% reliable smartphone cannot tell the difference between 99.99% and 99.999% service reliability! With this in mind, rather than simply maximizing uptime, Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness—with features, service, and performance—is optimized.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“A key principle of any effective software engineering, not only reliability-oriented engineering, simplicity is a quality that, once lost, can be extraordinarily difficult to recapture. Nevertheless, as the old adage goes, a complex system that works necessarily evolved from a simple system that works. Chapter 9, Simplicity, goes into this topic in detail.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“Figure 2-4 shows how a user’s request is serviced: first, the user points their browser to shakespeare.google.com. To obtain the corresponding IP address, the user’s device resolves the address with its DNS server (1). This request ultimately ends up at Google’s DNS server, which talks to GSLB. As GSLB keeps track of traffic load among frontend servers across regions, it picks which server IP address to send to this user. Figure 2-4. The life of a request The browser connects to the HTTP server on this IP. This server (named the Google Frontend, or GFE) is a reverse proxy that terminates the TCP connection (2). The GFE looks up which service is required (web search, maps, or—in this case—Shakespeare). Again using GSLB, the server finds an available Shakespeare frontend server, and sends that server an RPC containing the HTTP request (3).”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“All of Google’s services communicate using a Remote Procedure Call (RPC) infrastructure named Stubby; an open source version, gRPC, is available. 3 Often, an RPC call is made even when a call to a subroutine in the local program needs to be performed. This makes it easier to refactor the call into a different server if more modularity is needed, or when a server’s codebase grows. GSLB can load balance RPCs in the same way it load balances externally visible services.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“When humans are necessary, we have found that thinking through and recording the best practices ahead of time in a “playbook” produces roughly a 3x improvement in MTTR as compared to the strategy of “winging it.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“SRE’s goal is no longer “zero outages”; rather, SREs and product developers aim to spend the error budget getting maximum feature velocity.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“Often, sheer force of effort can help a rickety system achieve high availability, but this path is usually short-lived and fraught with burnout and dependence on a small number of heroic team members.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“Google places a 50% cap on the aggregate “ops” work for all SREs —”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“And, as we all know, culture beats strategy every time: [Mer11]”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“If you currently assign tickets randomly to victims on your team, stop. Doing so is extremely disrespectful of your team’s time, and works completely counter to the principle of not being interruptible as much as possible.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“An upfront investment in SRE training is absolutely worthwhile, both for the students eager to grasp their production environment and for the teams grateful to welcome students into the ranks of on-call.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“John is the newest member of the FooServer SRE team. Senior SREs on this team are tasked with a lot of grunt work, such as responding to tickets, dealing with alerts, and performing tedious binary rollouts.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“How can we harness the enthusiasm and curiosity in our new hires to make sure that existing SREs benefit from it?”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“Successful SRE teams are built on trust”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“Investing up front in the education and technical orientation of new SREs will shape them into better engineers. Such training will accelerate them to a state of proficiency faster, while making their skill set more robust and balanced.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“[Dea07] J. Dean, “Software Engineering Advice from Building Large-Scale Distributed Systems”, Stanford CS297 class lecture, Spring 2007.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“Whenever you see leader election, critical shared state, or distributed locking, think about distributed consensus: any lesser approach is a ticking bomb waiting to explode in your systems.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“For non-Byzantine failures, the minimum number of replicas that can be deployed is three — if two are deployed, then there is no tolerance for failure of any process. Three replicas may tolerate one failure.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“Quorum leases are particularly useful for read-heavy workloads in which reads for particular subsets of the data are concentrated in a single geographic region.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“Quorum leases [Mor14] are a recently developed distributed consensus performance optimization aimed at reducing latency and increasing throughput for read operations.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“Introducing randomness is the best approach. Raft [Ong14], for example, has a well-thought-out method of approaching the leader election process.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“(Synchronous consensus applies to real-time systems, in which dedicated hardware means that messages will always be passed with specific timing guarantees.)”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“Use randomized exponential backoff on errors”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“Antagonistic neighbors Other processes (often completely unrelated and run by different teams) can have a significant impact on the performance of your processes. We’ve seen differences in performance of this nature of up to 20%. This difference mostly stems from competition for shared resources, such as space in memory caches or bandwidth, in ways that may not be directly obvious.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“There’s a lot of evidence suggesting that diverse teams are simply better teams [Nel14]”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“Connecting the performance of the service with design decisions in a regular meeting is an immensely powerful feedback loop.”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“Production meetings are a special kind of meeting where an SRE team carefully articulates to itself — and to its invitees — the state of the service(s) in their charge, so as to increase general awareness among everyone who cares, and to improve the operation of the service(s).”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

“Those who cannot remember the past are condemned to repeat it.” George Santayana, philosopher and essayist”
― Betsy Beyer, Site Reliability Engineering: How Google Runs Production Systems

0 likes

Like

Site Reliability Engineering Quotes

See a Problem?

Preview — Site Reliability Engineering by Betsy Beyer