Site Reliability Engineering Quotes

2,875 ratings, 4.21 average rating, 271 reviews
Open Preview
Site Reliability Engineering Quotes
Showing 31-60 of 79
“If we were to build and operate these systems at one more nine of availability, what would our incremental increase in revenue be?”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“a system that serves 2.5M requests in a day with a daily availability target of 99.99% can serve up to 250 errors and still hit its target for that given day.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“You might expect Google to try to build 100% reliable services—ones that never fail. It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed and how quickly products can be delivered to users, and dramatically increases their cost, which in turn reduces the numbers of features a team can afford to offer. Further, users typically don’t notice the difference between high reliability and extreme reliability in a service, because the user experience is dominated by less reliable components like the cellular network or the device they are working with. Put simply, a user on a 99% reliable smartphone cannot tell the difference between 99.99% and 99.999% service reliability! With this in mind, rather than simply maximizing uptime, Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness—with features, service, and performance—is optimized.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“A key principle of any effective software engineering, not only reliability-oriented engineering, simplicity is a quality that, once lost, can be extraordinarily difficult to recapture. Nevertheless, as the old adage goes, a complex system that works necessarily evolved from a simple system that works. Chapter 9, Simplicity, goes into this topic in detail.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“Figure 2-4 shows how a user’s request is serviced: first, the user points their browser to shakespeare.google.com. To obtain the corresponding IP address, the user’s device resolves the address with its DNS server (1). This request ultimately ends up at Google’s DNS server, which talks to GSLB. As GSLB keeps track of traffic load among frontend servers across regions, it picks which server IP address to send to this user. Figure 2-4. The life of a request The browser connects to the HTTP server on this IP. This server (named the Google Frontend, or GFE) is a reverse proxy that terminates the TCP connection (2). The GFE looks up which service is required (web search, maps, or—in this case—Shakespeare). Again using GSLB, the server finds an available Shakespeare frontend server, and sends that server an RPC containing the HTTP request (3).”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“All of Google’s services communicate using a Remote Procedure Call (RPC) infrastructure named Stubby; an open source version, gRPC, is available. 3 Often, an RPC call is made even when a call to a subroutine in the local program needs to be performed. This makes it easier to refactor the call into a different server if more modularity is needed, or when a server’s codebase grows. GSLB can load balance RPCs in the same way it load balances externally visible services.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“When humans are necessary, we have found that thinking through and recording the best practices ahead of time in a “playbook” produces roughly a 3x improvement in MTTR as compared to the strategy of “winging it.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“SRE’s goal is no longer “zero outages”; rather, SREs and product developers aim to spend the error budget getting maximum feature velocity.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“Often, sheer force of effort can help a rickety system achieve high availability, but this path is usually short-lived and fraught with burnout and dependence on a small number of heroic team members.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“Google places a 50% cap on the aggregate “ops” work for all SREs —”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“And, as we all know, culture beats strategy every time: [Mer11]”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“If you currently assign tickets randomly to victims on your team, stop. Doing so is extremely disrespectful of your team’s time, and works completely counter to the principle of not being interruptible as much as possible.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“An upfront investment in SRE training is absolutely worthwhile, both for the students eager to grasp their production environment and for the teams grateful to welcome students into the ranks of on-call.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“John is the newest member of the FooServer SRE team. Senior SREs on this team are tasked with a lot of grunt work, such as responding to tickets, dealing with alerts, and performing tedious binary rollouts.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“How can we harness the enthusiasm and curiosity in our new hires to make sure that existing SREs benefit from it?”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“Successful SRE teams are built on trust”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“Investing up front in the education and technical orientation of new SREs will shape them into better engineers. Such training will accelerate them to a state of proficiency faster, while making their skill set more robust and balanced.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“[Dea07] J. Dean, “Software Engineering Advice from Building Large-Scale Distributed Systems”, Stanford CS297 class lecture, Spring 2007.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“Whenever you see leader election, critical shared state, or distributed locking, think about distributed consensus: any lesser approach is a ticking bomb waiting to explode in your systems.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“For non-Byzantine failures, the minimum number of replicas that can be deployed is three — if two are deployed, then there is no tolerance for failure of any process. Three replicas may tolerate one failure.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“Quorum leases are particularly useful for read-heavy workloads in which reads for particular subsets of the data are concentrated in a single geographic region.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“Quorum leases [Mor14] are a recently developed distributed consensus performance optimization aimed at reducing latency and increasing throughput for read operations.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“Introducing randomness is the best approach. Raft [Ong14], for example, has a well-thought-out method of approaching the leader election process.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“(Synchronous consensus applies to real-time systems, in which dedicated hardware means that messages will always be passed with specific timing guarantees.)”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“Use randomized exponential backoff on errors”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“Antagonistic neighbors Other processes (often completely unrelated and run by different teams) can have a significant impact on the performance of your processes. We’ve seen differences in performance of this nature of up to 20%. This difference mostly stems from competition for shared resources, such as space in memory caches or bandwidth, in ways that may not be directly obvious.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“There’s a lot of evidence suggesting that diverse teams are simply better teams [Nel14]”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“Connecting the performance of the service with design decisions in a regular meeting is an immensely powerful feedback loop.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“Production meetings are a special kind of meeting where an SRE team carefully articulates to itself — and to its invitees — the state of the service(s) in their charge, so as to increase general awareness among everyone who cares, and to improve the operation of the service(s).”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“Those who cannot remember the past are condemned to repeat it.” George Santayana, philosopher and essayist”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems