If there is a single maxim that runs through this book’s arguments, it is that we are often better served by connecting ideas than we are by protecting them.
“Note that we can run multiple classes of services using identical hardware and software. We can provide vastly different service guarantees by adjusting a variety of service characteristics, such as the quantities of resources, the degree of redundancy, the geographical provisioning constraints, and, critically, the infrastructure software configuration.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“You might expect Google to try to build 100% reliable services—ones that never fail. It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed and how quickly products can be delivered to users, and dramatically increases their cost, which in turn reduces the numbers of features a team can afford to offer. Further, users typically don’t notice the difference between high reliability and extreme reliability in a service, because the user experience is dominated by less reliable components like the cellular network or the device they are working with. Put simply, a user on a 99% reliable smartphone cannot tell the difference between 99.99% and 99.999% service reliability! With this in mind, rather than simply maximizing uptime, Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness—with features, service, and performance—is optimized.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“When humans are necessary, we have found that thinking through and recording the best practices ahead of time in a “playbook” produces roughly a 3x improvement in MTTR as compared to the strategy of “winging it.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“Figure 2-4 shows how a user’s request is serviced: first, the user points their browser to shakespeare.google.com. To obtain the corresponding IP address, the user’s device resolves the address with its DNS server (1). This request ultimately ends up at Google’s DNS server, which talks to GSLB. As GSLB keeps track of traffic load among frontend servers across regions, it picks which server IP address to send to this user. Figure 2-4. The life of a request The browser connects to the HTTP server on this IP. This server (named the Google Frontend, or GFE) is a reverse proxy that terminates the TCP connection (2). The GFE looks up which service is required (web search, maps, or—in this case—Shakespeare). Again using GSLB, the server finds an available Shakespeare frontend server, and sends that server an RPC containing the HTTP request (3).”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
“SRE’s goal is no longer “zero outages”; rather, SREs and product developers aim to spend the error budget getting maximum feature velocity.”
― Site Reliability Engineering: How Google Runs Production Systems
― Site Reliability Engineering: How Google Runs Production Systems
Anjan’s 2024 Year in Books
Take a look at Anjan’s Year in Books, including some fun facts about their reading.
More friends…
Favorite Genres
Polls voted on by Anjan
Lists liked by Anjan

























