Goodreads helps you keep track of books you want to read.
Start by marking “Site Reliability Engineering: How Google Runs Production Systems” as Want to Read:
Site Reliability Engineering: How Google Runs Production Systems
Enlarge cover
Rate this book
Clear rating
Open Preview

Site Reliability Engineering: How Google Runs Production Systems

4.23  ·  Rating details ·  1,963 ratings  ·  193 reviews
The overwhelming majority of a software system's lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems?

In this collection of essays and articles, key members of Google's Site Reliability Team explain how and why their commitmen
Paperback, 552 pages
Published April 26th 2016 by O'Reilly Media (first published April 16th 2016)
More Details... Edit Details

Friend Reviews

To see what your friends thought of this book, please sign up.

Reader Q&A

To ask other readers questions about Site Reliability Engineering, please sign up.
Popular Answered Questions
Hampus Wessman Yes. The free version on the website was released later. It's the same book.…moreYes. The free version on the website was released later. It's the same book.(less)

Community Reviews

Showing 1-30
Average rating 4.23  · 
Rating details
 ·  1,963 ratings  ·  193 reviews

More filters
Sort order
Start your review of Site Reliability Engineering: How Google Runs Production Systems
Simon Eskildsen
Much of the information on running production systems effectively from Google has been extremely important to how I have changed my thinking about the SRE role over the years—finally, there's one piece that has all of what was previously something you'd had to look long and hard for in various talks, papers and abstracts: error budgets, the SRE role definition, scaling, etc. That said, this book suffers a classic problem from having too many authors write independent chapters. Much is repeated, ...more
Apr 25, 2016 rated it it was ok
Boring as F. The main message is: oh look at us, we have super hard problems and like saying 99.999% a lot. And oh yeah... SREs are developers. We don't spend more than 50% on "toil" work. Pleeeease. Book has some interesting stories and if you are good at reading between the lines you might learn something. Everything else is BS. Does every chapter needs to start telling us who edited the chapter? I don't give a f. The book also seems to be the product of multiple individuals (a lot of them act ...more
Michael Scott
Apr 23, 2016 rated it liked it
Shelves: compsci-tech
Site Reliability Engineering, or Google's claim to fame re: technology and concepts developed more than a decade ago by the grid computing community, is a collection of essays on the design and operation of large-scale datacenters, with the goal of making them simultaneously scalable, robust, and efficient. Overall, despite (willing?) ignorance of the history of distributed systems and in particular (grid) datacenter technology, this is an excellent book that teaches us how Google thinks (or use ...more
Dimitrios Zorbas
Jul 26, 2017 rated it it was amazing
I have so many bookmarks in this book and consider it an invaluable read. While not every project / company needs to operate at Google scale, it helps streamlining the process to define SLO / SLAs for the occasion and establishing communication channels and practices to achieve them.

It helped me wrap my head around concepts for which I used to rely on intuition.
I've shaped processes and created template documents (postmortem / launch coordination checklist)
for work based on this book.
Michael Koltsov
Mar 03, 2017 rated it it was amazing
I don’t normally buy paper books, which means that in the course of the last few years I’ve bought only one paper book even though I’ve read hundreds of books during that period of time. This book is the second one I’ve bought so far, which means a lot to me. Not mentioning that Google is providing it on the Internet free of charge.

For me, personally, this book is a basis on which a lot of my past assumptions could be argued as viable solutions with the scale of Google. This book is not revealin
Alexander Yakushev
Sep 04, 2016 rated it really liked it
This book is great on multiple levels. First of all, it packs great content — the detailed explanation of how and why Google has internally established what we now call "the DevOps culture." Rationale coupled together with hands-on implementation guide provide incredible insight into creating and running SRE team in your own company.
The text quality is top-notch, the book is written with clarity in mind and thoroughly edited.
I'd rate the content itself at four stars. But the book deserves the fi
James Stewart
Jun 12, 2016 rated it it was ok
Loads of interesting ideas and thoughts, but a bit of a slog to get through.

The approach of having different members of the team write different sections probably worked really well for engaging everyone, but it made for quite a bit of repetition. It also ends up feeling like a few books rolled into one, with one on distributed systems design, another on SRE culture and practices, and maybe another on management.
Alex Palcuie
Mar 18, 2017 rated it it was amazing
Shelves: favorites
I think this is the best engineering book in the last decade.
Regis Hattori
Dec 23, 2019 rated it really liked it
This book is divided into five parts: Introduction, Principles, Practices, Management, and Conclusions.

I see a lot of value in the first two parts for any people involved in software development. It convinces us about the importance of the subject with very good arguments, no matter if you are a software engineering, a product manager or even a user. This part deserves 5 stars

After some chapters of the Practices part, the conclusion I made is that this part of the book may only be useful if you
Tomas Varaneckas
Sep 25, 2017 rated it it was ok
This was a really hard read, in a bad sense. The first couple of dozen pages were really promising, but the book turned out to be unnecessarily long, incredibly boring, repetative and inconsistent gang bang of random blog posts and often trivial information. It has roughly 10% of valuable content, and would greatly benefit from being reduced to 50-pager. At it's current state it seems that it was a corporate collaborative ego-trip, to show potential employees how cool Google SRE is, and how maje ...more
Oct 19, 2016 rated it really liked it
There's a ton of great information here, and we refer to it regularly as we're trying to change the culture at work. I gave it a 4 instead of a 5 because it does suffer a little from the style – think collection of essays rather than a unified arc – but it's really worth reading even if it requires some care to transfer to more usual environments. ...more
Bjoern Rochel
Aug 27, 2019 rated it really liked it
Shelves: 2018, 2019
A little disclaimer: My review here is more about the concept and organizational parts than the pure technical aspects. Mostly because I manage engineering teams nowadays and these areas are the more important ones for me. This book contains also a lot of technical information on how to implement SRE that I would highly recommended for interested software engineers.

One aspect I liked in particular about SRE is the Error Budget concept, Googles way to manage the age old conflict between product a
Liviu Costea
Nov 16, 2019 rated it it was amazing
Shelves: devops
A lot of food for thought, a book that became a reference in the field. The only problem is the wide coverage, you might find some chapters very niche, like not everybody cares how to build layer 4 load balancer.
Highly recommended if you are following devops approaches.
Vít Listík
Dec 25, 2018 rated it it was amazing
I like the fact that it is written by multiple authors. Everything stated in the book seems so obvious but it is so sad to read it because it is not yet an industry standard. A must read for every SRE.
Amir Sarabadani
Dec 24, 2019 rated it liked it
It's basically a looong advertisement for google with some useful information inside while it should be other way around. ...more
Ahmad hosseini
What is SRE?
Site Reliability Engineering (SRE) is Google’s approach to service management.
An SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).
Typical SRE activities fall into the following approximate categories:
• Software engineering: Involves writing or modifying code, in addition to any associated design and documentation work.
• System engineering: Involves configuring p
Luke Amdor
Oct 16, 2017 rated it really liked it
Some really great chapters especially towards the beginning and the end. However, I feel like it could have been edited better. It meanders a lot.
Jul 15, 2018 rated it really liked it
The book seems largely to be a collection of essays written by disparate people within Google's SRE organization. It's as well-organized and coherent as that can be (and I think it's a good format for this -- far better than if they'd tried to create something with a more unified narrative). But it's very uneven: some chapters are terrific while some seem rather empty. I found the chapters on risk, load balancing, overload, distributed consensus, and (surprisingly) launches to be among the most ...more
Scott Maclellan
Jan 26, 2019 rated it really liked it
A fantastic and in-depth resource. Great for going deeper and maturing how a company builds and runs software at scale. Touches on the specific tactical actions your team can take to build more reliable products. The extended sections on culture slowed me down alot, but have led to some very interesting conversations at work.
Tadas Talaikis
Jun 06, 2017 rated it liked it
"Boring" (at least from the outside world perspective, ok with me), basically can be much shorter. Culture, automation of everything, load balancing, monitoring, like everywhere else, except maybe Borg thing. ...more
Apr 22, 2018 rated it liked it
There’s interesting content for sure. But the writing isn’t engaging (the book is long so that becomes boring kinda fast) and some aspects of the google culture are real creepy (best example: “humans are imperfect machines” while talking about people management...)
Jun 12, 2020 rated it it was amazing
This is a complete collection of everything about building the SRE team, from their practices to how to onboard a new SRE to the team.
I am personally really inspired by the concept of error Budget and the share by default culture folders by practices such as blameless postmortem.
David Robillard
Feb 05, 2019 rated it it was amazing
Shelves: it
A must read for anyone involved with online services.
Gary Boland
Jan 14, 2018 rated it liked it
A useful checklist for production engineering is tarnished by the undercurrent of marketing/recruiting. Still deserves its place on the shelf if you deliver software for a living
Sundarraj Kaushik
Oct 04, 2018 rated it it was amazing
A wonderful book to learn how to manage websites so that they are reliable.

Some good random extracts from the book.

Site Reliability Engineering
1. Operations personnel should spend 50% of their time in writing automation scripts and programs.
2. the decision to stop releases for the remainder of the quarter once an error budget is depleted
3. an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning o
This is the kind of book that can be quite hard to digest in one go, cover to cover. It took me more than two years to (casually) read it!

Of course not everything can be applied everywhere. Not every organization is of the size of Google, and has the same amount of resources to apply the principles.

Still, there is good advice mentioned in the book which can come handy in many situations.
Mar 08, 2020 rated it liked it
Shelves: paused
The book is great in terms of getting more understanding of google’s SRE culture. But I got to a place where it became irrelevant to me to continue the book so I decided to drop it.
Mark Hillick
Aug 08, 2017 rated it it was amazing  ·  review of another edition
This review has been hidden because it contains spoilers. To view it, click here.
Tim O'Hearn
Feb 06, 2018 rated it it was amazing
“Perfect algorithms may not have perfect implementations.”

And perfect books may not have perfect writers. Site Reliability Engineering is an essay collection that can be rickety at times but is steadfast in its central thesis. Google can claim credit for inventing Site Reliability Engineering and, in this book, a bunch of noteworthy engineers share their wisdom from the trenches.

When it comes to software architecture and product development, I’ve found delight in reading about how startups’ p
Jun 18, 2016 rated it really liked it
When I started working on software infrastructure at large companies, I was struck by how little of what I was working on had been covered in school, and how little I could find in academia. Talking to friends in industry, many of us were facing the same problems, but there didn't seem to be any literature on what we were doing. Everything we learned, we learned either through the school of hard knocks, or from more experienced folks. This book fills a much needed gap.

Furthermore, since many com
« previous 1 3 4 5 6 7 next »

Readers also enjoyed

  • Designing Data-Intensive Applications
  • The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations
  • Kubernetes: Up & Running
  • Building Microservices: Designing Fine-Grained Systems
  • Accelerate: Building and Scaling High-Performing Technology Organizations
  • The Unicorn Project
  • The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win
  • Monolith to Microservices: Sustaining Productivity While Detangling the System
  • Team Topologies: Organizing Business and Technology Teams for Fast Flow
  • Software Engineering at Google: Lessons Learned from Programming Over Time
  • Release It!: Design and Deploy Production-Ready Software (Pragmatic Programmers)
  • Domain-Driven Design: Tackling Complexity in the Heart of Software
  • Making Work Visible: Exposing Time Theft to Optimize Work & flow
  • The Manager's Path: A Guide for Tech Leaders Navigating Growth and Change
  • The Go Programming Language
  • The Mythical Man-Month: Essays on Software Engineering
  • A Philosophy of Software Design
  • Effective Java
See similar books…

Goodreads is hiring!

If you like books and love to build cool products, we may be looking for you.
Learn more »
Betsy Beyer is a Technical Writer for Google in New York City specializing in Site Reliability Engineering. She has previously written documentation for Google’s Data Center and Hardware Operations Teams in Mountain View and across its globally distributed datacenters. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. En route to her current career, Betsy ...more

Related Articles

  Discover lots of new and upcoming nonfiction reads this season with our author interviews, articles, and book lists!   Interviews with...
21 likes · 28 comments
“When a team must allocate a disproportionate amount of time to resolving tickets at the cost of spending time improving the service, scalability and reliability suffer.” 1 likes
“team size should not scale directly with service growth.” 1 likes
More quotes…