More on this book
Community
Kindle Notes & Highlights
The tribal nature of IT culture often entrenches practitioners in dogmatic positions that hold the industry back.
SRE is what happens when you ask a software engineer to design an operations team.
In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).
In general, for any software service or system, 100% is not the right reliability target because no user can tell the difference between a system being 100% available and 99.999% available.
Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.