With this practical book, you’ll discover how to catch complications in your distributed system before they develop into costly problems. Based on his extensive experience in systems ops at large technology companies, author Slawek Ligus describes an effective data-driven approach for monitoring and alerting that enables you to maintain high availability and deliver a high quality of service. Learn methods for measuring state changes and data flow in your system, and set up alerts to help you recover quickly from problems when they do arise. If you’re a system operator waging the daily battle to provide the best performance at the lowest cost, this book is for you.
Was it easy to read: Not really. The ideas weren’t complex. But they were somehow formulated in a technical manner that made me re-read a lot of sentences to finally understand them.
What I liked about it: That it gives some practical, actionable tips e.g. on plotting and understanding graphs (the typical quantities metrics represent, summary statistics best fit for those quantities), setting up alerts, calculating thresholds. Also I think this book really shows how mature monitoring culture should look like: monitor extensively – alert selectively.
What I disliked: The author is a devops so naturally the book is written from devops perspective, most examples are about CPU usage and etc. I as a developer would be more happy with application monitoring examples. Plus the language was more complex than it needed to be, as I mentioned before.
Ideas/ Quotes: “Monitor extensively and alert selectively: identify what metrics drive your business and work top-down to setup alarms around timeseries behind KPIs”
“Ideally, monitoring should enable operators to drill down from high level overview into the fine levels of details, granular enough to point at specifics”
“Flawed assumption: the ticket generated from an alert is a unit of work rather than an indication of a problem in the system”
“All alarms that trigger on non-issue should be done away if there is no evidence that the resulting alerts are actionable. If this policy is not followed, false alarms will cause more harm than good. There are only 2 ways one can respond to non-issues: ignore it or overreact”
“Measuring quality most not be effort-full, otherwise quality assessment will come at a very high cost and with dubious credibility”
Excellent and concise practical book on setting up monitoring of your IT services. Full of technical advice without getting bogged down with any particular monitoring systems and software.
A short note about this book I used in my work. First of all two good points. The first is that it deals with monitoring, alerting and reporting in general, that is to say independently of the tools used. This is both a strong point and a weak point since it could be useful to identify families of tools adapted to each use. This step back is not so common and allows to introduce higher level concepts, for example the organization of the monitoring in stacks which is absolutely crucial but also notions and general definitions applicable in all circumstances - or almost. And we come to the second strong point, definitions. It is essential in the professional context to rely on precise definitions that allow framing concepts that most people have an unfortunate tendency to confuse as monitoring and alerting, for example.
In the weak points, it lacks background and practical cases. If we do not know the subject well, we will finish reading about as bad -- I exaggerate, we will at least be armed with definitions and concepts and that's already a lot. The writing is completely devoid of soul: no humor, no anecdotes which makes reading quite boring.
The most problematic point is the structure of the book that is really unclear and will not permit to refer to it easily to find an element. More importantly, it lacks structuring elements for the implementation of a solution like the 4 golden signals.
> - Latency: The time it takes to service a request, with a focus on distinguishing between the latency of successful requests and the latency of failed requests. > - Traffic: A measure of how much demand is being placed on the service. This is measured using a high-level service-specific metric, like HTTP requests per second in the case of an HTTP REST API. > - Errors: The rate of requests that fail. The failures can be explicit (e.g., HTTP 500 errors) or implicit (e.g., an HTTP 200 OK response with a response body having too few items). > - Saturation: How “full” is the service. This is a measure of the system utilization, emphasizing the resources that are most constrained (e.g., memory, I/O or CPU). Services degrade in performance as they approach high saturation.
Or the 5 golden signals if we add to that the measure of availability -- I wrote an article about it. Despite these reservations, it is still useful to have it on hand to refer to it from time to time, but it is not a must have -- far from it. I would rather read with interest a recent book published by the same editor: Practical Monitoring: Effective Strategies for the Real World.
I was really looking forward to this book as I've heard good things about it and thought it would round up what I already knew about the topic. However right from the start it felt rather awkward. The author is trying to maintain an abstract high level view on monitoring and alerting and not go into specific implementations. This makes for an awkward combination with it being basically a 101/introductory book on the topic. A lot of the formal descriptions of monitoring and alerting feel forced and don't hold up in the abstract very well and are too high level to be practical. He also talks about operations in an almost romantic hero style way which I didn't enjoy. In addition to that the book also includes some final chapters on outage handling and organizational and cultural setups. The terms human error, root cause analysis, and "5 Whys" are thrown around a lot with no acknowledgement of it being actually harmful to learning according to modern research in the field of systems safety. Definitely not a book I would recommend.
This entire review has been hidden because of spoilers.
A very solid overview of monitoring and alerting for online services. The book covers both the engineering aspects of setting up and configuring monitors and alerts as well as the management aspects of collecting data over time, using it to implement continuous improvement, etc.
I appreciated that the book covered monitoring and alerting in general and wasn't specific to a particular system or toolchain.
Highly recommended for both beginners and experienced engineers dealing with monitoring and alerting in production.
Something about the writing style made it hard to stay interested along the way. It might have been that I felt like was so often reading that an idea was about to broken down into a number of smaller components. Some long chapters didn't help, either.
Another gripe is a large number of grammatical errors. The editor(s) should've done a better job with these.
In the printed book, the graphs and figures were often very hard to read. The colors used were hard to distinguish compared to the PDF or ebook.
Good introduction to monitoring. Lacks some real examples
I was expecting a bit more examples from this book, both technical and non-technical. Still it's a very good introduction to monitoring and alerting, but I don't know if I would recommend it to someone that's starting on the subject.
Practical book however not so accurate right now (2017). This technology is not updated, was replaced with other projects which are not covered in this book.
However can give you a strong background about what monitoring is and should be, what kind of types we can use and for what purposes
Did you know there is solid theoretical background behind monitoring (sufficient for couple of PhD work)? I didnt, but it actually is, and the book proves it.