The Complete Guide to Optimizing Systems PerformanceWritten by the winner of the 2013 LISA Award for Outstanding Achievement in System Administration Large-scale enterprise, cloud, and virtualized computing systems have introduced serious performance challenges. Now, internationally renowned performance expert Brendan Gregg has brought together proven methodologies, tools, and metrics for analyzing and tuning even the most complex environments. Systems Performance: Enterprise and the Cloud focuses on Linux(R) and Unix(R) performance, while illuminating performance issues that are relevant to all operating systems. You'll gain deep insight into how systems work and perform, and learn methodologies for analyzing and improving system and application performance. Gregg presents examples from bare-metal systems and virtualized cloud tenants running Linux-based Ubuntu(R), Fedora(R), CentOS, and the illumos-based Joyent(R) SmartOS(TM) and OmniTI OmniOS(R). He systematically covers modern systems performance, including the "traditional" analysis of CPUs, memory, disks, and networks, and new areas including cloud computing and dynamic tracing. This book also helps you identify and fix the "unknown unknowns" of complex performance: bottlenecks that emerge from elements and interactions you were not aware of. The text concludes with a detailed case study, showing how a real cloud customer issue was analyzed from start to finish. Coverage includes - Modern performance analysis and tuning: terminology, concepts, models, methods, and techniques - Dynamic tracing techniques and tools, including examples of DTrace, SystemTap, and perf - Kernel internals: uncovering what the OS is doing - Using system observability tools, interfaces, and frameworks - Understanding and monitoring application performance - Optimizing CPUs: processors, cores, hardware threads, caches, interconnects, and kernel scheduling - Memory optimization: virtual memory, paging, swapping, memory architectures, busses, address spaces, and allocators - File system I/O, including caching - Storage devices/controllers, disk I/O workloads, RAID, and kernel I/O - Network-related performance issues: protocols, sockets, interfaces, and physical connections - Performance implications of OS and hardware-based virtualization, and new issues encountered with cloud computing - Benchmarking: getting accurate results and avoiding common mistakes This guide is indispensable for anyone who operates enterprise or cloud environments: system, network, database, and web admins; developers; and other professionals. For students and others new to optimization, it also provides exercises reflecting Gregg's extensive instructional experience.
This isn't a book, so much as it is a reference manual or an appendix. It's nearly 800 pages of dense, low-level discussions of performance issues related to the CPU, memory, hard drive, OS, and so on. The writing is very structured, repetitive, and dry and resembles a list of facts more than prose. If you have a specific performance issue and need to know how to, say, use DTrace to diagnose an issue with a memory leak, this book is perfect. If you're looking for something you can read cover to cover to generally improve your understanding of system performance, this book probably isn't it.
If you are going to read this, I recommend reading the first few sections of each chapter, which typically have a nice introduction to the architecture of the CPU, memory, etc. They are also full of handy tables, such as typical, real-world latencies and typical performance trade-offs to consider (e.g. cpu vs memory, small vs large record sizes). The remainder of each chapter is a deep-dive into specific performance tools you can use, which is handy as a reference, but does not make for interesting reading otherwise, as there is no way you can retain so much detailed info. I'd also mention that since the author is a Solaris expert and creator of DTrace, you will see a lot of information about both in every single chapter.
The final chapter of the book is great: it walks through a real-world case study and shows how to use various techniques to analyze it and the thought process that goes into tracking down performance bottlenecks. Seeing such a case study gives you a much better sense for the context in which the various performance tools should be used and some awareness of whether the data returned by those tools is normal or not. This would have been a much better book if every chapter had been primarily focused on such case studies, with all the other nitty gritty details tacked on solely as supporting information (perhaps in an appendix!).
Whenever I watch Netflix movies from different locations, on different networks, and via different devices without any serious performance problems, I think about this book. I just can't help it. Guess why?
Brendan Gregg, also known as The Server Whisperer, among other things, wrote one of the most pragmatic and comprehensive books on Unix and Linux performance engineering. If you're in any way seriously involved with benchmarking, tuning, and analyzing GNU/Linux or Unix based systems in the last 10 years, you've either come across some of the tools developed by the author such as DTraceToolkit, or used concepts and techniques developed by him such as USE Method, and Flame Graphs.
I'll cut it short, and try to answer a simple question: Should you buy and study this book?
Yes, and no.
Yes, in the sense that it has solid and very-well laid out & presented material on the fundamental aspects of performance engineering of most of the underlying components and layers such as kernel, file systems, disks, memory, networking, etc. On top of that, the principles clearly explained and exemplified by Gregg are timeless; e.g. queuing theory, benchmarking pitfalls, workload characterization, checklists, visualization techniques, etc. (The book also provides nice historical context which is a plus in my book, but that's just me.) Moreover, the questions at the end of chapter can be considered very nice additions: some of straightforward technical questions will help you fill your technical the gaps, whereas some of the more open ended ones can be considered a very good practice material to stretch your analytic thinking capabilities.
No, in the sense that, a period of 6 years is like a lifetime in this industry, and technology is a moving target, GNU/Linux doubly so! From 2021 and onward, I don't think majority of the readers of this review will need any of the Solaris-related practical and historical stuff in the book (it is great, but not for everyone I guess). Moreover, it'll be good to have a more up-to-date resources for finer details of practical performance engineering tools for GNU/Linux systems, as well as heavier focus on cloud computing related tips, tricks, and pitfalls when it comes to analyzing and tuning performance.
I've already bought this book some time ago, but I'll definitely buy the 2nd Edition from the same author (see the official book web site for more information).
Long story short, most of the body of knowledge in this book (and most probably the one in the 2nd Edition), can be considered "required" for most of the professional performance engineers working with Linux based systems.
Great book on debugging production systems. It serves a comprehensive, but simple, mental model for how systems work, and solid methodologies to look at each component. Especially the USE-method: looking at each system component for utilization, saturation, and errors: network, disk, cpu, memory, mutexes, ... Most of the time people use the 'streetlight' method, going through random tools they know. Best illustrated in its absurdity by the parable of the drunk man who was looking for his keys in the dark under the streetlight.
The reason why I can't give the 5th star is because it's focused on currently observable problems. Many of the gnarliest systems performance problems I've encountered happen for a shorter period of time under some hard-to-reproduce condition (where focus is on recovery, not understanding). You then have to dig through metrics _after_ the fact to find out what might have happened. This is often easy for errors, but not for saturation and utilization. Why is there nothing in the book about this?
This is an impressive book, packed with information about methodology and tooling to diagnose performance problems in modern computing environments. The problem is that the presentation of the material is so incredibly dull, in the end I wanted to go read a telephone book instead for pleasure.
The structure is clear and I see what the author was trying to do: provide a repeating pattern of how information is laid out across chapters, but it makes the robotic writing even feel even less appealing. I was also annoyed by Gregg tooting his own horn a little too much for my taste. He wastes no opportunity pointing out which programs presented here exist thanks to him, down to the exact date and circumstances during which he built them. Who cares?! That is useless information to the reader.
I wish the last chapter, the case study, would have appeared in every chapter, presenting real world examples of how to apply all the theory Gregg teaches. In that sense the book feels like a mixture of a university course and a reference: lots of theory and frameworks, but very little time is spent on actually guiding the reader through how to transfer this knowledge to actual problems.
Nevertheless, I can see myself coming back to this book a lot. Again, the university analogy comes to mind: by the time you're taught the subject you have no idea how to apply any of it or when, but several years later you might encounter a problem and you have that light bulb moment in which you remember and dust off this tome and find the advice you need.
If you're looking for a gentle and most of all practical introduction to performance analysis, however, this isn't it.
Do not let the size daunts you however. Chapters are self-contained, as the author understands that the book might be read under pressure, and contain useful exercises at the end.
What really makes this book stands out, is not the top-notch technical writing or abundance of useful one-liners, is the fact that the author moves forward and suggests a methodology for troubleshooting and performance analysis, as opposed to the ad-hoc methods of the past (or best case scenario a checklist and $DEITY forbid the use of “blame someone else methodology”). In particular the author suggests the USE methodology, USE standing for Utilization – Saturation – Errors, to methodically and accurately analyze and diagnose problems. This methodology (which can be adapted/expanded at will, last time I checked the book was not written in stone), is worth the price of the book alone.
The author correctly maintains that you must have an X-ray (so to speak) of the system at all times. By utilizing tools such as DTrace (available for Solaris and BSD) or the Linux equivalent SystemTap, much insight can be gained from the internals of a system.
Chapters 5-10 are self-explanatory: the author presents what the chapter is about, common errors and common one-liners used to diagnose possible problems. As said before, chapters aim to be self contained and can be read while actually troubleshooting a live system so no lengthy explanations there. At the end of the chapter, the bibliography section provides useful pointers towards resources for further study, something that is greatly appreciated. Finally, the exercises can be easily transformed to interview questions, which is another bonus.
Cloud computing and the special considerations that is presenting is getting its own chapter and the author tries to keep it platform agnostic (even if employed by a “Cloud Computing” company), which is a nice touch. This is followed by a chapter on useful advice on how to actually benchmark systems and the book ends with a, sadly too short, case study.
The appendices that follow should be read, as they contain a lot of useful one-liners (as if the ones in the book were not enough), concrete examples of the USE method, a guide of porting dtrace to systemtap and a who-is-who in the world of systems performance.
So how to sum up the book? “Incredible value” is one thought that comes to mind, “timeless classic” is another. If you are a systems {operator|engineer|administrator|architect}, this book is a must-have and should be kept within reach at all times. Even if your $DAYJOB does not have systems on the title, the book is going to be useful, if you have to interact with Unix-like systems on a frequent basis.
Brendan is probably the de-facto authority in the performance world. Brendan walks through the Linus Kernel internals and covers the performance of each areas like Memory, CPU, File Systems, Disks, Networks. His methodologies for analyzing performance problems are must read for SREs and performance engineers. The are plethora of tools that Brendan contributed in creating for Linux performance troubleshooting. I love the easy to follow and structured approach of Brendan's writing. Specifically the USE methodology, drill-down methodology, the block diagram with tools should be at every desk of SREs, production engineers and performance engineers.
My only concern is that I'm too late to pick this first edition. BPF tools are not covered and some contents are outdated for this point in time.
If you want to read this, please wait till November 2020 till the book hit the stands.
Very very well written book. I didn't actually read it front to back, I read the first 4 chapters which covers the foundation, chapter 5 which covers application-level performance, and the last 3 chapters on cloud and multi-tenant performance, benchmarking, and a case study. The middle chapters dive into other specific topics like CPU, memory, file systems, etc. that I will reference on an as-needed basis.
Overall very well written, communicates concepts clearly, and reifies a lot of things that I often see go unnoticed or underappreciated. A book worth keeping on the bookshelf even after being read.
Absolutely amazing book on performance measurement. Contains a lot of theory how to measure performance (starting from "what performance really is" - and it is not so obvious) to example how to drill down. This books contains a lot of practical examples on performance issues investigation. Looks slightly outdated (tap, solaris, DTrace) but it is really worth reading for admins and every person who cares about performance.
Though at risk of being a tad ranty about how Solaris is better than linux, Brendan Gregg's detail and understanding of Kernel development and performance is comprehensive and both introduces the topic and then guides the reader through how to measure it. It's a must-read for Linux developers.
Good to skim through to learn about what is possible and out there, but it's more like a reference book to check when needed with a specific performance problem.
When investigating a CPU issue, I found that I had to dive much deeper than the details provided by this book. It's a good starting point though. Brendan Gregg is a legend.
One of my most favourite everything-reference when I need to do system benchmark/trouble-shooting. This book covers almost all aspects of low-level stuff from kernel to network-protocol, or file system, disk system. Highly recommended.
This book is nightmarishly good. I figured it would take months to slog through it, but was able to in 4 months. I was a bit nutty, since I took handwritten notes.
Book I wanted to have when I first started my career. All the stuff in OS classes and things gleaned from experience wrapped up in one, with exercises at chapter's end to test your knowledge, well-written text, graphics that accurately capture and describe topics and a mild amount of human humour injected where appropriate.
This is a HARD long read if you are totally new to this space. By the end of it, you won't know everything, but you will have a general idea and know where to look.
Think book suffers from being just so long. Also, you will miss out if you do not have Dtrace installed and functioning properly + kernel symbols enabled to permit you to follow some of the more detailed tracing/system call demonstrations.
I want to give it a 4.5 for not being totally Linux-centric, but this is "Systems Performance," so that agnosticism is wrong on my part. The length is pretty killer as well, so the book can double as explainer and reference. Some folks may want to just skim, find some magic commands, and go from there. That may be enough to survive but not to "thrive" so I won't hate on that.
Buy the book, fight through from start to finish, feel confident afterwards with your next perf challenge.
This book details how to approach software performance issues. It explains how to observe, measure and visualise what is happening in the OS and beyond (only on the Linux and Solaris platforms). I don't think the book reads very well from cover to cover, but I did devour the first chapters that explain the concepts and methodologies, as well as the final the case study. To my opinion, the rest seems more of a reference to consult when faced with a particular performance issue as they covers CPU, memory, network, disk, etc and detail how to use the available tooling.