More on this book
Community
Kindle Notes & Highlights
by
Gene Kim
Read between
May 25 - June 9, 2019
Management at New Relic, observed in 2011, “We found that when we woke up developers at 2 a.m., defects were fixed faster than ever.” One side effect of this practice is that it helps Development management see that business goals are not achieved simply because features have been marked as “done.” Instead, the feature is only done when it is performing as designed in production, without causing excessive escalations or unplanned work for either Development or Operations.‡
There will always be some people who are less experienced than others doing releases and launches. The LRR and HRR checklists are a way to create that organizational memory.”
If we are not performing user research, the odds are that two-thirds of the features we are building deliver zero or negative value to our organization, even as they make our codebase ever more complex, thus increasing our maintenance costs over time and making our software more difficult to change. Furthermore, the effort to build these features is often made at the expense of delivering features that would deliver value (i.e., opportunity cost). Jez Humble joked, “Taken to an extreme, the organization and customers would have been better off giving the entire team a vacation, instead of
...more
“human error is not our cause of troubles; instead, human error is a consequence of the design of the tools that we gave them.”
In the blameless post-mortem meeting, we will do the following: Construct a timeline and gather details from multiple perspectives on failures, ensuring we don’t punish people for making mistakes Empower all engineers to improve safety by allowing them to give detailed accounts of their contributions to failures Enable and encourage people who do make mistakes to be the experts who educate the rest of the organization on how not to make them in the future Accept that there is always a discretionary space where humans can decide to take action or not, and that the judgment of those decisions
...more
After we conduct a blameless post-mortem meeting, we should widely announce the availability of the meeting notes and any associated artifacts (e.g., timelines, IRC chat logs, external communications). This information should (ideally) be placed in a centralized location where our entire organization can access it and learn from the incident.
Our work in the technology value stream, like space travel, should be approached as a fundamentally experimental endeavor and managed that way. All work we do is a potentially important hypothesis and a source of data, rather than a routine application and validation of past practice. Instead of treating technology work as entirely standardized, where we strive for process compliance, we must continually seek to find ever-weaker failure signals so that we can better understand and manage the system we operate in.
Resilience requires that we first define our failure modes and then perform testing to ensure that these failure modes operate as designed. One way we do this is by injecting faults into our production environment and rehearsing large-scale failures so we are confident we can recover from accidents when they occur, ideally without even impacting our customers.
Instead of putting our expertise into Word documents, we need to transform these documented standards and processes, which encompass the sum of our organizational learnings and knowledge, into an executable form that makes them easier to reuse. One of the best ways we can make this knowledge re-usable is by putting it into a centralized source code repository, making the tool available for everyone to search and use.

