More on this book
Community
Kindle Notes & Highlights
by
Gene Kim
Read between
August 5, 2020 - March 5, 2021
Development management see that business goals are not achieved simply because features have been marked as “done.” Instead, the feature is only done when it is performing as designed in production, without causing excessive escalations or unplanned work for either Development or Operations.‡
Regardless of how we’ve organized our teams, the underlying principles remain the same: when developers get feedback on how their applications perform in production, which includes fixing it when it breaks, they become closer to the customer, this creates a buy-in that everyone in the value stream benefits from.
The most common reaction for developers after participating in a customer observation is dismay, often stating “how awful it was seeing the many ways we have been inflicting pain on our customers.” These customer observations almost always result in significant learning and a fervent desire to improve the situation for the customer.
By having developers be responsible for deployment and production support, we are far more likely to have a smooth transition to Operations.**
One of the many surprising facts about Google is that they have a functional orientation for their Ops engineers, who are referred to as “Site Reliability Engineers” (SRE), a term coined by Ben Treynor Sloss in 2004.‡‡
Treynor Sloss has resisted creating a single sentence definition of what SREs are, but, he once described SREs as “what happens when a software engineer is tasked with what used to be called operations.”
In this book, we use the term “Ops engineer,” but the two terms, “Ops Engineer” and “Site Reliability Engineer,” are intended to be interchangeable.
The faster we can experiment, iterate, and integrate feedback into our product or service, the faster we can learn and out-experiment the competition. And how quickly we can integrate our feedback depends on our ability to deploy and release software.
If we are not performing user research, the odds are that two-thirds of the features we are building deliver zero or negative value to our organization, even as they make our codebase ever more complex, thus increasing our maintenance costs over time and making our software more difficult to change. Furthermore, the effort to build these features is often made at the expense of delivering features that would deliver value (i.e., opportunity cost). Jez Humble joked, “Taken to an extreme, the organization and customers would have been better off giving the entire team a vacation, instead of
...more
it’s so essential for developers to work in small, incremental steps rather than on long-lived feature branches.
Giray Özil tweeted, “Ask a programmer to review ten lines of code, he’ll find ten issues. Ask him to do five hundred lines, and he’ll say it looks good.”
When testing failures occur, our typical reaction is to do more testing. However, if we are merely performing more testing at the end of the project, we may worsen our outcomes.
Pair programming is when two engineers work together at the same workstation, a method popularized by Extreme Programming and Agile in the early 2000s.
When the two have differing specialties, skills are transferred as an automatic side effect, whether it’s through ad-hoc training or by sharing techniques and workarounds.
Jeff Atwood, one of the founders of Stack Exchange, wrote, “I can’t help wondering if pair programming is nothing more than code review on steroids….The advantage of pair programming is its gripping immediacy: it is impossible to ignore the reviewer when he or she is sitting right next to you.”
If accidents are not caused by “bad apples,” but rather are due to inevitable design problems in the complex system that we created, then instead of “naming, blaming, and shaming” the person who caused the failure, our goal should always be to maximize opportunities for organizational learning, continually reinforcing that we value actions that expose and share more widely the problems in our daily work.
As John Allspaw, CTO of Etsy, states, “Our goal at Etsy is to view mistakes, errors, slips, lapses, and so forth with a perspective of learning.”
When engineers make mistakes and feel safe when giving details about it, they are not only willing to be held accountable, they are also enthusiastic in helping the rest of the company avoid the same error in the future. This is what creates organizational learning.
To help enable a just culture, when accidents and significant incidents occur (e.g., failed deployment, production issue that affected customers), we should conduct a blameless post-mortem after the incident has been resolved.‡ Blameless post-mortems, a term coined by John Allspaw, help us examine “mistakes in a way that focuses on the situational aspects of a failure’s mechanism and the decision-making process of individuals proximate to the failure.”
One of the potentially surprising outcomes of these meetings is that people will often blame themselves for things outside of their control or question their own abilities. Ian Malpass, an engineer at Etsy observes, “In that moment when we do something that causes the entire site to go down, we get this ‘ice water down the spine’ feeling, and likely the first thought through our head is, ‘I suck and I have no idea what I’m doing.’ We need to stop ourselves from doing that, as it is route to madness, despair, and feelings of being an imposter, which is something that we can’t let happen to good
...more
After developing and using Morgue, the number of recorded post-mortems at Etsy increased significantly compared to when they used wiki pages, especially for P2, P3, and P4 incidents (i.e., lower severity problems). This result reinforced the hypothesis that if they made it easier to document post-mortems through tools such as Morgue, more people would record and detail the outcomes of their post-mortem meetings, enabling more organizational learning.
On failures, Roy Rapoport from Netflix observes, “What the 2014 State of DevOps Report proved to me is that high-performing DevOps organizations will fail and make mistakes more often. Not only is this okay, it’s what organizations need! You can even see it in the data: if high performers are performing thirty times more frequently but with only half the change failure rate, they’re obviously having more failures.”
“DevOps must allow this sort of innovation and the resulting risks of people making mistakes. Yes, you’ll have more failures in production. But that’s a good thing, and should not be punished.”
As Peter Senge is known to say, “The only sustainable competitive advantage is an organization’s ability to learn faster than the competition.”
One of the practices that forms part of the Toyota Production System is called the improvement blitz (or sometimes a kaizen blitz), defined as a dedicated and concentrated period of time to address a particular issue, often over the course of a several days. Dr. Spear explains, “...blitzes often take this form: A group is gathered to focus intently on a process with problems…The blitz lasts a few days, the objective is process improvement, and the means are the concentrated use of people from outside the process to advise those normally inside the process.”
In addition to the Lean-oriented terms kaizen blitz and improvement blitz, the technique of dedicated rituals for improvement work has also been called spring or fall cleanings and ticket queue inversion weeks. Other terms have also been used, such as hack days, hackathons, and 20% innovation time. Unfortunately, these specific rituals sometimes focus on product innovation and prototyping new market ideas, rather than on improvement work, and worse, they are often restricted to developers—which is considerably different than the goals of an improvement blitz.†
We may schedule week-long improvement blitzes that prioritize Dev and Ops working together toward improvement goals. These improvement blitzes are simple to administer: One week is selected where everyone in the technology organization works on an improvement activity at the same time. At the end of the period, each team makes a presentation to their peers that discusses the problem they were tackling and what they built. This practice reinforces a culture in which engineers work across the entire value stream to solve problems. Furthermore, it reinforces fixing problems as part of our daily
...more
Consider for a moment that our complex system is like a spider web, with intertwining strands that are constantly weakening and breaking. If the right combination of strands breaks, the entire web collapses. There is no amount of command-and-control management that can direct workers to fix each strand one by one. Instead, we must create the organizational culture and norms that lead to everyone continually finding and fixing broken strands as part of our daily work. As Dr. Spear observes, “No wonder then that spiders repair rips and tears in the web as they occur, not waiting for the failures
...more
The DevOps Enterprise Summit was created in 2014 for technology leaders to share their experiences adopting DevOps principles and practices in large, complex organizations. The program is organized primarily around experience reports from technology leaders on the DevOps journey, as well as subject matter experts on topics selected by the community.
In other words, when we use components or libraries—either commercial or open source—in our software, we not only inherit their functionality, but also any security vulnerabilities they contain.
AWS GovCloud has already been approved for use for federal government systems of all types, including those which require high levels of confidentiality, integrity, and availability. By the time you read this book, it is expected that Cloud.gov will be approved for all systems that require moderate levels of confidentiality, integrity, and availability.**
However, in order to protect our continuous build, integration, or deployment pipeline, our mitigation strategies may include:
Hardening continuous build and integration servers and ensuring we can reproduce them in an automated manner,
Reviewing all changes introduced into version control, either through pair programming at commit time or by a code review process between commit and merge into trunk, to prevent continuous integration servers from running uncontrolled code
If we have constructed our deployment pipeline correctly so that deployments are low-risk, the majority of our changes won’t need to go through a manual change approval process, because we will have placed our reliance on controls such as automated testing and proactive production monitoring.
Ideally, by having a reliable deployment pipeline in place, we will have already earned a reputation for fast, reliable, and non-dramatic deployments. At this point, we should seek to gain agreement from Operations and the relevant change authorities that our changes have been demonstrated to be low risk enough to be defined as standard changes, pre-approved by the CAB. This enables us to deploy into production without need for further approval, although the changes should still be properly recorded.
Even when our changes are categorized as standard changes, they still need to be visual and recorded in our change management systems (e.g., Remedy or ServiceNow). Ideally, deployments will be performed automatically by our configuration management and deployment pipeline tools (e.g., Puppet, Chef, Jenkins) and the results will be automatically recorded.
We may automatically link these change request records to specific items in our work planning tools (e.g., JIRA, Rally, LeanKit, ThoughtWorks Mingle), allowing us to create more context for our changes, such as linking to feature defects, production incidents, or user stories. This can be accomplished in a lightweight way by including ticket numbers from planning tools in the comments associated with version control check ins.§ By doing this, we can trace a production deployment to the changes in version control and, from there, trace them further back to the planning tool tickets.
wherever possible, we should avoid using separation of duties as a control. Instead, we should choose controls such as pair programming, continuous inspection of code check-ins, and code review. These controls can give us the necessary reassurance about the quality of our work.
At a time when every technology leader is challenged with enabling security, reliability, and agility, and at a time when security breaches, time to market, and massive technology transformation is taking place, DevOps offers a solution.
Innovation is impossible without risk taking, and if you haven’t managed to upset at least some people in management, you’re probably not trying hard enough.
For example, we may choose to measure our incidents by the following metrics: ▹ Event severity: How severe was this issue? This directly relates to the impact on the service and our customers. ▹ Total downtime: How long were customers unable to use the service to any degree? ▹ Time to detect: How long did it take for us or our systems to know there was a problem? ▹ Time to resolve: How long after we knew there was a problem did it take for us to restore service?

