The DevOps Handbook: How to Create World-Class Agility, Reliability, & Security in Technology Organizations
Rate it:
Open Preview
4%
Flag icon
Many DevOps practices emerge if we continue to manage our work beyond the goal of “potentially shippable code” at the end of each iteration, extending it to having our code always in a deployable state, with developers checking into trunk daily, and if we demonstrate our features in production-like environments.
4%
Flag icon
Instead of IT Operations doing manual work that comes from work tickets, it enables developer productivity through APIs and self-serviced platforms that create environments, test and deploy code, monitor and display production telemetry, and so forth. By doing this, IT Operations becomes more like Development (as do QA and Infosec), engaged in product development, where the product is the platform that developers use to safely, quickly, and securely test, deploy, and run their IT services in production.
4%
Flag icon
enabling them to win in the marketplace. Before the revolution, average manufacturing plant order lead times were six weeks, with fewer than 70% of orders shipped on time. By 2005, with the widespread implementation of Lean practices, average product lead times had dropped to less than three weeks, and more than 95% of orders were shipped on time.
5%
Flag icon
Development will take responsibility for responding to changes in the market and for deploying features and changes into production as quickly as possible. IT Operations will take responsibility for providing customers with IT service that is stable, reliable, and secure, making it difficult or even impossible for anyone to introduce production changes that could jeopardize production. Configured this way, Development and IT Operations have diametrically opposed goals and incentives.
5%
Flag icon
Alarmingly, our most fragile artifacts support either our most important revenue-generating systems or our most critical projects. In other words, the systems most prone to failure are also our most important and are at the epicenter of our most urgent changes. When these changes fail, they jeopardize our most important organizational promises, such as availability to customers, revenue goals, security of customer data, accurate financial reporting, and so forth.
5%
Flag icon
every company is a technology company, whether they know it or not.
5%
Flag icon
Instead of accruing technical debt, problems are fixed as they are found, mobilizing the entire organization if needed because global goals outweigh local goals.
7%
Flag icon
DevOps relies on bodies of knowledge from Lean, Theory of Constraints, the Toyota Production System, resilience engineering, learning organizations, safety culture, human factors, and many others. Other valuable contexts that DevOps draws from include high-trust management cultures, servant leadership, and organizational change management.
8%
Flag icon
In DevOps, we typically define our technology value stream as the process required to convert a business hypothesis into a technology-enabled service or feature that delivers value to the customer.
8%
Flag icon
The First Way enables fast left-to-right flow of work from Development to Operations to the customer. In order to maximize flow, we need to make work visible, reduce our batch sizes and intervals of work, build in quality by preventing defects from being passed to downstream work centers, and constantly optimize for global goals.
8%
Flag icon
The Second Way enables the fast and constant flow of feedback from right to left at all stages of our value stream. It requires that we amplify feedback to prevent problems from happening again, or that we enable faster detection and recovery.
8%
Flag icon
The Third Way enables the creation of a generative, high-trust culture that supports a dynamic, disciplined, and scientific approach to experimentation and risk-taking, facilitating the creation of organizational learning, both from our successes and failures.
10%
Flag icon
Small batch sizes result in less WIP, faster lead times, faster detection of errors, and less rework.
11%
Flag icon
waste and hardship in the software development stream as anything that causes delay for the customer, such as activities that can be bypassed without affecting the result.
11%
Flag icon
•Partially done work: This includes any work in the value stream that has not been completed (e.g., requirement documents or change orders not yet reviewed) and work that is sitting in queue (e.g., waiting for QA review or server admin ticket). Partially done work becomes obsolete and loses value as time progresses.
11%
Flag icon
•Extra processes: Any additional work being performed in a process that does not add value to the customer. This may include documentation not used in a downstream work center, or reviews or approvals that do not add value to the output. Extra processes add effort and increase lead times.
11%
Flag icon
•Extra features: Features built into the service that are not needed by the organization or the customer (e.g., “gold plating”). Extra features add complexity and ...
This highlight has been truncated due to consecutive passage length restrictions.
11%
Flag icon
•Task switching: When people are assigned to multiple projects and value streams, requiring them to context switch and manage dependencies between work, adding addi...
This highlight has been truncated due to consecutive passage length restrictions.
11%
Flag icon
•Waiting: Any delays between work requiring resources to wait until they can complete the current work. Delays increase cycle time and pr...
This highlight has been truncated due to consecutive passage length restrictions.
11%
Flag icon
•Motion: The amount of effort to move information or materials from one work center to another. Motion waste can be created when people who need to communicate frequently are not colocated. Handoffs also create motion waste and o...
This highlight has been truncated due to consecutive passage length restrictions.
11%
Flag icon
•Defects: Incorrect, missing, or unclear information, materials, or products create waste, as effort is needed to resolve these issues. The longer the time between defect creation and defect detec...
This highlight has been truncated due to consecutive passage length restrictions.
11%
Flag icon
•Nonstandard or manual work: Reliance on nonstandard or manual work from others, such as using non-rebuilding servers, test environments, and configurations. Ideally, any manual work that can be automated should be automated, self-serviced, and available on demand. However, some types of manual work will likely always be essential.
11%
Flag icon
•Heroics: In order for an organization to achieve goals, individuals and teams are put in a position where they must perform unreasonable acts, which may even become a part of their daily work (e.g., nightly 2:00 AM problems in production, creating hundreds of work tickets as part of every software release).
11%
Flag icon
if I have a lousy idea, but it’s the only idea out there, then you know what? My lousy idea is the best idea we got going, and so that’s the one we try.
11%
Flag icon
how we’ve set things up, that’s all artificial, that’s a constraint. It’s not a natural law of physics. Keep that in mind because so much resistance comes from the uncertainty of doing something differently. There’s often this mindset that because we haven’t done something a certain way before, it can’t be done. But we’ve made all of this stuff up.
11%
Flag icon
Improving flow through the technology value stream is essential to achieving DevOps outcomes. We do this by making work visible, limiting WIP, reducing batch sizes and the number of handoffs, continually identifying and evaluating our constraints, and eliminating hardships in our daily work.
12%
Flag icon
We make our systems of work safer by creating fast, frequent, high-quality information flow throughout our value stream and our organization, which includes feedback and feedforward loops. This allows us to detect and remediate problems while they are smaller, cheaper, and easier to fix; to avert problems before they cause catastrophe; and to create organizational learning that we integrate into future work.
12%
Flag icon
When failures and accidents occur, we treat them as opportunities for learning, as opposed to causes for punishment and blame.
12%
Flag icon
In a safe system of work, we must constantly test our design and operating assumptions.
12%
Flag icon
The more assumptions we can invalidate, the faster we can find and fix problems, increasing our resilience, agility, and ability to learn and innovate.
12%
Flag icon
Instead of working around the problem or scheduling a fix “when we have more time,” we swarm to fix it immediately
13%
Flag icon
people were afraid to ask for help and they didn’t want to disturb their teammates. In order to alleviate this, they changed how they defined when teammates should pull the Andon cord. Instead of the cord being pulled whenever a team member was stuck, they would pull the cord whenever they needed the opinion of the team.
13%
Flag icon
timeliness. Examples of ineffective quality controls, per Lean Enterprise, include:16 •Requiring another team to complete tedious, error-prone, and manual tasks that could be easily automated and run as needed by the team who needs the work performed. •Requiring approvals from busy people who are distant from the work, forcing them to make decisions without adequate knowledge of the work or the potential implications, or to merely rubber stamp their approvals. •Creating large volumes of documentation of questionable detail, which become obsolete shortly after they are written. •Pushing large ...more
13%
Flag icon
Lean defines two types of customers that we must design for: the external customer (who most likely pays for the service we are delivering) and the internal customer (who receives and processes the work immediately after us). According to Lean, our most important customer is our next step downstream.
13%
Flag icon
In the technology value stream, we optimize for downstream work centers by designing for operations, where operational non-functional requirements (e.g., architecture, performance, stability, testability, configurability, and security) are prioritized as highly as user features.
14%
Flag icon
When accidents and failures occur, instead of looking for human error, we look for how we can redesign the system to prevent the accident from happening again.
14%
Flag icon
We improve daily work by explicitly reserving time to pay down technical debt, fix defects, and refactor and improve problematic areas of our code and environments. We do this by reserving cycles in each development interval, or by scheduling kaizen blitzes, which are periods when engineers self-organize into teams to work on fixing any problem they want.
14%
Flag icon
when teams or individuals have experiences that create expertise, our goal is to convert that tacit knowledge (i.e., knowledge that is difficult to transfer to another person by means of writing it down or verbalizing) into explicit, codified knowledge,
20%
Flag icon
organizations that don’t pay down technical debt can find themselves so burdened with daily workarounds for problems left unfixed that they can no longer complete any new work. In other words, they are now only making the interest payment on their technical debt. We will actively manage this technical debt by ensuring that we invest at least 20% of all Development and Operations capacity on refactoring and investing in automation work and architecture and non-functional requirements (NFRs).
20%
Flag icon
when organizations do not pay their “20% tax,” technical debt will increase to the point where an organization inevitably spends all of its cycles paying down technical debt.
22%
Flag icon
the person performing the work often has little visibility or understanding of how their work relates to any value stream goals (e.g., “I’m just configuring servers because someone told me to.”). This places workers in a creativity and motivation vacuum.
22%
Flag icon
market-oriented teams are responsible not only for feature development but also for testing, securing, deploying, and supporting their service in production, from idea conception to retirement. These teams are designed to be cross-functional and independent—able to design and run user experiments, build and deliver new features, deploy and run their service in production, and fix any defects without manual dependencies on other teams, thus enabling them to move faster.
22%
Flag icon
To achieve market orientation, we won’t do a large, top-down reorganization, which often creates large amounts of disruption, fear, and paralysis. Instead, we will embed the functional engineers and skills (e.g., Ops, QA, Infosec) into each service team, or create a platform organization that provides an automated technology platform for service teams to self-serve everything they need to test, deploy, monitor, and manage their services in testing and production environments.
23%
Flag icon
we want to encourage learning, help people overcome learning anxiety, help ensure that people have relevant skills and a defined career road map, and so forth. By doing this, we help foster a growth mindset in our engineers—after all, a learning organization requires people who are willing to learn. By encouraging everyone to learn, as well as providing training and support, we create the most sustainable and least expensive way to create greatness in our teams—by investing in the development of the people we already have.
23%
Flag icon
Development and Test teams are assigned to a “project” and then reassigned to another project as soon as the project is completed and funding runs out. This leads to all sorts of undesired outcomes, including developers being unable to see the long-term consequences of decisions they make (a form of feedback) and a funding model that only values and pays for the earliest stages of the software life cycle—which, tragically, is also the least expensive part for successful products or services.
23%
Flag icon
developers should be able to understand and update the code of a service without knowing anything about the internals of its peer services. Services interact with their peers strictly through APIs and thus don’t share data structures, database schemata, or other internal representations of objects. Bounded contexts ensure that services are compartmentalized and have well-defined interfaces, which also enable easier testing.
24%
Flag icon
Each two-pizza team (2PT) is as autonomous as possible. The team’s lead, working with the executive team, decides on the key business metric that the team is responsible for, known as the fitness function, which becomes the overall evaluation criteria for the team’s experiments. The team is then able to act autonomously to maximize that metric.
26%
Flag icon
The additional work identified during project team retrospectives falls into the broad category of improvement work, such as fixing defects, refactoring, and automating manual work. Product managers and project managers may want to defer or deprioritize improvement work in favor of customer features. However, we must remind everyone that improvement of daily work is more important than daily work itself, and that all teams must have dedicated capacity for this (e.g., reserving 20% of all capacity for improvement work, scheduling one day per week or one week per month, etc.).
27%
Flag icon
In order to create fast and reliable flow from Dev to Ops, we must ensure that we always use production-like environments at every stage of the value stream. Furthermore, these environments must be created in an automated manner, ideally on demand from scripts and configuration information stored in version control and entirely self-serviced, without any manual work required from Operations. Our goal is to ensure that we can re-create the entire production environment based on what’s in version control.
28%
Flag icon
To ensure that we can restore production service repeatedly and predictably (and, ideally, quickly) even when catastrophic events occur, we must check in the following assets to our shared version control repository: •all application code and dependencies (e.g., libraries, static content, etc.) •any script used to create database schemas, application reference data, etc. •all the environment creation tools and artifacts described in the previous step (e.g., VMware or AMI images, Puppet, Chef, or Ansible scripts.) •any file used to create containers (e.g., Docker, Rocket, or Kubernetes ...more
« Prev 1 3