The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations
Rate it:
18%
Flag icon
the age of the application was not a significant predictor of performance; instead, what predicted performance was whether the application was architected (or could be re-architected) for testability and deployability.
18%
Flag icon
As we generate successes, we earn the right to expand the scope of our DevOps initiative. We want to follow a safe sequence that methodically grows our levels of credibility, influence, and support. The following list, adapted from a course taught by Dr. Roberto Fernandez, a William F. Pounds Professor in Management at MIT, describes the ideal phases used by change agents to build and expand their coalition and base of support: Find Innovators and Early Adopters: In the beginning, we focus our efforts on teams who actually want to help—these are our kindred spirits and fellow travelers who are ...more
This highlight has been truncated due to consecutive passage length restrictions.
20%
Flag icon
Based on their research, Dr. Govindarajan and Dr. Trimble assert that organizations need to create a dedicated transformation team that is able to operate outside of the rest of the organization that is responsible for daily operations (which they call the “dedicated team” and “performance engine” respectively).
Michael Coté liked this
20%
Flag icon
We will actively manage this technical debt by ensuring that we invest at least 20% of all Development and Operations cycles on refactoring, investing in automation work and architecture and non-functional requirements (NFRs, sometimes referred to as the “ilities”), such as maintainability, manageability, scalability, reliability, testability, deployability, and security.
21%
Flag icon
The deal [between product owners and] engineering goes like this: Product management takes 20% of the team’s capacity right off the top and gives this to engineering to spend as they see fit.
22%
Flag icon
One of the most visible consequences of this is long lead times, especially for complex activities like large deployments where we must open up tickets with multiple groups and coordinate work handoffs, resulting in our work waiting in long queues at every step.
23%
Flag icon
Taken to the extreme, market-oriented teams are responsible not only for feature development, but also for testing, securing, deploying, and supporting their service in production, from idea conception to retirement.
23%
Flag icon
To achieve market orientation, we won’t do a large, top-down reorganization, which often creates large amounts of disruption, fear, and paralysis. Instead, we will embed the functional engineers and skills (e.g., Ops, QA, Infosec) into each service team, or provide their capabilities to teams through automated self-service platforms that provide production-like environments, initiate automated tests, or perform deployments.
23%
Flag icon
What these organizations have in common is a high-trust culture that enables all departments to work together effectively, where all work is transparently prioritized and there is sufficient slack in the system to allow high-priority work to be completed quickly. This is, in part, enabled by automated self-service platforms that build quality into the products everyone is building.
24%
Flag icon
Another way to enable high-performing outcomes is to create stable service teams with ongoing funding to execute their own strategy and road map of initiatives. These teams have the dedicated engineers needed to deliver on concrete commitments made to internal and external customers, such as features, stories, and tasks. Contrast this to the more traditional model where Development and Test teams are assigned to a “project” and then reassigned to another project as soon as the project is completed and funding runs out. This leads to all sorts of undesired outcomes, including developers being ...more
26%
Flag icon
For a variety of reasons, such as cost and scarcity, we may be unable to embed Ops engineers into every product team. However, we can get many of the same benefits by assigning a designated liaison for each product team.
26%
Flag icon
Furthermore, just like in the embedded Ops model, this liaison attends the team standups, integrating their needs into the Operations road map and performing any needed tasks.
26%
Flag icon
When Ops engineers are embedded or assigned as liaisons into our product teams, we can integrate them into our Dev team rituals.
29%
Flag icon
Our goal is to ensure that Development and QA are routinely integrating the code with production-like environments at increasingly frequent intervals throughout the project.††† We do this by expanding the definition of “done” beyond just correct code functionality (addition in bold text): at the end of each development interval, we have integrated, tested, working and potentially shippable code, demonstrated in a production-like environment.
32%
Flag icon
Our goal is to write and run automated performance tests that validate our performance across the entire application stack (code, database, storage, network, virtualization, etc.) as part of the deployment pipeline, so we detect problems early, when the fixes are cheapest and fastest.
35%
Flag icon
Our countermeasure to large batch size merges is to institute continuous integration and trunk-based development practices, where all developers check in their code to trunk at least once per day.
35%
Flag icon
However, the data from Puppet Labs’ 2015 State of DevOps Report is clear: trunk-based development predicts higher throughput and better stability, and even higher job satisfaction and lower rates of burnout.
36%
Flag icon
They also focused on making all their environments look as similar as possible, including the restricted security access rights and load balancers.
36%
Flag icon
We created a development and deployment process that removed the need for handoffs to DBAs by cross-training developers, automating schema changes, and executing them daily.
37%
Flag icon
The deployment process at Etsy has become so safe and routine that new engineers will perform a production deployment on their first day at work—as have Etsy board members and even dogs!
37%
Flag icon
The next tests to run are the smoke tests, which are system level tests that run cURL to execute PHPUnit test cases. Following these tests, the functional tests are run, which execute end-to-end GUI-driven tests on a live server—this server is either their QA environment or staging environment (nicknamed “Princess”), which is actually a production server that has been taken out of rotation, ensuring that it exactly matches the production environment.
46%
Flag icon
In decades past, creating links between a service and the production infrastructure it depended on was often a manual effort (such as ITIL CMDBs or creating configuration definitions inside alerting tools in tools such as Nagios). However, increasingly these links are now registered automatically within our services, which are then dynamically discovered and used in production through tools such as Zookeeper, Etcd, Consul, etc. These tools enable services to register themselves, storing information that other services need to interact with it (e.g., IP address, port numbers, URIs). This solves ...more
50%
Flag icon
One potential countermeasure is to do what Google does, which is have Development groups self-manage their services in production before they become eligible for a centralized Ops group to manage.
50%
Flag icon
we may define launch requirements that must be met in order for services to interact with real customers and be exposed to real production traffic.
50%
Flag icon
Furthermore, to help the product teams, Ops engineers should act as consultants to help them make their services production-ready.
50%
Flag icon
However, the key differences are we require effective monitoring to be in place, deployments to be reliable and deterministic, and an architecture that supports fast and frequent deployments.
50%
Flag icon
By integrating operability requirements into the earliest stages of the development process and having Development initially self-manage their own applications and services, the process of transitioning new services into production becomes smoother, becoming far easier and more predictable to complete.
50%
Flag icon
However, for services already in production, we need a different mechanism to ensure that Operations is never stuck with an unsupportable service in production. This is especially relevant for functionally-oriented Operations organizations. In this step, we may create a service handback mechanism—in other words, when a production service becomes sufficiently fragile, Operations has the ability to return production support responsibility back to Development.
50%
Flag icon
developers still must have self-managed their service in production for at least six months before it becomes eligible to have an SRE assigned to the team.
51%
Flag icon
We further increase the likelihood of production problems being fixed by ensuring that the Development teams remain intact, and not disbanded after the project is complete.
51%
Flag icon
In organizations with project-based funding, there may be no developers to hand the service back to, as the team has already been disbanded or may not have the budget or time to take on service responsibility. Potential countermeasures include holding an improvement blitz to improve the service, temporarily funding or staffing improvement efforts, or retiring the service.
53%
Flag icon
These controls often add more friction to the deployment process by multiplying the number of steps and approvals, and increasing batch sizes and deployment lead times, which we know reduces the likelihood of successful production outcomes for both Dev and Ops. These controls also reduce how quickly we get feedback from our work.
54%
Flag icon
For more complex organizations and organizations with more tightly-coupled architectures, we may need to deliberately schedule our changes, where representatives from the teams get together, not to authorize changes, but to schedule and sequence their changes in order to minimize accidents.
54%
Flag icon
A logical place to require reviews is prior to committing code to trunk in source control, where changes could potentially have a team-wide or global impact.
54%
Flag icon
Define
55%
Flag icon
Dr. Laurie Williams performed a study in 2001 that showed “paired programmers are 15% slower than two independent individual programmers, while ‘error-free’ code increased from 70% to 85%.
55%
Flag icon
To fix the problem and eliminate all of these delays, they ended up dismantling the entire Gerrit code review process, instead requiring pair programming to implement code changes into the system. By doing this, they reduced the amount of time required to get code reviewed from weeks to hours.
57%
Flag icon
Examples of such countermeasures include new automated tests to detect dangerous conditions in our deployment pipeline, adding further production telemetry, identifying categories of changes that require additional peer review, and conducting rehearsals of this category of failure as part of regularly scheduled Game Day exercises.
60%
Flag icon
All too often, we codify our standards and processes for architecture, testing, deployment, and infrastructure management in prose, storing them in Word documents that are uploaded somewhere.
60%
Flag icon
Instead of putting our expertise into Word documents, we need to transform these documented standards and processes, which encompass the sum of our organizational learnings and knowledge, into an executable form that makes them easier to reuse. One of the best ways we can make this knowledge re-usable is by putting it into a centralized source code repository, making the tool available for everyone to search and use.
61%
Flag icon
If we do not have a list of technologies that Operations will support, collectively generated by Development and Operations, we should systematically go through the production infrastructure and services, as well as all their dependencies that are currently supported, to find which ones are creating a disproportionate amount of failure demand and unplanned work.
62%
Flag icon
Since 2011, we have been committed to create a culture of learning—part of that is something we call Teaching Thursday, where each week we create time for our associates to learn. For two hours, each associate is expected to teach or learn.
68%
Flag icon
However, our goal should be to continually show an exemplary track record of successful changes, so we can eventually gain their agreement that our automated changes can be safely classified as standard changes.