The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations
Rate it:
22%
Flag icon
Matrix-oriented organizations attempt to combine functional and market orientation.
22%
Flag icon
Market-oriented organizations optimize for responding quickly to customer needs.
23%
Flag icon
TESTING, OPERATIONS, AND SECURITY AS EVERYONE’S JOB, EVERY DAY In high-performing organizations, everyone within the team shares a common goal—quality, availability, and security aren’t the responsibility of individual departments but are a part of everyone’s job, every day.
23%
Flag icon
One of the most significant things they did to help change the outcomes of deployments was to have all Facebook engineers, engineering managers, and architects rotate through on-call duty for the services they built. By doing this, everyone who worked on the service experienced visceral feedback on the upstream architectural and coding decisions they made, which made an enormous positive impact on the downstream outcomes.
23%
Flag icon
The term full stack engineer is now commonly used (sometimes as a rich source of parody) to describe generalists who are familiar—at least have a general level of understanding—with the entire application stack (e.g., application code, databases, operating systems, networking, cloud).
23%
Flag icon
When we value people merely for their existing skills or performance in their current role rather than for their ability to acquire and deploy new skills, we (often inadvertently) reinforce what Dr. Carol Dweck describes as the fixed mindset, where people view their intelligence and abilities as static “givens” that can’t be changed in meaningful ways.
24%
Flag icon
Bounded contexts are described in the book Domain Driven Design by Eric J. Evans. The idea is that developers should be able to understand and update the code of a service without knowing anything about the internals of its peer services.
24%
Flag icon
As part of its transformation initiative away from a monolithic code base in 2002, Amazon used the two-pizza rule to keep team sizes small—a team only as large as can be fed with two pizzas—usually about five to ten people.
25%
Flag icon
When we asked for permission, we were told no, but we did it anyway, because we knew we needed it.
27%
Flag icon
Kanban boards are an ideal tool to create visibility, and visibility is a key component in properly recognizing and integrating Ops work into all the relevant value streams. When we do this well, we achieve market-oriented outcomes, regardless of how we’ve drawn our organization charts.
27%
Flag icon
Continuous delivery includes creating the foundations of our automated deployment pipeline, ensuring that we have automated tests that constantly validate that we are in a deployable state, having developers integrate their code in to trunk daily, and architecting our environments and code to enable low-risk releases. Primary focuses within these chapters include: Creating the foundation of our deployment pipeline Enabling fast and reliable automated testing Enabling and practicing continuous integration and testing Automating, enabling, and architecting for low-risk releases
27%
Flag icon
In order to create fast and reliable flow from Dev to Ops, we must ensure that we always use production-like environments at every stage of the value stream.
28%
Flag icon
Operations benefits from this capability to create new environments quickly, because automation of the environment creation process enforces consistency and reduces tedious, error-prone manual work. Furthermore, Development benefits by being able to reproduce all the necessary parts of the production environment to build, run, and test their code on their workstations. By doing this, we enable developers to find and fix many problems, even at the earliest stages of the project, as opposed to during integration testing or worse, in production.
29%
Flag icon
But why does using version control for our environments predict IT and organizational performance better than using version control for our code? Because in almost all cases, there are orders of magnitude more configurable settings in our environment than in our code. Consequently, it is the environment that needs to be in version control the most.‡‡ Version control also provides a means of communication for everyone working in the value stream—having Development, QA, Infosec, and Operations able to see each other’s changes helps reduce surprises, creates visibility into each other’s work, and ...more
30%
Flag icon
At Netflix, the average age of Netflix AWS instance is twenty-four days, with 60% being less than one week old.
30%
Flag icon
Automated testing addresses another significant and unsettling problem. Gary Gruver observes that “without automated testing, the more code we write, the more time and money is required to test our code—in most cases, this is a totally unscalable business model for any technology organization.”
31%
Flag icon
When facing deadline pressures, developers may stop creating unit tests as part of their daily work, regardless of how we’ve defined ‘done.’ To detect this, we may choose to measure and make visible our test coverage (as a function of number of classes, lines of code, permutations, etc.), maybe even failing our validation test suite when it drops below a certain level (e.g., when less than 80% of our classes have unit tests).¶¶
32%
Flag icon
WRITE OUR AUTOMATED TESTS BEFORE WE WRITE THE CODE (“TEST-DRIVEN DEVELOPMENT”) One of the most effective ways to ensure we have reliable automated testing is to write those tests as part of our daily work, using techniques such as test-driven development (TDD) and acceptance test-driven development (ATDD). This is when we begin every change to the system by first writing an automated test that validates the expected behavior fails and then we write the code to make the tests pass. This technique was developed by Kent Beck in the late 1990s as part of Extreme Programming, and has the following ...more
35%
Flag icon
we can now again modify our definition of “done” (addition in bold text): “At the end of each development interval, we must have integrated, tested, working, and potentially shippable code, demonstrated in a production-like environment, created from trunk using a one-click process, and validated with automated tests.”
36%
Flag icon
The requirements for our deployment pipeline include:
37%
Flag icon
In practice, the terms deployment and release are often used interchangeably. However, they are two distinct actions that serve two very different purposes: Deployment is the installation of a specified version of software to a given environment (e.g., deploying code into an integration test environment or deploying code into production). Specifically, a deployment may or may not be associated with a release of a feature to customers. Release is when we make a feature (or set of features) available to all our customers or a segment of customers (e.g., we enable the feature to be used by 5% of ...more
39%
Flag icon
switching off one particular stakeholder’s features is usually much easier than rolling back an entire release.
39%
Flag icon
Perform Dark Launches Feature toggles allow us to deploy features into production without making them accessible to users, enabling a technique known as dark launching. This is where we deploy all the functionality into production and then perform testing of that functionality while it is still invisible to customers. For large or risky changes, we often do this for weeks before the production launch, enabling us to safely test with the anticipated production-like loads.
39%
Flag icon
Eugene Letuchy, an engineer on the Chat team, wrote about how the number of concurrent users presented a huge software engineering challenge: “The most resource-intensive operation performed in a chat system is not sending messages. It is rather keeping each online user aware of the online-idle-offline states of their friends, so that conversations can begin.”
39%
Flag icon
“The secret for going from zero to seventy million users overnight is to avoid doing it all in one fell swoop.”
39%
Flag icon
“In the last five years, there has been confusion around the terms continuous delivery versus continuous deployment—and, indeed, my own thinking and definitions have changed since we wrote the book.
40%
Flag icon
This is the principle of evolutionary architecture—Jez Humble observes that architecture of “any successful product or organization will necessarily evolve over its life cycle.”
40%
Flag icon
Each decision most likely best served the organizational goals at the time. If we had tried to implement the 1995 equivalent of micro-services out of the gate, we would have likely failed, collapsing under our own weight and probably taking the entire company with us.Ӡ
40%
Flag icon
The challenge is how to keep migrating from the architecture we have to the architecture we need.
41%
Flag icon
What Shoup’s team did at eBay is a textbook example of evolutionary design, using a technique called the strangler application pattern—instead of “ripping out and replacing” old services with architectures that no longer support our organizational goals, we put the existing functionality behind an API and avoid making further changes to it. All new functionality is then implemented in the new services that use the new desired architecture, making calls to the old system when necessary.
41%
Flag icon
The strangler application pattern is especially useful for helping migrate portions of a monolithic application or tightly-coupled services to one that is more loosely-coupled. All too often, we find ourselves working within an architecture that has become too tightly-coupled and too interconnected, often having been created years (or decades) ago.
41%
Flag icon
“[IT project owners] are not held accountable for their contributions to overall system entropy.”
41%
Flag icon
As described earlier, the strangler application pattern involves placing existing functionality behind an API, where it remains unchanged, and implementing new functionality using our desired architecture, making calls to the old system when necessary. When we implement strangler applications, we seek to access all services through versioned APIs, also called versioned services or immutable services.
42%
Flag icon
The problem that became evident to Ashman was that the number of code commits started to decrease, objectively showing the increasing difficulty of introducing code changes, while the number of lines of code continued to increase. Ashman noted, “To me, it said we needed to do something, otherwise the problems would keep getting worse, with no end in sight.” As a result, in 2012 Ashman focused on implementing a code re-architecturing project that used the strangler pattern. The team accomplished this by creating what they internally called Building Blocks, which allowed developers to work in ...more
42%
Flag icon
The strangler application pattern involves incrementally replacing a whole system, usually a legacy system, with a completely new one. Conversely, branching by abstraction, a term coined by Paul Hammant, is a technique where we create an abstraction layer between the areas that we are changing. This enables evolutionary design of the application architecture while allowing everybody to work off trunk/master and practice continuous integration.
43%
Flag icon
In Operations, we may deal with this problem with the following rule of thumb: When something goes wrong in production, we just reboot the server. If that doesn’t work, reboot the server next to it. If that doesn’t work, reboot all the servers. If that doesn’t work, blame the developers, they’re always causing outages.
43%
Flag icon
the Microsoft Operations Framework (MOF) study in 2001 found that organizations with the highest service levels rebooted their servers twenty times less frequently than average and had five times fewer “blue screens of death.” In other words, they found that the best-performing organizations were much better at diagnosing and fixing service incidents, in what Kevin Behr, Gene Kim, and George Spafford called a “culture of causality” in The Visible Ops Handbook. High performers used a disciplined approach to solving problems, using production telemetry to understand possible contributing factors ...more
43%
Flag icon
CREATE OUR CENTRALIZED TELEMETRY INFRASTRUCTURE Operational monitoring and logging is by no means new—multiple generations of Operations engineers have used and customized monitoring frameworks (e.g., HP OpenView, IBM Tivoli, and BMC Patrol/BladeLogic) to ensure the health of production systems.
43%
Flag icon
In order for us to see all problems as they occur, we must design and develop our applications and environments so that they generate sufficient telemetry, allowing us to understand how our system is behaving as a whole. When all levels of our application stack have monitoring and logging, we enable other important capabilities, such as graphing and visualizing our metrics, anomaly detection, proactive alerting and escalation, etc.
43%
Flag icon
In The Art of Monitoring, James Turnbull describes a modern monitoring architecture, which has been developed and used by Operations engineers at web-scale companies (e.g., Google, Amazon, Facebook).
44%
Flag icon
In the applications we create and operate, every feature should be instrumented—if it was important enough for an engineer to implement, it is certainly important enough to generate enough production telemetry so that we can confirm that it is operating as designed and that the desired outcomes are being achieved.‡
44%
Flag icon
Alternative libraries to StatsD that allow developers to generate production telemetry can be easily aggregated and analyzed include JMX and codahale metrics. Other tools that create metrics invaluable for problem solving include New Relic, AppDynamics, and Dynatrace. Tools such as munin and collectd can be used to create similar functionality.¶
45%
Flag icon
This is often referred to as an information radiator, defined by the Agile Alliance as “the generic term for any of a number of handwritten, drawn, printed, or electronic displays which a team places in a highly visible location, so that all team members as well as passers-by can see the latest information at a glance: count of automated tests, velocity, incident reports, continuous integration status, and so on. This idea originated as part of the Toyota Production System.”
45%
Flag icon
By putting information radiators in highly visible places, we promote responsibility among team members, actively demonstrating the following values: The team has nothing to hide from its visitors (customers, stakeholders, etc.) The team has nothing to hide from itself: it acknowledges and confronts problems
45%
Flag icon
These metrics will vary according to different domains and organizational goals. For instance, for e-commerce sites, we may want to maximize the time spent on the site; however, for search engines, we may want to reduce the time spent on the site, since long sessions may indicate that users are having difficulty finding what they’re looking for.
46%
Flag icon
By radiating how customers interact with what we build in the context of our goals, we enable fast feedback to feature teams so they can see whether the capabilities we are building are actually being used and to what extent they are achieving business goals. As a result, we reinforce the cultural expectations that instrumenting and analyzing customer usage is also a part of our daily work, so we better understand how our work contributes to our organizational goals.
47%
Flag icon
USE MEANS AND STANDARD DEVIATIONS TO DETECT POTENTIAL PROBLEMS One of the simplest statistical techniques that we can use to analyze a production metric is computing its mean (or average) and standard deviations. By doing this, we can create a filter that detects when this metric is significantly different from its norm, and even configure our alerting so that we can take corrective action (e.g., notify on-call production staff at 2 a.m. to investigate when database queries are significantly slower than average).
47%
Flag icon
“Alert fatigue is the single biggest problem we have right now…We need to be more intelligent about our alerts or we’ll all go insane.”
49%
Flag icon
We have re-discovered that the secret to smooth and continuous flow is making small, frequent changes that anyone can inspect and easily understand.
49%
Flag icon
we will have everyone in the value stream share the downstream responsibilities of handling operational incidents. We can do this by putting developers, development managers, and architects on pager rotation, just as Pedro Canahuati, Facebook Director of Production Engineering, did in 2009. This ensures everyone in the value stream gets visceral feedback on any upstream architectural and coding decisions they make.