The DevOps Handbook: How to Create World-Class Agility, Reliability, & Security in Technology Organizations
Rate it:
Open Preview
44%
Flag icon
Note that in addition to monitoring our production services, we also need telemetry for those services in our pre-production environments (e.g., development, test, staging, etc.). Doing this enables us to find and fix issues before they go into production, such as detecting when we have ever-increasing database insert times due to a missing table index.
45%
Flag icon
Netflix “used outlier detection in a very simple way, which was to first compute what was the ‘current normal’ right now, given the population of nodes in a compute cluster. And then we identified which nodes didn’t fit that pattern, and removed those nodes from production.
45%
Flag icon
exercise. One of the easiest ways to do this is to analyze our most severe incidents in the recent past (e.g., thirty days) and create a list of telemetry that could have enabled earlier and faster detection and diagnosis of the problem, as well as easier and faster confirmation that an effective fix had been implemented.
47%
Flag icon
ensure that we are actively monitoring our production telemetry when anyone performs a production deployment,
47%
Flag icon
have Development groups self-manage their services in production to prove they are stable before they become eligible for an SRE (site reliability engineering) team to manage. By having developers be responsible for deployment and production support, we are far more likely to have a smooth transition to Operations.
49%
Flag icon
“the most inefficient way to test a business model or product idea is to build the complete product to see whether the predicted demand actually exists.”
49%
Flag icon
we may conduct an experiment to see whether modifying the text or color on a “buy” button increases revenue or whether slowing down the response time of a website (by introducing an artificial delay as the treatment) reduces revenue.
49%
Flag icon
If we are not performing user research, the odds are that two-thirds of the features we are building deliver zero or negative value to our organization, even as they make our codebase ever more complex, thus increasing our maintenance costs over time and making our software more difficult to change.
49%
Flag icon
we can frame hypotheses in feature development in the following form:12 •We believe increasing the size of hotel images on the booking page •Will result in improved customer engagement and conversion •We will have confidence to proceed when we see a 5% increase in customers who review hotel images who then proceed to book in forty-eight hours.
51%
Flag icon
The principle of small batch sizes also applies to code reviews. The larger the size of the change that needs to be reviewed, the longer it takes to understand and the larger the burden on the reviewing engineer.
51%
Flag icon
•If someone submits a change that is too large to reason about easily—in other words, you can’t understand its impact after reading through it a couple of times, or you need to ask the submitter for clarification—it should be split up into multiple, smaller changes that can be understood at a glance.
52%
Flag icon
when asked to describe a great pull request that indicates an effective review process, Tomayko quickly listed off the essential elements: there must be sufficient detail on why the change is being made, how the change was made, as well as any identified risks and resulting countermeasures.
54%
Flag icon
If accidents are not caused by “bad apples” but rather are due to inevitable design problems in the complex system that we created, then instead of “naming, blaming, and shaming” the person who caused the failure, our goal should always be to maximize opportunities for organizational learning, continually reinforcing that we value actions that expose and share more widely the problems in our daily work. This is what enables us to improve the quality and safety of the system we operate within and reinforce the relationships between everyone who operates within that system.
57%
Flag icon
coordination.15 We put into our shared source code repository not only source code but also other artifacts that encode knowledge and learning, including: •configuration standards for our libraries, infrastructure, and environments (Chef, Puppet, or Ansible scripts) •deployment tools •testing standards and tools, including security •deployment pipeline tools •monitoring and analysis tools •tutorials and standards
57%
Flag icon
Examples of non-functional requirements include ensuring that we have: •sufficient production telemetry in our applications and environments •the ability to accurately track dependencies •services that are resilient and degrade gracefully •forward and backward compatibility between versions •the ability to archive data to manage the size of the production data set •the ability to easily search and understand log messages across services •the ability to trace requests from users through multiple services •simple, centralized runtime configuration using feature flags, etc.
58%
Flag icon
the five essential characteristics of cloud computing defined by the US Federal Government’s National Institute of Standards and Technology (NIST):20 •On-demand self service: Consumers can automatically provision computing resources as needed, without human interaction from the provider. •Broad network access: Capabilities can be accessed through heterogeneous platforms, such as mobile phones, tablets, laptops, and workstations. •Resource pooling: Provider resources are pooled in a multi-tenant model, with physical and virtual resources dynamically assigned on demand. The customer may specify ...more
59%
Flag icon
etc. One of the easiest ways to do this is to schedule and conduct day- or week-long improvement blitzes, where everyone on a team (or in the entire organization) self-organizes to fix problems they care about—no feature work is allowed.
63%
Flag icon
The 2019 report also showed that when analyzing software components, the time required for those projects to remediate their security vulnerabilities (TTR) was correlated with the time required to update any of their dependencies (TTU).23 In other words, projects that update more frequently tend to remediate their security vulnerabilities faster. This fact motivates why Jeremy Long, founder of the OWASP Dependency Check project, suggests that the best security patching strategy is to remain current on all dependencies.
63%
Flag icon
The 2019 study also found that the “popularity” of a software project (e.g., number of GitHub stars or forks or the number of Maven Central downloads) is not correlated with better security characteristics.
68%
Flag icon
Innovation is impossible without risk-taking, and if you haven’t managed to upset at least some people in management, you’re probably not trying hard enough. Don’t let your organization’s immune system deter or distract you from your vision.
70%
Flag icon
any countermeasures must be assigned to someone, and if the corrective action does not warrant being a top priority when the meeting is over, then it is not a corrective action. (This is to prevent the meeting from generating a list of good ideas that are never implemented.)
1 3 Next »