More on this book
Community
Kindle Notes & Highlights
by
Gene Kim
Read between
February 24 - May 23, 2021
the dedicated release engineer became intimately familiar with the product’s Development and QA issues, and helped them get what they needed from the Ops organization to achieve their goals. They were familiar with the typical Dev and QA requests for Ops, and would often execute the needed work themselves. As needed, they would also pull in dedicated technical Ops engineers (e.g., DBAs, Infosec, storage engineers, network engineers) and help determine which self-service tools the entire Operations group should prioritize building.
One way to enable market-oriented outcomes is for Operations to create a set of centralized platforms and tooling services that any Dev team can use to become more productive, such as getting production-like environments, deployment pipelines, automated testing tools, production telemetry dashboards, and so forth.†
“Without these self-service Operations platforms, the cloud is just Expensive Hosting 2.0.”
For instance, we may create a platform that provides a shared version control repository with pre-blessed security libraries, a deployment pipeline that automatically runs code quality and security scanning tools, which deploys our applications into known, good environments that already have production monitoring tools installed on them. Ideally, we make life so much easier for Dev teams that they will overwhelmingly decide that using our platform is the easiest, safest, and most secure means to get their applications into production.
We build into these platforms the cumulative and collective experience of everyone in the organization, including QA, Operations, and Infosec, which helps to create an ever safer system of work. This increases developer productivity and makes it easy for product teams to leverage common processes, such as performing automated testing and satisfying security and compliance requirements.
these platform teams provide other services to help their customers learn their technology, migrate off of other technologies, and even provide coaching and consulting to help elevate the state of the practice inside the organization.
These shared services also facilitate standardization, which enable engineers to quickly become productive, even if they switch between teams. For instance, if every product team chooses a different toolchain, engineers may have to learn an entirely new set of technologies to do their work, putting the team goals ahead of the global goals.
INVITE OPS TO OUR DEV STANDUPS
INVITE OPS TO OUR DEV RETROSPECTIVES
Often, Development teams will make their work visible on a project board or kanban board. It’s far less common, however, for work boards to show the relevant Operations work that must be performed in order for the application to run successfully in production, where customer value is actually created. As a result, we are not aware of necessary Operations work until it becomes an urgent crisis, jeopardizing deadlines or creating a production outage.
In order to create fast and reliable flow from Dev to Ops, we must ensure that we always use production-like environments at every stage of the value stream.
Furthermore, these environments must be created in an automated manner, ideally on demand from scripts and configuration information stored in version control and entirely self-serviced, without any manual work required from Operations. Our goal is to ensure that we can re-create the entire production environment based on what’s in version control.
All too often, the only time we discover how our applications perform in anything resembling a production-like environment is during production deployment—far too late to correct probl...
This highlight has been truncated due to consecutive passage length restrictions.
The team quickly made a surprising discovery: only 50% of the source code in their development and test environments matched what was running in production.
The team carefully reverse-engineered all the changes that had been made to the different environments and put them all into version control. They also automated their environment creation process so they could repeatedly and correctly spin up environments. Campbell-Pretty described the results, noting that “the time it took to get a correct environment went from eight weeks to one day. This was one of the key adjustments that allowed us to hit our objectives concerning our lead time, the cost to deliver, and the number of escaped defects that made it into production.”
Copying a virtualized environment (e.g., a VMware image, running a Vagrant script, booting an Amazon Machine Image file in EC2) Building an automated environment creation process that starts from “bare metal” (e.g., PXE install from a baseline image) Using “infrastructure as code” configuration management tools (e.g., Puppet, Chef, Ansible, Salt, CFEngine, etc.) Using automated operating system configuration tools (e.g., Solaris Jumpstart, Red Hat Kickstart, Debian preseed) Assembling an environment from a set of virtual images or containers (e.g., Vagrant, Docker) Spinning up a new
  
  ...more
All application code and dependencies (e.g., libraries, static content, etc.) Any script used to create database schemas, application reference data, etc. All the environment creation tools and artifacts described in the previous step (e.g., VMware or AMI images, Puppet or Chef recipes, etc.) Any file used to create containers (e.g., Docker or Rocket definition or composition files) All supporting automated tests and any manual test scripts Any script that supports code packaging, deployment, database migration, and environment provisioning All project artifacts (e.g., requirements
  
  ...more
We may have multiple repositories for different types of objects and services, where they are labelled and tagged alongside our source code. For instance, we may store large virtual machine images, ISO files, compiled binaries, and so forth in artifact repositories (e.g., Nexus, Artifactory). Alternatively, we may put them in blob stores (e.g., Amazon S3 buckets) or put Docker images into Docker registries, and so forth.
latter pattern is what has become known as immutable infrastructure, where manual changes to the production environment are no longer allowed—the only way production changes can be made is to put the changes into version control and re-create the code and environments from scratch. By doing this, no variance is able to creep into production.
We do this by expanding the definition of “done” beyond just correct code functionality (addition in bold text): at the end of each development interval, we have integrated, tested, working and potentially shippable code, demonstrated in a production-like environment.
Ideally, we use the same tools, such as monitoring, logging, and deployment, in our pre-production environments as we do in production. By doing this, we have familiarity and experience that will help us smoothly deploy and run, as well as diagnose and fix, our service when it is in production.
Automated testing addresses another significant and unsettling problem. Gary Gruver observes that “without automated testing, the more code we write, the more time and money is required to test our code—in most cases, this is a totally unscalable business model for any technology organization.”
“Fear became the mind-killer. Fear stopped new team members from changing things because they didn’t understand the system. But fear also stopped experienced people from changing things because they understood it all too well.”†
Now when any Google developer commits code, it is automatically run against a suite of hundreds of thousands of automated tests. If the code passes, it is automatically merged into trunk, ready to be deployed into production. Many Google properties build hourly or daily, then pick which builds to release; others adopt a continuous “Push on Green” delivery philosophy. The stakes are higher than ever—a single code deployment error at Google can take down every property, all at the same time (such as a global infrastructure change or when a defect is introduced into a core library that every
  
  ...more
To achieve this, we must create automated build and test processes that run in dedicated environments. This is critical for the following reasons: Our build and test process can run all the time, independent of the work habits of individual engineers. A segregated build and test process ensures that we understand all the dependencies required to build, package, run, and test our code (i.e., removing the “it worked on the developer’s laptop, but it broke in production” problem). We can package our application to enable the repeatable installation of code and configurations into an environment
  
  ...more
Now that we have a working deployment pipeline infrastructure, we must create our continuous integration practices, which require three capabilities: A comprehensive and reliable set of automated tests that validate we are in a deployable state. A culture that “stops the entire production line” when our validation tests fail. Developers working in small batches on trunk rather than long-lived feature branches.
Worse, suppose the problem wasn’t caused by a code change, but was due to a test environment issue (e.g., an incorrect configuration setting somewhere). The development team may believe that they fixed the problem because all the unit tests pass, only to discover that the tests will still fail later that night. Further complicating the issue, ten more changes will have been checked in to version control by the team that day. Each of these changes has the potential to introduce more errors that could break our automated tests, further increasing the difficulty of successfully diagnosing and
  
  ...more
In general, automated tests fall into one of the following categories, from fastest to slowest: Unit tests: These typically test a single method, class, or function in isolation, providing assurance to the developer that their code operates as designed. For many reasons, including the need to keep our tests fast and stateless, unit tests often “stub out” databases and other external dependencies (e.g., functions are modified to return static, predefined values, instead of calling the real database).§§ Acceptance tests: These typically test the application as a whole to provide assurance that a
  
  ...more
Integration tests: Integration tests are where we ensure that our application correctly interacts with other production applications and services, as opposed to calling stubbed out interfaces. As Humble and Farley observe, “Much of the work in the SIT environment involves deploying new versions of each of the applications until they all cooperate. In this situation the smoke test is usually a fully fledged set of acceptance tests that run against the whole application.” Integration tests are performed on builds that have passed our unit and acceptance tests. Because integration tests are often
  
  ...more
when less than 80% of our classes have unit tests).¶¶
One of the most effective ways to ensure we have reliable automated testing is to write those tests as part of our daily work, using techniques such as test-driven development (TDD) and acceptance test-driven development (ATDD). This is when we begin every change to the system by first writing an automated test that validates the expected behavior fails and then we write the code to make the tests pass. This technique was developed by Kent Beck in the late 1990s as part of Extreme Programming, and has the following three steps: Ensure the tests fail. “Write a test for the next bit of
  
  ...more
To mitigate of this, a small number of reliable, automated tests are almost always preferable over a large number of manual or unreliable automated tests. Therefore, we focus on automating only the tests that genuinely validate the business goals we are trying to achieve. If abandoning a test results in production defects, we should add it back to our manual test suite, with the ideal of eventually automating it.
INTEGRATE PERFORMANCE TESTING INTO OUR TEST SUITE All too often, we discover that our application performs poorly during integration testing or after it has been deployed to production. Performance problems are often difficult to detect, such as when things slow down over time, going unnoticed until it is too late (e.g., database queries without an index). And many problems are difficult to solve, especially when they are caused by architectural decisions we made or unforeseen limitations of our networking, database, storage, or other systems. Our goal is to write and run automated performance
  
  ...more
To find performance problems early, we should log performance results and evaluate each performance run against previous results. For instance, we might fail the performance tests if performance deviates more than 2% from the previous run.
Furthermore, similar to how we run analysis tools on our application in our deployment pipeline (e.g., static code analysis, test coverage analysis), we should run tools that analyze the code that constructs our environments (e.g., Foodcritic for Chef, puppet-lint for Puppet). We should also run any security hardening checks as part of our automated tests to ensure that everything is configured securely and correctly (e.g., server-spec).
Every member of the team should be empowered to roll back the commit to get back into a green state.



























