More on this book
Community
Kindle Notes & Highlights
by
Gene Kim
Read between
February 27 - March 28, 2023
We may have multiple repositories for different types of objects and services, where they are labeled and tagged alongside our source code. For instance, we may store large virtual machine images, ISO files, compiled binaries, and so forth in artifact repositories (e.g., Nexus, Artifactory). Alternatively, we may put them in blob stores (e.g., Amazon S3 buckets) or put Docker images into Docker registries, and so forth. We will also create and store a cryptographic hash of these objects at build time and validate this hash at deploy time to ensure they haven’t been tampered with.
control plays in the software development process. We now know when all application and environment changes are recorded in version control; it enables us to not only quickly see all changes that might have contributed to a problem but also provides the means to roll back to a previous known, running state, allowing us to more quickly recover from failures.
To ensure consistency of our environments, whenever we make production changes (configuration changes, patching, upgrading, etc.), those changes need to be replicated everywhere in our production and pre-production environments, as well as in any newly created environments.
immutable infrastructure, where manual changes to the production environment are no longer allowed—the only way production changes can be made is to put the changes into version control and re-create the code and environments from scratch.
expanding the definition of “done” beyond just correct code functionality: at the end of each development interval, or more frequently, we have integrated, tested, working, and potentially shippable code, demonstrated in a production-like environment.
by the end of the project, we will have successfully deployed and run our code in production-like environments hundreds or even thousands of times, giving us confidence that most of our production deployment problems have been found and fixed.
we use the same tools, such as monitoring, logging, and deployment, in our pre-production environments as we do in production. By doing this, we have familiarity and experience that will help us smoothly deploy and run, as well as diagnose and fix, our service when it is in production.
developers build automated tests as part of their daily work. This creates a fast feedback loop that helps developers find problems early and fix them quickly when there are the fewest constraints (e.g., time, resources).
•We can package our application to enable the repeatable installation of code and configurations into an environment (e.g., on Linux RPM, yum, npm; on Windows, OneGet; alternatively framework-specific packaging systems can be used, such as EAR and WAR files for Java, gems for Ruby, etc.).
We begin the deployment pipeline by running the commit stage, which builds and packages the software, runs automated unit tests, and performs additional validation such as static code analysis, duplication and test coverage analysis, and checking style.
Once changes are accepted into version control, we want to package our code only once so that the same packages are used to deploy code throughout our entire deployment pipeline. By doing this, code will be deployed into our integrated test and staging environments in the same way that it is deployed into production. This reduces variances that can avoid downstream errors that are difficult to diagnose (e.g., using different compilers, compiler flags, library versions, or configurations).
we must create our continuous integration practices, which require three capabilities: •a comprehensive and reliable set of automated tests that validate we are in a deployable state •a culture that “stops the entire production line” when our validation tests fail •developers working in small batches on trunk rather than long-lived feature branches
When facing deadline pressures, developers may stop creating unit tests as part of their daily work, regardless of how we’ve defined “done.” To detect this, we may choose to measure and make visible our test coverage (as a function of number of classes, lines of code, permutations, etc.), maybe even failing our validation test suite when it drops below a certain level (e.g., when less than 80% of our classes have unit tests).
A specific design goal of our automated test suite is to find errors as early in the testing phase as possible. This is why we run faster-running automated tests (e.g., unit tests) before slower-running automated tests (e.g., acceptance and integration tests), which are both run before any manual testing.
Because we want our tests to run quickly, we need to design our tests to run in parallel, potentially across many different servers. We may also want to run different categories of tests in parallel. For example, when a build passes our acceptance tests, we may run our performance testing in parallel with our security testing,
Any tester (which includes all our developers) should use the latest build that has passed all the automated tests, as opposed to waiting for developers to flag a specific build as ready to test. By doing this, we ensure that testing happens as early in the process as possible.
To have humans executing tests that should be automated is a waste of human potential.
Our goal is to write and run automated performance tests that validate our performance across the entire application stack (code, database, storage, network, virtualization, etc.) as part of the deployment pipeline so we detect problems early, when the fixes are cheapest and fastest.
When we have acceptance tests that are able to be run in parallel, we can use them as the basis of our performance tests. For instance, suppose we run an e-commerce site and have identified “search” and “checkout” as two high-value operations that must perform well under load. To test this, we may run thousands of parallel search acceptance tests simultaneously with thousands of parallel checkout tests.
To find performance problems early, we should log performance results and evaluate each performance run against previous results. For instance, we might fail the performance tests if performance deviates more than 2% from the previous run.
We prioritize the team goals over individual goals—whenever we help someone move their work forward, we help the entire team. This applies whether we’re helping someone fix the build or an automated test, or even performing a code review for them. And of course, we know that they’ll do the same for us, when we need help.
the longer developers are allowed to work in their branches in isolation, the more difficult it becomes to integrate and merge everyone’s changes back into trunk. In fact, integrating those changes becomes exponentially more difficult as we increase the number of branches and the number of changes in each code branch.
significant problems result when developers work in long-lived private branches (also known as “feature branches”), only merging back into trunk sporadically, resulting in a large batch-size of changes.
Our countermeasure to large batch size merges is to institute continuous integration and trunk-based development practices, where all developers check their code into trunk at least once per day. Checking in code this frequently reduces our batch size to the work performed by our entire developer team in a single day. The more frequently developers check their code into trunk, the smaller the batch size and the closer we are to the theoretical ideal of single-piece flow.
The discipline of daily code commits also forces us to break our work down into smaller chunks while still keeping trunk in a working, releasable state. Version control becomes an integral mechanism of how the team communicates with each other—everyone has a better shared understanding of the system, is aware of the state of the deployment pipeline, and can help each other when it breaks. As a result, we achieve higher quality and faster deployment lead times.
trunk-based development predicts higher throughput, better stability, and better availability if they follow these practices:21 •have three or fewer active branches in the application’s code repository •merge branches to trunk at least daily •don’t have code freezes or integration phases
because it is painful, we tend to do it less and less frequently, resulting in another self-reinforcing downward spiral. By deferring production deployments, we accumulate ever-larger differences between the code to be deployed and what’s running in production, increasing the deployment batch size. As deployment batch size grows, so does the risk of unexpected outcomes associated with the change, as well as the difficulty fixing them.
•Smoke testing our deployments: During the deployment process, we should test that we can connect to any supporting systems (e.g., databases, message buses, external services) and run a single test transaction through the system to ensure that our system is performing as designed. If any of these tests fail, we should fail the deployment.
we need to decouple our production deployments from our feature releases.
the terms deployment and release are often used interchangeably. However, they are two distinct actions that serve two very different purposes: •Deployment is the installation of a specified version of software to a given environment (e.g., deploying code into an integration test environment or deploying code into production). Specifically, a deployment may or may not be associated with a release of a feature to customers. •Release is when we make a feature (or set of features) available to all our customers or a segment of customers (e.g., we enable the feature to be used by 5% of our
  
  ...more
The Blue-Green Deployment Pattern The simplest of the three patterns is blue-green deployment. In this pattern, we have two production environments: blue and green. At any time, only one of these is serving customer traffic
To release a new version of our service, we deploy to the inactive environment, where we can perform our testing without interrupting the user experience. When we are confident that everything is functioning as designed, we execute our release by directing traffic to the blue environment. Thus, blue becomes live and green becomes staging. Rollback is performed by sending customer traffic back to the green environment.
The cluster immune system expands upon the canary release pattern by linking our production monitoring system with our release process and by automating the rollback of code when the user-facing performance of the production system deviates outside of a predefined expected range,
Implement Feature Toggles The primary way we enable application-based release patterns is by implementing feature toggles (also called feature flags), which provide us with the mechanism to selectively enable and disable features without requiring a production code deployment. Feature toggles can also control which features are visible and available to specific user segments (e.g., internal employees, segments of customers).
Feature toggles are usually implemented by wrapping application logic or UI elements with a conditional statement, where the feature is enabled or disabled based on a configuration setting stored somewhere.
To ensure that we find errors in features wrapped in feature toggles, our automated acceptance tests should run with all feature toggles on. (We should also test that our feature toggling functionality works correctly too!)
suppose we dark launch a new feature that poses significant release risk, such as new search features, account creation processes, or new database queries. After all the code is in production, keeping the new feature disabled, we may modify user session code to make calls to new functions—instead of displaying the results to the user, we simply log or discard the results. For example, we may have 1% of our online users make invisible calls to a new feature scheduled to be launched to see how our new feature behaves under load. After we find and fix any problems, we progressively increase the
  
  ...more
When all developers are working in small batches on trunk, or everyone is working off trunk in short-lived feature branches that get merged to trunk regularly, and when trunk is always kept in a releasable state, and when we can release on demand at the push of a button during normal business hours, we are doing continuous delivery. Developers get fast feedback when they introduce any regression errors, which include defects, performance issues, security issues, usability issues, etc. When these issues are found, they are fixed immediately so that trunk is always deployable. In addition to the
  
  ...more
the strangler fig application pattern—instead of “ripping out and replacing” old services with architectures that no longer support our organizational goals, we put the existing functionality behind an API and avoid making further changes to it. All new functionality is then implemented in the new services that use the new desired architecture, making calls to the old system when necessary.
Our goal is to create telemetry within our applications and environments, both in our production and pre-production environments as well as in our deployment pipeline.
ensure that we always have enough telemetry so that we can confirm that our services are correctly operating in production. And when problems do occur, our goal is to make it possible to quickly determine what is going wrong and make informed decisions on how best to fix it, ideally long before customers are impacted.
understand how our system is behaving as a whole. When all levels of our application stack have monitoring and logging, we enable other important capabilities, such as graphing and visualizing our metrics, anomaly detection, proactive alerting and escalation, etc.
In addition to collecting telemetry from our production services and environments, we must also collect telemetry from our deployment pipeline when important events occur, such as when our automated tests pass or fail and when we perform deployments to any environment. We should also collect telemetry on how long it takes us to execute our builds and tests. By doing this, we can detect conditions that could indicate problems, such as if the performance test or our build takes twice as long as normal, allowing us to find and fix errors before they go into production.
“Monitoring is so important that our monitoring systems need to be more available and scalable than the systems being monitored.”
In the applications we create and operate, every feature should be instrumented. If it was important enough for an engineer to implement, then it is important enough to generate enough production telemetry to confirm that it is operating as designed and that the desired outcomes are being achieved.
“When deciding whether a message should be ERROR or WARN, imagine being woken up at 4 AM. Low printer toner is not an ERROR.”
To help ensure that we have information relevant to the reliable and secure operations of our service, we should ensure that all potentially significant application events generate logging entries,
We need metrics from the following levels: •Business level: Examples include the number of sales transactions, revenue of sales transactions, user sign-ups, churn rate, A/B testing results, etc. •Application level: Examples include transaction times, user response times, application faults, etc. •Infrastructure level (e.g., database, operating system, networking, storage): Examples include web server traffic, CPU load, disk usage, etc. •Client software level (e.g., JavaScript on the client browser, mobile application): Examples include application errors and crashes, user-measured transaction
  
  ...more



























