More on this book
Kindle Notes & Highlights
The system is resilient to failure. Rather than being surprised by failures and treating them as exceptions, the architecture accepts that hardware and software failures are a part of the physics of information technology (IT).
There is a “playbook” of instructions on how to handle every alert that can be generated. Each type of alert is documented with a technical description of what is wrong, what the business impact is, and how to fix the issue. The playbook is continually improved.
All failures have a corresponding countermeasure, whether it is manually or automatically activated. Countermeasures that are activated frequently are always automated.
The less frequently a countermeasure is activated, the less confident we are that it will work the next time it is needed. Therefore infrequently activated countermeasures are periodically and automatically exercised by intentionally causing failures.
In distributed systems, failure is normal. Hardware failures that are rare, when multiplied by thousands of machines, become common. Therefore failures are assumed, designs work around them, and software anticipates them. Failure is an expected part of the landscape.
Systems should export metrics. They should count interesting events, such as how many times a particular API was called, and make these counters accessible.
CAP stands for consistency, availability, and partition resistance. The CAP Principle states that it is not possible to build a distributed system that guarantees consistency, availability, and resistance to partitioning. Any one or two can be achieved but not all three simultaneously. When using such systems you must be aware of which are guaranteed.
Figure 1.10: Numbers every engineer should know
Designing for operations means making sure all the normal operational functions can be done well. Normal operational functions include tasks such as periodic maintenance, updates, and monitoring. These issues must be kept in mind in early stages of planning.
The best strategy for providing a highly available service is to build features into the software that enhance one’s ability to perform and automate operational tasks.
Configuration • Startup and shutdown • Queue draining • Software upgrades • Backups and restores • Redundancy • Replicated databases • Hot swaps • Toggles for individual features • Graceful degradation • Access controls and rate limits • Data import controls • Monitoring • Auditing • Debug instrumentation • Exception collection
Rather than the term “operational requirements,” some organizations use the term “non-functional requirements.” We consider this term misleading.
It must be possible for software upgrades to be implemented without taking down the service.
Toggles for Individual Features
Graceful Degradation
Software needs to generate logs that are useful when debugging. Such logs should be both human-readable and machine-parseable. The kind of logging that is appropriate for debugging differs from the kind of logging that is needed for auditing.
Threading Data can be processed in different ways to achieve better scale. Simply processing one request at a time has its limits. Threading is a technique that can be used to improve system throughput by processing many requests at the same time.
Graceful degradation,
Resilient systems continue where predictive strategies leave
Resilient systems decouple component failure from user-visible outages.
survivable systems,
Software Resiliency Beats Hardware Reliability
Software solutions are favored for many reasons. First and foremost, they are more economical. Once software is written, it can be applied to many services and many machines with no additional cost (assuming it is home-grown, is open source, or does not require a per-machine license.) Software is also more malleable than hardware. It is easier to fix, upgrade, and replace.
Everything Malfunctions Eventually
Sheltered from the reality of a world full of malfunctions, we enable software developers to continue writing software that assumes a perfect, malfunction-free world (which, of course, does not exist).
Distributed computing, in contrast to the traditional approach, embraces components’ failures and malfunctions. It takes a reality-based approach that accepts malfunctions as a fact of life.
The three sources of work are life-cycle management, interacting with stake-holders, and process improvement and automation. Life-cycle management is the operational work involved in running the service. Interacting with stakeholders refers to both maintaining the relationship with people who use and depend on the service, and prioritizing and fulfilling their requests. Process improvement and automation is work inspired by the business desire for continuous improvement.
Emergency Issues:
Normal Requests:
Project Work:
The team manager should be part of the operational rotation.
Meta-work There is also meta-work: meetings, status reports, company functions. These generally eat into project time and should be minimized.
developing software that automates or optimizes aspects of the team’s responsibilities.
building organizational memory.
shift report
Every operations team should have a goal of eliminating the need for people to open tickets with them, similar to how there should always be a goal to automate manual processes.
Any ticket created by an automated system should have a corresponding playbook entry that explains how to process it, with a link to the bug ID requesting that the automation be improved to eliminate the need to open such tickets.
Theme
Toil Reduction
7.4.2 Communication Policies Many teams establish a communication agreement that clarifies which methods will be used in which situations. For example, a common agreement is that chat rooms will be the primary communication channel but only for ephemeral discussions. If a decision is made in the chat room or an announcement needs to be made, it will be broadcast via email. Email is for information that needs to carry across oncall shifts or day boundaries. Announcements with lasting effects, such as major policies or design decisions, need to be recorded in the team wiki or other document
...more
Operations in distributed computing is done at a large scale. Processes that have to be done manually do not scale. Constant process improvement and automation are essential.
Operations is responsible for the life cycle of a service: launch, maintenance, upgrades, and decommissioning. Maintenance tasks include emergency and non-emergency response. In addition, related projects maintain and evolve the service.
Checklists
The most productive use of time for operational staff is time spent automating and optimizing processes.
In a DevOps organization, software developers and operational engineers work together as one team that shares responsibility for a web site or service.
DevOps combines some cultural and attitude shifts with some common-sense processes. Originally based on applying Agile methodology to operations, the result is a streamlined set of principles and processes that can create reliable services.
We find that (1) operator error is the largest single cause of failures in two of the three services, (2) operator errors often take a long time to repair, (3) configuration errors are the largest category of operator errors, (4) failures in custom-written front-end software are significant, and (5) more extensive online testing and more thoroughly exposing and detecting component failures would reduce failure rates in at least one service.
DevOps