Practice of Cloud System Administration, The: DevOps and SRE Practices for Web Services, Volume 2
Rate it:
Open Preview
5%
Flag icon
The system is resilient to failure. Rather than being surprised by failures and treating them as exceptions, the architecture accepts that hardware and software failures are a part of the physics of information technology (IT).
6%
Flag icon
There is a “playbook” of instructions on how to handle every alert that can be generated. Each type of alert is documented with a technical description of what is wrong, what the business impact is, and how to fix the issue. The playbook is continually improved.
6%
Flag icon
All failures have a corresponding countermeasure, whether it is manually or automatically activated. Countermeasures that are activated frequently are always automated.
6%
Flag icon
The less frequently a countermeasure is activated, the less confident we are that it will work the next time it is needed. Therefore infrequently activated countermeasures are periodically and automatically exercised by intentionally causing failures.
6%
Flag icon
In distributed systems, failure is normal. Hardware failures that are rare, when multiplied by thousands of machines, become common. Therefore failures are assumed, designs work around them, and software anticipates them. Failure is an expected part of the landscape.
6%
Flag icon
Systems should export metrics. They should count interesting events, such as how many times a particular API was called, and make these counters accessible.
8%
Flag icon
CAP stands for consistency, availability, and partition resistance. The CAP Principle states that it is not possible to build a distributed system that guarantees consistency, availability, and resistance to partitioning. Any one or two can be achieved but not all three simultaneously. When using such systems you must be aware of which are guaranteed.
9%
Flag icon
9%
Flag icon
Figure 1.10: Numbers every engineer should know
9%
Flag icon
Designing for operations means making sure all the normal operational functions can be done well. Normal operational functions include tasks such as periodic maintenance, updates, and monitoring. These issues must be kept in mind in early stages of planning.
9%
Flag icon
The best strategy for providing a highly available service is to build features into the software that enhance one’s ability to perform and automate operational tasks.
9%
Flag icon
Configuration • Startup and shutdown • Queue draining • Software upgrades • Backups and restores • Redundancy • Replicated databases • Hot swaps • Toggles for individual features • Graceful degradation • Access controls and rate limits • Data import controls • Monitoring • Auditing • Debug instrumentation • Exception collection
10%
Flag icon
Rather than the term “operational requirements,” some organizations use the term “non-functional requirements.” We consider this term misleading.
10%
Flag icon
It must be possible for software upgrades to be implemented without taking down the service.
11%
Flag icon
Toggles for Individual Features
11%
Flag icon
Graceful Degradation
11%
Flag icon
Software needs to generate logs that are useful when debugging. Such logs should be both human-readable and machine-parseable. The kind of logging that is appropriate for debugging differs from the kind of logging that is needed for auditing.
22%
Flag icon
Threading Data can be processed in different ways to achieve better scale. Simply processing one request at a time has its limits. Threading is a technique that can be used to improve system throughput by processing many requests at the same time.
23%
Flag icon
Graceful degradation,
24%
Flag icon
Resilient systems continue where predictive strategies leave
24%
Flag icon
Resilient systems decouple component failure from user-visible outages.
24%
Flag icon
survivable systems,
24%
Flag icon
Software Resiliency Beats Hardware Reliability
24%
Flag icon
Software solutions are favored for many reasons. First and foremost, they are more economical. Once software is written, it can be applied to many services and many machines with no additional cost (assuming it is home-grown, is open source, or does not require a per-machine license.) Software is also more malleable than hardware. It is easier to fix, upgrade, and replace.
24%
Flag icon
Everything Malfunctions Eventually
24%
Flag icon
Sheltered from the reality of a world full of malfunctions, we enable software developers to continue writing software that assumes a perfect, malfunction-free world (which, of course, does not exist).
24%
Flag icon
Distributed computing, in contrast to the traditional approach, embraces components’ failures and malfunctions. It takes a reality-based approach that accepts malfunctions as a fact of life.
30%
Flag icon
The three sources of work are life-cycle management, interacting with stake-holders, and process improvement and automation. Life-cycle management is the operational work involved in running the service. Interacting with stakeholders refers to both maintaining the relationship with people who use and depend on the service, and prioritizing and fulfilling their requests. Process improvement and automation is work inspired by the business desire for continuous improvement.
30%
Flag icon
Emergency Issues:
30%
Flag icon
Normal Requests:
30%
Flag icon
Project Work:
30%
Flag icon
The team manager should be part of the operational rotation.
31%
Flag icon
Meta-work There is also meta-work: meetings, status reports, company functions. These generally eat into project time and should be minimized.
31%
Flag icon
developing software that automates or optimizes aspects of the team’s responsibilities.
31%
Flag icon
building organizational memory.
31%
Flag icon
shift report
31%
Flag icon
Every operations team should have a goal of eliminating the need for people to open tickets with them, similar to how there should always be a goal to automate manual processes.
31%
Flag icon
Any ticket created by an automated system should have a corresponding playbook entry that explains how to process it, with a link to the bug ID requesting that the automation be improved to eliminate the need to open such tickets.
31%
Flag icon
Jeff Ryan
End of shift report
31%
Flag icon
Theme
31%
Flag icon
Toil Reduction
31%
Flag icon
7.4.2 Communication Policies Many teams establish a communication agreement that clarifies which methods will be used in which situations. For example, a common agreement is that chat rooms will be the primary communication channel but only for ephemeral discussions. If a decision is made in the chat room or an announcement needs to be made, it will be broadcast via email. Email is for information that needs to carry across oncall shifts or day boundaries. Announcements with lasting effects, such as major policies or design decisions, need to be recorded in the team wiki or other document ...more
31%
Flag icon
Operations in distributed computing is done at a large scale. Processes that have to be done manually do not scale. Constant process improvement and automation are essential.
31%
Flag icon
Operations is responsible for the life cycle of a service: launch, maintenance, upgrades, and decommissioning. Maintenance tasks include emergency and non-emergency response. In addition, related projects maintain and evolve the service.
32%
Flag icon
Checklists
32%
Flag icon
The most productive use of time for operational staff is time spent automating and optimizing processes.
32%
Flag icon
In a DevOps organization, software developers and operational engineers work together as one team that shares responsibility for a web site or service.
32%
Flag icon
DevOps combines some cultural and attitude shifts with some common-sense processes. Originally based on applying Agile methodology to operations, the result is a streamlined set of principles and processes that can create reliable services.
32%
Flag icon
We find that (1) operator error is the largest single cause of failures in two of the three services, (2) operator errors often take a long time to repair, (3) configuration errors are the largest category of operator errors, (4) failures in custom-written front-end software are significant, and (5) more extensive online testing and more thoroughly exposing and detecting component failures would reduce failure rates in at least one service.
32%
Flag icon
DevOps
« Prev 1 3 4 5