Brian’s Kindle Notes & Highlights for Practice of Cloud System Administration, The: DevOps and SRE Practices for Web Services, Volume 2

There are three tips we’ve found that help us resist the temptation to future-proof. First, use test-driven development and force yourself to stop coding when all tests pass. Second, adding TODO() comments listing features you’d like to add often reduces the emotional need to actually write the code. Third, the style guide should explicitly discourage excessive future-proofing and encourage aggressively deleting unused code or features that have become obsolete. This establishes a high standard that can be applied at code review time.

50%

6. Create a design document template for your organization. Show it to coworkers and get feedback about the format and ways that you can make it easier to use. 7. Write the design for a proposed or existing project using the design document format. 8. Take an existing document that your organization uses and rewrite it using the design document format.

last one is an interesting exercise. can be useful to get a good example immediately, without waiting for a project to hit the right phase to generate one.

54%

To accelerate this kind of improvement, we introduce malfunctions artificially. Rather than trying to avoid malfunctions, we instigate them. If a failover process is broken, we want to learn this fact in a controlled way, preferably during normal business hours when the largest number of people are awake and available to respond. If we do not trigger failures in a controlled way, we will learn about such problems only during a real emergency: usually after hours, usually when everyone is asleep.

54%

Wheel of Misfortune is a game that operational teams play to prepare people for oncall. It is a way to improve an individual’s knowledge of how to handle oncall tasks and to share best practices. It can also be used to introduce new procedures to the entire team. This game enables team members to maintain skills, learn new skills as needed, and learn from each other. The game is played as follows. The entire team meets in a conference room. Each round involves one person volunteering to be the contestant and another volunteering to be the Master of Disaster (MoD). The MoD explains an oncall ...more

55%

Fire drills exercise a particular disaster preparedness process. In these situations actual failures are triggered to actively test both the technology and the people involved.

57%

Outages of long duration require frequent status updates to management and other stakeholders. By designating a single person to be the Public Information Officer, the IC can keep from being distracted by executives demanding updates or users asking, “Is it up yet?” Every status update should end by noting when the next status can be expected.

Yes.

58%

Many monitoring systems do not track the units of a metric, requring people to guess based on the metric name or other context, and perform conversions manually. This process is notably prone to error.

Totally. And make sure the human-consumable metrics representations make units unambiguous as well.

69%

We have identified eight broad categories of operational responsibilities that most services have.

Suggests rating (qualitatively on scale of 1-5) each of these dimensions for each service to get an overall picture of health of each service and team, as well as which areas to invest in first to make largest improvements. The areas are: - regular tasks: process of prioritizing, assigning, completing non-emergency tasks - emergency response - monitoring and metrics - capacity planning - change management - new product introduction/removal - service deploy/decommission - performance/efficiency (of infra resources) Some of these look like they might be able to have objective thresholds set to determine their scores? This is actually close (though narrower) than something michael e was working with to assess overall team health. I think both approaches are onto something. Can see trends over time, help prioritize across a long list of possible next projects etc

69%

The difference between NPI, SDD, and CM is subtle. NPI is how something is launched for the first time. It is the non-technical coordinating function for all related processes, many of which are technical. SDD is the technical process of deploying new instances of an existing item, whether it is a machine, a server, or a service. It may include non-technical processes such as budget approval, but these are in support of the technical goal. CM is how upgrades and changes are managed, either using a software deployment platform or an enterprise-style change management review board.

69%

The CMM defines five levels: • Level 1, Initial: Sometimes called Chaotic. This is the starting point for a new or undocumented process. Processes are ad hoc and rely on individual heroics. • Level 2, Repeatable: The process is at least documented sufficiently such that it can be repeated with the same results. • Level 3, Defined: Roles and responsibilities of the process are defined and confirmed. • Level 4, Managed: The process is quantitatively managed in accordance with agreed-upon metrics. • Level 5, Optimizing: Process management includes deliberate process optimization/improvement.

How it's recommended that you actually score the areas above. More about each process' sophistication, predictability than the quality of the output?

70%

Level 3: Defined At this level, the roles and responsibilities of the process are defined and confirmed. At the previous level, we learned what needed to be done. At this level, we know who is responsible for doing it, and we have definitions of how to measure correctness.

I think the definitions of correctness (which metrics we want to gather and what values we strive for, before the metrics actually collected (automatically)) is probably a bigger deal than the assignment of responsibilities, but maybe it's always been more natural that the responsibilities are clear early on, even during chaotic phase?

70%

or there is disagreement among team members about whether the role of a load balancer is to improve capacity or to improve resiliency. This demonstrates Level 1 behavior.

I admit to being in this state at one point (in design of next gen load balancer layer architecture), or at least instigating the discussion as part of the design. Fortunately, we clarified the state (we were anti- the capacity approach :) ) we desired to be in. Granted, after rolling out new load balancer layer, we did not measure capacity and how close we were to crossing from resiliency into capacity regime. Really good point here.

70%

20.4.2 Assessing Each Service During each assessment period, record the assessment number (1 through 5) along with notes that justify the assessment. Generally these notes are in the form of answers to the questions.

Write "forward-looking" promotion documents saying what would need to be done to graduate to the next level

70%

Both of these spreadsheets should be filled out by the team as a group exercise. The manager’s role is to hold everyone to high standards for accuracy. The manager should not do this assessment on his or her own and surprise the team with the result. A self-assessment has an inherent motivational factor; being graded by a manager is demotivating at best.

71%

Level 3 assessment is designed to be good enough for most services. Going above that level should be reserved for high-priority services such as revenue-generating services or services that are particularly demanding.

71%

identify a problem, engineer a solution, and measure the success or failure by whether the assessment improves. This is a subtle but important difference.

71%

Appendix A. Assessments

This appendix is useful for the assessment of service (process) health. Use it.

76%

A.11 Toil Reduction Toil Reduction is the process by which we improve the use of people within our system. When we reduce toil (i.e., exhausting physical labor), we create a more sustainable working environment for operational staff. While reducing toil is not a service per se, this OR can be used to assess the amount of toil and determine whether practices are in place to limit the amount of toil.

Teams in toil hell might really benefit from a quick self assessment here. Even if comprehensive assessment is coming (later).

80%

O(nm): Exponential Scaling: Performance worsens exponentially.

Bug: should be m^n. Precision, please.

81%

Goals: (bullet list describing the problem being solved)

(From design doc template): include scaling demands here (will scale to X units (perhaps cumulative total, TPS) by Y date)). Can be very helpful later in doc to indicate the scaling limits of the proposed solution. When (in terms of metrics + best guess in terms of time) will this design break down and need replacing?

81%

Non-goals: (bullet list describing the limits of the project)

(From design doc template) Reminder to list what's explicitly out of scope for a design review. It could potentially be helpful to have a section with quick summary of what we'd do to extend design to include high probability future scope (don't 'future proof' but do know if we're taking a big step away from future needs)

See a Problem?

Preview — Practice of Cloud System Administration, The by Thomas A. Limoncelli