Practice of Cloud System Administration, The: DevOps and SRE Practices for Web Services, Volume 2
Rate it:
Open Preview
Kindle Notes & Highlights
48%
Flag icon
There are three tips we’ve found that help us resist the temptation to future-proof. First, use test-driven development and force yourself to stop coding when all tests pass. Second, adding TODO() comments listing features you’d like to add often reduces the emotional need to actually write the code. Third, the style guide should explicitly discourage excessive future-proofing and encourage aggressively deleting unused code or features that have become obsolete. This establishes a high standard that can be applied at code review time.
50%
Flag icon
6. Create a design document template for your organization. Show it to coworkers and get feedback about the format and ways that you can make it easier to use. 7. Write the design for a proposed or existing project using the design document format. 8. Take an existing document that your organization uses and rewrite it using the design document format.
Brian
last one is an interesting exercise. can be useful to get a good example immediately, without waiting for a project to hit the right phase to generate one.
54%
Flag icon
To accelerate this kind of improvement, we introduce malfunctions artificially. Rather than trying to avoid malfunctions, we instigate them. If a failover process is broken, we want to learn this fact in a controlled way, preferably during normal business hours when the largest number of people are awake and available to respond. If we do not trigger failures in a controlled way, we will learn about such problems only during a real emergency: usually after hours, usually when everyone is asleep.
54%
Flag icon
Wheel of Misfortune is a game that operational teams play to prepare people for oncall. It is a way to improve an individual’s knowledge of how to handle oncall tasks and to share best practices. It can also be used to introduce new procedures to the entire team. This game enables team members to maintain skills, learn new skills as needed, and learn from each other. The game is played as follows. The entire team meets in a conference room. Each round involves one person volunteering to be the contestant and another volunteering to be the Master of Disaster (MoD). The MoD explains an oncall ...more
55%
Flag icon
Fire drills exercise a particular disaster preparedness process. In these situations actual failures are triggered to actively test both the technology and the people involved.
57%
Flag icon
Outages of long duration require frequent status updates to management and other stakeholders. By designating a single person to be the Public Information Officer, the IC can keep from being distracted by executives demanding updates or users asking, “Is it up yet?” Every status update should end by noting when the next status can be expected.
Brian
Yes.
58%
Flag icon
Many monitoring systems do not track the units of a metric, requring people to guess based on the metric name or other context, and perform conversions manually. This process is notably prone to error.
Brian
Totally. And make sure the human-consumable metrics representations make units unambiguous as well.
69%
Flag icon
We have identified eight broad categories of operational responsibilities that most services have.
Brian
Suggests rating (qualitatively on scale of 1-5) each of these dimensions for each service to get an overall picture of health of each service and team, as well as which areas to invest in first to make largest improvements. The areas are: - regular tasks: process of prioritizing, assigning, completing non-emergency tasks - emergency response - monitoring and metrics - capacity planning - change management - new product introduction/removal - service deploy/decommission - performance/efficiency (of infra resources) Some of these look like they might be able to have objective thresholds set to determine their scores? This is actually close (though narrower) than something michael e was working with to assess overall team health. I think both approaches are onto something. Can see trends over time, help prioritize across a long list of possible next projects etc
69%
Flag icon
The difference between NPI, SDD, and CM is subtle. NPI is how something is launched for the first time. It is the non-technical coordinating function for all related processes, many of which are technical. SDD is the technical process of deploying new instances of an existing item, whether it is a machine, a server, or a service. It may include non-technical processes such as budget approval, but these are in support of the technical goal. CM is how upgrades and changes are managed, either using a software deployment platform or an enterprise-style change management review board.
69%
Flag icon
The CMM defines five levels: • Level 1, Initial: Sometimes called Chaotic. This is the starting point for a new or undocumented process. Processes are ad hoc and rely on individual heroics. • Level 2, Repeatable: The process is at least documented sufficiently such that it can be repeated with the same results. • Level 3, Defined: Roles and responsibilities of the process are defined and confirmed. • Level 4, Managed: The process is quantitatively managed in accordance with agreed-upon metrics. • Level 5, Optimizing: Process management includes deliberate process optimization/improvement.
Brian
How it's recommended that you actually score the areas above. More about each process' sophistication, predictability than the quality of the output?
70%
Flag icon
Level 3: Defined At this level, the roles and responsibilities of the process are defined and confirmed. At the previous level, we learned what needed to be done. At this level, we know who is responsible for doing it, and we have definitions of how to measure correctness.
Brian
I think the definitions of correctness (which metrics we want to gather and what values we strive for, before the metrics actually collected (automatically)) is probably a bigger deal than the assignment of responsibilities, but maybe it's always been more natural that the responsibilities are clear early on, even during chaotic phase?
70%
Flag icon
or there is disagreement among team members about whether the role of a load balancer is to improve capacity or to improve resiliency. This demonstrates Level 1 behavior.
Brian
I admit to being in this state at one point (in design of next gen load balancer layer architecture), or at least instigating the discussion as part of the design. Fortunately, we clarified the state (we were anti- the capacity approach :) ) we desired to be in. Granted, after rolling out new load balancer layer, we did not measure capacity and how close we were to crossing from resiliency into capacity regime. Really good point here.
70%
Flag icon
20.4.2 Assessing Each Service During each assessment period, record the assessment number (1 through 5) along with notes that justify the assessment. Generally these notes are in the form of answers to the questions.
Brian
Write "forward-looking" promotion documents saying what would need to be done to graduate to the next level
70%
Flag icon
Both of these spreadsheets should be filled out by the team as a group exercise. The manager’s role is to hold everyone to high standards for accuracy. The manager should not do this assessment on his or her own and surprise the team with the result. A self-assessment has an inherent motivational factor; being graded by a manager is demotivating at best.
71%
Flag icon
Level 3 assessment is designed to be good enough for most services. Going above that level should be reserved for high-priority services such as revenue-generating services or services that are particularly demanding.
71%
Flag icon
identify a problem, engineer a solution, and measure the success or failure by whether the assessment improves. This is a subtle but important difference.
71%
Flag icon
Appendix A. Assessments
Brian
This appendix is useful for the assessment of service (process) health. Use it.
76%
Flag icon
A.11 Toil Reduction Toil Reduction is the process by which we improve the use of people within our system. When we reduce toil (i.e., exhausting physical labor), we create a more sustainable working environment for operational staff. While reducing toil is not a service per se, this OR can be used to assess the amount of toil and determine whether practices are in place to limit the amount of toil.
Brian
Teams in toil hell might really benefit from a quick self assessment here. Even if comprehensive assessment is coming (later).
80%
Flag icon
O(nm): Exponential Scaling: Performance worsens exponentially.
Brian
Bug: should be m^n. Precision, please.
81%
Flag icon
Goals: (bullet list describing the problem being solved)
Brian
(From design doc template): include scaling demands here (will scale to X units (perhaps cumulative total, TPS) by Y date)). Can be very helpful later in doc to indicate the scaling limits of the proposed solution. When (in terms of metrics + best guess in terms of time) will this design break down and need replacing?
81%
Flag icon
Non-goals: (bullet list describing the limits of the project)
Brian
(From design doc template) Reminder to list what's explicitly out of scope for a design review. It could potentially be helpful to have a section with quick summary of what we'd do to extend design to include high probability future scope (don't 'future proof' but do know if we're taking a big step away from future needs)
« Prev 1 2 Next »