More on this book
Kindle Notes & Highlights
System administrators who can write code are more valuable to an employer.
However, automation should not be approached as simply developing faster, more predictable functional replacements for the tasks that people perform. The human–computer system needs to be viewed as a whole.
Also bear in mind that eliminating a task, whether it is easy, difficult, frequent, or rare, is beneficial because it becomes one less thing to know, do, or maintain. If you can eliminate the task, rather than automating it, that will be the most efficient approach.
Tasks classified as rare/easy can remain manual. If they are easy, anyone should be able to do them successfully. A team’s culture will influence if the person does the right thing. • Tasks classified as rare/difficult should be documented and tools should be created to assist the process. Documentation and better tools will make it easier to do the tasks correctly and consistently. This quadrant includes troubleshooting and recovery tasks that cannot be automated. However, good documentation can assist the process and good tools can remove the burden of repetition or human error. • Tasks
...more
commercial software or using free or open source projects leverages the skills and knowledge of hundreds or thousands of other people.
When designing automation, ask yourself which view of the human component is being assumed by this automation. Are people a bottleneck, a source of unwanted variability, or a resource? If people are a bottleneck, can you remove the bottleneck without removing their visibility into what the system is doing, and without making it impossible for them to adjust how the system works, when necessary? If people are a source of unwanted variability, then you are constraining the environment and inputs of the automation to make it more reliable. What effect does that have on the people running the
...more
The long-term operation of a system can be broken down into four stages: tracking, regulating, monitoring, and targeting. Tracking covers event detection and short-term control in response to inputs or detected events. Automation typically starts at this level. Regulation covers long-term control, such as managing transition between states.
There is a popular misconception that the goal of automation is to do tasks faster than they could be done manually. That is just one of the goals. Other goals include the following: • Help scaling. Automation is a workforce multiplier. It permits one person to do the work of many. • Improve accuracy. Automation is less error prone than people are. Automation does not get distracted or lose interest, nor does it get sloppy over time. Over time software is improved to handle more edge cases and error situations. Unlike hardware, software gets stronger over time (Spolsky 2004, p. 183). •
...more
This highlight has been truncated due to consecutive passage length restrictions.
Automation, however, would be applied where each freshly installed machine looks up its hostname in a directory or external database to find its function. It then configures the machine’s OS, installs various packages, configures them, and starts the services the machine is intended to run. The manual steps are eliminated, such that machines come to life on their own.
Automation has many benefits, but it requires dedicated time and effort to create. Automation, like any programming, is best created during a block of time where there are no outside interruptions. Sometimes there is so much other work to be done that it is difficult to find a sufficient block of time to focus on creating the automation. You need to deliberately make the time to create the automation, not hope that eventually things will quiet down sufficiently so that you have the time.
Apply any and all effort to fix the biggest bottleneck first. There may be multiple areas where automation is needed. Choose the one with the biggest impact first.
You can’t automate what you can’t do manually.
The automation tools and their support tools such as bug trackers and source code repositories should be a centralized, shared service used by all involved in software development. Such an approach makes it easier to collaborate. For example, moving bugs and other issues between projects is easier if all teams use the same bug tracking system.
A VCS should not be used only for source code; that is, configuration files must also be revision controlled. When automation or use of tools involves configuration file changes, you should automate the steps of checking the config file out of version control, modifying it, and then checking it back in. Tools should not be allowed to modify config files outside of the VCS.
A style guide is a standard indicating how source code should be formatted and which language features are encouraged, discouraged, or banned.
Style Guide Basics
Some teams have a list of tasks that are done during each shift. Some example tasks include verifying the monitoring system is working, checking that backups ran, and checking for security alerts related to software used in-house. These tasks should be eliminated through automation.
Alert Responsibilities Once alerted, your responsibilities change. You are now responsible for verifying the problem, fixing it, and ensuring that follow-up work gets completed. You may not be the person who does all of this work, but you are responsible for making sure it all happens through delegation and handoffs.
Quick Fixes versus Long-Term Fixes Now the issue is worked on. Your priority is to come up with the best solution that will resolve the issue within the SLA. Sometimes we have a choice between a long-term fix and a quick fix. The long-term fix will resolve the fundamental problem and prevent the issue in the future. It may involve writing code or releasing new software.
Asking for Help It is also the responsibility of the oncall person to ask for help when needed.
Follow-up Work Once the problem has been resolved, the priority shifts to raising the visibility of the issue so that long-term fixes and optimizations will be done.
Observe, Orient, Decide, Act (OODA) The OODA loop was developed for combat operations by John Boyd. Designed for situations like fighter jet combat, it fits high-stress situations that require quick responses. Kyle Brandt (2014) popularized the idea of applying OODA to system administration.
Oncall Playbook Ideally, every alert that the system can generate will be matched by documentation that describes what to do in response. An oncall playbook is this documentation.
Third-Party Escalation Sometimes escalations must include someone outside the operations team, such as the oncall team of a different service or a vendor. Escalating to a third party has special considerations.
Eventually your shift will end and it will be time to hand off oncall to the next person.
One strategy is to write an end-of-shift report that is emailed to the entire oncall roster. Sending it to just the next person oncall is vulnerable to error. You may pick the wrong person by mistake, or may not know about a substitution that was negotiated. Sending the report to everyone keeps the entire team up-to-date and gives everyone an opportunity to get involved if needed. The end-of-shift report should include any notable events that happened and anything that the next shift needs to know. For example, it should identify an ongoing outage or behaviors that need manual monitoring.
...more
Long-Term Fixes Each alert may generate follow-up work that cannot be done during the oncall period. Large problems may require a causal analysis to determine the root cause of the outage.
The postmortem process records, for any engineers whose actions have contributed to the outage, a detailed account of actions they took at the time, effects they observed, expectations they had, assumptions they had made, and their understanding of the timeline of events as they occurred. They should be able to do this without fear of punishment or retribution.
A Postmortem Report for Every High-Priority Alert At Google many teams had a policy of writing a postmortem report every time their monitoring system paged the oncall person.
The ideal monitoring system makes the operations team omniscient and omnipresent.
Many monitoring systems do not track the units of a metric, requring people to guess based on the metric name or other context, and perform conversions manually.
information as follows: Operational Health is
16.3 What to Monitor
The following are some example KPIs:
• Availability:
• Latency:
• Urgent Bug Count:
• Urgent Bug Resolution:
• Major Bug Resolution:
• Backend Server Stability:
• User Satisfaction:
• Cart Size: The median number of items
• Finance:
You need to instrument your system enough so that you can see when things are going to fail.
The Minimum Monitor Problem A common interview question is “If you could monitor only three aspects of a web server, what would they be?” This is an
excellent test of technical knowledge and logical thinking. It requires you to use your technical knowledge to find one metric that can proxy for many possible problems.
reduce storage needs is through summarization, or down-sampling. With this technique, recent data is kept at full fidelity but older data is replaced by averages or other form of summarization.
Meta-monitoring
Monitoring the monitoring system is called meta-monitoring. How do you know if the reason you haven’t been alerted today is because everything is fine or because the monitoring system has failed?