Practice of Cloud System Administration, The: DevOps and SRE Practices for Web Services, Volume 2
Rate it:
Open Preview
59%
Flag icon
The accuracy and precision of collected data should be monitored; one should monitor how often an attempt to collect a measurement fails.
60%
Flag icon
Monitoring is the primary way we gain visibility into the systems we run.
60%
Flag icon
The goal of monitoring is to detect problems before they turn into outages, not to detect outages. If we simply detect outages, then our operating process has downtime “baked in.”
60%
Flag icon
A measurement is a data point.
60%
Flag icon
Identify the business’s key performance indicators (KPIs) and then determine which metrics can be collected to create those KPIs.
60%
Flag icon
60%
Flag icon
Sensing and Measurement
60%
Flag icon
The sensing and measurement component gathers the measurements. Measurements can be categorized as blackbox or whitebox, depending on the amount of knowledge of the internals that is used.
60%
Flag icon
Blackbox monitoring means that measurements try to emulate a user.
60%
Flag icon
Doing an HTTPS GET of a web site’s main page is an example of blackbox monitoring.
60%
Flag icon
Blackbox testing includes monitoring that attempts to do multi-step processes as a user, such as verifying that the purchasing process is working.
60%
Flag icon
A whitebox measurement has the benefit of internal knowledge because it is a lower level of abstraction. For example, such a measure might monitor raw counters of the number of times a particular API call was made, the number of outstanding items waiting in a queue, or latency information of internal processes.
60%
Flag icon
Direct versus Synthesized Measurements
60%
Flag icon
Some measurements are direct, whereas others are synthesized. For example, if every time a purchase is made the system sends a metric of the total money collected to the monitoring system, this is a direct measurement. Alternatively, if every 5 minutes the monitoring system tallies the total money collected so far, this is a synthesized metric.
60%
Flag icon
Rate versus Capability Monitoring
60%
Flag icon
Event frequency determines what to monitor.
60%
Flag icon
The rate at which purchases happen determines which of these to monitor.
60%
Flag icon
In short, rate metrics are more important when event frequency is high and there are smooth, predictable trends.
60%
Flag icon
When there is low event frequency or an uneven rate, capability metrics are more important.
60%
Flag icon
never collect rates ...
This highlight has been truncated due to consecutive passage length restrictions.
60%
Flag icon
A counter is a measurement that only increases—for example, a count of the number of API calls received by a service or the count of the number of packets transmitted on a network interface.
61%
Flag icon
Collection Once we have a measurement, we must transmit it to the storage system.
61%
Flag icon
Escalation Chain: The escalation chain is who to contact, and who to contact if that person does not respond. Generally, one or two chains are defined for each service or group of services. • Suggested Resolution: Concise instructions of what to do to resolve this issue. This is best done with a link to the playbook entry related to this alert, as described in Section 14.2.5. The last two items may be difficult to write at the time the alert rule is created.
62%
Flag icon
Having the ability to negatively acknowledge (“NAK”) the alert saves time during escalations.
62%
Flag icon
Jeff Ryan
Talk to Chris Kreps about how MIR3 might work with this. It is a god idea.
62%
Flag icon
The only thing we dislike more than pie charts are averages, or the mathematical mean. Averages can be misleading in ways that encourage you to make bad decisions.
62%
Flag icon
That said, averages aren’t inherently misleading and have a place in your statistical toolbox. Like other statistical functions they need to be used wisely.
63%
Flag icon
There are two major objectives of capacity planning. First, we want to prevent service interruptions due to lack of capacity. Second, we want to preserve capital investment by adding only the capacity required at any given time.
63%
Flag icon
Terms to Know QPS: Queries per second. Usually how many web hits or API calls received per second. Active Users: The number of users who have accessed the service in the specified timeframe. MAU: Monthly active users. The number of users who have accessed the service in the last month. Engagement: How many times on average an active user performs a particular transaction. Primary Resource: The one system-level resource that is the main limiting factor for the service. Capacity Limit: The point at which performance starts to degrade rapidly or become unpredictable. Core Driver: A factor that ...more
63%
Flag icon
Future Resources = Current Usage × (1 + Normal Growth + Planned Growth) + Headroom
64%
Flag icon
Another component in determining how much headroom is needed is the amount of time it takes to have additional resources deployed into production from the moment that someone realizes that additional resources are required.
64%
Flag icon
Core drivers are factors that strongly drive demand for a primary resource.
64%
Flag icon
holidays. A more accurate representation of users may be how many were active in the last 7 or 30 days.
65%
Flag icon
moving average convergence/divergence (MACD) metric. MACD measures the difference between a long-period (e.g., 3 months) and a short-period (e.g., 1 month) moving average.
66%
Flag icon
Resource Regression A resource regression is a calculation of the difference in resource usage between one release or version and another.
66%
Flag icon
How many database calls? A single transaction may touch many services within your system infrastructure, and those service resources must be assessed as well and scaled appropriately along with the transaction server scaling.
66%
Flag icon
Standard capacity planing is sufficient for small sites, sites that grow slowly, and sites with simple needs. It is insufficient for large, rapidly growing sites. They require more advanced techniques.
66%
Flag icon
Advanced capacity planning is based on core drivers, capacity limits of individual resources, and sophisticated data analysis such as correlation, regression analysis, and statistical models for forecasting.
67%
Flag icon
Measurement affects behavior. People change their behavior when they know they are being measured.
67%
Flag icon
People tend to find the shortest path to meeting a goal. This creates unintended side effects.
67%
Flag icon
Setting KPIs is quite possibly the most important thing that a manager does.
67%
Flag icon
It is often said that a manager has two responsibilities: setting priorities and providing the resources to get those priorities done.
67%
Flag icon
A key performance indicator is a type of performance measurement used to evaluate the success of an organization or a particular activity.
67%
Flag icon
KPIs should be directly tied to the organization’s strategy, vision, or mission.
67%
Flag icon
A well-defined KPI follows the SMART criteria: Specific, Measurable, Achievable, Relevant, and Time-phrased.
67%
Flag icon
Fewer than 10 “severity 1” open bugs.
67%
Flag icon
OKRs, which stands for “objectives and key results.”
67%
Flag icon
Step 1: Envision the Ideal
67%
Flag icon
Pause to imagine what the world would be like if this goal was met perfectly.
67%
Flag icon
Step 2: Quantify Distance to the Ideal