More on this book
Kindle Notes & Highlights
The accuracy and precision of collected data should be monitored; one should monitor how often an attempt to collect a measurement fails.
Monitoring is the primary way we gain visibility into the systems we run.
The goal of monitoring is to detect problems before they turn into outages, not to detect outages. If we simply detect outages, then our operating process has downtime “baked in.”
A measurement is a data point.
Identify the business’s key performance indicators (KPIs) and then determine which metrics can be collected to create those KPIs.
Sensing and Measurement
The sensing and measurement component gathers the measurements. Measurements can be categorized as blackbox or whitebox, depending on the amount of knowledge of the internals that is used.
Blackbox monitoring means that measurements try to emulate a user.
Doing an HTTPS GET of a web site’s main page is an example of blackbox monitoring.
Blackbox testing includes monitoring that attempts to do multi-step processes as a user, such as verifying that the purchasing process is working.
A whitebox measurement has the benefit of internal knowledge because it is a lower level of abstraction. For example, such a measure might monitor raw counters of the number of times a particular API call was made, the number of outstanding items waiting in a queue, or latency information of internal processes.
Direct versus Synthesized Measurements
Some measurements are direct, whereas others are synthesized. For example, if every time a purchase is made the system sends a metric of the total money collected to the monitoring system, this is a direct measurement. Alternatively, if every 5 minutes the monitoring system tallies the total money collected so far, this is a synthesized metric.
Rate versus Capability Monitoring
Event frequency determines what to monitor.
The rate at which purchases happen determines which of these to monitor.
In short, rate metrics are more important when event frequency is high and there are smooth, predictable trends.
When there is low event frequency or an uneven rate, capability metrics are more important.
never collect rates ...
This highlight has been truncated due to consecutive passage length restrictions.
A counter is a measurement that only increases—for example, a count of the number of API calls received by a service or the count of the number of packets transmitted on a network interface.
Collection Once we have a measurement, we must transmit it to the storage system.
Escalation Chain: The escalation chain is who to contact, and who to contact if that person does not respond. Generally, one or two chains are defined for each service or group of services. • Suggested Resolution: Concise instructions of what to do to resolve this issue. This is best done with a link to the playbook entry related to this alert, as described in Section 14.2.5. The last two items may be difficult to write at the time the alert rule is created.
Having the ability to negatively acknowledge (“NAK”) the alert saves time during escalations.
The only thing we dislike more than pie charts are averages, or the mathematical mean. Averages can be misleading in ways that encourage you to make bad decisions.
That said, averages aren’t inherently misleading and have a place in your statistical toolbox. Like other statistical functions they need to be used wisely.
There are two major objectives of capacity planning. First, we want to prevent service interruptions due to lack of capacity. Second, we want to preserve capital investment by adding only the capacity required at any given time.
Terms to Know QPS: Queries per second. Usually how many web hits or API calls received per second. Active Users: The number of users who have accessed the service in the specified timeframe. MAU: Monthly active users. The number of users who have accessed the service in the last month. Engagement: How many times on average an active user performs a particular transaction. Primary Resource: The one system-level resource that is the main limiting factor for the service. Capacity Limit: The point at which performance starts to degrade rapidly or become unpredictable. Core Driver: A factor that
...more
Future Resources = Current Usage × (1 + Normal Growth + Planned Growth) + Headroom
Another component in determining how much headroom is needed is the amount of time it takes to have additional resources deployed into production from the moment that someone realizes that additional resources are required.
Core drivers are factors that strongly drive demand for a primary resource.
holidays. A more accurate representation of users may be how many were active in the last 7 or 30 days.
moving average convergence/divergence (MACD) metric. MACD measures the difference between a long-period (e.g., 3 months) and a short-period (e.g., 1 month) moving average.
Resource Regression A resource regression is a calculation of the difference in resource usage between one release or version and another.
How many database calls? A single transaction may touch many services within your system infrastructure, and those service resources must be assessed as well and scaled appropriately along with the transaction server scaling.
Standard capacity planing is sufficient for small sites, sites that grow slowly, and sites with simple needs. It is insufficient for large, rapidly growing sites. They require more advanced techniques.
Advanced capacity planning is based on core drivers, capacity limits of individual resources, and sophisticated data analysis such as correlation, regression analysis, and statistical models for forecasting.
Measurement affects behavior. People change their behavior when they know they are being measured.
People tend to find the shortest path to meeting a goal. This creates unintended side effects.
Setting KPIs is quite possibly the most important thing that a manager does.
It is often said that a manager has two responsibilities: setting priorities and providing the resources to get those priorities done.
A key performance indicator is a type of performance measurement used to evaluate the success of an organization or a particular activity.
KPIs should be directly tied to the organization’s strategy, vision, or mission.
A well-defined KPI follows the SMART criteria: Specific, Measurable, Achievable, Relevant, and Time-phrased.
Fewer than 10 “severity 1” open bugs.
OKRs, which stands for “objectives and key results.”
Step 1: Envision the Ideal
Pause to imagine what the world would be like if this goal was met perfectly.
Step 2: Quantify Distance to the Ideal