Miguel David’s Kindle Notes & Highlights for Site Reliability Engineering: How Google Runs Production Systems

Rate it:

Open Preview

More on this book

Community

Wellington Cabral

2 notes & 43 highlights

Niharika

1 note & 9 highlights

Ricardo

1 note & 6 highlights

Przemek

6 notes & 109 highlights

Paul

1 note & 14 highlights

Guilherme Costa

Kenneth LeFebvre

Zhi Han

Ethan Petuchowski

Atthavit Wannasakwong

Sugan

Tien Nguyen Van

Michael Burch

Ovidiu Giorgi

David Moreno

Bouke

Oleksiy Kovyrin

Elvin

Ran

Mindaugas Mozūras

José

Kindle Notes & Highlights

by Miguel David

See all Miguel’s Notes & Highlights

Site Reliability Engineering: How Google Runs Production Systems

by Betsy Beyer

52%

A replicated state machine (RSM) is a system that executes the same set of operations, in the same order, on several processes.

52%

Timestamps are highly problematic in distributed systems because it’s impossible to guarantee that clocks are synchronized across multiple machines. Spanner [Cor12] addresses this problem by modeling the worst-case uncertainty involved and slowing down processing where necessary to resolve that uncertainty.

52%

A barrier in a distributed computation is a primitive that blocks a group of processes from proceeding until some condition is met (for example, until all parts of one phase of a computation are completed). Use of a barrier effectively splits a distributed computation into logical phases.

52%

Distributed locks can be used to prevent multiple workers from processing the same input file. In practice, it is essential to use renewable leases with timeouts instead of indefinite locks, because doing so prevents locks from being held indefinitely by processes that crash.

53%

but read operations may be served from any replica, because stale data results in extra work being performed but not incorrect results

53%

Quorum leases are particularly useful for read-heavy workloads in which reads for particular subsets of the data are concentrated in a single geographic region.

55%

There are two failure domains that you can never escape: the software itself, and human error on the part of the system’s administrators.

57%

Base services for which outages have wide impact (such as cron) should have very few dependencies.

57%

The most important state we keep in Paxos is information regarding which cron jobs are launched. We synchronously inform a quorum of replicas of the beginning and end of each scheduled launch for each cron job.

57%

Snapshots are in fact our most critical state — if we lose our snapshots, we essentially have to start from zero again because we’ve lost our internal state. Losing logs, on the other hand, just causes a bounded loss of state and sends the cron system back in time to the point when the latest snapshot

58%

The core of the distributed cron implementation is therefore Paxos, a commonplace algorithm to reach consensus in an unreliable environment.

58%

In light of the risk trade-offs, running a well-tuned periodic pipeline successfully is a delicate balance between high resource cost and risk of preemptions.

59%

if retry logic is not implemented, correctness problems can result when work is dropped upon failure, and the job won’t be retried. If retry logic is present but it is naive or poorly implemented, retry upon failure can compound the problem.

59%

the double correctness guarantee holds: the output files are always unique, and the pipeline state is always correct by virtue of tasks with leases.

59%

Workflow also versions all tasks.

59%

in order to commit work, a worker must own an active lease and reference the task ID number of the configuration it used to produce its result. If the configuration changed while the work unit was in flight, all workers of that type will be unable to commit despite owning current leases. Thus, all work performed after a configuration change is consistent with the new configuration,

60%

Most cloud computing applications seek to optimize for some combination of uptime, latency, scale, velocity, and privacy.

60%

Uptime Also referred to as availability, the proportion of time a service is usable by its users. Latency How responsive a service appears to its users. Scale A service’s volume of users and the mixture of workloads the service can handle before latency suffers or the service falls apart. Velocity How fast a service can innovate to provide users with superior value at reasonable cost. Privacy This concept imposes complex requirements. As a simplification, this chapter limits its scope in discussing privacy to data deletion: data must be destroyed within a reasonable time after users delete it.

61%

No one really wants to make backups; what people really want are restores.

61%

Archives safekeep data for long periods of time to meet auditing, discovery, and compliance needs.

61%

Thus, we see that diversity is key: protecting against a failure at layer X requires storing data on diverse components at that layer. Media isolation protects against media flaws: a bug or attack in a disk device driver is unlikely to affect tape drives. If we could, we’d make backup copies of our valuable data on clay tablets.

62%

The first layer is soft deletion (or “lazy deletion” in the case of developer API offerings), which has proven to be an effective defense against inadvertent data deletion scenarios. The second line of defense is backups and their related recovery methods. The third and final layer is regular data validation, covered in “Third Layer: Early Detection”

62%

Even the best armor is useless if you don’t put it on.

62%

The factors supporting successful recovery should drive your backup decisions, not the other way around.

63%

establish “trust points” in your data — portions of your stored data that are verified after being rendered immutable, usually by the passage of time.

63%

Between distributing the load horizontally and restricting the work to vertical slices of the data demarcated by time, we can reduce those eight decades of wall time by several orders of magnitude, rendering our restores relevant.

63%

Similar to the effects enjoyed when units test are introduced early in the project lifecycle, a data validation pipeline results in an overall acceleration of software development projects.

63%

Validation job management Monitoring, alerts, and dashboards Rate-limiting features Troubleshooting tools Production playbooks Data validation APIs that make validators easy to add and refactor

63%

Continuously test the recovery process as part of your normal operations Set up alerts that fire when a recovery process fails to provide a heartbeat indication of its success

65%

While a project like NORAD Tracks Santa may seem whimsical, it had all the characteristics that define a difficult and risky launch: a hard deadline (Google couldn’t ask Santa to come a week later if the site wasn’t ready), a lot of publicity, an audience of millions, and a very steep traffic ramp-up (everybody was going to be watching the site on Christmas Eve).

68%

Runaway success is usually the most welcome cause of overload when a new service launches, but there are myriad other causes, including load balancing failures, machine outages, synchronized client behavior, and external attacks.

77%

Provide product development with a platform of SRE-validated infrastructure, upon which they can build their systems.

78%

When SRE can’t provide full-fledged support, it provides other options for making improvements to production, such as documentation and consultation.

78%

SREs should essentially work as a part of the development team, rather than an external unit.

79%

Applicable services often have the following characteristics: The service implements significant new functionality and will be part of an existing system already managed by SRE. The service is a significant rewrite or alternative to an existing system, targeting the same use cases. The development team sought SRE advice or approached SRE for takeover upon launch.

79%

a “dark launch” setup, in which part of the traffic from existing users is sent to the new service in addition to being sent to the live production service.

79%

This is a positive outcome, because the service has been engineered to be reliable and low maintenance, and can therefore remain with the development team.

79%

Codified best practices The ability to commit what works well in production to code, so services can simply use this code and become “production ready” by design. Reusable solutions Common and easily shareable implementations of techniques used to mitigate scalability and reliability issues. A common production platform with a common control surface Uniform sets of interfaces to production facilities, uniform sets of operational controls, and uniform monitoring, logging, and configuration for all services. Easier automation and smarter systems A common control surface that enables automation ...more

79%

enable product development teams to design applications using the framework solution that was built and blessed by SRE, as opposed to either retrofitting the application to SRE specifications after the fact, or retrofitting more SREs to support a service that was markedly different

80%

A production platform with a common service structure, conventions, and software infrastructure made it possible for an SRE team to provide support for the “platform” infrastructure, while the development teams provide on-call support for functional issues with the service

80%

“Hope is not a strategy.” This rallying cry of the SRE team at Google sums up what we mean by preparedness and disaster testing.

81%

Operations on a submarine are ruled by a trusted human decision chain — a series of people, rather than one individual.

82%

An SRE team should be as compact as possible and operate at a high level of abstraction, relying upon lots of backup systems as failsafes and thoughtful APIs to communicate with the systems. At the same time, the SRE team should also have comprehensive knowledge of the systems — how they operate, how they fail, and how to respond to failures — that comes from operating them day-to-day.

83%

Incorrect data Validate both syntax and, if possible, semantics. Watch for empty data and partial or truncated data (e.g., alert if the configuration is N% smaller than the previous version). Delayed data This may invalidate current data due to timeouts. Alert well before the data is expected to expire.

83%

Putting alerts into email and hoping that someone will read all of them and notice the important ones is the moral equivalent of piping them to /dev/null: they will eventually be ignored.

83%

Overloads and Failure Services should produce reasonable but suboptimal results if overloaded.

« Prev 1 2 3 4 5 Next »

See a Problem?

Preview — Site Reliability Engineering by Betsy Beyer