Site Reliability Engineering: How Google Runs Production Systems
Rate it:
Open Preview
41%
Flag icon
consistent hashing [Kar97] describes a way to provide a mapping algorithm that remains relatively stable even when new backends are added to or removed from the list. This approach minimizes the disruption to existing connections when the pool of backends changes.
41%
Flag icon
By changing the destination MAC address of a forwarded packet, the balancer can leave all the information in upper layers intact, so the backend receives the original source and destination IP addresses. The backend can then send a reply directly to the original sender — a technique known as Direct Server Response (DSR). If user requests are small and replies are large (e.g., most HTTP requests), DSR provides tremendous savings, because only a small fraction of traffic need traverse the load balancer.
41%
Flag icon
Our current VIP load balancing solution [Eis16] uses packet encapsulation. A network load balancer puts the forwarded packet into another IP packet with Generic Routing Encapsulation (GRE) [Han94], and uses a backend’s address as the destination. A backend receiving the packet strips off the outer IP+GRE layer and processes the inner IP packet as if it were delivered directly to its network interface. The network load balancer and the backend no longer need to exist in the same broadcast domain; they can even be on separate continents as long as a route between the two exists. Packet ...more
41%
Flag icon
Unfortunately, encapsulation also comes with a price: inflated packet size. Encapsulation introduces overhead (24 bytes in the case of IPv4+GRE, to be precise), which can cause the packet to exceed the available Maximum Transmission Unit (MTU) size and require fragmentation.
42%
Flag icon
An alternative model would be to establish and tear down a connection for each request, but this model has significant resource and latency costs. In the corner case of a connection that remains idle for a long time, our RPC implementation has an optimization that switches the connection to a cheap “inactive” mode where, for example, the frequency of health checks is reduced and the underlying TCP connection is dropped in favor of UDP.
43%
Flag icon
To keep interfaces (and their implementations) simple, services are often defined to allow the most expensive requests to consume 100, 1,000, or even 10,000 times more resources than the cheapest requests.
44%
Flag icon
All that said, we’ve learned (the hard way!) about one very dangerous pitfall of the Least-Loaded Round Robin approach: if a task is seriously unhealthy, it might start serving 100% errors. Depending on the nature of those errors, they may have very low latency; it’s frequently significantly faster to just return an “I’m unhealthy!” error than to actually process a request. As a result, clients might start sending a very large amount of traffic to the unhealthy task, erroneously thinking that the task is available, as opposed to fast-failing them! We say that the unhealthy task is now ...more
44%
Flag icon
spend most of their life just waiting for the network to respond). In this case, because blocking on I/O often consumes zero CPU, very little RAM, and no bandwidth, we’d still want to send twice as many requests to the faster backend. However, Least-Loaded Round Robin will consider both backend tasks equally loaded.
44%
Flag icon
each client task has only a very limited view into the state of its backend tasks: the view of its own requests.
46%
Flag icon
Even if a backend is only slightly overloaded, a client request is often better served if the backend rejects retry and new requests equally and quickly.
46%
Flag icon
we implement a per-request retry budget of up to three attempts.
46%
Flag icon
layering on the per-client retry budget (a 10% retry ratio) reduces the growth to just 1.1x in the general case
46%
Flag icon
If multiple layers retried, we’d have a combinatorial explosion.
46%
Flag icon
Therefore, instead of “batch client → backend,” you have “batch client → batch proxy → backend.” In this case, when the very large job starts, only the batch proxy job suffers, shielding the actual backends (and higher-priority clients). Effectively, the batch proxy acts like a fuse. Another advantage of using the proxy is that it typically reduces the number of connections against the backend, which can improve the load balancing against the backend
46%
Flag icon
It’s a common mistake to assume that an overloaded backend should turn down and stop accepting all traffic. However, this assumption actually goes counter to the goal of robust load balancing. We actually want the backend to continue accepting as much traffic as possible, but to only accept that load as capacity frees up.
46%
Flag icon
cascading failure is a failure that grows over time as a result of positive feedback.
47%
Flag icon
The most common cause of cascading failures is overload.
47%
Flag icon
Running out of a resource can result in higher latency, elevated error rates, or the substitution of lower-quality results.
47%
Flag icon
Because requests take longer to handle, more requests are handled concurrently (up to a possible maximum capacity at which queuing may occur). This affects almost all resources, including memory, number of active threads (in a thread-per-request server model), number of file descriptors, and backend resources (which in turn can have other effects).
47%
Flag icon
Excessively long queue lengths If there is insufficient capacity to handle all the requests at steady state, the server will saturate its queues.
47%
Flag icon
Thread starvation When a thread can’t make progress because it’s waiting for a lock,
47%
Flag icon
task might be evicted by the container manager (VM or otherwise) for exceeding available resource limits,
47%
Flag icon
A vicious cycle can occur in this scenario: less CPU is available, resulting in slower requests, resulting in increased RAM usage, resulting in more GC, resulting in even lower availability of CPU. This is known colloquially as the “GC death spiral.”
47%
Flag icon
Reduction in available RAM can reduce application-level cache hit rates,
47%
Flag icon
Serve degraded results Serve lower-quality, cheaper-to-compute results to the user.
47%
Flag icon
Servers should protect themselves from becoming overloaded and crashing. When overloaded at either the frontend or backend layers, fail early and cheaply.
47%
Flag icon
Rate limiting can be implemented in a number of places: At the reverse proxies, by limiting the volume of requests by criteria such as IP address to mitigate attempted denial-of-service attacks and abusive clients. At the load balancers,
48%
Flag icon
If the request rate and latency of a given task is constant, there is no reason to queue requests: a constant number of threads should be occupied.
48%
Flag icon
For a system with fairly steady traffic over time, it is usually better to have small queue lengths relative to the thread pool size (e.g., 50% or less), which results in the server rejecting requests early when it can’t sustain the rate of incoming requests.
48%
Flag icon
On the other end of the spectrum, systems with “bursty” load for which traffic patterns fluctuate drastically may do better with a queue size based on the current number of threads in use, processing time for each request, and the size and frequency of bursts.
48%
Flag icon
Keep the system simple and understandable, particularly if it isn’t used often.
48%
Flag icon
You can make sure that graceful degradation stays working by regularly running a small subset of servers near overload in order to exercise this code path.
48%
Flag icon
Always use randomized exponential backoff when scheduling retries. See also “Exponential Backoff and Jitter” in the AWS Architecture Blog
48%
Flag icon
Limit retries per request. Don’t retry a given request indefinitely. Consider having a server-wide retry budget. For example, only allow 60 retries per minute in a process,
48%
Flag icon
a single request at the highest layer may produce a number of attempts as large as the product of the number of attempts at each layer to the lowest layer.
48%
Flag icon
When a frontend sends an RPC to a backend server, the frontend consumes resources waiting for a reply. RPC deadlines define how long a request can wait before the frontend gives up, limiting the time that the backend may consume the frontend’s resources.
49%
Flag icon
Suppose an RPC has a 10-second deadline, as set by the client. The server is very overloaded, and as a result, it takes 11 seconds to move from a queue to a thread pool. At this point, the client has already given up on the request.
49%
Flag icon
Rather than inventing a deadline when sending RPCs to backends, servers should employ deadline propagation. With deadline propagation, a deadline is set high in the stack (e.g., in the frontend).
49%
Flag icon
This problem can be avoided if the requests that don’t complete return with an error early, rather than waiting the full deadline. For example, if a backend is unavailable, it’s usually best to immediately return an error for that backend, rather than consuming resources until the backend is available.
49%
Flag icon
Having deadlines several orders of magnitude longer than the mean request latency is usually bad.
49%
Flag icon
It’s usually better to avoid intra-layer communication
50%
Flag icon
Because of caching effects, gradually ramping up load may yield different results than immediately increasing to expected load levels. Therefore, consider testing both gradual and impulse load patterns.
50%
Flag icon
If a component enters a degraded mode on heavy load, is it capable of exiting the degraded mode without human intervention? If a couple of servers crash under heavy load, how much does the load need to drop in order for the system to stabilize?
50%
Flag icon
Keep in mind that individual components may have different breaking points, so load test each component separately.
50%
Flag icon
Process health checking (“is this binary responding at all?”) and service health checking (“is this binary able to respond to this class of requests right now?”) are two conceptually distinct operations. Process health checking is relevant to the cluster scheduler, whereas service health checking is relevant to the load balancer.
51%
Flag icon
The distributed consensus problem deals with reaching agreement among a group of processes connected by an unreliable communications network.
51%
Flag icon
The CAP theorem ([Fox99], [Bre12]) holds that a distributed system cannot simultaneously have all three of the following properties: Consistent views of the data at each node Availability of the data at each node Tolerance to network partitions [Gil02]
52%
Flag icon
Distributed consensus algorithms may be crash-fail (which assumes that crashed nodes never return to the system) or crash-recover. Crash-recover algorithms are much more useful, because most problems in real systems are transient in nature due to a slow network, restarts, and so on.
52%
Flag icon
Byzantine failure occurs when a process passes incorrect messages due to a bug or malicious activity, and are comparatively costly to handle, and less often encountered.
52%
Flag icon
no asynchronous distributed consensus algorithm can guarantee progress in the presence of an unreliable network.