More on this book
Community
Kindle Notes & Highlights
consistent hashing [Kar97] describes a way to provide a mapping algorithm that remains relatively stable even when new backends are added to or removed from the list. This approach minimizes the disruption to existing connections when the pool of backends changes.
By changing the destination MAC address of a forwarded packet, the balancer can leave all the information in upper layers intact, so the backend receives the original source and destination IP addresses. The backend can then send a reply directly to the original sender — a technique known as Direct Server Response (DSR). If user requests are small and replies are large (e.g., most HTTP requests), DSR provides tremendous savings, because only a small fraction of traffic need traverse the load balancer.
Our current VIP load balancing solution [Eis16] uses packet encapsulation. A network load balancer puts the forwarded packet into another IP packet with Generic Routing Encapsulation (GRE) [Han94], and uses a backend’s address as the destination. A backend receiving the packet strips off the outer IP+GRE layer and processes the inner IP packet as if it were delivered directly to its network interface. The network load balancer and the backend no longer need to exist in the same broadcast domain; they can even be on separate continents as long as a route between the two exists. Packet
...more
Unfortunately, encapsulation also comes with a price: inflated packet size. Encapsulation introduces overhead (24 bytes in the case of IPv4+GRE, to be precise), which can cause the packet to exceed the available Maximum Transmission Unit (MTU) size and require fragmentation.
An alternative model would be to establish and tear down a connection for each request, but this model has significant resource and latency costs. In the corner case of a connection that remains idle for a long time, our RPC implementation has an optimization that switches the connection to a cheap “inactive” mode where, for example, the frequency of health checks is reduced and the underlying TCP connection is dropped in favor of UDP.
To keep interfaces (and their implementations) simple, services are often defined to allow the most expensive requests to consume 100, 1,000, or even 10,000 times more resources than the cheapest requests.
All that said, we’ve learned (the hard way!) about one very dangerous pitfall of the Least-Loaded Round Robin approach: if a task is seriously unhealthy, it might start serving 100% errors. Depending on the nature of those errors, they may have very low latency; it’s frequently significantly faster to just return an “I’m unhealthy!” error than to actually process a request. As a result, clients might start sending a very large amount of traffic to the unhealthy task, erroneously thinking that the task is available, as opposed to fast-failing them! We say that the unhealthy task is now
...more
spend most of their life just waiting for the network to respond). In this case, because blocking on I/O often consumes zero CPU, very little RAM, and no bandwidth, we’d still want to send twice as many requests to the faster backend. However, Least-Loaded Round Robin will consider both backend tasks equally loaded.
each client task has only a very limited view into the state of its backend tasks: the view of its own requests.
Even if a backend is only slightly overloaded, a client request is often better served if the backend rejects retry and new requests equally and quickly.
we implement a per-request retry budget of up to three attempts.
layering on the per-client retry budget (a 10% retry ratio) reduces the growth to just 1.1x in the general case
If multiple layers retried, we’d have a combinatorial explosion.
Therefore, instead of “batch client → backend,” you have “batch client → batch proxy → backend.” In this case, when the very large job starts, only the batch proxy job suffers, shielding the actual backends (and higher-priority clients). Effectively, the batch proxy acts like a fuse. Another advantage of using the proxy is that it typically reduces the number of connections against the backend, which can improve the load balancing against the backend
It’s a common mistake to assume that an overloaded backend should turn down and stop accepting all traffic. However, this assumption actually goes counter to the goal of robust load balancing. We actually want the backend to continue accepting as much traffic as possible, but to only accept that load as capacity frees up.
cascading failure is a failure that grows over time as a result of positive feedback.
The most common cause of cascading failures is overload.
Running out of a resource can result in higher latency, elevated error rates, or the substitution of lower-quality results.
Because requests take longer to handle, more requests are handled concurrently (up to a possible maximum capacity at which queuing may occur). This affects almost all resources, including memory, number of active threads (in a thread-per-request server model), number of file descriptors, and backend resources (which in turn can have other effects).
Excessively long queue lengths If there is insufficient capacity to handle all the requests at steady state, the server will saturate its queues.
Thread starvation When a thread can’t make progress because it’s waiting for a lock,
task might be evicted by the container manager (VM or otherwise) for exceeding available resource limits,
A vicious cycle can occur in this scenario: less CPU is available, resulting in slower requests, resulting in increased RAM usage, resulting in more GC, resulting in even lower availability of CPU. This is known colloquially as the “GC death spiral.”
Reduction in available RAM can reduce application-level cache hit rates,
Serve degraded results Serve lower-quality, cheaper-to-compute results to the user.
Servers should protect themselves from becoming overloaded and crashing. When overloaded at either the frontend or backend layers, fail early and cheaply.
Rate limiting can be implemented in a number of places: At the reverse proxies, by limiting the volume of requests by criteria such as IP address to mitigate attempted denial-of-service attacks and abusive clients. At the load balancers,
If the request rate and latency of a given task is constant, there is no reason to queue requests: a constant number of threads should be occupied.
For a system with fairly steady traffic over time, it is usually better to have small queue lengths relative to the thread pool size (e.g., 50% or less), which results in the server rejecting requests early when it can’t sustain the rate of incoming requests.
On the other end of the spectrum, systems with “bursty” load for which traffic patterns fluctuate drastically may do better with a queue size based on the current number of threads in use, processing time for each request, and the size and frequency of bursts.
Keep the system simple and understandable, particularly if it isn’t used often.
You can make sure that graceful degradation stays working by regularly running a small subset of servers near overload in order to exercise this code path.
Always use randomized exponential backoff when scheduling retries. See also “Exponential Backoff and Jitter” in the AWS Architecture Blog
Limit retries per request. Don’t retry a given request indefinitely. Consider having a server-wide retry budget. For example, only allow 60 retries per minute in a process,
a single request at the highest layer may produce a number of attempts as large as the product of the number of attempts at each layer to the lowest layer.
When a frontend sends an RPC to a backend server, the frontend consumes resources waiting for a reply. RPC deadlines define how long a request can wait before the frontend gives up, limiting the time that the backend may consume the frontend’s resources.
Suppose an RPC has a 10-second deadline, as set by the client. The server is very overloaded, and as a result, it takes 11 seconds to move from a queue to a thread pool. At this point, the client has already given up on the request.
Rather than inventing a deadline when sending RPCs to backends, servers should employ deadline propagation. With deadline propagation, a deadline is set high in the stack (e.g., in the frontend).
This problem can be avoided if the requests that don’t complete return with an error early, rather than waiting the full deadline. For example, if a backend is unavailable, it’s usually best to immediately return an error for that backend, rather than consuming resources until the backend is available.
Having deadlines several orders of magnitude longer than the mean request latency is usually bad.
It’s usually better to avoid intra-layer communication
Because of caching effects, gradually ramping up load may yield different results than immediately increasing to expected load levels. Therefore, consider testing both gradual and impulse load patterns.
If a component enters a degraded mode on heavy load, is it capable of exiting the degraded mode without human intervention? If a couple of servers crash under heavy load, how much does the load need to drop in order for the system to stabilize?
Keep in mind that individual components may have different breaking points, so load test each component separately.
Process health checking (“is this binary responding at all?”) and service health checking (“is this binary able to respond to this class of requests right now?”) are two conceptually distinct operations. Process health checking is relevant to the cluster scheduler, whereas service health checking is relevant to the load balancer.
The distributed consensus problem deals with reaching agreement among a group of processes connected by an unreliable communications network.
The CAP theorem ([Fox99], [Bre12]) holds that a distributed system cannot simultaneously have all three of the following properties: Consistent views of the data at each node Availability of the data at each node Tolerance to network partitions [Gil02]
Distributed consensus algorithms may be crash-fail (which assumes that crashed nodes never return to the system) or crash-recover. Crash-recover algorithms are much more useful, because most problems in real systems are transient in nature due to a slow network, restarts, and so on.
Byzantine failure occurs when a process passes incorrect messages due to a bug or malicious activity, and are comparatively costly to handle, and less often encountered.
no asynchronous distributed consensus algorithm can guarantee progress in the presence of an unreliable network.