More on this book
Community
Kindle Notes & Highlights
Read between
August 2 - December 28, 2020
In a distributed system, there may well be some parts of the system that are broken in some unpredictable way, even though other parts of the system are working fine. This is known as a partial failure.
are nondeterministic:
build large-scale computing systems:
of high-performance computing
Supercomputers
cloud computing,
Traditional enterprise datacenters lie somewhere between these extremes.
In a supercomputer, a job typically checkpoints the state of its computation to durable storage from time to time.
Thus, a supercomputer is more like a single-node computer than a distributed system: it deals with partial failure by letting it escalate into total failure
online,
Making the service unavailable
is not acce...
This highlight has been truncated due to consecutive passage length restrictions.
and nodes communicate through shared memory and remote direct memory access (RDMA). On
nodes in cloud services are built from commodity machines, which can provide equivalent performance at lower cost
Supercomputers often use specialized network topologies, such as multi-dimensional meshes and toruses
but in a system with thousands of nodes, it is reasonable to assume that something is always broken
If the system can tolerate failed nodes and still keep working as a whole,
you can perform a rolling upgrade (see Chapter 4), restarting one node at a time, while the service
we need to build a reliable system from unreliable components.
The fault handling must be part of the software design, and you
In distributed systems, suspicion, pessimism, and paranoia pay off.
some bits wrong, for example due to radio interference on a wireless network
is unreliable: it may drop, delay, duplicate, or reorder packets. TCP (the
Although the system can be more reliable than its underlying parts, there is always a limit to how much more reliable it can
the distributed systems we focus on in this book are shared-nothing systems: i.e., a bunch of machines connected by a network.
it’s comparatively cheap because it requires no special hardware,
The internet and most internal networks in datacenters (often Ethernet) are asynchronous packet networks.
The usual way of handling this issue is a timeout: after some time you give up waiting and assume that the response is not going to arrive.
[16]. It found that adding redundant networking gear doesn’t reduce faults as much as you might hope, since it doesn’t guard against human error (e.g., misconfigured switches), which is a major cause of outages.
Sharks might bite undersea cables and damage them
If the error handling of network faults is not defined and tested, arbitrarily bad things could happen: for example, the cluster could become deadlocked and permanently unable to serve requests,
Many systems need to automatically detect faulty nodes.
A load balancer needs to stop sending requests to a node that is dead (i.e.,
with single-leader replication, if the...
This highlight has been truncated due to consecutive passage length restrictions.
you might get some feedback to explicitly tell you that something is not working:
Even if TCP acknowledges that a packet was delivered, the application may have crashed before handling it.
A long timeout means a long wait until a node is declared dead
A short timeout detects faults faster, but carries a higher risk of incorrectly declaring a node dead when in fact it has only suffered a temporary slowdown
If the system is already struggling with high load, declaring nodes dead prematurely can make the problem worse.
it could happen that the node actually wasn’t dead but only slow to respond due to overload;
transferring its load to other nodes can cause a ...
This highlight has been truncated due to consecutive passage length restrictions.
asynchronous networks have unbounded delays (that
a car,
traffic congestion.
the variability of packet delays on computer networks is most o...
This highlight has been truncated due to consecutive passage length restrictions.
packets to the same destination, the network switch must queue
them up and feed them into the destination network link one by one (as illustrated in Figure 8-2). On a busy network link, a packet
If there is so much incoming data that the switch queue fills up, the packet is dropped, so it needs to be resent — even thou...
This highlight has been truncated due to consecutive passage length restrictions.
the incoming request from the network is queued by th...
This highlight has been truncated due to consecutive passage length restrictions.
virtualized environments, a running operating system is often paused for tens of milliseconds while another virtual machine uses a CPU core. During this time, the VM cannot consume any data from the network, so the incoming data is queued (buffered)