Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
38%
Flag icon
In a distributed system, there may well be some parts of the system that are broken in some unpredictable way, even though other parts of the system are working fine. This is known as a partial failure.
38%
Flag icon
are nondeterministic:
38%
Flag icon
build large-scale computing systems:
38%
Flag icon
of high-performance computing
38%
Flag icon
Supercomputers
38%
Flag icon
cloud computing,
38%
Flag icon
Traditional enterprise datacenters lie somewhere between these extremes.
38%
Flag icon
In a supercomputer, a job typically checkpoints the state of its computation to durable storage from time to time.
38%
Flag icon
Thus, a supercomputer is more like a single-node computer than a distributed system: it deals with partial failure by letting it escalate into total failure
38%
Flag icon
online,
38%
Flag icon
Making the service unavailable
38%
Flag icon
is not acce...
This highlight has been truncated due to consecutive passage length restrictions.
38%
Flag icon
and nodes communicate through shared memory and remote direct memory access (RDMA). On
38%
Flag icon
nodes in cloud services are built from commodity machines, which can provide equivalent performance at lower cost
38%
Flag icon
Supercomputers often use specialized network topologies, such as multi-dimensional meshes and toruses
38%
Flag icon
but in a system with thousands of nodes, it is reasonable to assume that something is always broken
38%
Flag icon
If the system can tolerate failed nodes and still keep working as a whole,
38%
Flag icon
you can perform a rolling upgrade (see Chapter 4), restarting one node at a time, while the service
38%
Flag icon
we need to build a reliable system from unreliable components.
38%
Flag icon
The fault handling must be part of the software design, and you
38%
Flag icon
In distributed systems, suspicion, pessimism, and paranoia pay off.
38%
Flag icon
some bits wrong, for example due to radio interference on a wireless network
38%
Flag icon
is unreliable: it may drop, delay, duplicate, or reorder packets. TCP (the
38%
Flag icon
Although the system can be more reliable than its underlying parts, there is always a limit to how much more reliable it can
38%
Flag icon
the distributed systems we focus on in this book are shared-nothing systems: i.e., a bunch of machines connected by a network.
38%
Flag icon
it’s comparatively cheap because it requires no special hardware,
38%
Flag icon
The internet and most internal networks in datacenters (often Ethernet) are asynchronous packet networks.
38%
Flag icon
The usual way of handling this issue is a timeout: after some time you give up waiting and assume that the response is not going to arrive.
38%
Flag icon
[16]. It found that adding redundant networking gear doesn’t reduce faults as much as you might hope, since it doesn’t guard against human error (e.g., misconfigured switches), which is a major cause of outages.
39%
Flag icon
Sharks might bite undersea cables and damage them
39%
Flag icon
If the error handling of network faults is not defined and tested, arbitrarily bad things could happen: for example, the cluster could become deadlocked and permanently unable to serve requests,
39%
Flag icon
Many systems need to automatically detect faulty nodes.
39%
Flag icon
A load balancer needs to stop sending requests to a node that is dead (i.e.,
39%
Flag icon
with single-leader replication, if the...
This highlight has been truncated due to consecutive passage length restrictions.
39%
Flag icon
you might get some feedback to explicitly tell you that something is not working:
39%
Flag icon
Even if TCP acknowledges that a packet was delivered, the application may have crashed before handling it.
39%
Flag icon
A long timeout means a long wait until a node is declared dead
39%
Flag icon
A short timeout detects faults faster, but carries a higher risk of incorrectly declaring a node dead when in fact it has only suffered a temporary slowdown
39%
Flag icon
If the system is already struggling with high load, declaring nodes dead prematurely can make the problem worse.
39%
Flag icon
it could happen that the node actually wasn’t dead but only slow to respond due to overload;
39%
Flag icon
transferring its load to other nodes can cause a ...
This highlight has been truncated due to consecutive passage length restrictions.
39%
Flag icon
asynchronous networks have unbounded delays (that
39%
Flag icon
a car,
39%
Flag icon
traffic congestion.
39%
Flag icon
the variability of packet delays on computer networks is most o...
This highlight has been truncated due to consecutive passage length restrictions.
39%
Flag icon
packets to the same destination, the network switch must queue
39%
Flag icon
them up and feed them into the destination network link one by one (as illustrated in Figure 8-2). On a busy network link, a packet
39%
Flag icon
If there is so much incoming data that the switch queue fills up, the packet is dropped, so it needs to be resent — even thou...
This highlight has been truncated due to consecutive passage length restrictions.
39%
Flag icon
the incoming request from the network is queued by th...
This highlight has been truncated due to consecutive passage length restrictions.
39%
Flag icon
virtualized environments, a running operating system is often paused for tens of milliseconds while another virtual machine uses a CPU core. During this time, the VM cannot consume any data from the network, so the incoming data is queued (buffered)
1 12 28