Chena Lee’s Kindle Notes & Highlights for Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Rate it:

Open Preview

More on this book

Community

Sparsh Priyadarshi

1 note & 1 highlight

Jefersson Nathan

11 notes & 11 highlights

Charles Fonseca

4 notes & 524 highlights

Ucchishta Sivaguru

9 notes & 20 highlights

Sugan

1 note & 44 highlights

Guzman Monne

28 notes & 34 highlights

Dong

2 notes & 26 highlights

Mohamed Elsherif

5 notes & 17 highlights

Joe Soltzberg

20 notes & 75 highlights

Corey

6 notes & 10 highlights

Dinesh Singh

2 notes & 11 highlights

Robert Gustavo

38 notes & 38 highlights

Cezar Castro rosa

Nikhil Goyal

Vladimir

Ion Gritco

Keith Sader

Guilherme Camargo

Vipin Ajayakumar

Jason

Alexis

Ory

Faisal Morensya

Muhaimen Ezabbad

Frederico Cabral

Ian Dunn

Tali

Antonio Bustamante

Asif Hoda

zhouqiang

Nick Fahrenkrog

Matt Chamlee

Atthavit Wannasakwong

Xuan Lin

Eric Chong

Dallin Coons

Di Fan

Prakash Srivastava

Denis

Kindle Notes & Highlights

by Chena Lee

See all Chena’s Notes & Highlights

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

by Martin Kleppmann

Read between August 2 - December 28, 2020

38%

In a distributed system, there may well be some parts of the system that are broken in some unpredictable way, even though other parts of the system are working fine. This is known as a partial failure.

38%

are nondeterministic:

38%

build large-scale computing systems:

38%

of high-performance computing

38%

Supercomputers

38%

cloud computing,

38%

Traditional enterprise datacenters lie somewhere between these extremes.

38%

In a supercomputer, a job typically checkpoints the state of its computation to durable storage from time to time.

38%

Thus, a supercomputer is more like a single-node computer than a distributed system: it deals with partial failure by letting it escalate into total failure

38%

online,

38%

Making the service unavailable

38%

is not acce...

This highlight has been truncated due to consecutive passage length restrictions.

38%

and nodes communicate through shared memory and remote direct memory access (RDMA). On

38%

nodes in cloud services are built from commodity machines, which can provide equivalent performance at lower cost

38%

Supercomputers often use specialized network topologies, such as multi-dimensional meshes and toruses

38%

but in a system with thousands of nodes, it is reasonable to assume that something is always broken

38%

If the system can tolerate failed nodes and still keep working as a whole,

38%

you can perform a rolling upgrade (see Chapter 4), restarting one node at a time, while the service

38%

we need to build a reliable system from unreliable components.

38%

The fault handling must be part of the software design, and you

38%

In distributed systems, suspicion, pessimism, and paranoia pay off.

38%

some bits wrong, for example due to radio interference on a wireless network

38%

is unreliable: it may drop, delay, duplicate, or reorder packets. TCP (the

38%

Although the system can be more reliable than its underlying parts, there is always a limit to how much more reliable it can

38%

the distributed systems we focus on in this book are shared-nothing systems: i.e., a bunch of machines connected by a network.

38%

it’s comparatively cheap because it requires no special hardware,

38%

The internet and most internal networks in datacenters (often Ethernet) are asynchronous packet networks.

38%

The usual way of handling this issue is a timeout: after some time you give up waiting and assume that the response is not going to arrive.

38%

[16]. It found that adding redundant networking gear doesn’t reduce faults as much as you might hope, since it doesn’t guard against human error (e.g., misconfigured switches), which is a major cause of outages.

39%

Sharks might bite undersea cables and damage them

39%

If the error handling of network faults is not defined and tested, arbitrarily bad things could happen: for example, the cluster could become deadlocked and permanently unable to serve requests,

39%

Many systems need to automatically detect faulty nodes.

39%

A load balancer needs to stop sending requests to a node that is dead (i.e.,

39%

with single-leader replication, if the...

This highlight has been truncated due to consecutive passage length restrictions.

39%

you might get some feedback to explicitly tell you that something is not working:

39%

Even if TCP acknowledges that a packet was delivered, the application may have crashed before handling it.

39%

A long timeout means a long wait until a node is declared dead

39%

A short timeout detects faults faster, but carries a higher risk of incorrectly declaring a node dead when in fact it has only suffered a temporary slowdown

39%

If the system is already struggling with high load, declaring nodes dead prematurely can make the problem worse.

39%

it could happen that the node actually wasn’t dead but only slow to respond due to overload;

39%

transferring its load to other nodes can cause a ...

This highlight has been truncated due to consecutive passage length restrictions.

39%

asynchronous networks have unbounded delays (that

39%

a car,

39%

traffic congestion.

39%

the variability of packet delays on computer networks is most o...

This highlight has been truncated due to consecutive passage length restrictions.

39%

packets to the same destination, the network switch must queue

39%

them up and feed them into the destination network link one by one (as illustrated in Figure 8-2). On a busy network link, a packet

39%

If there is so much incoming data that the switch queue fills up, the packet is dropped, so it needs to be resent — even thou...

This highlight has been truncated due to consecutive passage length restrictions.

39%

the incoming request from the network is queued by th...

This highlight has been truncated due to consecutive passage length restrictions.

39%

virtualized environments, a running operating system is often paused for tens of milliseconds while another virtual machine uses a CPU core. During this time, the VM cannot consume any data from the network, so the incoming data is queued (buffered)

« Prev 1 … 11 12 13 … 28 Next »

See a Problem?

Preview — Designing Data-Intensive Applications by Martin Kleppmann