Tali’s Kindle Notes & Highlights for Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Rate it:

Open Preview

More on this book

Community

Sparsh Priyadarshi

1 note & 1 highlight

Jefersson Nathan

11 notes & 11 highlights

Charles Fonseca

4 notes & 524 highlights

Ucchishta Sivaguru

9 notes & 20 highlights

Sugan

1 note & 44 highlights

Guzman Monne

28 notes & 34 highlights

Dong

2 notes & 26 highlights

Mohamed Elsherif

5 notes & 17 highlights

Chena Lee

6 notes & 1353 highlights

Joe Soltzberg

20 notes & 75 highlights

Corey

6 notes & 10 highlights

Dinesh Singh

2 notes & 11 highlights

Robert Gustavo

38 notes & 38 highlights

Cezar Castro rosa

Nikhil Goyal

Vladimir

Ion Gritco

Keith Sader

Guilherme Camargo

Vipin Ajayakumar

Jason

Alexis

Ory

Faisal Morensya

Muhaimen Ezabbad

Frederico Cabral

Ian Dunn

Antonio Bustamante

Asif Hoda

zhouqiang

Nick Fahrenkrog

Matt Chamlee

Atthavit Wannasakwong

Xuan Lin

Eric Chong

Dallin Coons

Di Fan

Prakash Srivastava

Denis

Kindle Notes & Highlights

by Tali

See all Tali’s Notes & Highlights

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

by Martin Kleppmann

Read between April 5 - December 1, 2020

32%

A key feature of a transaction is that it can be aborted and safely retried if an error occurred.

32%

object-relational mapping (ORM) frameworks such as Rails’s ActiveRecord and Django don’t retry aborted transactions — the error usually results in an exception bubbling up the stack, so any user input is thrown away and the user gets an error message.

32%

If the error is due to overload, retrying the transaction will make the problem worse, not better. To avoid such feedback cycles, you can limit the number of retries, use exponential backoff, and handle overload-related errors differently

32%

serializable isolation means that the database guarantees that transactions have the same effect as if they ran serially (i.e., one at a time, without any concurrency). In

32%

databases prevent dirty writes by using row-level locks: when a transaction wants to modify a particular object (row or document), it must first acquire a lock on that object. It must then hold that lock until the transaction is committed or aborted.

32%

Read skew is considered acceptable under read committed isolation:

33%

a key principle of snapshot isolation is readers never block writers, and writers never block readers. This allows a database to handle long-running read queries on a consistent snapshot at the same time as processing writes normally, without any lock contention between the two.

38%

In a supercomputer, a job typically checkpoints the state of its computation to durable storage from time to time. If one node fails, a common solution is to simply stop the entire cluster workload. After the faulty node is repaired, the computation is restarted from the last checkpoint [7, 8]. Thus, a supercomputer is more like a single-node computer than a distributed system: it deals with partial failure by letting it escalate into total failure

39%

Even if TCP acknowledges that a packet was delivered, the application may have crashed before handling it. If you want to be sure that a request was successful, you need a positive response from the application itself [24

39%

Conversely, if something has gone wrong, you may get an error response at some level of the stack, but in general you have to assume that you will get no response at all.

39%

UDP is a good choice in situations where delayed data is worthless.

39%

a circuit is a fixed amount of reserved bandwidth which nobody else can use while the circuit is established, whereas the packets of a TCP connection opportunistically use whatever network bandwidth is available.

40%

multi-tenancy with dynamic resource partitioning provides better utilization, so it is cheaper, but it has the downside of variable delays. Variable delays in networks are not a law of nature, but simply the result of a cost/benefit trade-off.

40%

it makes no sense to compare monotonic clock values from two different computers, because they don’t mean the same thing.

40%

The quartz clock in a computer is not very accurate: it drifts (runs faster or slower than it should). Clock drift varies depending on the temperature of the machine.

40%

NTP synchronization can only be as good as the network delay, so there is a limit to its accuracy when you’re on a congested network with variable packet delays.

40%

When a CPU core is shared between virtual machines, each VM is paused for tens of milliseconds while another VM is running. From an application’s point of view, this pause manifests itself as the clock suddenly jumping forward

40%

Additional causality tracking mechanisms, such as version vectors, are needed in order to prevent violations of causality (see “Detecting Concurrent Writes”).

40%

NTP’s synchronization accuracy is itself limited by the network round-trip time,

41%

if you have two confidence intervals, each consisting of an earliest and latest possible timestamp (A = [Aearliest, Alatest] and B = [Bearliest, Blatest]), and those two intervals do not overlap (i.e., Aearliest < Alatest < Bearliest < Blatest), then B definitely happened after A — there can be no doubt. Only if the intervals overlap are we unsure in which order A and B happened.

41%

Spanner deliberately waits for the length of the confidence interval before committing a read-write transaction. By doing so, it ensures that any transaction that may read the data is at a sufficiently later time, so their confidence intervals do not overlap.

41%

when a node obtains a lease, it knows that it is the leader for some amount of time, until the lease expires. In order to remain leader, the node must periodically renew the lease before it expires. If the node fails, it stops renewing the lease,

41%

code. In the case of a virtual machine, the CPU time spent in other virtual machines is known as steal time.

41%

real-time systems may have lower throughput, since they have to prioritize timely responses above all else

42%

If ZooKeeper is used as lock service, the transaction ID zxid or the node version cversion can be used as fencing token. Since they are guaranteed to be monotonically increasing,

42%

A system is Byzantine fault-tolerant if it continues to operate correctly even if some of the nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering with the network.

42%

Most Byzantine fault-tolerant algorithms require a supermajority of more than two-thirds of the nodes to be functioning

43%

The use of multiple servers makes NTP more robust than if it only uses a single server.

43%

The synchronous model assumes bounded network delay, bounded process pauses, and bounded clock error.

43%

uniqueness and monotonic sequence are safety properties, but availability is a liveness property.

43%

scalability is not the only reason for wanting to use a distributed system. Fault tolerance and low latency (by placing data geographically close to users) are equally important goals,

45%

The best way of building fault-tolerant systems is to find some general-purpose abstractions with useful guarantees, implement them once, and then let applications rely on those guarantees.

45%

If two nodes both believe that they are the leader, that situation is called split brain, and it often leads to data loss.

45%

The edge cases of eventual consistency only become apparent when there is a fault in the system (e.g., a network interruption) or at high concurrency.

45%

In a linearizable system, as soon as one client successfully completes a write, all clients reading from the database must be able to see the value just written.

46%

A database may provide both serializability and linearizability, and this combination is known as strict serializability

46%

Coordination services like Apache ZooKeeper [15] and etcd [16] are often used to implement distributed locks and leader election.

46%

libraries like Apache Curator [17] help by providing higher-level recipes on top of ZooKeeper.

46%

hard uniqueness constraint, such as the one you typically find in relational databases, requires linearizability.

46%

Using the leader for reads relies on the assumption that you know for sure who the leader is.

46%

consensus algorithms can implement linearizable storage safely. This is how ZooKeeper [21] and etcd [22] work, for example.

46%

Systems with multi-leader replication are generally not linearizable, because they concurrently process writes on multiple nodes and asynchronously replicate them to other nodes.

46%

it is safest to assume that a leaderless system with Dynamo-style replication does not provide linearizability.

47%

If your application does not require linearizability, then it can be written in a way that each replica can process requests independently, even if it is disconnected from other replicas (e.g., multi-leader). In this case, the application can remain available in the face of a network problem, but its behavior is not linearizable.

47%

CAP is sometimes presented as Consistency, Availability, Partition tolerance: pick 2 out of

47%

if you want linearizability, the response time of read and write requests is at least proportional to the uncertainty of delays in the network.

47%

there are no concurrent operations in a linearizable datastore: there must be a single timeline along which all operations are totally ordered.

47%

linearizability implies causality: any system that is linearizable will preserve causality correctly

47%

causal consistency is the strongest possible consistency model that does not slow down due to network delays, and remains available in the face of network failures

52%

Unlike linearizability, which puts all operations in a single, totally ordered timeline, causality provides us with a weaker consistency model: some things can be concurrent, so the version history is like a timeline with branching and merging. Causal consistency does not have the coordination overhead of linearizability and is much less sensitive to network problems.

« Prev 1 … 4 5 6 Next »

See a Problem?

Preview — Designing Data-Intensive Applications by Martin Kleppmann