Chena Lee’s Kindle Notes & Highlights for Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Rate it:

Open Preview

More on this book

Community

Sparsh Priyadarshi

1 note & 1 highlight

Jefersson Nathan

11 notes & 11 highlights

Charles Fonseca

4 notes & 524 highlights

Ucchishta Sivaguru

9 notes & 20 highlights

Sugan

1 note & 44 highlights

Guzman Monne

28 notes & 34 highlights

Dong

2 notes & 26 highlights

Mohamed Elsherif

5 notes & 17 highlights

Joe Soltzberg

20 notes & 75 highlights

Corey

6 notes & 10 highlights

Dinesh Singh

2 notes & 11 highlights

Robert Gustavo

38 notes & 38 highlights

Cezar Castro rosa

Nikhil Goyal

Vladimir

Ion Gritco

Keith Sader

Guilherme Camargo

Vipin Ajayakumar

Jason

Alexis

Ory

Faisal Morensya

Muhaimen Ezabbad

Frederico Cabral

Ian Dunn

Tali

Antonio Bustamante

Asif Hoda

zhouqiang

Nick Fahrenkrog

Matt Chamlee

Atthavit Wannasakwong

Xuan Lin

Eric Chong

Dallin Coons

Di Fan

Prakash Srivastava

Denis

Kindle Notes & Highlights

by Chena Lee

See all Chena’s Notes & Highlights

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

by Martin Kleppmann

Read between August 2 - December 28, 2020

40%

For correct ordering, you would need the clock source to be significantly more accurate than the thing you are measuring (namely network delay).

41%

So-called logical clocks [56, 57], which are based on incrementing counters rather than an oscillating quartz crystal, are a safer alternative for ordering events

41%

it doesn’t make sense to think of a clock reading as a point in time — it is more like a range of times, within a confidence interval:

41%

An interesting exception is Google’s TrueTime API in Spanner

41%

which explicitly reports the confidence interval on the local clock.

41%

The most common implementation of snapshot isolation requires a monotonically increasing transaction ID.

41%

However, when a database is distributed across many machines, potentially in multiple datacenters, a global, monotonically increasing transaction ID (across all partitions) is difficult to generate,

41%

With lots of small, rapid transactions, creating transaction IDs in a distributed system becomes an untenable bottleneck.vi

41%

Only if the intervals overlap are we unsure in which order A and B happened.

41%

Spanner deliberately waits for the length of the confidence interval before committing a read-write transaction.

41%

Google deploys a GPS receiver or atomic clock in each datacenter, allowing clocks to be synchronized to within about 7 ms

41%

Using clock synchronization for distributed transaction semantics is an area of active research

41%

Process Pauses

41%

writes. How does a node know that it is still leader

41%

One option is for the leader to obtain a lease from the other nodes, which is similar to a lock with a timeout

41%

Firstly, it’s relying on synchronized clocks: the expiry time on the lease is set by a different machine

41%

and it’s being compared to the local system clock.

41%

the code assumes

41%

that very little time passes between the point that it checks the time (System.currentTimeMillis()) and the time when the request is processed (process(request)). Normally this code runs very quickly, so the 10 second buffer

41%

what if there is an unexpected pause in the execution of the program? For example, imagine the thread stops for 15 seconds around the line lease.isValid()

41%

thread might be paused for so long? Unfortunately not. There are various reasons why this could

41%

garbage collector (GC) that occasionally needs to stop all running threads.

41%

“concurrent” garbage collectors like the HotSpot JVM’s CMS cannot fully run in parallel with the application code — even they need to stop the world from time to time

41%

a virtual machine can be suspended (pausing the execution of all processes and saving the contents of memory to disk) and resumed

41%

execution may also be suspended and resumed arbitrarily, e.g., when the user closes the lid of their laptop.

41%

the operating system context-switches to another thread,

41%

the hypervisor switches to a different virtual machine

41%

the currently running thread can be paused at any arbitrary...

This highlight has been truncated due to consecutive passage length restrictions.

41%

the application performs synchronous disk access, a thread may be paused waiting for a slow disk I/O operation to complete

41%

the Java classloader lazily loads class files when they are first used, which could happen at any time in the program execution.

41%

the disk is actually a network filesystem or network block device (such as Amazon’s EBS), the I/O latency is further subject to the variability of network delays

41%

swapping to disk (paging), a simple memory access may result in a page fault that requires a page from disk to be loaded into memory.

41%

swapping pages in and out of memory and getting little actual work done (this is known as thrashing).

41%

A Unix process can be paused by sending it the SIGSTOP signal, for example by pressing Ctrl-Z in a shell.

41%

The problem is similar to making multi-threaded code on a single machine thread-safe: you can’t assume anything about timing, because arbitrary context switches and parallelism may occur.

41%

A node in a distributed system must assume that its execution can be paused for a significant length of time at any point, even in the middle of a function.

41%

Some software runs in environments where a failure to respond within a specified time can cause serious damage: computers that control aircraft, rockets, robots, cars, and other physical objects must respond quickly and predictably to their sensor inputs.

41%

deadline

41%

hard real-time systems.

41%

you wouldn’t want the release of the airbag to be delayed due to an inopportune GC pause in the airbag release system.

41%

a real-time operating system (RTOS) that allows processes to be scheduled with a guaranteed allocation of CPU time in specified intervals is needed;

41%

Moreover, “real-time” is not the same as “high-performance” — in fact, real-time systems may have lower throughput, since they have to prioritize timely responses above

41%

An emerging idea is to treat GC pauses like brief planned outages of a node,

42%

A variant of this idea is to use the garbage collector only for short-lived objects (which are fast to collect) and to restart processes periodically,

42%

a node cannot necessarily trust its own judgment of a situation. A distributed system cannot exclusively rely on a single node, because a node may fail at any time, potentially leaving the system stuck and unable to recover.

42%

many distributed algorithms rely on a quorum, that is, voting among the nodes

42%

system requires there to be only one of some thing.

42%

Implementing this in a distributed system requires care:

42%

The problem is an example of what we discussed in “Process Pauses”: if the client holding the lease is paused for too long, its lease expires. Another client can obtain a lease for the same file, and start writing to the file. When the paused client comes back, it believes (incorrectly) that it still has a valid lease and proceeds to also write to the file.

42%

Let’s assume that every time the lock server grants a lock or lease, it also returns a fencing token, which is a number that increases every time a lock is granted

« Prev 1 … 13 14 15 … 28 Next »

See a Problem?

Preview — Designing Data-Intensive Applications by Martin Kleppmann