Chena Lee’s Kindle Notes & Highlights for Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Rate it:

Open Preview

More on this book

Community

Sparsh Priyadarshi

1 note & 1 highlight

Jefersson Nathan

11 notes & 11 highlights

Charles Fonseca

4 notes & 524 highlights

Ucchishta Sivaguru

9 notes & 20 highlights

Sugan

1 note & 44 highlights

Guzman Monne

28 notes & 34 highlights

Dong

2 notes & 26 highlights

Mohamed Elsherif

5 notes & 17 highlights

Joe Soltzberg

20 notes & 75 highlights

Corey

6 notes & 10 highlights

Dinesh Singh

2 notes & 11 highlights

Robert Gustavo

38 notes & 38 highlights

Cezar Castro rosa

Nikhil Goyal

Vladimir

Ion Gritco

Keith Sader

Guilherme Camargo

Vipin Ajayakumar

Jason

Alexis

Ory

Faisal Morensya

Muhaimen Ezabbad

Frederico Cabral

Ian Dunn

Antonio Bustamante

Asif Hoda

zhouqiang

Nick Fahrenkrog

Matt Chamlee

Atthavit Wannasakwong

Xuan Lin

Eric Chong

Dallin Coons

Di Fan

Prakash Srivastava

Denis

Kindle Notes & Highlights

by Chena Lee

See all Chena’s Notes & Highlights

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

by Martin Kleppmann

Read between August 2 - December 28, 2020

52%

a username is unique and rejecting concurrent registrations for the same username.

52%

With some digging, it turns out that a wide range of problems are actually reducible to consensus and are equivalent to each other

52%

Linearizable compare-and-set registers

52%

Atomic transaction commit

52%

Total order broadcast

52%

Locks and leases

52%

it. Membership/coordination service

52%

Uniqueness constraint

52%

Use an algorithm to automatically choose a new leader. This approach requires a consensus algorithm, and it is advisable to use a proven algorithm that correctly handles adverse

52%

Although a single-leader database can provide linearizability without executing a consensus algorithm on every write, it still requires consensus to maintain its leadership and for leadership changes.

52%

Tools like ZooKeeper play an important role in providing an “outsourced” consensus, failure detection, and membership service that applications can use.

52%

If you find yourself wanting to do one of those things that is reducible to consensus, and you want it to be fault-tolerant, then it is advisable to use something like ZooKeeper.

52%

Nevertheless, not every system necessarily requires consensus: for example, leaderless and multi-leader replication systems typically do not use global consensus.

54%

Applications thus commonly use a combination of several different datastores, indexes, caches, analytics systems, etc. and implement mechanisms for moving data from one store to another.

54%

On a high level, systems that store and process data can be grouped into two broad categories:

54%

Systems of record

54%

as source of truth,

54%

When new data comes in, e.g., as user input, it is first written here.

54%

Derived data systems

54%

A classic example is a cache:

54%

Technically speaking, derived data is redundant, in the sense that it duplicates existing information. However, it is often essential for getting good performance on read queries. It is commonly denormalized.

55%

Services (online systems)

55%

Response time is usually the primary measure of performance of a service, and availability is often very important

55%

Batch processing systems (offline systems)

55%

The primary performance measure of a batch job is usually throughput

55%

Stream processing systems (near-real-time systems)

55%

However, a stream job operates on events shortly after they happen, whereas a batch job operates on a fixed set of input data.

55%

MapReduce, a batch processing algorithm

55%

It was subsequently implemented in various open source data systems, including Hadoop, CouchDB, and MongoDB.

55%

Although the preceding command line likely looks a bit obscure if you’re unfamiliar with Unix tools, it is incredibly powerful. It will process gigabytes of log files in a matter of seconds,

55%

Sorting versus in-memory aggregation

55%

The Unix pipeline example does not have such a hash table, but instead relies on sorting a list of URLs in which multiple occurrences of the same URL are simply repeated.

55%

On the other hand, if the job’s working set is larger than the available memory, the sorting approach has the advantage that it can make efficient use of disks.

55%

Mergesort has sequential access patterns that perform well on disks.

55%

The sort utility in GNU Coreutils (Linux) automatically handles larger-than-memory datasets by spilling to disk, and automatically parallelizes sorting across multiple CPU cores

55%

inventor of Unix pipes, first described them like this in 1964 [11]: “We should have some ways of connecting programs like [a] garden hose — screw in another segment when it becomes necessary to massage data in another way. This is the way of I/O also.”

55%

connecting programs with pipes became part of what is now known as the Unix philosophy

55%

Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new “features”.

56%

Expect the output of every program to become the input to another, as yet unknown, program.

56%

Design and build software, even operating systems, to be tried early, ideally within weeks.

56%

Use tools in preference to unskilled help to lighten a programming task,

56%

This approach — automation, rapid prototyping, incremental iteration, being friendly to experimentation, and breaking down large projects into manageable chunks —

56%

It is arguably a better sorting implementation than most programming languages have in their standard libraries

56%

If you expect the output of one program to become the input to another program, that means those programs must use the same data format — in other words, a compatible interface.