Chena Lee’s Kindle Notes & Highlights for Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Rate it:

Open Preview

More on this book

Community

Sparsh Priyadarshi

1 note & 1 highlight

Jefersson Nathan

11 notes & 11 highlights

Charles Fonseca

4 notes & 524 highlights

Ucchishta Sivaguru

9 notes & 20 highlights

Sugan

1 note & 44 highlights

Guzman Monne

28 notes & 34 highlights

Dong

2 notes & 26 highlights

Mohamed Elsherif

5 notes & 17 highlights

Joe Soltzberg

20 notes & 75 highlights

Corey

6 notes & 10 highlights

Dinesh Singh

2 notes & 11 highlights

Robert Gustavo

38 notes & 38 highlights

Cezar Castro rosa

Nikhil Goyal

Vladimir

Ion Gritco

Keith Sader

Guilherme Camargo

Vipin Ajayakumar

Jason

Alexis

Ory

Faisal Morensya

Muhaimen Ezabbad

Frederico Cabral

Ian Dunn

Antonio Bustamante

Asif Hoda

zhouqiang

Nick Fahrenkrog

Matt Chamlee

Atthavit Wannasakwong

Xuan Lin

Eric Chong

Dallin Coons

Di Fan

Prakash Srivastava

Denis

Kindle Notes & Highlights

by Chena Lee

See all Chena’s Notes & Highlights

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

by Martin Kleppmann

Read between August 2 - December 28, 2020

29%

are more powerful,

29%

Although in principle it’s possible to split and merge partitions (see the next section), a fixed number of partitions is operationally simpler, and so many fixed-partition databases choose not to implement partition splitting.

29%

Thus, the number of partitions configured at the outset is the maximum number of nodes you can have, so you need to choose it high enough to accommodate future growth.

29%

However, each partition also has management overhead, so it’s counterproductive to ...

This highlight has been truncated due to consecutive passage length restrictions.

29%

If partitions are very large, rebalancing and recovery from node failures become expensive. But if partitions are too small, they incur too much overhead.

29%

For databases that use key range partitioning (see “Partitioning by Key Range”), a fixed number of partitions with fixed boundaries would be very inconvenient:

29%

When a partition grows to exceed a configured size (on HBase, the default is 10 GB), it is split into two partitions so that approximately half of the data ends up on each side of the split [26]. Conversely, if lots of data is deleted and a partition shrinks

29%

An advantage of dynamic partitioning is that the number of partitions adapts to the total data volume.

29%

a caveat is that an empty database starts off with a single partition, since there is no a priori information about where to draw the partition boundaries.

29%

allow an initial set of partitions to be configured on an empty database (this is called pre-splitting).

29%

With dynamic partitioning, the number of partitions is proportional to the size of the dataset,

29%

On the other hand, with a fixed number of partitions, the size of each partition is proportional to the size of the dataset.

29%

this case, the size of each partition grows proportionally to the dataset size while the number of nodes remains unchanged, but when you increase the number of nodes, the partitions become smaller again.

29%

When a new node joins the cluster, it randomly chooses a fixed number of existing partitions to split, and then takes ownership of one half of each of those split partitions while leaving the other half of each partition in place. The randomization can produce unfair splits, but when averaged over a larger number of partitions

29%

If it is not done carefully, this process can overload the network or the nodes and harm the performance of other requests while the rebalancing is in progress.

29%

As partitions are rebalanced, the assignment of partitions to nodes changes. Somebody needs to stay on top of those changes in order to answer the question: if I want to read or write the key “foo”, which IP address and port number do I need to connect to?

29%

service discovery,

29%

This routing tier does not itself handle any requests; it only acts as a partition-aware load balancer.

30%

Cassandra and Riak take a different approach: they use a gossip protocol among the nodes to disseminate any changes in cluster state.

30%

This model puts more complexity in the database nodes but avoids the dependency on an external coordination service such as ZooKeeper.

30%

However, massively parallel processing (MPP) relational database products, often used for analytics, are much more sophisticated in the types of queries they support.

30%

transactions have been the mechanism of choice for simplifying these issues.

30%

A transaction is a way for an application to group several reads and writes together into a logical unit.

30%

either the entire transaction succeeds (commit) or it fails...

This highlight has been truncated due to consecutive passage length restrictions.

30%

sometimes there are advantages to weakening transactional guarantees or abandoning them entirely

31%

(NoSQL)

31%

Transactions were the main casualty of this movement: many of this new generation of databases abandoned transactions entirely, or redefined the word to describe a much weaker set of guarantees than had previously been understood

31%

popular belief that transactions were the antithesis of scalability,

31%

ACID, which stands for Atomicity, Consistency, Isolation, and Durability.

31%

(Systems that do not meet the ACID criteria are sometimes called BASE, which stands for Basically Available, Soft state, and Eventual consistency

31%

Without atomicity, if an error occurs partway through making multiple changes, it’s difficult to know which changes have taken effect and which haven’t. The application could try again, but that risks making the same change twice, leading to duplicate or incorrect

31%

ACID, consistency refers to an application-specific notion of the database being in a “good state.”

31%

ACID consistency is that you have certain statements about your data (invariants) that must always be true

31%

this idea of consistency depends on the application’s notion of invariants,

31%

you write bad data that violates your invariants, the database can’t stop you.

31%

Atomicity, isolation, and durability are properties of the database, whereas consistency (in the ACID sense) is a property of the application.

31%

Isolation in the sense of ACID means that concurrently executing transactions are isolated from each other: they cannot step on each other’s toes.

31%

textbooks formalize isolation as serializability, which means that each transaction can pretend that it is the only transaction running on the entire database.

31%

the result is the same as if they had run serially (o...

This highlight has been truncated due to consecutive passage length restrictions.

31%

However, in practice, serializable isolation is rarely used, because it carries a performance penalty.

31%

In Oracle there is an isolation level called “serializable,” but it actually implements something called snapshot isolation, which is a weaker guarantee than serializability

31%

In order to provide a durability guarantee, a database must wait until these writes or replications are complete before reporting a transaction as successfully committed.

31%

One study of SSDs found that between 30% and 80% of drives develop at least one bad block during the first four years of operation

31%

When a worn-out SSD (that has gone through many write/erase cycles) is disconnected from power, it can start losing data within a timescale of weeks to months, depending on the temperature

31%

In practice, there is no one technique that can provide absolute guarantees. There are only various risk-reduction techniques, including writing to disk, replicating to remote machines, and backups — and they can and should be used together.

31%

multi-object transactions are often needed if several pieces of data need to be kept in sync.

31%

In relational databases, that is typically done based on the client’s TCP connection to the database server: on any particular connection, everything between a BEGIN

31%

many nonrelational databases don’t have such a way of grouping operations together.

31%

more complex atomic operations,iv such as an increment operation,

32%

compare-and-set operation,

« Prev 1 … 6 7 8 … 28 Next »

See a Problem?

Preview — Designing Data-Intensive Applications by Martin Kleppmann