Eric Chong’s Kindle Notes & Highlights for Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Rate it:

Open Preview

More on this book

Community

Sparsh Priyadarshi

1 note & 1 highlight

Jefersson Nathan

11 notes & 11 highlights

Charles Fonseca

4 notes & 524 highlights

Ucchishta Sivaguru

9 notes & 20 highlights

Sugan

1 note & 44 highlights

Guzman Monne

28 notes & 34 highlights

Dong

2 notes & 26 highlights

Mohamed Elsherif

5 notes & 17 highlights

Chena Lee

6 notes & 1353 highlights

Joe Soltzberg

20 notes & 75 highlights

Corey

6 notes & 10 highlights

Dinesh Singh

2 notes & 11 highlights

Robert Gustavo

38 notes & 38 highlights

Cezar Castro rosa

Nikhil Goyal

Vladimir

Ion Gritco

Keith Sader

Guilherme Camargo

Vipin Ajayakumar

Jason

Alexis

Ory

Faisal Morensya

Muhaimen Ezabbad

Frederico Cabral

Ian Dunn

Antonio Bustamante

Asif Hoda

zhouqiang

Nick Fahrenkrog

Matt Chamlee

Atthavit Wannasakwong

Xuan Lin

Dallin Coons

Di Fan

Prakash Srivastava

Denis

Kindle Notes & Highlights

by Eric Chong

See all Eric Chong’s Notes & Highlights

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

by Martin Kleppmann

63%

Fan-out allows several independent consumers to each “tune in” to the same broadcast of messages, without affecting each other

63%

fan-out: delivering each message to multiple consumers.

63%

Why can we not have a hybrid, combining the durable storage approach of databases with the low-latency notification facilities of messaging?

63%

A log is simply an append-only sequence of records on disk.

63%

In order to scale to higher throughput than a single disk can offer, the log can be partitioned

64%

Such a sequence number makes sense because a partition is append-only, so the messages within a partition are totally ordered.

64%

in situations with high message throughput, where

64%

each message is fast to process and where message ordering is important, the log-based approach works very well.

64%

To reclaim disk space, the log is actually divided into segments, and from time to time old segments are deleted or moved

64%

dropping messages, buffering, or applying backpressure.

64%

the offset is under the consumer’s control, so it can easily be manipulated if necessary:

64%

This aspect makes log-based messaging more like the batch processes of the last chapter, where derived data is clearly separated from input data through a repeatable transformation process.

64%

an event is a record of something that happened at some point in time.

64%

most nontrivial applications need to combine several different technologies in order to satisfy their requirements:

64%

With data warehouses this synchronization is usually performed by ETL processes (see “Data Warehousing”), often by taking a full copy of a database, transforming it, and bulk-loading it into the data warehouse — in other words, a batch process.

64%

dual writes have some serious problems, one of which is a race condition illustrated in Figure 11-4. In this example, two clients concurrently want to update an item X: client 1 wants to set the value to A, and client 2 wants to set it to B. Both clients first write the new value to the database, then write it to the search index.

64%

Another problem with dual writes is that one of the writes may fail while the other succeeds.

64%

More recently, there has been growing interest in change data capture (CDC), which is the process of observing all data changes written to a database and extracting them in a form in which they can be replicated to other systems.

64%

CDC is especially interesting if changes are made available as a stream,

65%

We can call the log consumers derived data systems,

65%

Essentially, change data capture makes one database the leader (the one from which the changes are captured), and turns the others into followers.

65%

the system of record database does not wait for the change to be applied to consumers before committing it. This design has the operational advantage that adding a slow consumer does not affect the system of record too much, but it has the downside that all the issues of replication lag apply (see “Problems with Replication Lag”).

65%

if you don’t have the entire log history, you need to start with a consistent snapshot, as previously discussed in “Setting

65%

There are some parallels between the ideas we’ve discussed here and event sourcing, a technique that was developed in the domain-driven design (DDD) community

65%

event sourcing involves storing all changes to the application state as a log of change events.

65%

The log of changes is extracted from the database at a low level

65%

which ensures that the order of writes extracted from the database matches the order in which they were actually written, avoiding the race condition

65%

In event sourcing, the application logic is explicitly built on the basis of immutable events that are written to an event log.

65%

Events are designed to reflect things that happened at the application level, rather than low-level state changes.

65%

Thus, applications that use event sourcing need to take the log of events (representing the data written to the system) and transform it into application state that is suitable for showing to a user

65%

but it should be deterministic so that you can run it again and derive the same application state from the event log.

65%

the current value for a primary key is entirely determined by the most recent event for that primary key,

65%

an event typically expresses the intent of a user action, not the mechanics of the state update that occurred as a result of the action.

65%

later events typically do not override prior events, and so you need the full history of events to reconstruct the final state.

65%

The event sourcing philosophy is careful to distinguish between events and commands [48

65%

The application must first validate that it can execute the command.

65%

If the validation is successful and the command is accepted, it becomes an event,

65%

At the point when the event is generated, it becomes a fact.

65%

change or cancellation is a separate event that is added later.

65%

batch processing benefits from the immutability of its input files,

66%

The biggest downside of event sourcing and change data capture is that the consumers of the event log are usually asynchronous, so there is a possibility that a user may make a write to the log, then read from a log-derived view and find that their write has not yet been reflected in the read view.

66%

This requires a transaction to combine the writes into an atomic unit, so either you need to keep the event log and the read view in the same storage system,

66%

or you need a distributed transaction across the different systems.

66%

you actually want to rewrite history and pretend that the data was never written in the first place.

66%

A piece of code that processes streams like this is known as an operator or a job.

66%

The one crucial difference to batch jobs is that a stream never ends.

66%

Stream processing has long been used for monitoring purposes,

66%

where an organization wants to be alerted if certain things happen.

66%

Complex event processing (CEP) is an approach developed in the 1990s for analyzing event streams, especially geared toward the kind of application that requires searching for certain event patterns

67%

CEP engines reverse these roles: queries are stored long-term, and events from the input streams continuously flow past them in search of a query that matches an event pattern

« Prev 1 2 Next »

See a Problem?

Preview — Designing Data-Intensive Applications by Martin Kleppmann