Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
63%
Flag icon
Fan-out allows several independent consumers to each “tune in” to the same broadcast of messages, without affecting each other
63%
Flag icon
fan-out: delivering each message to multiple consumers.
63%
Flag icon
Why can we not have a hybrid, combining the durable storage approach of databases with the low-latency notification facilities of messaging?
63%
Flag icon
A log is simply an append-only sequence of records on disk.
63%
Flag icon
In order to scale to higher throughput than a single disk can offer, the log can be partitioned
64%
Flag icon
Such a sequence number makes sense because a partition is append-only, so the messages within a partition are totally ordered.
64%
Flag icon
in situations with high message throughput, where
64%
Flag icon
each message is fast to process and where message ordering is important, the log-based approach works very well.
64%
Flag icon
To reclaim disk space, the log is actually divided into segments, and from time to time old segments are deleted or moved
64%
Flag icon
dropping messages, buffering, or applying backpressure.
64%
Flag icon
the offset is under the consumer’s control, so it can easily be manipulated if necessary:
64%
Flag icon
This aspect makes log-based messaging more like the batch processes of the last chapter, where derived data is clearly separated from input data through a repeatable transformation process.
64%
Flag icon
an event is a record of something that happened at some point in time.
64%
Flag icon
most nontrivial applications need to combine several different technologies in order to satisfy their requirements:
64%
Flag icon
With data warehouses this synchronization is usually performed by ETL processes (see “Data Warehousing”), often by taking a full copy of a database, transforming it, and bulk-loading it into the data warehouse — in other words, a batch process.
64%
Flag icon
dual writes have some serious problems, one of which is a race condition illustrated in Figure 11-4. In this example, two clients concurrently want to update an item X: client 1 wants to set the value to A, and client 2 wants to set it to B. Both clients first write the new value to the database, then write it to the search index.
64%
Flag icon
Another problem with dual writes is that one of the writes may fail while the other succeeds.
64%
Flag icon
More recently, there has been growing interest in change data capture (CDC), which is the process of observing all data changes written to a database and extracting them in a form in which they can be replicated to other systems.
64%
Flag icon
CDC is especially interesting if changes are made available as a stream,
65%
Flag icon
We can call the log consumers derived data systems,
65%
Flag icon
Essentially, change data capture makes one database the leader (the one from which the changes are captured), and turns the others into followers.
65%
Flag icon
the system of record database does not wait for the change to be applied to consumers before committing it. This design has the operational advantage that adding a slow consumer does not affect the system of record too much, but it has the downside that all the issues of replication lag apply (see “Problems with Replication Lag”).
65%
Flag icon
if you don’t have the entire log history, you need to start with a consistent snapshot, as previously discussed in “Setting
65%
Flag icon
There are some parallels between the ideas we’ve discussed here and event sourcing, a technique that was developed in the domain-driven design (DDD) community
65%
Flag icon
event sourcing involves storing all changes to the application state as a log of change events.
65%
Flag icon
The log of changes is extracted from the database at a low level
65%
Flag icon
which ensures that the order of writes extracted from the database matches the order in which they were actually written, avoiding the race condition
65%
Flag icon
In event sourcing, the application logic is explicitly built on the basis of immutable events that are written to an event log.
65%
Flag icon
Events are designed to reflect things that happened at the application level, rather than low-level state changes.
65%
Flag icon
Thus, applications that use event sourcing need to take the log of events (representing the data written to the system) and transform it into application state that is suitable for showing to a user
65%
Flag icon
but it should be deterministic so that you can run it again and derive the same application state from the event log.
65%
Flag icon
the current value for a primary key is entirely determined by the most recent event for that primary key,
65%
Flag icon
an event typically expresses the intent of a user action, not the mechanics of the state update that occurred as a result of the action.
65%
Flag icon
later events typically do not override prior events, and so you need the full history of events to reconstruct the final state.
65%
Flag icon
The event sourcing philosophy is careful to distinguish between events and commands [48
65%
Flag icon
The application must first validate that it can execute the command.
65%
Flag icon
If the validation is successful and the command is accepted, it becomes an event,
65%
Flag icon
At the point when the event is generated, it becomes a fact.
65%
Flag icon
change or cancellation is a separate event that is added later.
65%
Flag icon
batch processing benefits from the immutability of its input files,
66%
Flag icon
The biggest downside of event sourcing and change data capture is that the consumers of the event log are usually asynchronous, so there is a possibility that a user may make a write to the log, then read from a log-derived view and find that their write has not yet been reflected in the read view.
66%
Flag icon
This requires a transaction to combine the writes into an atomic unit, so either you need to keep the event log and the read view in the same storage system,
66%
Flag icon
or you need a distributed transaction across the different systems.
66%
Flag icon
you actually want to rewrite history and pretend that the data was never written in the first place.
66%
Flag icon
A piece of code that processes streams like this is known as an operator or a job.
66%
Flag icon
The one crucial difference to batch jobs is that a stream never ends.
66%
Flag icon
Stream processing has long been used for monitoring purposes,
66%
Flag icon
where an organization wants to be alerted if certain things happen.
66%
Flag icon
Complex event processing (CEP) is an approach developed in the 1990s for analyzing event streams, especially geared toward the kind of application that requires searching for certain event patterns
67%
Flag icon
CEP engines reverse these roles: queries are stored long-term, and events from the input streams continuously flow past them in search of a query that matches an event pattern
« Prev 1