Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
64%
Flag icon
On the other hand, in situations with high message throughput, where each message is fast to process and where message ordering is important, the log-based approach works very well.
64%
Flag icon
Since partitioned logs typically preserve message ordering only within a single partition, all messages that need to be ordered consistently need to be routed to the same partition.
64%
Flag icon
This can be achieved by choosing the partition for an event based on the user ID of that event (in other words, making the user ID the partitioning key).
64%
Flag icon
all messages with an offset less than a consumer’s current offset have already been processed,
64%
Flag icon
Thus, the broker does not need to track acknowledgments for every single message — it only needs to periodically record the consumer offsets.
64%
Flag icon
the message broker behaves like a leader database, and the consumer like a follower.
64%
Flag icon
If a consumer node fails, another node in the consumer group is assigned the failed consumer’s partitions, and it starts consuming messages at the last recorded offset.
64%
Flag icon
actually divided into segments, and from time to time old segments are deleted or moved to archive storage.
64%
Flag icon
Effectively, the log implements a bounded-size buffer that discards old messages when it gets full, also known as a circular buffer or ring buffer.
64%
Flag icon
You can monitor how far a consumer is behind the head of the log, and raise an alert if it falls behind significantly.
64%
Flag icon
in a log-based message broker, consuming messages is more like reading from a file: it is a read-only operation that does not change the log. The only side effect of processing, besides any output of the consumer, is that the consumer offset moves forward. But the offset is under the consumer’s control, so it can easily be manipulated if necessary:
64%
Flag icon
In fact, a replication log (see “Implementation of Replication Logs”) is a stream of database write events, produced by the leader as it processes transactions. The followers apply that stream of writes
64%
Flag icon
If periodic full database dumps are too slow, an alternative that is sometimes used is dual writes, in which the application code explicitly writes to each of the systems when data changes: for example, first writing to the database, then updating the search index, then invalidating the cache entries (or even performing those writes concurrently).
64%
Flag icon
However, dual writes have some serious problems, one of which is a race condition illustrated
64%
Flag icon
Another problem with dual writes is that one of the writes may fail while the other succeeds.
64%
Flag icon
Ensuring that they either both succeed or both fail is a case of the atomic commit problem, which is expensive to solve
64%
Flag icon
The situation would be better if there really was only one leader — for example, the database — and if we could make the search index a follower of the database. But is this possible in practice?
64%
Flag icon
The problem with most databases’ replication logs is that they have long been considered to be an internal implementation detail of the database, not a public API.
65%
Flag icon
More recently, there has been growing interest in change data capture (CDC), which is the process of observing all data changes written to a database and extracting them in a form in which they can be replicated to other systems. CDC is especially interesting if changes are made available as a stream, immediately as they are written.
65%
Flag icon
We can call the log consumers derived data systems,
65%
Flag icon
the data stored in the search index and the data warehouse is just another view onto the data in the system of record.
65%
Flag icon
change data capture makes one database the leader (the one from which the changes are captured), and turns the others into followers.
65%
Flag icon
Database triggers can be used to implement change data capture
65%
Flag icon
However, they tend to be fragile and have significant performance overheads.
65%
Flag icon
Parsing the replication log can be a more robust approach, although it also comes with challenges, such as handling schema changes.
65%
Flag icon
updated. Thus, if you don’t have the entire log history, you need to start with a consistent snapshot, as previously discussed in
65%
Flag icon
The principle is simple: the storage engine periodically looks for log records with the same key, throws away any duplicates, and keeps only the most recent update for each key.
65%
Flag icon
The disk space required for such a compacted log depends only on the current contents of the database, not the number of writes that have ever occurred in the database.
65%
Flag icon
Event sourcing is a powerful technique for data modeling: from an application point of view it is more meaningful to record the user’s actions as immutable events, rather than recording the effect of those actions on a mutable database.
65%
Flag icon
On the other hand, with event sourcing, events
65%
Flag icon
later events typically do not override prior events, and so you need the full history of events to reconstruct the final state. Log compaction is not possible in the same way.
65%
Flag icon
Applications that use event sourcing typically have some mechanism for storing snapshots of the current state that is derived from the log of events, so they don’t need to repeatedly reprocess the full log.
66%
Flag icon
that batch processing benefits from the immutability of its input files, so
66%
Flag icon
We normally think of databases as storing the current state of the application — this representation is optimized for reads, and it is usually the most convenient for serving queries.
66%
Flag icon
Whenever you have state that changes, that state is the result of the events that mutated it over time.
66%
Flag icon
No matter how the state changes, there was always a sequence of events that caused those changes.
66%
Flag icon
If you store the changelog durably, that simply has the effect of making the state reproducible.
66%
Flag icon
As Pat Helland puts it [52]: Transaction logs record all the changes made to the database. High-speed appends are the only way to change the log. From this perspective, the contents of the database hold a caching of the latest record values in the logs. The truth is the log. The database is a cache of a subset of the log. That cached subset happens to be the latest value of each record and index value from the log.
66%
Flag icon
Log compaction, as discussed in “Log compaction”, is one way of bridging the distinction between log and database state: it retains only the latest version of each record, and discards overwritten versions.
66%
Flag icon
Although such auditability is particularly important in financial systems,
66%
Flag icon
Immutable events also capture more information than just
66%
Flag icon
Moreover, by separating mutable state from the immutable event log, you can derive several different read-oriented representations from the same log of events.
66%
Flag icon
Storing data is normally quite straightforward if you don’t have to worry about how it is going to be queried and accessed; many of the complexities of schema design, indexing, and storage engines are the result of wanting to support certain query and access patterns
66%
Flag icon
For this reason, you gain a lot of flexibility by separating the form in which data is written from the form it is read, and by allowing several different read views.
66%
Flag icon
Debates about normalization and denormalization (see “Many-to-One and Many-to-Many Relationships”) become largely irrelevant if you can translate data from a write-optimized event log to read-optimized application
66%
Flag icon
The biggest downside of event sourcing and change data capture is that the consumers of the event log are usually asynchronous, so there is a possibility that a user may make a write to the log, then read from a log-derived view and find that their write has not yet been reflected in the read view.
66%
Flag icon
With event sourcing, you can design an event such that it is a self-contained description of a user action.
66%
Flag icon
If the event log and the application state are partitioned in the same way (for example, processing an event for a customer in partition 3 only requires updating partition 3 of the application state),
66%
Flag icon
Many systems that don’t use an event-sourced model nevertheless rely on immutability: various databases internally use immutable data structures or multi-version data to support point-in-time snapshots
66%
Flag icon
Besides the performance reasons, there may also be circumstances in which you need data to be deleted for administrative or legal reasons, in spite of all immutability.
1 24 28