More on this book
Community
Kindle Notes & Highlights
Read between
October 21 - November 26, 2024
This means that if a slow consumer cannot keep up with the rate of messages, and it falls so far behind that its consumer offset points to a deleted segment, it will miss some of the messages.
You can monitor how far a consumer is behind the head of the log, and raise an alert if it falls behind significantly.
The only side effect of processing, besides any output of the consumer, is that the consumer offset moves forward.
The fact that something was written to a database is an event that can be captured, stored, and processed.
The events in the replication log describe the data changes that occurred.
The problem with most databases’ replication logs is that they have long been considered to be an internal implementation detail of the database, not a public API.
Clients are supposed to query the database through its data model and query language, not parse the replication logs and try to extract data from them.
Change data capture is a mechanism for ensuring that all changes made to the system of record are also reflected in the derived data systems so that the derived systems have an accurate copy of the data.
Essentially, change data capture makes one database the leader (the one from which the changes are captured), and turns the others into followers.
If you have the log of all changes that were ever made to a database, you can reconstruct the entire state of the database by replaying the log.
In a log-structured storage engine, an update with a special null value (a tombstone) indicates that a key was deleted, and causes it to be removed during log compaction.
If the same key is frequently overwritten, previous values will eventually be garbage-collected, and only the latest value will be retained.
Similarly to change data capture, event sourcing involves storing all changes to the application state as a log of change events.
The application writing to the database does not need to be aware that CDC is occurring.
In event sourcing, the application logic is explicitly built on the basis of immutable events that are written to an event log.
An event log by itself is not very useful, because users generally expect to see the current state of a system, not the history of modifications.
At the point when the event is generated, it becomes a fact.
The nature of state is that it changes, so databases support updating and deleting data as well as inserting it.
Whenever you have state that changes, that state is the result of the events that mutated it over time.
No matter how the state changes, there was always a sequence of events that caused those changes.
The log of all changes, the changelog, represents the evolution of state over time.
If you store the changelog durably, that simply has the effect of making the state reproducible.
The truth is the log. The database is a cache of a subset of the log.
Running old and new systems side by side is often easier than performing a complicated schema migration in an existing system.
The traditional approach to database and schema design is based on the fallacy that data must be written in the same form as it will be queried.
With event sourcing, you can design an event such that it is a self-contained description of a user action.
Besides the performance reasons, there may also be circumstances in which you need data to be deleted for administrative or legal reasons, in spite of all immutability.
Similarly to the way that a regular expression allows you to search for certain patterns of characters in a string, CEP allows you to specify rules to search for certain patterns of events in a stream.
Averaging over a few minutes smoothes out irrelevant fluctuations from one second to the next, while still giving you a timely picture of any changes in traffic pattern.
Probabilistic algorithms produce approximate results, but have the advantage of requiring significantly less memory in the stream processor than exact algorithms.
Conventional search engines first index the documents and then run queries over the index.
In a batch process, the processing tasks rapidly crunch through a large collection of historical events.
Confusing event time and processing time leads to bad data.
A tricky problem when defining windows in terms of event time is that you can never be sure when you have received all of the events for a particular window, or whether there are some events still to come.
Once you know how the timestamp of an event should be determined, the next step is to decide how windows over time periods should be defined.
A tumbling window has a fixed length, and every event belongs to exactly one window.
A hopping window also has a fixed length, but allows windows to overlap in order to provide some smoothing.
A sliding window contains all the events that occur within some interval of each other.
A sliding window can be implemented by keeping a buffer of events sorted by time and removing old events when they expire from the window.
In order to calculate the click-through rate for each URL in the search results, you need to bring together the events for the search action and the click action, which are connected by having the same session ID.
Due to variable network delays, the click event may even arrive before the search event.
If the search event expires without you seeing a matching click event, you emit an event saying which search results were not clicked.
When joining sales to a table of tax rates, you probably want to join with the tax rate at the time of the sale, which may be different from the current tax rate if you are reprocessing historical data.
If a stream operator crashes, it can restart from its most recent checkpoint and discard any output generated between the last checkpoint and the crash.
Within the confines of the stream processing framework, the microbatching and checkpointing approaches provide the same exactly-once semantics as batch processing.
In order to give the appearance of exactly-once processing in the presence of faults, we need to ensure that all outputs and side effects of processing an event take effect if and only if the processing is successful.
Our goal is to discard the partial output of any failed tasks so that they can be safely retried without taking effect twice.
An idempotent operation is one that you can perform multiple times, and it has the same effect as if you performed it only once.