More on this book
Community
Kindle Notes & Highlights
Read between
August 2 - December 28, 2020
In these circumstances, it’s not sufficient to just append another event to the log to indicate that the prior data should be considered deleted — you actually want to rewrite history and pretend that the data was never written in the first place.
Truly deleting data is surprisingly hard [64], since copies can live in many places: for example, storage engines, filesystems, and SSDs often
You can take the data in the events and write it to a database, cache, search index, or similar storage system, from where it can then be queried by other clients.
You can push the events to users in some way, for example by sending email alerts or push notifications, or by streaming the events to a real-time dashboard where they are visualized.
You can process one or more input streams to produce one or more output streams.
Fault-tolerance mechanisms must also change: with a batch job that has been running for a few minutes, a failed task can simply be restarted from the beginning, but with a stream job that has been running for several years, restarting from the beginning after a crash may not be a viable option.
Complex event processing (CEP) is an approach developed in the 1990s for analyzing event streams, especially geared toward the kind of application that requires searching for certain event patterns
example, you might want to know the average number of queries per second to a service over the last 5 minutes, and their
The time interval over which you aggregate is known as a window,
This use of approximation algorithms sometimes leads people to believe that stream processing systems are always lossy and inexact, but that is wrong: there is nothing inherently approximate about stream processing, and probabilistic algorithms are merely an optimization
Maintaining materialized views
We can regard these examples as specific cases of maintaining materialized views
Besides CEP, which allows searching for patterns consisting of multiple events, there is also sometimes a need to search for individual events based on complex criteria, such as full-text search queries.
For example, media monitoring services subscribe to feeds of news articles and broadcasts from media outlets, and search for any news mentioning companies, products, or topics of interest.
Conventional search engines first index the documents and then run queries over the index. By contrast, searching a stream turns the processing on its head: the queries are stored, and the documents run past the queries,
A tricky problem when defining windows in terms of event time is that you can never be sure when you have received all of the events for a particular window, or whether there are some events still to come.
Ignore the straggler events, as they are probably a small percentage of events in normal circumstances.
Publish a correction, an updated value for the window with stragglers included.
However, the clock on a user-controlled device often
cannot be trusted, as it may be accidentally or deliberately set to the wrong time
To adjust for incorrect device clocks, one approach is to log three timestamps [82]: The time at which the event occurred, according to the device clock The time at which the event was sent to the server, according to the device clock The time at which the event was received by the server, according to the server clock
Tumbling window A tumbling window has a fixed length, and every event belongs to exactly one window.
Hopping window A hopping window also has a fixed length, but allows windows to overlap in order to provide some smoothing.
Sliding window A sliding window contains all the events that occur within some interval of each other.
Session window Unlike the other window types, a session window has no fixed duration. Instead, it is defined by grouping together all events for the same user that occur closely together in time, and the window ends when the user has been inactive for some time
Stream-stream join (window join)
The click may never come if the user abandons their search, and even if it comes, the time between the search and the click may be highly variable: in many cases it might be a few seconds, but it could be as long as days or weeks
Due to variable network delays, the click event may even arrive before the search event.
stream processor needs to maintain state: for example, all the events that occurred in the last hour, indexed by session ID.
If there is a matching event, you emit an event saying which search result was clicked. If the search event expires without you seeing a matching click event, you emit an event saying which search results were not clicked.
Stream-table join (stream enrichment)
This process is sometimes known as enriching the activity events with information from the database.
such remote queries are likely to be slow and risk overloading the database
Another approach is to load a copy of the database into the stream processor so that it can be queried locally without a network round-trip. This technique is very similar to the hash joins we discussed in “Map-Side Joins”: the
The difference to batch jobs is that a batch job uses a point-in-time snapshot of the database as input, whereas a stream processor is long-running, and the contents of the database are likely to change over time, so the stream processor’s local copy of the database needs to be kept up to date.
This issue can be solved by change data capture: the stream processor can subscribe to a changelog of the user profile database as well as the stream of activity events.
Table-table join (materialized view maintenance)
The stream process needs to maintain a database containing the set of followers for each user so that it knows which timelines need to be updated when a new tweet arrives
Put another way: if state changes over time, and you join with some state, what point in time do you use for the join
If the ordering of events across streams is undetermined, the join becomes nondeterministic [87], which means you cannot rerun the same job on the same input and necessarily get the same result:
In data warehouses, this issue is known as a slowly changing dimension (SCD), and it is often addressed by using a unique identifier for a particular version of the joined record:
This principle is known as exactly-once semantics, although effectively-once would be a more descriptive term
Microbatching and checkpointing One solution is to break the stream into small blocks, and treat each block like a miniature batch process.
order to give the appearance of exactly-once processing in the presence of faults, we need to ensure that all outputs and side effects of processing
an event take effect if and only if the processing is successful. Those effects include any messages sent to downstream
Those things either all need to happen atomically, or none of them must happen, but they should not go out of sync with each other.
Idempotence Our goal is to discard the partial output of any failed tasks so that they can be safely retried without taking effect twice. Distributed transactions are one way of achieving that goal, but another way is to rely on idempotence [97].
For example, when consuming messages from Kafka, every message has a persistent, monotonically increasing offset. When writing a value to an external database, you can include the offset of the message that triggered the last write with the value.
Relying on idempotence implies several assumptions: restarting a failed task must replay the same messages in the same order (a log-based message broker does this), the processing must be deterministic, and no other node may concurrently update the same value [98,
When failing over from one processing node to another, fencing may be required (see “The leader and the lock”) to prevent interference from a node that is thought to be dead but is actually alive.