Chena Lee’s Kindle Notes & Highlights for Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Rate it:

Open Preview

More on this book

Community

Sparsh Priyadarshi

1 note & 1 highlight

Jefersson Nathan

11 notes & 11 highlights

Charles Fonseca

4 notes & 524 highlights

Ucchishta Sivaguru

9 notes & 20 highlights

Sugan

1 note & 44 highlights

Guzman Monne

28 notes & 34 highlights

Dong

2 notes & 26 highlights

Mohamed Elsherif

5 notes & 17 highlights

Joe Soltzberg

20 notes & 75 highlights

Corey

6 notes & 10 highlights

Dinesh Singh

2 notes & 11 highlights

Robert Gustavo

38 notes & 38 highlights

Cezar Castro rosa

Nikhil Goyal

Vladimir

Ion Gritco

Keith Sader

Guilherme Camargo

Vipin Ajayakumar

Jason

Alexis

Ory

Faisal Morensya

Muhaimen Ezabbad

Frederico Cabral

Ian Dunn

Antonio Bustamante

Asif Hoda

zhouqiang

Nick Fahrenkrog

Matt Chamlee

Atthavit Wannasakwong

Xuan Lin

Eric Chong

Dallin Coons

Di Fan

Prakash Srivastava

Denis

Kindle Notes & Highlights

by Chena Lee

See all Chena’s Notes & Highlights

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

by Martin Kleppmann

Read between August 2 - December 28, 2020

66%

In these circumstances, it’s not sufficient to just append another event to the log to indicate that the prior data should be considered deleted — you actually want to rewrite history and pretend that the data was never written in the first place.

66%

Truly deleting data is surprisingly hard [64], since copies can live in many places: for example, storage engines, filesystems, and SSDs often

66%

You can take the data in the events and write it to a database, cache, search index, or similar storage system, from where it can then be queried by other clients.

66%

You can push the events to users in some way, for example by sending email alerts or push notifications, or by streaming the events to a real-time dashboard where they are visualized.

66%

You can process one or more input streams to produce one or more output streams.

66%

Fault-tolerance mechanisms must also change: with a batch job that has been running for a few minutes, a failed task can simply be restarted from the beginning, but with a stream job that has been running for several years, restarting from the beginning after a crash may not be a viable option.

66%

Complex event processing (CEP) is an approach developed in the 1990s for analyzing event streams, especially geared toward the kind of application that requires searching for certain event patterns

67%

example, you might want to know the average number of queries per second to a service over the last 5 minutes, and their

67%

The time interval over which you aggregate is known as a window,

67%

This use of approximation algorithms sometimes leads people to believe that stream processing systems are always lossy and inexact, but that is wrong: there is nothing inherently approximate about stream processing, and probabilistic algorithms are merely an optimization

67%

Maintaining materialized views

67%

We can regard these examples as specific cases of maintaining materialized views

67%

Besides CEP, which allows searching for patterns consisting of multiple events, there is also sometimes a need to search for individual events based on complex criteria, such as full-text search queries.

67%

For example, media monitoring services subscribe to feeds of news articles and broadcasts from media outlets, and search for any news mentioning companies, products, or topics of interest.

67%

Conventional search engines first index the documents and then run queries over the index. By contrast, searching a stream turns the processing on its head: the queries are stored, and the documents run past the queries,

67%

A tricky problem when defining windows in terms of event time is that you can never be sure when you have received all of the events for a particular window, or whether there are some events still to come.

67%

Ignore the straggler events, as they are probably a small percentage of events in normal circumstances.

67%

Publish a correction, an updated value for the window with stragglers included.

67%

However, the clock on a user-controlled device often

67%

cannot be trusted, as it may be accidentally or deliberately set to the wrong time

67%

To adjust for incorrect device clocks, one approach is to log three timestamps [82]: The time at which the event occurred, according to the device clock The time at which the event was sent to the server, according to the device clock The time at which the event was received by the server, according to the server clock

67%

Tumbling window A tumbling window has a fixed length, and every event belongs to exactly one window.

67%

Hopping window A hopping window also has a fixed length, but allows windows to overlap in order to provide some smoothing.

68%

Sliding window A sliding window contains all the events that occur within some interval of each other.

68%

Session window Unlike the other window types, a session window has no fixed duration. Instead, it is defined by grouping together all events for the same user that occur closely together in time, and the window ends when the user has been inactive for some time

68%

Stream-stream join (window join)

68%

The click may never come if the user abandons their search, and even if it comes, the time between the search and the click may be highly variable: in many cases it might be a few seconds, but it could be as long as days or weeks

68%

Due to variable network delays, the click event may even arrive before the search event.

68%

stream processor needs to maintain state: for example, all the events that occurred in the last hour, indexed by session ID.

68%

If there is a matching event, you emit an event saying which search result was clicked. If the search event expires without you seeing a matching click event, you emit an event saying which search results were not clicked.

68%

Stream-table join (stream enrichment)

68%

This process is sometimes known as enriching the activity events with information from the database.

68%

such remote queries are likely to be slow and risk overloading the database

68%

Another approach is to load a copy of the database into the stream processor so that it can be queried locally without a network round-trip. This technique is very similar to the hash joins we discussed in “Map-Side Joins”: the

68%

The difference to batch jobs is that a batch job uses a point-in-time snapshot of the database as input, whereas a stream processor is long-running, and the contents of the database are likely to change over time, so the stream processor’s local copy of the database needs to be kept up to date.

68%

This issue can be solved by change data capture: the stream processor can subscribe to a changelog of the user profile database as well as the stream of activity events.

68%

Table-table join (materialized view maintenance)

68%

The stream process needs to maintain a database containing the set of followers for each user so that it knows which timelines need to be updated when a new tweet arrives

68%

Put another way: if state changes over time, and you join with some state, what point in time do you use for the join

68%

If the ordering of events across streams is undetermined, the join becomes nondeterministic [87], which means you cannot rerun the same job on the same input and necessarily get the same result:

68%

In data warehouses, this issue is known as a slowly changing dimension (SCD), and it is often addressed by using a unique identifier for a particular version of the joined record:

68%

This principle is known as exactly-once semantics, although effectively-once would be a more descriptive term

68%

Microbatching and checkpointing One solution is to break the stream into small blocks, and treat each block like a miniature batch process.

68%

order to give the appearance of exactly-once processing in the presence of faults, we need to ensure that all outputs and side effects of processing

68%

an event take effect if and only if the processing is successful. Those effects include any messages sent to downstream

68%

Those things either all need to happen atomically, or none of them must happen, but they should not go out of sync with each other.

68%

Idempotence Our goal is to discard the partial output of any failed tasks so that they can be safely retried without taking effect twice. Distributed transactions are one way of achieving that goal, but another way is to rely on idempotence [97].

69%

For example, when consuming messages from Kafka, every message has a persistent, monotonically increasing offset. When writing a value to an external database, you can include the offset of the message that triggered the last write with the value.

69%

Relying on idempotence implies several assumptions: restarting a failed task must replay the same messages in the same order (a log-based message broker does this), the processing must be deterministic, and no other node may concurrently update the same value [98,

69%

When failing over from one processing node to another, fencing may be required (see “The leader and the lock”) to prevent interference from a node that is thought to be dead but is actually alive.

« Prev 1 … 24 25 26 27 28 Next »

See a Problem?

Preview — Designing Data-Intensive Applications by Martin Kleppmann