Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
69%
Flag icon
Any stream process that requires state
69%
Flag icon
must ensure that this state can be recovered after a failure.
69%
Flag icon
One option is to keep the state in a remote datastore and replicate it, although having to query a remote
69%
Flag icon
database for each individual message can be slow,
69%
Flag icon
An alternative is to keep state local to the stream processor, and replicate it periodically. Then, when the stream processor is recovering from a failure, the new task can read the replicated state and resume processing without data loss.
69%
Flag icon
In some cases, it may not even be necessary to replicate the state, because it can be rebuilt from the input streams.
69%
Flag icon
Representing databases as streams opens up powerful opportunities for integrating systems. You can keep derived data systems such as search indexes, caches, and analytics systems continually up to date by consuming
70%
Flag icon
A recurring theme in this book has been that for any given problem, there are several solutions, all of which have different pros, cons, and trade-offs.
70%
Flag icon
Thus, the most appropriate choice of software tool also depends on the circumstances. Every piece of software,
70%
Flag icon
even a so-called “general-purpose” database, is designed for a particular usage pattern.
70%
Flag icon
the first challenge is then to figure out the mapping between the software products and the circumstances...
This highlight has been truncated due to consecutive passage length restrictions.
70%
Flag icon
so many applications need to combine two different tools in order to satisfy all of the requirements.
71%
Flag icon
The need for data integration often only becomes apparent if you zoom out and consider the dataflows across an entire organization.
71%
Flag icon
When copies of the same data need to be maintained in several storage systems in order to satisfy different access patterns, you need to be very clear about the inputs and outputs: where is data written first, and which representations are derived from which sources? How do you get data into all the right places, in the right formats?
71%
Flag icon
If it is possible for you to funnel all user input through a single system that decides on an ordering for all writes, it becomes much easier to derive other representations of the data by processing the writes in the same order.
71%
Flag icon
If the throughput of events is greater than a single machine can handle, you need to partition the log across multiple machines
71%
Flag icon
offline, you typically have a separate leader in each datacenter,
71%
Flag icon
This implies an undefined ordering of events that originate in two different datacenters.
71%
Flag icon
applications are deployed as microservices
71%
Flag icon
When two events originate in different services, there is no defined order for those events.
71%
Flag icon
Some applications maintain client-side state that is updated immediately on user input
71%
Flag icon
With such applications, clients and servers are very likely to see events in different orders.
71%
Flag icon
Most consensus algorithms are designed for situations in which the throughput of a single node is sufficient to process the entire stream of events, and these algorithms do not provide a mechanism for multiple nodes to share the work of ordering the events. It is still an open research problem to design consensus algorithms that can scale beyond the throughput of a single node and that work well in a geographically distributed setting.
71%
Flag icon
However, in a system that stores friendship status in one place and messages in another place, that ordering dependency between the unfriend event and the message-send event may be lost.
71%
Flag icon
In this example, the notifications are effectively a join between the messages and the friend list,
71%
Flag icon
Logical timestamps
71%
Flag icon
If you can log an event to record the state of the system that the user saw before making a decision, and give that event a unique identifier,
71%
Flag icon
We will return to this idea in “Reads are events too”.
71%
Flag icon
In principle, one type of processing can be emulated on top of the other,
71%
Flag icon
No matter whether the derived data is a search index, a statistical model, or a cache, it is helpful to think in terms of data pipelines that derive one thing from another, pushing state changes in one system through functional application code and applying the effects to derived systems.
71%
Flag icon
However, asynchrony is what makes systems based on event logs robust: it allows a fault in one part of the system to be contained locally, whereas distributed transactions abort if any one participant fails, so they tend to amplify failures by spreading them to the rest of the system
72%
Flag icon
The reasoning behind this design is that batch processing is simpler and thus less prone to bugs, while stream processors are thought to be less reliable and harder to make fault-tolerant (see
72%
Flag icon
database stores data in records of some data model (rows in tables, documents, vertices in a graph, etc.) while an operating system’s filesystem stores data in files — but at their core, both are “information management” systems
72%
Flag icon
this light, I think that the dataflow across an entire
72%
Flag icon
organization starts looking like one huge database [7]. Whenever a batch, stream, or ETL process transports data from one place and form to another place and form, it is acting like the database subsystem that keeps indexes or materialized views up to date.
72%
Flag icon
Federated databases: unifying reads It is possible to provide a unified query interface to a wide variety of underlying storage engines and processing methods
72%
Flag icon
Unbundled databases: unifying writes
72%
Flag icon
Federation and unbundling are two sides of the same coin: composing a reliable, scalable, and maintainable system out of diverse components.
73%
Flag icon
The tools for composing data systems are getting better, but I think one major part is missing: we don’t yet have the unbundled-database equivalent of the Unix shell (i.e.,
73%
Flag icon
There is interesting early-stage research in this area, such as differential dataflow [24, 25], and I hope that these ideas will find their way into production systems.
73%
Flag icon
When the function that creates a derived dataset is not a standard cookie-cutter function like creating a secondary index, custom code is required to handle the application-specific aspects.
73%
Flag icon
Most web applications today are deployed as stateless services,
73%
Flag icon
This style of deployment is convenient, as servers can be added or removed at will, but the state has to go somewhere: typically, a database.
73%
Flag icon
Stable message ordering and fault-tolerant message processing are quite stringent demands, but they are much less expensive and more operationally robust than distributed transactions.
73%
Flag icon
Composing stream operators into dataflow systems has a lot of similar characteristics to the microservices approach [40]. However, the underlying communication mechanism is very different: one-directional, asynchronous message streams
74%
Flag icon
Viewed like this, the role of caches, indexes, and materialized views is simple: they shift the boundary between the read path and the write path.
74%
Flag icon
However, recent “single-page” JavaScript web apps have gained a lot of stateful capabilities,
74%
Flag icon
These changing capabilities have led to a renewed interest in offline-first applications that
74%
Flag icon
More recent protocols have moved beyond the basic request/response pattern of HTTP: server-sent events (the EventSource API) and WebSockets provide communication channels by which a web browser can keep an open TCP connection to a server,
74%
Flag icon
In terms of our model of write path and read path, actively pushing state changes all the way to client devices means extending the write path all the way to the end user.