More on this book
Community
Kindle Notes & Highlights
Read between
August 2 - December 28, 2020
Any stream process that requires state
must ensure that this state can be recovered after a failure.
One option is to keep the state in a remote datastore and replicate it, although having to query a remote
database for each individual message can be slow,
An alternative is to keep state local to the stream processor, and replicate it periodically. Then, when the stream processor is recovering from a failure, the new task can read the replicated state and resume processing without data loss.
In some cases, it may not even be necessary to replicate the state, because it can be rebuilt from the input streams.
Representing databases as streams opens up powerful opportunities for integrating systems. You can keep derived data systems such as search indexes, caches, and analytics systems continually up to date by consuming
A recurring theme in this book has been that for any given problem, there are several solutions, all of which have different pros, cons, and trade-offs.
Thus, the most appropriate choice of software tool also depends on the circumstances. Every piece of software,
even a so-called “general-purpose” database, is designed for a particular usage pattern.
the first challenge is then to figure out the mapping between the software products and the circumstances...
This highlight has been truncated due to consecutive passage length restrictions.
so many applications need to combine two different tools in order to satisfy all of the requirements.
The need for data integration often only becomes apparent if you zoom out and consider the dataflows across an entire organization.
When copies of the same data need to be maintained in several storage systems in order to satisfy different access patterns, you need to be very clear about the inputs and outputs: where is data written first, and which representations are derived from which sources? How do you get data into all the right places, in the right formats?
If it is possible for you to funnel all user input through a single system that decides on an ordering for all writes, it becomes much easier to derive other representations of the data by processing the writes in the same order.
If the throughput of events is greater than a single machine can handle, you need to partition the log across multiple machines
offline, you typically have a separate leader in each datacenter,
This implies an undefined ordering of events that originate in two different datacenters.
applications are deployed as microservices
When two events originate in different services, there is no defined order for those events.
Some applications maintain client-side state that is updated immediately on user input
With such applications, clients and servers are very likely to see events in different orders.
Most consensus algorithms are designed for situations in which the throughput of a single node is sufficient to process the entire stream of events, and these algorithms do not provide a mechanism for multiple nodes to share the work of ordering the events. It is still an open research problem to design consensus algorithms that can scale beyond the throughput of a single node and that work well in a geographically distributed setting.
However, in a system that stores friendship status in one place and messages in another place, that ordering dependency between the unfriend event and the message-send event may be lost.
In this example, the notifications are effectively a join between the messages and the friend list,
Logical timestamps
If you can log an event to record the state of the system that the user saw before making a decision, and give that event a unique identifier,
We will return to this idea in “Reads are events too”.
In principle, one type of processing can be emulated on top of the other,
No matter whether the derived data is a search index, a statistical model, or a cache, it is helpful to think in terms of data pipelines that derive one thing from another, pushing state changes in one system through functional application code and applying the effects to derived systems.
However, asynchrony is what makes systems based on event logs robust: it allows a fault in one part of the system to be contained locally, whereas distributed transactions abort if any one participant fails, so they tend to amplify failures by spreading them to the rest of the system
The reasoning behind this design is that batch processing is simpler and thus less prone to bugs, while stream processors are thought to be less reliable and harder to make fault-tolerant (see
database stores data in records of some data model (rows in tables, documents, vertices in a graph, etc.) while an operating system’s filesystem stores data in files — but at their core, both are “information management” systems
this light, I think that the dataflow across an entire
organization starts looking like one huge database [7]. Whenever a batch, stream, or ETL process transports data from one place and form to another place and form, it is acting like the database subsystem that keeps indexes or materialized views up to date.
Federated databases: unifying reads It is possible to provide a unified query interface to a wide variety of underlying storage engines and processing methods
Unbundled databases: unifying writes
Federation and unbundling are two sides of the same coin: composing a reliable, scalable, and maintainable system out of diverse components.
The tools for composing data systems are getting better, but I think one major part is missing: we don’t yet have the unbundled-database equivalent of the Unix shell (i.e.,
There is interesting early-stage research in this area, such as differential dataflow [24, 25], and I hope that these ideas will find their way into production systems.
When the function that creates a derived dataset is not a standard cookie-cutter function like creating a secondary index, custom code is required to handle the application-specific aspects.
Most web applications today are deployed as stateless services,
This style of deployment is convenient, as servers can be added or removed at will, but the state has to go somewhere: typically, a database.
Stable message ordering and fault-tolerant message processing are quite stringent demands, but they are much less expensive and more operationally robust than distributed transactions.
Composing stream operators into dataflow systems has a lot of similar characteristics to the microservices approach [40]. However, the underlying communication mechanism is very different: one-directional, asynchronous message streams
Viewed like this, the role of caches, indexes, and materialized views is simple: they shift the boundary between the read path and the write path.
However, recent “single-page” JavaScript web apps have gained a lot of stateful capabilities,
These changing capabilities have led to a renewed interest in offline-first applications that
More recent protocols have moved beyond the basic request/response pattern of HTTP: server-sent events (the EventSource API) and WebSockets provide communication channels by which a web browser can keep an open TCP connection to a server,
In terms of our model of write path and read path, actively pushing state changes all the way to client devices means extending the write path all the way to the end user.