More on this book
Community
Kindle Notes & Highlights
Read between
October 21 - November 26, 2024
Even if an operation is not naturally idempotent, it can often be made idempotent with a bit of extra metadata.
Log compaction allows the stream to retain a full copy of the contents of a database.
Representing databases as streams opens up powerful opportunities for integrating systems.
It’s possible to create a load balancing scheme in which two consumers share the work of processing a partition by having both read the full set of messages, but one of them only considers messages with even-numbered offsets while the other deals with the odd-numbered offsets.
What one person considers to be an obscure and pointless feature may well be a central requirement for someone else.
If it is possible for you to funnel all user input through a single system that decides on an ordering for all writes, it becomes much easier to derive other representations of the data by processing the writes in the same order.
Most consensus algorithms are designed for situations in which the throughput of a single node is sufficient to process the entire stream of events, and these algorithms do not provide a mechanism for multiple nodes to share the work of ordering the events.
In cases where there is no causal link between events, the lack of a total order is not a big problem, since concurrent events can be ordered arbitrarily.
A partitioned system with secondary indexes either needs to send writes to multiple partitions (if the index is term-partitioned) or send reads to all partitions (if the index is document-partitioned).
Stream processing allows changes in the input to be reflected in derived views with low delay, whereas batch processing allows large amounts of accumulated historical data to be reprocessed in order to derive new views onto an existing dataset.
Derived views allow gradual evolution.
Incrementalizing a batch computation adds complexity, making it more akin to the streaming layer, which runs counter to the goal of keeping the batch layer as simple as possible.
Unix developed pipes and files that are just sequences of bytes, whereas databases developed SQL and transactions.
Whenever a batch, stream, or ETL process transports data from one place and form to another place and form, it is acting like the database subsystem that keeps indexes or materialized views up to date.
At a system level, asynchronous event streams make the system as a whole more robust to outages or performance degradation of individual components.
Event logs provide an interface that is powerful enough to capture fairly strong consistency properties (due to durability and ordering of events), but also general enough to be applicable to almost any kind of data.
When one dataset is derived from another, it goes through some kind of transformation function.
A full-text search index is created by applying various natural language processing functions such as language detection, word segmentation, stemming or lemmatization, spelling correction, and synonym identification, followed by building a data structure for efficient lookups (such as an inverted index).
In a machine learning system, we can consider the model as being derived from the training data by applying various feature extraction and statistical analysis functions.
When the function that creates a derived dataset is not a standard cookie-cutter function like creating a secondary index, custom code is required to handle the application-specific aspects.
Most web applications today are deployed as stateless services, in which any user request can be routed to any application server, and the server forgets everything about the request once it has sent the response.
Stable message ordering and fault-tolerant message processing are quite stringent demands, but they are much less expensive and more operationally robust than distributed transactions.
The fastest and most reliable network request is no network request at all!
If you want to reconstruct the original output, you will need to obtain the historical exchange rate at the original time of purchase.
Taken together, the write path and the read path encompass the whole journey of the data, from the point where it is collected to the point where it is consumed (probably by another human).
The read path is the portion of the journey that only happens when someone asks for it.
No index means less work on the write path (no index to update), but a lot more work on the read path.
In particular, we can think of the on-device state as a cache of state on the server.
In a typical web page, if you load the page in a web browser and the data subsequently changes on the server, the browser does not find out about the change until you reload the page.
When a client is first initialized, it would still need to use a read path to get its initial state, but thereafter it could rely on a stream of state changes sent by the server.
When both the writes and the reads are represented as events, and routed to the same stream operator in order to be handled, we are in fact performing a stream-table join between the stream of read queries and the database.
For queries that only touch a single partition, the effort of sending queries through a stream and collecting a stream of responses is perhaps overkill.
Often simple solutions appear to work correctly when concurrency is low and there are no faults, but turn out to have many subtle bugs in more demanding circumstances.
Any lost packets are retransmitted and any duplicates are removed by the TCP stack before it hands the data to an application.
In many databases, a transaction is tied to a client connection (if the client sends several queries, the database knows that they belong to the same transaction because they are sent on the same TCP connection).
The function in question can completely and correctly be implemented only with the knowledge and help of the application standing at the endpoints of the communication system.
Although the low-level features (TCP duplicate suppression, Ethernet checksums, WiFi encryption) cannot provide the desired end-to-end features by themselves, they are still useful, since they reduce the probability of problems at the higher levels.
We just need to remember that the low-level reliability features are not by themselves sufficient to ensure end-to-end correctness.
When we refuse to use distributed transactions because they are too expensive, we end up having to reimplement fault-tolerance mechanisms in application code.
Uniqueness checking can be scaled out by partitioning based on the value that needs to be unique.
Every request for a username is encoded as a message, and appended to a partition determined by the hash of the username.
A stream processor sequentially reads the requests in the log, using a local database to keep track of which usernames are taken.
Its fundamental principle is that any writes that may conflict are routed to the same partition and processed sequentially.
Ensuring that an operation is executed atomically, while satisfying constraints, becomes more interesting when several partitions are involved.
The request to transfer money from account A to account B is given a unique request ID by the client, and appended to a log partition based on the request ID.
To avoid the need for a distributed transaction, we first durably log the request as a single message, and then derive the credit and debit instructions from that first message.
Timeliness means ensuring that users observe the system in an up-to-date state.
Integrity means absence of corruption; i.e., no data loss, and no contradictory or false data.
Atomicity and durability are important tools for preserving integrity.
Violations of timeliness can be annoying and confusing, but violations of integrity can be catastrophic.