Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
68%
Flag icon
Even if an operation is not naturally idempotent, it can often be made idempotent with a bit of extra metadata.
69%
Flag icon
Log compaction allows the stream to retain a full copy of the contents of a database.
69%
Flag icon
Representing databases as streams opens up powerful opportunities for integrating systems.
69%
Flag icon
It’s possible to create a load balancing scheme in which two consumers share the work of processing a partition by having both read the full set of messages, but one of them only considers messages with even-numbered offsets while the other deals with the odd-numbered offsets.
70%
Flag icon
What one person considers to be an obscure and pointless feature may well be a central requirement for someone else.
71%
Flag icon
If it is possible for you to funnel all user input through a single system that decides on an ordering for all writes, it becomes much easier to derive other representations of the data by processing the writes in the same order.
71%
Flag icon
Most consensus algorithms are designed for situations in which the throughput of a single node is sufficient to process the entire stream of events, and these algorithms do not provide a mechanism for multiple nodes to share the work of ordering the events.
71%
Flag icon
In cases where there is no causal link between events, the lack of a total order is not a big problem, since concurrent events can be ordered arbitrarily.
71%
Flag icon
A partitioned system with secondary indexes either needs to send writes to multiple partitions (if the index is term-partitioned) or send reads to all partitions (if the index is document-partitioned).
71%
Flag icon
Stream processing allows changes in the input to be reflected in derived views with low delay, whereas batch processing allows large amounts of accumulated historical data to be reprocessed in order to derive new views onto an existing dataset.
71%
Flag icon
Derived views allow gradual evolution.
72%
Flag icon
Incrementalizing a batch computation adds complexity, making it more akin to the streaming layer, which runs counter to the goal of keeping the batch layer as simple as possible.
72%
Flag icon
Unix developed pipes and files that are just sequences of bytes, whereas databases developed SQL and transactions.
72%
Flag icon
Whenever a batch, stream, or ETL process transports data from one place and form to another place and form, it is acting like the database subsystem that keeps indexes or materialized views up to date.
72%
Flag icon
At a system level, asynchronous event streams make the system as a whole more robust to outages or performance degradation of individual components.
72%
Flag icon
Event logs provide an interface that is powerful enough to capture fairly strong consistency properties (due to durability and ordering of events), but also general enough to be applicable to almost any kind of data.
73%
Flag icon
When one dataset is derived from another, it goes through some kind of transformation function.
73%
Flag icon
A full-text search index is created by applying various natural language processing functions such as language detection, word segmentation, stemming or lemmatization, spelling correction, and synonym identification, followed by building a data structure for efficient lookups (such as an inverted index).
73%
Flag icon
In a machine learning system, we can consider the model as being derived from the training data by applying various feature extraction and statistical analysis functions.
73%
Flag icon
When the function that creates a derived dataset is not a standard cookie-cutter function like creating a secondary index, custom code is required to handle the application-specific aspects.
73%
Flag icon
Most web applications today are deployed as stateless services, in which any user request can be routed to any application server, and the server forgets everything about the request once it has sent the response.
73%
Flag icon
Stable message ordering and fault-tolerant message processing are quite stringent demands, but they are much less expensive and more operationally robust than distributed transactions.
73%
Flag icon
The fastest and most reliable network request is no network request at all!
73%
Flag icon
If you want to reconstruct the original output, you will need to obtain the historical exchange rate at the original time of purchase.
73%
Flag icon
Taken together, the write path and the read path encompass the whole journey of the data, from the point where it is collected to the point where it is consumed (probably by another human).
73%
Flag icon
The read path is the portion of the journey that only happens when someone asks for it.
73%
Flag icon
No index means less work on the write path (no index to update), but a lot more work on the read path.
74%
Flag icon
In particular, we can think of the on-device state as a cache of state on the server.
74%
Flag icon
In a typical web page, if you load the page in a web browser and the data subsequently changes on the server, the browser does not find out about the change until you reload the page.
74%
Flag icon
When a client is first initialized, it would still need to use a read path to get its initial state, but thereafter it could rely on a stream of state changes sent by the server.
74%
Flag icon
When both the writes and the reads are represented as events, and routed to the same stream operator in order to be handled, we are in fact performing a stream-table join between the stream of read queries and the database.
74%
Flag icon
For queries that only touch a single partition, the effort of sending queries through a stream and collecting a stream of responses is perhaps overkill.
74%
Flag icon
Often simple solutions appear to work correctly when concurrency is low and there are no faults, but turn out to have many subtle bugs in more demanding circumstances.
74%
Flag icon
Any lost packets are retransmitted and any duplicates are removed by the TCP stack before it hands the data to an application.
74%
Flag icon
In many databases, a transaction is tied to a client connection (if the client sends several queries, the database knows that they belong to the same transaction because they are sent on the same TCP connection).
75%
Flag icon
The function in question can completely and correctly be implemented only with the knowledge and help of the application standing at the endpoints of the communication system.
75%
Flag icon
Although the low-level features (TCP duplicate suppression, Ethernet checksums, WiFi encryption) cannot provide the desired end-to-end features by themselves, they are still useful, since they reduce the probability of problems at the higher levels.
75%
Flag icon
We just need to remember that the low-level reliability features are not by themselves sufficient to ensure end-to-end correctness.
75%
Flag icon
When we refuse to use distributed transactions because they are too expensive, we end up having to reimplement fault-tolerance mechanisms in application code.
75%
Flag icon
Uniqueness checking can be scaled out by partitioning based on the value that needs to be unique.
75%
Flag icon
Every request for a username is encoded as a message, and appended to a partition determined by the hash of the username.
75%
Flag icon
A stream processor sequentially reads the requests in the log, using a local database to keep track of which usernames are taken.
75%
Flag icon
Its fundamental principle is that any writes that may conflict are routed to the same partition and processed sequentially.
75%
Flag icon
Ensuring that an operation is executed atomically, while satisfying constraints, becomes more interesting when several partitions are involved.
75%
Flag icon
The request to transfer money from account A to account B is given a unique request ID by the client, and appended to a log partition based on the request ID.
75%
Flag icon
To avoid the need for a distributed transaction, we first durably log the request as a single message, and then derive the credit and debit instructions from that first message.
76%
Flag icon
Timeliness means ensuring that users observe the system in an up-to-date state.
76%
Flag icon
Integrity means absence of corruption; i.e., no data loss, and no contradictory or false data.
76%
Flag icon
Atomicity and durability are important tools for preserving integrity.
76%
Flag icon
Violations of timeliness can be annoying and confusing, but violations of integrity can be catastrophic.