Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
1%
Flag icon
It is impossible to reduce the probability of a fault to zero; therefore it is usually best to design fault-tolerance mechanisms that prevent faults from causing failures.
3%
Flag icon
In an early-stage startup or an unproven product it’s usually more important to be able to iterate quickly on product features than it is to scale to some hypothetical future load.
3%
Flag icon
Making a system simpler does not necessarily mean reducing its functionality; it can also mean removing accidental complexity. Moseley and Marks [32] define complexity as accidental if it is not inherent in the problem that the software solves (as seen by the users) but arises only from the implementation.
5%
Flag icon
The main arguments in favor of the document data model are schema flexibility, better performance due to locality, and that for some applications it is closer to the data structures used by the application. The relational model counters by providing better support for joins, and many-to-one and many-to-many relationships.
6%
Flag icon
Document databases are sometimes called schemaless, but that’s misleading, as the code that reads the data usually assumes some kind of structure — i.e., there is an implicit schema, but it is not enforced by the database [20]. A more accurate term is schema-on-read (the structure of the data is implicit, and only interpreted when the data is read), in contrast with schema-on-write (the traditional approach of relational databases, where the schema is explicit and the database ensures all written data conforms to it)
6%
Flag icon
A hybrid of the relational and document models is a good route for databases to take in the future.
10%
Flag icon
there is a big difference between storage engines that are optimized for transactional workloads and those that are optimized for analytics.
11%
Flag icon
Any kind of index usually slows down writes, because the index also needs to be updated every time data is written.
11%
Flag icon
well-chosen indexes speed up read queries, but every index slows down writes.
12%
Flag icon
As a rule of thumb, LSM-trees are typically faster for writes, whereas B-trees are thought to be faster for reads
13%
Flag icon
A data warehouse, by contrast, is a separate database that analysts can query to their hearts’ content, without affecting OLTP operations
13%
Flag icon
Many database vendors now focus on supporting either transaction processing or analytics workloads, but not both.
15%
Flag icon
Log-structured storage engines are a comparatively recent development. Their key idea is that they systematically turn random-access writes into sequential writes on disk, which enables higher write throughput due to the performance characteristics of hard drives and SSDs.
15%
Flag icon
when your queries require sequentially scanning across a large number of rows, indexes are much less relevant. Instead it becomes important to encode data very compactly, to minimize the amount of data that the query needs to read from disk.
19%
Flag icon
The main focus of RPC frameworks is on requests between services owned by the same organization, typically within the same datacenter.
23%
Flag icon
Pretending that replication is synchronous when in fact it is asynchronous is a recipe for problems down the line.
26%
Flag icon
For defining concurrency, exact time doesn’t matter: we simply call two operations concurrent if they are both unaware of each other, regardless of the physical time at which they occurred.
31%
Flag icon
Today, when a system claims to be “ACID compliant,” it’s unclear what guarantees you can actually expect. ACID has unfortunately become mostly a marketing term.
31%
Flag icon
The ability to abort a transaction on error and have all writes from that transaction discarded is the defining feature of ACID atomicity.
31%
Flag icon
Atomicity, isolation, and durability are properties of the database, whereas consistency (in the ACID sense) is a property of the application. The application may rely on the database’s atomicity and isolation properties in order to achieve consistency, but it’s not up to the database alone. Thus, the letter C doesn’t really belong in ACID.i
32%
Flag icon
A transaction is usually understood as a mechanism for grouping multiple operations on multiple objects into one unit of execution.
33%
Flag icon
From a performance point of view, a key principle of snapshot isolation is readers never block writers, and writers never block readers
35%
Flag icon
Serializable isolation is usually regarded as the strongest isolation level. It guarantees that even though transactions may execute in parallel, the end result is the same as if they had executed one at a time, serially, without any concurrency.
35%
Flag icon
A system designed for single-threaded execution can sometimes perform better than a system that supports concurrency, because it can avoid the coordination overhead of locking.
36%
Flag icon
Compared to two-phase locking, the big advantage of serializable snapshot isolation is that one transaction doesn’t need to block waiting for locks held by another transaction. Like under snapshot isolation, writers don’t block readers, and vice versa.
37%
Flag icon
Transactions are an abstraction layer that allows an application to pretend that certain concurrency problems and certain kinds of hardware and software faults don’t exist.
40%
Flag icon
Better hardware utilization is also a significant motivation for using virtual machines.
40%
Flag icon
Variable delays in networks are not a law of nature, but simply the result of a cost/benefit trade-off.
40%
Flag icon
if you use software that requires synchronized clocks, it is essential that you also carefully monitor the clock offsets between all the machines.
41%
Flag icon
it doesn’t make sense to think of a clock reading as a point in time — it is more like a range of times, within a confidence interval:
43%
Flag icon
Safety is often informally defined as nothing bad happens, and liveness as something good eventually happens.
47%
Flag icon
applications that don’t require linearizability can be more tolerant of network problems.
47%
Flag icon
a better way of phrasing CAP would be either Consistent or Available when Partitioned
47%
Flag icon
Linearizability is slow — and this is true all the time, not only during a network fault.
50%
Flag icon
distributed transactions in MySQL are reported to be over 10 times slower than single-node transactions
65%
Flag icon
Event sourcing is a powerful technique for data modeling: from an application point of view it is more meaningful to record the user’s actions as immutable events, rather than recording the effect of those actions on a mutable database.
66%
Flag icon
Storing data is normally quite straightforward if you don’t have to worry about how it is going to be queried and accessed; many of the complexities of schema design, indexing, and storage engines are the result of wanting to support certain query and access patterns (see Chapter 3). For this reason, you gain a lot of flexibility by separating the form in which data is written from the form it is read, and by allowing several different read views. This idea is sometimes known as command query responsibility segregation
66%
Flag icon
The traditional approach to database and schema design is based on the fallacy that data must be written in the same form as it will be queried. Debates about normalization and denormalization (see “Many-to-One and Many-to-Many Relationships”) become largely irrelevant if you can translate data from a write-optimized event log to read-optimized application state: it is entirely reasonable to denormalize data in the read-optimized views, as the translation process gives you a mechanism for keeping it consistent with the event log.
66%
Flag icon
Deletion is more a matter of “making it harder to retrieve the data” than actually “making it impossible to retrieve the data.”
71%
Flag icon
Derived views allow gradual evolution. If you want to restructure a dataset, you do not need to perform the migration as a sudden switch. Instead, you can maintain the old schema and the new schema side by side as two independently derived views onto the same underlying data.
72%
Flag icon
Whenever you run CREATE INDEX, the database essentially reprocesses the existing dataset (as discussed in “Reprocessing data for application evolution”) and derives the index as a new view onto the existing data. The existing data may be a snapshot of the state rather than a log of all changes that ever happened, but the two are closely related (see “State, Streams, and Immutability”).
73%
Flag icon
building for scale that you don’t need is wasted effort and may lock you into an inflexible design. In effect, it is a form of premature optimization.
74%
Flag icon
the role of caches, indexes, and materialized views is simple: they shift the boundary between the read path and the write path.
75%
Flag icon
By themselves, TCP, database transactions, and stream processors cannot entirely rule out these duplicates. Solving the problem requires an end-to-end solution: a transaction identifier that is passed all the way from the end-user client to the database.
75%
Flag icon
just because an application uses a data system that provides comparatively strong safety properties, such as serializable transactions, that does not mean the application is guaranteed to be free from data loss or corruption. The application itself needs to take end-to-end measures, such as duplicate suppression, as well.
76%
Flag icon
violations of timeliness are “eventual consistency,” whereas violations of integrity are “perpetual inconsistency.”
76%
Flag icon
You cannot reduce the number of apologies to zero, but you can aim to find the best trade-off for your needs — the sweet spot where there are neither too many inconsistencies nor too many availability problems.
77%
Flag icon
it is important to try restoring from your backups from time to time — otherwise you may only find out that your backup is broken when it is too late and you have already lost data. Don’t just blindly trust that it is all working.
77%
Flag icon
If you are not afraid of making changes, you can much better evolve an application to meet changing requirements.
78%
Flag icon
Having privacy does not mean keeping everything secret; it means having the freedom to choose which things to reveal to whom, what to make public, and what to keep secret.