More on this book
Community
Kindle Notes & Highlights
Read between
April 9 - May 30, 2019
It is impossible to reduce the probability of a fault to zero; therefore it is usually best to design fault-tolerance mechanisms that prevent faults from causing failures.
In an early-stage startup or an unproven product it’s usually more important to be able to iterate quickly on product features than it is to scale to some hypothetical future load.
Making a system simpler does not necessarily mean reducing its functionality; it can also mean removing accidental complexity. Moseley and Marks [32] define complexity as accidental if it is not inherent in the problem that the software solves (as seen by the users) but arises only from the implementation.
The main arguments in favor of the document data model are schema flexibility, better performance due to locality, and that for some applications it is closer to the data structures used by the application. The relational model counters by providing better support for joins, and many-to-one and many-to-many relationships.
Document databases are sometimes called schemaless, but that’s misleading, as the code that reads the data usually assumes some kind of structure — i.e., there is an implicit schema, but it is not enforced by the database [20]. A more accurate term is schema-on-read (the structure of the data is implicit, and only interpreted when the data is read), in contrast with schema-on-write (the traditional approach of relational databases, where the schema is explicit and the database ensures all written data conforms to it)
A hybrid of the relational and document models is a good route for databases to take in the future.
there is a big difference between storage engines that are optimized for transactional workloads and those that are optimized for analytics.
Any kind of index usually slows down writes, because the index also needs to be updated every time data is written.
well-chosen indexes speed up read queries, but every index slows down writes.
As a rule of thumb, LSM-trees are typically faster for writes, whereas B-trees are thought to be faster for reads
A data warehouse, by contrast, is a separate database that analysts can query to their hearts’ content, without affecting OLTP operations
Many database vendors now focus on supporting either transaction processing or analytics workloads, but not both.
Log-structured storage engines are a comparatively recent development. Their key idea is that they systematically turn random-access writes into sequential writes on disk, which enables higher write throughput due to the performance characteristics of hard drives and SSDs.
when your queries require sequentially scanning across a large number of rows, indexes are much less relevant. Instead it becomes important to encode data very compactly, to minimize the amount of data that the query needs to read from disk.
The main focus of RPC frameworks is on requests between services owned by the same organization, typically within the same datacenter.
Pretending that replication is synchronous when in fact it is asynchronous is a recipe for problems down the line.
For defining concurrency, exact time doesn’t matter: we simply call two operations concurrent if they are both unaware of each other, regardless of the physical time at which they occurred.
Today, when a system claims to be “ACID compliant,” it’s unclear what guarantees you can actually expect. ACID has unfortunately become mostly a marketing term.
The ability to abort a transaction on error and have all writes from that transaction discarded is the defining feature of ACID atomicity.
Atomicity, isolation, and durability are properties of the database, whereas consistency (in the ACID sense) is a property of the application. The application may rely on the database’s atomicity and isolation properties in order to achieve consistency, but it’s not up to the database alone. Thus, the letter C doesn’t really belong in ACID.i
A transaction is usually understood as a mechanism for grouping multiple operations on multiple objects into one unit of execution.
From a performance point of view, a key principle of snapshot isolation is readers never block writers, and writers never block readers
Serializable isolation is usually regarded as the strongest isolation level. It guarantees that even though transactions may execute in parallel, the end result is the same as if they had executed one at a time, serially, without any concurrency.
A system designed for single-threaded execution can sometimes perform better than a system that supports concurrency, because it can avoid the coordination overhead of locking.
Compared to two-phase locking, the big advantage of serializable snapshot isolation is that one transaction doesn’t need to block waiting for locks held by another transaction. Like under snapshot isolation, writers don’t block readers, and vice versa.
Transactions are an abstraction layer that allows an application to pretend that certain concurrency problems and certain kinds of hardware and software faults don’t exist.
Better hardware utilization is also a significant motivation for using virtual machines.
Variable delays in networks are not a law of nature, but simply the result of a cost/benefit trade-off.
if you use software that requires synchronized clocks, it is essential that you also carefully monitor the clock offsets between all the machines.
it doesn’t make sense to think of a clock reading as a point in time — it is more like a range of times, within a confidence interval:
Safety is often informally defined as nothing bad happens, and liveness as something good eventually happens.
applications that don’t require linearizability can be more tolerant of network problems.
a better way of phrasing CAP would be either Consistent or Available when Partitioned
Linearizability is slow — and this is true all the time, not only during a network fault.
distributed transactions in MySQL are reported to be over 10 times slower than single-node transactions
Event sourcing is a powerful technique for data modeling: from an application point of view it is more meaningful to record the user’s actions as immutable events, rather than recording the effect of those actions on a mutable database.
Storing data is normally quite straightforward if you don’t have to worry about how it is going to be queried and accessed; many of the complexities of schema design, indexing, and storage engines are the result of wanting to support certain query and access patterns (see Chapter 3). For this reason, you gain a lot of flexibility by separating the form in which data is written from the form it is read, and by allowing several different read views. This idea is sometimes known as command query responsibility segregation
The traditional approach to database and schema design is based on the fallacy that data must be written in the same form as it will be queried. Debates about normalization and denormalization (see “Many-to-One and Many-to-Many Relationships”) become largely irrelevant if you can translate data from a write-optimized event log to read-optimized application state: it is entirely reasonable to denormalize data in the read-optimized views, as the translation process gives you a mechanism for keeping it consistent with the event log.
Deletion is more a matter of “making it harder to retrieve the data” than actually “making it impossible to retrieve the data.”
Derived views allow gradual evolution. If you want to restructure a dataset, you do not need to perform the migration as a sudden switch. Instead, you can maintain the old schema and the new schema side by side as two independently derived views onto the same underlying data.
Whenever you run CREATE INDEX, the database essentially reprocesses the existing dataset (as discussed in “Reprocessing data for application evolution”) and derives the index as a new view onto the existing data. The existing data may be a snapshot of the state rather than a log of all changes that ever happened, but the two are closely related (see “State, Streams, and Immutability”).
building for scale that you don’t need is wasted effort and may lock you into an inflexible design. In effect, it is a form of premature optimization.
the role of caches, indexes, and materialized views is simple: they shift the boundary between the read path and the write path.
By themselves, TCP, database transactions, and stream processors cannot entirely rule out these duplicates. Solving the problem requires an end-to-end solution: a transaction identifier that is passed all the way from the end-user client to the database.
just because an application uses a data system that provides comparatively strong safety properties, such as serializable transactions, that does not mean the application is guaranteed to be free from data loss or corruption. The application itself needs to take end-to-end measures, such as duplicate suppression, as well.
violations of timeliness are “eventual consistency,” whereas violations of integrity are “perpetual inconsistency.”
You cannot reduce the number of apologies to zero, but you can aim to find the best trade-off for your needs — the sweet spot where there are neither too many inconsistencies nor too many availability problems.
it is important to try restoring from your backups from time to time — otherwise you may only find out that your backup is broken when it is too late and you have already lost data. Don’t just blindly trust that it is all working.
If you are not afraid of making changes, you can much better evolve an application to meet changing requirements.
Having privacy does not mean keeping everything secret; it means having the freedom to choose which things to reveal to whom, what to make public, and what to keep secret.