Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
Kindle Notes & Highlights
3%
Flag icon
It is well known that the majority of the cost of software is not in its initial development, but in its ongoing maintenance — fixing bugs, keeping its systems operational, investigating failures, adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding new features.
3%
Flag icon
However, we can and should design software in such a way that it will hopefully minimize pain during maintenance, and thus avoid creating legacy software ourselves.
3%
Flag icon
“good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations”
3%
Flag icon
Monitoring the health of the system and quickly restoring service if it goes into a bad state
3%
Flag icon
Tracking down the cause of problems, such as system failures or degraded performance
3%
Flag icon
Keeping tabs on how different systems affect each other, so that a problematic change can be avo...
This highlight has been truncated due to consecutive passage length restrictions.
3%
Flag icon
Anticipating future problems and solving them before they occur (e.g...
This highlight has been truncated due to consecutive passage length restrictions.
3%
Flag icon
Preserving the organization’s knowledge about the system, even as individual people come and go
3%
Flag icon
Exhibiting predictable behavior, minimizing surprises
3%
Flag icon
In complex software, there is also a greater risk of introducing bugs when making a change: when the system is harder for developers to understand and reason about, hidden assumptions, unintended consequences, and unexpected interactions are more easily overlooked. Conversely, reducing complexity greatly improves the maintainability of software, and thus simplicity should be a key goal for the systems we build.
3%
Flag icon
Moseley and Marks [32] define complexity as accidental if it is not inherent in the problem that the software solves (as seen by the users) but arises only from the implementation.
3%
Flag icon
One of the best tools we have for removing accidental complexity is abstraction.
3%
Flag icon
Throughout this book, we will keep our eyes open for good abstractions that allow us to extract parts of a large system into well-defined, reusable components.
3%
Flag icon
The ease with which you can modify a data system, and adapt it to changing requirements, is closely linked to its simplicity and its abstractions: simple and easy-to-understand systems are usually easier to modify than complex ones. But since this is such an important idea, we will use a different word to refer to agility on a data system level: evolvability
4%
Flag icon
The limits of my language mean the limits of my world.
4%
Flag icon
The limits of my language mean the limits of my world. Ludwig Wittgenstein, Tractatus Logico-Philosophicus (1922)
4%
Flag icon
Most applications are built by layering one data model on top of another. For each layer, the key question is: how is it represented in terms of the next-lower layer? For example:
4%
Flag icon
In a complex application there may be more intermediary levels, such as APIs built upon APIs, but the basic idea is still the same: each layer hides the complexity of the layers below it by providing a clean data model.
4%
Flag icon
The Object-Relational Mismatch
4%
Flag icon
There is a one-to-many relationship from the user to these items, which can be represented in various ways:
5%
Flag icon
The lack of a schema is often cited as an advantage; we will discuss this in “Schema flexibility in the document model”.
5%
Flag icon
The advantage of using an ID is that because it has no meaning to humans, it never needs to change: the ID can remain the same, even if the information it identifies changes.
5%
Flag icon
However, when it comes to representing many-to-one and many-to-many relationships, relational and document databases are not fundamentally different: in both cases, the related item is referenced by a unique identifier, which is called a foreign key in the relational model and a document reference in the document model [9]. That identifier is resolved at read time by using a join or follow-up queries.
5%
Flag icon
The main arguments in favor of the document data model are schema flexibility, better performance due to locality, and that for some applications it is closer to the data structures used by the application. The relational model counters by providing better support for joins, and many-to-one and many-to-many relationships.
5%
Flag icon
However, if your application does use many-to-many relationships, the document model becomes less appealing.
5%
Flag icon
For highly interconnected data, the document model is awkward, the relational model is acceptable, and graph models (see “Graph-Like Data Models”) are the most natural.
6%
Flag icon
Schema-on-read is similar to dynamic (runtime) type checking in programming languages, whereas schema-on-write is similar to static (compile-time) type checking.
6%
Flag icon
But in cases where all records are expected to have the same structure, schemas are a useful mechanism for documenting and enforcing that structure.
6%
Flag icon
It seems that relational and document databases are becoming more similar over time, and that is a good thing: the data models complement each other.v
7%
Flag icon
MapReduce is neither a declarative query language nor a fully imperative query API, but somewhere in between: the logic of the query is expressed with snippets of code, which are called repeatedly by the processing framework. It is based on the map (also known as collect) and reduce (also known as fold or inject) functions that exist in many functional programming languages.
7%
Flag icon
The aggregation pipeline language is similar in expressiveness to a subset of SQL, but it uses a JSON-based syntax rather than SQL’s English-sentence-style syntax; the difference is perhaps a matter of taste. The moral of the story is that a NoSQL system may find itself accidentally reinventing SQL, albeit in disguise.
10%
Flag icon
On the most fundamental level, a database needs to do two things: when you give it some data, it should store the data, and when you ask it again later, it should give the data back to you.
10%
Flag icon
there is a big difference between storage engines that are optimized for transactional workloads and those that are optimized for analytics.
11%
Flag icon
In this book, log is used in the more general sense: an append-only sequence of records.
11%
Flag icon
Appending and segment merging are sequential write operations, which are generally much faster than random writes,
11%
Flag icon
both of which were inspired by Google’s Bigtable paper [9] (which introduced the terms SSTable and memtable).
11%
Flag icon
In order to optimize this kind of access, storage engines often use additional Bloom filters [15]. (A Bloom filter is a memory-efficient data structure for approximating the contents of a set. It can tell you if a key does not appear in the database, and thus saves many unnecessary disk reads for nonexistent keys.)
12%
Flag icon
By contrast, B-trees break the database down into fixed-size blocks or pages, traditionally 4 KB in size (sometimes bigger), and read or write one page at a time. This design corresponds more closely to the underlying hardware, as disks are also arranged in fixed-size blocks.
12%
Flag icon
The basic underlying write operation of a B-tree is to overwrite a page on disk with new data. It is assumed that the overwrite does not change the location of the page; i.e., all references to that page remain intact when the page is overwritten. This is in stark contrast to log-structured indexes such as LSM-trees, which only append to files (and eventually delete obsolete files) but never modify files in place.
12%
Flag icon
In order to make the database resilient to crashes, it is common for B-tree implementations to include an additional data structure on disk: a write-ahead log (WAL, also known as a redo log).
12%
Flag icon
However, benchmarks are often inconclusive and sensitive to details of the workload. You need to test systems with your particular workload in order to make a valid comparison.
12%
Flag icon
However, lower write amplification and reduced fragmentation are still advantageous on SSDs: representing data more compactly allows more read and write requests within the available I/O bandwidth.
12%
Flag icon
The impact on throughput and average response time is usually small, but at higher percentiles (see “Describing Performance”) the response time of queries to log-structured storage engines can sometimes be quite high, and B-trees can be more predictable
12%
Flag icon
Typically, SSTable-based storage engines do not throttle the rate of incoming writes, even if compaction cannot keep up, so you need explicit monitoring to detect this situation
12%
Flag icon
The heap file approach is common because it avoids duplicating data when multiple secondary indexes are present: each index just references a location in the heap file, and the actual data is kept in one place.
13%
Flag icon
A compromise between a clustered index (storing all row data within the index) and a nonclustered index (storing only references to the data within the index) is known as a covering index or index with included columns, which stores some of a table’s columns within the index [33]. This allows some queries to be answered by using the index alone (in which case, the index is said to cover the query)
13%
Flag icon
More commonly, specialized spatial indexes such as R-trees are used.
13%
Flag icon
2D index could narrow down by timestamp and temperature simultaneously. This technique is used by HyperDex
13%
Flag icon
Redis and Couchbase provide weak durability by writing to disk asynchronously.
13%
Flag icon
Rather, they can be faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk