Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
1%
Flag icon
Many critical bugs are actually due to poor error handling [3]; by deliberately inducing faults, you ensure that the fault-tolerance machinery is continually exercised and tested, which can increase your confidence that faults will be handled correctly when they occur naturally.
Brian and 1 other person liked this
2%
Flag icon
Decouple the places where people make the most mistakes from the places where they can cause failures. In particular, provide fully featured non-production sandbox environments where people can explore and experiment safely, using real data, without affecting real users.
Brian and 1 other person liked this
Robert Gustavo
· Flag
Robert Gustavo
Alternately, you can use Otis's account as test data on the sandbox, talk to external services, and then trigger a notification for his notes and highlights, cause him to get a bunch of likes, and hel…
2%
Flag icon
provide tools to recompute data (in case it turns out that the old computation was incorrect).
Chase DuBois
Ours are often not fast enough or are difficult to find or use
Jeff liked this
2%
Flag icon
The final twist of the Twitter anecdote: now that approach 2 is robustly implemented, Twitter is moving to a hybrid of both approaches. Most users’ tweets continue to be fanned out to home timelines at the time when they are posted, but a small number of users with a very large number of followers (i.e., celebrities) are excepted from this fan-out. Tweets from any celebrities that a user may follow are fetched separately and merged with that user’s home timeline when it is read, like in approach 1. This hybrid approach is able to deliver consistently good performance.
Robert and 1 other person liked this
Robert
· Flag
Robert
I guess when you follow someone (non-celebrity), it backfills your timeline?
Brian Rosenblat
· Flag
Brian Rosenblat
chase, you've become quite the SME on social network architecture :)
Brian
· Flag
Brian
@robert: yes, i believe so. nice thing is to use something like redis that you can insert entries in place without having to re-write the entire feed to cache.
3%
Flag icon
When complexity makes maintenance hard, budgets and schedules are often overrun. In complex software, there is also a greater risk of introducing bugs when making a change: when the system is harder for developers to understand and reason about, hidden assumptions, unintended consequences, and unexpected interactions are more easily overlooked.
3%
Flag icon
One of the best tools we have for removing accidental complexity is abstraction
3%
Flag icon
The ease with which you can modify a data system, and adapt it to changing requirements, is closely linked to its simplicity and its abstractions: simple and easy-to-understand systems are usually easier to modify than complex ones.
3%
Flag icon
Reliability means making systems work correctly, even when faults occur. Faults can be in hardware (typically random and uncorrelated), software (bugs are typically systematic and hard to deal with), and humans (who inevitably make mistakes from time to time). Fault-tolerance techniques can hide certain types of faults from the end user.
4%
Flag icon
A JSON datatype is also supported by several databases, including IBM DB2, MySQL, and PostgreSQL
Chase DuBois
I had no idea
5%
Flag icon
The advantage of using an ID is that because it has no meaning to humans, it never needs to change: the ID can remain the same, even if the information it identifies changes. Anything that is meaningful to humans may need to change sometime in the future — and if that information is duplicated, all the redundant copies need to be updated. That incurs write overheads, and risks inconsistencies (where some copies of the information are updated but others aren’t).
Brian
· Flag
Brian
But merging duplicate records is always a pain, right? And the aggregation paradigm we have with works probably pops up in other domains as well, which unless you can bound the number of objects aggre…
6%
Flag icon
Schema-on-read is similar to dynamic (runtime) type checking in programming languages, whereas schema-on-write is similar to static (compile-time) type checking. Just as the advocates of static and dynamic type checking have big debates about their relative merits [22], enforcement of schemas in database is a contentious topic, and in general there’s no right or wrong answer.
6%
Flag icon
Schema changes have a bad reputation of being slow and requiring downtime. This reputation is not entirely deserved: most relational database systems execute the ALTER TABLE statement in a few milliseconds. MySQL is a notable exception — it copies the entire table on ALTER TABLE, which can mean minutes or even hours of downtime when altering a large table — although various tools exist to work around this limitation [24, 25, 26].
Chase DuBois
Waah-wah MySQL.
Brian
· Flag
Brian
What does Aurora do? :)
6%
Flag icon
If your application often needs to access the entire document (for example, to render it on a web page), there is a performance advantage to this storage locality. If data is split across multiple tables, like in Figure 2-1, multiple index lookups are required to retrieve it all, which may require more disk seeks and take more time.
6%
Flag icon
It’s worth pointing out that the idea of grouping related data together for locality is not limited to the document model. For example, Google’s Spanner database offers the same locality properties in a relational data model, by allowing the schema to declare that a table’s rows should be interleaved (nested) within a parent table
6%
Flag icon
Most relational database systems (other than MySQL)
Chase DuBois
I'm seeing a pattern here.
Jeff and 2 other people liked this
6%
Flag icon
declarative languages often lend themselves to parallel execution.
7%
Flag icon
The relational model can handle simple cases of many-to-many relationships, but as the connections within your data become more complex, it becomes more natural to start modeling your data as a graph.
8%
Flag icon
By using different labels for different kinds of relationships, you can store several different kinds of information in a single graph, while still maintaining a clean data model. Those features give graphs a great deal of flexibility for data modeling, as illustrated in Figure 2-5. The figure shows a few things that would be difficult to express in a traditional relational schema, such as different kinds of regional structures in different countries (France has départements and régions, whereas the US has counties and states), quirks of history such as a country within a country (ignoring for ...more
Brian
· Flag
Brian
Hm, this quickly reminds me of book hierarchy, actually, which I hadn't thought of in this way. Especially with things like comic book volumes, box sets, series etc....
8%
Flag icon
Graphs are good for evolvability: as you add features to your application, a graph can easily be extended to accommodate changes in your application’s data structures.
9%
Flag icon
New nonrelational “NoSQL” datastores have diverged in two main directions: Document databases target use cases where data comes in self-contained documents and relationships between one document and another are rare. Graph databases go in the opposite direction, targeting use cases where anything is potentially related to everything.
11%
Flag icon
well-chosen indexes speed up read queries, but every index slows down writes.
Brian
· Flag
Brian
assuming you don't allow index updates to occur asynchronously with mechanisms to eventually propagate (anticipating later chapters).
20%
Flag icon
During rolling upgrades, or for various other reasons, we must assume that different nodes are running the different versions of our application’s code. Thus, it is important that all data flowing around the system is encoded in a way that provides backward compatibility (new code can read old data) and forward compatibility (old code can read new data).
20%
Flag icon
Except for MySQL, which often rewrites an entire table even though it is not strictly necessary, as mentioned in “Schema flexibility in the document model”
29%
Flag icon
The advantage of a global (term-partitioned) index over a document-partitioned index is that it can make reads more efficient: rather than doing scatter/gather over all partitions, a client only needs to make a request to the partition containing the term that it wants. However, the downside of a global index is that writes are slower and more complicated, because a write to a single document may now affect multiple partitions of the index (every term in the document might be on a different partition, on a different node).
29%
Flag icon
global secondary indexes
Chase DuBois
Now this name makes so much more sense!
29%
Flag icon
After rebalancing, the load (data storage, read and write requests) should be shared fairly between the nodes in the cluster. While rebalancing is happening, the database should continue accepting reads and writes. No more data than necessary should be moved between nodes, to make rebalancing fast and to minimize the network and disk I/O load.
31%
Flag icon
ACID atomicity describes what happens if a client wants to make several writes, but a fault occurs after some of the writes have been processed — for example, a process crashes, a network connection is interrupted, a disk becomes full, or some integrity constraint is violated. If the writes are grouped together into an atomic transaction, and the transaction cannot be completed (committed) due to a fault, then the transaction is aborted and the database must discard or undo any writes it has made so far in that transaction.
31%
Flag icon
Atomicity, isolation, and durability are properties of the database, whereas consistency (in the ACID sense) is a property of the application. The application may rely on the database’s atomicity and isolation properties in order to achieve consistency, but it’s not up to the database alone. Thus, the letter C doesn’t really belong in ACID.
Chase DuBois
This sounds like opinion. I wonder what the counterargument is.
31%
Flag icon
The classic database textbooks formalize isolation as serializability, which means that each transaction can pretend that it is the only transaction running on the entire database.
31%
Flag icon
in practice, serializable isolation is rarely used, because it carries a performance penalty. Some popular databases, such as Oracle 11g, don’t even implement it. In Oracle there is an isolation level called “serializable,” but it actually implements something called snapshot isolation, which is a weaker guarantee than serializability
32%
Flag icon
However, document databases lacking join functionality also encourage denormalization (see “Relational Versus Document Databases Today”). When denormalized information needs to be updated, like in the example of Figure 7-2, you need to update several documents in one go. Transactions are very useful in this situation to prevent denormalized data from going out of sync.
32%
Flag icon
datastores with leaderless replication (see “Leaderless Replication”) work much more on a “best effort” basis, which could be summarized as “the database will do as much as it can, and if it runs into an error, it won’t undo something it has already done” — so it’s the application’s responsibility to recover from errors.
Chase DuBois
Sounds a bit scary
32%
Flag icon
Concurrency bugs are hard to find by testing, because such bugs are only triggered when you get unlucky with the timing. Such timing issues might occur very rarely, and are usually difficult to reproduce. Concurrency is also very difficult to reason about, especially in a large application where you don’t necessarily know which other pieces of code are accessing the database.
Chase DuBois
Amen
32%
Flag icon
A popular comment on revelations of such problems is “Use an ACID database if you’re handling financial data!” — but that misses the point. Even many popular relational database systems (which are usually considered “ACID”) use weak isolation, so they wouldn’t necessarily have prevented these bugs from occurring.
33%
Flag icon
Snapshot isolation [28] is the most common solution to this problem. The idea is that each transaction reads from a consistent snapshot of the database — that is, the transaction sees all the data that was committed in the database at the start of the transaction. Even if the data is subsequently changed by another transaction, each transaction sees only the old data from that particular point in time.
33%
Flag icon
From a performance point of view, a key principle of snapshot isolation is readers never block writers, and writers never block readers.
33%
Flag icon
With append-only B-trees, every write transaction (or batch of transactions) creates a new B-tree root, and a particular root is a consistent snapshot of the database at the point in time when it was created.
Chase DuBois
Hard to picture this
33%
Flag icon
Unfortunately, object-relational mapping frameworks make it easy to accidentally write code that performs unsafe read-modify-write cycles instead of using atomic operations provided by the database
Chase DuBois
yep
34%
Flag icon
the last write wins (LWW) conflict resolution method is prone to lost updates, as discussed in “Last write wins (discarding concurrent writes)”. Unfortunately, LWW is the default in many replicated databases.
34%
Flag icon
Figure 7-8. Example of write skew causing an application bug.
34%
Flag icon
Write skew can occur if two transactions read the same objects, and then update some of those objects (different transactions may update different objects). In the special case where different transactions update the same object, you get a dirty write or lost update anomaly (depending on the timing).
34%
Flag icon
This effect, where a write in one transaction changes the result of a search query in another transaction, is called a phantom [3]. Snapshot isolation avoids phantoms in read-only queries, but in read-write transactions like the examples we discussed, phantoms can lead to particularly tricky cases of write skew.
35%
Flag icon
With stored procedures and in-memory data, executing all transactions on a single thread becomes feasible. As they don’t need to wait for I/O and they avoid the overhead of other concurrency control mechanisms, they can achieve quite good throughput on a single thread.
38%
Flag icon
This is a deliberate choice in the design of computers: if an internal fault occurs, we prefer a computer to crash completely rather than returning a wrong result, because wrong results are difficult and confusing to deal with. Thus, computers hide the fuzzy physical reality on which they are implemented and present an idealized system model that operates with mathematical perfection.
38%
Flag icon
nondeterminism and possibility of partial failures is what makes distributed systems hard to work with
38%
Flag icon
it is an old idea in computing to construct a more reliable system from a less reliable underlying base
38%
Flag icon
If you send a request and don’t get a response, it’s not possible to distinguish whether (a) the request was lost, (b) the remote node is down, or (c) the response was lost.
39%
Flag icon
If the system is already struggling with high load, declaring nodes dead prematurely can make the problem worse.
39%
Flag icon
In such environments, you can only choose timeouts experimentally: measure the distribution of network round-trip times over an extended period, and over many machines, to determine the expected variability of delays. Then, taking into account your application’s characteristics, you can determine an appropriate trade-off between failure detection delay and risk of premature timeouts.
40%
Flag icon
With careful use of quality of service (QoS, prioritization and scheduling of packets) and admission control (rate-limiting senders), it is possible to emulate circuit switching on packet networks, or provide statistically bounded delay
« Prev 1 3