Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
Kindle Notes & Highlights
1%
Flag icon
there are datastores that are also used as message queues (Redis), and there are message queues with database-like durability guarantees (Apache Kafka).
1%
Flag icon
There are many factors that may influence the design of a data system, including the skills and experience of the people involved, legacy system dependencies, the timescale for delivery, your organization’s tolerance of different kinds of risk, regulatory
1%
Flag icon
It is impossible to reduce the probability of a fault to zero; therefore it is usually best to design fault-tolerance mechanisms that prevent faults from causing failures.
1%
Flag icon
Many critical bugs are actually due to poor error handling
1%
Flag icon
Thus, on a storage cluster with 10,000 disks, we should expect on average one disk to die per day.
2%
Flag icon
single-server system requires planned downtime if you need to reboot the machine (to apply operating system security patches, for example), whereas a system that can tolerate machine failure can be patched one node at a time, without downtime
2%
Flag icon
If a system is expected to provide some guarantee (for example, in a message queue, that the number of incoming messages equals the number of outgoing messages), it can constantly check itself while it is running and raise an alert if a discrepancy is found [12].
2%
Flag icon
Decouple the places where people make the most mistakes from the places where they can cause failures.
2%
Flag icon
the median a good metric if you want to know how long users typically have to wait: half of user requests are served in less than the median response time, and the other half take longer than the median.
3%
Flag icon
service level objectives (SLOs) and service level agreements (SLAs),
3%
Flag icon
As a server can only process a small number of things in parallel (limited, for example, by its number of CPU cores), it only takes a small number of slow requests to hold up the processing of subsequent requests — an effect sometimes known as head-of-line blocking.
3%
Flag icon
When generating load artificially in order to test the scalability of a system, the load-generating client needs to keep sending requests independently of the response time.
3%
Flag icon
Even if you make the calls in parallel, the end-user request still needs to wait for the slowest of the parallel calls to complete.
3%
Flag icon
Distributing load across multiple machines is also known as a shared-nothing architecture. A
3%
Flag icon
An elastic system can be useful if load is highly unpredictable, but manually scaled systems are simpler and may have fewer operational surprises
3%
Flag icon
complexity as accidental if it is not inherent in the problem that the software solves (as seen by the users) but arises only from the implementation.
5%
Flag icon
in the network model, a record could have multiple parents. For example, there could be one record for the "Greater Seattle Area" region, and every user who lived in that region could be linked to it. This allowed many-to-one and many-to-many relationships to be modeled.
5%
Flag icon
The only way of accessing a record was to follow a path from a root record along these chains of links. This was called an access path.
5%
Flag icon
when it comes to representing many-to-one and many-to-many relationships, relational and document databases are not fundamentally different: in both cases, the related item is referenced by a unique identifier, which is called a foreign key in the relational model and a document reference
5%
Flag icon
The main arguments in favor of the document data model are schema flexibility, better performance due to locality, and that for some applications it is closer to the data structures used by the application. The relational model counters by providing better support for joins, and many-to-one and many-to-many relationships.
5%
Flag icon
you cannot refer directly to a nested item within a document,
5%
Flag icon
if your application does use many-to-many relationships, the document model becomes less appealing.
5%
Flag icon
Joins can be emulated in application code by making multiple requests to the database, but that also moves complexity into the application and is usually slower than a join performed by specialized code inside the database.
6%
Flag icon
arbitrary keys and values can be added to a document, and when reading, clients have no guarantees as to what fields the documents may contain.
6%
Flag icon
schema-on-read (the structure of the data is implicit, and only interpreted when the data is read),
6%
Flag icon
Schema-on-read is similar to dynamic (runtime) type checking in programming languages, whereas schema-on-write is similar to static (compile-time) type checking.
6%
Flag icon
MySQL is a notable exception — it copies the entire table on ALTER TABLE, which can mean minutes or even hours of downtime when altering a large table
6%
Flag icon
schema-on-read approach is advantageous if the items in the collection don’t all have the same structure for some reason
6%
Flag icon
for example, because: There are many different types of objects, and it is not practicable to put each type of object in its own table. The structure of the data is determined by external systems over which you have no control and which may change at any time.
6%
Flag icon
If data is split across multiple tables, like in Figure 2-1, multiple index lookups are required to retrieve it all, which may require more disk seeks and take more time.
6%
Flag icon
The locality advantage only applies if you need large parts of the document at the same time.
6%
Flag icon
— only modifications that don’t change the encoded size of a document can easily be performed in place [19
6%
Flag icon
it is generally recommended that you keep documents fairly small and avoid writes that increase the size of a document [9]. These performance limitations significantly reduce the set of situations in which document databases are useful.
6%
Flag icon
Google’s Spanner database offers the same locality properties in a relational data model, by allowing the schema to declare that a table’s rows should be interleaved (nested) within a parent table
6%
Flag icon
hybrid of the relational and document models is a good route for databases to take in the future.
6%
Flag icon
An imperative language tells the computer to perform certain operations in a certain order.
6%
Flag icon
In a declarative query language, like SQL or relational algebra, you just specify the pattern of the data you want — what conditions the results must meet, and how you want the data to be transformed (e.g., sorted, grouped, and aggregated) — but not how to achieve that goal. It is up to the database system’s query optimizer to decide which indexes and which join methods to use, and in which order to execute various parts of the query.
6%
Flag icon
makes it possible for the database system to introduce performance improvements without requiring any changes to queries.
7%
Flag icon
there is nothing in SQL that constrains it to running on a single machine, and MapReduce doesn’t have a monopoly on distributed query execution.
7%
Flag icon
property graph model (implemented by Neo4j, Titan, and InfiniteGraph)
7%
Flag icon
triple-store model (implemented by Datomic, AllegroGraph, and others).
8%
Flag icon
Graphs are good for evolvability: as you add features to your application, a graph can easily be extended to accommodate changes
8%
Flag icon
Cypher is a declarative query language for property graphs, created for the Neo4j graph database
8%
Flag icon
Since SQL:1999, this idea of variable-length traversal paths in a query can be expressed using something called recursive common table expressions (the WITH RECURSIVE syntax).
8%
Flag icon
The triple-store model is mostly equivalent to the property graph model, using different words to describe the same ideas.
8%
Flag icon
store, all information is stored in the form of very simple three-part statements: (subject, predicate, object). For example, in the triple (Jim, likes, bananas), Jim is the subject, likes is the predicate (verb), and bananas is the object. The subject of a triple is equivalent to a vertex in a graph.
9%
Flag icon
the semantic web was overhyped in the early 2000s but so far hasn’t shown any sign of being realized in practice,
9%
Flag icon
tools like Apache Jena [42] can automatically convert between different RDF formats if necessary.
9%
Flag icon
SPARQL is a query language for triple-stores using the RDF data model [43]. (It is an acronym for SPARQL Protocol and RDF Query Language, pronounced “sparkle.”) It predates Cypher, and since Cypher’s pattern matching is borrowed from SPARQL, they look quite similar
9%
Flag icon
Cascalog [47] is a Datalog implementation for querying large datasets in Hadoop.viii
« Prev 1 3 6