More on this book
Community
Kindle Notes & Highlights
Read between
April 5 - December 1, 2020
there are datastores that are also used as message queues (Redis), and there are message queues with database-like durability guarantees (Apache Kafka).
There are many factors that may influence the design of a data system, including the skills and experience of the people involved, legacy system dependencies, the timescale for delivery, your organization’s tolerance of different kinds of risk, regulatory
It is impossible to reduce the probability of a fault to zero; therefore it is usually best to design fault-tolerance mechanisms that prevent faults from causing failures.
Many critical bugs are actually due to poor error handling
Thus, on a storage cluster with 10,000 disks, we should expect on average one disk to die per day.
single-server system requires planned downtime if you need to reboot the machine (to apply operating system security patches, for example), whereas a system that can tolerate machine failure can be patched one node at a time, without downtime
If a system is expected to provide some guarantee (for example, in a message queue, that the number of incoming messages equals the number of outgoing messages), it can constantly check itself while it is running and raise an alert if a discrepancy is found [12].
Decouple the places where people make the most mistakes from the places where they can cause failures.
the median a good metric if you want to know how long users typically have to wait: half of user requests are served in less than the median response time, and the other half take longer than the median.
service level objectives (SLOs) and service level agreements (SLAs),
As a server can only process a small number of things in parallel (limited, for example, by its number of CPU cores), it only takes a small number of slow requests to hold up the processing of subsequent requests — an effect sometimes known as head-of-line blocking.
When generating load artificially in order to test the scalability of a system, the load-generating client needs to keep sending requests independently of the response time.
Even if you make the calls in parallel, the end-user request still needs to wait for the slowest of the parallel calls to complete.
Distributing load across multiple machines is also known as a shared-nothing architecture. A
An elastic system can be useful if load is highly unpredictable, but manually scaled systems are simpler and may have fewer operational surprises
complexity as accidental if it is not inherent in the problem that the software solves (as seen by the users) but arises only from the implementation.
in the network model, a record could have multiple parents. For example, there could be one record for the "Greater Seattle Area" region, and every user who lived in that region could be linked to it. This allowed many-to-one and many-to-many relationships to be modeled.
The only way of accessing a record was to follow a path from a root record along these chains of links. This was called an access path.
when it comes to representing many-to-one and many-to-many relationships, relational and document databases are not fundamentally different: in both cases, the related item is referenced by a unique identifier, which is called a foreign key in the relational model and a document reference
The main arguments in favor of the document data model are schema flexibility, better performance due to locality, and that for some applications it is closer to the data structures used by the application. The relational model counters by providing better support for joins, and many-to-one and many-to-many relationships.
you cannot refer directly to a nested item within a document,
if your application does use many-to-many relationships, the document model becomes less appealing.
Joins can be emulated in application code by making multiple requests to the database, but that also moves complexity into the application and is usually slower than a join performed by specialized code inside the database.
arbitrary keys and values can be added to a document, and when reading, clients have no guarantees as to what fields the documents may contain.
schema-on-read (the structure of the data is implicit, and only interpreted when the data is read),
Schema-on-read is similar to dynamic (runtime) type checking in programming languages, whereas schema-on-write is similar to static (compile-time) type checking.
MySQL is a notable exception — it copies the entire table on ALTER TABLE, which can mean minutes or even hours of downtime when altering a large table
schema-on-read approach is advantageous if the items in the collection don’t all have the same structure for some reason
for example, because: There are many different types of objects, and it is not practicable to put each type of object in its own table. The structure of the data is determined by external systems over which you have no control and which may change at any time.
If data is split across multiple tables, like in Figure 2-1, multiple index lookups are required to retrieve it all, which may require more disk seeks and take more time.
The locality advantage only applies if you need large parts of the document at the same time.
— only modifications that don’t change the encoded size of a document can easily be performed in place [19
it is generally recommended that you keep documents fairly small and avoid writes that increase the size of a document [9]. These performance limitations significantly reduce the set of situations in which document databases are useful.
Google’s Spanner database offers the same locality properties in a relational data model, by allowing the schema to declare that a table’s rows should be interleaved (nested) within a parent table
hybrid of the relational and document models is a good route for databases to take in the future.
An imperative language tells the computer to perform certain operations in a certain order.
In a declarative query language, like SQL or relational algebra, you just specify the pattern of the data you want — what conditions the results must meet, and how you want the data to be transformed (e.g., sorted, grouped, and aggregated) — but not how to achieve that goal. It is up to the database system’s query optimizer to decide which indexes and which join methods to use, and in which order to execute various parts of the query.
makes it possible for the database system to introduce performance improvements without requiring any changes to queries.
there is nothing in SQL that constrains it to running on a single machine, and MapReduce doesn’t have a monopoly on distributed query execution.
property graph model (implemented by Neo4j, Titan, and InfiniteGraph)
triple-store model (implemented by Datomic, AllegroGraph, and others).
Graphs are good for evolvability: as you add features to your application, a graph can easily be extended to accommodate changes
Cypher is a declarative query language for property graphs, created for the Neo4j graph database
Since SQL:1999, this idea of variable-length traversal paths in a query can be expressed using something called recursive common table expressions (the WITH RECURSIVE syntax).
The triple-store model is mostly equivalent to the property graph model, using different words to describe the same ideas.
store, all information is stored in the form of very simple three-part statements: (subject, predicate, object). For example, in the triple (Jim, likes, bananas), Jim is the subject, likes is the predicate (verb), and bananas is the object. The subject of a triple is equivalent to a vertex in a graph.
the semantic web was overhyped in the early 2000s but so far hasn’t shown any sign of being realized in practice,
tools like Apache Jena [42] can automatically convert between different RDF formats if necessary.
SPARQL is a query language for triple-stores using the RDF data model [43]. (It is an acronym for SPARQL Protocol and RDF Query Language, pronounced “sparkle.”) It predates Cypher, and since Cypher’s pattern matching is borrowed from SPARQL, they look quite similar
Cascalog [47] is a Datalog implementation for querying large datasets in Hadoop.viii