More on this book
Community
Kindle Notes & Highlights
Read between
October 21 - November 26, 2024
Technology is a powerful force in our society.
Fortunately, behind the rapid changes in technology, there are enduring principles that remain true, no matter which version of a particular tool you are using.
In computing we tend to be attracted to things that are new and shiny, but I think we have a huge amount to learn from things that have been done before.
A data-intensive application is typically built from standard building blocks that provide commonly needed functionality.
The things that can go wrong are called faults, and systems that anticipate faults and can cope with them are called fault-tolerant or resilient.
A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user.
When one component dies, the redundant component can take its place while the broken component is replaced.
There is no quick solution to the problem of systematic faults in software.
Scalability is the term we use to describe a system’s ability to cope with increased load.
The median is also known as the 50th percentile, and sometimes abbreviated as p50.
For example, if the 95th percentile response time is 1.5 seconds, that means 95 out of 100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more.
Reducing response times at very high percentiles is difficult because they are easily affected by random events outside of your control, and the benefits are diminishing.
Queueing delays often account for a large part of the response time at high percentiles.
Even if you make the calls in parallel, the end-user request still needs to wait for the slowest of the parallel calls to complete.
An architecture that is appropriate for one level of load is unlikely to cope with 10 times that load.
While distributing stateless services across multiple machines is fairly straightforward, taking stateful data systems from a single node to a distributed setup can introduce a lot of additional complexity.
An architecture that scales well for a particular application is built around assumptions of which operations will be common and which will be rare—the load parameters.
Even though they are specific to a particular application, scalable architectures are nevertheless usually built from general-purpose building blocks, arranged in familiar patterns.
When complexity makes maintenance hard, budgets and schedules are often overrun.
One of the best tools we have for removing accidental complexity is abstraction.
Reliability means making systems work correctly, even when faults occur.
Scalability means having strategies for keeping performance good, even when load increases.
The limits of my language mean the limits of my world.
Most applications are built by layering one data model on top of another.
There are many different kinds of data models, and every data model embodies assumptions about how it is going to be used.
Specialized query operations that are not well supported by the relational model
Different applications have different requirements, and the best choice of technology for one use case may well be different from the best choice for another use case.
In the JSON representation, all the relevant information is in one place, and one query is sufficient.
If the user interface has free-text fields for entering the region and the industry, it makes sense to store them as plain-text strings.
Conference on Data Systems Languages (CODASYL)
In a relational database, the query optimizer automatically decides which parts of the query to execute in which order, and which indexes to use.
The main arguments in favor of the document data model are schema flexibility, better performance due to locality, and that for some applications it is closer to the data structures used by the application.
The relational model counters by providing better support for joins, and many-to-one and many-to-many relationships.
However, if your application does use many-to-many relationships, the document model becomes less appealing.
No schema means that arbitrary keys and values can be added to a document, and when reading, clients have no guarantees as to what fields the documents may contain.
A more accurate term is schema-on-read (the structure of the data is implicit, and only interpreted when the data is read), in contrast with schema-on-write (the traditional approach of relational databases, where the schema is explicit and the database ensures all written data conforms to it)
Schema-on-read is similar to dynamic (runtime) type checking in programming languages, whereas schema-on-write is similar to static (compile-time) type checking.
The schema-on-read approach is advantageous if the items in the collection don’t all have the same structure for some reason
A document is usually stored as a single continuous string, encoded as JSON, XML, or a binary variant thereof (such as MongoDB’s BSON).
If your application often needs to access the entire document (for example, to render it on a web page), there is a performance advantage to this storage locality.
The locality advantage only applies if you need large parts of the document at the same time.
A hybrid of the relational and document models is a good route for databases to take in the future.
Many commonly used programming languages are imperative.
An imperative language tells the computer to perform certain operations in a certain order.
A declarative query language is attractive because it is typically more concise and easier to work with than an imperative API.
The fact that SQL is more limited in functionality gives the database much more room for automatic optimizations.
Declarative languages have a better chance of getting faster in parallel execution because they specify only the pattern of the results, not the algorithm that is used to determine the results.
In a web browser, using declarative CSS styling is much better than manipulating styles imperatively in JavaScript.
MapReduce is a programming model for processing large amounts of data in bulk across many machines, popularized by Google
MapReduce is a fairly low-level programming model for distributed execution on a cluster of machines.