More on this book
Community
Kindle Notes & Highlights
Read between
August 2 - December 28, 2020
The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error). See “Reliability”.
It can tolerate the user making mistakes or using the software in unexpected ways.
The system prevents any unauthorized access and abuse.
The things that can go wrong are called faults, and systems that anticipate faults and can cope with them are called fault-tolerant or resilient.
So it only makes sense to talk about tolerating certain types of faults.
failure is when the system as a whole stops providing the required service to the user.
it is usually best to design fault-tolerance mechanisms that prevent faults from causing failures.
it can make sense to increase the rate of faults by triggering them deliberately — for example, by randomly killing individual processes without warning.
Netflix Chaos Monkey
Our first response is usually to add redundancy to the individual hardware components in order to reduce the failure rate of the system.
RAID
the platforms are designed to prioritize flexibility and elasticityi over single-machine reliability.
Another class of fault is a systematic error within the system [8]. Such faults are harder to anticipate, and because they are correlated across nodes, they tend to cause many more system failures than uncorrelated hardware faults [5]. Examples include:
dormant
carefully thinking about assumptions and interactions in the system; thorough testing; process isolation; allowing processes to crash and restart; measuring, monitoring, and analyzing system behavior in production.
Design systems in a way that minimizes opportunities for error. For example, well-designed abstractions, APIs, and admin interfaces make it easy to do “the right thing” and discourage “the wrong thing.” However, if the interfaces are too restrictive people will work around them, negating their benefit, so this is a tricky balance to get right.
succinctly
besides the actual time to process the request (the service time), it includes network delays and queueing delays. Latency is the duration that a request is waiting to be handled — during which it is latent, awaiting service [17].
However, the mean is not a very good metric if you want to know your “typical” response time, because it doesn’t tell you how many users actually experienced that delay.
Usually it is better to use percentiles.
This is because the customers with the slowest requests are often those who have the most data on their accounts because they have made many purchases — that is, they’re the most valuable customers
happy by ensuring the website is fast for them: Amazon has also observed that a 100 ms increase in response time reduces sales by 1%
High percentiles become especially important in backend services that are called multiple times as part of serving a single end-user request.
tail latency amplification
forward decay [25], t-digest [26], or HdrHistogram [27].
dichotomy
While distributing stateless services across multiple machines is fairly straightforward, taking stateful data systems from a single node to a distributed setup can introduce a lot of additional complexity.
An architecture that scales well for a particular application is built around assumptions of which operations will be common and which will be rare — the load parameters.
Operability Make it easy for operations teams to keep the system running smoothly. Simplicity Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system. (Note this is not the same as simplicity of the user interface.) Evolvability Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change. Also known as extensibility, modifiability, or plasticity.
complexity as accidental if it is not inherent in the problem that the software solves (as seen by the users) but arises only from the implementation.
impedance mismatch.i
Object-relational mapping (ORM) frameworks like ActiveRecord and Hibernate
In the traditional SQL model (prior to SQL:1999), the most common normalized representation is to put positions, education, and contact information in separate tables, with a foreign key reference to the users table, as in Figure 2-1. Later versions of the SQL standard added support for structured datatypes and XML data; this allowed multi-valued data to be stored within a single row, with support for querying and indexing inside those documents. These features are supported to varying degrees by Oracle, IBM DB2, MS SQL Server, and PostgreSQL [6, 7]. A JSON datatype is also supported by
...more
For a data structure like a résumé, which is mostly a self-contained document, a JSON representation can be quite appropriate: see Example 2-1. JSON has the appeal of being much simpler than XML. Document-oriented databases like MongoDB [9], RethinkDB [10], CouchDB [11], and Espresso [12] support this data model.
Localization support — when the site is translated into other languages, the standardized lists can be localized, so the region and industry can be displayed in the viewer’s language
Removing such duplication is the key idea behind normalization in databases.ii
obscurity).
CODASYL
The links between records in the network model were not foreign keys, but more like pointers in a programming language (while still being stored on disk). The only way of accessing a record was to follow a path from a root record along these chains of links. This was called an access path.
a relation (table) is simply a collection of tuples (rows), and that’s it.
the related item is referenced by a unique identifier, which is called a foreign key in the relational model and a document reference in the document model
schema flexibility, better performance due to locality, and that for some applications it is closer to the data structures used by the application. The relational model counters by providing
better support for joins, and many-to-one and many-to-many relationships.
The document model has limitations: for example, you cannot refer directly to a nested item within a document, but instead you need to say something like “the second item in the list of positions for user 251”
Schema changes have a bad reputation of being slow and requiring downtime. This reputation is not entirely deserved: most relational database systems execute the ALTER TABLE statement in a few milliseconds.
The schema-on-read approach is advantageous if the items in the collection don’t all have the same structure for some reason (i.e., the data is heterogeneous) — for example, because: There are many different types of objects, and it is not practicable to put each type of object in its own table. The structure of the data is determined by external systems over which you have no control and which may change at any time.
The locality advantage only applies if you need large parts of the document at the same time.
On updates to a document, the entire document usually needs to be rewritten
Google’s Spanner database offers the same locality properties in a relational data model, by allowing the schema to declare that a table’s rows should be interleaved (nested) within a parent table [27]. Oracle allows the same,
This includes functions to make local modifications to XML documents and the ability to index and query inside XML documents, which allows applications to use data models very similar to what they would do when using a document database.