Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
6%
Flag icon
SQL is a declarative query language, whereas IMS and CODASYL queried the database using imperative code.
6%
Flag icon
An imperative language tells the computer to perform certain operations in a certain order.
6%
Flag icon
In a declarative query language, like SQL or relational algebra, you just specify the pattern of the data you want — what conditions the results must meet, and how you want the data to be transformed (e.g., sorted, grouped, and aggregated) — but not how to achieve that goal.
6%
Flag icon
If the database wants to reclaim unused disk space behind the scenes, it might need to move records around, changing the order in which the animals appear. Can the database do that safely, without breaking queries?
7%
Flag icon
MapReduce is neither a declarative query language nor a fully imperative query API, but somewhere in between: the logic of the query is expressed with snippets of code, which are called repeatedly by the processing framework. It is based on the map (also known as collect) and reduce (also known as fold or inject) functions that exist in many functional programming languages.
7%
Flag icon
MongoDB 2.2 added support for a declarative query language called the aggregation pipeline
7%
Flag icon
In the property graph model, each vertex consists of: A unique identifier A set of outgoing edges A set of incoming edges A collection of properties (key-value pairs) Each edge consists of: A unique identifier The vertex at which the edge starts (the tail vertex) The vertex at which the edge ends (the head vertex) A label to describe the kind of relationship between the two vertices A collection of properties (key-value pairs)
7%
Flag icon
You can think of a graph store as consisting of two relational tables, one for vertices and one for edges,
8%
Flag icon
Chena Lee
이러면 edge 를 이미 비슷한거있는데 또 만들어버리거나 여러 이름의 edge를 이 query에 다 써다하는데 빼먹을수 있잖아
8%
Flag icon
In a triple-store, all information is stored in the form of very simple three-part statements: (subject, predicate, object). For example, in the triple (Jim, likes, bananas), Jim is the subject, likes is the predicate (verb), and bananas is the object.
9%
Flag icon
The semantic web is fundamentally a simple and reasonable idea: websites already publish information as text and pictures for humans to read, so why don’t they also publish information as machine-readable data for computers to read?
9%
Flag icon
plethora
9%
Flag icon
hubris.
9%
Flag icon
The reasoning behind this design is that you should be able to combine your data with someone else’s data, and if they attach a different meaning to the word within or lives_in, you won’t get a conflict because their predicates are actually <http://other.org/foo#within> and <http://other.org/foo#lives_in>.
9%
Flag icon
for example, it is the query language of Datomic [40], and Cascalog [47] is a Datalog implementation for querying large datasets in Hadoop.viii
10%
Flag icon
Similarly to what db_set does, many databases internally use a log, which is an append-only data file.
10%
Flag icon
log is used in the more general sense: an append-only sequence of records.
11%
Flag icon
An index is an additional structure that is derived from the primary data.
11%
Flag icon
well-chosen indexes speed up read queries, but every index slows down writes.
11%
Flag icon
Bitcask
11%
Flag icon
subject to the requirement that all the keys fit in the available RAM,
11%
Flag icon
Concurrency and crash recovery are much simpler if segment files are append-only or immutable.
11%
Flag icon
SSTable
11%
Flag icon
LSM-Tree)
12%
Flag icon
B-trees have stood the test of time very well. They remain the standard index implementation in almost all relational databases, and many nonrelational databases use them too.
12%
Flag icon
This design corresponds more closely to the underlying hardware, as disks are also arranged in fixed-size blocks.
12%
Flag icon
In order to make the database resilient to crashes, it is common for B-tree implementations to include an additional data structure on disk: a write-ahead log (WAL, also known as a redo log).
12%
Flag icon
This is typically done by protecting the tree’s data structures with latches
12%
Flag icon
Log-structured approaches are simpler in this regard, because they do all the merging in the background without interfering with incoming queries and atomically swap old segments for new segments from time to time.
12%
Flag icon
copy-on-write scheme
12%
Flag icon
We can save space in pages by not storing the entire key, but abbreviating it.
12%
Flag icon
keys only need to provide enough information to act as boundaries between key ranges.
12%
Flag icon
Many B-tree implementations therefore try to lay out the tree so that leaf pages appear in sequential order on disk.
12%
Flag icon
each leaf page may have references to its sibling pages to the left and right, which allows scanning keys in order without jumping back to parent pages.
12%
Flag icon
Some storage engines even overwrite the same page twice in order to avoid ending up with a partially updated page in the event of a power failure
12%
Flag icon
This effect — one write to the database resulting in multiple writes to the disk over the course of the database’s lifetime — is known as write amplification.
12%
Flag icon
On many SSDs, the firmware internally uses a log-structured algorithm to turn random writes into sequential writes on the underlying storage chips, so the impact of the storage engine’s write pattern is less pronounced [19].
12%
Flag icon
the disk’s finite write bandwidth needs to be shared between the initial write (logging and flushing a
12%
Flag icon
If write throughput is high and compaction is not configured carefully, it can happen that compaction cannot keep up with the rate of incoming writes.
12%
Flag icon
Typically, SSTable-based storage engines do not throttle the rate of incoming writes, even if compaction cannot keep up, so you need explicit monitoring to detect this situation
12%
Flag icon
in many relational databases, transaction isolation is implemented using locks on ranges of keys, and in a B-tree index, those locks can be directly attached to the tree
12%
Flag icon
The main difference is that in a secondary index, the indexed values are not necessarily unique;
12%
Flag icon
it could be a reference to the row stored elsewhere. In the latter case, the place where rows are stored is known as a heap file, and it stores data in no particular order (it may be append-only,
12%
Flag icon
The heap file approach is common because it avoids duplicating data when multiple secondary indexes are present:
12%
Flag icon
The situation is more complicated if the new value is larger, as it probably needs to be moved to a new location in the heap where there is enough space. In that case, either all indexes need to be updated to point at the new heap location of the
12%
Flag icon
it can be desirable to store the indexed row directly within an index. This is known as a clustered index.
13%
Flag icon
covering index or index with included columns, which stores some of a table’s columns within the index
13%
Flag icon
As with any kind of duplication of data, clustered and covering indexes can speed up reads, but they require additional storage and can add overhead on writes.
13%
Flag icon
The most common type of multi-column index is called a concatenated index,
13%
Flag icon
Multi-dimensional indexes are a more general way of querying several columns at once, which is particularly important for geospatial data.