Chena Lee’s Kindle Notes & Highlights for Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Rate it:

Open Preview

More on this book

Community

Sparsh Priyadarshi

1 note & 1 highlight

Jefersson Nathan

11 notes & 11 highlights

Charles Fonseca

4 notes & 524 highlights

Ucchishta Sivaguru

9 notes & 20 highlights

Sugan

1 note & 44 highlights

Guzman Monne

28 notes & 34 highlights

Dong

2 notes & 26 highlights

Mohamed Elsherif

5 notes & 17 highlights

Joe Soltzberg

20 notes & 75 highlights

Corey

6 notes & 10 highlights

Dinesh Singh

2 notes & 11 highlights

Robert Gustavo

38 notes & 38 highlights

Cezar Castro rosa

Nikhil Goyal

Vladimir

Ion Gritco

Keith Sader

Guilherme Camargo

Vipin Ajayakumar

Jason

Alexis

Ory

Faisal Morensya

Muhaimen Ezabbad

Frederico Cabral

Ian Dunn

Tali

Antonio Bustamante

Asif Hoda

zhouqiang

Nick Fahrenkrog

Matt Chamlee

Atthavit Wannasakwong

Xuan Lin

Eric Chong

Dallin Coons

Di Fan

Prakash Srivastava

Denis

Kindle Notes & Highlights

by Chena Lee

See all Chena’s Notes & Highlights

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

by Martin Kleppmann

Read between August 2 - December 28, 2020

SQL is a declarative query language, whereas IMS and CODASYL queried the database using imperative code.

An imperative language tells the computer to perform certain operations in a certain order.

In a declarative query language, like SQL or relational algebra, you just specify the pattern of the data you want — what conditions the results must meet, and how you want the data to be transformed (e.g., sorted, grouped, and aggregated) — but not how to achieve that goal.

If the database wants to reclaim unused disk space behind the scenes, it might need to move records around, changing the order in which the animals appear. Can the database do that safely, without breaking queries?

MapReduce is neither a declarative query language nor a fully imperative query API, but somewhere in between: the logic of the query is expressed with snippets of code, which are called repeatedly by the processing framework. It is based on the map (also known as collect) and reduce (also known as fold or inject) functions that exist in many functional programming languages.

MongoDB 2.2 added support for a declarative query language called the aggregation pipeline

In the property graph model, each vertex consists of: A unique identifier A set of outgoing edges A set of incoming edges A collection of properties (key-value pairs) Each edge consists of: A unique identifier The vertex at which the edge starts (the tail vertex) The vertex at which the edge ends (the head vertex) A label to describe the kind of relationship between the two vertices A collection of properties (key-value pairs)

You can think of a graph store as consisting of two relational tables, one for vertices and one for edges,

이러면 edge 를 이미 비슷한거있는데 또 만들어버리거나 여러 이름의 edge를 이 query에 다 써다하는데 빼먹을수 있잖아

In a triple-store, all information is stored in the form of very simple three-part statements: (subject, predicate, object). For example, in the triple (Jim, likes, bananas), Jim is the subject, likes is the predicate (verb), and bananas is the object.

The semantic web is fundamentally a simple and reasonable idea: websites already publish information as text and pictures for humans to read, so why don’t they also publish information as machine-readable data for computers to read?

plethora

hubris.

The reasoning behind this design is that you should be able to combine your data with someone else’s data, and if they attach a different meaning to the word within or lives_in, you won’t get a conflict because their predicates are actually <http://other.org/foo#within> and <http://other.org/foo#lives_in>.

for example, it is the query language of Datomic [40], and Cascalog [47] is a Datalog implementation for querying large datasets in Hadoop.viii

10%

Similarly to what db_set does, many databases internally use a log, which is an append-only data file.

10%

log is used in the more general sense: an append-only sequence of records.

11%

An index is an additional structure that is derived from the primary data.

11%

well-chosen indexes speed up read queries, but every index slows down writes.

11%

Bitcask

11%

subject to the requirement that all the keys fit in the available RAM,

11%

Concurrency and crash recovery are much simpler if segment files are append-only or immutable.

11%

SSTable

11%

LSM-Tree)

12%

B-trees have stood the test of time very well. They remain the standard index implementation in almost all relational databases, and many nonrelational databases use them too.

12%

This design corresponds more closely to the underlying hardware, as disks are also arranged in fixed-size blocks.

12%

In order to make the database resilient to crashes, it is common for B-tree implementations to include an additional data structure on disk: a write-ahead log (WAL, also known as a redo log).

12%

This is typically done by protecting the tree’s data structures with latches

12%

Log-structured approaches are simpler in this regard, because they do all the merging in the background without interfering with incoming queries and atomically swap old segments for new segments from time to time.

12%

copy-on-write scheme

12%

We can save space in pages by not storing the entire key, but abbreviating it.

12%

keys only need to provide enough information to act as boundaries between key ranges.

12%

Many B-tree implementations therefore try to lay out the tree so that leaf pages appear in sequential order on disk.

12%

each leaf page may have references to its sibling pages to the left and right, which allows scanning keys in order without jumping back to parent pages.

12%

Some storage engines even overwrite the same page twice in order to avoid ending up with a partially updated page in the event of a power failure

12%

This effect — one write to the database resulting in multiple writes to the disk over the course of the database’s lifetime — is known as write amplification.

12%

On many SSDs, the firmware internally uses a log-structured algorithm to turn random writes into sequential writes on the underlying storage chips, so the impact of the storage engine’s write pattern is less pronounced [19].

12%

the disk’s finite write bandwidth needs to be shared between the initial write (logging and flushing a

12%

If write throughput is high and compaction is not configured carefully, it can happen that compaction cannot keep up with the rate of incoming writes.

12%

Typically, SSTable-based storage engines do not throttle the rate of incoming writes, even if compaction cannot keep up, so you need explicit monitoring to detect this situation

12%

in many relational databases, transaction isolation is implemented using locks on ranges of keys, and in a B-tree index, those locks can be directly attached to the tree

12%

The main difference is that in a secondary index, the indexed values are not necessarily unique;

12%

it could be a reference to the row stored elsewhere. In the latter case, the place where rows are stored is known as a heap file, and it stores data in no particular order (it may be append-only,

12%

The heap file approach is common because it avoids duplicating data when multiple secondary indexes are present:

12%

The situation is more complicated if the new value is larger, as it probably needs to be moved to a new location in the heap where there is enough space. In that case, either all indexes need to be updated to point at the new heap location of the

12%

it can be desirable to store the indexed row directly within an index. This is known as a clustered index.

13%

covering index or index with included columns, which stores some of a table’s columns within the index

13%

As with any kind of duplication of data, clustered and covering indexes can speed up reads, but they require additional storage and can add overhead on writes.

13%

The most common type of multi-column index is called a concatenated index,

13%

Multi-dimensional indexes are a more general way of querying several columns at once, which is particularly important for geospatial data.

« Prev 1 2 3 … 28 Next »

See a Problem?

Preview — Designing Data-Intensive Applications by Martin Kleppmann