Nikhil Goyal’s Kindle Notes & Highlights for Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Rate it:

Open Preview

More on this book

Community

Sparsh Priyadarshi

1 note & 1 highlight

Jefersson Nathan

11 notes & 11 highlights

Charles Fonseca

4 notes & 524 highlights

Ucchishta Sivaguru

9 notes & 20 highlights

Sugan

1 note & 44 highlights

Guzman Monne

28 notes & 34 highlights

Dong

2 notes & 26 highlights

Mohamed Elsherif

5 notes & 17 highlights

Chena Lee

6 notes & 1353 highlights

Joe Soltzberg

20 notes & 75 highlights

Corey

6 notes & 10 highlights

Dinesh Singh

2 notes & 11 highlights

Robert Gustavo

38 notes & 38 highlights

Cezar Castro rosa

Vladimir

Ion Gritco

Keith Sader

Guilherme Camargo

Vipin Ajayakumar

Jason

Alexis

Ory

Faisal Morensya

Muhaimen Ezabbad

Frederico Cabral

Ian Dunn

Antonio Bustamante

Asif Hoda

zhouqiang

Nick Fahrenkrog

Matt Chamlee

Atthavit Wannasakwong

Xuan Lin

Eric Chong

Dallin Coons

Di Fan

Prakash Srivastava

Denis

Kindle Notes & Highlights

by Nikhil Goyal

See all Nikhil’s Notes & Highlights

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

by Martin Kleppmann

Read between October 21 - November 26, 2024

Note there is nothing in SQL that constrains it to running on a single machine, and MapReduce doesn’t have a monopoly on distributed query execution.

A usability problem with MapReduce is that you have to write two carefully coordinated JavaScript functions, which is often harder than writing a single query.

If your application has mostly one-to-many relationships (tree-structured data) or no relationships between records, the document model is appropriate.

The relational model can handle simple cases of many-to-many relationships, but as the connections within your data become more complex, it becomes more natural to start modeling your data as a graph.

In a relational database, you usually know in advance which joins you need in your query.

10%

One thing that document and graph databases have in common is that they typically don’t enforce a schema for the data they store, which can make it easier to adapt applications to changing requirements.

11%

If you want to search the same data in several different ways, you may need several different indexes on different parts of the data.

11%

An index is an additional structure that is derived from the primary data.

11%

Any kind of index usually slows down writes, because the index also needs to be updated every time data is written.

11%

Compaction means throwing away duplicate keys in the log, and keeping only the most recent update for each key.

12%

There are also different strategies to determine the order and timing of how SSTables are compacted and merged.

13%

A downside of log-structured storage is that the compaction process can sometimes interfere with the performance of ongoing reads and writes.

13%

There is no quick and easy rule for determining which type of storage engine is better for your use case, so it is worth testing empirically.

13%

A primary key uniquely identifies one row in a relational table, or one document in a document database, or one vertex in a graph database.

13%

Redis and Couchbase provide weak durability by writing to disk asynchronously.

13%

Even a disk-based storage engine may never need to read from disk if you have enough memory, because the operating system caches recently used disk blocks in memory anyway.

13%

Transaction processing just means allowing clients to make low-latency reads and writes—as opposed to batch processing jobs, which only run periodically (for example, once per day).

13%

Usually an analytic query needs to scan over a huge number of records, only reading a few columns per record, and calculates aggregate statistics (such as count, sum, or average) rather than returning the raw data to the user.

14%

The data warehouse contains a read-only copy of the data in all the various OLTP systems in the company.

14%

Usually, facts are captured as individual events, because this allows maximum flexibility of analysis later.

14%

As each row in the fact table represents an event, the dimensions represent the who, what, where, when, how, and why of the event.

14%

Snowflake schemas are more normalized than star schemas, but star schemas are often preferred because they are simpler for analysts to work with

14%

If each column is stored in a separate file, a query only needs to read and parse those columns that are used in that query, which can save a lot of work.

14%

Column storage is easiest to understand in a relational data model, but it applies equally to nonrelational data.

14%

The column-oriented storage layout relies on each column file containing the rows in the same order.

14%

Often, the number of distinct values in a column is small compared to the number of rows

14%

The bit is 1 if the row has that value, and 0 if not.

15%

For data warehouse queries that need to scan over millions of rows, a big bottleneck is the bandwidth for getting data from disk into memory.

15%

In a column store, it doesn’t necessarily matter in which order the rows are stored.

15%

The administrator of the database can choose the columns by which the table should be sorted, using their knowledge of common queries.

15%

Different queries benefit from different sort orders, so why not store the same data sorted in several different ways?

15%

The difference is that a materialized view is an actual copy of the query results, written to disk, whereas a virtual view is just a shortcut for writing queries.

15%

When the underlying data changes, a materialized view needs to be updated, because it is a denormalized copy of the data.

15%

The advantage of a materialized data cube is that certain queries become very fast because they have effectively been precomputed.

16%

Everything changes and nothing stands still.

16%

The translation from the in-memory representation to a byte sequence is called encoding (also known as serialization or marshalling), and the reverse is called decoding (parsing, deserialization, unmarshalling).

17%

JSON distinguishes strings and numbers, but it doesn’t distinguish integers and floating-point numbers, and it doesn’t specify a precision.

17%

The difficulty of getting different organizations to agree on anything outweighs most other concerns.

17%

{ "userName": "Martin", "favoriteNumber": 1337, "interests": ["daydreaming", "hacking"] }

17%

The binary encoding is 66 bytes long, which is only a little less than the 81 bytes taken by the textual JSON encoding (with whitespace removed).

17%

Both Thrift and Protocol Buffers require a schema for any data that is encoded.

17%

You can add new fields to the schema, provided that you give each field a new tag number.

17%

If you were to add a field and make it required, that check would fail if new code read data written by old code, because the old code will not have written the new field that you added.

17%

Therefore, to maintain backward compatibility, every field you add after the initial deployment of the schema must be optional or have a default value.

17%

Removing a field is just like adding a field, with backward and forward compatib...

This highlight has been truncated due to consecutive passage length restrictions.

17%

Thrift has a dedicated list datatype, which is parameterized with the datatype of the list elements.

17%

Avro also uses a schema to specify the structure of the data being encoded.

18%

The encoding simply consists of values concatenated together.

18%

To parse the binary data, you go through the fields in the order that they appear in the schema and use the schema to tell you the datatype of each field.

18%

When data is decoded (read), the Avro library resolves the differences by looking at the writer’s schema and the reader’s schema side by side and translating the data from the writer’s schema into the reader’s schema.

« Prev 1 2 3 … 15 Next »

See a Problem?

Preview — Designing Data-Intensive Applications by Martin Kleppmann