Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
7%
Flag icon
Note there is nothing in SQL that constrains it to running on a single machine, and MapReduce doesn’t have a monopoly on distributed query execution.
7%
Flag icon
A usability problem with MapReduce is that you have to write two carefully coordinated JavaScript functions, which is often harder than writing a single query.
7%
Flag icon
If your application has mostly one-to-many relationships (tree-structured data) or no relationships between records, the document model is appropriate.
7%
Flag icon
The relational model can handle simple cases of many-to-many relationships, but as the connections within your data become more complex, it becomes more natural to start modeling your data as a graph.
8%
Flag icon
In a relational database, you usually know in advance which joins you need in your query.
10%
Flag icon
One thing that document and graph databases have in common is that they typically don’t enforce a schema for the data they store, which can make it easier to adapt applications to changing requirements.
11%
Flag icon
If you want to search the same data in several different ways, you may need several different indexes on different parts of the data.
11%
Flag icon
An index is an additional structure that is derived from the primary data.
11%
Flag icon
Any kind of index usually slows down writes, because the index also needs to be updated every time data is written.
11%
Flag icon
Compaction means throwing away duplicate keys in the log, and keeping only the most recent update for each key.
12%
Flag icon
There are also different strategies to determine the order and timing of how SSTables are compacted and merged.
13%
Flag icon
A downside of log-structured storage is that the compaction process can sometimes interfere with the performance of ongoing reads and writes.
13%
Flag icon
There is no quick and easy rule for determining which type of storage engine is better for your use case, so it is worth testing empirically.
13%
Flag icon
A primary key uniquely identifies one row in a relational table, or one document in a document database, or one vertex in a graph database.
13%
Flag icon
Redis and Couchbase provide weak durability by writing to disk asynchronously.
13%
Flag icon
Even a disk-based storage engine may never need to read from disk if you have enough memory, because the operating system caches recently used disk blocks in memory anyway.
13%
Flag icon
Transaction processing just means allowing clients to make low-latency reads and writes—as opposed to batch processing jobs, which only run periodically (for example, once per day).
13%
Flag icon
Usually an analytic query needs to scan over a huge number of records, only reading a few columns per record, and calculates aggregate statistics (such as count, sum, or average) rather than returning the raw data to the user.
14%
Flag icon
The data warehouse contains a read-only copy of the data in all the various OLTP systems in the company.
14%
Flag icon
Usually, facts are captured as individual events, because this allows maximum flexibility of analysis later.
14%
Flag icon
As each row in the fact table represents an event, the dimensions represent the who, what, where, when, how, and why of the event.
14%
Flag icon
Snowflake schemas are more normalized than star schemas, but star schemas are often preferred because they are simpler for analysts to work with
14%
Flag icon
If each column is stored in a separate file, a query only needs to read and parse those columns that are used in that query, which can save a lot of work.
14%
Flag icon
Column storage is easiest to understand in a relational data model, but it applies equally to nonrelational data.
14%
Flag icon
The column-oriented storage layout relies on each column file containing the rows in the same order.
14%
Flag icon
Often, the number of distinct values in a column is small compared to the number of rows
14%
Flag icon
The bit is 1 if the row has that value, and 0 if not.
15%
Flag icon
For data warehouse queries that need to scan over millions of rows, a big bottleneck is the bandwidth for getting data from disk into memory.
15%
Flag icon
In a column store, it doesn’t necessarily matter in which order the rows are stored.
15%
Flag icon
The administrator of the database can choose the columns by which the table should be sorted, using their knowledge of common queries.
15%
Flag icon
Different queries benefit from different sort orders, so why not store the same data sorted in several different ways?
15%
Flag icon
The difference is that a materialized view is an actual copy of the query results, written to disk, whereas a virtual view is just a shortcut for writing queries.
15%
Flag icon
When the underlying data changes, a materialized view needs to be updated, because it is a denormalized copy of the data.
15%
Flag icon
The advantage of a materialized data cube is that certain queries become very fast because they have effectively been precomputed.
16%
Flag icon
Everything changes and nothing stands still.
16%
Flag icon
The translation from the in-memory representation to a byte sequence is called encoding (also known as serialization or marshalling), and the reverse is called decoding (parsing, deserialization, unmarshalling).
17%
Flag icon
JSON distinguishes strings and numbers, but it doesn’t distinguish integers and floating-point numbers, and it doesn’t specify a precision.
17%
Flag icon
The difficulty of getting different organizations to agree on anything outweighs most other concerns.
17%
Flag icon
{     "userName": "Martin",     "favoriteNumber": 1337,     "interests": ["daydreaming", "hacking"] }
17%
Flag icon
The binary encoding is 66 bytes long, which is only a little less than the 81 bytes taken by the textual JSON encoding (with whitespace removed).
17%
Flag icon
Both Thrift and Protocol Buffers require a schema for any data that is encoded.
17%
Flag icon
You can add new fields to the schema, provided that you give each field a new tag number.
17%
Flag icon
If you were to add a field and make it required, that check would fail if new code read data written by old code, because the old code will not have written the new field that you added.
17%
Flag icon
Therefore, to maintain backward compatibility, every field you add after the initial deployment of the schema must be optional or have a default value.
17%
Flag icon
Removing a field is just like adding a field, with backward and forward compatib...
This highlight has been truncated due to consecutive passage length restrictions.
17%
Flag icon
Thrift has a dedicated list datatype, which is parameterized with the datatype of the list elements.
17%
Flag icon
Avro also uses a schema to specify the structure of the data being encoded.
18%
Flag icon
The encoding simply consists of values concatenated together.
18%
Flag icon
To parse the binary data, you go through the fields in the order that they appear in the schema and use the schema to tell you the datatype of each field.
18%
Flag icon
When data is decoded (read), the Avro library resolves the differences by looking at the writer’s schema and the reader’s schema side by side and translating the data from the writer’s schema into the reader’s schema.