Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
13%
Flag icon
One option is to translate a two-dimensional location into a single number using a space-filling curve, and then to use a regular B-tree index [34]. More commonly, specialized spatial indexes such as R-trees are used.
13%
Flag icon
R-trees
13%
Flag icon
Lucene is able to search text for words within a certain edit distance (an edit distance of 1 means that one letter has been added, removed, or replaced)
13%
Flag icon
data on disk needs to be laid out carefully if you want good performance on reads and writes.
13%
Flag icon
As RAM becomes cheaper, the cost-per-gigabyte argument is eroded. Many datasets are simply not that big, so it’s quite feasible to keep them entirely in memory, potentially distributed across several machines. This has led to the development of in-memory databases.
13%
Flag icon
they can be faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk
13%
Flag icon
Redis offers a database-like interface to various data structures such as priority queues and sets. Because it keeps all data in memory, its implementation is comparatively simple.
13%
Flag icon
non-volatile memory (NVM)
13%
Flag icon
A transaction needn’t necessarily have ACID (atomicity, consistency, isolation, and durability) properties.
13%
Flag icon
Because these applications are interactive, the access pattern became known as online transaction processing (OLTP).
13%
Flag icon
In order to differentiate this pattern of using databases from transaction processing, it has been called online analytic processing (OLAP)
13%
Flag icon
for both transaction processing and analytic queries. SQL turned out to be quite flexible in this regard: it works well for OLTP-type queries as well as OLAP-type queries.
13%
Flag icon
This process of getting data into the warehouse is known as Extract–Transform–Load (ETL) and is illustrated in
14%
Flag icon
Data warehouse vendors such as Teradata, Vertica, SAP HANA, and ParAccel typically sell their systems under expensive commercial licenses. Amazon RedShift is a hosted version of ParAccel.
14%
Flag icon
Many data warehouses are used in a fairly formulaic style, known as a star schema (also known as dimensional modeling
14%
Flag icon
Other columns in the fact table are foreign key references to other tables, called dimension tables.
14%
Flag icon
The name “star schema” comes from the fact that when the table relationships are visualized, the fact table is in the middle, surrounded by its dimension tables; the connections to these tables are like the rays of a star.
14%
Flag icon
A variation of this template is known as the snowflake schema, where dimensions are further broken down into subdimensions.
14%
Flag icon
In most OLTP databases, storage is laid out in a row-oriented fashion: all the values from one row of a table are stored next to each other. Document databases are similar: an entire document is typically stored as one contiguous sequence of bytes.
14%
Flag icon
Fortunately, column-oriented storage often lends itself very well to compression.
14%
Flag icon
But if n is bigger, there will be a lot of zeros in most of the bitmaps (we say that they are sparse). In that case, the bitmaps can additionally be run-length encoded,
14%
Flag icon
Another advantage of sorted order is that it can help with compression of columns. If the primary sort column does not have many distinct values, then after sorting, it will have long sequences where the same value is repeated many times in a row.
14%
Flag icon
Different queries benefit from different sort orders, so why not store the same data sorted in several different ways? Data needs to be replicated to multiple machines anyway, so that you don’t lose data if one machine fails. You might as well store that redundant data sorted in different ways so that when you’re processing a query, you can use the version that best fits the query pattern.
15%
Flag icon
One way of creating such a cache is a materialized view. In a relational data model, it is often defined like a standard (virtual) view: a table-like object whose contents are the results of some query.
15%
Flag icon
why materialized views are not often used in OLTP databases.
15%
Flag icon
disadvantage is that a data cube doesn’t have the same flexibility as querying the raw data.
16%
Flag icon
The translation from the in-memory representation to a byte sequence is called encoding (also known as serialization or marshalling), and the reverse is called decoding (parsing, deserialization, unmarshalling).ii
16%
Flag icon
In order to restore data in the same object types, the decoding process needs to be able to instantiate arbitrary classes. This is frequently a source of security problems
16%
Flag icon
it’s generally a bad idea to use your language’s built-in encoding for anything other than very transient purposes.
16%
Flag icon
Binary strings are a useful feature, so people get around this limitation by encoding the binary data as text using Base64.
16%
Flag icon
JSON is less verbose than XML, but both still use a lot of space compared to binary formats.
17%
Flag icon
Thrift and Protocol Buffers each come with a code generation tool that takes a schema definition like the ones shown here, and produces classes that implement the schema in various programming languages
Chena Lee
It’s like amazon coral model!!
17%
Flag icon
The big difference compared to Figure 4-1 is that there are no field names (userName, favoriteNumber, interests). Instead, the encoded data contains field tags, which are numbers (1, 2, and 3).
17%
Flag icon
it is encoded in two bytes, with the top bit of each byte used to indicate whether there are still more bytes to come.
17%
Flag icon
You can change the name of a field in the schema, since the encoded data never refers to field names, but you cannot change a field’s tag,
17%
Flag icon
You can add new fields to the schema, provided that you give each field a new tag number. If old code (which doesn’t know about the new tag numbers you added) tries to read data written by new code, including a new field with a tag number it doesn’t recognize, it can simply ignore that field.
17%
Flag icon
This maintains forward compatibility: old code can read records that wer...
This highlight has been truncated due to consecutive passage length restrictions.
17%
Flag icon
if you add a new field, you cannot make it required.
17%
Flag icon
every field you add after the initial deployment of the schema must be optional or have a default value.
17%
Flag icon
you can only remove a field that is optional
17%
Flag icon
and you can never use the same tag number again
17%
Flag icon
that the binary data can only be decoded correctly if the code reading the data is using the exact same schema as the code that wrote the data. Any mismatch in the schema between the reader and the writer would mean incorrectly decoded data.
17%
Flag icon
The key idea with Avro is that the writer’s schema and the reader’s schema don’t have to be the same — they only need to be compatible.
18%
Flag icon
The difference is that Avro is friendlier to dynamically generated schemas.
18%
Flag icon
most relational databases have a network protocol over which you can send queries to the database and get back responses. Those protocols are generally specific to a particular database, and the database vendor provides a driver (e.g., using the ODBC or JDBC APIs)
18%
Flag icon
The schema is a valuable form of documentation,
18%
Flag icon
For users of statically typed programming languages, the ability to generate code from the schema is useful, since it enables type checking at compile time.
18%
Flag icon
The encoding formats discussed previously support such preservation of unknown fields, but sometimes you need to take care at an application level,
18%
Flag icon
data outlives code.
18%
Flag icon
a client-side JavaScript application running inside a web browser can use XMLHttpRequest to become an HTTP client (this technique is known as Ajax