More on this book
Community
Kindle Notes & Highlights
Read between
August 2 - December 28, 2020
One option is to translate a two-dimensional location into a single number using a space-filling curve, and then to use a regular B-tree index [34]. More commonly, specialized spatial indexes such as R-trees are used.
R-trees
Lucene is able to search text for words within a certain edit distance (an edit distance of 1 means that one letter has been added, removed, or replaced)
data on disk needs to be laid out carefully if you want good performance on reads and writes.
As RAM becomes cheaper, the cost-per-gigabyte argument is eroded. Many datasets are simply not that big, so it’s quite feasible to keep them entirely in memory, potentially distributed across several machines. This has led to the development of in-memory databases.
they can be faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk
Redis offers a database-like interface to various data structures such as priority queues and sets. Because it keeps all data in memory, its implementation is comparatively simple.
non-volatile memory (NVM)
A transaction needn’t necessarily have ACID (atomicity, consistency, isolation, and durability) properties.
Because these applications are interactive, the access pattern became known as online transaction processing (OLTP).
In order to differentiate this pattern of using databases from transaction processing, it has been called online analytic processing (OLAP)
for both transaction processing and analytic queries. SQL turned out to be quite flexible in this regard: it works well for OLTP-type queries as well as OLAP-type queries.
This process of getting data into the warehouse is known as Extract–Transform–Load (ETL) and is illustrated in
Data warehouse vendors such as Teradata, Vertica, SAP HANA, and ParAccel typically sell their systems under expensive commercial licenses. Amazon RedShift is a hosted version of ParAccel.
Many data warehouses are used in a fairly formulaic style, known as a star schema (also known as dimensional modeling
Other columns in the fact table are foreign key references to other tables, called dimension tables.
The name “star schema” comes from the fact that when the table relationships are visualized, the fact table is in the middle, surrounded by its dimension tables; the connections to these tables are like the rays of a star.
A variation of this template is known as the snowflake schema, where dimensions are further broken down into subdimensions.
In most OLTP databases, storage is laid out in a row-oriented fashion: all the values from one row of a table are stored next to each other. Document databases are similar: an entire document is typically stored as one contiguous sequence of bytes.
Fortunately, column-oriented storage often lends itself very well to compression.
But if n is bigger, there will be a lot of zeros in most of the bitmaps (we say that they are sparse). In that case, the bitmaps can additionally be run-length encoded,
Another advantage of sorted order is that it can help with compression of columns. If the primary sort column does not have many distinct values, then after sorting, it will have long sequences where the same value is repeated many times in a row.
Different queries benefit from different sort orders, so why not store the same data sorted in several different ways? Data needs to be replicated to multiple machines anyway, so that you don’t lose data if one machine fails. You might as well store that redundant data sorted in different ways so that when you’re processing a query, you can use the version that best fits the query pattern.
One way of creating such a cache is a materialized view. In a relational data model, it is often defined like a standard (virtual) view: a table-like object whose contents are the results of some query.
why materialized views are not often used in OLTP databases.
disadvantage is that a data cube doesn’t have the same flexibility as querying the raw data.
The translation from the in-memory representation to a byte sequence is called encoding (also known as serialization or marshalling), and the reverse is called decoding (parsing, deserialization, unmarshalling).ii
In order to restore data in the same object types, the decoding process needs to be able to instantiate arbitrary classes. This is frequently a source of security problems
it’s generally a bad idea to use your language’s built-in encoding for anything other than very transient purposes.
Binary strings are a useful feature, so people get around this limitation by encoding the binary data as text using Base64.
JSON is less verbose than XML, but both still use a lot of space compared to binary formats.
The big difference compared to Figure 4-1 is that there are no field names (userName, favoriteNumber, interests). Instead, the encoded data contains field tags, which are numbers (1, 2, and 3).
it is encoded in two bytes, with the top bit of each byte used to indicate whether there are still more bytes to come.
You can change the name of a field in the schema, since the encoded data never refers to field names, but you cannot change a field’s tag,
You can add new fields to the schema, provided that you give each field a new tag number. If old code (which doesn’t know about the new tag numbers you added) tries to read data written by new code, including a new field with a tag number it doesn’t recognize, it can simply ignore that field.
This maintains forward compatibility: old code can read records that wer...
This highlight has been truncated due to consecutive passage length restrictions.
if you add a new field, you cannot make it required.
every field you add after the initial deployment of the schema must be optional or have a default value.
you can only remove a field that is optional
and you can never use the same tag number again
that the binary data can only be decoded correctly if the code reading the data is using the exact same schema as the code that wrote the data. Any mismatch in the schema between the reader and the writer would mean incorrectly decoded data.
The key idea with Avro is that the writer’s schema and the reader’s schema don’t have to be the same — they only need to be compatible.
The difference is that Avro is friendlier to dynamically generated schemas.
most relational databases have a network protocol over which you can send queries to the database and get back responses. Those protocols are generally specific to a particular database, and the database vendor provides a driver (e.g., using the ODBC or JDBC APIs)
The schema is a valuable form of documentation,
For users of statically typed programming languages, the ability to generate code from the schema is useful, since it enables type checking at compile time.
The encoding formats discussed previously support such preservation of unknown fields, but sometimes you need to take care at an application level,
data outlives code.
a client-side JavaScript application running inside a web browser can use XMLHttpRequest to become an HTTP client (this technique is known as Ajax