Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
57%
Flag icon
A batch job’s output is only considered valid when the job has completed successfully (MapReduce discards the partial output of a failed job).
57%
Flag icon
A join is necessary whenever you have some code that needs to access records on both sides of that association (both the record that holds the reference and the record being referenced).
57%
Flag icon
When we talk about joins in the context of batch processing, we mean resolving all occurrences of some association within a dataset.
57%
Flag icon
In order to achieve good throughput in a batch process, the computation must be (as much as possible) local to one machine.
57%
Flag icon
Making random-access requests over the network for every record you want to process is too slow.
57%
Flag icon
Recall that the purpose of the mapper is to extract a key and value from each input record.
57%
Flag icon
Having lined up all the required data in advance, the reducer can be a fairly simple, single-threaded piece of code that can churn through records with high throughput and low memory overhead.
57%
Flag icon
When a mapper emits a key-value pair, the key acts like the destination address to which the value should be delivered.
57%
Flag icon
Using the MapReduce programming model has separated the physical network communication aspects of the computation (getting the data to the right machine) from the application logic (processing the data once you have it).
57%
Flag icon
If you have multiple web servers handling user requests, the activity events for a particular user are most likely scattered across various different servers’ log files.
57%
Flag icon
Since a MapReduce job is only complete when all of its mappers and reducers have completed, any subsequent jobs must wait for the slowest reducer to complete before they can start.
58%
Flag icon
The simplest way of performing a map-side join applies in the case where a large dataset is joined with a small dataset.
58%
Flag icon
If the inputs to the map-side join are partitioned in the same way, then the hash join approach can be applied to each partition independently.
58%
Flag icon
Another variant of a map-side join applies if the input datasets are not only partitioned in the same way, but also sorted based on the same key.
58%
Flag icon
If a map-side merge join is possible, it probably means that prior MapReduce jobs brought the input datasets into this partitioned and sorted form in the first place.
58%
Flag icon
The output of a reduce-side join is partitioned and sorted by the join key, whereas the output of a map-side join is partitioned and sorted in the same way as the large input (since one map task is started for each file block of the join’s large input, regardless of whether a partitioned or broadcast join is used).
58%
Flag icon
The output of a batch process is often not a report, but some other kind of structure.
58%
Flag icon
If the indexed set of documents changes, one option is to periodically rerun the entire indexing workflow for the entire set of documents, and replace the previous index files wholesale with the new index files when it is done.
58%
Flag icon
If all the mappers or reducers concurrently write to the same output database, with a rate expected of a batch process, that database can easily be overwhelmed, and its performance for queries is likely to suffer.
58%
Flag icon
If you introduce a bug into the code and the output is wrong or corrupted, you can simply roll back to a previous version of the code and rerun the job, and the output will be correct again.
59%
Flag icon
If a map or reduce task fails, the MapReduce framework automatically re-schedules it and runs it again on the same input.
59%
Flag icon
Databases require you to structure data according to a particular model (e.g., relational or documents), whereas files in a distributed filesystem are just byte sequences, which can be written using any data model and encoding.
59%
Flag icon
MapReduce gave engineers the ability to easily run their own code over large datasets.
59%
Flag icon
Batch processes are less sensitive to faults than online systems, because they do not immediately affect users if they fail and they can always be run again.
59%
Flag icon
Pipes do not fully materialize the intermediate state, but instead stream one command’s output to the next command’s input incrementally, using only a small in-memory buffer.
59%
Flag icon
Storing intermediate state in a distributed filesystem means those files are replicated across several nodes, which is often overkill for such temporary data.
60%
Flag icon
A sorting operation inevitably needs to consume its entire input before it can produce any output, because it’s possible that the very last input record is the one with the lowest key and thus needs to be the very first output record.
60%
Flag icon
The overhead of sending messages over the network can significantly slow down distributed graph algorithms.
61%
Flag icon
The freedom to easily run arbitrary code is what has long distinguished batch processing systems of MapReduce heritage from MPP databases
61%
Flag icon
While the extensibility of being able to run arbitrary code is useful, there are also many common cases where standard processing patterns keep reoccurring, and so it is worth having reusable implementations of the common building blocks.
61%
Flag icon
MapReduce frequently writes to disk, which makes it easy to recover from an individual failed task without restarting the entire job but slows down execution in the failure-free case.
61%
Flag icon
Dataflow engines perform less materialization of intermediate state and keep more in memory, which means that they need to recompute more data if a node fails.
61%
Flag icon
If several tasks for a partition succeed, only one of them actually makes its output visible.
62%
Flag icon
A complex system that works is invariably found to have evolved from a simple system that works.
62%
Flag icon
The problem with daily batch processes is that changes in the input are only reflected in the output a day later, which is too slow for many impatient users.
63%
Flag icon
The more often you poll, the lower the percentage of requests that return new events, and thus the higher the overheads become.
63%
Flag icon
In particular, Unix pipes and TCP connect exactly one sender with one recipient, whereas a messaging system allows multiple producer nodes to send messages to the same topic and allows multiple consumer nodes to receive messages in a topic.
63%
Flag icon
If you can afford to sometimes lose messages, you can probably get higher throughput and lower latency on the same hardware.
63%
Flag icon
If a consumer is offline, it may miss messages that were sent while it is unreachable.
63%
Flag icon
Producers write messages to the broker, and consumers receive them by reading them from the broker.
63%
Flag icon
If the connection to a client is closed or times out without the broker receiving an acknowledgment, it assumes that the message was not processed, and therefore it delivers the message again to another consumer.
63%
Flag icon
Sending a packet over a network or making a request to a network service is normally a transient operation that leaves no permanent trace.
64%
Flag icon
A log is simply an append-only sequence of records on disk.
64%
Flag icon
There is no ordering guarantee across different partitions.
64%
Flag icon
Typically, when a consumer has been assigned a log partition, it reads the messages in the partition sequentially, in a straightforward single-threaded manner.
64%
Flag icon
Since partitioned logs typically preserve message ordering only within a single partition, all messages that need to be ordered consistently need to be routed to the same partition.
64%
Flag icon
In database replication, the log sequence number allows a follower to reconnect to a leader after it has become disconnected, and resume replication without skipping any writes.
64%
Flag icon
If a consumer node fails, another node in the consumer group is assigned the failed consumer’s partitions, and it starts consuming messages at the last recorded offset.
64%
Flag icon
If you only ever append to the log, you will eventually run out of disk space.
64%
Flag icon
To reclaim disk space, the log is actually divided into segments, and from time to time old segments are deleted or moved to archive storage.