More on this book
Community
Kindle Notes & Highlights
Read between
October 21 - November 26, 2024
A batch job’s output is only considered valid when the job has completed successfully (MapReduce discards the partial output of a failed job).
A join is necessary whenever you have some code that needs to access records on both sides of that association (both the record that holds the reference and the record being referenced).
When we talk about joins in the context of batch processing, we mean resolving all occurrences of some association within a dataset.
In order to achieve good throughput in a batch process, the computation must be (as much as possible) local to one machine.
Making random-access requests over the network for every record you want to process is too slow.
Recall that the purpose of the mapper is to extract a key and value from each input record.
Having lined up all the required data in advance, the reducer can be a fairly simple, single-threaded piece of code that can churn through records with high throughput and low memory overhead.
When a mapper emits a key-value pair, the key acts like the destination address to which the value should be delivered.
Using the MapReduce programming model has separated the physical network communication aspects of the computation (getting the data to the right machine) from the application logic (processing the data once you have it).
If you have multiple web servers handling user requests, the activity events for a particular user are most likely scattered across various different servers’ log files.
Since a MapReduce job is only complete when all of its mappers and reducers have completed, any subsequent jobs must wait for the slowest reducer to complete before they can start.
The simplest way of performing a map-side join applies in the case where a large dataset is joined with a small dataset.
If the inputs to the map-side join are partitioned in the same way, then the hash join approach can be applied to each partition independently.
Another variant of a map-side join applies if the input datasets are not only partitioned in the same way, but also sorted based on the same key.
If a map-side merge join is possible, it probably means that prior MapReduce jobs brought the input datasets into this partitioned and sorted form in the first place.
The output of a reduce-side join is partitioned and sorted by the join key, whereas the output of a map-side join is partitioned and sorted in the same way as the large input (since one map task is started for each file block of the join’s large input, regardless of whether a partitioned or broadcast join is used).
The output of a batch process is often not a report, but some other kind of structure.
If the indexed set of documents changes, one option is to periodically rerun the entire indexing workflow for the entire set of documents, and replace the previous index files wholesale with the new index files when it is done.
If all the mappers or reducers concurrently write to the same output database, with a rate expected of a batch process, that database can easily be overwhelmed, and its performance for queries is likely to suffer.
If you introduce a bug into the code and the output is wrong or corrupted, you can simply roll back to a previous version of the code and rerun the job, and the output will be correct again.
If a map or reduce task fails, the MapReduce framework automatically re-schedules it and runs it again on the same input.
Databases require you to structure data according to a particular model (e.g., relational or documents), whereas files in a distributed filesystem are just byte sequences, which can be written using any data model and encoding.
MapReduce gave engineers the ability to easily run their own code over large datasets.
Batch processes are less sensitive to faults than online systems, because they do not immediately affect users if they fail and they can always be run again.
Pipes do not fully materialize the intermediate state, but instead stream one command’s output to the next command’s input incrementally, using only a small in-memory buffer.
Storing intermediate state in a distributed filesystem means those files are replicated across several nodes, which is often overkill for such temporary data.
A sorting operation inevitably needs to consume its entire input before it can produce any output, because it’s possible that the very last input record is the one with the lowest key and thus needs to be the very first output record.
The overhead of sending messages over the network can significantly slow down distributed graph algorithms.
The freedom to easily run arbitrary code is what has long distinguished batch processing systems of MapReduce heritage from MPP databases
While the extensibility of being able to run arbitrary code is useful, there are also many common cases where standard processing patterns keep reoccurring, and so it is worth having reusable implementations of the common building blocks.
MapReduce frequently writes to disk, which makes it easy to recover from an individual failed task without restarting the entire job but slows down execution in the failure-free case.
Dataflow engines perform less materialization of intermediate state and keep more in memory, which means that they need to recompute more data if a node fails.
If several tasks for a partition succeed, only one of them actually makes its output visible.
A complex system that works is invariably found to have evolved from a simple system that works.
The problem with daily batch processes is that changes in the input are only reflected in the output a day later, which is too slow for many impatient users.
The more often you poll, the lower the percentage of requests that return new events, and thus the higher the overheads become.
In particular, Unix pipes and TCP connect exactly one sender with one recipient, whereas a messaging system allows multiple producer nodes to send messages to the same topic and allows multiple consumer nodes to receive messages in a topic.
If you can afford to sometimes lose messages, you can probably get higher throughput and lower latency on the same hardware.
If a consumer is offline, it may miss messages that were sent while it is unreachable.
Producers write messages to the broker, and consumers receive them by reading them from the broker.
If the connection to a client is closed or times out without the broker receiving an acknowledgment, it assumes that the message was not processed, and therefore it delivers the message again to another consumer.
Sending a packet over a network or making a request to a network service is normally a transient operation that leaves no permanent trace.
A log is simply an append-only sequence of records on disk.
There is no ordering guarantee across different partitions.
Typically, when a consumer has been assigned a log partition, it reads the messages in the partition sequentially, in a straightforward single-threaded manner.
Since partitioned logs typically preserve message ordering only within a single partition, all messages that need to be ordered consistently need to be routed to the same partition.
In database replication, the log sequence number allows a follower to reconnect to a leader after it has become disconnected, and resume replication without skipping any writes.
If a consumer node fails, another node in the consumer group is assigned the failed consumer’s partitions, and it starts consuming messages at the last recorded offset.
If you only ever append to the log, you will eventually run out of disk space.
To reclaim disk space, the log is actually divided into segments, and from time to time old segments are deleted or moved to archive storage.