More on this book
Community
Kindle Notes & Highlights
Read between
August 2 - December 28, 2020
to inspect is great for debugging.
This allows you to restart the later stage without rerunning the entire pipeline.
However, the biggest limitation of Unix tools is that they run only on a single machine — and that’s where tools like Hadoop come in.
MapReduce is a bit like Unix tools, but distributed across potentially thousands of machines.
running a MapReduce job normally does not modify the input and does not have any side effects other than producing the output.
The output files are written once, in a sequential fashion (not modifying any existing part of a file once it has been written).
MapReduce jobs read and write files on a distributed filesystem.
In Hadoop’s implementation of MapReduce, that filesystem is called HDFS (Hadoop Distributed File System), an open source reimplementation of the Google File System (GFS)
Object storage services such as Amazon S3,
HDFS consists of a daemon process running on each machine, exposing a network service that allows other nodes to access files stored on that machine
A central server called the NameNode keeps track of which file blocks are stored on which machine.
In order to tolerate machine and disk failures, file blocks are replicated on multiple machines.
the mapper function to extract a key and value from each input record.
key, and leaves the value empty.
reducer function to iterate over the sorted key-value pairs. If there are multiple occurrences of the same key,
MapReduce can parallelize a computation across many machines, without you having to write code to explicitly handle the parallelism.
the dataflow in a Hadoop MapReduce job. Its parallelization is based on partitioning (see
The MapReduce scheduler (not shown in the diagram) tries to run each mapper on one of the machines that stores a replica of the input file,
This principle is known as putting the computation near the data [27]: it saves copying the input file over the network, reducing network load and increasing locality.
most cases, the application code that should run in the map task is not yet present on the machine that is assigned the task of running it, so the MapReduce framework first copies the code (e.g., JAR files in the case of a Java program) to the appropriate machines.
To ensure that all key-value pairs with the same key end up at the same reducer, the framework uses a hash of the key to determine which reduce task should receive a particular key-value pair (see “Partitioning by Hash of Key”).
First, each map task partitions its output by reducer, based on the hash of the key. Each of these partitions is written to a sorted file on the mapper’s local disk,
The process of partitioning by reducer, sorting, and copying data partitions from mappers to reducers is known as the shuffle
The range of problems you can solve with a single MapReduce job is limited. Referring back to the log analysis example, a single MapReduce job could determine the number of page views per URL, but not the most popular URLs, since that requires a second round of sorting.
Thus, it is very common for MapReduce jobs to be chained together into workflows, such that the output of one job becomes the input to the next job.
To handle these dependencies between job executions, various workflow schedulers for Hadoop have been developed, including Oozie, Azkaban, Luigi, Airflow, and Pinball
Workflows consisting of 50 to 100 MapReduce jobs are common when building recommendation systems
Various higher-level tools for Hadoop, such as Pig [30], Hive [31], Cascading [32], Crunch [33], and FlumeJava [34], also set up workflows of multiple MapReduce stages that are automatically wired together appropriately.
In many datasets it is common for one record to have an association with another record: a foreign
A join is necessary
MapReduce has no concept of indexes — at least not in the usual sense.
When a MapReduce job is given a set of files as input, it reads the entire content of all of those files; a database would call this operation a full table scan.
When we talk about joins in the context of batch processing, we mean resolving all occurrences of some association within a dataset.
The simplest implementation of this join would go over the activity events one by one and query the user database (on a remote server) for every user ID it encounters.
from very poor performance:
limited by the round-trip time to the database server, the effectiveness of a local cache would depend very much on the distribution of data, and running a large number of querie...
This highlight has been truncated due to consecutive passage length restrictions.
In order to achieve good throughput in a batch process, the computation must be (as much as possible) local to one machine.
Moreover, querying a remote database would mean that the batch job becomes nondeterministic, because the data in the remote database might change while the job is running.
Thus, a better approach would be to take a copy of the user database
key would be the user ID:
The MapReduce job can even arrange the records to be sorted such that the reducer always sees the record from the user database first, followed by the
activity events in timestamp order — this technique is known as a secondary sort
outputting pairs of viewed-url and viewer-age-in-years. Subsequent MapReduce jobs could then calculate the distribution of viewer ages for each URL, and cluster by age group.
Since the reducer processes all of the records for a particular user ID in one go, it only needs to keep one user record in memory at any one time, and it never needs to make any requests over the network.
algorithm is known as a sort-...
This highlight has been truncated due to consecutive passage length restrictions.
This separation contrasts with the typical use of databases, where a request to fetch data from a database often occurs somewhere deep inside a piece of application code
Since MapReduce handles all network communication, it also shields the application code from having to worry about partial failures,
MapReduce transparently retries failed tasks without affecting the application logic.
Besides joins, another common use of the “bringing related data to the same place” pattern is grouping records by some key (as in the GROUP BY clause in SQL).
The simplest way of implementing such a grouping operation with MapReduce is to set up the mappers so that the key-value pairs they produce use the desired grouping key.