Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
56%
Flag icon
to inspect is great for debugging.
56%
Flag icon
This allows you to restart the later stage without rerunning the entire pipeline.
56%
Flag icon
However, the biggest limitation of Unix tools is that they run only on a single machine — and that’s where tools like Hadoop come in.
56%
Flag icon
MapReduce is a bit like Unix tools, but distributed across potentially thousands of machines.
56%
Flag icon
running a MapReduce job normally does not modify the input and does not have any side effects other than producing the output.
56%
Flag icon
The output files are written once, in a sequential fashion (not modifying any existing part of a file once it has been written).
56%
Flag icon
MapReduce jobs read and write files on a distributed filesystem.
56%
Flag icon
In Hadoop’s implementation of MapReduce, that filesystem is called HDFS (Hadoop Distributed File System), an open source reimplementation of the Google File System (GFS)
56%
Flag icon
Object storage services such as Amazon S3,
56%
Flag icon
HDFS consists of a daemon process running on each machine, exposing a network service that allows other nodes to access files stored on that machine
56%
Flag icon
A central server called the NameNode keeps track of which file blocks are stored on which machine.
56%
Flag icon
In order to tolerate machine and disk failures, file blocks are replicated on multiple machines.
56%
Flag icon
the mapper function to extract a key and value from each input record.
56%
Flag icon
key, and leaves the value empty.
56%
Flag icon
reducer function to iterate over the sorted key-value pairs. If there are multiple occurrences of the same key,
56%
Flag icon
MapReduce can parallelize a computation across many machines, without you having to write code to explicitly handle the parallelism.
56%
Flag icon
the dataflow in a Hadoop MapReduce job. Its parallelization is based on partitioning (see
56%
Flag icon
The MapReduce scheduler (not shown in the diagram) tries to run each mapper on one of the machines that stores a replica of the input file,
56%
Flag icon
This principle is known as putting the computation near the data [27]: it saves copying the input file over the network, reducing network load and increasing locality.
56%
Flag icon
most cases, the application code that should run in the map task is not yet present on the machine that is assigned the task of running it, so the MapReduce framework first copies the code (e.g., JAR files in the case of a Java program) to the appropriate machines.
56%
Flag icon
To ensure that all key-value pairs with the same key end up at the same reducer, the framework uses a hash of the key to determine which reduce task should receive a particular key-value pair (see “Partitioning by Hash of Key”).
56%
Flag icon
First, each map task partitions its output by reducer, based on the hash of the key. Each of these partitions is written to a sorted file on the mapper’s local disk,
56%
Flag icon
The process of partitioning by reducer, sorting, and copying data partitions from mappers to reducers is known as the shuffle
57%
Flag icon
The range of problems you can solve with a single MapReduce job is limited. Referring back to the log analysis example, a single MapReduce job could determine the number of page views per URL, but not the most popular URLs, since that requires a second round of sorting.
57%
Flag icon
Thus, it is very common for MapReduce jobs to be chained together into workflows, such that the output of one job becomes the input to the next job.
57%
Flag icon
To handle these dependencies between job executions, various workflow schedulers for Hadoop have been developed, including Oozie, Azkaban, Luigi, Airflow, and Pinball
57%
Flag icon
Workflows consisting of 50 to 100 MapReduce jobs are common when building recommendation systems
57%
Flag icon
Various higher-level tools for Hadoop, such as Pig [30], Hive [31], Cascading [32], Crunch [33], and FlumeJava [34], also set up workflows of multiple MapReduce stages that are automatically wired together appropriately.
57%
Flag icon
In many datasets it is common for one record to have an association with another record: a foreign
57%
Flag icon
A join is necessary
57%
Flag icon
MapReduce has no concept of indexes — at least not in the usual sense.
57%
Flag icon
When a MapReduce job is given a set of files as input, it reads the entire content of all of those files; a database would call this operation a full table scan.
57%
Flag icon
When we talk about joins in the context of batch processing, we mean resolving all occurrences of some association within a dataset.
57%
Flag icon
The simplest implementation of this join would go over the activity events one by one and query the user database (on a remote server) for every user ID it encounters.
57%
Flag icon
from very poor performance:
57%
Flag icon
limited by the round-trip time to the database server, the effectiveness of a local cache would depend very much on the distribution of data, and running a large number of querie...
This highlight has been truncated due to consecutive passage length restrictions.
57%
Flag icon
In order to achieve good throughput in a batch process, the computation must be (as much as possible) local to one machine.
57%
Flag icon
Moreover, querying a remote database would mean that the batch job becomes nondeterministic, because the data in the remote database might change while the job is running.
57%
Flag icon
Thus, a better approach would be to take a copy of the user database
57%
Flag icon
key would be the user ID:
57%
Flag icon
The MapReduce job can even arrange the records to be sorted such that the reducer always sees the record from the user database first, followed by the
57%
Flag icon
activity events in timestamp order — this technique is known as a secondary sort
57%
Flag icon
outputting pairs of viewed-url and viewer-age-in-years. Subsequent MapReduce jobs could then calculate the distribution of viewer ages for each URL, and cluster by age group.
57%
Flag icon
Since the reducer processes all of the records for a particular user ID in one go, it only needs to keep one user record in memory at any one time, and it never needs to make any requests over the network.
57%
Flag icon
algorithm is known as a sort-...
This highlight has been truncated due to consecutive passage length restrictions.
57%
Flag icon
This separation contrasts with the typical use of databases, where a request to fetch data from a database often occurs somewhere deep inside a piece of application code
57%
Flag icon
Since MapReduce handles all network communication, it also shields the application code from having to worry about partial failures,
57%
Flag icon
MapReduce transparently retries failed tasks without affecting the application logic.
57%
Flag icon
Besides joins, another common use of the “bringing related data to the same place” pattern is grouping records by some key (as in the GROUP BY clause in SQL).
57%
Flag icon
The simplest way of implementing such a grouping operation with MapReduce is to set up the mappers so that the key-value pairs they produce use the desired grouping key.
1 21 28