Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
59%
Flag icon
various higher-level programming models (Pig, Hive, Cascading, Crunch) were created as abstractions on top of MapReduce.
59%
Flag icon
This setup is reasonable if the output from the first job is a dataset that you want to publish widely within your organization.
59%
Flag icon
Publishing data to a well-known location in the distributed filesystem allows loose coupling so that jobs don’t need to know who is producing their input or consuming their output
59%
Flag icon
The process of writing out this intermediate state to files is called materialization.
59%
Flag icon
Pipes do not fully materialize the intermediate state, but instead stream one command’s output to the next command’s input incrementally, using only a small in-memory buffer.
59%
Flag icon
A MapReduce job can only start when all tasks in the preceding jobs (that generate its inputs) have completed,
59%
Flag icon
Having to wait until all of the preceding job’s tasks have completed slows down the execution of the workflow as a whole.
59%
Flag icon
Mappers are often redundant: they just read back the same file that was just written by a reducer,
59%
Flag icon
Storing intermediate state in a distributed filesystem means those files are replicated across several nodes, which is often overkill for such temporary data.
60%
Flag icon
these systems are known as dataflow engines.
60%
Flag icon
They parallelize work by partitioning inputs, and they copy the output of one function over the network to become the input to another function.
60%
Flag icon
Unlike in MapReduce, these functions need not take the strict roles of alternating map and reduce, but instead can be assembled in more flexible ways. We call these functions operators, and
60%
Flag icon
it can try to place the task that consumes some data on the same machine as the task that produces it, so that the data can be exchanged through a shared memory buffer rather than having to copy it over the network.
60%
Flag icon
You can use dataflow engines to implement the same computations as MapReduce workflows, and they usually execute significantly faster due to the optimizations described here.
60%
Flag icon
To enable this recomputation, the framework must keep track of how a given piece of data was computed — which input partitions it used, and which operators were applied to it.
60%
Flag icon
When recomputing data, it is important to know whether the computation is deterministic:
60%
Flag icon
Thus, when using a dataflow engine, materialized datasets on HDFS are still usually the inputs and the final outputs of a job.
60%
Flag icon
As an optimization for batch processing graphs, the bulk synchronous parallel (BSP) model of computation [70] has become popular.
60%
Flag icon
It is also known as the Pregel model,
60%
Flag icon
algorithms on top of Pregel. This fault tolerance is achieved by periodically checkpointing the state of all vertices at the end of an iteration — i.e., writing their full state to durable storage.
60%
Flag icon
As a result, graph algorithms often have a lot of cross-machine communication overhead, and the intermediate state (messages sent between nodes) is often bigger than the original graph.
60%
Flag icon
For this reason, if your graph can fit in memory on a single computer, it’s quite likely that a single-machine (maybe even single-threaded) algorithm will outperform a distributed batch process
61%
Flag icon
By incorporating declarative aspects in their high-level APIs, and having query optimizers that can take advantage of them during execution, batch processing frameworks begin to look more like MPP databases
61%
Flag icon
same time, by having the extensibility of being able to run arbitrary code and read/write data in arbitrary formats, they retain their flexibility advantage.
61%
Flag icon
as k-nearest neighbors
61%
Flag icon
domains. As batch processing systems gain built-in functionality and high-level declarative operators, and as MPP databases become more programmable and flexible, the two are beginning to look more alike:
61%
Flag icon
In the next chapter, we will turn to stream processing, in which the input is unbounded — that is, you still have a job, but its inputs are never-ending streams of data. In this case, a job is never complete, because at any time there may still be more work coming
62%
Flag icon
In reality, a lot of data is unbounded because it arrives gradually over time:
62%
Flag icon
The problem with daily batch processes is that changes in the input are only reflected in the output a day later, which is too slow for many impatient users.
63%
Flag icon
In a stream processing context, a record is more commonly known as an event, but it is essentially the same thing: a small, self-contained, immutable object containing the details of something that happened at some point in time. An event usually contains a timestamp indicating when it happened according to a time-of-day clock
63%
Flag icon
However, when moving toward continual processing with low delays, polling becomes expensive if the datastore is not designed for this kind of usage. The more often you poll, the lower the percentage of requests that return new events, and thus the higher the overheads become. Instead, it is better for consumers to be notified when new events appear.
63%
Flag icon
What happens if the producers send messages faster than the consumers can process them? Broadly speaking, there are three options: the system can drop messages, buffer messages in a queue, or apply backpressure (also known as flow control; i.e., blocking the producer from sending more messages).
63%
Flag icon
If messages are buffered in a queue, it is important to understand what happens as that queue grows. Does the system crash if the queue no longer fits in memory, or does it write messages to disk? In the latter case, how does the disk access affect the performance of the messaging system
63%
Flag icon
A number of messaging systems use direct network communication
63%
Flag icon
UDP multicast is widely used in the financial industry for streams such as stock market feeds, where low latency is important
63%
Flag icon
implementing publish/subscribe messaging over TCP or IP multicast.
63%
Flag icon
these direct messaging systems work well in the situations for which they are designed,
63%
Flag icon
they generally require the application code to be aware of the possibility of message loss.
63%
Flag icon
message broker (also known as a message queue),
63%
Flag icon
and the question of durability is moved to the broker instead.
63%
Flag icon
they generally allow unbounded queueing
63%
Flag icon
A consequence of queueing is also that consumers are generally asynchronous:
63%
Flag icon
(Note that it could happen that the message actually was fully processed, but the acknowledgment was lost in the network. Handling this case requires an atomic commit protocol, as
63%
Flag icon
messages (as required by both the JMS and AMQP standards), the combination of load balancing with redelivery inevitably leads to messages being reordered. To avoid this issue, you can use a separate queue per consumer
63%
Flag icon
but it can be important if there are causal dependencies between messages, as we shall see later in the chapter.
63%
Flag icon
Why can we not have a hybrid, combining the durable storage approach of databases with the low-latency notification facilities of messaging? This is the idea behind log-based message brokers.
63%
Flag icon
The Unix tool tail -f, which watches a file for data being appended, essentially works like this.
64%
Flag icon
Even though these message brokers write all messages to disk, they are able to achieve throughput of millions of messages per second by partitioning across multiple machines, and fault tolerance by replicating messages
64%
Flag icon
To achieve load balancing across a group of consumers,
64%
Flag icon
Each client then consumes all the messages in the partitions it has been assigned.
1 23 28