More on this book
Community
Kindle Notes & Highlights
Read between
August 2 - December 28, 2020
various higher-level programming models (Pig, Hive, Cascading, Crunch) were created as abstractions on top of MapReduce.
This setup is reasonable if the output from the first job is a dataset that you want to publish widely within your organization.
Publishing data to a well-known location in the distributed filesystem allows loose coupling so that jobs don’t need to know who is producing their input or consuming their output
The process of writing out this intermediate state to files is called materialization.
Pipes do not fully materialize the intermediate state, but instead stream one command’s output to the next command’s input incrementally, using only a small in-memory buffer.
A MapReduce job can only start when all tasks in the preceding jobs (that generate its inputs) have completed,
Having to wait until all of the preceding job’s tasks have completed slows down the execution of the workflow as a whole.
Mappers are often redundant: they just read back the same file that was just written by a reducer,
Storing intermediate state in a distributed filesystem means those files are replicated across several nodes, which is often overkill for such temporary data.
these systems are known as dataflow engines.
They parallelize work by partitioning inputs, and they copy the output of one function over the network to become the input to another function.
Unlike in MapReduce, these functions need not take the strict roles of alternating map and reduce, but instead can be assembled in more flexible ways. We call these functions operators, and
it can try to place the task that consumes some data on the same machine as the task that produces it, so that the data can be exchanged through a shared memory buffer rather than having to copy it over the network.
You can use dataflow engines to implement the same computations as MapReduce workflows, and they usually execute significantly faster due to the optimizations described here.
To enable this recomputation, the framework must keep track of how a given piece of data was computed — which input partitions it used, and which operators were applied to it.
When recomputing data, it is important to know whether the computation is deterministic:
Thus, when using a dataflow engine, materialized datasets on HDFS are still usually the inputs and the final outputs of a job.
As an optimization for batch processing graphs, the bulk synchronous parallel (BSP) model of computation [70] has become popular.
It is also known as the Pregel model,
algorithms on top of Pregel. This fault tolerance is achieved by periodically checkpointing the state of all vertices at the end of an iteration — i.e., writing their full state to durable storage.
As a result, graph algorithms often have a lot of cross-machine communication overhead, and the intermediate state (messages sent between nodes) is often bigger than the original graph.
For this reason, if your graph can fit in memory on a single computer, it’s quite likely that a single-machine (maybe even single-threaded) algorithm will outperform a distributed batch process
By incorporating declarative aspects in their high-level APIs, and having query optimizers that can take advantage of them during execution, batch processing frameworks begin to look more like MPP databases
same time, by having the extensibility of being able to run arbitrary code and read/write data in arbitrary formats, they retain their flexibility advantage.
as k-nearest neighbors
domains. As batch processing systems gain built-in functionality and high-level declarative operators, and as MPP databases become more programmable and flexible, the two are beginning to look more alike:
In the next chapter, we will turn to stream processing, in which the input is unbounded — that is, you still have a job, but its inputs are never-ending streams of data. In this case, a job is never complete, because at any time there may still be more work coming
In reality, a lot of data is unbounded because it arrives gradually over time:
The problem with daily batch processes is that changes in the input are only reflected in the output a day later, which is too slow for many impatient users.
In a stream processing context, a record is more commonly known as an event, but it is essentially the same thing: a small, self-contained, immutable object containing the details of something that happened at some point in time. An event usually contains a timestamp indicating when it happened according to a time-of-day clock
However, when moving toward continual processing with low delays, polling becomes expensive if the datastore is not designed for this kind of usage. The more often you poll, the lower the percentage of requests that return new events, and thus the higher the overheads become. Instead, it is better for consumers to be notified when new events appear.
What happens if the producers send messages faster than the consumers can process them? Broadly speaking, there are three options: the system can drop messages, buffer messages in a queue, or apply backpressure (also known as flow control; i.e., blocking the producer from sending more messages).
If messages are buffered in a queue, it is important to understand what happens as that queue grows. Does the system crash if the queue no longer fits in memory, or does it write messages to disk? In the latter case, how does the disk access affect the performance of the messaging system
A number of messaging systems use direct network communication
UDP multicast is widely used in the financial industry for streams such as stock market feeds, where low latency is important
implementing publish/subscribe messaging over TCP or IP multicast.
these direct messaging systems work well in the situations for which they are designed,
they generally require the application code to be aware of the possibility of message loss.
message broker (also known as a message queue),
and the question of durability is moved to the broker instead.
they generally allow unbounded queueing
A consequence of queueing is also that consumers are generally asynchronous:
(Note that it could happen that the message actually was fully processed, but the acknowledgment was lost in the network. Handling this case requires an atomic commit protocol, as
messages (as required by both the JMS and AMQP standards), the combination of load balancing with redelivery inevitably leads to messages being reordered. To avoid this issue, you can use a separate queue per consumer
but it can be important if there are causal dependencies between messages, as we shall see later in the chapter.
Why can we not have a hybrid, combining the durable storage approach of databases with the low-latency notification facilities of messaging? This is the idea behind log-based message brokers.
The Unix tool tail -f, which watches a file for data being appended, essentially works like this.
Even though these message brokers write all messages to disk, they are able to achieve throughput of millions of messages per second by partitioning across multiple machines, and fault tolerance by replicating messages
To achieve load balancing across a group of consumers,
Each client then consumes all the messages in the partitions it has been assigned.