Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
52%
Flag icon
Clients maintain a long-lived session on ZooKeeper servers, and the client and server periodically exchange heartbeats to check that the other node is still alive.
52%
Flag icon
If the leader fails, one of the other nodes should take over.
52%
Flag icon
As nodes are removed or fail, other nodes need to take over the failed nodes’ work.
52%
Flag icon
An application may initially run only on a single node, but eventually may grow to thousands of nodes.
52%
Flag icon
ZooKeeper is not intended for storing the runtime state of the application, which may change thousands or even millions of times per second.
52%
Flag icon
ZooKeeper, etcd, and Consul are also often used for service discovery—that is, to find out which IP address you need to connect to in order to reach a particular service.
52%
Flag icon
Although service discovery does not require consensus, leader election does.
52%
Flag icon
A membership service determines which nodes are currently active and live members of a cluster.
52%
Flag icon
Causal consistency does not have the coordination overhead of linearizability and is much less sensitive to network problems.
52%
Flag icon
If one node is going to accept a registration, it needs to somehow know that another node isn’t concurrently in the process of registering the same name.
52%
Flag icon
When several clients are racing to grab a lock or lease, the lock decides which one successfully acquired it.
52%
Flag icon
When several transactions concurrently try to create conflicting records with the same key, the constraint must decide which one to allow and which should fail with a constraint violation.
52%
Flag icon
Although a single-leader database can provide linearizability without executing a consensus algorithm on every write, it still requires consensus to maintain its leadership and for leadership changes.
52%
Flag icon
If you find yourself wanting to do one of those things that is reducible to consensus, and you want it to be fault-tolerant, then it is advisable to use something like ZooKeeper.
53%
Flag icon
Strictly speaking, ZooKeeper and etcd provide linearizable writes, but reads may be stale, since by default they can be served by any one of the replicas.
53%
Flag icon
A total order that is inconsistent with causality is easy to create, but not very useful.
53%
Flag icon
Consensus is allowed to decide on any value that is proposed by one of the participants.
55%
Flag icon
In a large application you often need to be able to access and process data in many different ways, and there is no one database that can satisfy all those different needs simultaneously.
55%
Flag icon
In reality, integrating disparate systems is one of the most important things that needs to be done in a nontrivial application.
55%
Flag icon
A system of record, also known as source of truth, holds the authoritative version of your data.
55%
Flag icon
Data in a derived system is the result of taking some existing data from another system and transforming or processing it in some way.
55%
Flag icon
If you lose derived data, you can recreate it from the original source.
55%
Flag icon
Most databases, storage engines, and query languages are not inherently either a system of record or a derived system.
55%
Flag icon
The distinction between system of record and derived data system depends not on the tool, but on how you use it in your application.
55%
Flag icon
By being clear about which data is derived from which other data, you can bring clarity to an otherwise ...
This highlight has been truncated due to consecutive passage length restrictions.
55%
Flag icon
A batch processing system takes a large amount of input data, runs a job to process it, and produces some output data.
56%
Flag icon
The Ruby script keeps an in-memory hash table of URLs, where each URL is mapped to the number of times it has been seen.
56%
Flag icon
Make each program do one thing well.
56%
Flag icon
Design and build software, even operating systems, to be tried early, ideally within weeks.
56%
Flag icon
The sort tool is a great example of a program that does one thing well.
56%
Flag icon
If you want to be able to connect any program’s output to any program’s input, that means that all programs must use the same input/output interface.
56%
Flag icon
A file is just an ordered sequence of bytes.
56%
Flag icon
Although it’s not perfect, even decades later, the uniform interface of Unix is still something remarkable.
56%
Flag icon
If you run a program and don’t specify anything else, stdin comes from the keyboard and stdout goes to the screen.
56%
Flag icon
Pipes let you attach the stdout of one process to the stdin of another process (with a small in-memory buffer, and without writing the entire intermediate data stream to disk).
56%
Flag icon
Programs that need multiple inputs or outputs are possible but tricky.
56%
Flag icon
MapReduce is a bit like Unix tools, but distributed across potentially thousands of machines.
56%
Flag icon
As with most Unix tools, running a MapReduce job normally does not modify the input and does not have any side effects other than producing the output.
56%
Flag icon
While Unix tools use stdin and stdout as input and output, MapReduce jobs read and write files on a distributed filesystem.
56%
Flag icon
Shared-disk storage is implemented by a centralized storage appliance, often using custom hardware and special network infrastructure such as Fibre Channel.
56%
Flag icon
HDFS consists of a daemon process running on each machine, exposing a network service that allows other nodes to access files stored on that machine (assuming that every general-purpose machine in a datacenter has some disks attached to it).
56%
Flag icon
A central server called the NameNode keeps track of which file blocks are stored on which machine.
56%
Flag icon
In order to tolerate machine and disk failures, file blocks are replicated on multiple machines.
56%
Flag icon
MapReduce is a programming framework with which you can write code to process large datasets in a distributed filesystem like HDFS.
56%
Flag icon
The mapper is called once for every input record, and its job is to extract the key and value from the input record.
56%
Flag icon
The MapReduce framework takes the key-value pairs produced by the mappers, collects all the values belonging to the same key, and calls the reducer with an iterator over that collection of values.
56%
Flag icon
The main difference from pipelines of Unix commands is that MapReduce can parallelize a computation across many machines, without you having to write code to explicitly handle the parallelism.
56%
Flag icon
In Hadoop MapReduce, the mapper and reducer are each a Java class that implements a particular interface.
57%
Flag icon
The reduce task takes the files from the mappers and merges them together, preserving the sort order.
57%
Flag icon
The reducer is called with a key and an iterator that sequentially scans over all records with the same key (which may in some cases not all fit in memory).
1 11 15