Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Rate it:
Open Preview
52%
Flag icon
a username is unique and rejecting concurrent registrations for the same username.
52%
Flag icon
With some digging, it turns out that a wide range of problems are actually reducible to consensus and are equivalent to each other
52%
Flag icon
Linearizable compare-and-set registers
52%
Flag icon
Atomic transaction commit
52%
Flag icon
Total order broadcast
52%
Flag icon
Locks and leases
52%
Flag icon
it. Membership/coordination service
52%
Flag icon
Uniqueness constraint
52%
Flag icon
Use an algorithm to automatically choose a new leader. This approach requires a consensus algorithm, and it is advisable to use a proven algorithm that correctly handles adverse
52%
Flag icon
Although a single-leader database can provide linearizability without executing a consensus algorithm on every write, it still requires consensus to maintain its leadership and for leadership changes.
52%
Flag icon
Tools like ZooKeeper play an important role in providing an “outsourced” consensus, failure detection, and membership service that applications can use.
52%
Flag icon
If you find yourself wanting to do one of those things that is reducible to consensus, and you want it to be fault-tolerant, then it is advisable to use something like ZooKeeper.
52%
Flag icon
Nevertheless, not every system necessarily requires consensus: for example, leaderless and multi-leader replication systems typically do not use global consensus.
54%
Flag icon
Applications thus commonly use a combination of several different datastores, indexes, caches, analytics systems, etc. and implement mechanisms for moving data from one store to another.
54%
Flag icon
On a high level, systems that store and process data can be grouped into two broad categories:
54%
Flag icon
Systems of record
54%
Flag icon
as source of truth,
54%
Flag icon
When new data comes in, e.g., as user input, it is first written here.
54%
Flag icon
Derived data systems
54%
Flag icon
A classic example is a cache:
54%
Flag icon
Technically speaking, derived data is redundant, in the sense that it duplicates existing information. However, it is often essential for getting good performance on read queries. It is commonly denormalized.
55%
Flag icon
Services (online systems)
55%
Flag icon
Response time is usually the primary measure of performance of a service, and availability is often very important
55%
Flag icon
Batch processing systems (offline systems)
55%
Flag icon
The primary performance measure of a batch job is usually throughput
55%
Flag icon
Stream processing systems (near-real-time systems)
55%
Flag icon
However, a stream job operates on events shortly after they happen, whereas a batch job operates on a fixed set of input data.
55%
Flag icon
MapReduce, a batch processing algorithm
55%
Flag icon
It was subsequently implemented in various open source data systems, including Hadoop, CouchDB, and MongoDB.
55%
Flag icon
Although the preceding command line likely looks a bit obscure if you’re unfamiliar with Unix tools, it is incredibly powerful. It will process gigabytes of log files in a matter of seconds,
55%
Flag icon
Sorting versus in-memory aggregation
55%
Flag icon
The Unix pipeline example does not have such a hash table, but instead relies on sorting a list of URLs in which multiple occurrences of the same URL are simply repeated.
55%
Flag icon
On the other hand, if the job’s working set is larger than the available memory, the sorting approach has the advantage that it can make efficient use of disks.
55%
Flag icon
Mergesort has sequential access patterns that perform well on disks.
55%
Flag icon
The sort utility in GNU Coreutils (Linux) automatically handles larger-than-memory datasets by spilling to disk, and automatically parallelizes sorting across multiple CPU cores
55%
Flag icon
inventor of Unix pipes, first described them like this in 1964 [11]: “We should have some ways of connecting programs like [a] garden hose — screw in another segment when it becomes necessary to massage data in another way. This is the way of I/O also.”
55%
Flag icon
connecting programs with pipes became part of what is now known as the Unix philosophy
55%
Flag icon
Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new “features”.
56%
Flag icon
Expect the output of every program to become the input to another, as yet unknown, program.
56%
Flag icon
Design and build software, even operating systems, to be tried early, ideally within weeks.
56%
Flag icon
Use tools in preference to unskilled help to lighten a programming task,
56%
Flag icon
This approach — automation, rapid prototyping, incremental iteration, being friendly to experimentation, and breaking down large projects into manageable chunks —
56%
Flag icon
It is arguably a better sorting implementation than most programming languages have in their standard libraries
56%
Flag icon
If you expect the output of one program to become the input to another program, that means those programs must use the same data format — in other words, a compatible interface.
56%
Flag icon
In Unix, that interface is a file
56%
Flag icon
of records separated by the \n
56%
Flag icon
Another characteristic feature of Unix tools is their use of standard input (stdin) and standard output (stdout).
56%
Flag icon
program doesn’t know or care where the input is coming from and where the output is going to.
56%
Flag icon
The input files to Unix commands are normally treated as immutable. This means you can run the commands as often as you want,
56%
Flag icon
You can end the pipeline at any point, pipe the output into less,
1 20 28