More on this book
Community
Kindle Notes & Highlights
Read between
August 2 - December 28, 2020
a username is unique and rejecting concurrent registrations for the same username.
With some digging, it turns out that a wide range of problems are actually reducible to consensus and are equivalent to each other
Linearizable compare-and-set registers
Atomic transaction commit
Total order broadcast
Locks and leases
it. Membership/coordination service
Uniqueness constraint
Use an algorithm to automatically choose a new leader. This approach requires a consensus algorithm, and it is advisable to use a proven algorithm that correctly handles adverse
Although a single-leader database can provide linearizability without executing a consensus algorithm on every write, it still requires consensus to maintain its leadership and for leadership changes.
Tools like ZooKeeper play an important role in providing an “outsourced” consensus, failure detection, and membership service that applications can use.
If you find yourself wanting to do one of those things that is reducible to consensus, and you want it to be fault-tolerant, then it is advisable to use something like ZooKeeper.
Nevertheless, not every system necessarily requires consensus: for example, leaderless and multi-leader replication systems typically do not use global consensus.
Applications thus commonly use a combination of several different datastores, indexes, caches, analytics systems, etc. and implement mechanisms for moving data from one store to another.
On a high level, systems that store and process data can be grouped into two broad categories:
Systems of record
as source of truth,
When new data comes in, e.g., as user input, it is first written here.
Derived data systems
A classic example is a cache:
Technically speaking, derived data is redundant, in the sense that it duplicates existing information. However, it is often essential for getting good performance on read queries. It is commonly denormalized.
Services (online systems)
Response time is usually the primary measure of performance of a service, and availability is often very important
Batch processing systems (offline systems)
The primary performance measure of a batch job is usually throughput
Stream processing systems (near-real-time systems)
However, a stream job operates on events shortly after they happen, whereas a batch job operates on a fixed set of input data.
MapReduce, a batch processing algorithm
It was subsequently implemented in various open source data systems, including Hadoop, CouchDB, and MongoDB.
Although the preceding command line likely looks a bit obscure if you’re unfamiliar with Unix tools, it is incredibly powerful. It will process gigabytes of log files in a matter of seconds,
Sorting versus in-memory aggregation
The Unix pipeline example does not have such a hash table, but instead relies on sorting a list of URLs in which multiple occurrences of the same URL are simply repeated.
On the other hand, if the job’s working set is larger than the available memory, the sorting approach has the advantage that it can make efficient use of disks.
Mergesort has sequential access patterns that perform well on disks.
The sort utility in GNU Coreutils (Linux) automatically handles larger-than-memory datasets by spilling to disk, and automatically parallelizes sorting across multiple CPU cores
inventor of Unix pipes, first described them like this in 1964 [11]: “We should have some ways of connecting programs like [a] garden hose — screw in another segment when it becomes necessary to massage data in another way. This is the way of I/O also.”
connecting programs with pipes became part of what is now known as the Unix philosophy
Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new “features”.
Expect the output of every program to become the input to another, as yet unknown, program.
Design and build software, even operating systems, to be tried early, ideally within weeks.
Use tools in preference to unskilled help to lighten a programming task,
This approach — automation, rapid prototyping, incremental iteration, being friendly to experimentation, and breaking down large projects into manageable chunks —
It is arguably a better sorting implementation than most programming languages have in their standard libraries
If you expect the output of one program to become the input to another program, that means those programs must use the same data format — in other words, a compatible interface.
In Unix, that interface is a file
of records separated by the \n
Another characteristic feature of Unix tools is their use of standard input (stdin) and standard output (stdout).
program doesn’t know or care where the input is coming from and where the output is going to.
The input files to Unix commands are normally treated as immutable. This means you can run the commands as often as you want,
You can end the pipeline at any point, pipe the output into less,