More on this book
Community
Kindle Notes & Highlights
Read between
October 21 - November 26, 2024
Any node whose clock drifts too far from the others should be declared dead and removed from the cluster.
Logical clocks do not measure the time of day or the number of seconds elapsed, only the relative ordering of events (whether one event happened before or after another).
If we only know the time +/– 100 ms, the microsecond digits in the timestamp are essentially meaningless.
It allows read-only transactions to see the database in a consistent state at a particular point in time, without locking and interfering with read-write transactions.
The most common implementation of snapshot isolation requires a monotonically increasing transaction ID.
In order to ensure that transaction timestamps reflect causality, Spanner deliberately waits for the length of the confidence interval before committing a read-write transaction.
How does a node know that it is still leader (that it hasn’t been declared dead by the others), and that it may safely accept writes?
In order to remain leader, the node must periodically renew the lease before it expires.
Many programming language runtimes (such as the Java Virtual Machine) have a garbage collector (GC) that occasionally needs to stop all running threads.
In virtualized environments, a virtual machine can be suspended (pausing the execution of all processes and saving the contents of memory to disk) and resumed (restoring the contents of memory and continuing execution).
When the operating system context-switches to another thread, or when the hypervisor switches to a different virtual machine (when running in a virtual machine), the currently running thread can be paused at any arbitrary point in the code.
In the case of a virtual machine, the CPU time spent in other virtual machines is known as steal time.
If the operating system is configured to allow swapping to disk (paging), a simple memory access may result in a page fault that requires a page from disk to be loaded into memory.
In extreme circumstances, the operating system may spend most of its time swapping pages in and out of memory and getting little actual work done (this is known as thrashing).
A node in a distributed system must assume that its execution can be paused for a significant length of time at any point, even in the middle of a function.
In embedded systems, real-time means that a system is carefully designed and tested to meet specified timing guarantees in all circumstances.
If the runtime can warn the application that a node soon requires a GC pause, the application can stop sending new requests to that node, wait for it to finish processing outstanding requests, and then perform the GC while no requests are in progress.
If a remote node doesn’t respond, there is no way of knowing what state it is in, because problems in the network cannot reliably be distinguished from problems at a node.
A distributed system cannot exclusively rely on a single node, because a node may fail at any time, potentially leaving the system stuck and unable to recover.
If a quorum of nodes declares another node dead, then it must be considered dead, even if that node still very much feels alive.
When the paused client comes back, it believes (incorrectly) that it still has a valid lease and proceeds to also write to the file.
In the Byzantine version of the problem, there are n generals who need to agree, and their endeavor is hampered by the fact that there are some traitors in their midst.
Byzantium was an ancient Greek city that later became Constantinople, in the place which is now Istanbul in Turkey.
A system is Byzantine fault-tolerant if it continues to operate correctly even if some of the nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering with the network.
A publicly accessible application must carefully sanitize any inputs from users, for example checking that a value is within a reasonable range and limiting the size of strings to prevent denial of service through large memory allocations.
NTP clients can be configured with multiple server addresses.
The use of multiple servers makes NTP more robust than if it only uses a single server.
Algorithms need to be written in a way that does not depend too heavily on the details of the hardware and software configuration on which they are run.
The synchronous model assumes bounded network delay, bounded process pauses, and bounded clock error.
In the crash-stop model, an algorithm may assume that a node can fail in only one way, namely by crashing.
In the crash-recovery model, nodes are assumed to have stable storage (i.e., nonvolatile disk storage) that is preserved across crashes, while the in-memory state is assumed to be lost.
For modeling real systems, the partially synchronous model with crash-recovery faults is generally the most useful model.
An algorithm is correct in some system model if it always satisfies its properties in all situations that we assume may occur in that system model.
Safety is often informally defined as nothing bad happens, and liveness as something good eventually happens.
If a safety property is violated, we can point at a particular point in time at which it was broken (for example, if the uniqueness property was violated, we can identify the particular operation in which a duplicate fencing token was returned).
Safety and liveness properties and system models are very useful for reasoning about the correctness of a distributed algorithm.
Proving an algorithm correct does not mean its implementation on a real system will necessarily always behave correctly.
In distributed systems, we try to build tolerance of partial failures into software, so that the system as a whole may continue functioning even when some of its constituent parts are broken.
To tolerate faults, the first step is to detect them, but even that is hard.
The only way information can flow from one node to another is by sending it over the unreliable network.
Is it better to be alive and wrong or right and dead?
The best way of building fault-tolerant systems is to find some general-purpose abstractions with useful guarantees, implement them once, and then let applications rely on those guarantees.
If two nodes both believe that they are the leader, that situation is called split brain, and it often leads to data loss.
If you look at two database nodes at the same moment in time, you’re likely to see different data on the two nodes, because write requests arrive on different nodes at different times.
In a linearizable system, as soon as one client successfully completes a write, all clients reading from the database must be able to see the value just written.
Maintaining the illusion of a single copy of the data means guaranteeing that the value read is the most recent, up-to-date value, and doesn’t come from a stale cache or replica.
If a read request is concurrent with a write request, it may return either the old or the new value.
After any one read has returned the new value, all following reads (on the same or other clients) must also return the new value.
In the absence of other requests, it would be okay for B’s read to return 2. However, client A has already read the new value 4 before B’s read started, so B is not allowed to read an older value than A.
Linearizability is a recency guarantee on reads and writes of a register (an individual object).