If we want to make distributed systems work, we must accept the possibility of partial failure and build fault-tolerance mechanisms into the software. In other words, we need to build a reliable system from unreliable components. (As discussed in “Reliability”, there is no such thing as perfect reliability, so we’ll need to understand the limits of what we can realistically promise.)

