We may also perform Game Day exercises, where we rehearse large-scale failures, such as turning off entire data centers. Or we may inject ever-larger scale faults into the production environment (such as the famous Netflix “Chaos Monkey,” which randomly kills processes and compute servers in production) to ensure that we’re as resilient as we want to be.

