More on this book
Community
Kindle Notes & Highlights
Read between
February 27 - March 17, 2022
the occasional outage and problem with a system—particularly if it is resolved quickly and cleanly—can actually boost the user’s trust and confidence. The technical term for this effect is the service recovery paradox.
Resilience engineering tests—also called failure drills—look to trigger failure strategically so that the true behavior of the system can be documented and verified.
if an organization has never restored from backup, it does not have working backups.
Waiting for an actual outage to figure that out is not a safer strategy than running a failure drill at a time you’ve chosen, supervised by your most experienced engineers.
Two types of impact are relevant to failure tests. The first is technical impact: the likelihood of cascading failures, data corruption, or dramatic changes in security or stability. The second is user impact: How many people are negatively affected and to what degree?
4+1 architectural view model. Developed by Philippe Kruchten,
people’s perception of risk is not static, and it’s often not connected to the probability of failure so much as it is the potential feeling of rejection and condemnation from their peers.
I count among my accomplishments the fact that by the time I left the UN, people had bought in to the agile, iterative process that treats software as a living thing that needs to be constantly maintained and improved. They stopped saying “When the website is done . . .”
The only thing harder than managing your own doubts is dealing with sabotage from colleagues who don’t understand how much progress is being made because their expectations of what improvement will look like is different from other members of the team.
As software engineers, it is easy to fall into the trap of thinking that effective work will always look like writing code, but sometimes you get much closer to your desired outcome by teaching others to do the job rather than doing it yourself.
How to do something should be the decision of the people actually entrusted to do it.
we tend to think of failure as bad luck and success as skill.
success is no more or less complex than failure. You should use the same methodology to learn from success that you use to learn from failure.
Traditional postmortems are written by the software engineers who responded to the incident. These are people you want working on software, not writing reports.
Postmortems establish a record about what really happened and how specific actions affected the outcome. They do not document failure; they provide context.
It has become popular as of late to use the term working group instead of the word committee because committee just sounds bureaucratic to most people.
A camel, the old expression goes, is a horse built by committee.
Software engineers can’t fix problems if they spend all their time in meetings giving their managers updates about the problems. The boot has to be moved off their necks.
The most valuable skill a leader can have is knowing when to get out of the way.
Modernization projects without a clear definition of what success looks like will find themselves with a finish line that only moves further back the closer they get to it.
All technology is imperfect, so the goal of legacy modernization should not be restoring mythical perfection but bringing the system to a state where it is possible to maintain modern best practices around security, stability, and availability.
Future-proofing isn’t about preventing mistakes; it’s about knowing how to maintain and evolve technology gradually.
Two types of problems will cause us to rethink a working system as it ages. The first are usage changes. The second are deteriorations.
The secret to future-proofing is making migrations and redesigns normal routines that don’t require heavy lifting.
Those that neglect to devote a little bit of time to cleaning their technical debt will be forced into cumbersome and risky legacy modernization efforts instead.
Launching a new feature is like having a house party. The more house parties you have in your house before you clean things up, the worse condition your house will be in.
managing deteriorations comes down to these two practices: If you’re introducing something that will deteriorate, build it to fail gracefully. Shorten the time between upgrades so that people have plenty of practice doing them.
When in doubt, default to panicking. It’s more important that errors get noticed so that they can be resolved.
In 2008, Google announced it had sorted a petabyte of data in six hours with 4,000 computers using 48,000 storage drives. A single run always resulted in at least one of the 48,000 drives dying.
The main thing to consider when thinking about automating a problem away is this: If the automation fails, will it be clear what has gone wrong?
Automation is more problematic when it obscures or otherwise encourages engineers to forget what the system is actually doing under the hood.
Throughout this book and in this chapter especially, the message has been don’t build for scale before you have scale. Build something that you will have to redesign later, even if that means building a monolith to start.
Before your system can be scaled with practical approaches, like load balancers, mesh networks, queues, and other trappings of distributive computing, the simple systems have to be stable and performant. Disaster comes from trying to build complex systems right away and neglecting the foundation with which all system behavior—planned and unexpected—will be determined.
a person cannot be well versed in an infinite number of topics, so for every additional service, expect the level of expertise to be cut in half.
Lots of teams model their systems after white papers from Google or Amazon when they do not have the team to maintain a Google or an Amazon.
Two tools popular with system thinkers for these kinds of models are Loopy (https://ncase.me/loopy/). and InsightMaker (https://insightmaker.com/
Future-proofing means constantly rethinking and iterating on the existing system.
People don’t go into building a service thinking that they will neglect it until it’s a huge liability for their organization. People fail to maintain services because they are not given the time or resources to maintain them.
Legacy modernizations themselves are anti-patterns. A healthy organization running a healthy system should be able to evolve it over time without rerouting resources to a formal modernization effort.
It’s not the age of a system that causes it to fail, but the pressure of what the organization has forgotten about it slowly building toward an explosion.
The hard part about legacy modernization is the system around the system. The organization, its communication structures, its politics, and its incentives are all intertwined with the technical product in such a way that to move the product, you must do it by turning the gears of this other, complex, undocumented system.
Engineering organizations that maintain a separation between operations and development, for example, inevitably find that their development teams design solutions so that when they go wrong, they impact the operations team first and most severely. Meanwhile, their operations team builds processes that throw barriers in front of development, passing the negative consequences of that to those teams.
Designing a modernization effort is about iteration, and iteration is about feedback. Therefore, the real challenge of legacy modernization is not the technical tasks, but making sure the feedback loops are open in the critical places and communication is orderly everywhere else.
As a general rule, the discretion to make decisions should be delegated to the people who must implement those decisions.
If you are not contributing code or being woken up in the middle of the night to answer a page, have the good sense to remember that no matter how important your job is, you are not the implementor. You do not operate the system, but you can find the operators and make sure t...
This highlight has been truncated due to consecutive passage length restrictions.
The person in the best position to find a working strategy is the person on the ground watching the gears of the system turn.
Failure is inevitable when attempting to change complex systems in production. There are too many moving parts, too many unknowns, and too much that needs to be fixed. Getting every single decision right is impossible. Modern engineering teams use stats like service level objectives, error budgets, and mean time to recovery to move the emphasis away from avoiding failure and toward recovering quickly.
We cannot completely eliminate failure, because there’s a level of complexity where a single person—no matter how intelligent—cannot comprehend the full system.
What gets people seen is what they will ultimately prioritize, even if those behaviors are in open conflict with the official instructions they receive from management.
the program that doesn’t change becomes a time bomb.