Kill It with Fire: Manage Aging Computer Systems (and Future Proof Modern Ones)
Rate it:
Open Preview
69%
Flag icon
the occasional outage and problem with a system—particularly if it is resolved quickly and cleanly—can actually boost the user’s trust and confidence. The technical term for this effect is the service recovery paradox.
70%
Flag icon
Resilience engineering tests—also called failure drills—look to trigger failure strategically so that the true behavior of the system can be documented and verified.
70%
Flag icon
if an organization has never restored from backup, it does not have working backups.
70%
Flag icon
Waiting for an actual outage to figure that out is not a safer strategy than running a failure drill at a time you’ve chosen, supervised by your most experienced engineers.
70%
Flag icon
Two types of impact are relevant to failure tests. The first is technical impact: the likelihood of cascading failures, data corruption, or dramatic changes in security or stability. The second is user impact: How many people are negatively affected and to what degree?
70%
Flag icon
4+1 architectural view model. Developed by Philippe Kruchten,
72%
Flag icon
people’s perception of risk is not static, and it’s often not connected to the probability of failure so much as it is the potential feeling of rejection and condemnation from their peers.
74%
Flag icon
I count among my accomplishments the fact that by the time I left the UN, people had bought in to the agile, iterative process that treats software as a living thing that needs to be constantly maintained and improved. They stopped saying “When the website is done . . .”
74%
Flag icon
The only thing harder than managing your own doubts is dealing with sabotage from colleagues who don’t understand how much progress is being made because their expectations of what improvement will look like is different from other members of the team.
74%
Flag icon
As software engineers, it is easy to fall into the trap of thinking that effective work will always look like writing code, but sometimes you get much closer to your desired outcome by teaching others to do the job rather than doing it yourself.
75%
Flag icon
How to do something should be the decision of the people actually entrusted to do it.
76%
Flag icon
we tend to think of failure as bad luck and success as skill.
76%
Flag icon
success is no more or less complex than failure. You should use the same methodology to learn from success that you use to learn from failure.
76%
Flag icon
Traditional postmortems are written by the software engineers who responded to the incident. These are people you want working on software, not writing reports.
76%
Flag icon
Postmortems establish a record about what really happened and how specific actions affected the outcome. They do not document failure; they provide context.
77%
Flag icon
It has become popular as of late to use the term working group instead of the word committee because committee just sounds bureaucratic to most people.
77%
Flag icon
A camel, the old expression goes, is a horse built by committee.
78%
Flag icon
Software engineers can’t fix problems if they spend all their time in meetings giving their managers updates about the problems. The boot has to be moved off their necks.
78%
Flag icon
The most valuable skill a leader can have is knowing when to get out of the way.
78%
Flag icon
Modernization projects without a clear definition of what success looks like will find themselves with a finish line that only moves further back the closer they get to it.
78%
Flag icon
All technology is imperfect, so the goal of legacy modernization should not be restoring mythical perfection but bringing the system to a state where it is possible to maintain modern best practices around security, stability, and availability.
78%
Flag icon
Future-proofing isn’t about preventing mistakes; it’s about knowing how to maintain and evolve technology gradually.
78%
Flag icon
Two types of problems will cause us to rethink a working system as it ages. The first are usage changes. The second are deteriorations.
80%
Flag icon
The secret to future-proofing is making migrations and redesigns normal routines that don’t require heavy lifting.
80%
Flag icon
Those that neglect to devote a little bit of time to cleaning their technical debt will be forced into cumbersome and risky legacy modernization efforts instead.
80%
Flag icon
Launching a new feature is like having a house party. The more house parties you have in your house before you clean things up, the worse condition your house will be in.
80%
Flag icon
managing deteriorations comes down to these two practices: If you’re introducing something that will deteriorate, build it to fail gracefully. Shorten the time between upgrades so that people have plenty of practice doing them.
81%
Flag icon
When in doubt, default to panicking. It’s more important that errors get noticed so that they can be resolved.
82%
Flag icon
In 2008, Google announced it had sorted a petabyte of data in six hours with 4,000 computers using 48,000 storage drives. A single run always resulted in at least one of the 48,000 drives dying.
82%
Flag icon
The main thing to consider when thinking about automating a problem away is this: If the automation fails, will it be clear what has gone wrong?
82%
Flag icon
Automation is more problematic when it obscures or otherwise encourages engineers to forget what the system is actually doing under the hood.
83%
Flag icon
Throughout this book and in this chapter especially, the message has been don’t build for scale before you have scale. Build something that you will have to redesign later, even if that means building a monolith to start.
83%
Flag icon
Before your system can be scaled with practical approaches, like load balancers, mesh networks, queues, and other trappings of distributive computing, the simple systems have to be stable and performant. Disaster comes from trying to build complex systems right away and neglecting the foundation with which all system behavior—planned and unexpected—will be determined.
83%
Flag icon
a person cannot be well versed in an infinite number of topics, so for every additional service, expect the level of expertise to be cut in half.
83%
Flag icon
Lots of teams model their systems after white papers from Google or Amazon when they do not have the team to maintain a Google or an Amazon.
84%
Flag icon
Two tools popular with system thinkers for these kinds of models are Loopy (https://ncase.me/loopy/). and InsightMaker (https://insightmaker.com/
85%
Flag icon
Future-proofing means constantly rethinking and iterating on the existing system.
85%
Flag icon
People don’t go into building a service thinking that they will neglect it until it’s a huge liability for their organization. People fail to maintain services because they are not given the time or resources to maintain them.
85%
Flag icon
Legacy modernizations themselves are anti-patterns. A healthy organization running a healthy system should be able to evolve it over time without rerouting resources to a formal modernization effort.
85%
Flag icon
It’s not the age of a system that causes it to fail, but the pressure of what the organization has forgotten about it slowly building toward an explosion.
86%
Flag icon
The hard part about legacy modernization is the system around the system. The organization, its communication structures, its politics, and its incentives are all intertwined with the technical product in such a way that to move the product, you must do it by turning the gears of this other, complex, undocumented system.
86%
Flag icon
Engineering organizations that maintain a separation between operations and development, for example, inevitably find that their development teams design solutions so that when they go wrong, they impact the operations team first and most severely. Meanwhile, their operations team builds processes that throw barriers in front of development, passing the negative consequences of that to those teams.
86%
Flag icon
Designing a modernization effort is about iteration, and iteration is about feedback. Therefore, the real challenge of legacy modernization is not the technical tasks, but making sure the feedback loops are open in the critical places and communication is orderly everywhere else.
86%
Flag icon
As a general rule, the discretion to make decisions should be delegated to the people who must implement those decisions.
86%
Flag icon
If you are not contributing code or being woken up in the middle of the night to answer a page, have the good sense to remember that no matter how important your job is, you are not the implementor. You do not operate the system, but you can find the operators and make sure t...
This highlight has been truncated due to consecutive passage length restrictions.
87%
Flag icon
The person in the best position to find a working strategy is the person on the ground watching the gears of the system turn.
87%
Flag icon
Failure is inevitable when attempting to change complex systems in production. There are too many moving parts, too many unknowns, and too much that needs to be fixed. Getting every single decision right is impossible. Modern engineering teams use stats like service level objectives, error budgets, and mean time to recovery to move the emphasis away from avoiding failure and toward recovering quickly.
87%
Flag icon
We cannot completely eliminate failure, because there’s a level of complexity where a single person—no matter how intelligent—cannot comprehend the full system.
87%
Flag icon
What gets people seen is what they will ultimately prioritize, even if those behaviors are in open conflict with the official instructions they receive from management.
87%
Flag icon
the program that doesn’t change becomes a time bomb.