Kill It with Fire: Manage Aging Computer Systems (and Future Proof Modern Ones)
Rate it:
Open Preview
44%
Flag icon
Too often, meetings are maladapted attempts to solve problems. So if you want to know what parts of the project are suffering the most, pay attention to what the team is having meetings about, how often meetings are held, and who is being dragged into those meetings. In particular, look for meetings with long invite lists. Large meetings are less effective than small meetings, but they do convincingly spread the blame around by giving everyone the impression that all parties were consulted and all opinions were explored.
44%
Flag icon
Building a product from the beginning with a service-oriented architecture is usually a mistake. Because you don’t have the proper product/market fit figured out yet, integrations and data contracts become a major pain point.
45%
Flag icon
In general, the level of abstraction your design has should be inversely proportional to the number of untested assumptions you’re making. The more abstractions a given design includes, the more difficult changing APIs without breaking data contracts becomes. The more often you break contracts, the more often a team has to stop new work and redo old work.
45%
Flag icon
At small organizations, we find people are doing several different jobs at once with roles not so clearly defined. Everyone in the same space is using the same resources. In short, small organizations build monoliths because small organizations are monoliths.
45%
Flag icon
Large organizations do well when they transition their monoliths to services, because the problems around communication and knowledge sharing that need to be solved to make complex systems work are problems that large organizations have to solve anyway.
46%
Flag icon
Decisions motivated by wanting to avoid rewriting code later are usually bad decisions. In general, any decision made to please or impress imagined spectators with the superficial elegance of your approach is a bad one. If you’re coming into a project where team members are fixing something that isn’t broken, you can be sure they are doing so because they are afraid of the way their product looks to other people.
47%
Flag icon
If you have hundreds or even thousands of engineers contributing to the same code base, the potential for miscommunication and conflict is almost infinite. Coordinating between teams sharing ownership on the same monolith often pushes organizations back into a traditional release cycle model where one team tests and assembles a set of updates that go to production in a giant package. This slows development down, and more important, it slows down rollbacks that affect the organization’s ability to respond to failure.
47%
Flag icon
Breaking up the monolith into services that roughly correspond to what each team owns means that each team can control its own deploys. Development speeds up. Add a layer of complexity in the form of formal, testable API specs, and the system can facilitate communication between those teams by policing how they are allowed to change downstream interactions.
48%
Flag icon
You have to be comfortable with the unknown. You can do that by emphasizing resilience over reliability. Reliability is important, but too many organizations use reliability to aim for perfection, which is the exact opposite of what they should be trying to accomplish.
49%
Flag icon
“Anything over four nines is basically a lie.” The more nines you are trying to guarantee, the more risk-averse engineering teams will become, and the more they will avoid necessary improvements. Remember, to get five nines or more, they have only seconds to respond to incidents. That’s a lot of pressure.
51%
Flag icon
Code Yellows are not generally called by engineering managers or directors because the scope of their field of influence should be small enough to manage the problem via other project management strategies. Code Yellows are for systemic problems; a problem that fits entirely within the domain of a single engineering manager without touching or affecting any other group is not systemic.
52%
Flag icon
Popular culture sells the myth about lazy, stupid, uncaring bureaucrats. It’s easy to dismiss people that way. Buying into the idea that those kinds of problems are really character flaws means you don’t have to recognize that you’ve created an environment where people feel trapped. They are caught between conflicting incentives with no way to win.
53%
Flag icon
However, if they implemented a better process and still failed anyway, they would lose this mental lifeline they were hanging on to. The team doomed itself to failure because they were afraid of learning that the problem the whole time was them—not their process, not the organization’s denial of resources, not the inexperienced manager.
54%
Flag icon
If the organization is making changes that will not provide enough value to justify their expense, boost the value of those changes by turning them into a vehicle of better engineering practices. If the organization is paralyzed by missing information and unknown complications, promote resilience and eliminate the fear of failure. If problems extend beyond what any one team can solve by itself, allow the organization to temporarily reorganize itself around the problem. If teams are demoralized to the point where they are hurting themselves, challenge other parts of the organization to ...more
55%
Flag icon
The simplest form of design exercise is to talk to your user. Doing that is better than doing nothing, but in unstructured conversations, the quality of the feedback can vary. Sometimes users don’t know what they want. Sometimes the user and the researcher use the same words to mean different things. Sometimes the power dynamics between the user and the person conducting the interview are so great, the user tells the interviewer what he or she wants to hear.
57%
Flag icon
Nothing produces out-of-scope digressions more effectively than having people in meetings who don’t need to be there.
58%
Flag icon
Conway’s law has become a voodoo curse—something that people believe only in retrospect. Few engineers attribute their architecture successes to the structures of their organizations, but when a product is malformed, the explanation of Conway’s law is easily accepted.
59%
Flag icon
Software engineers are incentivized to forego tried and true approaches in favor of new frontiers. Left to their own devices, software engineers will proliferate tools, ignoring feature overlaps for the sake of that one thing tool X does better than tool Y that is relevant only in that specific situation.
59%
Flag icon
Well-integrated, high-functioning software that is easy to understand usually blends in. Simple solutions do not do much to enhance one’s personal brand. They are rarely worth talking about. Therefore, when an organization provides no pathway to promotion for software engineers, they are incentivized to make technical decisions that emphasize their individual contribution over integrating well into an existing system.
59%
Flag icon
The folly of engineering culture is that we are often ashamed of signing up our organization for a future rewrite by picking the right architecture for right now, but we have no misgivings about producing systems that are difficult for others to understand and therefore impossible to maintain.
60%
Flag icon
The systems we like to rewrite from scratch are usually the systems we have been ignoring. We don’t know how likely failure is because we pay attention to them only when they fail and forget about them otherwise.
60%
Flag icon
The systems we like to rewrite from scratch also tend to be complex with many layers of abstraction and integrations. When we change something on them, it doesn’t always go smoothly, particularly if we’ve slipped up in our test coverage. The more problems we have making changes, the more we overestimate future failures. The more a system seems brittle, failure-prone, and just impossible to save, the more a full rewrite feels like an easier solution.
Hussain Abbas
This is probably every Drupal project I have picked up.
60%
Flag icon
It’s the minor adjustments to systems that have not been actively developed in a while that create the impression that failure is inevitable and push otherwise rational engineers toward doing rewrites when rewrites are not necessary.
61%
Flag icon
Coordination requires trust. Given a choice, we prefer to base our trust on the character of people we know, but when we scale to a size where that is not possible anymore, we gradually replace social bonds with process. Typically this happens when the organization has reached the size of around 100 to 150 people.
61%
Flag icon
One of the benefits of microservices, for example, is that it allows many teams to contribute to the same system independently from one another. Whereas a monolith would require coordination in the form of code reviews—a personal, direct interaction between colleagues—service-oriented architecture scales the same guarantees with process. Engineers document contracts and protocols; automation is applied to ensure that those contracts are not violated, and it prescribes a course of action if they are. For that reason, engineers who want to “jump ahead” and build something with microservices from ...more
62%
Flag icon
Conway’s law is a tendency, not a commandment. Large, complex organizations can develop fluid and resilient communication pathways; it just requires the right leadership and the right tooling. Reorgs should be undertaken only in situations where an organization’s structure is completely and totally out of alignment with its implementation.
63%
Flag icon
Done well, the candidate plans the roadmap out backward, starting at the end state and identifying smaller and smaller units of change. With each step, we are designing tests to find weaknesses in the organization’s operational excellence that we can resolve. It’s important that our roadmap is structured around proving we have something with a simple test, rather than steps that assert we do. On large projects, it’s easy for people to become confused or misreport the ground truth. It is useful to know how a leader would construct a test to verify information.
64%
Flag icon
Conway’s law is ultimately about communication and incentives. The incentive side can be covered by giving people a pathway to prestige and career advancement that complements the modernization effort. The only way to design communication pathways is actually to give people something to communicate about. In each case, we allow the vision for the new organization to reveal itself by designing structures that encourage new communication pathways to form in response to our modernization challenges.
65%
Flag icon
Don’t design the organization; let the organization design itself by choosing a structure that facilitates the communication teams will need to get the job done.
66%
Flag icon
Risk is not a static number on a spreadsheet. It’s a feeling that can be manipulated, and while we may justify that feeling with statistics, probabilities, and facts, our perception of level of risk often bears no relationship to those data points.
66%
Flag icon
It’s easy for the organization’s rhetoric to be disconnected from the values that govern the work environment. What colleagues pay attention to are the real values of an organization. No matter how passionate or consistent the messaging, attention from colleagues will win out over the speeches.
67%
Flag icon
Punishment and rewards are two sides of the same coin. Rewards have a punitive effect because they, like outright punishment, are manipulative. “Do this and you’ll get that” is not really very different from “Do this or here’s what will happen to you.” In the case of incentives, the reward itself may be highly desired; but by making that bonus contingent on certain behaviors, managers manipulate their subordinates, and that experience of being controlled is likely to assume a punitive quality over time.
67%
Flag icon
when you’re examining why certain failures are considered riskier than others, an important question to ask yourself is this: How do people get seen here? What behaviors and accomplishments give them an opportunity to talk about their ideas and their work with their colleagues and be acknowledged?
69%
Flag icon
the occasional outage and problem with a system—particularly if it is resolved quickly and cleanly—can actually boost the user’s trust and confidence. The technical term for this effect is the service recovery paradox.
69%
Flag icon
what we do know is that recovering fast and being transparent about the nature of the outage and the resolution often improves relationships with stakeholders. Factoring in the boost to psychological safety and productivity that just culture creates, technical organizations shouldn’t shy away from breaking things on purpose. Under the right conditions, the cost of user trust is minimal, and the benefits might be substantial.
69%
Flag icon
Having a part of a system that no one understands is a weakness, so avoiding the issue for fear of breaking things should not be considered the safer choice. Using failure as a tool to make systems and the organizations that run them stronger is one of the foundational concepts behind resilience engineering.
70%
Flag icon
Is it better to wait for something to fail and hope you have the right resources and expertise at the ready? Or is it better to trigger failure at a time when you can plan resources, expertise, and impact in advance? You don’t know that something doesn’t work the way you intended it to until you try it.
72%
Flag icon
Organizations should not shy away from failure, because failure can never be prevented. It can only be put off for a while or redirected to another part of the system. Eventually, every organization will experience failure. Embracing failure as a source of learning means that your team gains more experience and ultimately improves their skills mitigating negative impacts on users. Practicing recovering from failure, therefore, makes the team more valuable.
74%
Flag icon
Technology is never done, but modernization projects can be.
74%
Flag icon
Whenever you can avoid having people argue about principles, philosophies, and other generalities of good engineering versus bad engineering, take advantage of it.
74%
Flag icon
As software engineers, it is easy to fall into the trap of thinking that effective work will always look like writing code, but sometimes you get much closer to your desired outcome by teaching others to do the job rather than doing it yourself.
75%
Flag icon
Just as humans are terrible judges of probability, we’re also terrible judges of time. What feels like ages might only be a few days. By marking time, we can realign our emotional perception of how things are going. Find some way to record what you worked on and when so the team can easily go back and get a full picture of how far they’ve come.
75%
Flag icon
Knowing that such-and-such ticket was closed on a certain day doesn’t necessarily take me back to that moment. Sometimes a date is just a date. When you mark time, do so in a way that evokes memory, that highlights the feeling of distance between that moment and where the team is now.
76%
Flag icon
But postmortems are not specific to failure. If your modernization plan includes modeling your approach after another successful project, consider doing a postmortem on that project’s success instead. Remember that we tend to think of failure as bad luck and success as skill. We do postmortems on failure because we’re likely to see them as complex scenarios with a variety of contributing factors. We assume that success happens for simple, straightforward reasons.
76%
Flag icon
As an industry, we reflect on success but study failure. Sometimes obsessively. I’m suggesting that if you’re modeling your project after another team or another organization’s success, you should devote some time to actually researching how that success came to be in the first place.
76%
Flag icon
The value of the postmortem is not its level of detail, but the role it plays in knowledge sharing. Postmortems are about helping people not involved in the incident avoid the same mistakes. The best postmortems are also distributed outside the organization to cultivate trust through transparency.
80%
Flag icon
When people associate refactoring and necessary migrations with a system somehow being built wrong or breaking, they will put off doing them until things are falling apart. When the process of managing these changes is part of the routine, like getting a haircut or changing the oil in your car, the system can be future-proofed by gradually modernizing it.
82%
Flag icon
As a general rule, we get better at dealing with problems the more often we have to deal with them. The longer the gap between the maturity date of deteriorations, the more likely knowledge has been lost and critical functionality has been built without taking the inevitable into account.
83%
Flag icon
don’t build for scale before you have scale. Build something that you will have to redesign later, even if that means building a monolith to start. Build something wrong and update it often.
83%
Flag icon
The secret to building technology “wrong” but in the correct way is to understand that successful complex systems are made up of stable simple systems. Before your system can be scaled with practical approaches, like load balancers, mesh networks, queues, and other trappings of distributive computing, the simple systems have to be stable and performant. Disaster comes from trying to build complex systems right away and neglecting the foundation with which all system behavior—planned and unexpected—will be determined.