More on this book
Community
Kindle Notes & Highlights
We build our computer systems the way we build our cities: over time, without a plan, on top of ruins. —Ellen Ullman
To understand legacy systems, you have to be able to define how the original requirements were determined. You have to excavate an entire thought process and figure out what the trade-offs look like now that the options are different.
It is easy to build things, but it is difficult to rethink them once they are in place.
Market shifts are complex events. We can see the pattern of technology cycling through the same approaches and structures over and over, but these shifts are less about the superiority of a given characteristic and more about how potential consumers organize themselves.
Adopting new practices doesn’t necessarily make technology better, but doing so almost always makes technology more complicated, and more complicated technology is hard to maintain and ultimately more prone to failure.
When I’m working on a legacy system, I always start off by evaluating the prospective users. Who will be maintaining this system long term? What technologies are they comfortable with? Who will be using this system the most? How do they expect the system to work?
engineering teams tend to gravitate toward full rewrites because they incorrectly think of old systems as specs. They assume that since an old system works, all technical challenges and possible problems have been settled. The risks have been eliminated!
So programmers prefer full rewrites over iterating legacy systems because rewrites maintain an attractive level of ambiguity while the existing systems are well known and, therefore, boring.
people talk about the phases of their modernization plans in terms of which technologies they are going to use rather than what value they will add.
Modernizations should be based on adding value, not chasing new technology. Familiar interfaces help speed up adoption. People gain awareness of interfaces and technology through their networks, not necessarily by popularity.
In 1983, Charles Perrow coined the term normal accidents to describe systems that were so prone to failure, no amount of safety procedures could eliminate accidents entirely.
If your goal is to reduce failures or minimize security risks, your best bet is to start by evaluating your system on those two characteristics: Where are things tightly coupled, and where are things complex? Your goal should not be to eliminate all complexity and all coupling; there will be trade-offs in each specific instance.
When both observability and testing are lacking on your legacy system, observability comes first. Tests tell you only what won’t fail; monitoring tells you what is failing.
In poker, people call it resulting. It’s the habit of confusing the quality of the outcome with the quality of the decision. In psychology, people call it a self-serving bias. When things go well, we overestimate the roles of skill and ability and underestimate the role of luck. When things go poorly, on the other hand, it’s all bad luck or external forces.
Being strategically narrow-minded to demonstrate value and build momentum is not a bad idea.
With my engineers, I set the expectation that to have a productive, free-flowing debate, we need to be able to sort comments and issues into in-scope and out-of-scope quickly and easily as a team. I call this technique “true but irrelevant,” because I can typically sort meeting information into three buckets: things that are true, things that are false, and things that are true but irrelevant. Irrelevant is just a punchier way of saying out of scope.
Fixing things that are not broken means you’re taking on all the risks of a modernization but will not be able to find the compelling value add and build the momentum that keeps things going.
The US Army/Marine CorpsCounterinsurgency Field Manual1 put it best when it advised soldiers: “Planning is problem solving, while design is problem setting.”
Essentially, engineers are motivated to create named things. If something can be named, it can have a creator. If the named thing turns out to be popular, the engineer’s prestige increases, and her career will advance.
The folly of engineering culture is that we are often ashamed of signing up our organization for a future rewrite by picking the right architecture for right now, but we have no misgivings about producing systems that are difficult for others to understand and therefore impossible to maintain.
Conway argued against aspiring for a universally correct architecture. He wrote in 1968, “It is an article of faith among experienced system designers that given any system design, someone someday will find a better one to do the same job. In other words, it is misleading and incorrect to speak of the design for a specific job, unless this is understood in the context of space, time, knowledge, and technology.”
One of the major themes that influences how systems degrade over time is how terrible human beings are at probability. We tend to overestimate the likelihood of events recurring once we have already seen them and underestimate the likelihood of events that have not yet happened. Sidney Dekker, a professor of human factors and system safety, called the outcome of this cognition problem on system safety drift.10 Systems do not generally fail all at once; they “drift” into failure via feedback loops caused by a desire to prevent failure.
We’re overemphasizing failure that may be rare and underestimating both the time it will take to complete the rewrite and the performance gains of the rewrite itself. We are swapping a system that works and needs to be adjusted for an expensive and difficult migration to something unproven.
Design is problem setting. Incorporating it into your process will help your teams become more resilient. By themselves, technical conversations tend to incentivize people to maintain status by criticizing ideas. Design can help mitigate those effects by giving conversations the structure of a game and a path to winning. Legacy modernizations are ultimately transitions and require leaders with high tolerance for ambiguity. Conway’s law doesn’t mean you should design your organization to look like the technology you want. It means you should pay attention to how the organization structure
...more
Don’t design the organization; let the organization design itself by choosing a structure that facilitates the communication teams will need to get the job done.
Social markets are governed by social norms (read: peer pressure and social capital), and they often inspire people to work harder and longer than much more expensive incentives that represent the traditional work-for-pay exchange. In other words, people will work hard for positive reinforcement; they might not work harder for an extra thousand dollars.
So when you’re examining why certain failures are considered riskier than others, an important question to ask yourself is this: How do people get seen here? What behaviors and accomplishments give them an opportunity to talk about their ideas and their work with their colleagues and be acknowledged?
Working groups relax hierarchy to allow people to solve problems across organizational units, whereas committees both reflect and reinforce those organizational boundaries and hierarchies.
As a general rule, we get better at dealing with problems the more often we have to deal with them.
Build something wrong and update it often. The secret to building technology “wrong” but in the correct way is to understand that successful complex systems are made up of stable simple systems.
It’s not the age of a system that causes it to fail, but the pressure of what the organization has forgotten about it slowly building toward an explosion.
Part of the reason legacy modernizations fail so often is that human beings are incentivized to mute or otherwise remove feedback loops that establish accountability. We are often unable to stop this because we insist on talking about that problem as a moral failing instead of an unconscious bias. Engineering organizations that maintain a separation between operations and development, for example, inevitably find that their development teams design solutions so that when they go wrong, they impact the operations team first and most severely.