More on this book
Community
Kindle Notes & Highlights
Read between
August 17 - September 9, 2023
Computers wouldn’t be developed to work with monitors until 1964 when Bell Labs incorporated the first primitive visual interface into the Multics time-sharing system. We had no way of seeing the input the computer was receiving, so we borrowed an interface from the telegraph, which, in turn, was borrowing one from 18th-century French weavers.
From an economic perspective, there’s a difference between risk and ambiguity.15 Risks are known and estimable threats; ambiguities are places where outcomes both positive and negative are unknown. The traditional school of thought tells us that human beings are averse to ambiguity and will avoid it as much as possible. However, ambiguity aversion is one of those decision-making models that test well in laboratories but break down when brought into the real world where decisions are more complex and probabilities less clearly defined. Specifically when the decision involves multiple
...more
The goal of full rewrites is to restore ambiguity and, therefore, enthusiasm. They fail because the assumption that the old system can be used as a spec and be trusted to have diagnosed accurately and eliminated every risk is wrong.
Another place where artificial consistency comes into play is with databases. The top choices for databases 10 years ago are not the top choices today, so senior leaders sometimes will ask that legacy databases be migrated to another option more consistent with whatever newer systems are using. As with the previous example, there are legitimate nontechnical reasons to do this, such as not wanting the expense of supporting two different databases that essentially behave the same way, but the issue quickly can get out of hand when the engineering team is being asked to remove the key value store
...more
Figuring out when consistency adds technical value and when it is artificial is one of the hardest decisions an engineering team must make.
But for most organizations, the conversation around modernization begins with failure. No one would invest the time and effort if the system were humming along just fine. The term legacy modernization itself is a little misleading. Plenty of old systems exist that no one gives a thought to changing because they just work.
In 1983, Charles Perrow coined the term normal accidents to describe systems that were so prone to failure, no amount of safety procedures could eliminate accidents entirely.
Signs of complexity in software include the number of direct dependencies and the depth of the dependency tree, the number of integrations, the hierarchy of users and ability to delegate, the number of edge cases the system must control for, the amount of input from untrusted sources, the amount of legal variety in that input, and so on, and so forth.
Tightly coupled and complex systems are prone to failure because the coupling produces cascading effects, and the complexity makes the direction and course of those cascades impossible to predict.
Places of complexity are areas where the human operators make the most mistakes and have the greatest probability of misunderstanding. Places of tight coupling are areas of acceleration where effects both good and bad will move faster, which means less time for intervention.
A fair amount of prep work is necessary to make iteration in place work. You will need to set up monitoring. At a minimum, you should have some way to track errors in the application layer and search logs, but the tooling here grows more sophisticated every year. The better you can identify what normal looks like on your legacy system, the easier it is to iterate in place safely.
If you’ve never restored from a backup, you don’t actually have backups. If you’ve never failed over to another region, you don’t actually have failovers. If you’ve never rolled back a deploy, you don’t have a mature deploy pipeline.
In poker, people call it resulting. It’s the habit of confusing the quality of the outcome with the quality of the decision. In psychology, people call it a self-serving bias. When things go well, we overestimate the roles of skill and ability and underestimate the role of luck. When things go poorly, on the other hand, it’s all bad luck or external forces.
One of the main reasons legacy modernization projects are hard is because people overvalue the hindsight an existing system offers them.
Scale always involves some luck. You can plan for a certain number of transactions or users, but you can’t really control those factors, especially if you’re building anything that involves the public internet.
With my engineers, I set the expectation that to have a productive, free-flowing debate, we need to be able to sort comments and issues into in-scope and out-of-scope quickly and easily as a team. I call this technique “true but irrelevant,” because I can typically sort meeting information into three buckets: things that are true, things that are false, and things that are true but irrelevant. Irrelevant is just a punchier way of saying out of scope.
A combat medic’s first job is to stop the bleeding, not order a bunch of X-rays and put together a diet and exercise plan. To be effective when you’re coming into a project that has already started, your role needs to adapt. First you need to stop the bleeding, and then you can do your analysis and long-term planning.
Slack started as an online multiplayer video game. Slack was actually the second time its founder had started building an online game only to realize that the real product was something completely different. His earlier startup, Flickr, had the same origin story.
Tightly coupled systems become messy because they accrue debt with each workaround that is deployed. The downsides of tight coupling can be mitigated with engineering standards dictating how to extend, modify, and ultimately play nicely with the coupling. They can also be mitigated by honoring the engineering team’s commitment to refactoring on occasion.
Contract testing is a form of automated testing that checks whether components of a system have broken their data contracts with one another. When breaking up a monolith into services, honoring these contracts or clearly communicating when they need to be broken is essential to building, integrating, and maintaining a high-performing system. Contract testing is not a form of formal specification per se, but rolling it out follows roughly the same process. It requires every endpoint to have a spec written in a specific markup language that the contract testing tool can parse and check for
...more
“Planning is problem solving, while design is problem setting.” Problem-solving versus problem-setting is the difference between being reactive and being responsive.
When a manager’s prestige is determined by the number of people reporting up to her and the size of her budget, the manager will be incentivized to subdivide design tasks that in turn will be reflected in the efficiency of the technical design—or as Conway put it: “The greatest single common factor behind many poorly designed systems now in existence has been the availability of a design organization in need of work.”
When an organization has no clear career pathway for software engineers, they grow their careers by building their reputations externally. This means getting drawn into the race of being one of the first to prove the production-scale benefits of a new paradigm, language, or technical product.
Organizations end up with patchwork solutions because the tech community rewards explorers. Being among the first with tales of documenting, experimenting, or destroying a piece of technology builds an individual’s prestige.
when an organization provides no pathway to promotion for software engineers, they are incentivized to make technical decisions that emphasize their individual contribution over integrating well into an existing system.
Unfortunately, far too often managers advance in their careers by managing more people. And if the organization isn’t properly controlling for that, system design will be overcomplicated by the need to broadcast importance.
Organizations that are unprepared to grow talent end up with managers who are incentivized to subdivide their teams into more specialized units before there are either enough people or enough work to maintain such a unit. The manager gets to check off the career-building experience of running multiple teams, hiring more engineers, and taking on more ambitious projects, and the needs of the overall architecture are ignored.
Ariely’s research suggests that even evoking the traditional market by offering small financial incentives to work harder causes people to stop thinking about the bonds between them and their colleagues and makes them think about things in terms of a monetary exchange3—which is a colder, less personal, and often less emotionally rewarding space.
Note the distinction between good behavior and bad outcomes. When deciding how they should execute on a given set of tasks, workers consider two questions: How does the organization want me to behave? And, will I get punished if things go wrong despite that correct behavior? If you want people to do the right thing despite the risk, you need to accept that failure.
We do postmortems on failure because we’re likely to see them as complex scenarios with a variety of contributing factors. We assume that success happens for simple, straightforward reasons. In reality, success is no more or less complex than failure. You should use the same methodology to learn from success that you use to learn from failure.
Postmortems establish a record about what really happened and how specific actions affected the outcome. They do not document failure; they provide context.
In that situation, my team would typically run two war rooms: one where the engineers were solving problems together and one outfitted with lots of fancy dashboards where I was babysitting the senior executives. The most valuable skill a leader can have is knowing when to get out of the way.
Engineering organizations that maintain a separation between operations and development, for example, inevitably find that their development teams design solutions so that when they go wrong, they impact the operations team first and most severely. Meanwhile, their operations team builds processes that throw barriers in front of development, passing the negative consequences of that to those teams.

