More on this book
Community
Kindle Notes & Highlights
Read between
March 31 - May 16, 2023
Too often, meetings are maladapted attempts to solve problems. So if you want to know what parts of the project are suffering the most, pay attention to what the team is having meetings about, how often meetings are held, and who is being dragged into those meetings. In particular, look for meetings with long invite lists. Large meetings are less effective than small meetings, but they do convincingly spread the blame around by giving everyone the impression that all parties were consulted and all opinions were explored.
Building a product from the beginning with a service-oriented architecture is usually a mistake. Because you don’t have the proper product/market fit figured out yet, integrations and data contracts become a major pain point.
In general, the level of abstraction your design has should be inversely proportional to the number of untested assumptions you’re making. The more abstractions a given design includes, the more difficult changing APIs without breaking data contracts becomes. The more often you break contracts, the more often a team has to stop new work and redo old work.
At small organizations, we find people are doing several different jobs at once with roles not so clearly defined. Everyone in the same space is using the same resources. In short, small organizations build monoliths because small organizations are monoliths.
Large organizations do well when they transition their monoliths to services, because the problems around communication and knowledge sharing that need to be solved to make complex systems work are problems that large organizations have to solve anyway.
Decisions motivated by wanting to avoid rewriting code later are usually bad decisions. In general, any decision made to please or impress imagined spectators with the superficial elegance of your approach is a bad one. If you’re coming into a project where team members are fixing something that isn’t broken, you can be sure they are doing so because they are afraid of the way their product looks to other people.
If you have hundreds or even thousands of engineers contributing to the same code base, the potential for miscommunication and conflict is almost infinite. Coordinating between teams sharing ownership on the same monolith often pushes organizations back into a traditional release cycle model where one team tests and assembles a set of updates that go to production in a giant package. This slows development down, and more important, it slows down rollbacks that affect the organization’s ability to respond to failure.
Breaking up the monolith into services that roughly correspond to what each team owns means that each team can control its own deploys. Development speeds up. Add a layer of complexity in the form of formal, testable API specs, and the system can facilitate communication between those teams by policing how they are allowed to change downstream interactions.
You have to be comfortable with the unknown. You can do that by emphasizing resilience over reliability. Reliability is important, but too many organizations use reliability to aim for perfection, which is the exact opposite of what they should be trying to accomplish.
“Anything over four nines is basically a lie.” The more nines you are trying to guarantee, the more risk-averse engineering teams will become, and the more they will avoid necessary improvements. Remember, to get five nines or more, they have only seconds to respond to incidents. That’s a lot of pressure.
Code Yellows are not generally called by engineering managers or directors because the scope of their field of influence should be small enough to manage the problem via other project management strategies. Code Yellows are for systemic problems; a problem that fits entirely within the domain of a single engineering manager without touching or affecting any other group is not systemic.
Popular culture sells the myth about lazy, stupid, uncaring bureaucrats. It’s easy to dismiss people that way. Buying into the idea that those kinds of problems are really character flaws means you don’t have to recognize that you’ve created an environment where people feel trapped. They are caught between conflicting incentives with no way to win.
However, if they implemented a better process and still failed anyway, they would lose this mental lifeline they were hanging on to. The team doomed itself to failure because they were afraid of learning that the problem the whole time was them—not their process, not the organization’s denial of resources, not the inexperienced manager.
If the organization is making changes that will not provide enough value to justify their expense, boost the value of those changes by turning them into a vehicle of better engineering practices. If the organization is paralyzed by missing information and unknown complications, promote resilience and eliminate the fear of failure. If problems extend beyond what any one team can solve by itself, allow the organization to temporarily reorganize itself around the problem. If teams are demoralized to the point where they are hurting themselves, challenge other parts of the organization to
...more
The simplest form of design exercise is to talk to your user. Doing that is better than doing nothing, but in unstructured conversations, the quality of the feedback can vary. Sometimes users don’t know what they want. Sometimes the user and the researcher use the same words to mean different things. Sometimes the power dynamics between the user and the person conducting the interview are so great, the user tells the interviewer what he or she wants to hear.
Nothing produces out-of-scope digressions more effectively than having people in meetings who don’t need to be there.
Conway’s law has become a voodoo curse—something that people believe only in retrospect. Few engineers attribute their architecture successes to the structures of their organizations, but when a product is malformed, the explanation of Conway’s law is easily accepted.
Software engineers are incentivized to forego tried and true approaches in favor of new frontiers. Left to their own devices, software engineers will proliferate tools, ignoring feature overlaps for the sake of that one thing tool X does better than tool Y that is relevant only in that specific situation.
Well-integrated, high-functioning software that is easy to understand usually blends in. Simple solutions do not do much to enhance one’s personal brand. They are rarely worth talking about. Therefore, when an organization provides no pathway to promotion for software engineers, they are incentivized to make technical decisions that emphasize their individual contribution over integrating well into an existing system.
The folly of engineering culture is that we are often ashamed of signing up our organization for a future rewrite by picking the right architecture for right now, but we have no misgivings about producing systems that are difficult for others to understand and therefore impossible to maintain.
The systems we like to rewrite from scratch are usually the systems we have been ignoring. We don’t know how likely failure is because we pay attention to them only when they fail and forget about them otherwise.
The systems we like to rewrite from scratch also tend to be complex with many layers of abstraction and integrations. When we change something on them, it doesn’t always go smoothly, particularly if we’ve slipped up in our test coverage. The more problems we have making changes, the more we overestimate future failures. The more a system seems brittle, failure-prone, and just impossible to save, the more a full rewrite feels like an easier solution.
It’s the minor adjustments to systems that have not been actively developed in a while that create the impression that failure is inevitable and push otherwise rational engineers toward doing rewrites when rewrites are not necessary.
Coordination requires trust. Given a choice, we prefer to base our trust on the character of people we know, but when we scale to a size where that is not possible anymore, we gradually replace social bonds with process. Typically this happens when the organization has reached the size of around 100 to 150 people.
One of the benefits of microservices, for example, is that it allows many teams to contribute to the same system independently from one another. Whereas a monolith would require coordination in the form of code reviews—a personal, direct interaction between colleagues—service-oriented architecture scales the same guarantees with process. Engineers document contracts and protocols; automation is applied to ensure that those contracts are not violated, and it prescribes a course of action if they are. For that reason, engineers who want to “jump ahead” and build something with microservices from
...more
Conway’s law is a tendency, not a commandment. Large, complex organizations can develop fluid and resilient communication pathways; it just requires the right leadership and the right tooling. Reorgs should be undertaken only in situations where an organization’s structure is completely and totally out of alignment with its implementation.
Done well, the candidate plans the roadmap out backward, starting at the end state and identifying smaller and smaller units of change. With each step, we are designing tests to find weaknesses in the organization’s operational excellence that we can resolve. It’s important that our roadmap is structured around proving we have something with a simple test, rather than steps that assert we do. On large projects, it’s easy for people to become confused or misreport the ground truth. It is useful to know how a leader would construct a test to verify information.
Conway’s law is ultimately about communication and incentives. The incentive side can be covered by giving people a pathway to prestige and career advancement that complements the modernization effort. The only way to design communication pathways is actually to give people something to communicate about. In each case, we allow the vision for the new organization to reveal itself by designing structures that encourage new communication pathways to form in response to our modernization challenges.
Don’t design the organization; let the organization design itself by choosing a structure that facilitates the communication teams will need to get the job done.
Risk is not a static number on a spreadsheet. It’s a feeling that can be manipulated, and while we may justify that feeling with statistics, probabilities, and facts, our perception of level of risk often bears no relationship to those data points.
It’s easy for the organization’s rhetoric to be disconnected from the values that govern the work environment. What colleagues pay attention to are the real values of an organization. No matter how passionate or consistent the messaging, attention from colleagues will win out over the speeches.
Punishment and rewards are two sides of the same coin. Rewards have a punitive effect because they, like outright punishment, are manipulative. “Do this and you’ll get that” is not really very different from “Do this or here’s what will happen to you.” In the case of incentives, the reward itself may be highly desired; but by making that bonus contingent on certain behaviors, managers manipulate their subordinates, and that experience of being controlled is likely to assume a punitive quality over time.
when you’re examining why certain failures are considered riskier than others, an important question to ask yourself is this: How do people get seen here? What behaviors and accomplishments give them an opportunity to talk about their ideas and their work with their colleagues and be acknowledged?
the occasional outage and problem with a system—particularly if it is resolved quickly and cleanly—can actually boost the user’s trust and confidence. The technical term for this effect is the service recovery paradox.
what we do know is that recovering fast and being transparent about the nature of the outage and the resolution often improves relationships with stakeholders. Factoring in the boost to psychological safety and productivity that just culture creates, technical organizations shouldn’t shy away from breaking things on purpose. Under the right conditions, the cost of user trust is minimal, and the benefits might be substantial.
Having a part of a system that no one understands is a weakness, so avoiding the issue for fear of breaking things should not be considered the safer choice. Using failure as a tool to make systems and the organizations that run them stronger is one of the foundational concepts behind resilience engineering.
Is it better to wait for something to fail and hope you have the right resources and expertise at the ready? Or is it better to trigger failure at a time when you can plan resources, expertise, and impact in advance? You don’t know that something doesn’t work the way you intended it to until you try it.
Organizations should not shy away from failure, because failure can never be prevented. It can only be put off for a while or redirected to another part of the system. Eventually, every organization will experience failure. Embracing failure as a source of learning means that your team gains more experience and ultimately improves their skills mitigating negative impacts on users. Practicing recovering from failure, therefore, makes the team more valuable.
Technology is never done, but modernization projects can be.
Whenever you can avoid having people argue about principles, philosophies, and other generalities of good engineering versus bad engineering, take advantage of it.
As software engineers, it is easy to fall into the trap of thinking that effective work will always look like writing code, but sometimes you get much closer to your desired outcome by teaching others to do the job rather than doing it yourself.
Just as humans are terrible judges of probability, we’re also terrible judges of time. What feels like ages might only be a few days. By marking time, we can realign our emotional perception of how things are going. Find some way to record what you worked on and when so the team can easily go back and get a full picture of how far they’ve come.
Knowing that such-and-such ticket was closed on a certain day doesn’t necessarily take me back to that moment. Sometimes a date is just a date. When you mark time, do so in a way that evokes memory, that highlights the feeling of distance between that moment and where the team is now.
But postmortems are not specific to failure. If your modernization plan includes modeling your approach after another successful project, consider doing a postmortem on that project’s success instead. Remember that we tend to think of failure as bad luck and success as skill. We do postmortems on failure because we’re likely to see them as complex scenarios with a variety of contributing factors. We assume that success happens for simple, straightforward reasons.
As an industry, we reflect on success but study failure. Sometimes obsessively. I’m suggesting that if you’re modeling your project after another team or another organization’s success, you should devote some time to actually researching how that success came to be in the first place.
The value of the postmortem is not its level of detail, but the role it plays in knowledge sharing. Postmortems are about helping people not involved in the incident avoid the same mistakes. The best postmortems are also distributed outside the organization to cultivate trust through transparency.
When people associate refactoring and necessary migrations with a system somehow being built wrong or breaking, they will put off doing them until things are falling apart. When the process of managing these changes is part of the routine, like getting a haircut or changing the oil in your car, the system can be future-proofed by gradually modernizing it.
As a general rule, we get better at dealing with problems the more often we have to deal with them. The longer the gap between the maturity date of deteriorations, the more likely knowledge has been lost and critical functionality has been built without taking the inevitable into account.
don’t build for scale before you have scale. Build something that you will have to redesign later, even if that means building a monolith to start. Build something wrong and update it often.
The secret to building technology “wrong” but in the correct way is to understand that successful complex systems are made up of stable simple systems. Before your system can be scaled with practical approaches, like load balancers, mesh networks, queues, and other trappings of distributive computing, the simple systems have to be stable and performant. Disaster comes from trying to build complex systems right away and neglecting the foundation with which all system behavior—planned and unexpected—will be determined.