Dan Kuida’s Kindle Notes & Highlights for An Elegant Puzzle: Systems of Engineering Management

Rate it:

Open Preview

More on this book

Community

ian yang

43 notes & 115 highlights

Vladimir

1 note & 1 highlight

Stanislav Lazuta

1 note & 126 highlights

Bhaskar Chowdhury

18 notes & 22 highlights

Rob

8 notes & 44 highlights

Andrii Sherepa

3 notes & 26 highlights

Max Wolffe

4 notes & 73 highlights

Siarhei Vasilyeu

3 notes & 43 highlights

Jake McCrary

1 note & 102 highlights

Simão Freitas

3 notes & 15 highlights

James Bowkett

Jake Losh

Joshua Silva

Jason

Ankit Bhagat

Jiří

Corey

Eric

Zach

Beto de Castro Moreira

Mindaugas Mozūras

marcos b. siqueira

Angie Wood

Douglas

Léo

Andrew Turner

Bassam Ismail

Billie

Konstantin Anthony Romanov Ⅰ

Graham

Sujith

mathiasx

Tomáš

Dale Alleshouse

Nikita

Marc Roberts

matagus

Kindle Notes & Highlights

by Dan Kuida

See all Dan’s Notes & Highlights

An Elegant Puzzle: Systems of Engineering Management

by Will Larson

Started reading June 11, 2025

The first blog post that I ever wrote was on April 7, 2007, and was titled “Finding Our Programming Flow.” It was not very good. That year I wrote 69 posts, the last being “Miyajima and Hiroshima,” a collection of pictures from a trip I took while teaching English in Japan. The next year, 2008, I wrote 192 posts. The writing still left much to be desired. It took 200 more posts and another decade to cobble together a written voice and to make enough mistakes that my experience might become worth reading. I’m fortunate that that moment coincided with my time at Stripe, an environment where ...more

It was only when I got the opportunity to work at Uber, which was growing its engineering team from 200 to 2,000 over two years, and then at Stripe, which was experiencing similar rapid growth, that I had the opportunity to truly refine my approach to management through exposure to an endless variety of challenges. There are few things peaceful about managing in rapidly growing companies, but I’ve never found anywhere better to learn and to grow.

Organizational design gets the right people in the right places, empowers them to make decisions, and then holds them accountable for their results. Maintained consistently and changed sparingly, nothing else will help you scale more.

An organization is a collection of people working toward a shared goal. Each organization is an exploration of the possible, undertaken together by the ten, the hundred, or the thousand. Initially, I was tempted to glibly write that sometimes organizations work, but the truly extraordinary thing is that all organizations work.

When I have a problem that I want to solve quickly and cheaply, I start thinking about process design. A problem I want to solve permanently and we have time to go slow? That’s a good time to evolve your culture. However, if process is too weak a force, and culture too slow, then organizational design lives between those two.

Managers should support six to eight engineers This gives them enough time for active coaching, coordinating, and furthering their team’s mission by writing strategies,2 leading change,3 and so on.

Ramping up. Managers supporting fewer than four other managers should be in a period of active learning on either the problem domain or on transitioning from supporting engineers to supporting managers. In the steady state, this can lead to folks feeling underutilized, or being tempted to meddle in daily operations.

On-call rotations want eight engineers For production on-call responsibilities,4 I’ve found that two-tier 24/7 support requires eight engineers. As teams holding their own pagers have become increasingly mainstream, this has become an important sizing constraint, and I try to ensure that every engineering team’s steady state is eight people. Shared rotations. It is sometimes necessary to pool multiple teams together to reach the eight engineers necessary for a 24/7 on-call rotation. This is an effective intermediate step toward teams owning their own on-call rotations, but it is not a good ...more

I’ve sponsored quite a few teams of one or two people, and each time I’ve regretted it. To repeat: I have regretted it every single time. An important property of teams is that they abstract the complexities of the individuals that compose them. Teams with fewer than four individuals are a sufficiently leaky abstraction that they function indistinguishably from individuals. To reason about a small team’s delivery, you’ll have to know about each on-call shift, vacation, and interruption.

Keep innovation and maintenance together. A frequent practice is to spin up a new team to innovate while existing teams are bogged down in maintenance. I’ve historically done this myself, but I’ve moved toward innovating within existing teams.5 This requires very deliberate decision-making and some bravery, but in exchange you’ll get higher morale and a culture of learning, and will avoid creating a two-tiered class system of innovators and maintainers.

A team is falling behind if each week their backlog is longer than it was the week before. Typically, people are working extremely hard but not making much progress, morale is low, and your users are vocally dissatisfied.

Teams want to climb from falling behind to innovating, while entropy drags them backward. Each state requires a different tact.

I can’t stress enough that these fixes are slow. This is because systems accumulate months or years of static, and you have to drain that all away. Conversely, the same properties that make these fixes slow to fix make them extremely durable once in effect!

Many folks try to move all teams at the same time, peanut buttering7 their limited resources, but resist that indecision-framed-as-fairness: it’s not a fair outcome if no one gets anything.

Adding new individuals to a team disrupts that team’s gelling process, so I’ve found it much easier to have rapid growth periods for any given team, followed by consolidation/gelling periods during which the team gels. The organization will never stop growing, but each team will.

Durable excellence This approach to nurturing great organizations is the opposite of a quick fix. While it’s slow, I’ve found that it consistently leads to enduring, real improvement in the happiness and throughput of an organization. Most importantly, these improvements stick around long enough to compound, creating a durable excellence.

Fundamentally, I believe that sustained productivity comes from high-performing teams, and that disassembling a high-performing team leads to a significant loss of productivity, even if the members are fully retained. In this worldview, high-performing teams are sacred, and I’m quite hesitant to disassemble them.

Another reason that I lean away from moving folks off high-performing teams is that most teams have high fixed costs and relatively small variable costs: moving one person can shift an innovating team back into falling behind, and now neither team is doing particularly well. This is especially true on teams responsible for products and services.

The expected time to complete a new task approaches infinity as a team’s utilization approaches 100 percent, and most teams have many dependencies on other teams. Together, these facts mean you can often slow a team down by shifting resources to it, because doing so creates new upstream constraints.

The other approach that I’ve seen work well is to rotate individuals for a fixed period into an area that needs help. The fixed duration allows them to retain their identity and membership in their current team, giving their full focus to helping out, rather than splitting their focus between performing the work and finding membership in the new team. This is also a safe way to measure how much slack the team really has!

10%

All real-world systems have some degree of inherent self-healing properties: an overloaded database will slow down enough that someone fixes it, and overwhelmed employees will get slow at finishing work until someone finds a way to help.

10%

That’s less than four hours per engineer per month if you can leverage your entire existing team, but training comes up again here: if it takes you six months to get the average engineer onto your interview loop, each trained engineer is now doing three to four hours of hiring-related work per week, and your trained engineers are down to approximately 0.4 efficiency. The overall team is getting 1.06 engineers’ worth of work out of every three engineers. It’s not just training and hiring, though: For every additional order of magnitude of engineers, you need to design and maintain a new layer ...more

11%

Understanding the overall impact of increased load comes down to a few important trends: Most system-implemented systems are designed to support one to two orders’ magnitude of growth from the current load. Even systems designed for more growth tend to run into limitations within one to two orders of magnitude. If your traffic doubles every six months, then your load increases an order of magnitude every 18 months. (And sometimes new features or products cause load to increase much more quickly.) The cardinality of supported systems increases over time as you add teams, and as “trivial” ...more

11%

However, the real productivity killer is not system rewrites but the migrations that follow those rewrites. Poorly designed migrations expand the consequences of this rewrite loop from the individual teams supporting the systems to the entire surrounding organization.

11%

you only get value from projects when they finish: to make progress, above all else, you must ensure that some of your projects finish.

11%

If your engineer is doing more than three interviews a week, it is a useful act of mercy to give them a month off every three or four months.

12%

The second most effective time thief that I’ve found is ad hoc interruptions: getting pinged on HipChat or Slack, taps on the shoulder, alerts from your on-call system, high-volume email lists, and so on.

12%

With that setup in place, create a rotation for people who are available to answer questions, and train your team not to answer other forms of interruptions. This is remarkably uncomfortable because we want to be helpful humans, but it becomes necessary as the number of interruptions climbs higher.

12%

One specific tool that I’ve found extremely helpful here is an ownership registry, which allows you to look up who owns what, eliminating the frequent “Who owns X?” variety of question. You’ll need this sort of thing to automate paging the right on-call rotation, so you might as well get two useful tools out of it!

12%

Finally, the one thing that I’ve found at companies with very few interruptions and have observed almost nowhere else: really great, consistently available documentation. It’s probably even harder to bootstrap documentation into a non-documenting company than it is to bootstrap unit tests into a non-testing company, but the best solution to frequent interruptions I’ve seen is a culture of documentation, documentation reading, and a documentation search that actually works.

12%

Something that is somewhat ignored a bit here is how to handle urgent project requests when you’re already underwater with your existing work and maintenance. The most valuable skill in this situation is learning to say no in a way that is appropriate to your company’s culture. That probably deserves its own chapter. There are probably some companies where saying no is culturally impossible, and in those places I guess you either learn to say your noes as yeses, or maybe you find a slightly easier environment to participate in. How do you remain productive in times of hypergrowth?

13%

How you respond to this is, in my opinion, the core challenge of leading a large organization. How do you continue to remain emotionally engaged with the challenges faced by individuals you’re responsible to help, when their problem is low in your problems queue? In that moment, do you shrug off the responsibility, either by changing roles or picking powerlessness? Hide in indifference? Become so hard on yourself that you collapse inward?

13%

What I’ve found most successful is to identify a few areas to improve, ensure you’re making progress on those, and give yourself permission to do the rest poorly. Work with your manager to write this up as an explicit plan and agree on what reasonable progress looks like. These issues are still stored with your other bags of risk and responsibility, but you’ve agreed on expectations.

Prioritize and execute

13%

Succession planning is thinking through how the organization would function without you, documenting those gaps, and starting to fill them in. It’s awkward enough to talk about that it doesn’t get much discussion, but it’s a foundational skill for building an enduring organization.

14%

The key tools for leading efficient change are systems thinking, metrics, and vision. When the steps of change are too wide, teams get destabilized, and gaps open within them. In those moments, managers create stability by becoming glue.

15%

Big changes appear to happen in a moment, but if you look closely underneath the big change, there is usually a slow accumulation of small changes.

15%

Since reading Accelerate: The Science of Lean Software and DevOp, by Nicole Forsgren, Gene Kim, and Jez Humble,4 I’ve spent a lot of time pondering the authors’ definition of velocity. They focus on four measures of developer velocity: Delivery lead time is the time from the creation of code to its use in production. Deployment frequency is how often you deploy code. Change fail rate is how frequently changes fail. Time to restore service is the time spent recovering from defects. The book uses surveys from tens of thousands of organizations to assess each one’s overall productivity and show ...more

16%

Problem discovery The first phase of a planning cycle is exploring the different problems that you could pick to solve. It’s surprisingly common to skip this phase, but that, unsurprisingly, leads to inertia-driven local optimization. Taking the time to evaluate which problem to solve is one of the best predictors I’ve found of a team’s long-term performance. The themes that I’ve found useful for populating the problem space are: Users’ pain. What are the problems that your users experience? It’s useful to go broad via survey mechanisms, as well as to go deep by interviewing a smaller set of ...more

This highlight has been truncated due to consecutive passage length restrictions.

18%

Solution validation Once you’ve narrowed down the problem you want to solve, it’s easy to jump directly into execution, but that can make it easy to fall in love with a difficult approach. Instead, I’ve found that it’s well worth it to take the risk out of your approach with an explicit solution validation phase.

18%

The elements that I’ve found effective for solution validation are: Write a customer letter. Write the launch announcement that you would send after finishing the solution. Are you able to write something exciting, useful, and real? It’s much more useful to test it against your actual users than to rely on your intuition. Identify prior art. How do peers across the industry approach this problem? The fact that others have solved a problem in a certain way doesn’t mean that it’s a great way, but it does at least mean it’s possible. A mild caveat: it’s better to rely on people you have some ...more

This highlight has been truncated due to consecutive passage length restrictions.

18%

Strategies are grounded documents which explain the trade-offs and actions that will be taken to address a specific challenge. Visions are aspirational documents that enable individuals who don’t work closely together to make decisions that fit together cleanly.

19%

Good Strategy/Bad Strategy by Richard Rumelt,

19%

When you read bad guiding policies, you think, “so what?” because its found a way to justify entrenching the status quo. When you read good guiding policies, you think, “Ah, that’s really going to annoy Anna, Bill, and Claire,” because the approach takes a clear stance on competing goals.

19%

When you apply your guiding policies to your diagnosis, you get your actions. Folks are often comfortable with hard decisions in the abstract, but struggle to translate policies into the specific steps to implement them. This is typically the easiest part to write, but publishing it and following through with it can be a significant test of your commitment.

19%

When you read good, coherent actions, you think, “This is going to be uncomfortable, but I think it can work.” When you read bad ones, you think, “Ah, we got afraid of the consequences, and we aren’t really changing anything.”

19%

Because strategies are specific to a given problem, it’s okay—and even encouraged—to write quite a few of them. Over the past year, I’ve worked with people on strategies for how we partner with other teams, how we manage end-to-end API latency, and how we manage infrastructure costs.15 I’ve also peered over others’ shoulders as they worked on quite a few more ideas. The act of writing a strategy leads folks through a systematic analysis, so, even if we don’t share them, writing these documents helps us work through quite a few challenges, both overwhelming and mundane.

20%

Test the document! This is a core leadership tool, and your first version will almost certainly be bad. Write a draft, sit down with a few different folks to get their perspectives, then iterate. Keep doing this until you’ve synthesized feedback. If there is feedback you disagree with, embrace the vision as an opportunity to address conflict by explicitly acknowledging disagreements within the vision text.

21%

Bad goals are indistinguishable from numbers. “Our p50 build time will be below two seconds,” or “We’ll finish eight large projects.” You’ll know a goal is just a number when you read it and aren’t sure if it’s ambitious or whether it matters. Good goals are a composition of four specific kinds of numbers: A target states where you want to reach. A baseline identifies where you are today. A trend describes the current velocity. A time frame sets bounds for the change. Put these all together, and a well-structured goal takes the form of: “In Q3, we will reduce time to render our frontpage from ...more

22%

Infrastructure cost is a great example of a baseline metric.20 When you’re asked to take responsibility for a company’s overall infrastructure costs, you’re going to start from a goal along the lines of “Maintain infrastructure costs at their current percentage of net revenue of 30 percent.”

23%

What I’ve found effective is to send push notifications, typically email, to teams whose metric has changed recently, both in terms of absolute change and in terms of their benchmarked performance against their cohort.

« Prev 1 2 Next »

See a Problem?

Preview — An Elegant Puzzle by Will Larson