An Elegant Puzzle: Systems of Engineering Management
Rate it:
Open Preview
Kindle Notes & Highlights
3%
Flag icon
The first blog post that I ever wrote was on April 7, 2007, and was titled “Finding Our Programming Flow.” It was not very good. That year I wrote 69 posts, the last being “Miyajima and Hiroshima,” a collection of pictures from a trip I took while teaching English in Japan. The next year, 2008, I wrote 192 posts. The writing still left much to be desired. It took 200 more posts and another decade to cobble together a written voice and to make enough mistakes that my experience might become worth reading. I’m fortunate that that moment coincided with my time at Stripe, an environment where ...more
4%
Flag icon
It was only when I got the opportunity to work at Uber, which was growing its engineering team from 200 to 2,000 over two years, and then at Stripe, which was experiencing similar rapid growth, that I had the opportunity to truly refine my approach to management through exposure to an endless variety of challenges. There are few things peaceful about managing in rapidly growing companies, but I’ve never found anywhere better to learn and to grow.
4%
Flag icon
Organizational design gets the right people in the right places, empowers them to make decisions, and then holds them accountable for their results. Maintained consistently and changed sparingly, nothing else will help you scale more.
5%
Flag icon
An organization is a collection of people working toward a shared goal. Each organization is an exploration of the possible, undertaken together by the ten, the hundred, or the thousand. Initially, I was tempted to glibly write that sometimes organizations work, but the truly extraordinary thing is that all organizations work.
5%
Flag icon
When I have a problem that I want to solve quickly and cheaply, I start thinking about process design. A problem I want to solve permanently and we have time to go slow? That’s a good time to evolve your culture. However, if process is too weak a force, and culture too slow, then organizational design lives between those two.
5%
Flag icon
Managers should support six to eight engineers This gives them enough time for active coaching, coordinating, and furthering their team’s mission by writing strategies,2 leading change,3 and so on.
6%
Flag icon
Ramping up. Managers supporting fewer than four other managers should be in a period of active learning on either the problem domain or on transitioning from supporting engineers to supporting managers. In the steady state, this can lead to folks feeling underutilized, or being tempted to meddle in daily operations.
6%
Flag icon
On-call rotations want eight engineers For production on-call responsibilities,4 I’ve found that two-tier 24/7 support requires eight engineers. As teams holding their own pagers have become increasingly mainstream, this has become an important sizing constraint, and I try to ensure that every engineering team’s steady state is eight people. Shared rotations. It is sometimes necessary to pool multiple teams together to reach the eight engineers necessary for a 24/7 on-call rotation. This is an effective intermediate step toward teams owning their own on-call rotations, but it is not a good ...more
6%
Flag icon
I’ve sponsored quite a few teams of one or two people, and each time I’ve regretted it. To repeat: I have regretted it every single time. An important property of teams is that they abstract the complexities of the individuals that compose them. Teams with fewer than four individuals are a sufficiently leaky abstraction that they function indistinguishably from individuals. To reason about a small team’s delivery, you’ll have to know about each on-call shift, vacation, and interruption.
6%
Flag icon
Keep innovation and maintenance together. A frequent practice is to spin up a new team to innovate while existing teams are bogged down in maintenance. I’ve historically done this myself, but I’ve moved toward innovating within existing teams.5 This requires very deliberate decision-making and some bravery, but in exchange you’ll get higher morale and a culture of learning, and will avoid creating a two-tiered class system of innovators and maintainers.
7%
Flag icon
A team is falling behind if each week their backlog is longer than it was the week before. Typically, people are working extremely hard but not making much progress, morale is low, and your users are vocally dissatisfied.
7%
Flag icon
Teams want to climb from falling behind to innovating, while entropy drags them backward. Each state requires a different tact.
8%
Flag icon
I can’t stress enough that these fixes are slow. This is because systems accumulate months or years of static, and you have to drain that all away. Conversely, the same properties that make these fixes slow to fix make them extremely durable once in effect!
8%
Flag icon
Many folks try to move all teams at the same time, peanut buttering7 their limited resources, but resist that indecision-framed-as-fairness: it’s not a fair outcome if no one gets anything.
8%
Flag icon
Adding new individuals to a team disrupts that team’s gelling process, so I’ve found it much easier to have rapid growth periods for any given team, followed by consolidation/gelling periods during which the team gels. The organization will never stop growing, but each team will.
8%
Flag icon
Durable excellence This approach to nurturing great organizations is the opposite of a quick fix. While it’s slow, I’ve found that it consistently leads to enduring, real improvement in the happiness and throughput of an organization. Most importantly, these improvements stick around long enough to compound, creating a durable excellence.
8%
Flag icon
Fundamentally, I believe that sustained productivity comes from high-performing teams, and that disassembling a high-performing team leads to a significant loss of productivity, even if the members are fully retained. In this worldview, high-performing teams are sacred, and I’m quite hesitant to disassemble them.
9%
Flag icon
Another reason that I lean away from moving folks off high-performing teams is that most teams have high fixed costs and relatively small variable costs: moving one person can shift an innovating team back into falling behind, and now neither team is doing particularly well. This is especially true on teams responsible for products and services.
9%
Flag icon
The expected time to complete a new task approaches infinity as a team’s utilization approaches 100 percent, and most teams have many dependencies on other teams. Together, these facts mean you can often slow a team down by shifting resources to it, because doing so creates new upstream constraints.
9%
Flag icon
The other approach that I’ve seen work well is to rotate individuals for a fixed period into an area that needs help. The fixed duration allows them to retain their identity and membership in their current team, giving their full focus to helping out, rather than splitting their focus between performing the work and finding membership in the new team. This is also a safe way to measure how much slack the team really has!
10%
Flag icon
All real-world systems have some degree of inherent self-healing properties: an overloaded database will slow down enough that someone fixes it, and overwhelmed employees will get slow at finishing work until someone finds a way to help.
10%
Flag icon
That’s less than four hours per engineer per month if you can leverage your entire existing team, but training comes up again here: if it takes you six months to get the average engineer onto your interview loop, each trained engineer is now doing three to four hours of hiring-related work per week, and your trained engineers are down to approximately 0.4 efficiency. The overall team is getting 1.06 engineers’ worth of work out of every three engineers. It’s not just training and hiring, though: For every additional order of magnitude of engineers, you need to design and maintain a new layer ...more
11%
Flag icon
Understanding the overall impact of increased load comes down to a few important trends: Most system-implemented systems are designed to support one to two orders’ magnitude of growth from the current load. Even systems designed for more growth tend to run into limitations within one to two orders of magnitude. If your traffic doubles every six months, then your load increases an order of magnitude every 18 months. (And sometimes new features or products cause load to increase much more quickly.) The cardinality of supported systems increases over time as you add teams, and as “trivial” ...more
11%
Flag icon
However, the real productivity killer is not system rewrites but the migrations that follow those rewrites. Poorly designed migrations expand the consequences of this rewrite loop from the individual teams supporting the systems to the entire surrounding organization.
11%
Flag icon
you only get value from projects when they finish: to make progress, above all else, you must ensure that some of your projects finish.
11%
Flag icon
If your engineer is doing more than three interviews a week, it is a useful act of mercy to give them a month off every three or four months.
12%
Flag icon
The second most effective time thief that I’ve found is ad hoc interruptions: getting pinged on HipChat or Slack, taps on the shoulder, alerts from your on-call system, high-volume email lists, and so on.
12%
Flag icon
With that setup in place, create a rotation for people who are available to answer questions, and train your team not to answer other forms of interruptions. This is remarkably uncomfortable because we want to be helpful humans, but it becomes necessary as the number of interruptions climbs higher.
12%
Flag icon
One specific tool that I’ve found extremely helpful here is an ownership registry, which allows you to look up who owns what, eliminating the frequent “Who owns X?” variety of question. You’ll need this sort of thing to automate paging the right on-call rotation, so you might as well get two useful tools out of it!
12%
Flag icon
Finally, the one thing that I’ve found at companies with very few interruptions and have observed almost nowhere else: really great, consistently available documentation. It’s probably even harder to bootstrap documentation into a non-documenting company than it is to bootstrap unit tests into a non-testing company, but the best solution to frequent interruptions I’ve seen is a culture of documentation, documentation reading, and a documentation search that actually works.
12%
Flag icon
Something that is somewhat ignored a bit here is how to handle urgent project requests when you’re already underwater with your existing work and maintenance. The most valuable skill in this situation is learning to say no in a way that is appropriate to your company’s culture. That probably deserves its own chapter. There are probably some companies where saying no is culturally impossible, and in those places I guess you either learn to say your noes as yeses, or maybe you find a slightly easier environment to participate in. How do you remain productive in times of hypergrowth?
13%
Flag icon
How you respond to this is, in my opinion, the core challenge of leading a large organization. How do you continue to remain emotionally engaged with the challenges faced by individuals you’re responsible to help, when their problem is low in your problems queue? In that moment, do you shrug off the responsibility, either by changing roles or picking powerlessness? Hide in indifference? Become so hard on yourself that you collapse inward?
13%
Flag icon
What I’ve found most successful is to identify a few areas to improve, ensure you’re making progress on those, and give yourself permission to do the rest poorly. Work with your manager to write this up as an explicit plan and agree on what reasonable progress looks like. These issues are still stored with your other bags of risk and responsibility, but you’ve agreed on expectations.
Dan Kuida
Prioritize and execute
13%
Flag icon
Succession planning is thinking through how the organization would function without you, documenting those gaps, and starting to fill them in. It’s awkward enough to talk about that it doesn’t get much discussion, but it’s a foundational skill for building an enduring organization.
14%
Flag icon
The key tools for leading efficient change are systems thinking, metrics, and vision. When the steps of change are too wide, teams get destabilized, and gaps open within them. In those moments, managers create stability by becoming glue.
15%
Flag icon
Big changes appear to happen in a moment, but if you look closely underneath the big change, there is usually a slow accumulation of small changes.
15%
Flag icon
Since reading Accelerate: The Science of Lean Software and DevOp, by Nicole Forsgren, Gene Kim, and Jez Humble,4 I’ve spent a lot of time pondering the authors’ definition of velocity. They focus on four measures of developer velocity: Delivery lead time is the time from the creation of code to its use in production. Deployment frequency is how often you deploy code. Change fail rate is how frequently changes fail. Time to restore service is the time spent recovering from defects. The book uses surveys from tens of thousands of organizations to assess each one’s overall productivity and show ...more
16%
Flag icon
Problem discovery The first phase of a planning cycle is exploring the different problems that you could pick to solve. It’s surprisingly common to skip this phase, but that, unsurprisingly, leads to inertia-driven local optimization. Taking the time to evaluate which problem to solve is one of the best predictors I’ve found of a team’s long-term performance. The themes that I’ve found useful for populating the problem space are: Users’ pain. What are the problems that your users experience? It’s useful to go broad via survey mechanisms, as well as to go deep by interviewing a smaller set of ...more
This highlight has been truncated due to consecutive passage length restrictions.
18%
Flag icon
Solution validation Once you’ve narrowed down the problem you want to solve, it’s easy to jump directly into execution, but that can make it easy to fall in love with a difficult approach. Instead, I’ve found that it’s well worth it to take the risk out of your approach with an explicit solution validation phase.
18%
Flag icon
The elements that I’ve found effective for solution validation are: Write a customer letter. Write the launch announcement that you would send after finishing the solution. Are you able to write something exciting, useful, and real? It’s much more useful to test it against your actual users than to rely on your intuition. Identify prior art. How do peers across the industry approach this problem? The fact that others have solved a problem in a certain way doesn’t mean that it’s a great way, but it does at least mean it’s possible. A mild caveat: it’s better to rely on people you have some ...more
This highlight has been truncated due to consecutive passage length restrictions.
18%
Flag icon
Strategies are grounded documents which explain the trade-offs and actions that will be taken to address a specific challenge. Visions are aspirational documents that enable individuals who don’t work closely together to make decisions that fit together cleanly.
19%
Flag icon
Good Strategy/Bad Strategy by Richard Rumelt,
19%
Flag icon
When you read bad guiding policies, you think, “so what?” because its found a way to justify entrenching the status quo. When you read good guiding policies, you think, “Ah, that’s really going to annoy Anna, Bill, and Claire,” because the approach takes a clear stance on competing goals.
19%
Flag icon
When you apply your guiding policies to your diagnosis, you get your actions. Folks are often comfortable with hard decisions in the abstract, but struggle to translate policies into the specific steps to implement them. This is typically the easiest part to write, but publishing it and following through with it can be a significant test of your commitment.
19%
Flag icon
When you read good, coherent actions, you think, “This is going to be uncomfortable, but I think it can work.” When you read bad ones, you think, “Ah, we got afraid of the consequences, and we aren’t really changing anything.”
19%
Flag icon
Because strategies are specific to a given problem, it’s okay—and even encouraged—to write quite a few of them. Over the past year, I’ve worked with people on strategies for how we partner with other teams, how we manage end-to-end API latency, and how we manage infrastructure costs.15 I’ve also peered over others’ shoulders as they worked on quite a few more ideas. The act of writing a strategy leads folks through a systematic analysis, so, even if we don’t share them, writing these documents helps us work through quite a few challenges, both overwhelming and mundane.
20%
Flag icon
Test the document! This is a core leadership tool, and your first version will almost certainly be bad. Write a draft, sit down with a few different folks to get their perspectives, then iterate. Keep doing this until you’ve synthesized feedback. If there is feedback you disagree with, embrace the vision as an opportunity to address conflict by explicitly acknowledging disagreements within the vision text.
21%
Flag icon
Bad goals are indistinguishable from numbers. “Our p50 build time will be below two seconds,” or “We’ll finish eight large projects.” You’ll know a goal is just a number when you read it and aren’t sure if it’s ambitious or whether it matters. Good goals are a composition of four specific kinds of numbers: A target states where you want to reach. A baseline identifies where you are today. A trend describes the current velocity. A time frame sets bounds for the change. Put these all together, and a well-structured goal takes the form of: “In Q3, we will reduce time to render our frontpage from ...more
22%
Flag icon
Infrastructure cost is a great example of a baseline metric.20 When you’re asked to take responsibility for a company’s overall infrastructure costs, you’re going to start from a goal along the lines of “Maintain infrastructure costs at their current percentage of net revenue of 30 percent.”
23%
Flag icon
What I’ve found effective is to send push notifications, typically email, to teams whose metric has changed recently, both in terms of absolute change and in terms of their benchmarked performance against their cohort.
« Prev 1