Rob’s Kindle Notes & Highlights for Release It!: Design and Deploy Production-Ready Software

Since production is the only place to learn how the software will respond to real-world stimuli, I advocate any approach that begins the learning process as soon as possible.

6%

A postmortem is like a murder mystery.

…where you are both the detective AND the murderer!

9%

The worst problem here is that the bug in one system could propagate to all the other affected systems. A better question to ask is, “How do we prevent bugs in one system from affecting everything else?”

9%

Enterprise software must be cynical. Cynical software expects bad things to happen and is never surprised when they do. Cynical software doesn’t even trust itself, so it puts up internal barriers to protect itself from failures. It refuses to get too intimate with other systems, because it could get hurt.

9%

Millions of dollars in image advertising—touting online customer service—can be undone in a few hours by a batch of bad hard drives.

10%

The major dangers to your system’s longevity are memory leaks and data growth. Both kinds of sludge will kill your system in production. Both are rarely caught during testing.

10%

The trouble is that applications never run long enough in the development environment to reveal their longevity bugs.

11%

Triggering a fault opens the crack. Faults become errors, and errors provoke failures. That’s how the cracks propagate.

12%

High interactive complexity arises when systems have enough moving parts and hidden, internal dependencies that most operators’ mental models are either incomplete or just plain wrong.

13%

Integration points are the number-one killer of systems. Every single one of those feeds presents a stability risk. Every socket, process, pipe, or remote procedure call can and will hang. Even database calls can hang, in ways obvious and subtle. Every feed into the system can hang it, crash it, or generate other impulses at the worst possible time.

18%

Users are a terrible thing. Systems would be much better off with no users.

18%

Human users have a gift for doing exactly the worst possible thing at the worst possible time.

18%

The weak reference object actually is a bag of holding. It keeps the payload for later use.

18%

What is the point of adding this level of indirection? When memory gets low, the garbage collector is allowed to reclaim any weakly reachable objects.

19%

On one hand, networks have gotten fast enough that “someone else’s memory” can be faster to access than local disk.

22%

Once you start trying to expose metrics, as I discuss in Designing for Transparency, rolling your own connection pool goes from a fun Computer Science 101 exercise to a tedious grind.

i.o.w. a lot of “comp sci” exercises probably really ARE given to undergrads but the point isn’t for them to create a fully baked solution so much as it is to illustrate how problems are devilishly complicated even when they have the shallow appearance of being simple

22%

In object theory, the Liskov substitution principle (see Family Values: A Behavioral Notion of Subtyping [LW93]) states that any property that is true about objects of a type T should also be true for objects of any subtype of T.

27%

When a bunch of servers impose this transient load all at once, it’s called a dogpile.

I like “thundering herd” but you do you man

27%

Use random clock slew to diffuse the demand.

…clock SKEW…?

30%

The only sensible numbers are “zero,” “one,” and “lots,” so unless your query selects exactly one row, it has the potential to return too many.

34%

The Steady State pattern says that for every mechanism that accumulates a resource, some other mechanism must recycle that resource.

40%

You will often find a tension between definitions of “safe.” Shutting down instances is unsafe for availability, while spinning up instances is unsafe for cost. These forces don’t cancel each other out. Instead, they define a U-shaped curve where going too far in either direction is bad.

43%

You can only measure the response time on requests that are done.

48%

In the car business, they say the engine needs fuel, fire, and air to work. Our version of that is code, config, and connection.

And COFFEE!

Axel Prieto liked this

49%

It is vital to establish a strong “chain of custody” that stretches from the developer through to the production instance. It must be impossible for an unauthorized party to sneak code into your system.

52%

Given a system on the verge of failure, administrators in operations have to proceed through observation, analysis, hypothesis, and action very quickly. If that action appears to resolve the issue, it becomes part of the lore, possibly even part of a documented knowledge base. Who says it was the right action, though? What if it’s just a coincidence?

Cargo Cult Ops

52%

Some developers from Netflix have quipped that Netflix is a monitoring system that streams movies as a side effect.

53%

Instances are the basic blocks that make up our system. They’re like cobblestone Minecraft blocks—not that interesting by themselves, but we can make amazing things out of them.

56%

Every failing system starts with a queue backing up somewhere.

56%

It takes time to shove packets through the wires.

62%

Cost also comes from operations. The harder your software is to operate, the more time it takes from people.

62%

The idea of monitoring, log collection, alerting, and dashboarding as being about economic value more than technical availability may be unfamiliar. Even so, if you adopt this perspective, you’ll find that it is easy to make decisions about what to monitor, how much data to collect, and how to represent it.

62%

The bottleneck that burns you next year probably doesn’t exist right now.

74%

My advice is to dodge the analysis trap. Don’t try to find the best tool, but instead pick one that suffices and get good with it.

75%

Between the time a developer commits code to the repository and the time it runs in production, code is a pure liability. Undeployed code is unfinished inventory. It has unknown bugs. It may break scaling or cause production downtime.

75%

internalize the motto, “If it hurts, do it more often.”

This includes your documentation.

76%

Forget that. You need to test on all the weird data.

80%

a huge number of edge cases

How many edge cases do you get before they’re no longer edge cases?

80%

It turns out that a successful service needs to be changed more often than a useless one.

87%

Releases should about as big an event as getting a haircut (or compiling a new kernel, for you gray-ponytailed UNIX hackers who don’t require haircuts).

88%

Give up, shut down the whole company, and open a hot dog and doughnut shop in Fiji.

don't mind if I do?

89%

In The Evolution of Useful Things [Pet92], Henry Petroski argues that the old dictum “Form follows function” is false. In its place, he offers the rule of design evolution, “Form follows failure.” That is, changes in the design of such commonplace things as forks and paper clips are motivated more by the things early designs do poorly than those things they do well.

91%

Global state is the most insidious form of implicit context.

As every old school JavaScript developer knows all too well.

93%

Avoid frameworks that require code generation based on a schema.

LOOKING AT YOU, THRIFT

95%

It turns out that like concurrency, safety is not a composable property.

96%

We want a continuous low level of breakage to make sure our system can handle the big things.

97%

Also, as Charity Majors, CEO of Honeycomb.io says, “If you have a wall full of green dashboards, that means your monitoring tools aren’t good enough.” There’s always something weird going on.

98%

What happens when your single point of failure goes home every evening?