Release It!: Design and Deploy Production-Ready Software
Rate it:
Kindle Notes & Highlights
4%
Flag icon
Since production is the only place to learn how the software will respond to real-world stimuli, I advocate any approach that begins the learning process as soon as possible.
6%
Flag icon
A postmortem is like a murder mystery.
Rob
…where you are both the detective AND the murderer!
9%
Flag icon
The worst problem here is that the bug in one system could propagate to all the other affected systems. A better question to ask is, “How do we prevent bugs in one system from affecting everything else?”
9%
Flag icon
Enterprise software must be cynical. Cynical software expects bad things to happen and is never surprised when they do. Cynical software doesn’t even trust itself, so it puts up internal barriers to protect itself from failures. It refuses to get too intimate with other systems, because it could get hurt.
9%
Flag icon
Millions of dollars in image advertising—touting online customer service—can be undone in a few hours by a batch of bad hard drives.
10%
Flag icon
The major dangers to your system’s longevity are memory leaks and data growth. Both kinds of sludge will kill your system in production. Both are rarely caught during testing.
10%
Flag icon
The trouble is that applications never run long enough in the development environment to reveal their longevity bugs.
11%
Flag icon
Triggering a fault opens the crack. Faults become errors, and errors provoke failures. That’s how the cracks propagate.
12%
Flag icon
High interactive complexity arises when systems have enough moving parts and hidden, internal dependencies that most operators’ mental models are either incomplete or just plain wrong.
13%
Flag icon
Integration points are the number-one killer of systems. Every single one of those feeds presents a stability risk. Every socket, process, pipe, or remote procedure call can and will hang. Even database calls can hang, in ways obvious and subtle. Every feed into the system can hang it, crash it, or generate other impulses at the worst possible time.
18%
Flag icon
Users are a terrible thing. Systems would be much better off with no users.
18%
Flag icon
Human users have a gift for doing exactly the worst possible thing at the worst possible time.
18%
Flag icon
The weak reference object actually is a bag of holding. It keeps the payload for later use.
18%
Flag icon
What is the point of adding this level of indirection? When memory gets low, the garbage collector is allowed to reclaim any weakly reachable objects.
19%
Flag icon
On one hand, networks have gotten fast enough that “someone else’s memory” can be faster to access than local disk.
22%
Flag icon
Once you start trying to expose metrics, as I discuss in ​Designing for Transparency​, rolling your own connection pool goes from a fun Computer Science 101 exercise to a tedious grind.
Rob
i.o.w. a lot of “comp sci” exercises probably really ARE given to undergrads but the point isn’t for them to create a fully baked solution so much as it is to illustrate how problems are devilishly complicated even when they have the shallow appearance of being simple
22%
Flag icon
In object theory, the Liskov substitution principle (see Family Values: A Behavioral Notion of Subtyping [LW93]) states that any property that is true about objects of a type T should also be true for objects of any subtype of T.
27%
Flag icon
When a bunch of servers impose this transient load all at once, it’s called a dogpile.
Rob
I like “thundering herd” but you do you man
27%
Flag icon
Use random clock slew to diffuse the demand.
Rob
…clock SKEW…?
30%
Flag icon
The only sensible numbers are “zero,” “one,” and “lots,” so unless your query selects exactly one row, it has the potential to return too many.
34%
Flag icon
The Steady State pattern says that for every mechanism that accumulates a resource, some other mechanism must recycle that resource.
40%
Flag icon
You will often find a tension between definitions of “safe.” Shutting down instances is unsafe for availability, while spinning up instances is unsafe for cost. These forces don’t cancel each other out. Instead, they define a U-shaped curve where going too far in either direction is bad.
43%
Flag icon
You can only measure the response time on requests that are done.
48%
Flag icon
In the car business, they say the engine needs fuel, fire, and air to work. Our version of that is code, config, and connection.
Rob
And COFFEE!
Axel Prieto liked this
49%
Flag icon
It is vital to establish a strong “chain of custody” that stretches from the developer through to the production instance. It must be impossible for an unauthorized party to sneak code into your system.
52%
Flag icon
Given a system on the verge of failure, administrators in operations have to proceed through observation, analysis, hypothesis, and action very quickly. If that action appears to resolve the issue, it becomes part of the lore, possibly even part of a documented knowledge base. Who says it was the right action, though? What if it’s just a coincidence?
Rob
Cargo Cult Ops
52%
Flag icon
Some developers from Netflix have quipped that Netflix is a monitoring system that streams movies as a side effect.
53%
Flag icon
Instances are the basic blocks that make up our system. They’re like cobblestone Minecraft blocks—not that interesting by themselves, but we can make amazing things out of them.
56%
Flag icon
Every failing system starts with a queue backing up somewhere.
56%
Flag icon
It takes time to shove packets through the wires.
62%
Flag icon
Cost also comes from operations. The harder your software is to operate, the more time it takes from people.
62%
Flag icon
The idea of monitoring, log collection, alerting, and dashboarding as being about economic value more than technical availability may be unfamiliar. Even so, if you adopt this perspective, you’ll find that it is easy to make decisions about what to monitor, how much data to collect, and how to represent it.
62%
Flag icon
The bottleneck that burns you next year probably doesn’t exist right now.
74%
Flag icon
My advice is to dodge the analysis trap. Don’t try to find the best tool, but instead pick one that suffices and get good with it.
75%
Flag icon
Between the time a developer commits code to the repository and the time it runs in production, code is a pure liability. Undeployed code is unfinished inventory. It has unknown bugs. It may break scaling or cause production downtime.
75%
Flag icon
internalize the motto, “If it hurts, do it more often.”
Rob
This includes your documentation.
76%
Flag icon
Forget that. You need to test on all the weird data.
80%
Flag icon
a huge number of edge cases
Rob
How many edge cases do you get before they’re no longer edge cases?
80%
Flag icon
It turns out that a successful service needs to be changed more often than a useless one.
87%
Flag icon
Releases should about as big an event as getting a haircut (or compiling a new kernel, for you gray-ponytailed UNIX hackers who don’t require haircuts).
88%
Flag icon
Give up, shut down the whole company, and open a hot dog and doughnut shop in Fiji.
Rob
don't mind if I do?
89%
Flag icon
In The Evolution of Useful Things [Pet92], Henry Petroski argues that the old dictum “Form follows function” is false. In its place, he offers the rule of design evolution, “Form follows failure.” That is, changes in the design of such commonplace things as forks and paper clips are motivated more by the things early designs do poorly than those things they do well.
91%
Flag icon
Global state is the most insidious form of implicit context.
Rob
As every old school JavaScript developer knows all too well.
93%
Flag icon
Avoid frameworks that require code generation based on a schema.
Rob
LOOKING AT YOU, THRIFT
95%
Flag icon
It turns out that like concurrency, safety is not a composable property.
96%
Flag icon
We want a continuous low level of breakage to make sure our system can handle the big things.
97%
Flag icon
Also, as Charity Majors, CEO of Honeycomb.io says, “If you have a wall full of green dashboards, that means your monitoring tools aren’t good enough.” There’s always something weird going on.
98%
Flag icon
What happens when your single point of failure goes home every evening?