More on this book
Community
Kindle Notes & Highlights
Read between
July 27 - October 15, 2021
Since production is the only place to learn how the software will respond to real-world stimuli, I advocate any approach that begins the learning process as soon as possible.
The worst problem here is that the bug in one system could propagate to all the other affected systems. A better question to ask is, “How do we prevent bugs in one system from affecting everything else?”
Enterprise software must be cynical. Cynical software expects bad things to happen and is never surprised when they do. Cynical software doesn’t even trust itself, so it puts up internal barriers to protect itself from failures. It refuses to get too intimate with other systems, because it could get hurt.
Millions of dollars in image advertising—touting online customer service—can be undone in a few hours by a batch of bad hard drives.
The major dangers to your system’s longevity are memory leaks and data growth. Both kinds of sludge will kill your system in production. Both are rarely caught during testing.
The trouble is that applications never run long enough in the development environment to reveal their longevity bugs.
Triggering a fault opens the crack. Faults become errors, and errors provoke failures. That’s how the cracks propagate.
High interactive complexity arises when systems have enough moving parts and hidden, internal dependencies that most operators’ mental models are either incomplete or just plain wrong.
Integration points are the number-one killer of systems. Every single one of those feeds presents a stability risk. Every socket, process, pipe, or remote procedure call can and will hang. Even database calls can hang, in ways obvious and subtle. Every feed into the system can hang it, crash it, or generate other impulses at the worst possible time.
Users are a terrible thing. Systems would be much better off with no users.
Human users have a gift for doing exactly the worst possible thing at the worst possible time.
The weak reference object actually is a bag of holding. It keeps the payload for later use.
What is the point of adding this level of indirection? When memory gets low, the garbage collector is allowed to reclaim any weakly reachable objects.
On one hand, networks have gotten fast enough that “someone else’s memory” can be faster to access than local disk.
Once you start trying to expose metrics, as I discuss in Designing for Transparency, rolling your own connection pool goes from a fun Computer Science 101 exercise to a tedious grind.
i.o.w. a lot of “comp sci” exercises probably really ARE given to undergrads but the point isn’t for them to create a fully baked solution so much as it is to illustrate how problems are devilishly complicated even when they have the shallow appearance of being simple
In object theory, the Liskov substitution principle (see Family Values: A Behavioral Notion of Subtyping [LW93]) states that any property that is true about objects of a type T should also be true for objects of any subtype of T.
The only sensible numbers are “zero,” “one,” and “lots,” so unless your query selects exactly one row, it has the potential to return too many.
The Steady State pattern says that for every mechanism that accumulates a resource, some other mechanism must recycle that resource.
You will often find a tension between definitions of “safe.” Shutting down instances is unsafe for availability, while spinning up instances is unsafe for cost. These forces don’t cancel each other out. Instead, they define a U-shaped curve where going too far in either direction is bad.
You can only measure the response time on requests that are done.
In the car business, they say the engine needs fuel, fire, and air to work. Our version of that is code, config, and connection.
Axel Prieto liked this
It is vital to establish a strong “chain of custody” that stretches from the developer through to the production instance. It must be impossible for an unauthorized party to sneak code into your system.
Given a system on the verge of failure, administrators in operations have to proceed through observation, analysis, hypothesis, and action very quickly. If that action appears to resolve the issue, it becomes part of the lore, possibly even part of a documented knowledge base. Who says it was the right action, though? What if it’s just a coincidence?
Some developers from Netflix have quipped that Netflix is a monitoring system that streams movies as a side effect.
Instances are the basic blocks that make up our system. They’re like cobblestone Minecraft blocks—not that interesting by themselves, but we can make amazing things out of them.
Every failing system starts with a queue backing up somewhere.
It takes time to shove packets through the wires.
Cost also comes from operations. The harder your software is to operate, the more time it takes from people.
The idea of monitoring, log collection, alerting, and dashboarding as being about economic value more than technical availability may be unfamiliar. Even so, if you adopt this perspective, you’ll find that it is easy to make decisions about what to monitor, how much data to collect, and how to represent it.
The bottleneck that burns you next year probably doesn’t exist right now.
My advice is to dodge the analysis trap. Don’t try to find the best tool, but instead pick one that suffices and get good with it.
Between the time a developer commits code to the repository and the time it runs in production, code is a pure liability. Undeployed code is unfinished inventory. It has unknown bugs. It may break scaling or cause production downtime.
Forget that. You need to test on all the weird data.
It turns out that a successful service needs to be changed more often than a useless one.
Releases should about as big an event as getting a haircut (or compiling a new kernel, for you gray-ponytailed UNIX hackers who don’t require haircuts).
In The Evolution of Useful Things [Pet92], Henry Petroski argues that the old dictum “Form follows function” is false. In its place, he offers the rule of design evolution, “Form follows failure.” That is, changes in the design of such commonplace things as forks and paper clips are motivated more by the things early designs do poorly than those things they do well.
It turns out that like concurrency, safety is not a composable property.
We want a continuous low level of breakage to make sure our system can handle the big things.
Also, as Charity Majors, CEO of Honeycomb.io says, “If you have a wall full of green dashboards, that means your monitoring tools aren’t good enough.” There’s always something weird going on.
What happens when your single point of failure goes home every evening?

