More on this book
Kindle Notes & Highlights
Read between
August 10 - September 28, 2017
Each feature of our service can be individually enabled or disabled. If a feature turns out to have negative consequences, such as security holes or unexpectedly bad performance, it can be disabled without deploying a different software release.
This can be a little tricky because of the combinatorics problem of having many switches that each introduce a new degree of freedom. How do you build confidence that any constellation of switch states is healthy, valid, error free? Defining the 'feature' lines appropriately here can be critical.
This pattern also permits some rather sophisticated latency management. Suppose this system is expected to return a result in 200 ms or less. If one of the backends is slow for some reason, the frontend doesn’t have to wait for it. If it takes 10 ms to compose and send the resulting HTML, at 190 ms the frontend can give up on the slow backends and generate the page with the information it has. The ability to manage a latency time budget like that can be very powerful. For example, if the advertisement system is slow, search results can be displayed without any ads.
later. It must be possible to perform the following tasks: • Make a backup of a configuration and restore it • View the difference between one archived copy and another revision • Archive the running configuration without taking the system down
Should also be able to review the state of config at an arbitrary point in time (e.g. At time of failure/error spike etc). This should also let you determine if/when a config change occurs coincident with a problem.
A common issue is that a user’s login state is stored locally by a web server but not communicated to replicas. When the load balancer receives future requests from the same user, if they are sent to a different replica the user will be asked to log in again. This will repeat until the user has logged into every replica. Solutions to this problem are discussed in Section 4.2.3.
Worked with a customer many years ago whose app servers worked this way. On first request to www, customer redirected to another host and remained sticky there for remainder of session. I think the only way to maintain state outside of the app server was in the db, but the app server may have provided some in-memory cache that forced additional session stickiness at the next layer of stack as well?...Netscape application server, anyone?
This is years before the advent of memcached (2003). What cache server approach did people use before that? Or in dial-up days it didn't matter and it didn't exist?
A configuration setting (a toggle) should be present to enable or disable each new feature.
It's probably healthy to remove the switches after some time. I haven't seen a good trigger to spark the removal however, other than putting it on the list of what a project's done-done-done criteria.
Can be healthy to decide which switches should remain for load-shedding, feature disabling (e.g. If it's broken and affecting other portions of the app somehow), but don't leave them all on forever. Configuration represents opportunity for dev/test/prod env entropy, even when code is identical.
Steve Sarner liked this
More sophisticated toggles can be enabled for particular groups of users. A feature may be enabled for a small group of trusted testers who receive early access. Once it is validated, the toggle can enable the feature for all users, perhaps by enabling it for successively larger groups.
Yes, this is helpful. I'd also add easy way to opt people into the feature (including self-opt in).
Is it wise to keep a/b testing completely separate? I think so, as it's might be easy to foul up a test if settings get changed (or if self opt-in is allowed, for example)
Graceful degradation means software acts differently when it is becoming overloaded or when systems it depends on are down. For example, a web site might have two user interfaces: one is rich and full of images, while the other is lightweight and all text. Normally users receive the rich interface. However, if the system is overloaded or at risk of hitting bandwidth limits, it switches to the lightweight mode.
This often not built in from the beginning as app may depend on a single db/cluster. Hard to know what the sensible lines to tolerate outage are. So I have a feeling that this is most often done retroactively. Any good methodologies for identifying opportunities to gracefully degrade? I've seen the pattern in which the team does so after each relevant failure. Not ideal, but less likely to be wasted effort, though might be a while till the same failure/outage occurs again.
Even small sites have learned that it is better to put up a temporary web server that displays the same “under construction” page no matter what the query, than to have the users receive no service at all.
Suggesting that customers check twitter or some such to get updates can be helpful too, assuming you actually do so (but even if you don't, customers might at least report when you're back up at least :( )
Following is an example database failover procedure: 1. Announce the impending failover to the db-team and manager-team mailing lists. 2. Verify the hot spare has at least 10 terabytes of free disk space. 3. Verify these dependencies are all operating within parameters: (link to server control panel) (link to data feed control panel).
Good runbook should also include what to do if conditions are not met. Perhaps link to another runbook or put instructions in line. Even if that means "page so and so", precious time isn't lost wondering what to do
When developers are unwilling to add operational features, one option is to write the features yourself. This is a bad option for two reasons. First, the developers might not accept your code. As an outsider, you do not know their coding standards, the internal infrastructure, and their overall vision for the future software architecture. Any bugs in your code will receive magnified blame.
This book makes the assumption that debts and ops are separate. It also implicitly points out some of the key benefits of not making this split. I guess the audience usually selects itself
IaaS providers usually cannot provide guarantees of how much steal time will exist, nor can they provide mechanisms to control it. Netflix found the only way it could deal with this issue was to be reactionary. If high steal time was detected on a virtual machine in Amazon Web Services (AWS), Netflix would delete the virtual machine and have it re-created. If the company was lucky, the new virtual machine would be created on a physical machine that was less oversubscribed. This is a sorry state of affairs (Link 2013).
How often were they competing with themselves? Was this just an inefficient way to distribute Netflix evenly across all of ec2?
With an SOA it is easy to split up a team every time it grows beyond a manageable limit. Each team can focus on a related set of subsystems. They can even trade subsystems between teams as skills and demands require.
There should probably be some pressure toward homogeneity of technology across services/teams for this reason. But this can be hard to do, e.g. culturally.
Following are some best practices for running an SOA: • Use the same underlying RPC protocol to implement the APIs on all services. This way any tool related to the RPC mechanism is leveraged for all services. • Have a consistent monitoring mechanism. All services should expose measurements to the monitoring system the same way. • Use the same techniques with each service as much as possible. Use the same load balancing system, management techniques, coding standards, and so on. As services move between teams, it will be easier for people to get up to speed if these things are consistent. •
...more
Decoupling the components can be a long and difficult journey. Start by identifying pieces that can be spilt out as services one at a time. Do not pick the easiest pieces but rather the pieces most in need of the benefits of SOA: flexibility, ease of upgrade and replacement, and so on.
Interesting. We did it this way and learned a lot of lessons slowly rather than quickly. Seems your first should perhaps be small, then jump straight to the most in need of isolation before going back to medium fruit?
Also, bug: "spilt" -> "split"
When this solution works well, it is often the easiest solution because it does not require a redesign of the software. However, there are many problems with scaling this way.
The problems that follow are apt, but often scaling up is a perfectly reasonable response, especially if it means you can spend precious engineering time on other problems and keep the team slightly smaller. But do need to know when re-engineering is appropriate, in advance. Especially true for the last problem: scaling up may not solve the scaling problem at all, regardless of cost.
A cache is a net benefit in performance if the time saved during cache hits exceeds the time lost from the additional overhead. We can estimate this using weighted averages. If the typical time for a regular lookup is L, a cache hit is H, a cache miss is M, and the cache hit ratio is R, then using the cache is more effective if H × R + M × (1 − R) < L.
This is just looking at averages though. If your caching has a big benefit to your p90 latency but holds average latency the same or even degrades just a little, you're probably coming out well ahead overall. Need to look at the shape of the performance distribution, not just mean response times.
Suppose a system without caching required 20 replicas, but with caching required only 15. If each replica is a machine, this means the cache is more cost-effective if it costs less than purchasing 5 machines.
This might be dangerous advice. It sounds like using cache as a scaling/availability strategy, which can be risky. Loss of cache means outage? Caching should be a performance tool, not availability tool.
Most algorithms do not perform well with a sudden influx of otherwise little-used data. For example, backing up a database involves reading every record one at a time and leaves the cache filled with otherwise little-used data. At that point, the cache is cold, so performance suffers.
Search engine bots can have a similar effect. Some deliberately avoid writing to cache in response to robot requests.
ARC solves this problem by putting newly cached data in a probationary state. If it is accessed a second time, it gets out of probation and is put into the main cache. A single pass through the database flushes the probationary cache, not the main cache.
Adaptive replacement cache eviction algorithm. Addressed the above (assuming robots don't double-hit the cached data ;) )
Another variation is for threads to kill and re-create themselves periodically so that they remain “fresh.” This mitigates memory leaks and other problems, but in doing so hides them and makes them more difficult to find.
Should you leave a few threads/workers intact for a longer period to reveal memory leaks? Use them as canaries?
Some CDNs can process the HTTPS connections on your behalf, relieving you of managing such complexity. The connection between your web servers and the CDN can then use an easier-to-manage transport mechanism, or no encryption at all.
Do you and your customers want to trust CDN with whatever data you were trying to encrypt in the first place?
The probability an outage will happen during that time is the reciprocal of the mean time between failures. The percent probability that a second failure will happen during the repair window is MTTR/MTBF × 100.
This is mathematically atrocious. Probably ok for back of envelope. But what's the probability that this undercomputes true probability by say 20%?
In this approach, the primary replica receives the entire workload but the secondary replica is ready to take over at any time. This is sometimes called the hot spare or “hot standby” strategy since the spare is connected to the system, running (hot), and can be switched into operation instantly.
The best fix is to eliminate the bug that causes the problem. Unfortunately, it can take a long time to fix the code and push a new release. A quick fix is needed in the meantime. A widely used strategy is to have a banned query list that is easy to update and communicate to all the frontends. The frontends automatically reject any query that is found on the banned query list.
Queries are sent to the remaining servers only if replies to the canary requests are received in a reasonable period of time. If the leaf servers crash or hang while the canary requests are being processed, the system flags the request as potentially dangerous
Solid-state drives (SSDs), which have no moving parts, wear out since each block is rated to be written only a certain number of times.
Heh, I think it has more to do with physics than the rating. I'd like to see more precision from this book. (do prescription drugs lose efficacy /because/ of the expiration label...?)
They discovered the “bathtub failure curve” where failures tend to happen either in the first month or only many years later.
Is this distribution built into failure models mentioned above?
Did anyone consider exercising drives for 30 days and only putting those that survive into prod? That may be somewhat pointless? Unless you reserve that for the more painful to replace drives/hosts?
wouldn't it be nice if drive manufacturers just did this for us? :)
“DRAM Errors in the Wild: A Large-Scale Field Study” (Schroeder, Pinheiro & Weber 2009) analyzed memory errors in a large fleet of machines in datacenters over a period of 2.5 years. These authors found that error rates were orders of magnitude higher than previously reported and were dominated by hard errors—the kind that ECC can detect but not correct.
For example, Google drains machines one by one for kernel upgrades. As a result of this practice, each machine is rebooted in a controlled way approximately every three months. This reduces the number of surprises found during power outages.
You can choose to break a service into many replicas and put one replica in each rack. With this arrangement, the service has rack diversity. A simple example would be a DNS service where each DNS server is in a different rack so that a rack-wide failure does not cause a service outage.
I think I remember us inadvertently managing to get multiple replicas in the same rack (or at least a reasonably common failure domain at much smaller scale than AZ) from AWS somehow. Highly correlated failures. Was not pleasant.
Because of these differences, distributed computing services are best managed by a separate team, with separate management, with bespoke operational and management practices.
Would like to see active debate between this Google and Amazon view on operations (and operational excellence)? Where is the pressure to consider investment in system/product/app/service changes to improve uptime and reduce cost of maintenance/availability/downtime/recovery? It may make a lot of sense to specialize functions to separate the roles, but it may require a certain size, and there must be strong mechanisms to collaborate and continuously strive for operational excellence. I guess AWS is a little closer to Google than much of Amazon, with a tier of support engineers who handle 'routine' maintenance and ops (but is that a good thing or a sign of missed excellence opportunities?). It may we'll allow them to move fast and have a little more buffer on when infrastructure investments must take place?
I also wonder what the comp difference (if any) exists between Google developers and reliability engineers. Probably is one, right?
If engineering management is pressured to focus on new features and neglect bug fixes, the result is a system that slowly destabilizes until it spins out of control.
The budget can also be based on an SLA. A certain amount of instability is expected each month, which is considered a budget. Each roll-out uses some of the budget, as do instability-related bugs. Developers can maximize the number of roll-outs that can be done each month by dedicating effort to improve the code that causes this instability. This creates a positive feedback loop. An example of this is Google’s Error Budgets, which are more fully explained in Section 19.4.
Erm, think you mean negative feedback loop. ;) again, let's achieve more precision in this book
That said, this error budget idea intrigues.
Have a common staffing pool for SRE and Developers. 6. Have excess Ops work overflow to the Dev team. 7. Cap SRE operational load at 50 percent. 8. Share 5 percent of Ops work with the Dev team.
These seem to be leaning toward uniting the dev and ops teams. Why not go the whole way again? Sounds like deva and reliability engineers are actually the same role but alternate throughout career?
An SRE might not be a full-time software developer, but he or she should be able to solve nontrivial problems by writing code.
That seems very inconsistent with debts and reliability engineers been from "the same pool"?
Ah. They meant that you get N people for a program. If you need more reliability engineers, you need to subtract devs. That's a good feedback mechanism
The common staffing pool encourages the developers to create systems that can be operated efficiently so as to minimize the number of SREs needed.
It also means that reliability engineers don't get to enjoy the fruits of their innovation. They don't get to enjoy quiet on-calls if they do their jobs exceptionally well. They get moved to a trouble spot instead. Does this create any pathologies; is there a local optimum that makes ops just quiet enough to tolerate but not so quiet that you're made redundant? Devs, on the other hand, personally derive all benefit.
Faults are introduced into the system to increase resilience. Fire drills (discussed in Chapter 15) intentionally take down machines or networks to make sure redundant systems kick in. • You try “crazy” or audacious things. For example, you might try to get the flow time from one week down to one day.
A group of users could be selected, half with their check box default checked (group A) and the other half with the new mechanism (group B). Whichever design has the better results would be used for all users after the test. Flag flips can be used both to control the test and to enable the winning design when the test is finished.
Not sure a/b testing and feature flag toggling should be through same mechanism. Seems too much risk of fouling up your a/b tests. But might be great for the /code/ not to care: it just consults a single call to determine which treatment a give request or customer should receive.
Differentiated Services: Sometimes there is a need to enable different services for different users. A good flag system can enable paid customers to see different features than unpaid users see. Many membership levels can be implemented by associating a set of flags with each one.
Should this be a separate system as well to maintain customer feature permissions? Seems that this would need to evolve very independently...
But the other pattern here is that it might be good to have the same set of features and available treatments for each feature across these needs: global feature enablement/kill switch, a/b testing, customer-level feature permissions/access
Tasks classified as rare/easy can remain manual. If they are easy, anyone should be able to do them successfully. A team’s culture will influence if the person does the right thing.
Tasks classified as rare/difficult should be documented and tools should be created to assist the process. Documentation and better tools will make it easier to do the tasks correctly and consistently. This quadrant includes troubleshooting and recovery tasks that cannot be automated. However, good documentation can assist the process and good tools can remove the burden of repetition or human error.
Tasks classified as frequent/easy should be automated. The return on investment is obvious. Interestingly enough, once something is documented, it becomes easier to do, thus sliding it toward this quadrant.
Tasks classified as frequent/difficult should be automated, but it may be best to acquire that automation rather than write it yourself. Purchasing commercial software or using free or open source projects leverages the skills and knowledge of hundreds or thousands of other people.
In companies like Google, there is an established policy to deal with this type of situation. Specifically, there is an official process for a team to declare a toil emergency. The team pauses to consider its options and make a plan to fix the biggest sources of toil. Management reviews the reprioritization plans and approves putting other projects on hold until balance is achieved.