Practice of Cloud System Administration, The: DevOps and SRE Practices for Web Services, Volume 2
Rate it:
Open Preview
Kindle Notes & Highlights
6%
Flag icon
Each feature of our service can be individually enabled or disabled. If a feature turns out to have negative consequences, such as security holes or unexpectedly bad performance, it can be disabled without deploying a different software release.
Brian
This can be a little tricky because of the combinatorics problem of having many switches that each introduce a new degree of freedom. How do you build confidence that any constellation of switch states is healthy, valid, error free? Defining the 'feature' lines appropriately here can be critical.
7%
Flag icon
This pattern also permits some rather sophisticated latency management. Suppose this system is expected to return a result in 200 ms or less. If one of the backends is slow for some reason, the frontend doesn’t have to wait for it. If it takes 10 ms to compose and send the resulting HTML, at 190 ms the frontend can give up on the slow backends and generate the page with the information it has. The ability to manage a latency time budget like that can be very powerful. For example, if the advertisement system is slow, search results can be displayed without any ads.
Brian
Reminds me of recent load test results ;)
10%
Flag icon
later. It must be possible to perform the following tasks: • Make a backup of a configuration and restore it • View the difference between one archived copy and another revision • Archive the running configuration without taking the system down
Brian
Should also be able to review the state of config at an arbitrary point in time (e.g. At time of failure/error spike etc). This should also let you determine if/when a config change occurs coincident with a problem.
10%
Flag icon
Live restores are often done by providing a special API for inserting data during a restore operation. The architecture should allow for the restoration of a single account, preferably without locking that user or group out of the service.
Brian
Proxy +1 from customer care around the world
10%
Flag icon
A common issue is that a user’s login state is stored locally by a web server but not communicated to replicas. When the load balancer receives future requests from the same user, if they are sent to a different replica the user will be asked to log in again. This will repeat until the user has logged into every replica. Solutions to this problem are discussed in Section 4.2.3.
Brian
Worked with a customer many years ago whose app servers worked this way. On first request to www, customer redirected to another host and remained sticky there for remainder of session. I think the only way to maintain state outside of the app server was in the db, but the app server may have provided some in-memory cache that forced additional session stickiness at the next layer of stack as well?...Netscape application server, anyone? This is years before the advent of memcached (2003). What cache server approach did people use before that? Or in dial-up days it didn't matter and it didn't exist?
11%
Flag icon
A configuration setting (a toggle) should be present to enable or disable each new feature.
Brian
It's probably healthy to remove the switches after some time. I haven't seen a good trigger to spark the removal however, other than putting it on the list of what a project's done-done-done criteria. Can be healthy to decide which switches should remain for load-shedding, feature disabling (e.g. If it's broken and affecting other portions of the app somehow), but don't leave them all on forever. Configuration represents opportunity for dev/test/prod env entropy, even when code is identical.
Steve Sarner liked this
Robert Gustavo
· Flag
Robert Gustavo
Every time you add a switch, you should add a jira task to remove it. Our page of switches is unwieldy.
Dylan Jackson
· Flag
Dylan Jackson
Assuming you log/write switching events, detecting switches which have not changed in several months (which are not whitelisted for keeping long-term for operational purposes) is a good way to detect …
Brian
· Flag
Brian
+1 on Gus. Maybe there's also a periodic active step to review/justify/retain a switch so there's consistent pressure to downsize? Dylan's suggestion can be used to escalate pressure.
11%
Flag icon
More sophisticated toggles can be enabled for particular groups of users. A feature may be enabled for a small group of trusted testers who receive early access. Once it is validated, the toggle can enable the feature for all users, perhaps by enabling it for successively larger groups.
Brian
Yes, this is helpful. I'd also add easy way to opt people into the feature (including self-opt in). Is it wise to keep a/b testing completely separate? I think so, as it's might be easy to foul up a test if settings get changed (or if self opt-in is allowed, for example)
11%
Flag icon
Graceful degradation means software acts differently when it is becoming overloaded or when systems it depends on are down. For example, a web site might have two user interfaces: one is rich and full of images, while the other is lightweight and all text. Normally users receive the rich interface. However, if the system is overloaded or at risk of hitting bandwidth limits, it switches to the lightweight mode.
Brian
This often not built in from the beginning as app may depend on a single db/cluster. Hard to know what the sensible lines to tolerate outage are. So I have a feeling that this is most often done retroactively. Any good methodologies for identifying opportunities to gracefully degrade? I've seen the pattern in which the team does so after each relevant failure. Not ideal, but less likely to be wasted effort, though might be a while till the same failure/outage occurs again.
Chet
· Flag
Chet
I've heard of systems disabling views/widgets if a SLA has not been met.

Some systems go into read-only mode.
Brian
· Flag
Brian
Yeah. I'm curious how to identify which features to build in graceful degradation for. Expensive to build it in all the time, especially automatedly.
11%
Flag icon
Even small sites have learned that it is better to put up a temporary web server that displays the same “under construction” page no matter what the query, than to have the users receive no service at all.
Brian
Suggesting that customers check twitter or some such to get updates can be helpful too, assuming you actually do so (but even if you don't, customers might at least report when you're back up at least :( )
12%
Flag icon
It is critical that every procedure include a test suite that verifies success or failure.
Brian
Good point: essential element to a good runbook.
12%
Flag icon
Following is an example database failover procedure: 1. Announce the impending failover to the db-team and manager-team mailing lists. 2. Verify the hot spare has at least 10 terabytes of free disk space. 3. Verify these dependencies are all operating within parameters: (link to server control panel) (link to data feed control panel).
Brian
Good runbook should also include what to do if conditions are not met. Perhaps link to another runbook or put instructions in line. Even if that means "page so and so", precious time isn't lost wondering what to do
12%
Flag icon
When developers are unwilling to add operational features, one option is to write the features yourself. This is a bad option for two reasons. First, the developers might not accept your code. As an outsider, you do not know their coding standards, the internal infrastructure, and their overall vision for the future software architecture. Any bugs in your code will receive magnified blame.
Brian
This book makes the assumption that debts and ops are separate. It also implicitly points out some of the key benefits of not making this split. I guess the audience usually selects itself
14%
Flag icon
IaaS providers usually cannot provide guarantees of how much steal time will exist, nor can they provide mechanisms to control it. Netflix found the only way it could deal with this issue was to be reactionary. If high steal time was detected on a virtual machine in Amazon Web Services (AWS), Netflix would delete the virtual machine and have it re-created. If the company was lucky, the new virtual machine would be created on a physical machine that was less oversubscribed. This is a sorry state of affairs (Link 2013).
Brian
How often were they competing with themselves? Was this just an inefficient way to distribute Netflix evenly across all of ec2?
19%
Flag icon
With an SOA it is easy to split up a team every time it grows beyond a manageable limit. Each team can focus on a related set of subsystems. They can even trade subsystems between teams as skills and demands require.
Brian
There should probably be some pressure toward homogeneity of technology across services/teams for this reason. But this can be hard to do, e.g. culturally.
Chet
· Flag
Chet
Some languages are better with certain domains than others. I've seen algorithm heavy orgs use C/C++/Fortran -- not Python/Java/Ruby/Javascript.

There are exceptions to the rule of consistency.
19%
Flag icon
Following are some best practices for running an SOA: • Use the same underlying RPC protocol to implement the APIs on all services. This way any tool related to the RPC mechanism is leveraged for all services. • Have a consistent monitoring mechanism. All services should expose measurements to the monitoring system the same way. • Use the same techniques with each service as much as possible. Use the same load balancing system, management techniques, coding standards, and so on. As services move between teams, it will be easier for people to get up to speed if these things are consistent. • ...more
19%
Flag icon
Decoupling the components can be a long and difficult journey. Start by identifying pieces that can be spilt out as services one at a time. Do not pick the easiest pieces but rather the pieces most in need of the benefits of SOA: flexibility, ease of upgrade and replacement, and so on.
Brian
Interesting. We did it this way and learned a lot of lessons slowly rather than quickly. Seems your first should perhaps be small, then jump straight to the most in need of isolation before going back to medium fruit? Also, bug: "spilt" -> "split"
Chet
· Flag
Chet
Were the pieces small or big? Were they easy or hard?

Small, easy show little value. Big and easy could show some value, but I find it similar to refactoring code because I don't like how the last per…
Brian
· Flag
Brian
Small and easy can reveal the lessons you don't want to learn while doing the big, difficult ones.
20%
Flag icon
When this solution works well, it is often the easiest solution because it does not require a redesign of the software. However, there are many problems with scaling this way.
Brian
The problems that follow are apt, but often scaling up is a perfectly reasonable response, especially if it means you can spend precious engineering time on other problems and keep the team slightly smaller. But do need to know when re-engineering is appropriate, in advance. Especially true for the last problem: scaling up may not solve the scaling problem at all, regardless of cost.
21%
Flag icon
A cache is a net benefit in performance if the time saved during cache hits exceeds the time lost from the additional overhead. We can estimate this using weighted averages. If the typical time for a regular lookup is L, a cache hit is H, a cache miss is M, and the cache hit ratio is R, then using the cache is more effective if H × R + M × (1 − R) < L.
Brian
This is just looking at averages though. If your caching has a big benefit to your p90 latency but holds average latency the same or even degrades just a little, you're probably coming out well ahead overall. Need to look at the shape of the performance distribution, not just mean response times.
21%
Flag icon
Suppose a system without caching required 20 replicas, but with caching required only 15. If each replica is a machine, this means the cache is more cost-effective if it costs less than purchasing 5 machines.
Brian
This might be dangerous advice. It sounds like using cache as a scaling/availability strategy, which can be risky. Loss of cache means outage? Caching should be a performance tool, not availability tool.
Chet
· Flag
Chet
I agree that caching should focus on reducing P50 performance.

What about p99/p100? We have situations where caching is the only thing keeping an API/page working. Otherwise, it's unavailable for a par…
Brian
· Flag
Brian
That's a special case and you shouldn't consider it caching in that case. It becomes a nonauthoritative, nonephemeral data store. But you'll still probably need a way for the cx to gracefully degrade …
22%
Flag icon
Most algorithms do not perform well with a sudden influx of otherwise little-used data. For example, backing up a database involves reading every record one at a time and leaves the cache filled with otherwise little-used data. At that point, the cache is cold, so performance suffers.
Brian
Search engine bots can have a similar effect. Some deliberately avoid writing to cache in response to robot requests.
22%
Flag icon
ARC solves this problem by putting newly cached data in a probationary state. If it is accessed a second time, it gets out of probation and is put into the main cache. A single pass through the database flushes the probationary cache, not the main cache.
Brian
Adaptive replacement cache eviction algorithm. Addressed the above (assuming robots don't double-hit the cached data ;) )
23%
Flag icon
Another variation is for threads to kill and re-create themselves periodically so that they remain “fresh.” This mitigates memory leaks and other problems, but in doing so hides them and makes them more difficult to find.
Brian
Should you leave a few threads/workers intact for a longer period to reveal memory leaks? Use them as canaries?
Chet
· Flag
Chet
It'll take longer to find a leak if it exists. I think it's more infuriating from a debugging perspective.

Given 10 threads. You recreate 9 regularly. You keep 1 alive forever. How long will it take th…
Brian
· Flag
Brian
You could see the memory footprint time series diverge between the 9 (sawtooth) and the 1 (linear, or seemingly flat). Or at least see an absolute difference in memory usage. If the memory leak is pro…
Brian
· Flag
Brian
And this could still work with bounce deployments (thread would appropriately be killed and recreated)--can't be running old code! :)
23%
Flag icon
Best practice is to use a flag or software switch to determine whether native URLs or CDN URLs are output as your system generates web pages.
Brian
(a kill switch, that is)
23%
Flag icon
Some CDNs can process the HTTPS connections on your behalf, relieving you of managing such complexity. The connection between your web servers and the CDN can then use an easier-to-manage transport mechanism, or no encryption at all.
Brian
Do you and your customers want to trust CDN with whatever data you were trying to encrypt in the first place?
24%
Flag icon
Components from the same manufacturing batch have similar mortality curves, resulting in a sudden rush of failures.
Brian
Yeah see this violates the exponential prob distribution whuch is probably used for modeling.
24%
Flag icon
The probability an outage will happen during that time is the reciprocal of the mean time between failures. The percent probability that a second failure will happen during the repair window is MTTR/MTBF × 100.
Brian
This is mathematically atrocious. Probably ok for back of envelope. But what's the probability that this undercomputes true probability by say 20%?
25%
Flag icon
In this approach, the primary replica receives the entire workload but the secondary replica is ready to take over at any time. This is sometimes called the hot spare or “hot standby” strategy since the spare is connected to the system, running (hot), and can be switched into operation instantly.
Brian
When is this advantageous over spreading load across all?
Chet
· Flag
Chet
Hot spares are useful for databases and hard drives where there is a single owner of writes. Spreading write load across two is a double master scenario. It sounds easier to pick one box for everythin…
Brian
· Flag
Brian
But I'm wondering why not use for reads, assuming you already have configured read-only replicas. Best I could think of are to have the db in writable mode/access so that it's ready to swap in with no…
25%
Flag icon
The best fix is to eliminate the bug that causes the problem. Unfortunately, it can take a long time to fix the code and push a new release. A quick fix is needed in the meantime. A widely used strategy is to have a banned query list that is easy to update and communicate to all the frontends. The frontends automatically reject any query that is found on the banned query list.
Brian
Interesting
25%
Flag icon
Queries are sent to the remaining servers only if replies to the canary requests are received in a reasonable period of time. If the leaf servers crash or hang while the canary requests are being processed, the system flags the request as potentially dangerous
26%
Flag icon
Solid-state drives (SSDs), which have no moving parts, wear out since each block is rated to be written only a certain number of times.
Brian
Heh, I think it has more to do with physics than the rating. I'd like to see more precision from this book. (do prescription drugs lose efficacy /because/ of the expiration label...?)
26%
Flag icon
They discovered the “bathtub failure curve” where failures tend to happen either in the first month or only many years later.
Brian
Is this distribution built into failure models mentioned above? Did anyone consider exercising drives for 30 days and only putting those that survive into prod? That may be somewhat pointless? Unless you reserve that for the more painful to replace drives/hosts? wouldn't it be nice if drive manufacturers just did this for us? :)
26%
Flag icon
“DRAM Errors in the Wild: A Large-Scale Field Study” (Schroeder, Pinheiro & Weber 2009) analyzed memory errors in a large fleet of machines in datacenters over a period of 2.5 years. These authors found that error rates were orders of magnitude higher than previously reported and were dominated by hard errors—the kind that ECC can detect but not correct.
Brian
Fascinating, especially the last bit.
26%
Flag icon
For example, Google drains machines one by one for kernel upgrades. As a result of this practice, each machine is rebooted in a controlled way approximately every three months. This reduces the number of surprises found during power outages.
26%
Flag icon
If we are load balancing over two machines, each at 80 percent utilization, then there is no spare capacity available if one goes down.
Brian
I remember naively suggesting a scenario similar to this in LB tier. Michael and Dave prevented the madness :)
26%
Flag icon
You can choose to break a service into many replicas and put one replica in each rack. With this arrangement, the service has rack diversity. A simple example would be a DNS service where each DNS server is in a different rack so that a rack-wide failure does not cause a service outage.
Brian
I think I remember us inadvertently managing to get multiple replicas in the same rack (or at least a reasonably common failure domain at much smaller scale than AZ) from AWS somehow. Highly correlated failures. Was not pleasant.
28%
Flag icon
Because of these differences, distributed computing services are best managed by a separate team, with separate management, with bespoke operational and management practices.
Brian
Would like to see active debate between this Google and Amazon view on operations (and operational excellence)? Where is the pressure to consider investment in system/product/app/service changes to improve uptime and reduce cost of maintenance/availability/downtime/recovery? It may make a lot of sense to specialize functions to separate the roles, but it may require a certain size, and there must be strong mechanisms to collaborate and continuously strive for operational excellence. I guess AWS is a little closer to Google than much of Amazon, with a tier of support engineers who handle 'routine' maintenance and ops (but is that a good thing or a sign of missed excellence opportunities?). It may we'll allow them to move fast and have a little more buffer on when infrastructure investments must take place? I also wonder what the comp difference (if any) exists between Google developers and reliability engineers. Probably is one, right?
28%
Flag icon
If engineering management is pressured to focus on new features and neglect bug fixes, the result is a system that slowly destabilizes until it spins out of control.
28%
Flag icon
The budget can also be based on an SLA. A certain amount of instability is expected each month, which is considered a budget. Each roll-out uses some of the budget, as do instability-related bugs. Developers can maximize the number of roll-outs that can be done each month by dedicating effort to improve the code that causes this instability. This creates a positive feedback loop. An example of this is Google’s Error Budgets, which are more fully explained in Section 19.4.
Brian
Erm, think you mean negative feedback loop. ;) again, let's achieve more precision in this book That said, this error budget idea intrigues.
28%
Flag icon
Have a common staffing pool for SRE and Developers. 6. Have excess Ops work overflow to the Dev team. 7. Cap SRE operational load at 50 percent. 8. Share 5 percent of Ops work with the Dev team.
Brian
These seem to be leaning toward uniting the dev and ops teams. Why not go the whole way again? Sounds like deva and reliability engineers are actually the same role but alternate throughout career?
28%
Flag icon
Aim for a maximum of two events per oncall shift.
Brian
How long is shift? Don't tell me you alter shift length to make this true. ;)
28%
Flag icon
An SRE might not be a full-time software developer, but he or she should be able to solve nontrivial problems by writing code.
Brian
That seems very inconsistent with debts and reliability engineers been from "the same pool"? Ah. They meant that you get N people for a program. If you need more reliability engineers, you need to subtract devs. That's a good feedback mechanism
28%
Flag icon
The common staffing pool encourages the developers to create systems that can be operated efficiently so as to minimize the number of SREs needed.
Brian
It also means that reliability engineers don't get to enjoy the fruits of their innovation. They don't get to enjoy quiet on-calls if they do their jobs exceptionally well. They get moved to a trouble spot instead. Does this create any pathologies; is there a local optimum that makes ops just quiet enough to tolerate but not so quiet that you're made redundant? Devs, on the other hand, personally derive all benefit.
33%
Flag icon
Faults are introduced into the system to increase resilience. Fire drills (discussed in Chapter 15) intentionally take down machines or networks to make sure redundant systems kick in. • You try “crazy” or audacious things. For example, you might try to get the flow time from one week down to one day.
Brian
Two (very different) things we should do way more
42%
Flag icon
A group of users could be selected, half with their check box default checked (group A) and the other half with the new mechanism (group B). Whichever design has the better results would be used for all users after the test. Flag flips can be used both to control the test and to enable the winning design when the test is finished.
Brian
Not sure a/b testing and feature flag toggling should be through same mechanism. Seems too much risk of fouling up your a/b tests. But might be great for the /code/ not to care: it just consults a single call to determine which treatment a give request or customer should receive.
42%
Flag icon
Differentiated Services: Sometimes there is a need to enable different services for different users. A good flag system can enable paid customers to see different features than unpaid users see. Many membership levels can be implemented by associating a set of flags with each one.
Brian
Should this be a separate system as well to maintain customer feature permissions? Seems that this would need to evolve very independently... But the other pattern here is that it might be good to have the same set of features and available treatments for each feature across these needs: global feature enablement/kill switch, a/b testing, customer-level feature permissions/access
44%
Flag icon
Tasks classified as rare/easy can remain manual. If they are easy, anyone should be able to do them successfully. A team’s culture will influence if the person does the right thing.
44%
Flag icon
Tasks classified as rare/difficult should be documented and tools should be created to assist the process. Documentation and better tools will make it easier to do the tasks correctly and consistently. This quadrant includes troubleshooting and recovery tasks that cannot be automated. However, good documentation can assist the process and good tools can remove the burden of repetition or human error.
44%
Flag icon
Tasks classified as frequent/easy should be automated. The return on investment is obvious. Interestingly enough, once something is documented, it becomes easier to do, thus sliding it toward this quadrant.
44%
Flag icon
Tasks classified as frequent/difficult should be automated, but it may be best to acquire that automation rather than write it yourself. Purchasing commercial software or using free or open source projects leverages the skills and knowledge of hundreds or thousands of other people.
46%
Flag icon
In companies like Google, there is an established policy to deal with this type of situation. Specifically, there is an official process for a team to declare a toil emergency. The team pauses to consider its options and make a plan to fix the biggest sources of toil. Management reviews the reprioritization plans and approves putting other projects on hold until balance is achieved.
« Prev 1