Lorin Hochstein's Blog, page 22

Soak time

This is what makes separating “action items generation” from post-incident review meetings so valuable. Soak time works.

Your kid gets it.

View more on Lorin Hochstein's website »

Like • 0 comments • flag

Published on August 06, 2020 19:17

In praise of the Wild West engineer

If you put software engineers on a continuum with slow and cautious at one end and Wild West at the other end, I’ve always been on the slow and cautious side. But I’ve come to appreciate the value of the Wild West engineer.

Here’s an example of Wild West behavior: during some sort of important operational task (let’s say a failover), the engineer carrying it out sees a Python stack trace appear in the UI. Whoops, there’s a bug in the operational code! They ssh to the box, fix the code, and then run it again. It works!

This sort of behavior used to horrify me. How could you just hotfix the code by ssh’ing to the box and updating the code by hand like that? Do you know how dangerous that is? You’re supposed to PR, run tests, and deploy a new image!

But here’s the thing. During an actual incident, the engineers involved will have to take risky actions in order to remediate. You’ve got to poke and prod at the system in a way that’s potentially unsafe in order to (hopefully!) make things better. As Richard Cook wrote, all practitioner actions are gambles. The Wild West engineers are the ones with the most experience making these sorts of production changes, so they’re more likely to be successful at them in these circumstances.

I also don’t think that a Wild West engineer is someone who simply takes unnecessary risks. Rather, they have a well-calibrated sense of what’s risky and what isn’t. In particular, if something breaks, they know how to fix it. Once, years ago, during an incident, I made a global change to a dynamic property in order to speed up a remediation, and a Wild West engineer I knew clucked with disapproval. That was a stupid risk I took. You always stage your changes across regions!

Now, simply because the Wild West engineers have a well-calibrated sense of risk, doesn’t mean their sense is always correct! David Woods notes that all systems are simultaneously well-adapted, under-adapted, and over-adapted. The Wild West engineer might miscalculate a risk But I think it’s a mistake to dismiss Wild West engineers as simply irresponsible. While I’m still firmly the slow-and-cautious type, when everything is on fire, I’m happy to have the Wild West engineers around to take those dangerous remediation actions. Because if it ends up making things worse, they’ll know how to handle it. They’ve been there before.

View more on Lorin Hochstein's website »

Like • 0 comments • flag

Published on August 04, 2020 22:40

Safe by design?

I’ve been enjoying the ongoing MIT STAMP workshop. In particular, I’ve been enjoying listening to Nancy Leveson talk about system safety. Leveson is a giant in the safety research community (and, incidentally, an author of my favorite software engineering study). She’s also a critic of root cause and human error as explanations for accidents. Despite this, she has a different perspective on safety than many in the resilience engineering community. To sharpen my thinking, I’m going to capture my understanding of this difference in this post below.

From Leveson’s perspective, the engineering design should ensure that the system is safe. More specifically, the design should contain controls that eliminate or mitigate hazards. In this view, accidents are invariably attributable to design errors: a hazard in the system was not effectively controlled in the design.

"The operator's job is to make up for the holes in the designer's work." – Jens Rasmussen (1981)
— John Allspaw (@allspaw) July 18, 2012

By contrast, many in the resilience engineering community claim that design alone cannot ensure that the system is safe. The idea here is that the system design will always be incomplete, and the human operators must adapt their local work to make up for the gaps in the designed system. These adaptations usually contribute to safety, and sometimes contribute to incidents, and in post-incident investigations we often only notice the latter case.

These perspectives are quite different. Leveson believes that depending on human adaptation in the system is itself dangerous. If we’re depending on human adaptation to achieve system safety, then the design engineers have not done their jobs properly in controlling hazards. The resilience engineering folks believe that depending on human adaptation is inevitable, because of the messy nature of complex systems.

View more on Lorin Hochstein's website »

Like • 0 comments • flag

Published on July 28, 2020 17:45

All we can do is find problems

I’m in the second week of the three week virtual MIT STAMP workshop. Today, Prof. Nancy Leveson gave a talk titled Safety Assurance (Safety Case): Is it Possible? Feasible? Safety assurance refers to the act of assuring that a system is safe, after the design has been completed.

Leveson is a skeptic of evaluating the safety of a system. Instead, she argues for focusing on generating safety requirements at the design stage so that safety can be designed in, rather than doing an evaluation post-design. (You can read her white paper for more details on her perspective). Here are the last three bullets from her final slide:

If you are using hazard analysis to prove your system is safe, then you are using it wrong and your goal is futileHazard analysis (using any method) can only help you find problems, it cannot prove that no problems existThe general problem is in setting the right psychological goal. It should not be “confirmation,” but exploration

This perspective resonated with me, because it matches how I think about availability metrics. You can’t use availability metrics to inform you about whether your system is reliable enough, because they can only tell you if you have a problem. If your availability metrics look good, that doesn’t tell you anything about how to spend your engineering resources on reliability.

As Leveson remarked about safety, I think the best we can do in our non-safety-critical domains is study our systems to identify where the potential problems are, so that we can address them. Since we can’t actually quantify risk, the best we can do is to get better at identifying systemic issues. We need to always be looking for problems in the system, regardless of how many nines of availability we achieved last quarter. After all, that next major outage is always just around the corner.

View more on Lorin Hochstein's website »

Like • 0 comments • flag

Published on July 28, 2020 10:07

The power of functionalism

Most software engineers are likely familiar with functional programming. The idea of functionalism, focusing on the “what” rather than the “how”, doesn’t just apply to programming. I was reminded of how powerful a functionalist approach is this week as while I’ve been attending the STAMP workshop. STAMP is an approach to systems safety developed by Nancy Leveson.

The primary metaphor in STAMP is the control system: STAMP employs a control system model to help reason about the safety of a system. This is very much a functionalist approach, as it models agents in the system based only on what control actions they can take and what feedback they can receive. You can use this same model to reason about a physical component, a software system, a human, a team, an organization, even a regulatory body. As long as you can identify the inputs your component receives, and the control actions that it can perform, you can model it as a control system.

Cognitive systems engineering (CSE) uses a different metaphor: that of a cognitive system. But CSE also takes a functional approach, observing how people actually work and trying to identify what functions their actions serve in the system. It’s a bottom-up functionalism where STAMP is top-down, so it yields different insights into the system.

What’s appealing to me about these functionalist approaches is that they change the way I look at a problem. They get me to think about the problem or system at hand in a different way than I would have if I didn’t take a deliberately take a functional approach. And “it helped me look at the world in a different way” is the highest compliment I can pay to a technology.

View more on Lorin Hochstein's website »

Like • 0 comments • flag

Published on July 22, 2020 22:08

“How could they be so stupid?”

This is still the case. https://t.co/4tTv9kFizF
— John Allspaw (@allspaw) July 20, 2020

From the New York Times story on the recent Twitter hack:

Mr. O’Connor said other hackers had informed him that Kirk got access to the Twitter credentials when he found a way into Twitter’s internal Slack messaging channel and saw them posted there, along with a service that gave him access to the company’s servers.

It’s too soon after this incident to put too much faith in the reporting, but let’s assume it’s accurate. A collective cry of “Posting credentials to a Slack channel? How could engineers at Twitter be so stupid?” rose up from the internet. It’s a natural reaction, but it’s not a constructive one.

I don’t personally know any engineers at Twitter, but I have confidence that they have excellent engineers over there, including excellent security folks. So, how do we explain this seemingly obvious security lapse?

The problem is that we on the outside can’t, because we don’t have enough information. This type of lapse is a classic example of a workaround. People in a system use workarounds (they do things the “wrong” way) when there are obstacles to doing things the “right” way.

There are countless possibilities for why people employ workarounds. Maybe some system that’s required for doing it the “right” way is down for some reason, or maybe it simply takes too long or is too hard to do things the “right” way. Combine that with production pressures, and a workaround is born.

I’m willing to bet that there are people in your organization that use workarounds. You probably use some yourself. Identifying those workarounds teaches us something about how the system works, and how people have to do things the “wrong” way to actually get their work done.

Some workarounds, like the Twitter example, are dangerous. But simply observing “they shouldn’t have done that” does nothing to address the problems in the system that motivated the workaround in the first place.

When you see a workaround, don’t ask “how could they be so stupid to do things the obviously wrong way?” Instead, ask “what are the properties of our system that contributed to the development of this workaround?” Because, unless you gain a deeper understanding of your system, the problems that motivated the workaround aren’t going to go away.

View more on Lorin Hochstein's website »

Like • 0 comments • flag

Published on July 20, 2020 18:55

A reasonable system

Reasonable is an adjective we typically apply to humans, or something we implore of them (“Be reasonable!”). And, while I do want reasonable colleagues, what I really want is a reasonable system.

By reasonable system, I mean a system whose behavior I can reason about, both backwards and forwards in time. Given my understanding of how the system works, and the signals that are emitted by the system, I want to be able to understand its past behavior, and predict what its behavior is going to be in the future.

View more on Lorin Hochstein's website »

Like • 0 comments • flag

Published on July 16, 2020 19:01

Who’s afraid of serializability?

Kyle Kingsbury recently did an analysis of PostgreSQL
12.3 and found that under certain
conditions it violated guarantees it makes about transactions, including
violations of the serializability transaction
isolation level.

I thought it would be fun to use one of his counterexamples to illustrate what serializable means.

Here’s one of the counterexamples that Kingsbury’s tool found:

[image error]

In this counterexample, there are two list objects, here named 1799 and 1798, which I’m going to call x and y. The examples use two list operations, append (denoted "a") and read (denoted "r").

Here’s my redrawing of the example. I’ve drawn all operations against x in blue and against y in red. Note that I’m using empty list ([]) instead of nil.

[image error]

There are two transactions, which I’ve denoted T1 and T2, and each one involves operations on two list objects, denoted x and y.

For transactions that use the serializability isolation model, all of the operations in all of the
transactions have to be consistent with some sequential ordering of the transactions. In this particular
example, that means that all of the operations have to make sense assuming either:

all of the operations in T1 happened before all of the operations in T2
all of the operations in T2 happened before all of the operations in T1

Assume order: T1, T2

If we assume T1 happened before T2, then the operations for x are:

x = []
T1: x.append(2)
T2: x.read() → []

This history violates the contract of a list: we’ve appended an element to a list but then read an empty list. It’s
as if the append didn’t happen!

Assume order: T2, T1

If we assume T2 happened before T1, then the operations for y are:

y = []
T2: y.append(4)
y.append(5)
y.read() → [4, 5]
T1: y.read() → []

This history violates the contract of a list as well: we read [4, 5] and then [ ]: it’s as if the values disappeared!

Kingsbury indicates that this pair of transactions are illegal by annotating the operations with arrows that show required orderings. The "rw"
arrow means that the read operation that happened in the tail must be ordered before the write operation at the head of the arrow. If the arrows form a cycle, then the example violates serializability: there’s no possible ordering that can satisfy all of the arrows.

Serializability, linearizability, locality

This example is a good illustration of how serializability differs from linearizability.
Lineraizability is a consistency model that also requires that operations must be consistent with sequential ordering.
However, linearizability is only about individual objects, where transactions refer to collections of objects.

(Linearizability also requires that if operation A happens before operation
B in time, then operation A must take effect before operation B, and
serializability doesn’t require that, but let’s put that aside for now).

This counterexample above is a linearizable history: we can order the operations such that they are consistent with the contracts of x and y. Here’s a valid linearization:

x = []
y = []
x.read() → []
x.append(2)
y.read() → []
y.append(4)
y.append(5)
y.read() → [4, 5]

This example demonstrates how it’s possible to have histories that are linearizable but not serializable.

We say that lineariazibility is a local property where serializability is not: by the definition of linearizability, we can identify if a history is linearizable by looking at the histories of the individual objects (x, y), but we can’t do that for serializability.

View more on Lorin Hochstein's website »

Like • 0 comments • flag

Published on June 13, 2020 23:05

SRE, CSE, and the safety boundary

Site reliability engineering (SRE) and cognitive systems engineering (CSE) are two fields seeking the same goal: helping to design, build, and operate complex, software-intensive systems that stay up and running. They both worry about incidents and human workload, and they both reason about systems in terms of models. But their approaches are very different, and this post is about exploring one of those differences.

Caveat: I believe that you can’t really understand a field unless you either have direct working experience, or you have observed people doing work in the field. I’m not a site reliability engineer or a cognitive systems engineer, nor have I directly observed SREs or CSEs at work. This post is an outsider’s perspective on both of these fields. But I think it holds true to the philosophies that these approaches espouse publicly. Whether it corresponds to the actual day-to-day work of SREs and CSEs, I will leave to the judgment of the folks on the ground who actually do SRE or CSE work.

A bit of background

Site reliability engineering was popularized by Google, and continues to be strongly associated with the company. Google has published three O’Reilly books, the first one in 2016. I won’t say any more about the background of SRE here, but there are many other sources (including the Google books) for those who want to know more about the background.

Cognitive systems engineering is much older, tracing its roots back to the early eighties. If SRE is, as Ben Treynor described it what happens when you ask a software engineer to design an operations function, then CSE is what happens when you ask a psychologist how to prevent nuclear meltdowns.

CSE emerged in the wake of the Three Mile Island accident of 1979, where researchers were trying to make sense of how the accident happened. Before Three Mile Island, research on "human factors" aspects of work had focused on human physiology (for example, designing airplane cockpits), but after TMI the focused expanded to include cognitive aspects of work. The two researchers most closely associated with CSE, Erik Hollnagel and David Woods, were both trained as psychology researchers: their paper Cognitive Systems Engineering: New wine in new bottles marks the birth of the field (Thai Wood covered this paper in his excellent Resilience Roundup newsletter).

CSE has been applied in many different domains, but I think it would be unknown in the "tech" community were it not for the tireless efforts of John Allspaw to popularize the results of CSE research that has been done in the past four decades.

The field and decades of writing on topics such as these and many more like them: it’s Cognitive Systems Engineering.

CSE and its methods are the predominant fuel of Resilience Engineering.
— John Allspaw (@allspaw) April 18, 2020

A useful metaphor: Rasmussen’s dynamic safety model

Jens Rasmussen was a Danish safety researcher whose work remains deeply influential in CSE. In 1997 he published a paper titled Risk management in a dynamic society: a modelling problem. This paper introduced the metaphor of the safety boundary, as illustrated in the following visual model, which I’ve reproduced from this paper:

[image error]

Rasmussen viewed a safety-critical system as a point that moves inside of a space enclosed by three boundaries.

At the top right is what Rasmussen called the "boundary to economic failure". If the system crosses this boundary, then the system will fail due to poor economic performance. We know that if we try to work too quickly, we sacrifice safety. But we can’t work arbitrarily slowly to increase safety, because then we won’t get anything done. Management naturally puts pressure on the system to move away from this boundary.

At the bottom right is what Rasmussen called the "boundary of unacceptable work load". Management can apply pressure on the workforce to work both safely and quickly, but increasing safety and increasing productivity both require effort on behalf of practitioners, and there are limits to the amount of work that people can do. Practitioners naturally put pressure on the system to move away from this boundary.

At the left, the diagram has two boundaries. The outer boundary is what Rasmussen called the "boundary of functionally acceptable performance", what I’ll call the safety boundary. If the system crosses this boundary, an incident happens. We can never know exactly where this boundary is. The inner boundary is labelled "resulting perceived boundary of acceptable performance". That’s where we think the boundary is, and where we try to stay away from.

SRE vs CSE in context of the dynamic safety model

I find the dynamic safety model useful because I think it illustrates the difference in focus between SRE and CSE.

SRE focuses on two questions:

How do we keep the system away from the safety boundary?
What do we do once we’ve crossed the boundary?

To deal with the first question, SRE thinks about issues such as how to design systems and how to introduce changes safely. The second question is the realm of incident response.

CSE, on the other hand, focuses on the following questions:

How will the system behave near the system boundary?
How should we take this boundary behavior into account in our design?

CSE focuses on the space near the boundary, both to learn how work is actually done, and to inform how we should design tools to better support this work. In the words of Woods and Hollnagel:

Discovery is aided by looking at situations that are near the margins of practice and when resource saturation is threatened (attention, workload, etc.). These are the circumstances when one can see how the system stretches to accommodate new demands, and the sources of resilience that usually bridge gaps. – Joint Cognitive Systems: Patterns in Cognitive Systems Engineering, p37

Fascinatingly, CSE has also identified common patterns of system behavior at the boundary that holds across multiple domains. But that will have to wait for a different post.

Reading more about CSE

I’m still a novice in the field of cognitive systems engineering. I’m actually using these posts to help learn through explaining the concepts to others.

The source I’ve found most useful so far is the book Joint Cognitive Systems: Foundations of Cognitive Systems Engineering , which is referenced in this post. If you prefer videos, Cook’s Lectures on the study of cognitive work is excellent.

I’ve also started a CSE reading list.

View more on Lorin Hochstein's website »

Like • 0 comments • flag

Published on May 25, 2020 12:15