More on this book
Community
Kindle Notes & Highlights
The most robust set of answers to this question can be found in a report by the President’s Council of Advisors on Science and Technology (PCAST), an advisory group of the nation’s leading scientists and engineers, which in 2016 produced an in-depth review of forensic science in criminal courts. The report summarizes the available evidence on the validity of fingerprint analysis and especially on the likelihood of erroneous identifications (false positives) such as the one involving Mayfield.
This evidence is surprisingly sparse, and as PCAST notes, it is “distressing” that work to produce it did not begin until recently. The most credible data come from the only published large-scale study of fingerprint identification accuracy, which was conducted by FBI scientists themselves in 2011. The study involved 169 examiners, each comparing approximately one hundred pairs of latent and exemplar fingerprints. Its central finding was that very few erroneous identifications occurred: the false-positive rate was about one in six hundred.
The first step to reduce noise must be, of course, to acknowledge its possibility.
Even when they are aware of the risk of bias, forensic scientists are not immune to the bias blind spot: the tendency to acknowledge the presence of bias in others, but not in oneself.
Thanks to the persistence of Dror and his colleagues, attitudes are slowly changing and a growing number of forensic laboratories have begun taking new measures to reduce error in their analyses.
The necessary methodological steps are relatively simple. They illustrate a decision hygiene strategy that has applicability in many domains: sequencing information to limit the formation of premature intuitions. In any judgment, some information is relevant, and some is not. More information is not always better, especially if it has the potential to bias judgments by leading the judge to form a premature intuition.
In that spirit, the new procedures deployed in forensic laboratories aim to protect the independence of the examiners’ judgments by giving the examiners only the information they need, when they need it. In other words, the laboratory keeps them as much in the dark about the case as possible and reveals information only gradually. To do that, the approach Dror and colleagues codified is called linear sequential unmasking.
Dror has another recommendation that illustrates the same decision hygiene strategy: examiners should documen...
This highlight has been truncated due to consecutive passage length restrictions.
The same logic inspires a third recommendation, which is an important part of decision hygiene. When a different examiner is called on to verify the identification made by the first person, the second person should not be aware of the first judgment.
Speaking of Sequencing Information “Wherever there is judgment, there is noise—and that includes reading fingerprints.” “We have more information about this case, but let’s not tell the experts everything we know before they make their judgment, so as not to bias them. In fact, let’s tell them only what they absolutely need to know.” “The second opinion is not independent if the person giving it knows what the first opinion was. And the third one, even less so: there can be a bias cascade.” “To fight noise, they first have to admit that it exists.”
Many judgments involve forecasting. What is the unemployment rate likely to be in the next quarter? How many electric cars will be sold next year? What will be the effects of climate change in 2050? How long will it take to complete a new building? What will be the annual earnings of a particular company? How will a new employee perform? What will be the cost of a new air pollution regulation? Who will win an election? The answers to such questions have major consequences. Fundamental choices of
Analysts of forecasting—of when it goes wrong and why—make a sharp distinction between bias and noise (also called inconsistency or unreliability). Everyone agrees that in some contexts, forecasters are biased. For example, official agencies show unrealistic optimism in their budget forecasts.
if asked to formulate their forecasts as confidence intervals rather than as point estimates, they tend to pick narrower intervals than they should.
Forecasters are also noisy.
Occasion noise is common; forecasters do not always agree with themselves. Between-person noise is also pervasive; forecasters disagree with one another, even if they are specialists.
Averaging is mathematically guaranteed to reduce noise:
One method to produce aggregate forecasts is to use prediction markets, in which individuals bet on likely outcomes and are thus incentivized to make the right forecasts. Much of the time, prediction markets have been found to do very well, in the sense that if the prediction market price suggests that events are, say, 70% likely to happen, they happen about 70% of the time. Many companies in various industries have used prediction markets to aggregate diverse views.
Another formal process for aggregating diverse views is known as the Delphi method.
The Delphi method has worked well in many situations, but it can be challenging to implement. A simpler version, mini-Delphi, can be deployed within a single meeting. Also called estimate-talk-estimate, it requires participants first to produce separate (and silent) estimates, then to explain and justify them, and finally to make a new estimate in response to the estimates and explanations of others.
The Good Judgment Project
other. However, given our objective ignorance of future events, it is much better to formulate probabilistic forecasts.
To know whether forecasters are good, we should ask whether their probability estimates map onto reality.
With each new piece of information, Tetlock and his colleagues allowed the forecasters to update their forecasts. For scoring purposes, each one of these updates is treated as a new forecast. That way, participants in the Good Judgment Project are incentivized to monitor the news and update their forecasts continuously.
(A well-known response to this criticism, sometimes attributed to John Maynard Keynes, is, “When the facts change, I change my mind. What do you do?”)
Perpetual Beta
Apart from general intelligence, we could reasonably expect that superforecasters are unusually good with numbers. And they are. But their real advantage is not their talent at math; it is their ease in thinking analytically and probabilistically.
Superforecasters also excel at taking the outside view, and they care a lot about base rates.
In short, what distinguishes the superforecasters isn’t their sheer intelligence; it’s how they apply it. The skills they bring to bear reflect the sort of cognitive style we described in chapter 18 as likely to result in better judgments, particularly a high level of “active open-mindedness.”
To characterize the thinking style of superforecasters, Tetlock uses the phrase “perpetual beta,” a term used by computer programmers for a program that is not meant to be released in a final version but that is endlessly used, analyzed, and improved. Tetlock finds that “the strongest predictor of rising into the ranks of superforecasters is perpetual beta, the degree to which one is committed to belief updating and self-improvement.”
The success of the superforecasting project highlights the value of two decision hygiene strategies: selection (the superforecasters are, well, super) and aggregation (when they work in teams, forecasters perform better). The two strategies are broadly applicable in many judgments. Whenever possible, you should aim to combine the strategies, by constructing teams of judges (e.g., forecasters, investment professionals, recruiting officers) who are selected for being both good at what they do and complementary to one another.
Regardless of diversity, aggregation can only reduce noise if judgments are truly independent.
Speaking of Selection and Aggregation “Let’s take the average of four independent judgments—this is guaranteed to reduce noise by half.” “We should strive to be in perpetual beta, like the superforecasters.” “Before we discuss this situation, what is the relevant base rate?” “We have a good team, but how can we ensure more diversity of opinions?”
Guidelines in Medicine
A central task of doctors is to make diagnoses—to decide whether a patient has some kind of illness and, if so, to identify it. Diagnosis often requires some kind of judgment. For many conditions, the diagnosis is routine and largely mechanical, and rules and procedures are in place to minimize noise.
Importantly, some diagnoses do not involve judgment at all. Health care often progresses by removing the element of judgment—by shifting from judgment to calculation.
first. But the surprise is not the existence of noise in the medical profession. It is its sheer magnitude.
When there is noise, one physician may be clearly right and the other may be clearly wrong (and may suffer from some kind of bias). As might be expected, skill matters a lot. A study of pneumonia diagnoses by radiologists, for instance, found significant noise. Much of it came from differences in skill. More specifically, “variation in skill can explain 44% of the variation in diagnostic decisions,” suggesting that “policies that improve skill perform better than uniform decision guidelines.” Here as elsewhere, training and selection are evidently crucial to the reduction of error, and to the
...more
In medicine, between-person noise, or interrater reliability, is usually measured by the kappa statistic. The higher the kappa, the less noise. A kappa value of 1 reflects perfect agreement; a value of 0 reflects exactly as much agreement as you would expect between monkeys throwing darts onto a list of possible diagnoses.
These cases of interpersonal noise dominate the existing research, but there are also findings of occasion noise. Radiologists sometimes offer a different view when assessing the same image again and thus disagree with themselves (albeit less often than they disagree with others). When assessing the degree of blockage in angiograms, twenty-two physicians agreed with themselves between 63 and 92% of the time. In areas that involve vague criteria and complex judgments, intrarater reliability, as it is called, can be poor.
How can we explain such findings? A possible answer is that physicians almost inevitably run behind in clinic after seeing patients with complex medical problems that require more than the usual twenty-minute slot. We already mentioned the role of stress and fatigue as triggers of occasion noise (see chapter 7), and these elements seem to be at work here. To keep up with their schedules, some doctors skip discussions about preventive health measures. Another illustration of the role of fatigue among clinicians is the lower rate of appropriate handwashing during the end of hospital shifts.
...more
What might work to reduce noise in medicine? As we mentioned, training can increase skill, and skill certainly helps. So does the aggregation of multiple expert judgments (second opinions and so forth). Algorithms offer an especially promising avenue, and doctors are now using deep-learning algorithms and artificial intelligence to reduce noise. For example, such algorithms have been used to detect lymph node metastases in women with breast cancer. The best of these have been found to be superior to the best pathologist, and, of course, algorithms are not noisy. Deep-learning algorithms have
...more
Perhaps the most famous example of a guideline for diagnosis is the Apgar score, developed in 1952 by the obstetric anesthesiologist Virginia Apgar. Assessing whether a newborn baby is in distress used to be a matter of clinical judgment for physicians and midwives. Apgar’s score gave them a standard guideline instead. The evaluator measures the baby’s color, heart rate, reflexes, muscle tone, and respiratory effort, sometimes summarized as a “backronym” for Apgar’s name: appearance (skin color), pulse (heart rate), grimace (reflexes), activity (muscle tone), and respiration (breathing rate
...more
Speaking of Guidelines in Medicine “Among doctors, the level of noise is far higher than we might have suspected. In diagnosing cancer and heart disease—even in reading X-rays—specialists sometimes disagree. That means that the treatment a patient gets might be a product of a lottery.” “Doctors like to think that they make the same decision whether it’s Monday or Friday or early in the morning or late in the afternoon. But it turns out that what doctors say and do might well depend on how tired they are.” “Medical guidelines can make doctors less likely to blunder at a patient’s expense. Such
...more
In almost all large organizations, performance is formally evaluated on a regular basis. Those who are rated do not enjoy the experience. As one newspaper headline put it, “Study Finds That Basically Every Single Person Hates Performance Reviews.” Every single person also knows (we think) that performance reviews are subject to both bias and noise. But most people do not know just how noisy they are.
Thousands of research articles have been published on the practice of performance appraisals. Most researchers find that such appraisals are exceedingly noisy. This sobering conclusion comes mostly from studies based on 360-degree performance reviews, in which multiple raters provide input on the same person being rated, usually on multiple dimensions of performance. When this analysis is conducted, the result is not pretty. Studies often find that true variance, that is, variance attributable to the person’s performance, accounts for no more than 20 to 30% of the total variance. The rest, 70
...more
Different studies come to different conclusions on the breakdown of system noise into these three components (level, pattern, and occasion), and we can certainly imagine reasons why it should vary from one organization to the next. But all forms of noise are undesirable. The basic message that emerges from this research is a simple one: most ratings of performance have much less to do with the performance of the person being rated than we would wish. As one review summarizes it, “the relationship between job performance and ratings of job performance is likely to be weak or at best uncertain.”
Almost all organizations use the noise-reduction strategy of aggregation. Aggregate ratings are often associated with 360-degree rating systems, which became the standard in large corporations in the 1990s. (The journal Human Resources Management had a special issue on 360-degree feedback in 1993.)
The rise in popularity of 360-degree feedback coincided with the generalization of fluid, project-based organizations.
As we have seen, the halo effect implies that supposedly separate dimensions will in fact not be treated separately. A strong positive or negative rating on one of the first questions will tend to pull answers to subsequent questions in the same direction.
Finally, 360-degree systems are not immune to a near-universal disease of all performance measurement systems: creeping ratings inflation. One large industrial company once observed that 98% of its managers had been rated as “fully meeting expectations.” When almost everyone receives the highest possible rating, it is fair to question the value of these ratings.