Noise: A Flaw in Human Judgment
Rate it:
Open Preview
Kindle Notes & Highlights
Read between September 16 - November 21, 2021
59%
Flag icon
Regardless of diversity, aggregation can only reduce noise if judgments are truly independent.
59%
Flag icon
group deliberation often adds more error in bias than it removes in noise.
59%
Flag icon
“We have a good team, but how can we ensure more diversity of opinions?”
59%
Flag icon
If you have a fasting blood sugar level of 126 milligrams per deciliter or higher or an HbA1c (an average measure of blood sugar over the prior three months) of at least 6.5, you are considered to have diabetes.
59%
Flag icon
the surprise is not the existence of noise in the medical profession. It is its sheer magnitude.
59%
Flag icon
decision hygiene strategy: the development of diagnostic guidelines.
59%
Flag icon
noise in medicine is hardly limited to noise in diagnostic judgments,
60%
Flag icon
skill matters a lot. A study of pneumonia diagnoses by radiologists, for instance, found significant noise. Much of it came from differences in skill. More specifically, “variation in skill can explain 44% of the variation in diagnostic decisions,” suggesting that “policies that improve skill perform better than uniform decision guidelines.”
60%
Flag icon
In medicine, between-person noise, or interrater reliability, is usually measured by the kappa statistic. The higher the kappa, the less noise. A kappa value of 1 reflects perfect agreement;
60%
Flag icon
“fair,” which is of course better but which also indicates significant noise.
60%
Flag icon
Coronary angiograms,
60%
Flag icon
In nonacute settings, when a patient presents with recurrent chest pain, treatment—such as stent placement—is often pursued if more than 70% of one or more arteries is found to be blocked. However, a degree of variability in interpreting angiograms has been documented, potentially leading to unnecessary procedures.
60%
Flag icon
Endometriosis
60%
Flag icon
chest X-ray,
60%
Flag icon
diagnosis of TB
60%
Flag icon
melanoma
60%
Flag icon
dermatologists at New York University failed to diagnose melanoma from skin biopsies 36% of the time.
60%
Flag icon
variability in radiologists’ judgments with respect to breast cancer from screening mammograms.
60%
Flag icon
doctors are significantly more likely to order cancer screenings early in the morning than late in the afternoon.
60%
Flag icon
physicians almost inevitably run behind in clinic after seeing patients with complex medical problems that require more than the usual twenty-minute slot. We already mentioned the role of stress and fatigue as triggers of occasion noise (see chapter 7), and these elements seem to be at work here.
61%
Flag icon
What might work to reduce noise in medicine? As we mentioned, training can increase skill, and skill certainly helps. So does the aggregation of multiple expert judgments (second opinions and so forth). Algorithms offer an especially promising avenue, and doctors are now using deep-learning algorithms and artificial intelligence to reduce noise.
61%
Flag icon
And AI now performs at least as well as radiologists do in detecting cancer from mammograms; further advances in AI will probably demonstrate its superiority.
61%
Flag icon
Apgar score, developed in 1952 by the obstetric anesthesiologist Virginia Apgar.
61%
Flag icon
“backronym”
61%
Flag icon
Note that heart rate is the only strictly numerical component of the score and that all the other items involve an element of judgment. But because the judgment is decomposed into individual elements, each of which is straightforward to assess, practitioners with even a modest degree of training are unlikely to disagree a great deal—and hence Apgar scoring produces little noise.
61%
Flag icon
Unlike rules or algorithms, guidelines do not eliminate the need for judgment: the decision is not a straightforward computation. Disagreement remains possible on each of the components and hence on the final conclusion. Yet guidelines succeed in reducing noise because they decompose a complex decision into a number of easier subjudgments on predefined dimensions.
62%
Flag icon
Why is this? Specialists lack a single, clear answer (which means that the explanations for noise are themselves noisy).
Erhan
In most instances, the diagnostician relies on stmptoms experienced and extolled by the patient. In other words, the physician is dealing with subjective symptoms and not objective signs.
62%
Flag icon
“inconstancy of the physician”: different schools of thought, different training, different clinical experiences, different interview styles.
62%
Flag icon
Beyond physician differences, however, the main reason for noise was “inadequacy of the nomenclature.”
62%
Flag icon
DSM-III led to a dramatic increase in the research on whether diagnoses were noisy. It also proved helpful in reducing noise.
62%
Flag icon
“the use of diagnostic criteria for psychiatric disorders has been shown to increase the reliability of psychiatric diagnoses.” On the other hand, there continues to be a serious risk that “admissions of a single patient will reveal multiple diagnoses for the same patient.”
62%
Flag icon
“psychiatrists have a hard time agreeing on who does and does not have major depressive disorder.” Field trials for DSM-5 found “minimal agreement,” which “means that highly trained specialist psychiatrists under study conditions were only able to agree that a patient has depression between 4 and 15% of the time.”
62%
Flag icon
According to some field trials, DSM-5 actually made things worse, showing increased noise “in all major domains, with some diagnoses, such as mixed anxiety-depressive disorder… so unreliable as to appear useless in clinical practice.”
62%
Flag icon
In medicine in general, guidelines have been highly successful in reducing both bias and noise. They have helped doctors, nurses, and patients and greatly improved public health in the process. The medical profession needs more of them.
62%
Flag icon
Basically Every Single Person Hates Performance Reviews.” Every single person also knows (we think) that performance reviews are subject to both bias and noise.
63%
Flag icon
Today’s knowledge workers balance multiple, sometimes contradictory objectives. Focusing on only one of them might produce erroneous evaluations and have harmful incentive effects. The number of patients a doctor sees every day is an important driver of hospital productivity, for example, but you would not want physicians to focus single-mindedly on that indicator, much less to be evaluated and rewarded only on that basis.
63%
Flag icon
Studies often find that true variance, that is, variance attributable to the person’s performance, accounts for no more than 20 to 30% of the total variance. The rest, 70 to 80% of the variance in the ratings, is system noise.
63%
Flag icon
“the relationship between job performance and ratings of job performance is likely to be weak or at best uncertain.”
63%
Flag icon
raters might not in fact attempt to evaluate performance accurately but might rate people “strategically.”
63%
Flag icon
some 360-degree feedback systems are used solely for developmental purposes. With these systems, the respondents are told that the feedback will not be used for evaluation purposes. To the extent that the raters actually believe what they are told, this approach discourages them from inflating—or deflating—ratings.
63%
Flag icon
Even when the feedback is purely developmental, ratings remain noisy.
63%
Flag icon
the halo effect implies that supposedly separate dimensions will in fact not be treated separately. A strong positive or negative rating on one of the first questions will tend to pull answers to subsequent questions in the same direction.
Erhan
Note the very first question of surveys and scales.
63%
Flag icon
Finally, 360-degree systems are not immune to a near-universal disease of all performance measurement systems: creeping ratings inflation. One large industrial company once observed that 98% of its managers had been rated as “fully meeting expectations.” When almost everyone receives the highest possible rating, it is fair to question the value of these ratings.
64%
Flag icon
common frame of reference.
65%
Flag icon
frame-of-reference training, has been shown to help ensure consistency between raters.
65%
Flag icon
case scales are more reliable than scales that use numbers, adjectives, or behavioral descriptions.
65%
Flag icon
if your goal is to determine which candidates will succeed in a job and which will fail, standard interviews (also called unstructured interviews to distinguish them from structured interviews, to which we will turn shortly) are not very informative. To put it more starkly, they are often useless.
66%
Flag icon
if all you know about two candidates is that one appeared better than the other in the interview, the chances that this candidate will indeed perform better are about 56 to 61%.
66%
Flag icon
There is strong evidence, for instance, that hiring recommendations are linked to impressions formed in the informal rapport-building phase of an interview, those first two or three minutes where you just chat amicably to put the candidate at ease. First impressions turn out to matter—a lot.
66%
Flag icon
initial impressions have a deep effect on the way the interview proceeds.