Rate this book

Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars

Name: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars
Rating: 3.71 (5 reviews)
ISBN: 9781107054134

Deborah G. Mayo

Rate this book

Mounting failures of replication in social and biological sciences give a new urgency to critically appraising proposed reforms. This book pulls back the cover on disagreements between experts charged with restoring integrity to science. It denies two pervasive views of the role of probability in to assign degrees of belief, and to control error rates in a long run. If statistical consumers are unaware of assumptions behind rival evidence reforms, they can't scrutinize the consequences that affect them (in personalized medicine, psychology, etc.). The book sets sail with a simple if little has been done to rule out flaws in inferring a claim, then it has not passed a severe test. Many methods advocated by data experts do not stand up to severe scrutiny and are in tension with successful strategies for blocking or accounting for cherry picking and selective reporting. Through a series of excursions and exhibits, the philosophy and history of inductive inference come alive. Philosophical tools are put to work to solve problems about science and pseudoscience, induction and falsification.

GenresPhilosophyScienceMathematicsTechnicalNonfictionSocial Science

486 pages, Hardcover

Published September 20, 2018

55 people are currently reading

288 people want to read

About the author

Deborah G. Mayo

7 books5 followers

What do you think?

Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars

8 (28%)

4 stars

10 (35%)

3 stars

5 (17%)

2 stars

4 (14%)

1 star

1 (3%)

Displaying 1 - 5 of 5 reviews

Matt

79 reviews7 followers

abandoned

April 2, 2019

I think I'd be wasting my time continuing with this. I'm not a statistician or a user of statistical methods, so I'm not really in a position to judge the book -- it's just not for me. A couple of thoughts anyway:

- The style is strange; it gave me the feeling of being dropped into the author's train of thought, or perhaps thrown headlong into its path. I don't think that was due to the complexity of the ideas, or even to the semantic density of the writing, but rather to some combination of stream-of-consciousness composition and assumed shared background between writer and reader. In a popular book I would simply call this bad writing/editing, but since Mayo is a specialist writing for specialists, I'm not in a position to lay down that kind of judgment. It didn't feel necessary, though, at least in the early sections that I read.

- (What follows may simply reveal my own ignorance and misunderstanding)

p. 53: "Analogous situations to the optional stopping example occur even without optional stopping, as with selecting a data-dependent, maximally likely, alternative. Here's an example from Cox and Hinkley [citation] attributed to Allan Birnbaum [citation]. A single observation is made on X, which can take values 1, 2, ..., 100. [...] If X is observed to be r, [...] then the most likely hypothesis is θ = r. In fact, Pr(X = r; θ = r) = 1. By contrast, Pr(X = r; θ = 0) = 0.01. Whatever value r that is observed, hypothesis θ = r is 100 times as likely as is θ = 0. [...] So "even if in fact θ = 0, we are certain to find evidence apparently pointing strongly against θ = 0, if we allow comparisons of likelihoods chosen in the light of the data" [citation]. This does not happen if the test is restricted to two preselected values. [...] Allan Birnbaum gets the prize for inventing chestnuts that deeply challenge both those who do, and those who do not, hold the Likelihood Principle."

(As far as I can remember, or find using the index, the Likelihood Principle has not yet been formally defined, but on page 30 it is said to be "related" to the "Law of Likelihood", i.e. "Data x are better evidence for hypothesis H1 than for H0 if x is more probable under H1 than under H0: Pr(x; H1) > Pr(x; H0), that is, the likelihood ratio (LR) of H1 over H0 exceeds 1.")

Earlier (p. 38) a similar case is used to illustrate "our most serious problem: The Law of Likelihood permits finding evidence in favor of a hypothesis deliberately arrived at using the data". A deck of cards is shuffled, and the top card turned over: it is the Ace of Diamonds. The LL tells us that "the hypothesis that the deck consists of 52 aces of diamonds [...] is [52 times] better supported than the hypothesis that the deck is normal". This is supposed to present a problem for 'Likelihoodists', one which they can only evade by insisting on the distinction between evidence and belief.

I don't understand what is supposed to be strange or threatening about the 'trick deck' case. The hypothesis that the deck consists of 52 aces of diamonds *has* just received some support, whether we formulated it in advance or not -- but this doesn't imply that we need to reduce our confidence that the deck is normal. All that has happened is the other 51 'the deck consists of 52 copies of one card' hypotheses have just been ruled out. If we gave (or would have given) each of them a 0.01% chance before looking at the data, we should now consider the AD hypothesis 0.52% likely to be true.

Similarly, I don't understand what is supposed to be puzzling about the Allan Birnbaum case, which seems to be presented almost as a paradox. If we had no other reason to favour the hypothesis that θ = r, then the observed value provides no evidence for or against θ = 0, only evidence in favour of θ = r. The ratio Pr(θ = r)/Pr(θ = 0) increases, but only because θ = [anything other than 0 or r] has been completely ruled out, and its probability mass transferred to θ = r.

Surely nobody really advocates the Bayesianism-without-prior-probabilities approach that these examples seem to target? Mayo must have a reason for presenting them, but I would have loved a clear explanation of what it is.

Jerzy

557 reviews137 followers

October 20, 2023

Finally finished it. Hard to put a star rating on this! I'm very glad I read it, since the ideas are important, and the history & context of past debates is excellent. These ideas will inform my statistical practice and teaching for decades. Yet... the writing is often extremely unclear, not merely in the sense that these are tricky ideas that take time to digest (they are and I'm ok with that!) but in the sense of "did this even have an editor and proofreader?"

Maybe that's in the spirit of error-statistical severe testing itself: although I assign a high subjective posterior probability to "the ideas here are valuable," I assign a low severity to "their presentation and communication is clear and convincing" :-)

Mayo and Spanos (2011), a shorter article on the author's website, is a more succinct introduction to severity. But it's still not all that clear. Can anyone recommend a better place to point a novice?

-----

tl;dr: Sometimes you want to know the answer to the question "Do I have COVID-19 or not?"

But sometimes, you're in a situation where it's impossible to know that for sure, and instead we accept the fact that we have to rely on a good but imperfect test.
In other words, we can only answer questions such as: "Did I use a COVID-19 test from a well-tested, reputable, trustworthy manufacturer? Was the test unexpired? Did I conduct the test correctly?"
If you answer "Yes" to all of these, *then* you might decide to trust the test result as an imperfect stand-in for what you really want to know.
But if you answer "No" to some of them (the test had expired, or you swabbed incorrectly, or whatever), then you know the test simply wasn't good evidence about your COVID status. The test might *be* right (maybe you don't have COVID, and the test result happens to be negative), but it might not; we can't *trust* it; we need a better test.

That's basically what this book is about. "Is this claim is true?" (or even "...probably true?") is a different question than "How good is this evidence for/against the claim?" We are often in situations where we can't truly answer the 1st question, but at least we can think carefully about answering the 2nd. In those cases, what questions should we ask about the evidence? And how can we design better evidence-gathering procedures?

...Update, a few months later (Aug 2023): One more (real-life!!!) example that came to my mind! When my kids were younger and just learning to bike, they would sometimes bike far ahead of me and get to an intersection and want to cross it before I caught up with them, and I'd be yelling to them. Do you think I was yelling "Based on the data already available, what's your posterior probability that a car is coming?" No, of course not. I was yelling "Look both ways before you cross!"

* Under the Bayesian perspective, we typically want to elicit priors, look at the available data, calculate posteriors, and use the posteriors to make decisions. For the bike example, there was always a low "prior prob" of cars coming---the neighborhood where we biked is very quiet. And since my kids (at first!) were only focused on looking straight ahead and didn't look left & right, their "available data" would almost surely show no cars. So their "posterior prob" that no car is coming would almost always be very very high... even on the rare occasion that a car really *was* coming.
* Under the Frequentist or Mayo's error-statistician perspective, what we *should* do instead is design useful studies that'll help us not fool ourselves. In the bike example, I *don't* care at all about the posterior probability as such. I *do* care about whether my kids used a "study design" that would reliably detect a mistake if there were one. I care about whether they looked both ways, so that if there were a car they would see one.

"Did we collect enough data? Did we collect the right data?" are different questions than "How should I update my priors after data lands in my lap?"

~~~

I'll admit that sometimes Deborah Mayo's writing can be unclear; when I find her comments on old blog posts I often can't make heads or tails of what she wrote. Even in this book, her writing could have used some additional editing; and I cannot understand why she uses "significance level" for what everyone else calls "p-value" (and something else for what everyone else calls a significance level). But when I can parse her ideas, I generally agree and find them to be a sensible, well-thought-out philosophy of statistics.

In this book, the key idea seems to be: We need a philosophy of statistical inference that helps us judge how well a dataset or test can be taken as evidence for/against a claim, which is distinct from the probability that the claim is true. If we blur the lines between "how good is this evidence?" and "is the claim true?" then we can't really make sense of statistical inference, and frankly can't do good science in general.

* We can't honestly say that the data provide evidence for a (scientific or substantive) claim if we haven't attempted to rule out ways that the claim might be false. Mayo's catchy phrase/acronym for this is to say we have "bad evidence, no test (BENT)."

For example, on p.18: "Playfair's formula may be true, or probably true, but Peirce's little demonstration is enough to show his [ie Playfair's] method did a lousy job of testing it."

* On the flip side, if we have indeed run "severe" tests---ones that would have been likely to find flaws in our claim if they were there---and the tests found no such flaws, then we do indeed have evidence for the claim.
(By the way, it seems that her use of "severe tests" is not limited to traditional statistical hypothesis tests; those are just the ones that give the most people the most headaches in trying to understand & interpret them :) Different things like "can we reproduce the statistical summaries in your paper if you share your raw data with us?" are also severe tests for certain kinds of flaws. The "Macho Man" example in Section 2.6 gives several other ways to probe a psych experiment for flaws: whether or not there was stat. significance, are these measurements and null hypotheses actually relevant to the substantive/scientific question?)

To me, this is at the heart of the frequentist-vs-Bayesian schism:
* Many Bayesians seem to think that frequentists' P(data | claim) are just convoluted & ineffective attempts to talk about P(claim | data & prior), ie, what number should I assign to my quantified belief in the claim after seeing the data? Is the claim true or not? Since frequentists can't put a number on this directly, they are just incompetent Bayesians, or so the story often goes.
* But (competent) frequentists are really trying to do something else entirely. How do I design a study that'll give solid evidence about the claim? Is the study well-designed or not? (Of course, there are also many people using & even teaching frequentists methods without really understanding this crucial point... But "get rid of frequentism because it's often misused" doesn't help us, if we can't replace it with something else that achieves the same important goal!)

It reminds me of Kevin Kelly & Clark Glymour's view in "Why bayesian confirmation does not capture the logic of scientific justification": Sure, frequentism doesn't do exactly what scientists really want... neither can anything else, including Bayesianism. And given the limitations on what can be done, frequentism is probably closer to what scientists can/should settle for. Kelly & Glymour are pretty harsh in this respect:

Of course, this "best-we-can-do" justification is not as satisfying as a "two-sided" decision procedure would be, but that kind of performance is impossible and the grown-up attitude is to obtain the best possible performance as efficiently as possible rather than to opine the impossible.

So far, Mayo's clearly not a Bayesian, but she's also hinting that she is not fully in the frequentist camp either. Not sure where it'll end up yet.

...Update, a few months later (Aug 2023): Mayo is mostly aligned with best-practices Frequentists after all. However, she claims that she has a different justification for using traditional frequentist methods--specifically for going from "My test did/didn't reject H0: mu=0" to the interpretation that "We do/don't infer that mu>0" or similar.

* She says the standard justification is behavioristic: "With a small p-value we can reject H0, because doing so will rarely be wrong (in the long run, across many applications of such tests)." To her and many others, this is unsatisfying: why should long-run properties convince us that we made the right decision *here*, with *this* dataset?
* By contrast, Mayo claims that her "severity" approach is more directly inferential: "With a small p-value we can reject H0 because, with high probability, such a big departure from H0 (in HA's direction) would be unlikely to occur if H0 were true (and would be plausible if HA were true)."

But... to my eyes, this seems like just a different wording of the same argument!!!
"With a small p-value we can reject H0 because we rarely get a small p-value when H0 is true and therefore our method is rarely wrong."
Right?

Hmm... or maybe it's a more specific version of it. Mayo claims that the "long-run" argument is necessary but not sufficient. So maybe her severity approach is not meant to be a *different* argument, but just a *more specific* argument, one that fills in the "long-run" argument's gaps between necessary and sufficient?

* Perhaps "our method will rarely be wrong" isn't satisfactory on its own because there exist *pointless* "methods" that are also rarely wrong---such as ignoring the data and drawing p from Unif(0,1) to decide whether to reject H0. This is rarely wrong when H0 is true, yet it's useless for using data to learn about the world.
* And maybe Mayo is claiming that by saying "our method will rarely be wrong *in this specific way*" she has ruled out pointless-but-rarely-wrong methods, leaving only useful-and-rarely-wrong methods?

~~~

In the section on the Law of Likelihood and also the Likelihood Principle, I like how Mayo frames her criticisms of Royall's views. According to (Mayo's summary of) Royall, we have to distinguish evidence vs belief (which I do agree with)... so we can simply use likelihood ratios for evidence and Bayesian priors for belief. But Royall seems to think that cherry-picked hypotheses, early peeking, and all the other data-snooping stuff that frequentists try to prevent can just be handled with priors, and I *don't* agree with that. Royall (according to Mayo) says that we can just trust likelihood ratios as comparative evidence for any two hypotheses H0 and H1, *no matter how we came up with them*; and if that ever seems to lead to a weird and bad conclusion, it just means you must have a prior about those particular hypotheses, so go Bayesian and include that prior in deciding which hypothesis to believe.

This makes no sense to me, nor to Mayo. I might indeed have my own personal prior beliefs about the plausibility of each hypothesis; but such priors are *distinct* from whether we have a well-designed study and an adequate sample size and pre-specified hypotheses and all that good stuff. First and foremost, we need to do good science! Doing bad science and then using a prior doesn't magically make it good science. Hiding a rotten floorboard under a rug doesn't mean that the floor is safe for guests to stand on.

philosophy science statistics

Haydn

126 reviews3 followers

October 3, 2023

Gave up, with some regret, 3/4 of the way through.

The content is good. Great, even. But Mayo doesn't get to the real juice - to her real points - quick enough. She lacks consistency. This is why I persevered with this book for so long and was reticent to abandon ship. Large swaths of excellent material are interspersed by superfluous sections, overly verbose paragraphs, and the occasional wince-inducing phrase.

Hope to pick it up again in future.

Alex

153 reviews

August 17, 2020

I skipped excursion 5 because I couldn't take any more normal hypothesis testing examples. I have a long list of questions to answer for myself after reading this, but they are mostly about basic tenets of various statistical philosophies rather than severity itself. My (very naive) impression is that the primary contribution of severity is as justification for frequentist inference without having to refer to long run performance. I remain unclear on how severity and post-hoc power analysis are different. I was also disappointed by how little time was spent discussing Kass' statistical pragmatism -- see also Tong (2019) in The American Statistician for another interesting perspective on the pragmatic/"conditional" interpretation of models.

philosophy-of-science statistics technical

Athanassios Protopapas

22 reviews3 followers

November 5, 2023

This was an interesting, if uneven, book. The historical perspective was rich and informative, even for an outsider like me. The examples are all quite simple in terms of math, so one does not need an advanced degree in applied math to follow them. The writing is quite idiosyncratic, often entertaining. But the book is written as a discussion with philosophers of statistics, and as such can probably only be fully understood by professional philosophers of statistics. This is a shame because I don't think most of the concepts discussed are that impenetrable. I think it is the writing style that makes it a struggle to follow the arguments, even (those I think are) the most basic ones. Like others have commented in online reviews, the book feels like you popped up in the middle of a discussion you weren't invited to. It would have been so much easier if the manuscript had been reviewed by someone with just a basic background in statistics and philosophy of science who could indicate all the parts that needed an additional phrase or two to make sure everyone is on board for the argument. And some extra editing could certainly help. In the end I think I kind of got the gist of the severity idea, but maybe not. It was still very interesting to read about the Fisher/Neyman-Pearson debates and the Bayesian debates; and comforting to confirm my primitive intuitions about power and confidence intervals (and thus about which hypotheses one can reject with some confidence, or not). I guess this book wasn't written for me, and that's OK, but it still feels like a missed opportunity to effectively reach a much wider audience.