Stats can't make modeling decisions

Here's a question that appeared recently on the Reddit statistics forum:
If effect sizes of coefficient are really small, can you interpret as no relationship?  Coefficients are very significant, which is expected with my large dataset. But coefficients are tiny (0.0000001). Can I conclude no relationship? Or must I say there is a relationship, but it's not practical?
I posted a response there, but since we get questions like this a lot, I will write a more detailed response here.
First, as several people mentioned on Reddit, you have to distinguish between a small coefficient and a small effect size.  The size of the coefficient depend on the units it is expressed in.  For example, in a previous article I wrote about the relationship between a baby's birth weight and its mother's age ("Are first babies more likely to be light?").  With weights in pounds and ages in years, the estimated coefficient is about 0.017 pounds per year.

At first glance, that looks like a small effect size.  But the average birth weight in the U.S. is about 7.3 pounds, and the range from the youngest to the oldest mother was more than 20 years.  So if we say the effect size is "about 3 ounces per decade", that would be easier to interpret.  Or it might be even better express the effect in terms of percentages; for example, "A 10 year increase in mother's age is associated with a 2.4% increase in birth weight."

So that's the first part of my answer:

Expressing effect size in practical terms makes it easier to evaluate its importance in practice.
The second part of my answer address the question, "Can I conclude no relationship?"  This is a question about modeling, not statistics, and

Statistical analysis can inform modeling choices, but it can't make decisions for you.

As a reminder, when you make a model of a real-world scenario, you have to decide what to include and what to leave out. If you include the most important things and leave out less important things, your model will be good enough for most purposes.

But in most scenarios, there is no single uniquely correct model. Rather, there are many possible models that might be good enough, or not, for various purposes.

Based on statistics alone, you can't say whether there is, or is not, a relationship between two variables.  But statistics can help you justify your decision to include a relationship in a model or ignore it.
The affirmativeIf you want to argue that an effect SHOULD be included in a model, you can justify that decision (using classical statistics) in two steps: 
1) Estimate the effect size and use background knowledge to make an argument about why it matters in practical terms.  For example, a 3 ounce difference in birth weight might be associated with real differences in health outcomes, or not.
AND
2) Show that the p-value is small, which at least suggests that the observed effect is unlikely to be due to chance.  (Some people will object to this interpretation of p-values, but I explain why I think it is valid in "Hypothesis testing is only mostly useless").

In my study of birth weight, I argued that mother's age should be included in the model because the effect size was big enough to matter in the real world, and because the p-value was very small.

The negativeIf you want to argue that it is ok to leave an effect out of a model, you can justify that decision in one of two ways:

1) If you apply a hypothesis test and get a small p-value, you probably can't dismiss it as random.  But if the estimated effect size is small, can use background information to make an argument about why it is negligible.

OR

2)  If you apply a hypothesis test and get a large p-value, that suggests that the effect you observed could be explained by chance.  But that doesn't mean the effect is necessarily negligible.  To make that argument, you need to consider the power of the test.  One way to do that is to find the smallest hypothetical effect size that would yield a high probability of a significant test.  Then you can say something like, "If the effect size were as big as X, this test would have a 90% of being statistically significant. The test was not statistically significant, so the effect size is likely to be less than X.  And in practical terms, X is negligible."

The BayesianSo far I have been using the logic of classical statistics, which is problematic in many ways.

Alternatively, in a Bayesian framework, the result would be a posterior distribution on the effect size, which you could use to generate an ensemble of models with different effect sizes. To make predictions, you would generate predictive distributions that represent your uncertainty about the effect size. In that case there's no need to make binary decisions about whether there is, or is not, a relationship.

Or you could use Bayesian model comparison, but I think that a mostly misguided effort to shoehorn Bayesian methods into a classical framework.  But that's a topic for another time.
 •  0 comments  •  flag
Share on Twitter
Published on May 03, 2016 13:09
No comments have been added yet.


Probably Overthinking It

Allen B. Downey
Probably Overthinking It is a blog about data science, Bayesian Statistics, and occasional other topics.
Follow Allen B. Downey's blog with rss.