Allen B. Downey's Blog: Probably Overthinking It, page 18
May 18, 2016
Learning to Love Bayesian Statistics
And here's a transcript of what I said:
Thanks everyone for joining me for this webcast. At the bottom of this slide you can see the URL for my slides, so you can follow along at home.
I’m Allen Downey and I’m a professor at Olin College, which is a new engineering college right outside Boston. Our mission is to fix engineering education, and one of the ways I’m working on that is by teaching Bayesian statistics.
Bayesian methods have been the victim of a 200 year smear campaign. If you are interested in the history and the people involved, I recommend this book, The Theory That Would Not Die. But the result of this campaign is that many of the things you have heard about Bayesian statistics are wrong, or at least misleading.
In particular, you might have heard that Bayesian stats are difficult, computationally slow, the results are subjective; and we don’t need Bayesian stats because they are redundant with more mainstream methods. I’ll explain what each of these claims is about, and then I’ll explain why I disagree.
At most universities, Bayesian stats is a graduate level topic that you can only study after you’ve learned a lot of math and a lot of classical statistics.
And if you pick up most books, you can see why people think it’s hard. This is from one of the most popular books about Bayesian statistics, and this is just page 7. Only 563 pages to go!
Also, you might have heard that Bayesian methods are computationally intensive, so they don’t scale up to real-world problems or big data.
Furthermore, the results you get depend on prior beliefs, so they are subjective. That might be ok for some applications, but not for science, where the results are supposed to be objective.
There are some cases where the prior is objective, the results you get from Bayesian methods are the same as the results from classical methods, so what’s the point?
Finally, and this is something I hear from statisticians, we’ve made 100 years of scientific and engineering progress using classical statistics. We’re doing just fine without Bayes.
Ronald Fisher, one of the most prominent statisticians of the 20th century, summed it up like this: “The theory of inverse probability [which he called it because Bayes is the theory that dare not speak its name] is founded upon an error, and must be wholly rejected.” Well, that’s the end of the webcast. Thanks for joining me.
Wait, not so fast. None of what I just said is right; they are all myths. Let me explain them one at a time (although not in the same order). I’ll start with #3.
The results from Bayesian methods are subjective. Actually, this one is true. But that’s a good thing.
In fact, it’s an I.J. Good thing. Another prominent statistician, and a cryptologist, summed it up like this, “The Bayesian states his judgements, whereas the objectivist sweeps them under the carpet by calling assumptions knowledge, and he basks in the glorious objectivity of science.”
His point is that we like to pretend that science is strictly objective, but we’re kidding ourselves. Science is always based on modeling decisions, and modeling decisions are always subjective. Bayesian methods make these decisions explicit, and that’s a feature, not a bug.
But all hope for objectivity is not lost. Even if you and I start with different priors, if we see enough data (and agree on how to interpret it) our beliefs will converge. And if we don’t have enough data to converge, we’ll be able to quantify how much uncertainty remains.
On to the next myth, that Bayesian methods are unnecessary because science and engineering are getting along fine without them.
Well, the cracks in that wall are starting to show. In 2005, this paper explained why many published findings are probably false because of unintended flexibility in experimental design, definitions, outcomes and analysis.
This 2011 paper introduced the term “researcher degree of freedom” to explain why false positive rates might be much higher than 5%
And in 2015 a psychology journal shook the foundations of classical statistics when it banned p-values and confidence intervals in favor of estimated effect sizes.
Most recently, there has been a large-scale effort to replicate effects reported in previous papers. The initial results are discouraging, with many failures to replicate, including some well-known and generally accepted effects.
So everything is not great.
Of course, I don’t want to dismiss every statistical method invented in the last 100 years. I am a big fan of regression, for example. But these particular tools: confidence intervals and hypothesis testing with p-values. We would be better off without.
In fact, here’s what the world would look like today if p-values had never been invented.
The next myth I’ll address is that Bayesian methods are redundant because they produce the same results as classical methods, for particular choices of prior beliefs. To be honest, I have always found this claim baffling.
Classical methods produce results in the form of point estimates and confidence intervals.
Bayesian methods produce a posterior distribution, which contains every possible outcome and it’s corresponding probability. That’s a different kind of thing from a point estimate or an interval, and it contains a lot more information. So it’s just not the same.
If you want to compare results from Bayesian methods and classical methods, you can use the posterior distribution to generate point estimates and intervals. But that’s not a fair comparison. It’s like running a race between a car and an airplane, but to make them comparable, you keep the plane on the ground. You’re missing the whole point!
Bayesian methods don’t do the same things better. They do different things, and those things are better. Let me give you some examples.
Suppose you run a poll of likely voters and you find that 52% of your sample intends to vote for one of the candidates, Alice. You compute a confidence interval like 42% to 62%, and you’ve got a p-value, 0.34. Now here’s what you really want to know: what is the probability that Alice will win?
Based on these statistics, I have no idea. This is a diagram from a friend of mine, Ted Bunn, explaining that p-values can answer lots of questions, just not the questions we care about.
In contrast, what you get from Bayesian statistics is a posterior distribution, like this, that shows the possible outcomes of the election and the probability of each outcome. With a distribution like this, you can answer the questions you care about, like the probability of victory, or the probability of winning by more than 5 points.
And when you get new data, Bayes’s theorem tell you how to update the previous distribution to get a new distribution that incorporates the new data. By the way, these figures are from Nate Silver’s blog, they guy who correctly predicted the outcome of the 2008 presidential election in 49 out of 50 states, and in 2012 he got 50 out of 50, using Bayesian statistics.
As this example shows, Bayesian methods answer the questions we actually care about, unlike classical methods, and produce results in a form that makes information actionable.
Let me give you another example. Suppose you are testing a new drug, A, compared to an existing drug B, and it looks like A is better, with a nice small p-value. But A is more expensive. If you’re a doctor, which drug should you prescribe?
Again, classical statistics provides almost no help with this kind of decision making. But suppose I gave you results in the form of probabilities, like the probability of survival, or positive outcomes, or side effects. Now if you have prices, you can do cost-effectiveness analysis, like dollars per life saved, or per quality adjusted life-year, and so on. That kind of information is actually useful.
I’ll do one more example that’s a little more fun. Suppose you are on The Price Is Right, and you are a contestant in the Showdown at the end of the show. You get to see a prize and then you have to guess the price. Your opponent sees a different prize and then they have to guess the price. Whoever is closer, without going over, wins the prize.
Well, here’s what you could do. If you know the distribution of prices for prizes in the past, you could use that as your prior. And conveniently, there’s a super-fan on the Internet who has done that for you, so you can download the results.
Then you can use your price estimate to do a Bayesian update. Here’s what it looks like: the darker line is the prior; the lighter line is the posterior, which is what you believe about the price after you see the prize.
Here’s what it looks like for the second contestant. Again, the dark line is the prior, the light line is what your opponent believes after they see the second prize.
Now you can each perform an opimization that computes your expected gain depending on what you bid. In this example, the best bid for you is about $21,000, and the best bid for your opponent is about $32,000.
This example is mostly for fun, but it makes a point: Bayesian methods suppose complex decision making under uncertainty. The Price Is Right Problem is not easy, it involves some discontinuties that make it hard to do with continuous mathematics. But using Bayesian statistics and some simple computational tools, it’s not that hard.
Ok, one more thing you hear about Bayesian methods is that they are too slow. Well, compared to what? If you want the wrong answer, you can have it very fast. If you want the right answer, there are a few ways to get there.
For a lot of problems, you can get away with brute force. It turns out that computers are fast, and computation is cheap. For many real world problems, a simple implementation of Bayesian statistics is fast enough. Maybe it takes a 10th of a second instead of a millionth of a second, but most of the time we don’t care.
If you do care, there are alternatives. MCMC stands for Monte Carlo Markov Chain, which is a killer computational technique for speeding up Bayesian methods, especially when the number of parameters you are trying to estimate gets big.
And if that’s not fast enough, sometimes there are analytic methods you can use. Here’s an example I worked on that uses a Dirichlet distribution, one of the cases where you can perform a Bayesian update just by doing a few additions.
Ok, last myth I’ll talk about is the idea that Bayesian methods are hard. Again, if you look at things like the Wikipedia page, it can be a little intimidating. But that’s a problem with the math, not the methods.
The fundamental ideas are very simple, and if you take a computational approach, they are really not hard.
I’ll show you an example, starting with a toy problem and then using it to solve a real problem. Suppose I have a box of dice where one is 4-sided, one is 6-sided, one 8-sided and one 12-sided.
I pick a die, but I don’t let you see it. I roll the die and report that I got a six. Then I ask, what is the probability that I rolled each die? Immediately you can figure that I didn’t roll the 4-sided die. And you might have intuition that the six-sided die is the most likely candidate. But how do we quantify that intuition?
I’ll show you two ways to solve this problem, first on paper and the computationally.
Here’s a table you can use to solve problems like this when you have a small number of possible hypotheses.
I’ll fill in the first column, which is the list of possible dice.
Now, let’s assume that I was equally likely to choose any of the dice. So the prior probability is the same for all of them. I’ll set them to 1. I could divide through and set them all to ¼ , but it turns out it doesn’t matter.
So that’s what you should believe before I roll the die and tell you the outcome. Then you get some data: I tell you I rolled a 6. What you have to do is compute the likelihood of the data under each hypothesis. That is, for example, what’s the probability of rolling a 6 on a 4 sided die? Zero. What’s the probability of getting a 6 on a six-sided die? One in six. So I’ll fill those in.
The chance of rolling a 6 on an 8-sided die is one in 8. On a 12-sided die it’s one in 12. And that’s it. Everything from here on is just artithmetic. First we multiply the prior probabilities by the likelihoods.
The result is the posterior probabilities, but they are not normalized; that is, they don’t add up to 1. We can normalize them by adding them up and dividing through.
But first, just to make the arithmetic easy, I’m going to multiply through by 24.
Now I’ll add them up… and divide through by 9. So the posterior probabilities are 0 (the 4-sided die has been eliminated from the running), 4 ninths, 3 ninths, and 2 ninths. As expected, the 6-sided die is the most likely.
Now here’s what that looks like computationally. I’ve got a function, likelihood, that computes the likelihood of the data under each hypothesis. The hypothesis is the number of sides on the die; the data is the outcome that I rolled, the 6.
If the outcome exceeds the number of sides, that’s impossible, so the likelihood is zero. Otherwise, the likelihood is one out of the number of sides. One you provide a likelihood function, you’re done. The Suite class knows how to do a Bayesian update; it does the same thing we just did with the table.
So, we solved the dice problem, which might not seem very interesting. But we also solved the German tank problem, which was very interesting during World War II.
When the Germans were making tanks, they allocated serial numbers to different factories, in different months, in blocks of 100. But not all of the serial numbers got used. So if you captured a tank and looked at the serial number, you could estimate how many tanks were made at a particular factory in a particular month.
To see how that works, let’s look at the likelihood function. The hypothesis now is the number of tanks that were made, out of a possible 100. The data is the serial number of a tank that was captured.
You might recognize this likelihood function. It’s the same as the dice problem, except instead of 4 dice, we have 100 dice.
Here’s what the update looks like. We create an object called Tank that represents the prior distribution, then update it with the data, he serial number 37. And here’s what the posterior distribution looks like.
Everything below 37 has been eliminated, because that’s not possible. The most likely estimate is 37. But the other values up to 100 are also possible. If you see more data, you can do another update and get a new posterior.
It turns out that this works. During WWII there were statisticians producing estimates like this (although not using exactly this method), and their estimates were consistently much lower than what was coming from conventional intelligence. After the war, the production records were captured, and it turned out that the statistical estimates were much better.
So Bayesian methods are not as hard as people think. In particular, if you use computational methods, you can get started very quickly.
I teach a class at Olin College for people who have almost prior statistical knowledge, but they know how to program in Python. In 7 weeks, they work on projects where they apply Bayesian methods to problems they choose and formulate, and I publish the good ones.
Here’s one where the students predicted the outcome of the 2015 Super Bowl, which the New England Patriots won.
Here’s one where they analyzed responses on Tinder, computing that probability that someone would respond. It sounds a little bit sad, but it’s not really like that. They were having some fun with it.
And here’s one that actually got a lot of attention: two students who used data from Game of Thrones to predict the probability that various characters would survive for another book or another chapter.
So people can get started with this very quickly and work on real world problems, although some of the problems are more serious than others.
In summary, most of what you’ve heard about Bayesian methods is wrong, or at least misleading. The results are subjective, but so is everything we believe about the world. Get over it.
Bayesian methods are not redundant; they are different, and better. And we need them more than ever.
Bayesian methods can be computationally intensive, but there are lots of ways to deal with that. And for most applications, they are fast enough, which is all that matters.
Finally, they are not that hard, especially if you take a computational approach.
Of course, I am not impartial. I teach workshops where I introduce people to Bayesian statistics using Python. I’ve got one coming up this weekend in Boston and another at PyCon in Portland Oregon.
And I’m also trying to help teachers get this stuff into the undergraduate curriculum. It’s not just for graduate students! In June I’m doing a one-day workshop for college instructors, along with my colleague, Sanjoy Mahajan.
And I’ve got a book on the topic, called Think Bayes, which you can read at thinkbayes.com
I’ve got a few more books that I recommend, but I won’t read them to you. Let me get to here, where you can go to this URL to get my slides. And here are a few ways you can get in touch with me if you want to follow up.

May 17, 2016
Does Trivers-Willard apply to people?
According to Wikipedia, the Trivers-Willard hypothesis:
"...suggests that female mammals are able to adjust offspring sex ratio in response to their maternal condition. For example, it may predict greater parental investment in males by parents in 'good conditions' and greater investment in females by parents in 'poor conditions' (relative to parents in good condition)."For humans, the hypothesis suggests that people with relatively high social status might be more likely to have boys. Some studies have shown evidence for this hypothesis, but based on my very casual survey, it is not persuasive.
To test whether the T-W hypothesis holds up in humans, I downloaded birth data for the nearly 4 million babies born in the U.S. in 2014.
I selected variables that seemed likely to be related to social status and used logistic regression to identify variables associated with sex ratio.
Summary of results
Running regression with one variable at a time, many of the variables I selected have a statistically significant effect on sex ratio, with the sign of the effect generally in the direction predicted by T-W.However, many of the variables are also correlated with race. If we control for either the mother's race or the father's race, most other variables have no additional predictive power.Contrary to other reports, the age of the parents seems to have no predictive power.Strangely, the variable that shows the strongest and most consistent relationship with sex ratio is the number of prenatal visits. Although it seems obvious that prenatal visits are a proxy for quality of health care and socioeconomic status, the sign of the effect is opposite what T-W predicts; that is, more prenatal visits is a strong predictor of lower sex ratio (more girls).Conclusion
This dataset provides strong evidence of a race effect: African Americans and Hispanics are more likely than whites to have girls. Asians are slightly more likely to have girls.
Other than than, there is no evidence to support T-W. The number of prenatal visits has strong predictive power, but the sign of the effect is the opposite of what T-W would predict.
And once we control for race and prenatal visits, no other variables have predictive power (despite the size of the dataset).
You can read all the details in this Jupyter notebook.
Note: Following convention, I report sex ratio in terms of boys per 100 girls. The overall sex ratio at birth is about 105; that is, 105 boys are born for every 100 girls.
UPDATE
I've run exactly the same analysis using data from 2013. Here's the notebook. All of the results and conclusions are substantially the same.

May 16, 2016
I will probably not win the Great Beat Run
Last year I wrote about my chances of winning my age group in a 5K. This is an update to that article, based on new data.
Almost every year since 2008 I have participated in the Great Bear Run, a 5K road race in Needham MA. I usually finish in the top 30 or so, and in my age group I have come in 2nd, 3rd, 4th (three times), 5th, and 6th.
So I have to wonder if I'll ever win my age group. To answer this question, I developed a Bayesian model of road racing.
The SOB model¶To understand the model, it helps to look at the data. Here is the list of people who beat me in each race:
In [1]: from __future__ import print_function, divisionfrom thinkbayes2 import Pmf, Cdf, Beta
import thinkbayes2
import thinkplot
%matplotlib inline
In [2]: data = {
2008: ['Gardiner', 'McNatt', 'Terry'],
2009: ['McNatt', 'Ryan', 'Partridge', 'Turner', 'Demers'],
2010: ['Gardiner', 'Barrett', 'Partridge'],
2011: ['Barrett', 'Partridge'],
2012: ['Sagar'],
2013: ['Hammer', 'Wang', 'Hahn'],
2014: ['Partridge', 'Hughes', 'Smith'],
2015: ['Barrett', 'Sagar', 'Fernandez'],
2016: ['McGrane', 'Davies', 'Partridge', 'Johnson'],
}
There are some regulars who show up and beat me almost every year, but they are not always in my age group. But there are other names that appear only once.
To predict my performance in future races, we need a model that includes the probability that regulars will show up, and well as the possibility that newcomers will appear.
I model this process with three factors, $S$, $O$, and $B$. In order to finish ahead of me, a runner has to
Show up,Outrun me, andBe in my age group.For each runner, the probability of displacing me is a product of these factors:
$p_i = SOB$
Some runners have a higher SOB factor than others; we can use previous results to estimate it.
But first we have to think about an appropriate prior. Based on casual observation, I conjecture that the prior distribution of $S$ is an increasing function, with many people who run nearly every year, and fewer who run only occasionally:
In [3]: ss = Beta(2, 1)thinkplot.Pdf(ss.MakePmf(), label='S')
thinkplot.Config(xlabel='Probability of showing up (S)',
ylabel='PMF', loc='upper left')
[image error]
The prior distribution of $O$ is biased toward high values. Of the people who have the potential to beat me, many of them will beat me every time. I am only competitive with a few of them.
(For example, of the 18 people who have beat me, I have only ever beat 2).
In [4]: os = Beta(3, 1)thinkplot.Pdf(os.MakePmf(), label='O')
thinkplot.Config(xlabel='Probability of outrunning me (O)',
ylabel='PMF', loc='upper left')
[image error]
The probability that a runner is in my age group depends on the difference between his age and mine. Someone exactly my age will always be in my age group. Someone 4 years older will be in my age group only once every 5 years (the Great Bear run uses 5-year age groups).
So the distribution of $B$ is uniform.
In [5]: bs = Beta(1, 1)thinkplot.Pdf(bs.MakePmf(), label='B')
thinkplot.Config(xlabel='Probability of being in my age group (B)',
ylabel='PMF', loc='upper left')
[image error]
I used Beta distributions for each of the three factors, so each $p_i$ is the product of three Beta-distributed variates. In general, the result is not a Beta distribution, but maybe we can find a Beta distribution that is a good approximation of the actual distribution.
I'll draw a sample from the distributions of $S$, $O$, and $B$, and multiply them out. It turns out that the result is a good match for a Beta distribution with parameters 1 and 3.
In [6]: n = 1000sample = ss.Sample(n) * os.Sample(n) * bs.Sample(n)
cdf = Cdf(sample)
thinkplot.PrePlot(1)
prior = Beta(1, 3)
thinkplot.Cdf(prior.MakeCdf(), color='grey', label='Model')
thinkplot.Cdf(cdf, label='SOB sample')
thinkplot.Config(xlabel='Probability of displacing me',
ylabel='CDF', loc='lower right')
[image error]
Now let's look more carefully at the data. There are 18 people who have beat me during at least one year, several more than once.
The runner with the biggest SOB factor is Rich Partridge, who has displaced me in 5 of 9 years. In fact, he outruns me almost every year, but is not always in my age group.
In [7]: from itertools import chainfrom collections import Counter
counter = Counter(chain(*data.values()))
len(counter), counter
Out[7]: (18,
Counter({'Barrett': 3,
'Davies': 1,
'Demers': 1,
'Fernandez': 1,
'Gardiner': 2,
'Hahn': 1,
'Hammer': 1,
'Hughes': 1,
'Johnson': 1,
'McGrane': 1,
'McNatt': 2,
'Partridge': 5,
'Ryan': 1,
'Sagar': 2,
'Smith': 1,
'Terry': 1,
'Turner': 1,
'Wang': 1}))
The following function makes a Beta distribution to represent the posterior distribution of $p_i$ for each runner. It starts with the prior, Beta(1, 3), and updates it with the number of times the runner displaces me, and the number of times he doesn't.
In [8]: def MakeBeta(count, num_races, precount=3):beta = Beta(1, precount)
beta.Update((count, num_races-count))
return beta
Now we can make a posterior distribution for each runner:
In [9]: num_races = len(data)betas = [MakeBeta(count, num_races)
for count in counter.values()]
Let's check the posterior means to see if they make sense. For Rich Partridge, who has displaced me 5 times out of 9, the posterior mean is 46%; for someone who has displaced me only once, it is 15%.
So those don't seem crazy.
In [10]: [beta.Mean() * 100 for beta in betas]Out[10]: [15.384615384615385,
15.384615384615385,
15.384615384615385,
15.384615384615385,
15.384615384615385,
23.076923076923077,
15.384615384615385,
15.384615384615385,
15.384615384615385,
23.076923076923077,
23.076923076923077,
15.384615384615385,
15.384615384615385,
46.15384615384615,
15.384615384615385,
15.384615384615385,
15.384615384615385,
30.76923076923077]
Now we're ready to do some inference. The model only has one parameter, the total number of runners who could displace me, $n$. For the 18 SOBs we have actually observed, we use previous results to estimate $p_i$. For additional hypothetical runners, we update the distribution with 0 displacements out of num_races.
To improve performance, my implementation precomputes the distribution of $k$ for each value of $n$, using ComputePmfs and ComputePmf.
After that, the Likelihood function is simple: it just looks up the probability of $k$ given $n$.
In [11]: class Bear2(thinkbayes2.Suite, thinkbayes2.Joint):def ComputePmfs(self, data):
num_races = len(data)
counter = Counter(chain(*data.values()))
betas = [MakeBeta(count, num_races)
for count in counter.values()]
self.pmfs = dict()
low = len(betas)
high = max(self.Values())
for n in range(low, high+1):
self.pmfs[n] = self.ComputePmf(betas, n, num_races)
def ComputePmf(self, betas, n, num_races, label=''):
no_show = MakeBeta(0, num_races)
all_betas = betas + [no_show] * (n - len(betas))
ks = []
for i in range(2000):
ps = [beta.Random() for beta in all_betas]
xs = np.random.random(len(ps))
k = sum(xs < ps)
ks.append(k)
return Pmf(ks, label=label)
def Likelihood(self, data, hypo):
n = hypo
k = data
return self.pmfs[n][k]
def Predict(self):
metapmf = thinkbayes2.Pmf()
for n, prob in self.Items():
pmf = bear2.pmfs[n]
metapmf[pmf] = prob
mix = thinkbayes2.MakeMixture(metapmf)
return mix
Here's what some of the precomputed distributions look like, for several values of $n$.
If there are fewer runners, my chance of winning is slightly better, but the difference is small, because fewer runners implies a higher mean for $p_i$.
In [12]: bear2 = Bear2()thinkplot.PrePlot(3)
pmf = bear2.ComputePmf(betas, 18, num_races, label='n=18')
pmf2 = bear2.ComputePmf(betas, 22, num_races, label='n=22')
pmf3 = bear2.ComputePmf(betas, 26, num_races, label='n=24')
thinkplot.Pdfs([pmf, pmf2, pmf3])
thinkplot.Config(xlabel='# Runners who beat me (k)',
ylabel='PMF', loc='upper right')
[image error]
For the prior distribution of $n$, I'll use a uniform distribution from 18 to 40.
In [13]: low = 18high = 40
bear2 = Bear2(range(low, high))
bear2.ComputePmfs(data)
And here's the update, using the number of runners who displaced me each year:
In [14]: for year, sobs in data.items():k = len(sobs)
bear2.Update(k)
Here's the posterior distribution of $n$. It's noisy because I used random sampling to estimate the conditional distributions of $k$. But that's ok because we don't really care about $n$; we care about the predictive distribution of $k$. And noise in the distribution of $n$ has very little effect on $k$.
In [15]: thinkplot.PrePlot(1)thinkplot.Pdf(bear2, label='n')
thinkplot.Config(xlabel='Number of SOBs (n)',
ylabel='PMF', loc='upper right')
[image error]
The predictive distribution for $k$ is a weighted mixture of the conditional distributions we already computed:
In [16]: predict = bear2.Predict()And here's what it looks like:
In [17]: thinkplot.Hist(predict, label='k')thinkplot.Config(xlabel='# Runners who beat me (k)', ylabel='PMF', xlim=[-0.5, 12])
predict[0] * 100
Out[17]: 1.3229375957878466 [image error]
According to this model, my chance of winning my age group is less than 2%. Disappointing.

May 13, 2016
Do generations exist?
This article presents a "one-day paper", my attempt to pose a research question, answer it, and publish the results in one work day (May 13, 2016).
Copyright 2016 Allen B. Downey
MIT License: https://opensource.org/licenses/MIT
In [1]: from __future__ import print_function, divisionfrom thinkstats2 import Pmf, Cdf
import thinkstats2
import thinkplot
import pandas as pd
import numpy as np
from scipy.stats import entropy
%matplotlib inline
What's a generation supposed to be, anyway?¶
If generation names like "Baby Boomers" and "Generation X" are just a short way of referring to people born during certain intervals, you can use them without implying that these categories have any meaningful properties.
But if these names are supposed to refer to generations with identifiable characteristics, we can test whether these generations exist. In this notebook, I suggest one way to formulate generations as a claim about the world, and test it.
Suppose we take a representative sample of people in the U.S., divide them into cohorts by year of birth, and measure the magnitude of the differences between consecutive cohorts. Of course, there are many ways we could define and measure these differences; I'll suggest one in a minute.
But ignoring the details for now, what would those difference look like if generations exist? Presumably, the differences between successive cohorts would be relatively small within each generation, and bigger between generations.
If we plot the cumulative total of these differences, we expect to see something like the figure below (left), with relatively fast transitions (big differences) between generations, and periods of slow change (small differences) within generations.
On the other hand, if there are no generations, we expect the differences between successive cohorts to be about the same. In that case the cumulative differences should look like a straight line, as in the figure below (right):
So, how should we quantify the differences between successive cohorts. When people talk about generational differences, they are often talking about differences in attitudes about political, social issues, and other cultural questions. Fortunately, these are exactly the sorts of things surveyed by the General Social Survey (GSS).
To gather data, I selected question from the GSS that were asked during the last three cycles (2010, 2012, 2014) and that were coded on a 5-point Likert scale.
You can see the variables that met these criteria, and download the data I used, here:
https://gssdataexplorer.norc.org/projects/13170/variables/data_cart
Now let's see what we got.
First I load the data dictionary, which contains the metadata:
In [2]: dct = thinkstats2.ReadStataDct('GSS.dct')Then I load the data itself:
In [3]: df = dct.ReadFixedWidth('GSS.dat')I'm going to drop two variables that turned out to be mostly N/A
In [4]: df.drop(['immcrime', 'pilloky'], axis=1, inplace=True)And then replace the special codes 8, 9, and 0 with N/A
In [5]: df.ix[:, 3:] = df.ix[:, 3:].replace([8, 9, 0], np.nan)df.head()
Out[5]: year id_ age wrkwayup fepol pillok premarsx spanking fechld fepresch ... patriot2 patriot3 patriot4 poleff18 poleff19 poleff20 govdook polgreed choices refrndms 0 2010 1 31 4 2 1 4 4 2 2 ... NaN NaN NaN NaN NaN NaN 3 2 NaN NaN 1 2010 2 23 5 2 1 4 3 2 3 ... NaN NaN NaN NaN NaN NaN 4 4 NaN NaN 2 2010 3 71 NaN 2 2 1 1 3 2 ... NaN NaN NaN NaN NaN NaN 3 4 NaN NaN 3 2010 4 82 2 2 1 3 3 3 2 ... NaN NaN NaN NaN NaN NaN 2 2 NaN NaN 4 2010 5 78 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 rows × 150 columns
For the age variable, I also have to replace 99 with N/A
In [6]: df.age.replace([99], np.nan, inplace=True)Here's an example of a typical variable on a 5-point Likert scale.
In [7]: thinkplot.Hist(Pmf(df.choices))[image error]
I have to compute year born
In [8]: df['yrborn'] = df.year - df.ageHere's what the distribution looks like. The survey includes roughly equal numbers of people born each year from 1922 to 1996.
In [9]: pmf_yrborn = Pmf(df.yrborn)thinkplot.Cdf(pmf_yrborn.MakeCdf())
Out[9]: {'xscale': 'linear', 'yscale': 'linear'} [image error]
Next I sort the respondents by year born and then assign them to cohorts so there are 200 people in each cohort.
In [10]: df_sorted = df[~df.age.isnull()].sort_values(by='yrborn')df_sorted['counter'] = np.arange(len(df_sorted), dtype=int) // 200
df_sorted[['year', 'age', 'yrborn', 'counter']].head()
Out[10]: year age yrborn counter 1411 2010 89 1921 0 1035 2010 89 1921 0 1739 2010 89 1921 0 1490 2010 89 1921 0 861 2010 89 1921 0 In [11]: df_sorted[['year', 'age', 'yrborn', 'counter']].tail()
Out[11]: year age yrborn counter 4301 2014 18 1996 32 4613 2014 18 1996 32 4993 2014 18 1996 32 4451 2014 18 1996 32 6032 2014 18 1996 32
I end up with the same number of people in each cohort (except the last).
In [12]: thinkplot.Cdf(Cdf(df_sorted.counter))None
[image error]
Then I can group by cohort.
In [13]: groups = df_sorted.groupby('counter')I'll instantiate an object for each cohort.
In [14]: class Cohort:skip = ['year', 'id_', 'age', 'yrborn', 'cohort', 'counter']
def __init__(self, name, df):
self.name = name
self.df = df
self.pmf_map = {}
def make_pmfs(self):
for col in self.df.columns:
if col in self.skip:
continue
self.pmf_map[col] = Pmf(self.df[col].dropna())
try:
self.pmf_map[col].Normalize()
except ValueError:
print(self.name, col)
def total_divergence(self, other, divergence_func):
total = 0
for col, pmf1 in self.pmf_map.items():
pmf2 = other.pmf_map[col]
divergence = divergence_func(pmf1, pmf2)
#print(col, pmf1.Mean(), pmf2.Mean(), divergence)
total += divergence
return total
To compute the difference between successive cohorts, I'll loop through the questions, compute Pmfs to represent the responses, and then compute the difference between Pmfs.
I'll use two functions to compute these differences. One computes the difference in means:
In [15]: def MeanDivergence(pmf1, pmf2):return abs(pmf1.Mean() - pmf2.Mean())
The other computes the Jensen-Shannon divergence
In [16]: def JSDivergence(pmf1, pmf2):xs = set(pmf1.Values()) | set(pmf2.Values())
ps = np.asarray(pmf1.Probs(xs))
qs = np.asarray(pmf2.Probs(xs))
ms = ps + qs
return 0.5 * (entropy(ps, ms) + entropy(qs, ms))
First I'll loop through the groups and make Cohort objects
In [17]: cohorts = []for name, group in groups:
cohort = Cohort(name, group)
cohort.make_pmfs()
cohorts.append(cohort)
len(cohorts)
Out[17]: 33
Each cohort spans a range about 3 birth years. For example, the cohort at index 10 spans 1965 to 1967.
In [18]: cohorts[11].df.yrborn.describe()Out[18]: count 200.000000
mean 1956.585000
std 0.493958
min 1956.000000
25% 1956.000000
50% 1957.000000
75% 1957.000000
max 1957.000000
Name: yrborn, dtype: float64
Here's the total divergence between the first two cohorts, using the mean difference between Pmfs.
In [19]: cohorts[0].total_divergence(cohorts[1], MeanDivergence)Out[19]: 31.040279367049521
And here's the total J-S divergence:
In [20]: cohorts[0].total_divergence(cohorts[1], JSDivergence)Out[20]: 6.7725572288951366
This loop computes the (absolute value) difference between successive cohorts and the cumulative sum of the differences.
In [21]: res = []cumulative = 0
for i in range(len(cohorts)-1):
td = cohorts[i].total_divergence(cohorts[i+1], MeanDivergence)
cumulative += td
print(i, td, cumulative)
res.append((i, cumulative))
0 31.040279367 31.040279367
1 28.3836530218 59.4239323888
2 23.5696110289 82.9935434177
3 26.2323801584 109.225923576
4 24.9007829029 134.126706479
5 26.0393476525 160.166054132
6 24.8166232133 184.982677345
7 21.6520436658 206.634721011
8 24.1933904043 230.828111415
9 25.6776252798 256.505736695
10 29.8524030016 286.358139696
11 26.8980464824 313.256186179
12 25.7648247447 339.021010923
13 26.1889717953 365.209982719
14 22.8978988777 388.107881596
15 21.8568321591 409.964713756
16 24.0318591633 433.996572919
17 26.5253260458 460.521898965
18 22.8921890979 483.414088063
19 23.2634817637 506.677569826
20 22.7139238903 529.391493717
21 21.8612744354 551.252768152
22 28.7140491676 579.96681732
23 25.3770585475 605.343875867
24 28.4834464175 633.827322285
25 26.5365954008 660.363917685
26 25.9157951263 686.279712812
27 24.5135820021 710.793294814
28 25.8325896861 736.6258845
29 23.3612099178 759.987094418
30 25.4897938733 785.476888291
31 33.1647682631 818.641656554
The results are a nearly straight line, suggesting that there are no meaningful generations, at least as I've formulated the question.
In [22]: xs, ys = zip(*res)thinkplot.Plot(xs, ys)
thinkplot.Config(xlabel='Cohort #',
ylabel='Cumulative difference in means',
legend=False)
[image error]
The results looks pretty much the same using J-S divergence.
In [23]: res = []cumulative = 0
for i in range(len(cohorts)-1):
td = cohorts[i].total_divergence(cohorts[i+1], JSDivergence)
cumulative += td
print(i, td, cumulative)
res.append((i, cumulative))
0 6.7725572289 6.7725572289
1 5.62917236405 12.4017295929
2 5.0624907141 17.464220307
3 4.17000968182 21.6342299889
4 4.28812434098 25.9223543299
5 4.82791915546 30.7502734853
6 4.68718559848 35.4374590838
7 4.68468177153 40.1221408553
8 5.05290361605 45.1750444714
9 4.73898052765 49.914024999
10 4.70690791746 54.6209329165
11 4.50240017279 59.1233330893
12 4.27566340317 63.3989964924
13 4.25538202128 67.6543785137
14 3.81795459454 71.4723331083
15 3.87937009897 75.3517032072
16 4.04261284213 79.3943160494
17 4.08350528576 83.4778213351
18 4.48996769966 87.9677890348
19 3.91403460156 91.8818236363
20 3.84677999195 95.7286036283
21 3.80839549117 99.5369991195
22 4.51411169618 104.051110816
23 4.08910255992 108.140213376
24 4.4797335534 112.619946929
25 4.89923921364 117.519186143
26 4.31338353073 121.832569673
27 4.2918089177 126.124378591
28 4.12560127765 130.249979869
29 3.9072987636 134.157278632
30 4.5453620366 138.702640669
31 8.30423913322 147.006879802
In [24]: xs, ys = zip(*res)
thinkplot.Plot(xs, ys)
thinkplot.Config(xlabel='Cohort #',
ylabel='Cumulative JS divergence',
legend=False)
[image error]
Conclusion: Using this set of questions from the GSS, and two measures of divergence, it seems that the total divergence between successive cohorts is nearly constant over time.
If a "generation" is supposed to be a sequence of cohorts with relatively small differences between them, punctuated by periods of larger differences, this study suggests that generations do not exist.

May 11, 2016
Binomial and negative binomial distributions
Today's post is prompted by this question from Reddit:
How do I calculate the distribution of the number of selections (with replacement) I need to make before obtaining k? For example, let's say I am picking marbles from a bag with replacement. There is a 10% chance of green and 90% of black. I want k=5 green marbles. What is the distribution number of times I need to take a marble before getting 5?
I believe this is a geometric distribution. I see how to calculate the cumulative probability given n picks, but I would like to generalize it so that for any value of k (number of marbles I want), I can tell you the mean, 10% and 90% probability for the number of times I need to pick from it.
Another way of saying this is, how many times do I need to pull on a slot machine before it pays out given that each pull is independent?
Note: I've changed the notation in the question to be consistent with convention.
In [1]: from __future__ import print_function, divisionimport thinkplot
from thinkstats2 import Pmf, Cdf
from scipy import stats
from scipy import special
%matplotlib inline
Solution¶
There are two ways to solve this problem. One is to relate the desired distribution to the binomial distribution.
If the probability of success on every trial is p, the probability of getting the kth success on the nth trial is
PMF(n; k, p) = BinomialPMF(k-1; n-1, p) pThat is, the probability of getting k-1 successes in n-1 trials, times the probability of getting the kth success on the nth trial.
Here's a function that computes it:
In [2]: def MakePmfUsingBinom(k, p, high=100):pmf = Pmf()
for n in range(1, high):
pmf[n] = stats.binom.pmf(k-1, n-1, p) * p
return pmf
And here's an example using the parameters in the question.
In [3]: pmf = MakePmfUsingBinom(5, 0.1, 200)thinkplot.Pdf(pmf)
[image error]
We can solve the same problem using the negative binomial distribution, but it requires some translation from the parameters of the problem to the conventional parameters of the binomial distribution.
The negative binomial PMF is the probability of getting r non-terminal events before the kth terminal event. (I am using "terminal event" instead of "success" and "non-terminal" event instead of "failure" because in the context of the negative binomial distribution, the use of "success" and "failure" is often reversed.)
If n is the total number of events, n = k + r, so
r = n - kIf the probability of a terminal event on every trial is p, the probability of getting the kth terminal event on the nth trial is
PMF(n; k, p) = NegativeBinomialPMF(n-k; k, p) pThat is, the probability of n-k non-terminal events on the way to getting the kth terminal event.
Here's a function that computes it:
In [4]: def MakePmfUsingNbinom(k, p, high=100):pmf = Pmf()
for n in range(1, high):
r = n-k
pmf[n] = stats.nbinom.pmf(r, k, p)
return pmf
Here's the same example:
In [5]: pmf2 = MakePmfUsingNbinom(5, 0.1, 200)thinkplot.Pdf(pmf2)
[image error]
And confirmation that the results are the same within floating point error.
In [6]: diffs = [abs(pmf[n] - pmf2[n]) for n in pmf]max(diffs)
Out[6]: 8.6736173798840355e-17
Using the PMF, we can compute the mean and standard deviation:
In [7]: pmf.Mean(), pmf.Std()Out[7]: (49.998064403376738, 21.207570382894403)
To compute percentiles, we can convert to a CDF (which computes the cumulative sum of the PMF)
In [8]: cdf = Cdf(pmf)scale = thinkplot.Cdf(cdf)
[image error]
And here are the 10th and 90th percentiles.
In [9]: cdf.Percentile(10), cdf.Percentile(90)Out[9]: (26, 78)
Copyright 2016 Allen Downey
MIT License: http://opensource.org/licenses/MIT

May 10, 2016
Probability is hard: part four
In the first article, I presented the Red Dice problem, which is a warm-up problem that might help us make sense of the other problems.In the second article, I presented the problem of interpreting medical tests when there is uncertainty about parameters like sensitivity and specificity. I explained that there are (at least) four models of the scenario, and I hinted that some of them yield different answers. I presented my solution for one of the scenarios, using the computational framework from Think Bayes.In the third article, I presented my solution for Scenario B.Now, on to Scenarios C and D. For the last time, here are the scenarios:
Scenario A: The patients are drawn at random from the relevant population, and the reason we are uncertain about t is that either (1) there are two versions of the test, with different false positive rates, and we don't know which test was used, or (2) there are two groups of people, the false positive rate is different for different groups, and we don't know which group the patient is in.Scenario B: As in Scenario A, the patients are drawn at random from the relevant population, but the reason we are uncertain about t is that previous studies of the test have been contradictory. That is, there is only one version of the test, and we have reason to believe that t is the same for all groups, but we are not sure what the correct value of t is.Scenario C: As in Scenario A, there are two versions of the test or two groups of people. But now the patients are being filtered so we only see the patients who tested positive and we don't know how many patients tested negative. For example, suppose you are a specialist and patients are only referred to you after they test positive.Scenario D: As in Scenario B, we have reason to think that t is the same for all patients, and as in Scenario C, we only see patients who test positive and don't know how many tested negative.
And the questions we would like to answer are:
What is the probability that a patient who tests positive is actually sick?Given two patients who have tested positive, what is the probability that both are sick?I have posted a new Jupyter notebook with a solution for Scenario C and D. It also includes the previous solutions for Scenarios A and B; if you have already seen the first two, you can skip to the new stuff.
You can read a static version of the notebook here.
OR
You can run the notebook on Binder.
If you click the Binder link, you should get a home page showing the notebooks in the repository for this blog. Click on test_scenario_cd.ipynb to load the notebook for this article.
Comments
As I said in the first article, this problem turned out to be harder than I expected. First, it took time, and a lot of untangling, to identify the different scenarios. Once we did that, figuring out the answers was easier, but still not easy.
For me, writing a simulation for each scenario turned out to be very helpful. It has the obvious benefit of providing an approximate answer, but it also forced me to specify the scenarios precisely, which is crucial for problems like these.
Now, having solved four versions of this problem, what guidance can we provide for interpreting medical tests in practice? And how realistic are these scenarios, anyway?
Most medical tests are more sensitive than 90% and yield false positives less often than 20-40% of the time. But many symptoms that indicate disease are neither sensitive nor specific. So this analysis might be more applicable to interpreting symptoms, rather than test results.
I think all four scenarios are relevant to practice:
For some signs of disease, we have estimates of true and false positive rates, but when there are multiple inconsistent estimates, we might be unsure which estimate is right. For some other tests, we might have reason to believe that these parameters are different for different groups of patients; for example, many antibody tests are more likely to generate false positives in patients who have been exposed to other diseases.In some medical environments, we could reasonably treat patients as a random sample of the population, but more often patients have been filtered by a selection process that depends on their symptoms and test results. We considered two extremes: in Scenarios A and B, we have a random sample, and in C and D we only see patients who test positive. But in practice the selection process is probably somewhere in between.
If the selection process is unknown, and we don't know whether the parameters of the test are likely to be different for different groups, one option is to run the analysis for all four scenarios (or rather, for the three scenarios that are different), and use the range of the results to characterize uncertainty due to model selection.
UPDATES
[May 11, 2016] In this comment on Reddit, /u/ppcsf shows how to solve this problem using a probabilistic programming language based on C#. The details are in this gist.

May 9, 2016
Probability is hard: part three
In the first article, I presented the Red Dice problem, which is a warm-up problem that might help us make sense of the other problems.
In the second article, I presented the problem of interpreting medical tests when there is uncertainty about parameters like sensitivity and specificity. I explained that there are (at least) four models of the scenario, and I hinted that some of them yield different answers. I presented my solution for one of the scenarios, using the computational framework from Think Bayes.
Now, on to Scenario B. For those of you coming in late, here are the Scenarios again:
Scenario A: The patients are drawn at random from the relevant population, and the reason we are uncertain about t is that either (1) there are two versions of the test, with different false positive rates, and we don't know which test was used, or (2) there are two groups of people, the false positive rate is different for different groups, and we don't know which group the patient is in.Scenario B: As in Scenario A, the patients are drawn at random from the relevant population, but the reason we are uncertain about t is that previous studies of the test have been contradictory. That is, there is only one version of the test, and we have reason to believe that t is the same for all groups, but we are not sure what the correct value of t is.Scenario C: As in Scenario A, there are two versions of the test or two groups of people. But now the patients are being filtered so we only see the patients who tested positive and we don't know how many patients tested negative. For example, suppose you are a specialist and patients are only referred to you after they test positive.Scenario D: As in Scenario B, we have reason to think that t is the same for all patients, and as in Scenario C, we only see patients who test positive and don't know how many tested negative.
And the questions we would like to answer are:
What is the probability that a patient who tests positive is actually sick?Given two patients who have tested positive, what is the probability that both are sick?I have posted a new Jupyter notebook with a Solution for Scenario B. It also includes the previous solution for Scenario A; if you read it before, you can skip to the new stuff.
You can read a static version of the notebook here.
OR
You can run the notebook on Binder.
If you click the Binder link, you should get a home page showing the notebooks and Python modules in the repository for this blog. Click on test_scenario_b.ipynb to load the notebook for this article.
I'll give you a chance to think about Scenarios C and D, and I'll post my solution in the next couple of days.
Enjoy!

May 6, 2016
Probability is hard, part two
Now I'll present the problem we are really working on: interpreting medical tests. Suppose we test a patient to see if they have a disease, and the test comes back positive. What is the probability that the patient is actually sick (that is, has the disease)?
To answer this question, we need to know:
The prevalence of the disease in the population the patient is from. Let's assume the patient is identified as a member of a population where the known prevalence is p.The sensitivity of the test, s, which is the probability of a positive test if the patient is sick.The false positive rate of the test, t, which is the probability of a positive test if the patient is not sick.
Given these parameters, it is straightforward to compute the probability that the patient is sick, given a positive test. This problem is a staple in probability classes.
But here's the wrinkle. What if you are uncertain about one of the parameters: p, s, or t? For example, suppose you have reason to believe that t is either 0.2 or 0.4 with equal probability. And suppose p=0.1 and s=0.9.
The questions we would like to answer are:
What is the probability that a patient who tests positive is actually sick?Given two patients who have tested positive, what is the probability that both are sick?
As we did with the Red Dice problem, we'll consider four scenarios:
Scenario A: The patients are drawn at random from the relevant population, and the reason we are uncertain about t is that either (1) there are two versions of the test, with different false positive rates, and we don't know which test was used, or (2) there are two groups of people, the false positive rate is different for different groups, and we don't know which group the patient is in.Scenario B: As in Scenario A, the patients are drawn at random from the relevant population, but the reason we are uncertain about t is that previous studies of the test have been contradictory. That is, there is only one version of the test, and we have reason to believe that t is the same for all groups, but we are not sure what the correct value of t is.Scenario C: As in Scenario A, there are two versions of the test or two groups of people. But now the patients are being filtered so we only see the patients who tested positive and we don't know how many patients tested negative. For example, suppose you are a specialist and patients are only referred to you after they test positive.Scenario D: As in Scenario B, we have reason to think that t is the same for all patients, and as in Scenario C, we only see patients who test positive and don't know how many tested negative.
I have posted a Jupyter notebook with solutions for the basic scenario (where we are certain about t) and for Scenario A. You might want to write your own solution before you read mine. And then you might want to work on Scenarios B, C, and D.
You can read a static version of the notebook here.
OR
You can run the notebook on Binder.
If you click the Binder link, you should get a home page showing the notebooks and Python modules in the repository for this blog. Click on test_scenario_a.ipynb to load the notebook for this article.
I'll post my solution for Scenarios B, C, and D next week.

May 5, 2016
Probability is hard
A few times we have hit a brick wall on a hard problem and made a strategic retreat by working on a simpler problem. At this point, we have retreated all the way to what I'll call the Red Dice problem, which goes like this:
Suppose I have a six-sided die that is red on 2 sides and blue on 4 sides, and another die that's the other way around, red on 4 sides and blue on 2.
I choose a die at random and roll it, and I tell you it came up red. What is the probability that I rolled the second die (red on 4 sides)? And if I do it again, what's the probability that I get red again?There are several variations on this problem, with answers that are subtly different. I explain the variations, and my solution, in a Jupyter notebook:
You can read a static version of the notebook here.
OR
You can run the notebook on Binder.
If you click the Binder link, you should get a home page showing the notebooks and Python modules in the repository for this blog. Click on red_dice.ipynb to load the notebook for this article.
Once we have settled the Red Dice problem, we will get back to the original problem, which relates to interpreting medical tests (a classic example of Bayesian inference that, again, turns out to be more complicated than we thought).
UPDATE: After reading the notebook, some readers are annoyed with me because Scenarios C and D are not consistent with the way I posed the question. I'm sorry if you feel tricked -- that was not the point! To clarify, Scenarios A and B are legitimate interpretations of the question, as posed, which is deliberately ambiguous. Scenarios C and D are exploring a different version because it will be useful when we get to the next problem.

May 3, 2016
Stats can't make modeling decisions
If effect sizes of coefficient are really small, can you interpret as no relationship? Coefficients are very significant, which is expected with my large dataset. But coefficients are tiny (0.0000001). Can I conclude no relationship? Or must I say there is a relationship, but it's not practical?I posted a response there, but since we get questions like this a lot, I will write a more detailed response here.
First, as several people mentioned on Reddit, you have to distinguish between a small coefficient and a small effect size. The size of the coefficient depend on the units it is expressed in. For example, in a previous article I wrote about the relationship between a baby's birth weight and its mother's age ("Are first babies more likely to be light?"). With weights in pounds and ages in years, the estimated coefficient is about 0.017 pounds per year.
At first glance, that looks like a small effect size. But the average birth weight in the U.S. is about 7.3 pounds, and the range from the youngest to the oldest mother was more than 20 years. So if we say the effect size is "about 3 ounces per decade", that would be easier to interpret. Or it might be even better express the effect in terms of percentages; for example, "A 10 year increase in mother's age is associated with a 2.4% increase in birth weight."
So that's the first part of my answer:
Expressing effect size in practical terms makes it easier to evaluate its importance in practice.
The second part of my answer address the question, "Can I conclude no relationship?" This is a question about modeling, not statistics, and
Statistical analysis can inform modeling choices, but it can't make decisions for you.
As a reminder, when you make a model of a real-world scenario, you have to decide what to include and what to leave out. If you include the most important things and leave out less important things, your model will be good enough for most purposes.
But in most scenarios, there is no single uniquely correct model. Rather, there are many possible models that might be good enough, or not, for various purposes.
Based on statistics alone, you can't say whether there is, or is not, a relationship between two variables. But statistics can help you justify your decision to include a relationship in a model or ignore it.
The affirmativeIf you want to argue that an effect SHOULD be included in a model, you can justify that decision (using classical statistics) in two steps:
1) Estimate the effect size and use background knowledge to make an argument about why it matters in practical terms. For example, a 3 ounce difference in birth weight might be associated with real differences in health outcomes, or not.
AND
2) Show that the p-value is small, which at least suggests that the observed effect is unlikely to be due to chance. (Some people will object to this interpretation of p-values, but I explain why I think it is valid in "Hypothesis testing is only mostly useless").
In my study of birth weight, I argued that mother's age should be included in the model because the effect size was big enough to matter in the real world, and because the p-value was very small.
The negativeIf you want to argue that it is ok to leave an effect out of a model, you can justify that decision in one of two ways:
1) If you apply a hypothesis test and get a small p-value, you probably can't dismiss it as random. But if the estimated effect size is small, can use background information to make an argument about why it is negligible.
OR
2) If you apply a hypothesis test and get a large p-value, that suggests that the effect you observed could be explained by chance. But that doesn't mean the effect is necessarily negligible. To make that argument, you need to consider the power of the test. One way to do that is to find the smallest hypothetical effect size that would yield a high probability of a significant test. Then you can say something like, "If the effect size were as big as X, this test would have a 90% of being statistically significant. The test was not statistically significant, so the effect size is likely to be less than X. And in practical terms, X is negligible."
The BayesianSo far I have been using the logic of classical statistics, which is problematic in many ways.
Alternatively, in a Bayesian framework, the result would be a posterior distribution on the effect size, which you could use to generate an ensemble of models with different effect sizes. To make predictions, you would generate predictive distributions that represent your uncertainty about the effect size. In that case there's no need to make binary decisions about whether there is, or is not, a relationship.
Or you could use Bayesian model comparison, but I think that a mostly misguided effort to shoehorn Bayesian methods into a classical framework. But that's a topic for another time.

Probably Overthinking It
- Allen B. Downey's profile
- 233 followers
