Allen B. Downey's Blog: Probably Overthinking It, page 17
October 25, 2016
Socks, skeets, space aliens
In my Bayesian statistics class this semester, I asked students to invent new Bayes theorem problems, with the following criteria:
1) A good Bayes's theorem problem should pose an interesting question that seems hard to solve directly, but
2) It should be easier to solve with Bayes's theorem than without it, and
3) It should have some element of surprise, or at least a non-obvious outcome.
Several years ago I posted some of my favorites in this article. Last week I posted a problem one of my students posed (Why is My Cat Orange?). This week I have another student-written problem and two related problems that I wrote. I'll post solutions later in the week.
The sock drawer problemPosed by Yuzhong Huang:
There are two drawers of socks. The first drawer has 40 white socks and 10 black socks; the second drawer has 20 white socks and 30 black socks. We randomly get 2 socks from a drawer, and it turns out to be a pair (same color) but we don't know the color of these socks. What is the chance that we picked the first drawer?
[For this one, you can compute an approximate solution assuming socks are selected with replacement, or an exact solution assuming, more realistically, that they are selected without replacement.]
The Alien Blaster problemIn preparation for an alien invasion, the Earth Defense League has been working on new missiles to shoot down space invaders. Of course, some missile designs are better than others; let's assume that each design has some probability of hitting an alien ship, x.
Based on previous tests, the distribution of x in the population of designs is roughly uniform between 10% and 40%. To approximate this distribution, we'll assume that x is either 10%, 20%, 30%, or 40% with equal probability.
Now suppose the new ultra-secret Alien Blaster 10K is being tested. In a press conference, an EDF general reports that the new design has been tested twice, taking two shots during each test. The results of the test are confidential, so the general won't say how many targets were hit, but they report: ``The same number of targets were hit in the two tests, so we have reason to think this new design is consistent.''
Is this data good or bad; that is, does it increase or decrease your estimate of x for the Alien Blaster 10K?
The Skeet Shooting problemAt the 2016 Summer Olympics in the Women's Skeet event, Kim Rhode faced Wei Meng in the bronze medal match. After 25 shots, they were tied, sending the match into sudden death. In each round of sudden death, each competitor shoots at two targets. In the first three rounds, Rhode and Wei hit the same number of targets. Finally in the fourth round, Rhode hit more targets, so she won the bronze medal, making her the first Summer Olympian to win an individual medal at six consecutive summer games. Based on this information, should we infer that Rhode and Wei had an unusually good or bad day?
As background information, you can assume that anyone in the Olympic final has about the same probability of hitting 13, 14, 15, or 16 out of 25 targets.
I'll post solutions in a few days.
1) A good Bayes's theorem problem should pose an interesting question that seems hard to solve directly, but
2) It should be easier to solve with Bayes's theorem than without it, and
3) It should have some element of surprise, or at least a non-obvious outcome.
Several years ago I posted some of my favorites in this article. Last week I posted a problem one of my students posed (Why is My Cat Orange?). This week I have another student-written problem and two related problems that I wrote. I'll post solutions later in the week.
The sock drawer problemPosed by Yuzhong Huang:
There are two drawers of socks. The first drawer has 40 white socks and 10 black socks; the second drawer has 20 white socks and 30 black socks. We randomly get 2 socks from a drawer, and it turns out to be a pair (same color) but we don't know the color of these socks. What is the chance that we picked the first drawer?
[For this one, you can compute an approximate solution assuming socks are selected with replacement, or an exact solution assuming, more realistically, that they are selected without replacement.]
The Alien Blaster problemIn preparation for an alien invasion, the Earth Defense League has been working on new missiles to shoot down space invaders. Of course, some missile designs are better than others; let's assume that each design has some probability of hitting an alien ship, x.
Based on previous tests, the distribution of x in the population of designs is roughly uniform between 10% and 40%. To approximate this distribution, we'll assume that x is either 10%, 20%, 30%, or 40% with equal probability.
Now suppose the new ultra-secret Alien Blaster 10K is being tested. In a press conference, an EDF general reports that the new design has been tested twice, taking two shots during each test. The results of the test are confidential, so the general won't say how many targets were hit, but they report: ``The same number of targets were hit in the two tests, so we have reason to think this new design is consistent.''
Is this data good or bad; that is, does it increase or decrease your estimate of x for the Alien Blaster 10K?
The Skeet Shooting problemAt the 2016 Summer Olympics in the Women's Skeet event, Kim Rhode faced Wei Meng in the bronze medal match. After 25 shots, they were tied, sending the match into sudden death. In each round of sudden death, each competitor shoots at two targets. In the first three rounds, Rhode and Wei hit the same number of targets. Finally in the fourth round, Rhode hit more targets, so she won the bronze medal, making her the first Summer Olympian to win an individual medal at six consecutive summer games. Based on this information, should we infer that Rhode and Wei had an unusually good or bad day?
As background information, you can assume that anyone in the Olympic final has about the same probability of hitting 13, 14, 15, or 16 out of 25 targets.
I'll post solutions in a few days.

Published on October 25, 2016 12:08
October 21, 2016
Why is my cat orange?
One of the students in my Bayesian statistics class, Mafalda Borges, came up with an excellent new Bayes theorem problem. Here's my paraphrase:
The sex-linked red gene, O, determines whether there will be red variations to fur color. This gene is located on the X chromosome...Males have only one X chromosome, so only have one allele of this gene. O results in orange variations, and o results in non-orange fur.Since females have two X chromosomes, they have two alleles of this gene. OO results in orange toned fur, oo results in non-orange fur, and Oo results in a tortoiseshell cat, in which some parts of the fur are orange variants and others areas non-orange.
If the population genetics for the red gene are in equilibrium, we can use the Hardy-Weinberg principle. If the prevalence of the red allele is p and the prevalence of the non-red allele is q=1-p:
1) The fraction of male cats that are orange is p and the fraction that are non-orange is q.
2) The fractions of female cats that are OO, Oo, and oo are p², 2pq, and q², respectively.
Finally, if we know the genetics of a mating pair, we can compute the probability of each genetic combination in their offspring.
1) If the offspring is male, he got a Y chromosome from his father. Whether he is orange or not depends on which allele he got from his mother:
2) If the offspring is female, her coat depends on both parents:
That's all the background information you need to solve the problem. I'll post the solution next week.
About 3/4 of orange cats are male. If my cat is orange, what is the probability that his mother was orange?To answer this question, you have to know a little about the genes that affect coat color in cats:
The sex-linked red gene, O, determines whether there will be red variations to fur color. This gene is located on the X chromosome...Males have only one X chromosome, so only have one allele of this gene. O results in orange variations, and o results in non-orange fur.Since females have two X chromosomes, they have two alleles of this gene. OO results in orange toned fur, oo results in non-orange fur, and Oo results in a tortoiseshell cat, in which some parts of the fur are orange variants and others areas non-orange.
If the population genetics for the red gene are in equilibrium, we can use the Hardy-Weinberg principle. If the prevalence of the red allele is p and the prevalence of the non-red allele is q=1-p:
1) The fraction of male cats that are orange is p and the fraction that are non-orange is q.
2) The fractions of female cats that are OO, Oo, and oo are p², 2pq, and q², respectively.
Finally, if we know the genetics of a mating pair, we can compute the probability of each genetic combination in their offspring.
1) If the offspring is male, he got a Y chromosome from his father. Whether he is orange or not depends on which allele he got from his mother:

2) If the offspring is female, her coat depends on both parents:


Published on October 21, 2016 11:52
October 14, 2016
Millennials are still not getting married
Last year I presented a paper called "Will Millennials Ever Get Married?" at SciPy 2015. You can see video of the talk and download the paper here.
I used data from the National Survey of Family Growth (NSFG) to estimate the age at first marriage for women in the U.S., broken down by decade of birth. I found evidence that women born in the 1980s and 90s were getting married later than previous cohorts, and I generated projections that suggest they are on track to stay unmarried at substantially higher rates.
Yesterday the National Center for Health Statistics (NCHS) released a new batch of data from surveys conducted in 2013-2015. I downloaded it and updated my analysis. Also, for the first time, I apply the analysis to the data from male respondents.
WomenBased on a sample of 58488 women in the U.S., here are survival curves that estimate the fraction who have never been married for each birth group (women born in the 1940s, 50s, etc) at each age.
For example, the top line represents women born in the 1990s. At age 15, none of them were married; at age 24, 81% of them are still unmarried. (The survey data runs up to 2015, so the oldest respondents in this group were interviewed at age 25, but the last year contains only partial data, so the survival curve is cut off at age 24).
For women born in the 1980s, the curve goes up to age 34, at which point about 39% of them had never been married.
Two patterns are visible in this figure. Women in each successive cohort are getting married later, and a larger fraction are never getting married at all.
By making some simple projections, we can estimate the magnitude of these effects separately. I explain the methodology in the paper. The following figure shows the survival curves from the previous figure as well as projections shown in gray:
These results suggest that women born in the 1980s and 1990s are not just getting married later; they are on pace to stay unmarried at rates substantially higher than previous cohorts. In particular, women born in the 1980s seem to have leveled off; very few of them have been married between ages 30 and 34. For women born in the 1990s, it is too early to tell whether they have started to level off.
The following figure summarizes these results by taking vertical slices through the survival curves at ages 23, 33 and 43.
In this figure the x-axis is birth cohort and the y-axis is the fraction who have never married.
1) The top line shows that the fraction of women never married by age 23 has increased from 25% for women born in the 40s to 81% for women born in the 90s.
2) The fraction of women unmarried at age 33 has increased from 9% for women born in the 40s to 38% for women born in the 80s, and is projected to be 47% for women born in the 90s.
3) The fraction of women unmarried at age 43 has increased from 8% for women born in the 40s to 17% for women born in the 70s, and is projected to be 36% for women born in the 1990s.
These projections are based on simple assumptions, so we should not treat them as precise predictions, but they are not as naive as a simple straight-line extrapolations of past trends.
MenThe results for men are similar but less extreme. Here are the estimated survival curves based on a sample of 24652 men in the U.S. The gray areas show 90% confidence intervals for the estimates due to sampling error.
And projections for the last two cohorts.
Finally, here are the same data plotted with birth cohort on the x-axis.
1) At age 23, the fraction of men who have never married has increased from 66% for men born in the 50s to 88% for men born in the 90s.
2) At age 33, the fraction of unmarried men has increased from 27% to 44%, and is projected to go to 50%.
3) At age 43, the fraction of unmarried men is almost unchanged for men born in the 50s, 60s, and 70s, but is projected to increase to 30% for men born in the 1990s.
MethodologyThe NSFG is intended to be representative of the adult U.S. population, but it uses stratified sampling to systematically oversample certain subpopulations, including teenagers and racial minorities. My analysis takes this design into account (by weighted resampling) to generate results that are representative of the population.
The survival curves are computed by Kaplan-Meier estimation, with confidence intervals computed by resampling. Missing values are filled by random choice from valid values, so the confidence intervals represent variability due to missing values as well as sampling.
To generate projections, we might consider two factors:
1) If people in the last two cohorts are postponing marriage, we might expect their marriage rates to increase or decrease more slowly.
2) If we extrapolate the trends, we might expect marriage rates to continue to fall or fall faster.
I used an alternative between these extremes: I assume that the hazard function from the previous generation will apply to the next. This takes into account the possibility of delayed marriage (since there are more unmarried people "at risk" in the projections), but it also assumes a degree of regression to past norms. In that sense, the projections are probably conservative; that is, they probably underestimate how different the last two cohorts will be from their predecessors.
All the details are in this Jupyter notebook.
I used data from the National Survey of Family Growth (NSFG) to estimate the age at first marriage for women in the U.S., broken down by decade of birth. I found evidence that women born in the 1980s and 90s were getting married later than previous cohorts, and I generated projections that suggest they are on track to stay unmarried at substantially higher rates.
Yesterday the National Center for Health Statistics (NCHS) released a new batch of data from surveys conducted in 2013-2015. I downloaded it and updated my analysis. Also, for the first time, I apply the analysis to the data from male respondents.
WomenBased on a sample of 58488 women in the U.S., here are survival curves that estimate the fraction who have never been married for each birth group (women born in the 1940s, 50s, etc) at each age.

For example, the top line represents women born in the 1990s. At age 15, none of them were married; at age 24, 81% of them are still unmarried. (The survey data runs up to 2015, so the oldest respondents in this group were interviewed at age 25, but the last year contains only partial data, so the survival curve is cut off at age 24).
For women born in the 1980s, the curve goes up to age 34, at which point about 39% of them had never been married.
Two patterns are visible in this figure. Women in each successive cohort are getting married later, and a larger fraction are never getting married at all.
By making some simple projections, we can estimate the magnitude of these effects separately. I explain the methodology in the paper. The following figure shows the survival curves from the previous figure as well as projections shown in gray:

These results suggest that women born in the 1980s and 1990s are not just getting married later; they are on pace to stay unmarried at rates substantially higher than previous cohorts. In particular, women born in the 1980s seem to have leveled off; very few of them have been married between ages 30 and 34. For women born in the 1990s, it is too early to tell whether they have started to level off.
The following figure summarizes these results by taking vertical slices through the survival curves at ages 23, 33 and 43.

In this figure the x-axis is birth cohort and the y-axis is the fraction who have never married.
1) The top line shows that the fraction of women never married by age 23 has increased from 25% for women born in the 40s to 81% for women born in the 90s.
2) The fraction of women unmarried at age 33 has increased from 9% for women born in the 40s to 38% for women born in the 80s, and is projected to be 47% for women born in the 90s.
3) The fraction of women unmarried at age 43 has increased from 8% for women born in the 40s to 17% for women born in the 70s, and is projected to be 36% for women born in the 1990s.
These projections are based on simple assumptions, so we should not treat them as precise predictions, but they are not as naive as a simple straight-line extrapolations of past trends.
MenThe results for men are similar but less extreme. Here are the estimated survival curves based on a sample of 24652 men in the U.S. The gray areas show 90% confidence intervals for the estimates due to sampling error.

And projections for the last two cohorts.

Finally, here are the same data plotted with birth cohort on the x-axis.

1) At age 23, the fraction of men who have never married has increased from 66% for men born in the 50s to 88% for men born in the 90s.
2) At age 33, the fraction of unmarried men has increased from 27% to 44%, and is projected to go to 50%.
3) At age 43, the fraction of unmarried men is almost unchanged for men born in the 50s, 60s, and 70s, but is projected to increase to 30% for men born in the 1990s.
MethodologyThe NSFG is intended to be representative of the adult U.S. population, but it uses stratified sampling to systematically oversample certain subpopulations, including teenagers and racial minorities. My analysis takes this design into account (by weighted resampling) to generate results that are representative of the population.
The survival curves are computed by Kaplan-Meier estimation, with confidence intervals computed by resampling. Missing values are filled by random choice from valid values, so the confidence intervals represent variability due to missing values as well as sampling.
To generate projections, we might consider two factors:
1) If people in the last two cohorts are postponing marriage, we might expect their marriage rates to increase or decrease more slowly.
2) If we extrapolate the trends, we might expect marriage rates to continue to fall or fall faster.
I used an alternative between these extremes: I assume that the hazard function from the previous generation will apply to the next. This takes into account the possibility of delayed marriage (since there are more unmarried people "at risk" in the projections), but it also assumes a degree of regression to past norms. In that sense, the projections are probably conservative; that is, they probably underestimate how different the last two cohorts will be from their predecessors.
All the details are in this Jupyter notebook.

Published on October 14, 2016 08:43
September 26, 2016
Bayes's Theorem is not optional
Abstract: I present a probability puzzle, the Rain in Seattle Problem, and use it to explain differences between the Bayesian and frequentist interpretations of probability, and between Bayesian and frequentist statistical methods. Since I am trying to clear up confusion, I try to describe the alternatives without commenting on their pros and cons.
IntroductionConversations about Bayesian statistics sometimes get bogged down in confusion about two separate questions:
1) The Bayesian interpretation of probability, as opposed to the frequentist interpretation.
2) The Bayesian approach to statistical inference, as opposed to frequentist approach.
The first is a philosophical position about what probability means; the second is more like a practical recommendation about how to make inferences from data. They are almost entirely separate questions; for example, you might prefer the Bayesian interpretation of probability by philosophical criteria, and then use frequentist statistics because of practical requirements; or the other way around.
Under the frequentist interpretation of probability, we can only talk about the probability of an event if we can model it as a subset of a sample space. For example, we can talk about the probability of drawing a straight in poker because a straight is a well-defined subset of the sample space that contains all poker hands. But by this interpretation, we could not assign a probability to the proposition that Hillary Clinton will win the election, unless we could model this event as a subset of all elections, somehow.
Under the Bayesian interpretation, a probability represents a degree of belief, so it is permissible to assign probabilities to events even if they are unique. It is also permissible to use probability to represent uncertainty about non-random events. For example, if you are uncertain about whether there is life on Mars, you could assign a probability to that proposition under the Bayesian interpretation. Under the frequentist interpretation, there either is life on Mars or not; it is not a random event, so we can't assign a probability to it.
(I avoid saying things like "a Bayesian believes this" or "a Frequentist believes that". These are philosophical positions, and we can discuss their consequences regardless of who believes what.)
In problems where the frequentist interpretation of probability applies, the Bayesian and frequentist interpretations yield the same answers. The difference is that for some problems we get an answer under Bayesianism and no answer under frequentism.
Now, before I get into Bayesian and frequentist inference, let's look at an example.
The Rain in Seattle problemSuppose you are interviewing for a data science job and you are asked this question (from glassdoor.com):
The question asks you to compute the probability of rain conditioned on three yesses, which I'll write P(rain|YYY).
Now, here's an important point: you can't give a meaningful answer to this question unless you know P(rain), the probability of rain unconditioned on what your friends say. To see why, consider two extreme cases:
1. If P(rain) is 1, it always rains in Seattle. If your friends all tell you it's raining, you know that they are telling the truth, and that P(rain|YYY) is 1.
2. If P(rain) is 0, it never rains in Seattle, so you know your friends are lying and P(rain|YYY) = 0.
For values of P(rain) between 0 and 1, the answer could be any value between 0 and 1. So if you see any response to this question that does not take into account P(rain), you can be sure that it is wrong (or coincidentally right based on an invalid argument).
But if we are given the base rate, we can solve the problem easily using Bayes's Rule, According to the Western Regional Climate Center, from 1965-99 there was measurable rain in Seattle during 822 hours per year, which is about 10% of the time.
A base rate of 10% corresponds to prior odds of 1:9. Each friend is twice as likely to tell the truth as to lie, so each friend contributes evidence in favor of rain with a likelihood ratio, or Bayes factor, of 2. Multiplying the prior odds by the likelihood ratios yields posterior odds 8:9, which corresponds to probability 8/17, or 0.47.
And that is the unique correct answer to the question (provided that you accept the modeling assumptions). More generally, if P(rain) = p, the conditional probability P(rain|YYY) is
Probability(8 Odds(p))
assuming that Odds() converts probabilities to odds and Probability() does the opposite.
What about the frequentist answer?Several of the responses on glassdoor.com provide what they call a frequentist or non-Bayes perspective:
1) Everything in this problem can be well-modeled by random processes. There is a well-defined long-run probability of rain in Seattle, and we can model the friends' responses as independent random variables (at least according to the statement of the problem).
AND
2) There is nothing especially Bayesian about Bayes's Theorem! Bayes's Theorem is an uncontroversial law of probability that is true under any interpretation of probability, and can be used for any kind of statistical inference.
The "non-Bayes" responses are not actually other perspectives; they are just incorrect. Under frequentism, we would either accept the solution based on Bayes's Theorem or, under a strict interpretation, we might say that it is either raining in Seattle or not, and refuse to assign a probability.
But what about frequentist inference?Statistical inference is the process of inferring the properties of a population based on a sample. For example, if you want to know the fraction of U.S. voters who intend to vote for Donald Trump, you could poll a sample of the population. Then,
1) Using frequentist inference, you could compute an estimate of the fraction of the population that intends to vote for Trump (call it x), you could compute a confidence interval for the estimate, and you could compute a p-value based on a null-hypothesis like "x is 50%". But if anyone asked "what's the probability that x is greater than 50%", you would not be able to answer that question.
2) Using Bayesian inference, you would start with some prior belief about x, use the polling data to update your belief, and produce a posterior distribution for x, which represents all possible values and their probabilities. You could use the posterior distribution to compute estimates and intervals similar to the results of frequentist inference. But if someone asked "what's the probability that x is greater than 50%", you could compute the answer easily.
So, how does this apply to the Rain in Seattle Problem? It doesn't, because the Rain in Seattle problem has nothing to do with statistical inference. It is a question about probability, not statistics. It has one correct answer under any interpretation of probability, regardless of your preferences for statistical inference.
Summary1) Conversations about Bayesian methods will be improved if we distinguish two almost unrelated questions: the meaning of probability and the choice of inferential methods.
2) You don't have to be a Bayesian to use Bayes's Theorem. Most probability problems, including the Rain in Seattle problem, have a single solution considered correct under any interpretation of probability and statistics.
IntroductionConversations about Bayesian statistics sometimes get bogged down in confusion about two separate questions:
1) The Bayesian interpretation of probability, as opposed to the frequentist interpretation.
2) The Bayesian approach to statistical inference, as opposed to frequentist approach.
The first is a philosophical position about what probability means; the second is more like a practical recommendation about how to make inferences from data. They are almost entirely separate questions; for example, you might prefer the Bayesian interpretation of probability by philosophical criteria, and then use frequentist statistics because of practical requirements; or the other way around.
Under the frequentist interpretation of probability, we can only talk about the probability of an event if we can model it as a subset of a sample space. For example, we can talk about the probability of drawing a straight in poker because a straight is a well-defined subset of the sample space that contains all poker hands. But by this interpretation, we could not assign a probability to the proposition that Hillary Clinton will win the election, unless we could model this event as a subset of all elections, somehow.
Under the Bayesian interpretation, a probability represents a degree of belief, so it is permissible to assign probabilities to events even if they are unique. It is also permissible to use probability to represent uncertainty about non-random events. For example, if you are uncertain about whether there is life on Mars, you could assign a probability to that proposition under the Bayesian interpretation. Under the frequentist interpretation, there either is life on Mars or not; it is not a random event, so we can't assign a probability to it.
(I avoid saying things like "a Bayesian believes this" or "a Frequentist believes that". These are philosophical positions, and we can discuss their consequences regardless of who believes what.)
In problems where the frequentist interpretation of probability applies, the Bayesian and frequentist interpretations yield the same answers. The difference is that for some problems we get an answer under Bayesianism and no answer under frequentism.
Now, before I get into Bayesian and frequentist inference, let's look at an example.
The Rain in Seattle problemSuppose you are interviewing for a data science job and you are asked this question (from glassdoor.com):
You're about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it's raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that "Yes" it is raining. What is the probability that it's actually raining in Seattle?Take a minute to think about it before you go on. Then take a look at the responses on glassdoor.com. The top response, which uses Bayes's Theorem, is correct. I'll explain the correct solution first; then I want to comment on some of the other responses.
The question asks you to compute the probability of rain conditioned on three yesses, which I'll write P(rain|YYY).
Now, here's an important point: you can't give a meaningful answer to this question unless you know P(rain), the probability of rain unconditioned on what your friends say. To see why, consider two extreme cases:
1. If P(rain) is 1, it always rains in Seattle. If your friends all tell you it's raining, you know that they are telling the truth, and that P(rain|YYY) is 1.
2. If P(rain) is 0, it never rains in Seattle, so you know your friends are lying and P(rain|YYY) = 0.
For values of P(rain) between 0 and 1, the answer could be any value between 0 and 1. So if you see any response to this question that does not take into account P(rain), you can be sure that it is wrong (or coincidentally right based on an invalid argument).
But if we are given the base rate, we can solve the problem easily using Bayes's Rule, According to the Western Regional Climate Center, from 1965-99 there was measurable rain in Seattle during 822 hours per year, which is about 10% of the time.
A base rate of 10% corresponds to prior odds of 1:9. Each friend is twice as likely to tell the truth as to lie, so each friend contributes evidence in favor of rain with a likelihood ratio, or Bayes factor, of 2. Multiplying the prior odds by the likelihood ratios yields posterior odds 8:9, which corresponds to probability 8/17, or 0.47.
And that is the unique correct answer to the question (provided that you accept the modeling assumptions). More generally, if P(rain) = p, the conditional probability P(rain|YYY) is
Probability(8 Odds(p))
assuming that Odds() converts probabilities to odds and Probability() does the opposite.
What about the frequentist answer?Several of the responses on glassdoor.com provide what they call a frequentist or non-Bayes perspective:
Answer from a frequentist perspective: Suppose there was one person. P(Y|rain) is twice (2/3 / 1/3) as likely as P(Y|no rain), so the P(rain) is 2/3. If instead n people all say YES, then they are either all telling the truth, or all lying. The outcome that they are all telling the truth is (2/3)^n / (1/3)^n = 2^n as likely as the outcome that they are not. Thus P(YYY | rain) = 2^n / (2^n + 1) = 8/9 for n=3. Notice that this corresponds exactly to the Bayesian answer when prior(raining) = 1/2.And here's another:
I thought about this a little differently from a non-Bayes perspective. It's raining if any ONE of the friends is telling the truth, because if they are telling the truth then it is raining. If all of them are lying, then it isn't raining because they told you that it was raining. So what you want is the probability that any one person is telling the truth. Which is simply 1-Pr(all lie) = 26/27. Anyone let me know if I'm wrong here!These are not actually frequentist responses. For this problem, we get the same answer under Bayesianism and frequentism because:
1) Everything in this problem can be well-modeled by random processes. There is a well-defined long-run probability of rain in Seattle, and we can model the friends' responses as independent random variables (at least according to the statement of the problem).
AND
2) There is nothing especially Bayesian about Bayes's Theorem! Bayes's Theorem is an uncontroversial law of probability that is true under any interpretation of probability, and can be used for any kind of statistical inference.
The "non-Bayes" responses are not actually other perspectives; they are just incorrect. Under frequentism, we would either accept the solution based on Bayes's Theorem or, under a strict interpretation, we might say that it is either raining in Seattle or not, and refuse to assign a probability.
But what about frequentist inference?Statistical inference is the process of inferring the properties of a population based on a sample. For example, if you want to know the fraction of U.S. voters who intend to vote for Donald Trump, you could poll a sample of the population. Then,
1) Using frequentist inference, you could compute an estimate of the fraction of the population that intends to vote for Trump (call it x), you could compute a confidence interval for the estimate, and you could compute a p-value based on a null-hypothesis like "x is 50%". But if anyone asked "what's the probability that x is greater than 50%", you would not be able to answer that question.
2) Using Bayesian inference, you would start with some prior belief about x, use the polling data to update your belief, and produce a posterior distribution for x, which represents all possible values and their probabilities. You could use the posterior distribution to compute estimates and intervals similar to the results of frequentist inference. But if someone asked "what's the probability that x is greater than 50%", you could compute the answer easily.
So, how does this apply to the Rain in Seattle Problem? It doesn't, because the Rain in Seattle problem has nothing to do with statistical inference. It is a question about probability, not statistics. It has one correct answer under any interpretation of probability, regardless of your preferences for statistical inference.
Summary1) Conversations about Bayesian methods will be improved if we distinguish two almost unrelated questions: the meaning of probability and the choice of inferential methods.
2) You don't have to be a Bayesian to use Bayes's Theorem. Most probability problems, including the Rain in Seattle problem, have a single solution considered correct under any interpretation of probability and statistics.

Published on September 26, 2016 04:47
September 16, 2016
Blow it up and start again
The president of Olin College, Rick Miller, spoke recently at the Business Innovation Factory. Here's the most-tweeted quote from the talk: "The only way to change education is to blow it up and start again."
I agree, and I saw an example recently that helps make the point. The American Statistical Association recently published this Statement on p-Values. Here's how it starts:
Statistics at OlinFirst I'll explain what worked, then we'll look at what could have gone wrong.
In 2010, I proposed a new class called Computational Probability and Statistics, as a substitute for a very conventional statistics class that was offered at the time. My class was based on a library I developed while I was on sabbatical at Google, which is now the thinkstats module in ThinkX.
While teaching the class, I wrote Think Stats, which was published by O'Reilly Media in 2011. After a few semesters, I developed another course called Computational Bayesian Statistics, and wrote Think Bayes, which was published in 2013.
In 2014 I expanded CompProbStat from 2 credits to 4 and renamed it Data Science. I recruited external collaborators to provide data and motivating questions for student projects, and several other professors sat in and helped guide student projects. In 2016 one of those professors took over and taught his version of the class, adding his expertise in machine learning.
At the same time, two of my colleagues were developing their own statistics classes, focused on applications in diverse areas of engineering and science. None of these classes look much like the conventional statistics material, and they are much better for it.
In six years, we developed five new classes, published two books, got six additional professors involved in teaching data science and statistics, and, most importantly, we developed a curriculum that serves the goals and needs of our students.
How did that happen?This project would have been impossible at almost any other college.
At most colleges and universities, a professor of computer science (like me) who proposes a new statistics class will not get far, because of two fundamental and unquestioned assumptions of undergraduate education: (1) you need a Ph.D. in a topic before you can teach an introductory course, and (2) if you do research in a field, that makes you better at teaching it to undergraduates. Note: neither of these is true.
And the content of my courses would have been scrutinized by a committee with a checklist. To teach something new, you have to stop teaching something old, and it is nearly impossible to get permission to stop teaching anything. Every topic, no matter how obsolete, is defended by zealots with no respect for evidence or reason. Fun example: here are Time magazine's Five Reasons Kids Should Still Learn Cursive Writing. Note: none of them are good.
Every field has its obstructionists, but statistics has its own special kind: the anti-Bayesians. I can only imagine the howls if I proposed teaching Bayesian statistics to undergraduates. When I suggested teaching it before classical statistics, I would have been thrown out a window. And when I proposed to teach it instead of classical statistics, I would have been dragged through the streets.
At Olin, fixing engineering education is our mission. When someone proposes an experiment, we ask the right questions: Does it contribute to our mission by improving undergraduate education at Olin and other institutions? Is it a reasonable risk? And do we have the resources to do it? If the answers are yes, we do it. Note: if it's an unreasonable risk and we don't have the resources, sometimes we do it anyway.
The second reason my project would be impossible at most schools is that statistics is owned by the math or statistics department, and even though the faculty don't like teaching classes for non-majors, they get credit for providing "service classes" (a term I would like to ban), so they have an incentive to protect their territory.
And just as the math department would fight to keep me out, the computer science department would fight to keep me in. If the CS department owns my "faculty line" (another term I would like to ban), they want me to teach CS classes.
At Olin, we have no departments. We don't have to do this kind of bookkeeping, and that leaves us free to think about the students (remember them?) and design a curriculum that serves their needs.
The third reason my project wouldn't happen anywhere else is that I wouldn't do it anywhere else. At most universities, there is no incentive to develop new classes; in fact, there is a strong disincentive. If you try something new, you make enemies, because the new is an insult to the old. If it doesn't work, you get punished, and even if it works, you get no reward.
The one factor that drives hiring and firing is research. Even at liberal arts colleges that value good teaching, there is no expectation for innovation. If you do a decent job of teaching the same two or three classes over and over, that's good enough. At Olin, we are encouraged to take risks, supported while we work out the bugs, and rewarded for the effort.
Also, at most universities, there is no incentive to write textbooks. They don't count as research and they don't count as teaching; the time you spend on a textbook is just time you didn't spend on research. At Olin, we use broad categories to evaluate faculty work, and a successful textbook is valued because it benefits students (at Olin and other institutions) and contributes to our mission to change engineering education.
So blow it upYou don't get a lot of opportunities to blow it up and start again, but when you do, a lot of good things can happen. It's not as scary as it sounds.
Also, there is nothing special about p = 0.05.

I agree, and I saw an example recently that helps make the point. The American Statistical Association recently published this Statement on p-Values. Here's how it starts:
In February 2014, George Cobb, Professor Emeritus of Mathematics
and Statistics at Mount Holyoke College, posed these
questions to an ASA discussion forum:
Q: Why do so many colleges and grad schools teach p = 0.05?
A: Because that’s still what the scientific community and journal
editors use.
Q: Why do so many people still use p = 0.05?
A: Because that’s what they were taught in college or grad school.
Cobb’s concern was a long-worrisome circularity in the sociologyThis "worrisome circularity" is a concrete example of why gradual change is so hard, and why sometimes the only solution is to blow it up and start again. That idea is scary to a lot of people, but it doesn't have to be. I have an example that might help, the statistics curriculum at Olin.
of science based on the use of bright lines such as p < 0.05:
“We teach it because it’s what we do; we do it because it’s what
we teach.”
Statistics at OlinFirst I'll explain what worked, then we'll look at what could have gone wrong.
In 2010, I proposed a new class called Computational Probability and Statistics, as a substitute for a very conventional statistics class that was offered at the time. My class was based on a library I developed while I was on sabbatical at Google, which is now the thinkstats module in ThinkX.
While teaching the class, I wrote Think Stats, which was published by O'Reilly Media in 2011. After a few semesters, I developed another course called Computational Bayesian Statistics, and wrote Think Bayes, which was published in 2013.
In 2014 I expanded CompProbStat from 2 credits to 4 and renamed it Data Science. I recruited external collaborators to provide data and motivating questions for student projects, and several other professors sat in and helped guide student projects. In 2016 one of those professors took over and taught his version of the class, adding his expertise in machine learning.
At the same time, two of my colleagues were developing their own statistics classes, focused on applications in diverse areas of engineering and science. None of these classes look much like the conventional statistics material, and they are much better for it.
In six years, we developed five new classes, published two books, got six additional professors involved in teaching data science and statistics, and, most importantly, we developed a curriculum that serves the goals and needs of our students.
How did that happen?This project would have been impossible at almost any other college.
At most colleges and universities, a professor of computer science (like me) who proposes a new statistics class will not get far, because of two fundamental and unquestioned assumptions of undergraduate education: (1) you need a Ph.D. in a topic before you can teach an introductory course, and (2) if you do research in a field, that makes you better at teaching it to undergraduates. Note: neither of these is true.
And the content of my courses would have been scrutinized by a committee with a checklist. To teach something new, you have to stop teaching something old, and it is nearly impossible to get permission to stop teaching anything. Every topic, no matter how obsolete, is defended by zealots with no respect for evidence or reason. Fun example: here are Time magazine's Five Reasons Kids Should Still Learn Cursive Writing. Note: none of them are good.
Every field has its obstructionists, but statistics has its own special kind: the anti-Bayesians. I can only imagine the howls if I proposed teaching Bayesian statistics to undergraduates. When I suggested teaching it before classical statistics, I would have been thrown out a window. And when I proposed to teach it instead of classical statistics, I would have been dragged through the streets.
At Olin, fixing engineering education is our mission. When someone proposes an experiment, we ask the right questions: Does it contribute to our mission by improving undergraduate education at Olin and other institutions? Is it a reasonable risk? And do we have the resources to do it? If the answers are yes, we do it. Note: if it's an unreasonable risk and we don't have the resources, sometimes we do it anyway.
The second reason my project would be impossible at most schools is that statistics is owned by the math or statistics department, and even though the faculty don't like teaching classes for non-majors, they get credit for providing "service classes" (a term I would like to ban), so they have an incentive to protect their territory.
And just as the math department would fight to keep me out, the computer science department would fight to keep me in. If the CS department owns my "faculty line" (another term I would like to ban), they want me to teach CS classes.
At Olin, we have no departments. We don't have to do this kind of bookkeeping, and that leaves us free to think about the students (remember them?) and design a curriculum that serves their needs.
The third reason my project wouldn't happen anywhere else is that I wouldn't do it anywhere else. At most universities, there is no incentive to develop new classes; in fact, there is a strong disincentive. If you try something new, you make enemies, because the new is an insult to the old. If it doesn't work, you get punished, and even if it works, you get no reward.
The one factor that drives hiring and firing is research. Even at liberal arts colleges that value good teaching, there is no expectation for innovation. If you do a decent job of teaching the same two or three classes over and over, that's good enough. At Olin, we are encouraged to take risks, supported while we work out the bugs, and rewarded for the effort.
Also, at most universities, there is no incentive to write textbooks. They don't count as research and they don't count as teaching; the time you spend on a textbook is just time you didn't spend on research. At Olin, we use broad categories to evaluate faculty work, and a successful textbook is valued because it benefits students (at Olin and other institutions) and contributes to our mission to change engineering education.
So blow it upYou don't get a lot of opportunities to blow it up and start again, but when you do, a lot of good things can happen. It's not as scary as it sounds.
Also, there is nothing special about p = 0.05.

Published on September 16, 2016 08:14
September 14, 2016
It's a small world, scale-free network after all
Real social networks generally have the properties of small world graphs (high clustering and low path lengths) and the characteristics of scale free networks (a heavy-tailed degree distribution).
The Watts-Strogatz (WS) network model has small world characteristics, but the degree distribution is roughly normal, very different from observed distributions.
The Barabasi-Albert (BA) model has low path lengths and a heavy-tailed degree distribution, but
It has low clustering, andThe degree distribution does not fit observed data well.
The Holmes-Kim (HK) model generates graphs with higher clustering, although still not as high as observed values. And the degree distribution is heavy tailed, but it still doesn't fit observed distributions well.
I propose a new model that generates graphs with
Low path lenths,Clustering coefficients similar to the HK model (but still lower than observed values), andA degree distribution that fits observed data well.
I test the models with a relatively small dataset from SNAP.
The proposed model is based on a "friend of a friend" growth mechanism that is a plausible description of the way social networks actually grow. The implementation is simple, comparable to BA and HK in both lines of code and run time.
All the details are in this Jupyter notebook, but I summarize the primary results here.
Comparing the modelsThe Facebook dataset from SNAP contains 4039 nodes and 88234 edges. The mean path length is 3.7 and the clustering coefficient is 0.6.
A WS model with the same number of nodes and edges, and with probability of rewiring, p=0.05, has mean path length 3.2 and clustering 0.62, so it clearly has the small world properties. But the distribution of degree does not match the data at all:
[image error]
A BA model with the same number of nodes and edges has very short paths (2.5), but very low clustering (0.04). The degree distribution is a better match for the data:
[image error]
If we plot CDFs on a log-log scale, the BA model matches the tail of the distribution reasonably well, but the WS model is hopeless.
[image error]
But if we plot CDFs on a log-x scale, we see that the BA model does not match the rest of the distribution:
[image error]
The HK model also has short path lengths (2.8), and the clustering is much better (0.23), but still not as high as in the data (0.6). The degree distribution is pretty much the same as in the BA model.
The FOF model
The generative model I propose is called FOF for "friends of friends". It is similar to both BA and HK, but it yields a degree distribution that matches observed data better.
It starts with a complete graph with m nodes, so initially all nodes have degree m. Each time we generate a node we:
Select a random target uniformly from existing nodes.Iterate through the friends of the target. For each one, with probability p, we form a triangle that includes the source, friend, and a random friend of friend.Finally, we connect the source and target.
Because we choose friends of the target, this process has preferential attachment, but it does not yield a power law tail. Rather, the degree distribution is approximately lognormal with median degree m.
Because this process forms triangles, it yields a moderately high clustering coefficient.
A FOF graph with the same number of nodes and edges as the Facebook data has low path length (3.0) and moderate clustering (0.24, which is more than BA, comparable to HK, but still less than the observed value, 0.6).
The degree distribution is a reasonable match for the tail of the observed distribution:
[image error]
And a good match for the rest of the distribution
[image error]
In summary, the FOF model has
Short path lengths, like WS, BA, and HK.Moderate clustering, similar to HK, less than WS, and higher than BA.Good fit to the tail of the degree distribution, like BA and HK.Good fit to the rest of the degree distribution, unlike WS, BA, and HK.
Also, the mechanism of growth is plausible: when a person joins the network, they connect to a randomly-chosen friend and then a random subset of "friends of friends". This process has preferential attachment because friends of friends are more likely to have high degree (see The Inspection Paradox is Everywhere) But the resulting distribution is approximately lognormal, which is heavy tailed, but does not have a power law tail.
ImplementationHere is a function that generates FOF graphs:
def fof_graph(n, m, p=0.25, seed=None):
if m < 1 or m+1 >= n:
raise nx.NetworkXError()
if seed is not None:
random.seed(seed)
# start with a completely connected core
G = nx.complete_graph(m+1)
for source in range(len(G), n):
# choose a random node
target = random.choice(G.nodes())
# enumerate neighbors of target and add triangles
friends = G.neighbors(target)
k = len(friends)
for friend in friends:
if flip(p):
triangle(G, source, friend)
# connect source and target
G.add_edge(source, target)
return G
def flip(p):
return random.random() < p
def triangle(G, source, friend):
"""Chooses a random neighbor of `friend` and makes a triangle.
Triangle connects `source`, `friend`, and
random neighbor of `friend`.
"""
fof = set(G[friend])
if source in G:
fof -= set(G[source])
if fof:
w = random.choice(list(fof))
G.add_edge(source, w)
G.add_edge(source, friend)
Again, all the details are in this Jupyter notebook.
The Watts-Strogatz (WS) network model has small world characteristics, but the degree distribution is roughly normal, very different from observed distributions.
The Barabasi-Albert (BA) model has low path lengths and a heavy-tailed degree distribution, but
It has low clustering, andThe degree distribution does not fit observed data well.
The Holmes-Kim (HK) model generates graphs with higher clustering, although still not as high as observed values. And the degree distribution is heavy tailed, but it still doesn't fit observed distributions well.
I propose a new model that generates graphs with
Low path lenths,Clustering coefficients similar to the HK model (but still lower than observed values), andA degree distribution that fits observed data well.
I test the models with a relatively small dataset from SNAP.
The proposed model is based on a "friend of a friend" growth mechanism that is a plausible description of the way social networks actually grow. The implementation is simple, comparable to BA and HK in both lines of code and run time.
All the details are in this Jupyter notebook, but I summarize the primary results here.
Comparing the modelsThe Facebook dataset from SNAP contains 4039 nodes and 88234 edges. The mean path length is 3.7 and the clustering coefficient is 0.6.
A WS model with the same number of nodes and edges, and with probability of rewiring, p=0.05, has mean path length 3.2 and clustering 0.62, so it clearly has the small world properties. But the distribution of degree does not match the data at all:
[image error]
A BA model with the same number of nodes and edges has very short paths (2.5), but very low clustering (0.04). The degree distribution is a better match for the data:
[image error]
If we plot CDFs on a log-log scale, the BA model matches the tail of the distribution reasonably well, but the WS model is hopeless.
[image error]
But if we plot CDFs on a log-x scale, we see that the BA model does not match the rest of the distribution:
[image error]
The HK model also has short path lengths (2.8), and the clustering is much better (0.23), but still not as high as in the data (0.6). The degree distribution is pretty much the same as in the BA model.
The FOF model
The generative model I propose is called FOF for "friends of friends". It is similar to both BA and HK, but it yields a degree distribution that matches observed data better.
It starts with a complete graph with m nodes, so initially all nodes have degree m. Each time we generate a node we:
Select a random target uniformly from existing nodes.Iterate through the friends of the target. For each one, with probability p, we form a triangle that includes the source, friend, and a random friend of friend.Finally, we connect the source and target.
Because we choose friends of the target, this process has preferential attachment, but it does not yield a power law tail. Rather, the degree distribution is approximately lognormal with median degree m.
Because this process forms triangles, it yields a moderately high clustering coefficient.
A FOF graph with the same number of nodes and edges as the Facebook data has low path length (3.0) and moderate clustering (0.24, which is more than BA, comparable to HK, but still less than the observed value, 0.6).
The degree distribution is a reasonable match for the tail of the observed distribution:
[image error]
And a good match for the rest of the distribution
[image error]
In summary, the FOF model has
Short path lengths, like WS, BA, and HK.Moderate clustering, similar to HK, less than WS, and higher than BA.Good fit to the tail of the degree distribution, like BA and HK.Good fit to the rest of the degree distribution, unlike WS, BA, and HK.
Also, the mechanism of growth is plausible: when a person joins the network, they connect to a randomly-chosen friend and then a random subset of "friends of friends". This process has preferential attachment because friends of friends are more likely to have high degree (see The Inspection Paradox is Everywhere) But the resulting distribution is approximately lognormal, which is heavy tailed, but does not have a power law tail.
ImplementationHere is a function that generates FOF graphs:
def fof_graph(n, m, p=0.25, seed=None):
if m < 1 or m+1 >= n:
raise nx.NetworkXError()
if seed is not None:
random.seed(seed)
# start with a completely connected core
G = nx.complete_graph(m+1)
for source in range(len(G), n):
# choose a random node
target = random.choice(G.nodes())
# enumerate neighbors of target and add triangles
friends = G.neighbors(target)
k = len(friends)
for friend in friends:
if flip(p):
triangle(G, source, friend)
# connect source and target
G.add_edge(source, target)
return G
def flip(p):
return random.random() < p
def triangle(G, source, friend):
"""Chooses a random neighbor of `friend` and makes a triangle.
Triangle connects `source`, `friend`, and
random neighbor of `friend`.
"""
fof = set(G[friend])
if source in G:
fof -= set(G[source])
if fof:
w = random.choice(list(fof))
G.add_edge(source, w)
G.add_edge(source, friend)
Again, all the details are in this Jupyter notebook.

Published on September 14, 2016 07:38
September 2, 2016
Sleeping Beauty and the Red Dice
In response to my previous article on the Sleeping Beauty Problem, I got this comment from a reader:
Elga presents the Sleeping Beauty problem like this:
But Elga's Section 3 introduces some confusion around the meaning of "information". Elga says:
In particular (as I explained in my previous article), when Sleeping Beauty is awakened, she observes an event, awakening, that is twice as likely under T (the proposition that the coin toss is Heads) than under H, and she should change her credences accordingly.
So in my solution, her belief change is not unusual; it is an application of Bayes's Theorem that is only remarkable because it is not immediately obvious what the evidence is and what it's likelihood is under the two hypotheses. In that sense, it is similar to the Elvis Problem.
In the rest of Section 3, Elga tries to reconcile the seemingly contradictory conclusions that Beauty receives no new information and Beauty should change her credences. I think this argument addresses a non-problem, because Beauty does receive information that justifies her change in credences. So I agree with Lewis that Elga is wrong to conclude that the Sleeping Beauty problem raises, "a new question about how a rational agent ought to update her beliefs over time".
In summary:
1) Lewis is wrong about the answer to the problem and wrong to reject Elga's proof,
2) Also, his claim that Beauty does not receive information is wrong.
3) However, he is right to reject the argument in Elga's Section 3.
The Red Dice
At this point, we have three arguments to support the "thirder" position:
1) The argument based on long-run frequencies (I quoted Elga's version above).
2) The argument based on the principle of indifference (Elga's section 2).
3) The argument based on Bayes's theorem (in my previous article).
But if you still find it hard to believe that Beauty gets information when she wakes up, the Red Dice problem might help. I wrote about several versions of it in this previous article:
A thirder would respond (correctly) that the outcome you observed is twice as likely if the die is mostly red, and therefore it provides evidence in favor of the hypothesis that it is mostly red. Specifically, the posterior probability is 2/3.
If you don't believe this answer, you can see a more careful explanation and a demonstration by simulation in this Jupyter notebook (see Scenario C).
The Red Dice problem suggests that we should be skeptical of an argument with the form "The observation was inevitable under all hypotheses, and therefore we received no information." If an event happens once under H and twice under T, it is inevitable under both; nevertheless, a random observation of the event is twice as likely under T, and therefore provides evidence in favor of T.
The late great philosopher David Lewis was a halfer. I'd be interested in any reactions to his paper on it: http://fitelson.org/probability/lewis_sb.pdfThe context of the paper is a disagreement between Lewis and Adam Elga; specifically, Lewis's paper is a response to Elga's paper "Self-locating belief and the Sleeping Beauty Problem".
Elga presents the Sleeping Beauty problem like this:
Some researchers are going to put you to sleep. During the two days that your sleep will last, they will briefly wake you up either once or twice, depending on the toss of a fair coin (Heads: once; Tails: twice). After each waking, they will put you to back to sleep with a drug that makes you forget that waking. [Just after you are] awakened, to what degree ought you believe that the outcome of the coin toss is Heads?And then he states the two most common responses to the problem
First answer: 1/2, of course! Initially you were certain that the coin was fair, and so initially your credence in the coin’s landing Heads was 1/2. Upon being awakened, you receive no new information (you knew all along that you would be awakened). So your credence in the coin’s landing Heads ought to remain 1/2.
Second answer: 1/3, of course! Imagine the experiment repeated many times. Then in the long run, about 1/3 of the wakings would be Heads-wakings — wakings that happen on trials in which the coin lands Heads. So on any particular waking, you should have credence 1/3 that that waking is a Heads-waking, and hence have credence 1/3 in the coin’s landing Heads on that trial. This consideration remains in force in the present circumstance, in which the experiment is performed just once.In his Section 2, Elga then proves that the correct answer is 1/3. His proof is correct (although there are a few spots where it would be helpful to fill in some intermediate steps). So Lewis is wrong to reject this proof.
But Elga's Section 3 introduces some confusion around the meaning of "information". Elga says:
Let H be the proposition that the outcome of the coin toss is Heads. Before being putAnd then in a footnote:
to sleep, your credence in H was 1/2. I’ve just argued that when you are awakened
on Monday, that credence ought to change to 1/3. This belief change is unusual. It is
not the result of your receiving new information — you were already certain that you
would be awakened on Monday.
To say that an agent receives new information (as I shall use that expression) is to say that the agent receives evidence that rules out possible worlds not already ruled out by her previous evidence.This is where Elga and I disagree. I would say that an agent receives information if they receive evidence that is not equally likely in all possible worlds. In that case, the evidence should cause the agent to change their credences (subjective beliefs) about at least some possible worlds.
In particular (as I explained in my previous article), when Sleeping Beauty is awakened, she observes an event, awakening, that is twice as likely under T (the proposition that the coin toss is Heads) than under H, and she should change her credences accordingly.
So in my solution, her belief change is not unusual; it is an application of Bayes's Theorem that is only remarkable because it is not immediately obvious what the evidence is and what it's likelihood is under the two hypotheses. In that sense, it is similar to the Elvis Problem.
In the rest of Section 3, Elga tries to reconcile the seemingly contradictory conclusions that Beauty receives no new information and Beauty should change her credences. I think this argument addresses a non-problem, because Beauty does receive information that justifies her change in credences. So I agree with Lewis that Elga is wrong to conclude that the Sleeping Beauty problem raises, "a new question about how a rational agent ought to update her beliefs over time".
In summary:
1) Lewis is wrong about the answer to the problem and wrong to reject Elga's proof,
2) Also, his claim that Beauty does not receive information is wrong.
3) However, he is right to reject the argument in Elga's Section 3.
The Red Dice
At this point, we have three arguments to support the "thirder" position:
1) The argument based on long-run frequencies (I quoted Elga's version above).
2) The argument based on the principle of indifference (Elga's section 2).
3) The argument based on Bayes's theorem (in my previous article).
But if you still find it hard to believe that Beauty gets information when she wakes up, the Red Dice problem might help. I wrote about several versions of it in this previous article:
Suppose I have a six-sided die that is mostly red -- that is, red on 4 sides and blue on 2 -- and another that is mostly blue -- that is, blue on 4 sides and red on 2.
I choose a die at random (with equal probability) and roll it. If it comes up red, I tell you "it came up red". Otherwise, I put the die back, choose again, and roll again. I repeat until the outcome is red.
If I follow this procedure and eventually report that the die came up red, what is the probability that the last die I rolled is mostly red?A halfer might claim (incorrectly) that you have received no relevant information about the die because the outcome was inevitable, eventually. The evidence you receive when I tell you the outcome is red is identical regardless of which die it was, so it should not change your credences.
A thirder would respond (correctly) that the outcome you observed is twice as likely if the die is mostly red, and therefore it provides evidence in favor of the hypothesis that it is mostly red. Specifically, the posterior probability is 2/3.
If you don't believe this answer, you can see a more careful explanation and a demonstration by simulation in this Jupyter notebook (see Scenario C).
The Red Dice problem suggests that we should be skeptical of an argument with the form "The observation was inevitable under all hypotheses, and therefore we received no information." If an event happens once under H and twice under T, it is inevitable under both; nevertheless, a random observation of the event is twice as likely under T, and therefore provides evidence in favor of T.

Published on September 02, 2016 10:44
June 16, 2016
What is a distribution?
This article uses object-oriented programming to explore of one of the most useful concepts in statistics, distributions. The code is in a Jupyter notebook.
You can read a static version of the notebook on nbviewer.
OR
You can run the code in a browser by clicking this link and then selecting distribution.ipynb from the list.
The following is a summary of the material in the notebook, which you might want to read before you dive into the code.
Random processes and variablesOne of the recurring themes of my books is the use of object-oriented programming to explore mathematical ideas. Many mathematical entities are hard to define because they are so abstract. Representing them in Python puts the focus on what operations each entity supports — that is, what the objects can do — rather than on what they are.
In this article, I explore the idea of a probability distribution, which is one of the most important ideas in statistics, but also one of the hardest to explain. To keep things concrete, I'll start with one of the usual examples: rolling dice.
When you roll a standard six-sided die, there are six possible outcomes — numbers 1 through 6 — and all outcomes are equally likely.
If you roll two dice and add up the total, there are 11 possible outcomes — numbers 2 through 12 — but they are not equally likely. The least likely outcomes, 2 and 12, only happen once in 36 tries; the most likely outcome happens 1 times in 6.
And if you roll three dice and add them up, you get a different set of possible outcomes with a different set of probabilities.
What I've just described are three random number generators, which are also called random processes. The output from a random process is a random variable, or more generally a set of random variables. And each random variable has probability distribution, which is the set of possible outcomes and the corresponding set of probabilities.
Representing distributionsThere are many ways to represent a probability distribution. The most obvious is a probability mass function, or PMF, which is a function that maps from each possible outcome to its probability. And in Python, the most obvious way to represent a PMF is a dictionary that maps from outcomes to probabilities.
So is a Pmf a distribution? No. At least in my framework, a Pmf is one of several representations of a distribution. Other representations include the cumulative distribution function (CDF) and the characteristic function (CF).
These representations are equivalent in the sense that they all contain the same information; if I give you any one of them, you can figure out the others.
So why would we want different representations of the same information? The fundamental reason is that there are many operations we would like to perform with distributions; that is, questions we would like to answer. Some representations are better for some operations, but none of them is the best for all operations.
Here are some of the questions we would like a distribution to answer:
What is the probability of a given outcome?
What is the mean of the outcomes, taking into account their probabilities?
What is the variance of the outcome? Other moments?
What is the probability that the outcome exceeds (or falls below) a threshold?
What is the median of the outcomes, that is, the 50th percentile?
What are the other percentiles?
How can get generate a random sample from this distribution, with the appropriate probabilities?
If we run two random processes and choose the maximum of the outcomes (or minimum), what is the distribution of the result?
If we run two random processes and add up the results, what is the distribution of the sum?
Each of these questions corresponds to a method we would like a distribution to provide. But there is no one representation that answers all of them easily and efficiently.
As I demonstrate in the notebook, the PMF representation makes it easy to look up an outcome and get its probability, and it can compute mean, variance, and other moments efficiently.
The CDF representation can look up an outcome and find its cumulative probability efficiently. And it can do a reverse lookup equally efficiently; that is, given a probability, it can find the corresponding value, which is useful for computing medians and other percentiles.
The CDF also provides an easy way to generate random samples, and a remarkably simple way to compute the distribution of the maximum, or minimum, of a sample.
To answer the last question, the distribution of a sum, we can use the PMF representation, which is simple, but not efficient. An alternative is to use the characteristic function (CF), which is the Fourier transform of the PMF. That might sound crazy, but using the CF and the Convolution Theorem, we can compute the distribution of a sum in linearithmic time, or O(n log n).
If you are not familiar with the Convolution Theorem, you might want to read Chapter 8 of Think DSP.
So what's a distribution?The Pmf, Cdf, and CharFunc are different ways to represent the same information. For the questions we want to answer, some representations are better than others. So how should we represent the distribution itself?
In my implementation, each representation is a mixin; that is, a class that provides a set of capabilities. A distribution inherits all of the capabilities from all of the representations. Here's a class definition that shows what I mean:
class Dist(Pmf, Cdf, CharFunc):
def __init__(self, d):
"""Initializes the Dist.
Calls all three __init__ methods.
"""
Pmf.__init__(self, d)
Cdf.__init__(self, *compute_cumprobs(d))
CharFunc.__init__(self, compute_fft(d))
When you create a Dist, you provide a dictionary of values and probabilities.
Dist.__init__ calls the other three __init__ methods to create the Pmf, Cdf, and CharFunc representations. The result is an object that has all the attributes and methods of the three representations.
From a software engineering point of view, that might not be the best design, but it is meant to illustrate what it means to be a distribution.
In short, if you give me any representation of a distribution, you have told me everything I need to answer questions about the possible outcomes and their probabilities. Converting from one representation to another is mostly a matter of convenience and computational efficiency.
Conversely, if you are trying to find the distribution of a random variable, you can do it by computing whichever representation is easiest to figure out.
So that's the idea. If you want more details, take a look at the notebook by following one of the links at the top of the page.
You can read a static version of the notebook on nbviewer.
OR
You can run the code in a browser by clicking this link and then selecting distribution.ipynb from the list.
The following is a summary of the material in the notebook, which you might want to read before you dive into the code.
Random processes and variablesOne of the recurring themes of my books is the use of object-oriented programming to explore mathematical ideas. Many mathematical entities are hard to define because they are so abstract. Representing them in Python puts the focus on what operations each entity supports — that is, what the objects can do — rather than on what they are.
In this article, I explore the idea of a probability distribution, which is one of the most important ideas in statistics, but also one of the hardest to explain. To keep things concrete, I'll start with one of the usual examples: rolling dice.
When you roll a standard six-sided die, there are six possible outcomes — numbers 1 through 6 — and all outcomes are equally likely.
If you roll two dice and add up the total, there are 11 possible outcomes — numbers 2 through 12 — but they are not equally likely. The least likely outcomes, 2 and 12, only happen once in 36 tries; the most likely outcome happens 1 times in 6.
And if you roll three dice and add them up, you get a different set of possible outcomes with a different set of probabilities.
What I've just described are three random number generators, which are also called random processes. The output from a random process is a random variable, or more generally a set of random variables. And each random variable has probability distribution, which is the set of possible outcomes and the corresponding set of probabilities.
Representing distributionsThere are many ways to represent a probability distribution. The most obvious is a probability mass function, or PMF, which is a function that maps from each possible outcome to its probability. And in Python, the most obvious way to represent a PMF is a dictionary that maps from outcomes to probabilities.
So is a Pmf a distribution? No. At least in my framework, a Pmf is one of several representations of a distribution. Other representations include the cumulative distribution function (CDF) and the characteristic function (CF).
These representations are equivalent in the sense that they all contain the same information; if I give you any one of them, you can figure out the others.
So why would we want different representations of the same information? The fundamental reason is that there are many operations we would like to perform with distributions; that is, questions we would like to answer. Some representations are better for some operations, but none of them is the best for all operations.
Here are some of the questions we would like a distribution to answer:
What is the probability of a given outcome?
What is the mean of the outcomes, taking into account their probabilities?
What is the variance of the outcome? Other moments?
What is the probability that the outcome exceeds (or falls below) a threshold?
What is the median of the outcomes, that is, the 50th percentile?
What are the other percentiles?
How can get generate a random sample from this distribution, with the appropriate probabilities?
If we run two random processes and choose the maximum of the outcomes (or minimum), what is the distribution of the result?
If we run two random processes and add up the results, what is the distribution of the sum?
Each of these questions corresponds to a method we would like a distribution to provide. But there is no one representation that answers all of them easily and efficiently.
As I demonstrate in the notebook, the PMF representation makes it easy to look up an outcome and get its probability, and it can compute mean, variance, and other moments efficiently.
The CDF representation can look up an outcome and find its cumulative probability efficiently. And it can do a reverse lookup equally efficiently; that is, given a probability, it can find the corresponding value, which is useful for computing medians and other percentiles.
The CDF also provides an easy way to generate random samples, and a remarkably simple way to compute the distribution of the maximum, or minimum, of a sample.
To answer the last question, the distribution of a sum, we can use the PMF representation, which is simple, but not efficient. An alternative is to use the characteristic function (CF), which is the Fourier transform of the PMF. That might sound crazy, but using the CF and the Convolution Theorem, we can compute the distribution of a sum in linearithmic time, or O(n log n).
If you are not familiar with the Convolution Theorem, you might want to read Chapter 8 of Think DSP.
So what's a distribution?The Pmf, Cdf, and CharFunc are different ways to represent the same information. For the questions we want to answer, some representations are better than others. So how should we represent the distribution itself?
In my implementation, each representation is a mixin; that is, a class that provides a set of capabilities. A distribution inherits all of the capabilities from all of the representations. Here's a class definition that shows what I mean:
class Dist(Pmf, Cdf, CharFunc):
def __init__(self, d):
"""Initializes the Dist.
Calls all three __init__ methods.
"""
Pmf.__init__(self, d)
Cdf.__init__(self, *compute_cumprobs(d))
CharFunc.__init__(self, compute_fft(d))
When you create a Dist, you provide a dictionary of values and probabilities.
Dist.__init__ calls the other three __init__ methods to create the Pmf, Cdf, and CharFunc representations. The result is an object that has all the attributes and methods of the three representations.
From a software engineering point of view, that might not be the best design, but it is meant to illustrate what it means to be a distribution.
In short, if you give me any representation of a distribution, you have told me everything I need to answer questions about the possible outcomes and their probabilities. Converting from one representation to another is mostly a matter of convenience and computational efficiency.
Conversely, if you are trying to find the distribution of a random variable, you can do it by computing whichever representation is easiest to figure out.
So that's the idea. If you want more details, take a look at the notebook by following one of the links at the top of the page.

Published on June 16, 2016 08:04
June 14, 2016
Bayesian Statistics for Undergrads
Yesterday Sanjoy Mahajan and I led a workshop on teaching Bayesian statistics for undergraduates. The participants were college teachers from around New England, including Norwich University in Vermont and Wesleyan University in Connecticut, as well as our neighbors, Babson College and Wellelsey College.
The feedback we got was enthusiastic, and we hope the workshop will help the participants design new classes that make Bayesian methods accessible to their students.
Materials from the workshop are in this GitHub repository. And here are the slides:
The goal of the workshop is to show that teaching Bayesian statistics to undergrads is possible and desirable. To show that it's possible, we presented three approaches:
A computational approach, based on my class at Olin, Computational Bayesian Statistics, and the accompanying book, Think Bayes. This material is appropriate for students with basic programming skills, although a lot of it could adapted for use with spreadsheets.An analytic approach, based on Sanjoy's class, called Bayesian Inference. This material is appropriate for students who are comfortable with mathematics including calculus.We also presented core material that does not depend on programming or advanced math --really just arithmetic.Why Bayes?Reasons the participants gave for teaching Bayes included:Some of them work and teach in areas like psychology and biology where the limitations of classical methods have become painfully apparent, and interest in alternatives is high.Others are interested in applications like business intelligence and data analytics where Bayesian methods are a hot topic.Some participants teach introductory classes that satisfy requirements in quantitative reasoning, and they are looking for material to develop students' ability to reason with and about uncertainty.I think these are all good reasons. At the introductory level, Bayesian methods are a great opportunity for students who might not be comfortable with math to gradually build confidence with mathematical methods as tools for better thinking.
Bayes's theorem provides a divide-and-conquer strategy for solving difficult problems by breaking them into smaller, simpler pieces. And many of the classic applications of Bayes's theorem -- like interpreting medical tests and weighing courtroom evidence -- are real-world problems where careful thinking matters and mistakes have consequences!
For students who only take a few classes in mathematics, I think Bayesian statistics is a better choice than calculus, which the vast majority of students will never use again; and better than classical statistics, which (based on my observation) often leaves students more confused about quantitative reasoning than when they started.
At the more advanced level, Bayesian methods are appealing because they can be applied in a straightforward way to real-world decision making processes, unlike classical methods, which generally fail to answer the questions we actually want to answer.
For example, if we are considering several hypotheses about the world, it is useful to know the probability that each is true. You can use that information to guide decision making under uncertainty. But classical statistical inference refuses to answer that question, and under the frequentist interpretation of probability, you are not even allowed to ask it.
As another example, the result you get from Bayesian statistics is generally a posterior distribution for a parameter, or a joint distribution for several parameters. From these results, it is straightforward to compute a distribution that predicts almost any quantity of interest, and this distribution encodes not only the most likely outcome or central tendency; it also represents the uncertainty of the prediction and the spread of the possible outcomes.
Given a predictive distribution, you can answer whatever questions are relevant to the domain, like the probability of exceeding some bound, or the range of values most likely to contain the true value (another question classical inference refuses to answer). And it is straightforward to feed the entire distribution into other analyses, like risk-benefit analysis and other kinds of optimization, that directly guide decision making.
I mention these advantages in part to address one of the questions that came up in the workshop. Several of the participants are currently teaching traditional introductory statistics classes, and they would like to introduce Bayesian methods, but are also required to cover certain topics in classical statistics, notably null-hypothesis significance testing (NHST).
So they want to know how to design a class that covers these topics and also introduces Bayesian statistics. This is an important challenge, and I was frustrated that I didn't have a better answer to offer at the workshop. But with some time to organize my thoughts, I have a two suggestions:Avoid direct competitionI don't recommend teaching a class that explicitly compares classical and Bayesian statistics. Pedagogically, it is likely to be confusing. Strategically, it is asking for intra-departmental warfare. And importantly, I think it misrepresents Bayesian methods, and undersells them, if you present them as a tool-for-tool replacement for classical methods.
The real problem with classical inference is not that it gets the wrong answer; the problem is that is asks the wrong questions. For example, a fundamental problem with NHST is that it requires a binary decision: either we reject the null hypothesis or we fail to reject it (whatever that means). An advantage of the Bayesian approach is that it helps us represent and work with uncertainty; expressing results in terms of probability is more realistic, and more useful, than trying to cram the world into one of two holes.
If you use Bayesian methods to compute the probability of a hypothesis, and then apply a threshold to decide whether the theory is true, you are missing the point. Similarly, if you compute a posterior distribution, and then collapse it to a single point estimate (or even an interval), you are throwing away exactly the information that makes Bayesian results more useful.
Bayesian methods don't do the same things better; they do different things, which are better. If you want to demonstrate the advantages of Bayesian methods, do it by solving practical problems and answering the questions that matter.
As an example, this morning my colleague Jon Adler sent me a link to this paper, Bayesian Benefits for the Pragmatic Researcher , which is a model of what I am talking about.
Identify the goalsAs always, it is important to be explicit about the learning goals of the class you are designing. Curriculum problems that seems impossible can sometimes be simplified by unpacking assumptions about what needs to be taught and why. For example, if we think about why NHST is a required topic, we get some insight into how to present it: if you want to make sure students can read papers that report p-values, you might take one approach; if you imagine they will need to use classical methods, that might require a different approach.
For classical statistical inference, I recommend "The New Statistics", an approach advocated by Geoff Cumming (I am not sure to what degree it is original to him). The fundamental idea of is that statistical analysis should focus on estimating effect sizes, and should express results in terms that emphasize practical consequences, as contrasted with statistical significance.
If "The New Statistics" is what we should teach, computational simulation is how. Many of the ideas that take the most time, and seem the hardest, in a traditional stats class, can be taught much more effectively using simulation. I wrote more about this just last week, in this post, There is Still Only One Test, and there are links there to additional resources.
But if the goal is to teach classical statistical inference better, I would leave Bayes out of it. Even if it's tempting to use a Bayesian framework to explain the problems with classical inference, it would be more likely to confuse students than help them.
If you only have space in the curriculum to teach one paradigm, and you are not required to teach classical methods, I recommend a purely Bayesian course. But if you have to teach classical methods in the same course, I suggest keeping them separated.
I experienced a version of this at PyCon this year, where I taught two tutorials back to back: Bayesian statistics in the morning and computational statistical inference in the afternoon. I joked that I spent the morning explaining why the afternoon was wrong. But the reality is that they two topics hardly overlap at all. In the morning I used Bayesian methods to formulate real-world problems and answer practical questions. In the afternoon, I helped people understand classical inference, including its limitations, and taught them how to do it well, if they have to.
I think a similar balance (or compromise?) could work in the undergraduate statistic curriculum at many colleges and universities.
The feedback we got was enthusiastic, and we hope the workshop will help the participants design new classes that make Bayesian methods accessible to their students.
Materials from the workshop are in this GitHub repository. And here are the slides:
The goal of the workshop is to show that teaching Bayesian statistics to undergrads is possible and desirable. To show that it's possible, we presented three approaches:
A computational approach, based on my class at Olin, Computational Bayesian Statistics, and the accompanying book, Think Bayes. This material is appropriate for students with basic programming skills, although a lot of it could adapted for use with spreadsheets.An analytic approach, based on Sanjoy's class, called Bayesian Inference. This material is appropriate for students who are comfortable with mathematics including calculus.We also presented core material that does not depend on programming or advanced math --really just arithmetic.Why Bayes?Reasons the participants gave for teaching Bayes included:Some of them work and teach in areas like psychology and biology where the limitations of classical methods have become painfully apparent, and interest in alternatives is high.Others are interested in applications like business intelligence and data analytics where Bayesian methods are a hot topic.Some participants teach introductory classes that satisfy requirements in quantitative reasoning, and they are looking for material to develop students' ability to reason with and about uncertainty.I think these are all good reasons. At the introductory level, Bayesian methods are a great opportunity for students who might not be comfortable with math to gradually build confidence with mathematical methods as tools for better thinking.
Bayes's theorem provides a divide-and-conquer strategy for solving difficult problems by breaking them into smaller, simpler pieces. And many of the classic applications of Bayes's theorem -- like interpreting medical tests and weighing courtroom evidence -- are real-world problems where careful thinking matters and mistakes have consequences!
For students who only take a few classes in mathematics, I think Bayesian statistics is a better choice than calculus, which the vast majority of students will never use again; and better than classical statistics, which (based on my observation) often leaves students more confused about quantitative reasoning than when they started.
At the more advanced level, Bayesian methods are appealing because they can be applied in a straightforward way to real-world decision making processes, unlike classical methods, which generally fail to answer the questions we actually want to answer.
For example, if we are considering several hypotheses about the world, it is useful to know the probability that each is true. You can use that information to guide decision making under uncertainty. But classical statistical inference refuses to answer that question, and under the frequentist interpretation of probability, you are not even allowed to ask it.
As another example, the result you get from Bayesian statistics is generally a posterior distribution for a parameter, or a joint distribution for several parameters. From these results, it is straightforward to compute a distribution that predicts almost any quantity of interest, and this distribution encodes not only the most likely outcome or central tendency; it also represents the uncertainty of the prediction and the spread of the possible outcomes.
Given a predictive distribution, you can answer whatever questions are relevant to the domain, like the probability of exceeding some bound, or the range of values most likely to contain the true value (another question classical inference refuses to answer). And it is straightforward to feed the entire distribution into other analyses, like risk-benefit analysis and other kinds of optimization, that directly guide decision making.
I mention these advantages in part to address one of the questions that came up in the workshop. Several of the participants are currently teaching traditional introductory statistics classes, and they would like to introduce Bayesian methods, but are also required to cover certain topics in classical statistics, notably null-hypothesis significance testing (NHST).
So they want to know how to design a class that covers these topics and also introduces Bayesian statistics. This is an important challenge, and I was frustrated that I didn't have a better answer to offer at the workshop. But with some time to organize my thoughts, I have a two suggestions:Avoid direct competitionI don't recommend teaching a class that explicitly compares classical and Bayesian statistics. Pedagogically, it is likely to be confusing. Strategically, it is asking for intra-departmental warfare. And importantly, I think it misrepresents Bayesian methods, and undersells them, if you present them as a tool-for-tool replacement for classical methods.
The real problem with classical inference is not that it gets the wrong answer; the problem is that is asks the wrong questions. For example, a fundamental problem with NHST is that it requires a binary decision: either we reject the null hypothesis or we fail to reject it (whatever that means). An advantage of the Bayesian approach is that it helps us represent and work with uncertainty; expressing results in terms of probability is more realistic, and more useful, than trying to cram the world into one of two holes.
If you use Bayesian methods to compute the probability of a hypothesis, and then apply a threshold to decide whether the theory is true, you are missing the point. Similarly, if you compute a posterior distribution, and then collapse it to a single point estimate (or even an interval), you are throwing away exactly the information that makes Bayesian results more useful.
Bayesian methods don't do the same things better; they do different things, which are better. If you want to demonstrate the advantages of Bayesian methods, do it by solving practical problems and answering the questions that matter.
As an example, this morning my colleague Jon Adler sent me a link to this paper, Bayesian Benefits for the Pragmatic Researcher , which is a model of what I am talking about.
Identify the goalsAs always, it is important to be explicit about the learning goals of the class you are designing. Curriculum problems that seems impossible can sometimes be simplified by unpacking assumptions about what needs to be taught and why. For example, if we think about why NHST is a required topic, we get some insight into how to present it: if you want to make sure students can read papers that report p-values, you might take one approach; if you imagine they will need to use classical methods, that might require a different approach.
For classical statistical inference, I recommend "The New Statistics", an approach advocated by Geoff Cumming (I am not sure to what degree it is original to him). The fundamental idea of is that statistical analysis should focus on estimating effect sizes, and should express results in terms that emphasize practical consequences, as contrasted with statistical significance.
If "The New Statistics" is what we should teach, computational simulation is how. Many of the ideas that take the most time, and seem the hardest, in a traditional stats class, can be taught much more effectively using simulation. I wrote more about this just last week, in this post, There is Still Only One Test, and there are links there to additional resources.
But if the goal is to teach classical statistical inference better, I would leave Bayes out of it. Even if it's tempting to use a Bayesian framework to explain the problems with classical inference, it would be more likely to confuse students than help them.
If you only have space in the curriculum to teach one paradigm, and you are not required to teach classical methods, I recommend a purely Bayesian course. But if you have to teach classical methods in the same course, I suggest keeping them separated.
I experienced a version of this at PyCon this year, where I taught two tutorials back to back: Bayesian statistics in the morning and computational statistical inference in the afternoon. I joked that I spent the morning explaining why the afternoon was wrong. But the reality is that they two topics hardly overlap at all. In the morning I used Bayesian methods to formulate real-world problems and answer practical questions. In the afternoon, I helped people understand classical inference, including its limitations, and taught them how to do it well, if they have to.
I think a similar balance (or compromise?) could work in the undergraduate statistic curriculum at many colleges and universities.

Published on June 14, 2016 09:27
June 7, 2016
There is still only one test
In 2011 I wrote an article called "There is Only One Test", where I explained that all hypothesis tests are based on the same framework, which looks like this:
Here are the elements of this framework:
1) Given a dataset, you compute a test statistic that measures the size of the apparent effect. For example, if you are describing a difference between two groups, the test statistic might be the absolute difference in means. I'll call the test statistic from the observed data

Here are the elements of this framework:
1) Given a dataset, you compute a test statistic that measures the size of the apparent effect. For example, if you are describing a difference between two groups, the test statistic might be the absolute difference in means. I'll call the test statistic from the observed data
Published on June 07, 2016 07:34
Probably Overthinking It
Probably Overthinking It is a blog about data science, Bayesian Statistics, and occasional other topics.
- Allen B. Downey's profile
- 233 followers
