Allen B. Downey's Blog: Probably Overthinking It, page 22
February 24, 2015
Upcoming talk on survival analysis in Python
On March 2, 2015 I am presenting a short talk for the Python Data Science meetup. Here is the announcement for the meetup.
And here are my slides:
The code for the talk is in an IPython notebook you can view on nbviewer. It is still a work in progress!
And here's the punchline graph:

Each line corresponds to a different cohort: 50s means women born during the 1950s, and so on. The curves show the probability of being unmarried as a function of age; for example, about 50% of women born in the 50s had been married by age 23. For women born in the 80s, only about 25% were married by age 23.
The gray lines show predictions based on applying patterns from previous cohorts.
A few patterns emerge from this figure:
1) Successive generations of women are getting married later and later. No surprise there.
2) For women born in the 50s, the curve leveled off; if they didn't marry early, they were unlikely to marry at all. For later generations, the curve keeps dropping, which indicates some women getting married for the first time at later ages.
3) The predictions suggest that the fraction of women who eventually marry is not changing substantially. My predictions for Millennials suggest that they will end up marrying, eventually, at rates similar to previous cohorts.
And here are my slides:
The code for the talk is in an IPython notebook you can view on nbviewer. It is still a work in progress!
And here's the punchline graph:

Each line corresponds to a different cohort: 50s means women born during the 1950s, and so on. The curves show the probability of being unmarried as a function of age; for example, about 50% of women born in the 50s had been married by age 23. For women born in the 80s, only about 25% were married by age 23.
The gray lines show predictions based on applying patterns from previous cohorts.
A few patterns emerge from this figure:
1) Successive generations of women are getting married later and later. No surprise there.
2) For women born in the 50s, the curve leveled off; if they didn't marry early, they were unlikely to marry at all. For later generations, the curve keeps dropping, which indicates some women getting married for the first time at later ages.
3) The predictions suggest that the fraction of women who eventually marry is not changing substantially. My predictions for Millennials suggest that they will end up marrying, eventually, at rates similar to previous cohorts.
Published on February 24, 2015 13:59
February 10, 2015
Bayesian analysis of match rates on Tinder
Last fall I taught an introduction to Bayesian statistics at Olin College. My students worked on some excellent projects, and I invited them to write up their results as guest articles for this blog. Just in time for Valentine's Day, here's the second in the series:
It’s (Not) a Match!Ankur Das and Mason del Rosario
All code for this post can be found on Github: https://github.com/mdelrosa/tinderStudy
Valentine’s day is fast approaching, and if you are anything like us engineering students, you will most likely be dateless on the 14th of February. Luckily, there are lots of great apps geared towards bringing singles together. One of the most popular is Tinder. For those unfamiliar with the app, Tinder presents the user with other users’ profiles, and the user then either swipes right for yes or left for no. When two people have swiped right on one another, they become a “match” and can chat with each other via text through the app.

Images of our Tinder profiles. Aren’t we a couple of handsome devils?
Some time ago, two of our friends decided to conduct an experiment on Tinder. The friends, one male and one female, each swiped right on the next 100 profiles and recorded how many matches they got. Though our female friend received over 90 matches, the male garnered only one, giving a 90% match rate for the female and an abysmal 1% response rate for the male! The discrepancy we saw is not an isolated incident. This article from Godofstyle.com reports an experiment where male profiles at best garnered 7% match rates while a female profile received a 20% match rate.
Why such poor odds for men on Tinder? Women may be more discriminating with their swipes, but another possibility is that women are simply less active on Tinder. Pursuing the less depressing option, we set about trying to answer the following question: how does the activity rate of Tinder’s population affect a given user’s match rate?
p q r
Luckily for all us dateless men and women out there, the number of responses we get is not an accurate measure of your attractiveness on Tinder. Adapting Allen Downey’s work on “The Volunteer Problem,” a user’s total response rate p is the product of r, the true attractiveness rating, and q, the activity rating. For every 100 users we swipe, only some q fraction of them will even see our profiles, let alone swipe back. With that in mind, we have something to attribute our Tinder failures to: the inactivity of users who may not have logged in recently.
Still, it would be nice to know what our actual r response rate is. Finding p is simple --- simply measure the number of return swipes --- but we need to determine q to solve for r. With our set of Bayesian tools, Python scripts, and swiping fingers, we set out to estimate the activity rate, q, of other Tinder users.
How often do we Tinder?
For some of us, Tinder is a way of life. For a reasons we won’t get into here, we check Tinder hourly, keeping up to date with the latest swipes. But many users check the app more infrequently, going days or weeks without looking at new potential matches. Unfortunately Tinder’s best selling points also make life difficult for the amateur Bayesian statistician. The app provides minimal information on profiles and has no public history of past swipes and encounters. When seeing a new profile, the user’s last log-in time provides the only hint of her future activity, leaving us guessing if she chose to ignore our profile or simply never saw it.
Luckily, we can adapt Prof. Downey’s solution to “The Red Line Problem” to turn this sparse data into a predictive model. We start with an assumption that the average time between Tinder activity follows a normal distribution centered at 10 hours. Even then, someone who uses Tinder every 10 hours on average will occasionally check twice an hour or just once a week. We model this variance with an exponential distribution for each average activity period to show different activity rates for the same user.
(Note: all times and arrival rates are given in hours unless otherwise noted)
While some other user may go 10 hours without checking Tinder, our last arrival data only gives us a single point in that interval, skewing our results. First, we are 5 times more likely to observe someone during a 5 hour interval than a 1 hour interval, simply due to its length. We account for this bias towards larger intervals by multiplying each probability by time.
Exponential PDF scaled by time for for a user with an average of 10 hours between activity
Even with these likelihoods, our data comes in the form of time since last activity. 15 minutes ago could indicate a future log in 10 minutes or 10 days later, so we need to adapt our distribution to this type of input. For every gap between activity, we have an equal chance of making our observation at any point in between. Combining the chances of each activity period happening with the chances of finding a time in that period produces a mixture of various observed last log-in times, specific to each hypothetical user with a different average activity rate.
Exponential CDF for observed last activity time for a user with an average of 10 hours between activity
Now what?
That was a lot of work to get to a distribution within a distribution within a distribution, but now we finally have a chance to use our data! Looking back at our prior distribution of average activity rates, we now have a corresponding distribution of observed past log-in times for each. We sampled 100 users to record their activity times, giving us accurate information for the users relevant to us. Our findings vary between genders, areas, and age ranges, so new time data must be collected for each aspiring Tinder statistician. Updating each of our hypotheses with the chance of observing the times we saw shows the graph below, a steep distribution centered at 10 hours.
Exponential PDF scaled by time for for a user with an average of 10 hours between activity
The posterior distribution peaks at the same central value as the prior, but with a smaller standard deviation, due the large amount of data. The small standard deviation likely indicates more about Tinder’s sorting algorithm than the true userbase, however. Given the number of people who have likely stopped using the app, we should see a much longer tail for users checking Tinder every 24 hours or longer. This data actually indicates that Tinder only shows us users that tend to be fairly active. This behavior makes sense: what dating app would match you with inactive users?
Predictive Modeling
After a bit of data collection and analysis, we now know the breakdown of our potential matches. Most of them check Tinder about every 10 hours on average, so they should respond promptly to our swipes. For the sake of modeling ease, let’s say that when someone checks Tinder, they instantly see our profile, regardless of the time they spend active or Tinder’s display algorithm. Not only does this make our analysis much easier, it also gives us a worst case estimate. If our results for Tinder success assume extra people see our profile, our true response rate can only be better.
Using the spread of potential Tinder matches, we can predict how many will use Tinder within some threshold time. If we check again in 5 hours, for instance, we will of course receive fewer views than 10 hours later. To find the distributions of response rates, we looked at each average response rate, finding the chance of activity for that rate within some threshold time. Scaling all these probabilities by the likelihood of that average response rate produces a distribution of q, the portion of users active within some time.
q distributions for active users within 5 and 10 hours
True Response Rate
Now that we are able to generate a distribution for q, we are half way towards generating a distribution for a given user’s true response rate, r. Next, we need to know what the distribution of perceived response rate, p, looks like. Recalling that the product of q and r gives us p, we can reverse engineer our desired r distribution by taking the quotient of p over q! We’re working on that part of the problem now.
Conclusion
Clearly, we still have some work to do before we can say anything quantitative regarding a user’s true match rate, but we can say some positive things without going much further. One encouraging observation is that a user’s true match rate will always be better than the perceived match rate. While a user may see that only 10 out of every 100 profiles become matches, a percentage of those users have not even been on Tinder for quite some time. The upshot of this: you’re probably more attractive than Tinder makes you think you are!
That’s all for now. Happy swiping!
It’s (Not) a Match!Ankur Das and Mason del Rosario
All code for this post can be found on Github: https://github.com/mdelrosa/tinderStudy
Valentine’s day is fast approaching, and if you are anything like us engineering students, you will most likely be dateless on the 14th of February. Luckily, there are lots of great apps geared towards bringing singles together. One of the most popular is Tinder. For those unfamiliar with the app, Tinder presents the user with other users’ profiles, and the user then either swipes right for yes or left for no. When two people have swiped right on one another, they become a “match” and can chat with each other via text through the app.


Some time ago, two of our friends decided to conduct an experiment on Tinder. The friends, one male and one female, each swiped right on the next 100 profiles and recorded how many matches they got. Though our female friend received over 90 matches, the male garnered only one, giving a 90% match rate for the female and an abysmal 1% response rate for the male! The discrepancy we saw is not an isolated incident. This article from Godofstyle.com reports an experiment where male profiles at best garnered 7% match rates while a female profile received a 20% match rate.
Why such poor odds for men on Tinder? Women may be more discriminating with their swipes, but another possibility is that women are simply less active on Tinder. Pursuing the less depressing option, we set about trying to answer the following question: how does the activity rate of Tinder’s population affect a given user’s match rate?
p q r
Luckily for all us dateless men and women out there, the number of responses we get is not an accurate measure of your attractiveness on Tinder. Adapting Allen Downey’s work on “The Volunteer Problem,” a user’s total response rate p is the product of r, the true attractiveness rating, and q, the activity rating. For every 100 users we swipe, only some q fraction of them will even see our profiles, let alone swipe back. With that in mind, we have something to attribute our Tinder failures to: the inactivity of users who may not have logged in recently.
Still, it would be nice to know what our actual r response rate is. Finding p is simple --- simply measure the number of return swipes --- but we need to determine q to solve for r. With our set of Bayesian tools, Python scripts, and swiping fingers, we set out to estimate the activity rate, q, of other Tinder users.
How often do we Tinder?
For some of us, Tinder is a way of life. For a reasons we won’t get into here, we check Tinder hourly, keeping up to date with the latest swipes. But many users check the app more infrequently, going days or weeks without looking at new potential matches. Unfortunately Tinder’s best selling points also make life difficult for the amateur Bayesian statistician. The app provides minimal information on profiles and has no public history of past swipes and encounters. When seeing a new profile, the user’s last log-in time provides the only hint of her future activity, leaving us guessing if she chose to ignore our profile or simply never saw it.
Luckily, we can adapt Prof. Downey’s solution to “The Red Line Problem” to turn this sparse data into a predictive model. We start with an assumption that the average time between Tinder activity follows a normal distribution centered at 10 hours. Even then, someone who uses Tinder every 10 hours on average will occasionally check twice an hour or just once a week. We model this variance with an exponential distribution for each average activity period to show different activity rates for the same user.
(Note: all times and arrival rates are given in hours unless otherwise noted)
While some other user may go 10 hours without checking Tinder, our last arrival data only gives us a single point in that interval, skewing our results. First, we are 5 times more likely to observe someone during a 5 hour interval than a 1 hour interval, simply due to its length. We account for this bias towards larger intervals by multiplying each probability by time.

Even with these likelihoods, our data comes in the form of time since last activity. 15 minutes ago could indicate a future log in 10 minutes or 10 days later, so we need to adapt our distribution to this type of input. For every gap between activity, we have an equal chance of making our observation at any point in between. Combining the chances of each activity period happening with the chances of finding a time in that period produces a mixture of various observed last log-in times, specific to each hypothetical user with a different average activity rate.

Now what?
That was a lot of work to get to a distribution within a distribution within a distribution, but now we finally have a chance to use our data! Looking back at our prior distribution of average activity rates, we now have a corresponding distribution of observed past log-in times for each. We sampled 100 users to record their activity times, giving us accurate information for the users relevant to us. Our findings vary between genders, areas, and age ranges, so new time data must be collected for each aspiring Tinder statistician. Updating each of our hypotheses with the chance of observing the times we saw shows the graph below, a steep distribution centered at 10 hours.

The posterior distribution peaks at the same central value as the prior, but with a smaller standard deviation, due the large amount of data. The small standard deviation likely indicates more about Tinder’s sorting algorithm than the true userbase, however. Given the number of people who have likely stopped using the app, we should see a much longer tail for users checking Tinder every 24 hours or longer. This data actually indicates that Tinder only shows us users that tend to be fairly active. This behavior makes sense: what dating app would match you with inactive users?
Predictive Modeling
After a bit of data collection and analysis, we now know the breakdown of our potential matches. Most of them check Tinder about every 10 hours on average, so they should respond promptly to our swipes. For the sake of modeling ease, let’s say that when someone checks Tinder, they instantly see our profile, regardless of the time they spend active or Tinder’s display algorithm. Not only does this make our analysis much easier, it also gives us a worst case estimate. If our results for Tinder success assume extra people see our profile, our true response rate can only be better.
Using the spread of potential Tinder matches, we can predict how many will use Tinder within some threshold time. If we check again in 5 hours, for instance, we will of course receive fewer views than 10 hours later. To find the distributions of response rates, we looked at each average response rate, finding the chance of activity for that rate within some threshold time. Scaling all these probabilities by the likelihood of that average response rate produces a distribution of q, the portion of users active within some time.

True Response Rate
Now that we are able to generate a distribution for q, we are half way towards generating a distribution for a given user’s true response rate, r. Next, we need to know what the distribution of perceived response rate, p, looks like. Recalling that the product of q and r gives us p, we can reverse engineer our desired r distribution by taking the quotient of p over q! We’re working on that part of the problem now.
Conclusion
Clearly, we still have some work to do before we can say anything quantitative regarding a user’s true match rate, but we can say some positive things without going much further. One encouraging observation is that a user’s true match rate will always be better than the perceived match rate. While a user may see that only 10 out of every 100 profiles become matches, a percentage of those users have not even been on Tinder for quite some time. The upshot of this: you’re probably more attractive than Tinder makes you think you are!
That’s all for now. Happy swiping!
Published on February 10, 2015 04:47
February 5, 2015
Godless freshmen: now more Nones than Catholics
This article is an update to my annual series on one of the most under-reported stories of the decade: the fraction of college freshmen who report no religious preference has more than tripled since 1985, from 8% to 27%, and the trend is accelerating.
In last year's installment, I made the bold prediction that the trend would continue, and that the students starting college in 2014 would again, be the most godless ever. It turns out I was right for the fifth year in a row. The number of students reporting no religious preference increased to 27.5%, a substantial increase since last year's record-high 24.6%. Also, for the first time in the history of the survey, the number of "Nones" exceeds the number of Catholics (25.3%).
The number of people reporting that they never attended a religious service also reached an all-time high at 29.3%, up from 27.3% last year.
Of course, we should not over-interpret a single data point, but generally:
1) This year's data points are consistent with previous predictions, and
2) Data since 1990 support the conclusion that the number of incoming college students with no religious preference is increasing and accelerating.
This analysis is based on survey results from the Cooperative Institutional Research Program (CIRP) of the Higher Education Research Insitute (HERI). In 2014, more than 153,000 students at 227 colleges and universities completed the CIRP Freshman Survey, which includes questions about students’ backgrounds, activities, and attitudes.
In one question, students select their “current religious preference,” from a choice of seventeen common religions, “Other religion,” or “None.”
Another question asks students how often they “attended a religious service” in the last year. The choices are “Frequently,” “Occasionally,” and “Not at all.” Students are instructed to select “Occasionally” if they attended one or more times.
The following figure shows the fraction of Nones over more than 40 years of the survey
Fraction of college Freshmen with no religious preference.The blue line shows actual data through 2013; the blue square shows the new data point for 2014. The gray regions shows the predictions I generated last year based on data through 2013. The new data point falls at the high end of the predicted interval.
The red line shows a quadratic fit to the data. The dark gray region shows a 90% confidence interval, which quantifies sampling error, so it reflects uncertainty about the parameters of the fit. The light gray region shows a 90% confidence interval taking into account both sampling error and residual error. So it reflects total uncertainty about the predicted value, including uncertainty due to random variation from year to year.
Here is the corresponding plot for attendance at religious services:
Fraction of college Freshmen who report no attendance at religious services.
Again, the new data point for 2014, 29.3%, falls in the predicted range, although somewhat ahead of the long term trend.
Predictions for 2015
Using the new 2014 data, we can generate predictions for 2015. Here is the revised plot for "Nones":
Predictive interval for 2015.This year's measurement is ahead of the long-term trend, so next year's is likely to regress, slightly, to 27.1% (down 0.4%).
And here is the prediction for "No attendance":
Predictive interval for 2015.
Again, because this year's value is ahead of the long term trend, the center of the predictive distribution is slightly lower, at 28.6% (down 0.7%).
I'll be back next year to check on these predictions.
Comments
1) As always, more males than females report no religious preference, and the gender gap appears to be growing.
Difference between men and women, fraction reporting no religious preference.Evidence that the gender gap is increasing is strong. The p-value of the slope of the fitted curve is less than 1e-6.
2) I notice that the number of schools and the number of students participating in the Freshman Survey has been falling for several years. I wonder if the mix of schools represented in the survey is changing over time, and what effect this might have on the trends I am watching. The percentage of "Nones" is different across different kinds of institutions (colleges, universities, public, private, etc.) If participation rates are changing among these groups, that would affect the results.
3) Obviously college students are not representative of the general population. Data from other sources indicate that the same trends are happening in the general population, but I haven't been able to make a quantitative comparison between college students and others. Data from other sources also indicate that college graduates are slightly more likely to attend religious services, and to report a religious preference, than the general population.
Data Source
Eagan, K., Stolzenberg, E. B., Ramirez, J. J., Aragon, M. C., Suchard, M. R., & Hurtado, S. (2014). The American freshman: National norms fall 2014. Los Angeles: Higher Education Research Institute, UCLA.
This and all previous reports are available from the HERI publications page.
In last year's installment, I made the bold prediction that the trend would continue, and that the students starting college in 2014 would again, be the most godless ever. It turns out I was right for the fifth year in a row. The number of students reporting no religious preference increased to 27.5%, a substantial increase since last year's record-high 24.6%. Also, for the first time in the history of the survey, the number of "Nones" exceeds the number of Catholics (25.3%).
The number of people reporting that they never attended a religious service also reached an all-time high at 29.3%, up from 27.3% last year.
Of course, we should not over-interpret a single data point, but generally:
1) This year's data points are consistent with previous predictions, and
2) Data since 1990 support the conclusion that the number of incoming college students with no religious preference is increasing and accelerating.
This analysis is based on survey results from the Cooperative Institutional Research Program (CIRP) of the Higher Education Research Insitute (HERI). In 2014, more than 153,000 students at 227 colleges and universities completed the CIRP Freshman Survey, which includes questions about students’ backgrounds, activities, and attitudes.
In one question, students select their “current religious preference,” from a choice of seventeen common religions, “Other religion,” or “None.”
Another question asks students how often they “attended a religious service” in the last year. The choices are “Frequently,” “Occasionally,” and “Not at all.” Students are instructed to select “Occasionally” if they attended one or more times.
The following figure shows the fraction of Nones over more than 40 years of the survey

The red line shows a quadratic fit to the data. The dark gray region shows a 90% confidence interval, which quantifies sampling error, so it reflects uncertainty about the parameters of the fit. The light gray region shows a 90% confidence interval taking into account both sampling error and residual error. So it reflects total uncertainty about the predicted value, including uncertainty due to random variation from year to year.
Here is the corresponding plot for attendance at religious services:

Again, the new data point for 2014, 29.3%, falls in the predicted range, although somewhat ahead of the long term trend.
Predictions for 2015
Using the new 2014 data, we can generate predictions for 2015. Here is the revised plot for "Nones":

And here is the prediction for "No attendance":

Again, because this year's value is ahead of the long term trend, the center of the predictive distribution is slightly lower, at 28.6% (down 0.7%).
I'll be back next year to check on these predictions.
Comments
1) As always, more males than females report no religious preference, and the gender gap appears to be growing.

2) I notice that the number of schools and the number of students participating in the Freshman Survey has been falling for several years. I wonder if the mix of schools represented in the survey is changing over time, and what effect this might have on the trends I am watching. The percentage of "Nones" is different across different kinds of institutions (colleges, universities, public, private, etc.) If participation rates are changing among these groups, that would affect the results.
3) Obviously college students are not representative of the general population. Data from other sources indicate that the same trends are happening in the general population, but I haven't been able to make a quantitative comparison between college students and others. Data from other sources also indicate that college graduates are slightly more likely to attend religious services, and to report a religious preference, than the general population.
Data Source
Eagan, K., Stolzenberg, E. B., Ramirez, J. J., Aragon, M. C., Suchard, M. R., & Hurtado, S. (2014). The American freshman: National norms fall 2014. Los Angeles: Higher Education Research Institute, UCLA.
This and all previous reports are available from the HERI publications page.
Published on February 05, 2015 11:29
January 26, 2015
Bayesian predictions for Super Bowl XLIX
Last fall I taught an introduction to Bayesian statistics at Olin College. My students worked on some excellent projects, and I invited them to write up their results as guest articles for this blog. Here is the first article of the series:
Predicting Super Bowl XLIXAlex Crease and Matt Wismer
All code for this post can be found on Github: https://github.com/MattWis/PoissonFootball
With the Super Bowl coming up this Sunday, everyone is probably wondering where to place their bets. Fortunately, we've developed a model that uses Bayesian statistics to predict the results of football matches based upon prior season data of each team. We then used our model to predict a game between the Seahawks and the Patriots.
Our model is an expanded version of Allen Downey's solution to the “World Cup Problem”, where one can predict goals scored in a soccer match based on prior data. While the soccer model uses a Poisson process to predict when the next goal will be scored, our football model is a bit trickier because it not only accounts for defense, but also incorporates the two types of scoring; field goals and 7 point touchdowns, to predict final scores. Safeties and two-point conversions were eliminated to simplify our model. We also didn't include overtime considerations, allowing the two teams to tie at the end of regulation.
This model can be useful in a number of different betting schemes. Here’s what it predicts:
Who’s going to win?
Patriots, easy. Well, we may be biased because we both go to school in New England. Here’s the real story:
We considered scoring to be a single Poisson process, and have a separate distribution that represents our belief about the probability of a team scoring a touchdown or a field goal. We then calculate the probability of scoring a certain number of touchdowns and field goals based on the binomial distribution. The single Poisson process is related to the ability of the offense to get into field goal range, and the percentage of touchdowns is related to the team’s red zone efficiency.
We also keep track of offense and defense separately. To calculate the distribution of points scored, we average the offense’s scoring rate with the rate at which the defense allows scoring, and average the offense’s touchdown percent with the percentage of touchdowns that the defense allows.
The points distributions for each team were generated as described above, so we then calculated the probability that one would be greater than the other, or, in simpler terms, the probability that each team would win:
Outcome:Patriots WinSeahawks WinOvertimeProbability:52.0%44.1%3.9%
At this point, the teams seem well matched, with our model slightly favoring the Patriots. Looking more closely at the data, the 90% confidence interval gives the Patriots a scoring range of 6-45, and the Seahawks 6-44, bringing us to the investigation of the point spread.
Betting the Point Spread
This game is likely to be pretty close. We looked at the probability of different spreads by subtracting the Seahawk's point distribution from the Patriot’s point distribution. The cumulative distribution of the point spread is shown below. As you can see, the 50% mark appears right around the Patriots winning by 1. So if you can find a spread significantly distant from that, you can consult this chart to figure out how likely you will be to win the spread. However, this result does not consider overtime a possibility, it depends on the chance that the teams will tie. Thus the point spread may be different if the game goes into overtime.
Bets on the Total
Some people like to bet on the total points scored by both teams, so we wanted to figure that out as well. We added our two points distributions, and got an interesting result. There is almost an identical probability of the total being 37 (3.636%) or 44 (3.634%). You can consult the below graph for other likely total scores to see where to place your bets.
Another method of betting on the total is based on an over/under scale, like the point spread. We've got you covered in that case, too. We expect the total score to be centered around 45 points, as that is the mean of the distribution. According to http://www.oddsshark.com/nfl/odds, the over/under line is currently at 48 or 48.5, and with our results in mind, we advise betting under. However, if the game goes into overtime, all bets are off (pun intended) and the total score could go higher.
Office Pool Grid Betting
Because of the multiple ways of scoring in football, certain scores are more likely to occur than others. As you can see in the histograms below, scoring 17 points in a match is the most likely scenario for both teams because it is a combination of 2 touchdowns and 1 field goal, both quantities being likely for a football match. However, it is impossible to score something like 11 points because no combinations of 7 and 3 add up to that value. From the two graphs, you can see that the Seahawks are actually more likely to score field goals than the Patriots because their probability of scoring 3 points is higher.
To satisfy your betting needs, we've created histograms below displaying the likelihood that teams will score values with specific last digits so you can (hopefully) blow your office mates away with your office pool bets. If you are familiar with the way points add up in football, our data isn't very surprising. Across both teams, the top two digits to place your bets on are 0 and 7, with 4 and 3 coming up behind.

Sadly, our results do not show a clear outcome to place your bets on. However, we hope that we've shed some light on how you can make an educated guess when deciding where to put your money. Please remember that you ultimately decide how to bet, and we are no way liable for any money you may lose (or win) over the Super Bowl weekend.
Predicting Super Bowl XLIXAlex Crease and Matt Wismer
All code for this post can be found on Github: https://github.com/MattWis/PoissonFootball
With the Super Bowl coming up this Sunday, everyone is probably wondering where to place their bets. Fortunately, we've developed a model that uses Bayesian statistics to predict the results of football matches based upon prior season data of each team. We then used our model to predict a game between the Seahawks and the Patriots.
Our model is an expanded version of Allen Downey's solution to the “World Cup Problem”, where one can predict goals scored in a soccer match based on prior data. While the soccer model uses a Poisson process to predict when the next goal will be scored, our football model is a bit trickier because it not only accounts for defense, but also incorporates the two types of scoring; field goals and 7 point touchdowns, to predict final scores. Safeties and two-point conversions were eliminated to simplify our model. We also didn't include overtime considerations, allowing the two teams to tie at the end of regulation.
This model can be useful in a number of different betting schemes. Here’s what it predicts:
Who’s going to win?
Patriots, easy. Well, we may be biased because we both go to school in New England. Here’s the real story:
We considered scoring to be a single Poisson process, and have a separate distribution that represents our belief about the probability of a team scoring a touchdown or a field goal. We then calculate the probability of scoring a certain number of touchdowns and field goals based on the binomial distribution. The single Poisson process is related to the ability of the offense to get into field goal range, and the percentage of touchdowns is related to the team’s red zone efficiency.
We also keep track of offense and defense separately. To calculate the distribution of points scored, we average the offense’s scoring rate with the rate at which the defense allows scoring, and average the offense’s touchdown percent with the percentage of touchdowns that the defense allows.
The points distributions for each team were generated as described above, so we then calculated the probability that one would be greater than the other, or, in simpler terms, the probability that each team would win:
Outcome:Patriots WinSeahawks WinOvertimeProbability:52.0%44.1%3.9%
At this point, the teams seem well matched, with our model slightly favoring the Patriots. Looking more closely at the data, the 90% confidence interval gives the Patriots a scoring range of 6-45, and the Seahawks 6-44, bringing us to the investigation of the point spread.
Betting the Point Spread
This game is likely to be pretty close. We looked at the probability of different spreads by subtracting the Seahawk's point distribution from the Patriot’s point distribution. The cumulative distribution of the point spread is shown below. As you can see, the 50% mark appears right around the Patriots winning by 1. So if you can find a spread significantly distant from that, you can consult this chart to figure out how likely you will be to win the spread. However, this result does not consider overtime a possibility, it depends on the chance that the teams will tie. Thus the point spread may be different if the game goes into overtime.

Bets on the Total
Some people like to bet on the total points scored by both teams, so we wanted to figure that out as well. We added our two points distributions, and got an interesting result. There is almost an identical probability of the total being 37 (3.636%) or 44 (3.634%). You can consult the below graph for other likely total scores to see where to place your bets.

Another method of betting on the total is based on an over/under scale, like the point spread. We've got you covered in that case, too. We expect the total score to be centered around 45 points, as that is the mean of the distribution. According to http://www.oddsshark.com/nfl/odds, the over/under line is currently at 48 or 48.5, and with our results in mind, we advise betting under. However, if the game goes into overtime, all bets are off (pun intended) and the total score could go higher.

Because of the multiple ways of scoring in football, certain scores are more likely to occur than others. As you can see in the histograms below, scoring 17 points in a match is the most likely scenario for both teams because it is a combination of 2 touchdowns and 1 field goal, both quantities being likely for a football match. However, it is impossible to score something like 11 points because no combinations of 7 and 3 add up to that value. From the two graphs, you can see that the Seahawks are actually more likely to score field goals than the Patriots because their probability of scoring 3 points is higher.




Sadly, our results do not show a clear outcome to place your bets on. However, we hope that we've shed some light on how you can make an educated guess when deciding where to put your money. Please remember that you ultimately decide how to bet, and we are no way liable for any money you may lose (or win) over the Super Bowl weekend.
Published on January 26, 2015 06:15
December 17, 2014
Statistics tutorials at PyCon 2015
I am happy to announce that I will offer two statistics tutorials at PyCon 2015 on April 9 in Montreal. In the morning session I am teaching Bayesian Statistics Made Simple, which I have taught several times before, including the last three PyCons. In the afternoon I am offering a new tutorial, Statistical Inference with Computational Methods.
The whole tutorial schedule is here, along with registration information. And here are some details about the tutorials:
Bayesian statistics made simpleAudience level:IntermediateCategory:ScienceDescriptionAn introduction to Bayesian statistics using Python. Bayesian statistics are usually presented mathematically, but many of the ideas are easier to understand computationally. People who know Python can get started quickly and use Bayesian analysis to solve real problems. This tutorial is based on material and case studies from Think Bayes (O’Reilly Media).AbstractBayesian statistical methods are becoming more common and more important, but there are not many resources to help beginners get started. People who know Python can use their programming skills to get a head start.I will present simple programs that demonstrate the concepts of Bayesian statistics, and apply them to a range of example problems. Participants will work hands-on with example code and practice on example problems.Attendees should have at least basic level Python and basic statistics. If you learned about Bayes’s theorem and probability distributions at some time, that’s enough, even if you don’t remember it!Attendees should bring a laptop with Python and matplotlib. You can work in any environment; you just need to be able to download a Python program and run it. I will provide code to help attendees get set up ahead of time.More information and registration here.
Statistical inference with computational methodsAudience level:IntermediateCategory:ScienceDescriptionStatistical inference is a fundamental tool in science and engineering, but it is often poorly understood. This tutorial uses computational methods, including Monte Carlo simulation and resampling, to explore estimation, hypothesis testing and statistical modeling. Attendees will develop understanding of statistical concepts and learn to use real data to answer relevant questions.AbstractDo you know the difference between standard deviation and standard error? Do you know what statistical test to use for any occasion? Do you really know what a p-value is? How about a confidence interval?Most students don’t really understand these concepts, even after taking several statistics classes. The problem is that these classes focus on mathematical methods that bury the concepts under a mountain of details.This tutorial uses Python to implement simple statistical experiments that develop deep understanding. Attendees will learn about resampling and related tools that use random simulation to perform statistical inference, including estimation and hypothesis testing. We will use pandas, which provides structures for data analysis, along with NumPy and SciPy.I will present examples using real-world data to answer relevant questions. The tutorial material is based on my book, Think Stats , a class I teach at Olin College, and my blog, “Probably Overthinking It.”
More information and registration here.
The whole tutorial schedule is here, along with registration information. And here are some details about the tutorials:
Bayesian statistics made simpleAudience level:IntermediateCategory:ScienceDescriptionAn introduction to Bayesian statistics using Python. Bayesian statistics are usually presented mathematically, but many of the ideas are easier to understand computationally. People who know Python can get started quickly and use Bayesian analysis to solve real problems. This tutorial is based on material and case studies from Think Bayes (O’Reilly Media).AbstractBayesian statistical methods are becoming more common and more important, but there are not many resources to help beginners get started. People who know Python can use their programming skills to get a head start.I will present simple programs that demonstrate the concepts of Bayesian statistics, and apply them to a range of example problems. Participants will work hands-on with example code and practice on example problems.Attendees should have at least basic level Python and basic statistics. If you learned about Bayes’s theorem and probability distributions at some time, that’s enough, even if you don’t remember it!Attendees should bring a laptop with Python and matplotlib. You can work in any environment; you just need to be able to download a Python program and run it. I will provide code to help attendees get set up ahead of time.More information and registration here.
Statistical inference with computational methodsAudience level:IntermediateCategory:ScienceDescriptionStatistical inference is a fundamental tool in science and engineering, but it is often poorly understood. This tutorial uses computational methods, including Monte Carlo simulation and resampling, to explore estimation, hypothesis testing and statistical modeling. Attendees will develop understanding of statistical concepts and learn to use real data to answer relevant questions.AbstractDo you know the difference between standard deviation and standard error? Do you know what statistical test to use for any occasion? Do you really know what a p-value is? How about a confidence interval?Most students don’t really understand these concepts, even after taking several statistics classes. The problem is that these classes focus on mathematical methods that bury the concepts under a mountain of details.This tutorial uses Python to implement simple statistical experiments that develop deep understanding. Attendees will learn about resampling and related tools that use random simulation to perform statistical inference, including estimation and hypothesis testing. We will use pandas, which provides structures for data analysis, along with NumPy and SciPy.I will present examples using real-world data to answer relevant questions. The tutorial material is based on my book, Think Stats , a class I teach at Olin College, and my blog, “Probably Overthinking It.”
More information and registration here.
Published on December 17, 2014 06:12
December 4, 2014
The Rock Hyrax Problem

This is the third of a series of articles about Bayesian analysis. The previous article is here.
Earlier this semester I posed this problem to my Bayesian statistics class at Olin College:
Suppose I capture and tag 10 rock hyraxes. Some time later, I capture another 10 hyraxes and find that two of them are already tagged. How many hyraxes are there in this environment?This is an example of a mark and recapture experiment, which you can read about on Wikipedia. The Wikipedia page also includes the photo of a tagged hyrax shown above.
As always with problems like this, we have to make some modeling assumptions.
1) For simplicity, you can assume that the environment is reasonably isolated, so the number of hyraxes does not change between observations.
2) And you can assume that each hyrax is equally likely to be captured during each phase of the experiment, regardless of whether it has been tagged. In reality, it is possible that tagged animals would avoid traps in the future, or possible that the same behavior that got them caught the first time makes them more likely to be caught again. But let's start simple.
My solution to this problem uses the computation framework from my book, Think Bayes . The framework is described in this notebook. If you have read Think Bayes or attended one of my workshops, you might want to attempt this problem before you look at my solution.
If you solve this problem analytically, or use MCMC, and you want to share your solution, please let me know and I will post it here.
And when you are ready, you can see my solution in this notebook.
I will post more of the exercises from my class over the next few weeks.
UPDATE December 5, 2014: João Neto posted a solution to this problem in BUGS using a Jeffrey's prior.
Published on December 04, 2014 05:55
November 25, 2014
The World Cup Problem Part 2: Germany v. Argentina
This is the second of two articles about Bayesian analysis applied to World Cup soccer. The previous article is here.
Earlier this semester I posed this problem to my Bayesian statistics class at Olin College:
If this model holds, we expect the distribution of time between goals to be exponential, and the distribution of goals per game to be Poisson.
My solution to this problem uses the computation framework from my book, Think Bayes . The framework is described in this notebook. If you have read Think Bayes or attended one of my workshops, you might want to attempt this problem before you look at my solution.
If you solve this problem analytically, or use MCMC, and you want to share your solution, please let me know and I will post it here.
And when you are ready, you can see my solution in this notebook.
I will post more of the exercises from my class over the next few weeks.
One technical note: Last week on reddit, someone asked about my use of "fake data" to construct the prior distribution. This move is not as bogus as it sounds; it is just a convenient way to construct a prior that has the right mean (or whatever other statistics are known) and a shape that is consistent with background information.
UPDATE November 25, 2014: Cameron Davidson-Pilon has once again risen to the challenge and implemented a solution using PyMC. You can see his solution here.
If you want to learn more about PyMC, an excellent place to start is Cameron's online book Bayesian Methods for Hackers.[image error]
Earlier this semester I posed this problem to my Bayesian statistics class at Olin College:
In the final match of the 2014 FIFA World Cup, Germany defeated Argentina 1-0. How much evidence does this victory provide that Germany had the better team? What is the probability that Germany would win a rematch?Before you can answer a question like this, you have to make some modeling decisions. As I suggested to my class, scoring in games like soccer and hockey can be well modeled by a Poisson process, which assumes that each team, against a given opponent, will score goals at some goal-scoring rate, λ, and that this rate is stationary; in other words, the probability of scoring a goal is about the same at any point during the game.
If this model holds, we expect the distribution of time between goals to be exponential, and the distribution of goals per game to be Poisson.
My solution to this problem uses the computation framework from my book, Think Bayes . The framework is described in this notebook. If you have read Think Bayes or attended one of my workshops, you might want to attempt this problem before you look at my solution.
If you solve this problem analytically, or use MCMC, and you want to share your solution, please let me know and I will post it here.
And when you are ready, you can see my solution in this notebook.
I will post more of the exercises from my class over the next few weeks.
One technical note: Last week on reddit, someone asked about my use of "fake data" to construct the prior distribution. This move is not as bogus as it sounds; it is just a convenient way to construct a prior that has the right mean (or whatever other statistics are known) and a shape that is consistent with background information.
UPDATE November 25, 2014: Cameron Davidson-Pilon has once again risen to the challenge and implemented a solution using PyMC. You can see his solution here.
If you want to learn more about PyMC, an excellent place to start is Cameron's online book Bayesian Methods for Hackers.[image error]
Published on November 25, 2014 06:44
The World Cup Problem Part 2: Germany v. Argentina
This is the second of two articles about Bayesian analysis applied to World Cup soccer. The previous article is here.
Earlier this semester I posed this problem to my Bayesian statistics class at Olin College:
If this model holds, we expect the distribution of time between goals to be exponential, and the distribution of goals per game to be Poisson.
My solution to this problem uses the computation framework from my book, Think Bayes . The framework is described in this notebook. If you have read Think Bayes or attended one of my workshops, you might want to attempt this problem before you look at my solution.
If you solve this problem analytically, or use MCMC, and you want to share your solution, please let me know and I will post it here.
And when you are ready, you can see my solution in this notebook.
I will post more of the exercises from my class over the next few weeks.
One technical note: Last week on reddit, someone asked about my use of "fake data" to construct the prior distribution. This move is not as bogus as it sounds; it is just a convenient way to construct a prior that has the right mean (or whatever other statistics are known) and a shape that is consistent with background information.
UPDATE November 25, 2014: Cameron Davidson-Pilon has once again risen to the challenge and implemented a solution using PyMC. You can see his solution here.
If you want to learn more about PyMC, an excellent place to start is Cameron's online book Bayesian Methods for Hackers.
Earlier this semester I posed this problem to my Bayesian statistics class at Olin College:
In the final match of the 2014 FIFA World Cup, Germany defeated Argentina 1-0. How much evidence does this victory provide that Germany had the better team? What is the probability that Germany would win a rematch?Before you can answer a question like this, you have to make some modeling decisions. As I suggested to my class, scoring in games like soccer and hockey can be well modeled by a Poisson process, which assumes that each team, against a given opponent, will score goals at some goal-scoring rate, λ, and that this rate is stationary; in other words, the probability of scoring a goal is about the same at any point during the game.
If this model holds, we expect the distribution of time between goals to be exponential, and the distribution of goals per game to be Poisson.
My solution to this problem uses the computation framework from my book, Think Bayes . The framework is described in this notebook. If you have read Think Bayes or attended one of my workshops, you might want to attempt this problem before you look at my solution.
If you solve this problem analytically, or use MCMC, and you want to share your solution, please let me know and I will post it here.
And when you are ready, you can see my solution in this notebook.
I will post more of the exercises from my class over the next few weeks.
One technical note: Last week on reddit, someone asked about my use of "fake data" to construct the prior distribution. This move is not as bogus as it sounds; it is just a convenient way to construct a prior that has the right mean (or whatever other statistics are known) and a shape that is consistent with background information.
UPDATE November 25, 2014: Cameron Davidson-Pilon has once again risen to the challenge and implemented a solution using PyMC. You can see his solution here.
If you want to learn more about PyMC, an excellent place to start is Cameron's online book Bayesian Methods for Hackers.

Published on November 25, 2014 06:44
November 18, 2014
The World Cup Problem: Germany v. Brazil
Earlier this semester I posed this problem to my Bayesian statistics class at Olin College:
If this model holds, we expect the distribution of time between goals to be exponential, and the distribution of goals per game to be Poisson.
My solution to this problem uses the computation framework from my book, Think Bayes . The framework is described in this notebook. If you have read Think Bayes or attended one of my workshops, you might want to attempt this problem before you look at my solution.
If you solve this problem analytically, or use MCMC, and you want to share your solution, please let me know and I will post it here.
And when you are ready, you can see my solution in this notebook.
I will post more of the exercises from my class over the next few weeks. Coming next: The World Cup Problem Part II: Germany v. Argentina.
In the 2014 FIFA World Cup, Germany played Brazil in a semifinal match. Germany scored after 11 minutes and again at the 23 minute mark. At that point in the match, how many goals would you expect Germany to score after 90 minutes? What was the probability that they would score 5 more goals (as, in fact, they did)?Before you can answer a question like this, you have to make some modeling decisions. As I suggested to my class, scoring in games like soccer and hockey can be well modeled by a Poisson process, which assumes that each team, against a given opponent, will score goals at some goal-scoring rate, λ, and that this rate is stationary; in other words, the probability of scoring a goal is about the same at any point during the game.
If this model holds, we expect the distribution of time between goals to be exponential, and the distribution of goals per game to be Poisson.
My solution to this problem uses the computation framework from my book, Think Bayes . The framework is described in this notebook. If you have read Think Bayes or attended one of my workshops, you might want to attempt this problem before you look at my solution.
If you solve this problem analytically, or use MCMC, and you want to share your solution, please let me know and I will post it here.
And when you are ready, you can see my solution in this notebook.
I will post more of the exercises from my class over the next few weeks. Coming next: The World Cup Problem Part II: Germany v. Argentina.

Published on November 18, 2014 06:32
October 4, 2014
On efficient algorithms for finding the goddamn endnotes
Introduction
In many recent books, the goddamn endnotes are numbered sequentially within each chapter, but chapter numbers seldom appear in the header on each page. So a reader who wants to find a goddamn endnote typically has to search backward to figure out which chapter they are reading before they can find the goddamn endnote. Several simple changes can make this process more efficient and less infuriating.
In this paper, we present a novel algorithm for finding a goddamn endnote (FGE) in O(1) time; that is, time bounded by a constant regardless of the size or number of chapters.
Background
The most widely deployed algorithm for FGE requires O(n+m) time; that is, time proportional to either n, the number of pages in a chapter, or m, the number of chapters, whichever is larger. The following pseudocode sketches this algorithm:
1) Store the endnote number you want to find as e.
2) Figure out what chapter you are reading by searching for the beginning of the chapter, and store the result as c.
3) Find the footnotes at the end of the chapter and find the footnotes for chapter c.
4) Look up footnote e.
The most common implementation of Step 2 is linear in n, the number of pages in the chapter. Some readers can execute a binary search by comparing chapter titles in the header, reducing search time to O(log n).
Similarly, the most common implementation of Step 3 is linear in m, the number of chapters, but some readers perform a binary search.
Assuming that the number of footnotes in a chapter is bounded by a constant, we consider Step 4 to be constant time.
Alternatives
1) Put the goddamn chapter number (GCN) in the goddamn header. This simple change immediately reduces Step 2 to constant time.
2) Number the goddamn endnotes from the beginning to the end of the book. Do not start over at the beginning of each goddamn chapter. This change also reduces Step 2 to constant time, but with this change Step 4 may no longer be considered constant time.
3) Along with the endnote number, e, also include p, the page where the goddamn endnote appears. This change makes Step 2 constant time; and Step 3, although technically log time, practically constant time for most readers.
4) Put the goddamn endnotes at the bottom of the goddamn page (cf. footnotes). For a bounded page size, looking at the bottom of the page is a constant time operation.
Recommendations
We recommend that publishers replace existing linear-time algorithms with any of several easy-to-implement, efficient alternatives, goddammit.
[image error]
In many recent books, the goddamn endnotes are numbered sequentially within each chapter, but chapter numbers seldom appear in the header on each page. So a reader who wants to find a goddamn endnote typically has to search backward to figure out which chapter they are reading before they can find the goddamn endnote. Several simple changes can make this process more efficient and less infuriating.
In this paper, we present a novel algorithm for finding a goddamn endnote (FGE) in O(1) time; that is, time bounded by a constant regardless of the size or number of chapters.
Background
The most widely deployed algorithm for FGE requires O(n+m) time; that is, time proportional to either n, the number of pages in a chapter, or m, the number of chapters, whichever is larger. The following pseudocode sketches this algorithm:
1) Store the endnote number you want to find as e.
2) Figure out what chapter you are reading by searching for the beginning of the chapter, and store the result as c.
3) Find the footnotes at the end of the chapter and find the footnotes for chapter c.
4) Look up footnote e.
The most common implementation of Step 2 is linear in n, the number of pages in the chapter. Some readers can execute a binary search by comparing chapter titles in the header, reducing search time to O(log n).
Similarly, the most common implementation of Step 3 is linear in m, the number of chapters, but some readers perform a binary search.
Assuming that the number of footnotes in a chapter is bounded by a constant, we consider Step 4 to be constant time.
Alternatives
1) Put the goddamn chapter number (GCN) in the goddamn header. This simple change immediately reduces Step 2 to constant time.
2) Number the goddamn endnotes from the beginning to the end of the book. Do not start over at the beginning of each goddamn chapter. This change also reduces Step 2 to constant time, but with this change Step 4 may no longer be considered constant time.
3) Along with the endnote number, e, also include p, the page where the goddamn endnote appears. This change makes Step 2 constant time; and Step 3, although technically log time, practically constant time for most readers.
4) Put the goddamn endnotes at the bottom of the goddamn page (cf. footnotes). For a bounded page size, looking at the bottom of the page is a constant time operation.
Recommendations
We recommend that publishers replace existing linear-time algorithms with any of several easy-to-implement, efficient alternatives, goddammit.
[image error]
Published on October 04, 2014 16:57
Probably Overthinking It
Probably Overthinking It is a blog about data science, Bayesian Statistics, and occasional other topics.
- Allen B. Downey's profile
- 233 followers
