Allen B. Downey's Blog: Probably Overthinking It, page 19

April 27, 2016

Bayes on Jupyter

I am working on an updated version of my workshop, Bayesian Statistics Made Simple, now using Jupyter notebooks (formerly known as IPython). It's still a work in progress, but you can see a draft of my slides here:

 

If you want to run the code, you can run the notebook in your browser by hitting this button



You should see a home page like this:


If you want to try the exercises, open workshop01.ipynb.  If you just want to see the answers, open workshop01_soln.ipynb.
Either way, you should be able to run the notebooks in your browser and try out the examples.
If you run into any problems, let me know.  Comments and suggestions are welcome.
Special thanks to the generous people who run Binder, which makes it easy to share and reproduce computation.  You can watch their video here:



 •  0 comments  •  flag
Share on Twitter
Published on April 27, 2016 10:37

April 6, 2016

Bayesian update with a multivariate normal distribution

.highlight{background: #f8f8f8; overflow:auto;width:auto;border:solid gray;border-width:.1em .1em .1em .1em;padding:0em .5em;border-radius: 4px;} .k{color: #338822; font-weight: bold;} .kn{color: #338822; font-weight: bold;} .mi{color: #000000;} .o{color: #000000;} .ow{color: #BA22FF; font-weight: bold;} .nb{color: #338822;} .n{color: #000000;} .s{color: #cc2222;} .se{color: #cc2222; font-weight: bold;} .si{color: #C06688; font-weight: bold;} .nn{color: #4D00FF; font-weight: bold;}

This notebook contains a solution to a problem posted on Reddit; here's the original statement of the problem:

So, I have two sets of data where the elements correspond to each other:

A = {122.8, 115.5, 102.5, 84.7, 154.2, 83.7, 122.1, 117.6, 98.1,
111.2, 80.3, 110.0, 117.6, 100.3, 107.8, 60.2}
B = {82.6, 99.1, 74.6, 51.9, 62.3, 67.2, 82.4, 97.2, 68.9, 77.9,
81.5, 87.4, 92.4, 80.8, 74.7, 42.1}

I'm trying to find out the probability that (91.9 <= A <= 158.3) and (56.4 <= B <= 100). I know that P(91.9 <= A <= 158.3) = 0.727098 and that P(56.4 <= B <= 100) = 0.840273, given that A is a normal distribution with mean 105.5 and standard deviation 21.7 and that B is a normal distribution with mean 76.4 and standard deviation 15.4. However, since they are dependent events, P(BA)=P(A)P(B|A)=P(B)P(A|B). Is there any way that I can find out P(A|B) and P(B|A) given the data that I have?

The original poster added this clarification:

I'm going to give you some background on what I'm trying to do here first. I'm doing sports analysis trying to find the best quarterback of the 2015 NFL season using passer rating and quarterback rating, two different measures of how the quarterback performs during a game. The numbers in the sets above are the different ratings for each of the 16 games of the season (A being passer rating, B being quarterback rating, the first element being the first game, the second element being the second, etc.) The better game the quarterback has, the higher each of the two measures will be; I'm expecting that they're correlated and dependent on each other to some degree. I'm assuming that they're normally distributed because most things done by humans tend to be normally distributed.

As a first step, let's look at the data. I'll put the two datasets into NumPy arrays.

In [2]: a = np.array([122.8, 115.5, 102.5, 84.7, 154.2, 83.7,
122.1, 117.6, 98.1, 111.2, 80.3, 110.0,
117.6, 100.3, 107.8, 60.2])
b = np.array([82.6, 99.1, 74.6, 51.9, 62.3, 67.2,
82.4, 97.2, 68.9, 77.9, 81.5, 87.4,
92.4, 80.8, 74.7, 42.1])
n = len(a)
n
Out[2]: 16

And make a scatter plot:

In [3]: thinkplot.Scatter(a, b, alpha=0.7)
[image error]

It looks like modeling this data with a bi-variate normal distribution is a reasonable choice.

Let's make an single array out of it:

In [4]: X = np.array([a, b])

And compute the sample mean

In [29]: x̄ = X.mean(axis=1)
print(x̄)
[ 105.5375 76.4375]

Sample standard deviation

In [30]: std = X.std(axis=1)
print(std)
[ 21.04040384 14.93640163]

Covariance matrix

In [32]: S = np.cov(X)
print(S)
[[ 472.21183333 161.33583333]
[ 161.33583333 237.96916667]]

And correlation coefficient

In [33]: corrcoef = np.corrcoef(a, b)
print(corrcoef)
[[ 1. 0.4812847]
[ 0.4812847 1. ]]

Now, let's start thinking about this as a Bayesian estimation problem.

There are 5 parameters we would like to estimate:

The means of the two variables, μ_a, μ_b

The standard deviations, σ_a, σ_b

The coefficient of correlation, ρ.

As a simple starting place, I'll assume that the prior distributions for these variables are uniform over all possible values.

I'm going to use a mesh algorithm to compute the joint posterior distribution, so I'll "cheat" and construct the mesh using conventional estimates for the parameters.

For each parameter, I'll compute a range of possible values where

The center of the range is the value estimated from the data.

The width of the range is 6 standard errors of the estimate.

The likelihood of any point outside this mesh is so low, it's safe to ignore it.

Here's how I construct the ranges:

In [9]: def make_array(center, stderr, m=11, factor=3):
return np.linspace(center-factor*stderr,
center+factor*stderr, m)

μ_a = x̄[0]
μ_b = x̄[1]
σ_a = std[0]
σ_b = std[1]
ρ = corrcoef[0][1]

μ_a_array = make_array(μ_a, σ_a / np.sqrt(n))
μ_b_array = make_array(μ_b, σ_b / np.sqrt(n))
σ_a_array = make_array(σ_a, σ_a / np.sqrt(2 * (n-1)))
σ_b_array = make_array(σ_b, σ_b / np.sqrt(2 * (n-1)))
#ρ_array = make_array(ρ, np.sqrt((1 - ρ**2) / (n-2)))
ρ_array = make_array(ρ, 0.15)

def min_max(array):
return min(array), max(array)

print(min_max(μ_a_array))
print(min_max(μ_b_array))
print(min_max(σ_a_array))
print(min_max(σ_b_array))
print(min_max(ρ_array))
(89.757197120005102, 121.31780287999489)
(65.235198775056304, 87.639801224943696)
(9.5161000378105989, 32.564707642175762)
(6.7553975307657055, 23.117405735750815)
(0.031284703359568844, 0.93128470335956881)

Although the mesh is constructed in 5 dimensions, for doing the Bayesian update, I want to express the parameters in terms of a vector of means, μ, and a covariance matrix, Σ.

Params is an object that encapsulates these values. pack is a function that takes 5 parameters and returns a Param object.

In [10]: class Params:
def __init__(self, μ, Σ):
self.μ = μ
self.Σ = Σ

def __lt__(self, other):
return (self.μ, self.Σ) < (self.μ, self.Σ)
In [11]: def pack(μ_a, μ_b, σ_a, σ_b, ρ):
μ = np.array([μ_a, μ_b])
cross = ρ * σ_a * σ_b
Σ = np.array([[σ_a**2, cross], [cross, σ_b**2]])
return Params(μ, Σ)

Now we can make a prior distribution. First, mesh is the Cartesian product of the parameter arrays. Since there are 5 dimensions with 11 points each, the total number of points is 11**5 = 161,051.

In [12]: mesh = product(μ_a_array, μ_b_array,
σ_a_array, σ_b_array, ρ_array)

The result is an iterator. We can use itertools.starmap to apply pack to each of the points in the mesh:

In [13]: mesh = starmap(pack, mesh)

Now we need an object to encapsulate the mesh and perform the Bayesian update. MultiNorm represents a map from each Param object to its probability.

It inherits Update from thinkbayes2.Suite and provides Likelihood, which computes the probability of the data given a hypothetical set of parameters.

If we know the mean is μ and the covariance matrix is Σ:

The sampling distribution of the mean, x̄, is multivariable normal with parameters μ and Σ/n.

The sampling distribution of (n-1) S is Wishart with parameters n-1 and Σ.

So the likelihood of the observed summary statistics, x̄ and S, is the product of two probability densities:

The pdf of the multivariate normal distrbution evaluated at x̄.

The pdf of the Wishart distribution evaluated at (n-1) S.

In [14]: class MultiNorm(thinkbayes2.Suite):

def Likelihood(self, data, hypo):
x̄, S, n = data

pdf_x̄ = multivariate_normal(hypo.μ, hypo.Σ/n)
pdf_S = wishart(df=n-1, scale=hypo.Σ)

like = pdf_x̄.pdf(x̄) * pdf_S.pdf((n-1) * S)
return like

Now we can initialize the suite with the mesh.

In [15]: suite = MultiNorm(mesh)

And update it using the data (the return value is the total probability of the data, aka the normalizing constant). This takes about 30 seconds on my machine.

In [16]: suite.Update((x̄, S, n))
Out[16]: 1.6385250666091713e-15

Now to answer the original question, about the conditional probabilities of A and B, we can either enumerate the parameters in the posterior or draw a sample from the posterior.

Since we don't need a lot of precision, I'll draw a sample.

In [17]: sample = suite.MakeCdf().Sample(300)

For a given pair of values, μ and Σ, in the sample, we can generate a simulated dataset.

The size of the simulated dataset is arbitrary, but should be large enough to generate a smooth distribution of P(A|B) and P(B|A).

In [18]: def generate(μ, Σ, sample_size):
return np.random.multivariate_normal(μ, Σ, sample_size)

# run an example using sample stats
fake_X = generate(x̄, S, 300)

The following function takes a sample of $a$ and $b$ and computes the conditional probabilites P(A|B) and P(B|A)

In [19]: def conditional_probs(sample):
df = pd.DataFrame(sample, columns=['a', 'b'])
pA = df[(91.9 <= df.a) & (df.a <= 158.3)]
pB = df[(56.4 <= df.b) & (df.b <= 100)]
pBoth = pA.index.intersection(pB.index)
pAgivenB = len(pBoth) / len(pB)
pBgivenA = len(pBoth) / len(pA)
return pAgivenB, pBgivenA

conditional_probs(fake_X)
Out[19]: (0.8174603174603174, 0.865546218487395)

Now we can loop through the sample of parameters, generate simulated data for each, and compute the conditional probabilities:

In [20]: def make_predictive_distributions(sample):
pmf = thinkbayes2.Joint()

for params in sample:
fake_X = generate(params.μ, params.Σ, 300)
probs = conditional_probs(fake_X)
pmf[probs] += 1

pmf.Normalize()
return pmf

predictive = make_predictive_distributions(sample)

Then pull out the posterior predictive marginal distribution of P(A|B), and print the posterior predictive mean:

In [21]: thinkplot.Cdf(predictive.Marginal(0).MakeCdf())
predictive.Marginal(0).Mean()
Out[21]: 0.7246540147158328 [image error]

And then pull out the posterior predictive marginal distribution of P(B|A), with the posterior predictive mean

In [22]: thinkplot.Cdf(predictive.Marginal(1).MakeCdf())
predictive.Marginal(1).Mean()
Out[22]: 0.8221366197059744 [image error]

We don't really care about the posterior distributions of the parameters, but it's good to take a look and make sure they are not crazy.

The following function takes μ and Σ and unpacks them into a tuple of 5 parameters:

In [23]: def unpack(μ, Σ):
μ_a = μ[0]
μ_b = μ[1]
σ_a = np.sqrt(Σ[0][0])
σ_b = np.sqrt(Σ[1][1])
ρ = Σ[0][1] / σ_a / σ_b
return μ_a, μ_b, σ_a, σ_b, ρ

So we can iterate through the posterior distribution and make a joint posterior distribution of the parameters:

In [24]: def make_marginals(suite):
joint = thinkbayes2.Joint()
for params, prob in suite.Items():
t = unpack(params.μ, params.Σ)
joint[t] = prob
return joint

marginals = make_marginals(suite)

And here are the posterior marginal distributions for μ_a and μ_b

In [25]: thinkplot.Cdf(marginals.Marginal(0).MakeCdf())
thinkplot.Cdf(marginals.Marginal(1).MakeCdf());
[image error]

And here are the posterior marginal distributions for σ_a and σ_b

In [26]: thinkplot.Cdf(marginals.Marginal(2).MakeCdf())
thinkplot.Cdf(marginals.Marginal(3).MakeCdf());
[image error]

Finally, the posterior marginal distribution for the correlation coefficient, ρ

In [27]: thinkplot.Cdf(marginals.Marginal(4).MakeCdf());
[image error]
 •  0 comments  •  flag
Share on Twitter
Published on April 06, 2016 08:39

March 28, 2016

Engineering education is good preparation for a lot of things

I am a big fan of engineering education.  I think studying engineering for four years is intellectually challenging, exciting, and (when it is done right) fun.  An engineering education is good preparation for careers in many fields and (again, when it is done right) good preparation for citizenship.

So you can imagine my reaction to a click-bait headline like, "Does Engineering Education Breed Terrorists?", which appeared recently in the Chronicle of Higher Education.

The article presents results from a new book, Engineers of Jihad: The Curious Connection Between Violent Extremism and Education, by Diego Gambetta and Steffen Hertog.  If you don't want to read the book, it looks like you can get most of the results from this working paper, which is free to download.

I have not read the book, but I will summarize results from the paper.

Engineers are over-represented

Gambetta and Hertog compiled a list of 404 "members of violent Islamist groups".  Of those, they found educational information on 284, and found that 196 had some higher education.  For 178 of those cases, they were able to determine the subject of study, and of those, the most common fields were engineering (78), Islamic studies (34), medicine (14), and economics/business (12).

So the representation of engineers among this group is 78/178, or 44%.  Is that more than we should expect?  To answer that, the ideal reference group would be a random sample of males from the same countries, with the same distribution of ages, selecting people with some higher education.

It is not easy to assemble a reference sample like that, but Gambetta and Hertog approximate it with data from several sources.  Their Table 4 summarizes the results, broken down by country:


In every country, the fraction of engineers in the sample of violent extremists is higher than the fraction in the reference group (with the possible exceptions of Singapore and Saudi Arabia).

As always in studies like this, there are methodological issues that could be improved, but taking the results at face value, it looks like people who studied engineering are over-represented among violent extremists by a substantial factor -- Gambetta and Hertog put it at "two to four times the size we would expect [by random selection]".

Why engineers?

In the second part of the paper, Gambetta and Hertog turn to the question of causation; as they formulate it, "we will try to explain only why engineers became more radicalized than people with other degrees".

They consider four hypotheses:

1) "Random appearance of engineers as first movers, followed by diffusion through their network", which means that if the first members of these groups were engineers by chance, and if they recruited primarily through their social networks, and if engineers are over-represented in those networks, the observed over-representation might be due to historical accident.

2) "Selection of engineers because of their technical skills"; that is, engineers might be over-represented because extremist groups recruit them specifically.

3) "Engineers' peculiar cognitive traits and dispositions"; which includes both selection and self-selection; that is, extremist groups might recruit people with the traits of engineers, and engineers might be more likely to have traits that make them more likely to join extremist groups.  [I repeat "more likely" as a reminder that under this hypothesis, not all engineers have the traits, and not all engineers with the traits join extremist groups.]

4) "The special social difficulties faced by engineers in Islamic countries"; which is the hypothesis that (a) engineering is a prestigious field of study, (b) graduates expect to achieve high status, (c) in many Islamic countries, their expectations are frustrated by lack of economic opportunity, and (d) the resulting "frustrated expectations and relative deprivation" explain why engineering graduates are more likely to become radicalized.

Their conclusion is that 1 and 2 "do not survive close scrutiny" and that the observed over-representation is best explained by the interaction of 3 and 4.

My conclusions

Here is my Bayesian interpretation of these hypotheses and the arguments Gambetta and Hertog present:

1) Initially, I didn't find this hypothesis particularly plausible.  The paper presents weak evidence against it, so I find it less plausible now.

2) I find it very plausible that engineers would be recruited by extremist groups, not only for specific technical skills but also for their general ability to analyze systems, debug problems, design solutions, and work on teams to realize their designs.  And once recruited, I would expect these skills to make them successful in exactly the ways that would cause them to appear in the selected sample of known extremists.  The paper provides only addresses a narrow version of this hypothesis (specific technical skills), and provides only weak evidence against it, so I still find it highly plausible.

3) Initially I found it plausible that there are character traits that make a person more likely to pursue engineering and more likely to join and extremist group.  This hypothesis could explain the observed over-representation even if both of those connections are weak; that is, even if few of the people who have the traits pursue engineering and even fewer join extremist groups.  The paper presents some strong evidence for this hypothesis, so I find it quite plausible.

4) My initial reaction to this hypothesis is that the causal chain sounds fragile.  And even if true, it is so hard to test, it verges on being unfalsifiable.  But Gambetta and Hertog were able to do more work with it than I expected, so I grudgingly upgrade it to "possible".

In summary, my interpretation of the results differs slightly from the authors': I think (2) and (3) are more plausible than (3) and (4).

But overall I think the paper is solid: it starts with an intriguing observation, makes good use of data to refine the question and to evaluate alternative hypotheses, and does a reasonable job of interpreting the results.

So at this point, you might be wondering "What about engineering education?  Does it breed terrorists, like the headline said?"

Good question!  But Gambetta and Hertog have nothing to say about it.  They did not consider this hypothesis, and they presented no evidence in favor or against.  In fact, the phrase "engineering education" does not appear in their 90-page paper except in the title of a paper they cite.

Does bad journalism breed click bait?

So where did that catchy headline come from?  From the same place all click bait comes from: the intersection of dwindling newspaper subscriptions and eroding journalistic integrity.

In this case, the author of the article, Dan Berrett, might not be the guilty party.  He does a good job of summarizing Gambetta and Hertog's research.  He presents summaries of a related paper on the relationship between education and extremism, and a related book.  He points out some limitations of the study design.  So far, so good.

The second half of the article is weaker, where Berrett reports speculation from researchers farther and farther from the topic.

For example, Donna M. Riley at Virginia Tech suggests that traditional [engineering] programs "reinforce positivist patterns of thinking" because "general-education courses [...] are increasingly being narrowed" and "engineers spend almost all their time with the same set of epistemological rules."

Berrett also quotes a 2014 study of American students in four engineering programs (MIT, UMass Amherst, Olin, and Smith), which concludes, "Engineering education, fosters a culture of disengagement that defines public welfare concerns as tangential to what it means to practice engineering."

But even if these claims are true, they are about American engineering programs in 2010s.  They are unlikely to explain the radicalization of the extremists in the study, who were educated in Islamic countries in the "mid 70s to late 90s".

At this point there is no reason to think that engineering education contributed to the radicalization of the extremists in the sample.  Gambetta and Hertog don't consider the hypothesis, and present no evidence that supports it.  To their credit, they "wouldn’t presume to make curricular recommendations to [engineering] educators."

In summary, Gambetta and Hertog provide several credible explanations, supported by evidence, that could explain the observed over-representation.  Speculating on other causes, assuming they are true—and then providing prescriptions for American engineering education—is gratuitous.  And suggesting in a headline that engineering education causes terrorism is hack journalism.

We have plenty of problems

Traditional engineering programs are broken in a lot of ways.  That's why Olin College was founded, and why I wanted to work here.  I am passionate about engineering education and excited about the opportunities to make it better.  Furthermore, some of the places we have the most room for improvement are exactly the ones highlighted in Derrett's article.

Yes, the engineering curriculum should "emphasize human needs, contexts, and interactions", as the
executive director of the American Society for Engineering Education (ASEE) suggests.  As he explains, "We want engineers to understand what motivates humans, because that’s how you satisfy human needs.  You have to understand people as people."

And yes, we want engineers "who recognize needs, design solutions and engage in creative enterprises for the good of the world".  That's why we put it in Olin College's mission statement.

We may not be doing these things very well, yet.  We are doing our best to get better.

In the meantime, it is premature to add "preventing terrorism" to our list of goals.  We don't know the causes of terrorism, we have no reason to think engineering education is one of them, and even if we did, I'm not sure we would know what to do about it.

Debating changes in engineering education is hard enough.  Bringing terrorism into it—without evidence or reason—doesn't help.

 •  0 comments  •  flag
Share on Twitter
Published on March 28, 2016 08:02

March 18, 2016

The retreat from religion is accelerating

Updated results from the CIRP Freshman Survey were published last week, which means it's time for me to update to my annual series on the increasing number of college students with no religious affiliation (the "nones").

Since I started following this trend it 2008, I have called it "one of the most under-reported stories of the decade". I'm not sure if that's true any more; there have been more stories recently about secularization in the U.S., and the general disengagement of Millennials from organized religion.

But one part of the story continues to surprise me: the speed of the transformation is remarkable, averaging almost 1 percentage point per year for the last 20 years, and there is increasing evidence that it is accelerating.

30% have no religion

Among first-year college students in Fall 2015, the fraction reporting no religious affiliation reached an all-time high of 29.6%, up more than 2 percentage points from 27.5% last year.

And the fraction reporting that they never attended a religious service also reached an all-time high at 30.5%, up from 29.3% last year.

These reports are based on survey results from the Cooperative Institutional Research Program (CIRP) of the Higher Education Research Insitute (HERI).  In 2015, more than 141,000 full-time, first-time college students at 199 colleges and universities completed the CIRP Freshman Survey, which includes questions about students’ backgrounds, activities, and attitudes.

In one question, students select their “current religious preference,” from a choice of seventeen common religions, “Other religion,” "Atheist", "Agnostic", or “None.”  The options "Atheist" and "Agnostic" are new this year.  For consistency with previous years, I compare the "Nones" from previous years with the sum of "None", "Atheist" and "Agnostic" this year.

Another question asks students how often they “attended a religious service” in the last year. The choices are “Frequently,” “Occasionally,” and “Not at all.” Students are instructed to select “Occasionally” if they attended one or more times.
The following figure shows the fraction of Nones over more than 40 years of the survey.
The blue line shows actual data through 2014; the red line shows a quadratic fit to the data.  The dark gray region shows a 90% confidence interval, which represents uncertainty about the parameters of the model.  The light gray region shows a 90% predictive interval, which represents uncertainty about the predicted value.

The blue square shows the actual data point for 2015.  For the first time since I starting running this analysis, the new data point falls outside the predicted interval.

30% have not attended a religious service

Here is the corresponding plot for attendance at religious services.
The new data point for 2015, at 30.5%, falls in the predicted range, although somewhat ahead of the long term trend.

The trend is accelerating

The quadratic model suggests that the trend is accelerating, but for a clearer test, I plotted annual changes from 1985 to the present:
Visually, it looks like the annual changes are getting bigger, and the slope of the fitted line is about 0.4 percentage points per decade.  The p-value of that slope is 3.6%, which is considered statistically significant, but barely.  You can interpret that as you like.

The gender gap persists

As always, more males than females report no religious preference.  The gender gap closed slightly this year, but the long term trend suggests it is still growing.

Predictions for 2016
Using the new 2015 data, we can run the same analysis to generate predictions for 2016.  Here is the revised plot for "Nones":

This year's measurement is ahead of the long-term trend, so next year's is predicted to regress to 28.6% (down 1.0%).
And here is the prediction for "No attendance":
Again, because this year's value is ahead of the long term trend, the center of the predictive distribution is lower, at 29.8% (down 0.7%).

I'll be back next year to check on these predictions.

Comments
1) The addition of the two new categories, "Atheist" and "Agnostic" is new this year.  Of the 29.6% I am treating as "no religion", 8.3% reported as agnostic, 5.9% as atheist, and 15.4% as none.  It's hard to say what effect the change in the question had on the results.
2) The number of schools and the number of students participating in the Freshman Survey has been falling for several years.  I wonder if the mix of schools represented in the survey is changing over time, and what effect this might have on the trends I am watching.  The percentage of "Nones" is different across different kinds of institutions (colleges, universities, public, private, etc.)  If participation rates are changing among these groups, that would affect the results.

3) Obviously college students are not representative of the general population.  Data from other sources indicate that the same trends are happening in the general population, but I haven't been able to make a quantitative comparison between college students and others.  Data from other sources also indicate that college graduates are slightly more likely to attend religious services, and to report a religious preference, than the general population.

Data Source
Eagan, K., Stolzenberg, E. B., Ramirez, J. J., Aragon, M. C., Suchard, M. R., & Hurtado, S. (2014). The American freshman: National norms fall 2014. Los Angeles: Higher Education Research Institute, UCLA.

This and all previous reports are available from the HERI publications page.
 •  0 comments  •  flag
Share on Twitter
Published on March 18, 2016 10:38

March 4, 2016

A quick Bayes problem in IPython

I've been busy with other projects, so I haven't posted anything new for a while.  But a reader posted a nice Bayes's theorem problem, which I am using as an excuse to experiment with embedding an IPython/Jupyter notebook in a blog, following these instructions. Let's see how it goes... .highlight{background: #f8f8f8; overflow:auto;width:auto;border:solid gray;border-width:.1em .1em .1em .1em;padding:0em .5em;border-radius: 4px;} .k{color: #338822; font-weight: bold;} .kn{color: #338822; font-weight: bold;} .mi{color: #000000;} .o{color: #000000;} .ow{color: #BA22FF; font-weight: bold;} .nb{color: #338822;} .n{color: #000000;} .s{color: #cc2222;} .se{color: #cc2222; font-weight: bold;} .si{color: #C06688; font-weight: bold;} .nn{color: #4D00FF; font-weight: bold;}

The following problem was submitted to my blog, Probably Overthinking It, by a user named Amit, who wrote:

The following data is about a poll that occurred in 3 states. In state1, 50% of voters support Party1, in state2, 60% of the voters support Party1, and in state3, 35% of the voters support Party1. Of the total population of the three states, 40% live in state1, 25% live in state2, and 35% live in state3. Given that a voter supports Party1, what is the probability that he lives in state2?

My solution follows. First I'll create a suite to represent our prior knowledge. If we know nothing about a voter, we would use the relative populations of the states to guess where they are from.

In [2]: prior = thinkbayes2.Suite({'State 1': 0.4, 'State 2': 0.25, 'State 3': 0.35})
prior.Print()
State 1 0.4
State 2 0.25
State 3 0.35

Now if we know a voter supports Party 1, we can use that as data to update our belief. The following dictionary contains the likelihood of the data (supporting Party 1) under each hypothesis (which state the voter is from).

In [3]: likelihood = {'State 1': 0.5, 'State 2': 0.60, 'State 3': 0.35}

To make the posterior distribution, I'll start with a copy of the prior.

The update consists of looping through the hypotheses and multiplying the prior probability of each hypothesis, hypo, by the likelihood of the data if hypo is true.

The result is a map from hypotheses to posterior likelihoods, but they are not probabilities yet because they are not normalized.

In [4]: posterior = prior.Copy()
for hypo in posterior:
posterior[hypo] *= likelihood[hypo]
posterior.Print()
State 1 0.2
State 2 0.15
State 3 0.1225

Normalizing the posterior distribution returns the total likelihood of the data, which is the normalizing constant.

In [5]: posterior.Normalize()
Out[5]: 0.4725

Now the posterior is a proper distribution:

In [6]: posterior.Print()
State 1 0.42328042328
State 2 0.31746031746
State 3 0.259259259259

And the probability that the voter is from State 2 is about 32%.

In [ ]:
 •  0 comments  •  flag
Share on Twitter
Published on March 04, 2016 07:52

December 8, 2015

Many rules of statistics are wrong

There are two kinds of people who violate the rules of statistical inference: people who don't know them and people who don't agree with them.  I'm the second kind.

The rules I hold in particular contempt are:

The interpretation of p-values: Suppose you are testing a hypothesis, H, so you've defined a null hypothesis, H0, and computed a p-value, which is the likelihood of an observed effect under H0.

According to the conventional wisdom of statistics, if the p-value is small, you are allowed to reject the null hypothesis and declare that the observed effect is "statistically significant".  But you are not allowed to say anything about H, not even that it is more likely in light of the data.

I disagree.  If we were really not allowed to say anything about H, significance testing would be completely useless, but in fact it is only mostly useless.  As I explained in this previous article, a small p-value indicates that the observed data are unlikely under the null hypothesis.  Assuming that they are more likely under H (which is almost always the case), you can conclude that the data are evidence in favor of H and against H0.  Or, equivalently, that the probability of H, after seeing the data, is higher than it was before.  And it is reasonable to conclude that the apparent effect is probably not due to random sampling, but might have explanations other than H.

Correlation does not imply causation: If this slogan is meant as a reminder that correlation does not always imply causation, that's fine.  But based on responses to some of my previous work, many people take it to mean that correlation provides no evidence in favor of causation, ever.

I disagree.  As I explained in this previous article, correlation between A and B is evidence of some causal relationship between A and B, because you are more likely to observe correlation if there is a causal relationship than if there isn't.  The problem with using correlation for to infer causation is that it does not distinguish among three possible relationships: A might cause B, B might cause A, or any number of other factors, C, might cause both A and B.

So if you want to show that A causes B, you have to supplement correlation with other arguments that distinguish among possible relationships.  Nevertheless, correlation is evidence of causation.

Regression provides no evidence of causation: This rule is similar to the previous one, but generalized to include regression analysis.  I posed this question the reddit stats forum: the consensus view among the people who responded is that regression doesn't say anything about causation, ever.  (More about that in this previous article.)

I disagree.  It think regression provides evidence in favor of causation for the same reason correlation does, but in addition, it can distinguish among different explanations for correlation.  Specifically, if you think that a third factor, C, might cause both A and B, you can try adding a variable that measures C as an independent variable.  If the apparent relationship between A and B is substantially weaker after the addition of C, or if it changes sign, that's evidence that C is a confounding variable.

Conversely, if you add control variables that measure all the plausible confounders you can think of, and the apparent relationship between A and B survives each challenge substantially unscathed, that outcome should increase your confidence that either A causes B or B causes A, and decrease your confidence that confounding factors explain the relationship.

By providing evidence against confounding factors, regression provides evidence in favor of causation, but it is not clear whether it can distinguish between "A causes B" and "B causes A".  The received wisdom of statistics says no, of course, but at this point I hope you understand why I am not inclined to accept it.

In this previous article, I explore the possibility that running regressions in both directions might help.  At this point, I think there is an argument to be made, but I am not sure.  It might turn out to be hogwash.  But along the way, I had a chance to explore another bit of conventional wisdom...

Methods for causal inference, like matching estimators, have a special ability to infer causality:  In this previous article, I explored a propensity score matching estimator, which is one of the methods some people think have special ability to provide evidence for causation.  In response to my previous work, several people suggested that I try these methods instead of regression.

Causal inference, and the counterfactual framework it is based on, is interesting stuff, and I look forward to learning more about it.  And matching estimators may well squeeze stronger evidence from the same data, compared to regression.  But so far I am not convinced that they have any special power to provide evidence for causation.

Matching estimators and regression are based on many of the same assumptions and vulnerable to some of the same objections.  I believe (tentatively for now) that if either of them can provide evidence for causation, both can.

Quoting rules is not an argumentAs these examples show, many of the rules of statistics are oversimplified, misleading, or wrong.  That's why, in many of my explorations, I do things experts say you are not supposed to do.  Sometimes I'm right and the rule is wrong, and I write about it here.  Sometimes I'm wrong and the rule is right; in that case I learn something and I try to explain it here.  In the worst case, I waste time rediscovering something everyone already "knew".

If you think I am doing something wrong, I'd be interested to hear why.  Since my goal is to test whether the rules are valid, repeating them is not likely to persuade me.  But if you explain why you think the rules are right, I am happy to listen.

 •  0 comments  •  flag
Share on Twitter
Published on December 08, 2015 07:37

December 3, 2015

Internet use and religion, part six

In my work on Internet use and religion, one of the recurring questions is whether the analysis I am doing, primarily regression models using observational data, provides evidence that Internet use causes decreased religiosity, or only shows a statistical association between them.

I discuss this question in the next section, which is probably too long.  If you get bored, you can skip to the following section, which presents a different method for estimating effect size, a "propensity score matching estimator", and compares the results to the regression models.

What does "evidence" mean?In the previous article I presented results from two regression models and made the following argument:

Internet use predicts religiosity fairly strongly: the effect size is stronger than education, income, and use of other media (but not as strong as age).Controlling for the same variables, religiosity predicts Internet use only weakly: the effect is weaker than age, date of interview, income, education, and television (and about the same as radio and newspaper).This asymmetry suggests that Internet use causes a decrease in religiosity, and the reverse effect (religiosity discouraging Internet use) is weaker or zero.It is still possible that a third factor could cause both effects, but the control variables in the model, and asymmetry of the effect, makes it hard to come up with plausible ideas for what the third factor could be.
I am inclined to consider these results as evidence of causation (and not just a statistical association).

When I make arguments like this, I get pushback from statisticians who assert that the kind of observational data I am working with cannot provide any evidence for causation, ever.  To understand this position better, I posted this query on reddit.com/r/statistics.  As always, I appreciate the thoughtful responses, even if I don't agree.  The top-rated comments came from /u/Data_Driven_Dude, who states this position:
Causality is almost wholly a factor of methodology, not statistics. Which variables are manipulated, which are measured, when they're measured/manipulated, in what order, and over what period of time are all methodological considerations. Not to mention control/confounding variables. 
So the most elaborate statistics in the world can't offer evidence of causation if, for example, a study used cross-sectional survey design. [...] 
Long story short: causality is a helluva lot harder than most people believe it to be, and that difficulty isn't circumvented by mere regression.
I believe this is a consensus opinion among many statisticians and social scientists, but to be honest I find it puzzling.  As I argued in this previous article, correlation is in fact evidence of causation, because observing a correlation is more likely if there is causation than if there isn't.

The problem with correlation is not that it is powerless to demonstrate causation, but that a simple bivariate correlation between A and B can't distinguish between A causing B, B causing A, or a confounding variable, C, causing both A and B.

But regression models can.  In theory, if you control for C, you can measure the causal effect of A on B.  In practice, you can never know whether you have identified and effectively controlled for all confounding variables.  Nevertheless, by adding control variables to a regression model, you can find evidence for causation.  For example:

If A (Internet use, in my example) actually causes B (decreased religiosity), but not the other way around, and we run regressions with B as a dependent variable, and A as an explanatory variable, we expect find that A predicts B, of course.  But we also expect the observed effect to persist as we add control variables.  The magnitude of the effect might get smaller, but if we can control effectively for all confounding variables, it should converge on the true causal effect size.On the other hand, if we run the regression the other way, using B to predict A, we expect to find that B predicts A, but as we add control variables, the effect should disappear, and if we control for all confounding variables, it should converge to zero.
For example, in this previous article, I found that first babies are lighter than others, by about 3 ounces.  However, the mothers of first babies tend to be younger, and babies of younger mothers tend to be lighter.  When I control for mother's age, the apparent effect is smaller, less than an ounce, and no longer statistically significant.  I conclude that mother's age explains the apparent difference between first babies and others, and that the causal effect of being a first baby is small or zero.

I don't think that conclusion is controversial, but here's the catch: if you accept "the effect disappears when I add control variables" as evidence against causation, then you should also accept "the effect persists despite effective control for confounding variables" as evidence for causation.

Of course the key word is "effective".  If you think I have not actually controlled for an important confounding variable, you would be right to be skeptical of causation.  If you think the controls are weak, you might accept the results as weak evidence of causation.  But if you think the controls are effective, you should accept the results as strong evidence.

So I don't understand the claim that regression models cannot provide any evidence for causation, at all, ever, which I believe is the position my correspondents took, and which seems to be taken as a truism among at least some statisticians and social scientists.

I put this question to /u/Data_Driven_Dude, who wrote the following:
[...]just because you have statistical evidence for a hypothesized relationship doesn't mean you have methodological evidence for it. Methods and stats go hand in hand; the data you gather and analyze via statistics are reflections of the methods used to gather those data.
Say there is compelling evidence that A causes B. So I hypothesize that A causes B. I then conduct multiple, methodologically-rigorous studies over several years (probably over at least a decade). This effectively becomes my research program, hanging my hat on the idea that A causes B. I become an expert in A and B. After all that work, the studies support my hypothesis, and I then suggest that there is overwhelming evidence that A causes B.
 
Now take your point. There could be nascent (yet compelling) research from other scientists, as well logical "common sense," suggesting that A causes B. So I hypothesize that A causes B. I then administer a cross-sectional survey that includes A, B, and other variables that may play a role in that relationship. The data come back, and huzzah! Significant regression model after controlling for potentially spurious/confounding variables! Conclusion: A causes B. 
Nope. In your particular study, ruling out alternative hypotheses by controlling for other variables and finding that B doesn't predict A when those other variables are considered is not evidence of causality. Instead, what you found is support for the possibility that you're on the right track to finding causality. You found that A tentatively predicts B when controlling for XYZ, within L population based on M sample across N time. Just because you started with a causal hypothesis doesn't mean you conducted a study that can yield data to support that hypothesis. 
So when I say that non-experimental studies provide no evidence of causality, I mean that not enough methodological rigor has been used to suppose that your results are anything but a starting point. You're tiptoeing around causality, you're picking up traces of it, you see its shadow in the distance. But you're not seeing causality itself, you're seeing its influence on a relationship: a ripple far removed from the source.
I won't try to summarize or address this point-by-point, but a few observations:

One point of disagreement seems to be the meaning of "evidence".  I admit that I am using it in a Bayesian sense, but honestly, that's because I don't understand any alternatives.  In particular, I don't understand the distinction between "statistical evidence" and "methodological evidence".Part of what my correspondent describes is a process of accumulating evidence, starting with initial findings that might not be compelling and ending (a decade later!) when the evidence is overwhelming.  I mostly agree with this, but I think the process starts when the first study provides some evidence and continues as each additional study provides more.  If the initial study provides no evidence at all, I don't know how this process gets off the ground.  But maybe I am still stuck on the meaning of "evidence".A key feature of this position is the need for methodological rigor, which sounds good, but I am not sure what it means.  Apparently regression models with observational data lack it.  I suspect that randomized controlled trials have it.  But I'm not sure what's in the middle.  Or, to be more honest, I know what is considered to be in the middle, but I'm not sure I agree.To pursue the third point, I am exploring methods commonly used in the social sciences to test causality.  

SKIP TO HERE!Matching estimators of causal effectsI'm reading Morgan and Winship's  Counterfactuals and Causal Inference , which is generally good, although it has the academic virtue of presenting simple ideas in complicated ways.  So far I have implemented one of the methods in Counterfactuals, a propensity score matching estimator.  Matching estimators work like this:

To estimate the effect of a particular treatment, D, on a particular outcome, Y, we divide an observed sample into a treatment group that received D and a control group that didn't.For each member of the treatment group, we identify a member of the control group that is as similar as possible (I'll explain how soon), and compute the difference in Y between the matched pair.Averaging the observed differences over the pairs yields an estimate of the mean causal effect of D on Y.
The hard part of this process is matching.  Ideally the matching process should take into account all factors that cause Y.  If the pairs are identical in all of these factors, and differ only in D, any average difference in Y must be caused by D.

Of course, in practice we can't identify all factors that cause Y.  And even if we could, we might not be able to observe them all.  And even if we could, we might not be able to find a perfect match for each member of the treatment group.

We can't solve the first two problems, but "propensity scores" help with the third.  The basic idea is

Identify factors that predict D.  In my example, D is Internet use.Build a model that uses those factors to predict the probability of D for each member of the sample; this probability is the propensity score.  In my example, I use logistic regression with age, income, education, and other factors to compute propensity scores.  Match each member of the treatment group with the member in the control group with the closest propensity score.  In my example, the members of each pair have the same predicted probability of using the Internet (according to the model in step 2), so the only relevant difference between them is that one did and one didn't.  Any difference in the outcome, religiosity, should reflect the causal effect of the treatment, Internet use.I implemented this method (details below) and applied it to the data from the European Social Survey (ESS).
In each country I divide respondents into a treatment group with Internet use above the median and a control group below the median.  We lose information by quantizing Internet use in this way, but the distribution tends to be bimodal, with many people at the extremes and few in the middle, so treating Internet use as a binary variable is not completely terrible.
To compute propensity scores, I use logistic regression to predict Internet use based on year of birth (including a quadratic term), year of interview, education and income (as in-country ranks), and use of other media (television, radio, and newspaper).
As expected, the average propensity in the treatment group is higher than in the control group.  But some members of the control group are matched more often than others (and some not at all).  After matching, the two groups have the same average propensity.

Finally, I compute the pair-wise difference in religiosity and the average across pairs.

For each country, I repeat this process 101 times using weighted resampled data with randomly filled missing values.  That way I can compute confidence intervals that reflect variation due to sampling and missing data.  The following figure shows the estimated effect size for each country and a 95% confidence interval:
The estimated effect of Internet use on religiosity is negative in 30 out of 34 countries; in 18 of them it is statistically significant.  In 4 countries the estimate is positive, but none of them are statistically significant.

The median effect size is 0.28 points on a 10 point scale.  The distribution of effect size across countries is similar to the results from the regression model, which has median 0.33:

The confidence intervals for the matching estimator are bigger.  Some part of this difference is because of the information we lose by quantizing Internet use.  Some part is because we lose some samples during the matching process.  And some part is due to the non-parametric nature of matching estimators, which make fewer assumptions about the structure of the effect.

How it worksThe details of the method are in this IPython notebook, but I'll present the kernel of the algorithm here.  In the following, group is a Pandas DataFrame for one country, with one row for each respondent.

The first step is to quantize Internet use and define a binary variable, treatment:

    netuse = group.netuse_f
    thresh = netuse.median()
    if thresh < 1:
        thresh = 1
    group.treatment = (netuse >= thresh).astype(int)

The next step is use logistic regression to compute a propensity for each respondent:


    formula = ('treatment ~ inwyr07_f + '
               'yrbrn60_f + yrbrn60_f2 + '
               'edurank_f + hincrank_f +'
               'tvtot_f + rdtot_f + nwsptot_f')

    model = smf.logit(formula, data=group)    
    results = model.fit(disp=False)
    group.propensity = results.predict(group)
    
Next we divide into treatment and control groups:

    treatment = group[group.treatment == 1]
    control = group[group.treatment == 0]
    
And sort the controls by propensity:

    series = control.propensity.sort_values()

Then do the matching by bisection search:

    indices = series.searchsorted(treatment.propensity)
    indices[indices < 0] = 0
    indices[indices >= len(control)] = len(control)-1
    
And select the matches from the controls:

    control_indices = series.index[indices]
    matches = control.loc[control_indices]

Extract the distance in propensity between each pair, and the difference in religiosity:

    distances = (treatment.propensity.values - 
                 matches.propensity.values)
    differences = (treatment.rlgdgr_f.values - 
                   matches.rlgdgr_f.values)
    
Select only the pairs that are a close match

    caliper = differences[abs(distances) < 0.001]

And compute the mean difference:

    delta = np.mean(caliper)

That's all there is to it.  There are better ways to do the matching, but I started with something simple and computationally efficient (it's n log n, where n is the size of the control or treatment group, whichever is larger).

Back to the philosophyThe agreement of the two methods provides some evidence of causation, because if the effect were spurious, I would expect different methodologies, which are more or less robust against the spurious effect, to yield different results.

But it is not very strong evidence, because the two methods are based on many of the same assumptions.  In particular, the matching estimator is only as good as the propensity model, and in this case the propensity model includes the same factors as the regression model.  If those factors effectively control for confounding variables, both methods should work.  If they don't, neither will.

The propensity model uses logistic regression, so it is based on the usual assumptions about linearity and the distribution of errors.  But the matching estimator is non-parametric, so it depends on fewer assumptions about the effect itself.  It seems to me that being non-parametric is a potential advantage of matching estimators, but it doesn't help with the fundamental problem, which is that we don't know if we have effectively controlled for all confounding variables.

So I am left wondering why a matching estimator should be considered suitable for causal inference if a regression model is not.  In practice one of them might do a better job than the other, but I don't see any difference, in principle, in their ability to provide evidence for causation: either both can, or neither.

 •  0 comments  •  flag
Share on Twitter
Published on December 03, 2015 13:28

November 30, 2015

Internet use and religion, part five

[If you are jumping into the middle of this series, you might want to start with this article, which explains the methodological approach I am taking.]

In the previous article, I show results from two regression models that predict religious affiliation and degree of religiosity.  I use the models to compare hypothetical respondents who are at their national means for all explanatory factors; then I vary one factor at a time, comparing someone at the 25th percentile with someone at the 75th percentile.  I compute the different in the predicted probability of religious affiliation and the predicted level of religiosity:

1) In almost every country, a hypothetical respondent with high Internet use is less likely to report a religious affiliation.  The median effect size across countries is 3.5 percentage points.

2) In almost every country, a hypothetical respondent with high Internet use reports a lower degree of religiosity.  The median effect size is 0.36 points on a 10-point scale.

These results suggest that Internet use might cause religious disaffiliation and decreased religiosity, but they are ambiguous.  It is also possible that the direction of causation is the other way; that is, that religiosity causes a decrease in Internet use.  Or there might be other factors that cause both Internet use and religiosity.

I'll address the first possibility first.  If religiosity causes lower Internet use, we should be able to measure that effect by flipping the models, taking religious affiliation and religiosity as explanatory variables and trying to predict Internet use.

I did that experiment and found:

1) Religious affiliation (hasrelig) has no predictive value for Internet use in most countries, and only weak predictive value in others.

2) Degree of religiosity (rlgdgr) has some predictive value for Internet use in some countries, but the effect is weaker than other explanatory variables (like age and education), and weaker than the effect of Internet use on religiosity: the median across countries is 0.19 points on a 7-point scale.

Considering the two possibilities, that Internet use causes religious disaffiliation or the other way around, these results support the first possibility, although the second might make a smaller contribution.

Although it is still possible that a third factor causes increased Internet use and decreased religious affiliation, it would have to do so in a strangely asymmetric way to account for these results.  And since the model controls for age, income, education, and other media use, this hypothetical third factor would have to be uncorrelated with these controls (or only weakly correlated).

I can't think of any plausible candidates for this third factor.  So I tentatively conclude that Internet use causes decreased religiosity.  I present more detailed results below.

Summary of previous resultsIn the previous article, I computed an effect size for each factor and reported results for two models, one that predicts hasrelig (whether the respondent reports religious affiliations) and one that predicts rlgdgr (degree of religiosity on a 0-10 scale).  The following two graphs summarize the results.

Each figure shows the distribution of effect size across the 34 countries in the study.  The first figure shows the results for the first model as a difference in percentage points; each line shows the effect size for a different explanatory variable.

The factors with the largest effect sizes are year of birth (dark green line) and Internet use (purple).

For Internet use, a respondent who is average in every way, but falls at the 75th percentile of Internet use, is typically 2-7 percentage points less likely to be affiliated than a similar respondent at the 25th percentile of Internet use.  In a few countries, the effect is apparently the other way around, but in those cases the estimated effect size is not statistically significant.

 Overall, people who use the Internet more are less likely to be affiliated, and the effect is stronger than the effect of education, income, or the consumption of other media.

Similarly, when we try to predict degree of religiosity, people who use the Internet more (again comparing the 75th and 25th percentiles) report lower religiosity, typically 0.2 to 0.7 points on a 10 point scale.  Again, the effect size for Internet use is bigger than for education, income, or other media.
Of course, what I am calling an "effect size" may not be an effect in the sense of cause and effect.  What I have shown so far is that Internet users tend to be less religious, even when we control for other factors.  It is possible, and I think plausible, that Internet use actually causes this effect, but there are two other possible explanations for the observed statistical association:

1) Religious affiliation and religiosity might cause decreased Internet use.
2) Some other set of factors might cause both increased Internet use and decreased religiosity.

Addressing the first alternative explanation, if people who are more religious tend to use the Internet less (other things being equal), we would expect that effect to appear in a model that includes religiosity as an explanatory variable and Internet use as a dependent variable.

But it turns out that if we run these models, we find that religiosity has little power to predict levels of Internet use when we control for other factors.  I present the results below; the details are in this IPython notebook.

Model 1The first model tries to predict level of Internet use taking religious affiliation (hasrelig) as an explanatory variable, along with the same controls I used before: year of birth (linear and quadratic terms), year of interview, education, income, and consumption of other media.

The following figure shows the effect size of religious affiliation on Internet use.
In most countries it is essentially zero, but in a few countries people who report a religious affiliation also report less Internet use, but always less than 0.5 points on a 7 point scale.

The following figure shows the distribution of effect size for the other variables on the same scale.
If we are trying to predict Internet use for a given respondent, the most useful explanatory variables, in descending order of effect size, are year of birth, education, year of interview, income, and television viewing.  The effect sizes for religious affiliation, radio listening, and newspaper reading are substantially smaller.

The results of the second model are similar.
Model 2The second model tries to predict level of Internet use taking degree of religiosity (rlgdgr) as an explanatory variable, along with the same controls I used before.

The following figure shows the estimated effect size in each country, showing the difference in Internet use of two hypothetical respondents who are at their national mean for all variables except degree of religiosity, where they are at the 25th and 75th percentiles.

In most countries, the respondent reporting a higher level of religiosity also reports a lower level of Internet use, in almost all cases less than 0.5 points on a 7-point scale.  Again, this effect is smaller than the apparent effect of the other explanatory variables.
Again, the variables that best predict Internet use are year of birth, education, year of interview, income, and television viewing.  The apparent effect of religiosity is somewhat less than television viewing, and more than radio listening and newspaper reading.
Next stepsAs I present these results, I realize that I can make them easier to interpret by expressing the effect size in standard deviations, rather than raw differences.  Internet use is recorded on a 7 point scale, and religiosity on a 10 point scale, so its not obvious how to compare them.
Also, variability of Internet use and religiosity is different across countries, so standardizing will help with comparisons between countries, too.
More results in the next installment.

 •  0 comments  •  flag
Share on Twitter
Published on November 30, 2015 12:16

November 24, 2015

Internet use and religion, part four

In the previous article, I presented preliminary results from a study of relationships between Internet use and religion.  Using data from the European Social Survey, I ran regressions to estimate the effect of media consumption (television, radio, newspapers, and Internet) on religious affiliation and degree of religiosity.

As control variables, I include year born, education and income (expressed as relative ranks within each country) and the year the data were gathered (between 2002 and 2010).

Some of the findings so far:

In almost every country, younger people are less religious.In most countries, people with more education are less religious.In about half of the 34 countries, people with lower income are less religious.  In the rest, the effect (if any) is too small to be distinguished from noise with this sample size.In most countries, people who watch more television are less religious.In a fewer than half of the countries, people who listen to more radio are less religious.The results for newspapers are similar: only a few countries show a negative effect, and in some countries the effect is positive.In almost every country, people who use the Internet are less religious.There is a weak relationship between the strength of the effect and the average degree of religiosity: the negative effect of Internet use on religion tends to be stronger in more religious countries.In the previous article, I measured effect size using the parameters of the regression models: for logistic regression, the parameter is a log odds ratio, for linear regression it is a linear weight.  These parameters are not useful for comparing the effects of different factors, because they are not on the same scale, and they are not the best choice for comparing effect size between countries, because they don't take into account variation in each factor in each country.
For example, in one country the parameter associated with Internet use might be small, but if there is large variation in Internet use within the country, the net effect size might be greater than in another country with a larger parameter, but little variation.
Effect sizeSo my next step is to define effect size in terms that are comparable between factors and between countries.  To explain the methodology, I'll use the logistic model, which predicts the probability of religious affiliation.  I start by fitting the model to the data, then use the model to predict the probability of affiliation for a hypothetical respondent whose values for all factors are the national mean.  Then I vary one factor at a time, generating predictions for hypothetical respondents whose value for one factor is at the 25th percentile (within country) and at the 75th percentile.  Finally, I compute the difference in predicted values in percentage points.
As an example, suppose a hypothetically average respondent has a 45% chance of reporting a religious affiliation, as predicted by the model.  And suppose the 25th and 75th percentiles of Internet use are 2 and 7, on a 7 point scale.  A person who is average in every way, but with Internet use only 2 might have a 47% chance of affiliation.  The same person with Internet use 7 might have a 42% chance.  In that case I would report that the effect size is a difference of 5 percentage points.
As in the previous article, I run this analysis on about 200 iterations of resampled data, then compute a median and 95% confidence interval for each value.
The IPython notebook for this installment is here.
Quadratic age modelBefore I get to the results, there is one other change from the previous installment:  I added a quadratic term for year born.  The reason is that in preliminary results, I noticed that Internet had the strongest negative association with religiosity, followed by television, then radio and newspapers.  I wondered whether this pattern might be the result of correlation with age; that is, whether younger people are more likely to consume new media and be less religious.  I was already controlling for age using yrborn60 (year born minus 1960) but I worried that if the relationship with age is nonlinear, I might not be controlling for it effectively.
So I added a quadratic term to the model.  Here are the estimated parameters for the linear term and quadratic term:
In many countries, both parameters are statistically significant, so I am inclined to keep them in the model.  The sign of the quadratic term is usually positive, so the curves are convex up, which suggests that the age effect might be slowing down.
Anyway, including the quadratic term has almost no effect on the other results: the relative strengths of the associations are the same.
Model 1 resultsAgain, the first model uses logistic regression with dependent variable hasrelig, which indicates whether the respondent reports a religious affiliation.

In the following figures, the x-axis is the percentage point difference in hasrelig between people at the 25th and 75th percentile for each explanatory variable.
 In most countries, people with more education are less religious.
 In most countries, the effect of income is small and not statistically significant.
 The effect of television is negative in most countries.
The effect of radio is usually small.
 The effect of newspapers is usually small.
In most countries, Internet use is associated with substantial decreases in religious affiliation.

Pulling together the results so far, the following figure shows the distribution (CDF) of effect size across countries:
 Overall, the effect size for Internet use is the largest, followed by education and television.  The effect sizes for income, radio, and newspaper are all small, and centered around zero.

Model 2 resultsThe second model uses linear regression with dependent variable rlgdgr, which indicates degree of religiosity on a 0-10 scale.


In the following figures, the x-axis shows the difference in rlgdgr between people at the 25th and 75th percentile for each explanatory variable.

In most countries, people with more education are less religious.
The effect size for income is smaller.
 People who watch more television are less religious.
The effect size for radio is small.
The effect size for newspapers is small.
 In most countries, people who use the Internet more are less religious.
Comparing the effect size for different explanatory variable, again, Internet use has the biggest effect, followed by education and television.  Effect sizes for income, radio, and newspaper are smaller and centered around zero.

That's all for now.  I have a few things to check out, and then I should probably wrap things up.
 •  0 comments  •  flag
Share on Twitter
Published on November 24, 2015 12:31

November 19, 2015

Internet use and religion, part three

This article reports preliminary results from an exploration of the relationship between religion and Internet use in Europe, using data from the European Social Survey (ESS).
I describe the data processing pipeline and models in this previous article.  All the code for this article is in this IPython notebook.
Data inventoryThe dependent variables I use in the models are

rlgblg: Do you consider yourself as belonging to any particular religion or denomination?

rlgdgr: Regardless of whether you belong to a particular religion, how religious would you say you are?  Scale from 0 = Not at all religious to 10 = Very religious.

The explanatory variables are

yrbrn: And in what year were you born?

hincrank: Household income, rank from 0-1 indicating where this respondent falls relative to respondents from the same country, same round of interviews.

edurank: Years of education, rank from 0-1 indicating where this respondent falls relative to respondents from the same country, same round of interviews.

tvtot: On an average weekday, how much time, in total, do you spend watching television?  Scale from 0 = No time at all to 7 = More than 3 hours.

rdtot: On an average weekday, how much time, in total, do you spend listening to the radio? Scale from 0 = No time at all to 7 = More than 3 hours.

nwsptot: On an average weekday, how much time, in total, do you spend reading the newspapers? Scale from 0 = No time at all to 7 = More than 3 hours.

netuse: Now, using this card, how often do you use the internet, the World Wide Web or e-mail - whether at home or at work - for your personal use?  Scale from 0 = No access at home or work, 1 = Never use, 6 = Several times a week, 7 = Every day.
Model 1: Affiliated or not?In the first model, the dependent variable is rlgblg, which indicates whether the respondent is affiliated with a religion.
The following figures shows estimated parameters from logistic regression, for each of the explanatory variables.  The parameters are log odds ratios: negative values indicate that the variable decreases the likelihood of affiliation; positive values indicate that it increases the likelihood.
The horizontal lines show the 95% confidence interval for the parameters, which includes the effects of random sampling and filling missing values.  Confidence intervals that cross the zero line indicate that the parameter is not statistically significant at the p<0.05 level. In most countries, interview year has no apparent effect.  I will probably drop it from the next iteration of the model.  Year born has a consistent negative effect, indicating that younger people are less likely to be affiliated.  Possible exceptions are Israel, Turkey and Cyprus.
 In most countries, people with more education are less likely to be affiliated.  Possible exceptions: Latvia, Sweden, and the UK.
 In a few countries, income might have an effect, positive or negative.  But it most countries it is not statistically significant.
 It looks like television might have a positive or negative effect in several countries.
 In most countries the effect of radio is not statistically significant.  Possible exceptions are Portugal, Greece, Bulgaria, the Netherlands, Estonia, the UK, Belgium, and Germany.
 In most countries the effect of newspapers is not statistically significant.  Possible exceptions are Turkey, Greece, Italy, Spain, Croatia, Estonia, Portugal and Norway.
 In the majority of countries, Internet use (which includes email and web) is associated with religious disaffiliation.  The estimated parameter is only positive in 4 countries, and not statistically significant in any of them.  The effect of Internet use appears strongest in Poland, Portugal, Israel, and Austria.

The following scatterplot shows the estimated parameter for Internet use on the x-axis, and the fraction of people who report religious affiliation on the y-axis.  There is a weak negative correlation between then (rho = -0.38), indicating that the effect of Internet use is stronger in countries with higher rates of affiliation.

Model 2: Degree of religiosityIn the first model, the dependent variable is rlgdgr, a self-reported degree of religiosity on a 0-10 scale (where 0 = not at all religious and 10 = very religious).
The following figures shows estimated parameters from linear regression, for each of the explanatory variables.  Negative values indicate that the variable decreases the likelihood of affiliation; positive values indicate that it increases the likelihood.
Again, the horizontal lines show the 95% confidence interval for the parameters; intervals that cross the zero line are not statistically significant at the p<0.05 level.  As in Model 1, interview year is almost never statistically significant.
 Younger people are less religious in every country except Israel.
 In most countries, people with more education are less religious, with possible exceptions Estonia, the UK, and Latvia.
 In about half of the countries, people with higher income are less religious.  One possible exception: Germany.
 In most countries, people who watch more television are less religious.  Possible exceptions: Greece and Italy.
 In several countries, people who listen to the radio are less religious.  Possible exceptions: Slovenia, Austria, Polans, Israel, Croatia, Lithuania.
 In some countries, people who read newspapers more are less religious, but in some other countries they are more religious.
 In almost every country, people who use the Internet more are less religious.  The estimated parameter is only positive in three countries, and none of them are statistically significant.  The effect of Internet use appears to be particularly strong in Israel and Luxembourg.

The following scatterplot shows the estimated parameter for Internet use on the x-axis and the national average degree of religiosity on the y-axis.  Using all data points, the coefficient of correlation is -0.16, but if we exclude the outliers, it is -0.36, indicating that the effect of Internet use is stronger in countries with higher degrees of religiosity.
Next stepsI am working on a second round of visualizations that show the size of the Internet effect in each country, expressed in terms of differences between people at the 25th, 50th, and 75th percentiles of Internet use.

I am also open to suggestions for further explorations.  And if anyone has insight into some of the countries that show up as exceptions to the common patterns, I would be interested to hear it.




 •  0 comments  •  flag
Share on Twitter
Published on November 19, 2015 10:37

Probably Overthinking It

Allen B. Downey
Probably Overthinking It is a blog about data science, Bayesian Statistics, and occasional other topics.
Follow Allen B. Downey's blog with rss.