Naked Statistics: Stripping the Dread from the Data
Rate it:
Open Preview
Read between October 14, 2018 - February 16, 2019
65%
Flag icon
However, we know from the central limit theorem that the mean for a large, properly drawn sample will not typically deviate wildly from the mean for the population as a whole.
65%
Flag icon
American adult population in what is known as a t-distribution. (Basically the t-distribution is more dispersed than the normal distribution and therefore has “fatter tails.”)
65%
Flag icon
This means that if we were to do this analysis repeatedly—say with 100 different samples—then we would expect our observed regression coefficient to be within two standard errors of the true population parameter roughly 95 times out of 100.
65%
Flag icon
95 percent confidence interval. We can say that 95 times out of 100, we expect our confidence interval, which is 4.5 ± .26, to contain the true population parameter. This is the range between 4.24 and 4.76. A basic statistics package will calculate this interval as well.
66%
Flag icon
Thus, we can reject the null hypothesis that there is no association between height and weight for the general population at the 95 percent confidence level.
66%
Flag icon
rule of thumb is that the coefficient is likely to be statistically significant when the coefficient is at least twice the size of the standard error.*
66%
Flag icon
Remember, we have not proved that taller people weigh more in the general population; we have merely shown that our results for the Changing Lives sample would be highly anomalous if that were not the case.
66%
Flag icon
the dependent variable while holding other dependent variables constant, or “controlling for” these other factors.
66%
Flag icon
(often called multiple regression analysis when more than one explanatory variable is involved, or multivariate regression analysis)
66%
Flag icon
WEIGHT = –118 + 4.3 × (HEIGHT IN INCHES) + .12 (AGE IN YEARS) – 4.8 (IF SEX IS FEMALE) Our best estimate of the weight of a fifty-three-year-old woman who is 5 feet 5 inches is: –118 + 4.3 (65) + .12 (53) – 4.8 = 163 pounds.
67%
Flag icon
African Americans might be more likely than other residents to live in “food deserts,” which are areas with limited access to grocery stores that carry fruits, vegetables, and other fresh produce.
67%
Flag icon
R2 of zero means that our regression equation does no better than the mean at predicting the weight of any individual in the sample; an R2 of 1 means that the regression equation perfectly predicts the weight of every person in the sample.) A lot of the variation in weight across individuals remains unexplained.
67%
Flag icon
weight of every person in the sample.) A lot of the variation in weight across individuals remains unexplained.
67%
Flag icon
Education turns out to be negatively associated with weight, as I had hypothesized. Among participants in the Changing Lives study, each year of ...
This highlight has been truncated due to consecutive passage length restrictions.
67%
Flag icon
What is going on? The honest answer is that I have no idea. Let me reiterate a point that was buried earlier in a footnote: I’m just playing around with data here to illustrate how regression analysis works. The analytics presented here are to true academic research what street hockey is to the NHL. If this were a real research project, there would be weeks or months of follow-on analysis to probe this finding.
67%
Flag icon
As an example, let’s look at a paper by three economists examining the wage trajectories of a sample of roughly 2,500 men and women who graduated with MBAs from the Booth School of Business at the University of Chicago.1 Upon graduation, male and female graduates have very similar average starting salaries: $130,000 for men and $115,000 for women. After ten years in the workforce, however, a huge gap has opened up; women on average are earning a striking 45 percent less than their male classmates: $243,000 versus $442,000. In a broader sample of more than 18,000 MBA graduates who entered the ...more
This highlight has been truncated due to consecutive passage length restrictions.
68%
Flag icon
For workers who have been in the labor force more than ten years, the authors can ultimately explain all but 1 percent of the gender wage gap with factors unrelated to discrimination on the job.* They conclude, “We identify three proximate reasons for the large and rising gender gap in earnings: differences in training prior to MBA graduation; differences in career interruptions; and differences in weekly hours. These three determinants can explain the bulk of gender differences across the years following MBA completion.”
68%
Flag icon
To get your mind around how we can isolate the effect on weight of a single variable, say, education, imagine the following situation. Assume that all of the Changing Lives participants are convened in one place—
68%
Flag icon
If we have enough participants in our study, we can further subdivide each of those rooms by income. Eventually we will have lots of rooms, each of which contains individuals who are identical in all respects except for education and weight, which are the two variables we care about.
68%
Flag icon
There would be a room of forty-five-year-old 5-foot 5-inch men who earn $30,000 to $40,000 a year. Next door would be all the forty-five-year-old 5-foot 5-inch women who earn $30,000 to $40,000 a year. And so on (and on and on). There will still be some
68%
Flag icon
much of the remaining variation in weight in each room can be explained by education. In other words, what is the best linear relationship between education and weight in each room?
68%
Flag icon
The whole point of this exercise is to calculate a single coefficient that best expresses the relationship between education and weight for the entire sample, while holding other factors constant. What we would like to calculate is the single coefficient for education that we can use in every room to minimize the sum of the squared residuals for all of the rooms combined.
68%
Flag icon
As an aside, you can see why large data sets are so useful. They allow us to control for many factors while still having many observations in each “room.” Obviously a computer can do all of this in a split second without herding thousands of people into different rooms.
68%
Flag icon
Skepticism is always a good first response. I wrote at the outset of the chapter that “low-control” jobs are bad for your health.
69%
Flag icon
the “low-control” idea evolved into a term known as “job strain,” which characterizes jobs with “high psychological workload demands” and “low decision latitude.” Between 1981 and 1993, thirty-six studies were published on the subject; most found a significant positive association between job strain and heart disease.
69%
Flag icon
The t-distribution
69%
Flag icon
Our sample of 25 will still give us meaningful information, as would a sample of 5 or 10—but how meaningful? The t-distribution answers that question.
69%
Flag icon
They will still be distributed around the true coefficient for the whole population, but the shape of that distribution will not be our familiar bell-shaped normal curve.
69%
Flag icon
The t-distribution is actually a series, or “family,” of probability density functions that vary according to the size of our sample. Specifically, the more data we have in our sample, the more “degrees of freedom” we have when determining the appropriate distribution against which to evaluate our results.
69%
Flag icon
For instance, a basic regression analysis with a sample of 10 and a single explanatory variable has 9 degrees of freedom. The
69%
Flag icon
the coefficient on a particular variable is zero. Once we get the regression results, we would calculate a t-statistic, which is the ratio of the observed coefficient to the standard error for that coefficient.*
70%
Flag icon
The fewer the degrees of freedom (and therefore the “fatter” the tails of the relevant t-distribution),
70%
Flag icon
large data set like the Nurses’ Health Study for statistical associations that may or may not be causal, a clinical trial consists of a controlled experiment. One sample is given a treatment, such as hormone replacement; another sample is given a placebo. Clinical trials showed that women taking estrogen had a higher incidence of heart disease, stroke, blood clots, breast cancer, and other adverse health outcomes.
71%
Flag icon
Regression analysis is the hydrogen bomb of the statistics arsenal. Every person with a personal computer and a large data set can be a researcher in his or her own home or cubicle. What could possibly go wrong? All kinds of things. Regression analysis provides precise answers to complicated questions. These answers may or may not be accurate. In the wrong hands, regression analysis will yield results that are misleading or just plain wrong.
71%
Flag icon
Have you ever read the warning label on a hair dryer—the part that cautions, Do Not Use in the Bath Tub? And you think to yourself, “What kind of moron uses a hair dryer in the bath tub?” It’s an electrical appliance; you don’t use electrical appliances around water. They’re not designed for that. If regression analysis had a similar warning label, it would say, Do Not Use When There Is Not a Linear Association between the Variables That You Are Analyzing.
71%
Flag icon
that we cannot accurately summarize the relationship between lessons and scores with a single coefficient.
71%
Flag icon
we cannot accurately summarize the relationship
71%
Flag icon
The results you get will be the statistical equivalent of using a hair dryer in the bath tub.
71%
Flag icon
Correlation does not equal causation. Regression analysis can only demonstrate an association between two variables. As I have mentioned before, we cannot prove with statistics alone that a change in one variable is causing a change in the other.
71%
Flag icon
Suppose we were searching for potential causes for the rising rate of autism in the United States over the last two decades. Our dependent variable—the outcome we are seeking to explain—would be some measure of the incidence of the autism by year, such as the number of diagnosed cases for every 1,000 children of a certain age. If we were to include annual per capita income in China as an explanatory variable, we would almost certainly find a positive and statistically significant association between rising incomes in China and rising autism rates in the U.S. over the past twenty years.
72%
Flag icon
Reverse causality. A statistical association between A and B does not prove that A causes B. In fact, it’s entirely plausible that B is causing A. I alluded to this possibility earlier in
72%
Flag icon
poorly; bad golf is causing more lessons, not the other way around. (There are some simple methodological
72%
Flag icon
bad golf is causing more lessons, not the
72%
Flag icon
12 education. A positive and significant association between these two variables does not provide any insight into which direction the relationship happens to run. Investments in K–12 education could cause economic growth. On the other hand, states that have strong economies can afford to spend more on K–12 education, so the strong economy
72%
Flag icon
Or, education spending could boost economic growth, which makes possible additional education spending—the causality could be going in both ways.
72%
Flag icon
For example, it would be inappropriate to use the unemployment rate in a regression equation explaining GDP growth, since unemployment is clearly affected by the rate of GDP growth. Or, to think of it another way, a regression
72%
Flag icon
Omitted variable bias. You should be skeptical the next time you see a huge headline proclaiming, “Golfers More Prone to Heart Disease, Cancer, and Arthritis!”
72%
Flag icon
Any study that attempts to measure the effects of playing golf on health must control properly for age.
72%
Flag icon
Golf isn’t killing people; old age is killing people, and they happen to enjoy playing golf while it does.
72%
Flag icon
true. Regression results will be misleading and inaccurate if the regression equation leaves out an important explanatory variable, particularly if other variables in the equation “pick up” that effect.