Naked Statistics: Stripping the Dread from the Data
Rate it:
Open Preview
1%
Flag icon
I particularly disliked high school calculus for the simple reason that no one ever bothered to tell me why I needed to learn it. What is the area beneath a parabola? Who cares?
Dalal A. liked this
1%
Flag icon
Curiously, I loved physics in high school, even though physics relies very heavily on the very same calculus that I refused to do in Mrs.
2%
Flag icon
Because physics has a clear purpose.
2%
Flag icon
I love statistics. Statistics can be used to explain everything from DNA testing to the idiocy of playing the lottery.
2%
Flag icon
a goat showed up behind one of the doors that he didn’t pick. Should he switch? The answer is yes. Why? That’s in Chapter 5½.
2%
Flag icon
The paradox of statistics is that they are everywhere—from batting averages to presidential polls—but
3%
Flag icon
Or maybe we just need to think more clearly about what many workers are doing during that ten-minute break. My professional experience suggests that many of those workers who report leaving their offices for short breaks are huddled outside the entrance of the building smoking cigarettes
3%
Flag icon
does your credit card company use data on what you are buying to predict if you are likely to miss a payment? (Seriously, they can do that.)
3%
Flag icon
It’s easy to lie with statistics, but it’s hard to tell the truth without them.
63%
Flag icon
Most of the studies that you read about in the newspaper are based on regression analysis.
63%
Flag icon
The problem is that the mechanics of regression analysis are not the hard part; the hard part is determining which variables ought to be considered in the analysis and how that can best be done.
64%
Flag icon
There are so many potential regression pitfalls
64%
Flag icon
regression analysis—from the simplest statistical relationships to the complex models cobbled together by Nobel Prize winners. At its core, regression analysis seeks to find the “best fit” for a linear relationship between two variables.
64%
Flag icon
Regression analysis enables us to go one step further and “fit a line” that best describes a linear relationship between the two variables.
64%
Flag icon
It should be intuitive that the larger the sum of residuals overall, the worse the fit of the line.
64%
Flag icon
ordinary least squares gives us the best description of a linear relationship between two variables.
64%
Flag icon
The regression line certainly does not describe every observation in the data set perfectly. But it is the best description we can muster for what is clearly a meaningful relationship between height and weight. It also means that every observation can be explained as WEIGHT = a + b(HEIGHT) + e, where e is a “residual” that catches the variation in weight for each individual that is not explained by height.
65%
Flag icon
a one-unit increase in the independent variable (height) is associated with an increase of 4.5 units in the dependent variable (weight).
65%
Flag icon
Thus, if we had no other information, our best guess for the weight of a person who is 5 feet 10 inches tall (70 inches) in the Changing Lives study would be –135 + 4.5 (70) = 180 pounds.
65%
Flag icon
For any regression coefficient, you will generally be interested in three things: sign, size, and significance.
65%
Flag icon
having perfect teeth may be associated with other personality traits that explain the earnings advantage; the earnings effect may be caused by the kind of people who care about their teeth, not the teeth themselves.
65%
Flag icon
does it reflect a meaningful association that is likely to be observed for the population as a whole?
65%
Flag icon
However, we know from the central limit theorem that the mean for a large, properly drawn sample will not typically deviate wildly from the mean for the population as a whole. Similarly, we can assume that the observed relationship between variables like height and weight will not typically bounce around wildly from sample to sample, assuming that these samples are large and properly drawn from the same population.
66%
Flag icon
Once again, the normal distribution is our friend.
66%
Flag icon
we can calculate a standard error for the regression coefficient that gives us a sense of how much dispersion we should expect in the coefficients from sample to sample.
66%
Flag icon
the normal distribution is no longer willing to be our friend.
مساعد الشطي
Search the normal distrubution
66%
Flag icon
(Basically the t-distribution is more dispersed than the normal distribution and therefore has “fatter tails.
66%
Flag icon
any basic statistical software package will easily manage the additional complexity associated with using the t-distributions.
66%
Flag icon
We can say that 95 times out of 100, we expect our confidence interval, which is 4.5 ± .26, to contain the true population parameter.
66%
Flag icon
there is only a 5 percent chance that we are wrongly rejecting the null hypothesis.
66%
Flag icon
In fact, our results are even more extreme than that. The standard error (.13) is extremely low relative to the size of the coefficient (4.5).
66%
Flag icon
One rough rule of thumb is that the coefficient is likely to be statistically significant when the coefficient is at least twice the size of the standard error.* A statistics package also calculates a p-value, which is .000 in this case, meaning that there is essentia...
This highlight has been truncated due to consecutive passage length restrictions.
66%
Flag icon
The R2 tells us how much of that variation around the mean is associated with differences in height alone. The answer in our case is .25, or 25 percent. The more significant point may be that 75 percent of the variation in weight for our sample remains unexplained. There are clearly factors other than height that might help us understand the weights of the Changing Lives participants. This is where things get more interesting.
66%
Flag icon
When we include multiple variables in the regression equation, the analysis gives us an estimate of the linear association between each explanatory variable and the dependent variable while holding other dependent variables constant, or “controlling for” these other factors.
67%
Flag icon
Regression analysis (often called multiple regression analysis when more than one explanatory variable is involved, or multivariate regression analysis)
67%
Flag icon
I have included a table with the complete results of this regression equation in the appendix to this chapter.
67%
Flag icon
an R2 of zero means that our regression equation does no better than the mean at predicting the weight of any individual in the sample; an R2 of 1 means that the regression equation perfectly predicts the weight of every person in the sample.)
68%
Flag icon
If this were a real research project, there would be weeks or months of follow-on analysis to probe this finding.
68%
Flag icon
The gender wage gap fades away as the authors add more explanatory variables to the analysis.
68%
Flag icon
the value of multiple regression analysis, particularly the research insights that stem from being able to isolate the effect of one explanatory variable while controlling for other confounding factors.
69%
Flag icon
Our goal now is to see how much of the remaining variation in weight in each room can be explained by education. In other words, what is the best linear relationship between education and weight in each room?