More on this book
Community
Kindle Notes & Highlights
Started reading
November 28, 2019
Curiously, I loved physics in high school, even though physics relies very heavily on the very same calculus that I refused to do in Mrs.
Because physics has a clear purpose.
I love statistics. Statistics can be used to explain everything from DNA testing to the idiocy of playing the lottery.
a goat showed up behind one of the doors that he didn’t pick. Should he switch? The answer is yes. Why? That’s in Chapter 5½.
The paradox of statistics is that they are everywhere—from batting averages to presidential polls—but
Or maybe we just need to think more clearly about what many workers are doing during that ten-minute break. My professional experience suggests that many of those workers who report leaving their offices for short breaks are huddled outside the entrance of the building smoking cigarettes
does your credit card company use data on what you are buying to predict if you are likely to miss a payment? (Seriously, they can do that.)
It’s easy to lie with statistics, but it’s hard to tell the truth without them.
Most of the studies that you read about in the newspaper are based on regression analysis.
The problem is that the mechanics of regression analysis are not the hard part; the hard part is determining which variables ought to be considered in the analysis and how that can best be done.
There are so many potential regression pitfalls
regression analysis—from the simplest statistical relationships to the complex models cobbled together by Nobel Prize winners. At its core, regression analysis seeks to find the “best fit” for a linear relationship between two variables.
Regression analysis enables us to go one step further and “fit a line” that best describes a linear relationship between the two variables.
It should be intuitive that the larger the sum of residuals overall, the worse the fit of the line.
ordinary least squares gives us the best description of a linear relationship between two variables.
The regression line certainly does not describe every observation in the data set perfectly. But it is the best description we can muster for what is clearly a meaningful relationship between height and weight. It also means that every observation can be explained as WEIGHT = a + b(HEIGHT) + e, where e is a “residual” that catches the variation in weight for each individual that is not explained by height.
a one-unit increase in the independent variable (height) is associated with an increase of 4.5 units in the dependent variable (weight).
Thus, if we had no other information, our best guess for the weight of a person who is 5 feet 10 inches tall (70 inches) in the Changing Lives study would be –135 + 4.5 (70) = 180 pounds.
For any regression coefficient, you will generally be interested in three things: sign, size, and significance.
having perfect teeth may be associated with other personality traits that explain the earnings advantage; the earnings effect may be caused by the kind of people who care about their teeth, not the teeth themselves.
does it reflect a meaningful association that is likely to be observed for the population as a whole?
However, we know from the central limit theorem that the mean for a large, properly drawn sample will not typically deviate wildly from the mean for the population as a whole. Similarly, we can assume that the observed relationship between variables like height and weight will not typically bounce around wildly from sample to sample, assuming that these samples are large and properly drawn from the same population.
Once again, the normal distribution is our friend.
we can calculate a standard error for the regression coefficient that gives us a sense of how much dispersion we should expect in the coefficients from sample to sample.
(Basically the t-distribution is more dispersed than the normal distribution and therefore has “fatter tails.
any basic statistical software package will easily manage the additional complexity associated with using the t-distributions.
We can say that 95 times out of 100, we expect our confidence interval, which is 4.5 ± .26, to contain the true population parameter.
there is only a 5 percent chance that we are wrongly rejecting the null hypothesis.
In fact, our results are even more extreme than that. The standard error (.13) is extremely low relative to the size of the coefficient (4.5).
One rough rule of thumb is that the coefficient is likely to be statistically significant when the coefficient is at least twice the size of the standard error.* A statistics package also calculates a p-value, which is .000 in this case, meaning that there is essentia...
This highlight has been truncated due to consecutive passage length restrictions.
The R2 tells us how much of that variation around the mean is associated with differences in height alone. The answer in our case is .25, or 25 percent. The more significant point may be that 75 percent of the variation in weight for our sample remains unexplained. There are clearly factors other than height that might help us understand the weights of the Changing Lives participants. This is where things get more interesting.
When we include multiple variables in the regression equation, the analysis gives us an estimate of the linear association between each explanatory variable and the dependent variable while holding other dependent variables constant, or “controlling for” these other factors.
Regression analysis (often called multiple regression analysis when more than one explanatory variable is involved, or multivariate regression analysis)
I have included a table with the complete results of this regression equation in the appendix to this chapter.
an R2 of zero means that our regression equation does no better than the mean at predicting the weight of any individual in the sample; an R2 of 1 means that the regression equation perfectly predicts the weight of every person in the sample.)
If this were a real research project, there would be weeks or months of follow-on analysis to probe this finding.
The gender wage gap fades away as the authors add more explanatory variables to the analysis.
the value of multiple regression analysis, particularly the research insights that stem from being able to isolate the effect of one explanatory variable while controlling for other confounding factors.
Our goal now is to see how much of the remaining variation in weight in each room can be explained by education. In other words, what is the best linear relationship between education and weight in each room?