Andi’s Kindle Notes & Highlights for Statistics Done Wrong: The Woefully Complete Guide

Twice the standard deviation of the measurements. Calculate how far each observation is from the average, square each difference, and then average the results and take the square root. This is the standard deviation, and it measures how spread out the measurements are from their mean. Standard deviation bars stretch from one standard deviation below the mean to one standard deviation above.

35%

Twice the standard error for the estimate, another way of measuring the margin of error. If

35%

you run numerous identical experiments and obtain an estimate of Fixitol’s effectiveness from each, the standard error is the standard deviation of these estimates. The bars stretch one standard error below and one standard error above the mean. In the most common cases, a standard error bar is about half as wide as the 95% confidence interval.

36%

The standard deviation measures the spread of the individual data points.

36%

Confidence intervals and standard errors, on the other hand, estimate how far the average for this sample might be from

36%

the true average—the

39%

There are several ways to mitigate this problem. One is to split the dataset in half, choosing regions of interest with the first half and performing the in-depth analysis with the second. This reduces statistical power, though, so we’d have to collect more data to compensate. Alternatively, we could select regions of interest using some criterion other than response to walrus or penguin stimuli, such as prior anatomical knowledge.

39%

This phenomenon, called regression to the mean, isn’t some special property of blood pressures or businesses. It’s just the observation that luck doesn’t last forever. On average, everyone’s luck is average.

42%

One statistical technique to deal with such scenarios is called regression modeling. It estimates the marginal effect of each variable—the health impact of each additional pound of weight, not just the difference between groups on either side of an arbitrary cutoff. This gives much finer-grained results than a simple comparison between groups.

42%

A common simplification technique is to dichotomize variables by splitting a continuous measurement into two separate groups.

42%

One common solution is to split the data along the median of the sample, which divides the data into two equal-size groups—a so-called median split

43%

A major objection to dichotomization is that it throws away information.

43%

This reduces the statistical power of your study—a

43%

You’ll get less precise estimates of the correlations you’re trying to measure and will often underestimate effect sizes.

44%

As a result, the ANOVA procedure falsely claims that yachts and health care are related. Worse, this false correlation isn’t statistically significant only 5% of the time—from the ANOVA’s perspective, it’s a true correlation, and it is detected as often as the statistical power of the test allows it.

44%

Regression procedures can easily fit this data without any dichotomization, while producing false-positive correlations only at the

44%

rate you’d expect.

44%

Regression in its simplest form is fitting a straight line to data: finding the equation of the line that best predicts the outcome from the data.

44%

regression with multiple variables allows you to control for confounding factors in a study. For

46%

stepwise regression, a common

46%

procedure for selecting which variables are the most important in a regression.

46%

Hypothetically, by adding only statistically significant variables, you avoid overfitting, but running so many significance tests is bound to produce false positives, so some of the variables you select will be bogus.

46%

There are several variations of stepwise regression. The version I just described is called forward selection since it starts from scratch and starts including variables. The alternative, backward elimination, starts by including all 1,600 variables and excludes those that are statistically insignificant, one at a time.

46%

Akaike information criterion and the Bayesian information criterion, which reduce overfitting by penalizing models with more variables.

46%

cross-validation: fit the model using only a portion of the melons and then test its effectiveness at predicting the ripeness of the other melons. If the model overfits, it will perform poorly during cross-validation.

46%

leave-one-out cross-validation, where the model is fit using all but one data point and then evaluated on its ability to predict that point;

47%

in some cases there may be a good reason to believe that only a few of the variables have any effect on the outcome.

47%

the lasso (short for least absolute shrinkage and selection operator, an inspired acronym) has better mathematical properties and doesn’t fool the user with claims of statistical significance. But the lasso is not bulletproof, and there is no perfect automated solution.

47%

But that’s not what your model says. The model says that people with cholesterol and weight within that range have a 30% lower risk of heart attack; it doesn’t say that if you put an overweight person on a diet and exercise routine, that person will be less likely to have a heart attack.

47%

It’s common to interpret the results by saying, “If weight increases

48%

by one pound, with all other variables held constant, then heart attack rates increase by . . .” Perhaps that is true, but it may not be possible to hold all other variables constant in practice.

48%

Nobody ever gains a pound with all other variables held constant, so your regression equation doesn’t translate to reality.

48%

Simpson’s paradox arises whenever an apparent trend in data, caused by a confounding variable, can be eliminated or reversed by splitting the data into natural groups.

49%

In general, random assignment eliminates confounding variables and prevents Simpson’s paradox from giving us backward results. Purely observational studies are particularly susceptible to the paradox.

50%

Simpson’s paradox was discovered by Karl Pearson and Udny Yule and is thus an example of Stigler’s law of eponymy, discovered by Robert Merton, which states that no scientific discovery is named after the original discoverer.

51%

Further simulation by the researchers suggested that if scientists try different statistical analyses until one works—say, by controlling for different combinations of variables

51%

and trying different sample sizes—false positive rates can jump to more than 50% for a given dataset.

52%

Other blinding techniques include adding a constant to all measurements, keeping this constant hidden from analysts until the analysis is finalized; having independent groups perform separate parts of the analysis and only later combining their results; or using simulations to inject false data that is later removed.

52%

In some medical studies, triple blinding is performed as a form of blind analysis; the patients, doctors, and statisticians all do not know which group is the control group until the analysis is complete.

57%

Reproducibility Project,

61%

outcome reporting bias, where systematic reviews become biased toward more extreme and more interesting results.

61%

CONSORT checklist, which requires reporting of statistical methods, all measured outcomes, and any changes to the trial design after it began.

63%

This problem is commonly known as publication bias, or the file drawer problem. Many studies sit in a file drawer for years, never published, despite the valuable data they could

63%

contribute.

63%

Suppose, for example, that the effect size is 0.8 (on some arbitrary scale), but the review was composed of many small studies that each had a power of 0.2. You would expect only 20% of the studies to be able to detect the effect—but you may find that 90% or more of the published studies found it because the rest were tossed in the bin.

See a Problem?

Preview — Statistics Done Wrong by Alex Reinhart