Statistics Done Wrong: The Woefully Complete Guide
Rate it:
Open Preview
Kindle Notes & Highlights
35%
Flag icon
Twice the standard deviation of the measurements. Calculate how far each observation is from the average, square each difference, and then average the results and take the square root. This is the standard deviation, and it measures how spread out the measurements are from their mean. Standard deviation bars stretch from one standard deviation below the mean to one standard deviation above.
35%
Flag icon
Twice the standard error for the estimate, another way of measuring the margin of error. If
35%
Flag icon
you run numerous identical experiments and obtain an estimate of Fixitol’s effectiveness from each, the standard error is the standard deviation of these estimates. The bars stretch one standard error below and one standard error above the mean. In the most common cases, a standard error bar is about half as wide as the 95% confidence interval.
36%
Flag icon
The standard deviation measures the spread of the individual data points.
36%
Flag icon
Confidence intervals and standard errors, on the other hand, estimate how far the average for this sample might be from
36%
Flag icon
the true average—the
39%
Flag icon
There are several ways to mitigate this problem. One is to split the dataset in half, choosing regions of interest with the first half and performing the in-depth analysis with the second. This reduces statistical power, though, so we’d have to collect more data to compensate. Alternatively, we could select regions of interest using some criterion other than response to walrus or penguin stimuli, such as prior anatomical knowledge.
39%
Flag icon
This phenomenon, called regression to the mean, isn’t some special property of blood pressures or businesses. It’s just the observation that luck doesn’t last forever. On average, everyone’s luck is average.
42%
Flag icon
One statistical technique to deal with such scenarios is called regression modeling. It estimates the marginal effect of each variable—the health impact of each additional pound of weight, not just the difference between groups on either side of an arbitrary cutoff. This gives much finer-grained results than a simple comparison between groups.
42%
Flag icon
A common simplification technique is to dichotomize variables by splitting a continuous measurement into two separate groups.
42%
Flag icon
One common solution is to split the data along the median of the sample, which divides the data into two equal-size groups—a so-called median split
43%
Flag icon
A major objection to dichotomization is that it throws away information.
43%
Flag icon
This reduces the statistical power of your study—a
43%
Flag icon
You’ll get less precise estimates of the correlations you’re trying to measure and will often underestimate effect sizes.
44%
Flag icon
As a result, the ANOVA procedure falsely claims that yachts and health care are related. Worse, this false correlation isn’t statistically significant only 5% of the time—from the ANOVA’s perspective, it’s a true correlation, and it is detected as often as the statistical power of the test allows it.
44%
Flag icon
Regression procedures can easily fit this data without any dichotomization, while producing false-positive correlations only at the
44%
Flag icon
rate you’d expect.
44%
Flag icon
Regression in its simplest form is fitting a straight line to data: finding the equation of the line that best predicts the outcome from the data.
44%
Flag icon
regression with multiple variables allows you to control for confounding factors in a study. For
46%
Flag icon
stepwise regression, a common
46%
Flag icon
procedure for selecting which variables are the most important in a regression.
46%
Flag icon
Hypothetically, by adding only statistically significant variables, you avoid overfitting, but running so many significance tests is bound to produce false positives, so some of the variables you select will be bogus.
46%
Flag icon
There are several variations of stepwise regression. The version I just described is called forward selection since it starts from scratch and starts including variables. The alternative, backward elimination, starts by including all 1,600 variables and excludes those that are statistically insignificant, one at a time.
46%
Flag icon
Akaike information criterion and the Bayesian information criterion, which reduce overfitting by penalizing models with more variables.
46%
Flag icon
cross-validation: fit the model using only a portion of the melons and then test its effectiveness at predicting the ripeness of the other melons. If the model overfits, it will perform poorly during cross-validation.
46%
Flag icon
leave-one-out cross-validation, where the model is fit using all but one data point and then evaluated on its ability to predict that point;
47%
Flag icon
in some cases there may be a good reason to believe that only a few of the variables have any effect on the outcome.
47%
Flag icon
the lasso (short for least absolute shrinkage and selection operator, an inspired acronym) has better mathematical properties and doesn’t fool the user with claims of statistical significance. But the lasso is not bulletproof, and there is no perfect automated solution.
47%
Flag icon
But that’s not what your model says. The model says that people with cholesterol and weight within that range have a 30% lower risk of heart attack; it doesn’t say that if you put an overweight person on a diet and exercise routine, that person will be less likely to have a heart attack.
47%
Flag icon
It’s common to interpret the results by saying, “If weight increases
48%
Flag icon
by one pound, with all other variables held constant, then heart attack rates increase by . . .” Perhaps that is true, but it may not be possible to hold all other variables constant in practice.
48%
Flag icon
Nobody ever gains a pound with all other variables held constant, so your regression equation doesn’t translate to reality.
48%
Flag icon
Simpson’s paradox arises whenever an apparent trend in data, caused by a confounding variable, can be eliminated or reversed by splitting the data into natural groups.
49%
Flag icon
In general, random assignment eliminates confounding variables and prevents Simpson’s paradox from giving us backward results. Purely observational studies are particularly susceptible to the paradox.
50%
Flag icon
Simpson’s paradox was discovered by Karl Pearson and Udny Yule and is thus an example of Stigler’s law of eponymy, discovered by Robert Merton, which states that no scientific discovery is named after the original discoverer.
51%
Flag icon
Further simulation by the researchers suggested that if scientists try different statistical analyses until one works—say, by controlling for different combinations of variables
51%
Flag icon
and trying different sample sizes—false positive rates can jump to more than 50% for a given dataset.
52%
Flag icon
Other blinding techniques include adding a constant to all measurements, keeping this constant hidden from analysts until the analysis is finalized; having independent groups perform separate parts of the analysis and only later combining their results; or using simulations to inject false data that is later removed.
52%
Flag icon
In some medical studies, triple blinding is performed as a form of blind analysis; the patients, doctors, and statisticians all do not know which group is the control group until the analysis is complete.
57%
Flag icon
Reproducibility Project,
61%
Flag icon
outcome reporting bias, where systematic reviews become biased toward more extreme and more interesting results.
61%
Flag icon
CONSORT checklist, which requires reporting of statistical methods, all measured outcomes, and any changes to the trial design after it began.
63%
Flag icon
This problem is commonly known as publication bias, or the file drawer problem. Many studies sit in a file drawer for years, never published, despite the valuable data they could
63%
Flag icon
contribute.
63%
Flag icon
Suppose, for example, that the effect size is 0.8 (on some arbitrary scale), but the review was composed of many small studies that each had a power of 0.2. You would expect only 20% of the studies to be able to detect the effect—but you may find that 90% or more of the published studies found it because the rest were tossed in the bin.
« Prev 1 2 Next »