Statistics Done Wrong: The Woefully Complete Guide
Rate it:
Open Preview
Kindle Notes & Highlights
8%
Flag icon
The p value is the probability, under the assumption that there is no true effect or no true difference, of collecting data that shows a difference equal to or more extreme than what you actually observed.
8%
Flag icon
Remember, a p value is not a measure of how right you are or how important a difference is. Instead, think of it as a measure of surprise.
10%
Flag icon
The Neyman-Pearson approach is where we get “statistical significance,” with a prechosen p value threshold that guarantees the long-run false positive
11%
Flag icon
rate.
11%
Flag icon
A confidence interval combines a point estimate with the uncertainty in that estimate.
11%
Flag icon
If you want to test whether an effect is significantly different from zero, you can construct a 95% confidence interval and check whether the interval includes zero. In the process, you get the added bonus of learning how precise your estimate is. If the confidence interval is too wide, you may need to collect more data.
11%
Flag icon
For example, if you run a clinical trial, you might produce a confidence interval indicating that your drug reduces symptoms by somewhere between 15 and 25 percent. This effect is statistically significant because the interval doesn’t include zero,
12%
Flag icon
The power of a study is the probability that it will distinguish an effect of a certain size from pure luck.
17%
Flag icon
a narrow interval covering zero tells you that the effect is most likely small (which may be all you need to know, if a small effect is not practically useful), while a wide interval clearly shows that the measurement was not precise enough to draw conclusions.
17%
Flag icon
accuracy in parameter estimation, or AIPE
18%
Flag icon
but you’ve inflated the size of its effect because your study was underpowered. This effect, known as truth inflation, type M error (M for magnitude), or the
18%
Flag icon
winner’s curse
20%
Flag icon
shrinkage. For counties with few residents, you can “shrink” the cancer rate estimates
20%
Flag icon
toward the national average by taking a weighted average of the county cancer rate with the national average rate. When the county has few residents, you weight the national average strongly; when the county is large, you weight the county strongly.
20%
Flag icon
Unfortunately, it biases results in the opposite direction: small counties with truly abnormal cancer rates are estimated to have rates much closer to the national average than they are.
20%
Flag icon
For sites like reddit that have simple up-and-down votes rather than star ratings, one alternative is to generate a confidence interval for the fraction of positive votes. The interval starts wide when a comment has only a few votes and narrows to a definite value (“70% of voters like this comment”) as comments accumulate; sort the comments by the bottom bound of their confidence intervals. New comments start near the bottom, but the best among them accumulate votes and creep up the page as the confidence interval narrows. And because comments are sorted by the proportion of positive votes ...more
20%
Flag icon
Calculate the statistical power when designing your study to determine the appropriate sample size.
20%
Flag icon
Cohen’s classic Statistical Power Analysis for the Behavioral Sciences
20%
Flag icon
When you need to measure an effect with precision, rather than simply testing for significance, use assurance instead of power: design your experiment to measure the hypothesized effect to your desired level of precision.
20%
Flag icon
“Not significant” does not mean “nonexistent.”
20%
Flag icon
Look skeptically on the results of clearly underpowered studies. They may be exaggerated due to truth inflation.
20%
Flag icon
Use confidence intervals to determine the range of answers consistent with your data, regardless of statistical significance. When comparing groups of dif...
This highlight has been truncated due to consecutive passage length restrictions.
21%
Flag icon
A large sample size is supposed to ensure that any differences between groups are a result of my treatment, not genetics or preexisting conditions. But in this new design, I’m not recruiting new patients. I’m just counting the genetics of each existing patient 100 times. This problem is known as pseudoreplication
21%
Flag icon
You can think of pseudoreplication as collecting data that answers the wrong question.
22%
Flag icon
Pseudoreplication can also be caused by taking separate measurements of the same subject over time (autocorrelation
22%
Flag icon
),
22%
Flag icon
quantify dependence so you can correctly interpret your data. (This means they usually give wider confidence intervals and larger p values than the naive analysis.)
22%
Flag icon
Average the dependent data points.
22%
Flag icon
To make your results reflect the level of certainty in your measurements, which increases as you take more, you’d perform a weighted analysis, weighting the better-measured patients more strongly.
22%
Flag icon
Analyze each dependent data point separately.
22%
Flag icon
Correct for the dependence by adjusting your p values and confidence intervals.
22%
Flag icon
estimate the size of the dependence between data points and account for it, including clustered standard errors, repeated measures tests, and hierarchical models.
23%
Flag icon
Principal components analysis determines which combinations of variables in the data account for the most variation in the results.
25%
Flag icon
Use statistical methods such as hierarchical models and clustered standard errors to account for a strong dependence between your measurements.
25%
Flag icon
my false discovery rate—the fraction of statistically significant results that are really false positives—is 38%.
25%
Flag icon
base rate
26%
Flag icon
low p values as a sign that error is unlikely:
26%
Flag icon
No!
26%
Flag icon
base rate fallacy.
26%
Flag icon
A p value is calculated under the assumption that the medication does not work. It tells me the probability of obtaining my data or data more extreme than it. It does not tell me the chance my medication is effective. A small p value is stronger evidence, but to calculate the probability that the medication is effective, you’d need to factor in the base rate.
30%
Flag icon
P(false positive) = 1 – (1 – 0.05)n For n = 100, the false positive probability increases to 99%.
30%
Flag icon
look-elsewhere effect.
30%
Flag icon
Bonferroni correction method allows you to calculate p values as you normally would but says that if you make n comparisons in the trial, your criterion for significance should be p < 0.05/n. This lowers the chances of a false positive to what you’d see from making only one comparison at p < 0.05. However, as you can imagine, this reduces statistical power, since you’re demanding much stronger correlations before you conclude they’re statistically significant.
32%
Flag icon
false discovery rate: the
32%
Flag icon
fraction of statistically significant results that are false positives.
32%
Flag icon
Yoav Benjamini and Yosef Hochberg devised an exceptionally simple procedure that tells you which p values to consider statistically significant.
32%
Flag icon
Perform your statistical tests and get the p value for each. Make a list and sort it in ascending order. Choose a false-discovery rate and call it q. Call the number of statistical tests m. Find the largest p value such that p ≤ iq/m, where i is the p value’s place in the sorted list. Call that p value and all smaller than it statistically significant.
32%
Flag icon
Benjamini–Hochberg procedure
32%
Flag icon
The procedure usually provides better statistical power than the Bonferroni correction, and the false discovery rate is easier to interpret than the false positive rate.
33%
Flag icon
the probability of concluding that one drug has a significant effect and the other does not is 2B(1 – B).
« Prev 1