Andi’s Kindle Notes & Highlights for Statistics Done Wrong: The Woefully Complete Guide

The p value is the probability, under the assumption that there is no true effect or no true difference, of collecting data that shows a difference equal to or more extreme than what you actually observed.

8%

Remember, a p value is not a measure of how right you are or how important a difference is. Instead, think of it as a measure of surprise.

10%

The Neyman-Pearson approach is where we get “statistical significance,” with a prechosen p value threshold that guarantees the long-run false positive

11%

rate.

11%

A confidence interval combines a point estimate with the uncertainty in that estimate.

11%

If you want to test whether an effect is significantly different from zero, you can construct a 95% confidence interval and check whether the interval includes zero. In the process, you get the added bonus of learning how precise your estimate is. If the confidence interval is too wide, you may need to collect more data.

11%

For example, if you run a clinical trial, you might produce a confidence interval indicating that your drug reduces symptoms by somewhere between 15 and 25 percent. This effect is statistically significant because the interval doesn’t include zero,

12%

The power of a study is the probability that it will distinguish an effect of a certain size from pure luck.

17%

a narrow interval covering zero tells you that the effect is most likely small (which may be all you need to know, if a small effect is not practically useful), while a wide interval clearly shows that the measurement was not precise enough to draw conclusions.

17%

accuracy in parameter estimation, or AIPE

18%

but you’ve inflated the size of its effect because your study was underpowered. This effect, known as truth inflation, type M error (M for magnitude), or the

18%

winner’s curse

20%

shrinkage. For counties with few residents, you can “shrink” the cancer rate estimates

20%

toward the national average by taking a weighted average of the county cancer rate with the national average rate. When the county has few residents, you weight the national average strongly; when the county is large, you weight the county strongly.

20%

Unfortunately, it biases results in the opposite direction: small counties with truly abnormal cancer rates are estimated to have rates much closer to the national average than they are.

20%

For sites like reddit that have simple up-and-down votes rather than star ratings, one alternative is to generate a confidence interval for the fraction of positive votes. The interval starts wide when a comment has only a few votes and narrows to a definite value (“70% of voters like this comment”) as comments accumulate; sort the comments by the bottom bound of their confidence intervals. New comments start near the bottom, but the best among them accumulate votes and creep up the page as the confidence interval narrows. And because comments are sorted by the proportion of positive votes ...more

20%

Calculate the statistical power when designing your study to determine the appropriate sample size.

20%

Cohen’s classic Statistical Power Analysis for the Behavioral Sciences

20%

When you need to measure an effect with precision, rather than simply testing for significance, use assurance instead of power: design your experiment to measure the hypothesized effect to your desired level of precision.

20%

“Not significant” does not mean “nonexistent.”

20%

Look skeptically on the results of clearly underpowered studies. They may be exaggerated due to truth inflation.

20%

Use confidence intervals to determine the range of answers consistent with your data, regardless of statistical significance. When comparing groups of dif...

This highlight has been truncated due to consecutive passage length restrictions.

21%

A large sample size is supposed to ensure that any differences between groups are a result of my treatment, not genetics or preexisting conditions. But in this new design, I’m not recruiting new patients. I’m just counting the genetics of each existing patient 100 times. This problem is known as pseudoreplication

21%

You can think of pseudoreplication as collecting data that answers the wrong question.

22%

Pseudoreplication can also be caused by taking separate measurements of the same subject over time (autocorrelation

22%

),

22%

quantify dependence so you can correctly interpret your data. (This means they usually give wider confidence intervals and larger p values than the naive analysis.)

22%

Average the dependent data points.

22%

To make your results reflect the level of certainty in your measurements, which increases as you take more, you’d perform a weighted analysis, weighting the better-measured patients more strongly.

22%

Analyze each dependent data point separately.

22%

Correct for the dependence by adjusting your p values and confidence intervals.

22%

estimate the size of the dependence between data points and account for it, including clustered standard errors, repeated measures tests, and hierarchical models.

23%

Principal components analysis determines which combinations of variables in the data account for the most variation in the results.

25%

Use statistical methods such as hierarchical models and clustered standard errors to account for a strong dependence between your measurements.

25%

my false discovery rate—the fraction of statistically significant results that are really false positives—is 38%.

25%

base rate

26%

low p values as a sign that error is unlikely:

26%

No!

26%

base rate fallacy.

26%

A p value is calculated under the assumption that the medication does not work. It tells me the probability of obtaining my data or data more extreme than it. It does not tell me the chance my medication is effective. A small p value is stronger evidence, but to calculate the probability that the medication is effective, you’d need to factor in the base rate.

30%

P(false positive) = 1 – (1 – 0.05)n For n = 100, the false positive probability increases to 99%.

30%

look-elsewhere effect.

30%

Bonferroni correction method allows you to calculate p values as you normally would but says that if you make n comparisons in the trial, your criterion for significance should be p < 0.05/n. This lowers the chances of a false positive to what you’d see from making only one comparison at p < 0.05. However, as you can imagine, this reduces statistical power, since you’re demanding much stronger correlations before you conclude they’re statistically significant.

32%

false discovery rate: the

32%

fraction of statistically significant results that are false positives.

32%

Yoav Benjamini and Yosef Hochberg devised an exceptionally simple procedure that tells you which p values to consider statistically significant.

32%

Perform your statistical tests and get the p value for each. Make a list and sort it in ascending order. Choose a false-discovery rate and call it q. Call the number of statistical tests m. Find the largest p value such that p ≤ iq/m, where i is the p value’s place in the sorted list. Call that p value and all smaller than it statistically significant.

32%

Benjamini–Hochberg procedure

32%

The procedure usually provides better statistical power than the Bonferroni correction, and the false discovery rate is easier to interpret than the false positive rate.

33%

the probability of concluding that one drug has a significant effect and the other does not is 2B(1 – B).

See a Problem?

Preview — Statistics Done Wrong by Alex Reinhart