Statistics Done Wrong: The Woefully Complete Guide
Rate it:
Open Preview
Kindle Notes & Highlights
Read between April 2, 2019 - January 26, 2020
13%
Flag icon
The size of the bias you’re looking for.
Yuan
Deviation
13%
Flag icon
The sample size.
Yuan
N
13%
Flag icon
Measurement error.
Yuan
Measure. Many phychological fail to answer the research question by measuring the wrong thing
14%
Flag icon
With only 100 flips, there’s just too little data to always separate bias from random variation.
14%
Flag icon
Another problem is that even if the coin is perfectly fair, I will falsely accuse it of bias 5% of the time. I’ve designed my test to interpret outcomes with p < 0.05 as a sign of bias,
14%
Flag icon
Fortunately, an increased sample size improves the sensitivity.
14%
Flag icon
Often, performing a sufficiently powerful test is out of the question for purely practical reasons.
14%
Flag icon
measurements. If you were to compare the IQs of two groups of people, you’d see not only the normal variation in intelligence from one person to the next but also the random variation in individual scores.
14%
Flag icon
More data helps distinguish the signal from the noise. But this is easier said than done: many scientists don’t have the resources to conduct studies with adequate statistical power to detect what they’re looking for. They are doomed to fail before they even start.
14%
Flag icon
If a trial isn’t powerful enough to detect the effect it’s looking for, we say it is underpowered.
14%
Flag icon
You might think calculations of statistical power are essential for medical trials; a scientist might want to know how many patients are needed to test a new medication, and a quick calculation of statistical power would provide the answer.
14%
Flag icon
Scientists are usually satisfied when the statistical power is 0.8 or higher, corresponding to an 80% chance of detecting a real effect of the expected size. (If the true effect is ac...
This highlight has been truncated due to consecutive passage length restrictions.
15%
Flag icon
And nearly two-thirds of the negative trials didn’t have the power to detect a 50% difference.3
15%
Flag icon
A more recent study of trials in cancer research found similar results: only about half of published studies with negative results had enough statistical power to detect even a large difference in their primary outcome variable.4 Less than 10% of these studies explained why their sample sizes were so poor. Similar problems have been consistently seen in other fields of medicine.5,6
15%
Flag icon
concern. If each study is underpowered, the true effect will likely be discovered only after many studies using many animals have been completed and analyzed—using far more animal subjects than if the study had been done properly in the first place.
15%
Flag icon
Then, in 1989, a review showed that in the decades since Cohen’s research, the average study’s power had actually decreased.9 This decrease was because of researchers becoming aware of another problem, the issue of multiple comparisons, and compensating for it in a way that reduced their studies’ power.
15%
Flag icon
Math is another possible explanation for why power calculations are so uncommon: analytically calculating power can be difficult or downright impossible.
16%
Flag icon
As long as this significant result is interesting enough to feature in a paper, the scientist will not feel that her studies are underpowered.
16%
Flag icon
But it’s misleading to assume these results mean there is no real difference. There may be a difference, even an important one, but the study was so small it’d be lucky to notice it.
Yuan
the story is the other way around, no real difference could be caused by small sample size as well. So, false negative studies need to be "powered" by increasing the sample size.
16%
Flag icon
Wrong Turns on Red
Yuan
this is a superb example about "statistical insignificance versus practical insignificance.
16%
Flag icon
In other words, he turned statistical insignificance into practical insignificance.
17%
Flag icon
Confidence Intervals and Empowerment
Yuan
Curious about their relationship
17%
Flag icon
Thinking about results in terms of confidence intervals provides a new way to approach experimental design.
17%
Flag icon
“How much data must I collect to measure the effect to my desired precision?”
18%
Flag icon
You’ve correctly concluded Fixitol is effective, but you’ve inflated the size of its effect because your study was underpowered.
18%
Flag icon
Studies that produce less “exciting” results are closer to the truth but less interesting to a major journal editor.21
18%
Flag icon
When a study claims to have detected a large effect with a relatively small sample, your first reaction should not be “Wow, they’ve found something big!” but “Wow, this study is underpowered!”
19%
Flag icon
Another example: in the United States, counties with the lowest rates of kidney cancer tend to be Midwestern, Southern, and Western rural counties.
19%
Flag icon
On the other hand, counties with the highest rates of kidney cancer tend to be Midwestern, Southern, and Western rural counties.
20%
Flag icon
A popular strategy to fight this problem is called shrinkage.
20%
Flag icon
Cohen defined “medium-sized” as a 0.5-standard-deviation difference between groups.
26%
Flag icon
A p value is calculated under the assumption that the medication does not work. It tells me the probability of obtaining my data or data more extreme than it. It does not tell me the chance my medication is effective.
26%
Flag icon
A small p value is stronger evidence, but to calculate the probability that the medication is effective, you’d need to factor in the base rate.
26%
Flag icon
A 2002 study found that an overwhelming majority of statistics students—and instructors—failed a simple quiz about p values.
27%
Flag icon
In 90% of women with breast cancer, the mammogram will correctly detect it. (That’s the statistical power of the test. This is an estimate, since it’s hard to tell how many cancers we miss if we don’t know they’re there.) However,
27%
Flag icon
Only 9% of women with positive mammograms have breast cancer. Even doctors get this wrong. If you ask them, two-thirds will erroneously conclude that a p < 0.05 result implies a 95% chance that the result is true.2 But as you can see in these examples, the likelihood that a positive mammogram means cancer depends on the proportion of women who actually have cancer. And we are very fortunate that only a small proportion of women have breast cancer at any given time.
27%
Flag icon
So when the US Surgeon General’s famous report Smoking and Health came out in 1964, saying that tobacco smoking causes lung cancer, tobacco companies turned to Huff to provide their public rebuttal.[11]
Yuan
Ironic
28%
Flag icon
That is, he interprets the p value as the probability that the results are a fluke.
30%
Flag icon
As the comic shows, making multiple comparisons means multiple chances for a false positive. The more tests I perform, the greater the chance that at least one of them will produce a false positive.
30%
Flag icon
Suppose we have n independent hypotheses to test, none of which is true. We set our significance criterion at p < 0.05. The probability of obtaining at least one false positive among the n tests is as follows: P(false positive) = 1 – (1 – 0.05)n For n = 100, the false positive probability increases to 99%.
30%
Flag icon
If you send out a 10-page survey asking about nuclear power plant proximity, milk consumption, age, number of male cousins, favorite pizza topping, current sock color, and a few dozen other factors for good measure, you’ll probably find that at least one of those things is correlated with cancer.
Yuan
Now i understand the geno study problem
30%
Flag icon
If we want to make many comparisons at once but control the overall false positive rate, the p value should be calculated under the assumption that none of the differences is real.
Yuan
For genetic study , such as correlation of disease with gene, given so much possible correlation to examine, the P value need to be really small to avoid the false positive
31%
Flag icon
So why assume the null hypothesis is true in the first place?
31%
Flag icon
Accurate estimates of the size of differences are much more interesting than checking only whether each effect could be zero.
32%
Flag icon
Scientists are more interested in limiting the false discovery rate: the fraction of statistically significant results that are false positives.
32%
Flag icon
In 1995, Yoav Benjamini and Yosef Hochberg devised an exceptionally simple procedure that tells you which p values to consider statistically significant.
32%
Flag icon
You’re done! The procedure guarantees that out of all statistically significant results, on average no more than q percent will be false positives.11 I hope the method makes
32%
Flag icon
The Benjamini–Hochberg procedure is fast and effective, and it has been widely adopted by statisticians and scientists. It’s particularly appropriate when testing hundreds of hypotheses that are expected to be mostly false, such as associating genes with diseases. (The vast majority of genes have nothing to do with a particular
Yuan
Aha
32%
Flag icon
Remember, p < 0.05 isn’t the same as a 5% chance your result is false.
32%
Flag icon
If you are testing multiple hypotheses or looking for correlations between many variables, use a procedure such as Bonferroni or Benjamini–Hochberg (or one of their various derivatives and adaptations) to control for the excess of false positives.