More on this book
Community
Kindle Notes & Highlights
Read between
April 2, 2019 - January 26, 2020
With only 100 flips, there’s just too little data to always separate bias from random variation.
Another problem is that even if the coin is perfectly fair, I will falsely accuse it of bias 5% of the time. I’ve designed my test to interpret outcomes with p < 0.05 as a sign of bias,
Fortunately, an increased sample size improves the sensitivity.
Often, performing a sufficiently powerful test is out of the question for purely practical reasons.
measurements. If you were to compare the IQs of two groups of people, you’d see not only the normal variation in intelligence from one person to the next but also the random variation in individual scores.
More data helps distinguish the signal from the noise. But this is easier said than done: many scientists don’t have the resources to conduct studies with adequate statistical power to detect what they’re looking for. They are doomed to fail before they even start.
If a trial isn’t powerful enough to detect the effect it’s looking for, we say it is underpowered.
You might think calculations of statistical power are essential for medical trials; a scientist might want to know how many patients are needed to test a new medication, and a quick calculation of statistical power would provide the answer.
Scientists are usually satisfied when the statistical power is 0.8 or higher, corresponding to an 80% chance of detecting a real effect of the expected size. (If the true effect is ac...
This highlight has been truncated due to consecutive passage length restrictions.
And nearly two-thirds of the negative trials didn’t have the power to detect a 50% difference.3
A more recent study of trials in cancer research found similar results: only about half of published studies with negative results had enough statistical power to detect even a large difference in their primary outcome variable.4 Less than 10% of these studies explained why their sample sizes were so poor. Similar problems have been consistently seen in other fields of medicine.5,6
concern. If each study is underpowered, the true effect will likely be discovered only after many studies using many animals have been completed and analyzed—using far more animal subjects than if the study had been done properly in the first place.
Then, in 1989, a review showed that in the decades since Cohen’s research, the average study’s power had actually decreased.9 This decrease was because of researchers becoming aware of another problem, the issue of multiple comparisons, and compensating for it in a way that reduced their studies’ power.
Math is another possible explanation for why power calculations are so uncommon: analytically calculating power can be difficult or downright impossible.
As long as this significant result is interesting enough to feature in a paper, the scientist will not feel that her studies are underpowered.
But it’s misleading to assume these results mean there is no real difference. There may be a difference, even an important one, but the study was so small it’d be lucky to notice it.
the story is the other way around, no real difference could be caused by small sample size as well. So, false negative studies need to be "powered" by increasing the sample size.
In other words, he turned statistical insignificance into practical insignificance.
Thinking about results in terms of confidence intervals provides a new way to approach experimental design.
“How much data must I collect to measure the effect to my desired precision?”
You’ve correctly concluded Fixitol is effective, but you’ve inflated the size of its effect because your study was underpowered.
Studies that produce less “exciting” results are closer to the truth but less interesting to a major journal editor.21
When a study claims to have detected a large effect with a relatively small sample, your first reaction should not be “Wow, they’ve found something big!” but “Wow, this study is underpowered!”
Another example: in the United States, counties with the lowest rates of kidney cancer tend to be Midwestern, Southern, and Western rural counties.
On the other hand, counties with the highest rates of kidney cancer tend to be Midwestern, Southern, and Western rural counties.
A popular strategy to fight this problem is called shrinkage.
Cohen defined “medium-sized” as a 0.5-standard-deviation difference between groups.
A p value is calculated under the assumption that the medication does not work. It tells me the probability of obtaining my data or data more extreme than it. It does not tell me the chance my medication is effective.
A small p value is stronger evidence, but to calculate the probability that the medication is effective, you’d need to factor in the base rate.
A 2002 study found that an overwhelming majority of statistics students—and instructors—failed a simple quiz about p values.
In 90% of women with breast cancer, the mammogram will correctly detect it. (That’s the statistical power of the test. This is an estimate, since it’s hard to tell how many cancers we miss if we don’t know they’re there.) However,
Only 9% of women with positive mammograms have breast cancer. Even doctors get this wrong. If you ask them, two-thirds will erroneously conclude that a p < 0.05 result implies a 95% chance that the result is true.2 But as you can see in these examples, the likelihood that a positive mammogram means cancer depends on the proportion of women who actually have cancer. And we are very fortunate that only a small proportion of women have breast cancer at any given time.
That is, he interprets the p value as the probability that the results are a fluke.
As the comic shows, making multiple comparisons means multiple chances for a false positive. The more tests I perform, the greater the chance that at least one of them will produce a false positive.
Suppose we have n independent hypotheses to test, none of which is true. We set our significance criterion at p < 0.05. The probability of obtaining at least one false positive among the n tests is as follows: P(false positive) = 1 – (1 – 0.05)n For n = 100, the false positive probability increases to 99%.
If you send out a 10-page survey asking about nuclear power plant proximity, milk consumption, age, number of male cousins, favorite pizza topping, current sock color, and a few dozen other factors for good measure, you’ll probably find that at least one of those things is correlated with cancer.
If we want to make many comparisons at once but control the overall false positive rate, the p value should be calculated under the assumption that none of the differences is real.
For genetic study , such as correlation of disease with gene, given so much possible correlation to examine, the P value need to be really small to avoid the false positive
So why assume the null hypothesis is true in the first place?
Accurate estimates of the size of differences are much more interesting than checking only whether each effect could be zero.
Scientists are more interested in limiting the false discovery rate: the fraction of statistically significant results that are false positives.
In 1995, Yoav Benjamini and Yosef Hochberg devised an exceptionally simple procedure that tells you which p values to consider statistically significant.
You’re done! The procedure guarantees that out of all statistically significant results, on average no more than q percent will be false positives.11 I hope the method makes
The Benjamini–Hochberg procedure is fast and effective, and it has been widely adopted by statisticians and scientists. It’s particularly appropriate when testing hundreds of hypotheses that are expected to be mostly false, such as associating genes with diseases. (The vast majority of genes have nothing to do with a particular
Remember, p < 0.05 isn’t the same as a 5% chance your result is false.
If you are testing multiple hypotheses or looking for correlations between many variables, use a procedure such as Bonferroni or Benjamini–Hochberg (or one of their various derivatives and adaptations) to control for the excess of false positives.