More on this book
Community
Kindle Notes & Highlights
Read between
April 2, 2019 - January 26, 2020
The first principle is that you must not fool yourself, and you are the easiest person to fool.
This already happens on most websites that discuss science news, and it would annoy me endlessly to see this book used to justify it. The first comments on a news article are always complaints about how “they didn’t control for this variable” and “the sample size is too small,” and 9 times out of 10, the commenter never read the scientific paper to notice that their complaint was addressed in the third paragraph.
A research paper’s statistical methods can be judged only in detail and in context with the rest of its methods: study design, measurement techniques, cost constraints, and goals.
Use your statistical knowledge to better understand the strengths, limitations, and potential biases of research, not to shoot down any paper that seems to misuse a p value or contradict your personal beliefs.
Also, remember that a conclusion supported by poor statistics can still be correct—statistical and logical errors do not make a conc...
This highlight has been truncated due to consecutive passage length restrictions.
In short, please practice statistics...
This highlight has been truncated due to consecutive passage length restrictions.
But few people complain about statistics done by trained scientists. Scientists seek understanding, not ammunition to use against political opponents.
t tests, p values, proportional hazards models, propensity scores, logistic regressions, least-squares fits, and confidence intervals.
Review articles and editorials appear regularly in leading journals, demanding higher statistical standards and tougher review, but few scientists hear their pleas, and journal-mandated standards are often ignored.
“We are fast becoming a nuisance to society. People don’t take us seriously anymore, and when they do take us seriously, we may unintentionally do more harm than good.”
While Hanlon’s razor directs us to “never attribute to malice that which is adequately explained by incompetence,”
“torture the data until it confesses.”
Readers interested in the pharmaceutical industry’s statistical misadventures may enjoy Ben Goldacre’s Bad Pharma (Faber & Faber, 2012), which caused a statistically significant increase in my blood pressure while I read it.
We use statistics to make judgments about these kinds of differences. We will always observe some difference due to luck and random variation, so statisticians talk about statistically significant differences when the difference is larger than could easily be produced by luck. So first we must learn how to make that decision.
“Even if my medication were completely ineffective, what are the chances my experiment would have produced the observed outcome?” If
The p value is the probability, under the assumption that there is no true effect or no true difference, of collecting data that shows a difference equal to or more extreme than what you actually observed.
Remember, a p value is not a measure of how right you are or how important a difference is. Instead, think of it as a measure of surprise.
The choice of 0.05 isn’t because of any special logical or statistical reasons, but it has become scientific convention through decades of common use.
Remember, p is a measure of surprise,
p is
a measure of s...
This highlight has been truncated due to consecutive passage length restrictions.
It’s not a measure of the size of the effect. You can get a tiny p value by measuring a huge effect—“This medicine makes people live four times longer”—or by meas...
This highlight has been truncated due to consecutive passage length restrictions.
because any medication or intervention usually has some real effect, you can always get a statistically significant result by collecting so much data that you detect extreme...
This highlight has been truncated due to consecutive passage length restrictions.
There’s no mathematical tool to tell you whether your hypothesis is true or false; you can see only whether it’s consistent with the data. If the data is sparse or unclear, your conclusions will be uncertain.
Recall that a p value is calculated under the assumption that luck (not your medication or intervention) is the only factor in your experiment, and that p is defined as the probability of obtaining a result equal to or more extreme than the one observed.
which makes p values “psychic”: two experiments with different designs can produce identical data but different p values because the unobserved data is different.
The p value, when combined with an experimenter’s prior experience and domain knowledge, could be useful in deciding how to interpret new data.
In science, it is important to limit two kinds of errors: false positives, where you conclude there is an effect when there isn’t, and false negatives, where you fail to notice a real effect.
If we’re too ready to jump to conclusions about effects, we’re prone to get false positives; if we’re too conservative, we’ll err on the side of false negatives.
it is possible to develop a formal decision-making process that will ensure false positives occur only at some predefined rate. They called this rate α, and their idea was for experimenters to set an α based upon their experience and expectations. So, for instance, if we’re willing to put up with a 10% rate of false positives, we’ll set α = 0.1.
To determine which testing procedure is best, we see which has the lowest false negative rate for a given choice of α.
We use the p value to implement the Neyman-Pearson testing procedure by rejecting the null hypothesis whenever p < α.
Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong.3
A single experiment does not have a false positive rate.
The false positive rate is determined by your procedure, not the result of any single experiment.
Confidence intervals can answer the same questions as p values, with the advantage that they provide more information and are more straightforward to interpret.
A confidence interval combines a point estimate with the uncertainty in that estimate.
A confidence interval quantifies the uncertainty in your conclusions, providing vastly more information than a p value, which says nothing about effect sizes.
If the symptom is already pretty innocuous, maybe a 15–25% improvement isn’t too important. Then again, for a symptom like spontaneous human combustion, you might get excited about any improvement.
If you can write a result as a confidence interval instead of as a p value, you should.7
One possible explanation is that confidence intervals go unreported because they are often embarrassingly wide.11
During Rothman’s three-year tenure as associate editor, the fraction of papers reporting solely p values dropped precipitously. Significance tests returned after his departure, although subsequent editors successfully encouraged researchers to report confidence intervals as well.
You’ve seen how it’s possible to miss real effects by not collecting enough data. You might miss a viable medicine or fail to notice an important side effect. So how do you know how much data to collect?
The concept of statistical power provides the answer. The power of a study is the probability that it will distinguish an effect of a certain size from pure luck. A study might easily detect a huge benefit from a medication, but detecting a subtle difference is much less likely.
A power curve, as shown in Figure 2-2, can tell me.
The power for any hypothesis test is the probability that it will yield a statistically significant outcome (defined in this example as p < 0.05).