More on this book
Community
Kindle Notes & Highlights
Read between
April 2, 2019 - January 26, 2020
Learn to use prior estimates of the base rate to calculate the probability that a given result is a false positive (as in the mammogram example).
However, a difference in significance does not always make a significant difference.1
This doesn’t improve our statistical power, but it does prevent the false conclusion that the drugs are different.
high standard deviation would tell me it benefits some patients much more than others. Confidence intervals and standard errors, on the other hand, estimate how far the average for this sample might be from the true average—the
Hence, it is important to know whether an error bar represents a standard deviation, confidence interval, or standard error, though papers often do not say.[14
Plugging in the numbers for Fixitol and Solvix, I find that p < 0.05! There is a statistically significant difference between them, even though the confidence intervals overlap.
Unfortunately, many scientists skip the math and simply glance at plots to see whether confidence intervals overlap.
Earlier, we assumed the error bars in Figure 5-3 represent confidence intervals. But what if they are standard errors or standard deviations? Could we spot a significant difference by just looking for whether the error bars overlap?
A survey of psychologists, neuroscientists, and medical researchers found that the majority judged significance by confidence interval overlap, with many scientists confusing standard errors, standard deviations, and confidence intervals.
There is exactly one situation when visually checking confidence intervals works, and it is when comparing the confidence interval against a fixed value, rather than another confidence interval.
Overlapping confidence intervals do not mean two values are not significantly different. Checking confidence intervals or standard errors will mislead.
Your eyeball is not a well-defined statistical procedure.
And because standard error bars are about half as wide as the 95% confidence interval,
If in your explorations you find an interesting correlation, the standard procedure is to collect a new dataset and test the hypothesis again. Testing an independent dataset will filter out false positives and leave any legitimate discoveries standing.
And so exploratory findings should be considered tentative until confirmed.
These rules are often violated in the neuroimaging literature, perhaps as much as 40% of the time, causing inflated correlations and false positives.
Studies committing this error tend to find larger correlations between stimuli and neural activity than are plausible, given the random noise and error inherent to brain imaging.3 Similar problems occur when geneticists collect data on thousands of genes and select subsets for analysis or when epidemiologists dredge through demographics and risk factors to find which ones are associated with disease.4
Had we chosen a fixed group size in advance, the p value would be the probability of obtaining more extreme results with that particular group size.
But many stopped studies don’t even publish their original intended sample size or the stopping rule used to justify terminating the study.8 A trial’s early stoppage is not automatic evidence that its results are biased, but it is suggestive.
Dichotomization eliminates this distinction, dropping useful information and statistical power.
We are often interested in controlling for confounding factors. You might measure two or three variables (or two or three dozen) along with the outcome variable and attempt to determine the unique effect of each variable on the outcome after the other variables have been “controlled for.”
While the mathematical theory of regression with multiple variables can be more advanced than many practicing scientists care to understand, involving a great deal of linear algebra, the basic concepts and results are easy to understand and interpret.
Don’t arbitrarily split continuous variables into discrete groups unless you have good reason. Use a statistical procedure that can take full advantage of the continuous variables. If you do need to split continuous variables into groups for some reason, don’t choose the groups to maximize your statistical significance. Define the split in advance, use the same split as in previous similar research, or use outside standards (such as a medical definition of obesity or high blood pressure) instead.
only a truly randomized experiment eliminates all confounding variables.
Let’s start with the simplest problem: overfitting, which is the result of excessive enthusiasm in data analysis.
Stepwise regression is common in many scientific fields, but it’s usually a bad idea.
You probably already noticed one problem: multiple comparisons. Hypothetically, by adding only statistically significant variables, you avoid overfitting, but running so many significance tests is bound to produce false positives, so some of the variables you select will be bogus.
(Alternative stepwise procedures use other criteria instead of statistical significance but suffer from many of the same problems.)
stepwise regression is susceptible to egregious overfitting,
It’s also possible to change the criteria used to include new variables; instead of statistical significance, more-modern procedures use metrics like the Akaike information criterion and the Bayesian information criterion, which reduce overfitting by penalizing models with more variables.
How can a regression model be fairly evaluated, avoiding these problems? One option is cross-validation: fit the model using only a portion of the melons and then test its effectiveness at predicting the ripeness of the other melons.
But choosing a single model is usually foolishly overconfident. With so many variables to choose from, there are often many combinations of variables that predict the outcome nearly as well.
the lasso (short for least absolute shrinkage and selection operator, an inspired acronym) has better mathematical properties and doesn’t fool the user with claims of statistical significance. But the lasso is not bulletproof, and there is no perfect automated solution.
Correlation and Causation
The choices that produce interesting results will attract our attention and engage our human tendency to build plausible stories for any outcome.
The most worrying consequence of this statistical freedom is that researchers may unintentionally choose the statistical analysis most favorable to them.
The proliferation of statistical techniques has given us useful tools, but it seems they’ve been put to use as blunt objects with which to beat the data until it confesses.
the constant pressure to publish means that thorough documentation and replication are ignored. There’s no incentive for researchers to make their data and calculations available for inspection or to devote time to replicating other researchers’ results.
But first they asked two biostatisticians, Keith Baggerly and Kevin Coombes, to check the data.
the lead Duke researcher, Anil Potti, had falsified his résumé.
Potti eventually resigned from Duke amid accusations of fraud.
The Potti case illustrates two problems: the lack of reproducibility in much of modern science and the difficulty of publishing negative and contradictory results in academic journals.
The problem was not just that Potti did not share his data readily. Scientists often do not record and document the steps they take converting raw data to results, except in the often-vague form of a scientific paper or whatever is written down in a lab notebook.
Ideally, these steps would be reproducible: fully automated, with the computer source code available for inspection as a definitive record of the work. Errors would be easy to spot and correct, and any scientist could download the dataset and code and produce exactly the same results. Even better, the code would be combined with a description of its purpose.
but another scientist reading the paper and curious about its methods can download the source code, which shows exactly how
A more comprehensive strategy to ensure reproducibility and ease of error detection would follow the “Ten Simple Rules for Reproducible Computational Research,” developed by a group of biomedical researchers.9 These rules include automating data manipulation and reformatting, recording all changes to analysis software and custom programs using a software version control system, storing all raw data, and making all scripts and data available for public analysis.
Automated data analysis makes it easy to try software on new datasets or test that each piece functions correctly.