More on this book
Kindle Notes & Highlights
Started reading
December 29, 2018
In my experience teaching scientific programming, novices learn more quickly when they have working code to modify, rather than needing to write an algorithm from scratch.
Rifat liked this
The second line is the right answer. This problem arises because of rounding error, when the computer rounds very small decimal values to
zero. This loses precision and can introduce substantial errors in inference. As a result, we nearly always do statistical calculations using the logarithm of a probability, rather than the probability itself.
Programming at the level needed to perform 21st century statistical inference is not that complicated, b...
This highlight has been truncated due to consecutive passage length restrictions.
Everyone knows that the command line is m...
This highlight has been truncated due to consecutive passage length restrictions.
Pointing and clicking, however, leaves no trail of breadcrumbs. A file with your R comm...
This highlight has been truncated due to consecutive passage length restrictions.
With point-and-click, you pay down the road, rather th...
This highlight has been truncated due to consecutive passage length restrictions.
It is also a basic ethical requirement of science that our analyses be fully doc...
This highlight has been truncated due to consecutive passage length restrictions.
Teaching statistics this way is somewhat like teaching engineering backwards, starting with bridge building and ending with basic physics. So students and many scientists tend to use charts like Figure 1.1 without much thought to their underlying structure, without much awareness of the models that each procedure embodies, and without any framework to help them make the inevitable compromises required by real research.
If you don’t understand how the golem processes information, then you can’t interpret the golem’s output.
This is the proper objective, the thinking goes, because Karl Popper argued that science advances by falsifying hypotheses. Karl Popper (1902–1994) is possibly the most influential philosopher of science, at least among scientists. He did persuasively argue that science works better by developing hypotheses that are, in principle, falsifiable.
As a result, relations are multiple in both directions: Hypotheses do not imply unique models, and models do not imply unique hypotheses. This fact greatly complicates statistical inference.
The null model is not unique to any process model nor hypothesis.
If we reject the null, we can’t really conclude that selection matters, because there are other neutral models that predict different distributions of alleles. And if we fail to reject the null, we can’t really conclude that evolution is neutral, because some selection models expect the same frequency distribution.
We have a hypothesis H, and we show that it entails some observation D. Then we look for D. If we don’t find it, we must conclude that H is false. Logicians call this kind of reasoning modus tollens, which is Latin shorthand for “the method of destruction.” In contrast, finding D tells us nothing certain about H, because other hypotheses might also predict D.
This is a seductive story. If we can believe that important scientific hypotheses can be stated in this form, then we have a powerful method for improving the accuracy of our theories: look for evidence that disconfirms our hypotheses. Whenever we find a black swan, H0 must be false. Progress!
First, observations are prone to error, especially at the boundaries of scientific knowledge. Second, most hypotheses are quantitative, concerning degrees of existence, rather than discrete, concerning total presence or absence.
But the probabilistic nature of evidence rarely appears when practicing scientists discuss the philosophy and practice of falsification.14 My reading of the history of science is that these sorts of measurement problems are the norm, not the exception.15
But falsification is alwaysconsensual, not logical.
But Bayesian data analysis embraces it most fully, by using the language of chance to describe the plausibility of different possibilities.
It means also that parameters and models cannot have probability distributions, only measurements can.
The distribution of these measurements is called a sampling distribution.
There’s uncertainty about the planet’s shape, but notice that none of the uncertainty is a result of variation in repeat measurements.
So the sampling distribution of any measurement is constant, because the measurement is deterministic—there’s nothing “random” about it.
However, it is important to realize that even when a Bayesian procedure and frequentist procedure give exactly the same answer, our Bayesian golems aren’t justifying their inferences with imagined repeat sampling.
More generally, Bayesian golems treat “randomness” as a property of information, not of the world.
We just use randomness to describe our uncertainty in the face of i...
This highlight has been truncated due to consecutive passage length restrictions.
I want to convince the reader of something that appears unreasonable: multilevel regression deserves to be the default form of regression. Papers that do not use multilevel models should have to justify not using a multilevel approach.
For now, you can understand overfitting with this mantra: fitting is easy; prediction is hard.
However, the bonus that arises from this is that, if we really have shuffled enough to erase any prior knowledge of the ordering, then the order the cards end up in is very likely to be one of the many orderings with high information entropy. The concept of information entropy will be increasingly important as we progress, and will be unpacked in Chapters 6 and 9.
Designing a simple Bayesian model benefits from a design loop with three steps. (1) Data story: Motivate the model by narrating how the data might arise. (2) Update: Educate your model by feeding it the data. (3) Evaluate: All statistical models require supervision, leading possibly to model revision.
As a result, showing that a model does a good job does not in turn uniquely support our data story. Still, the story has value because in trying to outline the story, often one realizes that additional questions must be answered.
For example, there is a widespread superstition that 30 observations are needed before one can use a Gaussian distribution.
In contrast, Bayesian estimates are valid for any sample size. This does not mean that more data isn’t helpful—it certainly is. Rather, the estimates have a clear and valid interpretation, no matter the sample size. But the price for this power is dependency upon the initial estimates, the prior. If the prior is a bad one, then the resulting inference will be misleading. There’s no free lunch,42 when it comes to learning about the world. A Bayesian golem must choose an initial plausibility,
We could shuffle the order of the observations, as long as six W’s and three L’s remain, and still end up with the same final plausibility curve. That is only true, however, because the model assumes that order is irrelevant to inference. When
Consider three different kinds of things we counted in the previous sections. (1) The number of ways each conjecture could produce an observation (2) The accumulated number of ways each conjecture could produce the entire data (3) The initial plausibility of each conjectured cause of the data
You can build your own likelihood formula from basic assumptions of your story for how the data arise. That’s what we did in the globe tossing example earlier. Or you can use one of several off-the-shelf likelihoods that are common in the sciences. Later in the book, you’ll see how information theory justifies many of the conventional choices of likelihood.
In this case, once we add our assumptions that (1) every toss is independent of the other tosses and (2) the probability of W is the same on every toss, probability theory provides a unique answer, known as the binomial distribution. This is the common “coin tossing” distribution. And so the probability of observing w W’s in n tosses, with a probability p of W, is: Pr(w|n, p) = n! w!(n − w)!p w (1 − p) n−w
Just keep in mind that the job of the likelihood is to tell us the relative number of ways to see the data w, given values for p and n.
Overthinking: Names and probability distributions. The “d” in dbinom stands for density. Functions named in this way almost always have corresponding partners that begin with “r” for random samples and that begin with “p” for cumulative probabilities. See for example the help ?dbinom.
Notably, the most influential assumptions in both Bayesian and many non-Bayesian models are the likelihood functions and their relations to the parameters.
What is the average difference between treatment groups? • How strong is the association between a treatment and an outcome? • Does the effect of the treatment depend upon a covariate? • How much variation is there among groups?
Overthinking: Prior as probability distribution. You could write the prior in the example here as: Pr(p) = 1 1 − 0 = 1. The prior is a probability distribution for the parameter.
Such priors are sometimes called regularizing or weakly informative priors.
They are so useful that non-Bayesian statistical procedures have adopted a mathematically equivalent approach, penalized likelihood.
If your goal is to lie with statistics, you’d be a fool to do it with priors, because such a lie would be easily uncovered. Better to use the more opaque machinery of the likelihood.
both Bayesian and non-Bayesian models are equally harried, because both traditions depend heavily upon likelihood functions and conventionalized model forms.
This is because non-Bayesian procedures need to make choices that Bayesian ones do not, such as choice of es...
This highlight has been truncated due to consecutive passage length restrictions.