Yuan’s Kindle Notes & Highlights for Statistical Rethinking: A Bayesian Course with Examples in R and Stan (Chapman & Hall/CRC Texts in Statistical Science Book 122)

In my experience teaching scientific programming, novices learn more quickly when they have working code to modify, rather than needing to write an algorithm from scratch.

Rifat liked this

loc. 983270

The second line is the right answer. This problem arises because of rounding error, when the computer rounds very small decimal values to

loc. 983293

zero. This loses precision and can introduce substantial errors in inference. As a result, we nearly always do statistical calculations using the logarithm of a probability, rather than the probability itself.

loc. 983343

Programming at the level needed to perform 21st century statistical inference is not that complicated, b...

This highlight has been truncated due to consecutive passage length restrictions.

loc. 983397

Everyone knows that the command line is m...

This highlight has been truncated due to consecutive passage length restrictions.

loc. 983451

Pointing and clicking, however, leaves no trail of breadcrumbs. A file with your R comm...

This highlight has been truncated due to consecutive passage length restrictions.

loc. 983493

With point-and-click, you pay down the road, rather th...

This highlight has been truncated due to consecutive passage length restrictions.

loc. 983505

It is also a basic ethical requirement of science that our analyses be fully doc...

This highlight has been truncated due to consecutive passage length restrictions.

loc. 1441956

Teaching statistics this way is somewhat like teaching engineering backwards, starting with bridge building and ending with basic physics. So students and many scientists tend to use charts like Figure 1.1 without much thought to their underlying structure, without much awareness of the models that each procedure embodies, and without any framework to help them make the inevitable compromises required by real research.

loc. 1507492

If you don’t understand how the golem processes information, then you can’t interpret the golem’s output.

loc. 1507628

This is the proper objective, the thinking goes, because Karl Popper argued that science advances by falsifying hypotheses. Karl Popper (1902–1994) is possibly the most influential philosopher of science, at least among scientists. He did persuasively argue that science works better by developing hypotheses that are, in principle, falsifiable.

loc. 1638498

As a result, relations are multiple in both directions: Hypotheses do not imply unique models, and models do not imply unique hypotheses. This fact greatly complicates statistical inference.

loc. 1703941

The null model is not unique to any process model nor hypothesis.

loc. 1703953

If we reject the null, we can’t really conclude that selection matters, because there are other neutral models that predict different distributions of alleles. And if we fail to reject the null, we can’t really conclude that evolution is neutral, because some selection models expect the same frequency distribution.

给定一个统计模型，来判断命题A A不对不能推到出 B对（或ㄱA对） A对，不代表B不对，充分不必要条件

loc. 1704465

We have a hypothesis H, and we show that it entails some observation D. Then we look for D. If we don’t find it, we must conclude that H is false. Logicians call this kind of reasoning modus tollens, which is Latin shorthand for “the method of destruction.” In contrast, finding D tells us nothing certain about H, because other hypotheses might also predict D.

loc. 1769550

This is a seductive story. If we can believe that important scientific hypotheses can be stated in this form, then we have a powerful method for improving the accuracy of our theories: look for evidence that disconfirms our hypotheses. Whenever we find a black swan, H0 must be false. Progress!

loc. 1769658

First, observations are prone to error, especially at the boundaries of scientific knowledge. Second, most hypotheses are quantitative, concerning degrees of existence, rather than discrete, concerning total presence or absence.

loc. 1835236

In both the woodpecker and neutrino dramas, the key dilemma is whether the falsification is real or spurious. Measurement is complicated in both cases, but in quite different ways, rendering both true-detection and false-detection plausible. Popper

即便举出反例，也要考虑false positive，或者false negative的可能性。

loc. 1835297

But the probabilistic nature of evidence rarely appears when practicing scientists discuss the philosophy and practice of falsification.14 My reading of the history of science is that these sorts of measurement problems are the norm, not the exception.15

习惯于用risk/reward和precision/recall思考就会发现很多东西都不是binary的（YES or No）

loc. 1900574

But falsification is alwaysconsensual, not logical.

loc. 1966125

But Bayesian data analysis embraces it most fully, by using the language of chance to describe the plausibility of different possibilities.

loc. 1966361

It means also that parameters and models cannot have probability distributions, only measurements can.

loc. 1966375

The distribution of these measurements is called a sampling distribution.

loc. 1966602

There’s uncertainty about the planet’s shape, but notice that none of the uncertainty is a result of variation in repeat measurements.

loc. 1966650

So the sampling distribution of any measurement is constant, because the measurement is deterministic—there’s nothing “random” about it.

loc. 2031686

However, it is important to realize that even when a Bayesian procedure and frequentist procedure give exactly the same answer, our Bayesian golems aren’t justifying their inferences with imagined repeat sampling.

loc. 2031717

More generally, Bayesian golems treat “randomness” as a property of information, not of the world.

loc. 2031755

We just use randomness to describe our uncertainty in the face of i...

This highlight has been truncated due to consecutive passage length restrictions.

loc. 2163147

I want to convince the reader of something that appears unreasonable: multilevel regression deserves to be the default form of regression. Papers that do not use multilevel models should have to justify not using a multilevel approach.

loc. 2228737

For now, you can understand overfitting with this mantra: fitting is easy; prediction is hard.

loc. 3080280

However, the bonus that arises from this is that, if we really have shuffled enough to erase any prior knowledge of the ordering, then the order the cards end up in is very likely to be one of the many orderings with high information entropy. The concept of information entropy will be increasingly important as we progress, and will be unpacked in Chapters 6 and 9.

loc. 3080583

Designing a simple Bayesian model benefits from a design loop with three steps. (1) Data story: Motivate the model by narrating how the data might arise. (2) Update: Educate your model by feeding it the data. (3) Evaluate: All statistical models require supervision, leading possibly to model revision.

loc. 3145947

As a result, showing that a model does a good job does not in turn uniquely support our data story. Still, the story has value because in trying to outline the story, often one realizes that additional questions must be answered.

loc. 3276988

For example, there is a widespread superstition that 30 observations are needed before one can use a Gaussian distribution.

loc. 3277038

In contrast, Bayesian estimates are valid for any sample size. This does not mean that more data isn’t helpful—it certainly is. Rather, the estimates have a clear and valid interpretation, no matter the sample size. But the price for this power is dependency upon the initial estimates, the prior. If the prior is a bad one, then the resulting inference will be misleading. There’s no free lunch,42 when it comes to learning about the world. A Bayesian golem must choose an initial plausibility,

loc. 3277265

First, the model’s certainty is no guarantee that the model is a good one.

Similar to the simulation iterations. The more iterations you run (sample size), the smaller the CIs. Having narrow CIs doesn’t guarantee a good model.

loc. 3277416

We could shuffle the order of the observations, as long as six W’s and three L’s remain, and still end up with the same final plausibility curve. That is only true, however, because the model assumes that order is irrelevant to inference. When

loc. 3342727

Consider three different kinds of things we counted in the previous sections. (1) The number of ways each conjecture could produce an observation (2) The accumulated number of ways each conjecture could produce the entire data (3) The initial plausibility of each conjectured cause of the data

数学词汇：猜想

loc. 3407908

You can build your own likelihood formula from basic assumptions of your story for how the data arise. That’s what we did in the globe tossing example earlier. Or you can use one of several off-the-shelf likelihoods that are common in the sciences. Later in the book, you’ll see how information theory justifies many of the conventional choices of likelihood.

loc. 3408110

In this case, once we add our assumptions that (1) every toss is independent of the other tosses and (2) the probability of W is the same on every toss, probability theory provides a unique answer, known as the binomial distribution. This is the common “coin tossing” distribution. And so the probability of observing w W’s in n tosses, with a probability p of W, is: Pr(w|n, p) = n! w!(n − w)!p w (1 − p) n−w

loc. 3408329

Just keep in mind that the job of the likelihood is to tell us the relative number of ways to see the data w, given values for p and n.

loc. 3408359

Overthinking: Names and probability distributions. The “d” in dbinom stands for density. Functions named in this way almost always have corresponding partners that begin with “r” for random samples and that begin with “p” for cumulative probabilities. See for example the help ?dbinom.

loc. 3473451

Notably, the most influential assumptions in both Bayesian and many non-Bayesian models are the likelihood functions and their relations to the parameters.

loc. 3473785

What is the average difference between treatment groups? • How strong is the association between a treatment and an outcome? • Does the effect of the treatment depend upon a covariate? • How much variation is there among groups?

loc. 3538991

Overthinking: Prior as probability distribution. You could write the prior in the example here as: Pr(p) = 1 1 − 0 = 1. The prior is a probability distribution for the parameter.

loc. 3539150

Such priors are sometimes called regularizing or weakly informative priors.

loc. 3539160

They are so useful that non-Bayesian statistical procedures have adopted a mathematically equivalent approach, penalized likelihood.

loc. 3604558

If your goal is to lie with statistics, you’d be a fool to do it with priors, because such a lie would be easily uncovered. Better to use the more opaque machinery of the likelihood.

loc. 3604648

both Bayesian and non-Bayesian models are equally harried, because both traditions depend heavily upon likelihood functions and conventionalized model forms.

loc. 3604688

This is because non-Bayesian procedures need to make choices that Bayesian ones do not, such as choice of es...

This highlight has been truncated due to consecutive passage length restrictions.