This is the best statistics textbook I've read, and I've read at least parts of ~ 10 of them. I've also read many tutorials/explanatory articles online, and this competes with the best of them. The text is exceptionally clear and even somewhat addictive, which I was not expecting from a statistics book. I can think of a few reasons for this. First, Kruschke motivates why you should care. For example, one of the canonical examples that he returns to often is coin flipping. Instead of assuming that you care about coin flipping, he explains why -- e.g. coin flipping can be thought of as estimating whether or not a treatment "works" in a clinical trial. Even though this explanation is totally obvious, it was still nice because it made me think that he cared about and respected my reading experience. Second, he is careful to repeat the key points after he gives an example, to tie the loop that so many other authors seem to consider beneath them. Finally, Kruschke is actually pretty funny. I scribbled "lol" on not a small number of pages of this book, due to the high hit-rate of his dry jokes. The exception to this are the poems. Do yourself a favor and skip the poems.
Even though it is a Bayesian book, for me its most helpful chapter was Chapter 15, which explains the General Linear Model. Other chapters that I found particularly helpful were Chapter 18, which explains shrinkage very nicely, and Chapter 9, which explains hierarchical models very well. One of the downsides of a book like this is how quickly the field is moving. For example, some of the gamma priors that Kruschke recommends are recommended against by other authors, which you can see if you read the Stan manual. But, there are a lot of fundamental principles in this book that will probably stand the test of time, so I'm expecting and hoping that investing time in it will pay dividends across the course of my career.
Quotes
"Bayesian model comparison compensates for model complexity by the fact that each model must have a prior distribution over its parameters, and more complex models must dilute their prior distributions over larger parameter spaces than simpler models. Thus, even if a complex model has some particular combination of parameter values that fit the data well, the prior probability of that particular combination must be small because the prior is spread thinly over the broad parameter space." - p. 290
"HMC instead uses a proposal distribution that changes depending on the current position. HMC figures out the direction in which the posterior distribution increases, called its gradient, and warps the proposal distribution toward the gradient." - p. 401
"The key to models of variable selection is that each predictor has both a regression coefficient and an inclusion indicator, which can be thought of as simply another coefficient that takes on the values 0 or 1. When the inclusion indicator is 1, then the regression coefficient has its usual role. When the inclusion indicator is 0, the predictor has no role in the model and the gression coefficient is superfluous.... A simple way to put a prior on the inclusion indicator is to have each indicator come from an independent Bernoulli prior, such as δj ~ dbern(0.5)." p. 537
SR Flashcards
q: define probability density
a: the probability of an outcome occuring in a particular interval divided by the width of that interval
q: define thinning (MCMC)
a: a method in which only every kth value in an MCMC chain is stored in memory
> a method for reduing autocorrelation that does not improve efficiency
> not recommended by Kruschke unless storing the full original chain or analyzing it subsequently would take too much memory
q: define Haldane prior
a: an uninformed beta distribution that gives large weight to outcomes where p = 0 or p = 1
> whereas beta(theta|1,1) is a more conventional uniform distribution
> makes sense when you are more likely to think that p = 0 or p = 1 vs p somewhere in the middle
q: define marginal likelihood
a: the operation of taking the average of the likelihood p(D|θ) across all values of θ, weighted by the prior probability of θ
> ie p(D) = sum_θ{p(D|θ)p(θ)}
> denominator in Bayes rule; aka evidence; p. 107 DBDA
q: define autocorrelation function
a: the autocorrelation across a spectrum of candidate lags
q: define effective sample size (MCMC)
a: divides the actual sample size in an MCMC draw by the amount of autocorrelation
> more autocorrelation -> less actual independent data from each draw
> e.g. to get the limits of an HDI, want around an ESS of around 10,000
q: define shrink factor (MCMC)
a: the ratio of the variance across independent MCMC chains to within MCMC chains, which is close to 1 when the chains have converged to the true distribution
> aka Gelamn-Rubin statistic
> ?gelman.diag in the coda R package
q: What is the general effect of shrinkage in hierarchical models? what specifies the degree of shrinkage?
a: to cause low-level parameters to shift towards the modes of the higher-level distribution; the relative strength of the lower and upper level parameters
> although shrinknage is a consequence of hierarchical model structure, not Bayesian estimation per se
q: What does a Bayes factor quantify?
a: how much the prior odds between two models change given the data
q: describe noise distribution
a: the distribution that describes the random variability of the data values around the underlying trend
> i.e., is usually at the bottom of a Kruschke model
> can differ; e.g., you could model the noise as normal, log-normal, gamma, etc...
q: In hierarchical models, if you want high precision estimates at the individual level, what do you need? if you want high precision estimates at the group level, what do you need?
a: lots of data within individuals, lots of individuals (without necessarily lots of data per individual, although more is better)
> p 382 DBDA
q: In Bayesian analysis, how is a nominal predictor used to predict values in a linear model?
a: for normal predictors, generally you predict a different beta for each possible level to quantify the corresponding "deflection from the mean"
> typically the baseline is constrained so that the deflecting sums to zero across the categories
q: In a general linear model, what happens to the predictor variables first? second?
a: they are combined, e.g. via addition; they are mapped to the predicted variable by a inverse link function
> p 435 DBDA
q: define inverse link function
a: the function that maps the combination of predictor variables to the data
> sometimes called the "link function" for convenience; called inverse for historical reasons
q: define logit
a: the inverse of the sigmoidal logistic function
> canonical link fxn for the Bernoulli distribution
q: define probit
a: the inverse of the cdf of the standard normal distribution, which is denoted as Φ(z), so the probit is denoted as Φ-1(p)
> canonical link function for the normal distribution
> probit stands for "probability unit"; Bliss 1934
q: If the link function in a GLM is the identity function, then what is the GLM equivalent to?
a: conventional linear regression
> Ch. 15 DBDA
q: If a distribution has higher kurtosis, what does that mean practically?
a: that it has heavier tails
q: A GLM can be written as follows:
μ = f(lin(x), [parameters])
y ~ pdf(μ, [parameters], where...
1) lin() = ?
2) f = ?
3) pdf = ?
a: 1) a linear function to combine the predictors x
2) the inverse link function
3) the noise distribution (going from predicted central tendency to noisy data)
> p 444 DBDA
q: define posterior predictive check
a: fitting the results of sampling from the distribution to the actual data
q: If two variables are highly correlated in multiple linear regression, what will that do to the posterior estimate of those coefficients?
a: it will make them very broad
> if there are three or more correlated predictors, pairwise scatterplots may not show it, but autocorrelation will remain high
q: define multiplicative interaction
a: when the predicted value is a weighted combination of both the individual predictors and the multiplicative product of those predictors
> a type of non-additivity
> DBDA p 525
q: define double exponential distribution
a: two exponential distributions glued together on each side of a threshold
> eg one exponential distribution on both +beta and -beta, equally spaced
> aka Laplace distribtuion
q: What is the etymology of ANOVA?
a: the ANOVA model posits that the total variance can be partitioned into within-group variance plus between-group variance, and since "analysis" means separation into constituent parts, the term ANOVA accurately describes the underlying algebra in the traditional methods
q: What is the "homogeneity of variance assumption" in ANOVA?
a: that the standard deviation of the data within each group is the same for all groups
> ANOVA also assumes that the data are normally distributed within groups
q: Can you discern posterior distributions of credible differences between parameters based on their marginal distributions? why or why not?
a: no; the parameters might be correlated, so you need to evaluate differences between jointly credible values, e.g. by taking the differences at each step of an MCMC chain