Bayes's theorem and logistic regression
.li-itemize{margin:1ex 0ex;} .li-enumerate{margin:1ex 0ex;} .dd-description{margin:0ex 0ex 1ex 4ex;} .dt-description{margin:0ex;} .toc{list-style:none;} .thefootnotes{text-align:left;margin:0ex;} .dt-thefootnotes{margin:0em;} .dd-thefootnotes{margin:0em 0em 0em 2em;} .footnoterule{margin:1em auto 1em 0px;width:50%;} .caption{padding-left:2ex; padding-right:2ex; margin-left:auto; margin-right:auto} .title{margin:2ex auto;text-align:center} .center{text-align:center;margin-left:auto;margin-right:auto;} .flushleft{text-align:left;margin-left:0ex;margin-right:auto;} .flushright{text-align:right;margin-left:auto;margin-right:0ex;} DIV TABLE{margin-left:inherit;margin-right:inherit;} PRE{text-align:left;margin-left:0ex;margin-right:auto;} BLOCKQUOTE{margin-left:4ex;margin-right:4ex;text-align:left;} TD P{margin:0px;} .boxed{border:1px solid black} .textboxed{border:1px solid black} .vbar{border:none;width:2px;background-color:black;} .hbar{border:none;height:2px;width:100%;background-color:black;} .hfill{border:none;height:1px;width:200%;background-color:black;} .vdisplay{border-collapse:separate;border-spacing:2px;width:auto; empty-cells:show; border:2px solid red;} .vdcell{white-space:nowrap;padding:0px;width:auto; border:2px solid green;} .display{border-collapse:separate;border-spacing:2px;width:auto; border:none;} .dcell{white-space:nowrap;padding:0px;width:auto; border:none;} .dcenter{margin:0ex auto;} .vdcenter{border:solid #FF8000 2px; margin:0ex auto;} .minipage{text-align:left; margin-left:0em; margin-right:auto;} .marginpar{border:solid thin black; width:20%; text-align:left;} .marginparleft{float:left; margin-left:0ex; margin-right:1ex;} .marginparright{float:right; margin-left:1ex; margin-right:0ex;} .theorem{text-align:left;margin:1ex auto 1ex 0ex;} .part{margin:2ex auto;text-align:center} This week's post has more math than most, so I wrote in it LaTeX and translated it to HTML using HeVeA. Some of the formulas are not as pretty as they could be. If you prefer, you can read this article in PDF.
I’ll us H to represent the hypothesis that I was in the right room, and F to represent the observation that the first other student was female. Bayes’s theorem provides an algorithm for updating the probability of H:
P(H|F) = P(H) P(F|H)P(F)Where
P(H) is the prior probability of H before the other student arrived.P(H|F) is the posterior probability of H, updated based on the observation F.P(F|H) is the likelihood of the data, F, assuming that the hypothesis is true.P(F) is the likelihood of the data, independent of H.Before I saw the other students, I was confident I was in the right room, so I might assign P(H) something like 90%.
When I was in grad school most advanced computer science classes were 90% male, so if I was in the right room, the likelihood of the first female student was only 10%. And the likelihood of three female students was only 0.1%.
If we don’t assume I was in the right room, then the likelihood of the first female student was more like 50%, so the likelihood of all three was 12.5%.
Plugging those numbers into Bayes’s theorem yields P(H|F) = 0.18 after one female student, P(H|FF) = 0.036 after the second, and P(H|FFF) = 0.0072 after the third.
logit(p) = β0 + β1 x1 + ... + βn xn where the dependent variable, p, is a probability, the xs are explanatory variables, and the βs are coefficients we want to estimate. The logit function is the log-odds, or
logit(p) = ln⎛
⎜
⎜
⎝p1−p ⎞
⎟
⎟
⎠When you present logistic regression like this, it raises three questions:
Why is logit(p) the right choice for the dependent variable?Why should we expect the relationship between logit(p) and the explanatory variables to be linear?How should we interpret the estimated parameters?The answer to all of these questions turns out to be Bayes’s theorem. To demonstrate that, I’ll use a simple example where there is only one explanatory variable. But the derivation generalizes to multiple regression.
On notation: I’ll use P(H) for the probability that some hypothesis, H, is true. O(H) is the odds of the same hypothesis, defined as
O(H) = P(H)1 − P(H) I’ll use LO(H) to represent the log-odds of H:
LO(H) = lnO(H) I’ll also use LR for a likelihood ratio, and OR for an odds ratio. Finally, I’ll use LLR for a log-likelihood ratio, and LOR for a log-odds ratio.
O(H) is the prior odds that I was in the right room,O(H|F) is the posterior odds after seeing one female student,LR(F|H) is the likelihood ratio of the data, given the hypothesis.The likelihood ratio of the data is:
LR(F|H) = P(F|H)P(F|¬ H) where ¬ H means H is false.
Noticing that logistic regression is expressed in terms of log-odds, my next move is to write the log-odds form of Bayes’s theorem by taking the log of Eqn 1:
⎨
⎩ LLR(F|H)if X = 0 LLR(M|H)if X = 1 (5)Or we can collapse these two expressions into one by using X as a multiplier:
Odds ratios are often used in medicine to describe the association between a disease and a risk factor. In the example scenario, we can use an odds ratio to express the odds of the hypothesis H if we observe a male student, relative to the odds if we observe a female student:
ORX(H) = O(H|M)O(H|F) I’m using the notation ORX to represent the odds ratio associated with the variable X.
Applying Bayes’s theorem to the top and bottom of the previous expression yields
ORX(H) = O(H) LR(M|H)O(H) LR(F|H) = LR(M|H)LR(F|H)Taking the log of both sides yields
LO(H|X) = LO(H|F) + X LOR(X|H) We can think of this equation as the log-odds form of Bayes’s theorem, with the update term expressed as a log-odds ratio. Let’s compare that to the functional form of logistic regression:
logit(p) = β0 + X β1 The correspondence between these equations suggests the following interpretation:
The predicted value, logit(p), is the posterior log odds of the hypothesis, given the observed data.The intercept, β0, is the log-odds of the hypothesis if X=0.The coefficient of X, β1, is a log-odds ratio that represents odds of H when X=1, relative to when X=0.This relationship between logistic regression and Bayes’s theorem tells us how to interpret the estimated coefficients. It also answers the question I posed at the beginning of this note: the functional form of logistic regression makes sense because it corresponds to the way Bayes’s theorem uses data to update probabilities.
Abstract: My two favorite topics in probability and statistics are Bayes’s theorem and logistic regression. Because there are similarities between them, I have always assumed that there is a connection. In this note, I demonstrate the connection mathematically, and (I hope) shed light on the motivation for logistic regression and the interpretation of the results.As each student arrived, I used the observed data to update my belief that I was in the right place. We can use Bayes’s theorem to quantify the calculation I was doing intuitively.
I’ll us H to represent the hypothesis that I was in the right room, and F to represent the observation that the first other student was female. Bayes’s theorem provides an algorithm for updating the probability of H:
P(H|F) = P(H) P(F|H)P(F)Where
P(H) is the prior probability of H before the other student arrived.P(H|F) is the posterior probability of H, updated based on the observation F.P(F|H) is the likelihood of the data, F, assuming that the hypothesis is true.P(F) is the likelihood of the data, independent of H.Before I saw the other students, I was confident I was in the right room, so I might assign P(H) something like 90%.
When I was in grad school most advanced computer science classes were 90% male, so if I was in the right room, the likelihood of the first female student was only 10%. And the likelihood of three female students was only 0.1%.
If we don’t assume I was in the right room, then the likelihood of the first female student was more like 50%, so the likelihood of all three was 12.5%.
Plugging those numbers into Bayes’s theorem yields P(H|F) = 0.18 after one female student, P(H|FF) = 0.036 after the second, and P(H|FFF) = 0.0072 after the third.
logit(p) = β0 + β1 x1 + ... + βn xn where the dependent variable, p, is a probability, the xs are explanatory variables, and the βs are coefficients we want to estimate. The logit function is the log-odds, or
logit(p) = ln⎛
⎜
⎜
⎝p1−p ⎞
⎟
⎟
⎠When you present logistic regression like this, it raises three questions:
Why is logit(p) the right choice for the dependent variable?Why should we expect the relationship between logit(p) and the explanatory variables to be linear?How should we interpret the estimated parameters?The answer to all of these questions turns out to be Bayes’s theorem. To demonstrate that, I’ll use a simple example where there is only one explanatory variable. But the derivation generalizes to multiple regression.
On notation: I’ll use P(H) for the probability that some hypothesis, H, is true. O(H) is the odds of the same hypothesis, defined as
O(H) = P(H)1 − P(H) I’ll use LO(H) to represent the log-odds of H:
LO(H) = lnO(H) I’ll also use LR for a likelihood ratio, and OR for an odds ratio. Finally, I’ll use LLR for a log-likelihood ratio, and LOR for a log-odds ratio.
O(H) is the prior odds that I was in the right room,O(H|F) is the posterior odds after seeing one female student,LR(F|H) is the likelihood ratio of the data, given the hypothesis.The likelihood ratio of the data is:
LR(F|H) = P(F|H)P(F|¬ H) where ¬ H means H is false.
Noticing that logistic regression is expressed in terms of log-odds, my next move is to write the log-odds form of Bayes’s theorem by taking the log of Eqn 1:
⎨
⎩ LLR(F|H)if X = 0 LLR(M|H)if X = 1 (5)Or we can collapse these two expressions into one by using X as a multiplier:
Odds ratios are often used in medicine to describe the association between a disease and a risk factor. In the example scenario, we can use an odds ratio to express the odds of the hypothesis H if we observe a male student, relative to the odds if we observe a female student:
ORX(H) = O(H|M)O(H|F) I’m using the notation ORX to represent the odds ratio associated with the variable X.
Applying Bayes’s theorem to the top and bottom of the previous expression yields
ORX(H) = O(H) LR(M|H)O(H) LR(F|H) = LR(M|H)LR(F|H)Taking the log of both sides yields
LO(H|X) = LO(H|F) + X LOR(X|H) We can think of this equation as the log-odds form of Bayes’s theorem, with the update term expressed as a log-odds ratio. Let’s compare that to the functional form of logistic regression:
logit(p) = β0 + X β1 The correspondence between these equations suggests the following interpretation:
The predicted value, logit(p), is the posterior log odds of the hypothesis, given the observed data.The intercept, β0, is the log-odds of the hypothesis if X=0.The coefficient of X, β1, is a log-odds ratio that represents odds of H when X=1, relative to when X=0.This relationship between logistic regression and Bayes’s theorem tells us how to interpret the estimated coefficients. It also answers the question I posed at the beginning of this note: the functional form of logistic regression makes sense because it corresponds to the way Bayes’s theorem uses data to update probabilities.
This document was translated from LATEX by H E V E A .

Published on April 28, 2014 13:24
No comments have been added yet.
Probably Overthinking It
Probably Overthinking It is a blog about data science, Bayesian Statistics, and occasional other topics.
- Allen B. Downey's profile
- 233 followers
