![Page 1: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/1.jpg)
CS109A Introduction to Data SciencePavlos Protopapas, Kevin Rader, and Chris Tanner
Advanced Section #5:Generalized Linear Models:
Logistic Regression and Beyond
1
Nick Stern
![Page 2: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/2.jpg)
CS109A, PROTOPAPAS, RADER
Outline
1. Motivation
• Limitations of linear regression
2. Anatomy
• Exponential Dispersion Family (EDF)
• Link function
3. Maximum Likelihood Estimation for GLM’s
• Fischer Scoring
2
![Page 3: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/3.jpg)
CS109A, PROTOPAPAS, RADER
Motivation
3
![Page 4: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/4.jpg)
CS109A, PROTOPAPAS, RADER
Motivation
4
Linear regression framework:
𝑦" = 𝑥"%𝛽 + 𝜖"
Assumptions:
1. Linearity: Linear relationship between expected value and predictors
2. Normality: Residuals are normally distributed about expected value
3. Homoskedasticity: Residuals have constant variance 𝜎*
4. Independence: Observations are independent of one another
![Page 5: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/5.jpg)
CS109A, PROTOPAPAS, RADER
Motivation
5
Expressed mathematically…
• Linearity
𝔼 𝑦" = 𝑥"%𝛽
• Normality
𝑦" ∼ 𝒩(𝑥"%𝛽, 𝜎*)
• Homoskedasticity
𝜎* (instead of) 𝜎"*
• Independence
𝑝 𝑦"|𝑦3 = 𝑝(𝑦") for 𝑖 ≠ 𝑗
![Page 6: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/6.jpg)
CS109A, PROTOPAPAS, RADER
Motivation
6
What happens when our assumptions break down?
![Page 7: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/7.jpg)
CS109A, PROTOPAPAS, RADER
Motivation
7
We have options within the framework of linear regression
Transform X or Y
(Polynomial Regression)
Nonlinearity
Weight observations
(WLS Regression)
Heteroskedasticity
![Page 8: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/8.jpg)
CS109A, PROTOPAPAS, RADER
Motivation
8
But assuming Normality can be pretty limiting…
Consider modeling the following random variables:
• Whether a coin flip is heads or tails (Bernoulli)
• Counts of species in a given area (Poisson)
• Time between stochastic events that occur w/ constant rate (gamma)
• Vote counts for multiple candidates in a poll (multinomial)
![Page 9: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/9.jpg)
CS109A, PROTOPAPAS, RADER
Motivation
9
We can extend the framework for linear regression.
Enter:
Generalized Linear Models
Relaxes:
• Normality assumption
• Homoskedasticity assumption
![Page 10: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/10.jpg)
CS109A, PROTOPAPAS, RADER
Motivation
10
![Page 11: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/11.jpg)
CS109A, PROTOPAPAS, RADER
Anatomy
11
![Page 12: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/12.jpg)
CS109A, PROTOPAPAS, RADER
Anatomy
12
Two adjustments must be made to turn LM into GLM
1. Assume response variable comes from a family of distributions called the exponential dispersion family (EDF).
2. The relationship between expected value and predictors is expressed through a link function.
![Page 13: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/13.jpg)
CS109A, PROTOPAPAS, RADER
Anatomy – EDF Family
13
The EDF family contains: Normal, Poisson, gamma, and more!
The probability density function looks like this:
𝑓 𝑦"|𝜃" = exp𝑦"𝜃" − 𝑏 𝜃"
𝜙"+ 𝑐 𝑦", 𝜙"
Where
𝜃 - “canonical parameter”𝜙 - “dispersion parameter”𝑏 𝜃 - “cumulant function”𝑐 𝑦, 𝜙 - “normalization factor”
![Page 14: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/14.jpg)
CS109A, PROTOPAPAS, RADER
Anatomy – EDF Family
14
Example: representing Bernoulli distribution in EDF form.
PDF of a Bernoulli random variable:
𝑓 𝑦" 𝑝" = 𝑝"@A 1 − 𝑝" C D @A
Taking the log and then exponentiating (to cancel each other out) gives:
𝑓 𝑦" 𝑝" = exp 𝑦" log 𝑝" + 1 − 𝑦" log 1 − 𝑝"
Rearranging terms…
𝑓 𝑦" 𝑝" = exp 𝑦" log𝑝"
1 − 𝑝"+ log 1 − 𝑝"
![Page 15: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/15.jpg)
CS109A, PROTOPAPAS, RADER
Anatomy – EDF Family
15
Comparing:
𝑓 𝑦" 𝑝" = exp 𝑦" log𝑝"
1 − 𝑝"+ log 1 − 𝑝" 𝑓 𝑦"|𝜃" = exp
𝑦"𝜃" − 𝑏 𝜃"𝜙"
+ 𝑐 𝑦", 𝜙"vs.
Choosing:
𝜃" = log𝑝"
1 − 𝑝"𝜙" = 1
𝑏(𝜃") = log 1 + 𝑒IA
𝑐(𝑦", 𝜙") = 0
And we recover the EDF form of the Bernoulli distribution
![Page 16: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/16.jpg)
CS109A, PROTOPAPAS, RADER
Anatomy – EDF Family
16
The EDF family has some useful properties. Namely:
1. 𝔼 𝑦" ≡ 𝜇" = 𝑏N 𝜃"
2. 𝑉𝑎𝑟 𝑦" = 𝜙"𝑏NN 𝜃"(the proofs for these identities are in the notes)
Plugging in the values we obtained for Bernoulli, we get back:
𝔼 𝑦" = 𝑝" , 𝑉𝑎𝑟 𝑦" = 𝑝"(1 − 𝑝")
![Page 17: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/17.jpg)
CS109A, PROTOPAPAS, RADER
Anatomy – Link Function
17
Time to talk about the link function
![Page 18: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/18.jpg)
CS109A, PROTOPAPAS, RADER
Anatomy – Link Function
18
Recall from linear regression that:
𝜇" = 𝑥"%𝛽
Does this work for the Bernoulli distribution?
𝜇" = 𝑝" = 𝑥"%𝛽
Solution: wrap the expectation in a function called the link function:
𝑔 𝜇" = 𝑥"%𝛽 ≡ 𝜂"
*For the Bernoulli distribution, the link function is the “logit” function (hence “logistic” regression)
![Page 19: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/19.jpg)
CS109A, PROTOPAPAS, RADER
Anatomy – Link Function
19
Link functions are a choice, not a property. A good choice is:
1. Differentiable (implies “smoothness”)
2. Monotonic (guarantees invertibility)
1. Typically increasing so that 𝜇 increases w/ 𝜂
3. Expands the range of 𝜇 to the entire real line
Example: Logit function for Bernoulli
𝑔 𝜇" = 𝑔 𝑝" = log𝑝"
1 − 𝑝"
![Page 20: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/20.jpg)
CS109A, PROTOPAPAS, RADER
Anatomy – Link Function
20
Logit function for Bernoulli looks familiar…
𝑔 𝑝" = log𝑝"
1 − 𝑝"= 𝜃"
Choosing the link function by setting 𝜃" = 𝜂" gives us what is known as the “canonical link function.” Note:
𝜇" = 𝑏N 𝜃" → 𝜃" = 𝑏NDC(𝜇")(derivative of cumulant function must be invertible)
This choice of link, while not always effective, has some nice properties. Take STAT 149 to find out more!
![Page 21: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/21.jpg)
CS109A, PROTOPAPAS, RADER
Anatomy – Link Function
21
Here are some more examples (fun exercises at home)
Distribution 𝒇(𝒚𝒊|𝜽𝒊) Mean Function 𝝁𝒊 = 𝒃N(𝜽𝒊) Canonical Link 𝜽𝒊 = 𝒈(𝝁𝒊)
Normal 𝜃" 𝜇"
Bernoulli/Binomial 𝑒IA1 + 𝑒IA
log𝜇"
1 − 𝜇"
Poisson 𝑒IA log(𝜇")
Gamma−1𝜃"
−1𝜇"
Inverse Gaussian −2𝜃"DC*
−12𝜇"*
![Page 22: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/22.jpg)
CS109A, PROTOPAPAS, RADER
Maximum Likelihood Estimation
22
![Page 23: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/23.jpg)
CS109A, PROTOPAPAS, RADER
Maximum Likelihood Estimation
23
Recall from linear regression – we can estimate our parameters, 𝜃, by choosing those that maximize the likelihood, 𝐿 𝑦 𝜃), of the data, where:
𝐿 𝑦 𝜃 =^"
_
𝑝 𝑦" 𝜃"
In words: likelihood is the probability of observing a set of “N”independent datapoints, given our assumptions about the generative process.
![Page 24: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/24.jpg)
CS109A, PROTOPAPAS, RADER
Maximum Likelihood Estimation
24
For GLM’s we can plug in the PDF of the EDF family:
𝐿 𝑦 𝜃 =^"`C
_
exp𝑦"𝜃" − 𝑏 𝜃"
𝜙"+ 𝑐 𝑦", 𝜙"
How do we maximize this? Differentiate w.r.t. 𝜃 and set equal to 0. Taking the log first simplifies our life:
ℓ 𝑦 𝜃 = b"`C
_𝑦"𝜃" − 𝑏 𝜃"
𝜙"+ b
"`C
_
𝑐 𝑦", 𝜙"
![Page 25: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/25.jpg)
CS109A, PROTOPAPAS, RADER
Maximum Likelihood Estimation
25
Through lots of calculus & algebra (see notes), we can obtain the following form for the derivative of the log-likelihood:
ℓN 𝑦 𝜃 =b"`C
_1
𝑉𝑎𝑟 𝑦"𝜕𝜇"𝜕𝛽 (𝑦" − 𝜇")
Setting this sum equal to 0 gives us the generalized estimating equations:
b"`C
_1
𝑉𝑎𝑟 𝑦"𝜕𝜇"𝜕𝛽 (𝑦" − 𝜇") = 0
![Page 26: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/26.jpg)
CS109A, PROTOPAPAS, RADER
Maximum Likelihood Estimation
26
When we use the canonical link, this simplifies to the normal equations:
b"`C
_𝑦" − 𝜇" 𝑥"%
𝜙"= 0
Let’s attempt to solve the normal equations for the Bernoulli distribution. Plugging in 𝜇" and 𝜙" we get:
b"`C
_
𝑦" −𝑒dA
ef
1 − 𝑒dAef𝑥"% = 0
![Page 27: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/27.jpg)
CS109A, PROTOPAPAS, RADER
Maximum Likelihood Estimation
27
Sad news: we can’t isolate 𝛽 analytically.
![Page 28: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/28.jpg)
CS109A, PROTOPAPAS, RADER
Maximum Likelihood Estimation
28
Good news: we can approximate it numerically. One choice ofalgorithm is the Fisher Scoring algorithm.
In order to find the 𝜃 that maximizes the log-likelihood, ℓ(𝑦|𝜃):
1. Pick a starting value for our parameter, 𝜃g.
2. Iteratively update this value as follows:
𝜃"hC = 𝜃" −ℓN(𝜃")
𝔼 ℓNN 𝜃"
In words: perform gradient ascent with a learning rate inversely proportional to the expected curvature of the function at that point.
![Page 29: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/29.jpg)
CS109A, PROTOPAPAS, RADER
Maximum Likelihood Estimation
29
Here are the results of implementing the Fisher Scoring algorithm for simple logistic regression in python:
DEMO
![Page 30: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between](https://reader033.vdocuments.site/reader033/viewer/2022060207/5f03df677e708231d40b2f58/html5/thumbnails/30.jpg)
CS109A, PROTOPAPAS, RADER
Questions?
30