![Page 1: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/1.jpg)
Naive Bayes and Logistic Regression
[70240413 Statistical Machine Learning, Spring, 2015]
Jun Zhu [email protected]
http://bigml.cs.tsinghua.edu.cn/~jun
State Key Lab of Intelligent Technology & Systems
Tsinghua University
March 31, 2015
![Page 2: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/2.jpg)
Outline
Probabilistic methods for supervised learning
Naive Bayes classifier
Logistic regression
Exponential family distributions
Generalized linear models
![Page 3: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/3.jpg)
An Intuitive Example
[Courtesy of E. Keogh]
![Page 4: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/4.jpg)
With more data …
Build a histogram, e.g., for “Antenna length”
[Courtesy of E. Keogh]
![Page 5: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/5.jpg)
Empirical distribution
Histogram (or empirical distribution)
Smooth with kernel density estimation (KDE):
[Courtesy of E. Keogh]
![Page 6: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/6.jpg)
Classification?
Classify another insect we find. Its antennae are 3 units long
Is it more probable that the insect is a Grasshopper or a
Katydid?
[Courtesy of E. Keogh]
![Page 7: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/7.jpg)
Classification Probability
10
2
[Courtesy of E. Keogh]
![Page 8: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/8.jpg)
Classification Probability
[Courtesy of E. Keogh]
![Page 9: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/9.jpg)
Classification Probability
[Courtesy of E. Keogh]
![Page 10: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/10.jpg)
Naïve Bayes Classifier
The simplest “category-feature” generative model:
Category: “bird”, “Mammal”
Features: “has beak”, “can fly” …
![Page 11: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/11.jpg)
Naïve Bayes Classifier
A mathematic model:
Naive Bayes assumption: features are
conditionally independent given the class label
Y
A joint distribution:
p(x; y) = p(y)p(xjy)
{bird, mammal}
has beak? can fly? has fur? has four legs?
prior likelihood
![Page 12: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/12.jpg)
Naïve Bayes Classifier
A mathematic model:
p(yjx) =p(x; y)
p(x)=
p(y)p(xjy)
p(x)
Bayes’ decision rule: y¤
= argmaxy2Y
p(yjx)
Y {bird, mammal}
has beak? can fly? has fur? has four legs?
Inference via Bayes rule:
![Page 13: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/13.jpg)
Bayes Error
Theorem: Bayes classifier is optimal!
p(errorjx) =p(y = 1jx) if we decide y = 0
p(y = 0jx) if we decide y = 1
p(error) =1
¡1
p(errorjx)p(x)dx
![Page 14: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/14.jpg)
Naïve Bayes Classifier
How to learn model parameters?
Assume X are d binary features, Y has 2 possible labels
How many parameters to estimate?
p(yj¼) =¼ if y = 1 (i:e:; bird)
1¡ ¼ otherwiseY {bird, mammal}
has beak? can fly? has fur? has four legs?
![Page 15: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/15.jpg)
Naïve Bayes Classifier
How to learn model parameters?
A set of training data:
(1, 1, 0, 0; 1)
(1, 0, 0, 0; 1)
(0, 1, 1, 0; 0)
(0, 0, 1, 1; 0)
Maximum likelihood estimation (N: # of training data)
![Page 16: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/16.jpg)
Naïve Bayes Classifier
Maximum likelihood estimation (N: # of training data)
Results (count frequency! Exercise?):
![Page 17: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/17.jpg)
Naïve Bayes Classifier
Data scarcity issue (zero-counts problem):
How about if some features do not appear?
Laplace smoothing (Additive smoothing):
![Page 18: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/18.jpg)
A Bayesian Treatment
Put a prior on the parameters
![Page 19: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/19.jpg)
A Bayesian Treatment
Maximum a Posterior Estimate (MAP):
Results (Exercise?):
![Page 20: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/20.jpg)
A Bayesian Treatment
Maximum a Posterior Estimate (MAP):
If (non-informative prior), no effect
MLE is a special case of Bayesian estimate
Increase , lead to heavier influence from prior
![Page 21: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/21.jpg)
Bayesian Regression
Goal: learn a function from noisy observed data
Linear
Polynomial
…
![Page 22: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/22.jpg)
Bayesian Regression Noisy observations Gaussian likelihood function for linear regression Gaussian prior (Conjugate) Inference with Bayes’ rule Posterior
Marginal likelihood Prediction
![Page 23: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/23.jpg)
Extensions of NB
We covered the case with binary features and binary class labels
NB is applicable to the cases:
Discrete features + discrete class labels
Continuous features + discrete class labels
…
More dependency between features can be considered
Tree augmented NB
…
![Page 24: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/24.jpg)
Gaussian Naive Bayes (GNB)
E.g.: character recognition: feature Xi is intensity at pixel i:
The generative process is
Different mean and variance for each class k and each feature i
Sometimes assume variance is:
independent of Y (i.e., )
or independent of X (i.e., )
or both (i.e., )
Y
![Page 25: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/25.jpg)
Estimating Parameters & Prediction
MLE estimates
Prediction:
pixel i in
training image n
![Page 26: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/26.jpg)
What you need to know about NB classifier
What’s the assumption
Why we use it
How do we learn it
Why is Bayesian estimation (MAP) important
![Page 27: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/27.jpg)
Linear regression and linear classification
wTx + b = 0
wTx + b < 0 wTx + b > 0
Linear fit Linear decision boundary
![Page 28: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/28.jpg)
What’s the decision boundary of NB?
Is it linear or non-linear?
There are several distributions that lead to a linear decision
boundary, e.g., GNB with equal variance
Decision boundary (??):
![Page 29: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/29.jpg)
Gaussian Naive Bayes (GNB)
Decision boundary (the general multivariate Gaussian case):
![Page 30: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/30.jpg)
The predictive distribution of GNB
Understanding the predictive distribution
Under naive Bayes assumption:
Note: For multi-class, the predictive distribution is softmax!
![Page 31: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/31.jpg)
Generative vs. Discriminative Classifiers
Generative classifiers (e.g., Naive Bayes)
Assume some functional form for P(X,Y) (or P(Y) and P(X|Y))
Estimate parameters of P(X,Y) directly from training data
Make prediction
But, we note that
Why not learn P(Y|X) directly? Or, why not learn the decision boundary directly?
Discriminative classifiers (e.g., Logistic regression)
Assume some functional form for P(Y|X)
Estimate parameters of P(Y|X) directly from training data
Y
Y
![Page 32: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/32.jpg)
Logistic Regression
Recall the predictive distribution of GNB!
Assume the following functional form for P(Y|X)
Logistic function (or Sigmoid) applied to a linear function of the
data (for ):
î(v) =
1
1 + exp(¡®v)
: step function
use a large can be good for some neural networks
![Page 33: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/33.jpg)
Logistic Regression
What’s the decision boundary of logistic regression? (linear
or nonlinear?)
Logistic regression is a linear classifier!
![Page 34: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/34.jpg)
Representation
Logistic regression
For notation simplicity, we use the augmented vector:
Then, we have
![Page 35: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/35.jpg)
Multiclass Logistic Regression
For more than 2 classes, where , logistic
regression classifier is defined as
Well normalized distribution! No weights for class K!
Is the decision boundary still linear?
![Page 36: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/36.jpg)
Training Logistic Regression
We consider the binary classification
Training data
How to learn the parameters?
Can we do MLE?
No! Don’t have a model for P(X) or P(X|Y)
Can we do large-margin learning?
![Page 37: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/37.jpg)
Maximum Conditional Likelihood Estimate
We learn the parameters by solving
Discriminative philosophy – don’t waste effort on
learning P(X), focus on P(Y|X) – that’s all that matters for
classification!
![Page 38: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/38.jpg)
Maximum Conditional Likelihood Estimate
We have:
![Page 39: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/39.jpg)
Maximum Conditional Likelihood Estimate
Bad news: no closed-form solution!
Good news: is a concave function of w!
Is the original logistic function concave?
Read [S. Boyd, Convex Optimization, Chap. 1] for an introduction to convex optimization.
![Page 40: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/40.jpg)
Optimizing concave/convex function
Conditional likelihood for logistic regression is concave
Maximum of a concave function = minimum of a convex
function
Gradient ascent (concave) / Gradient descent (convex)
Gradient:
Update rule:
![Page 41: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/41.jpg)
Gradient Ascent for Logistic Regression
Property of sigmoid function
Gradient ascent algorithm iteratively does:
where is the prediction made by the current model
Until the change (of objective or gradient) falls below some threshold
à ( v ) =1
1 + e x p ( ¡ v )
![Page 42: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/42.jpg)
Issues
Gradient descent is the simplest optimization methods, faster
convergence can be obtained by using
E.g., Newton method, conjugate gradient ascent, IRLS
(iterative reweighted least squares)
The vanilla logistic regression often over-fits; using a
regularization can help a lot!
![Page 43: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/43.jpg)
Effects of step-size
Large => fast convergence but larger residual error; Also possible oscillations
Small => slow convergence but small residual error
![Page 44: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/44.jpg)
The Newton’s Method
AKA: Newton-Raphson method
A method that finds the root of:
For Wikipedia
![Page 45: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/45.jpg)
The Newton’s Method
To maximize the conditional likelihood
We need to find such that
So we can perform the following iteration:
where H is known as the Hessian matrix:
![Page 46: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/46.jpg)
Newton’s Method for LR
The update equation
where the gradient is:
The Hessian matrix is:
![Page 47: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/47.jpg)
Iterative reweighted least squares (IRLS)
In least square estimate of linear regression, we have
Now, for logistic regression
![Page 48: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/48.jpg)
Convergence curves
Legend: X-axis: Iteration #; Y-axis: classification error
In each figure, red for IRLS and blue for gradient descent
rec.autos
vs.
rec.sports.baseball
comp.windows.x
vs.
rec.motorcycles
![Page 49: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/49.jpg)
LR: Practical Issues
IRLS takes per iteration, where N is # training points and d is feature dimension, but converges in fewer iterations
Quasi-Newton methods, that approximate the Hessian, work faster
Conjugate gradient takes per iteration, and usually works best in practice
Stochastic gradient descent can also be used if N is large c.f. perceptron rule
![Page 50: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/50.jpg)
Gaussian NB vs. Logistic Regression
Representation equivalence
But only in some special case! (GNB with class independent
variances)
What’s the differences?
LR makes no assumption about P(X|Y) in learning
They optimize different functions, obtain different solutions
GNB
Gaussian parameters
LR
Regression parameters VS
![Page 51: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/51.jpg)
Generative vs. Discriminative
Given infinite data (asymptotically)
(1) If conditional independence assumption holds,
discriminative and generative NB perform similar
(2) If conditional independence assumption does NOT hold,
discriminative outperform generative NB
[Ng & Jordan, NIPS 2001]
![Page 52: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/52.jpg)
Generative vs. Discriminative
Given finite data (N data points, d features)
Naive Bayes (generative) requires to converge to its asymptotic error, whereas logistic regression (discriminative) requires .
Why?
“Independent class conditional densities” – parameter estimates are not coupled, each parameter is learnt independently, not jointly, from training data
![Page 53: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/53.jpg)
Experimental Comparison
UCI Machine Learning Repository 15 datasets, 8 continuous
features, 7 discrete features
Naive Bayes Logistic Regression
![Page 54: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/54.jpg)
What you need to know
LR is a linear classifier Decision boundary is a hyperplane
LR is learnt by maximizing conditional likelihood No closed-form solution Concave! Global optimum by gradient ascent methods
GNB with class-independent variances representationally equivalent to LR Solutions differ because of objective (loss) functions
In general, NB and LR make different assumptions NB: features independent given class, assumption on P(X|Y) LR: functional form of P(Y|X), no assumption on P(X|Y)
Convergence rates: GNB (usually) needs less data LR (usually) gets to better solutions in the limit
![Page 55: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/55.jpg)
Exponential family
For a numeric random variable X
is an exponential family distribution with natural (canonical) parameter h
Function T(x) is a sufficient statistic.
Function A(h) = log Z(h) is the log normalizer.
Examples: Bernoulli, multinomial, Gaussian, Poisson, gamma,...
Xn
N
![Page 56: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/56.jpg)
Recall Linear Regression
Let us assume that the target variable and the inputs are
related by the equation:
where ε is an error term of unmodeled effects or random noise
Now assume that ε follows a Gaussian N(0,σ), then we have:
![Page 57: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/57.jpg)
Recall: Logistic Regression (sigmoid
classifier)
The condition distribution: a Bernoulli
where m is a logistic function
We can used the brute-force gradient method as in LR But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model!
![Page 58: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/58.jpg)
Example: Multivariate Gaussian
Distribution
For a continuous vector random variable :
Exponential family representation
Note: a d-dimensional Gaussian is a -parameter distribution with a -element
vector of sufficient statistics (but because of symmetry and positivity, parameters are constrained and have lower degree of freedom)
Moment parameter
Natural parameter
![Page 59: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/59.jpg)
Example: Multinomial distribution
For a binary vector random variable :
Exponential family representation
![Page 60: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/60.jpg)
Why exponential family?
Moment generating property (proof?)
![Page 61: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/61.jpg)
Moment estimation
We can easily compute moments of any exponential family
distribution by taking the derivatives of the log normalizer
A(h).
The qth derivative gives the qth centered moment.
![Page 62: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/62.jpg)
Moment vs canonical parameters
The moment parameter µ can be derived from the natural (canonical) parameter
A(h) is convex since
Hence we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1):
A distribution in the exponential family can be parameterized not only by h - the canonical parameterization, but also by m - the moment parameterization.
4
8
-2 -1 0 1 2
4
8
-2 -1 0 1 2
A
h h
![Page 63: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/63.jpg)
Sufficiency
For p(x|q), T(x) is sufficient for q if there is no information in X regarding q beyond that in T(x).
We can throw away X for the purpose of inference w.r.t. q .
Bayesian view
Frequentist view
The Neyman factorization theorem
T(x) is sufficient for q if
T(x) q X ))(|()),(|( xTpxxTp qq
T(x) q X ))(|()),(|( xTxpxTxp q
T(x) q X
))(,()),(()),(,( xTxxTxTxp 21 qq
))(,()),(()|( xTxhxTgxp qq
![Page 64: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/64.jpg)
IID Sampling for Exponential Family
For exponential family distribution, we can obtain the sufficient statistics by inspection once represented in the standard form
Sufficient statistics:
For IID sampling, the joint distribution is also an exponential family
Sufficient statistics:
![Page 65: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/65.jpg)
MLE for Exponential Family
For iid data, the log-likelihood is
Take derivatives and set to zero:
This amounts to moment matching.
We can infer the canonical parameters using
Only involve sufficient stiatistics!
![Page 66: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/66.jpg)
Examples
Gaussian:
Multinomial:
Poisson:
![Page 67: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/67.jpg)
Generalized Linear Models (GLIMs) The graphical model Linear regression Discriminative linear classification Commonality: model
What is p()? the cond. dist. of Y. What is f()? the response function.
GLIM The observed input x is assumed to enter into the model via a linear
combination of its elements
The conditional mean m is represented as a function f(x) of x, where f is known as the response function
The observed output y is assumed to be characterized by an exponential family distribution with conditional mean m.
Xn
Yn
N
xTqx
![Page 68: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/68.jpg)
GLIM, cont.
The choice of exp family is constrained by the nature of the data Y Example: y is a continuous vector multivariate Gaussian
y is a class label Bernoulli or multinomial
The choice of the response function Following some mild constrains, e.g., [0,1]. Positivity …
Canonical response function:
In this case qTx directly corresponds to canonical parameter h.
)()(exp),(),|( 1 hhh
Ayxyhyp T -
)()(exp)()|( hhh Ayxyhyp T -
hfq
xmx yEXP
EXP
)( -1f
![Page 69: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/69.jpg)
MLE for GLIMs
Log-likelihood
Derivative of Log-likelihood
This is a fixed point function
because m is a function of q
![Page 70: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/70.jpg)
MLE for GLIMs with canonical response
Log-likelihood
Derivative of Log-likelihood
Online learning for canonical GLIMs
Stochastic gradient ascent = least mean squares (LMS) algorithm:
This is a fixed point function
because m is a function of q
![Page 71: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/71.jpg)
MLE for GLIMs with canonical response
Log-likelihood
Derivative of Log-likelihood
Batch learning applies
E.g., the Newton’s method leads to an Iteratively Reweighted Least Square (IRLS) algorithm
This is a fixed point function
because m is a function of q
![Page 72: Naive Bayes and Logistic Regression - Tsinghua Universityml.cs.tsinghua.edu.cn/~jun/courses/statml-fall2015/5-NB-Logistic... · Naive Bayes and Logistic Regression [70240413 Statistical](https://reader031.vdocuments.site/reader031/viewer/2022022420/5a79cfee7f8b9afa378d0af0/html5/thumbnails/72.jpg)
What you need to know
Exponential family distribution
Moment estimation
Generalized linear models
Parameter estimation of GLIMs