linear methods classificationseem5470/lecture/linear... · linear discriminant analysis • suppose...
TRANSCRIPT
Linear Methods for Classification
1
Reference: The Elements of Statistical Learning,by T. Hastie, R. Tibshirani, J. Friedman, Springer
Introduction
• Suppose there are classes, labeled 1,2, ,• A class of methods that model
for each class. Then, classify to the class with the largest value for its discriminant function
• Decision boundary between class and is that set of points for which ℓ
2
Linear Discriminant Analysis
• Suppose is the class‐conditional density of in class , i.e.,
• Let be the prior probability of class , with
• A simple application of Bayes theorem gives:
∑ ℓ ℓℓ
3
Linear Discriminant Analysis
• Recall: ∑ ℓ ℓℓ
• Many techniques are based on models for the class densities:• linear discriminant analysis (also quadratic
discriminant)• mixture of Gaussians• nonparametric density estimates• Naive Bayes models
4
Linear Discriminant Analysis
• Suppose that we model each class density as multivariate Gaussian
/ /∑
• Linear Discriminant Analysis (LDA) arises in the special case when the classes have a common covariance matrix
5
Linear Discriminant Analysis• Comparing two classes and , sufficient to look
at the log‐ratio:
ℓ ℓ
ℓℓ ℓ
ℓ
(Note that the quadratic term for can becanceled out, and )
• One important outcome is that the log‐ratio is an equation linear in
6
Linear Discriminant Analysis
• Linear log‐odds function implies the decision boundary between classes and ‐the set where
is linear in ; in dimensions a hyperplane• True for any pair of classes all decision
boundaries are linear• Divide into regions classified as class 1, class
2, etc. regions separated by hyperplanes
7
Linear Discriminant Analysis• Regions separated by hyperplanes• An idealized example with three classes
and
8
Linear Discriminant Analysis• From the previous equation comparing two
classes, without loss of generality, we can select one of the classes as the class
• Then, the linear discriminant function for the class :
• Equivalent description of the decision rule, with
9
Linear Discriminant Analysis
• In practice do not know the parameters of the Gaussian distributions
• Estimate using training data:• , where is the number of class‐
observations•
•
10
Linear Discriminant AnalysisExample
11
Quadratic Discriminant Analysis
• If are not assumed to be equal convenient cancellations do not occur
• The pieces quadratic in remain• Quadratic discriminant functions (QDA):
• The decision boundary between each pair of classes and is described by a quadratic equation ℓ 12
Example of three classes are Gaussian mixtures, and decision boundaries are approximated by quadratic equations in
13
Quadratic Discriminant AnalysisExample
Linear / Quadratic Discriminant Analysis• The estimates for QDA are similar to LDA, except
the separate covariance matrices must be estimated for each classes
• When is large = a dramatic increase in parameters• For LDA,
• there are 1 1 parameters since we only need the differences between the discriminant functionswhere is some pre‐chosen class (here we have chosen the last)
• each difference requires 1 parameters
14
• For QDA, • there are
parameters• In the STATLOG project, LDA was among the top
three classifiers for 7 of the 22 datasets QDA among the top three for four datasets, one of the pair were in the top three for 10 datasets
• Both techniques are widely used
15
Linear / Quadratic Discriminant Analysis
Regularized Discriminant Analysis• Compromise between LDA and QDA allows one to shrink the separate covariancesof QDA toward a common covariance as in LDA
• Regularized covariance matrices have the form:
where is the pooled covariance matrix as used in LDA
• allows a continuum of models between LDA and QDA, needs to be specified
• can be chosen based on performance of the model on validation data, or by cross‐validation
16
Regularized Discriminant Analysis• The results of RDA applied
to the vowel data• Both training and test error
improved with increasing • Test error increases sharply
after 0.9
• Large discrepancy between the training and test error is partly due to:• Many repeat measurements on small number of
individuals, different in the training and test set17
Logistic Regression• Desire to model the posterior probabilities of the
classes via linear functions in (p‐dimensional vector)
• Ensuring they sum to one and remain in • Model:
18
Logistic Regression• Specified in terms of log‐odds or logit
transformations• Choice of denominator is arbitrary – estimates are
equivariant under this choice
ℓ ℓℓ
ℓ ℓℓ Sum to 1
19
Logistic RegressionTwo‐class Classification
• For two‐class classification, we can model two classes as 0 and 1.
• Treating the class 1 as the concept of interest, the posterior probability can be regarded as the class membership probability:
Pr 1exp
1 exp
• As a result, it maps in p‐dimensional space to a value in [0,1]
20
logistic function
Logistic RegressionTwo‐Class Cases and Shape of Sigmoid Curve
• Consider 1‐dimensional
21
Pr
Logistic RegressionAn Example of One‐dimension
• We wish to predict death from baseline APACHE II score of patients.
• Let Pr be the probability that a patient with score will die.
22
• Note that linear regression would not work well since it could produce probabilities less than 0 or greater than 1
Logistic RegressionAn Example of One‐dimension
• Data that has a sharp survival cut off point between patients who live or die will lead to a large value of
23
Logistic RegressionAn Example of One‐dimension
• One the other hand, if the data has a lengthy transition from survival to death, it will lead to a low value of
24
Logistic RegressionModel Fitting for General Cases (K classes, p Dimension)
• Logistic regression models fit by maximum likelihood‐ using the conditional likelihood of given
• completely specifies the conditional distribution the multinomial distribution is appropriate
25
Logistic RegressionModel Fitting for General Cases (K classes, p Dimension)
• Let entire parameter set be
, then
• Log‐likelihood for observations of input data and class labels:
where
26
Logistic RegressionModel Fitting for Two‐class CasesTwo‐class cases• Convenient to code the two‐class via a 0/1 response
‐ where 1 when 1, and 0 when 2• Let ; ; , and ; 1 ;• Log‐likelihood:
ℓ log ; 1 log 1 ;
log 1
27
Logistic RegressionModel Fitting• Here:
• Assume vector of inputs includes the constant term 1 to accommodate the intercept
• Maximize log‐likelihood set its derivatives to zero
• Scores equations:
involves equations nonlinear in 28
Logistic RegressionNewton Method for Optimization
• Let’s consider a function of one scalar variable . The second order Taylor expansion around :
Δ Δ12Δ
• We want to find the global minimum ∗
• Near the minimum we could make a Taylor expansion:
∗ 12
∗
∗
• Newton method uses this fact, and minimizes a quadratic approximation to the function.
29
Logistic RegressionNewton Method for Optimization
• Guess an initial point . We can take a second order Taylor expansion around and it will still be accurate:
12
Take the derivative with respect to and set it equal to 0122 0
0
30
Logistic RegressionNewton Method for Optimization
• We just take the derivative with respect to and set it equal to zero at a point we will call
• We can iterate this procedure, minimizing one approximation and then using that to get a new approximation:
31
Logistic RegressionModel Fitting• Returning to the model fitting for the two‐class
case, the log‐likelihood is:
• Starting with , a single Newton update is:
32
Logistic RegressionModel Fitting• Let
denote the vector of valuesthe matrix of valuesthe vector of fitted probabilities with th
element a diagonal matrix of weights with th
diagonal element
33
Logistic RegressionModel Fitting
• ℓ
ℓ
• Newton Step:
34
Logistic RegressionExample• The subset of the Coronary Risk‐Factor Study
(CORIS) baseline survey, carried out in three rural areas of the Western Cape, South Africa
• Aim: establish the intensity of ischemic heart disease risk factors in that high‐incidence region
• Response variable is the presence or absence of myocardial infraction (MI) at the time of survey
• 160 cases in data set, sample of 302 controls
35
Logistic RegressionExample
36
Logistic RegressionExample
• Fit a logistic‐regression model by maximum likelihood, giving the results shown in the next slide• z scores for each coefficients in the
model (coefficients divided by their standard errors)
37
Logistic RegressionExample• Results from a logistic regression fit to the South
African heart disease data:
38
Coefficient Std. Error Z Score
(Intercept) ‐4.130 0.964 ‐4.285
sbp 0.006 0.006 1.023
tobacco 0.080 0.026 3.034
ldl 0.185 0.057 3.219
famhist 0.939 0.225 4.178
obesity ‐0.035 0.029 ‐1.187
alcohol 0.001 0.004 0.136
age 0.043 0.010 4.184
Logistic RegressionExample• z scores greater than approximately 2 in absolute
value is significant at the 5% level• Some surprises in the table of coefficients
• sbp and obesity appear to be not significant• On their own, both sbp and obesity are
significant, with positive sign• Presence of many other correlated variables no longer needed (can even get a negative sign)
39
L1 Regularized Logistic Regression
• The penalty used in the Lasso can be used for variable and shrinkage with any linear regression model.
• For logistic regression, we would maximize the log‐likelihood:
max,
log 1
• As with the Lasso, we typically do not penalize the intercept term, and standardize the predictors
40
Logistic Regression or LDA?
• For Linear Discriminant Analysis: log‐posterior odds between class and are linear functions of :
41
Logistic Regression or LDA?
• The linear logistic regression model by construction has linear logits:
• It seems that linear discriminant analysis model and the linear logistic model are the same
• Although they have exactly the same form,• Different linear coefficients are estimated
• Logistic regression is more general due to less assumptions
42
Logistic Regression or LDA?
• Joint density of and :
denotes marginal density of inputs • For both LDA and logistic regression, the second
term on the right has the logit‐linear form:
ℓ ℓℓ
43
Logistic Regression or LDA?• Logistic regression model leaves the marginal
density of as an arbitrary density function
• Fits the parameters of by maximizing the conditional likelihood – the multinomial likelihood with probabilities the
• Although is ignored, we can think of this marginal density as being estimated in a fully nonparametric and unrestricted fashion
• Using the empirical distribution function which places mass at each observation
44
Logistic Regression or LDA?• LDA fits parameters by maximizing the full log‐
likelihood, based on joint density:
where is the Gaussian density function• Standard normal theory leads easily to the
estimates , and • Marginal density does play a role. It is a
mixture density:
which also involves the parameters. 45
Logistic Regression or LDA?
• For LDA, relying on additional model assumptions• more information about parameters• estimate them more efficiently (lower
variance)• If in fact the true are Gaussian
• in worst case: ignoring this marginal part of likelihood constitutes a lose of efficiency of about 30% asymptotically in error rate• With 30% more data, the conditional
likelihood will do as well.46
Logistic Regression or LDA?
• For LDA, observations far from the decision boundary (down‐weighted by logistic regression) play a role in estimating the common covariance matrix• Not all good news: means that LDA is not
robust to gross outliers• The marginal likelihood can be thought of as a
regularizer• In practice, it is generally felt that logistic
regression is a safer, more robust bet than LDA, relying on fewer assumptions.
47