classification discrimination lecture 15. what is discrimination or classification? consider an...

36
CLASSIFICATION DISCRIMINATION LECTURE 15

Upload: laureen-rogers

Post on 19-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

CLASSIFICATIONDISCRIMINATION

LECTURE 15

Page 2: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

What is Discrimination or Classification?

• Consider an example where we have two populations P1 and P2 each ~ N(m1,s1) and N(m2,s2) respectively.

• A new observation is observed and it is known to come from either of these populations.

• The task of a discriminant function is to determine a “rule” to decide from which of the two populations x is most likely to come from.

• How we come up with a rule is what we need to study.

Page 3: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

Supervised Learning

• In computer Science this is known as SUPERVISED learning.

• Essentially we know the class labels ahead of time. • What we need to do is find a RULE using features in the

data that DISCRIMINATES effectively between the classes.

• So that if we have a new observation with its features we can correctly classify it.

Page 4: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

Example 1

• Suppose you are a doctor considering two different anesthetics for a patient.

• You have some information about the patient, gender, age, some medical history variables.

• So what we need is a data set where we have patient information and whether or not the anesthetic was SAFE for that patient.

• So what you want to do is USING the available variables build a MODEL or RULE that says whether anesthetic A or B is better for the patient.

• Then use this rule to decide whether or not to give the new patient A or B.

Page 5: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

Example 2: Turkey Thief

• There was this legal case in Kansas where a turkey farmer accused his neighbor of stealing turkeys from the farm.

• When the neighbor was arrested and the police looked in the freezer, there were multiple frozen turkeys there.

• The accused claimed these were WILD turkey that he had caught.• The Statistician was called in to give evidence as there are some

biological differences between domestic and wild turkey. • So the biologist measured the bones and other body characteristic of

the domestic and Wild turkeys and the Statistician built a DISCRIMANT function.

• They used the classification function to see if the turkeys in the freezer fell into he WILD or DOMESTIC class.

• THEY ALL fell in the DOMESTIC classification!

Page 6: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

The Idea

• USING knowledge of the classes we build the FUNCTION.

• We want to minimize misclassification error.

• Question: Should we use ALL the data to build the MODEL, because then we really do not have a good way to find out the misclassification probabilities.

• Generally: Training set and Testing sets are used.

Page 7: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

Some common Statistical Rules

• Suppose we want to classify between two multivariate normal distribution P1 with parameters 1m and 1S and P2 with parameters 2m and 2S .

• Suppose a new observation vector x is known to come from P1 or P2.

• There are various Statistical Rules allow us to PREDICT which population x most likely came from.

Page 8: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

1. Likelihood Rule

Choose P1 if L(x, 1, 1m s ) > L(x, 2, 2m s ) else choose P2.

Here, x is the observation vector.

This is a mathematical rule and reasonable under the assumption of normality.

Page 9: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

2. Linear Discriminant Function (LDA)rule:

Choose P1 if b’x – k > 0 and P2 otherwise.

Here b= S-1( 1- 2)m m and k=1/2( 1- 2)m m S-1( 1+ 2m m )The function b’x is called the linear discriminant

function.This assumes equal covariance matrices 1= 2=S S S.

It’s a single linear function of x that summarizes all the information in x.

Page 10: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

3. Mahalanobis Distance Rule

Choose P1 if d1 < d2

where di = (x-mi)S-1(x-mi) for i=1,2.

The function di is a measure of how far away x is from mi taking the Variance-Covariance into account.

This assumes equal covariance matrices 1= 2=S S S.The Likelihood criterion under normality and equal

variance is equivalent to this Rule.

Page 11: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

4. Posterior probability rule

Choose P1 if P(P1|x)>P(P2|x) where, P(Pi|x) = exp[(-1/2)di]/{exp[(-1/2)d1] + exp[(-1/2)d2] }

• Also assumes equal variance.• Not a true probability as (P1|x) is not a random event as

the observation belongs to either P1 or P2. • Gives an idea of how confident we are in our effort to

discriminate.

Page 12: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

Caveats

Generally mi and si are not known and we use sample values.

Under equal covariance all 4 rules are equivalent in terms of discrimination between groups.

Also in general we have more than 2 populations to discriminate the observations into.

Page 13: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

Sample Discriminant Rules

• Since we never know the parameters 1, 2, 1, 2m m S S . we use sample estimates generally MLE estimates below and form discrimant rules as in given before.

2

)1()1(

,,,

21

2211

2121

NN

SNSNSPooled

SSxx

Page 14: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

Estimating Probability of Misclassification

• 1. Re-substitution Estimates:

Apply the discriminant function to the data used to develop the rule and see how well it discriminates in general.

USES the SAME data to make and validate models.

Page 15: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

Holdout Data:

Keep a part of the data out from the part used to construct the rule and use the rule on that part and see how well it performs.

Problem is: if you don’t have a lot of samples its not the most efficient use of resources for building the model.

Page 16: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

Cross Validation:

Remove one observation at a time from the set, and construct the rule from the remaining observations and predict the first, do this for the second and third…

Define a summary matrix for misclassifying each data point.

Also called Jack-knifing.

• Obviously a rule classifying correctly a HIGHER number of times is preferred.

Page 17: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

The Issue for MA

• Often it is known in advance WHERE the samples come from and what conditions they have been exposed to.

• In fact we are often interested in gene expression profiles to distinguish between different conditions or classes.

• In the past schemes like a voting scheme was used to look at class membership in MAs.

• MANY MANY methods available, but general consensus is that a few of the methods have robust performance e.g. Linear discriminant Function (LDA), k-Nearest Neighbors (k-NN).

Page 18: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~
Page 19: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

Cost Function and Prior Probabilities

• When we there are only two populations all the four rules discussed earlier have the property that probability of misclassifying 1 to 2 is the same as 2 to 1.

• NOT generally a good idea especially in our anesthetic example. Idea is if you are going have to err, err in the side of caution.

• Hence we need to take into account the COST of misclassification.

Page 20: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

Some Math Details

• Define U = b’x-k from LDA.• U=( 1- 2m m )’S-1x - .5 ( 1- 2m m )’S-1 ( 1+ 2m m )• Under Normality and equal variance, • if x comes from P1, U ~ N(d,d) • and if x comes from P2, U ~ N(-d,d)

• Where d =( 1- 2m m )’S-1 ( 1- 2m m )

• And our Rule for LDA is P1 if U > 0 and p2 otherwise.• To make it asymmetric you can use a rule U > u where we can pick the

probability of misclassifying into one of the populations at most a fixed number say alpha.

Page 21: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

A General Rule• Define Cost Function as C(i|j) the cost of misclassifying an

observation from Pj to Pi.• Define Prior probability as pi for the ith group.• Average Cost of Misclassification (two groups)• p1C(2|1)P(2|1) + p2C(1|2)P(1|2)• Bayes Rule: Choose P1 • if p2f(x;q2)C(1|2) > p1f(x;q1)C(2|1)

• Observe if p1=p2 and C(2|1)=C(1|2) this reduces to the Likelihood rule.

• Under Normality and equal variance it reduces to:• d1* < d2* where d1* = .5(x- 1m )’S-1 (x- 1m ) – log(p1.C(2|1))

Page 22: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

Probabilistic Classification Theory (PCT)

• Most classification methods can be described as special implementations of Bayes’ Classifiers. The decision rule for classifying x into one of the classes P1…,Pk depends upon:– Prior information about the class frequencies p1…pk.– Information about how the class membership affects the gene

expression profiles xi (i=1…n)– Misclassification costs C(j,i) of classifying an observation which

belongs to class Pi into Pj.

• Our aim is to find a classification rule R that minimizes the expected Classification Costs.

Page 23: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~
Page 24: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

PCT II: Bayes Rule

• Recall Cost of Misclassification is given by:

• C(j|i) = 0 if i=j• = Ci , if i j (generally Ci is set to 1)

• Result: the classification rule that minimizes the expected misclassification cost is given by the posterior probability:

• R(x) = arg Min P(C|x) = arg Min P(x|C)pc

• This is called the Bayes Rule.

Page 25: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

PCT III: Prior Information

• Hence the idea is: IF we know the Probability of Class membership pc, and the conditional probability of the data given the classifiers P(x|C), we can find the optimum Classification Rule.

• In general it is VERY difficult to KNOW the prior information about class membership.

• To find P(x|C) the Likelihood of the data, we often use the Normal distribution (or log-transformed gene expression to be Normal). This is done in the Training set.

Page 26: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

Steps in Discriminant Analysis in MA

• Selection of features:• Model Fitting• Model Validation:

Page 27: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

Selection of Features

Selecting a set of genes. We do not want all the genes since it may have a tendency to over-fit the data also causes singularity.

How to select genes (gene filtering):– Use ONLY differentially expressed genes using an ANOVA type

model: xi = a C(xi) + ei

– Look at multiple genes or gene groups. Do PCA on all the genes. Not very efficient

– Partial least Squares(PLS), finds orthogonal linear combinations that maximize Cov (Xl,y).

– Do PCA and then rank PCAs by ratio of between class to within class varaince

– Other methods are Projection pursuit etc.

Most common differential expression or PLS

Page 28: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

MODEL FITTING

• Commonly used:• LDA• K Nearest Neighbor

• Other related• DLDA (Diagonal LDA)• RDA (Regularized DA) (there is a R package for this) • PAM (Prediction Analysis for Microarrays) (there is a R package

for this)• FDA (Flexible DA)

Page 29: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

Validation

• See how well the classifiers classify the observations into the different classes.

• Mostly commonly used method leave-one-out-cross validation.

• Though test data set (holdout sample) and resubmissions are still used.

Page 30: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

Linear Discriminant Analysis(LDA)

• Easy useful method.• Been found to be robust in MA.• Idea:• The main assumption is that the class densities can be written as

Multivariate Normal. • In R one uses lda in the MASS library.• Hence,

– P(x| C=k) = MVN ( m1…mk, Skk)

– Maximize : P(C=k| x) ={ P(x| C=k)pk}/S(P(x|C=j)pj

– If feature set is known then it is fairly straight forward, else one has to use some technique (forward, backward or step-wise) for feature selection.

Page 31: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

K-nearest Neighbor (kNN)• Assumption: samples with almost the same feature should belong to

the same class. In other words given a set of genes (g1,…,gm) known to be important in class membership, the kNN classifier assigns an unclassified sample to the class prevalent among the k samples whose expression values for the m genes are closest in the sample of interest.

• Typically each profile for sample j, is compared to the other profiles using Euclidean distances (however, any other distance like Manhattan, Correlation can be useful as well).

• The aim of kNN is to estimate the posterior probability P(C(X)=j|X=x) of a gene profile belonging to a class directly.

• For a particular k, it estimates the probability as a relative fraction of samples that belong to class j, among the k samples with most similar profiles.

• Essentially a non-linear classifier and may have VERY irregular edges.

Page 32: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

lda example from R

• Iris <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]),• + Sp = rep(c("s","c","v"), rep(50,3)))• > train <- sample(1:150, 75)• > table(Iris$Sp[train])

• c s v • 27 24 24 • > ## your answer may differ• > ## c s v• > ## 22 23 30

Page 33: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

Running lda

• > z <- lda(Sp ~ ., Iris, prior = c(1,1,1)/3, subset = train)• > predict(z, Iris[-train, ])$class• [1] s s s s s s s s s s s s s s s s s s s s s s s s s s c c c c c c c c c c c c• [39] c c c c c c c c c c c v v v v c v v v v v v v v v v v c v v c v v v v v v• Levels: c s v

Page 34: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

Contd…

• > (z1 <- update(z, . ~ . - Petal.W.))• Call:• lda(Sp ~ Sepal.L. + Sepal.W. + Petal.L., data = Iris, prior = c(1, • 1, 1)/3, subset = train)

• Prior probabilities of groups:• c s v • 0.3333333 0.3333333 0.3333333

Page 35: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

Contd…

• Group means:• Sepal.L. Sepal.W. Petal.L.• c 5.955556 2.781481 4.359259• s 5.008333 3.450000 1.429167• v 6.637500 2.983333 5.629167

• Coefficients of linear discriminants:• LD1 LD2• Sepal.L. 0.9045765 -0.07677002• Sepal.W. 0.7347963 2.58009411• Petal.L. -3.1529282 0.37700694

• Proportion of trace:• LD1 LD2 • 0.9939 0.0061

Page 36: CLASSIFICATION DISCRIMINATION LECTURE 15. What is Discrimination or Classification? Consider an example where we have two populations P1 and P2 each ~

knn

• library(class)• > train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])• > test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3])• > cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))• > knn(train, test, cl, k = 3, prob=TRUE)