classification inf-stk 5010 ole christian lingjærde · inf-stk 5010 classification ole christian...

INF-STK 5010 Classification Ole Christian Lingjærde Division for Biomedical Informatics Dept of Computer Science, UiO

Ole

Chr

istia

n Li

ngjæ

rde

Classifiers A classifier is a rule 𝑔𝑔: 𝒙𝒙 ↦ 𝑔𝑔(𝒙𝒙) that assigns to each input vector 𝒙𝒙 ∈ 𝑹𝑹𝑛𝑛 an output value 𝑔𝑔(𝒙𝒙) ∈ 1,2, … ,𝐾𝐾 . The input vector 𝒙𝒙 = (𝑥𝑥1, … , 𝑥𝑥𝑛𝑛) has 𝑛𝑛 components, each of which is called a feature or an explanatory variable.

The output value 𝑔𝑔 𝒙𝒙 is the class label assigned to the input 𝒙𝒙, and it may take any one of 𝐾𝐾 values.

The number of classes 𝐾𝐾 and the number of features 𝑛𝑛 are problem specific. The special case 𝐾𝐾 = 2 is called binary classification and then the class labels are usually called 0 and 1, so that 𝑔𝑔 𝑥𝑥 ∈ {0,1}.

Ole

Chr

istia

n Li

ngjæ

rde

Example

Ole

Chr

istia

n Li

ngjæ

rde

• Let 𝒙𝒙 = 𝑥𝑥1, 𝑥𝑥2 ∈ 𝑅𝑅2 and define

𝑔𝑔 𝑥𝑥1, 𝑥𝑥2 = �0, 2𝑥𝑥1 + 𝑥𝑥2 < 21, 2𝑥𝑥1 + 𝑥𝑥2 ≥ 2

• This is a binary classifier with two explanatory variables, and each of the two classes constitutes a half-plane.

Example

Ole

Chr

istia

n Li

ngjæ

rde


𝑔𝑔 𝑥𝑥1, 𝑥𝑥2 = �0, 2𝑒𝑒𝑥𝑥1 + 𝑥𝑥2 < 21, 2𝑒𝑒𝑥𝑥1 + 𝑥𝑥2 ≥ 2

• This is a binary classifier with two explanatory variables, and the border between the classes (called the decision border) is nonlinear.

Example

Ole

Chr

istia

n Li

ngjæ

rde


𝑔𝑔 𝑥𝑥1, 𝑥𝑥2 = �1, 𝑥𝑥1 < −1.0 2, 𝑥𝑥1 ≥ −1.0 & 𝑥𝑥2 > 2.0 3, 𝑥𝑥1 ≥ −1.0 & 𝑥𝑥2 ≤ 2.0

• This is a classification tree:

𝑔𝑔𝑥𝑥1 ,𝑥𝑥

2=

1

𝑔𝑔 𝑥𝑥1, 𝑥𝑥2 = 2

𝑔𝑔 𝑥𝑥1, 𝑥𝑥2 = 3 𝑥𝑥1 > 2

𝑥𝑥1 < −1

1

2 3

Types of classifiers

Ole

Chr

istia

n Li

ngjæ

rde

• Naive bayes classifier • Logistic regression • Linear discriminant analysis • Nearest neighbor classifiers • Shrunken centroid • Artificial neural networks • Classification trees • Random forest • Support vector machines • ... and more

Training and testing

Ole

Chr

istia

n Li

ngjæ

rde

• Training: All the mentioned classifiers require training on a set of input vectors 𝒙𝒙1, … ,𝒙𝒙𝑁𝑁 with corresponding known class labels 𝑦𝑦1, … ,𝑦𝑦𝑁𝑁, to learn the classification rule.

• Testing: After training, the classifier may be used to

classify new input cases 𝒙𝒙∗ with unknown class labels.

Training = model fitting

Ole

Chr

istia

n Li

ngjæ

rde

• Training a classifier means determining the values of all unknown parameters in the underlying model.

• Training requires a training data set, consisting of selection of (ideally) representative cases from the population that we want to apply the classifier to later.

• Parameter values are determined to optimize the classification performance on the training data (e.g. to produce as few classification errors as possible).

• In statistical parlance, this process is called model fitting and commonly involves solving a numerical optimization problem.

Testing = predicting

Ole

Chr

istia

n Li

ngjæ

rde

• Testing a classifier means applying the fitted model to new samples to obtain class label predictions.

• Parameter values are then fixed at the levels determined during the training of the classifier.

• The collection of new samples that we apply the model to is referred to as the test data set. In practice, we may want to apply the classifier to multiple test data sets.

Introducing a famous classification data set: the iris data

Ole

Chr

istia

n Li

ngjæ

rde

The iris data • Introduced by Ronald Fisher in 1936 • The data set consists of 50 samples from each of three species

of the flower Iris: • Iris setosa • Iris versicolor • Iris virginica

• Four features were measured from each sample: • the length of sepals (in cm) • the length of petals (in cm) • the width of sepals (in cm) • the width of petals (in cm)

Ole

Chr

istia

n Li

ngjæ

rde

The iris classification problem Discriminant analysis problem: Can we distinguish the species from each other based on the four sepal and petal measurements? Classification problem: Can we predict which species of iris we have based on the four sepal and petal measurements? The two problems are closely connected.

Ole

Chr

istia

n Li

ngjæ

rde

The iris data set

Ole

Chr

istia

n Li

ngjæ

rde

The iris data set

Ole

Chr

istia

n Li

ngjæ

rde

> dim(iris) [1] 150 5 > names(iris) [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" > str(iris) 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 ...

The iris data set

Ole

Chr

istia

n Li

ngjæ

rde

> hist(iris$Sepal.Length, col="lightblue")

The iris data set

Ole

Chr

istia

n Li

ngjæ

rde

> plot(density(iris$Sepal.Length))

The iris data set

Ole

Chr

istia

n Li

ngjæ

rde

> pie(table(iris$Species))

The iris data set

Ole

Chr

istia

n Li

ngjæ

rde

> boxplot(Sepal.Length ~ Species, data = iris, col = c("lightblue", "pink", "lightgreen"))

Now let's look at some specific classifiers and apply them to the iris data set!

Ole

Chr

istia

n Li

ngjæ

rde

Naive Bayes classifier

Ole

Chr

. Lin

gjæ

rde

Inst

itutt

for i

nfor

mat

ikk,

UiO

20

• For a given input 𝒙𝒙 ∈ 𝑅𝑅𝑛𝑛 we seek the class 𝑘𝑘 ∈ {1, … ,𝐾𝐾} that maximizes the posterior probability 𝑃𝑃 𝑘𝑘 𝒙𝒙 .

• We transform to something easier to estimate:

𝑃𝑃 𝑘𝑘 𝒙𝒙 = 𝑃𝑃 𝒙𝒙 𝑘𝑘 𝑃𝑃(𝑘𝑘)𝑃𝑃(𝒙𝒙)

(Bayes' theorem)

Why is the right hand side easier to estimate? • Insight 1: We can ignore the denominator 𝑃𝑃(𝒙𝒙) since we

want to compare different 𝑘𝑘's for the same 𝒙𝒙. • Insight 2: If we assume that all the features 𝑥𝑥1, … , 𝑥𝑥𝑛𝑛 are

independent for a given class, we have (ignoring denom.):

𝑃𝑃 𝑘𝑘 𝑥𝑥 = 𝑃𝑃 𝑥𝑥1 𝑘𝑘 ⋯𝑃𝑃 𝑥𝑥𝑛𝑛 𝑘𝑘 𝑃𝑃(𝑘𝑘)

• Insight 3: All the terms in the right hand side of

are easily estimated. Specifically: • Select all training samples (𝒙𝒙𝑖𝑖 ,𝑦𝑦𝑖𝑖) of class 𝑦𝑦𝑖𝑖 = 𝑘𝑘. • Estimate the density 𝑃𝑃 ⋅ 𝑘𝑘 of the j'th feature from

the selected training samples and then plug in the value 𝑥𝑥𝑗𝑗 to determine 𝑃𝑃 𝑥𝑥𝑗𝑗 𝑘𝑘 .

• Estimate 𝑃𝑃 𝑘𝑘 as the proportion of training samples that are of class 𝑘𝑘.

Naive Bayes classifier (2)

Ole

Chr

. Lin

gjæ

rde

Inst

itutt

for i

nfor

mat

ikk,

UiO

21

𝑃𝑃 𝑘𝑘 𝒙𝒙 = 𝑃𝑃 𝑥𝑥1 𝑘𝑘 ⋯𝑃𝑃 𝑥𝑥𝑛𝑛 𝑘𝑘 𝑃𝑃(𝑘𝑘)

Naive Bayes classifier (3)

Ole

Chr

. Lin

gjæ

rde

Inst

itutt

for i

nfor

mat

ikk,

UiO

22

library(e1071) learn.data = data.frame(x1=iris$Sepal.Length, x2=iris$Petal.Length, y=iris$Species) fit = naiveBayes(y ~ x1 + x2, learn.data) y.pred = predict(fit, learn.data, type="class") table(y.pred, learn.data$y) y.pred setosa versicolor virginica setosa 50 0 0 versicolor 0 44 7 virginica 0 6 43

Logistic regression

Ole

Chr

. Lin

gjæ

rde

Inst

itutt

for i

nfor

mat

ikk,

UiO

23

• For a continuous response 𝑦𝑦𝑖𝑖 the linear regression model 𝑦𝑦𝑖𝑖 ∼ 𝑁𝑁(𝜷𝜷′𝒙𝒙𝒊𝒊,𝜎𝜎2) may be used to model the relation between 𝑦𝑦𝑖𝑖 and 𝒙𝒙𝑖𝑖. The parameters 𝜷𝜷 is found with maximum likelihood (ML).

• For a binary response 𝑦𝑦𝑖𝑖 ∈ {0,1} the above model is not meaningful, but this model may be used:

𝑦𝑦𝑖𝑖 ∼ 𝐵𝐵𝐵𝐵𝑛𝑛(1, exp (𝜷𝜷′𝒙𝒙𝑖𝑖) 1+exp (𝜷𝜷′𝒙𝒙𝑖𝑖)

)

Again, the parameters 𝜷𝜷 may be found with ML.

Logistic regression (2)

Ole

Chr

. Lin

gjæ

rde

Inst

itutt

for i

nfor

mat

ikk,

UiO

24

Recall that if 𝑦𝑦 ∼ 𝐵𝐵𝐵𝐵𝑛𝑛(1,𝑝𝑝) then the probability distribution of the variable 𝑦𝑦 is given by

𝑃𝑃 𝑦𝑦 = 1 = 𝑝𝑝 ( 𝑃𝑃 𝑦𝑦 = 0 = 1 − 𝑝𝑝)

Thus, the model

𝑦𝑦𝑖𝑖 ∼ 𝐵𝐵𝐵𝐵𝑛𝑛 1, exp 𝜷𝜷′𝒙𝒙𝑖𝑖1+exp 𝜷𝜷′𝒙𝒙𝑖𝑖

is equivalent to

𝑃𝑃 𝑦𝑦𝑖𝑖 = 1|𝒙𝒙𝒊𝒊 = exp (𝜷𝜷′𝒙𝒙𝑖𝑖) 1+exp (𝜷𝜷′𝒙𝒙𝑖𝑖)

𝑃𝑃 𝑦𝑦𝑖𝑖 = 0|𝒙𝒙𝒊𝒊 = 1 1+exp (𝜷𝜷′𝒙𝒙𝑖𝑖)


Ole

Chr

. Lin

gjæ

rde

Inst

itutt

for i

nfor

mat

ikk,

UiO

25

Suppose the training data is as follows: The probability of observing the responses above is

𝑃𝑃 𝑦𝑦1 𝒙𝒙1 𝑃𝑃 𝑦𝑦2 𝒙𝒙2 𝑃𝑃 𝑦𝑦3 𝒙𝒙3 𝑃𝑃(𝑦𝑦4|𝒙𝒙4)

which according to the logistic regression model is

1 1 + exp (𝜷𝜷′𝒙𝒙1)

exp (𝜷𝜷′𝒙𝒙2) 1 + exp (𝜷𝜷′𝒙𝒙2)

1 1 + exp (𝜷𝜷′𝒙𝒙3)

1 1 + exp (𝜷𝜷′𝒙𝒙4)

𝒙𝒙𝒊𝒊 𝑦𝑦𝒊𝒊 (0.1, 0.3) 0

(0.3, 0.6) 1

(-0.4, 1.5) 0

(-1.1, 0.1) 0


Ole

Chr

. Lin

gjæ

rde

Inst

itutt

for i

nfor

mat

ikk,

UiO

26

The maximum likelihood estimator for the unknown parameter vector 𝜷𝜷 is found by maximizing

1 1 + exp (𝜷𝜷′𝒙𝒙1)

exp (𝜷𝜷′𝒙𝒙2) 1 + exp (𝜷𝜷′𝒙𝒙2)

1 1 + exp (𝜷𝜷′𝒙𝒙3)

1 1 + exp (𝜷𝜷′𝒙𝒙4)

with respect to 𝜷𝜷. This is easily done with numerical optimization algorithms (not treated here).


learn.data = data.frame(x1=iris$Sepal.Length, x2=iris$Petal.Length, y=iris$Species) learn.data = learn.data[learn.data$y != "virginica",] fit = glm(y ~ x1 + x2, family=binomial, learn.data) y.pred = predict(fit, learn.data, type="response") y.pred = ifelse(y.pred < 0.5, 0, 1) table(y.pred, as.numeric(learn.data$y)) y.pred 1 2 0 50 0 1 0 50

Ole

Chr

istia

n Li

ngjæ

rde

Linear discriminant analysis • In the simplest case we have 𝒙𝒙 ∈ 𝑅𝑅𝑛𝑛 and 𝑦𝑦 ∈ 0,1 and

the idea is to find a hyperplane

𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2 + ⋯+ 𝛽𝛽𝑛𝑛𝑥𝑥𝑛𝑛 = 𝛽𝛽0

that separates well cases of 𝒙𝒙 with 𝑦𝑦 = 0 from cases of 𝒙𝒙 with 𝑦𝑦 = 1, as in this example:

Ole

Chr

istia

n Li

ngjæ

rde

y=0

y=1

Separating hyperplane

learn.data = data.frame(x1=iris$Sepal.Length, x2=iris$Petal.Length, y=iris$Species) library(MASS) fit = lda(y ~ x1 + x2, learn.data) y.pred = predict(fit, learn.data) table(y.pred$class, learn.data$y) setosa versicolor virginica setosa 50 0 0 versicolor 0 48 3 virginica 0 2 47

Ole

Chr

istia

n Li

ngjæ

rde

Linear discriminant analysis (2)

Nearest neighbor classifier • To classify a new sample 𝒙𝒙, we find the 𝑘𝑘 nearest

neighbors in the training data set and assign to 𝒙𝒙 the class that occurs most often among these neighbors.

• Example with 𝑘𝑘 = 3:

Ole

Chr

istia

n Li

ngjæ

rde

The black point is assigned to Class 3

tmp = data.frame(x1=iris$Sepal.Length, x2=iris$Petal.Length, y=iris$Species) tmp = tmp[sample(nrow(tmp)),] learn.data = tmp[1:100,] test.data = tmp[101:150,] library(class) y.pred = knn(learn.data[,c(1,2)], test.data[,c(1,2)], learn.data$y, k=3) table(y.pred, test.data$y)

y.pred setosa versicolor virginica setosa 17 0 0 versicolor 0 16 2 virginica 0 2 13

Ole

Chr

istia

n Li

ngjæ

rde

Nearest neighbor classifier (2)

classification inf-stk 5010 ole christian lingjærde · inf-stk 5010 classification ole christian...

Documents