cz5225: modeling and simulation in biology lecture 7, microarray class classification by machine...

46
CZ5225: Modeling and Simulation in CZ5225: Modeling and Simulation in Biology Biology Lecture 7, Microarray Class Lecture 7, Microarray Class Classification by Machine learning Classification by Machine learning Methods Methods Prof. Chen Yu Zong Prof. Chen Yu Zong Tel: 6874-6877 Tel: 6874-6877 Email: Email: [email protected] [email protected] http://bidd.nus.edu.sg http://bidd.nus.edu.sg Room 07-24, level 8, S16, Room 07-24, level 8, S16, National University of Singapore National University of Singapore

Upload: charlotte-moore

Post on 18-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

CZ5225: Modeling and Simulation in BiologyCZ5225: Modeling and Simulation in Biology

Lecture 7, Microarray Class Classification by Lecture 7, Microarray Class Classification by Machine learning MethodsMachine learning Methods

Prof. Chen Yu ZongProf. Chen Yu Zong

Tel: 6874-6877Tel: 6874-6877Email: Email: [email protected]@nus.edu.sg

http://bidd.nus.edu.sghttp://bidd.nus.edu.sgRoom 07-24, level 8, S16, Room 07-24, level 8, S16,

National University of SingaporeNational University of Singapore

Page 2: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

22

Machine Learning MethodMachine Learning Method Inductive learning:

Example-based learning

Descriptor

Positive examples

Negative examples

Page 3: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

33

Machine Learning MethodMachine Learning Method

A=(1, 1, 1)B=(0, 1, 1)C=(1, 1, 1)D=(0, 1, 1)E=(0, 0, 0)F=(1, 0, 1)

Feature vectors: Descriptor

Feature vector

Positive examples

Negative examples

Page 4: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

44

Machine Learning MethodMachine Learning Method Feature vectors in input space:

A=(1, 1, 1)B=(0, 1, 1)C=(1, 1, 1)D=(0, 1, 1)E=(0, 0, 0)F=(1, 0, 1)

Z

Input space

X

Y

BAE

F

Feature vector

Page 5: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

55

Vector A= (a1, a2, a3, …, aN)

Task of machine learning transformed into the job for finding of a border-Line for optimal separation of the known positive and negative samples in a training-set

Positive

Negative

Machine Learning Method

Page 6: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

66

Patient_X= (gene_1, gene_2, gene_3, …, gene_N)

N (number of dimensions) is normally larger than 2, so we can’t visualize the data.

Cancerous

Healthy

Classifying Cancer Patients vs. Healthy Patients from Microarray

Page 7: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

77

Classifying Cancer Patients vs. Classifying Cancer Patients vs. Healthy Patients from MicroarrayHealthy Patients from Microarray

Cancerous

Healthy

Gene_1 expression level

For simplicity, pretend that we are only looking at expression levels of 2 genes.

-5 0 5

-5

0

5

Gen

e_2

expr

essi

on le

vel

Up-regulated

Down-regulated

Page 8: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

88

Classifying Cancer Patients vs. Classifying Cancer Patients vs. Healthy Patients from MicroarrayHealthy Patients from Microarray

Cancerous

Healthy

Gene_1 expression level

Question:

How can we build a classifier for this data?

-5 0 5

-5

0

5

Gen

e_2

expr

essi

on le

vel

Page 9: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

99

Classifying Cancer Patients vs. Classifying Cancer Patients vs. Healthy Patients from MicroarrayHealthy Patients from Microarray

Cancerous

Healthy

Gene_1 expression level

Simple Classification Rule:IF gene_1 <0 AND gene_2 <0THEN person=healthy

IF gene_1 >0 AND gene_2 >0THEN person=cancerous

-5 0 5

-5

0

5

Gen

e_2

expr

essi

on le

vel

Page 10: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

1010

Classifying Cancer Patients vs. Classifying Cancer Patients vs. Healthy Patients from MicroarrayHealthy Patients from Microarray

Simple Classification Rule:IF gene_1 <0 AND gene_2 <0 AND

… gene 5000 < Y

THEN person=healthy

IF gene_1 >0 AND gene_2 >0

… gene 5000 >WTHEN person=cancerous

If we move away from our simple example with 2 genes to a realistic case with say 5000 genes, then

1. What will these rules look like?

2. How will we find them?

Gets a little complicated, unwieldy…

Page 11: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

1111

Classifying Cancer Patients vs. Classifying Cancer Patients vs. Healthy Patients from MicroarrayHealthy Patients from Microarray

Cancerous

Healthy

Gene_1 expression level-5 0 5

-5

0

5

Gen

e_2

expr

essi

on le

vel

Reformulate the previous rule

SIMPLE RULE:

•If data point lies to the ‘left’ of the line, then ‘healthy’.

•If data point lies to ‘right’ of line then ‘cancerous’

It is easier to generalize this line to 5000 genes than it is a list of rules. Also easier to solve mathematically.

Page 12: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

1212

Extension to More Than 2 Genes (dimensions)Extension to More Than 2 Genes (dimensions)

Cancerous

Healthy

-5

0 5-5

0

5

•Line in 2D: x1C1 + x2C2 = T

•If we had 3 genes, and needed to build a ‘line’ in 3-dimensional space, then we would be seeking a plane.

Plane in 3D: x1C1 + x2C2 + x3C3 = T

•If we were looking in more than 3 dimensions, the ‘plane’ is called a hyperplane. A hyperplane is simply a generalization of a plane to dimensions higher than 3.

Hyperplane in N-dimensions: x1C1 + x2C2 + x3C3 + … + xNCN = T

Page 13: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

1313

Classification Methods (1)Classification Methods (1)

Page 14: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

1414

Classification Methods (1)Classification Methods (1)

Page 15: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

1515

Classification Methods (2)Classification Methods (2)

Page 16: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

1616

Classification Methods (2)Classification Methods (2)

Page 17: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

1717

Classification Methods (2)Classification Methods (2)

Page 18: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

1818

Classification Methods (2)Classification Methods (2)

Page 19: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

1919

Classification Methods (3)Classification Methods (3)

Page 20: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

2020

Classification Methods (3)Classification Methods (3)

K Nearest Neighbor Method

Page 21: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

2121

Classification Methods (4)Classification Methods (4)

Page 22: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

2222

Classification Methods (4)Classification Methods (4)

Page 23: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

2323

Classification Methods (4)Classification Methods (4)

Page 24: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

2424

Classification Methods (5) Classification Methods (5) SVM SVM

What is SVM? • Support vector machines, a machine learning method,

learning by examples, statistical learning, classify objects into one of the two classes.

Advantages of SVM: • Diversity of class members (no racial discrimination). • Low over-fitting risk • Easier to find “optimal” parameters for better class

differentiation performance

Page 25: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

2525

Classification Methods (5)Classification Methods (5)SVM MethodSVM Method

BorderNew border

Project to a higher dimensional space

Protein familymembers

Nonmembers

Protein familymembers

Nonmembers

Page 26: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

2626

Classification Methods (5)Classification Methods (5)SVM methodSVM method

Support vector

Support vector

New border

Protein familymembers

Nonmembers

Page 27: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

2727

What is a good Decision Boundary?What is a good Decision Boundary?

• Consider a two-class, linearly separable classification problem

• Many decision boundaries!– The Perceptron algorithm

can be used to find such a boundary

– Different algorithms have been proposed

• Are all decision boundaries equally good?

Class 1

Class 2

Page 28: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

2828

Examples of Bad Decision BoundariesExamples of Bad Decision Boundaries

Class 1

Class 2

Class 1

Class 2

Page 29: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

2929

Large-margin Decision BoundaryLarge-margin Decision Boundary• The decision boundary should be as far away from the data

of both classes as possible– We should maximize the margin, m– Distance between the origin and the line wtx=k is k/||w||

Class 1

Class 2

m

Page 30: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

3030

SVM MethodSVM Method

Protein familymembers

Nonmembers

New border

Support vector

Support vector

Page 31: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

3131

SVM MethodSVM Method

Border line is nonlinear

Page 32: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

3232

SVM methodSVM method

Non-linear transformation: use of kernel function

Page 33: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

3333

SVM methodSVM method

Non-linear transformation

Page 34: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

3434

 

 

kxw

0xw

*0x

*x

w

kxx

w

*0

*

between distance euclidean The

vector normal itsby defined is hyperplaneA

ixwy

w

xwy

ii

ii

xw i

all for

to subject

separated being

classes to subject point closest the and

hyperplane the between distance the Maximize

)(

)(minmax . :classes between separation or margin The

w2

1

Mathematical Algorithm of SVM

Page 35: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

3535

 

 

.hyperplane the for term offset an is

all for

to subject

is nformulatio equivalent An

b

ibxwy

w

ii 1)(

min2

21

.

)(

.01)(

min 22

21

all for and

to subject

separable not are data the When

xfsigny

bxwxf

ibxwy

Cw

iiii

ii

kxw

0xw

*0x

*x

i

Mathematical Algorithm of SVM

Page 36: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

3636

 

 

.1

01)(

min 22

21

iii

iiii

i

xfy

ibxwy

Cw

i

all for and

to subject

1

C1 22))(1(min Pfxfy ii

iFf

Empirical error Complexity tradeoff

Mathematical Algorithm of SVM

Page 37: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

3737

 

)(xx

nmmn

φ

φ

where :

Map data to higher dimensional space, feature space

and

all for and

to subject

.)()(

.01))((

min 22

21

bxwxf

ibxwy

Cw

iiii

iK i

φ

φ

Construct linear classifier in this space

).()(),(),()( yxyxKbxxKαyxfi

iii φφ where.

Which can be written as

Mathematical Algorithm of SVM Nonlinear decision boundaries

Page 38: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

3838

Mathematical Algorithm of SVMMathematical Algorithm of SVM

Page 39: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

3939

SVM Performance MeasureSVM Performance Measure

Page 40: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

4040

SVM Performance MeasureSVM Performance Measure

Page 41: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

4141

SVM Performance MeasureSVM Performance Measure

Page 42: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

4242

SVM Performance MeasureSVM Performance Measure

• Sensitivity P+ =TP/(TP+FN) accuracy for positive samples

• Specificity P- =TN/(TN+FP) accuracy for negative samples

• Overall prediction accuracy

• Matthews correlation coefficient

FNFPTNTP

TNTPQ

))()()((

**

FPTNFNTNFPTPFNTP

FPFNTNTPC

))()()((

**

FPTNFNTNFPTPFNTP

FPFNTNTPC

Page 43: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

4343

Why SVM Works?Why SVM Works?• The feature space is often very high dimensional. Why don’t we have the curse

of dimensionality?

• A classifier in a high-dimensional space has many parameters and is hard to estimate

• Vapnik argues that the fundamental problem is not the number of parameters to be estimated. Rather, the problem is about the flexibility of a classifier

• Typically, a classifier with many parameters is very flexible, but there are also exceptions

– Let xi=10i where i ranges from 1 to n. The classifier

can classify all xi correctly for all possible combination of class labels on xi

– This 1-parameter classifier is very flexible

Page 44: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

4444

Why SVM works?Why SVM works?

• Vapnik argues that the flexibility of a classifier should not be characterized by the number of parameters, but by the flexibility (capacity) of a classifier– This is formalized by the “VC-dimension” of a classifier

• Consider a linear classifier in two-dimensional space• If we have three training data points, no matter how those points are

labeled, we can classify them perfectly

Page 45: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

4545

VC-dimensionVC-dimension• However, if we have four points, we can find a labeling

such that the linear classifier fails to be perfect

• We can see that 3 is the critical number• The VC-dimension of a linear classifier in a 2D space is

3 because, if we have 3 points in the training set, perfect classification is always possible irrespective of the labeling, whereas for 4 points, perfect classification can be impossible

Page 46: CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel: 6874-6877 Email:

4646

VC-dimensionVC-dimension

• The VC-dimension of the nearest neighbor classifier is infinity, because no matter how many points you have, you get perfect classification on training data

• The higher the VC-dimension, the more flexible a classifier is

• VC-dimension, however, is a theoretical concept; the VC-dimension of most classifiers, in practice, is difficult to be computed exactly– Qualitatively, if we think a classifier is flexible, it

probably has a high VC-dimension