cz5225: modeling and simulation in biology lecture 7, microarray class classification by machine...
TRANSCRIPT
CZ5225: Modeling and Simulation in BiologyCZ5225: Modeling and Simulation in Biology
Lecture 7, Microarray Class Classification by Lecture 7, Microarray Class Classification by Machine learning MethodsMachine learning Methods
Prof. Chen Yu ZongProf. Chen Yu Zong
Tel: 6874-6877Tel: 6874-6877Email: Email: [email protected]@nus.edu.sg
http://bidd.nus.edu.sghttp://bidd.nus.edu.sgRoom 07-24, level 8, S16, Room 07-24, level 8, S16,
National University of SingaporeNational University of Singapore
22
Machine Learning MethodMachine Learning Method Inductive learning:
Example-based learning
Descriptor
Positive examples
Negative examples
33
Machine Learning MethodMachine Learning Method
A=(1, 1, 1)B=(0, 1, 1)C=(1, 1, 1)D=(0, 1, 1)E=(0, 0, 0)F=(1, 0, 1)
Feature vectors: Descriptor
Feature vector
Positive examples
Negative examples
44
Machine Learning MethodMachine Learning Method Feature vectors in input space:
A=(1, 1, 1)B=(0, 1, 1)C=(1, 1, 1)D=(0, 1, 1)E=(0, 0, 0)F=(1, 0, 1)
Z
Input space
X
Y
BAE
F
Feature vector
55
Vector A= (a1, a2, a3, …, aN)
Task of machine learning transformed into the job for finding of a border-Line for optimal separation of the known positive and negative samples in a training-set
Positive
Negative
Machine Learning Method
66
Patient_X= (gene_1, gene_2, gene_3, …, gene_N)
N (number of dimensions) is normally larger than 2, so we can’t visualize the data.
Cancerous
Healthy
Classifying Cancer Patients vs. Healthy Patients from Microarray
77
Classifying Cancer Patients vs. Classifying Cancer Patients vs. Healthy Patients from MicroarrayHealthy Patients from Microarray
Cancerous
Healthy
Gene_1 expression level
For simplicity, pretend that we are only looking at expression levels of 2 genes.
-5 0 5
-5
0
5
Gen
e_2
expr
essi
on le
vel
Up-regulated
Down-regulated
88
Classifying Cancer Patients vs. Classifying Cancer Patients vs. Healthy Patients from MicroarrayHealthy Patients from Microarray
Cancerous
Healthy
Gene_1 expression level
Question:
How can we build a classifier for this data?
-5 0 5
-5
0
5
Gen
e_2
expr
essi
on le
vel
99
Classifying Cancer Patients vs. Classifying Cancer Patients vs. Healthy Patients from MicroarrayHealthy Patients from Microarray
Cancerous
Healthy
Gene_1 expression level
Simple Classification Rule:IF gene_1 <0 AND gene_2 <0THEN person=healthy
IF gene_1 >0 AND gene_2 >0THEN person=cancerous
-5 0 5
-5
0
5
Gen
e_2
expr
essi
on le
vel
1010
Classifying Cancer Patients vs. Classifying Cancer Patients vs. Healthy Patients from MicroarrayHealthy Patients from Microarray
Simple Classification Rule:IF gene_1 <0 AND gene_2 <0 AND
… gene 5000 < Y
THEN person=healthy
IF gene_1 >0 AND gene_2 >0
… gene 5000 >WTHEN person=cancerous
If we move away from our simple example with 2 genes to a realistic case with say 5000 genes, then
1. What will these rules look like?
2. How will we find them?
Gets a little complicated, unwieldy…
1111
Classifying Cancer Patients vs. Classifying Cancer Patients vs. Healthy Patients from MicroarrayHealthy Patients from Microarray
Cancerous
Healthy
Gene_1 expression level-5 0 5
-5
0
5
Gen
e_2
expr
essi
on le
vel
Reformulate the previous rule
SIMPLE RULE:
•If data point lies to the ‘left’ of the line, then ‘healthy’.
•If data point lies to ‘right’ of line then ‘cancerous’
It is easier to generalize this line to 5000 genes than it is a list of rules. Also easier to solve mathematically.
1212
Extension to More Than 2 Genes (dimensions)Extension to More Than 2 Genes (dimensions)
Cancerous
Healthy
-5
0 5-5
0
5
•Line in 2D: x1C1 + x2C2 = T
•If we had 3 genes, and needed to build a ‘line’ in 3-dimensional space, then we would be seeking a plane.
Plane in 3D: x1C1 + x2C2 + x3C3 = T
•If we were looking in more than 3 dimensions, the ‘plane’ is called a hyperplane. A hyperplane is simply a generalization of a plane to dimensions higher than 3.
Hyperplane in N-dimensions: x1C1 + x2C2 + x3C3 + … + xNCN = T
1313
Classification Methods (1)Classification Methods (1)
1414
Classification Methods (1)Classification Methods (1)
1515
Classification Methods (2)Classification Methods (2)
1616
Classification Methods (2)Classification Methods (2)
1717
Classification Methods (2)Classification Methods (2)
1818
Classification Methods (2)Classification Methods (2)
1919
Classification Methods (3)Classification Methods (3)
2020
Classification Methods (3)Classification Methods (3)
K Nearest Neighbor Method
2121
Classification Methods (4)Classification Methods (4)
2222
Classification Methods (4)Classification Methods (4)
2323
Classification Methods (4)Classification Methods (4)
2424
Classification Methods (5) Classification Methods (5) SVM SVM
What is SVM? • Support vector machines, a machine learning method,
learning by examples, statistical learning, classify objects into one of the two classes.
Advantages of SVM: • Diversity of class members (no racial discrimination). • Low over-fitting risk • Easier to find “optimal” parameters for better class
differentiation performance
2525
Classification Methods (5)Classification Methods (5)SVM MethodSVM Method
BorderNew border
Project to a higher dimensional space
Protein familymembers
Nonmembers
Protein familymembers
Nonmembers
2626
Classification Methods (5)Classification Methods (5)SVM methodSVM method
Support vector
Support vector
New border
Protein familymembers
Nonmembers
2727
What is a good Decision Boundary?What is a good Decision Boundary?
• Consider a two-class, linearly separable classification problem
• Many decision boundaries!– The Perceptron algorithm
can be used to find such a boundary
– Different algorithms have been proposed
• Are all decision boundaries equally good?
Class 1
Class 2
2828
Examples of Bad Decision BoundariesExamples of Bad Decision Boundaries
Class 1
Class 2
Class 1
Class 2
2929
Large-margin Decision BoundaryLarge-margin Decision Boundary• The decision boundary should be as far away from the data
of both classes as possible– We should maximize the margin, m– Distance between the origin and the line wtx=k is k/||w||
Class 1
Class 2
m
3030
SVM MethodSVM Method
Protein familymembers
Nonmembers
New border
Support vector
Support vector
3131
SVM MethodSVM Method
Border line is nonlinear
3232
SVM methodSVM method
Non-linear transformation: use of kernel function
3333
SVM methodSVM method
Non-linear transformation
3434
kxw
0xw
*0x
*x
w
kxx
w
*0
*
between distance euclidean The
vector normal itsby defined is hyperplaneA
ixwy
w
xwy
ii
ii
xw i
all for
to subject
separated being
classes to subject point closest the and
hyperplane the between distance the Maximize
)(
)(minmax . :classes between separation or margin The
w2
1
Mathematical Algorithm of SVM
3535
.hyperplane the for term offset an is
all for
to subject
is nformulatio equivalent An
b
ibxwy
w
ii 1)(
min2
21
.
)(
.01)(
min 22
21
all for and
to subject
separable not are data the When
xfsigny
bxwxf
ibxwy
Cw
iiii
ii
kxw
0xw
*0x
*x
i
Mathematical Algorithm of SVM
3636
.1
01)(
min 22
21
iii
iiii
i
xfy
ibxwy
Cw
i
all for and
to subject
1
C1 22))(1(min Pfxfy ii
iFf
Empirical error Complexity tradeoff
Mathematical Algorithm of SVM
3737
)(xx
nmmn
φ
φ
where :
Map data to higher dimensional space, feature space
and
all for and
to subject
.)()(
.01))((
min 22
21
bxwxf
ibxwy
Cw
iiii
iK i
φ
φ
Construct linear classifier in this space
).()(),(),()( yxyxKbxxKαyxfi
iii φφ where.
Which can be written as
Mathematical Algorithm of SVM Nonlinear decision boundaries
3838
Mathematical Algorithm of SVMMathematical Algorithm of SVM
3939
SVM Performance MeasureSVM Performance Measure
4040
SVM Performance MeasureSVM Performance Measure
4141
SVM Performance MeasureSVM Performance Measure
4242
SVM Performance MeasureSVM Performance Measure
• Sensitivity P+ =TP/(TP+FN) accuracy for positive samples
• Specificity P- =TN/(TN+FP) accuracy for negative samples
• Overall prediction accuracy
• Matthews correlation coefficient
FNFPTNTP
TNTPQ
))()()((
**
FPTNFNTNFPTPFNTP
FPFNTNTPC
))()()((
**
FPTNFNTNFPTPFNTP
FPFNTNTPC
4343
Why SVM Works?Why SVM Works?• The feature space is often very high dimensional. Why don’t we have the curse
of dimensionality?
• A classifier in a high-dimensional space has many parameters and is hard to estimate
• Vapnik argues that the fundamental problem is not the number of parameters to be estimated. Rather, the problem is about the flexibility of a classifier
• Typically, a classifier with many parameters is very flexible, but there are also exceptions
– Let xi=10i where i ranges from 1 to n. The classifier
can classify all xi correctly for all possible combination of class labels on xi
– This 1-parameter classifier is very flexible
4444
Why SVM works?Why SVM works?
• Vapnik argues that the flexibility of a classifier should not be characterized by the number of parameters, but by the flexibility (capacity) of a classifier– This is formalized by the “VC-dimension” of a classifier
• Consider a linear classifier in two-dimensional space• If we have three training data points, no matter how those points are
labeled, we can classify them perfectly
4545
VC-dimensionVC-dimension• However, if we have four points, we can find a labeling
such that the linear classifier fails to be perfect
• We can see that 3 is the critical number• The VC-dimension of a linear classifier in a 2D space is
3 because, if we have 3 points in the training set, perfect classification is always possible irrespective of the labeling, whereas for 4 points, perfect classification can be impossible
4646
VC-dimensionVC-dimension
• The VC-dimension of the nearest neighbor classifier is infinity, because no matter how many points you have, you get perfect classification on training data
• The higher the VC-dimension, the more flexible a classifier is
• VC-dimension, however, is a theoretical concept; the VC-dimension of most classifiers, in practice, is difficult to be computed exactly– Qualitatively, if we think a classifier is flexible, it
probably has a high VC-dimension