applications of supervised learning in bioinformatics yen-jen oyang dept. of computer science and...

19
Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering

Upload: arlene-hubbard

Post on 18-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering

Applications of Supervised Learning in Bioinformatics

Yen-Jen Oyang

Dept. of Computer Science and Information Engineering

Page 2: Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering

Problem Definition ofSupervised Learning

(or Data Classification) In a supervised learning problem, each sample is

described by a set of feature values and each sample belongs to one of the predefined classes.

The goal is to derive a set of rules that predicts which class an incoming query sample should belong to, based on a given set of training samples. Supervised learning is also called data classification.

Page 3: Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering

The Vector Space Model

feature1 feature2 ‧‧‧‧‧ featurem

sample1

sample2

samplen

Rxxx m ),...,,( vector feature 21v

Class 2

Class 1

Class C

Page 4: Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering

In microarray data analysis, supervised In microarray data analysis, supervised learning algorithms have been learning algorithms have been employed to predict the class of an employed to predict the class of an incoming query sample based on the incoming query sample based on the existing samples with known classes.existing samples with known classes.

Application of Supervised Learning in Microarray Data

Analysis

Page 5: Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering

For example, in the Leukemia data set, For example, in the Leukemia data set, there are 72 samples and 7129 genes.there are 72 samples and 7129 genes. 25 Acute Myeloid Leukemia(AML) 25 Acute Myeloid Leukemia(AML)

samples.samples. 38 B-cell Acute Lymphoblastic Leukemia 38 B-cell Acute Lymphoblastic Leukemia

(B-cell ALL) samples.(B-cell ALL) samples. 9 T-cell Acute Lymphoblastic Leukemia (T-9 T-cell Acute Lymphoblastic Leukemia (T-

cell ALL) samples.cell ALL) samples.

Application of Supervised Learning in Microarray Data

Analysis

Page 6: Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering

Model of the Leukemia Dataset

gene1 gene2 ‧‧‧‧‧‧‧‧ gene7129

sample1

sample2

sample72

Rxxx ),...,,( vector feature 712921v

Class 2

Class 1

Class 3

Page 7: Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering

Training Process From the mathematical point of view, the task

of the supervised learning algorithm in the training stage is to identify curves that separate samples with different classes.

Prediction of the class of an incoming query sample is carried out by referring to the separating curves identified during the training stage.

Page 8: Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering

query

Page 9: Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering

The Basis of Kernel Regression

. and 0, 0,with

,)(2

1

)()(lim

)()()(

have we),(function dimension 1 aFor

2

2

2

)(

0

k

kx

k

kfe

kfkx

dttftxxf

xf-

Page 10: Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering

Given a set of samples Given a set of samples

randomly taken from a probability randomly taken from a probability distribution. We want to find a set of distribution. We want to find a set of Gaussian functions and the Gaussian functions and the corresponding weights to obtain an corresponding weights to obtain an approximate probability density function, approximate probability density function, i.e.i.e.

nsss ,...,, 21

),;( iiK μν

iw

).(),;()(ˆ 2

2

2 νμνν

μν

fewKwf i

i

iiii

ii

Problem Definition of Kernel Density Estimation (KDE)

with Gaussian Kernels

Page 11: Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering

The KDE based learning algorithm The KDE based learning algorithm constructs one approximate probability constructs one approximate probability density function for each class of samples.density function for each class of samples.

Prediction is conducted based on the Prediction is conducted based on the following likelihood function:following likelihood function:

samples. -class offunction density y probabilit

eapproximat theis )(ˆ and ly,respective classes, all

of samples trainingofnumber total theand class of

samples trainingofnumber theare and where

),(ˆ)(

j

f

j

fL

j

j

j

j

j

SS

vS

Sv

The KDE Based Predictor

Page 12: Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering

The Decision Function of the RVKDE Based Predictor

vector.feature theofdimension theis (iii)

ly;respective ),( ofproximity in the samples (negative)

positve theamong distance average theis )( (ii)

ly;respective , and (i)

)(2exp

2

11

)(2exp

2

11)(

12

2

12

2

m

n

nf

ji

ji

iiii

n

j j

j

m

j

n

i i

i

m

i

ss

sνν

Page 13: Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering

With the KDE based predictor, each training sample is associated with a kernel function, typically with a varying width.

Page 14: Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering

An Example ofSupervised Learning (Data Classification)

Given the data set shown on next slide, can we figure out a set of rules that predict the classes of samples?

Page 15: Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering

Data Set

Data Class Data Class Data Class

( 15,33)

O ( 18,28)

× ( 16,31)

O

( 9 ,23) × ( 15,35)

O ( 9 ,32) ×

( 8 ,15) × ( 17,34)

O ( 11,38)

×

( 11,31)

O ( 18,39)

× ( 13,34)

O

( 13,37)

× ( 14,32)

O ( 19,36)

×

( 18,32)

O ( 25,18)

× ( 10,34)

×

( 16,38)

× ( 23,33)

× ( 15,30)

O

( 12,33)

O ( 21,28)

× ( 13,22)

×

Page 16: Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering

Distribution of the Data Set

。。

10 15 20

30

。。。 。。

。 。。

××

××

×

×

×

×

×

×

××

×

×

Page 17: Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering

Rule Based on Observation

.x

o

30

253015 22

class

else

class

, thenand y

yxIf

Page 18: Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering

Rule Generated by a Kernel Density Estimation Based

Algorithm

Let and

If then prediction=“O”.

Otherwise prediction=“X”.

2o

2o

210

12o

o 2

1)( i

icv

i i

evf

.

2

1)(

2

214

12x

x

2x

x

j

jcv

j j

evf

),()( xo vfvf

Page 19: Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering

(15,33)

(11,31)

(18,32)

(12,33)

(15,35)

(17,34)

(14,32)

(16,31)

(13,34)

(15,30)

1.723 2.745 2.327 1.794 1.973 2.045 1.794 1.794 1.794 2.027

ico

io

(9,23) (8,15)(13,37)

(16,38) (18,28) (18,39) (25,18)(23,33)

(21,28) (9,32)(11,38)

(19,36)(10,34)

(13,22)

6.458 10.08 2.939 2.745 5.451 3.287 10.86 5.322 5.070 4.562 3.463 3.587 3.232 6.260

jcx

jx