applications of supervised learning in bioinformatics yen-jen oyang dept. of computer science and...

Applications of Supervised Learning in Bioinformatics

Yen-Jen Oyang

Dept. of Computer Science and Information Engineering

Problem Definition ofSupervised Learning

(or Data Classification) In a supervised learning problem, each sample is

described by a set of feature values and each sample belongs to one of the predefined classes.

The goal is to derive a set of rules that predicts which class an incoming query sample should belong to, based on a given set of training samples. Supervised learning is also called data classification.

The Vector Space Model

feature1 feature2 ‧‧‧‧‧ featurem

sample1

sample2

samplen

Rxxx m ),...,,( vector feature 21v

Class 2

Class 1

Class C

In microarray data analysis, supervised In microarray data analysis, supervised learning algorithms have been learning algorithms have been employed to predict the class of an employed to predict the class of an incoming query sample based on the incoming query sample based on the existing samples with known classes.existing samples with known classes.

Application of Supervised Learning in Microarray Data

Analysis

For example, in the Leukemia data set, For example, in the Leukemia data set, there are 72 samples and 7129 genes.there are 72 samples and 7129 genes. 25 Acute Myeloid Leukemia(AML) 25 Acute Myeloid Leukemia(AML)

samples.samples. 38 B-cell Acute Lymphoblastic Leukemia 38 B-cell Acute Lymphoblastic Leukemia

(B-cell ALL) samples.(B-cell ALL) samples. 9 T-cell Acute Lymphoblastic Leukemia (T-9 T-cell Acute Lymphoblastic Leukemia (T-

cell ALL) samples.cell ALL) samples.

Application of Supervised Learning in Microarray Data

Analysis

Model of the Leukemia Dataset

gene1 gene2 ‧‧‧‧‧‧‧‧ gene7129

sample1

sample2

sample72

Rxxx ),...,,( vector feature 712921v

Class 2

Class 1

Class 3

Training Process From the mathematical point of view, the task

of the supervised learning algorithm in the training stage is to identify curves that separate samples with different classes.

Prediction of the class of an incoming query sample is carried out by referring to the separating curves identified during the training stage.

The Basis of Kernel Regression

. and 0, 0,with

,)(2

1

)()(lim

)()()(

have we),(function dimension 1 aFor

2

2

2

)(

0

k

kx

k

kfe

kfkx

dttftxxf

xf-

Given a set of samples Given a set of samples

randomly taken from a probability randomly taken from a probability distribution. We want to find a set of distribution. We want to find a set of Gaussian functions and the Gaussian functions and the corresponding weights to obtain an corresponding weights to obtain an approximate probability density function, approximate probability density function, i.e.i.e.

nsss ,...,, 21

),;( iiK μν

iw

).(),;()(ˆ 2

2

2 νμνν

μν

fewKwf i

i

iiii

ii

Problem Definition of Kernel Density Estimation (KDE)

with Gaussian Kernels

The KDE based learning algorithm The KDE based learning algorithm constructs one approximate probability constructs one approximate probability density function for each class of samples.density function for each class of samples.

Prediction is conducted based on the Prediction is conducted based on the following likelihood function:following likelihood function:

samples. -class offunction density y probabilit

eapproximat theis )(ˆ and ly,respective classes, all

of samples trainingofnumber total theand class of

samples trainingofnumber theare and where

),(ˆ)(

j

f

j

fL

j

j

j

j

j

SS

vS

Sv

The KDE Based Predictor

The Decision Function of the RVKDE Based Predictor

vector.feature theofdimension theis (iii)

ly;respective ),( ofproximity in the samples (negative)

positve theamong distance average theis )( (ii)

ly;respective , and (i)

)(2exp

2

11

)(2exp

2

11)(

12

2

12

2

m

n

nf

ji

ji

iiii

n

j j

j

m

j

n

i i

i

m

i

ss

sν

sνν

With the KDE based predictor, each training sample is associated with a kernel function, typically with a varying width.

An Example ofSupervised Learning (Data Classification)

Given the data set shown on next slide, can we figure out a set of rules that predict the classes of samples?

Data Set

Data Class Data Class Data Class

（ 15,33）

O （ 18,28）

× （ 16,31）

O

（ 9 ,23） × （ 15,35）

O （ 9 ,32） ×

（ 8 ,15） × （ 17,34）

O （ 11,38）

×

（ 11,31）

O （ 18,39）

× （ 13,34）

O

（ 13,37）

× （ 14,32）

O （ 19,36）

×

（ 18,32）

O （ 25,18）

× （ 10,34）

×

（ 16,38）

× （ 23,33）

× （ 15,30）

O

（ 12,33）

O （ 21,28）

× （ 13,22）

×

Distribution of the Data Set

。。

10 15 20

30

。。。。。

。。。

××

××

×

×

×

×

×

×

××

×

×

Rule Based on Observation

.x

o

30

253015 22

class

else

class

, thenand y

yxIf

Rule Generated by a Kernel Density Estimation Based

Algorithm

Let and

If then prediction=“O”.

Otherwise prediction=“X”.

2o

2o

210

12o

o 2

1)( i

icv

i i

evf

.

2

1)(

2

214

12x

x

2x

x

j

jcv

j j

evf

),()( xo vfvf

(15,33)

(11,31)

(18,32)

(12,33)

(15,35)

(17,34)

(14,32)

(16,31)

(13,34)

(15,30)

1.723 2.745 2.327 1.794 1.973 2.045 1.794 1.794 1.794 2.027

ico

io

(9,23) (8,15)(13,37)

(16,38) (18,28) (18,39) (25,18)(23,33)

(21,28) (9,32)(11,38)

(19,36)(10,34)

(13,22)

6.458 10.08 2.939 2.745 5.451 3.287 10.86 5.322 5.070 4.562 3.463 3.587 3.232 6.260

jcx

jx

applications of supervised learning in bioinformatics yen-jen oyang dept. of computer science and...

Documents

class of samples

set of samples

leukemia data set

classes of samples

supervised learning

given set of training

supervised learning

data setrule