a gentle introduction to support vector machines in biomedicine · 2010-10-27 · a gentle...

A Gentle Introduction to A Gentle Introduction to A Gentle Introduction to A Gentle Introduction to Support Vector Machines Support Vector Machines in Biomedicinein Biomedicine

Alexander Statnikov*, Douglas Hardin#,

Isabelle Guyon†, Constantin F. Aliferis*Isabelle Guyon , Constantin F. Aliferis

(Materials about SVM Clustering were contributed by Nikita Lytkin*)

*New York University, #Vanderbilt University, †ClopiNet

Part I

• Introduction

• Necessary mathematical conceptsNecessary mathematical concepts

• Support vector machines (SVMs) for binary l ifi i l i l f l iclassification: classical formulation

• Basic principles of statistical machine learningp p g

2

Introduction

3

About this tutorial

• Main goal: Fully understand support vector machines (andMain goal: Fully understand support vector machines (and important extensions) with a modicum of mathematics knowledge.

• This tutorial is both modest (it does not invent anything new) and ambitious (support vector machines are generally considered mathematically quite difficult to grasp)mathematically quite difficult to grasp).

• Our educational approach:

Basic problem Main idea of SVM solution

Geometrical interpretationExtensions Case studies

4

Math/TheoryAlgorithms

Data-analysis problems of interesty p

1. Build computational classification models (or1. Build computational classification models (or “classifiers”) that assign objects (patients/samples) into two or more classes.

‐ Classifiers can be used for diagnosis, outcome prediction, and other classification tasks.

‐ E.g., build a decision‐support system to diagnose primary and metastatic cancers from gene expression profiles of the patients:

Classifier Primary Lung Cancer

model

Patient withlung cancer

Biopsy Gene expressionprofile

Metastatic Lung Cancer

5


ModelRelevant to clinical trials

Irrelevant to clinical trials

Article

Will respond to treatment

Model

PatientBlood

lMass spectrometry

t i fil

p

Will not respond to treatment

Cannot make decisionPatient sample proteomics profile

6


2. Build computational regression models to predict values2. Build computational regression models to predict values of some continuous response variable or outcome.

‐ Regression models can be used to predict survival, length of stay g p , g yin the hospital, laboratory test values, etc.

‐ E.g., build a decision‐support system to predict optimal dosage f h d b d d h h dof the drug to be administered to the patient. This dosage is

determined by the values of patient biomarkers, and clinical and demographics data:g p

Regression model

Bi k

Optimal dosage is 5 IU/Kg/week

1 2.2 3 423 2 3 92 2 1 8

PatientBiomarkers, clinical and

demographics data

IU/Kg/week

7


3. Out of all measured variables in the dataset, select the3. Out of all measured variables in the dataset, select the smallest subset of variables that is necessary for the most accurate prediction (classification or regression) of p ( g )some variable of interest (e.g., phenotypic response variable).

‐ E.g., find the most compact panel of breast cancer biomarkers from microarray gene expression data for 20,000 genes:

Breast cancer tissues

Normaltissues

8


4. Build a computational model to identify novel or outlier4. Build a computational model to identify novel or outlier objects (patients/samples).

‐ Such models can be used to discover deviations in sample phandling protocol when doing quality control of assays, etc.

‐ E.g., build a decision‐support system to identify aliens.Humans Alien #1

Alien #2

Alien #3

9


5. Group objects (patients/samples) into5. Group objects (patients/samples) into several clusters based on their similarity. y

‐ These methods can be used to discover disease sub‐types and for other tasks.

Cluster #1

Cluster #2

‐ E.g., consider clustering of brain tumor patients into 4 clusters based on their gene expression profiles All patients have the

Cluster #3expression profiles. All patients have the same pathological sub‐type of the disease, and clustering discovers new molecular

Cluster #4disease subtypes that happen to have different characteristics in terms of patient survival and time to recurrence after

Cluster #4

survival and time to recurrence after treatment.

10

Basic principles of classificationp p

• Want to classify objects as boats and houses.

11

y j


• All objects before the coast line are boats and all objects after the

12

coast line are houses. • Coast line serves as a decision surface that separates two classes.


These boatswill be misclassified as housesThese boats will be misclassified as houses

13


Thi h ill b i l ifi d b tThis house will be misclassified as a boat

14


Longitude

L i d

Boat

HouseLatitude

• The methods that build classification models (“classification algorithms”) operate

15

( f g ) pvery similarly to the previous example.

• First all objects are represented geometrically.


Longitude

L i d

Boat

HouseLatitude

Th th l ith k t fi d d i i

16

Then the algorithm seeks to find a decision surface that separates classes of objects


Longitude

These objects are classified as houses

? ? ?

? ? ?

? ? ?

L i d

These objects are classified as boats

Latitude

Unseen (new) objects are classified as “boats”

17

if they fall below the decision surface and as “houses” if the fall above it

The Support Vector Machine (SVM) happroach

• Support vector machines (SVMs) is a binary classification algorithm that offers a solution to problem #1.algorithm that offers a solution to problem #1.

• Extensions of the basic SVM algorithm can be applied to solve problems #1‐#5.solve problems #1 #5.

• SVMs are important because of (a) theoretical reasons:‐ Robust to very large number of variables and small samplesRobust to very large number of variables and small samples

‐ Can learn both simple and highly complex classification models

‐ Employ sophisticated mathematical principles to avoid overfitting

and (b) superior empirical results.

18

Main ideas of SVMsGene Y

Cancer patientsNormal patientsGene X

• Consider example dataset described by 2 genes, gene X and gene Y• Represent patients geometrically (by “vectors” or “points”)• Represent patients geometrically (by vectors or points )

19



• Find a linear decision surface (“hyperplane”) that can separate patient classes and has the largest distance (i e largest “gap” orpatient classes and has the largest distance (i.e., largest gap or “margin”) between border‐line patients (i.e., “support vectors”);

20


Cancer Cancer

kernel

Decision surface

Normal

GeneX

Normal

• If such linear decision surface does not exist, the data is mapped i t h hi h di i l (“f t ”) h th

Gene X

into a much higher dimensional space (“feature space”) where the separating decision surface is found;

• The feature space is constructed via very clever mathematical

21

The feature space is constructed via very clever mathematical projection (“kernel trick”).

History of SVMs and usage in the literaturey g• Support vector machine classifiers have a long history of development starting from the 1950’s.

• The most important milestone for development of modern SVMs is the 1992 paper by Boser, Guyon, and Vapnik (“A training algorithm for optimal margin classifiers”)c ass f e s )

Vladimir N.Vapnik

Mark A.Aizerman

NachmanAronszajn

FrankRosenblatt

Alexander Y.Lerner

Emanuel M.Braverman

Lev I.Rozonoer

Thomas M.Cover

Fred W.Smith

1950 1957 1963 1964 1965 1968

Vladimir N.Vapnik

Bernard E.Boser

CorinaCortes

TomasoPoggio

221974‐1979

Aleksey Y.Chervonenkis

IsabelleGuyon

1992

Vladimir N.Vapnik

1995

Vladimir N.Vapnik

TomasoPoggio

1975

GraceWahba

1990 1990

Federico Girosi

History of SVMs and usage in the literaturey g

10000Use of Support Vector Machines in the Literature

6 660

8,180

8,860

7000

8000

9000Use of Support Vector Machines in the Literature

General sciences

Biomedicine

4,950

6,660

5000

6000

7000

2,330

3,530

2000

3000

4000

359 621

906

1,430

4 12 46 99 201 351 521 726

917 1,190

0

1000

2000

1998 1999 2000 2001 2002 2003 2004 2005 2006 2007

23

History of SVMs and usage in the literaturey g

30000Use of Linear Regression in the Literature

20 000

22,200

24,100

20,100

25000

Use of Linear Regression in the Literature

General sciences

Biomedicine

12 000

13,500

14,900 16,000

17,700

19,500 20,000 19,600

14,900 15,500

19,200 18,700 19,100

17,700 18,300

15000

20000

9,770 10,800

12,000

10000

0

5000

1998 1999 2000 2001 2002 2003 2004 2005 2006 2007

24

Necessary mathematical conceptsy p

25

How to represent samples geometrically?(R )Vectors & points in n-dimensional space (Rn)

• Assume that a sample/patient is described by n characteristics p /p y(“features” or “variables”)

• Vector representation: Every sample/patient is a vector in Rn

with tail at point with 0 coordinates and arrow‐head at point with the feature values.

E l C id ti t d ib d b 2 f t• Example: Consider a patient described by 2 features:

Systolic BP = 110 and Age = 29.

This patient can be represented as a vector in R2:This patient can be represented as a vector in R :

Age

(110 29)(110, 29)

26

Systolic BP(0, 0)


• Point representation: Every sample/patient is represented as a point in the n‐dimensional space (Rn) with coordinates given bypoint in the n dimensional space (R ) with coordinates given by the values of its features.

• Example: Consider a patient described by 2 features:

Systolic BP = 110 and Age = 29.

This patient can be represented as a point in R2:

Age

(110 29)(110, 29)

27

Systolic BP(0, 0)


Cholesterol Systolic BP Tail of the Arrow head of thePatient id

Cholesterol

(mg/dl)

Systolic BP

(mmHg) Age (years)

Tail of the

vector

Arrow-head of the

vector

1 150 110 35 (0,0,0) (150, 110, 35)

2 250 120 30 (0,0,0) (250, 120, 30)2 250 120 30 (0,0,0) (250, 120, 30)

3 140 160 65 (0,0,0) (140, 160, 65)

4 300 180 45 (0,0,0) (300, 180, 45)

28

Purpose of vector representationp p• Having represented each sample/patient allows now to geometrically represent the decision surface that separates two groups of samples/patients.

A d i i f i R3

6

7

5

6

7

A decision surface in R2 A decision surface in R3

3

4

5

2

3

4

5

0 1 2 3 4 5 6 70

1

2

3 4 5 6 75

100

1

• In order to define the decision surface, we need to introduce some basic math elements

0 1 2 3 40

some basic math elements…

29

Basic operation on vectors in Rnp1. Multiplication by a scalar

Consider a vector and a scalar c

Define:

),...,,( 21 naaaa

),...,,( 21 ncacacaac

When you multiply a vector by a scalar, you “stretch” it in the same or opposite direction depending on whether the scalar is

), ,,( 21 n

same or opposite direction depending on whether the scalar is positive or negative.

2

)2,1(c

a

ac4

1

)2,1(c

a

2

3

4

)4,2(

2

ac

c

a

ac

1

2

3

)2,1(

1

ac

c a

1

2

0 1 2 3 4-2 -1

0 1 2 3 4 ac

-2

-1

30

Basic operation on vectors in Rnp2. Addition

Consider vectors and

Define:

),...,,( 21 naaaa

),...,,( 21 nbbbb

),...,,( 2211 nn babababa

), ,,( 2211 nn

Recall addition of forces in3

4

)03(

)2,1(

b

a

Recall addition of forces in classical mechanics.1

2

0 1 2 3 4

)2,4(

)0,3(

ba

b a

ba

b

31

Basic operation on vectors in Rnp3. Subtraction


Define:

),...,,( 21 naaaa

),...,,( 21 nbbbb

),...,,( 2211 nn babababa

What vector do we

), ,,( 2211 nn

What vector do we need to add to to get ? I e similar to

)0,3(

)2,1(

b

a

2

3

4

b

a

get ? I.e., similar to subtraction of real numbers

)2,2(ba

a

b

ba

1

2

0 1 2 3 4-3 -2 -1

a

numbers.b

32

Basic operation on vectors in Rnp4. Euclidian length or L2‐norm

Consider a vector

Define the L2 norm

),...,,( 21 naaaa

222 aaaa

Define the L2‐norm:

We often denote the L2 norm without subscript i e

212... naaaa

a

We often denote the L2‐norm without subscript, i.e. a

L2 i t i l t

24.25

)2,1(

2

a

a

a

1

2

3

Length of this vector is ≈ 2.24

L2‐norm is a typical way to measure length of a vector; other methods to measure

0 1 2 3 4

vector is 2.24 other methods to measure length also exist.

33

Basic operation on vectors in Rnp5. Dot product


Define dot product:

),...,,( 21 naaaa

),...,,( 21 nbbbb

n

bbbbbDefine dot product:

The la of cosines sa s that here

i

iinn bababababa1

2211 ...

|||||||| bb

The law of cosines says that where θ is the angle between and . Therefore, when the vectors are perpendicular

cos|||||||| 22 baba a

b

0ba

are perpendicular .0ba

4)2,1(a

4)2,0(a

1

2

3

3

)0,3(

ba

b

a

1

2

3

0

)0,3(

),(

ba

b

a

34

0 1 2 3 4b 0 1 2 3 4

b

Basic operation on vectors in Rnp5. Dot product (continued)

n

iiinn bababababa

12211 ...

• Property:

• In the classical regression equation

i 1

222211 ... aaaaaaaaa nn

bxwy

• In the classical regression equation

the response variable y is just a dot product of the

bxwy

vector representing patient characteristics ( ) and

the regression weights vector ( ) which is commonw

x

g g ( )

across all patients plus an offset b.

35

Hyperplanes as decision surfacesyp p

• A hyperplane is a linear decision surface that splits the space into two parts;

• It is obvious that a hyperplane is a binary classifier.

A hyperplane in R2 is a line A hyperplane in R3 is a plane

5

6

7

5

6

7

yp p p

2

3

4

5

1

2

3

4

0 1 2 3 4 5 6 70

1

2

0 1 2 3 4 5 6 7

0

5

100

36A hyperplane in Rn is an n‐1 dimensional subspace

Equation of a hyperplaneq yp p

First we show with show the definition of hyperplane by an interactive demonstration.

or go to http://www.dsl‐lab.org/svm_tutorial/planedemo.html

37

Source: http://www.math.umn.edu/~nykamp/

Click here for demo to begin


Consider the case of R3:

An equation of a hyperplane is defined by a point (P0) and a perpendicularP0P

w

0xx

by a point (P0) and a perpendicular vector to the plane ( ) at that point.w

0

0x

x

Define vectors: and , where P is an arbitrary point on a hyperplane.00 OPx

OPx

O

A condition for P to be on the plane is that the vector is perpendicular to :0xx

w

0)( 0 xxw

or

00 xwxw

define 0xwb

0 bxw

38

The above equations also hold for Rn when n>3.


)614(wExample

43)4210(

)7,1,0(

)6,1,4(

0

xwb

P

w

w 050 xw

+ direction

043)6,1,4(

043

43)4210(0

x

xw

xwb

P0 043 xw

010

04364

043),,()6,1,4(

),,(

)3()2()1(

)3()2()1(

xxx

xxx010 xw

‐ direction

)3()2()1(

What happens if the b coefficient changes? h h l l h d f

The hyperplane moves along the direction of . We obtain “parallel hyperplanes”.

w

39wbbD

/21 Distance between two parallel hyperplanes and is equal to .

01 bxw

02 bxw

(Derivation of the distance between two ll l h l )parallel hyperplanes)

02 bxw

wtxx

wt

w

0 bxw

02 bxw

bxw

wtwtD

wtxx

022

12

01 bxw

bwtxw

bwtxw

bxw

0

0)(

0

2

2

1

21

22

1x

2x

bb

bwtbbxw

bwtxw

0

0)(

0

2

2

2

111

21

wbbt

bwtb

/)(

02

21

2

2

1

40

wbbwtD

/21

RecappWe know…

( “ ”)• How to represent patients (as “vectors”)• How to define a linear decision surface (“hyperplane”)

We need to know…• How to efficiently compute the hyperplane that separates y p yp p ptwo classes with the largest “gap”?

Gene Y

Need to introduce basics of relevant optimization theory

41


Basics of optimization: C fConvex functions

A f ti i ll d if th f ti li b l th• A function is called convex if the function lies below the straight line segment connecting two points, for any two points in the interval.points in the interval.

• Property: Any local minimum is a global minimum!

Local minimum

Convex function Non convex function

Global minimum Global minimum

42

Convex function Non‐convex function

Basics of optimization: Q d (QP)Quadratic programming (QP)

•Quadratic programming (QP) is a special optimization problem: the function to optimize (“objective”) is quadratic, subject to linear constraints.

•Convex QP problems have convex objective functions.

•These problems can be solved easily and efficiently by greedy algorithms (because every local minimum is a global minimum).

43

Basics of optimization: E l QP blExample QP problem

Consider

Minimize subject to22||||

2

1x

),( 21 xxx

0121 xx2

quadratic objective

linear constraints

This is QP problem, and it is a convex QP as we will see later

W it it

objective constraints

We can rewrite it as:

Minimize subject to)(2

1 22

21 xx 0121 xx

2

quadratic linear

44

objective constraints

Basics of optimization: E l QP blExample QP problem

f(x x ) 01xxf(x1,x2)

121 xx

0121 xx

)(2

1 22

21 xx

x2

20121 xx

x1

The solution is x1=1/2 and x2=1/2.

45

C l i ! h dCongratulations! You have mastered all math elements needed toall math elements needed to

understand support vector machines.

Now let us strengthen yourNow, let us strengthen your knowledge by a quiz g y q

46

Quiz1) Consider a hyperplane shown

ith hit It i d fi d b

w

with white. It is defined by equation: Which of the three other hyperplanes can be defined by

P0010 xw

hyperplanes can be defined by equation: ?

‐ Orange

03 xw

Orange‐ Green‐ Yellow

4

2) What is the dot product between vectors and ?)3,3(a

)1,1( b

1

2

3 a

)3,3(a )1,1(b

b

1

0 1 2 3 4-2 -1

-1

47

-2

Quiz

4

3) What is the dot product between vectors and ?)3,3(a

)0,1(b

1

2

3 a

1b

1

0 2 3 4-2 -1

-1

-2

4

4) What is the length of a vector and what is the length of)0,2(a

2

3

4

and what is the length of all other red vectors in the figure?

)0,2(a

1

1

0 2 3 4-2 -1

-1

a

48

-2

Quiz

5) Which of the four functions is/are convex?) /

1 2

3 4

49

S hi (SVM ) f Support vector machines (SVMs) for binary classification: binary classification: classical formulation

50

Case 1: Linearly separable data; “H d ” l SVM“Hard-margin” linear SVM

Gi i i d,...,, 21 n

N Rxxx

Given training data:}1,1{,...,,

, ,,

21

21

N

N

yyy• Want to find a classifier (hyperplane) to separate negative objects from the positive onespositive ones.

• An infinite number of such hyperplanes exist.

• SVMs finds the hyperplane that• SVMs finds the hyperplane that maximizes the gap between data points on the boundaries (so called “support vectors”)

Positive objects (y=+1)Negative objects (y=‐1)

(so‐called support vectors ).• If the points on the boundaries are not informative (e.g., due to

i ) SVM t d ll

51

j (y )g j (y )noise), SVMs may not do well.

Statement of linear SVM classifier

0 bxw

1 bxw

h i di b

1 bxw

The gap is distance between parallel hyperplanes:

and1 bxw

and

Or equivalently:

1 bxw

0)1( b

We know that

0)1( bxw0)1( bxw

Positive objects (y=+1)Negative objects (y=‐1)Therefore:

wbbD

/21

w

Since we want to maximize the gap,

we need to minimize

wD

/2

wwe need to minimize

or equivalently minimize52

2

21 w ( is convenient for taking derivative later on)

21

Statement of linear SVM classifier

ddi i d0 bxw

1 bxw

1 b

In addition we need to impose constraints that all objects are correctly l ifi d I

1 bxw

1 bxw

1 bxw

classified. In our case:

if

if

1 bxw i

1 bxw i

1iy1iy

1 bxw

Equivalently:

1)( bxwy ii


In summary:

Want to minimize subject to for i = 1,…,N2

21 w

1)( bxwy ii

Want to minimize subject to for i 1,…,N

Then given a new object x, the classifier is 53

2 1)( bxwy ii

)()( bxwsignxf

SVM optimization problem: P l f lPrimal formulation

Minimize subject to for i = 1,…,Nn

iw221 01)( bxwy ii

Minimize subject to for i 1,…,N

i 1)(y ii

Objective function Constraints

•This is called “primal formulation of linear SVMs”.• It is a convex quadratic programming (QP) optimization problem with n variables (wi, i = 1,…,n), where n is the number of features in the dataset.

54

SVM optimization problem: D l f lDual formulation

Th i bl b t i th ll d “d l• The previous problem can be recast in the so‐called “dual form” giving rise to “dual formulation of linear SVMs”.

• It is also a convex quadratic programming problem but with• It is also a convex quadratic programming problem but with N variables (i ,i = 1,…,N), where N is the number of samples.samples.

Maximize subject to and . N

jijiji

N

i xxyy21 0i 0

N

ii yMaximize subject to and . ji

jijijii

i yy1,

21

0i 01i

ii y

Objective function ConstraintsN

Then the w‐vector is defined in terms of i:

A d th l ti b

N

iiii xyw

1

)()( bifN

55

And the solution becomes: )()(1

bxxysignxfi

iii

SVM optimization problem: B f f d l f lBenefits of using dual formulation

1) N d t i i l d t d t l d t1) No need to access original data, need to access only dot products.

NN

Objective function:

N

jijijiji

N

ii xxyy

1,21

1

N

Solution:

) b f f b d d b h b

)()(1

bxxysignxfi

iii

2) Number of free parameters is bounded by the number of support vectors and not by the number of variables (b f l f h h d l bl !)(beneficial for high‐dimensional problems!).

E.g., if a microarray dataset contains 20,000 genes and 100

56

E.g., if a microarray dataset contains 20,000 genes and 100 patients, then need to find only up to 100 parameters!

(Derivation of dual formulation)

Minimize subject to for i = 1 N

( )

n

iw221 01)( bxwy

Minimize subject to for i 1,…,N

iiw

12 01)( bxwy ii


Apply the method of Lagrange multipliers.

Define Lagrangian N

iii

n

iP bxwywbw 221 1)(,,

ii 11

a vector with n elements

a vector with N elements

We need to minimize this Lagrangian with respect to and simultaneously require that the derivative with respect to vanishes all subject to the

bw,

require that the derivative with respect to vanishes , all subject to the constraints that

.0i

57

(Derivation of dual formulation)( )

If we set the derivatives with respect to to 0 we obtain:bw

If we set the derivatives with respect to to 0, we obtain:bw,

N

iiP y

b

bw00

,,

N

iiii

P

i

xyww

bw

b

1

1

0,,

We substitute the above into the equation for and obtain “dual formulation of linear SVMs”:

i 1

,,bwP

f f

N

jijijiji

N

iiD xxyy

121

1

We seek to maximize the above Lagrangian with respect to , subject to the

jii 1,1

N

58

constraints that and .0i 01

N

iii y

Case 2: Not linearly separable data;“S f ” l SVM“Soft-margin” linear SVM

00

00

00

0What if the data is not linearly separable? E.g., there are outliers or noisy measurements

00 00 0

00

0outliers or noisy measurements, or the data is slightly non‐linear.

Want to handle this case without changing the family of decision functions.

Assign a “slack variable” to each object , which can be thought of distance from the separating hyperplane if an object is misclassified and 0 otherwise.

0iApproach:

Want to minimize subject to for i = 1,…,NN

iCw1

2

21

iii bxwy 1)(

the separating hyperplane if an object is misclassified and 0 otherwise.

59

Then given a new object x, the classifier is i 1

)()( bxwsignxf

Two formulations of soft-margin l SVMlinear SVM

Primal formulation

Minimize subject to for i = 1,…,N N

i

n

i Cw221 iii bxwy 1)(

Primal formulation:

Minimize subject to for i 1,…,N i

ii

i Cw11

2


iii bxwy 1)(

N N

Dual formulation:

N

jijijiji

n

ii xxyy

1,21

1

Ci 0 01

N

iii yMinimize subject to and

Objective function Constraintsfor i = 1,…,N.


60

Parameter C in soft-margin SVMParameter C in soft margin SVMMinimize subject to for i = 1,…,N

N

iCw2

21

iii bxwy 1)(

j , ,i

i1

2 iiiy )(

• When C is very large, the soft‐margin SVM is equivalent to hard‐margin SVM;

• When C is very small we

C=100 C=1

• When C is very small, we admit misclassifications in the training data at the expense of

C=100 C=1having w‐vector with small norm;

• C has to be selected for the• C has to be selected for the distribution at hand as it will be discussed later in this

61C=0.15 C=0.1

tutorial.

Case 3: Not linearly separable data;K l kKernel trick

Tumor

?

Gene Y

Tumor

N l?

?kernel

Normal?Normal

Data is not linearly separable Data is linearly separable in the

Gene X

a a s o ea y sepa ab ein the input space

a a s ea y sepa ab e efeature space obtained by a kernel

62

HR N:

Kernel trickData in a higher dimensional feature spacex

)(x

Original data (in input space)

)()( bxwsignxf

))(()( bxwsignxf

N

N

)(

i

iii xyw1

i

iii xyw1

)(

))()(()( bifN

))()(()(

1

bxxysignxfi

iii

))(()( bKifN

)),(()(

1

bxxKysignxfi

iii

Therefore, we do not need to know Ф explicitly, we just need to , p y, jdefine function K(∙, ∙): RN × RNR.

Not every function RN × RNR can be a valid kernel; it has to satisfy so‐called

63

Not every function R R R can be a valid kernel; it has to satisfy so called Mercer conditions. Otherwise, the underlying quadratic program may not be solvable.

Popular kernelspA kernel is a dot product in some feature space:

Examples:

)()(),( jiji xxxxK

Examples:Linear kernelGaussian kernel

),(2

jiji xxxxK

Gaussian kernelExponential kernel)exp(),(

)exp(),(2

jiji

jiji

xxxxK

xxxxK

Polynomial kernel

Hybrid kernel)exp()()(

)(),(2

jiq

jiji

qjiji

xxxxpxxK

xxpxxK

y

Sigmoidal)tanh(),(

)exp()(),(

jiji

jijiji

xxkxxK

xxxxpxxK

64

Understanding the Gaussian kernelg

)exp(),(2

jj xxxxK

Consider Gaussian kernel: )exp(),( jj xxxxK

All d i i h i d l i di i l h

65

• All data points in the input space are mapped to a multi‐dimensional sphere.

• The parameter γ defines the relative location of the points on the sphere.

Understanding the polynomial kernel

Consider polynomial kernel:

g p y3)1(),( jiji xxxxK

p y

Assume that we are dealing with 2‐dimensional data

)1(),( jiji xxxxK

g(i.e., in R2). Where will this kernel map the data?

)2()1( xx2‐dimensional space

kernel

)2(2

)1(2

)2()1(3

)2(3

)1()2()1(2

)2(2

)1()2()1(1 xxxxxxxxxxxx

10‐dimensional space

66

Example of benefits of using a kernelp g)2(x

• Data is not linearly separable h (R2)

1x

x

x

in the input space (R2).• Apply kernelto map data to a higher

2)(),( zxzxK

)1(x

x

3x4x to map data to a higher dimensional space (3‐dimensional) where it is

2x )linearly separable.

2 )(),(

2

)2()2()1()1()2(

)1(

)2(

)1(2 zxzxz

z

x

xzxzxK

)()(222 )2()1(

2)1(

)2()1(

2)1(

2)2(

2)2()2()2()1()1(

2)1(

2)1( zxzz

z

xx

x

zxzxzxzx

67

2)2(

)()(2

)2(

)()()()()()()()()()(

zx

Example of benefits of using a kernelp g

2)1(

2)( xx

x

x

Therefore the explicit mapping is

2

)2(

)2()1(2)(

x

xxxTherefore, the explicit mapping is

x )2(x

1x

2)2(x

k l

)1(x3x

4x

2x

21 , xxkernel

2x )1(x

43 , xx

)2()1(2 xx

68

Comparison with methods from classical statistics & regression

N d 5 l f h f h i• Need ≥ 5 samples for each parameter of the regression model to be estimated:

Number of variables

Polynomial degree

Number of parameters

Required sample

2 3 10 50

10 3 286 1,430

10 5 3,003 15,015

100 3 176,851 884,255

100 5 96,560,646 482,803,230

• SVMs do not have such requirement & often require much less sample than the number of variables, even

69

when a high‐degree polynomial kernel is used.

Basic principles of statistical hi l imachine learning

70

Generalization and overfittingg

• Generalization: A classifier or a regression algorithm g glearns to correctly predict output from given inputs not only in previously seen objects but also in y p y jpreviously unseen objects.

• Overfitting: A classifier or a regression algorithmOverfitting: A classifier or a regression algorithm learns to correctly predict output from given inputs in previously seen objects but fails to do so inin previously seen objects but fails to do so in previously unseen objects.

• Overfitting Poor generalization• Overfitting Poor generalization.

71

Example of overfitting and generalizationp g g

Algorithm 2

There is a linear relationship between predictor and outcome (plus some Gaussian noise).

Outcome of Interest Y

Algorithm 2

Outcome of Interest Y

Algorithm 1

Training Data

Test Data

Predictor X

• Algorithm 1 learned non‐reproducible peculiarities of the specific sample l bl f l b d d l h l h f h f

Predictor X

available for learning but did not learn the general characteristics of the function that generated the data. Thus, it is overfitted and has poor generalization.

• Algorithm 2 learned general characteristics of the function that produced the g g pdata. Thus, it generalizes.

72

Example of overfitting and generalizationp g g

True relationshipTraining data and decision

f f l ith #1Training data and decision

f f l ith #2p

surface of algorithm #1 surface of algorithm #2

A general principle here (Occam’s razor) is that the likelihood of overfitting increases with the “complexity” of an algorithm

73

overfitting increases with the complexity of an algorithm

“Loss + penalty” paradigm for learning to d f d lavoid overfitting and ensure generalization

•Many statistical learning algorithms (including SVMs) search for a decision function by solving the followingsearch for a decision function by solving the following optimization problem:

Minimize (Loss + λ Penalty)

• Lossmeasures error of fitting the data

• Penalty penalizes complexity of the learned functiony p p y

• λ is regularization parameter that balances Loss and Penalty

74

“Loss + penalty” paradigm for learning to d f d lavoid overfitting and ensure generalization

Gene Y

Cancer

Normal Gene X

The classifier that is represented by a straight line is considered simpler than the classifier represented by a wiggled line (because we need less

75

the classifier represented by a wiggled line (because we need less information to represent it).

SVMs in “loss + penalty” formp y

SVMs build the following classifiers )()( bxwsignxf

SVMs build the following classifiers:

Consider soft‐margin linear SVM formulation:

Fi d d th t

b

)()( bxwsignxf

Find and that

Minimize subject to for i = 1,…,N

w b

N

iiCw

1

2

21

iii bxwy 1)(

This can also be stated as:

Find and thatw

b

Minimize 2

21

)](1[ wxfyN

iii

Loss (“hinge loss”)

Penalty

(in fact, one can show that λ = 1/(2C)).76

Meaning of SVM loss functiong

Consider loss function N

xfy )](1[

Consider loss function:

• Recall that […]+ indicates the positive part

i

ii xfy1

)](1[

• For a given sample/patient i, the loss is non‐zero if

• In other words,

0)(1 ii xfy

1)( ii xfy

• Since , this means that the loss is non‐zero iffor yi = +1

ii

}1,1{ iy

1)( ixf

for yi= -1

• In other words, the loss is non‐zero if1)( ixf

for yi = +1

for yi= -1

1 bxw i

1 bxw i

77

Meaning of SVM loss functiong

0 bxw

1b

1 bxw

1• If the object is negative, it is penalized only in 1 bxw

1 2 3 4is penalized only in regions 2,3,4

• If the object is positive, it j p ,is penalized only in regions 1,2,3


78

Flexibility of “loss + penalty” frameworky p yMinimize (Loss + λ Penalty)

79

Part 2

d l l f• Model selection for SVMs

• Extensions to the basic SVM model:1. SVMs for multicategory data

2. Support vector regression2. Support vector regression

3. Novelty detection with SVM‐based methods

4 Support vector clustering4. Support vector clustering

5. SVM‐based variable selection

6 C ti t i l b biliti f SVM6. Computing posterior class probabilities for SVM classifiers

80

Model selection for SVMs

81

Need for model selection for SVMs

Gene 2 Gene 2Gene 2

Tumor

Gene 2

Tumor

Normal

Gene 1

• It is impossible to find a linear SVM classifier

Gene 1Normal

• We should not apply a non‐linear SVM that separates tumors from normals!

• Need a non‐linear SVM classifier, e.g. SVM with polynomial kernel of degree 2 solves

classifier while we can perfectly solve this problem using a linear SVM classifier!

82

this problem without errors.

A data-driven approach for d l l f SVMmodel selection for SVMs

• Do not know a priori what type of SVM kernel and what kernel• Do not know a priori what type of SVM kernel and what kernel parameter(s) to use for a given dataset?

• Need to examine various combinations of parameters, e.g.Need to examine various combinations of parameters, e.g. consider searching the following grid:

Parameter C

(0.1, 1) (1, 1) (10, 1) (100, 1) (1000, 1)

Polynomialdegree d

(0.1, 2) (1, 2) (10, 2) (100, 2) (1000, 2)

(0.1, 3) (1, 3) (10, 3) (100, 3) (1000, 3)

(0.1, 4) (1, 4) (10, 4) (100, 4) (1000, 4)

(0.1, 5) (1, 5) (10, 5) (100, 5) (1000, 5)

83

• How to search this grid while producing an unbiased estimate of classification performance?

Nested cross-validation

Recall the main idea of cross‐validation:

test What combination of SVM

data train testtrain

parameters to apply on training data?

traintest

valid train

train

train

valid

train

Perform “grid search” using another nested loop of cross‐validation.

84

Example of nested cross-validationpConsider that we use 3‐fold cross‐validation and we want to optimize parameter C that takes values “1” and “2”

P1

Outer Loop

T i i T ti A

optimize parameter C that takes values 1 and 2 .

P1

P2

data

Training set

Testing set

C AccuracyAverage

AccuracyP1, P2 P3 1 89%

83%P3

P1,P3 P2 2 84%P2, P3 P1 1 76%

83%…

…

Inner LoopTraining Validation

C AccuracyAverage

set setC Accuracy

Accuracy

P1 P2 86%P2 P1 84%

1 85% chooseC=1

85

P1 P2 70%P2 P1 90%

2 80%C=1

On use of cross-validation

E i i ll f d th t lid ti k ll•Empirically we found that cross‐validation works well for model selection for SVMs in many biomedical

bl d iproblem domains;•Many other approaches that can be used for model

l ti f SVM i tselection for SVMs exist, e.g.: Virtual leave‐one‐out cross‐validation Generalized cross‐validation Generalized cross‐validation Bayesian information criterion (BIC) Minimum description length (MDL) Vapnik‐Chernovenkis (VC) dimension Bootstrap

86

SVMs for multicategory datag y

87

One-versus-rest multicategory SVM h dSVM method

Gene 2

Tumor I?

* ** *

*

**

**

* * *

* *

** Tumor II

?

** **

88Gene 1

Tumor III

One-versus-one multicategorySVM h dSVM method

Gene 2

Tumor I

* *

*** *

*

*Tumor II

*** * *

*

*

** **?

?

89Gene 1

Tumor III

DAGSVM multicategory SVM h dSVM method

AML vs. ALL T-cell

Not AML Not ALL T-cell

AML vs. ALL T-cell

ALL B-cell vs. ALL T-cell AML vs. ALL B-cell ALL B-cell vs. ALL T-cell

Not ALL B‐cellNot ALL B-cell Not ALL T-cell Not AML

ALL T‐cell ALL B‐cell AMLALL B‐cell

90

SVM multicategory methods by Weston d W k d b C d Sand Watkins and by Crammer and Singer

G 2Gene 2

?

* ** *

*?

?

**

**

* * ***

* **

91Gene 1

Support vector regressionpp g

92

-Support vector regression (-SVR)pp g ( )

Given training data:Rxxx n

N ,...,, 21

Given training data:Ryyy N ,...,, 21

y=f(x)Main idea:

Find a function bxwxf

)(

y f(x)+

‐* *

* *

that approximates y1,…,yN :

• it has at most derivation from

f )(

* *** *

the true values yi

• it is as “flat” as possible (to d f )

xavoid overfitting)

E.g., build a model to predict survival of cancer patients that

x

93

g , p pcan admit a one month error (= ).

-Support vector regression (-SVR)pp g ( )

y=f(x)

** *

* *

* **

x

94

Formulation of “hard-margin” -SVRg

Find 0 bxw

2bxwxf

)(

y=f(x)

* *by minimizing subject to constraints:

2

21 w

* *** *

*

)(

)(

bxwy

bxwy

i

i

* *

for i = 1,…,N.x

I.e., difference between yi and the fitted function should be smaller than and larger than ‐ the fitted function should go through ‐

95

than and larger than the fitted function should go through neighborhood of all points yi.

Influence of parameter py=f(x)

ε = 0.05 ε = 0.10 ε = 0.15

y=f(x)x

y f( )

96ε = 0.20 ε = 0.25 ε = 0.30 x

Formulation of “soft-margin” -SVRg

*

y* If we have points like this

( li i )

** *

* *

*

(e.g., outliers or noise) we can either:

a) increase to ensure that

** *

* ***

)these points are within the new ‐neighborhood, or

b) assign a penalty (“slack”***

**

b) assign a penalty (“slack” variable) to each of this points (as was done for “ f ” )

x“soft‐margin” SVMs)

97

Formulation of “soft-margin” -SVRgFind

b N

C *21 bxwxf

)(

(0 0)

y=f(x)

by minimizing

subject to constraints:

i

iiCw1

*2

21

*

** *

*(0,0)(0,0)

(0,0)

(0,0)

(0 0 3)

(0.2,0)

ii*

)(

)(*

ii

ii

bxwy

bxwy

**

*(0,0)

(0,0.3)

for i = 1,…,N.

0, * ii x

•i = 0 if the linear function is above the lower boundary of the ε‐neighborhood; otherwise it denotes distance from the linear function t th l b dto the lower boundary. •i* = 0 if the linear function is below the upper boundary of the ε‐neighborhood; otherwise it denotes distance from the linear function

98

g ;to the upper boundary•Notice that only points outside ‐neighborhood are penalized!

Nonlinear -SVRy=f(x) y=f(x)Input space Feature space

**

**

***

*

)( ix

*

x

**

***Ф(x)

*

y=f(x)

x Ф(x)Linear relation does not exist Linear relation exists and it is determined

Input space

*

y=f(x)

*)(1

ix

Input space

**

**

*

99

x*

Corresponding non‐linear relation

-Support vector regression in “l l ” f

B ild d i i f ti f th f

“loss + penalty” formbxwxf

)(Build decision function of the form:

Find and thatN

bxwxf )(w

bMinimize

2

21

)|)(|,0max( wxfyN

iii

Loss (“linear -insensitive loss”)

Penalty

Loss function value

100Error in approximation

Comparing loss functions of regression h dmethods

Linear insensitive loss Quadratic insensitive lossLoss function

valueLoss function

value

Linear -insensitive loss Quadratic -insensitive loss

- Error in approximation


Mean squared error Mean linear errorLoss function

valueLoss function

value

Mean squared error Mean linear error

Error in E i

101



Comparing -SVR with popular regression methods

102

Applying -SVR to real datapp y g

I th b f d i k l d b t d i iIn the absence of domain knowledge about decision functions, it is recommended to optimize the following parameters (e g by cross validation using grid search):parameters (e.g., by cross‐validation using grid‐search):

• parameter Cparameter C• parameter • kernel parameters (e.g., degree of polynomial)p ( g , g p y )

Notice that parameter depends on the ranges of variables in the dataset; therefore it is recommended to normalize/re‐scale data prior to applying ‐SVR.

103

Novelty detection with SVM-based th dmethods

104

What is it about?• Find the simplest and most compact region in the space of predictors where the majority f d t l “li ” (i r

Y

Decision function = +1

of data samples “live” (i.e., with the highest density of samples)

*

*

* ** *

**** *

*

**

**** ** **

**

** ***

** *** ****

**

*

*

*

*

***

**

* ****

*

*

**

*

Predictor

**** ***** ******** *samples).

• Build a decision function that takes value +1 in this region

**** ** ****

*** ** **

***

*****

******* *

************* *

*******

***** *******

Decision function = 1takes value +1 in this region and ‐1 elsewhere.

• Once we have such a decision * *

Decision function = ‐1

O ce e a e suc a dec s ofunction, we can identify novel or outlier samples/patients in

Predictor X

105

the data.

Key assumptionsy p

• We do not know classes/labels of samples (positive or negative) in the data available for learningor negative) in the data available for learning this is not a classification problem

• All positive samples are similar but each negative• All positive samples are similar but each negative sample can be different in its own way

Thus, do not need to collect data for negative samples!

106

Sample applicationsp pp“Normal”

“Novel”

“Novel”

107

Novel“Novel”

Modified from: www.cs.huji.ac.il/course/2004/learns/NoveltyDetection.ppt

Sample applicationsp ppDiscover deviations in sample handling protocol when d l l fdoing quality control of assays.

Protein Y

* **

** *

***

** *

****

**

***** ***

*

*

** ** **

**

*

* *

Samples with high‐quality of processing

Samples with low quality of processing from the lab of Dr Smith ** *

****

* **

* **

** ***** *

* lab of Dr. Smith

Protein X**

***

Samples with low quality of processing from infants

*

* Samples with low quality of

Samples with low quality of processing from ICU patients

108* *

processing from patients with lung cancer

Sample applicationsp ppIdentify websites that discuss benefits of proven cancer treatments.

Weighted frequency of

word Y

* **

** *

***

** *

****

**

***** ***

*

*

** ** ***

**

*

**

Websites that discuss benefits of proven cancer treatmentsWebsites that discuss *

** ****

*

**

** *****

****

* **

* **

** ***** *

**Websites that discuss cancer prevention methods

** *

*

*** ***********

*

*

* ** *** **

Websites that discuss side‐effects of proven cancer treatments

Weighted **Blogs of cancer patients

gfrequency of

word X

** Websites that discuss

109

**Websites that discuss unproven cancer treatments

On Presentation of one-class SVM…

Wh i l di d SVM l ifi ti dWhen we previously discussed SVM classification and regression, we first introduced the version of the method for linear data, then discussed the non‐linear ,version. We will follow a similar course in the discussion of one‐class SVMs but with one fundamental difference Unlike SVMs and SVR onlyfundamental difference. Unlike SVMs and SVR, only non‐linear one‐class SVM is designed to be applied and makes practical sense. Furthermore, one‐class SVM should be used with special types of kernel functions. So, even though we will describe linear one‐class SVM, it is presented here only for educational purposes andit is presented here only for educational purposes and its practical meaning may be limited.

110

Linear one-class SVMMain idea: Find the maximal gap hyperplane that separates data from the origin (i e the only member of the second class is the origin)the origin (i.e., the only member of the second class is the origin).

Gene Y

1 bxw

0 bxw

1 bxw

Origin is the only Gene X

111

g ymember of the second class Positive objects (y=+1)Negative object (y=‐1)

Linear one-class SVMOne‐class SVM seeks the most compact region where the majority of data points live; so we are interested in another hyperplane:data points live; so we are interested in another hyperplane:

Gene Y

0 bxw

Origin is the only Gene X

112

g ymember of the second class Positive objects (y=+1)Negative object (y=‐1)

Formulation of one-class SVM: l linear case

Given training data: nRxxx

Find )()( bxwsignxf

Given training data: N Rxxx ,...,, 21

0 bxw

by minimizing

N

ii b

Nw

1

2

21 1


ibxw

i

j

for i = 1 N

0i

i

upper bound on the fraction of for i = 1,…,N.outliers (i.e., points outside decision surface) allowed in

i e the decision function should

113

the datai.e., the decision function should be positive in all training samples except for small deviations

Formulation of one-class SVM:l d l linear and non-linear cases

Linear case Non‐linear case (use “kernel trick”)

Find

by minimizing N

bw21 1

)()( bxwsignxf

Find

by minimizing N

bw21 1

))(()( bxwsignxf

by minimizing


i

i bN

w1

2

by minimizing


i

i bN

w1

2

0

i

ibxw

0

)(

i

ibxw

for i = 1,…,N.

i

for i = 1,…,N.

i

114

Non-linear one-class SVMGene Y

Input space

Use a kernel to map all points to a sphere with unit radius (e.g., use

)(x

sphere with unit radius (e.g., use Gaussian kernel)

Feature space

)( ix

Gene X

We want to find the simplest andWe want to find the simplest and most compact region enclosing the majority of data points.

115

Non-linear one-class SVM

0)( b Feature space0)( bxw Feature space

Solve linear one‐class SVM problem in the feature spacep

Negative object (y 1)

Positive objects (y=+1)

Negative object (y=‐1)

116

Non-linear one-class SVMGene Y

Input spaceInput space

0)( bxw

Gene X

Consider the corresponding decision surface (image) in

117

Consider the corresponding decision surface (image) in the input space – it is exactly what we were looking for!

More about one-class SVM

• One‐class SVMs inherit most of properties of SVMs for binary classification (e.g., “kernel trick”, sample y ( g , , pefficiency, ease of finding of a solution by efficient optimization method, etc.);p , );

• The choice of parameter significantly affects the resulting decision surface.

g

• The choice of origin is arbitrary and also significantly affects the decision surface returned by the algorithm.y g

118

More about one-class SVM

ν =0.5 ν =0.3

119ν =0.2 ν =0.1

Support vector clusteringpp g

Contributed by Nikita Lytkin

120

Goal of clustering (aka class discovery)Goal of clustering (aka class discovery)

Gi h f d iGiven a heterogeneous set of data points

Assign labels such that points

nN Rxxx

,...,, 21

},...,2,1{,...,, 21 Kyyy N with the same label are highly “similar“ to each other and are distinctly different from the rest

Clustering process

121

Support vector domain descriptionSupport vector domain description• Support Vector Domain Description (SVDD) of the data is a set

of vectors lying on the surface of the smallest hyper‐sphere enclosing all data points in a feature space– These surface points, called Support Vectors, lie in low density regions p , pp , y g

in the input space and are good indicators of cluster boundaries

kernelR

R

R

R

122

Outline of Support Vector ClusteringOutline of Support Vector Clustering

• Data points that are support vectors of the minimal enclosing hyper‐sphere are taken to be the cluster boundary points

All th i t i d t l t i th b d• All other points are assigned to clusters using the boundary points

1R

R

Ra

123

Cluster assignment in SVCCluster assignment in SVC

• Two points and belong to the same cluster (i.e., have the same label) if every point of the line segment

j t d i t th f t li ithi th h

),( ji xxix jx

projected into the feature space lies within the hyper‐sphere

Some points lie outside the hyper‐sphere

RR

Ra

124

Every point is within the hyper‐sphere in the feature space

Cluster assignment in SVC (continued)Cluster assignment in SVC (continued)C

• Point‐wise adjacency matrix is constructed by testing the line

t b t i f

AB E

segments between every pair of points

C t d tA B C D E

D

• Connected components are extracted

P i b l i h

A 1 1 1 0 0

B 1 1 0 0

C 1 0 0• Points belonging to the same

connected component are assigned to the same cluster

C 1 0 0

D 1 1

E 1Cassigned to the same cluster

AB E

125

D

SVM-based variable selection

126

Understanding the weight vector wRecall standard SVM formulation:

g g

0 bxw

Find and that minimize2

w

b

subject to

for i = 1,…,N.

2

21 w

1)( bxwy ii

, ,

Use classifier: Positive objects (y=+1)Negative objects (y=‐1))()( bxwsignxf

• The weight vector contains as many elements as there are input variables in the dataset, i.e. .h i d f h l d i f h

w

nw R

• The magnitude of each element denotes importance of the corresponding variable for classification task.

127

Understanding the weight vector wg g

x2 011

)1,1(

21

bxx

w

x2 001

)0,1(

21

bxx

w

2 011 21 bxx 2 001 21 bxx

x1 x1

x )10(w x3

X1 and X2 are equally important X1 is important, X2 is not

x2

010

)1,0(

21 bxx

w 3

0011

)0,1,1(

321

bxxx

w

x1

x2

x1X1 and X2 are equally important

128X2 is important, X1 is not

equally important, X3 is not

Understanding the weight vector wg gGene X2

MelanomaNevi

SVM decision surface)11(w

Nevi

011

)1,1(

21

bxx

w

Decision surface ofDecision surface of another classifier

001

)0,1(

bxx

w

001 21 bxx

True model

Gene X1

X1

PhenotypeX2• In the true model, X1 is causal and X2 is redundant•SVM decision surface implies that X1 and X2 are equally important; thus it is locally causally inconsistent

129

important; thus it is locally causally inconsistent•There exists a causally consistent decision surface for this example•Causal discovery algorithms can identify that X1 is causal and X2 is redundant

Simple SVM-based variable selection l halgorithm

Algorithm:1. Train SVM classifier using data for all variables to

estimate vector 2. Rank each variable based on the magnitude of the

w

corresponding element in vector 3. Using the above ranking of variables, select the

w

smallest nested subset of variables that achieves the best SVM prediction accuracy.

130


Consider that we have 7 variables: X X X X X X XConsider that we have 7 variables: X1, X2, X3, X4, X5, X6, X7The vector is: (0.1, 0.3, 0.4, 0.01, 0.9, ‐0.99, 0.2)The ranking of variables is: X6, X5, X3, X2, X7, X1, X4

w

6 5 3 2 7 1 4

Subset of variablesClassificationaccuracy

X6 X5 X3 X2 X7 X1 X4 0.920

X6 X5 X3 X2 X7 X1 0.920

X X X X X 0 919

Best classification accuracy

Classification accuracy that isX6 X5 X3 X2 X7 0.919

X6 X5 X3 X2 0.852

X6 X5 X3 0 843

Classification accuracy that is statistically indistinguishable from the best one

X6 X5 X3 0.843

X6 X5 0.832

X6 0.821

131 Select the following variable subset: X6, X5, X3, X2 , X7


• SVM weights are not locally causally consistent we• SVM weights are not locally causally consistent we may end up with a variable subset that is not causal and not necessarily the most compact oneand not necessarily the most compact one.

• The magnitude of a variable in vector estimates the effect of removing that variable on the objective

w

the effect of removing that variable on the objective function of SVM (e.g., function that we want to minimize) However this algorithm becomes subminimize). However, this algorithm becomes sub‐optimal when considering effect of removing several variables at a time This pitfall is addressed in thevariables at a time… This pitfall is addressed in the SVM‐RFE algorithm that is presented next.

132

SVM-RFE variable selection algorithmgAlgorithm:1 I iti li V t ll i bl i th d t1. Initialize V to all variables in the data2. Repeat3 Train SVM classifier using data for variables in V to3. Train SVM classifier using data for variables in V to

estimate vector 4 Estimate prediction accuracy of variables in V using

w

4. Estimate prediction accuracy of variables in V using the above SVM classifier (e.g., by cross‐validation)

5. Remove from V a variable (or a subset of variables) ( )with the smallest magnitude of the corresponding element in vector w

6. Until there are no variables in V7. Select the smallest subset of variables with the best

133

prediction accuracy

SVM-RFE variable selection algorithmg

10,000 genes

SVM model

Prediction

5,000 genes

Important for

SVM model

2,500 genes

Important for

…PredictionPrediction

accuracy

5,000

Important for classification

2,500 genes

Important for classification

Discarded Discarded

Prediction accuracy

genes

Not important for classification

genes

Not important for classification

Discarded Discarded

• Unlike simple SVM‐based variable selection algorithm, SVM‐RFE estimates vector many times to establish ranking of thew

RFE estimates vector many times to establish ranking of the variables.

• Notice that the prediction accuracy should be estimated at

w

134

Notice that the prediction accuracy should be estimated at each step in an unbiased fashion, e.g. by cross‐validation.

SVM variable selection in feature spacep

The real power of SVMs comes with application of the kernel trick that maps data to a much higher dimensional space (“feature space”) where the data is linearly separable.

Gene 2

Tumor

?kernel

Tumor

Normal?Gene 1

Normal

135

input space feature space

SVM variable selection in feature spacep• We have data for 100 SNPs (X1,…,X100) and some phenotype.W ll 3rd d i i id• We allow up to 3rd order interactions, e.g. we consider:

• X1,…,X1001, , 100

• X12,X1X2, X1X3,…,X1X100 ,…,X1002

• X13,X1X2X3, X1X2X4,…,X1X99X100 ,…,X10031 1 2 3 1 2 4 1 99 100 100

• Task: find the smallest subset of features (either SNPs or their interactions) that achieves the best predictivetheir interactions) that achieves the best predictive accuracy of the phenotype.

• Challenge: If we have limited sample we cannot explicitly• Challenge: If we have limited sample, we cannot explicitly construct and evaluate all SNPs and their interactions (176 851 features in total) as it is done in classical statistics

136

(176,851 features in total) as it is done in classical statistics.

SVM variable selection in feature spacep

Heuristic solution: Apply algorithm SVM‐FSMB that:pp y g1. Uses SVMs with polynomial kernel of degree 3 and

selects M features (not necessarily input variables!) ( y p )that have largest weights in the feature space.

E th l ith l t f t lik XE.g., the algorithm can select features like: X10, (X1X2), (X9X2X22), (X7

2X98), and so on.

2. Apply HITON‐MB Markov blanket algorithm to find the Markov blanket of the phenotype using Mthe Markov blanket of the phenotype using M features from step 1.

137

Computing posterior class b biliti f SVM l ifiprobabilities for SVM classifiers

138

Output of SVM classifierp1. SVMs output a class label ( i i i ) f h 0b

(positive or negative) for each sample:

0 bxw

)( bxwsign

2. One can also compute distance from the hyperplane thatfrom the hyperplane that separates classes, e.g. . These distances can be used to

Positive samples (y=+1)Negative samples (y= 1)

bxw

compute performance metrics like area under ROC curve.

Positive samples (y=+1)Negative samples (y=‐1)

Question: How can one use SVMs to estimate posterior l b bili i i P( l i i | l )?

139

class probabilities, i.e., P(class positive | sample x)?

Simple binning methodp g1. Train SVM classifier in the Training set.

Training set2. Apply it to the Validation set and compute distances

from the hyperplane to each sample.

Validation set

Sample # 1 2 3 4 5...

98 99 100

Distance 2 ‐1 8 3 4 ‐2 0.3 0.8Validation set

Testing set

3. Create a histogram with Q (e.g., say 10) bins using the above distances. Each bin has an upper and lower value in terms of distancevalue in terms of distance.

20

25

ples

se

t

5

10

15

Num

ber

of s

amp

in v

alid

atio

n s

140

-15 -10 -5 0 5 10 150

Distance

Simple binning methodp g4. Given a new sample from the Testing set, place it in

the corresponding bin

Training set

the corresponding bin.

E.g., sample #382 has distance to hyperplane = 1, so it is placed in the bin [0 2 5]

Validation set

is placed in the bin [0, 2.5]

15

20

25

mpl

es

n se

t

Validation set

Testing set0

5

10

15

Num

ber

of s

ain

val

idat

ion

5. Compute probability P(positive class | sample #382) as f ti f t iti i thi bi

-15 -10 -5 0 5 10 150

Distance

a fraction of true positives in this bin.

E.g., this bin has 22 samples (from the Validation set), out of which 17 are positive ones so we compute

141

out of which 17 are positive ones , so we computeP(positive class | sample #382) = 17/22 = 0.77

Platt’s methodConvert distances output by SVM to probabilities by passing them through the sigmoid filter:through the sigmoid filter:

)exp(1

1)|(

BAdsampleclasspositiveP

where d is the distance from hyperplane and A and B are parameters.

)exp(1 BAd

0.8

0.9

1

0.5

0.6

0.7

e cl

ass|

sam

ple)

0.2

0.3

0.4

P(p

ositi

ve

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

Distance142

Platt’s method1. Train SVM classifier in the Training set.

Training set2. Apply it to the Validation set and compute distances from the hyperplane to each sample.

Validation set

Sample # 1 2 3 4 5...

98 99 100

Distance 2 ‐1 8 3 4 ‐2 0.3 0.8 Validation set

Testing set3. Determine parameters A and B of the sigmoid

function by minimizing the negative log likelihood of y g g gthe data from the Validation set.

4 Given a new sample from the Testing set compute its4. Given a new sample from the Testing set, compute its posterior probability using sigmoid function.

143

Part 3

• Case studies (taken from our research)( )1. Classification of cancer gene expression microarray data2. Text categorization in biomedicine3. Prediction of clinical laboratory values4. Modeling clinical judgment5 Using SVMs for feature selection5. Using SVMs for feature selection6. Outlier detection in ovarian cancer proteomics data

• Software• Software

• Conclusions

• Bibliography

144

1. Classification of cancer gene i i d texpression microarray data

145

Comprehensive evaluation of algorithms f l f f dfor classification of cancer microarray data

Main goals:• Find the best performing decision support• Find the best performing decision support

algorithms for cancer diagnosis from i i d tmicroarray gene expression data;

• Investigate benefits of using gene selection and ensemble classification methods.

146

Classifiers

• K‐Nearest Neighbors (KNN) instance‐basedK Nearest Neighbors (KNN)

• Backpropagation Neural Networks (NN)

• Probabilistic Neural Networks (PNN)

neural networks

instance based

• Probabilistic Neural Networks (PNN)

• Multi‐Class SVM: One‐Versus‐Rest (OVR)

• Multi‐Class SVM: One‐Versus‐One (OVO)• Multi‐Class SVM: One‐Versus‐One (OVO)

• Multi‐Class SVM: DAGSVM

• Multi Class SVM by Weston & Watkins (WW)

kernel‐based

• Multi‐Class SVM by Weston & Watkins (WW)

• Multi‐Class SVM by Crammer & Singer (CS)

• Weighted Voting: One Versus Rest• Weighted Voting: One‐Versus‐Rest

• Weighted Voting: One‐Versus‐One

• Decision Trees CART

voting

d i i t

147

• Decision Trees: CART decision trees

Ensemble classifiers

dataset

Classifier 1 Classifier 2 Classifier N…

Prediction 1 Prediction 2 Prediction N… dataset

Ensemble l ifiClassifier

Final

148

aPrediction

Gene selection methods

Highly discriminatory genes

1. Signal‐to‐noise (S2N) ratio in one‐versus‐rest (OVR)

Uninformative genesHighly discriminatory genes

fashion;

2. Signal‐to‐noise (S2N) ratio in one versus one (OVO)one‐versus‐one (OVO) fashion;

3. Kruskal‐Wallis nonparametric3. Kruskal Wallis nonparametric one‐way ANOVA (KW);

4. Ratio of genes between‐i i hicategories to within‐category

sum of squares (BW).

genes

149

Performance metrics andl statistical comparison

1 A1. Accuracy+ can compare to previous studies+ easy to interpret & simplifies statistical comparisony p p p

2 Relative classifier information (RCI)2. Relative classifier information (RCI)+ easy to interpret & simplifies statistical comparison+ not sensitive to distribution of classes

t f diffi lt f d i i bl+ accounts for difficulty of a decision problem

R d i d t ti t ti t i• Randomized permutation testing to compare accuracies of the classifiers (=0.05)

150

Microarray datasetsy

Sam- Variables Cate- Dataset name

Number of

Referenceples (genes) gories

11_Tumors 174 12533 11 Su, 2001

Total:

•~1300 samples

14_Tumors 308 15009 26 Ramaswamy, 2001

9_Tumors 60 5726 9 Staunton, 2001

•74 diagnostic categories•41 cancer types and 12 l ti t

Brain_Tumor1 90 5920 5 Pomeroy, 2002

Brain_Tumor2 50 10367 4 Nutt, 2003

L k i 1 72 5327 3 Gol b 1999 12 normal tissue typesLeukemia1 72 5327 3 Golub, 1999

Leukemia2 72 11225 3 Armstrong, 2002

Lung Cancer 203 12600 5 Bhattacherjee 2001Lung_Cancer 203 12600 5 Bhattacherjee, 2001

SRBCT 83 2308 4 Khan, 2001

Prostate Tumor 102 10509 2 Singh, 2002

151

Prostate_Tumor g ,

DLBCL 77 5469 2 Shipp, 2002

Summary of methods and datasetsyGene Selection

Methods (4)Cross-Validation

Designs (2)PerformanceMetrics (2)

StatisticalComparison

S2N One‐Versus‐Rest

S2N One‐Versus‐One

Non‐param. ANOVA

( )g ( )10‐Fold CV

LOOCV

Accuracy

RCI

( )

Randomizedpermutation testing

p

Non param. ANOVA

BW ratioOne‐Versus‐Rest

One‐Versus‐One

Classifiers (11)

M

11_Tumors

14 Tumors

Gene Expression Datasets (11)

DAGSVM

Method by WW

Method by CS

MC‐SVM

Majority Voting

MC‐SVM OVR

Ensemble Classifiers (7)

MC‐

uts

_

9_Tumors

Brain Tumor1

Brain Tumor2category Dx

Method by CS

KNN

Backprop. NN

P b NN

MC‐SVM OVO

MC‐SVM DAGSVM

Decision Trees

Based

on

SVM outp

s

Brain_Tumor2

Leukemia1

Leukemia2

Multic

Prob. NN

Decision Trees

One‐Versus‐Rest

WV

Decision Trees

Decision Trees

Majority Voting

sed on outputs

all classifiers

Lung_Cancer

SRBCT

Prostate_Tumors

ary Dx

152

One‐Versus‐OneW

Bas

of a

DLBCLBina

Results without gene selectiong

100

OVROVODAGSVM

C-S

VM60

80

racy

, %

WWCSKNNNNPNN

MC

20

40

Acc

ur

mors

mor

s

mor

2

mor

1 m

ors

mia1

mia2

ncer

RBCT

mor

LBCL

0

9_Tu

mo14

_Tum

oBra

in_T

umo

Brain

_Tum

o11

_Tum

oLe

ukem

Leuk

emLu

ng_C

anc

SRB

Pros

tate_

Tum

DLB

153

Results with gene selectiongImprovement of diagnostic

performance by gene selectionDiagnostic performance

performance by gene selection (averages for the four datasets)

before and after gene selection

100 9 Tumors 14 Tumors

50

60

70

urac

y, %

40

60

80

100 9_Tumors 14_Tumors

30

40

50

emen

t in

accu

20

100

Acc

urac

y, %

Brain_Tumor1 Brain_Tumor2

10

20

Impr

ove

20

40

60

80

A

OVR OVO DAGSVM WW CS KNN NN PNN

SVM non-SVM

d f

SVM non‐SVM SVM non‐SVM

20

154

Average reduction of genes is 10‐30 times

Comparison with previously bl h d lpublished results

100

Multiclass SVMs(this study)

y, %

60

80

Acc

ura

cy

40

60

Multiple specialized classification methods(original primary studies)A

0

20

(original primary studies)

0

9_Tu

mors

14_T

umor

s ra

in_T

umor

2 ra

in_T

umor

1 11

_Tum

ors

Leuk

emia1

Le

ukem

ia2

Lung

_Can

cer

SRBCT

state

_Tum

or

DLBCL

155

1

Bra Bra 1

Lu

Pros

t

Summary of resultsy

•Multi‐class SVMs are the best family among the tested algorithms outperforming KNN, NN, PNN, DT, and WV.

•Gene selection in some cases improves classification performance of all classifiers, especially of non‐SVM algorithms;

•Ensemble classification does not improve performance of SVM and other classifiers;

•Results obtained by SVMs favorably compare with the literature.

156

Random Forest (RF) classifiers

• Appealing properties

( )

pp g p p– Work when # of predictors > # of samples

– Embedded gene selection

– Incorporate interactions

– Based on theory of ensemble learning

Can work with binary & multiclass tasks– Can work with binary & multiclass tasks

– Does not require much fine‐tuning of parameters

• Strong theoretical claims• Strong theoretical claims

• Empirical evidence: (Diaz‐Uriarte and Alvarez de A d BMC Bi i f ti 2006) t dAndres, BMC Bioinformatics, 2006) reported superior classification performance of RFs compared t SVM d th th dto SVMs and other methods

157

Key principles of RF classifiersy p p

T i

Training

Testingdata

gdata

4) Apply to testing data & combine predictions

1) Generate bootstrap samples

2) Random geneselection

3) Fit unpruned decision trees 158

Results without gene selectiong

• SVMs nominally outperform RFs is 15 datasets, RFs outperform SVMs in 4 datasets, algorithms are exactly the same in 3 datasets.

159

g y• In 7 datasets SVMs outperform RFs statistically significantly.• On average, the performance advantage of SVMs is 0.033 AUC and 0.057 RCI.

Results with gene selectiong

• SVMs nominally outperform RFs is 17 datasets, RFs outperform SVMs in 3 datasets, algorithms are exactly the same in 2 datasets.

160

g y• In 1 dataset SVMs outperform RFs statistically significantly.• On average, the performance advantage of SVMs is 0.028 AUC and 0.047 RCI.

2. Text categorization in biomedicineg

161

Models to categorize content and quality: M dMain idea

1. Utilize existing (or easy to build) training corpora

1 2. Simple document representations (i.e., typically stemmed and weighted

2

3

gwords in title and abstract, Mesh terms if available; occasionally addition of Metamap CUIs author info) as

162

4Metamap CUIs, author info) as “bag‐of‐words”

Models to categorize content and quality: M dMain idea

LabeledExamples

Unseen Examples

3. Train SVM models that capture implicit categories of meaning or

lit it iquality criteria

Labeled

0 80.9

1

4. Evaluate models’ performances

0.20.30.40.50.60.70.8p

‐ with nested cross‐validation or other appropriate error estimators

‐ use primarily AUC as well as other metrics ( iti it ifi it PPV P i i /R ll

5. Evaluate performance prospectively &

00.1

Txmt Diag Prog Etio

Estimated Performance 2005 Performance

(sensitivity, specificity, PPV, Precision/Recall curves, HIT curves, etc.)

163

5. Evaluate performance prospectively & compare to prior cross‐validation estimates

Models to categorize content and quality: S bl lSome notable results

Method Treatment - Etiology - Prognosis - Diagnosis -Method Treatment - Etiology - Prognosis - Diagnosis -Category Average AUC Range

over n folds

Treatment 0.97* 0.96 - 0.98

Category Average AUC Rangeover n folds

Treatment 0.97* 0.96 - 0.98

Method Treatment -AUC

Etiology -AUC

Prognosis -AUC

Diagnosis -AUC

Google Pagerank 0.54 0.54 0.43 0.46

Yahoo Webranks 0 56 0 49 0 52 0 52

Method Treatment -AUC

Etiology -AUC

Prognosis -AUC

Diagnosis -AUC

Google Pagerank 0.54 0.54 0.43 0.46

Yahoo Webranks 0 56 0 49 0 52 0 52Treatment 0.97 0.96 0.98

Etiology 0.94* 0.89 – 0.95

Prognosis 0.95* 0.92 – 0.97

Diagnosis 0.95* 0.93 - 0.98

Treatment 0.97 0.96 0.98

Etiology 0.94* 0.89 – 0.95

Prognosis 0.95* 0.92 – 0.97

Diagnosis 0.95* 0.93 - 0.98

Yahoo Webranks 0.56 0.49 0.52 0.52

Impact Factor 2005

0.67 0.62 0.51 0.52

W b hi

Yahoo Webranks 0.56 0.49 0.52 0.52

Impact Factor 2005

0.67 0.62 0.51 0.52

W b hi

1. SVM models have excellent ability to identify

ggWeb page hit count

0.63 0.63 0.58 0.57

Bibliometric Citation Count

0.76 0.69 0.67 0.60

Web page hit count

0.63 0.63 0.58 0.57

Bibliometric Citation Count

0.76 0.69 0.67 0.60

high‐quality PubMed documents according to ACPJ gold standard

Machine Learning Models

0.96 0.95 0.95 0.95Machine Learning Models

0.96 0.95 0.95 0.95

2 SVMmodels have better classificationg 2. SVM models have better classification performance than PageRank, Yahoo ranks, Impact Factor, Web Page hit counts, and

164

bibliometric citation counts on the Web according to ACPJ gold standard

Models to categorize content and quality: S bl lSome notable results

Gold standard: SSOAB Area under the ROCGold standard: SSOAB Area under the ROC

Treatment - Fixed Specificity

0.80.910.95 0.91

0 80.9

1

Treatment - Fixed Sensitivity

0.98

0 71

0.980.89

0 80.9

1

Gold standard: SSOAB Area under the ROC curve*

SSOAB-specific filters 0.893

Cit ti C t 0 791

Gold standard: SSOAB Area under the ROC curve*

SSOAB-specific filters 0.893

Cit ti C t 0 791

00.10.20.30.40.50.60.70.8

Sens Fixed Spec

Query Filters Learning Models

0.71

00.10.20.30.40.50.60.70.8

Fixed Sens Spec


Citation Count 0.791

ACPJ Txmt-specific filters 0.548

Impact Factor (2001) 0.549

Citation Count 0.791

ACPJ Txmt-specific filters 0.548

Impact Factor (2001) 0.549

Etiology - Fixed Specificity

0.68

0.910.94 0.91

00.10.20.30.40.50.60.70.80.9

1

Etiology - Fixed Sensitivity

0.98

0.44

0.98

0.75

00.10.20.30.40.50.60.70.80.9

1

Fixed Sens Spec

3. SVM models have better

Impact Factor (2005) 0.558Impact Factor (2005) 0.558Diagnosis - Fixed Specificity

0.65

0.970.82

0.97

0 30.40.50.60.70.80.9

1

Diagnosis - Fixed Sensitivity

0.96

0.68

0.960.88

0 30.40.50.60.70.80.9

1

Sens Fixed Spec


Fixed Sens Spec


classification performance than PageRank, Impact Factor and Citation count in Medline for

00.10.20.3

Sens Fixed Spec


00.10.20.3

Fixed Sens Spec


Prognosis - Fixed Specificity1

1

Prognosis - Fixed Sensitivity

0 871

SSOAB gold standard 0.8 0.77 0.77

00.10.20.30.40.50.60.70.80.9

1

Sens Fixed Spec


0.80.71

0.80.87

00.10.20.30.40.50.60.70.80.9

1

Fixed Sens Spec


165

4. SVM models have better sensitivity/specificity in PubMed than CQFs at comparable thresholds according to ACPJ gold standard

Other applications of SVMs to text categorization

Model Area Under the Curve

Machine Learning Models 0.93

Model Area Under the Curve

Machine Learning Models 0.93

Quackometer* 0.67

G l 0 63

Quackometer* 0.67

G l 0 63Google 0.63Google 0.63

1. Identifying Web Pages with misleading treatment information according to special purpose gold standard (Quack Watch). SVM models outperform p p p g (Q ) pQuackometer and Google ranks in the tested domain of cancer treatment.

2. Prediction of future paper citation and instrumentality of citations

166

3. Prediction of clinical laboratory lvalues

167

Dataset generation and l dexperimental design

• StarPanel database contains ~8∙106 lab measurements of ~100,000 in‐patients from Vanderbilt University Medical Center.

• Lab measurements were taken between 01/1998 and 10/2002.ab easu e e ts e e ta e bet ee 0 / 998 a d 0/ 00

For each combination of lab test and normal range, we generatedthe following datasets. t e o o g datasets

01/1998‐05/2001 06/2001‐10/2002

Training TestingValidation

(25% of Training)

168

Query-based approach for d f l l b lprediction of clinical cab values

Training data

Data model database Testing data

Train SVM classifier Testing sample

Validation data

sample

OptimalSVM classifier

Performance

Prediction

pdata model

SVM classifier

These steps are performed for every testing sample

These steps are performed for every data model

Predictionfor every testing sample

169

Classification results

Including cases with K=0 (i.e. samplesh l b )

Excluding cases with K=0 (i.e. samplesh l b )

Area under ROC curve (without feature selection)

Range of normal valuesRange of normal values

Area under ROC curve (without feature selection)

with no prior lab measurements) with no prior lab measurements)

>1 <99 [1, 99] >2.5 <97.5 [2.5, 97.5]

BUN 80.4% 99.1% 76.6% 87.1% 98.2% 70.7%

Ca 72.8% 93.4% 55.6% 81.4% 81.4% 63.4%

CaIo 74.1% 60.0% 50.1% 64.7% 72.3% 57.7%

tes

t

g

>1 <99 [1, 99] >2.5 <97.5 [2.5, 97.5]

BUN 75.9% 93.4% 68.5% 81.8% 92.2% 66.9%

Ca 67.5% 80.4% 55.0% 77.4% 70.8% 60.0%

CaIo 63.5% 52.9% 58.8% 46.4% 66.3% 58.7%

Range of normal values

est

CO2 82.0% 93.6% 59.8% 84.4% 94.5% 56.3%

Creat 62.8% 97.7% 89.1% 91.5% 98.1% 87.7%

Mg 56.9% 70.0% 49.1% 58.6% 76.9% 59.1%

Osmol 50.9% 60.8% 60.8% 91.0% 90.5% 68.0%

PCV 74 9% 99 2% 66 3% 80 9% 80 6% 67 1%

La

bo

rato

ry tCO2 77.3% 88.0% 53.4% 77.5% 90.5% 58.1%

Creat 62.2% 88.4% 83.5% 88.4% 94.9% 83.8%

Mg 58.4% 71.8% 64.2% 67.0% 72.5% 62.1%

Osmol 77.9% 64.8% 65.2% 79.2% 82.4% 71.5%

PCV 91 6% 84 6%

Lab

ora

tory

te

PCV 74.9% 99.2% 66.3% 80.9% 80.6% 67.1%

Phos 74.5% 93.6% 64.4% 71.7% 92.2% 69.7%

PCV 62.3% 91.6% 69.7% 76.5% 84.6% 70.2%

Phos 70.8% 75.4% 60.4% 68.0% 81.8% 65.9%

A total of 84,240 SVM classifiers were built for 16,848 possible data models.A total of 84,240 SVM classifiers were built for 16,848 possible data models.

170

Improving predictive power and parsimony f BUN d l f lof a BUN model using feature selection

Model description x 104 Histogram of test BUN

Test name BUNRange of normal values < 99 perc.Data modeling SRTNumber of previous measurements

5

p

2.5

3x 10 Histogram of test BUN

men

ts)

Use variables corresponding to hospitalization units?

Yes

Number of prior hospitalizations used

2

Dataset description 1

1.5

2

cy (

N m

easu

rem

N samples (total)

N abnormal samples

N variables

Training set 3749 78

V lid ti t 1251 27 3442

Dataset description

0

0.5

1

Fre

quen

normalvalues

abnormalvalues

105

Validation set 1251 27

Testing set 836 16

34420 50 100 150 200 250

0

Test value

Classification performance (area under ROC curve)Classification performance (area under ROC curve)

All RFE_Linear RFE_Poly HITON_PC HITON_MBValidation set 95.29% 98.78% 98.76% 99.12% 98.90%Testing set 94.72% 99.66% 99.63% 99.16% 99.05%

Number of features 3442 26 3 11 17

171

Classification performance (area under ROC curve)

All RFE_Linear RFE_Poly HITON_PC HITON_MBValidation set 95.29% 98.78% 98.76% 99.12% 98.90%Testing set 94.72% 99.66% 99.63% 99.16% 99.05%

Number of features 3442 26 3 11 17

Features

1 LAB: PM_1(BUN) LAB: PM_1(BUN) LAB: PM_1(BUN) LAB: PM_1(BUN)

2 LAB: PM_2(Cl) LAB: Indicator(PM_1(Mg)) LAB: PM_5(Creat) LAB: PM_5(Creat)

3 LAB: DT(PM_3(K))LAB: Test Unit NO_TEST_MEASUREMENT (Test CaIo, PM 1)

LAB: PM_1(Phos) LAB: PM_3(PCV)

4 LAB: DT(PM_3(Creat)) LAB: Indicator(PM_1(BUN)) LAB: PM_1(Mg)

5 LAB: Test Unit J018 (Test Ca, PM 3) LAB: Indicator(PM_5(Creat)) LAB: PM_1(Phos)

6 LAB: DT(PM_4(Cl)) LAB: Indicator(PM_1(Mg)) LAB: Indicator(PM_4(Creat))

7 LAB: DT(PM_3(Mg)) LAB: DT(PM_4(Creat)) LAB: Indicator(PM_5(Creat))

8 LAB: PM_1(Cl) LAB: Test Unit 7SCC (Test Ca, PM 1) LAB: Indicator(PM_3(PCV))

9 LAB: PM_3(Gluc) LAB: Test Unit RADR (Test Ca, PM 5) LAB: Indicator(PM_1(Phos))

10 LAB: DT(PM_1(CO2)) LAB: Test Unit 7SMI (Test PCV, PM 4) LAB: DT(PM_4(Creat))

11 LAB: DT(PM_4(Gluc)) DEMO: Gender LAB: Test Unit 11NM (Test BUN, PM 2)

12 LAB: PM_3(Mg) LAB: Test Unit 7SCC (Test Ca, PM 1)

13 LAB: DT(PM_5(Mg)) LAB: Test Unit RADR (Test Ca, PM 5)

14 LAB: PM_1(PCV) LAB: Test Unit 7SMI (Test PCV, PM 4)

15 LAB: PM_2(BUN) LAB: Test Unit CCL (Test Phos, PM 1)

16 LAB: Test Unit 11NM (Test PCV, PM 2) DEMO: Gender

17 LAB: Test Unit 7SCC (Test Mg, PM 3) DEMO: Age

18 LAB: DT(PM_2(Phos))

19 LAB: DT(PM_3(CO2))

20 LAB: DT(PM_2(Gluc))

21 LAB: DT(PM_5(CaIo))

22 DEMO: Hospitalization Unit TVOS

23 LAB: PM_1(Phos)

24 LAB: PM_2(Phos)

25 LAB: Test Unit 11NM (Test K, PM 5)

26 LAB: Test Unit VHR (Test CaIo, PM 1)

172

4. Modeling clinical judgmentg j g

173

Methodological framework and study outline

Patients Guidelines

PhysiciansPredict clinical decisions

Patient FeatureClinical

Diagnosis

Gold Standard

1 f1…fm cd1 hd1

Identify predictors ignored by physicians

Phy

sici

an1 … … …

N cdN hdN

… … … …

Explain each physician’s diagnostic model

ysic

ian 6

1 f1…fm cd1 hd1

… … …

N cdN hdN

Compare physicians with each other and with guidelines

same across physicians different across physicians

Phy

N N

Clinical context of experimentClinical context of experimentMalignant melanoma is the most dangerous form of skin cancer

Incidence & mortality have been constantly increasing in the last decades.the last decades.

Physicians and patientsPhysicians and patientsPatients N=177 76 melanomas - 101 nevi

Data collection:

Patients seen prospectively,

Dermatologists N = 6

3 t 3 t

76 melanomas - 101 nevi p p y,from 1999 to 2002 at Department of Dermatology, S.Chiara Hospital, Trento, Italy

inclusion criteria: histological 3 experts ‐ 3 non‐experts diagnosis and >1 digital image

available

Diagnoses made in 2004

Features

Lesion location

Family history of melanoma

Irregular Border Streaks (radial streaming, pseudopods)

Max-diameterFitzpatrick’s Photo-type

Number of colors Slate-blue veil

Min-diameter SunburnAtypical pigmented network

Whitish veilnetwork

Evolution Ephelis Abrupt network cut-off Globular elements

Age Lentigos Regression-ErythemaComedo-like openings, milia like cystsmilia-like cysts

Gender Asymmetry Hypo-pigmentation Telangiectasia

Method to explain physician-specificSVM models

FS

B ild

Build SVM

Build DT

SVM”Black Box”

Apply SVM

Regular Learning Meta-Learning

Results: Predicting physicians’ judgmentsResults: Predicting physicians judgments

All HITON PC HITON MB RFEPhysiciansPhysicians

All(features)

HITON_PC

(features)

HITON_MB

(features)RFE

(features)

Expert 1Expert 1 0.94 (24) 0.92 (4) 0.92 (5) 0.95 (14)

Expert 2Expert 2 0.92 (24) 0.89 (7) 0.90 (7) 0.90 (12)

Expert 3Expert 3 0.98 (24) 0.95 (4) 0.95 (4) 0.97 (19)

NonExpert 1NonExpert 1 0.92 (24) 0.89 (5) 0.89 (6) 0.90 (22)

NonExpert 2NonExpert 2 1.00 (24) 0.99 (6) 0.99 (6) 0.98 (11)NonExpert 2NonExpert 2 1.00 (24) 0.99 (6) 0.99 (6) 0.98 (11)

NonExpert 3NonExpert 3 0.89 (24) 0.89 (4) 0.89 (6) 0.87 (10)

Results: Physician-specific modelsResults: Physician specific models

Results: Explaining physician agreementResults: Explaining physician agreement

BlueBlue veil

irregular border streaks

Patient 001 yesyes nono yesyes

Expert 1 AUC=0.92 R2=99%

Expert 3AUC=0.95 R2=99%

Results: Explain physician disagreementResults: Explain physician disagreementBlue veil

irregular border

streaksnumber of colors

evolutionveil border of colors

Patient 002 nono nono yesyes 33 nono

Expert 1 AUC=0.92 R2=99%

Expert 3AUC=0.95 R2=99%

Results: Guideline complianceResults: Guideline compliance

Physician Reported guidelines

Compliance

Non compliant: they ignore theExperts1,2,3, Experts1,2,3, nonnon--expert 1expert 1

Pattern analysis Non-compliant: they ignore the majority of features (17 to 20) recommended by pattern analysis.

N li t t i lNon expert 2Non expert 2 ABCDE rule

Non compliant: asymmetry, irregular border and evolution are ignored.

Non-standard. Non compliant: 2 out of 7 reportedNon expert 3Non expert 3

Non standard. Reports using 7 features

Non compliant: 2 out of 7 reported features are ignored while some non-reported ones are not

On the contrary: In all guidelines the more predictors present On the contrary: In all guidelines, the more predictors present, the higher the likelihood of melanoma. All physicians were compliant with this principle.

5. Using SVMs for feature selectiong

183

Feature selection methodsFeature selection methods (non‐causal)• SVM RFE hi i S b d• SVM‐RFE• Univariate + wrapper• Random forest‐basedLARS El i N

This is an SVM‐based feature selection method

• LARS‐Elastic Net• RELIEF + wrapper• L0‐norm• Forward stepwise feature selection• No feature selection

Causal feature selection methods• HITON‐PC• HITON‐MB This method outputs a HITON MB• IAMB• BLCD• K2MB

Markov blanket of the response variable (under assumptions)

184

• K2MB…

13 real datasets were used to evaluate f l h dfeature selection methods

Dataset name DomainNumber of variables

Number of samples

Target Data typevariables samples

g yp

Infant_Mortality Clinical 86 5,337 Died within the first year discrete

Ohsumed Text 14,373 5,000 Relevant to nenonatal diseases continuous

ACPJ Eti l T t 28 228 15 779 R l t t it l tiACPJ_Etiology Text 28,228 15,779 Relevant to eitology continuous

LymphomaGene

expression7,399 227 3-year survival: dead vs. alive continuous

GisetteDigit

recognition5,000 7,000 Separate 4 from 9 continuous

Dexter Text 19,999 600 Relevant to corporate acquisitions continuous

Sylva Ecology 216 14,394 Ponderosa pine vs. everything else continuous & discrete

Ovarian_Cancer Proteomics 2,190 216 Cancer vs. normals continuous

ThrombinDrug

discovery139,351 2,543 Binding to thromin discrete (binary)

CGene

17 816 286 E i i (ER ) ER iBreast_CancerGene

expression17,816 286 Estrogen-receptor positive (ER+) vs. ER- continuous

HivaDrug

discovery1,617 4,229 Activity to AIDS HIV infection discrete (binary)

Nova Text 16,969 1,929 Separate politics from religion topics discrete (binary)

185

Bankruptcy Financial 147 7,063 Personal bankruptcy continuous & discrete

Classification performance vs. proportion f l d fof selected features

O i i l M ifi d

0.95

1Original

C)

0.9

Magnified

C)

0 8

0.85

0.9

an

ce (

AU

C

0 88

0.89

an

ce (

AU

C

0.7

0.75

0.8

on

pe

rfo

rma

0.87

0.88

on

pe

rfo

rm

0.6

0.65

0.7

Cla

ssifi

catio

0.86

Cla

ssifi

catio

0 0.5 10.5

0.55

C

HITON-PC with G2 testRFE

0.05 0.1 0.15 0.2

0.85C

HITON-PC with G2 testRFE

186

Proportion of selected features Proportion of selected features

Statistical comparison of predictivity and d f freduction of features

Predicitivity Reduction

P-value Nominal winner P-value Nominal winner

0.9754 SVM-RFE 0.0046 HITON-PC

SVM-RFE(4 variants)

0.8030 SVM-RFE 0.0042 HITON-PC

0.1312 HITON-PC 0.3634 HITON-PC

0.1008 HITON-PC 0.6816 SVM-RFE

• Null hypothesis: SVM‐RFE and HITON‐PC perform the same;• Use permutation‐based statistical test with alpha = 0.05.

187

Use permutation based statistical test with alpha 0.05.

Simulated datasets with known causal d l hstructure used to compare algorithms

188

Comparison of SVM-RFE and HITON-PCp

189

Comparison of all methods in terms of l h dcausal graph distance

SVM‐RFE HITON‐PC based causal

methods

190

Summary resultsy

HITON‐PC based causal

methods

HITON‐PC‐FDR

methods

191SVM‐RFE

Statistical comparison of graph distancep g p

Sample size = 200

Sample size = 500

Sample size =5000

Nominal Nominal NominalComparison P-value

Nominal winner

P-valueNominal winner

P-valueNominal winner

average HITON-PC-FDRith G2 t t <0 0001

HITON-PC-0 0028

HITON-PC-<0 0001

HITON-PC-with G2 test vs. average SVM-RFE

<0.0001FDR

0.0028FDR

<0.0001FDR

• Null hypothesis: SVM‐RFE and HITON‐PC‐FDR perform the same;• Use permutation‐based statistical test with alpha = 0.05.p p

192

6. Outlier detection in ovarian cancer t i d tproteomics data

193

Ovarian cancer dataData Set 1 (Top), Data Set 2 (Bottom)

CancerSame set of 216 patients, obtained using the Ciphergen

Normal

Other

H4 ProteinChiparray (dataset 1) and using the

h

Cancer

Ciphergen WCX2 ProteinChip array(dataset 2).

Normal

Clock Tick4000 8000 12000

Other

The gross break at the “benign disease” juncture in dataset 1 and the similarity of theThe gross break at the benign disease juncture in dataset 1 and the similarity of the profiles to those in dataset 2 suggest change of protocol in the middle of the first experiment.

Experiments with one-class SVMpAssume that sets {A, B} are normal and {C D E F} are

Data Set 1 (Top), Data Set 2 (Bottom)

normal and {C, D, E, F} are outliers. Also, assume that we do not know what are normal

Cancer

and outlier samples. Normal

Other

•Experiment 1: Train one‐class SVM on {A, B, C} and test on {A, B, C}: Area under ROC curve = 0.98

Cancer

•Experiment 2: Train one‐class SVM on {A, C} and test on {B, D, E, F}: Area under ROC curve = 0 98

Normal

Area under ROC curve = 0.98Clock Tick

4000 8000 12000

Other

195

Software

196

Interactive media and animations

SVM AppletsSVM Applets• http://www.csie.ntu.edu.tw/~cjlin/libsvm/• http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtmlp // /p g /• http://www.smartlab.dibe.unige.it/Files/sw/Applet%20SVM/svmapplet.html• http://www.eee.metu.edu.tr/~alatan/Courses/Demo/AppletSVM.html• http://www dsl lab org/svm tutorial/demo html (requires Java 3D)• http://www.dsl‐lab.org/svm_tutorial/demo.html (requires Java 3D)

Animationsat o s• Support Vector Machines:http://www.cs.ust.hk/irproj/Regularization%20Path/svmKernelpath/2moons.avihttp://www cs ust hk/irproj/Regularization%20Path/svmKernelpath/2Gauss avihttp://www.cs.ust.hk/irproj/Regularization%20Path/svmKernelpath/2Gauss.avihttp://www.youtube.com/watch?v=3liCbRZPrZAhttp://www.youtube.com/watch?v=AC7afKLgGTs

197

• Support Vector Regression: http://www.cs.ust.hk/irproj/Regularization%20Path/movie/ga0.5lam1.avi

Several SVM implementations for beginners

• GEMS: http://www.gems‐system.org

• Weka: http://www.cs.waikato.ac.nz/ml/weka/

• Spider (for Matlab): http://www.kyb.mpg.de/bs/people/spider/

• CLOP (for Matlab): http://clopinet.com/CLOP/

198

Several SVM implementations for intermediate users

•LibSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ General purpose Implements binary SVM multiclass SVM SVR one‐class SVM Implements binary SVM, multiclass SVM, SVR, one class SVM Command‐line interface Code/interface for C/C++/C#, Java, Matlab, R, Python, Pearl

SVMLi ht htt // li ht j hi /•SVMLight: http://svmlight.joachims.org/ General purpose (designed for text categorization) Implements binary SVM, multiclass SVM, SVRp y , , Command‐line interface Code/interface for C/C++, Java, Matlab, Python, Pearl

More software links at http://www.support‐vector‐machines.org/SVM_soft.htmland http://www.kernel‐machines.org/software

199

Conclusions

200

•Excellent empirical performance for many biomedical datasets.•Can learn efficiently both simple linear and very complex nonlinearCan learn efficiently both simple linear and very complex nonlinear functions by using the “kernel trick”.

•Address outliers and noise by using “slack variables”.• Internal capacity control (regularization) and overfitting avoidance• Internal capacity control (regularization) and overfitting avoidance.•Do not require direct access to data and can work with only dot‐products of data points.R i l i f QP i i i bl h h l b l•Require solution of a convex QP optimization problem that has a global minimum and can be solved efficiently. Thus optimizing the SVM model parameters is not subject to heuristic optimization failures that plague other machine learning methods (e.g., neural networks, decision trees).

•Produce sparse models that are defined only by a small subset of training points (“support vectors”).p ( pp )

•Do not have more free parameters than the number of support vectors, irrespective of the number of variables in the dataset.

•Depend on data scaling/normalization.Depend on data scaling/normalization. •Can be used with established data analysis protocols for model selection and error estimation, such as cross‐validation.

•When used for feature selection employ heuristic strategies to identify

201

•When used for feature selection, employ heuristic strategies to identify relevant features and possible mechanisms.

Bibliographyg p y

202

Part 1: Support vector machines for binary l f l l f lclassification: classical formulation

• Boser BE, Guyon IM, Vapnik VN: A training algorithm for optimal margin classifiers. Proceedings of the Fifth Annual Workshop on Computational Learning Theory (COLT)1992:144‐152.

• Burges CJC: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 1998, 2:121‐167.

• Cristianini N, Shawe‐Taylor J: An introduction to support vector machines and other kernel‐based learning methods. Cambridge: Cambridge University Press; 2000.

• Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning: data mining, inference, and prediction. New York: Springer; 2001.

• Herbrich R: Learning kernel classifiers: theory and algorithms. Cambridge, Mass: MIT Press; 20022002.

• Schölkopf B, Burges CJC, Smola AJ: Advances in kernel methods: support vector learning. Cambridge, Mass: MIT Press; 1999.

Sh T l J C i ti i i N K l th d f tt l i C b id UK• Shawe‐Taylor J, Cristianini N: Kernel methods for pattern analysis. Cambridge, UK: Cambridge University Press; 2004.

• Vapnik VN: Statistical learning theory. New York: Wiley; 1998.

V ik VN Th t f t ti ti l l i th N Y k S i 2000

203

• Vapnik VN: The nature of statistical learning theory. New York: Springer, 2000.

Part 1: Basic principles of statistical h lmachine learning

• Aliferis CF, Statnikov A, Tsamardinos I: Challenges in the analysis of mass‐throughput data: a technical commentary from the statistical machine learning perspective. Cancer Informatics 2006, 2:133‐162.

• Duda RO, Hart PE, Stork DG: Pattern classification. 2nd edition. New York: Wiley; 2001.

• Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning: data mining, inference, and prediction. New York: Springer; 2001.

• Mitchell T:Machine learning. New York, NY, USA: McGraw‐Hill; 1997.Mitchell T: Machine learning. New York, NY, USA: McGraw Hill; 1997.

• Vapnik VN: Statistical learning theory. New York: Wiley; 1998.

• Vapnik VN: The nature of statistical learning theory. New York: Springer, 2000.

204

Part 2: Model selection for SVMs• Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning: data mining,

inference, and prediction. New York: Springer; 2001.

• Kohavi R: A study of cross‐validation and bootstrap for accuracy estimation and model selection. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI) 1995, 2:1137‐1145.

• Scheffer T: Error estimation and model selection. Ph.D.Thesis, Technischen UniversitätBerlin, School of Computer Science; 1999.

• Statnikov A, Tsamardinos I, Dosbayev Y, Aliferis CF: GEMS: a system for automated cancerStatnikov A, Tsamardinos I, Dosbayev Y, Aliferis CF: GEMS: a system for automated cancer diagnosis and biomarker discovery from microarray gene expression data. Int J Med Inform 2005, 74:491‐503.

• Weiss SM and Kulikowski CA: Computer systems that learn: classification and prediction methods p y f pfrom statistics, neural nets, machine learning, and expert systems. San Mateo, California: M. Kaufmann Publishers; 1991.

205

Part 2: SVMs for multicategory datag y• Crammer K, Singer Y: On the learnability and design of output codes for multiclass

problems. Proceedings of the Thirteenth Annual Conference on Computational Learning Theory (COLT) 2000.

• Platt JC, Cristianini N, Shawe‐Taylor J: Large margin DAGs for multiclass classification. Advances in Neural Information Processing Systems (NIPS) 2000, 12:547‐553.

• Schölkopf B, Burges CJC, Smola AJ: Advances in kernel methods: support vector learning. Cambridge, Mass: MIT Press; 1999.

• Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S: A comprehensive evaluation ofStatnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 2005, 21:631‐643.

• Weston J, Watkins C: Support vector machines for multi‐class pattern recognition.Weston J, Watkins C: Support vector machines for multi class pattern recognition. Proceedings of the Seventh European Symposium On Artificial Neural Networks 1999, 4:6.

206

Part 2: Support vector regressionpp g• Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning: data mining,

inference, and prediction. New York: Springer; 2001.

• Smola AJ, Schölkopf B: A tutorial on support vector regression. Statistics and Computing2004, 14:199‐222.

Part 2: Novelty detection with SVM-based methods and Support Vector Clustering

• Scholkopf B, Platt JC, Shawe‐Taylor J, Smola AJ, Williamson RC: Estimating the Support of a High‐Dimensional Distribution Neural Computation 2001 13:1443‐1471

methods and Support Vector Clustering

High Dimensional Distribution. Neural Computation 2001, 13:1443 1471.

• Tax DMJ, Duin RPW: Support vector domain description. Pattern Recognition Letters 1999, 20:1191‐1199.

• Manevitz LM and Yousef M One class svms for document classification The Journal of Machine• Manevitz LM and Yousef M: One‐class svms for document classification. The Journal of Machine Learning Research 2002, 2:154.

• Hur BA, Horn D, Siegelmann HT, Vapnik V: Support vector clustering. Journal of Machine Learning Research 2001 2:125 137

207

Learning Research 2001, 2:125–137.

• Tax DMJ and Duin RPW: Support vector domain description. Pattern Recognition Letters 1999, 20:1191‐1199.

Part 2: SVM-based variable selection• Guyon I, Elisseeff A: An introduction to variable and feature selection. Journal of Machine

Learning Research 2003, 3:1157‐1182.• Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer classification using

support vector machines. Machine Learning 2002, 46:389‐422.• Hardin D, Tsamardinos I, Aliferis CF: A theoretical characterization of linear SVM‐based

f t l ti P di f th T t Fi t I t ti l C f M hifeature selection. Proceedings of the Twenty First International Conference on Machine Learning (ICML) 2004.

• Statnikov A, Hardin D, Aliferis CF: Using SVM weight‐based methods to identify causally relevant and non‐causally relevant variables Proceedings of the NIPS 2006 Workshop onrelevant and non causally relevant variables. Proceedings of the NIPS 2006 Workshop on Causality and Feature Selection 2006.

• Tsamardinos I, Brown LE: Markov Blanket‐Based Variable Selection in Feature Space. Technical report DSL‐08‐01 2008.

• Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, Vapnik V: Feature selection for SVMs. Advances in Neural Information Processing Systems (NIPS) 2000, 13:668‐674.

• Weston J, Elisseeff A, Scholkopf B, Tipping M: Use of the zero‐norm with linear models d k l h d J l f M hi L i R h 2003 3 1439 1461and kernel methods. Journal of Machine Learning Research 2003, 3:1439‐1461.

• Zhu J, Rosset S, Hastie T, Tibshirani R: 1‐norm support vector machines. Advances in Neural Information Processing Systems (NIPS) 2004, 16.

208

Part 2: Computing posterior class b b l f SVM l fprobabilities for SVM classifiers

• Platt JC: Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Advances in Large Margin Classifiers. Edited by Smola A, Bartlett B, Scholkopf B, Schuurmans D. Cambridge, MA: MIT press; 2000.

P 3 Cl ifi i f Part 3: Classification of cancer gene expression microarray data (Case Study 1)expression microarray data (Case Study 1)

• Diaz‐Uriarte R, Alvarez de Andres S: Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006, 7:3.

k l f d d h i l i f• Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 2005, 21:631‐643.

• Statnikov A Tsamardinos I Dosbayev Y Aliferis CF: GEMS: a system for automated cancer• Statnikov A, Tsamardinos I, Dosbayev Y, Aliferis CF: GEMS: a system for automated cancer diagnosis and biomarker discovery from microarray gene expression data. Int J Med Inform 2005, 74:491‐503.

• Statnikov A, Wang L, Aliferis CF: A comprehensive comparison of random forests and

209

, g , p psupport vector machines for microarray‐based cancer classification. BMC Bioinformatics2008, 9:319.

Part 3: Text Categorization in Biomedicine (C S d 2)(Case Study 2)

• Aphinyanaphongs Y, Aliferis CF: Learning Boolean queries for article quality filtering. Medinfo 2004 2004, 11:263‐267.

• Aphinyanaphongs Y, Tsamardinos I, Statnikov A, Hardin D, Aliferis CF: Text categorization models for high‐quality article retrieval in internal medicine. J Am Med Inform Assoc2005, 12:207‐216.

• Aphinyanaphongs Y, Statnikov A, Aliferis CF: A comparison of citation metrics to machine learning filters for the identification of high quality MEDLINE documents. J Am Med Inform Assoc 2006, 13:446‐455.

• Aphinyanaphongs Y, Aliferis CF: Prospective validation of text categorization models for indentifying high‐quality content‐specific articles in PubMed. AMIA 2006 Annual Symposium Proceedings 2006.

• Aphinyanaphongs Y, Aliferis C: Categorization Models for Identifying Unproven Cancer Treatments on the Web. MEDINFO 2007.

• Fu L, Aliferis CF: Models for Predicting and Explaining Citation Count of Biomedical Articles. AMIA 2008 Annual Symposium Proceedings 2008.

• Fu F, Aliferis CF: Using content‐based and bibliometric features for machine learning models to

210

u , e s C Us g co te t based a d b b o et c eatu es o ac e ea g ode s topredict citation counts in the biomedical literature. Scientometrics, 2010.

Part 3: Modeling clinical judgment (C S d 4)(Case Study 4)

• Sboner A, Aliferis CF: Modeling clinical judgment and implicit guideline compliance in the diagnosis of melanomas using machine learning. AMIA 2005 Annual Symposium Proceedings 2005:664‐668.

P 3 U i SVM f f l i Part 3: Using SVMs for feature selection (Case Study 5)(Case Study 5)

• Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD: Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification. Part II: Analysis and Extensions. Journal of Machine Learning Research 2010.

• Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD: Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification. Part I: Al i h d E i i l E l i J l f M hi L i R h 2010Algorithms and Empirical Evaluation. Journal of Machine Learning Research 2010.

211

Part 3: Outlier detection in ovarian cancer d (C S d 6)proteomics data (Case Study 6)

• Baggerly KA, Morris JS, Coombes KR: Reproducibility of SELDI‐TOF protein patterns in serum: comparing datasets from different experiments. Bioinformatics 2004, 20:777‐785.

• Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mills GB, Simone C, Fishman DA, Kohn EC, Liotta LA: Use of proteomic patterns in serum to identify ovarian cancer. Lancet 2002, 359:572‐577.

212

Thank you for your attention!y yQuestions/Comments?

Email: AlexanderStatnikov@med nyu eduEmail: [email protected]

URL: http://ww.nyuinformatics.org

213

Upcoming book on SVMsW ld S i ifi P bli hi (i ) 2010World Scientific Publishing, (in press) 2010

214

a gentle introduction to support vector machines in biomedicine · 2010-10-27 · a gentle...

Documents