kernel methods and relational learning in computational biology

Kernel Methods and Relational Learning inComputational Biology

ir. Michiel Stock

Faculty of Bioscience EngineeringGhent University

November 2014

KERMIT

Michiel Stock (KERMIT) Kernels for Computational Biology November 2014 1 / 36

mailto:[email protected]


http://www.kermit.ugent.be

Outline

1 Introduction

2 Kernel methodsTheoretical overviewDealing with sequencesDealing with graphsOther kernels

3 Learning relationsKronecker kernelsConditional ranking

4 Predicting enzyme functionDefining the problemResults

5 Conclusions




Introduction

Introduction




Introduction

Introductory example: drug design

Strategy for curing Alzheimer’s disease

Find compounds with good ADMET properties that selectively bindcholinesterase and amyloid precursor protein




Introduction

Labels: known protein-ligand interaction

GF

D

U YA

X

.6

B.5

ZT

E

.6

.8

.3

W

.3 1

V

.2C

ProteinsLigands




Introduction

The targets: features for proteins

Possible representations:

amino acid sequence

3D structure

gene expression

cellular location

phylogenetic profiles

...




Introduction

The ligands: features for compounds

Possible representations:

SMILE format and other text-basedrepresentations

coloured graph representation

fingerprints based on physicochemicaldescriptors

...




Introduction

Computational biology deals with interestingproblems

We deal with objects that are:

in large dimension (e.g. microarrays or proteomics data)

structured (e.g. gene sequences, small molecules, interactionnetworks, phylogenetic trees...)

heterogeneous (e.g. vectors, sequences, graphs to describe thesame protein)

in large quantities (e.g. more than 106 known proteinsequences)

noisy (e.g. many features are not relevant)




Introduction

Computational biology often deals with interactions

Relational learning

Predicting properties of two objects, which can be of a different type.




Kernel methods

Kernel methods




Kernel methods Theoretical overview

Formal definition of a kernel

Kernels are non-linear functions defined over objects x ∈ X .

Definition

A function k : X × X → R is called a positive definite kernel if it issymmetric, that is, k(x, x′) = k(x′, x) for any two objects x, x′ ∈ X , andpositive semi-definite, that is,

N∑

i=1

N∑

j=1

cicjk(xi , xj) ≥ 0

for any N > 0, any choice of N objects x1, . . . , xN ∈ X , and any choice ofreal numbers c1, . . . , cN ∈ R.

Can be seen as generalized covariances.





Interpretation of kernels

Suppose an object x has animplicit feature representationφ(x) ∈ F .A kernel function can be seenas a dot product in thisfeature space:

k(x, x′) = 〈φ(x), φ(x′)〉

Linear models in this featurespace F can be made:

y(x) = wTφ(x)

=∑

n

ank(xn, x)

�

X F

k h�(x),�(x0)i

dinsdag, 10 april 2012





Many kernel methods exist

Examples of popular kernelmethods:

Support vector machine(SVM)

Regularized least squares(RLS)

Kernel principalcomponent analysis(KPCA)

Learning algorithm isindependent of the kernelrepresentation!

SVM

KPCA




Kernel methods Dealing with sequences

Kernels using sequence alignment

sequence alignment optimises a score of how well the residues match

use this score as a kernel value (similarity for sequences)




Kernel methods Dealing with sequences

Kernels using substrings

Spectrum kernel (SK)

The SK considers the number of k-mers m two sequences si and sj have incommon.

SKk(si , sj) =∑

m∈Σk

N(m, si )∗N(m, sj)

with N(m, s) the number of k-mersm in sequence s.Many modifications exist.


http://www.ncbi.nlm.nih.gov/pubmed/11928508



Kernel methods Dealing with graphs

What is a graph?

Graph

Graphs are a set of interconnected objects, called vertices (or nodes), thatare connected through edges.

Graphs can show the structure of an object or interactions betweendifferent objects.

Graph are important in bioinformatics!Michiel Stock (KERMIT) Kernels for Computational Biology November 2014 16 / 36

http://en.wikipedia.org/wiki/Graph_(mathematics)




Comparing nodes within a graph

Diffusion kernel

Constructing a similarity between vertices within the same graph.

Based on performing arandom walk on a graph.Captures the long-rangerelationships betweenvertices.Inspired by the heatequation. The kernelquantifies how quickly ‘heat’can spread from one node toanother.


http://www.biomedcentral.com/1471-2105/9/162




Comparing two separate graphs

Graph kernel

Constructing a similarity between graphs.

Also based on performing arandom walk on both graphsand counting the number ofmatching walks.Usually very computationallydemanding!

In chemoinformatics:

In structural bioinformatics:

A B

zaterdag, 28 april 2012




Kernel methods Other kernels

Kernels for fingerprints

Objects that can be describedby a long binary vector x canbe represented by theTanimoto kernel:

KTan(xm, xn) =

〈xm, xn〉〈xm, xm〉+ 〈xn, xn〉 − 〈xm, xn〉

.

Fingerprint representation ofa molecule:


http://bioinformatics.oxfordjournals.org/content/21/suppl_1/i359.full.pdf



Kernel methods Other kernels

Kernels for other objects

Kernels for texts: often based on word count (example: medicalpapers)

Kernels for point clouds (example: using 3D structure of proteins)

Fisher kernels: use information of a generative model (example: usinga Hidden Markov Model)




Learning relations

Learning relations




Learning relations Kronecker kernels

A little math...

A =

[a11 a12

a21 a22

]and B =

[b11 b12

b21 b22

]. (1)

We define the Vectorization operator:

vec(A) =

a11

a12

a21

a22

And the Kronecker product:

A⊗ B =

a11b11 a11b12 a12b11 a12b12

a11b21 a11b22 a12b21 a12b22

a21b11 a21b12 a22b11 a22b12

a21b21 a21b22 a22b21 a22b22

Key equation: (BT ⊗ A)vec(X ) = vec(AXB)Michiel Stock (KERMIT) Kernels for Computational Biology November 2014 22 / 36




Kernels for pairs of objects

Pairwise kernel

Combine the kernel matrices of the individual objects to construct a kernelmatrix for pairs of objects.

Relational Learning and Ranking Algorithms for Bioinformatics Applications

KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics

[email protected]

Extra Logo’s

Michiel Stock, Willem Waegeman, Bernard De Baets

Introductory example: chemogenomics

Suppose one wants to model the binding interactions between a set of proteins and a database of ligands to aid the process of drug design. With no known mechanistic information one can build a statistical model based on a data set. Kernel methods allow for the generation of a joint feature representation of a pair containing a protein and a ligand.

proteins ligands

( , )( , )( , )

...

( , )( , )

EC 2.7.7.12

EC 4.2.3.90

EC ?.?.?.?EC 2.7.7.34

EC 4.6.1.11

EC 2.7.1.12

1

0

0

3

0

2

02

0

zondag, 13 mei 2012

h(e) = hw,�(e)i =X

e2E

aeK�(e, e)



[email protected]


Introductory example: chemogenomicsSuppose one wants to model the binding interactions between a set of proteins and a database of ligands to aid the process of drug design. Our framework can be used to model pairwise relations between different types of objects.

Proteins Ligands

Object kernelsPairwise kernel

Data set

Conditional ranking algorithmThe ranking data can be seen as a graph, we want to predict some value using a feature representation of the edges:

h 2 H �(e)

Given a training dataset T, this function can be learned using the following algorithm:

A(T ) = argmaxh2H

L(h, T ) + �khk2H,

with an appropriate loss function and a regularization parameter. To train a model for conditional ranking a convex and differentiable approximation of the ranking loss is used:

L �

L(h, T ) =X

v2V

X

e,e2Ev

(ye � ye � h(e) + h(e))2

In the most general case the Kronecker product pairwise kernel is used for the edges, which is simply the product of some kernel between pairs of nodes:

K�(e, e) = K�(v, v0, v, v0) = K�(v, v)K�(v0, v0)

SVMRLS

...

Learning algorithm

By optimizing a ranking loss, our algorithms can also be used for conditional ranking, as shown on the right.In short, our framework is ideally suited for bioinformatics challenges:

- efficient learning process- can handle complex objects (graphs, trees, sequences...)- ability to deal with information retrieval problems

Database objects

Mor

e re

leva

nt

Query 1 Query 2M

ore

rele

vant

Functional ranking of enzymesGiven structural information of an enzyme we want to infer its function. This is done by ranking annotated proteins of a database according to their predicted catalytic similarity (derived from the EC number) with the query-protein.

Using five state of the art structural similarities, we showed that learning a conditional ranking model is always an improvement compared on the baseline ranking.

KERMIT



[email protected]

Extra Logo’s




proteins ligands

( , )( , )( , )

...

( , )( , )

EC 2.7.7.12

EC 4.2.3.90

EC ?.?.?.?EC 2.7.7.34

EC 4.6.1.11

EC 2.7.1.12

1

0

0

3

0

2

02

0

zondag, 13 mei 2012

h(e) = hw,�(e)i =X

e2E

aeK�(e, e)



[email protected]



Proteins Ligands


Data set


h 2 H �(e)


A(T ) = argmaxh2H

L(h, T ) + �khk2H,


L �

L(h, T ) =X

v2V

X

e,e2Ev

(ye � ye � h(e) + h(e))2


K�(e, e) = K�(v, v0, v, v0) = K�(v, v)K�(v0, v0)

SVMRLS

...

Learning algorithm



Database objects

Mor

e re

leva

nt

Query 1 Query 2

Mor

e re

leva

nt



KERMIT

Kronecker kernel: KΦ = Kφ ⊗ Kψ





Kernel ridge regression for relations

set y = vec(Y ) andKΦ = Kφ ⊗ Kψ

We can just use the usualkernel ridge regression:

arg mina

(y−KΦa)T (y−KΦa)+

λaTKΦa

This is equivalent to solvingthe following linear system:

(KΦ + λINM×NM)a = y

N objects of type U (e.g.proteins)

M objects of type V(e.g. ligands)

Y : N ×M label matrix(e.g. molecularinteraction)

Kφ: N ×N kernel matrixfor objects of type UKψ: M ×M kernelmatrix for objects oftype V




Learning relations Conditional ranking

Conditional ranking

Motivation

Suppose one is not particularly interested in the exact value of theinteraction but in the order of the proteins for a given ligand.



[email protected]

Extra Logo’s




proteins ligands

( , )( , )( , )

...

( , )( , )

EC 2.7.7.12

EC 4.2.3.90

EC ?.?.?.?EC 2.7.7.34

EC 4.6.1.11

EC 2.7.1.12

1

0

0

3

0

2

02

0

zondag, 13 mei 2012

h(e) = hw,�(e)i =X

e2E

aeK�(e, e)



[email protected]



Proteins Ligands


Data set


h 2 H �(e)


A(T ) = argmaxh2H

L(h, T ) + �khk2H,


L �

L(h, T ) =X

v2V

X

e,e2Ev

(ye � ye � h(e) + h(e))2


K�(e, e) = K�(v, v0, v, v0) = K�(v, v)K�(v0, v0)

SVMRLS

...

Learning algorithm



Database objects

Mor

e re

leva

nt

Query 1 Query 2

Mor

e re

leva

nt



KERMIT



[email protected]

Extra Logo’s




proteins ligands

( , )( , )( , )

...

( , )( , )

EC 2.7.7.12

EC 4.2.3.90

EC ?.?.?.?EC 2.7.7.34

EC 4.6.1.11

EC 2.7.1.12

1

0

0

3

0

2

02

0

zondag, 13 mei 2012

h(e) = hw,�(e)i =X

e2E

aeK�(e, e)



[email protected]



Proteins Ligands


Data set


h 2 H �(e)


A(T ) = argmaxh2H

L(h, T ) + �khk2H,


L �

L(h, T ) =X

v2V

X

e,e2Ev

(ye � ye � h(e) + h(e))2


K�(e, e) = K�(v, v0, v, v0) = K�(v, v)K�(v0, v0)

SVMRLS

...

Learning algorithm



Database objects

Mor

e re

leva

nt

Query 1 Query 2

Mor

e re

leva

ntFunctional ranking of enzymes

Given structural information of an enzyme we want to infer its function. This is done by ranking annotated proteins of a database according to their predicted catalytic similarity (derived from the EC number) with the query-protein.


KERMIT




Learning relations Conditional ranking

Conditional ranking

Suppose: e = (u, v) ∈ E = (U × V)

Train the model:

h(e) = wTΦ(e) =∑

e∈EaeK

Φ(e, e)

by solving:

A(T ) = arg minh∈H

L(h,T )+λ‖h‖2H.

Where we use a ranking loss:

L(h,T ) =∑

u,u′∈U

∑

v ,v ′∈V(yu,v−yu′,v ′−h(u, v)+h(u′, v ′))2.

preference graph:

*Figure 1 Example of a multi-graph. If this graph, on the left, would be used for ranking the elements conditioned on C, then A scores better than E, which ranks higher than E, which on its turn ranks higher than D and D ranks higher than B. There is no information about the relation between C and F and G, respectively, our model could be used to include these two instances in the ranking if features are available. Notice that in this setting unconditional ranking of these objects is meaningless as this graph is obviously intransitive. Figure reproduced from (Pahikkala et al., 2010). The proposed framework is based on the Kronecker product kernel for generating implicit joint feature representations of queries and the sets of objects to be ranked. Exactly this kernel construction will allow a straightforward extension of the existing framework to dyadic relations and multi-task learning problems (Objectives 1 and 2). It has been proposed independently by three research groups for modeling pairwise inputs in different application domains (Basilico et al. 2004, Oyana et al. 2004, Ben-Hur et al. 2005). From a different perspective, it has been considered in structured output prediction methods for defining joint feature representations of inputs and outputs (Tsochantaridis et al., 2005, Weston et al., 2007). While the usefulness of Kronecker product kernels for pairwise learning has been clearly established, computational efficiency of the resulting algorithms remains a major challenge. Previously proposed methods require the explicit computation of the kernel matrix over the data object pairs, hereby introducing bottlenecks in terms of processing and memory usage, even for modest dataset sizes. To overcome this problem, one typically applies sampling strategies of the kernel matrix for training. An alternative approach known as the Cartesian kernel has been proposed in (Kashima et al., 2009). This kernel exhibits interesting computational properties, but it can be solely employed in selected applications, because it cannot make predictions for (couples of) objects that are not observed in the training dataset. When modeling interactions between two types of objects one gets close to the field of collaborative filtering, as shown in (Pessiot et al., 2007). Matrix factorization methods, which are used especially in collaborative filtering, may be applied to conditional ranking problems, by exploiting the known labels for pairs of objects in order to generate a latent feature representation that allows predicting these labels for pairs for which this information is missing. Such methods can be combined with our machine learning approach, as a preprocessing step in which additional latent features are generated (part of Objectives 1 and 2).




Predicting enzyme function






The data set

Data:

two data sets of ca. 1600enzymes with 21different functions

five different similaritymeasures of the activesite

active site of anenzyme:





The enzyme commission number




Predicting enzyme function Defining the problem

Quantifying enzyme function similarity

EC 2.7.7.12

EC 4.2.3.90

EC ?.?.?.?EC 2.7.7.34

EC 4.6.1.11

EC 2.7.1.12

1

0

0

3

0

2

02

0

zondag, 13 mei 2012





Conditional ranking of enzymes

Ranking enzymes

For an unannotated enzyme, rank the annotated enzymes so that thetop has a similar function w.r.t. the query.

Minimize ranking error:number of switches neededfor a perfect ranking

Example: suppose one has anenzyme with unknownfunction: EC ?.?.?.?

1 EC 2.7.7.12

2 EC 2.7.7.12

3 EC 2.7.7.34

4 EC 2.7.1.12

5 EC 2.7.7.34

6 EC 4.2.3.90

7 EC 1.14.11

8 EC 4.6.1.11

⇒ EC 2.7.7.12





Learning the catalytic similarity

pair of enzymes:e = (v , v ′)

label ye ∈ {0, 1, 2, 3, 4}:the catalytic similarity

five different structuralsimilarities: Kφ(v , v ′)

A B C D E F GA 4 4 0 0 0B 4 4 0 0 0C 0 0 4 2 1D 0 0 2 4 3E 0 0 1 3 4FG

Enzymes

Enzymes




Predicting enzyme function Results

Qualitative improvement in the enzyme similarities

Example for CavBase structural similarity:

Ground truthSupervisedUnsupervised

Lighter color = higher similarity




Predicting enzyme function Results

Improvement of the ROC curves

ROC curves for the five different structural similarity measures:unsupervised and supervised

False positive rate

Ave

rage

true

pos

itive

rate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

CB sup.FP sup.LPCS sup.MCS sup.SW sup.CB unsup.FP unsup.LPCS unsup.MCS unsup.SW unsup.

ROC curve for the different enzyme similarity measurements of data set I

Improve

ment

Increase of AUC from ca. 0.7 to more than 0.8!Michiel Stock (KERMIT) Kernels for Computational Biology November 2014 34 / 36



Conclusions

Conclusions

kernels can be used to work with structured objects...

... and can encode your prior knowledge

many problems in computational biology can be seen as ‘learningrelations’

relations between objects can be learned elegantly and efficientlyusing Kronecker kernels




Conclusions

Kernel Methods and Relational Learning inComputational Biology

ir. Michiel Stock

Faculty of Bioscience EngineeringGhent University

November 2014

KERMIT





kernel methods and relational learning in computational biology

Data & Analytics

objects x x

object x

kernel function

positive definite kernel

choice of n objects

definitiona function

gene sequences

introductionthe ligands