phdc exam presentation

52
Functional Characterisation of Metabolic Networks Carlos Manuel Estévez-Bretón MSc Doctorate in Systems Engineering and Computer Sciences Advisors: Luis Fernando Niño PhD Liliana Lopez Kleine PhD Intelligent Systems Research Laboratory - LISI Bioinformatics and Computational Biology research line “BioLisi” Examining Committee: Dr. Jason Papin, -U. of Virginia, Bioengineering. Dr. Andres Gonzalez, - U. de los Andes, Chemical Engineering. Dr. Fabio Gonzalez, U. Nacional, Systems Engineering.

Upload: carlos-manuel-estevez-breton

Post on 26-May-2015

659 views

Category:

Education


2 download

DESCRIPTION

This are the slides I used on my PhD candidature exam.

TRANSCRIPT

Page 1: PhDc exam presentation

Functional Characterisation of Metabolic Networks

Carlos Manuel Estévez-Bretón MScDoctorate in Systems Engineering and Computer Sciences

Advisors: Luis Fernando Niño PhDLiliana Lopez Kleine PhD

Intelligent Systems Research Laboratory - LISI Bioinformatics and Computational Biology research line “BioLisi”

Examining Committee: Dr. Jason Papin, -U. of Virginia, Bioengineering.

Dr. Andres Gonzalez, - U. de los Andes, Chemical Engineering.Dr. Fabio Gonzalez, U. Nacional, Systems Engineering.

Page 2: PhDc exam presentation

What...Why...

Research Question

How...

Progress ...

Age

nd

a

GoalsEvaluationDeliverables

Page 4: PhDc exam presentation

Metabolism are the complete set of

metabolic networks and

physical processes that determine the

physiological and biochemical properties

of a cell.

With the sequencing of complete genomes, it is now possible to

reconstruct the network of biochemical reactions in many organisms, from

bacteria to humans...

Page 5: PhDc exam presentation

PMC 2011 August 17.Wiley Interdiscip Rev Syst Biol Med. 2010 Jul-Aug; 2(4): 438–459.doi: 10.1002/wsbm.75

Ecological Scale

Lucas B. Edelman, James A. Eddy, and Nathan D. Price

Systems BiologyIn

trod

uctio

n

Page 6: PhDc exam presentation

PMC 2011 August 17.Wiley Interdiscip Rev Syst Biol Med. 2010 Jul-Aug; 2(4): 438–459.doi: 10.1002/wsbm.75

Ecological Scale

Lucas B. Edelman, James A. Eddy, and Nathan D. Price

Systems BiologyIn

trod

uctio

n

Page 7: PhDc exam presentation

PMC 2011 August 17.Wiley Interdiscip Rev Syst Biol Med. 2010 Jul-Aug; 2(4): 438–459.doi: 10.1002/wsbm.75

Ecological Scale

Lucas B. Edelman, James A. Eddy, and Nathan D. Price

Systems BiologyIn

trod

uctio

n

Page 8: PhDc exam presentation

PMC 2011 August 17.Wiley Interdiscip Rev Syst Biol Med. 2010 Jul-Aug; 2(4): 438–459.doi: 10.1002/wsbm.75

Ecological Scale

Lucas B. Edelman, James A. Eddy, and Nathan D. Price

Systems BiologyIn

trod

uctio

n

Page 9: PhDc exam presentation

PMC 2011 August 17.Wiley Interdiscip Rev Syst Biol Med. 2010 Jul-Aug; 2(4): 438–459.doi: 10.1002/wsbm.75

Ecological Scale

Lucas B. Edelman, James A. Eddy, and Nathan D. Price

Mu

ltil

eve

l fi

eld

Systems BiologyIn

trod

uctio

n

Page 10: PhDc exam presentation

PMC 2011 August 17.Wiley Interdiscip Rev Syst Biol Med. 2010 Jul-Aug; 2(4): 438–459.doi: 10.1002/wsbm.75

Ecological Scale

Lucas B. Edelman, James A. Eddy, and Nathan D. Price

Mu

ltil

eve

l fi

eld

Studied Interdisciplinary

Systems BiologyIn

trod

uctio

n

Page 11: PhDc exam presentation

�((%

����*

��$

�$&�

���

����

��

��

����

���

���

Intr

oduc

tion

Better and cheaper processing power

Page 12: PhDc exam presentation

Multilevel Information

�((%

����*

��$

�$&�

���

����

��

��

����

���

���

Intr

oduc

tion

Better and cheaper processing power

Page 13: PhDc exam presentation

Intr

oduc

tion

Regulatory Networks

Protein Protein Interaction Networks

Metabolic Networks

Ecological Networks

Page 14: PhDc exam presentation

Intr

oduc

tion

Regulatory Networks

Protein Protein Interaction Networks

Metabolic Networks

Ecological Networks

Main Data Sources

Page 15: PhDc exam presentation

“Techniques such as high-trougput (HT) sequencing and gene/protein profiling have transformed biological Research” (Khatri et al, 2012)

“In this way, the advent of HT profiling technologies presents a new challenge, that of extracting meaning from a long list of differentially expressed genes and proteins”.

(Khatri et al, 2012)

Page 16: PhDc exam presentation

“Techniques such as high-trougput (HT) sequencing and gene/protein profiling have transformed biological Research” (Khatri et al, 2012)

“In this way, the advent of HT profiling technologies presents a new challenge, that of extracting meaning from a long list of differentially expressed genes and proteins”.

(Khatri et al, 2012)

These biological techniques changes the way we study biological science.Interdisciplinary effort to extract meaning, analyze, and obtain information with high levels of confidence and quality.

Page 17: PhDc exam presentation

[14:56 18/11/2011 Bioinformatics-btr585.tex] Page: 3331 3331–3332

BIOINFORMATICS EDITORIAL Vol. 27 no. 24 2011, pages 3331–3332doi:10.1093/bioinformatics/btr585

Editorial

The rise and fall of supervised machine learning techniquesLars Juhl Jensen1,! and Alex Bateman2

1Department of Disease Systems Biology, The Novo Nordisk Foundation Center for Protein Research, Faculty ofHealth Sciences, University of Copenhagen, DK-2200 Copenhagen N, Denmark and 2Wellcome Trust SangerInstitute, Wellcome Trust Genome Campus, Hinxton CB10 1SA UK

Machine learning is of immense importance in bioinformatics andbiomedical science more generally (Larrañaga et al., 2006; Tarcaet al., 2007). In particular, supervised machine learning has beenused to great effect in numerous bioinformatics prediction methods.Through many years of editing and reviewing manuscripts, wenoticed that some supervised machine learning techniques seem tobe gaining in popularity while others seemed, at least to our eyes,to be looking ‘unfashionable’.

We were motivated to create a league table of machine learningtechniques to learn what is hot and what is not in the machinelearning field. In this editorial, we only include those that weconsidered major league and leave analysis of the minor leaguemethods as an exercise for the interested reader. To create our leaguetable, we created a list of supervised machine learning techniquescommonly used in bioinformatics and their common synonyms,plural forms and abbreviations. We then searched this list againstthe PubMed titles and abstracts to identify the number of paperspublished per year for each machine learning technique. To match asmany papers as possible, searches were case insensitive and allowedfor variation in hyphenation.

Fig. 1. The growth of supervised machine learning methods in PubMed.

!To whom correspondence should be addressed

To our surprise, the artificial neural network (ANN) is not onlythe dominant league leader in 2011 but has been in this positionsince at least the 1970s (see Fig. 1). However, in recent years theusage of support vector machines (SVMs) grew tremendously, andwe predict that SVMs will challengeANNs for the dominant positionin the coming decade. Since 2007 the number of publications usingANNs has decreased by 21%, which we hypothesize may be directlyattributed to researchers increasingly using SVMs in place of ANNs.SVMs caught up with and overtook Markov models in 2004 to gainsecond spot in our machine learning league.

As for the question of ‘what is hot?’, one can see that Randomforests are a rapidly growing method with not a single mention ofthem before 2003 and now a total of 407 papers published to date.

We were hoping to find techniques that were not so hot andperhaps going out of fashion. The results show that none of themajor league methods has gone out of fashion, but we do seemoderate decreases in the use of both ANNs and Markov models inthe literature.

We were also curious to find out if certain machine learningtechniques were used in combination with each other. To investigatethis, we looked at what machine learning methods are co-mentionedin articles (See Fig. 2). For all pairs of methods from the Supervised

Fig. 2. Heatmap showing the co-occurrence of machine learning techniqueswithin articles.

© The Author(s) 2011. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

by guest on Decem

ber 7, 2011http://bioinform

atics.oxfordjournals.org/D

ownloaded from

“Hot techniques”: ANN, Markov Models, and “new ones” SVM and Random Forests.

(Jensen & Bateman in 2011)

Inte

llige

nt S

yste

ms

Latent Topic Analysis is not in the list of methods.

Page 18: PhDc exam presentation

“In particular, supervised machine learning has been used to great effect in numerous bioinformatics prediction methods”.(Jensen & Bateman, 2011)

Machine learning is of immense importance in bioinformatics and more generally for biomedical sciences (Larrañaga et al., 2006; Tarca et al., 2007).

Because in metabolic systems analysis, is not common, I think that is important to emphasise that:

Page 19: PhDc exam presentation

There are no references in the literature for analysis of metabolic pathways from a functional approach, or using proposed machine learning methods.

Inte

llige

nt S

yste

ms

Page 20: PhDc exam presentation

In addition to all these applications, computa-tional techniques are used to solve other problems,such as efficient primer design for PCR, biologicalimage analysis and backtranslation of proteins (whichis, given the degeneration of the genetic code,a complex combinatorial problem).

Machine learning consists in programmingcomputers to optimize a performance criterionby using example data or past experience. Theoptimized criterion can be the accuracy provided bya predictive model—in a modelling problem—,and the value of a fitness or evaluation function—inan optimization problem.

In a modelling problem, the ‘learning’ term refers torunning a computer program to induce a model byusing training data or past experience. Machinelearning uses statistical theory when buildingcomputational models since the objective is to

make inferences from a sample. The two mainsteps in this process are to induce the model byprocessing the huge amount of data and to representthe model and making inferences efficiently. It mustbe noticed that the efficiency of the learning andinference algorithms, as well as their space andtime complexity and their transparency and inter-pretability, can be as important as their predictiveaccuracy. The process of transforming data intoknowledge is both iterative and interactive. Theiterative phase consists of several steps. In the firststep, we need to integrate and merge the differentsources of information into only one format. Byusing data warehouse techniques, the detection andresolution of outliers and inconsistencies are solved.In the second step, it is necessary to select, clean andtransform the data. To carry out this step, we need toeliminate or correct the uncorrected data, as well as

Figure 1: Classification of the topics wheremachine learningmethods are applied.

88 Larran‹ aga et al. at The R

eference Shelf on May 30, 2011

bib.oxfordjournals.orgD

ownloaded from

Larrañaga et al. bib.oxfordjournals.org at The Reference Shelf on May 30, 2011Mac

hine

Lea

rnin

g

Page 21: PhDc exam presentation

In addition to all these applications, computa-tional techniques are used to solve other problems,such as efficient primer design for PCR, biologicalimage analysis and backtranslation of proteins (whichis, given the degeneration of the genetic code,a complex combinatorial problem).

Machine learning consists in programmingcomputers to optimize a performance criterionby using example data or past experience. Theoptimized criterion can be the accuracy provided bya predictive model—in a modelling problem—,and the value of a fitness or evaluation function—inan optimization problem.

In a modelling problem, the ‘learning’ term refers torunning a computer program to induce a model byusing training data or past experience. Machinelearning uses statistical theory when buildingcomputational models since the objective is to

make inferences from a sample. The two mainsteps in this process are to induce the model byprocessing the huge amount of data and to representthe model and making inferences efficiently. It mustbe noticed that the efficiency of the learning andinference algorithms, as well as their space andtime complexity and their transparency and inter-pretability, can be as important as their predictiveaccuracy. The process of transforming data intoknowledge is both iterative and interactive. Theiterative phase consists of several steps. In the firststep, we need to integrate and merge the differentsources of information into only one format. Byusing data warehouse techniques, the detection andresolution of outliers and inconsistencies are solved.In the second step, it is necessary to select, clean andtransform the data. To carry out this step, we need toeliminate or correct the uncorrected data, as well as

Figure 1: Classification of the topics wheremachine learningmethods are applied.

88 Larran‹ aga et al. at The R

eference Shelf on May 30, 2011

bib.oxfordjournals.orgD

ownloaded from

Larrañaga et al. bib.oxfordjournals.org at The Reference Shelf on May 30, 2011

Bayesian classifiers, Feature subset selectionSVM, ANN, classification trees, Evolutionary algorithmstabu search

nearest neighbour, SVM, Bayesian classifier, fuzzy k-NN

Baye

sian

gen

eral

izat

ion

of t

he

SVM

, AN

N, l

inea

r di

scri

min

ant

anal

ysis

, cla

ssifi

catio

n tr

ees,

AN

N

SVM

and

HM

M,

linear discriminant analysis, quadratic discriminant analysis, k-NN classifier, bagging and boosting classification trees, SVM and random forest

Mac

hine

Lea

rnin

g

Page 22: PhDc exam presentation

In addition to all these applications, computa-tional techniques are used to solve other problems,such as efficient primer design for PCR, biologicalimage analysis and backtranslation of proteins (whichis, given the degeneration of the genetic code,a complex combinatorial problem).

Machine learning consists in programmingcomputers to optimize a performance criterionby using example data or past experience. Theoptimized criterion can be the accuracy provided bya predictive model—in a modelling problem—,and the value of a fitness or evaluation function—inan optimization problem.

In a modelling problem, the ‘learning’ term refers torunning a computer program to induce a model byusing training data or past experience. Machinelearning uses statistical theory when buildingcomputational models since the objective is to

make inferences from a sample. The two mainsteps in this process are to induce the model byprocessing the huge amount of data and to representthe model and making inferences efficiently. It mustbe noticed that the efficiency of the learning andinference algorithms, as well as their space andtime complexity and their transparency and inter-pretability, can be as important as their predictiveaccuracy. The process of transforming data intoknowledge is both iterative and interactive. Theiterative phase consists of several steps. In the firststep, we need to integrate and merge the differentsources of information into only one format. Byusing data warehouse techniques, the detection andresolution of outliers and inconsistencies are solved.In the second step, it is necessary to select, clean andtransform the data. To carry out this step, we need toeliminate or correct the uncorrected data, as well as

Figure 1: Classification of the topics wheremachine learningmethods are applied.

88 Larran‹ aga et al. at The R

eference Shelf on May 30, 2011

bib.oxfordjournals.orgD

ownloaded from

Larrañaga et al. bib.oxfordjournals.org at The Reference Shelf on May 30, 2011

Bayesian classifiers, Feature subset selectionSVM, ANN, classification trees, Evolutionary algorithmstabu search

nearest neighbour, SVM, Bayesian classifier, fuzzy k-NN

Baye

sian

gen

eral

izat

ion

of t

he

SVM

, AN

N, l

inea

r di

scri

min

ant

anal

ysis

, cla

ssifi

catio

n tr

ees,

AN

N

probabilistic graphical models, classification trees, boosting with classification trees

SVM

and

HM

M,

linear discriminant analysis, quadratic discriminant analysis, k-NN classifier, bagging and boosting classification trees, SVM and random forest

Mac

hine

Lea

rnin

g

Page 23: PhDc exam presentation

Why?

http://www.perftrends.com/images/why.jpg

Page 24: PhDc exam presentation
Page 25: PhDc exam presentation

... or Methods are

not applied to

Metabolic Pathways...

...or are based on Topological (Graph Based) network representations

Page 26: PhDc exam presentation

• It should be possible to make some advances in understanding the underlying functional conformation of metabolic pathways.

State

men

t

http://www.scriptmag.com/wp-content/uploads/BrainStorm-NewColor-12-22_32-1280x980at86.jpg

Page 27: PhDc exam presentation

http://www.scriptmag.com/wp-content/uploads/BrainStorm-NewColor-12-22_32-1280x980at86.jpg

• Supervised Clustering - useful to test the given representation - by classifying the biochemical reactions.

http

://w

ww

.ee.

ryer

son.

ca/~

cour

ses/

ele8

88/e

le_8

88_p

at_c

lass

.gif

State

men

t

Page 29: PhDc exam presentation

• Information Retrieval algebraic models, like vector space based ones, should “reveal” topics that occurs in document collections.

• Is it possible to generate new - “really new” pathways?• ...I’m talking about synthetic biology.

http://diversity-mining-lab.wikispaces.com/

State

men

t

Page 30: PhDc exam presentation

Research Question

Is it possible to classify metabolic networks only using functional features?

Page 32: PhDc exam presentation

Goals

• To Classify functionally, (without considering the topological structure) metabolic pathways based on machine learning methods.

Page 33: PhDc exam presentation

Goals

• To Classify functionally, (without considering the topological structure) metabolic pathways based on machine learning methods.

• To Build or adapt a system of functional representation for metabolic networks.

Page 34: PhDc exam presentation

Goals

• To Classify functionally, (without considering the topological structure) metabolic pathways based on machine learning methods.

• To Build or adapt a system of functional representation for metabolic networks.

• To Classify metabolic networks using machine learning methods.

Page 35: PhDc exam presentation

Goals

• To Classify functionally, (without considering the topological structure) metabolic pathways based on machine learning methods.

• To Build or adapt a system of functional representation for metabolic networks.

• To Classify metabolic networks using machine learning methods.

• To Apply (in new ways) machine learning methods in the study of systems biology.

Page 36: PhDc exam presentation

Met

hodo

logy

S1 + S2 + … Sn P1 + P2 + … PnEnzime

CoFactor CoEnzime

General Metabolic Reaction Model - GMRM

Vectorization of GMRM

S1 S2 S3 Enzime CoF CoE P1 P2 P3

MetaCyc

KEGG12

Repr

esen

tatio

nCl

assifi

catio

n

Carlo

s M

anue

l Est

évez

-Bre

tón

R. 2

012

Data

Sou

rce

Eval

uatio

n

Method 2Method 1

ROCConfusion matrix

Entropy

purity

adjusted Rand Index

Accuracy

Pipe

line

paper paper

paper

Page 37: PhDc exam presentation

Dat

a So

urce

sMetaCyc

KEGG12

Page 38: PhDc exam presentation

Dat

a R

epre

sent

atio

n

S1 + S2 + … Sn P1 + P2 + … PnEnzime

CoFactor CoEnzime

General Metabolic Reaction Model - GMRM

Vectorization of GMRM

S1 S2 S3 Enzime CoF CoE P1 P2 P3

Page 39: PhDc exam presentation

Cla

ssifi

catio

n

Supervised Classification

Method 1

Page 40: PhDc exam presentation

• Let’s think about clustering without any prior knowledge...

• Applying Information Retrieval methods to Metabolic Pathways data.

Method 2

Page 41: PhDc exam presentation

Eval

uatio

nROCConfusion

matrixEntropy

purity

adjusted Rand Index

Accuracy

http://www.intechopen.com/source/html/38584/media/image56.jpeg

Classified as:

Really is:

Positive Negative

Positive

Negative

False Negative

True NegativeFalse Positive

True Positive

Page 42: PhDc exam presentation

Eval

uatio

nROCConfusion

matrixEntropy

purity

adjusted Rand Index

Accuracy

http://www.intechopen.com/source/html/38584/media/image56.jpeg

Classified as:

Really is:

Positive Negative

Positive

Negative

False Negative

True NegativeFalse Positive

True Positive

Error Rate

Recall/sensitivitySpecificity/True Negative Rate

Precision1-Specificity/False Alarm Rate

Page 43: PhDc exam presentation

Eval

uatio

nROCConfusion

matrixEntropy

purity

adjusted Rand Index

Accuracy

http://www.intechopen.com/source/html/38584/media/image56.jpeg

http://wwww.cbgstat.com/v2/method_ROC_curve_MedCalc/images/ROC_curve_MedCalc_Snap17.gif

Page 44: PhDc exam presentation

Del

iver

able

sA computational metabolic representation proposal

A computational metabolic classification method

A generative metabolic pathways model

A pipeline for metabolic pathways analysis

Page 46: PhDc exam presentation

Prel

imin

ary

Res

ults

S1 + S2 + … Sn P1 + P2 + … PnEnzime

CoFactor CoEnzime

General Metabolic Reaction Model - GMRM

Vectorization of GMRM

S1 S2 S3 Enzime CoF CoE P1 P2 P3

MetaCyc

KEGG12

Repr

esen

tatio

nCl

assifi

catio

n

Carlo

s M

anue

l Est

évez

-Bre

tón

R. 2

012

Data

Sou

rce

Eval

uatio

n

Method 2Method 1

ROCConfusion matrix

Entropy

purity

adjusted Rand Index

Accuracy

Pipe

line

paper paper

paper

,,

,

,,

Page 47: PhDc exam presentation

Com

plex

ity

Metabolic Pathway

Reaction

Metabolites/ome

Metabolic Switch

Glucose

Glucose 6P ATP

HidrolasePyrophosphate

Vocabulary

Words Molecules

the

Murder for a jar of red rum

frog

soap

Document

Phrase

Paragraph

rum Murder for

jar

a

ofred

rum Murder for

jar

a

ofred

Glucose Glucose 6PATPHidrolase

ADP+ +

ADP

Ling

uist

ic A

nalo

gyS1 + S2 + … Sn P1 + P2 + … PnEnzime

CoFactor CoEnzime

General Metabolic Reaction Model - GMRM

Vectorization of GMRM

S1 S2 S3 Enzime CoF CoE P1 P2 P3

Page 48: PhDc exam presentation

Rep

rese

ntat

ion

S1 + S2 + … Sn P1 + P2 + … PnEnzime

CoFactor CoEnzime

General Metabolic Reaction Model - GMRM

Vectorization of GMRM

S1 S2 S3 Enzime CoF CoE P1 P2 P3

Page 49: PhDc exam presentation

Cla

ssifi

catio

n Supervised

4 Pathways

2 carbohydrate metabolism

1 lipid metabolism

1from nucleotide metabolism

Support Vector Machines

SVM

Classification Tree K Nearest Neighbour

CN2 Naive Bayes

24

orga

nism

s

Method 1

Page 50: PhDc exam presentation

Pipe

line

SVM

Page 51: PhDc exam presentation

Rev

iew

SVM

- Proposing a vector representation of biochemical reactions, based in a linguistic analogy.

I´m going to classify metabolic networks only using functional features...To find patterns that suggests constitution rules on metabolic pathways.

- Searching patterns by clustering.

Page 52: PhDc exam presentation

Thanks@karelman