phdc exam presentation
DESCRIPTION
This are the slides I used on my PhD candidature exam.TRANSCRIPT
Functional Characterisation of Metabolic Networks
Carlos Manuel Estévez-Bretón MScDoctorate in Systems Engineering and Computer Sciences
Advisors: Luis Fernando Niño PhDLiliana Lopez Kleine PhD
Intelligent Systems Research Laboratory - LISI Bioinformatics and Computational Biology research line “BioLisi”
Examining Committee: Dr. Jason Papin, -U. of Virginia, Bioengineering.
Dr. Andres Gonzalez, - U. de los Andes, Chemical Engineering.Dr. Fabio Gonzalez, U. Nacional, Systems Engineering.
What...Why...
Research Question
How...
Progress ...
Age
nd
a
GoalsEvaluationDeliverables
What?
http://www.impactcommunicationsinc.com/wp-content/uploads/2011/10/11-11_speak_up.jpg
Metabolism are the complete set of
metabolic networks and
physical processes that determine the
physiological and biochemical properties
of a cell.
With the sequencing of complete genomes, it is now possible to
reconstruct the network of biochemical reactions in many organisms, from
bacteria to humans...
PMC 2011 August 17.Wiley Interdiscip Rev Syst Biol Med. 2010 Jul-Aug; 2(4): 438–459.doi: 10.1002/wsbm.75
Ecological Scale
Lucas B. Edelman, James A. Eddy, and Nathan D. Price
Systems BiologyIn
trod
uctio
n
PMC 2011 August 17.Wiley Interdiscip Rev Syst Biol Med. 2010 Jul-Aug; 2(4): 438–459.doi: 10.1002/wsbm.75
Ecological Scale
Lucas B. Edelman, James A. Eddy, and Nathan D. Price
Systems BiologyIn
trod
uctio
n
PMC 2011 August 17.Wiley Interdiscip Rev Syst Biol Med. 2010 Jul-Aug; 2(4): 438–459.doi: 10.1002/wsbm.75
Ecological Scale
Lucas B. Edelman, James A. Eddy, and Nathan D. Price
Systems BiologyIn
trod
uctio
n
PMC 2011 August 17.Wiley Interdiscip Rev Syst Biol Med. 2010 Jul-Aug; 2(4): 438–459.doi: 10.1002/wsbm.75
Ecological Scale
Lucas B. Edelman, James A. Eddy, and Nathan D. Price
Systems BiologyIn
trod
uctio
n
PMC 2011 August 17.Wiley Interdiscip Rev Syst Biol Med. 2010 Jul-Aug; 2(4): 438–459.doi: 10.1002/wsbm.75
Ecological Scale
Lucas B. Edelman, James A. Eddy, and Nathan D. Price
Mu
ltil
eve
l fi
eld
Systems BiologyIn
trod
uctio
n
PMC 2011 August 17.Wiley Interdiscip Rev Syst Biol Med. 2010 Jul-Aug; 2(4): 438–459.doi: 10.1002/wsbm.75
Ecological Scale
Lucas B. Edelman, James A. Eddy, and Nathan D. Price
Mu
ltil
eve
l fi
eld
Studied Interdisciplinary
Systems BiologyIn
trod
uctio
n
�((%
����*
��$
�$&�
���
�
����
��
��
����
���
���
Intr
oduc
tion
Better and cheaper processing power
Multilevel Information
�((%
����*
��$
�$&�
���
�
����
��
��
����
���
���
Intr
oduc
tion
Better and cheaper processing power
Intr
oduc
tion
Regulatory Networks
Protein Protein Interaction Networks
Metabolic Networks
Ecological Networks
Intr
oduc
tion
Regulatory Networks
Protein Protein Interaction Networks
Metabolic Networks
Ecological Networks
Main Data Sources
“Techniques such as high-trougput (HT) sequencing and gene/protein profiling have transformed biological Research” (Khatri et al, 2012)
“In this way, the advent of HT profiling technologies presents a new challenge, that of extracting meaning from a long list of differentially expressed genes and proteins”.
(Khatri et al, 2012)
“Techniques such as high-trougput (HT) sequencing and gene/protein profiling have transformed biological Research” (Khatri et al, 2012)
“In this way, the advent of HT profiling technologies presents a new challenge, that of extracting meaning from a long list of differentially expressed genes and proteins”.
(Khatri et al, 2012)
These biological techniques changes the way we study biological science.Interdisciplinary effort to extract meaning, analyze, and obtain information with high levels of confidence and quality.
[14:56 18/11/2011 Bioinformatics-btr585.tex] Page: 3331 3331–3332
BIOINFORMATICS EDITORIAL Vol. 27 no. 24 2011, pages 3331–3332doi:10.1093/bioinformatics/btr585
Editorial
The rise and fall of supervised machine learning techniquesLars Juhl Jensen1,! and Alex Bateman2
1Department of Disease Systems Biology, The Novo Nordisk Foundation Center for Protein Research, Faculty ofHealth Sciences, University of Copenhagen, DK-2200 Copenhagen N, Denmark and 2Wellcome Trust SangerInstitute, Wellcome Trust Genome Campus, Hinxton CB10 1SA UK
Machine learning is of immense importance in bioinformatics andbiomedical science more generally (Larrañaga et al., 2006; Tarcaet al., 2007). In particular, supervised machine learning has beenused to great effect in numerous bioinformatics prediction methods.Through many years of editing and reviewing manuscripts, wenoticed that some supervised machine learning techniques seem tobe gaining in popularity while others seemed, at least to our eyes,to be looking ‘unfashionable’.
We were motivated to create a league table of machine learningtechniques to learn what is hot and what is not in the machinelearning field. In this editorial, we only include those that weconsidered major league and leave analysis of the minor leaguemethods as an exercise for the interested reader. To create our leaguetable, we created a list of supervised machine learning techniquescommonly used in bioinformatics and their common synonyms,plural forms and abbreviations. We then searched this list againstthe PubMed titles and abstracts to identify the number of paperspublished per year for each machine learning technique. To match asmany papers as possible, searches were case insensitive and allowedfor variation in hyphenation.
Fig. 1. The growth of supervised machine learning methods in PubMed.
!To whom correspondence should be addressed
To our surprise, the artificial neural network (ANN) is not onlythe dominant league leader in 2011 but has been in this positionsince at least the 1970s (see Fig. 1). However, in recent years theusage of support vector machines (SVMs) grew tremendously, andwe predict that SVMs will challengeANNs for the dominant positionin the coming decade. Since 2007 the number of publications usingANNs has decreased by 21%, which we hypothesize may be directlyattributed to researchers increasingly using SVMs in place of ANNs.SVMs caught up with and overtook Markov models in 2004 to gainsecond spot in our machine learning league.
As for the question of ‘what is hot?’, one can see that Randomforests are a rapidly growing method with not a single mention ofthem before 2003 and now a total of 407 papers published to date.
We were hoping to find techniques that were not so hot andperhaps going out of fashion. The results show that none of themajor league methods has gone out of fashion, but we do seemoderate decreases in the use of both ANNs and Markov models inthe literature.
We were also curious to find out if certain machine learningtechniques were used in combination with each other. To investigatethis, we looked at what machine learning methods are co-mentionedin articles (See Fig. 2). For all pairs of methods from the Supervised
Fig. 2. Heatmap showing the co-occurrence of machine learning techniqueswithin articles.
© The Author(s) 2011. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
by guest on Decem
ber 7, 2011http://bioinform
atics.oxfordjournals.org/D
ownloaded from
“Hot techniques”: ANN, Markov Models, and “new ones” SVM and Random Forests.
(Jensen & Bateman in 2011)
Inte
llige
nt S
yste
ms
Latent Topic Analysis is not in the list of methods.
“In particular, supervised machine learning has been used to great effect in numerous bioinformatics prediction methods”.(Jensen & Bateman, 2011)
Machine learning is of immense importance in bioinformatics and more generally for biomedical sciences (Larrañaga et al., 2006; Tarca et al., 2007).
Because in metabolic systems analysis, is not common, I think that is important to emphasise that:
There are no references in the literature for analysis of metabolic pathways from a functional approach, or using proposed machine learning methods.
Inte
llige
nt S
yste
ms
In addition to all these applications, computa-tional techniques are used to solve other problems,such as efficient primer design for PCR, biologicalimage analysis and backtranslation of proteins (whichis, given the degeneration of the genetic code,a complex combinatorial problem).
Machine learning consists in programmingcomputers to optimize a performance criterionby using example data or past experience. Theoptimized criterion can be the accuracy provided bya predictive model—in a modelling problem—,and the value of a fitness or evaluation function—inan optimization problem.
In a modelling problem, the ‘learning’ term refers torunning a computer program to induce a model byusing training data or past experience. Machinelearning uses statistical theory when buildingcomputational models since the objective is to
make inferences from a sample. The two mainsteps in this process are to induce the model byprocessing the huge amount of data and to representthe model and making inferences efficiently. It mustbe noticed that the efficiency of the learning andinference algorithms, as well as their space andtime complexity and their transparency and inter-pretability, can be as important as their predictiveaccuracy. The process of transforming data intoknowledge is both iterative and interactive. Theiterative phase consists of several steps. In the firststep, we need to integrate and merge the differentsources of information into only one format. Byusing data warehouse techniques, the detection andresolution of outliers and inconsistencies are solved.In the second step, it is necessary to select, clean andtransform the data. To carry out this step, we need toeliminate or correct the uncorrected data, as well as
Figure 1: Classification of the topics wheremachine learningmethods are applied.
88 Larran‹ aga et al. at The R
eference Shelf on May 30, 2011
bib.oxfordjournals.orgD
ownloaded from
Larrañaga et al. bib.oxfordjournals.org at The Reference Shelf on May 30, 2011Mac
hine
Lea
rnin
g
In addition to all these applications, computa-tional techniques are used to solve other problems,such as efficient primer design for PCR, biologicalimage analysis and backtranslation of proteins (whichis, given the degeneration of the genetic code,a complex combinatorial problem).
Machine learning consists in programmingcomputers to optimize a performance criterionby using example data or past experience. Theoptimized criterion can be the accuracy provided bya predictive model—in a modelling problem—,and the value of a fitness or evaluation function—inan optimization problem.
In a modelling problem, the ‘learning’ term refers torunning a computer program to induce a model byusing training data or past experience. Machinelearning uses statistical theory when buildingcomputational models since the objective is to
make inferences from a sample. The two mainsteps in this process are to induce the model byprocessing the huge amount of data and to representthe model and making inferences efficiently. It mustbe noticed that the efficiency of the learning andinference algorithms, as well as their space andtime complexity and their transparency and inter-pretability, can be as important as their predictiveaccuracy. The process of transforming data intoknowledge is both iterative and interactive. Theiterative phase consists of several steps. In the firststep, we need to integrate and merge the differentsources of information into only one format. Byusing data warehouse techniques, the detection andresolution of outliers and inconsistencies are solved.In the second step, it is necessary to select, clean andtransform the data. To carry out this step, we need toeliminate or correct the uncorrected data, as well as
Figure 1: Classification of the topics wheremachine learningmethods are applied.
88 Larran‹ aga et al. at The R
eference Shelf on May 30, 2011
bib.oxfordjournals.orgD
ownloaded from
Larrañaga et al. bib.oxfordjournals.org at The Reference Shelf on May 30, 2011
Bayesian classifiers, Feature subset selectionSVM, ANN, classification trees, Evolutionary algorithmstabu search
nearest neighbour, SVM, Bayesian classifier, fuzzy k-NN
Baye
sian
gen
eral
izat
ion
of t
he
SVM
, AN
N, l
inea
r di
scri
min
ant
anal
ysis
, cla
ssifi
catio
n tr
ees,
AN
N
SVM
and
HM
M,
linear discriminant analysis, quadratic discriminant analysis, k-NN classifier, bagging and boosting classification trees, SVM and random forest
Mac
hine
Lea
rnin
g
In addition to all these applications, computa-tional techniques are used to solve other problems,such as efficient primer design for PCR, biologicalimage analysis and backtranslation of proteins (whichis, given the degeneration of the genetic code,a complex combinatorial problem).
Machine learning consists in programmingcomputers to optimize a performance criterionby using example data or past experience. Theoptimized criterion can be the accuracy provided bya predictive model—in a modelling problem—,and the value of a fitness or evaluation function—inan optimization problem.
In a modelling problem, the ‘learning’ term refers torunning a computer program to induce a model byusing training data or past experience. Machinelearning uses statistical theory when buildingcomputational models since the objective is to
make inferences from a sample. The two mainsteps in this process are to induce the model byprocessing the huge amount of data and to representthe model and making inferences efficiently. It mustbe noticed that the efficiency of the learning andinference algorithms, as well as their space andtime complexity and their transparency and inter-pretability, can be as important as their predictiveaccuracy. The process of transforming data intoknowledge is both iterative and interactive. Theiterative phase consists of several steps. In the firststep, we need to integrate and merge the differentsources of information into only one format. Byusing data warehouse techniques, the detection andresolution of outliers and inconsistencies are solved.In the second step, it is necessary to select, clean andtransform the data. To carry out this step, we need toeliminate or correct the uncorrected data, as well as
Figure 1: Classification of the topics wheremachine learningmethods are applied.
88 Larran‹ aga et al. at The R
eference Shelf on May 30, 2011
bib.oxfordjournals.orgD
ownloaded from
Larrañaga et al. bib.oxfordjournals.org at The Reference Shelf on May 30, 2011
Bayesian classifiers, Feature subset selectionSVM, ANN, classification trees, Evolutionary algorithmstabu search
nearest neighbour, SVM, Bayesian classifier, fuzzy k-NN
Baye
sian
gen
eral
izat
ion
of t
he
SVM
, AN
N, l
inea
r di
scri
min
ant
anal
ysis
, cla
ssifi
catio
n tr
ees,
AN
N
probabilistic graphical models, classification trees, boosting with classification trees
SVM
and
HM
M,
linear discriminant analysis, quadratic discriminant analysis, k-NN classifier, bagging and boosting classification trees, SVM and random forest
Mac
hine
Lea
rnin
g
Why?
http://www.perftrends.com/images/why.jpg
... or Methods are
not applied to
Metabolic Pathways...
...or are based on Topological (Graph Based) network representations
• It should be possible to make some advances in understanding the underlying functional conformation of metabolic pathways.
State
men
t
http://www.scriptmag.com/wp-content/uploads/BrainStorm-NewColor-12-22_32-1280x980at86.jpg
http://www.scriptmag.com/wp-content/uploads/BrainStorm-NewColor-12-22_32-1280x980at86.jpg
• Supervised Clustering - useful to test the given representation - by classifying the biochemical reactions.
http
://w
ww
.ee.
ryer
son.
ca/~
cour
ses/
ele8
88/e
le_8
88_p
at_c
lass
.gif
State
men
t
http://diversity-mining-lab.wikispaces.com/
State
men
t
• Information Retrieval algebraic models, like vector space based ones, should “reveal” topics that occurs in document collections.
• Is it possible to generate new - “really new” pathways?• ...I’m talking about synthetic biology.
http://diversity-mining-lab.wikispaces.com/
State
men
t
Research Question
Is it possible to classify metabolic networks only using functional features?
How?
http://www.wired.com/images_blogs/threatlevel/2012/10/harris002.jpg
Goals
• To Classify functionally, (without considering the topological structure) metabolic pathways based on machine learning methods.
Goals
• To Classify functionally, (without considering the topological structure) metabolic pathways based on machine learning methods.
• To Build or adapt a system of functional representation for metabolic networks.
Goals
• To Classify functionally, (without considering the topological structure) metabolic pathways based on machine learning methods.
• To Build or adapt a system of functional representation for metabolic networks.
• To Classify metabolic networks using machine learning methods.
Goals
• To Classify functionally, (without considering the topological structure) metabolic pathways based on machine learning methods.
• To Build or adapt a system of functional representation for metabolic networks.
• To Classify metabolic networks using machine learning methods.
• To Apply (in new ways) machine learning methods in the study of systems biology.
Met
hodo
logy
S1 + S2 + … Sn P1 + P2 + … PnEnzime
CoFactor CoEnzime
General Metabolic Reaction Model - GMRM
Vectorization of GMRM
S1 S2 S3 Enzime CoF CoE P1 P2 P3
MetaCyc
KEGG12
Repr
esen
tatio
nCl
assifi
catio
n
Carlo
s M
anue
l Est
évez
-Bre
tón
R. 2
012
Data
Sou
rce
Eval
uatio
n
Method 2Method 1
ROCConfusion matrix
Entropy
purity
adjusted Rand Index
Accuracy
Pipe
line
paper paper
paper
Dat
a So
urce
sMetaCyc
KEGG12
Dat
a R
epre
sent
atio
n
S1 + S2 + … Sn P1 + P2 + … PnEnzime
CoFactor CoEnzime
General Metabolic Reaction Model - GMRM
Vectorization of GMRM
S1 S2 S3 Enzime CoF CoE P1 P2 P3
Cla
ssifi
catio
n
Supervised Classification
Method 1
• Let’s think about clustering without any prior knowledge...
• Applying Information Retrieval methods to Metabolic Pathways data.
Method 2
Eval
uatio
nROCConfusion
matrixEntropy
purity
adjusted Rand Index
Accuracy
http://www.intechopen.com/source/html/38584/media/image56.jpeg
Classified as:
Really is:
Positive Negative
Positive
Negative
False Negative
True NegativeFalse Positive
True Positive
Eval
uatio
nROCConfusion
matrixEntropy
purity
adjusted Rand Index
Accuracy
http://www.intechopen.com/source/html/38584/media/image56.jpeg
Classified as:
Really is:
Positive Negative
Positive
Negative
False Negative
True NegativeFalse Positive
True Positive
Error Rate
Recall/sensitivitySpecificity/True Negative Rate
Precision1-Specificity/False Alarm Rate
Eval
uatio
nROCConfusion
matrixEntropy
purity
adjusted Rand Index
Accuracy
http://www.intechopen.com/source/html/38584/media/image56.jpeg
http://wwww.cbgstat.com/v2/method_ROC_curve_MedCalc/images/ROC_curve_MedCalc_Snap17.gif
Del
iver
able
sA computational metabolic representation proposal
A computational metabolic classification method
A generative metabolic pathways model
A pipeline for metabolic pathways analysis
Progress ...http://desktop.freewallpaper4.me/view/original/3714/the-lonely-man.jpg
Prel
imin
ary
Res
ults
S1 + S2 + … Sn P1 + P2 + … PnEnzime
CoFactor CoEnzime
General Metabolic Reaction Model - GMRM
Vectorization of GMRM
S1 S2 S3 Enzime CoF CoE P1 P2 P3
MetaCyc
KEGG12
Repr
esen
tatio
nCl
assifi
catio
n
Carlo
s M
anue
l Est
évez
-Bre
tón
R. 2
012
Data
Sou
rce
Eval
uatio
n
Method 2Method 1
ROCConfusion matrix
Entropy
purity
adjusted Rand Index
Accuracy
Pipe
line
paper paper
paper
,,
,
,,
Com
plex
ity
Metabolic Pathway
Reaction
Metabolites/ome
Metabolic Switch
Glucose
Glucose 6P ATP
HidrolasePyrophosphate
Vocabulary
Words Molecules
the
Murder for a jar of red rum
frog
soap
Document
Phrase
Paragraph
rum Murder for
jar
a
ofred
rum Murder for
jar
a
ofred
Glucose Glucose 6PATPHidrolase
ADP+ +
ADP
Ling
uist
ic A
nalo
gyS1 + S2 + … Sn P1 + P2 + … PnEnzime
CoFactor CoEnzime
General Metabolic Reaction Model - GMRM
Vectorization of GMRM
S1 S2 S3 Enzime CoF CoE P1 P2 P3
Rep
rese
ntat
ion
S1 + S2 + … Sn P1 + P2 + … PnEnzime
CoFactor CoEnzime
General Metabolic Reaction Model - GMRM
Vectorization of GMRM
S1 S2 S3 Enzime CoF CoE P1 P2 P3
Cla
ssifi
catio
n Supervised
4 Pathways
2 carbohydrate metabolism
1 lipid metabolism
1from nucleotide metabolism
Support Vector Machines
SVM
Classification Tree K Nearest Neighbour
CN2 Naive Bayes
24
orga
nism
s
Method 1
Pipe
line
SVM
Rev
iew
SVM
- Proposing a vector representation of biochemical reactions, based in a linguistic analogy.
I´m going to classify metabolic networks only using functional features...To find patterns that suggests constitution rules on metabolic pathways.
- Searching patterns by clustering.
Thanks@karelman