k.u.leuven department of computer science predicting gene functions using hierarchical multi-label...

12
K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat, Jan Struyf, Hendrik Blockeel, Dragi Kocev, Sašo Džeroski K.U.Leuven Department of Computer Science

Upload: sharyl-perry

Post on 25-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

Predicting gene functions using hierarchical multi-label

decision tree ensembles

Celine Vens, Leander Schietgat, Jan Struyf, Hendrik Blockeel,Dragi Kocev, Sašo Džeroski

K.U.LeuvenDepartment of

Computer Science

Page 2: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

K.U.LeuvenDepartment of

Computer Science

• Classification: a common machine learning task e.g.,

•Given: genes with known function

•Task: predict function for new genes

•Special case: hierarchical multi-label classification (HMC)

• gene can have multiple functions

• functions are organized in a hierarchy

•tree (e.g., MIPS FunCat)

•DAG (e.g., Gene Ontology)

Hierarchy constraint: if gene is labeled with function X, then

it is also labeled with all parents of X

Hierarchical Multi-Label Classification (HMC) for Gene Function Prediction

Page 3: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

Predictions in Functional Genomics

• S. cerevisiae (13 datasets) and A. thaliana (12 datasets)

• two of biology’s model organisms

• most genes are annotated, ideal for testing purposes

• method can be applied to other organisms

• Data

• based on sequence statistics, phenotype, secondary structure, homology, microarray data,…

Page 4: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

Predictive Clustering Trees•Our focus is on decision trees

•Advantages: fast to build, noise-resistant, fast to apply, accurate predictions, easy to interpret,

•General framework: predictive clustering trees (PCTs)

PCT-algo

genes with features and known functions

Name A1 A2 … An 1 … 5 5/1 … 40 40/3 40/16 …G1 … … … … x x x x xG2 … … … … x x x x G3 … … … … x x G4 … … … … x x xG5 … … … … x x xG6 … … … … x x x… … … … … … … … … … … … … … … …

Input Algorithm Output

top-down inductionof PCTs PCT

Page 5: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

Clus-SC Clus-HSC

Clus-HMC

Hierarchy constraint

Identifies global feats

Predictive performance

Model size

Efficiency

Standard approachlearns one tree per class

Special-purpose approachlearns one tree per class +

hierarchy constraint

Our approachlearns one single tree

for all classes

Decision Trees for HMC: Different Approaches

Page 6: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

Predictive Clustering Forests

50 predictions

50 bootstrap replicates

Training set

•Ensembles

•Less interpretability

•Better performance

•Algorithm: Clus-HMC-Ens

1

2

n

3

Clus-HMC

50 PCTs

Test set

combined prediction

Clus-HMC

Clus-HMC

Clus-HMC

L1

L2

L3

Ln

L

Page 7: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

Clus-SC Clus-HSC

Clus-HMC Clus-HMC-Ens

Hierarchy constraint

Identifies global feats

Predictive performance

Model size

Efficiency

Standard approachlearns one tree per class

Special-purpose approachlearns one tree per class +

hierarchy constraint

Our approachlearns one single tree

for all classes

Variant of our approach

learns forest

Decision Trees for HMC: Different Approaches

Page 8: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

• Evaluation: precision-recall

• precision: percentage of predicted functions that are correct (TP/(TP+FP))

• recall: percentage of actual functions predicted by the algorithm (TP/(TP+FN))

• Average PR curve

– Consider (instance,class) couples

– Couple is (predicted) true if instance (is predicted to have) has class

Evaluation

TP FN

FP TN

Page 9: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

S. cerevisiae-FunCat (hom) A. thaliana-GO (seq)

S. cerevisiae-FunCat (expr) A. thaliana-GO (interpro)

•Clus-HMC-Ens better than Clus-HMC (average AUC improvement of 7%)

•Clus-HMC better than C4.5H (state-of-the-art system for HMC)(for the same recall of C4.5H, average precision improvement of 20.9%)

Page 10: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

QuickTime™ en eenTIFF (ongecomprimeerd)-decompressor

zijn vereist om deze afbeelding weer te geven.

QuickTime™ en eenTIFF (ongecomprimeerd)-decompressor

zijn vereist om deze afbeelding weer te geven.

Page 11: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

• Comparison with SVMs(Barutcuoglu et al.)

– Learn SVM per class

– Correct for HC violations with bayesian model

QuickTime™ en eenTIFF (ongecomprimeerd)-decompressor

zijn vereist om deze afbeelding weer te geven.

Page 12: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

• Clus-HMC outperforms (or is comparable to) state-of-the-art methods on functional genomics tasks

• Ensembles of Clus-HMC are able to boost performance, if the user is willing to give up on interpretability

• “Revenge of the decision trees”

Conclusions