my phd thesis presentation slides

56
Hierarchical information representation and efficient classification of gene expression microarray data PhD candidate: Mattia Bosio Advisors: Philippe Salembier Albert Oliveras Vergés 27/06/2014 Mattia Bosio PhD thesis defense 1

Upload: mattia-bosio

Post on 05-Jul-2015

331 views

Category:

Technology


1 download

DESCRIPTION

Presentation slides for my PhD thesis dissertation on machine learning algorithm development to analyze multi dimensional genomic data such as microarrays

TRANSCRIPT

Page 1: My PhD thesis presentation slides

Hierarchical information representation

and efficient classification

of gene expression microarray data

PhD candidate:

Mattia Bosio Advisors:

Philippe SalembierAlbert Oliveras Vergés

27/06/2014 Mattia Bosio PhD thesis defense 1

Page 2: My PhD thesis presentation slides

Thesis objective

Develop algorithms for microarray classification

–Predictive performance

–Results stability

–Biological interpretability

27/06/2014 Mattia Bosio PhD thesis defense 2

Page 3: My PhD thesis presentation slides

Roadmap

327/06/2014 Mattia Bosio PhD thesis defense

1- Microarrays

2- Challenges & Opportunities

3- Contributions

4- How did we get there?

5- Conclusions

Page 4: My PhD thesis presentation slides

27/06/2014 Mattia Bosio PhD thesis defense 4

Challenges & Opportunities

1- Microarrays

Page 5: My PhD thesis presentation slides

A platform to measure gene expression

27/06/2014 Mattia Bosio PhD thesis defense 5

• Give a picture of the whole cellular state

• Thousands of parallel measures

• Measure how much each gene is being used

• Can be used to discriminate between populations

Page 6: My PhD thesis presentation slides

Microarrays: what do they measure

27/06/2014 Mattia Bosio PhD thesis defense 6

Page 7: My PhD thesis presentation slides

Microarrays: how do they look like

27/06/2014 Mattia Bosio PhD thesis defense 7

45’000 ‘Genes’

72

S

am

ple

s

Page 8: My PhD thesis presentation slides

27/06/2014 Mattia Bosio PhD thesis defense 8

Challenges & Opportunities2- CHALLENGES &

OPPORTUNITIES

Page 9: My PhD thesis presentation slides

Challenges

27/06/2014 Mattia Bosio PhD thesis defense 9

Lack of structure

Noise

Sample size vs dimensions

45’000 ‘Genes’

72

S

am

ple

s

Page 10: My PhD thesis presentation slides

Opportunities

27/06/2014 Mattia Bosio PhD thesis defense 10

• Established tool for research but no optimum algorithm yet for classification

• Machine learning has already been used

– Good results that can be improved

• Signal processing dealt with similar problems

Page 11: My PhD thesis presentation slides

27/06/2014 Mattia Bosio PhD thesis defense 11

Contributions

3- CONTRIBUTIONS

Page 12: My PhD thesis presentation slides

27/06/2014 Mattia Bosio PhD thesis defense 12

Two-step classification framework

Genes

Feature set

Enhancement

Feature

Selection

Classifier

Train Data

Validation DataClass Estimations

Metagenes1. Metagenes 2. IFFS

3. Ensemble4. Knowledge

Integration

5. Multiclass

algorithm

Page 13: My PhD thesis presentation slides

4- HOW DID WE GET THERE?

27/06/2014 Mattia Bosio PhD thesis defense 14

Page 14: My PhD thesis presentation slides

4.1 FEATURE SET ENHANCEMENT

A structure is inferred from the data and new metagenes are created.

27/06/2014 Mattia Bosio PhD thesis defense

16

Page 15: My PhD thesis presentation slides

Feature set enhancementAddresses Noise and Lack of structure

• A binary tree is inferred

• Each node is a new feature

• New features are called metagenes

• Metagenes reduce noise by clustering similar genes

27/06/2014 Mattia Bosio PhD thesis defense

17

Page 16: My PhD thesis presentation slides

Feature set enhancementThe iterative process of metagene generation

• Iterative process based on Treelets [1]

• The two most similar features are substituted by a metagene

• Two key elements:– Similarity Metric

– Metagene generation algorithm

18

[1] A. B. Lee, B. Nadler, L. Wasserman, Treelets - an adaptive multi-scale basis for sparse unordered data, Annals of Applied Statistics 2 (2) (2008) 435 {471}.

Page 17: My PhD thesis presentation slides

4.2 FEATURE SELECTION: IFFS

How to select the right features to discriminate between classes with an iterative, wrapperalgorithm

27/06/2014 Mattia Bosio PhD thesis defense

19

Page 18: My PhD thesis presentation slides

IFFS:Find the few best features to classify

• “Improved Sequential Floating Forward Selection (IFFS)” [2]:

– Sequential, deterministic wrapper algorithm

• Flexible method : at each iteration decide if Add, Delete or Substitute a feature

• Alternatives are compared by a J(·) score

20

[2] S. Nakariyakul, D. Casasent, An improvement on floating search algorithms for feature subset selection, Pattern Recognition.

Page 19: My PhD thesis presentation slides

IFFS:Find the few best features to classify

Deterministic sequential wrapper algorithm

• All the decisions determined by a J(·) score

• Usually J(·) is an error rate estimation

– Ties are frequent due to the sample scarcity

27/06/2014 Mattia Bosio PhD thesis defense 21

[2] S. Nakariyakul, D. Casasent, An improvement on floating search algorithms for feature subset selection, Pattern Recognition.

Page 20: My PhD thesis presentation slides

J(·) score tailored for microarrays

27/06/2014 Mattia Bosio PhD thesis defense

22

Reliability measure to break ties in J(·)

Three rules to define the score combining error rate and reliability:1. Lexicographic sorting2. Exponential penalization3. Linear combination

J(·) score depends on 2 parameters:1. Error rate2. Reliability

Page 21: My PhD thesis presentation slides

IFFS: Experimental setup

• Datasets from MAQC study phase II [4]

• 7 datasets with hundreds of samples

– 30.000+ models evaluated

– Independent validation sets available

– Common evaluation procedure

23

[4] L. Shi, et al., The microarray quality control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models., Nature biotechnology 28 (2010) 827-38.

Page 22: My PhD thesis presentation slides

IFFS: experiment objectives

• Evaluate if metagenes are useful

• Benchmark with state of the art

• Comparison following MAQC standard:

Matthews Correlation Coefficient

27/06/2014 Mattia Bosio PhD thesis defense 25

𝑀𝐶𝐶 =𝑇𝑃 ⋅ 𝑇𝑁 − 𝐹𝑃 ⋅ 𝐹𝑁

(𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁)(𝑇𝑁 + 𝐹𝑃)(𝑇𝑁 + 𝐹𝑁)

Page 23: My PhD thesis presentation slides

Results: Metagenes are useful

27/06/2014 Mattia Bosio PhD thesis defense 26

• Introducing metagenes gives better results

Page 24: My PhD thesis presentation slides

The proposed framework improves state

of the art results

27/06/2014 Mattia Bosio PhD thesis defense 27

0.4

23

0.4

86

0.4

95

0.4

90

0.25

0.30

0.35

0.40

0.45

0.50

0.55

Page 25: My PhD thesis presentation slides

Observations

• The proposed framework works with both itskey elements

• Metagenes are useful (contrib #1)

• IFFS adapted to microarrays improves the state of the art (contrib #2)

27/06/2014 Mattia Bosio PhD thesis defense 28

Page 26: My PhD thesis presentation slides

4.3 FEATURE SELECTION: ENSEMBLE

How to select the right features to discriminate between classes with a novel ensemble learning algorithm

27/06/2014 Mattia Bosio PhD thesis defense

29

Page 27: My PhD thesis presentation slides

Ensemble learning - voting scheme

• Ensemble combine experts with a voting scheme

• One expert for each available feature– Expert = Trained Classifier output on analyzed data

– 1 Expert = 1 feature

• The feature selection becomes an Expert subset selection problem

27/06/2014 Mattia Bosio PhD thesis defense 30

Page 28: My PhD thesis presentation slides

Accuracy In Diversity [7]

the original algorithm

• Starts with p experts : One for each feature

• Sequentially removes the expert with worst error rate on a subset S

• In [6], a simpler version is defined: Kun algorithm

27/06/2014 Mattia Bosio PhD thesis defense 31

[6] L. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms.Wiley, 2004.[7]R. E. Banfield, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer, “A new ensemble diversity measure applied to thinning ensembles.” in Multiple Classifier Systems, ser. Lecture Notes in Computer Science, T. Windeatt and F. Roli, Eds., vol. 2709. Springer, 2003, pp. 306–316.

Page 29: My PhD thesis presentation slides

Accuracy In Diversity

the original algorithm

27/06/2014 Mattia Bosio PhD thesis defense 32

• PCDM (d) = % of experts correctly classifying sample i

• S set formed of samples with 𝑙𝑏 ≤ 𝑑 ≤ 𝑈𝑏• The expert with worst error rate on S is excluded

90%

50%

80%

100%

100%

EXPERTS

SAM

PLES

PCDM VOTE

AID Kun

𝑙𝑏 = 𝜇 ⋅ 𝑑 +1 − 𝑑

𝑛

𝑙𝑏 = 10%

𝑈𝑏 = 𝛼 ⋅ 𝑑 + 𝜇(1 − 𝑑) 𝑈𝑏 = 90%

Page 30: My PhD thesis presentation slides

Adaptations to microarrays

• Nonexpert: Exclude experts unable to find 2 classes in the training set

• Metagenes : included as experts

• Tie-break rule: the expert upper in the tree is excluded

27/06/2014 Mattia Bosio PhD thesis defense 33

Page 31: My PhD thesis presentation slides

Ensemble: experiment objectives

• Comparison between AID and Kun ensemble algorithms.

• Benchmark with state of the art.

• Comparison following MAQC standard:

Matthews Correlation Coefficient

27/06/2014 Mattia Bosio PhD thesis defense 34

𝑀𝐶𝐶 =𝑇𝑃 ⋅ 𝑇𝑁 − 𝐹𝑃 ⋅ 𝐹𝑁

(𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁)(𝑇𝑁 + 𝐹𝑃)(𝑇𝑁 + 𝐹𝑁)

Page 32: My PhD thesis presentation slides

Ensemble algorithms improve the state of

the art

27/06/2014 Mattia Bosio PhD thesis defense 35

• Both algorithms improve state of the art

• The simpler Kun algorithm is the best option

0.2

30

0.4

90

0.4

95

0.5

14

0.5

33

0.2

0.3

0.3

0.4

0.4

0.5

0.5

0.6

Page 33: My PhD thesis presentation slides

Observations

• Ensemble learning feature selection led to encouraging results.

• The proposed ensemble learning improves the state of the art. (contrib #3)

• Tailoring the algorithm to the data benefits the results.

27/06/2014 Mattia Bosio PhD thesis defense 36

Page 34: My PhD thesis presentation slides

4.4 KNOWLEDGE INTEGRATION

Introducing prior biologial knowledge to improve the metagene generation phase. The aim is to obtain more robust performance and more biologically interpretable gene selections

27/06/2014 Mattia Bosio PhD thesis defense

37

Page 35: My PhD thesis presentation slides

Integration of external biological data

when producing metagenes

27/06/2014 Mattia Bosio PhD thesis defense 38

Genes

Feature set

Enhancement

Feature

Selection

Classifier

Train Data

Validation DataClass Estimations

New metagenes

Biological Knowledge (MSigDb...)

Page 36: My PhD thesis presentation slides

Objectives of this section

• Measures to quantify biological similarity

• Develop ways to integrate both sources of info

Numerical correlation & Biological similarity

• Benchmarking :Predictive power | Results stability |Biological interpretability

27/06/2014 Mattia Bosio PhD thesis defense 39

Page 37: My PhD thesis presentation slides

Distances and merging algorithms

• 4 similarity metrics studied:

Godall | Smirnov | NoisyOR | Anderberg

• 2 criteria to merge numerical and biological info

Average | pdf equalization

27/06/2014 Mattia Bosio PhD thesis defense 40

Page 38: My PhD thesis presentation slides

Experimental setup

• 7 MAQC datasets

• 50-run Monte Carlo experiments

• Novel scoring system integrating Numericalresults and Biological analysis tools

27/06/2014 Mattia Bosio PhD thesis defense 41

Page 39: My PhD thesis presentation slides

Comparative scoring systemPredictive performance

𝑑 =𝜇

𝜖+𝜎from MCC values

Rank by decreasing 𝑑

= best

Biological analysis

4 parallel analysis toolsGSEA | Biograph | Genie |Enrichr

4 parallel rankings

Average biological rankings

27/06/2014 Mattia Bosio PhD thesis defense

42

1

1 3 6 2

3

Final score = rank average2

The best algorithm has the smallest final score

Page 40: My PhD thesis presentation slides

Predictive power scoring & ranking shows

G_pdf as the best solution

27/06/2014 Mattia Bosio PhD thesis defense 43

The smallest Final Score is the best alternative

MCC BIO

Bio

. An

alysisPred

ictiveR

ank.

Fin

al S

core

pdf_equalization average

Page 41: My PhD thesis presentation slides

Compared with state of the art, G_pdf

confirms to be the best alternative

27/06/2014 Mattia Bosio PhD thesis defense 44

The smallest final score is the best alternative

MCC BIO

Fin

alS

core

Page 42: My PhD thesis presentation slides

Observations about knowledge

integration

• Improved results in terms of results stabilityand interpretability

• Godall similarity with pdf-equalization schemeis the best way to integrate prior databases

• G-pdf performance confirmed against state of the art alternatives too (contrib #4)

27/06/2014 Mattia Bosio PhD thesis defense 45

Page 43: My PhD thesis presentation slides

4.5 MULTICLASS CLASSIFICATION

Study of a novel algorithm for multiclass classification applying coding theory on multiple binary classifiers

27/06/2014 Mattia Bosio PhD thesis defense

46

Page 44: My PhD thesis presentation slides

Multiclass approach combining multiple

binary classifiers

• Common methods like One Against All (OAA) or One Against One (OAO) can be improved.

• Information coding good results[119]

• Propose a novel approach with ECOC ideas

27/06/2014 Mattia Bosio PhD thesis defense 47

[119] E. Tapia, L. Ornella, P. Bulacio, and L. Angelone. Multiclass classication of microarray data samples with a reduced number of genes. BMC Bioinformatics 2011.

Page 45: My PhD thesis presentation slides

Our proposal: OAA+PAA

• Choice to combine several experts:– OAA = one classifier per class

– PAA = one classifier separating each class-pair

• Expert = bit in a codeword

• Class estimation by distance with reference words

27/06/2014 Mattia Bosio PhD thesis defense 48

𝑐1

𝑐2

𝑐3

𝑐4

1 0 0 0 1 1 1 0 0 0

0 1 0 0 1 0 0 1 1 0

0 0 1 0 0 1 0 1 0 1

0 0 0 1 0 0 1 0 1 1

M binary classifiersh1 h2 … hM

N =

4 C

lass

es

Page 46: My PhD thesis presentation slides

Experiments on 7 public datasets

• Binary classifiers trained with Treelet + IFFS

• Compared with OAA, OAO and state of the art alternatives[119 ]

• 50 run Monte Carlo run of 4:1 cross validation.

27/06/2014 Mattia Bosio PhD thesis defense 49

[119] E. Tapia, L. Ornella, P. Bulacio, and L. Angelone. Multiclass classication of microarray data samples with a reduced number of genes. BMC Bioinformatics 2011.

Page 47: My PhD thesis presentation slides

Average accuracy

27/06/2014 Mattia Bosio PhD thesis defense 50

OAA+PAA is better than OAA, OAO and state of the art alternatives

OAA OAO [119] LDPC [119] OAA OAA+PAA L1

70%

75%

80%

85%

Acc

ura

cy

Page 48: My PhD thesis presentation slides

Observations about OAA+PAA

• It consistently outperforms OAA and OAO algorithms

• Obtains better accuracy than state of the art alternatives from [119 ]

• OAA+PAA is a valid multiclass algorithm(contrib#5)

27/06/2014 Mattia Bosio PhD thesis defense 51

[119] E. Tapia, L. Ornella, P. Bulacio, and L. Angelone. Multiclass classication of microarray data samples with a reduced number of genes. BMC Bioinformatics 2011.

Page 49: My PhD thesis presentation slides

27/06/2014 Mattia Bosio PhD thesis defense 52

5- CONCLUSIONS

Page 50: My PhD thesis presentation slides

Two-step approach is the main

contribution

• Feature set enhancement

– Addresses lack of structure

– Addresses noise

• Feature selection & classification

– Choose the best variables among thousands available with new algorithms

27/06/2014 Mattia Bosio PhD thesis defense 53

Page 51: My PhD thesis presentation slides

Validated contributions

• Metagenes are helpful for classification

• Tailored IFFS algorithm improves state of the art

• Ensemble learning algorithm led to interesting results

• Knowledge integration framework improves interpretability and robustness

• OAA+PAA as a valid multiclass algorithm

27/06/2014 Mattia Bosio PhD thesis defense 54

Page 52: My PhD thesis presentation slides

PublicationsBosio M, Bellot P, Salembier P, Oliveras A. “Gene Expression Data Classification Combining Hierarchical Representation and Efficient Feature Selection”. Journal of Biological Systems. 2012;20:349-375.

Bosio M, Bellot P, Salembier P, Oliveras A. “Feature set enhancement via hierarchical clustering for microarray classification”. IEEE International Workshop on Genomic Signal Processing and Statistics, GENSIPS 2011. ; 2011. pp. 226 -229

Bosio M, Bellot P, Salembier P, Oliveras A. “Microarray classification with hierarchical data representation and novel feature selection criteria”. In: IEEE 12th International Conference on BioInformatics and BioEngineering. Larnaca, Cyprus; 2012.

Bosio M, Bellot P, Salembier P, Oliveras A. “Multiclass cancer microarray classification algorithm with Pair-Against-All redundancy”. In: The 2012 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS’12). Washington, DC, USA; 2012.

Bosio M, Salembier P, Bellot P, Oliveras A. “Hierarchical clustering combining numerical and biological similarities for gene expression data classification”. 35th Conference of the IEEE Engineering in Medicine and Biology Society (EMBC'13). Osaka, Japan 07/2013

M. Bosio, Salembier, P., Oliveras, A., and Bellot, P., “Ensemble feature selection and hierarchical data representation for microarray classification”, in 13th IEEE International Conference on BioInformatics and BioEngineering BIBE, Chania, Crete, 2013.

27/06/2014 Mattia Bosio PhD thesis defense 55

IFFS

KUN

BIO

INFO

MC

LASS

MET

AG

ENES

Page 53: My PhD thesis presentation slides

Future research directions

• Study a better use of the tree structure

• Integrate more information sources

• Deepen knowledge for ensemble learning

• Study applicability for Next Generation Seqanalysis or other ‘omics’ platforms

27/06/2014 Mattia Bosio PhD thesis defense 56

Page 54: My PhD thesis presentation slides

Hierarchical information representation

and efficient classification

of gene expression microarray data

PhD candidate:

Mattia Bosio Advisors:

Philippe SalembierAlbert Oliveras Vergés

27/06/2014 Mattia Bosio PhD thesis defense 57

Page 55: My PhD thesis presentation slides

Hierarchical information representation

and efficient classification

of gene expression microarray data

PhD candidate:

Mattia Bosio Advisors:

Philippe SalembierAlbert Oliveras Vergés

27/06/2014 Mattia Bosio PhD thesis defense 58

Page 56: My PhD thesis presentation slides

Hierarchical information representation

and efficient classification

of gene expression microarray data

PhD candidate:

Mattia Bosio Advisors:

Philippe SalembierAlbert Oliveras Vergés

27/06/2014 Mattia Bosio PhD thesis defense 59