the sbv improver species translation challenge sometimes you can trust a rat sahand hormoz adel...

45
Winning the rat race 1 The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ. Michael Biehl University of Groningen Johann Bernoulli Institute www.cs.rug.nl/biehl [email protected]

Upload: rosalyn-heath

Post on 29-Jan-2016

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

The sbv IMPROVER species translation challenge

Sometimes you can trust a rat

Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara

Gyan Bhanot Rutgers Univ.

Michael Biehl University of GroningenJohann Bernoulli Institute

www.cs.rug.nl/biehl

[email protected]

Page 2: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 2

sbv IMPROVER species translation challenge

systemsbiologyverificationcombined withindustrial methodologyfor process verificationin research

IBM Research, Yorktown HeightsPhilip Morris International Research and Developmentwww.sbvimprover.com

Page 3: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 3

protein phosphorylation

reversible protein phosphorylation

addition or removal of a phosphate group

alters shape and function of proteins

Page 4: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 4

protein phosphorylation

chemical stimuli

gene expression

reversible protein phosphorylation

addition or removal of a phosphate group

alters shape and function of proteins

Page 5: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 5

www.sbvimprover.com

chemical stimuli

phosphorylation

status

( measured)

gene expression

(Δ measured)

complex network (incomplete snapshot)

Page 6: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 6

A AB B

• normal bronchial epithelial cells, derived from human and rat• 52 different chemical stimuli (26 (A) + 26 (B)), additional controls• phosphorylation status after 5 minutes and 25 minutes• gene expression after 6 hours

challenge data

• rather low noise levels• subtract control, median of replicates

challenge organizers: activation

abs(P) > 3 @5min. or @25min.• ~ 10% positive examples

• noisy data (microarray)• correct for saturation effects

N= 20110 (human)

N= 13841 (rat)

Page 7: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 7

www.sbvimprover.com

2

1

3

challenge set-up and goals

1 intra-species prediction of phosphorylation from gene expression

2 predict the response in human using data available for rat cells

3 predict gene expression response across species

Page 8: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 8

intra-species phosphorylation prediction

sub-challenge 1

combination of two approaches:• voter method

gene selection based on mutual information• machine learning analysis

Principal Components representation +

Linear Discriminant Analysis • weighted combination

based on Leave-One-Out cross validation

Page 9: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 9

voter method

binarize data by thresholding

gene expression: G=1 if p < 0.01 (p-value for differential expression)

phosphorylation : P=1 if abs(P) > 3 (@5min. or @25 min.)

for all pairs of genes and proteins:

calculate separate and joint entropies

using frequencies over stimuli

mutual information

assumption: high I indicates that a gene is predictive for the

corresponding protein status

Page 10: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 10

example:

SYNPR level predictive of AKT1 activation

green = significant phosphorylationred = significant gene expression

SYNPR under-expressed AKT1 phosphorylated

voter method

for each protein:

- determine a set of most predictive genes (varying number ~ 30-70)

- vote according to the presence of significant gene expressions

relative frequency of positive votes determines certainty score in [0,1]

Leave-One-Out (L-1-O) validation:

consider mutual information only over 25 stimuli, predict the 26th

performance estimate with respect to predicting novel data

Page 11: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 11

voter method prediction

27 ... stimuli … 52

1 2

…. p

rotein

s……

. 16

• voting schemes obtained

from examples in A,

applied to the 26 new

stimuli of data set B

416 predictions w.r.t. data set B

• certainties in [0,1]

on average over the

26 L-1-O runs

Page 12: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 12

machine learning approach

low-dimensional representation of gene expression data• omit all genes with zero variation or only insignificant (p>0.05)

expression values over all 26 training stimuli (13841 -> 6033 genes)

• Principal Component Analysis (PCA) (pcascat, www.mloss.org

c/o MarcStrickert)

- error free representation of all data possible by max. 52 PCs

- here: use k ≤ 22 leading PCs only (remove small variations due to noise)

• Linear Discriminant Analysis (LDA) (Matlab, Statistics: classify)

- identifies discriminative directions in k-dim. space

based on within-class and between-class variation

- probabilistic output provided, interpreted as certainty score

- if all training examples negative, score 0 is assigned

Page 13: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 13

machine learning approach

• Leave-One-Out procedure with varying number k of PC projections

for each of the 16 target proteins for k=1:22

- repeat 26 times: LDA based on 25 stimuli, predict the 26th

yields probabilistic prediction 0 ≤ c(k) ≤ 1 (crisp threshold 0.5)

- compute Mathews Correlation Coefficient (0 ≤ mcc ≤ 1)

- determine the number of false positives (fp), true positives (tp),

false negatives (fn), true negatives (tn)

Page 14: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 14

machine learning approach

• perform protein-specific

weighted average to obtain certainties:

• prediction: apply to test set (B) (binarized)

27 ... stimuli … 52 27 ... stimuli … 52

proteins

proteins

Page 15: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 15

machine learning approach

• for fair comparison with voter method:

Nested Leave-One-Out procedure

for each protein, repeat 26 times:

L-1-O using 24 out of 25 stimuli, varying k

mcc-weighted prediction for the 26th stimulus

• averaged certainties as weighted means (unweighted mean if both mcc=0)

Page 16: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 16

combined prediction

Page 17: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 17

combined prediction

1 2

…. p

rotein

s……

. 16

27 ... stimuli … 52

Page 18: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 18

111

Page 19: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 19

LDA 0.34 0.71 0.67 2

voting 0.40 0.67 0.65 2

111

combination improved the performance!

Page 20: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 20

inter-species phosphorylation prediction

sub-challenge 2

Page 21: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 21

www.sbvimprover.com

sub-challenge 2 set-up

Page 22: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 22

sub-challenge 2 set-up

restrict ourselves to the useof phosphorylation data only

reasoning:immediate response to stimuli should be comparable between species

www.sbvimprover.com

Page 23: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 23

data

rat data set A

ratP rat data set B

ratP

human data set A

humP

human data set B

| humP | > 3 ?

1 2 3 … 25 26 27 28 29 … 51 521 2

3 …

161

2 3

… 16

stimuli

known prediction

pro

tein

s

Page 24: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 24

assume similar activation in both species: “human ≈ rat”

naïve prediction

prediction score, corresponding to threshold 3 for activation

- precise (monotonic!) form is irrelevant for ROC, PR etc.

- threshold 0.5 for crisp classification

- here: scaling factor yields values well-spread in [0,1]

Page 25: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 25

naïve prediction

AUC ≈ 0.83

sen

sitiv

ity

1-specificity

ROC

with respect to the full panel

(416 predictions) of

| humP | > 3

Page 26: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 26

27 ... stimuli … 52

1 2

…. pro

teins…

…. 16

color-coded certainty

for | humP |>3

in data set B

naïve prediction

Page 27: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 27

machine learning approach

rat data set A

ratP rat data set B

ratP

human data set A

|humP | > 3 ?

human data set B

| humP | > 3 ?

1 2 3 … 25 26 27 28 29 … 51 521 2

3 …

161

2 3

… 16

stimuli

training prediction

pro

tein

s

16-dim.vectors

16 separatebinary

classificationproblems

Page 28: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 28

LVQ prediction

LVQ1, one prototype per class

Nearest prototype classification:

here: 16-dim. data

Page 29: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 29

prediction score / certainty for activation

- precise (monotonic!) form is irrelevant for ROC, PR etc.

- crisp classification for threshold 0.5

- here: scaling factor yields range of values similar to naïve prediction

validation: 26 Leave-One-Out training processes:

split data set A in 25 training / 1 test sample

(if training set is all negative: accept naïve prediction)

prediction: ensemble average of certainties over the 26 LVQ systems

LVQ prediction

Page 30: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 30

AUC ≈ 0.88

ROC

with respect to the full panel

(416 predictions) of

| humP | > 3

obtained in the Leave-One-Out

validation scheme

LVQ predictionse

nsi

tivity

1-specificity

Page 31: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 31

naïve prediction

AUC ≈ 0.83

sen

sitiv

ity

1-specificity

ROC

with respect to the full panel

(416 predictions) of

| humP | > 3

Page 32: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 32

27 ... stimuli … 52

1 2

…. p

roteins…

16

combined prediction

1 2

…. p

roteins…

16

27 ... stimuli … 52

combined prediction: weighted average according to

protein-specific performance (AUROC)

Page 33: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 33

color-coded certainty

for |humP|>3

in data set B

27 ... stimuli … 52

1 2

…. pro

teins…

…. 16

combined prediction

Page 34: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 34

Page 35: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 35

naïve (rat) 0.45 0.74 0.79 1

LVQ 0.37 0.69 0.76 3

naïve scheme: best indiviudal prediction

• L-1-O not confirmed in the test set

combination improves performance!

confirmed in “wisdom of the crowd”

analysis

Page 36: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 36

Page 37: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 37

inter-species pathway perturbation prediction

sub-challenge 3

Page 38: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 38

additional data / domain knowledge

246 gene sets from the C2CP collection (Broad Institute)

www.broadinstitute.org/gsea/msigdb/genesets.jsp?collection=CP

2) annotation of gene sets representing known pathways and function

1) mapping of rat genes to human orthologs

HGNC Comparison of Ortholog Predictions, HCOP

www.genenames.org/cgi-bin/hcop.pl

3) gene set enrichment analysis

www.broadinstitute.org/gsea/index.jsp

NES: normalized enrichment scores, representing expressionFDR: false discovery rate, i.e. statistical significance threshold: FDR <0.25

Page 39: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 39

in stimuli (set A)

gen

e s

ets

FDR < 0.25

rat vs. human

frequent observation:

negative correlations between significant

rat and human gene sets

biology? data (pre-)processing?

Page 40: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 40

• PCA: dimension and noise reduction

rat gene set data A and B represented by k (≤52) projections

training

training data: 26 stimuli in rat data set A

246-dim. vectors of rat NES

246 classification problems

targets: binarized human FDR (<0.25?)

• LDA: linear classifier using k projections as features (probabilistic output)• Leave-One-Out validation: determine optimal k from data set A

• use k=8 to make predictions for

data set B (averaged over 26 L-1-O runs)

machine learning approach

Page 41: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 41

27 ... stimuli … 52

gen

e s

ets

final prediction, certanties

human gene set prediction

Page 42: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 42

significant

Page 43: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 43

summary

sc-1: intra-species prediction of phosphorylation

gene expression is predictive for phosphorylation status

sc-3: inter-species prediction of gene sets

weakly predictive, presence of negative correlations between rat and human genes and gene sets

sc-2: inter-species prediction of phosphorylation

rat phosphorylation is predictive for human cell response

Page 44: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 44

outlook

• more sophisticated learning schemes / classifiers e.g. feature weighting schemes, Matrix Relevance LVQ

• ‘joint’ predictions of protein or gene set tableaus e.g. predict 1 protein from 16 + 15 values in set A two-step procedure for set B

• include gene expression in sub-challenge 2

• investigate difficult to predict proteins / gene sets

• infer and enhance network models from experimental data on-going, new challenge (runs until February 2014) Network Verification Challenge (NVC) www.sbvimprover.com

Page 45: The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara Gyan Bhanot Rutgers Univ

Winning the rat race 45

take home messages

• team work works (and skype is great)

• in case of doubt: PCA

• the smaller the data set, the simpler the method

• committees can be useful!

• if you have won the rat race, you might be a rat