computational diagnostics based on large scale gene expression profiles using mcmc rainer spang,

24
Computational Diagnostics based on Large Scale Gene Expression Profiles using MCMC Rainer Spang, Max Planck Institute for Molecular Genetics, Berlin Harry Zuzan, Carrie Blanchette, Erich Huang, Holly Dressman, Jeff Marks, Joe Nevins, Mike West

Upload: tahmores-nay

Post on 03-Jan-2016

16 views

Category:

Documents


0 download

DESCRIPTION

Computational Diagnostics based on Large Scale Gene Expression Profiles using MCMC Rainer Spang, Max Planck Institute for Molecular Genetics, Berlin Harry Zuzan, Carrie Blanchette, Erich Huang, Holly Dressman, Jeff Marks, Joe Nevins, Mike West - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

Computational Diagnostics based on Large Scale Gene Expression Profiles using MCMC

Rainer Spang,

Max Planck Institute for Molecular Genetics, Berlin

Harry Zuzan, Carrie Blanchette, Erich Huang, Holly Dressman, Jeff Marks, Joe Nevins, Mike West

Duke Medical Center & Duke University

Page 2: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

Estrogen Receptor Status

• 7000 genes• 49 breast tumors• 25 ER+• 24 ER-

Page 3: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

Tumor – Chip - 7000 Numbers

Page 4: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

Given

7000 Numbers

Wanted

89%

The probability that the tumor is ER+

Page 5: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

7000 Numbers Are More Numbers Than We Need

Predict ER status based on the expression levels of super-genes

Page 6: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

Singular Value Decomposition

X

FDAE

Data

Loadings Singular values

Expression levels of super genes, orthogonal

matrix

Page 7: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

)(genessuper all

|1 0][ ii i x βYP

i

i

i

x

Y

Probit Model

Class of tumor i

Distribution Function of a Standard NormalRegression weight for super gene i

Expression Level of super gene i

Page 8: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

Overfitting

• Using only a small number of super genes is not robust at all

• When using many (all) supergenes, the linear model can be easily saturated, i.e. we have several models that fit perfectly well

• Consequence: For a new patient we find among these models some that support that she is ER+ and others that predict she is ER-

Page 9: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

Given the Few Profiles With Known Diagnosis:

• The uncertainty on the right model is high

• The variance of the model-weights is large

• The likelihood landscape is flat• We need additional model

assumptions to solve the problem

Page 10: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

Informative Priors

Likelihood Prior Posterior

Page 11: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

If the Prior Is Chosen Badly:

• We can not reproduce the diagnosis of the training profiles any more

• We still can not identify the model• The diagnosis is driven mostly by

the additional assumptions and not by the data

Page 12: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

The Prior Needs to Be designed in 49 Dimensions

• Shape?• Center?• Orientation?• Not to narrow ... not to wide

Page 13: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

Shape

multidimensional normal

for simplicity

Page 14: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

Center

Assumptions on the model correspond to assumptions on the

diagnosis

]|1[ ii YP

Page 15: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

Orientation

orthogonal super-genes !

Page 16: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

Not to Narrow ... Not to Wide

Auto adjusting model

Scales are hyper parameters with their own priors

Page 17: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

)/,0|()|( 22

1ii

n

ii dNTp

Prior given the hyper parameter

Hyper parameter

Independent super genes

Unbiased prior

Rescaling by singular values

Page 18: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

A prior for the hyper parameters

)2/,2/(~2 kkGammai

-Conjugate prior

-Flexibility for

-Symmetric U-Shaped prior for

i

k=2 or k=3

]|1[ ii YP

Page 19: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

Latent Variable

iii xh 0 )1,0(~ N

01 ii hY

)(genessuper all

i0 β |1 ][ ii xYP

Albert & Chip 1993

Page 20: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

MCMC

- Gibbs Sampler

- Sequential updates of conditional distributions

normal truncated~),,|(

gamma~),,|(

normal~),,|(

TXhp

hXTp

ThXp

All conditional posteriors can be calculated analytically

West 2001, Albert & Chip 1993

Page 21: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

What are the additional assumptions

that came in by the prior?

• The model can not be dominated by only a few super-genes ( genes! )

• The diagnosis is done based on global changes in the expression profiles influenced by many genes

• The assumptions are neutral with respect to the individual diagnosis

Page 22: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,
Page 23: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

Which Genes Have Driven the Prediction ?

Gene Weight

nuclear factor 3 alpha 0.853

cysteine rich heart protein 0.842

estrogen receptor 0.840

intestinal trefoil factor 0.840

x box binding protein 1 0.835

gata 3 0.818

ps 2 0.818

liv1 0.812

... many many more ... ...

Page 24: Computational Diagnostics  based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

Thank you!