gene expression analysis using bayesian networks

Post on 04-Feb-2016

44 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Éric Paquet LBIT Université de Montréal. Gene Expression Analysis Using Bayesian Networks. Biological basis. RNA Polymerase (Copy DNA in RNA). DNA (Storage of Genetic Information). Ribosome (Translate Genetic Information in Proteins). mRNA (Storage & Transport - PowerPoint PPT Presentation

TRANSCRIPT

1

Gene Expression Analysis Using Bayesian Networks

Éric Paquet

LBIT Université de Montréal

2

Biological basis

DNA(Storage of Genetic

Information)

mRNA(Storage & Transport

of Genetic Information)

Proteins(Expression of

Genetic Information)

RNA Polymerase(Copy DNA in RNA)

Ribosome(Translate Genetic

Information in Proteins)

*-PDB file 1L3A, Transcriptional Regulator Pbf-2 2

3

Biological basis

3

How do proteins get regulated? E. coli operon lactose example :

In normal time, E. coli uses glucose to get energy, but how does it react if there is no more glucose but only lactose?

4

Biological basis

4

......

RNA Polymerase

Polymerase action is blocked because of a DNA lockGene Lac I associated protein

Lactose decomposor(β-galactosidase)

Lactose getter(permease)

Glucose Lactose

X

E. coli environment

5

......

RNA Polymerase

Glucose Lactose

X

E. coli environment

Biological basis

X

Lactose

5

Lactose decomposor(β-galactosidase)

Lactose getter(permease)

Lactose

Lactose recruits gene lacI associated protein… unlockingthe DNA that is then accessible to the polymerase

6

Biological basis

6

= inhibit

Lactose decomposor(β-galactosidase)

Lactose getter(permease)

7

......

RNA Polymerase

Glucose Lactose

E.coli environment

Biological basis

X

7

In absence of glucose, a polymerase magnet binds to the DNA to accelerate the products of information that help lactose decomposition

CAP

c-AMP

Lactose decomposor(β-galactosidase)

Lactose getter(permease)

Lactose

8

Biological basis

8

Lactose decomposor(β-galactosidase)

Lactose getter(permease)

= inhibit

= activate

Research goal:Infer these links

9

Why?

Get insights about cellular processesHelp understand diseasesFind drug targets

9

10

How?

Using gene expression data and tools for learning Bayesian networks

*-Spellman et al.(1998) Mol Biol Cell 9:3273-97

Lactose decomposor(β-galactosidase)

Lactose getter(permease)

+

*

10

Experiments

[mR

NA

] Tools for Learning Bayesian networks

11

A real value is coming from one spot and tells if the concentration of a specific mRNA is higher(+) or lower(-) than

the normal value

What is gene expression data?

Data showing the concentration of a specific mRNA at a given time of the cell life.

*

*-Spellman et al.(1998) Mol Biol Cell 9:3273-97

Experiments

[mR

NA

]Every columns are the result of one image

12

What is Bayesian networks?

Graphic representation of a joint distribution over a set of random variables.

A B

C D

E

P(A,B,C,D,E) = P(A)*P(B) *P(C|A)*P(D|A,B) *P(E|D)

Nodes represent gene expression while edges encode the interactions (cf. inhibition, activation)

13

Bayesian networks little problem

A Bayesian network should be a DAG (Direct Acyclic Graph), but there are a lot of example of regulatory networks having directed cycles.

*

*-Husmeier D.,Bioinformatics,Vol. 19 no. 17 2003, pages 2271–2282

Histeric oscillator

Switch

Transcription factor dimer

14

How can we deal with that?

Using DBN (Dynamic Bayesian Networks*) and sequential gene expression data

A

B

A1

B1

A2

B2

We unfold the network in time

*-Friedman, Murphy, Russell,Learning the Structure of Dynamic Probabilitic Networks

DBN = BN with constraints on parents and children nodes

t t+1

15

What are we searching for?

A Bayesian network that is most probable given the data D (gene expression)

We found this BN like that :BN* = argmaxBN{P(BN|D)}

)()()|()|(

DPBNPBNDPDBNP

Prior on network structureMarginal likelihood

Data probability

Where:

Naïve approach to the problem : try all possible dags and keep the best one!

16

It is impossible to try all possible DAGs because

The number of dags increases super-exponentially with the number of nodes

n = 3 → 25 dagsn = 4 → 543 dags n = 5 → 29281 dagsn = 6 → 3781503 dags n = 7 → 1138779265 dagsn = 8 → 783702329343 dags…

We are interested in problem having around 60 nodes ….

17

Learning Bayesian Networks from data?

Choosing search space method and a conditional distribution representation

•Networks space search methods•Greedy hill-climbing•Beam-search•Stochastic hill-climbing•Simulated annealing•MCMC simulation

•Conditional distribution representation•Linear Gaussian•Multinomial, binomial

Basically add, remove and reverse edges

A

B

CP(a) = ?P(b) = ?P(c|a,b) = ?

18

Learning Bayesian Networks from data?

Choosing search space method and a conditional distribution representation

•Networks space search methods•Greedy hill-climbing•Beam-search•Stochastic hill-climbing•Simulated annealing•MCMC simulation

•Conditional distribution representation•Linear Gaussian•Multinomial, binomial

A

B

CP(a) = ?P(b) = ?P(c|a,b) = ?Basically add, remove and reverse edges

19

We use three types of gene expression level?

Sort

-1.06 -0.12 0.18 0.21 1.16 1.19

Split data in 3 equal buckets

-1.06 -0.12 0.18 0.21 1.16 1.19

0 1 2

0 0 2 2 1 1 Discretized data

20

Return on:

)()()|()|(

DPBNPBNDPDBNP

Prior on network structureMarginal likelihood

Data probability

21

Insight on each terms

P(BN) → prior on networkIn our research, we always use a prior equals to 1We could incorporate knowledge using it

Eg. : we know the presence of an edge. If the edge is in the BN, P(BN) = 1 else P(BN) = 0

Efforts are made to reduce the search space by using knowledge eg. limit the number of parents or children

22

Insight on each terms

P(D|BN) → marginal likelihoodEasy to calculate using Multinomial distribution with Dirichlet prior *

ri

k ijk

ijkijkn

i

qi

j ijij

ij

asa

MNNbndP

11 1 )()(

)()()|(

*-Heckerman,A Tutorial on Learning With Bayesian Networks and Neapolitan,Learning Bayesian Networks

23

A

C B

MCMC (Markov Chain Monte Carlo) simulation

Markov Chain part:Zoom on a node of the chain

A

C B

A

C B

A

C B

A

C B

A

C B

A

C B

1/5

1/5

1/51/5

1/5

0

P(BNnew)

24

MCMC (Markov Chain Monte Carlo) simulation

Monte Carlo part:Choose next BN with probability P(BNnew)Accept the new BN with the following Metropolis–Hastings acceptance criterion :

gone! is P(D))(*)()|()()|(,1min

)(*)()()|()()()|(,1min

)(*)|()|(,1min

BNnewPBNoldPBNoldDPBNnewPBNnewDP

BNnewPDPBNoldPBNoldDPDPBNnewPBNnewDP

BNnewPDBNoldPDBNnewPMHP

25

Monte Carlo part example :1. Choose a random path. Each path having a P(BNnew) of 1/5

A

C B

A

C B

A

C B

A

C B

A

C B

A

C B

A

C B

1/5

1/5

1/51/5

1/5

0

P(BNnew)

1. Choose a random path. Each path having a P(BNnew) of 1/52. Choose another random number. If it is smaller than the

Metropolis-Hasting criterion, accept BNnew else return to BNold

26

MCMC (Markov Chain Monte Carlo) simulation recap:Choose a starting BN at randomBurning phase (generate 5*N BN from MCMC without storing them)Storing phase (get 100*N BN structure from MCMC)

log(

P(D

| B

N)P

(BN

))

Iteration

= Burning phase= Storing phase

27

Why 100*N BN and not only 1:

Cause we don’t have enough data and there are a lot of high scoring networksInstead, we associate confidence to edge. Eg. : how many time in the sample can we find edge going from A to B?We could fix a threshold on confidence and retrieve a global network construct with edges having confidence over the threshold

28

What we are working on:

Mixing both sequential and non-sequential data to retrieve interesting relation between genesHow?

Using DBN and MCMC for sequential data + BN and MCMC for non-sequential

100*N networks from DBN 100*N networks from BN

Informationtuner

Learn network

29

How to test the approach:

Problem : There is no way to test it on real data cause there is no completely known networkSolution : Work on realistic simulation where we know the network structureExample :

*-Hartemink A.” Using Bayesian Network Inference Algorithms to Recover Molecular Genetic Regulatory Networks”

0 1 12

2 4 13

3 5 6

7 8 9

10

11

*

Simulate

30

How to test the approach:

*-Hartemink A.” Using Bayesian Network Inference Algorithms to Recover Molecular Genetic Regulatory Networks”

0 1 12

2 4 13

3 5 6

7 8 9

10

11

*

Simulate

Sequential data Non-Sequential data

Infotuner DBN

MCMC

BNMCMC

0 1 122 4 133 5 6

7 8 91011

Compare using ROC curves

31

Test description:

Generate 60 sequential dataGenerate 120 non-sequential data (~reality proportion)Run DBN MCMC on sequential data keep 100*N sample netRun BN MCMC on non-sequential data keep 100*N sample netTest performance using weight on sample

0 BN 1 DBN.05 BN 0.95 DBN…0.95 BN .05 DBN1 BN 0 DBN

The metric used is the area under ROC curve. Perfect learner gets 1.0 , random gets 0.5 and the worst one gets 0.

32

Results:

1 DBN10

Are

a un

der R

OC

cur

ve

0 BN

33

Perspective:

Working on more sophisticated ways to mix sequential and non-sequential dataWorking on real cases:

Yeast cell-cycleArabidopsis Thaliana circadian rhythm

Real data also means missing valuesEvaluate missing values solution (EM, KNNImpute)

34

Acknowledgements:

François Major

35

Why are there missing datas?

Low correlationExperimental problems

36

ROC Curve

Receiver Operating Characteristic curve

*-http://gim.unmc.edu/dxtests/roc2.htm

*

37

MCMC simulation and number of sampled networks

ROC curve area in function of the number of sample networks from MCMC simulation for N=12

0.86

0.865

0.87

0.875

0.88

0.885

0.89

0.895

500

750

1000

1250

1500

1750

2000

2250

2500

2750

3000

3250

3500

3750

4000

4250

4500

4750

5000

# of samples from MCMC

RO

C a

rea

top related