gene expression analysis using bayesian networks

Gene Expression Analysis Using Bayesian Networks

Éric Paquet

LBIT Université de Montréal

Biological basis

DNA(Storage of Genetic

Information)

mRNA(Storage & Transport

of Genetic Information)

Proteins(Expression of

Genetic Information)

RNA Polymerase(Copy DNA in RNA)

Ribosome(Translate Genetic

Information in Proteins)

*-PDB file 1L3A, Transcriptional Regulator Pbf-2 2

Biological basis

How do proteins get regulated? E. coli operon lactose example :

In normal time, E. coli uses glucose to get energy, but how does it react if there is no more glucose but only lactose?

Biological basis

......

RNA Polymerase

Polymerase action is blocked because of a DNA lockGene Lac I associated protein

Lactose decomposor(β-galactosidase)

Lactose getter(permease)

Glucose Lactose

E. coli environment

......

RNA Polymerase

Glucose Lactose

E. coli environment

Biological basis

Lactose

Lactose recruits gene lacI associated protein… unlockingthe DNA that is then accessible to the polymerase

Biological basis

= inhibit

......

RNA Polymerase

Glucose Lactose

E.coli environment

Biological basis

In absence of glucose, a polymerase magnet binds to the DNA to accelerate the products of information that help lactose decomposition

Lactose

Biological basis

= inhibit

= activate

Research goal:Infer these links

Get insights about cellular processesHelp understand diseasesFind drug targets

Using gene expression data and tools for learning Bayesian networks

*-Spellman et al.(1998) Mol Biol Cell 9:3273-97

Experiments

] Tools for Learning Bayesian networks

A real value is coming from one spot and tells if the concentration of a specific mRNA is higher(+) or lower(-) than

the normal value

What is gene expression data?

Data showing the concentration of a specific mRNA at a given time of the cell life.

*-Spellman et al.(1998) Mol Biol Cell 9:3273-97

Experiments

]Every columns are the result of one image

What is Bayesian networks?

Graphic representation of a joint distribution over a set of random variables.

P(A,B,C,D,E) = P(A)*P(B) *P(C|A)*P(D|A,B) *P(E|D)

Nodes represent gene expression while edges encode the interactions (cf. inhibition, activation)

Bayesian networks little problem

A Bayesian network should be a DAG (Direct Acyclic Graph), but there are a lot of example of regulatory networks having directed cycles.

*-Husmeier D.,Bioinformatics,Vol. 19 no. 17 2003, pages 2271–2282

Histeric oscillator

Switch

Transcription factor dimer

How can we deal with that?

Using DBN (Dynamic Bayesian Networks*) and sequential gene expression data

We unfold the network in time

*-Friedman, Murphy, Russell,Learning the Structure of Dynamic Probabilitic Networks

DBN = BN with constraints on parents and children nodes

What are we searching for?

A Bayesian network that is most probable given the data D (gene expression)

We found this BN like that :BN* = argmaxBN{P(BN|D)}

)()()|()|(

DPBNPBNDPDBNP

Prior on network structureMarginal likelihood

Data probability

Where:

Naïve approach to the problem : try all possible dags and keep the best one!

It is impossible to try all possible DAGs because

The number of dags increases super-exponentially with the number of nodes

n = 3 → 25 dagsn = 4 → 543 dags n = 5 → 29281 dagsn = 6 → 3781503 dags n = 7 → 1138779265 dagsn = 8 → 783702329343 dags…

We are interested in problem having around 60 nodes ….

Learning Bayesian Networks from data?

Choosing search space method and a conditional distribution representation

•Networks space search methods•Greedy hill-climbing•Beam-search•Stochastic hill-climbing•Simulated annealing•MCMC simulation

•Conditional distribution representation•Linear Gaussian•Multinomial, binomial

Basically add, remove and reverse edges

CP(a) = ?P(b) = ?P(c|a,b) = ?

Learning Bayesian Networks from data?

Choosing search space method and a conditional distribution representation

•Networks space search methods•Greedy hill-climbing•Beam-search•Stochastic hill-climbing•Simulated annealing•MCMC simulation

•Conditional distribution representation•Linear Gaussian•Multinomial, binomial

CP(a) = ?P(b) = ?P(c|a,b) = ?Basically add, remove and reverse edges

We use three types of gene expression level?

-1.06 -0.12 0.18 0.21 1.16 1.19

Split data in 3 equal buckets

-1.06 -0.12 0.18 0.21 1.16 1.19

0 0 2 2 1 1 Discretized data

Return on:

)()()|()|(

DPBNPBNDPDBNP

Prior on network structureMarginal likelihood

Data probability

Insight on each terms

P(BN) → prior on networkIn our research, we always use a prior equals to 1We could incorporate knowledge using it

Eg. : we know the presence of an edge. If the edge is in the BN, P(BN) = 1 else P(BN) = 0

Efforts are made to reduce the search space by using knowledge eg. limit the number of parents or children

Insight on each terms

P(D|BN) → marginal likelihoodEasy to calculate using Multinomial distribution with Dirichlet prior *

ijkijkn

j ijij

MNNbndP

11 1 )()(

)()()|(

*-Heckerman,A Tutorial on Learning With Bayesian Networks and Neapolitan,Learning Bayesian Networks

MCMC (Markov Chain Monte Carlo) simulation

Markov Chain part:Zoom on a node of the chain

1/51/5

P(BNnew)

MCMC (Markov Chain Monte Carlo) simulation

Monte Carlo part:Choose next BN with probability P(BNnew)Accept the new BN with the following Metropolis–Hastings acceptance criterion :

gone! is P(D))(*)()|()()|(,1min

)(*)()()|()()()|(,1min

)(*)|()|(,1min

BNnewPBNoldPBNoldDPBNnewPBNnewDP

BNnewPDPBNoldPBNoldDPDPBNnewPBNnewDP

BNnewPDBNoldPDBNnewPMHP

Monte Carlo part example :1. Choose a random path. Each path having a P(BNnew) of 1/5

1/51/5

P(BNnew)

1. Choose a random path. Each path having a P(BNnew) of 1/52. Choose another random number. If it is smaller than the

Metropolis-Hasting criterion, accept BNnew else return to BNold

MCMC (Markov Chain Monte Carlo) simulation recap:Choose a starting BN at randomBurning phase (generate 5*N BN from MCMC without storing them)Storing phase (get 100*N BN structure from MCMC)

Iteration

= Burning phase= Storing phase

Why 100*N BN and not only 1:

Cause we don’t have enough data and there are a lot of high scoring networksInstead, we associate confidence to edge. Eg. : how many time in the sample can we find edge going from A to B?We could fix a threshold on confidence and retrieve a global network construct with edges having confidence over the threshold

What we are working on:

Mixing both sequential and non-sequential data to retrieve interesting relation between genesHow?

Using DBN and MCMC for sequential data + BN and MCMC for non-sequential

100*N networks from DBN 100*N networks from BN

Informationtuner

Learn network

How to test the approach:

Problem : There is no way to test it on real data cause there is no completely known networkSolution : Work on realistic simulation where we know the network structureExample :

*-Hartemink A.” Using Bayesian Network Inference Algorithms to Recover Molecular Genetic Regulatory Networks”

0 1 12

2 4 13

Simulate

How to test the approach:

*-Hartemink A.” Using Bayesian Network Inference Algorithms to Recover Molecular Genetic Regulatory Networks”

0 1 12

2 4 13

Simulate

Sequential data Non-Sequential data

Infotuner DBN

BNMCMC

0 1 122 4 133 5 6

7 8 91011

Compare using ROC curves

Test description:

Generate 60 sequential dataGenerate 120 non-sequential data (~reality proportion)Run DBN MCMC on sequential data keep 100*N sample netRun BN MCMC on non-sequential data keep 100*N sample netTest performance using weight on sample

0 BN 1 DBN.05 BN 0.95 DBN…0.95 BN .05 DBN1 BN 0 DBN

The metric used is the area under ROC curve. Perfect learner gets 1.0 , random gets 0.5 and the worst one gets 0.

Results:

1 DBN10

Perspective:

Working on more sophisticated ways to mix sequential and non-sequential dataWorking on real cases:

Yeast cell-cycleArabidopsis Thaliana circadian rhythm

Real data also means missing valuesEvaluate missing values solution (EM, KNNImpute)

Acknowledgements:

François Major

Why are there missing datas?

Low correlationExperimental problems

ROC Curve

Receiver Operating Characteristic curve

*-http://gim.unmc.edu/dxtests/roc2.htm

MCMC simulation and number of sampled networks

ROC curve area in function of the number of sample networks from MCMC simulation for N=12

# of samples from MCMC

gene expression analysis using bayesian networks

Documents

measuring gene expression part 2 - gene … gene expression...

a bayesian multiple comparison approach for gene ... ·...

gene expression analysis using bayesian networks for...

context-specific bayesian clustering for gene expression...

regulation of gene expression in prokaryotes. regulation of...

gene expression,regulation of gene expression by dr.tasnim

6. the gene expression omnibus (geo): a gene expression

gene expression - center for teaching &...

horner's class/gene expression/gene... · chapter menu gene...

chapter 11: gene expression 11-1 control of gene expression...

bayesian robust inference for differential gene expression...

1 gene expression overview. 2 gene expression gene...

microarrays and gene expression analysis. 2 gene expression...

gene expression gene regulation - biostatistics

bayesian models for gene expression with dna microarray data

gene expression

multiple choice review gene...

a bayesian clustering approach for detecting gene-gene...