using bayesian networks to analayze expression data

Using Bayesian Networks to Analayze Expression

Data

Shelly Bar-Nahor

Today

1. Introduction to Bayesian Networks.

2. Describe a method for recovering gene interactions from microarray data, using tools for learning Bayesian Networks.

3. Apply the method to the S. cerevisiae cell-cycle measurements.

Bayesian Networks – Compact representation of probability distributions via conditional independence.

The representation consists of two components:

1. G - a directed acyclic graph (DAG)

Nodes – Random Variables (X1, …,Xn ).

Vertices – direct influence.

2. Θ – a set of conditional probability distributions.

Together, these 2 components specify a unique distribution on (X1, …,Xn ).

A

CB

P(C|A) c0

a0

a1

c1

0.95 0.05

0.1 0.9

The graph G encodes the markov assumption:

Each Variable Xi is independent of it’s non-descendants given it’s parents in G.

By applying the chain rull of probabilities -

P(X1 ,…,Xn) = ∏ P(Xi | PaG(Xi))

P(Xi | PaG(Xi)) is the conditional distribution for each variable Xi. We denote the parameters that specify these distributions by Θ.

n

The Learning Problem:

Let m be the number of samples and n the number of variables. Given a training set D=(X1, …,Xm), where Xi = (Xi1, …,Xin) , find a network B=<G,Θ> that best matches D.

Let’s assume we have the graph’s structure. We now want to estimate the parameter Θ given the data D.

To do so, we will use the ‘Likelihood Function’:

L(Θ:D) = P(D| Θ) = P(X1=x[1], … ,Xm=x[m]| Θ) = P(x[i] | Θ)

∏m

Learning Parameters - Likelihood Function

BE

A

C

E[1] B[1] A[1] C[1]

: : : :

E[m] B[m] A[m] C[m]

D =

Assume i.i.d samples,

the Likelihood functions is:

L(Θ:D) = P(E[m],B[m],A[m],C[m] | Θ)

∏m


BE

A

C

L(Θ:D) = P(E[m],B[m],A[m],C[m] : Θ) =

∏m

∏m

P(E[m] : Θ) .

P(B[m] : Θ) .

P(A[m] | B[m],E[m] : Θ) .

P(C[m] | A[m] : Θ)

P(E[m]:Θ) P(B[m]:Θ) P(A[m]|B[m],E[m]:Θ) .

P(C[m]|A[m]:Θ) ∏m

∏m

∏m

∏m

=


General Bayesian Networks

L(Θ:D) = P(X1=x[1], … ,Xm=x[m]| Θ) = P(Xi[m]|Pai[m]: Θi) = Li(Θi:D)

Decomposition – Independent estimation problems.

∏m

∏i ∏

i

∏m

Learning Parameters

MLE - maximum likelihood estimator Bayesian Inference – Learning using bayes rule

Represent uncertainty about the sampling using a bayesian network: Θ

X[1] X[2] X[m]. . . . . . .

The values of X are independent given Θ

P(X[m] | Θ ) = Θ

Bayesian prediction is inference in this network

P(x[m+1]| x[1],…,x[m]) =

∫ P(x[m+1]| Θ,x[1],…,x[m])P(Θ|x[1],…x[m])d Θ =

∫ P(x[m+1]| Θ)P(Θ|x[1],…,x[m])d Θ

Θ

X[1] X[2] X[m]. . . . . . . X[m+1]

Observed Data query

Bayesian prediction is inference

in this network

P(Θ|x[1],…,x[m]) =

P(x[1],…,x[m]| Θ) P(Θ)

P(x[1],…,x[m])

Likelihood prior

Probability of data

Bayes rule

0

10

1

Equivalence Classes of Bayesian Netorks

Problem: The joint probability represented by a graph can equaly be represented by another one.

Ind(G) – set of independence statements that hold in all distributions the markov assumption.

G and G’ are aquivalent if Ind(G) = Ind(G’)

A C B

A C B

A C B

P(x) = P(A)P(C|A)P(B|C)

P(x) = P(C)P(A|C)P(B|C)

P(C|A)P(A)=

P(A|C)P(C)

In the same way

Equivalence Classes of Bayesian NetworksTwo DAGs are equivalent if and only if they have the same underlying undirected graph and the same v-structure.

We will represent an equivealence class of netwrok structures by a partially DAG (PDAG), where a directed edge denotes all members of the class contain the same directed arc, and an undirected edge otherwise.

Learning Causal Patterns

•We want to model the mechanism that generate the adependencies (e.g. gene transcriptions).

• A Causal Networks is a model of such causal processes.

• representation: a DAG where each node represents a random variable with a local probabilty model. The parents of a variable are its immediate cause.


A causal Network models not only the distribution of the observation, but also the effects of interventions.

• Observations – a passive mesurement of out domain (I.e. a sample from X)

• Intervention – setting the values of some variable using forces outside the causal model .


•If X causes Y, then manipulating the value of X affect the value of Y.

•If Y couses X, then manipulating the value of X will not affect Y.

•X Y and Y X are equivalent Bayseian networks.

Causal Markov Assumption: given the values of a variable’s immdeiate causes, it is independent of it’s earlier causes.


• When making the causal Markov assumption a causal network can be interpreted as a Bayesian Network.

•From obeservations alone we cannot distinguish between causal networks that belong to the same equivalence class.

•From a directed edge in the PDAG we can infer a causal direction.

So Far…

• The likelihood Function

• Parameter estimation and the decomposition principle

• Equivalence Classes of Bayesian Netorks.

• Learning Causal Patterns

Analyzing Expression Data

1. Present modeling assumptions

2. Find high scoring networks.

3. Characterize features.

4. Estimate Statistical Confidence in features.

5. Present local probability models.

We consider probability distributions over all possible states of the system.

A state is discribed using random variables.

Random Variables –

• expression level of individual gene

• expreimental conditions

• temporal indicators (time/stage the sample was taken).

• background variables (which clinical procedure was used to take the sample)

Modeling assumptions

The Lerning Problem:

Given a training set D=(X1, …,XN) of independent instances X, find an equivalence class of networks B=<G, Θ> that best matches D. We will use a scoring system:

S(G:D) = logP(G|D) = … = logP(D|G) + logP(G) + C

P(D|G) = ∫P(D|G, Θ)P(Θ|G)d Θ

Likelihood Priors over parameters

Marginal Likelihood

prior

The learning problem - Scoring, cont.

Properties of the selected priors:

• structure equivalent

• decomposable

S(G:D) = ∑ ScoreContribution(Xi,Pa(Xi)) : D)Now, learning amounts to finding structure G that maximize the score.

This problem is NP-hard

We resort to huristic search

Local Search Strategy

Using decomposition, we can change one arc and evaluate the gains made by this change

B

C

A B

C

A B

C

A

If and arc to Xi is added or deleted, only score(Xi, Pa(Xi)) needs to be evaluated. If an arc is reversed only score(Xi, Pa(Xi)) and score(Xj, Pa(Xj)) need to be evaluated.

Initial structure G Neighboring structures G’

Find High-Scoring Networks

Problem:

Small data sets are not sufficiently informative to determine that a single model is the “right” model (or equivalence class of models).

Solution:

analyze a set of high-scoring networks. Attempt to characterize features that are common to most of these networks and focus on learning them.

Features

We will use two classes of features involving pairs of variables.1. Markov Relations – Is Y in the Markov

Blanket of X?

Y is in X’s Markov Blanket if and only if there is either an edge between them, or both are parents of another variable.

A Markov Realation indicates that the two genes are related in some joint biological interaction or process.

Features

2. Order Realtions – Is X an ancestor of Y in all the networks of a given equivalence class?

Does the PDAG contain a path from X to Y in which all the edges are directed?

This is an indication that X might be a couse of Y.

Estimate Statistical Confidence in Features

• We want to estimate to what extent the data support a given feature.

• We use the Bootstrap Method:

We generate “purturbed” versions of the original data, and learn from them.

• We should be more confident on features that would still be induced from the “purturbed” data.

We use the Bootstrap as foolows:

For i=1 … m (in our experiments m=200)

o Re-sample with replacement N instance of D. denote Di

the resulting dataset.

o Apply the learning procedure on Di induce a

network structure Gi.

For each feature f of interest calculate:

conf(f) = 1/m ∑f(Gi)

f(G) is 1 if f is a feature in G, and 0 otherwise.


• Features induced with high confidence are rarely false positives.

• The bootstrap procedure is especially robust fot the Markov and order features.

• The conclusions that can be established on high confidence features are reliable even in cases where the data sets are small for the model being induced.


Local Probability Models

• In order to specify a Bayesian network model we need to choose the type of local probability model we use.

• The choice of representation depends on the type of variables we use:

Discrete Variables – can be represented with a table.

Continuous variables – no representation for all possible densities.


We consider two approaches:

1. Multinomial Model – treat each variable as discrete and learn a multinomial distribution that describes the probability of each possible state of a child variable given the state of it’s parent.

We descretize by setting a threshold to the ratio between measured expression and control: values lower than 2-0.5 are under_expressed(-1), and higher then 20.5 are over_expressed(1).


2. Linear Gaussian model – Learn a linear regression model for child variable give it’s parents.

If U1,…,Uk are parent of variable X, then

P(X|u1, …,uk) ~ N(a0 + ∑ai·ui, σ2).

That is, X is normally distributed aroud a mean that depends linearly on the values of its parents.


• In the multinomial model, By Discretizing the mesaured expression levels we loose information.

• The linear-Gaussian model can only detect dependencies that are close to linear. In particular it is not likely to discover combinatorial effects (e.g. a gene is over expressed if and only if certain several genes are jointly over expressed).

Application to Cell Cycle Expression Patterns• Data from Spellman et al. ‘Comprehensive identification of cell cycle-regulates genes of the yeast sacccharomyces cerevisia by microarray hybridization’, Mullecular Biology of the Cell.

• Contains 76 gene expression measurements of the mRNA levels of yeast.

• Spellman et al identified 800 genes whose expression varied over the different cell-cycle stages.

Application to Cell Cycle Expression Patterns• Treat each mesaurement as an independent sample, and do not take into account the temporal aspect of the mesaurement.

• Compensate by adding to the root of all learned networks, a variable denoting the cell cycle phase.

• Performed 2 experiments: one with the multinomial distribution, and the other with linear Gausian distribution. http://www.cs.huji.ac.il/~nirf/GeneExpression/top800/

Confidence

threshold

Number of features with confidence equal or higer then the x-value

Linear Gaussian - Order

Robustness Analysis – Credibility of confidence assessment

Robustness Analysis – Credibility of confidence assessment

Confidence threshold

Number of features with confidence equal or higer then the x-value

Multinomial - Order

Robustness Analysis – Adding more genes

Multinomial model

x: confidence with 250 genes, y: with 800

Robustness Analysis – discretization

• The descritization methos penalizes genes whose natural range of variation is small since a fixed threshold is used.

• Avoid the problem by normalizing the expression of genes in the data.

• The top 20 Markov relations highlighted by this method were a bit different, and the order relation were more robust. Possibly because order relations depend on the network structure and not local.

Robustness Analysis – compare between the linear-Gausian and multinomial experiments.

X-axis: confidence in the multinomial experiment Y-axis: confidence in the Linear Gaussian experiment

Biological Analysis – Order Relations

• Found existence of dominant genes. Out of all 800 genes only few seem to dominant the order.

• Among them are genes that are:

directly involove in initiation of the cell-cycle and its control.

components of pre-replication complexes.

involved in DNA repair ,which are associated with transcription initiation.

Biological Analysis - Markov Relations

• Among top scoring relations, all involving two known genes make sense biology.

• Several of the unknown pairs are phsically adjacent on the chromozome, and persumably regulatedby the same mechanism

• Some relations are beyond limitations of clustering.

Example: CLN2,RNR3,SVS1,SRO4,RAD51 all appear in the same cluster by spellman et al. In our network CLN2 is a parent of the other 4, while no links were found between them. This suit biological knowledege: CLN2 is a central and earlycell-cycle control, while there is no clear biological relationship between the others.

discussion

The approach

is capable of handling noise and estimating the confidence in the different features in the network.

managed to extract many biologically plausible conclusions.

Is capable of learning rich structures from the data, such as discovering causal relationships, and interaction between genes other then positive correlation.

discussion

Ides:

learn models over “clustered” genes.

recover all relationships in one analysis.

Improving the confidence estimation.

Incorporating biological knowledge as prior knowledge to the analysis.

Learn causal patterns, while adding intervantional data.

The End

using bayesian networks to analayze expression data

Documents