gene network inference from microarray data 1. copyright notice many of the images in this power...

Gene Network Inference From Microarray Data

1

Copyright notice

• Many of the images in this power point presentation of other people. The Copyright belong to the original authors. Thanks!

2

Gene Network Inference

3

Level of Biochemical Detail

• Detailed models require lots of data!• Highly detailed biochemical models are

only feasible for very small systems which are extensively studied

• Example: Arkin et al. (1998), Genetics 149(4):1633-48

lysis-lysogeny switch in Lambda:

5 genes, 67 parameters based on 50 years of research, stochastic simulation required supercomputer

4

Example: Lysis-Lysogeny

Arkin et al. (1998), Genetics 149(4):1633-48

5

Level of Biochemical Detail

• In-depth biochemical simulation of e.g. a whole cell is infeasible (so far)

• Less detailed network models are useful when data is scarce and/or network structure is unknown

• Once network structure has been determined, we can refine the model

6

Boolean or Continuous?

• Boolean Networks (Kauffman (1993), The Origins of Order) assumes ON/OFF gene states.

• Allows analysis at the network-level• Provides useful insights in network dynamics• Algorithms for network inference from binary

data

A

B

C C = A AND B

0

10

7

Boolean or Continuous?

• Boolean abstraction is poor fit to real data• Cannot model important concepts:

– amplification of a signal– subtraction and addition of signals– compensating for smoothly varying environmental

parameter (e.g. temperature, nutrients)– varying dynamical behavior (e.g. cell cycle period)

• Feedback control:negative feedback is used to stabilize expression

causes oscillation in Boolean model

8

Deterministic or Stochastic?

• Use of concentrations assumes individual molecules can be ignored

• Known examples (in prokaryotes) where stochastic fluctuations play an essential role (e.g. lysis-lysogeny in lambda)

• Requires stochastic simulation (Arkin et al. (1998),

Genetics 149(4):1633-48), or modeling molecule counts (e.g. Petri nets, Goss and Peccoud (1998), PNAS 95(12):6750-5)

• Significantly increases model complexity

9

Deterministic or Stochastic?

• Eukaryotes: larger cell volume, typically longer half-lives. Few known stochastic effects.

• Yeast: 80% of the transcriptome is expressed at 0.1-2 mRNA copies/cell Holstege, et al.(1998),

Cell 95:717-728.• Human: 95% of transcriptome is expressed

at <5 copies/cell Velculescu et al.(1997), Cell 88:243-251

10

Spatial or Non-Spatial

• Spatiality introduces additional complexity:– intercellular interactions– spatial differentiation– cell compartments– cell types

• Spatial patterns also provide more datae.g. stripe formation in Drosophila: Mjolsness et al. (1991), J. Theor. Biol. 152: 429-454.

• Few (no?) large-scale spatial gene expression data sets available so far.

11

Data Requirements: Lower Bounds from Information Theory

• How many bits of information are needed just to specify the connection pattern of a network?

• N2 possible connections between N nodes

N2 bits needed to specify which connections are present or absent

• O(N) bits of information per “data point”

O(N) data points needed

12

Effect of Limited Connectivity

• Assume only K inputs per gene (on average) NK connections out of N2 possible:

possible connection patterns

• Number of bits needed to fully specify the connection pattern:

O(Klog(N/K)) data points needed

KNNKNK

Nloglog

2

2

NK

N 2

13

Comparison with clustering

• Use pairwise correlation comparisons as a stand-in for clustering

• As number of genes increases, number of false positives will increase as well need to use more stringent correlation test

• If we want to use the same correlation cutoff value r, we need to increase the number of data points as N increases

O(log(N)) data points needed

14

Summary

Fully connected N (thousands) Connectivity K Klog(N/K) (hundreds?) Clustering log(N) (tens)

• Additional constraints reduce data requirements:– choice of regulatory functions– limited connectivity

• Network inference is feasible, but does require much more data than clustering

15

Reverse Engineering Gene Network Methods

• Boolean network

• Relevance network (co-expression network)

• Bayesian network

• Graphical Gaussian models

• Differential equation

16

Gene Networks: reverse engineering

• Dynamical gene networks:

discrete models-- Boolean networks

Bayesian networks, Petri Net

continuous models-- neural networks

differential equations

• Static gene networks:

statistical correlation analysis

graph theory approach

17

Problems

• Static model: require less data but low accuracy

• Dynamical model: require more data but high accuracy

• Noise and time delay master equations

Problem: scarcity of time series data or dimensionality problem, e.g. number of genes typically far exceeds the number of time points for which data are available, making the problem an ill-posed one

18

Gene Co-expression Relation

• The relation of n gene expressions can be represented by an n×n symmetric correlation (e.g. Pearson correlation) matrix M.

• Coexistence of collectivity and noise: m=Mn+Mc.– Strong correlation part Mc

indicates modular collectivity.– Weak correlation part Mn indicates

“noise” between unrelated genes.

19

Relevance networks(Butte and Kohane, 2000)

1. Choose a measure of association A(.,.)

2. Define a threshold value tA

3. For all pairs of domain variables (X,Y) compute their association A(X,Y)

4. Connect those variables (X,Y) by an undirected edge whose association A(X,Y) exceeds the predefined threshold value tA

20

Relevance networks(Butte and Kohane, 2000)

21

Determining the Threshold by Random Matrix Theory

• Construct a series of correlation matrices with different cutoff values. – For a certain cutoff, the absolute values

less than the cutoff are set to zero– Only the correlation coefficients with

absolute values beyond the cutoff are kept .

• Calculate the NNSD of the series of correlation matrices.

• Determine the cutoff threshold by testing Fit-of-Goodness to Poisson distribution using Chi-square test.

22

Yeast Gene Co-expression Network at Cutoff 0.77

23

Graphical Gaussian Models

• GGMs are undirected probabilistic graphical models that allow the identification of conditional independence relations among the nodes under the assumption of a multivariate Gaussian distribution of the data.

• The inference of GGMs is based on a (stable) estimation of the covariance matrix of this distribution. A high correlation coefficient Cik between two nodes may indicate a direct interaction. The strengths of these direct interactions are measured by the partial correlation coefficient πik, which describes the correlation between nodes Xi and Xk conditional on all the other nodes in the network.

24


jjii

ijij

)()(

)(111

1

2

2

1

1

direct interaction

Partial correlation, i.e. correlation

conditional on all other domain variables

Corr(X1,X2|X3,…,Xn)

But usually: #observations < #variables

strong partial

correlation π12

25


• To infer a GGM, one typically employs the following procedure. – From the given data, the empirical covariance matrix is

computed, inverted and the partial correlations ρik are computed. – The distribution of | ρik | is inspected, and edges (i, k)

corresponding to significantly small values of | ρik | are removed from the graph.

– The critical step in the application of this procedure is the stable estimation of the covariance matrix and its inverse.

• Schafer and Strimmer (2005) propose a novel covariance matrix estimator regularized by a shrinkage approach after extensively exploring alternative regularization methods based on bagging.

26

Further drawbacks

• Relevance networks and Graphical Gaussian models can extract undirected edges only.

• Bayesian networks promise to extract at least some directed edges. But can we trust in these edge directions?

It may be better to learn undirected edges than learning directed edges with false orientations.

27

Bayesian networks (BN) in brief

• Graphs in which nodes represent random variables• (Lack of) Arcs represent conditional independence

assumptions• Present & absent arcs provide compact

representation of joint probability distributions• BNs have complicated notion of independence,

which takes into account the directionality of the arcs

28

Bayes’ RuleCan rearrange the conditional probability formula

to get P(A|B) P(B) = P(A,B), but by symmetry we can also get: P(B|A) P(A) = P(A,B) It follows that:

The power of Bayes' rule is that in many situations where we want to compute P(A|B) it turns out that it is difficult to do so directly, yet we might have direct information about P(B|A). Bayes' rule enables us to compute P(A|B) in terms of P(B|A).

29

Bayesian networks

A

CB

D

E F

NODES

EDGES

•Marriage between graph theory and probability theory.

•Directed acyclic graph (DAG) represents conditional independence relations.

•Markov assumption leads to a factorization of the joint probability distribution:

),|()|(),|()|()|()(

),,,,,(

DCFPDEPCBDPACPABPAP

FEDCBAP

30

31

Simple Bayesian network example, from “Bayesian Networks Without Tears” article

P(hear your dog bark as you get home) = P(hb) = ?

32

Need prior P for root nodes and conditional Ps, that consider all possible values of parent

nodes, for nonroot nodes

33

Major benefit of BN

• We can know P(hb) based only on the conditional probabilities of hb and its parent node. We don’t need to know/include all the ancestor probabilities between hb and the root nodes.

34

Independence assumptions• Source of savings

in # of values needed

• From our simple example: are ‘family-out’ and ‘hear-bark’ independent, i.e. P(hb|fo)=P(hb)? Intuition might say they are not independent…

35

Independence assumptions• …but in fact they can be

assumed to be independent if some conditions are met.

• Conditions are symbolized by presence/absence and direction of arrows between nodes.

• Knowing whether dog is or is not in the house is all that is needed to know probability of hearing a bark, so family being in or out is independent. This kind of independence assumption is what allows savings in how many numbers must be specified for probabilities.

Learning Bayesian Belief Networks

1. The network structure is given in advance and all the variables are fully observable in the training examples. ==> Trivial Case: just estimate the conditional probabilities.

2. The network structure is given in advance but only some of the variables are observable in the training data. ==> Similar to learning the weights for the hidden units of a Neural Net: Gradient Ascent Procedure

3. The network structure is not known in advance. ==> Use a heuristic search or constraint-based technique to search through potential structures.

36

BN from microarray

• Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data,” Segal E, Shapira M, Regev A, Pe’er D, Botstein D, Koller D, Friedman N, Nature Genetics, June 2003

37

38

Results of SSR article• Expression data set, from other researchers

circa 2000, is for genes of yeast subjected to various kinds of stress

• Compiled list of 466 candidate regulators• Applied analysis to 2355 genes in all 173

arrays of yeast data set• This gave automatic inference of 50 modules

of genes• All modules were analyzed with external data

sources to check functional coherence of gene products and validity of regulatory program

• Three novel hypotheses suggested by method were tested in bio lab and found to be accurate

Differential Equations

• Typically uses linear differential equations to model the gene trajectories:dxi(t) / dt = a0 + ai,1 x1(t)+ ai,2 x2(t)+ … + ai,n xn(t)

• Several reasons for that choice:– lower number of parameters implies that we

are less likely to over fit the data

– sufficient to model complex interactions between the genes

41

Small Network Example

dx1(t) / dt = 0.491 - 0.248 x1(t)

dx2(t) / dt = -0.473 x3(t) + 0.374 x4(t)

dx3(t) / dt = -0.427 + 0.376 x1(t) - 0.241 x3(t)

dx4(t) / dt = 0.435 x1(t) - 0.315 x3(t) - 0.437 x4(t)

x2

x1

x4

x3

_

_

+

+ _

_

+

_

42


dx1(t) / dt = 0.491 - 0.248 x1(t)

dx2(t) / dt = -0.473 x3(t) + 0.374 x4(t)

dx3(t) / dt = -0.427 + 0.376 x1(t) - 0.241 x3(t)

dx4(t) / dt = 0.435 x1(t) - 0.315 x3(t) - 0.437 x4(t)

x2

x1

x4

x3

_

_

+

+ _

_

+

_

one interactioncoefficient

43


dx1(t) / dt = 0.491 - 0.248 x1(t)

dx2(t) / dt = -0.473 x3(t) + 0.374 x4(t)

dx3(t) / dt = -0.427 + 0.376 x1(t) - 0.241 x3(t)

dx4(t) / dt = 0.435 x1(t) - 0.315 x3(t) - 0.437 x4(t)

x2

x1

x4

x3

_

_

+

+ _

_

+

_

constantcoefficients

44

Issues with Differential Equations

• Even under the simplest linear model, there are m(m+1) unknown parameters to estimate:

• m(m-1) directional effects• m self effects• m constant effects

• Number of data points is mn and we typically have that n << m (few time-points).

• To avoid over fitting, extra constraints must be incorporated into the model such as:

• Smoothness of the equations • Sparseness of the network (few non-null interaction

coefficients)

45

Collins et al. PNAS

• Using SVD for a family of possible solutions

• Using robust regression to choose from them

46

Goal is to use as few measurements as possible. By this method (with exact measurements):

M = O(log(N))

47

If the system is near a steady state, dynamics can be approximated by linear system of Differential Equations:

xi = concentration of mRNA

(reflects expression level of genes)

λi = self-degradation rates

bi = external stimuli

ξi = noise

Wij = type and strength of effect

of jth gene on ith gene

)()()()(1

ttbtxWtxdt

dxii

N

jjijii

i

48

Suppositions made:• No time-dependency in connections

(so W is not time-dependent), and they are not changed by the tests

• System near steady state

• Noise will be discarded, so exact measurements are assumed

• can be calculated exactly enoughX49

System becomes:

With A = W + diag(-λi)Compute by using several measurements of the

data for X. (e.g. using interpolation)Goal = deduce W (or A) from the rest

If M=N, compute (XT)-1, but mostly M << N (this is our goal: M = log(N))

MNMNNNMN BXAX xxxx

NMT

NMT

NNT

NMT BXAX xxxx

X

50

Therefore, use SVD (to find least squares sol.):

Here, U and V are orthogonal (UT = U-1)

and W is diag(w1,…,wN) with wi the singular values of X

Suppose all wi = 0 are in the beginning, so wi = 0 for i = 1…L and wi ≠ 0 (i=L+1...L+N)

NNT

NNNMNMT VWUX xxxx

NM

TNM

T

NNT

NNT

NNiNM

BX

AVwdiagU

xx

xxxx )(

51

Then the least squares (L2) solution to the problem is:

With 1/wj replaced by 0 if wj = 0

So this formula tries to match every data point as closely as possible to the solution.

NNT

jNMMNMN V

wdiagUBXA xxxx0

1

52

But all possible solutions are:

with C = (cij)NxN where cij = 0 if j > L and otherwise just a scalar coefficient

How to choose from the family of solutions ?

The least squares method tries to match every datapoint as closely as possible

→ a not-so-sparse matrix with a lot of small entries.

TCVAA 0

53

1. Basing on prior biological knowledge,impose this on the solutions.e.g.: when we know 2 genes are related,the solution must reflect this in the matrix

2. Work from the assumption that normal gene networks are sparse, and look for the matrix that is most sparse thus: search cij to maximize the number of zero-entries in A

54

So:

• get as much zero-entries as you can

• therefore get a sparse matrix

• the non-zero entries form the connections

• fit as much measurements as you can, exactly: “robust regression”

(So you suppose exact measurements)

55

Do this using L1 regression. Thus, when considering

we want to “minimize” A.

The L1 regression idea is then to look for the solution C where is minimal.

This causes as many zeros as possible.

Implementation was done using the simplex method (linear adjustment method)

10 |||| TCVA

TCVAA 0

56

• Results: Mc = O(log(N))

• Better than only SVD,

without regression:

57

Thus, to reverse-engineer a network of N genes, we “only” need Mc = O(logN) experiments.

Then Mc << N, and the computational cost will be O(N4)

(Brute-force methods would have a cost of O(N!/(k!(N-k)!)) with k non-zero entries)

58

Discussion

Advantages:

• Few data needed, in comparison with neural networks, Bayesian models

• No prior knowledge needed

• Easy to parallelize, as it recovers the connectivity matrix row by row (gene by gene)

• Also applicable to protein networks

59

gene network inference from microarray data 1. copyright notice many of the images in this power...

Documents