systems biology: the inference of networks from high dimensional genomics data

1

Systems Biology: The inference of networks from high dimensional

genomics data

Ka Yee YeungNov 3, 2011

Systems Biology

• “Systems biology is the study of the interactions between the components of a biological system, and how these interactions give rise to the function and behavior of that system” (Nir Friedman)

• The goal is to construct models of complex biological systems and diseases (Trey Ideker)

2

An iterative approach

3

ExperimentsData handling

Mathematical modeling

High-throughput assays

Integration of multiple forms of experiments and knowledge

Multi-disciplinary Science

• Biology• Biotechnology• Computer Science• Mathematics and Statistics• Physics and chemistry• Engineering…

4

Networks as a universal language

Gene Regulatory NetworkSocial Network

ElectronicCircuit

Internet

We are caught in an inescapable network of mutuality. ... Whatever affects one directly, affects all indirectly. —Martin Luther King Jr.

Science Special Online Collection: Complex systems and networks: http://www.sciencemag.org/complexity/

5

Road Map

• Definitions: graphical representation of networks

• Different types of molecular networks• What can we do with networks?• Network construction methods

– Co-expression networks– Bayesian networks– Regression-based methods

6

Graphical Representation of Gene Networks

• G=(V,E) where– V: set of nodes (vertices)– E: set of edges between the nodes

that represent the relationships between nodes

• Directed vs. undirected• Network topology: connectivity structure• Modules: subset of nodes that are more

highly interconnected with each other than other nodes in the network

7

Undirected

Directed

Degree

• The degree k of a node is the number of edges connected to it.

• In a directed graph, each node has an in-degree and an out-degree.

Courtesy of Bill Noble8

Degree distribution

• The degree distribution plots the number of nodes that have a given degree k as a function of k.

• The shape of the degree distribution allows us to distinguish among types of networks.


Scale-free networks

• Most nodes have only one connection; a few hub nodes are highly connected.

• The degree distribution is exponential, which yields a straight line on a log plot.

• Most biological networks are scale-free.


Molecular or Biochemical Pathways

Example: KEGG (Kyoto Encyclopedia of Genes and Genomes) • Contains 410 “pathways” that represent molecular interaction and reaction networks that are manually curated from 149,937 published references (as of 10/25/2011).

11

• A set of coupled chemical reactions or signaling events. • Nodes are molecules (often substrates) and edges represent chemical reactions.• Represent decades of work in which the underlying chemical reactions are validated.

Molecular Networks Constructed from High-throughput assays (1)

Physical interaction network: • A graphical representation of molecular binding interactions such as a protein-protein interaction (PPI) network. • Nodes are molecules; edges represent physical interactions between molecules.• Example: Yeast PPI network in which most interactions derived from large-scale experiments like yeast 2-hybrid data (high false +ve/-ve rates)

12

Molecular Networks Constructed from High-throughput assays (2)

Bayesian networks: A directed, graphical representation of the probabilities of one observation given another. Nodes represent mRNA molecules; edges represent the probability of a particular expression value given the expression values of the parent nodes.

Correlation or co-expression network: A graphical representation that averages over observed expression data. Nodes are mRNA molecules, edges represent correlations between expression levels of connected nodes.

13

What can we do with these molecular networks?

Using the position in networks to describe

functionGuilt by association

Finding the causal regulator(the "Blame Game")

Courtesy of Mark Gerstein14

What can we do with these molecular networks?

Hubs tend to be essential!

Courtesy of Mark Gerstein

Power-law distribution

log(

Fre

quen

cy)

Success stories:• Network modeling links breast cancer susceptibility and centrosome dysfunction. Pujana et al. Nature Genetics 2007• Analysis of Oncogenic Signaling Networks in Glioblastoma Identifies ASPM as a Novel Molecular Target. PNAS 2006

15

Fig 4 Schadt et al. Nature Reviews Drug Discovery 2009

What can we do with these

molecular networks? Network-based drug discovery

Success stories:• Variations in DNA elucidate molecular networks that cause disease. Chen et al. Nature 2008.• Genetics of gene expression and its effect on disease. Emilsson et al. Nature 2008.

16

Motivation of gene network inference

• Using biochemical methods, it takes 1000’s of person years to assign genes to pathways.

• Even for well-studied genomes, the majority of genes are not mapped to known pathways.

• As more genomes are being sequenced, and more genes are discovered, we need systematic methods to assign genes to pathways.

17

18

A gene-regulation function describes how inputs such as transcription factors and regulatory elements, are transformed into a gene’s mRNA level.

Kim et al. Science 2009

19

A gene-regulation function describes how inputs such as transcription factors and regulatory elements, are transformed into a gene’s mRNA level.

Kim et al. Science 2009

Modeling DNA sequence-based cis-regulatory gene networks. Bolouri & Davidson. 2002.

20

Network construction methods

• Co-expression networks• Bayesian networks• Regression-based methods

Goal: construct gene networks

21

Early inference of transcriptional regulation:

Clustering• Clustering: extract groups of genes that

are tightly co-expressed over a range of different experiments.

• Pattern discovery• No prior knowledge required• Applications:

– Guilt by association (functional annotations)– Extraction of regulatory motifs– Molecular signatures for tissue sub-types

22

Correlation: pairwise similarity

23

Experiments

gene

s gene

s

genes

X

Y

X

Y

Raw matrixSimilarity matrix

1

n

1 p n

n

-3

-2

-1

0

1

2

3

4

1 2 3 4

X

Y

Z

W

Correlation (X,Y) = 1

Correlation (X,Z) = -1

Correlation (X,W) = 1

24

Clustering algorithms

• Inputs: – Similarity matrix– Number of clusters or some other

parameters• Many different classifications of

clustering algorithms:– Hierarchical vs partitional– Heuristic-based vs model-based– Soft vs hard

25

Hierarchical Clustering

• Agglomerative (bottom-up)

• Algorithm:– Initialize: each item a cluster– Iterate:

• select two most similar clusters• merge them

– Halt: when required number of clusters is reached

dendrogram

26

Hierarchical: Single Link

• cluster similarity = similarity of two most similar members

- Potentially long and skinny clusters

+ Fast

27

Hierarchical: Complete Link

• cluster similarity = similarity of two least similar members

+ tight clusters

- slow

28

Hierarchical: Average Link

• cluster similarity = average similarity of all pairs

+ tight clusters

- slow

29

Co-expression Networks

• Co-expression networks– Aka: Correlation networks, association networks– Use microarray data only– Nodes are connected if they have a significant

pairwise expression profile association across environmental perturbations

• References: – A general framework for weighted gene co-

expression network analysis (Zhang, Horvath SAGMB 2005)

– WGCNA: an R package for weighted correlation network analysis. (Langfelder, Horvath BMC Bioinformatics 2008)

Steps for constructing aco-expression

network

A) Microarray gene expression data

B) Measure concordance of gene expression with correlation

C) The Pearson correlation matrix is thresholded to arrive at an adjacency matrix unweighted network

Or transformed continuously with the power adjacency function weighted network

30

Example: co-expression network

31

A B c D E F G H I JA 0.91 0.72 0.84 0.78 0.88 B 0.91 C 0.72 D 0.84 E 0.94 F 0.78 0.94 0.75 G 0.88 0.75 0.92 H 0.92 I 0.98J 0.98

Correlation matrix

Correlation threshold, =1

A

B

C

D E

F

G

H I

J

k P(k)

0 10/10

At =1, there are no edges, so all nodes have degree (k) = 0


32


Correlation matrixCorrelation threshold, =0.9

A

B

C

D E

F

G

H I

J

k P(k)

0 2/10

1 8/10

At =0.9, there are 4 edges, so 8 nodes have degree (k) = 1


33


Correlation matrixCorrelation threshold, =0.7

A

B

C

D E

F

G

H I

J

k P(k)

1 7/10

3 2/10

5 1/10

Log(k)

Log(

P(k

))

34

Bayesian networks• A directed acyclic graph (DAG) such that the nodes

represent mRNA expression levels and the edges represent the probability of observing an expression value given the values of the parent nodes.

• The probability distribution for a gene depends only on its regulators (parents) in the network.

Example: G4 and G5 share a common regulator G2, i.e., they are conditionally independent given G2. factorization of the full joint probability distribution into component conditional distributions.

Needham et al. PLOS Comp Bio 2007

Independent Events

35

G1

G2 G3

G4 G5

If G1, …, G5 are independent, then the joint probability p(G1, G2, G3, G4, G5) = p(G1) p(G2) p(G3) p(G4) p(G5)

Example:K=“KaYee gives the lecture today”. R=“It is raining outside today”Whether it is rain or shine outside doesn’t affect whether KaYee is giving the lecture today.p(K,R) = p(K) * p(R)

Conditional Probability Distributions

• Conditional probability distributions: p(B|A) = the probability of B given A.

36

• Score a network (fit) in light of the data: p(M|D) where D=data, M=network structure infer how well a particular network explains the observed data.

Example:K=“KaYee gives the lecture today”. E=“today’s lecture contains equations”P(E, K) = Probability that Ka Yee gives the lecture today and today’s lecture contains equations = 0.05.P(K)=1/10 = 0.1.

P(E|K) = P(E, K) / P(K) = 0.05/0.1 = 0.5.

Conditional Independence

• In Bayesian networks, each node is independent of its non-descendants, given its parents in the DAG.

• Using conditional independence between variables, the joint probability distribution of the models may be represented in a compact manner.

37

Example:K=“KaYee gives the lecture today”. E=“today’s lecture contains equations”C=“today’s slides are in Comic Sans font”

If Ka Yee is giving the lecture today, then whether today’s lecture contains equations doesn’t affect whether today’s slides are in Comic Sans.P(E|K,C) = P(E|K)E and C are conditionally independent given K.

K

E C

Joint Probability Distribution

38

p(G1, G2, G3, G4, G5) = p(G1) p(G2|G1) p(G3|G1) p(G4|G2) p(G5|G1, G2, G3)

Constructing a Bayesian network

• Variables (nodes in the graph)• Add edges to the graph by computing conditional

probabilities that characterize the distribution of states of each node given the state of its parents.

• The number of possible network structures grows exponentially with the number of nodes, so an exhaustive search of all possible structures to find the one best supported by the data is not feasible.

• Monte Carlo Markov Chain (MCMC) algorithm:– Start with a random network. – Small random changes are then made to the network by

flipping, adding, or deleting individual edges.– Accept changes that improve the fit of the network to the

data. 39

40

Bayesian networks

• Advantages:– Compact and intuitive representation– Integration of prior knowledge– Probabilistic framework for data integration

• Limitation: no feedback loop dynamic Bayesian networks (variables are indexed by time and replicated in the network)

• References:– Using Bayesian Network to Analyze Expression Data.

Friedman et al. J. Computational Biology 7:601-620, 2000.– A Primer on Learning in Bayesian Networks for

Computational Biology. Needham et al. PLOS Computational Biology 2007.

What kinds of data contain potential information about gene

networks?Large expression sets• Co-expression (correlation of expression levels)

implies connectivity

• But correlation ≠ causality

41

A B

A B

A B

A B

C

Adding causality• Genetic perturbation: DNA variation at A influences RNA variation at B.• Time series: A goes up prior to B. • Prior knowledge

✔

42

Adding genetics data• Quantitative trait locus (QTL): a region of

DNA that is associated with a particular trait (eg. Height)

• QTL mapping (linkage analysis): correlate the genotypic and phenotypic variation

43

×

BY (lab)

RM (wild)

:

95 segregants

Phenotype:RNA levels in

response to drug perturbation

DNA genotype

. . .

Our data

6 time points

Our experimental design:Time dependencies: ordering of regulatory events.Genotype data: correlate DNA variations in the segregants to measured expression levels

Experimental design: Roger Bumgarner, Kenneth Dombek, Eric Schadt, Jun Zhu.

Genetics of global gene expression. Rockman & Kruglyak. 2006.

4444

Expression dataGenome-wide binding data Literature

Other data, e.g. protein-protein

interaction, genetic

interaction, genotype etc.

genes

Probability that R regulates g

0.950.230.78…….

g

Regulators constrained by the external data sources

Gene regulatory network

Supervised learning: integration of external data

Variable selection

Time series expression data

Yeung et al. To appear in PNAS.

Integration of external data

45


Other data

Compute variables (Xi) that capture evidence of regulation for (TF-gene) pairs

Y XiT

F-g

ene

Training data:Positive (Y=1) vs. negative (Y=0) training examplesApply logistic regression to

determine weights (i’s) of Xi’s.genes


0.950.230.78…….

Constraining candidate regulators

• Without prior knowledge, every gene is a potential regulator of every other gene. We want to restrict the search to the most likely regulators.

• For each gene g, we estimated how likely that each regulator R regulates g (a priori) using the supervised framework and the external data sources.

46

g

R1R2

R3Graphical representation of network as a set of nodes and edges.Goal: To infer parent nodes (regulators) for each gene g using the time series expression data

Regression-based approach

Let X(g,t,s) = expression level of gene g at time t in segregant s

47

€

X(g,t,s) = β g,s *X(R,t −1,s)R is a potential regulator

∑ +ε

g

Potential regulators R

t

t-1

Variable selection

Use the expression level at time (t-1) to predict the expression levels at time t in the same segregant

Yeung et al. To appear in PNAS.

Bayesian Model Averaging (BMA)

[Raftery 1995], [Hoeting et. al. 1999]

• BMA takes model uncertainty into account by averaging over the posterior distributions of a quantity of interest based on multiple models, weighted by their posterior model probabilities.

• Output: Posterior probabilities for selected genes and selected models

48

€

Pr(Δ | D ) = Pr(Δ | D, Mk )*Pr(Mk | D )k =1

K

∑

Assessment• Recovery of known regulatory relationships:

– We showed significant enrichment between our inferred network and the assessment criteria.

• Lab validation of selected sub-networks

• Comparison to other methods in the literature. 49

…

Child nodes of selected TFs

WT TF

Genes that respond to deletion with rapamycin perturbation

50

Systematic Name

Common name

# references

in SGD

# child nodes in

network A

Expression pattern over

time

Known binding site

from JASPAR?

Description from SGD

YDR421W ARO80 19 51Increasing over time

yes

Zinc finger transcriptional activator of the Zn2Cys6 family; activates transcription of aromatic amino acid catabolic genes in the presence of aromatic amino acids

YML113W DAT1 17 57Decreasing over time

no

DNA binding protein that recognizes oligo(dA).oligo(dT) tracts; Arg side chain in its N-terminal pentad Gly-Arg-Lys-Pro-Gly repeat is required for DNA-binding; not essential for viability

YBL103C RTG3 83 47Increasing over time

yes

Basic helix-loop-helix-leucine zipper (bHLH/Zip) transcription factor that forms a complex with another bHLH/Zip protein, Rtg1p, to activate the retrograde (RTG) and TOR pathways

Comparing our networks to the deletion data

51

Deleted TF# child nodes

Genes that respond to the

deletion# overlap

Fisher's test p-value

ARO80 51 10 4 9.3 x 10-6

DAT1 57 784 20 0.04

RTG3 47 2288 39 0.03

Our inferred network Validation experiment

52

Legend:Green: Genes that respond to deletion of ARO80 under rapamycin in BY at 50 minutes.

Aro80p is a known regulator of ARO9 and ARO10. (Iraqui et al. Molecular and Cellular Biology

1999, 19:3360-3371).

53

Legend:Green: Genes that respond to deletion of ARO80 under rapamycin in BY at 50 minutes.Magenta: Target genes with known ARO80 binding site.

Amazingly, all 4 genes that respond to deletion (ARO9, ARO10, NAF1, ESBP6) contain the known ARO80 binding site upstream!

54


Other data, e.g. protein-protein

interaction, genetic

interaction, genotype etc.

genes


0.950.230.78…….

g

Regulators constrained by the external data sources

Gene regulatory network

Supervised learning: integration of external data

Variable selection

Time series expression data

Goal: incorporate prior probabilities in the variable selection step.

Revisiting our Road Map

• Definitions: graphical representation of networks

• Different types of molecular networks• What can we do with networks?• Network construction methods

– Co-expression networks– Bayesian networks– Regression-based methods– Assessment

55

56

Thank you’s

Special thanksDr. Rachel BremDr. Su-In Lee

Method developmentAdrian RafteryKenneth LoJohn Mittler

Data + Biological interpretationRoger BumgarnerKenneth DombekEric SchadtJun Zhu

R01GM084163R01GM084163-02S2

systems biology: the inference of networks from high dimensional genomics data

Documents

biological networks

types of networks

number of nodes

inference of networks

reaction networks

subset of nodes

nodes verticese

hub nodes