systems biology: the inference of networks from high dimensional genomics data
DESCRIPTION
Systems Biology: The inference of networks from high dimensional genomics data. Ka Yee Yeung Nov 3, 2011. Systems Biology. - PowerPoint PPT PresentationTRANSCRIPT
1
Systems Biology: The inference of networks from high dimensional
genomics data
Ka Yee YeungNov 3, 2011
Systems Biology
• “Systems biology is the study of the interactions between the components of a biological system, and how these interactions give rise to the function and behavior of that system” (Nir Friedman)
• The goal is to construct models of complex biological systems and diseases (Trey Ideker)
2
An iterative approach
3
ExperimentsData handling
Mathematical modeling
High-throughput assays
Integration of multiple forms of experiments and knowledge
Multi-disciplinary Science
• Biology• Biotechnology• Computer Science• Mathematics and Statistics• Physics and chemistry• Engineering…
4
Networks as a universal language
Gene Regulatory NetworkSocial Network
ElectronicCircuit
Internet
We are caught in an inescapable network of mutuality. ... Whatever affects one directly, affects all indirectly. —Martin Luther King Jr.
Science Special Online Collection: Complex systems and networks: http://www.sciencemag.org/complexity/
5
Road Map
• Definitions: graphical representation of networks
• Different types of molecular networks• What can we do with networks?• Network construction methods
– Co-expression networks– Bayesian networks– Regression-based methods
6
Graphical Representation of Gene Networks
• G=(V,E) where– V: set of nodes (vertices)– E: set of edges between the nodes
that represent the relationships between nodes
• Directed vs. undirected• Network topology: connectivity structure• Modules: subset of nodes that are more
highly interconnected with each other than other nodes in the network
7
Undirected
Directed
Degree
• The degree k of a node is the number of edges connected to it.
• In a directed graph, each node has an in-degree and an out-degree.
Courtesy of Bill Noble8
Degree distribution
• The degree distribution plots the number of nodes that have a given degree k as a function of k.
• The shape of the degree distribution allows us to distinguish among types of networks.
Courtesy of Bill Noble9
Scale-free networks
• Most nodes have only one connection; a few hub nodes are highly connected.
• The degree distribution is exponential, which yields a straight line on a log plot.
• Most biological networks are scale-free.
Courtesy of Bill Noble10
Molecular or Biochemical Pathways
Example: KEGG (Kyoto Encyclopedia of Genes and Genomes) • Contains 410 “pathways” that represent molecular interaction and reaction networks that are manually curated from 149,937 published references (as of 10/25/2011).
11
• A set of coupled chemical reactions or signaling events. • Nodes are molecules (often substrates) and edges represent chemical reactions.• Represent decades of work in which the underlying chemical reactions are validated.
Molecular Networks Constructed from High-throughput assays (1)
Physical interaction network: • A graphical representation of molecular binding interactions such as a protein-protein interaction (PPI) network. • Nodes are molecules; edges represent physical interactions between molecules.• Example: Yeast PPI network in which most interactions derived from large-scale experiments like yeast 2-hybrid data (high false +ve/-ve rates)
12
Molecular Networks Constructed from High-throughput assays (2)
Bayesian networks: A directed, graphical representation of the probabilities of one observation given another. Nodes represent mRNA molecules; edges represent the probability of a particular expression value given the expression values of the parent nodes.
Correlation or co-expression network: A graphical representation that averages over observed expression data. Nodes are mRNA molecules, edges represent correlations between expression levels of connected nodes.
13
What can we do with these molecular networks?
Using the position in networks to describe
functionGuilt by association
Finding the causal regulator(the "Blame Game")
Courtesy of Mark Gerstein14
What can we do with these molecular networks?
Hubs tend to be essential!
Courtesy of Mark Gerstein
Power-law distribution
log(
Fre
quen
cy)
Success stories:• Network modeling links breast cancer susceptibility and centrosome dysfunction. Pujana et al. Nature Genetics 2007• Analysis of Oncogenic Signaling Networks in Glioblastoma Identifies ASPM as a Novel Molecular Target. PNAS 2006
15
Fig 4 Schadt et al. Nature Reviews Drug Discovery 2009
What can we do with these
molecular networks? Network-based drug discovery
Success stories:• Variations in DNA elucidate molecular networks that cause disease. Chen et al. Nature 2008.• Genetics of gene expression and its effect on disease. Emilsson et al. Nature 2008.
16
Motivation of gene network inference
• Using biochemical methods, it takes 1000’s of person years to assign genes to pathways.
• Even for well-studied genomes, the majority of genes are not mapped to known pathways.
• As more genomes are being sequenced, and more genes are discovered, we need systematic methods to assign genes to pathways.
17
18
A gene-regulation function describes how inputs such as transcription factors and regulatory elements, are transformed into a gene’s mRNA level.
Kim et al. Science 2009
19
A gene-regulation function describes how inputs such as transcription factors and regulatory elements, are transformed into a gene’s mRNA level.
Kim et al. Science 2009
Modeling DNA sequence-based cis-regulatory gene networks. Bolouri & Davidson. 2002.
20
Network construction methods
• Co-expression networks• Bayesian networks• Regression-based methods
Goal: construct gene networks
21
Early inference of transcriptional regulation:
Clustering• Clustering: extract groups of genes that
are tightly co-expressed over a range of different experiments.
• Pattern discovery• No prior knowledge required• Applications:
– Guilt by association (functional annotations)– Extraction of regulatory motifs– Molecular signatures for tissue sub-types
22
Correlation: pairwise similarity
23
Experiments
gene
s gene
s
genes
X
Y
X
Y
Raw matrixSimilarity matrix
1
n
1 p n
n
-3
-2
-1
0
1
2
3
4
1 2 3 4
X
Y
Z
W
Correlation (X,Y) = 1
Correlation (X,Z) = -1
Correlation (X,W) = 1
24
Clustering algorithms
• Inputs: – Similarity matrix– Number of clusters or some other
parameters• Many different classifications of
clustering algorithms:– Hierarchical vs partitional– Heuristic-based vs model-based– Soft vs hard
25
Hierarchical Clustering
• Agglomerative (bottom-up)
• Algorithm:– Initialize: each item a cluster– Iterate:
• select two most similar clusters• merge them
– Halt: when required number of clusters is reached
dendrogram
26
Hierarchical: Single Link
• cluster similarity = similarity of two most similar members
- Potentially long and skinny clusters
+ Fast
27
Hierarchical: Complete Link
• cluster similarity = similarity of two least similar members
+ tight clusters
- slow
28
Hierarchical: Average Link
• cluster similarity = average similarity of all pairs
+ tight clusters
- slow
29
Co-expression Networks
• Co-expression networks– Aka: Correlation networks, association networks– Use microarray data only– Nodes are connected if they have a significant
pairwise expression profile association across environmental perturbations
• References: – A general framework for weighted gene co-
expression network analysis (Zhang, Horvath SAGMB 2005)
– WGCNA: an R package for weighted correlation network analysis. (Langfelder, Horvath BMC Bioinformatics 2008)
Steps for constructing aco-expression
network
A) Microarray gene expression data
B) Measure concordance of gene expression with correlation
C) The Pearson correlation matrix is thresholded to arrive at an adjacency matrix unweighted network
Or transformed continuously with the power adjacency function weighted network
30
Example: co-expression network
31
A B c D E F G H I JA 0.91 0.72 0.84 0.78 0.88 B 0.91 C 0.72 D 0.84 E 0.94 F 0.78 0.94 0.75 G 0.88 0.75 0.92 H 0.92 I 0.98J 0.98
Correlation matrix
Correlation threshold, =1
A
B
C
D E
F
G
H I
J
k P(k)
0 10/10
At =1, there are no edges, so all nodes have degree (k) = 0
Example: co-expression network
32
A B c D E F G H I JA 0.91 0.72 0.84 0.78 0.88 B 0.91 C 0.72 D 0.84 E 0.94 F 0.78 0.94 0.75 G 0.88 0.75 0.92 H 0.92 I 0.98J 0.98
Correlation matrixCorrelation threshold, =0.9
A
B
C
D E
F
G
H I
J
k P(k)
0 2/10
1 8/10
At =0.9, there are 4 edges, so 8 nodes have degree (k) = 1
Example: co-expression network
33
A B c D E F G H I JA 0.91 0.72 0.84 0.78 0.88 B 0.91 C 0.72 D 0.84 E 0.94 F 0.78 0.94 0.75 G 0.88 0.75 0.92 H 0.92 I 0.98J 0.98
Correlation matrixCorrelation threshold, =0.7
A
B
C
D E
F
G
H I
J
k P(k)
1 7/10
3 2/10
5 1/10
Log(k)
Log(
P(k
))
34
Bayesian networks• A directed acyclic graph (DAG) such that the nodes
represent mRNA expression levels and the edges represent the probability of observing an expression value given the values of the parent nodes.
• The probability distribution for a gene depends only on its regulators (parents) in the network.
Example: G4 and G5 share a common regulator G2, i.e., they are conditionally independent given G2. factorization of the full joint probability distribution into component conditional distributions.
Needham et al. PLOS Comp Bio 2007
Independent Events
35
G1
G2 G3
G4 G5
If G1, …, G5 are independent, then the joint probability p(G1, G2, G3, G4, G5) = p(G1) p(G2) p(G3) p(G4) p(G5)
Example:K=“KaYee gives the lecture today”. R=“It is raining outside today”Whether it is rain or shine outside doesn’t affect whether KaYee is giving the lecture today.p(K,R) = p(K) * p(R)
Conditional Probability Distributions
• Conditional probability distributions: p(B|A) = the probability of B given A.
36
• Score a network (fit) in light of the data: p(M|D) where D=data, M=network structure infer how well a particular network explains the observed data.
Example:K=“KaYee gives the lecture today”. E=“today’s lecture contains equations”P(E, K) = Probability that Ka Yee gives the lecture today and today’s lecture contains equations = 0.05.P(K)=1/10 = 0.1.
P(E|K) = P(E, K) / P(K) = 0.05/0.1 = 0.5.
Conditional Independence
• In Bayesian networks, each node is independent of its non-descendants, given its parents in the DAG.
• Using conditional independence between variables, the joint probability distribution of the models may be represented in a compact manner.
37
Example:K=“KaYee gives the lecture today”. E=“today’s lecture contains equations”C=“today’s slides are in Comic Sans font”
If Ka Yee is giving the lecture today, then whether today’s lecture contains equations doesn’t affect whether today’s slides are in Comic Sans.P(E|K,C) = P(E|K)E and C are conditionally independent given K.
K
E C
Joint Probability Distribution
38
p(G1, G2, G3, G4, G5) = p(G1) p(G2|G1) p(G3|G1) p(G4|G2) p(G5|G1, G2, G3)
Constructing a Bayesian network
• Variables (nodes in the graph)• Add edges to the graph by computing conditional
probabilities that characterize the distribution of states of each node given the state of its parents.
• The number of possible network structures grows exponentially with the number of nodes, so an exhaustive search of all possible structures to find the one best supported by the data is not feasible.
• Monte Carlo Markov Chain (MCMC) algorithm:– Start with a random network. – Small random changes are then made to the network by
flipping, adding, or deleting individual edges.– Accept changes that improve the fit of the network to the
data. 39
40
Bayesian networks
• Advantages:– Compact and intuitive representation– Integration of prior knowledge– Probabilistic framework for data integration
• Limitation: no feedback loop dynamic Bayesian networks (variables are indexed by time and replicated in the network)
• References:– Using Bayesian Network to Analyze Expression Data.
Friedman et al. J. Computational Biology 7:601-620, 2000.– A Primer on Learning in Bayesian Networks for
Computational Biology. Needham et al. PLOS Computational Biology 2007.
What kinds of data contain potential information about gene
networks?Large expression sets• Co-expression (correlation of expression levels)
implies connectivity
• But correlation ≠ causality
41
A B
A B
A B
A B
C
Adding causality• Genetic perturbation: DNA variation at A influences RNA variation at B.• Time series: A goes up prior to B. • Prior knowledge
✔
42
Adding genetics data• Quantitative trait locus (QTL): a region of
DNA that is associated with a particular trait (eg. Height)
• QTL mapping (linkage analysis): correlate the genotypic and phenotypic variation
43
×
BY (lab)
RM (wild)
:
95 segregants
Phenotype:RNA levels in
response to drug perturbation
DNA genotype
. . .
Our data
6 time points
Our experimental design:Time dependencies: ordering of regulatory events.Genotype data: correlate DNA variations in the segregants to measured expression levels
Experimental design: Roger Bumgarner, Kenneth Dombek, Eric Schadt, Jun Zhu.
Genetics of global gene expression. Rockman & Kruglyak. 2006.
4444
Expression dataGenome-wide binding data Literature
Other data, e.g. protein-protein
interaction, genetic
interaction, genotype etc.
genes
Probability that R regulates g
0.950.230.78…….
g
Regulators constrained by the external data sources
Gene regulatory network
Supervised learning: integration of external data
Variable selection
Time series expression data
Yeung et al. To appear in PNAS.
Integration of external data
45
Expression dataGenome-wide binding data Literature
Other data
Compute variables (Xi) that capture evidence of regulation for (TF-gene) pairs
Y XiT
F-g
ene
Training data:Positive (Y=1) vs. negative (Y=0) training examplesApply logistic regression to
determine weights (i’s) of Xi’s.genes
Probability that R regulates g
0.950.230.78…….
Constraining candidate regulators
• Without prior knowledge, every gene is a potential regulator of every other gene. We want to restrict the search to the most likely regulators.
• For each gene g, we estimated how likely that each regulator R regulates g (a priori) using the supervised framework and the external data sources.
46
g
R1R2
R3Graphical representation of network as a set of nodes and edges.Goal: To infer parent nodes (regulators) for each gene g using the time series expression data
Regression-based approach
Let X(g,t,s) = expression level of gene g at time t in segregant s
47
€
X(g,t,s) = β g,s *X(R,t −1,s)R is a potential regulator
∑ +ε
g
Potential regulators R
t
t-1
Variable selection
Use the expression level at time (t-1) to predict the expression levels at time t in the same segregant
Yeung et al. To appear in PNAS.
Bayesian Model Averaging (BMA)
[Raftery 1995], [Hoeting et. al. 1999]
• BMA takes model uncertainty into account by averaging over the posterior distributions of a quantity of interest based on multiple models, weighted by their posterior model probabilities.
• Output: Posterior probabilities for selected genes and selected models
48
€
Pr(Δ | D ) = Pr(Δ | D, Mk )*Pr(Mk | D )k =1
K
∑
Assessment• Recovery of known regulatory relationships:
– We showed significant enrichment between our inferred network and the assessment criteria.
• Lab validation of selected sub-networks
• Comparison to other methods in the literature. 49
…
Child nodes of selected TFs
WT TF
Genes that respond to deletion with rapamycin perturbation
50
Systematic Name
Common name
# references
in SGD
# child nodes in
network A
Expression pattern over
time
Known binding site
from JASPAR?
Description from SGD
YDR421W ARO80 19 51Increasing over time
yes
Zinc finger transcriptional activator of the Zn2Cys6 family; activates transcription of aromatic amino acid catabolic genes in the presence of aromatic amino acids
YML113W DAT1 17 57Decreasing over time
no
DNA binding protein that recognizes oligo(dA).oligo(dT) tracts; Arg side chain in its N-terminal pentad Gly-Arg-Lys-Pro-Gly repeat is required for DNA-binding; not essential for viability
YBL103C RTG3 83 47Increasing over time
yes
Basic helix-loop-helix-leucine zipper (bHLH/Zip) transcription factor that forms a complex with another bHLH/Zip protein, Rtg1p, to activate the retrograde (RTG) and TOR pathways
Comparing our networks to the deletion data
51
Deleted TF# child nodes
Genes that respond to the
deletion# overlap
Fisher's test p-value
ARO80 51 10 4 9.3 x 10-6
DAT1 57 784 20 0.04
RTG3 47 2288 39 0.03
Our inferred network Validation experiment
52
Legend:Green: Genes that respond to deletion of ARO80 under rapamycin in BY at 50 minutes.
Aro80p is a known regulator of ARO9 and ARO10. (Iraqui et al. Molecular and Cellular Biology
1999, 19:3360-3371).
53
Legend:Green: Genes that respond to deletion of ARO80 under rapamycin in BY at 50 minutes.Magenta: Target genes with known ARO80 binding site.
Amazingly, all 4 genes that respond to deletion (ARO9, ARO10, NAF1, ESBP6) contain the known ARO80 binding site upstream!
54
Expression dataGenome-wide binding data Literature
Other data, e.g. protein-protein
interaction, genetic
interaction, genotype etc.
genes
Probability that R regulates g
0.950.230.78…….
g
Regulators constrained by the external data sources
Gene regulatory network
Supervised learning: integration of external data
Variable selection
Time series expression data
Goal: incorporate prior probabilities in the variable selection step.
Revisiting our Road Map
• Definitions: graphical representation of networks
• Different types of molecular networks• What can we do with networks?• Network construction methods
– Co-expression networks– Bayesian networks– Regression-based methods– Assessment
55
56
Thank you’s
Special thanksDr. Rachel BremDr. Su-In Lee
Method developmentAdrian RafteryKenneth LoJohn Mittler
Data + Biological interpretationRoger BumgarnerKenneth DombekEric SchadtJun Zhu
R01GM084163R01GM084163-02S2