cs 6293 advanced topics: translational bioinformatics
DESCRIPTION
CS 6293 Advanced Topics: Translational Bioinformatics. Biological networks: Theory and applications. Lecture outline. Basic terminology and concepts in networks Some interesting results between network properties and biological functions Network clustering / community discovery - PowerPoint PPT PresentationTRANSCRIPT
CS 6293 Advanced Topics: Translational Bioinformatics
Biological networks:Theory and applications
Lecture outline
• Basic terminology and concepts in networks
• Some interesting results between network properties and biological functions
• Network clustering / community discovery• Applications of network clustering methods
Network
• A network refers to a graph• An useful concept in analyzing the
interactions of different components in a system
Biological networks• An abstract of the complex relationships among
molecules in the cell• Many types.
– Protein-protein interaction networks– Protein-DNA(RNA) interaction networks– Genetic interaction network– Metabolic network– Signal transduction networks– (real) neural networks – Many others
• In some networks, edges have more precise meaning. In some others, meaning of edges is obscure
Protein-protein interaction networks
• Yeast PPI network• Nodes – proteins• Edges – interactions
The color of a node indicates the phenotypic effect of removing the corresponding protein (red = lethal, green = non-lethal, orange = slow growth, yellow = unknown).
Obtaining biological networks• Direct experimental methods
– Protein-protein interaction networks• Yeast-2-hybrid• Tandem affinity purification• Co-immunoprecipitation
– Protein-DNA interaction• Chromatin Immunoprecipitation (followed by microarray or
sequencing, ChIP-chip, ChIP-seq)– High level of noises (false-positive and false-negative)
• Computational prediction methods– Often cannot differentiate direct and indirect
interactions
Why networks?• Studying genes/proteins on the network level
allows us to:– Assess the role of individual genes/proteins in the
overall pathway– Evaluate redundancy of network components– Identify candidate genes involved in genetic diseases– Sets up the framework for mathematical models
For complex systems, the actual output may not be predictable by looking at only individual components:
The whole is greater than the sum of its parts
Graphs
• A graph G = (V, E)– V = set of vertices– E = set of edges = subset of V V– Thus |E| = O(|V|2)
1
2 4
3
Vertices: {1, 2, 3, 4}
Edges: {(1, 2), (2, 3), (1, 3), (4, 3)}
Graph Variations (1)
• Directed / undirected:– In an undirected graph:
• Edge (u,v) E implies edge (v,u) E• Road networks between cities
– In a directed graph:• Edge (u,v): uv does not imply vu• Street networks in downtown
– Degree of vertex v:• The number of edges adjacency to v• For directed graph, there are in-degree and out-degree
1
2 4
3
Directed
1
2 4
3
Undirected
Degree = 3In-degree = 3Out-degree = 0
Graph Variations (2)• Weighted / unweighted:
– In a weighted graph, each edge or vertex has an associated weight (numerical value)
• E.g., a road map: edges might be weighted w/ distance
1
2 4
3
1
2 4
3Unweighted Weighted
0.3
0.4
1.2
1.9
Graph Variations (3)
• Connected / disconnected:– A connected graph has a path from every
vertex to every other– A directed graph is strongly connected if there
is a directed path between any two vertices1
2 4
3
Connected but not strongly connected
Graph Variations (4)
• Dense / sparse:– Graphs are sparse when the number of edges is
linear to the number of vertices• |E| O(|V|)
– Graphs are dense when the number of edges is quadratic to the number of vertices
• |E| O(|V|2)
– Most graphs of interest are sparse– If you know you are dealing with dense or sparse
graphs, different data structures may make sense
Representing Graphs• Assume V = {1, 2, …, n}• An adjacency matrix represents the graph as a n
x n matrix A:– A[i, j] = 1 if edge (i, j) E
= 0 if edge (i, j) E• For weighted graph
– A[i, j] = wij if edge (i, j) E= 0 if edge (i, j) E
• For undirected graph– Matrix is symmetric: A[i, j] = A[j, i]
Graphs: Adjacency Matrix
• Example:
1
2 4
3
A 1 2 3 4
1
2
3 ??4
Graphs: Adjacency Matrix
• Example:
1
2 4
3
A 1 2 3 4
1 0 1 1 0
2 0 0 1 0
3 0 0 0 0
4 0 0 1 0
How much storage does the adjacency matrix require?A: O(V2)
Graphs: Adjacency Matrix
• Example:
1
2 4
3 4
3
2
0100
1011
0101
01101
4321A
Undirected graph
Graphs: Adjacency Matrix
• Example:
1
2 4
3
5
6
9 4
4
3
2
0400
4096
0905
06501
4321A
Weighted graph
Graphs: Adjacency Matrix
• Time to answer if there is an edge between vertex u and v: Θ(1)
• Memory required: Θ(n2) regardless of |E|– Usually too much storage for large graphs– But can be very efficient for small graphs
• Most large interesting graphs are sparse– E.g., road networks (due to limit on junctions)– For this reason the adjacency list is often a
more appropriate representation
Graphs: Adjacency List
• Adjacency list: for each vertex v V, store a list of vertices adjacent to v
• Example:– Adj[1] = {2,3}– Adj[2] = {3}– Adj[3] = {}– Adj[4] = {3}
• Variation: can also keep a list of edges coming into vertex
1
2 4
3
Graph representations• Adjacency list
1
2 4
3
2 3
3
3
How much storage does the adjacency list require?A: O(V+E)
Graph representations
• Undirected graph
1
2 4
3 432
010010110101011014321A
2 3
1
3
3
1 2 4
Graph representations
• Weighted graph
1
2 4
3
5
6
9 4 432
040040960905065014321A
2,5 3,6
1,5 3,9
3,4
1,6 2,9 4,4
Graphs: Adjacency List• How much storage is required?• For directed graphs
– |adj[v]| = out-degree(v)– Total # of items in adjacency lists is
out-degree(v) = |E|
• For undirected graphs– |adj[v]| = out-degree(v) – # items in adjacency lists is
degree(v) = 2 |E|
• So: Adjacency lists take (V+E) storage• Time needed to test if edge (u, v) E is O(n)
Tradeoffs between the two representations
Adj Matrix Adj Listtest (u, v) E Θ(1) O(n)Degree(u) Θ(n) O(n)Memory Θ(n2) Θ(n+m)Edge insertion Θ(1) Θ(1)Edge deletion Θ(1) O(n)Graph traversal Θ(n2) Θ(n+m)
|V| = n, |E| = m
Both representations are very useful and have different properties, although adjacency lists are probably better for most problems
Structural properties of networks
• Degree distribution• Average shortest path length• Clustering coefficient• Community structure• Degree correlation• Motivation to study structural properties:
– Structure determines function– Functional structural properties may be shared by
different types of real networks (bio or non-bio)
Degree distribution P(k)• The probability that a selected node has
exactly (or approximately) k links.– P(k) is obtained by counting the number of nodes
N(k) with k = 1, 2… links divided by the total number of nodes N.
Erdos-Renyi model
• Each pair of nodes have a probability p to form an edge
• Most nodes have about the same # of connections
• Degree distribution is binomial or Poisson
Real networks: scale-free
• Heavy tail distribution– Power-law distribution
• P(k) = k-r
0 10 20 30 40 50 600
20
40
60
80
100
Number of connections
Num
ber o
f gen
es
Comparing Random and Scale-free distribution
• In the random network, the five nodes with the most links (in red) are connected to only 27% of all nodes (green). In the scale-free network, the five most connected nodes (red) are connected to 60% of all nodes (green) (source: Nature)
Robust yet fragile nature of networks
Shortest and mean path length• Distance in networks is measured
with the path length• As there are many alternative paths
between two nodes, the shortest path between the selected nodes has a special role.
• In directed networks, – AB is often different from the BA– Often there is no direct path between
two nodes.• The average path length between all
pairs of nodes offers a measure of a network’s overall navigability.
• most pairs of vertices in a biological network seem to be connected by a short path – small-world property
Clustering coefficient
• Your clustering coefficient: the probability that two of your friends are also friends– You have m friends– Among your m friends, there are n pairs of
friends• The maximum is m * (m-1) / 2• C = 2 n / (m^2-m)
• Clustering coefficient of a network: the average clustering coefficient of all individuals
Clustering Coefficient
Ci=2Ei/ki(ki-1)=2/9
ith node has ki neighbors linking with it
Ei is the actual number of links between ki neighbors
maximal number of links between ki neighbors is ki(ki-1)/2
The probability that two of your friends are also friends
• Clustering coefficient of a network: average clustering coefficient of all nodes
Degree correlation
• Do rich people tend to hang together with rich people (rich-club)?
• Or do they tend to interact with less wealthy people?
• Do high degree nodes tend to connect to low degree nodes or high degree ones?
Some interesting findings from biological networks
• Jeong, Lethality and centrality in protein networks. Nature 411, 41-42 (3 May 2001)
• Roger Guimerà and Luís A. Nunes Amaral, Functional cartography of complex metabolic networks. Nature 433, 895-900 (24 February 2005)
• Han, et. al. Evidence for dynamically organized modularity in the yeast protein–protein interaction network. Nature 430, 88-93 (1 July 2004)
Connectivity vs essentiality
Number of connections
% o
f ess
entia
l pro
tein
s
Jeong et. al. Nature 2001
Community role vs essentiality
• Effect of a perturbation cannot depend on the node’s degree only!
• Many hub genes are not essential• Some non-hub genes are essential• Maybe a gene’s role in her community is
also important– Local leader? Global leader? Ambassador?– Guimerà and Amaral, Nature 433, 2005
Community structure
• Role 1, 2, 3: non-hubs with increasing participation indices
• Role 5, 6: hubs with increasing participation indices
Dynamically organized modularity in the yeast PPI network
• Protein interaction networks are static• Two proteins cannot interact if one is not expressed• We should look at the gene expression level• Han, et. al, Nature 430, 2004
Obtaining Data
Distinguish party hubs from date hubs
Red curve – hubsCyan curve – nonhubsBlack curve – randomized• Partners of date hubs are significantly more diverse in spatial distribution
than partners of party hubs
Effect of removal of nodes on average geodesic distance
Green – nonhub nodesBrown – hubsRed – date hubsBlue – party hubsThe ‘breakdown point’ is the threshold after which the main component of the network starts disintegrating.
Original Network
On removal of date hubs
On removal of party hubs
Dynamically organized modularity
Red circles – Date hubsBlue squares - Modules
Han-Yu Chuang, Eunjung Lee, Yu-Tseung Liu, Doheon Han-Yu Chuang, Eunjung Lee, Yu-Tseung Liu, Doheon Lee, Trey Ideker, Lee, Trey Ideker, Network-based classification of breast cancer metastasis, Mol Syst Biol. 2007; 3: 140.Mol Syst Biol. 2007; 3: 140.
Challenge: Predict Metastasis• If metastasis is likely => aggressive If metastasis is likely => aggressive
adjuvant therapyadjuvant therapy– How to decide the likelihood? How to decide the likelihood?
• Traditional predictive factors are not goodTraditional predictive factors are not good
Recently: Gene Marker Sets• Examine genome-wide expression profilesExamine genome-wide expression profiles
– Score individual genes for how well they Score individual genes for how well they discriminate between different classes of diseasediscriminate between different classes of disease
• Establish gene expression signatureEstablish gene expression signature
– Problem: # genes >> # patientsProblem: # genes >> # patients
Pathway Expression vs. PPI Subnetwork as Marker
• Score known pathways for coherence of Score known pathways for coherence of gene expression changes?gene expression changes?– Majority of human genes not yet assigned to a Majority of human genes not yet assigned to a
definitive pathwaydefinitive pathway
• Large Protein-Protein Interaction networks Large Protein-Protein Interaction networks recently became availablerecently became available– Extract subnetworks from PPI networks as markersExtract subnetworks from PPI networks as markers
Subnetwork Marker Identification: Data Used
• 2 separate cohorts of breast cancer patients2 separate cohorts of breast cancer patients– van 't Veer et. al, and Wang et. al.van 't Veer et. al, and Wang et. al.– Roughly half had developed metastasisRoughly half had developed metastasis
• Used Protein-Protein Interaction network Used Protein-Protein Interaction network obtained by assembling a pooled dataset– 57,235 interactions among 11,203 proteins57,235 interactions among 11,203 proteins
Goal: Find Significantly Discriminative Subnetworks
• Use a scoring system to search for Use a scoring system to search for subnetworks highly discriminative of subnetworks highly discriminative of metastasismetastasis
Discriminative Score Function S
Step 1: Assign activity scores to a subnetwork of genes
Step 2: Assign discriminative score S to the subnetwork
• Score(subnetwork) = Mutual Information Score(subnetwork) = Mutual Information between a subnetwork’s activity score between a subnetwork’s activity score vector and phenotype vector over all vector and phenotype vector over all patientspatients– S(k) = MI (a,c)S(k) = MI (a,c)
Find Candidate Subnetworks using S and Greedy Search
• Use a single PPI node as seedUse a single PPI node as seed– At each iteration, add the neighbor At each iteration, add the neighbor
resulting in highest score improvement resulting in highest score improvement – Stop when no addition increases score Stop when no addition increases score
by rate r= .05, or distance from seed > 2by rate r= .05, or distance from seed > 2– Report candidate subnetwork and Report candidate subnetwork and
repeat with next node as seedrepeat with next node as seed
Identify Significant Subnets from 3 Null Distributions
• p1:100 expression perm. trials, p < 0.05p1:100 expression perm. trials, p < 0.05– Expression vectors of individual genes Expression vectors of individual genes
randomly permuted on the networkrandomly permuted on the network• p2: 100 random subnetworks seeded at p2: 100 random subnetworks seeded at
protein i, p < 0.05protein i, p < 0.05• p3: 20,000 phenotype perm. trials, p < p3: 20,000 phenotype perm. trials, p <
0.000050.00005
Results: Correspondence to hallmarks of cancer• For two datasets of 295 and For two datasets of 295 and
286 patients, 286 patients, 149149 and and 243243 (resp.) discriminative (resp.) discriminative subnets foundsubnets found
• 47% and 65% of subnets 47% and 65% of subnets enriched for common enriched for common biological processbiological process
• 66 and 153 subnets were 66 and 153 subnets were enriched for processes enriched for processes involved in major events of involved in major events of cancer progression cancer progression
Results: Reproducibility• Subnetwork markers significantly more Subnetwork markers significantly more
reproducible between datasets than individual reproducible between datasets than individual gene markersgene markers
Results: Reproducibility
Dataset 1 Dataset 2
Results: Reproducibility Shared network motifs with differences in Shared network motifs with differences in
differential expression differential expression Left-hand side is from Dataset 1 and right-hand Left-hand side is from Dataset 1 and right-hand
side is from Dataset 2side is from Dataset 2
Results: Subnetwork Markers as Classifiers Averaged expression values for each subnetwork Averaged expression values for each subnetwork
were used as features for a classifier based on were used as features for a classifier based on logistic regressionlogistic regression
For comparison, the top individual gene-markers For comparison, the top individual gene-markers were instead used as features were instead used as features
Markers from one dataset were used as predictors Markers from one dataset were used as predictors of metastasis on the other datasetof metastasis on the other dataset
Dataset 1 markers tested on Dataset 2, and Dataset 1 markers tested on Dataset 2, and vice versavice versa
Results: Subnetwork Markers as Classifiers
Results: Informative of Non-discriminative Disease Genes
Network analyses can identify proteins not Network analyses can identify proteins not differentially expressed, but required to connect differentially expressed, but required to connect higher scoring proteins in a significant subnetworkhigher scoring proteins in a significant subnetwork
85.9 and 96.7% of the significant subnetworks 85.9 and 96.7% of the significant subnetworks contained at least one protein that was not contained at least one protein that was not significantly differentially expressed in metastasissignificantly differentially expressed in metastasis
Results: Informative of Non-discriminative Disease Genes Several established prognostic markers were not Several established prognostic markers were not
present in individual gene expression markers, but present in individual gene expression markers, but played a central, interconnecting role in played a central, interconnecting role in discriminative subnetworksdiscriminative subnetworks MYC, ERBB2MYC, ERBB2
Community discovery: motivations
• Biological networks are modular– Metabolic pathways– Protein complexes– Transcriptional regulatory modules
• Provide a high-level overview of the networks
• Predict gene functions based on communities
Community discovery problem
• Divide a network into relatively densely connected sub-networks
Vertexreorder
Challenges
• How many communities?• Is there any community at all?
• Also known as modules • Relatively densely connected sub-network• Quite common in real networks
– Social networks– Internet– Biological networks– Transportation– Power grid
Community structures
Community discovery problem
• Divide a network into relatively densely connected sub-networks
Vertexreorder
History
• Social science: clustering– Based on affinities / similarities– Need to give # of clusters– Can always find clusters
• Computer science: graph partitioning– Minimizing cut / cut ratio– Need to give # of partitions– Can always produce partitions
• Preferred approach: natural division– Automatically determine # of communities– Do not partition if no community
Expected fraction of edges falling in community i
Observed fraction of edges falling in community i
Modularity function (Q)
• Measure strength of community structures– Newman, Phy Rev E, 2003
e11 e12
e21 e22 22212
12111
eeaeea
21 aaM
-1 < Q < 1Q = 0 if k = 1
Number of communities
k
i
iii
Ma
MeQ
1
2
)(
Q = 0.45
Q = 0.56
Q = 0
Q = 0.40 Q = 0.54
Goal: find the partition that has the highest Q valueBut: optimizing Q is NP-hard (Brandes et al., 2006)
Heuristic algorithms
• k-way spectral partitioning approximately optimizes Q if k is known– White & Smyth, SDM 2005
• k is unknown: test all possible k’s
1 2 3
5
10
15
20
25
301 2 3
5
10
15
20
25
30
eig kmeans
k-way spectral partitioning
Q = 0.56Q = 0.40 Q = 0.54
k = 2 k = 3 k = 4
• Good accuracy• ~O(n3) time complexity; n: # of vertices
Recursive bi-partitioning
Q = 0.56
Q = 0.40
Q = 0.54
x
• ~O(m logn) time complexity; m: # of edges• Accuracy worse than k-way partitioning
Can we do better?
• Objectives– Efficiency of the recursive algorithm– Accuracy of the k-way algorithm (or even better)
• Ideas– Flexible l-way recursive partition (l = 2-5)
• As efficient as recursive bi-partition• Accuracy similar to K-way algorithm• Ruan and Zhang, ICDM 2007
– Take the results of recursive algorithm as the starting point, do local improvement
• Ruan and Zhang, Physical Review E 2008
Algorithm Qcut
1. Recursive partitioning until local maximum of Q
2. Refine solution by greedy searchConsider two types of operations
• Move a vertex to a different community• Merge two communities
– Take the one with the largest improvement of Q– Repeat until no improvement of Q can be made– Go back to step 1 if necessary
• Key: quickly find out the operation that can give the largest improvement of Q
Identifying candidate moves
• If vertex v moves from community i to j
xi – degree of v in community ix – degree of vai – total degree for vertices in community i
2
)(
Mxxaa
Mxx
Q jiij
• Compute all potential Q from initial state• Update is almost constant for scale-free networks• Additional heuristics to improve efficiency
Results on synthetic networks
• Relative Q = Qfound − Qtrue
N_out
Rela
tive
Q
N_out
Accu
racy
• State of the art: Newman, PNAS 2006
An exampleReal Structure Vertex reordered
Result of Qcut (Accuracy: 99%) Result of Newman (Accuracy: 77%)
Results on real-world networks
#Vertices #EdgesModularity
Newman SA QcutSocial 67 142 0.573 0.608 0.587Neuron 297 2359 0.396 0.408 0.398Ecoli Reg 418 519 0.766 0.752 0.776Circuit 512 819 0.804 0.670 0.815Yeast Reg 688 1079 0.759 0.740 0.766Ecoli PPI 1440 5871 0.367 0.387 0.387Internet 3015 5156 0.611 0.624 0.632Physicists 27519 116181 -- -- 0.744
SA: Simulated annealing, Guimera & Amaral, Nature 2005
Running time (seconds)
#vertices #EdgesRunning time
Newman SA QcutSocial 67 142 0.0 5.4 2.0
Neuron 297 2359 0.4 139 1.9Ecoli Reg 418 519 0.7 147 12.7
Circuit 512 819 1.8 143 6.1Yeast Reg 688 1079 3.0 1350 13.4Ecoli PPI 1440 5871 33.2 5868 41.5Internet 3015 5156 253.7 11040 43.0
Physicists 27519 116181 -- -- 2852
Graphical user interface for biologists
A real-world example• A classic social network: Karate club• Node – club member; edge – friendship• Club was split due to a dispute• Can we predict the split given the network?
Network of football teams
• Vertices: football teams in NCAA Division I-A
• Edges: games played in year 2000
• 110 teams• 11 conferences
(excluding independents)• Most games are within
conferences
Big 12 Big East
Conference vs. CommunityConferences
Communities discoveredby Qcut / Newman
Mountain West Pacific Ten
Whose fault is it?Communities discovered
by Qcut / Newman
Q = 0.6239
Force the two conferences to be separated:
Q = 0.6237
Resolution limit of the Q function
• C1 and C2 separable only if Q2 – Q1 > 0• Q2 – Q1 a1a2/2M – e12
– a1a2/2M: expected # of edges between C1 and C2– e12: actual # of edges between c1 and c2
• If C1 and C2 are small relative to the network– Expected # edges < 1– C1 and C2 non-separable even if connected by one edge– But the edge may be due to noise in data
Q2
Large network
c1c2
Large network
Q1
c1 c2
Resolution limit• Optimizing Q
– may miss small communities– is sensitive to false-positive edges– cannot reveal hierarchical structures
• A community containing some sub-communities
• Real-world networks– contain both large and small communities– may have false positive edges
• Biological data are extremely noisy
– have hierarchies
A solution: HQcut
• Ruan & Zhang, Physical Review E 2008• Apply Qcut to get communities with largest Q• Recursively search for sub-communities
within each community• When to stop?
– Q value of sub-network is small, or– Q is not statistically significant
• Estimated by Monte-Carlo method
Q = 0.49
Randomize
Z-score = (0.49 - 0.15) / 0.016 = 21
Randomize
Q = 0.18Z-score = (0.18 - 0.15) / 0.016 = 1.9
Randomize
Q = 0.49randQ = 0.52 0.031
Z-score = (0.49 - 0. 52) / 0.031 = -1.3
randQ = 0.15 0.016
randQ = 0.15 0.016
Large network
Q = 0.49Z-score = -1.3
Q = 0.49Z-score = 21
Q = 0.18Z-score = 1.9
Test on synthetic networks• Network: 1000 vertices• Community sizes vary from 15 to 100
Accuracy
Discovered by Qcut Discovered by HQcut
Example communities
Results for the NCAA teamsCommunities by Qcut/Newman Communities by HQcut
Mountain West Pacific Ten
Applications to a PPI network
• Protein-protein interaction (PPI) network– Vertices: proteins– Edges: interactions detected by experiments
• Motivation:– Community = protein complex?
• Protein complex– Group of proteins associated via interactions– Elementary functional unit in the cell– Prediction from PPI network is important
Experiments
• Data set– A yeast protein-protein interaction network
• Krogan et.al., Nature. 2006– 2708 proteins, 7123 interactions
• Algorithms:– Qcut, HQcut, Newman
• Evaluation– ~300 Known protein complexes in MIPS– How well does a community match to a known protein
complex?
ResultsNewman Qcut HQcut
# of communities 56 93 316
Max community size 312 264 60
# of matched communities 53 52 216
Communities with matching score = 1 5 (9%) 7 (13%) 43 (20%)
Average matching score 0.56 0.55 0.70
# of novel predictions 3 41 100
Communities found by HQcutSmall ribosomal subunit (90%)
RNA poly II mediator (83%)
Proteasome core (90%)
Exosome (94%)
gamma-tubulin (77%)
respiratory chain complex IV (82%)
Example hierarchical community
Microarray data• Data organized into a matrix
– Rows are genes– Columns are samples representing different
time points, conditions, tissues, etc.• Analysis techniques
– Differential expression analysis– Classification and clustering– Regulatory network construction– Enrichment analysis
• Characteristics of microarray data– High dimensionality and noise– Underlying topology unknown, often
irregular shape
Sample
Gen
e
Red: high activityGreen: low activity
Microarray data clustering
• Many clustering algorithms available– K-means– Hierarchical– Self organizing maps– Parameter hard to tune– Does not consider network topology
Sample
Gen
e • Common functions?• Common regulation?• Predict functions for
unknown genes?
Analyze genes in each cluster
Red: high activityGreen: low activity
Network-based data analysis
• Genes i and j connected if their expression patterns are “sufficiently similar”– Similarity > threshold
• Long list of references– K nearest neighbors
• Recently became popular• Many interesting applications beyond clustering• Focus here is clustering
Gen
e
SampleConstructCo-expression network
ij
=
Motivation
• Can we use the idea of community finding for clustering microarray data?
• Advantages: – Parameter free– Network topology considered– Constructed network may have other uses
Network-based microarray data analysis
• How to get the networks?– Threshold-based– Nearest neighbors
• Can we use a complete weight matrix?– Complete graph, with weighted edges– In general, no, since Q is ill-defined on weighted networks
Gen
e
SampleConstructCo-expression network
ij
How to determine the right cutoff?
=
Network-based microarray data analysis
• There is an implicit network structure
• Motivation: true network should be naturally modular– Can be measured by modularity (Q)– If constructed right, should have the highest Q
Clustering
gene
Condition
Method overview
……
Net_1,Most dense
Net_m,Most sparse
Microarraydata
Similaritymatrix
Network series
Qcut
Qcut
Method overview (cont’d)
Network density
Mod
ular
ity
Random network
True network
Difference
• Therefore, use ∆Q to determine the best network parameter and obtain the best community structure
• We actually run HQcut, a variant of Qcut, in order to avoid resolution limit (Ruan & Zhang, Phys Rev E 2008)
Network construction methods
• Value-based method– Remove edges with similarities < ε.– Fixed ε for all vertices– May have problem detecting weakly correlated modules
• Asymmetric k-nearest neighbors (aKNN)– Connect each vertex to k other vertices– Fixed k for all vertices (k < 10 good enough)– Minimum degree = k. max = ?– Sensitive to outliers
• Mutual k-nearest-neighbors (mKNN)– Association confirmed by both ends– Maximum degree = k, min = 0. (k larger than in aKNN.)– Outlier can be detected.– Ruan, ICDM 2009
Results: synthetic data set 1
• High dimensional data generated by synDeca. – 20 clusters of high dimensional points, plus some
scatter points– Clusters are of various shapes: eclipse, rectangle,
random
10 20 30 40 50 60 70 80 90 100
100
200
300
400
500
600
700
800
900
1000 0 50 100 150 200 250 3000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of neighbors
QReal
QRandom
Qreal - Qrandom
Clustering Accuracy
∆ Q
Accuracy
Comparison
10 20 30 40 50 60 70 80 90 1000
0.2
0.4
0.6
0.8
1
Dimension
Clu
ster
ing
Acc
urac
y
This workkmeansoptimal knnHQcut
mKNN-HQcut with the optimum k
mKNN-HQcut with automatically determined k
Results: synthetic data set 2
• Gene expression data– Thalamuthu et al, 2006– 600 data sets– ~600 genes, 50
conditions, 15 clusters– 0 or 1x outliers
Without outliers With outliersmKNN-HQcutWith optimal k
mKNN-HQcutWith auto k
Comparison with other methods
Results on yeast stress response data
• 3000 genes, 173 samplesBest k = 140. Resulting in 75 clusters
Results on yeast stress response data
• Enrichment of common functions– Accumulative hyper-geometric test
GO Function Terms
Gene
Protein biosynthesis (p < 10-96)
Nuclear transport (p < 10-50)
mt ribosome (p < 10-63)
DNA repair (p < 10-66)RNA splicing (p < 10-105)Nitrogen compound metabolism (p < 10-37)
Peroxisome (p < 10-13)
Comparison with k-means
K-means
mkNN-HQcutUsing automatically determined k = 140
Over
all f
unct
ion
cohe
renc
e
Application to Arabidopsis data• ~22000 genes, 1138
samples• 1150 singletons• 800 (300) modules of
size >= 10 (20)• > 80% (90%) of
modules have enriched functions
• Much more significant than all five existing studies on the same data set
Top 40 most significant modules
Cis-regulatory network of Arabidopsis
MotifModule
Beyond gene clusters (1)
• Gene specific studies– Collaborator is interested in Gibberellins – A hormone important for the growth and development of
plant– Commercially important– Biosynthesis and signaling well studied– Transcriptional regulation of biosynthesis and signaling
not yet clear– 3 important gene families, GA20ox, GA3ox and GA2ox
for biosynthesis– Receptor gene family: GID1A,B,C– Analyze the co-expression network around these genes
GID1A
GID1B
GID1C
20ox1
20ox220ox3 20ox4
20ox5
3ox1
3ox23ox3
3ox4
2ox1
2ox2
2ox3
2ox42ox6
2ox7
2ox8
GA3
20ox
3ox
2ox
Beyond gene clusters (2)• Cancer classification
Gene
Sam
ple
Sample
Alizadeh et. al. Nature, 2000
Sample: tumor/normal cells
Qcut
ActivatedBlood B
Chronic lymphocytic leukemia (CLL)
Follicular lymphoma (FL)
Blood T
Transformed cell lines
Diffuse large B-cell Lymphoma(DLBCL)
Resting Blood B
DLBCL
DLBCL
Network of cell samplesBlack: normal cellsBlue: tumor cells
Survival rate after chemotherapy
DLBCL-1DLBCL-2
DLBCL-3
Survival rate: 73%Median survival time: 71.3 months
Survival rate: 40%Median survival time: 22.3 months
Survival rate: 20%Median survival time: 12.5 months
Beyond gene clustering (3)• Topology vs function
Number of connections
% o
f ess
entia
l pro
tein
s
Jeong et. al. Nature 2001
Community participation vs. essentiality
• Key: how to systematically search for such relationships?
Community participation
% E
ssen
tial
% E
ssen
tial
Number of connections
Non-hub
HubParticipation < 0.2
Participation >= 0.2