cs 6293 advanced topics: translational bioinformatics

CS 6293 Advanced Topics: Translational Bioinformatics

Biological networks:Theory and applications

Lecture outline

• Basic terminology and concepts in networks

• Some interesting results between network properties and biological functions

• Network clustering / community discovery• Applications of network clustering methods

Network

• A network refers to a graph• An useful concept in analyzing the

interactions of different components in a system

Biological networks• An abstract of the complex relationships among

molecules in the cell• Many types.

– Protein-protein interaction networks– Protein-DNA(RNA) interaction networks– Genetic interaction network– Metabolic network– Signal transduction networks– (real) neural networks – Many others

• In some networks, edges have more precise meaning. In some others, meaning of edges is obscure

Protein-protein interaction networks

• Yeast PPI network• Nodes – proteins• Edges – interactions

The color of a node indicates the phenotypic effect of removing the corresponding protein (red = lethal, green = non-lethal, orange = slow growth, yellow = unknown).

Obtaining biological networks• Direct experimental methods

– Protein-protein interaction networks• Yeast-2-hybrid• Tandem affinity purification• Co-immunoprecipitation

– Protein-DNA interaction• Chromatin Immunoprecipitation (followed by microarray or

sequencing, ChIP-chip, ChIP-seq)– High level of noises (false-positive and false-negative)

• Computational prediction methods– Often cannot differentiate direct and indirect

interactions

Why networks?• Studying genes/proteins on the network level

allows us to:– Assess the role of individual genes/proteins in the

overall pathway– Evaluate redundancy of network components– Identify candidate genes involved in genetic diseases– Sets up the framework for mathematical models

For complex systems, the actual output may not be predictable by looking at only individual components:

The whole is greater than the sum of its parts

Graphs

• A graph G = (V, E)– V = set of vertices– E = set of edges = subset of V V– Thus |E| = O(|V|2)

1

2 4

3

Vertices: {1, 2, 3, 4}

Edges: {(1, 2), (2, 3), (1, 3), (4, 3)}

Graph Variations (1)

• Directed / undirected:– In an undirected graph:

• Edge (u,v) E implies edge (v,u) E• Road networks between cities

– In a directed graph:• Edge (u,v): uv does not imply vu• Street networks in downtown

– Degree of vertex v:• The number of edges adjacency to v• For directed graph, there are in-degree and out-degree

1

2 4

3

Directed

1

2 4

3

Undirected

Degree = 3In-degree = 3Out-degree = 0

Graph Variations (2)• Weighted / unweighted:

– In a weighted graph, each edge or vertex has an associated weight (numerical value)

• E.g., a road map: edges might be weighted w/ distance

1

2 4

3

1

2 4

3Unweighted Weighted

0.3

0.4

1.2

1.9


• Connected / disconnected:– A connected graph has a path from every

vertex to every other– A directed graph is strongly connected if there

is a directed path between any two vertices1

2 4

3

Connected but not strongly connected


• Dense / sparse:– Graphs are sparse when the number of edges is

linear to the number of vertices• |E| O(|V|)

– Graphs are dense when the number of edges is quadratic to the number of vertices

• |E| O(|V|2)

– Most graphs of interest are sparse– If you know you are dealing with dense or sparse

graphs, different data structures may make sense

Representing Graphs• Assume V = {1, 2, …, n}• An adjacency matrix represents the graph as a n

x n matrix A:– A[i, j] = 1 if edge (i, j) E

= 0 if edge (i, j) E• For weighted graph

– A[i, j] = wij if edge (i, j) E= 0 if edge (i, j) E

• For undirected graph– Matrix is symmetric: A[i, j] = A[j, i]

Graphs: Adjacency Matrix

• Example:

1

2 4

3

A 1 2 3 4

1

2

3 ??4


• Example:

1

2 4

3

A 1 2 3 4

1 0 1 1 0

2 0 0 1 0

3 0 0 0 0

4 0 0 1 0

How much storage does the adjacency matrix require?A: O(V2)


• Example:

1

2 4

3 4

3

2

0100

1011

0101

01101

4321A

Undirected graph


• Example:

1

2 4

3

5

6

9 4

4

3

2

0400

4096

0905

06501

4321A

Weighted graph


• Time to answer if there is an edge between vertex u and v: Θ(1)

• Memory required: Θ(n2) regardless of |E|– Usually too much storage for large graphs– But can be very efficient for small graphs

• Most large interesting graphs are sparse– E.g., road networks (due to limit on junctions)– For this reason the adjacency list is often a

more appropriate representation

Graphs: Adjacency List

• Adjacency list: for each vertex v V, store a list of vertices adjacent to v

• Example:– Adj[1] = {2,3}– Adj[2] = {3}– Adj[3] = {}– Adj[4] = {3}

• Variation: can also keep a list of edges coming into vertex

1

2 4

3

Graph representations• Adjacency list

1

2 4

3

2 3

3

3

How much storage does the adjacency list require?A: O(V+E)

Graph representations

• Undirected graph

1

2 4

3 432

010010110101011014321A

2 3

1

3

3

1 2 4

Graph representations

• Weighted graph

1

2 4

3

5

6

9 4 432

040040960905065014321A

2,5 3,6

1,5 3,9

3,4

1,6 2,9 4,4

Tradeoffs between the two representations

Adj Matrix Adj Listtest (u, v) E Θ(1) O(n)Degree(u) Θ(n) O(n)Memory Θ(n2) Θ(n+m)Edge insertion Θ(1) Θ(1)Edge deletion Θ(1) O(n)Graph traversal Θ(n2) Θ(n+m)

|V| = n, |E| = m

Both representations are very useful and have different properties, although adjacency lists are probably better for most problems

Structural properties of networks

• Degree distribution• Average shortest path length• Clustering coefficient• Community structure• Degree correlation• Motivation to study structural properties:

– Structure determines function– Functional structural properties may be shared by

different types of real networks (bio or non-bio)

Degree distribution P(k)• The probability that a selected node has

exactly (or approximately) k links.– P(k) is obtained by counting the number of nodes

N(k) with k = 1, 2… links divided by the total number of nodes N.

Erdos-Renyi model

• Each pair of nodes have a probability p to form an edge

• Most nodes have about the same # of connections

• Degree distribution is binomial or Poisson

Real networks: scale-free

• Heavy tail distribution– Power-law distribution

• P(k) = k-r

0 10 20 30 40 50 600

20

40

60

80

100

Number of connections

Num

ber o

f gen

es

Comparing Random and Scale-free distribution

• In the random network, the five nodes with the most links (in red) are connected to only 27% of all nodes (green). In the scale-free network, the five most connected nodes (red) are connected to 60% of all nodes (green) (source: Nature)

Robust yet fragile nature of networks

Shortest and mean path length• Distance in networks is measured

with the path length• As there are many alternative paths

between two nodes, the shortest path between the selected nodes has a special role.

• In directed networks, – AB is often different from the BA– Often there is no direct path between

two nodes.• The average path length between all

pairs of nodes offers a measure of a network’s overall navigability.

• most pairs of vertices in a biological network seem to be connected by a short path – small-world property

Clustering coefficient

• Your clustering coefficient: the probability that two of your friends are also friends– You have m friends– Among your m friends, there are n pairs of

friends• The maximum is m * (m-1) / 2• C = 2 n / (m^2-m)

• Clustering coefficient of a network: the average clustering coefficient of all individuals

Clustering Coefficient

Ci=2Ei/ki(ki-1)=2/9

ith node has ki neighbors linking with it

Ei is the actual number of links between ki neighbors

maximal number of links between ki neighbors is ki(ki-1)/2

The probability that two of your friends are also friends

• Clustering coefficient of a network: average clustering coefficient of all nodes

Degree correlation

• Do rich people tend to hang together with rich people (rich-club)?

• Or do they tend to interact with less wealthy people?

• Do high degree nodes tend to connect to low degree nodes or high degree ones?

Some interesting findings from biological networks

• Jeong, Lethality and centrality in protein networks. Nature 411, 41-42 (3 May 2001)

• Roger Guimerà and Luís A. Nunes Amaral, Functional cartography of complex metabolic networks. Nature 433, 895-900 (24 February 2005)

• Han, et. al. Evidence for dynamically organized modularity in the yeast protein–protein interaction network. Nature 430, 88-93 (1 July 2004)

Connectivity vs essentiality


% o

f ess

entia

l pro

tein

s

Jeong et. al. Nature 2001

Community role vs essentiality

• Effect of a perturbation cannot depend on the node’s degree only!

• Many hub genes are not essential• Some non-hub genes are essential• Maybe a gene’s role in her community is

also important– Local leader? Global leader? Ambassador?– Guimerà and Amaral, Nature 433, 2005

Community structure

• Role 1, 2, 3: non-hubs with increasing participation indices

• Role 5, 6: hubs with increasing participation indices

Dynamically organized modularity in the yeast PPI network

• Protein interaction networks are static• Two proteins cannot interact if one is not expressed• We should look at the gene expression level• Han, et. al, Nature 430, 2004

Obtaining Data

Distinguish party hubs from date hubs

Red curve – hubsCyan curve – nonhubsBlack curve – randomized• Partners of date hubs are significantly more diverse in spatial distribution

than partners of party hubs

Effect of removal of nodes on average geodesic distance

Green – nonhub nodesBrown – hubsRed – date hubsBlue – party hubsThe ‘breakdown point’ is the threshold after which the main component of the network starts disintegrating.

Original Network

On removal of date hubs

On removal of party hubs

Dynamically organized modularity

Red circles – Date hubsBlue squares - Modules

Han-Yu Chuang, Eunjung Lee, Yu-Tseung Liu, Doheon Han-Yu Chuang, Eunjung Lee, Yu-Tseung Liu, Doheon Lee, Trey Ideker, Lee, Trey Ideker, Network-based classification of breast cancer metastasis, Mol Syst Biol. 2007; 3: 140.Mol Syst Biol. 2007; 3: 140.

Challenge: Predict Metastasis• If metastasis is likely => aggressive If metastasis is likely => aggressive

adjuvant therapyadjuvant therapy– How to decide the likelihood? How to decide the likelihood?

• Traditional predictive factors are not goodTraditional predictive factors are not good

Recently: Gene Marker Sets• Examine genome-wide expression profilesExamine genome-wide expression profiles

– Score individual genes for how well they Score individual genes for how well they discriminate between different classes of diseasediscriminate between different classes of disease

• Establish gene expression signatureEstablish gene expression signature

– Problem: # genes >> # patientsProblem: # genes >> # patients

Pathway Expression vs. PPI Subnetwork as Marker

• Score known pathways for coherence of Score known pathways for coherence of gene expression changes?gene expression changes?– Majority of human genes not yet assigned to a Majority of human genes not yet assigned to a

definitive pathwaydefinitive pathway

• Large Protein-Protein Interaction networks Large Protein-Protein Interaction networks recently became availablerecently became available– Extract subnetworks from PPI networks as markersExtract subnetworks from PPI networks as markers

Subnetwork Marker Identification: Data Used

• 2 separate cohorts of breast cancer patients2 separate cohorts of breast cancer patients– van 't Veer et. al, and Wang et. al.van 't Veer et. al, and Wang et. al.– Roughly half had developed metastasisRoughly half had developed metastasis

• Used Protein-Protein Interaction network Used Protein-Protein Interaction network obtained by assembling a pooled dataset– 57,235 interactions among 11,203 proteins57,235 interactions among 11,203 proteins

Goal: Find Significantly Discriminative Subnetworks

• Use a scoring system to search for Use a scoring system to search for subnetworks highly discriminative of subnetworks highly discriminative of metastasismetastasis

Discriminative Score Function S

Step 1: Assign activity scores to a subnetwork of genes

Step 2: Assign discriminative score S to the subnetwork

• Score(subnetwork) = Mutual Information Score(subnetwork) = Mutual Information between a subnetwork’s activity score between a subnetwork’s activity score vector and phenotype vector over all vector and phenotype vector over all patientspatients– S(k) = MI (a,c)S(k) = MI (a,c)

Find Candidate Subnetworks using S and Greedy Search

• Use a single PPI node as seedUse a single PPI node as seed– At each iteration, add the neighbor At each iteration, add the neighbor

resulting in highest score improvement resulting in highest score improvement – Stop when no addition increases score Stop when no addition increases score

by rate r= .05, or distance from seed > 2by rate r= .05, or distance from seed > 2– Report candidate subnetwork and Report candidate subnetwork and

repeat with next node as seedrepeat with next node as seed

Identify Significant Subnets from 3 Null Distributions

• p1:100 expression perm. trials, p < 0.05p1:100 expression perm. trials, p < 0.05– Expression vectors of individual genes Expression vectors of individual genes

randomly permuted on the networkrandomly permuted on the network• p2: 100 random subnetworks seeded at p2: 100 random subnetworks seeded at

protein i, p < 0.05protein i, p < 0.05• p3: 20,000 phenotype perm. trials, p < p3: 20,000 phenotype perm. trials, p <

0.000050.00005

Results: Correspondence to hallmarks of cancer• For two datasets of 295 and For two datasets of 295 and

286 patients, 286 patients, 149149 and and 243243 (resp.) discriminative (resp.) discriminative subnets foundsubnets found

• 47% and 65% of subnets 47% and 65% of subnets enriched for common enriched for common biological processbiological process

• 66 and 153 subnets were 66 and 153 subnets were enriched for processes enriched for processes involved in major events of involved in major events of cancer progression cancer progression

Results: Reproducibility• Subnetwork markers significantly more Subnetwork markers significantly more

reproducible between datasets than individual reproducible between datasets than individual gene markersgene markers

Results: Reproducibility

Dataset 1 Dataset 2

Results: Reproducibility Shared network motifs with differences in Shared network motifs with differences in

differential expression differential expression Left-hand side is from Dataset 1 and right-hand Left-hand side is from Dataset 1 and right-hand

side is from Dataset 2side is from Dataset 2

Results: Subnetwork Markers as Classifiers Averaged expression values for each subnetwork Averaged expression values for each subnetwork

were used as features for a classifier based on were used as features for a classifier based on logistic regressionlogistic regression

For comparison, the top individual gene-markers For comparison, the top individual gene-markers were instead used as features were instead used as features

Markers from one dataset were used as predictors Markers from one dataset were used as predictors of metastasis on the other datasetof metastasis on the other dataset

Dataset 1 markers tested on Dataset 2, and Dataset 1 markers tested on Dataset 2, and vice versavice versa

Results: Subnetwork Markers as Classifiers

Results: Informative of Non-discriminative Disease Genes

Network analyses can identify proteins not Network analyses can identify proteins not differentially expressed, but required to connect differentially expressed, but required to connect higher scoring proteins in a significant subnetworkhigher scoring proteins in a significant subnetwork

85.9 and 96.7% of the significant subnetworks 85.9 and 96.7% of the significant subnetworks contained at least one protein that was not contained at least one protein that was not significantly differentially expressed in metastasissignificantly differentially expressed in metastasis

Results: Informative of Non-discriminative Disease Genes Several established prognostic markers were not Several established prognostic markers were not

present in individual gene expression markers, but present in individual gene expression markers, but played a central, interconnecting role in played a central, interconnecting role in discriminative subnetworksdiscriminative subnetworks MYC, ERBB2MYC, ERBB2

Community discovery: motivations

• Biological networks are modular– Metabolic pathways– Protein complexes– Transcriptional regulatory modules

• Provide a high-level overview of the networks

• Predict gene functions based on communities

Community discovery problem

• Divide a network into relatively densely connected sub-networks

Vertexreorder

Challenges

• How many communities?• Is there any community at all?

• Also known as modules • Relatively densely connected sub-network• Quite common in real networks

– Social networks– Internet– Biological networks– Transportation– Power grid

Community structures

Community discovery problem

• Divide a network into relatively densely connected sub-networks

Vertexreorder

History

• Social science: clustering– Based on affinities / similarities– Need to give # of clusters– Can always find clusters

• Computer science: graph partitioning– Minimizing cut / cut ratio– Need to give # of partitions– Can always produce partitions

• Preferred approach: natural division– Automatically determine # of communities– Do not partition if no community

Expected fraction of edges falling in community i

Observed fraction of edges falling in community i

Modularity function (Q)

• Measure strength of community structures– Newman, Phy Rev E, 2003

e11 e12

e21 e22 22212

12111

eeaeea

21 aaM

-1 < Q < 1Q = 0 if k = 1

Number of communities

k

i

iii

Ma

MeQ

1

2

)(

Q = 0.45

Q = 0.56

Q = 0

Q = 0.40 Q = 0.54

Goal: find the partition that has the highest Q valueBut: optimizing Q is NP-hard (Brandes et al., 2006)

Heuristic algorithms

• k-way spectral partitioning approximately optimizes Q if k is known– White & Smyth, SDM 2005

• k is unknown: test all possible k’s

1 2 3

5

10

15

20

25

301 2 3

5

10

15

20

25

30

eig kmeans

k-way spectral partitioning

Q = 0.56Q = 0.40 Q = 0.54

k = 2 k = 3 k = 4

• Good accuracy• ~O(n3) time complexity; n: # of vertices

Recursive bi-partitioning

Q = 0.56

Q = 0.40

Q = 0.54

x

• ~O(m logn) time complexity; m: # of edges• Accuracy worse than k-way partitioning

Can we do better?

• Objectives– Efficiency of the recursive algorithm– Accuracy of the k-way algorithm (or even better)

• Ideas– Flexible l-way recursive partition (l = 2-5)

• As efficient as recursive bi-partition• Accuracy similar to K-way algorithm• Ruan and Zhang, ICDM 2007

– Take the results of recursive algorithm as the starting point, do local improvement

• Ruan and Zhang, Physical Review E 2008

Algorithm Qcut

1. Recursive partitioning until local maximum of Q

2. Refine solution by greedy searchConsider two types of operations

• Move a vertex to a different community• Merge two communities

– Take the one with the largest improvement of Q– Repeat until no improvement of Q can be made– Go back to step 1 if necessary

• Key: quickly find out the operation that can give the largest improvement of Q

Identifying candidate moves

• If vertex v moves from community i to j

xi – degree of v in community ix – degree of vai – total degree for vertices in community i

2

)(

Mxxaa

Mxx

Q jiij

• Compute all potential Q from initial state• Update is almost constant for scale-free networks• Additional heuristics to improve efficiency

Results on synthetic networks

• Relative Q = Qfound − Qtrue

N_out

Rela

tive

Q

N_out

Accu

racy

• State of the art: Newman, PNAS 2006

An exampleReal Structure Vertex reordered

Result of Qcut (Accuracy: 99%) Result of Newman (Accuracy: 77%)

Results on real-world networks

#Vertices #EdgesModularity

Newman SA QcutSocial 67 142 0.573 0.608 0.587Neuron 297 2359 0.396 0.408 0.398Ecoli Reg 418 519 0.766 0.752 0.776Circuit 512 819 0.804 0.670 0.815Yeast Reg 688 1079 0.759 0.740 0.766Ecoli PPI 1440 5871 0.367 0.387 0.387Internet 3015 5156 0.611 0.624 0.632Physicists 27519 116181 -- -- 0.744

SA: Simulated annealing, Guimera & Amaral, Nature 2005

Running time (seconds)

#vertices #EdgesRunning time

Newman SA QcutSocial 67 142 0.0 5.4 2.0

Neuron 297 2359 0.4 139 1.9Ecoli Reg 418 519 0.7 147 12.7

Circuit 512 819 1.8 143 6.1Yeast Reg 688 1079 3.0 1350 13.4Ecoli PPI 1440 5871 33.2 5868 41.5Internet 3015 5156 253.7 11040 43.0

Physicists 27519 116181 -- -- 2852

Graphical user interface for biologists

A real-world example• A classic social network: Karate club• Node – club member; edge – friendship• Club was split due to a dispute• Can we predict the split given the network?

Network of football teams

• Vertices: football teams in NCAA Division I-A

• Edges: games played in year 2000

• 110 teams• 11 conferences

(excluding independents)• Most games are within

conferences

Big 12 Big East

Conference vs. CommunityConferences

Communities discoveredby Qcut / Newman

Mountain West Pacific Ten

Whose fault is it?Communities discovered

by Qcut / Newman

Q = 0.6239

Force the two conferences to be separated:

Q = 0.6237

Resolution limit of the Q function

• C1 and C2 separable only if Q2 – Q1 > 0• Q2 – Q1 a1a2/2M – e12

– a1a2/2M: expected # of edges between C1 and C2– e12: actual # of edges between c1 and c2

• If C1 and C2 are small relative to the network– Expected # edges < 1– C1 and C2 non-separable even if connected by one edge– But the edge may be due to noise in data

Q2

Large network

c1c2

Large network

Q1

c1 c2

Resolution limit• Optimizing Q

– may miss small communities– is sensitive to false-positive edges– cannot reveal hierarchical structures

• A community containing some sub-communities

• Real-world networks– contain both large and small communities– may have false positive edges

• Biological data are extremely noisy

– have hierarchies

A solution: HQcut

• Ruan & Zhang, Physical Review E 2008• Apply Qcut to get communities with largest Q• Recursively search for sub-communities

within each community• When to stop?

– Q value of sub-network is small, or– Q is not statistically significant

• Estimated by Monte-Carlo method

Q = 0.49

Randomize

Z-score = (0.49 - 0.15) / 0.016 = 21

Randomize

Q = 0.18Z-score = (0.18 - 0.15) / 0.016 = 1.9

Randomize

Q = 0.49randQ = 0.52 0.031

Z-score = (0.49 - 0. 52) / 0.031 = -1.3

randQ = 0.15 0.016

randQ = 0.15 0.016

Large network

Q = 0.49Z-score = -1.3

Q = 0.49Z-score = 21

Q = 0.18Z-score = 1.9

Test on synthetic networks• Network: 1000 vertices• Community sizes vary from 15 to 100

Accuracy

Discovered by Qcut Discovered by HQcut

Example communities

Results for the NCAA teamsCommunities by Qcut/Newman Communities by HQcut

Mountain West Pacific Ten

Applications to a PPI network

• Protein-protein interaction (PPI) network– Vertices: proteins– Edges: interactions detected by experiments

• Motivation:– Community = protein complex?

• Protein complex– Group of proteins associated via interactions– Elementary functional unit in the cell– Prediction from PPI network is important

Experiments

• Data set– A yeast protein-protein interaction network

• Krogan et.al., Nature. 2006– 2708 proteins, 7123 interactions

• Algorithms:– Qcut, HQcut, Newman

• Evaluation– ~300 Known protein complexes in MIPS– How well does a community match to a known protein

complex?

ResultsNewman Qcut HQcut

# of communities 56 93 316

Max community size 312 264 60

# of matched communities 53 52 216

Communities with matching score = 1 5 (9%) 7 (13%) 43 (20%)

Average matching score 0.56 0.55 0.70

# of novel predictions 3 41 100

Communities found by HQcutSmall ribosomal subunit (90%)

RNA poly II mediator (83%)

Proteasome core (90%)

Exosome (94%)

gamma-tubulin (77%)

respiratory chain complex IV (82%)

Example hierarchical community

Microarray data• Data organized into a matrix

– Rows are genes– Columns are samples representing different

time points, conditions, tissues, etc.• Analysis techniques

– Differential expression analysis– Classification and clustering– Regulatory network construction– Enrichment analysis

• Characteristics of microarray data– High dimensionality and noise– Underlying topology unknown, often

irregular shape

Sample

Gen

e

Red: high activityGreen: low activity

Microarray data clustering

• Many clustering algorithms available– K-means– Hierarchical– Self organizing maps– Parameter hard to tune– Does not consider network topology

Sample

Gen

e • Common functions?• Common regulation?• Predict functions for

unknown genes?

Analyze genes in each cluster

Red: high activityGreen: low activity

Network-based data analysis

• Genes i and j connected if their expression patterns are “sufficiently similar”– Similarity > threshold

• Long list of references– K nearest neighbors

• Recently became popular• Many interesting applications beyond clustering• Focus here is clustering

Gen

e

SampleConstructCo-expression network

ij

=

Motivation

• Can we use the idea of community finding for clustering microarray data?

• Advantages: – Parameter free– Network topology considered– Constructed network may have other uses

Network-based microarray data analysis

• How to get the networks?– Threshold-based– Nearest neighbors

• Can we use a complete weight matrix?– Complete graph, with weighted edges– In general, no, since Q is ill-defined on weighted networks

Gen

e

SampleConstructCo-expression network

ij

How to determine the right cutoff?

=

Network-based microarray data analysis

• There is an implicit network structure

• Motivation: true network should be naturally modular– Can be measured by modularity (Q)– If constructed right, should have the highest Q

Clustering

gene

Condition

Method overview

……

Net_1,Most dense

Net_m,Most sparse

Microarraydata

Similaritymatrix

Network series

Qcut

Qcut

Method overview (cont’d)

Network density

Mod

ular

ity

Random network

True network

Difference

• Therefore, use ∆Q to determine the best network parameter and obtain the best community structure

• We actually run HQcut, a variant of Qcut, in order to avoid resolution limit (Ruan & Zhang, Phys Rev E 2008)

Network construction methods

• Value-based method– Remove edges with similarities < ε.– Fixed ε for all vertices– May have problem detecting weakly correlated modules

• Asymmetric k-nearest neighbors (aKNN)– Connect each vertex to k other vertices– Fixed k for all vertices (k < 10 good enough)– Minimum degree = k. max = ?– Sensitive to outliers

• Mutual k-nearest-neighbors (mKNN)– Association confirmed by both ends– Maximum degree = k, min = 0. (k larger than in aKNN.)– Outlier can be detected.– Ruan, ICDM 2009

Results: synthetic data set 1

• High dimensional data generated by synDeca. – 20 clusters of high dimensional points, plus some

scatter points– Clusters are of various shapes: eclipse, rectangle,

random

10 20 30 40 50 60 70 80 90 100

100

200

300

400

500

600

700

800

900

1000 0 50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of neighbors

QReal

QRandom

Qreal - Qrandom

Clustering Accuracy

∆ Q

Accuracy

Comparison

10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

Dimension

Clu

ster

ing

Acc

urac

y

This workkmeansoptimal knnHQcut

mKNN-HQcut with the optimum k

mKNN-HQcut with automatically determined k

Results: synthetic data set 2

• Gene expression data– Thalamuthu et al, 2006– 600 data sets– ~600 genes, 50

conditions, 15 clusters– 0 or 1x outliers

Without outliers With outliersmKNN-HQcutWith optimal k

mKNN-HQcutWith auto k

Comparison with other methods

Results on yeast stress response data

• 3000 genes, 173 samplesBest k = 140. Resulting in 75 clusters

Results on yeast stress response data

• Enrichment of common functions– Accumulative hyper-geometric test

GO Function Terms

Gene

Protein biosynthesis (p < 10-96)

Nuclear transport (p < 10-50)

mt ribosome (p < 10-63)

DNA repair (p < 10-66)RNA splicing (p < 10-105)Nitrogen compound metabolism (p < 10-37)

Peroxisome (p < 10-13)

Comparison with k-means

K-means

mkNN-HQcutUsing automatically determined k = 140

Over

all f

unct

ion

cohe

renc

e

Application to Arabidopsis data• ~22000 genes, 1138

samples• 1150 singletons• 800 (300) modules of

size >= 10 (20)• > 80% (90%) of

modules have enriched functions

• Much more significant than all five existing studies on the same data set

Top 40 most significant modules

Cis-regulatory network of Arabidopsis

MotifModule

Beyond gene clusters (1)

• Gene specific studies– Collaborator is interested in Gibberellins – A hormone important for the growth and development of

plant– Commercially important– Biosynthesis and signaling well studied– Transcriptional regulation of biosynthesis and signaling

not yet clear– 3 important gene families, GA20ox, GA3ox and GA2ox

for biosynthesis– Receptor gene family: GID1A,B,C– Analyze the co-expression network around these genes

GID1A

GID1B

GID1C

20ox1

20ox220ox3 20ox4

20ox5

3ox1

3ox23ox3

3ox4

2ox1

2ox2

2ox3

2ox42ox6

2ox7

2ox8

GA3

20ox

3ox

2ox

Beyond gene clusters (2)• Cancer classification

Gene

Sam

ple

Sample

Alizadeh et. al. Nature, 2000

Sample: tumor/normal cells

Qcut

ActivatedBlood B

Chronic lymphocytic leukemia (CLL)

Follicular lymphoma (FL)

Blood T

Transformed cell lines

Diffuse large B-cell Lymphoma(DLBCL)

Resting Blood B

DLBCL

DLBCL

Network of cell samplesBlack: normal cellsBlue: tumor cells

Survival rate after chemotherapy

DLBCL-1DLBCL-2

DLBCL-3

Survival rate: 73%Median survival time: 71.3 months



Beyond gene clustering (3)• Topology vs function


% o

f ess

entia

l pro

tein

s

Jeong et. al. Nature 2001

Community participation vs. essentiality

• Key: how to systematically search for such relationships?

Community participation

% E

ssen

tial

% E

ssen

tial


Non-hub

HubParticipation < 0.2

Participation >= 0.2

cs 6293 advanced topics: translational bioinformatics

Documents

weighted graph

connected graph

undirected graph

edge v

set of edges

number of edges adjacency

partsgraphsa graph g

downtowndegree of vertex