1.on finding clusters in undirected simple graphs: application to protein complex detection 2.dpclus...

97
1. On finding clusters in undirected simple graphs: application to protein complex detection 2. DPClus software tool 3. Introduction to DPClusO 4. Concept of BiClustering 5. Concept of DNA sequencing Today’s lecture will cover the following topics

Upload: clarissa-fowler

Post on 01-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

1. On finding clusters in undirected simple graphs: application to protein complex detection

2. DPClus software tool

3. Introduction to DPClusO

4. Concept of BiClustering

5. Concept of DNA sequencing

Today’s lecture will cover the following topics

Page 2: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Outline

•Introduction

•Some basic concepts

•The proposed algorithm

•The DPClus software

•Results & Discussion

•Conclusions

On finding clusters in undirected simple graphs: application to protein complex detection

Page 3: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Introduction

•There is no universal definition of a cluster.

•But clustering is an important issue.•Consequently there are diverse definitions and various methods.•The major purpose of clustering is finding cohesive groups.

•Here, we are going to discuss a graph clustering algorithm.

Page 4: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Regarding a graph, a cluster is a subgraph whose nodes are densely connected with each other compared to their connections with other nodes in the graph.

This is a flexible definition of a cluster.

Intuitively, we can recognize two clusters in this arbitrary graph.

Introduction

But it is difficult to draw a big graph revealing its clusters.

Page 5: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

An E. coli protein-protein interaction network---consisting of 3007 proteins and 11531 interactions (From Mori Lab NAIST, Japan)

Some algorithm is needed to detect locally dense regions……

Introduction

Page 6: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Md. Altaf-Ul-Amin, Yoko Shinbo, Kenji Mihara, Ken Kurokawa and Shigehiko Kanaya, “Development and implementation of an algorithm for detection of protein complexes in large interaction networks”, BMC Bioinformatics 7:207, April 2006.

Introduction

Page 7: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Some basic concepts

It is likely that two nodes belong to the same cluster have more common neighbors than two nodes that are not

Page 8: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

It is likely that two nodes belong to the same cluster have more common neighbors than two nodes that are not

Some basic concepts

Page 9: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

•The density d of a cluster is the ratio of the number of edges present in it and the maximum possible number of edges in it.

•It is easy to realize that d = |E|/|E|max = 2*|E|/|N|*(|N|-1).

•d is a real number ranging from 0 to 1.

Some basic concepts

Page 10: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Density of the total graph = 0.241

d=0.9

d=1.0

The density of the complexes are relatively higher

Some basic concepts

Page 11: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Considering density alone is not enough

Such situations can be tackled by keeping track of the periphery

Some basic concepts

•Both the graphs consist of 8 nodes and both are of density 0.5

•But one of them seems to be a single cluster while the other is divided into two clusters

a

b c

d

e

g f

h

a

b

cd

ef

g

h

Page 12: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Some basic concepts

The cluster property of any node n with respect to any cluster k of density dk and size Nk is defined as follows:

cpnk=|Enk|/(dk* |Nk|)

Here, |Enk| is the total number of edges between the node n and each of the nodes of cluster k.

a

b c

d

e

g f

h

a

b

cd

ef

g

h

Cluster property of node f 0.57

Cluster property of node f = 0.2

Page 13: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

The proposed algorithm is a sequential constructive algorithm:

It initializes the complex/cluster by choosing a seed node.

It then repeatedly add other nodes on the basis of priority and some conditions.

The major methods of the algorithm

•Choosing a seed node.

•Selecting a priority node.

•Checking necessary conditions before adding a node to a complex.

The proposed Algorithm

Page 14: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Inputs to the algorithm are:

•The associated matrix of the network.

•A minimum threshold density for the generated clusters.

•A parameter to determine how we separate a complex from its periphery.

Output of the algorithm are :

Overlapping/non-overlapping complexes whose densities are more or equal to the given density.

The proposed Algorithm

Page 15: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

-

The proposed AlgorithmInput an undirected simple graph G.

Set thresholds din and cpin

and initialize cluster ID k = 1.

Generate degrees of the nodes of G.Determine the highest highest node degree (Dh). Dk= 0

Start at highest weight nodeof G as the kth cluster.

dk > din

No

Yescpp(k-p) > cpin

Yes

No

Deduct the last added node from kth cluster.

No

End

All neighbors of kth cluster are checked?

No

Yes

Print kth cluster.G G – kth cluster

k k+1.

Yes

Input & Initialization

Generate weight of each node of G.

highest node weight= 0 YesNo

Start at highest degree nodeof G as the kth cluster.

Generate the neighbors of the kth cluster in G. and sort them according to priority.Add the highest prority neigbor (p) to the cluster.

Add the next priority neighbor (p) to kth cluster.

Termination check

Seed selection

Cluster formation

Output & update

Flowchart of the proposed Algorithm

Page 16: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

0 1 0 0 0 0 0 0 0 0 0 0 0 0

1 0 1 1 0 1 0 0 0 0 0 0 0 0

0 1 0 1 1 1 0 0 0 0 0 0 0 0

0 1 1 0 1 1 0 1 0 0 0 0 0 0

0 0 1 1 0 1 0 0 0 0 0 0 0 0

0 1 1 1 1 0 1 0 0 0 0 0 0 0

0 0 0 0 0 1 0 0 0 0 1 0 0 0

0 0 0 1 0 0 0 0 1 0 0 0 0 0

0 0 0 0 0 0 0 1 0 1 0 0 1 1

0 0 0 0 0 0 0 0 1 0 1 0 1 1

0 0 0 0 0 0 1 0 0 1 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 0 1 1 0 1 0 1

0 0 0 0 0 0 0 0 1 1 0 0 1 0

M =

Muv = 1 if there is an edge between nodes u and v and 0 otherwise.

The proposed Algorithm

Page 17: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

1 0 1 1 0 1 0 0 0 0 0 0 0 0

0 4 2 2 3 2 1 1 0 0 0 0 0 0

1 2 4 3 2 3 1 1 0 0 0 0 0 0

1 2 3 5 2 3 1 0 1 0 0 0 0 0

0 3 2 2 3 2 1 1 0 0 0 0 0 0

1 2 3 3 2 5 0 1 0 0 1 0 0 0

0 1 1 1 1 0 2 0 0 1 0 0 0 0

0 1 1 0 1 1 0 2 0 1 0 0 1 1

0 0 0 1 0 0 0 0 4 2 1 1 2 2

0 0 0 0 0 0 1 1 2 4 0 1 2 2

0 0 0 0 0 1 0 0 1 0 2 0 1 1

0 0 0 0 0 0 0 0 1 1 0 1 0 1

0 0 0 0 0 0 0 1 2 2 1 0 4 2

0 0 0 0 0 0 0 1 2 2 1 1 2 3

M2 =

(M2)uv for uv represents the number of common neighbor of the nodes u and v.

The proposed Algorithm

Page 18: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

1 0 1 1 0 1 0 0 0 0 0 0 0 0

0 4 2 2 3 2 1 1 0 0 0 0 0 0

1 2 4 3 2 3 1 1 0 0 0 0 0 0

1 2 3 5 2 3 1 0 1 0 0 0 0 0

0 3 2 2 3 2 1 1 0 0 0 0 0 0

1 2 3 3 2 5 0 1 0 0 1 0 0 0

0 1 1 1 1 0 2 0 0 1 0 0 0 0

0 1 1 0 1 1 0 2 0 1 0 0 1 1

0 0 0 1 0 0 0 0 4 2 1 1 2 2

0 0 0 0 0 0 1 1 2 4 0 1 2 2

0 0 0 0 0 1 0 0 1 0 2 0 1 1

0 0 0 0 0 0 0 0 1 1 0 1 0 1

0 0 0 0 0 0 0 1 2 2 1 0 4 2

0 0 0 0 0 0 0 1 2 2 1 1 2 3

M2 =

(M2)uv for uv represents the number of common neighbor of the nodes u and v.

The proposed Algorithm

Page 19: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

2

2

3

22

0

3

2

2

0 02

2

2

2

23

0

0

00

2

The proposed Algorithm

The weights of edges are derived by squaring the associated matrix of the graph

Page 20: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

2

2

3

22

0

3

2

2

0 02

2

2

2

23

0

0

00

2

10

106

10

6

0

6

6

0

0

6

0

06

The proposed Algorithm

The weights of nodes (sum of the weights of the connecting edges)

Page 21: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

2

2

3

22

0

3

2

2

0 02

2

2

2

23

0

0

00

2

10

10 6

10

6

0

6

6

0

0

6

0

06

Sum of edge weights

# of edges

P1 2 1

P3 3 1

P4 2 1

P5 3 1

The proposed Algorithm

Seed

Neighbors

Page 22: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

2

2

3

22

0

3

2

2

0 02

2

2

2

23

0

0

00

2

10

10 6

10

6

0

6

6

0

0

6

0

06

Sum of edge weights

# of edges

P3 3 1

P5 3 1

P1 2 1

P4 2 1

The proposed Algorithm

Neighbors

cp of P3 = 1

Page 23: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

2

2

3

22

0

3

2

2

0 02

2

2

2

23

0

0

00

2

10

10 6

10

6

0

6

6

0

0

6

0

06

Sum of edge weights

# of edges

P1 4 2

P4 4 2

P5 6 2

P7 0 1

d=1.0

Neighbors

The proposed Algorithm

Page 24: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

2

2

3

22

0

3

2

2

0 02

2

2

2

23

0

0

00

2

10

10 6

10

6

0

6

6

0

0

6

0

06

Sum of edge weights

# of edges

P5 6 2

P1 4 2

P4 4 2

P7 0 1

d=1.0

Neighbors

The proposed Algorithm

cp of P5 = 1

Page 25: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

2

2

3

22

0

3

2

2

0 02

2

2

2

23

0

0

00

2

10

10 6

10

6

0

6

6

0

0

6

0

06

Sum of edge weights

# of edges

P1 4 2

P4 4 2

P6 0 1

P7 0 1

d=1.0

Neighbors

The proposed Algorithm

cp of P1 = 1

Page 26: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

2

2

3

22

0

3

2

2

0 02

2

2

2

23

0

0

00

2

10

10 6

10

6

0

6

6

0

0

6

0

06

Sum of edge weights

# of edges

P0 0 1

P4 4 2

P6 0 1

P7 0 1

d=1.0

Neighbors

The proposed Algorithm

Page 27: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

2

2

3

22

0

3

2

2

0 02

2

2

2

23

0

0

00

2

10

10 6

10

6

0

6

6

0

0

6

0

06

Sum of edge weights

# of edges

P4 4 2

P0 0 1

P6 0 1

P7 0 1

d=1.0

Neighbors

The proposed Algorithm

cp of P4 = 0.75

Page 28: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

2

2

3

22

0

3

2

2

0 02

2

2

2

23

0

0

00

2

10

10 6

10

6

0

6

6

0

0

6

0

06

d=0.9

Neighbors

The proposed Algorithm

Sum of edge weights

# of edges

cp-value

P0 0 1 ~0.22

P6 0 1 ~0.22

P7 0 1 ~0.22

Page 29: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

02

2

2

2

2

0

0

0

2

6

0

6

6

0

6

0

0

The proposed Algorithm

The remaining graph

Seed

Page 30: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

02

2

2

2

2

0

0

0

2

6

0

6

6

0

6

0

0

d=1.0

The proposed Algorithm

Page 31: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

02

2

2

2

2

0

0

0

2

6

0

6

6

0

6

0

0

d=1.0

The proposed Algorithm

Page 32: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

02

2

2

2

2

0

0

0

2

6

0

6

6

0

6

0

0

d=1.0

The proposed Algorithm

Page 33: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

The proposed Algorithm

The remaining graph

Page 34: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

The proposed Algorithm

Clustering by the proposed algorithm

Page 35: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Results: Complexes in the E. coli PPI Network

The network of E. coli proteins consists of 363 interactions involving a total of 336 proteins

DIP:339N GroEL DIP:1081N PrnP

DIP:1025N CarB DIP:1026N CarA

DIP:539N MalG DIP:508N MalE

DIP:124N XerD DIP:726N XerC

DIP:367N PntB DIP:366N PntA

DIP:342N SbcC DIP:572N Gam

-------------- --------- -------------- ---------

-------------- --------- -------------- ---------

http://dip.mbi.ucla.edu/

Page 36: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

components of RNA polymerase (RpoA, RpoB, RpoC, Rsd, RpoZ RpoD, RpoN, FliA)

Results: Complexes in the E. coli PPI Network

Page 37: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

components of ATP synthetase (AtpA, AtpB, AtpE, AtpF, AtpG, AtpH, AtpL);

Results: Complexes in the E. coli PPI Network

Page 38: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Proteins involved in cell division (FtsQ, FtsI, FtsW, FtsN, FtsK and FtsL)

Results: Complexes in the E. coli PPI Network

Page 39: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

components of DNA polymerase (DnaX, HolA, HolB, HolD, and HolC);

Results: Complexes in the E. coli PPI Network

Page 40: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

We extract a set of 12487 unique binary interactions involving 4648 proteins by discarding self-interactions of the PPI data obtained from ftp://ftpmips.gsf.de/yeast/PPI/.

Results: Complexes in the S. cerevisiae PPI Network

Page 41: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Results: Details of a Group of Predicted Complexes

Information on the complexes that are of size 6 of the set generated using din=0.7, cpin=0.50 and non-overlapping mode.

ID

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

1 5 10 15

17

13

14

14

12

12

11

9

8

8

8

8

8

7

7

7

7

6

6

6

6

6

6

6

6

6

6

6

28 0.71

0.72

1.00

0.83

0.71

0.94

0.71

0.98

0.72

0.93

0.72

0.71

0.71

0.71

0.95

0.76

0.71

0.71

0.80

0.80

0.73

0.73

0.73

0.73

0.73

0.73

0.73

0.73

0.73

CTF4,CTF8,CTF18,CTF19,CIN1,CIN2,CIN8,GIM3,GIM4,GIM5,MAD1,MAD2,MAD3,BUB1,BUB3,PAC2,PAC10,ARP6,BIK1,BIM1,CHL1,CSM3, DCC1,HTZ1,KAR3,SCC1-73,TUB3,YKE2

CHS3,CHS5,CHS7,BNI1,BNI4,RVS161,RVS167,ARC40,ARP2,BCK1,CLA4,FKS1,KRE1,SKT5,SLT2, SMI1,SWI4

TAF17,TAF25,TAF60,TAF61,TAF90,SPT3,SPT7,SPT8,SPT20,ADA2,GCN5,HFI1,NGG1,TRA1

LSM1,LSM2,LSM3,LSM4,LSM5,LSM6,LSM7,LSM8,DCP1,KEM1,MRNa,PAT1,SNRNa,U6

RAD27,RAD50,CDC45-1,ELG1,ESC2,HPR5,MMS4,MRC1,POL32,RRM3,SGS1,TOF1,TOP3

TRS20,TRS23,TRS31,TRS33,TRS65,TRS85,TRS120,TRS130,BET3,BET5,GSG1,KRE11

COG5,COG6,COG7,COG8,ARL1,ARL3,GOS1,GYP1,RIC1,SWF1,TLG2,YPT6

APC1,APC2,APC4,APC5,APC9,APC11,CDC16,CDC23,CDC26,CDC27,DOC1

CDC73,CTI6,DEP1,LEO1,SAP30,SET2,SIF2,SWR1,VPS71

CFT1,CFT2,FIP1,PAP1,PFS2,PTA1,YSH1,YTH1

MED2,MED4,MED7,MED8,PGD1,RPB3,SOH1,SRB4

BEM1,BEM2,BOI1,BOI2,CDC24,CDC42,MSB1,STE20

ARP1,ASE1,CLB4,JNM1,KAR9,KIP3,NIP100,PAC11

CDC4,CDC34,CDC53,CLN1,CLN2,CLN3,SIC1,SKP1

CDC3,CDC10,CDC11,CDC12,GIN4,SEP7,SHS1

CKA1,CKA2,CKB1,CKB2,CDC7-1,RHO3,TOP2

SNR3,SNR10,SNR11,SNR189,GAR1,NHP2,NOP10

SPC19,SPC24,NNF1,NUF2,SMC1,TID3,YDR295c

YGL161c,YGL198w,GCS1,YDR425w,YIP1,YPL095c

PRP5,PRP9,PRP11,PRP21,NOG2,YNR053c

NUP49,NUP57,APG17,NIC96,NSP1,SEC35

KTR3,LAS17,SLA1,YFR024c,YOR284w,YSC84

ECM31,GCD7,NIP29,TEM1,YJL199c,YPL070w

ERB1,HAS1,NIP7,NOP7,NUG1,SSF1

SEC2,SEC4,SEC10,SEC15,MYO2,SMY1

MYO3,MYO5,BBC1,BZZ1,UBP7,VRP1

DBF2,DBF20,CDC15,LTE1,MOB1,SPO12

HHF1,HHF2,HHT1,HHT2,SPT6,STH1

CBF1,CEP3,CHL4,CTF13,MCM21,MIF2

N d Function Class Gene Name

YIP1

GCS1

YGL161c

YPL095c

YGL198w

YDR425w

(a) (b)

3.9x10-17

9.0x10-13

1.7x10-11

1.1x10-6

3.7x10-4

3.4x10-11

4.0x10-6

2.1x10-10

1.9x10-5

4.8x10-7

3.4x10-5

3.1x10-9

4.5x10-7

6.8x10-7

3.5x10-6

5.4x10-3

1.3x10-4

3.5x10-6

9.5x10-4

1.3x10-7

6.3x10-10

1.0x10-4

4.8x10-1

2.3x10-3

2.4x10-5

1.0x10-4

1.2x10-3

1.8x10-5

2.3x10-5

Corrected P-value

We considered 15 functional classes: (1) Cell cycle and DNA processing, (2) Protein with binding function or cofactor requirement (structural or catalytic), (3) Protein fate (folding, modification, destination), (4) Biogenesis of cellular components, (5) Cellular transport, transport facilitation and transport routes, (6) Metabolism, (7) Interaction with the cellular environment, (8) Transcription, (9) Energy, (10) Cell rescue, defense and virulence, (11) Cell type differentiation, (12) Cellular communication/signal transduction mechanism, (13) Protein activity regulation, (14) Protein synthesis, and (15) Transposable elements, viral and plasmid proteins

Page 42: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

1

01

k

i

C

N

iC

FN

i

F

P

Results: Hypergeometric distribution

N= Total number of proteins in the network

F= Number of proteins of a functional group in the network

C= Number of proteins in a cluster

k= Number of proteins of a functional group in a cluster

The p-value of a cluster implies the probability that the proteins of the cluster have been randomly selected

The lower the p-value the higher the statistical significance

Page 43: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

3 green and 4 red balls

Put them in a box

Randomly choose any 3

P0(# of red ball is 0) = 35

1

3

7

3

3

0

4

P1(# of red ball is 1) = 35

12

3

7

2

3

1

4

P2(# of red ball is 2) = P3(# of red ball is 3) = 35

18

3

7

1

3

2

4

35

4

3

7

0

3

3

4

Notice that, P0 +P1+P2+P3=1

P-value & Hyper geometric distribution

Page 44: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

P0(# of red ball is 0) = 35

1

3

7

3

3

0

4

P1(# of red ball is 1) = 35

12

3

7

2

3

1

4

P2(# of red ball is 2) = P3(# of red ball is 3) = 35

18

3

7

1

3

2

4

35

4

3

7

0

3

3

4

0

0.1

0.2

0.3

0.4

0.5

0.6

0 1 32

P-value & Hyper geometric distribution

Page 45: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

P0(# of red ball is 0) = 35

1

3

7

3

3

0

4

P1(# of red ball is 1) = 35

12

3

7

2

3

1

4

P2(# of red ball is 2) = P3(# of red ball is 3) = 35

18

3

7

1

3

2

4

35

4

3

7

0

3

3

4

P(# of red ball ≤ 1)= P0 +P1

P(# of red ball ≥ 2)=1-(P0 +P1)

P(# of red ball ≥ k)=1-(P0 +P1+…+Pk-1)

1

01

k

i

C

N

iC

FN

i

F

P N=7, F=4, C=3

P-value & Hyper geometric distribution

Page 46: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Results: Details of a Group of Predicted Complexes

Information on the complexes that are of size 6 of the set generated using din=0.7, cpin=0.50 and non-overlapping mode.

Protein YDR425w of complex 19 is related to cellular transport and YIP1, YGL198w, YGL161c and GCS1 are related to vesicular transport. Hence, we predict the function-unknown protein YPL095c of this complex is a transport related protein most likely related to vesicular transport.

ID

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

1 5 10 15

17

13

14

14

12

12

11

9

8

8

8

8

8

7

7

7

7

6

6

6

6

6

6

6

6

6

6

6

28 0.71

0.72

1.00

0.83

0.71

0.94

0.71

0.98

0.72

0.93

0.72

0.71

0.71

0.71

0.95

0.76

0.71

0.71

0.80

0.80

0.73

0.73

0.73

0.73

0.73

0.73

0.73

0.73

0.73

CTF4,CTF8,CTF18,CTF19,CIN1,CIN2,CIN8,GIM3,GIM4,GIM5,MAD1,MAD2,MAD3,BUB1,BUB3,PAC2,PAC10,ARP6,BIK1,BIM1,CHL1,CSM3, DCC1,HTZ1,KAR3,SCC1-73,TUB3,YKE2

CHS3,CHS5,CHS7,BNI1,BNI4,RVS161,RVS167,ARC40,ARP2,BCK1,CLA4,FKS1,KRE1,SKT5,SLT2, SMI1,SWI4

TAF17,TAF25,TAF60,TAF61,TAF90,SPT3,SPT7,SPT8,SPT20,ADA2,GCN5,HFI1,NGG1,TRA1

LSM1,LSM2,LSM3,LSM4,LSM5,LSM6,LSM7,LSM8,DCP1,KEM1,MRNa,PAT1,SNRNa,U6

RAD27,RAD50,CDC45-1,ELG1,ESC2,HPR5,MMS4,MRC1,POL32,RRM3,SGS1,TOF1,TOP3

TRS20,TRS23,TRS31,TRS33,TRS65,TRS85,TRS120,TRS130,BET3,BET5,GSG1,KRE11

COG5,COG6,COG7,COG8,ARL1,ARL3,GOS1,GYP1,RIC1,SWF1,TLG2,YPT6

APC1,APC2,APC4,APC5,APC9,APC11,CDC16,CDC23,CDC26,CDC27,DOC1

CDC73,CTI6,DEP1,LEO1,SAP30,SET2,SIF2,SWR1,VPS71

CFT1,CFT2,FIP1,PAP1,PFS2,PTA1,YSH1,YTH1

MED2,MED4,MED7,MED8,PGD1,RPB3,SOH1,SRB4

BEM1,BEM2,BOI1,BOI2,CDC24,CDC42,MSB1,STE20

ARP1,ASE1,CLB4,JNM1,KAR9,KIP3,NIP100,PAC11

CDC4,CDC34,CDC53,CLN1,CLN2,CLN3,SIC1,SKP1

CDC3,CDC10,CDC11,CDC12,GIN4,SEP7,SHS1

CKA1,CKA2,CKB1,CKB2,CDC7-1,RHO3,TOP2

SNR3,SNR10,SNR11,SNR189,GAR1,NHP2,NOP10

SPC19,SPC24,NNF1,NUF2,SMC1,TID3,YDR295c

YGL161c,YGL198w,GCS1,YDR425w,YIP1,YPL095c

PRP5,PRP9,PRP11,PRP21,NOG2,YNR053c

NUP49,NUP57,APG17,NIC96,NSP1,SEC35

KTR3,LAS17,SLA1,YFR024c,YOR284w,YSC84

ECM31,GCD7,NIP29,TEM1,YJL199c,YPL070w

ERB1,HAS1,NIP7,NOP7,NUG1,SSF1

SEC2,SEC4,SEC10,SEC15,MYO2,SMY1

MYO3,MYO5,BBC1,BZZ1,UBP7,VRP1

DBF2,DBF20,CDC15,LTE1,MOB1,SPO12

HHF1,HHF2,HHT1,HHT2,SPT6,STH1

CBF1,CEP3,CHL4,CTF13,MCM21,MIF2

N d Function Class Gene Name

YIP1

GCS1

YGL161c

YPL095c

YGL198w

YDR425w

(a) (b)

3.9x10-17

9.0x10-13

1.7x10-11

1.1x10-6

3.7x10-4

3.4x10-11

4.0x10-6

2.1x10-10

1.9x10-5

4.8x10-7

3.4x10-5

3.1x10-9

4.5x10-7

6.8x10-7

3.5x10-6

5.4x10-3

1.3x10-4

3.5x10-6

9.5x10-4

1.3x10-7

6.3x10-10

1.0x10-4

4.8x10-1

2.3x10-3

2.4x10-5

1.0x10-4

1.2x10-3

1.8x10-5

2.3x10-5

Corrected P-value

Page 47: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Conclusions

•In this work, we present an algorithm to detect locally dense regions in undirected simple graphs.

•The algorithm can be used to detect protein complexes in large protein-protein interaction networks or co-expressed gene clusters based on microarray data.

•It can also be used for protein/gene function prediction by way of finding complexes/clusters in networks consisting of function known and function unknown proteins.

•Also, DPClus can be applied to other networks where finding cohesive groups is an agenda.

The DPClus software is available at http://kanaya.naist.jp/DPClus/

Page 48: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Md. Altaf-Ul-Amin, Hisashi Tsuji, Ken Kurokawa, Hiroko Asahi, Yoko Shinbo, Shigehiko Kanaya, “DPClus: A Density-periphery Based Graph Clustering Software Mainly Focused on Detection of Protein Complexes in Interaction Networks”, Journal of Computer Aided Chemistry , Vol.7, 150-156, 2006.

2. The DPClus Software

The DPClus software is available at http://kanaya.naist.jp/DPClus/

The DPClus software has been developed based on the proposed algorithm.

Page 49: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

The main window of DPClus

The DPClus Software

Page 50: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

AtpB AtpAAtpG AtpEAtpA AtpHAtpB AtpHAtpG AtpHAtpE AtpH

The input file format

0 0 1 0 1 0 0 0 1 1 1 0 0 0 1 0 1 0 0 1 1 1 1 1 0

List of edgesCorresponding network

Adjacency matrix

The DPClus Software

Adjacency list

AtpA AtpB, AtpH AtpB AtpA , AtpH AtpH AtpB, AtpA, AtpG, AtpE AtpG AtpH, AtpEAtpE AtpG

Page 51: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

ClusterLength of cluster 1 is: 8RpoARpoBRpoCRsdRpoZRpoDRpoNFliAClusterLength of cluster 2 is: 8AtpHAtpGAtpBAtpAAtpFAtpLAtpEAtpB(A)ClusterLength of cluster 3 is: 5----------------------------------------------------------------------------

Output file format

The DPClus Software

Page 52: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Click!

Intra cluster edges are green and inter cluster edges are red

Nodes have been arranged by dragging

The DPClus Software

Page 53: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Click

Click

Click

Hierarchical graph of the clusters

The DPClus Software

Page 54: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Clustering of microarray data

Sample microarray data

To apply DPCcus, we need to convert this data to a network

The DPClus Software

Page 55: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Experiment ID

Genes

m

kjjk

m

kiik

m

kjjkiik

ij

xxxx

xxxxR

1

2

1

2

1

)()(

))((

Gene-Gene correlation

Select highly correlated gene pairs

Edges of a Network

At3g10060At3g54150At3g10060At3g63140At3g10060At5g07020-------------- --------------------------- -------------

The DPClus Software

Page 56: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

# of experiments   626 Threshold correlation   0.95cp value 0.5density value 0.9Minimum cluster size 3

The DPClus Software

Page 57: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Ribosomal proteinclusters

Electron transport clusters

Photosynthesis clusters

The DPClus Software

Page 58: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Partitioning a PPI Network into Overlapping Modules Constrained by High-Density and Periphery Tracking Md. Altaf-Ul-Amin, Masayoshi Wada and Shigehiko KanayaVolume 2012 (2012), Article ID 726429, ISRN Biomathematics

The DPClusO Algorithm

Page 59: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

DPClusO has been developed with similar concepts like DPClus but DPClusO is more general and advantageous.

•each node goes to at least one cluster •no two clusters are completely the same •density of each cluster is more than or equal to user given density• clusters are constrained by periphery if that exists

Major differences with DPClus

•each node goes to at least one cluster as big as possible

•Memory efficient

•Faster computation

Page 60: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

C D A B E F G H I K L MQ R S T ON J P

C D A B E F G Q R S T O L K I H M J G I N J M H E F O M N P N J

Clustering by DPClus Clustering by DPClusO

Example showing difference in clustering by DPClus and DPClusO

In both cases clustering was done using din = 0.6 and cpin = 0.5

Page 61: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Evaluation of DPClusO

Page 62: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Measures used for Evaluation

Overlapping score: How two clusters match with each other

How a set of predicted clusters match with a set of known clusters

How rich a cluster is with similar function proteins

Page 63: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Plot of the number of clusters generated by DPClusO with respect to maximum overlapping. OVmax=0 means all modules are completely non-overlapping. For other points OVmax indicates the maximum overlapping score between any two modules.

DPClusO generated clusters are not too overlapping

Page 64: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

64

0

100

200

300

400

500

0 0.5 1 0 0.5 1

0

100

200

300

400

500

0 0.5 1 0 0.5 1

0

100

200

300

400

500

0 0.5 1

DPClusO

Coach

Core

DPClusO/3

Coach/3

Core/3

# o

f m

atc

he

d c

lus

ters

OV

(a) Union (b) Krogan

(d) Gavin(c) DIP

(e) MIPS

Plots showing how many and to what extent the known protein complexes (all complexes and size 3 or more complexes shown separately) of yeast matched with modules predicted by DPClusO, COACH and CORE corresponding to five different datasets.

DPClusO detected more known protein complexes

Page 65: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

65

Variation of F-measure with maximum overlapping score (used as a filtering parameter) for modules of size 3 or more generated by DPClusO, COACH and CORE. The marked horizontal lines indicate F-measures for three algorithms in case of no filtering.

By adding simple filtering DPClusO achieved the best F-measure

Page 66: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

0

500OriginalAddRemoveRearrange

0.0 0.5 1.00.0 0.5 1.0

0.0 0.5 1.00.0 0.5 1.00

500

(a) for 5% changes (b) for 5% changes (S3)

(c) for 10% changes (d) for 10% changes (S3)

OV

# of

mat

ched

clu

ster

s

Verifying robustness of DPClusO by comparing generated modules from real and randomly altered PPI networks in the context of matching with known complexes. (a) & (b) In case of addition, removal and rearrangement of 5% edges in the context of all and size 3 or more known complexes respectively. (c) & (d) In case of addition, removal and rearrangement of 10% edges in the context of all and size 3 or more known complexes respectively.

DPClusO is a robust algorithm

Page 67: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

67

Comparison between the distributions of the high density modules and randomly selected protein groups with respect to –log(p-value) in the contexts of three types of gene ontology terms: (a), (b) biological process(BP), (c), (d) cellular cpmpartment (CC), (e), (f) molecular function(MF).

DPClusO detected modules are rich with similar function proteins

Page 68: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

0

500

1000

1 150120906030 1 150120906030

1 150120906030 1 150120906030

1 150120906030 1 150120906030

0

500

1000

0

500

1000

(a)BP (b) BP(random)

(c)CC (d) CC(random)

(e) MF (f) MF(random)

-log(p-value)#

of

clu

ste

rs

Comparison between the distributions of the star and star like modules and randomly selected protein groups with respect to –log(p-value) in the contexts of three types of gene ontology terms: (a), (b) biological process(BP), (c), (d) cellular cpmpartment (CC), (e), (f) molecular function(MF).

Also as a consequence of DPClusO clustering it was learnt that a PPI network is a combination of mainly high density and star-like modules.

Page 69: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

DPClusO is a network clustering algorithm

Easily we can convert multivariate data into networks and apply DPClusO for clustering

DPClusO is freely available at:http://kanaya.naist.jp/DPClusO

Page 70: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Given a nxp data matrix X, where n is the number of objects (e.g. genes) and p is the number of conditions (e.g. array), a bicluster is defined as a submatrix XIJ of X within which a subset of objects I express similar behavior across the subset of conditions J.

A nxp data matrix X can be easily converted to a bipartite graph by considering a threshold or so.

Finding bicluster (densely connected regions) in a bipartite graph is a similar problem.

Definition of a bicluster

Page 71: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

A Graph G=(V,E) is bipartite if its vertex set V can be partitioned into two subsets V1, V2 such that each edge of E has one end vertex in V1 and another in V2.

V1

V2

Page 72: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Biclusters are densely connected regions in a bipartite graph

C d A a G g I f K k

D c A b G h I g L i

D d B a H e I h L j

E c B b H f J f L k

E d C a H g J g M l

F c C b H h K h M m

F d D a I e G f N l

G d D b K i C c N m

K j

Page 73: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Gene expression data can be represented as bipartite graphs

gene/cond. cond0 cond1 cond2 cond3 cond4

YAL005C 2.85 3.34 0 0 0

YAL012W 0.21 0.03 0.18 -0.27 -0.32

YAL014C -0.03 -0.07 0.28 0.32 -0.27

YAL015C -0.25 0.58 0.77 0.28 0.32

YAL016W 0.11 0.04 0.75 0.82 0.21

YAL017W 0.24 0.31 0.95 0.12 0.18

YAL021C -0.3 0.22 0.02 -0.64 0.06

gene/cond. cond0 cond1 cond2 cond3 cond4

YAL005C 1 1 0 0 0

YAL012W 0 0 0 0 0

YAL014C 0 0 0 0 0

YAL015C 0 0 0 0 0

YAL016W 0 0 0 1 0

YAL017W 0 0 1 0 0

YAL021C 0 0 0 0 0

By transforming highest 5% values to 1

Before transforming, the data can be normalized

Biclusters in gene expression data represents transcription modules/co-expressed gene groups

Page 74: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

•Tanay,A. et al. (2002) Discovering statistically significant biclusters in gene expression data. Bioinformatics, 18 (Suppl. 1), S136–S144.

•Ihmels,J. et al. (2002) Revealing modular organization in the yeast transcriptional network. Nat. Genet., 31, 370–377.

•Ben-Dor,A., Chor,B., Karp,R. and Yakhini,Z. (2002) Discovering local structure in gene expression data: the order-preserving sub-matrix problem. In Proceedings of the 6th Annual International Conference on Computational Biology, ACM Press, New York, NY, USA, pp. 49–57.

•Cheng,Y. and Church,G. (2000) Biclustering of expression data. Proc. Int. Conf. Intell. Syst. Mol. Biol. pp. 93–103.

•Murali,T.M. and Kasif,S. (2003) Extracting conserved gene expression motifs from gene expression data. Pac. Symp. Biocomput., 8, 77–88.

Page 75: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

We propose a biclustering method incorporating DPClus

G/E a b c d e f g h i j k l m

A 1 1 0 0 0 0 0 0 0 0 0 0 0

B 1 1 0 0 0 0 0 0 0 0 0 0 0

C 1 1 1 1 0 0 0 0 0 0 0 0 0

D 1 1 1 1 0 0 0 0 0 0 0 0 0

E 0 0 1 1 0 0 0 0 0 0 0 0 0

F 0 0 1 1 0 0 0 0 0 0 0 0 0

G 0 0 0 1 1 1 1 0 0 0 0 0 0

H 0 0 0 0 1 1 1 1 0 0 0 0 0

I 0 0 0 0 1 1 1 1 0 0 0 0 0

J 0 0 0 0 1 1 0 0 0 0 0 0 0

K 0 0 0 0 0 0 0 1 1 1 1 0 0

L 0 0 0 0 0 0 0 0 1 1 1 0 0

M 0 0 0 0 0 0 0 0 0 0 0 1 1

N 0 0 0 0 0 0 0 0 0 0 0 1 1

An example bipartite graph and its corresponding matrix

1||

0

)()(C

jkjBGijBGik MMCN (for ik)

Page 76: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

BiClus:Biclustering method incorporating DPClus

Concerning each row i (i=0 to |G|-1) of MCN, we calculate thresholdi=avgi+(maxi- avgi) Gmargin and set (MSG)ik =(MSG)ki=1if (MCN)ik thresholdi and thresholdi is not an indeterminate number (for k=0 to |G|-1).Here, avgi = SUMi/ni where ni is the number of non-zero entries in row i of MCN

and maxi is the maximum value of the entries in row i of MCN

Gmargin is a user defined value 1.

A B C D E F G H I J K L M N

A 0 2 2 2 0 0 0 0 0 0 0 0 0 0

B 2 0 2 2 0 0 0 0 0 0 0 0 0 0

C 2 2 0 4 2 2 1 0 0 0 0 0 0 0

D 2 2 4 0 2 2 1 0 0 0 0 0 0 0

E 0 0 2 2 0 2 1 0 0 0 0 0 0 0

F 0 0 2 2 2 0 1 0 0 0 0 0 0 0

G 0 0 1 1 1 1 0 3 3 2 0 0 0 0

H 0 0 0 0 0 0 3 0 4 2 1 0 0 0

I 0 0 0 0 0 0 3 4 0 2 1 0 0 0

J 0 0 0 0 0 0 2 2 2 0 0 0 0 0

K 0 0 0 0 0 0 0 1 1 0 0 3 0 0

L 0 0 0 0 0 0 0 0 0 0 3 0 0 0

M 0 0 0 0 0 0 0 0 0 0 0 0 0 2

N 0 0 0 0 0 0 0 0 0 0 0 0 2 0

Common neighbor matrix of the bipartite graph

Page 77: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

A B C D E F G H I J K L M N

A 0 1 1 1 0 0 0 0 0 0 0 0 0 0

B 1 0 1 1 0 0 0 0 0 0 0 0 0 0

C 1 1 0 1 1 1 0 0 0 0 0 0 0 0

D 1 1 1 0 1 1 0 0 0 0 0 0 0 0

E 0 0 1 1 0 1 1 0 0 0 0 0 0 0

F 0 0 1 1 1 0 0 0 0 0 0 0 0 0

G 0 0 0 0 1 0 0 1 1 1 0 0 0 0

H 0 0 0 0 0 0 1 0 1 1 0 0 0 0

I 0 0 0 0 0 0 1 1 0 1 0 0 0 0

J 0 0 0 0 0 0 1 1 1 0 0 0 0 0

K 0 0 0 0 0 0 0 0 0 0 0 1 0 0

L 0 0 0 0 0 0 0 0 0 0 1 0 0 0

M 0 0 0 0 0 0 0 0 0 0 0 0 0 1

N 0 0 0 0 0 0 0 0 0 0 0 0 1 0

BiClus:Biclustering method incorporating DPClus

This matrix represents a simple graph

Page 78: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

BiClus:Biclustering method incorporating DPClus

Simple graph derived from the common neighbor matrix.

We can use DPClus to find clusters in the simple graph.

Page 79: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

BiClus:Biclustering method incorporating DPClus

Clustering by DPClus

Page 80: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

BiClus:Biclustering method incorporating DPClus

Clustering by DPClus

Page 81: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

BiClus:Biclustering method incorporating DPClus

Finally determined biclusters

Page 82: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Evaluation of BiClus

-Using Synthetic data-Using real data

Page 83: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Synthetic data

Artificially embedded biclusters with noise

Evaluation of BiClus

Page 84: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Synthetic data

Artificially embedded biclusters with overlap

Evaluation of BiClus

Page 85: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

||

||max

||

1),(

21

21

),(),(

121

*

111222 GG

GG

MMMS

MCGMCG

G

Let M1, M2 be two sets of biclusters. The gene match score of M1 with respect to M2 is given by the function

Evaluation of BiClus

A systematic comparison and evaluation of biclustering methodsfor gene expression dataAmela Prelic´, Stefan Bleuler, Philip Zimmermann, Anja Wille, Peter Bu¨ hlmann, Wilhelm Gruissem, Lars Hennig, Lothar Thiele and Eckart Zitzle

BIOINFORMATICS, Vol. 22 no. 9 2006, pages 1122–1129

Page 86: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

effect of relevance of BCs

0

0.2

0.4

0.6

0.8

1

1.2

0 0.05 0.1 0.15 0.2 0.25 0.3

noise level

avg

mat

chin

g sc

ore

SAMBA

BiClus

Evaluation of BiClus

Synthetic data

Artificially embedded biclusters with noise

Page 87: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Evaluation of BiClus

regulatory complexity: relevance of BCs

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6 7 8 9

overlap degree

avg

mat

chin

g sc

ore

SAMBA

BiClus

Synthetic data

Artificially embedded biclusters with overlap

Page 88: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Gasch,A.P. et al. (2000) Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell, 11, 4241–4257.

Gene expression data collected from the above work

Page 89: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Gene expression data can be represented as bipartite graphs

gene/cond. cond0 cond1 cond2 cond3 cond4

YAL005C 2.85 3.34 0 0 0

YAL012W 0.21 0.03 0.18 -0.27 -0.32

YAL014C -0.03 -0.07 0.28 0.32 -0.27

YAL015C -0.25 0.58 0.77 0.28 0.32

YAL016W 0.11 0.04 0.75 0.82 0.21

YAL017W 0.24 0.31 0.95 0.12 0.18

YAL021C -0.3 0.22 0.02 -0.64 0.06

gene/cond. cond0 cond1 cond2 cond3 cond4

YAL005C 1 1 0 0 0

YAL012W 0 0 0 0 0

YAL014C 0 0 0 0 0

YAL015C 0 0 0 0 0

YAL016W 0 0 0 1 0

YAL017W 0 0 1 0 0

YAL021C 0 0 0 0 0

By transforming highest 5% values to 1

Before transforming, the data can be normalized

Biclusters in gene expression data represents transcription modules

Page 90: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

0.001 0.010.0030.002

Evaluation of BiClus

Real gene expression data of yeast

P-values represents statistical significance of functional richness of the modules

P-Values calculated using FuncAssociate: The Gene Set Functionator from http://llama.med.harvard.edu/cgi/func/funcassociate

Page 91: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Application of network concepts in DNA sequencing

Page 92: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Sequencing by hybridization (SBH)

Input: A spectrum S representing all l-mers from an unknown string s

Output: The string s such that spectrum (s,l) = S.

Given an unknown DNA sequence, an array provides information about all strings of length l that the sequence contains

s=TATGGTGC

S(s,l)={TAT, ATG, TGG, GGT, GTG, TGC}

S(s,l)={GTG, ATG, TGG, TAT, GGT, TGC}

Orderly placed

Randomly placed

Page 93: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

Input: A spectrum S representing all l-mers from an unknown string s

Output: The string s such that spectrum (s,l) = S.

The reduction of the SBH problem to an Eulerian path problem is to construct a graph whose edges correspond to l-mers from spectrum(s,l) and then to find a path in this graph visiting every edge exactly once.

Sequencing by hybridization (SBH)

Page 94: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

The reduction of the SBH problem to an Eulerian path problem is to construct a graph whose nodes correspond to (l-1)-mers and edges correspond to l-mers from spectrum(s,l) and then to find a path in this graph visiting every edge exactly once.

S(s,l)={GTG, ATG, TGG, TAT, GGT, TGC}

(l-1)-mers: GT, TG, AT, TG, TG, GG, TA, AT, GG, GT, TG, GC

(l-1)-mers(redundancy removed): GT, TG, AT, GG, TA, GC

GT

AT GG

TA

GC

TG

s=TATGGTGC

Sequencing by hybridization (SBH)

Page 95: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

A path in a graph visiting every edge exactly once is called Eulerian (pronounced Oilerian) path

A connected graph has an Eulerian path, if and only if it contains at most two semibalanced nodes and all other nodes are balanced.

Balanced node, indegree=outdegree

Semibalanced node |indegree-outdegree|=1

GT

AT GG

TA

GC

TG

Semibalanced

Sequencing by hybridization (SBH)

Page 96: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

S(s,l)={ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT}

(l-1)-mers:AT, TG, TG, GG, TG, GC, GT, TG, GG, GC, GC, CA, GC, CG, CG, GT

(l-1)-mers(redundancy removed):AT, TG, GG, GC, GT, CA, CG

GGAT

GC

TG

GT CA

CG

ATGGCGTGCA

Sequencing by hybridization (SBH)

Another example

Page 97: 1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept

S(s,l)={ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT}

(l-1)-mers:AT, TG, TG, GG, TG, GC, GT, TG, GG, GC, GC, CA, GC, CG, CG, GT

(l-1)-mers(redundancy removed):AT, TG, GG, GC, GT, CA, CG

GGAT

GC

TG

GT CA

CG

ATGCGTGGCA

Sequencing by hybridization (SBH)