dense subgraphs with restrictions & applications to gene annotations graphs

47
Samir Khuller University of Maryland Joint Work with Barna Saha, Allie Hoch, Louiqa Raschid, Xiao-Ning Zhang RECOMB 2010

Upload: barr

Post on 05-Jan-2016

51 views

Category:

Documents


0 download

DESCRIPTION

Dense Subgraphs with Restrictions & Applications to Gene Annotations Graphs. Samir Khuller University of Maryland Joint Work with Barna Saha, Allie Hoch, Louiqa Raschid, Xiao-Ning Zhang. RECOMB 2010. Sitting in a talk on community detection…. How do we define a community? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

Samir KhullerUniversity of Maryland

Joint Work withBarna Saha, Allie Hoch, Louiqa Raschid, Xiao-Ning Zhang

RECOMB 2010

Page 2: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

Sitting in a talk on community detection….. How do we define a community?

Perhaps we want to capture a group of individuals with strong interactions within the group?

Page 3: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

5

3 4

2

7

6

1The density of {1,2,3,4,5,6,7} = 9/7 = 1.28

The density of {1,2,3,4} = 6/4 = 1.5

The densest subgraph is {1,2,3,4}.

How do we compute the densest subgraph?

Surprisingly, this can be solved optimally in polynomial time!

[Goldberg 84, Lawler 76, Queyranne 75, GGT]

Extends to weighted graphs.

1

sum of weights of edges in the induced subgraphSubgraph density = number of nodes in the induced subgraph

Page 4: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

53

42 76

81

sum of weights of edges in the induced subgraphSubgraph density = number of nodes in the induced subgraph

Density of entire graph is 13/8 > 1.5

Page 5: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

Are all dense subgraphs meaningful ?◦ How do we allow some control over the kind of dense

subgraphs that are found?◦ Putting size constraints makes the problem intractable

immediately. Densest subgraph of size >=k. NP-hard, 2 approximation [Anderson][Khuller,

Saha]. Greedily Union densest subgraphs…..

Densest subgraph of size <=k (or =k). NP-hard and some approximations known [Feige,

Kortsarz, Peleg] [Charikar et al].

Are Dense Subgraphs Useful?

Page 6: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

Goldberg’s algorithm: a new flow network, is created with “directed” edges. ◦ An s-t min cut is computed in order to find the

densest subgraph after guessing the max density. Nodes on the “s” side of the cut are part of a densest subgraph.

◦ GGT speeds everything up to a single flow computation!

Lawler’s algorithm: slightly different flow construction, more intuitive.

Greedy algorithm: recursively delete low degree nodes. Gives a 2 approximation to density! Fast!

Page 7: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

1 3

2 4

source sink

1

1

11

1

1

1

1

1

1V1

V2We use s-t min cuts

Page 8: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

Original Graph:

1

2

3

5

2

Page 9: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

1

2

3

5

2

sourcesink

7

7

9

6 = 7 + 2*2 - 5

4

7

Edges from source to nodes: m’= sum of all edges in graph

Edge from node i to sink: m’ + 2g – d(i)

g = guess = 2

CUT = m’|V| +2|V1| (g-D1)where V1 are the source side nodes

Page 10: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

1

2

3

5

2

sourcesink

7

7

9

6

4

7

Edges from source to nodes: m’= sum of all edges in graph

Edge from node i to sink: m’ + 2g – d(i)

Since the cut is <21, the guess is too low.

g = guess = 2

CUT = m’|V| +2|V1| (g-D1)where V1 are the source side nodes

Page 11: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

1

2

3

5

2

sourcesink

7

7

13

10

8

7

Edges from source to nodes: m’= sum of all edges in graph

Edge from node i to sink: m’ + 2g – d(i)

Since the min cut is the trivial cut (and unique), the guess is too high.

g = guess = 4

Page 12: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

1

2

3

5

2

sourcesink

7

7

9 2/3

6 2/3

4 2/3

7

Edges from source to nodes: m’= sum of all edges in graph

Edge from node i to sink: m’ + 2g – d(i)

Since the cut is smaller than 21, the guess is too low.

g = guess = 2 1/3

Page 13: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

1

2

3

5

2

sourcesink

7

7

10

7

7

Edges from source to nodes: m’= sum of all edges in graph

Edge from node i to sink: m’ + 2g – d(i)

Since the cut has value 21 and V1 is not empty, guess is correct!

g = guess = 2 1/2

5

Page 14: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

What can we do with this weapon?

Consider gene annotation data from TAIR. For large networks we can use the fast

greedy approximation (gave us the densest subgraph every time!).

Page 15: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

AT1G15550GA4

GO:0016707 gibberellin 3-beta-dioxygenase activity

GO:0009686 gibberellin biosynthetic process

GO:0009739 response to gibberellin stimulus

GO:0009639 response to red or far red light

GO:0008134 transcription factor binding

GO:0010114 response to red light

PO:0019018 embryo axis

PO:0009046 flower

PO:0009005 root

PO:0009001 fruit

PO:0020001 ovary placenta

PO:0020148 shoot apical meristem

PO:0020030 cotyledon

PO:0009064 receptacle

PO:0003011 root vascular system

PO:0000014 rosette leaf

PO:0004723 sepal vascular system

PO:0009047 stem

PO:0020141 stem node

PO:0009009 embryo

PO:0004714 terminal floral bud

PO:0009025 leaf

PO:0007057 0 germination

PO:0007131 seedling growth

PO:0009067 filament

GO:0009740 gibberellic acid mediated signalling

GO:0005737 cytoplasm

GO-(gene)-PO tri-partite graph

Page 16: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

GO:0009686 gibberellin biosynthetic process

GO:0009739 response to gibberellin stimulus

GO:0009639 response to red or far red light

GO:0010114 response to red light

GO:0009740 gibberellic acid mediated signalling

GO:0008135 biological process

GO OntologyGO Ontology

Page 17: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

PO:0019018 embryo axisPO:0009046 flower

PO:0009005 root

PO:0009001 fruit

PO:0020001 ovary placenta

PO:0020148 shoot apical meristem

PO:0020030 cotyledon

PO:0009064 receptacle

PO:0003011 root vascular system

PO:0000014 rosette leaf

PO:0004723 sepal vascular system

PO:0009047 stem

PO:0020141 stem node

PO:0009009 embryoPO:0004714 terminal floral bud

PO:0009025 leaf

PO:0009067 filament

Plant structurePO OntologyPO Ontology

Page 18: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

Gene Annotation GraphGene Annotation Graph

Construct graphs for each gene using their GO, PO annotations

Combine the graphs of several genes into one single weighted graph

Gene 1

Gene 2

Gene 3

Gene 4

GO 1

GO 2

GO 3

GO 4

PO 1

PO 2

PO 3

PO 4

Page 19: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

Scientists like to find patterns in gene annotation graphs – but these are huge!

Need to allow some control over the kind of patterns that are computed

Would like to find biologically meaningful patterns Gene

1

Gene 2

Gene 3

Gene 4

GO 1

GO 2

GO 3

GO 4

PO 1

PO 2

PO 3

PO 4

Node

Edge

Page 20: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

AT1G15550GA4

GO:0016707 gibberellin 3-beta-dioxygenase activity

GO:0009686 gibberellin biosynthetic process

GO:0009739 response to gibberellin stimulus

GO:0009639 response to red or far red light

GO:0008134 transcription factor binding

GO:0010114 response to red light

PO:0019018 embryo axis

PO:0009046 flower

PO:0009005 root

PO:0009001 fruit

PO:0020001 ovary placenta

PO:0020148 shoot apical meristem

PO:0020030 cotyledon

PO:0009064 receptacle

PO:0003011 root vascular system

PO:0000014 rosette leaf

PO:0004723 sepal vascular system

PO:0009047 stem

PO:0020141 stem node

PO:0009009 embryo

PO:0004714 terminal floral bud

PO:0009025 leaf

PO:0007057 0 germination

PO:0007131 seedling growth

PO:0009067 filament

GO:0009740 gibberellic acid mediated signalling

GO:0005737 cytoplasm

GO-(gene)-PO tri-partite graph

Page 21: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

GO:0016707 gibberellin 3-beta-dioxygenase activity

GO:0009686 gibberellin biosynthetic process

GO:0009739 response to gibberellin stimulus

GO:0009639 response to red or far red light

GO:0008134 transcription factor binding

GO:0010114 response to red light

PO:0019018 embryo axis

PO:0009046 flower

PO:0009005 root

PO:0009001 fruit

PO:0020001 ovary placenta

PO:0020148 shoot apical meristem

PO:0020030 cotyledon

PO:0009064 receptacle

PO:0003011 root vascular system

PO:0000014 rosette leaf

PO:0004723 sepal vascular system

PO:0009047 stem

PO:0020141 stem node

PO:0009009 embryo

PO:0004714 terminal floral bud

PO:0009025 leaf

PO:0007057 0 germination

PO:0007131 seedling growth

PO:0009067 filament

GO:0009740 gibberellic acid mediated signalling

GO:0005737 cytoplasm

GO-PO bipartite graph

Page 22: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

Gene Annotation GraphGene Annotation Graph

Construct complete bipartite graph for each gene using their GO, PO annotations

Combine the bipartite graphs of several genes into one single weighted graph

GO 1

GO 2

GO 3

GO 4

PO 1

PO 2

PO 3

PO 4

1

2 1

11

3

3

2

31

1

1

2

Page 23: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

How can we extract knowledge? Cliques – these might give us some

biological information – but this is a stringent reqmt.

However clique finding is well known to be really hard (NP-hard, hard to approximate).

Why not look for “dense regions”? Note that the notion of density could be

defined for hyper-edges as well, but for our purposes this does not do as well.

Page 24: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

Dense Subgraphs in Gene Dense Subgraphs in Gene Annotation GraphAnnotation Graph A collection of GO-PO terms that appear together in the

underlying genes.

GO 1

GO 2

GO 3

GO 4

PO 1

PO 2

PO 3

PO 4

1

2 1

11

3

3

2

31

1

1

2

(GO3,PO1),(GO3,PO2),(GO3,PO4),(GO4,PO1),(GO4,PO2),(GO4,PO4) appear frequently in the 4 genes

Page 25: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

Are all dense subgraphs biologically meaningful ?◦ How do we allow some control over the kind of dense

subgraphs that are computed.◦ In fact we can impose both restrictions at the same time!

Restrictions in dense subgraph computation

Distance Restricted

Subset Restricted

GO terms and similarly PO terms that appear must be biologically related

Certain GO, PO terms must appear in the returned subgraph

Page 26: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

Are all dense subgraphs biologically meaningful ?◦ How do we allow some control over the kind of dense

subgraphs that are found?

Restrictions in dense subgraph computation

Distance Restricted

Subset Restricted

GO terms that appear in the densest subgraph must be close in the GO ontology graph and similarly for the PO terms

Page 27: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

Distance threshold = 1 This means that some sets of nodes are not allowed to

coexist in the final solution: {GO1 ,GO2}, {GO1,GO4}, {PO1 ,PO4}, {PO1,PO2},{PO2,PO3,}.

The final solution is {GO2, GO3, GO4, PO2, PO4}, which has a density of .8.

GO1

GO2

GO3

GO4

PO1

PO2

PO1

PO3

PO4

PO2

PO3

PO4

GO2

GO1

GO3

GO4

Page 28: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

For arbitrary ontology graph structure◦ NP Hard even to approximate it reasonably

Reduction from Independent set problem◦ Factor 2 relaxation of distance threshold is enough to get a

solution with density as high as the optimum Trees, Interval Graphs, Each edge participates in

small number of cycles◦ Polynomial time algorithm to compute the optimum

Page 29: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

23

4

5

1 6

9

7

8

58

3

7

4

2

6

1

GO-OntologyPO-Ontology

Distance Threshold=2

2

3

4

5

1

6

9

7

8

2

3

4

5

1

6

7

8

Page 30: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

Distance Threshold=2

Guess two nodes in each ontology that appears in the optimum solution and have maximum distance

23

4

5

1 6

9

7

8

58

3

7

4

2

6

1

GO-Ontology PO-Ontology

2

3

4

5

1

6

9

7

8

2

3

4

5

1

6

7

8

Page 31: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

Distance Threshold=2

Compute all the nodes which are within distance threshold from both the guessed nodes

23

4

5

1 6

9

7

8

58

3

7

4

2

6

1

GO-Ontology PO-Ontology

2

3

4

5

1

6

9

7

8

2

3

4

5

1

6

7

8

Page 32: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

Distance Threshold=2

In the gene-annotation bipartite graph consider only the chosen nodes and compute the densest subgraph

23

4

5

1 6

9

7

8

58

3

7

4

2

6

1

GO-Ontology PO-Ontology

2

3

4

5

1

6

9

7

8

2

3

4

5

1

6

7

8

Page 33: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

Distance Threshold=2

In the gene-annotation bipartite graph consider only the chosen nodes and compute the densest subgraph

23

4

5

1 6

9

7

8

58

3

7

4

2

6

1

GO-Ontology PO-Ontology

5

6

9

7

2

4

5

Page 34: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

Distance Threshold=2

In the gene-annotation bipartite graph consider only the chosen nodes and compute the densest subgraph

23

4

5

1 6

9

7

8

58

3

7

4

2

6

1

GO-Ontology PO-Ontology

5

6

9

7

2

4

5

Page 35: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

Distance Threshold=2

23

4

5

1 6

9

7

8

58

3

7

4

2

6

1

GO-Ontology PO-Ontology

5

6

9

7

2

4

5

Proof of optimality:Any node not chosen can not be in the optimum solutionAll the nodes chosen are within distance threshold

Page 36: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

Guess a small subset of nodes from the optimum Choose candidate nodes by considering distance from the

guessed nodes Compute the densest subgraph by restricting the gene

annotation graph to only the chosen nodes

Page 37: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

Are all dense subgraphs biologically meaningful ?◦ How do we some control over the kind of dense

subgraphs that are found ?

Restrictions in dense subgraph computation

Distance Restricted

Subset Restricted

Given a subset of GO, PO terms compute the densest subgraph containing them.

Page 38: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

8

2 3 4 5 6

7

1 2 2 2 1 1

1 11 1

1

3

•This set {5,6} must be in the solution.

•Density of {1,2,3,4} = (3+2+2+2)/4 = 2.25– Doesn’t contain {5,6}

•Density of {5,6,7,8} = 6/4 = 1.5 (Satisfies subset requirement)

•Density of {1,2,3,4,5,6,7,8} = (2+3+2+2+1*7)/8 = 2.0 (Best answer)

Polynomial time algorithm to compute the optimum solution

Page 39: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

For this problem we modified Lawler’s method of finding densest subgraphs. Let’s assume that we have a graph in which we want to force {5,6} to be in the final solution.

Page 40: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

The guess “g” is iteratively updated, as in Goldberg’s algorithm until the min cut is calculated and there is more than one possible solution, one contains just {s’ and s} and the other specifies the densest subgraph.

Page 41: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

A graph may contain multiple subgraphs of equal (or close to equal) density

Computing just one subgraph may not be sufficient Compute all subgraphs close to maximum density Extension of Picard and Queyranne’s result

◦ Polynomial time algorithm to find almost all dense subgraphs given the number of such subgraphs is polynomial in the number of vertices.

◦ Their method encodes all possible s-t min cuts.◦ After a max-flow is found, we lower edges with residual capacity close to

zero, to zero and now used [PQ] method to list all s-t min cuts. Can be extended to consider both distance and subset

restriction

Page 42: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

8

2 3 4 5 6

7

1 2 2 2 1 1

1 11 1

2

3

9

2

2

•Density of {1,2,3,4} = 9/8 = 2.25

•Density of {5,6,7,8,9} = 11/5 = 2.

•Density of {1,2,3,4,5,6,7,8,9} = 21/9 = 2.333

•The entire graph is the densest subgraph, but {1,2,3,4} and {5,6,7,8,9} are “almost” dense subgraphs

Page 43: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

10 Photomorphogenesis genes

CIB5 CRY2 HFR1 COP1 PHOT1 PHOT2 HY5 SHB1 CRY1 CIB1

66 GO CV terms. 41 PO CV terms; 2230 GO-PO edges.

Generate distance restricted dense subgraph. GO distance = 2. PO distance = 3. Dense subgraph with 3 GO terms & 13 PO terms

Photomorphogenesis ExperimentPhotomorphogenesis Experiment

Page 44: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

HFR1

COP1

PHOT1

PHOT2

HY5

13 PO CV terms 3 GO CV termsSet of 10 genes

CRY2

CIB5

SHB1

CIB1

CRY1

(partial) dense subgraph; 3 GO terms; 13 PO terms; 10 genes

0 annotation edges

8

26

12

13

13

12

13

2

13

Page 45: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

Potential Discovery

Genes CRY2 and PHOT1 are both observed in the dense subgraph with the following two GO and PO combinations: 5773: vacuole: cellular_component 13: cauline leaf; plant_structure 37: shoot apex; plant_structure (5773, 13) (5773, 37) This pattern has not been reported in the literature. Two independent studies [Kang et al. Planta 08, Ohgishi PNAS 04] have suggested that there may be some functional interactions between the members of PHOT1 and CRY2 in vacuole

Page 46: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

53

42 76

81

Density of entire graph is 13/8 > 1.5

Back to communities…?

Can we use distance based cutoffs to define a sub graph of interest?

Page 47: Dense  Subgraphs  with Restrictions & Applications to Gene Annotations Graphs

Identifying dense subgraphs with distance and subset restriction may help in identifying interesting patterns

Potential Applications in other domains:◦ Distance restricted dense subgraph for community detection◦ Subset restricted dense subgraph in PPI network for deriving protein

complexes Ranking almost dense subgraphs Change the notion of density [K,Mukherjee,Saha]?