graph based clustering

34
Graph Based Clustering Summer School “Achievements and Applications of Contemporary Informatics, Mathematics and Physics” (AACIMP 2011) August 8-20, 2011, Kiev, Ukraine Erik Kropat University of the Bundeswehr Munich Institute for Theoretical Computer Science, Mathematics and Operations Research Neubiberg, Germany

Upload: ssa-kpi

Post on 11-May-2015

1.132 views

Category:

Education


1 download

DESCRIPTION

AACIMP 2011 Summer School. Operational Research Stream. Lecture by Erik Kropat.

TRANSCRIPT

Page 1: Graph Based Clustering

Graph Based Clustering

Summer School

“Achievements and Applications of Contemporary Informatics,

Mathematics and Physics” (AACIMP 2011)

August 8-20, 2011, Kiev, Ukraine

Erik Kropat

University of the Bundeswehr Munich Institute for Theoretical Computer Science,

Mathematics and Operations Research

Neubiberg, Germany

Page 2: Graph Based Clustering

Real World Networks

• Biological Networks

− Gene regulatory networks

− Metabolic networks

− Neural networks

− Food webs

• Technological Networks

− Telecommunication networks

− Internet

− Power grids

food web

power grid

Page 3: Graph Based Clustering

Real World Networks

• Social Networks

− Communication networks

− Organizational networks

− Social media

− Online communities

• Economic Networks

− Financial market networks

− Trade networks

− Collaboration networks

social networks

economic networks

Source: Frank Schweitzer et al., “Economic Networks: The New Challenges,” Science 325, no. 5939 (July 24, 2009): 422-425.

Page 4: Graph Based Clustering

Graph-Theory

• Graph theory can provide more detailed information about the inner structure of the data set in terms of

− cliques (subsets of nodes where each pair of elements is connected)

− clusters (highly connected groups of nodes)

− centrality (important nodes, hubs)

− outliers . . . (unimportant nodes)

• Applications

− social network analysis

− diffusion of information

− spreading of diseases or rumours

⇒ marketing campaigns, viral marketing, social network advertising

Page 5: Graph Based Clustering

Graph-Based Clustering

• Collection of a wide range of very popular clustering algorithms

that are based on graph-theory.

• Organize information in large datasets to facilitate users

for faster access to required information.

Page 6: Graph Based Clustering

Idea

• Objects are represented as nodes in a complete or connected graph.

• Assign a weight to each branch between the two nodes x and y.

The weight is defined by the distance d(x,y) between the nodes.

Clustering Distance between

clusters Distance between objects

Page 7: Graph Based Clustering

Idea

minimal spanning tree

graph

clusters

Page 8: Graph Based Clustering

Graph Based Clustering

Hierarchical method

(1) Determine a minimal spanning tree (MST)

(2) Delete branches iteratively

New connected components = Cluster

1

3

5

8

4

6

Page 9: Graph Based Clustering

Minimal Spanning Trees

Page 10: Graph Based Clustering

Minimal Spanning Tree

A minimal spanning tree of a connected graph G = (V,E)

is a connected subgraph with minimal weight

that contains all nodes of G and has no cycles.

1

3

5

8

4

6

a

1

3

5

8

4

6

c

d

b

a

c

d

b

minimal spanning tree graph G = (V, E)

Page 11: Graph Based Clustering

Minimal spanning trees can be calculated with...

(1) Prim’s algorithm.

(2) Kruskal’s algorithm.

a

1

3

5

8

4

6

c

d

b

Page 12: Graph Based Clustering

Example – Prims’s Algorithm

1

3

5

8

4

6

a

b

c

d

Set VT = {a}, ET = { }

1

3

5

8

4

6

a

b

c

d

Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT.

VT = {a,b} and ET = { (a,b) }.

Page 13: Graph Based Clustering

Example– Prims’s Algorithm

c

1

3

5

8

4

6

a

b

c

d

Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT.

VT = {a,b,d} and ET = { (a,b), (a,d) }.

Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT.

VT = {a,b,c,d} and ET = { (a,b), (a,d),(b,c) }.

c

1

3

5

8

4

6

a

c

d

b

Page 14: Graph Based Clustering

Prim’s Algorithm

INPUT: Weighted graph G = (V, E), undirected + connected

OUTPUT: Minimal spanning tree T = (VT, ET) (1) Set VT = {v}, ET = { }, where v is an arbitrary node from V (starting point).

(2) REPEAT

(3) Choose an edge (a,b) with minimal weight, such that a ∈ VT and b ∉ VT.

(4) Set VT = VT ∪ {b} and ET = ET ∪ { (a,b) }.

(5) UNTIL VT = V

Page 15: Graph Based Clustering

Kruskal’s Algorithm

INPUT: Weighted graph G = (V, E), undirected + connected

OUTPUT: Minimal spanning tree T = (VT, ET) (1) Set VT = V, ET = { }, H = E.

(2) Initialize a queue to contain all edges in G, using the weights in ascending order as keys.

(3) WHILE H ≠ { }

(4) Choose an edge e ∈ H with minimal weight.

(5) Set H = H \ {e}.

(6) If (VT, ET ∪ {e}) has no cycles, then ET = ET ∪ {e} .

(7) END

Page 16: Graph Based Clustering

Branch Deletion

Page 17: Graph Based Clustering

Delete Branches - Different Strategies

(1) Delete the branch with maximum weight.

(2) Delete inconsistent branches.

(3) Delete by analysis of weights.

Page 18: Graph Based Clustering

(1) Delete the branch with maximum weight

• In each step, create two new clusters by deleting the branch with maximum weight.

• Repeat until the given number of clusters is reached.

2

2 6

3

4

2 2 2

Page 19: Graph Based Clustering

2

2 6

3

4

2 2 2

Ordered weights of branches: 6, 4, 3, 2, 2, 2, 2, 2.

Minimum spanning tree

Example: Delete the branch with maximum weight

Page 20: Graph Based Clustering

2

2 6

3

4

2 2 2

Ordered weights of branches: 6, 4, 3, 2, 2, 2, 2, 2.

Step 1: Delete branch (weight 6) ⇒ 2 clusters

Example: Delete the branch with maximum weight

Page 21: Graph Based Clustering

2

2 6

3

4

2 2 2

Example: Delete the branch with maximum weight

Ordered weights of branches: 6, 4, 3, 2, 2, 2, 2, 2.

Step 1: Delete branch (weight 6) ⇒ 2 clusters Step 2: Delete branch (weight 4) ⇒ 3 clusters

Page 22: Graph Based Clustering

(2) Delete inconsistent branches

• A branch e is inconsistent, if the corresponding weight de

is (much) larger than a reference value de .

• The reference value de can be defined by the average weight of all branches adjacent to e.

_

_

1

2 6

3 e de = 3 + 2 + 1 _________

3

_ = 2

de = 6 > 2 = de _

⇒ e inconsistent

Page 23: Graph Based Clustering

(3) Delete by analysis of weights

• Perform an “analysis” of all weights of branches in the MST. Determine a threshold S.

• The threshold can be estimated by histograms on the weights of branches (= length of branches).

• Delete a branches, if the corresponding weight higher than the threshold S.

weight of branch (length of branch)

Num

ber

weight of branch

Num

ber

S

Page 24: Graph Based Clustering

Exercise

Find a minimal spanning tree and provide a clustering of the graph by deleting all inconsistent branches.

10

f

a

b

c

d

e

g

2

12

4 1

3 20

8

5

9

15 6

Page 25: Graph Based Clustering

Example

Set VT = {a}, ET = { } Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT.

Page 26: Graph Based Clustering

Example

Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT.

Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT.

Page 27: Graph Based Clustering

Example

Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT.

Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT.

Page 28: Graph Based Clustering

Example

Choose an edge (x,y) with minimal weight such that x ∈ VT and y ∉ VT.

minimal spanning tree

Page 29: Graph Based Clustering

Example

For each branch calculate the reference value

(average weight of adjacent branches)

f

a

b

c

d

e

g

2

4 1

3

5

6

(3)

(3)

(4.5)

(3.6)

(5)

(4)

Page 30: Graph Based Clustering

Example

Delete inconsistent branches

(weight is larger than the reference value)

f

a

b

c

d

g 4 1

3

(3)

(3) (4)

e

2 clusters

Noise?

Page 31: Graph Based Clustering

Summary

Page 32: Graph Based Clustering

Summary

• In graph based clustering objects are represented as nodes in a complete or connected graph.

• The distance between two objects is given by the weight of the corresponding branch.

• Hierarchical method

(1) Determine a minimal spanning tree (MST)

(2) Delete branches iteratively

• Visualization of information in large datasets.

Page 33: Graph Based Clustering

• V. Kumar, M. Steinbach, P.-N. Tan

Introduction to Data Mining.

Addison Wesley, 2005.

Literature

• J.A. Dunne, R.J. Williams, N.D. Martinez, R.A. Wood, D.H. Erwin Compilation and Network Analyses of Cambrian Food Webs.

PLoS Biol 6(4): e102. doi:10.1371/journal.pbio.0060102 • F. Schweitzer, G. Fagiolo, D. Sornette, F. Vega-Redondo, A. Vespignani, D.R. White

Economic Networks: The New Challenges.

Science 325, no. 5939 (July 24, 2009): 422-425.

Other work mentioned in the presentation

Page 34: Graph Based Clustering

Thank you very much!