applying electromagnetic field theory concepts to clustering with constraints

28
Applying Electromagnetic Field Theory Concepts to Clustering with Constraints Huseyin Hakkoymaz, Georgios Chatzimilioudis, Dimitrios Gunopulos and Heikki Mannila

Upload: barid

Post on 15-Jan-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Applying Electromagnetic Field Theory Concepts to Clustering with Constraints. Huseyin Hakkoymaz, Georgios Chatzimilioudis, Dimitrios Gunopulos and Heikki Mannila. Motivation. Well-known problem, Dimensionality Curse : As the # of dimensions increases, distance - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Applying Electromagnetic Field Theory Concepts to Clustering

with Constraints

Huseyin Hakkoymaz, Georgios Chatzimilioudis, Dimitrios Gunopulos and Heikki Mannila

Page 2: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Motivation• Well-known problem, Dimensionality Curse:

– As the # of dimensions increases, distance metrics start losing their functionality

• Relative Distances– Unlike exact distances, relative distance-based

metrics have some immunity for the curse• Shortest path calculated by using the edges in graphs

• Local distance adjustments– In many domains, local changes affect whole system

• Cancer cells in body, sensor depletion in a network, etc…• The same idea is valid for distance metrics.

– Relative distances supported by pairwise constraints performs much better

• Constraints cause changes in local distances

(b) The distance matrix becomes useless if dimensions keeps increasing

(a) Change of a unit shape as dimen-sionality increases

Page 3: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Motivation(2)

• Best environment to realize objectives GRAPH– Graph considered as Electromagnetic Field (EMF)

• Pairwise constraints expressed naturally– Constraints EMF sources exerting force over edges– The force causes reduction or escalation of edge weights

• No limitation for reduction/escalation amount thanks to graph domain– Cartesian space metrics bounded by triangular inequality

a

s

b

de

f

h

g

c

i

j

t

a

s

b

de

f

h

g

c

i

j

t

Negative Constraint

Page 4: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Related Work• Distance metric learning [ Xing et al.]:

– Global Linear transformation of data point• Different weights for each dimension

– Shortcomings:• May fail in some cases, • Euclidian distance may utilize better  

• Integrating constraints and metric learning in semi-supervised clustering [Bilenko et al.]:

– Local weights for each cluster• Readjustment of weights at each iteration

– Combines constraints and metric learning in objective function

– Shortcomings:• Sometimes fails to adjust weights locally, • No guarantee for better accuracy with more

constraints

MPCK-Means

w1x > w1y w2x < w2y

K-Means KMeans+Dist. Metric

K-Means

w1x = w1y w2x = w2y

Page 5: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Related Work• Semi-supervised Graph Clustering: A Kernel Approach [Kulis

et al.]:– Mapping of data points into new feature space

– Similarity between Kernel-KMeans and graph clustering objectives

– Works for both vector and graph data

– Shortcomings:• Optimal Kernel required for good results• Time to compute optimal kernel is high• Relies mostly on min-cut objective, not distance

Correct Clustering SS-Kernel-Means Approach

Page 6: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Magnetically Affected Paths (MAP)

• Two special edges for constraints:– Positive Edge : Must-link constraints– Negative Edge: Cannot-link constraints

• Definitions:– Reduction Ratio: Amount of decrement in edge weight(+)– Escalation Ratio: Amount of increment in edge weight (-)

_

Positive Edges

Negative Edge

Page 7: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

vd hd

effect

Magnetically Affected Paths (MAP)• Each constraint edge affects regular edges based on:

– Constraint type– Vertical Distance (vd): Distance to the constraint axis– Horizontal Distance (hd): Distance to the mid-point of the constraint axis

• Vertical and Horizontal Effects Probabilistic model– if vd increases, effect decreases for both (+) and (-) constraints– if hd increases, effect decreases for (-) constaints – hd has no effect on (+) constraints

tsaxis

Horizontal Distance

e(u,v)

hd(u,v)

ts

Midpoint

vd(u,v)

e(u,v)

Vertical Distance

Page 8: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Magnetically Affected Paths (MAP)

• Compute escalation/reduction ratios of each constraint

where_

),(

),(),(),(),(

tsd

tvdsvdtudsudr

)(r

qnormRatioescalation e )(

r

qnormatioreductionR rand

),(

),(),(),(),(

tsd

tvdsvdtudsud

vu w(u,v)

ts

Typically, qe/qr = ~1.6

a

s

b

de

f

h

g

c

i

j

t(14,0)

(0,14)

(2,17)

(4,16)

(6,8)

(11,7)

(12,2)

(5,12)

(7,10)

(13,4)

(7,14)

(18,3)

Negative Constraint Origin

a

s

b

de

f

h

g

c

i

j

t(14,0)

(0,14)

(2,17)

(4,16)

(6,8)

(11,7)

(12,2)

(5,12)

(7,10)

(13,4)

(7,14)

(18,3)

Negative Constraint Origin

r = vertical distance effect∆ = horizontal distance effect

qe = weight of cannot link constraint qr = weight of must link constraint

Page 9: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Magnetically Affected Paths (MAP)

• Compute overall escalation/reduction ratio on an edge

• Multiply overall ratio by edge weight to assign new edge weight (1<α<∞)

_

M

vuatioreductionR

C

vuRatioescalationiooverallRat

M

jj

C

ii

e

||

1

||

1

),(),(

),(),(),( vuiooverallRatnew vuwvuw

t2s2

t1

s1 t3

s3

a

b

neg2

pos1neg1

t2s2

t1

s1 t3

s3

a

b

t2s2 t2s2

t1

s1 t3

s3

t3

s3

a

b

neg2

pos1neg1

Overall effect on an edge is quantified as total effect of all constraints

_

Page 10: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

EMC (ElectroMagnetic Field Based Clustering) Framework

• 3 steps clustering framework– Graph Construction– Readjustment of Edge Weights – Clustering Process

a

s

b

de

f

h

g

c

i

j

t

a

s

b

de

f

h

g

c

i

j

t

Page 11: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

EMC (ElectroMagnetic Field Based Clustering) Framework• Graph Construction

– Select the n-nearest neighbors for each object

– Connect the neighborhood and use Euclidean distance as edge weight

– If graph not connected, add new edges between disconnected components

• Readjustment of Edge Weights– Apply the MAP concept on graph

• all (+) and (-) edges applied before clustering step

– Extract new affinity matrix using new edge weights

– Employ k-shortest path distance as distance metric• Better than single shortest path• Can utilize MAP better• Very slow for large graphs

Page 12: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

EMC (ElectroMagnetic Field Based Clustering) Framework• Clustering Process

– Run clustering algorithm using new affinity matrix

– Any clustering algorithm compatible with graphs• K-Means• Hierarchical• SS-Kernel-KMeans, etc…

– We have used K-Medoids and Hierarchical clustering algorithms• Since they have similar results, we report only K-Medoids results

– Small amount of constraints improves accuracy significantly• Other algorithms need more constraints to achieve same performance

_

Page 13: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Two improvements for k-shortest paths

• K-SD shortest path algorithm

– Based on Dijkstra algorithm– Each vertex keeps k-distance entries

• Paths are distinct (two paths cannot have a common edge)– Just k times slower than Dijkstra algorithm

• Divide-and-Conquer approach (Multilevel approach)

– Partition the graph using multilevel graph partitioning • Kmetis: partitions large graphs into equal-sized subgraphs• Very fast (takes just a few seconds to partition very large graphs)

– Identify hubs• The nodes residing on the boundary of a partition• Connected to at least two partitions• These are the only way from one partition to next partition

.

Hubs between two partitions

Page 14: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Two improvements for k-shortest paths

• Divide-and-Conquer approach (Cont.)– Extract distance matrix for each partition– Merge the distance matrices using the hubs

– At least 20 times faster compared to original K-SD shortest path algorithm

– Applicable to very large graphs

Page 15: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Divide-and-Conquer Approach

Page 16: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Constructing Hub graph and extracting SHub matrix

Page 17: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Constructing Hub graph and extracting SHub matrix

SHub

Page 18: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Computing of K-SD shortest path distance

Page 19: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Computing of K-SD shortest path distance

SHub

•Update distances from first partition’s node1 to second partition hubs through first hub

•SHub is used for transition from first partition hubs to second partition hubs

Page 20: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Computing of K-SD shortest path distance

SHub

•Update distances from first partition’s node1 to second partition hubs through second hub

•SHub is used for transition from first partition hubs to second partition hubs

Page 21: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Computing of K-SD shortest path distance

SHub

•Update distances from first partition’s node1 to second partition hubs through last hub

•SHub is used for transition from first partition hubs to second partition hubs

Page 22: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Computing of K-SD shortest path distance

SHub

•Update distances from second partition nodes to first partition’s node1 to through second partitions hubs

•SHub is used for transition from first partition hubs to second partition hubs

•At this moment, all second partition hubs have their distances to the first partition’s node1

Page 23: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Experiments

• Implemented in Java and Matlab

• Synthetic and real datasets

• Datasets from UCI Machine Learning Repository:

– Soybean, Iris, Wine, Ionosphere, Balance, Breast cancer, Satellite

Page 24: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Experiments• EMCK-Means Experiments:

– Graph construction• Varied # of paths and # of nearest neighbors

– Readjustment phase• Constraint amount is increased by %10·|Dataset|

• Compared against to:– MPCK-Means: Unifies distance-based and metric based approaches– Diagonal Metric: Learns a distance metric with weighted dimensions– EMCK-Means: MAP implementation with K-Medoids– SS-Kernel-KMeans: Performs graph clustering based on min-cut objective

• Experimental Setup:– Same constraint sets used for each algorithm– Constraints are chosen at random

• %x .N where N is the dataset size– Run each algorithm 200 times

Page 25: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Experiments• Clustering results for EMCK-Means on :

– Wine, Balance, Breast Cancer, Ionosphere, Iris and Soybean dataset • We adjust number of shortest paths ranging from 5 to 20.

Page 26: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Comparison of Algorithms•Comparison of EMC, MPCK-Means, KMeans+Diagonal metric and SS-Kernel-KMeans

–Outperforms Iris, Balance and Ionosphere–Reasonable for Soybean and Breast Cancer–Almost no gain at all for Wine

Page 27: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Conclusions• EMC framework offers flexible and more accurate clustering in

graph domain– We can integrate other clustering algorithms into the framework– Small amount of constraints improves accuracy significantly– Applicability of more constraints at any time– Time reduces significantly as we increase # of partitions, p

• Future Works– Multilevel EMC

• Coarsen the graph• Perform clustering• Refinement

– Performs much faster than other algorithms without any significant change in accuracy

– No hubs or merge process

• _

Page 28: Applying Electromagnetic Field Theory Concepts to Clustering with Constraints

Thank you!