algorithms for geometric data streams 

56
Komplexitätstheorie und effiziente Algorithme Christian Sohler, TU Dortmund Algorithms for geometric data streams

Upload: ananda

Post on 17-Jan-2016

54 views

Category:

Documents


0 download

DESCRIPTION

Algorithms for geometric data streams . Christian  Sohler, TU Dortmund. Introduction. Data streams Massive data set arriving sequentially Different ways of „arriving“ Examples Network traffic Query logs … Approach - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Algorithms for  geometric data streams 

Komplexitätstheorie und effiziente Algorithmen

Christian  Sohler, TU Dortmund

Algorithms for geometric data streams 

Page 2: Algorithms for  geometric data streams 

2

Komplexitätstheorie und effiziente Algorithmen

Data streams• Massive data set arriving sequentially• Different ways of „arriving“

Examples• Network traffic• Query logs• …

Approach• Find algorithms that make a single (a few) pass(es) and process

data sequentially

Introduction

Page 3: Algorithms for  geometric data streams 

3

Komplexitätstheorie und effiziente Algorithmen

Geometric data streams• Massive sets of geometric objects arriving sequentially• Objects are typically points• Different form of arrival:

- sequence of points- sequence of updates

Questions• Find ways to analyze the geometric structure of the input data using

small space

Introduction

Page 4: Algorithms for  geometric data streams 

4

Komplexitätstheorie und effiziente Algorithmen

Motivation• Many computational tasks can be interpreted geometrically• Geometric features may be useful in learning and classification• Geometry plays an important role in the application

Examples• Learning • Clustering• How ‚clusterable‘ is a data set?• Road traffic prediction

Introduction

Page 5: Algorithms for  geometric data streams 

5

Komplexitätstheorie und effiziente Algorithmen

A basic learning problem• We have two classes of objects

Introduction

Page 6: Algorithms for  geometric data streams 

6

Komplexitätstheorie und effiziente Algorithmen

A basic learning problem• We have two classes of objects

Introduction

Page 7: Algorithms for  geometric data streams 

7

Komplexitätstheorie und effiziente Algorithmen

A basic learning problem• We have two classes of objects• We are given examples from both classes

Introduction

Page 8: Algorithms for  geometric data streams 

8

Komplexitätstheorie und effiziente Algorithmen

A basic learning problem• We have two classes of objects• We are given examples from both classes

Introduction

Page 9: Algorithms for  geometric data streams 

9

Komplexitätstheorie und effiziente Algorithmen

A basic learning problem• We have two classes of objects• We are given examples from both classes• Learn from examples to which class

future objects belong

Introduction

?

Page 10: Algorithms for  geometric data streams 

10

Komplexitätstheorie und effiziente Algorithmen

A basic learning problem• We have two classes of objects• We are given examples from both classes• Learn from examples to which class

future objects belong• Map object‘s description to Euclidean space

Introduction

?

Page 11: Algorithms for  geometric data streams 

11

Komplexitätstheorie und effiziente Algorithmen

A basic learning problem• We have two classes of objects• We are given examples from both classes• Learn from examples to which class

future objects belong• Map object‘s description to Euclidean space

SVM approach • Compute maximum margin hyperplane• Classifiy points according to their side

Introduction

?

Page 12: Algorithms for  geometric data streams 

12

Komplexitätstheorie und effiziente Algorithmen

SVM and SEB (smallest enclosing balls)• Dual of certain SVM formulation is SEB

[Tax, Duin, Pattern Recognition Letters, ‘99]• Geometric streaming SEB can be used as SVM heuristic

[Rai, Daume III, Venkatasubramanian, IJCAI‘09]• Also: Coresets have been used

to construct CSVMs[Tsang, Kwok, Cheung, Journal of Machine Learning Research, ’05]

Introduction

?

Page 13: Algorithms for  geometric data streams 

13

Komplexitätstheorie und effiziente Algorithmen

Outline • Merge & Reduce• Embeddings into tree metrics• Estimation of distribution of local neighborhoods• Balanced partitions• Approximating properties of balanced partitions

Introduction

Page 14: Algorithms for  geometric data streams 

14

Komplexitätstheorie und effiziente Algorithmen

Insertion-only streams• Sequence of points p ,…, p from R

Merge & Reduce

1 nd

Page 15: Algorithms for  geometric data streams 

15

Komplexitätstheorie und effiziente Algorithmen

Definition [k-median clustering]Given a weighted set P of points in R the k-median problem is to find

a set CR of k points (centers) such that

cost(P,C) = wmin ||p-c||

is minimized, where w >0 is the weight of point p.

Merge & Reduce

d

pP cC

d

p

p

Page 16: Algorithms for  geometric data streams 

16

Komplexitätstheorie und effiziente Algorithmen

Coreset [Har-Peled, Mazumdar, STOC’04]

A weighted point set S is a (k,)-coreset of a weighted point set P, if for every set C of k centers

| cost(P,C) – cost(S,C) | cost(P,C).

Merge & Reduce

3

3

3

33

4

4

Page 17: Algorithms for  geometric data streams 

17

Komplexitätstheorie und effiziente Algorithmen

Observation• Union of two (k,)-coresets is a (k,)-coreset• Can compute coreset of a coreset

Merge & Reduce

… Input Stream

Page 18: Algorithms for  geometric data streams 

18

Komplexitätstheorie und effiziente Algorithmen

Observation• Union of two (k,)-coresets is a (k,)-coreset• Can compute coreset of a coreset

Merge & Reduce

… Input Stream

Coreset

Page 19: Algorithms for  geometric data streams 

19

Komplexitätstheorie und effiziente Algorithmen

Observation• Union of two (k,)-coresets is a (k,)-coreset• Can compute coreset of a coreset

Merge & Reduce

… Input Stream

Page 20: Algorithms for  geometric data streams 

20

Komplexitätstheorie und effiziente Algorithmen

Observation• Union of two (k,)-coresets is a (k,)-coreset• Can compute coreset of a coreset

Merge & Reduce

… Input Stream

Coreset

Page 21: Algorithms for  geometric data streams 

21

Komplexitätstheorie und effiziente Algorithmen

Observation• Union of two (k,)-coresets is a (k,)-coreset• Can compute coreset of a coreset

Merge & Reduce

… Input Stream

Coreset of Union of Coreset

Page 22: Algorithms for  geometric data streams 

22

Komplexitätstheorie und effiziente Algorithmen

Observation• Union of two (k,)-coresets is a (k,)-coreset• Can compute coreset of a coreset

Merge & Reduce

… Input Stream

Page 23: Algorithms for  geometric data streams 

23

Komplexitätstheorie und effiziente Algorithmen

Observation• Union of two (k,)-coresets is a (k,)-coreset• Can compute coreset of a coreset

Merge & Reduce

… Input Stream

Page 24: Algorithms for  geometric data streams 

24

Komplexitätstheorie und effiziente Algorithmen

Observation• Union of two (k,)-coresets is a (k,)-coreset• Can compute coreset of a coreset

Merge & Reduce

… Input Stream

Page 25: Algorithms for  geometric data streams 

25

Komplexitätstheorie und effiziente Algorithmen

Observation• Union of two (k,)-coresets is a (k,)-coreset• Can compute coreset of a coreset

Merge & Reduce

… Input Stream

Page 26: Algorithms for  geometric data streams 

26

Komplexitätstheorie und effiziente Algorithmen

Observation• Union of two (k,)-coresets is a (k,)-coreset• Can compute coreset of a coreset

Merge & Reduce

… Input Stream

Page 27: Algorithms for  geometric data streams 

27

Komplexitätstheorie und effiziente Algorithmen

Observation• Union of two (k,)-coresets is a (k,)-coreset• Can compute coreset of a coreset

Merge & Reduce

… Input Stream

Page 28: Algorithms for  geometric data streams 

28

Komplexitätstheorie und effiziente Algorithmen

Coresets by pre-clustering [Guha, Mishra, Motwani, O‘Callaghan, FOCS’00; Har-Peled, Mazumdar, STOC’04; Frahling, S., STOC‘05]

• Compute a pre-clustering S with >k centers and cost(P,S) Opt• Size exponential in d

Merge & Reduce

3

3

3

33

4

4

k

Page 29: Algorithms for  geometric data streams 

29

Komplexitätstheorie und effiziente Algorithmen

Coresets by sampling [Chen, SICOMP’09; Feldman, Monemizadeh, S., SoCG‘07]

• Compute a random non-uniform sample• Show that sample approximates all solutions from a net• Size polynomial in d

Merge & Reduce

MM M/4

Page 30: Algorithms for  geometric data streams 

30

Komplexitätstheorie und effiziente Algorithmen

Coresets by reduction to 1D [Har-Peled, Kushal, DCG’07, Feldman, Fiat, Sharir, FOCS‘06]

• Uses geometric arguments to solve 1D• Combine with preclusting using line centers• For k-median: Size independent of n (but exponential in d)

Merge & Reduce

Page 31: Algorithms for  geometric data streams 

31

Komplexitätstheorie und effiziente Algorithmen

Open problems • Coresets for k-median of size independent of n and d ?

(Partial result in [Feldman, Monemizadeh, S., SoCG’07])• Coresets for k-median of size O(d/²)• Coresets for k-median of size poly(d, log n)/for

constant c=c(d)>0• Coresets for j-subspace 1-median of size poly(, d, j, log n) ?• Same questions for k-means objective function

Remark: Open questions refer to the definition of coresets from this talk.

Merge & Reduce

2-c

Page 32: Algorithms for  geometric data streams 

32

Komplexitätstheorie und effiziente Algorithmen

Insertion/deletion model• Stream consists of Insert(p), Delete(p) operations• Points are from {1,…, }• Stream is consistent, i.e. no Delete(p), if p is not present and no

Insert(p), if p is already present in the current set

Geometric update streams

d

Page 33: Algorithms for  geometric data streams 

33

Komplexitätstheorie und effiziente Algorithmen

Streaming algorithms via embeddings into tree metrics

Embeddings in tree metrics

p

q

r

s

t

Page 34: Algorithms for  geometric data streams 

34

Komplexitätstheorie und effiziente Algorithmen

Streaming algorithms via embeddings into tree metrics

Embeddings in tree metrics

p

q

r

s

t tsr

p q

Page 35: Algorithms for  geometric data streams 

35

Komplexitätstheorie und effiziente Algorithmen

Streaming algorithms via embeddings into tree metrics

Embeddings in tree metrics

p

q

r

s

t tsr

p q

p q sr t

2 i

2 i2 i

2 i

Page 36: Algorithms for  geometric data streams 

36

Komplexitätstheorie und effiziente Algorithmen

Streaming algorithms via embeddings into tree metrics

Embeddings in tree metrics

p

q

r

s

t tsr

p q

p q

2 i-1

2 i2 i

2 i

qps

r

s tr

2 i-1 2 i-1 2 i-1

Page 37: Algorithms for  geometric data streams 

37

Komplexitätstheorie und effiziente Algorithmen

Streaming algorithms via embeddings into tree metrics

Embeddings in tree metrics

p

q

r

s

t tsr

p q

p q

2 i2 i

2 i

qps

r

s tr

2 i-1 2 i-1 2 i-1

r s

2 i-22 i-2

Page 38: Algorithms for  geometric data streams 

38

Komplexitätstheorie und effiziente Algorithmen

Streaming algorithms via embeddings into tree metrics

Embeddings in tree metrics

p

q

r

s

t tsr

p q

p q

2 i2 i

2 i

qps

r

s tr

2 i-1 2 i-1 2 i-1

r s

2 i-22 i-2

Page 39: Algorithms for  geometric data streams 

39

Komplexitätstheorie und effiziente Algorithmen

Streaming algorithms via embeddings into tree metrics

Embeddings in tree metrics D(.,.)• ||p-q|| D(p,q)• E[D(p,q)] = O(log )||p-q||

[Bartal, FOCS’96;Charikar, Chekuri, Goel, Guha,Plotkin, FOCS’98]

tsr

p q

p q

2 i2 i

2 i

qps

r

s tr

2 i-1 2 i-1 2 i-1

r s

2 i-22 i-2

Page 40: Algorithms for  geometric data streams 

40

Komplexitätstheorie und effiziente Algorithmen

Estimator for cost of Euclidean minimum spanning tree (EMST) [Indyk, STOC’04]

• Write EMST for cost of EMST• Write MST for cost of minimum spanning tree of tree metric D• E[MST ] = O(log ) EMST (linearity of expectation)• Use cost of MST of D as estimator

Streaming algorithms viaembeddings into tree metrics

D

D

Page 41: Algorithms for  geometric data streams 

41

Komplexitätstheorie und effiziente Algorithmen

Observation [Indyk, STOC’04]

• The MST of D(.,.) is given by the tree defining the tree metric • #edges of length 2 = #non-empty cells in corresponding grid

Streaming algorithms via embeddings into tree metrics

p

q

r

s

t

tsr

p q

p q sr t

2 i2 i

2 i

i

2 i

Page 42: Algorithms for  geometric data streams 

42

Komplexitätstheorie und effiziente Algorithmen

Euclidean minimum spanning tree1. Use O(log nested grids G(i) with side length 2

2. for each grid

3. approximate |G(i)| := #nonempty cells in G(i) using F sketch

4. return 2 |G(i)|

Theorem [Indyk, STOC’04]

The above algorithm computes a O(log )-approximation to the cost of the minimum spanning tree.

Streaming algorithms viaembeddings into tree metrics

i

i

0

Page 43: Algorithms for  geometric data streams 

43

Komplexitätstheorie und effiziente Algorithmen

Streaming algorithms viaembeddings into tree metrics

Results using a similar approach [Indyk, STOC’04]

Earth mover‘s distance O(log )

Facility location O(log² )

Matching O(log )

k-Median O(1)

1+ with huge extraction time

Problem Approx. factor

Page 44: Algorithms for  geometric data streams 

44

Komplexitätstheorie und effiziente Algorithmen

Streaming algorithms viaestimating the distribution of local neighborhoods

Distribution of neighborhoods • Grids G(i) as before

• R-neighborhood of C: cells within distance at most R from C

• m (i) is number of points in i-th cell of the R-neighborhood of C

1 2 3

4 5 6 7 8

9 10 11 12 13

14 15 16 17 18

19 20 21

C,R

A cell and its 2-neighborhood

Page 45: Algorithms for  geometric data streams 

45

Komplexitätstheorie und effiziente Algorithmen

Streaming algorithms viaestimating the distribution of local neighborhoods

EMST estimator• Define Z (i) = ( m (i) > 0 )• EMST can be approximated from the Z (i)• Approx. ratio goes to 1 as R goes to

C,R C,R

C,R

Page 46: Algorithms for  geometric data streams 

46

Komplexitätstheorie und effiziente Algorithmen

Streaming algorithms viaestimating the distribution of local neighborhoods

EMST estimator• K: Size of R-neighborhood• Z are functions from {1,…,K} to {0,1}• Random (nonempty) C defines distribution over neighborhoods,

i.e. over functions Z:{1,…,K} {0,1}• Can still estimate EMST from this distribution

C,R

Page 47: Algorithms for  geometric data streams 

47

Komplexitätstheorie und effiziente Algorithmen

Algorithm• Sample a certain number of nonempty grid cells and maintain

number of points for each cell in their neighborhood• Sample gives estimation of the distribution of the Z (.)• Obtain estimation for EMST from estimated distribution

Theorem [Frahling, Indyk, S., IJCGA’07]

Let >0, d be constants.The cost of a Euclidean minimum spanning tree of a point set in R given as an update stream can be estimated with a factor of 1 using polylog() space.

Streaming algorithms viaestimating the distribution of local neighborhoods

C,R

d

Page 48: Algorithms for  geometric data streams 

48

Komplexitätstheorie und effiziente Algorithmen

Open Problems• (1+)-approximation for matching and/or earth mover‘s distance• Other problems? Approach is not very well understood• General characterization of problems solvable via approximation of

the distribution of local neighborhoods

Streaming algorithms viaestimating the distribution of local neighborhoods

Page 49: Algorithms for  geometric data streams 

49

Komplexitätstheorie und effiziente Algorithmen

Estimating the distribution [Frahling, S., STOC’05]

• Divide space into regions• For each region maintain #points inside• Balance „error“ among regions• Notion of error depends on problem

Example• 1-Median in 1D• Error cell width #points in cell

Streaming algorithms viabalanced partitions

Page 50: Algorithms for  geometric data streams 

50

Komplexitätstheorie und effiziente Algorithmen

Small space?• Problem dependent • Need to show that decomposition in few regions with sufficiently

small error exists

Streaming algorithms viabalanced partitions

Page 51: Algorithms for  geometric data streams 

51

Komplexitätstheorie und effiziente Algorithmen

One approach [Frahling, S., STOC’05]

• Nested grids G(i)• For each grid maintain cells intersected by random sample

(sample sizes differ for different grids)• #sample points inside cell -> #points inside cell• Combine cells from different grids to space decomposition

Streaming algorithms viabalanced partitions

Page 52: Algorithms for  geometric data streams 

52

Komplexitätstheorie und effiziente Algorithmen

Works for• k-median• k-means• MaxTSP, MaxMatching, Maximum spanning tree, Average distance,

MaxCut

Why?• Require proof for k-median and k-means• Last 5 problems can be reduced to 1-median

Streaming algorithms viabalanced partitions

Page 53: Algorithms for  geometric data streams 

53

Komplexitätstheorie und effiziente Algorithmen

Approximating properties of balanced partitions[Lammersen, S., ESA‘08]• Previous approach may lead to many regions• Example: facility location• Can approximate properties of balanced partitions, e.g. #regions• Only gives approximation of cost of solution• More details in Christiane‘s talk

Streaming algorithms viaapproximation of balanced partitions

Page 54: Algorithms for  geometric data streams 

54

Komplexitätstheorie und effiziente Algorithmen

Open problems• Min-sum-k-clustering• Other problems?

Streaming algorithms viabalanced partitions

Page 55: Algorithms for  geometric data streams 

55

Komplexitätstheorie und effiziente Algorithmen

(Some) Techniques in geometric streaming:• Merge & Reduce• Embeddings into tree metrics• Estimation of distribution of local neighborhoods• Balanced partitions• Approximating properties of balanced partitions

And lots of open problems to work on…

Summary

Page 56: Algorithms for  geometric data streams 

56

Komplexitätstheorie und effiziente Algorithmen

Thank you!