1 testing of clustering article by : noga alon, seannie dar, michal parnas and dana ron presented...

27
1 Testing of Testing of clustering clustering Article by Article by : : Noga Alon, Seannie Dar, Michal Parnas and Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Dana Ron Presented by Presented by : Nir Eitan Nir Eitan

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

11

Testing of clusteringTesting of clustering

Article byArticle by: : Noga Alon, Seannie Dar, Michal Parnas and Dana RonNoga Alon, Seannie Dar, Michal Parnas and Dana Ron

Presented byPresented by::Nir EitanNir Eitan

Page 2: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

2

What will I talk aboutWhat will I talk about??

General definition of clustering and motivations.

Being (k,b) clusterable Sublinear property tester Solving for a general metric Better result for a specific metric

& cost function.

Page 3: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

3

MotivationMotivation

What is a clustering problem? Cluster analysis or clustering is the assignment

of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.

Method of Unsupervised Learning

Page 4: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

4

MotivationMotivation

What is it used for? Image segmentation, object recognition, face

detection. Social network analysis Bioinformatics, grouping sequences into gene

families Crime analysis Market research And many more

Page 5: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

5

ClusteringClustering

Being (k,b) clusterable Input: a set X of n d-dimensional points Output: can X be partitioned into k subsets, so

that the cost of each is at most b. Different cost measures

Radius cost Diameter cost

Page 6: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

6

HardnessHardness

How hard is it? NP-complete! (both cases, for d>1) For a general metric, it is hard to approximate the

cost of an optimal clustering to within a factor of 2. Diameter Cost can be solved in (O(n))dk^2 (disjoint

covex hulls).

Page 7: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

7

SublinearitySublinearity

We would like to have some sublinear tester which tells us if the input is (k,b) clusterable, or far from it - Property testing.

Input: A set X of n d-dimensional points Output:

If X is (k,b) clusterable, answer yes If it is ε-far from being (k,(1+β)b) clusterable,

reject with probability at least 2/3 Being ε-far means there is no such clustering

even after removing any nε points

Page 8: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

8

Testers covered in the articleTesters covered in the article

Solving for general metrics and β=1 L2 metric, radius cost, can be solved for β=0 (no approximation) L2 metric, diameter cost. Can be solved in O(p(d,k)β-2d) samples. Lower bounds I will focus on the first and the third

Page 9: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

9

Testing of clustering under Testing of clustering under general metricgeneral metric Will show an algorithm with β=1, for radius

clustering Assumes triangle inequality Idea : Find representatives – points which

their pairwise distance is greater than 2b. Algorithm:

Hold a representatives list, and greedily try to add valid points to it (choosing the points uniformly and independently)

Do it for up to m iterations. If at any stage |rep|>k reject, otherwise accept

Page 10: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

10

Testing of clustering under Testing of clustering under general metric - Analysesgeneral metric - Analyses Case 1: X is (k,b) clusterable. the algorithm

will always accept. Case 2: X is ε-far from being (k,2b).

More than εn candidate representatives at every stage

Probability to get a candidate at every stage is >=ε.

Can use chernoff bound for m samples of bernoulli trials with p=ε.

Page 11: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

11

Testing of clustering under Testing of clustering under general metric - Analysesgeneral metric - Analyses Case 2- continued

Take m to be 6k/ε. Expected number of representatives after m

iterations > mε. Algorithm fails if less than k=1/6(mε) are found.

Use Chernoff bound to get fail probability < 1/3 : Pr[ΣXi<(1-γ)pm]<exp(-(1/2)γ2pm).

Run time is O(mk) = O(k2/ε) Can be done similarly for diameter cost

Page 12: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

12

Finding a clustering under Finding a clustering under general metricgeneral metric

Finding an approximately good clustering: If the set is (k,b) clustered, return t<=k clusters of radius <=2b and at most εn outside w.h.p.

Use the same algorithm as before, and return the representatives list. Probability to get more than εn outside the enlarged

radius is <1/3.

Page 13: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

13

L2 metric – diameter L2 metric – diameter clusteringclustering Can get to >0β<1 Proof stages

Prove for d=1 Prove for d=2, k=1 Prove for any d>=2, k=1 Prove for any d and k.

Page 14: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

14

11 Dimensional clusteringDimensional clustering

Can be solved deterministically in poly time. No real difference between diameter and

radius cost. A sublinear algorithm with β=0 will be shown

here Select uniformly and independently

m=θ(k/ε*log(k/ε)) random points Check if they can be (k,b) clustered.

Page 15: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

15

11 Dimensional clusteringDimensional clustering If X is (k,b) clusterable clearly the subset will be (k,b) clusterable as

well, and the algorithm will accept Lemma : Let X be ε-far from being (k,b) clusterable

then there exist k nonintersecting segments, each of length 2b, such that: There are at least εn/(k+1) points from X between every two segments As well as to the left of the leftmost and to the right of the rightmost

segment.

k=4nε/(k+1)

Page 16: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

16

11 Dimensional clusteringDimensional clustering From balls and bins analysis one gets that a point is chosen from

each one of those inter-segments with good probability (>2/3), so the algorithm rejects in this case.

Page 17: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

17

22--dimensional clustering with L2dimensional clustering with L2

A sublinear algorithm will be shown, dependent on beta, for d=2, and L2 metric, with diameter clustering.

Algorithm: Take m samples, and check if they form a (k,b) clustering

Start with k=1 (one cluster).

Page 18: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

18

Some definitionsSome definitions

Cx denote the disk or radius b centered at x I(T) denote the intersection of all disks Cx of

points in T A(R) – denotes the area of a region R. Uj – Union of all sampled points up to phase j A point is influential with respect to I(Uj) if it

causes a significant decrease in the area of I(Uj) ,more than 0.5(βb)2

Page 19: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

19

22--dimensional clustering with dimensional clustering with L2L2 Divide the m samples into phases For phase = 1 to p=2π/β2

Choose (Uniformly&indepedantly) ln(3p)/ε points Claim: For X which is epsilon-far from being (k,

(1+ β)b) clusterable, for every phase j there are at least εn influential points with respect to I(Uj-1). Will be proved using the next lemmas

Page 20: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

20

Geometric claimGeometric claim

Let C be a circle of radius at most b. Let s and t be any two points on C, and let o be a point on the segment connecting s and t such that dist(s,o)>=b. Consider the line perpendicular to the line through s and t at o, and let w be it’s closer meeting point with the circle C. Then dist(w,o)>=dist(o,t)/2

w

t=(α,η)

o

s

l’

l

Page 21: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

21

LemmaLemma

Let T be any finite subset of R2. Then for every x,y in I(T) such that x is noninfluential with respect to T, dist(x,y)<=(1+β)b. Use the geometry claim to prove it Reminder : a point is influential if it reduces the area by more

than 0.5(βb)2

Page 22: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

22

22 dimensions - Conclusiondimensions - Conclusion

It means that if X is ε-far from being (k,(1+β)b) clusterable, there are at least εn influential points in each stage

Given the sample size, we get that the probability to get an influential point in each phase is greater or equal to 2/3 (Union bound)

If there is an influential point in each phase, It means that by the end of the sampling the set sampled points T must have A(I(T))<0. Therefore, the algorithm must reject.

We get that for d=2, the sample size is : m=Θ(1/ε*log(1/β)(1/β)2)

Running time O(m2)

Page 23: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

23

Getting to higher dimensionsGetting to higher dimensions

In the general case the sample size needed is Θ(1/ε*d3/2log(1/β)(2/β)d)

Define influential point as a point which reduces the area by > (βb)dVd-1/(d2d-1)

Number of phases – dVd(2/β)d/(2Vd-1) For every plane that contains the line xy, the

same geometric argument as before can be used, giving a base of area (h/2)d-1Vd-1, giving us h<= βb as we need

Page 24: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

24

Getting to higher kGetting to higher k

For general k, the sample size needed is m=Θ(k2log(k)/ε*d(2/β)2d)

Running time is exponential in k and d. Uses about the same idea as before, now take

p(k)=k*(p+1) , where p was the number of phases taken for k=1

Influential point is a point which is influential for all current clusters (same value as for k=1).

Page 25: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

25

Getting to higher kGetting to higher k

So can we set the number of samples every phase to be ln(3p(k))/ ε as before? The answer is no, as there are multiple possibilities of

influential partitions. An influential partition is a k-partiton of all

influential points found until the given phase.

Page 26: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

26

Getting to higher kGetting to higher k

Consider all the possible partitions of the samples taken up to phase j.

the total number of possible influential partitions after phase j is up to kj

Take a different sample size for every phase: mj=((j-1)lnk + ln(3p(k)))/ ε . Union bound gives us the needed result Sum over all mj gives m=Θ(k2log(k)/ε*d(2/β)2d)

We get again, that A(I(T))<0, and the algorithm will reject w.h.p

Page 27: 1 Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan

27

Thank you for listeningThank you for listening

Star Cluster R136 Bursts Out