satyaki mahalanabis daniel Štefankovič

52
Satyaki Mahalanabis Daniel Štefankovič University of Rochester Density estimation in linear time (+approximating L 1 -distances)

Upload: jesse-bush

Post on 31-Dec-2015

27 views

Category:

Documents


0 download

DESCRIPTION

Density estimation in linear time (+approximating L 1 -distances). Satyaki Mahalanabis Daniel Štefankovič. University of Rochester. Density estimation. f 6. f 1. f 2. +. DATA. f 4. f 3. f 5. F = a family of densities. density. Density estimation - example. 0.418974, 0.848565, - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Satyaki Mahalanabis Daniel Štefankovič

Satyaki MahalanabisDaniel Štefankovič

University of Rochester

Density estimation in linear time(+approximating L1-distances)

Page 2: Satyaki Mahalanabis Daniel Štefankovič

Density estimation

DATA+f1 f2

f3f4 f5

f6

density

F = a family of densities

Page 3: Satyaki Mahalanabis Daniel Štefankovič

Density estimation - example

+

N(,)

0.418974, 0.848565, 1.73705, 1.59579, -1.18767, -1.05573, -1.36625

F = a family of normal densities with =1

Page 4: Satyaki Mahalanabis Daniel Štefankovič

Measure of quality:

L1 – distance from the truth

Why L1?

|f-g|1 = |f(x)-g(x)| dx

1) small L1 all events estimated with small additive error2) scale invariant

g=TRUTH f=OUTPUT

Page 5: Satyaki Mahalanabis Daniel Štefankovič

Obstacles to “quality”:

DATA+

weak class of densities

bad data

F

dist1(g,F)

?

Page 6: Satyaki Mahalanabis Daniel Štefankovič

What is bad data ?

g = TRUTHh = DATA (empirical density)

| h-g |1

= 2max |h(A)-g(A)|AY(F)

Y(F) = Yatracos class of F Aij={ x | fi(x)>fj(x) }

f1

f2 f3

A12A13 A23

Page 7: Satyaki Mahalanabis Daniel Štefankovič

= 2max |h(A)-g(A)|AY(F)

Density estimation

DATA (h)+F

with small |g-f|1

assuming these are small:

dist1(g,F)

f

Page 8: Satyaki Mahalanabis Daniel Štefankovič

= 2max |h(A)-g(A)|AY(F)

Why would these be small ???

dist1(h,F)

1) pick a large enough F 2) pick a small enough F so that VC-dimension of Y(F) is small3) data are iid from h

They will be if:

E[max|h(A)-g(A)|]

Theorem (Haussler,Dudley, Vapnik, Chervonenkis):VC(Y)

samplesAY

Page 9: Satyaki Mahalanabis Daniel Štefankovič

How to choose from 2 densities?

f1 f2

Page 10: Satyaki Mahalanabis Daniel Štefankovič

How to choose from 2 densities?

f1 f2

+1 +1 +1 -1

Page 11: Satyaki Mahalanabis Daniel Štefankovič

How to choose from 2 densities?

f1 f2

+1 +1 +1 -1

T

T f1

T f2

Th

Page 12: Satyaki Mahalanabis Daniel Štefankovič

How to choose from 2 densities?

f1 f2

+1 +1 +1 -1

T

T f1

T f2

Th

Scheffé: if T h > T (f1+f2)/2 f1

else f2

Theorem (see DL’01): |f-g|1 3dist1(g,F) + 2

Page 13: Satyaki Mahalanabis Daniel Štefankovič

= 2max |h(A)-g(A)|AY(F)

Density estimation

DATA (h)+F

with small |g-f|1

assuming these are small:

dist1(g,F)

f

Page 14: Satyaki Mahalanabis Daniel Štefankovič

Test functions

Tij (x) = sgn(fi(x) – fj(x))

Tij(fi – fj) = (fi-fj)sgn(f

i-f

j) = |fi – fj|1

F={f1,f2,...,fN}

TijfiTijfj

fi winsfj wins

Tijh

Page 15: Satyaki Mahalanabis Daniel Štefankovič

Density estimation algorithms

Scheffé tournament: Pick the density with the most wins.

Theorem (DL’01): |f-g|1 9dist1(g,F)+8

Minimum distance estimate (Y’85): Output fk F that minimizes max |(fk-h) Tij| Theorem (DL’01): |f-g|1 3dist1(g,F)+2

ij

n2

n3

Page 16: Satyaki Mahalanabis Daniel Štefankovič

Density estimation algorithms

Scheffé tournament: Pick the density with the most wins.

Theorem (DL’01): |f-g|1 9dist1(g,F)+8

Minimum distance estimate (Y’85): Output fk F that minimizes max |(fk-h) Tij| Theorem (DL’01): |f-g|1 3dist1(g,F)+2

ij

n2

n3Can we do better?

Page 17: Satyaki Mahalanabis Daniel Štefankovič

Our algorithm: Efficient minimum loss-weight

repeat until one distribution left 1) pick the pair of distributions in F that are furthest apart (in L1) 2) eliminate the loser

Theorem [MS’08]: |f-g|1 3dist1(g,F)+2 n

Take the most “discriminative” action.

*

* after preprocessing F

Page 18: Satyaki Mahalanabis Daniel Štefankovič

Tournament revelation problem INPUT: a weighed undirected graph G (wlog all edge-weights distinct)

OUTPUT: REPORT: heaviest edge {u1,v1} in G ADVERSARY eliminates u1 or v1 G1

REPORT: heaviest edge {u2,v2} in G1

ADVERSARY eliminates u2 or v2 G2

.....OBJECTIVE: minimize total time spent generating reports

Page 19: Satyaki Mahalanabis Daniel Štefankovič

Tournament revelation problem

1

23 4

5 6

A

B

CD

report the heaviest edge

Page 20: Satyaki Mahalanabis Daniel Štefankovič

Tournament revelation problem

1

23 4

5 6

A

B

CD

report the heaviest edge

BC

Page 21: Satyaki Mahalanabis Daniel Štefankovič

Tournament revelation problem

1

23

A

CD

report the heaviest edge

BCeliminate B

report the heaviest edge

Page 22: Satyaki Mahalanabis Daniel Štefankovič

Tournament revelation problem

1

23

A

CD

report the heaviest edge

BCeliminate B

report the heaviest edge

AD

Page 23: Satyaki Mahalanabis Daniel Štefankovič

Tournament revelation problem

1 CD

report the heaviest edge

BCeliminate B

report the heaviest edge

ADeliminate A

report the heaviest edge

CD

Page 24: Satyaki Mahalanabis Daniel Štefankovič

Tournament revelation problem

1

23 4

5 6

A

B

CD

BCB C

AD BDA D DB

DC AC AD AB

2O(F) preprocessing O(F) run-timeO(F2 log F) preprocessing O(F2) run-time

WE DO NOT KNOW: Can get O(F) run-time with polynomial preprocessing ???

Page 25: Satyaki Mahalanabis Daniel Štefankovič

Efficient minimum loss-weight

repeat until one distribution left 1) pick the pair of distributions that are furthest apart (in L1) 2) eliminate the loser

2O(F) preprocessing O(F) run-timeO(F2 log F) preprocessing O(F2) run-time

WE DO NOT KNOW: Can get O(F) run-time with polynomial preprocessing ???

(in practice 2) is more costly)

Page 26: Satyaki Mahalanabis Daniel Štefankovič

Efficient minimum loss-weight

repeat until one distribution left 1) pick the pair of distributions that are furthest apart (in L1) 2) eliminate the loser

Theorem: |f-g|1 3dist1(g,F)+2 nProof:

For every f’ to which f loses

|f-f’|1 max |f’-f’’|1 f’ loses to f’’

“that guy lost even more badly!”

Page 27: Satyaki Mahalanabis Daniel Štefankovič

Proof: For every f’ to which f loses

|f-f’|1 max |f’-f’’|1 f’ loses to f’’

“that guy lost even more badly!”

f1

BEST=f2f3

bad loss

2hT23 f2T23 + f3T23

(f1-f2)T12 (f2-f3) T23

(f4-h)T23

(fi-fj)(Tij-Tkl) 0

|f1-g|1 3|f2-g|1+2

Page 28: Satyaki Mahalanabis Daniel Štefankovič

Application: kernel density estimates (Akaike’54,Parzen’62,Rosenblatt’56)

K = kernel

h = density kernel used to smooth empirical g(x1,x2,...,xn i.i.d. samples from h)

K(y-xi)1n

i=1

n

g * K

h * Kas n

=

Page 29: Satyaki Mahalanabis Daniel Štefankovič

K(y-xi)1

n i=1

n

h * Kas n

What K should we choose?

Dirac would be goodDirac is not good

Something in-between: bandwidth selection for kernel density estimates

Ks(x)=K(x/s)

sas s 0 Ks(x) Dirac

Theorem (see DL’01): as s 0 with sn |g*K – h|1 0

g * K =

Page 30: Satyaki Mahalanabis Daniel Štefankovič

Data splitting methods for kernel density estimates

K1ns

y-xi( )s

How to pick the smoothing factor ?

i=1

n

x1,x2,...,xn

x1,...,xn-m

xn-m+1,...,xn

fs = K1(n-m)s

y-xi( )si=1

n-m

choose s usingdensity estimation

Page 31: Satyaki Mahalanabis Daniel Štefankovič

Kernels we will use:

K1ns

y-xi( )s

piecewise uniform

piecewise linear

Page 32: Satyaki Mahalanabis Daniel Štefankovič

Bandwidth selection for uniformkernels N distributions each is piecewise uniform with n pieces m datapoints

E.g. N n1/2

m n5/4

Goal: run the density estimation algorithm efficiently

gTij (fi+fj)Tij

2

|fi-fj|1

(fk-h) Tkj

EMLWMD

N2

N2

N

n

TIME

n+m log n

n+m log n

Page 33: Satyaki Mahalanabis Daniel Štefankovič

Bandwidth selection for uniformkernels N distributions each is piecewise uniform with n pieces m datapoints

E.g. N n1/2

m n5/4

Goal: run the density estimation algorithm efficiently

gTij (fi+fj)Tij

2

|fi-fj|1

(fk-h) Tkj

EMLWMD

N2

N2

N

n

TIME

n+m log n

n+m log n

Can speed this up?

Page 34: Satyaki Mahalanabis Daniel Štefankovič

Bandwidth selection for uniformkernels N distributions each is piecewise uniform with n pieces m datapoints

E.g. N n1/2

m n5/4

Goal: run the density estimation algorithm efficiently

gTij (fi+fj)Tij

2

|fi-fj|1

(fk-h) Tkj

EMLWMD

N2

N2

N

n

TIME

n+m log n

n+m log n

Can speed this up?

absolute error badrelative error good

Page 35: Satyaki Mahalanabis Daniel Štefankovič

Approximating L1-distances between distributions

WE WILL DO:(N2+Nn) (log N)

2

TRIVIAL (exact): N2n

N piecewise uniform densities (each n pieces)

Page 36: Satyaki Mahalanabis Daniel Štefankovič

Dimension reduction for L2

Johnson-Lindenstrauss Lemma (’82)

: L2 Lt2 t = O(-2 ln n)

( x,y S)

d(x,y) d((x),(y)) (1+)d(x,y)

|S|=n

N(0,t-1/2)

Page 37: Satyaki Mahalanabis Daniel Štefankovič

Dimension reduction for L1

Cauchy Random Projection (Indyk’00)

: L1 Lt1 t = O(-2 ln n)

( x,y S)

d(x,y) est((x),(y)) (1+)d(x,y)

|S|=n

N(0,t-1/2)C(0,1/t)

(Charikar, Brinkman’03 : cannot replace est by d)

Page 38: Satyaki Mahalanabis Daniel Štefankovič

Cauchy distribution C(0,1)density function: 1

(1+x2)

XC(0,1) aXC(0,|a|)

XC(0,a), YC(0,b)X+YC(0,a+b)

FACTS:

Page 39: Satyaki Mahalanabis Daniel Štefankovič

X1 X2 X3 X4X5X6X7 X8 X9

A B

z

X1C(0,z)A(X2+X3) + B(X5+X6+X7+X8)

Cauchy random projection for L1

D

(Indyk’00)

Page 40: Satyaki Mahalanabis Daniel Štefankovič

X1 X2 X3 X4X5X6X7 X8 X9

A B

z

Cauchy random projection for L1

D

D(X1+X2+...+X8+X9)

(Indyk’00)

X1C(0,z)A(X2+X3) + B(X5+X6+X7+X8)

Cauchy(0,|-|1)

Page 41: Satyaki Mahalanabis Daniel Štefankovič

All pairs L1-distances piece-wise linear densities

Page 42: Satyaki Mahalanabis Daniel Štefankovič

All pairs L1-distances piece-wise linear densities

X1 X2 C(0,1/2)

R=(3/4)X1 + (1/4)X2 B=(3/4)X2 + (1/4)X1

R-BC(0,1/2)

Page 43: Satyaki Mahalanabis Daniel Štefankovič

All pairs L1-distances piece-wise linear densities

Problem: too many intersections!

Solution: cut into even smaller pieces!

Stochastic measures are useful.

Page 44: Satyaki Mahalanabis Daniel Štefankovič

Brownian motion

exp(-x^2/2)1

(21/2

Cauchy motion

1

(1+x)2

0 .2 0 .4 0 .6 0 .8 1 .0

0 .5

0 .5

1 .0

0 .2 0 .4 0 .6 0 .8 1 .0

0 .4

0 .2

0 .2

0 .4

Page 45: Satyaki Mahalanabis Daniel Štefankovič

Brownian motion

exp(-x^2/2)1

(21/20 .2 0 .4 0 .6 0 .8 1 .0

0 .5

0 .5

1 .0

f dL = Y N(0,S)

computing integrals is easyf:RRd

Page 46: Satyaki Mahalanabis Daniel Štefankovič

f dL = Y C(0,s) for d=1

computing integrals is easyf:RRd

0 .2 0 .4 0 .6 0 .8 1 .0

0 .4

0 .2

0 .2

0 .4

Cauchy motion

1

(1+x)2

computing integrals is hard d>1* obtaining explicit expression for the density

*

Page 47: Satyaki Mahalanabis Daniel Štefankovič

X1 X2 X3 X4X5X6X7 X8 X9

What were we doing?

(f1,f2,f3) dL = (w1)1,(w2)1,(w3)1

Page 48: Satyaki Mahalanabis Daniel Štefankovič

X1 X2 X3 X4X5X6X7 X8 X9

What were we doing?

(f1,f2,f3) dL = (w1)1,(w2)1,(w3)1

Can we efficiently compute integrals dL for piecewise linear?

Page 49: Satyaki Mahalanabis Daniel Štefankovič

Can we efficiently compute integrals dL for piecewise linear?

R R2

z)=(1,z)

(X,Y)= dL

Page 50: Satyaki Mahalanabis Daniel Štefankovič

R R2

z)=(1,z)

(X,Y)= dL

(2(X-Y),2Y) has density atu+v,u-v

2

Page 51: Satyaki Mahalanabis Daniel Štefankovič

All pairs L1-distances for mixtures of uniform densities in time

O((N^2+Nn) (log N)2 )

All pairs L1-distances for piecewise linear densities in time

O((N^2+Nn) (log N)2 )

Page 52: Satyaki Mahalanabis Daniel Štefankovič

R R3

z)=(1,z,z2) (X,Y,Z)= dL

?1)

QUESTIONS

2) higher dimensions ?