satyaki mahalanabis daniel Štefankovič

Satyaki MahalanabisDaniel Štefankovič

University of Rochester

Density estimation in linear time(+approximating L1-distances)

Density estimation

DATA+f1 f2

f3f4 f5

f6

density

F = a family of densities

Density estimation - example

+

N(,)

0.418974, 0.848565, 1.73705, 1.59579, -1.18767, -1.05573, -1.36625

F = a family of normal densities with =1

Measure of quality:

L1 – distance from the truth

Why L1?

|f-g|1 = |f(x)-g(x)| dx

1) small L1 all events estimated with small additive error2) scale invariant

g=TRUTH f=OUTPUT

Obstacles to “quality”:

DATA+

weak class of densities

bad data

F

dist1(g,F)

?

What is bad data ?

g = TRUTHh = DATA (empirical density)

| h-g |1

= 2max |h(A)-g(A)|AY(F)

Y(F) = Yatracos class of F Aij={ x | fi(x)>fj(x) }

f1

f2 f3

A12A13 A23


Density estimation

DATA (h)+F

with small |g-f|1

assuming these are small:

dist1(g,F)

f


Why would these be small ???

dist1(h,F)

1) pick a large enough F 2) pick a small enough F so that VC-dimension of Y(F) is small3) data are iid from h

They will be if:

E[max|h(A)-g(A)|]

Theorem (Haussler,Dudley, Vapnik, Chervonenkis):VC(Y)

samplesAY

How to choose from 2 densities?

f1 f2


f1 f2

+1 +1 +1 -1


f1 f2

+1 +1 +1 -1

T

T f1

T f2

Th


f1 f2

+1 +1 +1 -1

T

T f1

T f2

Th

Scheffé: if T h > T (f1+f2)/2 f1

else f2

Theorem (see DL’01): |f-g|1 3dist1(g,F) + 2


Density estimation

DATA (h)+F

with small |g-f|1

assuming these are small:

dist1(g,F)

f

Test functions

Tij (x) = sgn(fi(x) – fj(x))

Tij(fi – fj) = (fi-fj)sgn(f

i-f

j) = |fi – fj|1

F={f1,f2,...,fN}

TijfiTijfj

fi winsfj wins

Tijh

Our algorithm: Efficient minimum loss-weight

repeat until one distribution left 1) pick the pair of distributions in F that are furthest apart (in L1) 2) eliminate the loser

Theorem [MS’08]: |f-g|1 3dist1(g,F)+2 n

Take the most “discriminative” action.

*

* after preprocessing F

Tournament revelation problem INPUT: a weighed undirected graph G (wlog all edge-weights distinct)

OUTPUT: REPORT: heaviest edge {u1,v1} in G ADVERSARY eliminates u1 or v1 G1

REPORT: heaviest edge {u2,v2} in G1

ADVERSARY eliminates u2 or v2 G2

.....OBJECTIVE: minimize total time spent generating reports

Tournament revelation problem

1

23 4

5 6

A

B

CD

report the heaviest edge


1

23 4

5 6

A

B

CD


BC


1

23

A

CD


BCeliminate B



1

23

A

CD


BCeliminate B


AD


1 CD


BCeliminate B


ADeliminate A


CD


1

23 4

5 6

A

B

CD

BCB C

AD BDA D DB

DC AC AD AB

2O(F) preprocessing O(F) run-timeO(F2 log F) preprocessing O(F2) run-time

WE DO NOT KNOW: Can get O(F) run-time with polynomial preprocessing ???

Efficient minimum loss-weight

repeat until one distribution left 1) pick the pair of distributions that are furthest apart (in L1) 2) eliminate the loser

2O(F) preprocessing O(F) run-timeO(F2 log F) preprocessing O(F2) run-time

WE DO NOT KNOW: Can get O(F) run-time with polynomial preprocessing ???

(in practice 2) is more costly)

Efficient minimum loss-weight

repeat until one distribution left 1) pick the pair of distributions that are furthest apart (in L1) 2) eliminate the loser

Theorem: |f-g|1 3dist1(g,F)+2 nProof:

For every f’ to which f loses

|f-f’|1 max |f’-f’’|1 f’ loses to f’’

“that guy lost even more badly!”

Proof: For every f’ to which f loses

|f-f’|1 max |f’-f’’|1 f’ loses to f’’

“that guy lost even more badly!”

f1

BEST=f2f3

bad loss

2hT23 f2T23 + f3T23

(f1-f2)T12 (f2-f3) T23

(f4-h)T23

(fi-fj)(Tij-Tkl) 0

|f1-g|1 3|f2-g|1+2

Application: kernel density estimates (Akaike’54,Parzen’62,Rosenblatt’56)

K = kernel

h = density kernel used to smooth empirical g(x1,x2,...,xn i.i.d. samples from h)

K(y-xi)1n

i=1

n

g * K

h * Kas n

=

K(y-xi)1

n i=1

n

h * Kas n

What K should we choose?

Dirac would be goodDirac is not good

Something in-between: bandwidth selection for kernel density estimates

Ks(x)=K(x/s)

sas s 0 Ks(x) Dirac

Theorem (see DL’01): as s 0 with sn |g*K – h|1 0

g * K =

Data splitting methods for kernel density estimates

K1ns

y-xi( )s

How to pick the smoothing factor ?

i=1

n

x1,x2,...,xn

x1,...,xn-m

xn-m+1,...,xn

fs = K1(n-m)s

y-xi( )si=1

n-m

choose s usingdensity estimation

Kernels we will use:

K1ns

y-xi( )s

piecewise uniform

piecewise linear

Bandwidth selection for uniformkernels N distributions each is piecewise uniform with n pieces m datapoints

E.g. N n1/2

m n5/4

Goal: run the density estimation algorithm efficiently

gTij (fi+fj)Tij

2

|fi-fj|1

(fk-h) Tkj

EMLWMD

N2

N2

N

n

TIME

n+m log n

n+m log n


E.g. N n1/2

m n5/4


gTij (fi+fj)Tij

2

|fi-fj|1

(fk-h) Tkj

EMLWMD

N2

N2

N

n

TIME

n+m log n

n+m log n

Can speed this up?


E.g. N n1/2

m n5/4


gTij (fi+fj)Tij

2

|fi-fj|1

(fk-h) Tkj

EMLWMD

N2

N2

N

n

TIME

n+m log n

n+m log n

Can speed this up?

absolute error badrelative error good

Approximating L1-distances between distributions

WE WILL DO:(N2+Nn) (log N)

2

TRIVIAL (exact): N2n

N piecewise uniform densities (each n pieces)

Dimension reduction for L2

Johnson-Lindenstrauss Lemma (’82)

: L2 Lt2 t = O(-2 ln n)

( x,y S)

d(x,y) d((x),(y)) (1+)d(x,y)

|S|=n

N(0,t-1/2)

Dimension reduction for L1

Cauchy Random Projection (Indyk’00)

: L1 Lt1 t = O(-2 ln n)

( x,y S)

d(x,y) est((x),(y)) (1+)d(x,y)

|S|=n

N(0,t-1/2)C(0,1/t)

(Charikar, Brinkman’03 : cannot replace est by d)

Cauchy distribution C(0,1)density function: 1

(1+x2)

XC(0,1) aXC(0,|a|)

XC(0,a), YC(0,b)X+YC(0,a+b)

FACTS:

X1 X2 X3 X4X5X6X7 X8 X9

A B

z

X1C(0,z)A(X2+X3) + B(X5+X6+X7+X8)

Cauchy random projection for L1

D

(Indyk’00)

X1 X2 X3 X4X5X6X7 X8 X9

A B

z

Cauchy random projection for L1

D

D(X1+X2+...+X8+X9)

(Indyk’00)

X1C(0,z)A(X2+X3) + B(X5+X6+X7+X8)

Cauchy(0,|-|1)

All pairs L1-distances piece-wise linear densities


X1 X2 C(0,1/2)

R=(3/4)X1 + (1/4)X2 B=(3/4)X2 + (1/4)X1

R-BC(0,1/2)


Problem: too many intersections!

Solution: cut into even smaller pieces!

Stochastic measures are useful.

Brownian motion

exp(-x^2/2)1

(21/2

Cauchy motion

1

(1+x)2

0 .2 0 .4 0 .6 0 .8 1 .0

0 .5

0 .5

1 .0

0 .2 0 .4 0 .6 0 .8 1 .0

0 .4

0 .2

0 .2

0 .4

Brownian motion

exp(-x^2/2)1

(21/20 .2 0 .4 0 .6 0 .8 1 .0

0 .5

0 .5

1 .0

f dL = Y N(0,S)

computing integrals is easyf:RRd

f dL = Y C(0,s) for d=1

computing integrals is easyf:RRd

0 .2 0 .4 0 .6 0 .8 1 .0

0 .4

0 .2

0 .2

0 .4

Cauchy motion

1

(1+x)2

computing integrals is hard d>1* obtaining explicit expression for the density

*

X1 X2 X3 X4X5X6X7 X8 X9

What were we doing?

(f1,f2,f3) dL = (w1)1,(w2)1,(w3)1

X1 X2 X3 X4X5X6X7 X8 X9

What were we doing?

(f1,f2,f3) dL = (w1)1,(w2)1,(w3)1

Can we efficiently compute integrals dL for piecewise linear?

Can we efficiently compute integrals dL for piecewise linear?

R R2

z)=(1,z)

(X,Y)= dL

R R2

z)=(1,z)

(X,Y)= dL

(2(X-Y),2Y) has density atu+v,u-v

2

All pairs L1-distances for mixtures of uniform densities in time

O((N^2+Nn) (log N)2 )

All pairs L1-distances for piecewise linear densities in time

O((N^2+Nn) (log N)2 )

R R3

z)=(1,z,z2) (X,Y,Z)= dL

?1)

QUESTIONS

2) higher dimensions ?

satyaki mahalanabis daniel Štefankovič

Documents

truth f

fg1 3dist1g

fg1 9dist1g

output fk f

yatracos class of f

g1 adversary

data empirical density

max fkh tij theorem