satyaki mahalanabis daniel Štefankovič
DESCRIPTION
Density estimation in linear time (+approximating L 1 -distances). Satyaki Mahalanabis Daniel Štefankovič. University of Rochester. Density estimation. f 6. f 1. f 2. +. DATA. f 4. f 3. f 5. F = a family of densities. density. Density estimation - example. 0.418974, 0.848565, - PowerPoint PPT PresentationTRANSCRIPT
Satyaki MahalanabisDaniel Štefankovič
University of Rochester
Density estimation in linear time(+approximating L1-distances)
Density estimation
DATA+f1 f2
f3f4 f5
f6
density
F = a family of densities
Density estimation - example
+
N(,)
0.418974, 0.848565, 1.73705, 1.59579, -1.18767, -1.05573, -1.36625
F = a family of normal densities with =1
Measure of quality:
L1 – distance from the truth
Why L1?
|f-g|1 = |f(x)-g(x)| dx
1) small L1 all events estimated with small additive error2) scale invariant
g=TRUTH f=OUTPUT
Obstacles to “quality”:
DATA+
weak class of densities
bad data
F
dist1(g,F)
?
What is bad data ?
g = TRUTHh = DATA (empirical density)
| h-g |1
= 2max |h(A)-g(A)|AY(F)
Y(F) = Yatracos class of F Aij={ x | fi(x)>fj(x) }
f1
f2 f3
A12A13 A23
= 2max |h(A)-g(A)|AY(F)
Density estimation
DATA (h)+F
with small |g-f|1
assuming these are small:
dist1(g,F)
f
= 2max |h(A)-g(A)|AY(F)
Why would these be small ???
dist1(h,F)
1) pick a large enough F 2) pick a small enough F so that VC-dimension of Y(F) is small3) data are iid from h
They will be if:
E[max|h(A)-g(A)|]
Theorem (Haussler,Dudley, Vapnik, Chervonenkis):VC(Y)
samplesAY
How to choose from 2 densities?
f1 f2
How to choose from 2 densities?
f1 f2
+1 +1 +1 -1
How to choose from 2 densities?
f1 f2
+1 +1 +1 -1
T
T f1
T f2
Th
How to choose from 2 densities?
f1 f2
+1 +1 +1 -1
T
T f1
T f2
Th
Scheffé: if T h > T (f1+f2)/2 f1
else f2
Theorem (see DL’01): |f-g|1 3dist1(g,F) + 2
= 2max |h(A)-g(A)|AY(F)
Density estimation
DATA (h)+F
with small |g-f|1
assuming these are small:
dist1(g,F)
f
Test functions
Tij (x) = sgn(fi(x) – fj(x))
Tij(fi – fj) = (fi-fj)sgn(f
i-f
j) = |fi – fj|1
F={f1,f2,...,fN}
TijfiTijfj
fi winsfj wins
Tijh
Density estimation algorithms
Scheffé tournament: Pick the density with the most wins.
Theorem (DL’01): |f-g|1 9dist1(g,F)+8
Minimum distance estimate (Y’85): Output fk F that minimizes max |(fk-h) Tij| Theorem (DL’01): |f-g|1 3dist1(g,F)+2
ij
n2
n3
Density estimation algorithms
Scheffé tournament: Pick the density with the most wins.
Theorem (DL’01): |f-g|1 9dist1(g,F)+8
Minimum distance estimate (Y’85): Output fk F that minimizes max |(fk-h) Tij| Theorem (DL’01): |f-g|1 3dist1(g,F)+2
ij
n2
n3Can we do better?
Our algorithm: Efficient minimum loss-weight
repeat until one distribution left 1) pick the pair of distributions in F that are furthest apart (in L1) 2) eliminate the loser
Theorem [MS’08]: |f-g|1 3dist1(g,F)+2 n
Take the most “discriminative” action.
*
* after preprocessing F
Tournament revelation problem INPUT: a weighed undirected graph G (wlog all edge-weights distinct)
OUTPUT: REPORT: heaviest edge {u1,v1} in G ADVERSARY eliminates u1 or v1 G1
REPORT: heaviest edge {u2,v2} in G1
ADVERSARY eliminates u2 or v2 G2
.....OBJECTIVE: minimize total time spent generating reports
Tournament revelation problem
1
23 4
5 6
A
B
CD
report the heaviest edge
Tournament revelation problem
1
23 4
5 6
A
B
CD
report the heaviest edge
BC
Tournament revelation problem
1
23
A
CD
report the heaviest edge
BCeliminate B
report the heaviest edge
Tournament revelation problem
1
23
A
CD
report the heaviest edge
BCeliminate B
report the heaviest edge
AD
Tournament revelation problem
1 CD
report the heaviest edge
BCeliminate B
report the heaviest edge
ADeliminate A
report the heaviest edge
CD
Tournament revelation problem
1
23 4
5 6
A
B
CD
BCB C
AD BDA D DB
DC AC AD AB
2O(F) preprocessing O(F) run-timeO(F2 log F) preprocessing O(F2) run-time
WE DO NOT KNOW: Can get O(F) run-time with polynomial preprocessing ???
Efficient minimum loss-weight
repeat until one distribution left 1) pick the pair of distributions that are furthest apart (in L1) 2) eliminate the loser
2O(F) preprocessing O(F) run-timeO(F2 log F) preprocessing O(F2) run-time
WE DO NOT KNOW: Can get O(F) run-time with polynomial preprocessing ???
(in practice 2) is more costly)
Efficient minimum loss-weight
repeat until one distribution left 1) pick the pair of distributions that are furthest apart (in L1) 2) eliminate the loser
Theorem: |f-g|1 3dist1(g,F)+2 nProof:
For every f’ to which f loses
|f-f’|1 max |f’-f’’|1 f’ loses to f’’
“that guy lost even more badly!”
Proof: For every f’ to which f loses
|f-f’|1 max |f’-f’’|1 f’ loses to f’’
“that guy lost even more badly!”
f1
BEST=f2f3
bad loss
2hT23 f2T23 + f3T23
(f1-f2)T12 (f2-f3) T23
(f4-h)T23
(fi-fj)(Tij-Tkl) 0
|f1-g|1 3|f2-g|1+2
Application: kernel density estimates (Akaike’54,Parzen’62,Rosenblatt’56)
K = kernel
h = density kernel used to smooth empirical g(x1,x2,...,xn i.i.d. samples from h)
K(y-xi)1n
i=1
n
g * K
h * Kas n
=
K(y-xi)1
n i=1
n
h * Kas n
What K should we choose?
Dirac would be goodDirac is not good
Something in-between: bandwidth selection for kernel density estimates
Ks(x)=K(x/s)
sas s 0 Ks(x) Dirac
Theorem (see DL’01): as s 0 with sn |g*K – h|1 0
g * K =
Data splitting methods for kernel density estimates
K1ns
y-xi( )s
How to pick the smoothing factor ?
i=1
n
x1,x2,...,xn
x1,...,xn-m
xn-m+1,...,xn
fs = K1(n-m)s
y-xi( )si=1
n-m
choose s usingdensity estimation
Kernels we will use:
K1ns
y-xi( )s
piecewise uniform
piecewise linear
Bandwidth selection for uniformkernels N distributions each is piecewise uniform with n pieces m datapoints
E.g. N n1/2
m n5/4
Goal: run the density estimation algorithm efficiently
gTij (fi+fj)Tij
2
|fi-fj|1
(fk-h) Tkj
EMLWMD
N2
N2
N
n
TIME
n+m log n
n+m log n
Bandwidth selection for uniformkernels N distributions each is piecewise uniform with n pieces m datapoints
E.g. N n1/2
m n5/4
Goal: run the density estimation algorithm efficiently
gTij (fi+fj)Tij
2
|fi-fj|1
(fk-h) Tkj
EMLWMD
N2
N2
N
n
TIME
n+m log n
n+m log n
Can speed this up?
Bandwidth selection for uniformkernels N distributions each is piecewise uniform with n pieces m datapoints
E.g. N n1/2
m n5/4
Goal: run the density estimation algorithm efficiently
gTij (fi+fj)Tij
2
|fi-fj|1
(fk-h) Tkj
EMLWMD
N2
N2
N
n
TIME
n+m log n
n+m log n
Can speed this up?
absolute error badrelative error good
Approximating L1-distances between distributions
WE WILL DO:(N2+Nn) (log N)
2
TRIVIAL (exact): N2n
N piecewise uniform densities (each n pieces)
Dimension reduction for L2
Johnson-Lindenstrauss Lemma (’82)
: L2 Lt2 t = O(-2 ln n)
( x,y S)
d(x,y) d((x),(y)) (1+)d(x,y)
|S|=n
N(0,t-1/2)
Dimension reduction for L1
Cauchy Random Projection (Indyk’00)
: L1 Lt1 t = O(-2 ln n)
( x,y S)
d(x,y) est((x),(y)) (1+)d(x,y)
|S|=n
N(0,t-1/2)C(0,1/t)
(Charikar, Brinkman’03 : cannot replace est by d)
Cauchy distribution C(0,1)density function: 1
(1+x2)
XC(0,1) aXC(0,|a|)
XC(0,a), YC(0,b)X+YC(0,a+b)
FACTS:
X1 X2 X3 X4X5X6X7 X8 X9
A B
z
X1C(0,z)A(X2+X3) + B(X5+X6+X7+X8)
Cauchy random projection for L1
D
(Indyk’00)
X1 X2 X3 X4X5X6X7 X8 X9
A B
z
Cauchy random projection for L1
D
D(X1+X2+...+X8+X9)
(Indyk’00)
X1C(0,z)A(X2+X3) + B(X5+X6+X7+X8)
Cauchy(0,|-|1)
All pairs L1-distances piece-wise linear densities
All pairs L1-distances piece-wise linear densities
X1 X2 C(0,1/2)
R=(3/4)X1 + (1/4)X2 B=(3/4)X2 + (1/4)X1
R-BC(0,1/2)
All pairs L1-distances piece-wise linear densities
Problem: too many intersections!
Solution: cut into even smaller pieces!
Stochastic measures are useful.
Brownian motion
exp(-x^2/2)1
(21/2
Cauchy motion
1
(1+x)2
0 .2 0 .4 0 .6 0 .8 1 .0
0 .5
0 .5
1 .0
0 .2 0 .4 0 .6 0 .8 1 .0
0 .4
0 .2
0 .2
0 .4
Brownian motion
exp(-x^2/2)1
(21/20 .2 0 .4 0 .6 0 .8 1 .0
0 .5
0 .5
1 .0
f dL = Y N(0,S)
computing integrals is easyf:RRd
f dL = Y C(0,s) for d=1
computing integrals is easyf:RRd
0 .2 0 .4 0 .6 0 .8 1 .0
0 .4
0 .2
0 .2
0 .4
Cauchy motion
1
(1+x)2
computing integrals is hard d>1* obtaining explicit expression for the density
*
X1 X2 X3 X4X5X6X7 X8 X9
What were we doing?
(f1,f2,f3) dL = (w1)1,(w2)1,(w3)1
X1 X2 X3 X4X5X6X7 X8 X9
What were we doing?
(f1,f2,f3) dL = (w1)1,(w2)1,(w3)1
Can we efficiently compute integrals dL for piecewise linear?
Can we efficiently compute integrals dL for piecewise linear?
R R2
z)=(1,z)
(X,Y)= dL
R R2
z)=(1,z)
(X,Y)= dL
(2(X-Y),2Y) has density atu+v,u-v
2
All pairs L1-distances for mixtures of uniform densities in time
O((N^2+Nn) (log N)2 )
All pairs L1-distances for piecewise linear densities in time
O((N^2+Nn) (log N)2 )
R R3
z)=(1,z,z2) (X,Y,Z)= dL
?1)
QUESTIONS
2) higher dimensions ?