![Page 1: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/1.jpg)
Minimax and Bayesian experimental design:Bridging the gap between statistical and
worst-case approaches to least squares regression
Michael W. MahoneyICSI and Department of Statistics, UC Berkeley
Joint work with Micha l Derezinski, Feynman Liang, Manfred Warmuth,and Ken Clarkson
September 2019
1 / 40
![Page 2: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/2.jpg)
Outline
Correcting the bias in least squares regression
Volume-rescaled sampling
Minimax experimental design
Bayesian experimental design
Conclusions
2 / 40
![Page 3: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/3.jpg)
Bias of the least-squares estimator
x
y
S = (x1, y1), . . . , (xn, yn)i.i.d.∼ D
w∗(S) = argminw
∑i
(xi ·w − yi )2
3 / 40
![Page 4: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/4.jpg)
Bias of the least-squares estimator
x
y
S = (x1, y1), . . . , (xn, yn)i.i.d.∼ D
Statistical regression
y = x ·w∗ + ξ, E[ ξ ] = 0
w∗(S) = argminw
∑i
(xi ·w − yi )2
3 / 40
![Page 5: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/5.jpg)
Bias of the least-squares estimator
x
y
S = (x1, y1), . . . , (xn, yn)i.i.d.∼ D
Statistical regression
y = x ·w∗ + ξ, E[ ξ ] = 0
w∗(S) = argminw
∑i
(xi ·w − yi )2
3 / 40
![Page 6: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/6.jpg)
Bias of the least-squares estimator
x
y
S = (x1, y1), . . . , (xn, yn)i.i.d.∼ D
Statistical regression
y = x ·w∗ + ξ, E[ ξ ] = 0
w∗(S) = argminw
∑i
(xi ·w − yi )2
Unbiased! E[w∗(S)
]= w∗
3 / 40
![Page 7: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/7.jpg)
Bias of the least-squares estimator
x
y
S = (x1, y1), . . . , (xn, yn)i.i.d.∼ D
Worst-case regression
w∗= argminw
ED
[(x ·w − y)2
]
w∗(S) = argminw
∑i
(xi ·w − yi )2
3 / 40
![Page 8: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/8.jpg)
Bias of the least-squares estimator
x
y
S = (x1, y1), . . . , (xn, yn)i.i.d.∼ D
Worst-case regression
w∗= argminw
ED
[(x ·w − y)2
]
w∗(S) = argminw
∑i
(xi ·w − yi )2
Biased! E[w∗(S)
]6= w∗
3 / 40
![Page 9: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/9.jpg)
Correcting the worst-case bias
x
y
xn+1
S = (x1, y1), . . . , (xn, yn)i.i.d.∼ D
Worst-case regression
Sample xn+1 ∼ x2 ·DX
Query yn+1 ∼ DY|x=xn+1
S ′ ← S ∪ (xn+1, yn+1)
Unbiased! E[w∗(S ′)
]= w∗
4 / 40
![Page 10: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/10.jpg)
Correcting the worst-case bias
x
y
xn+1
S = (x1, y1), . . . , (xn, yn)i.i.d.∼ D
Worst-case regression
Sample xn+1 ∼ x2 ·DXQuery yn+1 ∼ DY|x=xn+1
S ′ ← S ∪ (xn+1, yn+1)
Unbiased! E[w∗(S ′)
]= w∗
4 / 40
![Page 11: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/11.jpg)
Correcting the worst-case bias
x
y
xn+1
S = (x1, y1), . . . , (xn, yn)i.i.d.∼ D
Worst-case regression
Sample xn+1 ∼ x2 ·DXQuery yn+1 ∼ DY|x=xn+1
S ′ ← S ∪ (xn+1, yn+1)
Unbiased! E[w∗(S ′)
]= w∗
4 / 40
![Page 12: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/12.jpg)
Correcting the worst-case bias
x
y
xn+1
S = (x1, y1), . . . , (xn, yn)i.i.d.∼ D
Worst-case regression
Sample xn+1 ∼ x2 ·DXQuery yn+1 ∼ DY|x=xn+1
S ′ ← S ∪ (xn+1, yn+1)
Unbiased! E[w∗(S ′)
]= w∗
4 / 40
![Page 13: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/13.jpg)
In general: add dimension many pointsDerezinski and Warmuth
Worst-case regression in d dimensions
S = (x1, y1), . . . , (xn, yn)i.i.d.∼ D, (x, y) ∈ Rd×R
Estimate the optimum
w∗= argminw∈Rd
ED
[(x>w − y)2
]Volume rescaled sampling
Sampled points
xn+1, . . . , xn+d ∼ det
−x>n+1−. . .
−x>n+d−
2
· (DX )d
Query yn+i ∼ DY|x=xn+i∀i=1..d
Add S = (xn+1, yn+1), . . . , (xn+d , yn+d) to S
Theorem E[w∗(S ∪ S)
]= w∗ even though E
[w∗(S)
]6= w∗
5 / 40
![Page 14: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/14.jpg)
Effect of correcting the bias
Let w = 1T
∑Tt=1 w∗(St), for independent samples S1, ...,ST
Question: Is the estimation error ‖w −w∗‖ converging to 0?
Example: x>= (x1, . . . , x5)i.i.d.∼ N (0, 1), y =
5∑i=1
xi +x3i
3︸ ︷︷ ︸nonlinearity
+ε,
100 101 102 103
number of estimators
10-4
10-2
100
estim
ation e
rror
i.i.d. samples k=10
i.i.d. + volume k=10
i.i.d. samples k=20
i.i.d. + volume k=20
i.i.d. samples k=40
i.i.d. + volume k=40
6 / 40
![Page 15: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/15.jpg)
Discussion
I First-of-a-kind unbiased estimator for random designs,
different than RandNLA sampling theory
I Augmentation uses a determinantal point process (DPP) we
call volume-rescaled sampling
I There are many efficient DPP algorithms
I A new mathematical framework for computing expectations
Key application: Experimental design
I Bridge the gap between statistical and worst-case perspectives
7 / 40
![Page 16: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/16.jpg)
Outline
Correcting the bias in least squares regression
Volume-rescaled sampling
Minimax experimental design
Bayesian experimental design
Conclusions
8 / 40
![Page 17: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/17.jpg)
Volume-rescaled samplingDerezinski and Warmuth
x1, x2, . . . , xk − i.i.d. random vectors
sampled from x ∼ DX
DkX − distribution of X
d
x>i i
1
k
random X
Volume-rescaled sampling of size k from DX :
VSkDX (X) ∝ det(X>X)DkX (X)
Note: For k = d , we have det(X>X) = det(X)2
Question: What is the normalization factor of VSkDX ?
EDkX
[det(X>X)] = ??
Can find it through a new proof of the Cauchy-Binet formula!9 / 40
![Page 18: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/18.jpg)
The decomposition of volume-rescaled samplingDerezinski and Warmuth
Let X ∼ VSkDX and S ⊆ [k] be a random size d set such that
Pr(S | X) ∝ det(XS)2.
Then:
I XS ∼ VSdDX ,
I X[k]\S ∼ Dk−dX ,
I S is uniformly random,
and the three are independent.
XSd
random X
10 / 40
![Page 19: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/19.jpg)
Consequences for least squaresDerezinski and Warmuth
Theorem ([DWH19])
Let S =(x1, y1), . . . , (xk , yk) i.i.d.∼ Dk , for any k ≥ 0.
Sample x1, . . . , xd ∼ VSdDX ,
Query yi ∼ DY|x=xi ∀i=1..d .
Then for S = (x1, y1), . . . , (xd , yd),
E[w∗(S ∪ S)
]= ES∼Dk
[ES∼VSdD
[ w∗(S ∪ S) ]]
(decomposition) = ES∼VSk+dD
[w∗(S)
](d-modularity) = w∗.
11 / 40
![Page 20: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/20.jpg)
Outline
Correcting the bias in least squares regression
Volume-rescaled sampling
Minimax experimental design
Bayesian experimental design
Conclusions
12 / 40
![Page 21: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/21.jpg)
Classical statistical regression
We consider n parameterized experiments: x1, . . . , xn ∈ Rd .Each experiment has a real random outcome Yi for i = 1..n.
Classical setup:Yi = x>i w∗+ ξi , E[ξi ] = 0, Var[ξi ] = σ2, cov[ξi , ξj ] = 0, i 6= j
The ordinary least squares estimator wLS = X+Y satisfies:
(unbiasedness) E[wLS] = w∗,
(mean squared error)
MSE(wLS)︷ ︸︸ ︷E ‖wLS −w∗‖2 = σ2tr
((X>X)−1
)letting b = tr
((X>X)−1
)=
b
n· E ‖ξ‖2
(mean squared prediction error)
MSPE(wLS)︷ ︸︸ ︷E ‖X(wLS −w∗)‖2 = σ2d
=d
n· E ‖ξ‖2
13 / 40
![Page 22: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/22.jpg)
Experimental design in classical setting (summary)
Suppose we have a budget of k experiments out of the n choices.Goal: Select a subset of k experiments S ⊆ [n]Question: How large does k need to be so that:
MSE or MSPE︷ ︸︸ ︷Excess estimation error ≤ ε ·
E ‖ξ‖2︷ ︸︸ ︷Total noise ?
Denote L∗ = E ‖ξ‖2 = nσ2.
Prior result:There is a design (S , w) of size k s.t. E[wS ] = w∗ and:
MSE(wS)−MSE(wLS) ≤ ε · L∗, for k ≥ d + b/ε,
MSPE(wS)−MSPE(wLS) ≤ ε · L∗, for k ≥ d + d/ε,
where b = tr((X>X)−1).
14 / 40
![Page 23: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/23.jpg)
Experimental design in general setting (summary)
No assumptions on Yi .
We define w∗def= E[wLS] = X+E[Y ].
Define “total noise” as L∗def= E ‖ξ‖2, where ξ
def= X>w∗−Y .
Theorem 1 (MSE).There is a random design (S , w) such that E[wS ] = w∗ and
MSE(wS)−MSE(wLS) ≤ ε · L∗, for k = O(d log n + b/ε),
where b = tr((X>X)−1).
Theorem 2 (MSPE).There is a random design (S , w) such that E[wS ] = w∗ and
MSPE(wS)−MSPE(wLS) ≤ ε · L∗, for k = O(d log n + d/ε).
15 / 40
![Page 24: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/24.jpg)
Classical experimental design
Consider n parameterized experiments: x1, . . . , xn ∈ Rd .Each experiment has a real random response yi such that:
yi = x>i w∗ + ξi , ξi ∼ N (0, σ2)
Goal: Select k n experiments to best estimate w∗
Select S = 4, 6, 9
Receive y4, y6, y9
x>4
x>6
x>9
Xd
y
y4 .y6 .
y9
16 / 40
![Page 25: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/25.jpg)
A-optimal design
Find an unbiased estimator w with smallest mean squared error :
minw
maxw∗
Ew
[‖w −w∗‖2
]︸ ︷︷ ︸MSE[w]
subject to E[w]
= w∗ ∀w∗
Given every y1, . . . , yn , the optimum is least squares: w = X†y
MSE[X†y
]= tr
(Var[X†y]
)= σ2tr
((X>X)−1
)A-optimal design: min
S : |S |≤ktr((X>SXS)−1
)
Typical required assumption: yi = x>i w∗ + ξi , ξi ∼ N (0, σ2)
17 / 40
![Page 26: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/26.jpg)
A-optimal design
Find an unbiased estimator w with smallest mean squared error :
minw
maxw∗
Ew
[‖w −w∗‖2
]︸ ︷︷ ︸MSE[w]
subject to E[w]
= w∗ ∀w∗
Given set yi : i ∈ S , the optimum is least squares: w = X†SyS
MSE[X†SyS
]= tr
(Var[X†SyS ]
)= σ2tr
((X>SXS)−1
)A-optimal design: min
S : |S |≤ktr((X>SXS)−1
)
Typical required assumption: yi = x>i w∗ + ξi , ξi ∼ N (0, σ2)
17 / 40
![Page 27: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/27.jpg)
A-optimal design: a simple guarantee
Theorem (Avron and Boutsidis, 2013)For any X and k ≥ d there is S of size k such that:
tr((X>SXS)−1
)≤ n − d + 1
k − d + 1tr((X>X)−1
)︸ ︷︷ ︸(denoted φ)
Corollary If y = Xw∗+ ξ where Var[ξ] = σ2I and E[ξ] = 0 then
tr(Var[X†SyS ]
)︸ ︷︷ ︸σ2tr((X>S XS )−1)
≤ σ2 n−d+1
k−d+1φ ≤ φ
k−d+1︸ ︷︷ ︸ε
· tr(Var[ξ]
)︸ ︷︷ ︸nσ2
k = d + φ/ε and MSE[X†SyS ] ≤ ε · tr(Var[ξ])
18 / 40
![Page 28: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/28.jpg)
A-optimal design: a simple guarantee
Theorem (Avron and Boutsidis, 2013)For any X and k ≥ d there is S of size k such that:
tr((X>SXS)−1
)≤ n − d + 1
k − d + 1tr((X>X)−1
)︸ ︷︷ ︸(denoted φ)
Corollary If y = Xw∗+ ξ where
Is this necessary?︷ ︸︸ ︷Var[ξ] = σ2I and E[ξ] = 0 then
tr(Var[X†SyS ]
)︸ ︷︷ ︸σ2tr((X>S XS )−1)
≤ σ2 n−d+1
k−d+1φ ≤ φ
k−d+1︸ ︷︷ ︸ε
· tr(Var[ξ]
)︸ ︷︷ ︸nσ2
k = d + φ/ε and MSE[X†SyS ] ≤ ε · tr(Var[ξ])
18 / 40
![Page 29: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/29.jpg)
General response model (What if ξi is not N (0, σ2)?)
Fn - all random vectors in Rn with finite second moment
y ∈ Fn
w∗def= argmin
wEy
[‖Xw − y‖2
]= X†E[y],
ξy|Xdef= y − Xw∗ = y − XX†E[y] - deviation from best linear predictor
Two special cases:
1. Statistical regression: E[ξy|X
]= 0 (mean-zero noise)
2. Worst-case regression: Var[ξy|X
]= 0 (deterministic y)
19 / 40
![Page 30: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/30.jpg)
Random experimental designs
Statistical: Fixed S is okWorst-case: Fixed S can be exploited by the adversary
Definition
A random experimental design (S , w) of size k is:
1. a random set variable S ⊆ 1..n such that |S | ≤ k
2. a (jointly with S) random function w : R|S| → Rd
Mean squared error of a random experimental design (S , w):
MSE[w(yS)
]= ES,w,y
[‖w(yS)−w∗‖2
]Wk(X) - family of unbiased random experimental designs (S , w):
ES ,w,y
[w(yS)
]= X†E[y]︸ ︷︷ ︸
w∗
for all y ∈ Fn
20 / 40
![Page 31: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/31.jpg)
Main result
Theorem
For any ε > 0, there is a random experimental design (S , w) of size
k = O(d log n + φ/ε), where φ = tr((X>X)−1
),
such that (S , w) ∈ Wk(X) (unbiasedness) and for any y ∈ Fn
MSE[w(yS)
]−MSE
[X†y
]≤ ε · E
[‖ξy|X‖2
]
Toy example: Var[ξy|X] = σ2I, E[ξy|X] = 0
1. E[‖ξy|X‖2
]= tr
(Var[ξy|X]
)2. MSE
[X†y
]= φ
n · tr(Var[ξy|X]
)21 / 40
![Page 32: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/32.jpg)
Main result
Theorem
For any ε > 0, there is a random experimental design (S , w) of size
k = O(d log n+φ/ε), where φ = tr((X>X)−1
),
such that (S , w) ∈ Wk(X) (unbiasedness) and for any y ∈ Fn
MSE[w(yS)
]−MSE
[X†y
]≤ ε · E
[‖ξy|X‖2
]
Toy example: Var[ξy|X] = σ2I, E[ξy|X] = 0
1. E[‖ξy|X‖2
]= tr
(Var[ξy|X]
)2. MSE
[X†y
]= φ
n · tr(Var[ξy|X]
)21 / 40
![Page 33: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/33.jpg)
Important special instances
1. Statistical regression: y = Xw∗ + ξ, E[ξ] = 0
MSE[w(yS)
]−MSE
[X†y
]≤ ε · tr
(Var[ξ]
)I Weighted regression: Var[ξ] = diag
([σ2
1 , . . . , σ2n])
I Generalized regression: Var[ξ] is arbitrary
I Bayesian regression: w∗ ∼ N (0, I)
2. Worst-case regression: y is any fixed vector in Rn
ES,w
[‖w(yS)−w∗‖2
]≤ ε · ‖y − Xw∗‖2
where w∗ = X†y
22 / 40
![Page 34: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/34.jpg)
Main result: proof outline
1. Volume sampling:
I to get unbiasedness and expected bounds
I control MSE in tail of distribution
1.1 well-conditioned matrices
1.2 unbiased estimators
2. Error bounds via i.i.d. sampling:
I to bound sample size k
I control MSE in bulk of the distribution
2.1 Leverage score sampling: Pr(i)def= 1
d x>i (X>X)−1xi
2.2 Inverse score sampling: Pr(i)def= 1
φx>i (X>X)−2xi (new)
3. Proving expected error bounds for least squares23 / 40
![Page 35: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/35.jpg)
Volume sampling
Definition
Given a full rank matrix X ∈ Rn×d we define volume sampling
VS(X) as a distribution over sets S ⊆ [n] of size d :
Pr(S) =det(XS)2
det(X>X).
Pr(S) ∼squared volume
of the parallelepiped
spanned by xi : i ∈S
Computational cost:
O(nnz(X) log n + d4 log d)
xi
xj
24 / 40
![Page 36: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/36.jpg)
Unbiased estimators via volume sampling
Under arbitrary response model, any i.i.d. sampling is biased
Theorem ([DWH19])
Volume sampling corrects the least squares bias of i.i.d. sampling.
Let q = (q1, . . . , qn) be some i.i.d. importance sampling.
volume + i.i.d.
∼VS(X)︷ ︸︸ ︷xi1 , ..., xid ,
∼qk−d︷ ︸︸ ︷xid+1
, xid+2, ..., xik
E[argmin
w
k∑t=1
1
qit(x>it w − yit )
2
]= w∗y|X
25 / 40
![Page 37: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/37.jpg)
Key idea: volume-rescaled importance sampling
Simple volume-rescaled sampling:
I Let DX be a uniformly random xi
I (XS , yS) ∼ VSkD and w = X†SyS .
Then, E[w] = w∗y|X.
x>4
x>6
x>9
fixed Xd
y
y4 .y6 .
y9
Problem: Not robust to worst-case noiseSolution: Volume-rescaled importance sampling
I Let p = (p1, . . . , pn) be an importance sampling distribution,
I Define x ∼ DX as x = 1√pi
xi for i ∼ p.
Then, for (XS , yS) ∼ VSkD and w = X†S yS , we have E[w] = w∗y|X.
26 / 40
![Page 38: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/38.jpg)
Importance sampling for experimental design
1. Leverage score sampling : Pr(i) = plevidef= 1
d x>i (X>X)−1xi
A standard sampling method for worst-case linear regression.
2. Inverse score sampling : Pr(i) = pinvidef= 1
φx>i (X>X)−2xi .
A novel sampling technique essential for achieving O(φ/ε) sample size.
27 / 40
![Page 39: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/39.jpg)
Minimax A-optimality and Minimax experimental design
Definition
Minimax A-optimal value for experimental design:
R∗k (X)def= min
(S,w)∈Wk (X)max
y∈Fn\Sp(X)
MSE[w(yS)
]−MSE
[X†y
]E[‖ξy|X‖2
]Fact. X†y is the minimum variance unbiased estimator for Fn:
if Ey,w
[w(y)
]= X†E[y] ∀y∈Fn
then Var[w(y)
] Var
[X†y
]∀y∈Fn
I If d ≤ k ≤ n, then R∗k (X) ∈ [0,∞)
I If k ≥ C · d log n, then R∗k (X) ≤ C · φ/k for some C
I If k2 < εnd/3, then R∗k (X) ≥ (1−ε) · φ/k for some X28 / 40
![Page 40: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/40.jpg)
Alternative: mean squared prediction error
Definition. MSPE[w]
= E[‖X(w −w∗)‖2
](V-optimality)
Theorem
There is (S , w) of size k = O(d log n + d/ε) s.t. for any y ∈ Fn,
MSPE[w(yS)
]−MSPE
[X†y
]≤ ε · E
[‖ξy|X‖2
]Follows from the MSE bound by reduction to X>X = I.
Then MSPE[w]
= MSE[w]
and φ = d .
Minimax V-optimal value:
min(S ,w)∈Wk (X)
maxy∈Fn\Sp(X)
MSPE[w(yS)
]−MSPE
[X†y
]E[‖ξy|X‖2
]29 / 40
![Page 41: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/41.jpg)
Questions about minimax experimental design
1. Can R∗k (X) be found, exactly or approximately?
2. What happens in the regime of k ≤ C · d log n?
3. Can we restrict Wk(X) to only tractable experimental designs?
4. Does the minimax-value change when you restrict Fn?
4.1 Weighted regression
4.2 Generalized regression
4.3 Bayesian regression
4.4 Worst-case regression
30 / 40
![Page 42: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/42.jpg)
Reduction to worst-case regression
Theorem
W.l.o.g. we can replace random y ∈ Fn with fixed y ∈ Rn:
R∗k (X) = min(S,w)∈Wk (X)
maxy∈Rn\Sp(X)
ES ,w
[‖w(yS)− X†y‖2
]‖y − XX†y‖2
Suppose (S , w) for all fixed response vectors y ∈ Rn satisfies
E[w(yS)
]= X†y and E
[‖w(yS)− X†y‖2
]≤ ε · ‖y − XX†y‖2.
Then, for all random response vectors y ∈ Fn and w∗ ∈ Rd ,
E[‖w(yS)−w∗‖2
]︸ ︷︷ ︸MSE[w(yS )]
≤ E[‖X†y −w∗‖2
]︸ ︷︷ ︸MSE[X†y]
+ ε · E[‖y − Xw∗‖2
].
31 / 40
![Page 43: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/43.jpg)
Outline
Correcting the bias in least squares regression
Volume-rescaled sampling
Minimax experimental design
Bayesian experimental design
Conclusions
32 / 40
![Page 44: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/44.jpg)
Bayesian experimental design
Consider n parameterized experiments: x1, . . . , xn ∈ Rd .Each experiment has a real random response yi such that:
yi = x>i w∗ + ξi , ξi ∼ N (0, σ2), w∗ ∼ N (0, σ2A−1)
Goal: Select k n experiments to best estimate w∗
Select S = 4, 6, 9
Receive y4, y6, y9
x>4
x>6
x>9
fixed Xd
y
y4 .y6 .
y9
33 / 40
![Page 45: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/45.jpg)
Bayesian A-optimal design
Given the Bayesian assumptions, we have
w | yS ∼ N(
(X>SXS + A)−1X>SyS , σ2(X>SXS + A)−1),
Bayesian A-optimality criterion:
fA(X>SXS) = tr((X>SXS + A)−1
).
Goal: Efficiently find subset S of size k such that:
fA(X>SXS) ≤ (1 + ε) · minS ′:|S ′|=k
fA(X>S ′XS ′)︸ ︷︷ ︸OPTk
34 / 40
![Page 46: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/46.jpg)
Relaxation to a semi-definite program
SDP relaxation
The following can be found via an SDP solver in polynomial time:
p∗ = argminp1,...,pn
fA
( n∑i=1
pixix>i
),
subject to ∀i 0 ≤ pi ≤ 1,∑i
pi = k .
The solution p∗ satisfies fA(∑
i pixix>i
)≤ OPTk .
Question: For what k can we efficiently round this to S of size k?
35 / 40
![Page 47: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/47.jpg)
Efficient rounding for effective dimension many points
Definition
Define A-effective dimension as dA = tr(X>X(X>X + A)−1
)≤ d .
Theorem ([DLM19])
If k = Ω(dAε + log 1/ε
ε2
), then there is a polynomial time algorithm
that finds subset S of size k such that
fA(X>SXS
)≤ (1 + ε) ·OPTk .
Remark: Extends to other Bayesian criteria: C/D/V-optimality.
Key idea: Rounding with A-regularized volume-rescaled sampling,a new kind of determinantal point process.
36 / 40
![Page 48: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/48.jpg)
Comparison with prior work
Criteria Bayesian k = Ω(·)
[WYS17] A,V x d2
ε
[AZLSW17] A,C,D,E,G,V dε2
[NSTT19] A,D x dε + log 1/ε
ε2
our result [DLM19] A,C,D,V dAε + log 1/ε
ε2
37 / 40
![Page 49: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/49.jpg)
Outline
Correcting the bias in least squares regression
Volume-rescaled sampling
Minimax experimental design
Bayesian experimental design
Conclusions
38 / 40
![Page 50: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/50.jpg)
Conclusions
Unbiased estimators for least squares, uses volume sampling
Recent developments:
I Experimental design without any noise assumptions, i.e.,
arbitrary response
I Minimax experimental design: bridging the gap bw statistical
and worst-case perspectives
I Applications in Bayesian experimental design: bridging the
gap bw experimental design and determinantal point processes
Going beyond least squares:
I extensions to non-square losses,
I applications in distributed optimization.
39 / 40
![Page 51: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches](https://reader034.vdocuments.site/reader034/viewer/2022042204/5ea6197852ef9d403267d043/html5/thumbnails/51.jpg)
References
Haim Avron and Christos Boutsidis.
Faster subset selection for matrices and applications.
SIAM Journal on Matrix Analysis and Applications, 34(4):1464–1499, 2013.
Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, and Yining Wang.
Near-optimal design of experiments via regret minimization.
In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of
Machine Learning Research, pages 126–135, Sydney, Australia, August 2017.
Micha l Derezinski, Kenneth L. Clarkson, Michael W. Mahoney, and Manfred K. Warmuth.
Minimax experimental design: Bridging the gap between statistical and worst-case approaches to least
squares regression.
In Proceedings of the 32nd Conference on Learning Theory, 2019.
Micha l Derezinski, Feynman Liang, and Michael W. Mahoney.
Distributed estimation of the inverse Hessian by determinantal averaging.
arXiv e-prints (to appear), June 2019.
Micha l Derezinski and Manfred K. Warmuth.
Reverse iterative volume sampling for linear regression.
Journal of Machine Learning Research, 19(23):1–39, 2018.
Micha l Derezinski, Manfred K. Warmuth, and Daniel Hsu.
Correcting the bias in least squares regression with volume-rescaled sampling.
In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019.
Aleksandar Nikolov, Mohit Singh, and Uthaipon Tao Tantipongpipat.
Proportional volume sampling and approximation algorithms for a -optimal design.
In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1369–1386,
January 2019.
Yining Wang, Adams W. Yu, and Aarti Singh.
On computationally tractable selection of experiments in measurement-constrained regression models.
J. Mach. Learn. Res., 18(1):5238–5278, January 2017.
40 / 40