michael p. friedlander university of british columbia august … · 2020. 8. 28. · avron &...
TRANSCRIPT
Robust inversion and randomized sampling
Michael P. FriedlanderUniversity of British Columbia
International Symposium on Mathematical ProgrammingAugust 19–24, 2012
Collaborators:Sasha Aravkin, Felix Herrmann, Tristan van Leeuwen, and Mark Schmidt
Model problem
minimizex
f (x) :=1m
m∑i=1
fi (x)
Examples:• least-squares fi (x) = (aT
i x − bi )2 f (x) = ‖Ax − b‖2
• log likelihood fi (x) = − log p(bi ; x)
• sample average fi (x) = f (x , ωi ) f (x) ≈ Eω[f (x , ω)]
Context:• m large• each fi (x) and ∇fi (x) expensive to evaluate
Computing costs:• count fi /∇fi evals, not f /∇f evals• minimize passes through full data set (f1, . . . , fm)
2 / 33
SEISMICINVERSION
Reflection seismology
3 / 33
Full waveform inversion: velocity models
Lateral distance (Km)
Dep
th (
Km
)
1 2 3 4 5 6 7 8 9
0.5
1
1.5
2
2.5
3 2000
2500
3000
3500
4000
4500
5000
5500
x [km]
z [k
m]
0 2 4 6
0
1
2
[km
/s]
2
3
4
4 / 33
Full waveform inversionEach of m experiments yields a vector of measurements:sources: q1, . . . , qm
measurements: d1, . . . , dm
Seismic Laboratory for Imaging and Modeling
Forward problem: given m, predict data.If m, the velocity on the grid, is known, then we can predict (frequency-domain) data via Helmholz PDE:
d = Pu
q
!2m + r2
u = q
!"#$%&'#())'*m =
q: known source
m: known velocity
: frequency
u: predicted wavefield (on entire grid)
P: restriction to surface (where data is observed)
d: Predicted data.
!
Thursday, July 19, 2012
1 source, 1 frequency:
minimizex ,u
‖d − Pu‖2 subj to Hω(x)u = q
All sources, all frequencies: (eg, 1k sources, ∼ 10 freqs)
minimizex
m∑i
∑ω∈Ω
‖di − PHω(x)−1qi‖2
Main cost is solution of Helmholtz equation for each (i , ω) pair:
Hω(x)u = qi
5 / 33
Dimensionality reduction and stochastic optimization
Use all your data:
f (x) =m∑i‖ri (x)‖2 with ri (x) = di − F (x)qi
Dimensionality reduction: form s weighted averages (s m)
dj :=m∑j
wij di and qj :=m∑j
wij qi , j = 1, . . . , s
Stochastic approximation of the misfit:
f (x) =s∑j‖r j (x)‖2 with r j (x) = dj − F (x)qj
Stochastic optimization interpretation holds if E[WW T ] = I:
E[f (x)] = f (x) and E[∇f (x)] = ∇f (x)
6 / 33
Stochastic trace estimationLeast-squares misfit is a matrix trace:
f (x) =m∑i‖ri‖2 = trace(RTR) with R = [ r1 | · · · | rm ]
Data mixing (dimensionality reduction) equivalent to
R W = R
gives sample-average function
f (x) =s∑j‖r j‖2 = trace(RTR) with R = [ r 1 | · · · | r s ], s m
Connected to stochastic trace estimation.• Hutchinson (’90) — minimize var[trace(RT R)] with W ∼ Rademacher• Avron & Toledo (’11) — other optimal choices for W• Haber, Chung, Herrmann (’12) — connection to inverse problems
7 / 33
Nonlinear least-squares with corrupted data
good data 4% bad data
50 100 150 200 250 300 350 400 450 500 550
20
40
60
80
100
12050 100 150 200 250 300 350 400 450 500 550
20
40
60
80
100
120
collaboration with Total
8 / 33
Robust misfit measures
di = F (x)qi + ε
0
Student’s-tLaplaceNormal
0
|x |
12 x2
log(k + x2)
density function penalty function
• Least-squares penalty assumes ε ∼ N (0, 1)
• Theorem: Log-concave densities—ie, those induced by convexpenalties—are all exponentially bounded:
prob(|x | > t + ∆t given |x | > t
)= O(e−∆t)
• Heavy-tailed densities: robust to wrong data & approximate models9 / 33
Robust fullwaveform inversion results
• Recover velocity on a 2D grid: 201× 301• 151 sources, 6 frequencies: 906 PDE solves per f /∇f -eval• 50% corrupted data
0 5 10 15 20 25 30
0
5
10
15
20
!3 0 30
0.1
10 / 33
minimize ρ(R(x)) with R(x) = D − F (x)Q
Normal Laplace Student’s-t
‖R(x)‖2F
∑ij ‖Rij (x)‖1
∑ij log
(k + Rij (x)2)
0 5 10 15 20 25 30
0
5
10
15
20
0 5 10 15 20 25 30
0
5
10
15
20
0 5 10 15 20 25 30
0
5
10
15
20
!3 0 30
0.1
!3 0 30
0.1
!3 0 30
0.1
11 / 33
Sampling strategies for dimensionality reduction
Generic inverse problem (assuming iid observations):
minx
f (x) :=1m
m∑iρ(ri)
with R(x) = [ r1 | · · · | rm ]
Moment condition E [WW T ] = I is not generally sufficient to guarantee
E[ f (x)] = f (x)
E[∇f (x)] = ∇f (x)with R(x) = R(x)W
Random subset selection without replacement, ie,
R(x) = [ ri(1) | ri(2) | · · · | ri(s) ]
does yield desired “expected objective” property
12 / 33
MODEL PROBLEM
minimizex
f (x) :=1m
m∑i=1
fi (x)
13 / 33
Complexity of steepest descent
Baseline Algorithm:
xk+1 ← xk − αk∇f (xk ), αk ≡ 1/L
Assume: convex f ; Lipschitz ∇f with param L
Sublinear rate:• f (xk )− f (x∗) = O(1/k) (constant stepsize)• f (xk )− f (x∗) = O(1/k2) (optimal rate with extrapolation)
[Nesterov ’83; Tseng ’10]
Linear rate: additionally assume that f is strongly convex w/ param µ
• f (xk )− f (x∗) = O([1− µ/L]k )
Note: if f is twice differentiable, µI ∇2f (x) LI
14 / 33
Incremental gradient methods
Algorithm: xk+1 ← xk − α∇fi (xk ), i ∈ 1, . . . ,m cyclic , randomized
Assume: f strongly convex (µ) and Lipschitz ∇f (L)
Constant stepsize: αk ≡ α
• ‖xk − x∗‖2 ≤ O([1− µ/L]k ) +O(m2α) k full cycles
• E‖xk − x∗‖2 ≤ O([1− µ/L]k ) +O(mα) k iterations
Decreasing stepsize:∑
k αk =∞,∑
k α2k <∞
• ‖xk − x∗‖2 = O(1/k) k full cycles
• E‖xk − x∗‖2 = O(1/k) k iterations
Many variations:Luo/Tseng ’94; Nedic/Bertsekas ’00/’10; Blatt et al ’08; Bottou ’10
15 / 33
EXAMPLES
Seismic inversionRecover image of geological structures via nonlinear least squares
minimizex
m∑i
∑ω∈Ω
‖di − PHω(x)−1qi‖2
Observations: Each of m “shots” is an experiment:
sources: q1, . . . , qm, measurements: d1, . . . , dm
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
16 / 33
full gradient incremental gradient gradient sampling
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
x [km]
z [km
]
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
0.010.40.822.6471016223039 of 39 passes
17 / 33
Image denoising
• Statistical denoising via conditional random fields• Dataset of 50 synthetic 64× 64 images [Kumar/Hebert ’04]• Generalization of logistic model to capture dependencies among labels
maximizex
m∑i=1
log p(bi ; x)
• p is intractable and approximated
18 / 33
true image noisy sample
full gradient incremental gradient gradient sampling
0.250.500.7512345 of 5 passes 19 / 33
Sampling approach
Increasing batch: [Bertsekas/Tsitsiklis ’96, Shapiro/H-do-M ’00]
Sk ⊆ 1, . . . ,m, sk → m (slowly)
Sample-average gradient:
gk (x) :=1sk
∑i∈Sk
∇fi (x)
Algorithm:
xk+1 ← xk − αk d with Hk d = −gk (xk )
Goal: non-asymptotic analysis based on controlling gradient error
• gk (x) = ∇f (xk ) + ek where ‖ek‖2 ≤ εk or E[‖ek‖2] ≤ εk
• How to control sample size sk ?
20 / 33
Gradient with generic errors
Prototype algo:
xk+1 ← xk − αgk , gk = ∇f (xk ) + ek , α fixed
Assumptions:• Lipschitz gradient (L); strong convexity (µ)• ‖ek‖2 ≤ εk
Convergence rate: for all k = 1, 2, . . . [F. & Schmidt ’12]
‖xk − x∗‖2 ≤ O([1− µ/L]k ) +O(εk )
Examples:linear: εk = O(γk ) =⇒ f (xk )− f∗ = O(maxγ, 1− µ/Lk )
sublinear: εk = O(1/k2) =⇒ f (xk )− f∗ = O(1/k2)
persistent err: infkεk > 0 =⇒ f (xk )− f∗ = O(εk )
21 / 33
Growing the sample size
Prototype algo:
xk+1 ← xk − αgk , gk =1sk
∑i∈Sk
∇fi (xk ), Sk ⊆ 1, . . . ,m
Sampling strategies• Deterministic: pre-determined sample sequence• Randomized: uniform sampling
Convergence rates for all k = 1, 2, . . . [F. & Schmidt ’12]
deterministic: ‖xk − x∗‖2 = O([1− µ/L]k ) +O([m−sk
m]2)
sampling w/o replacement E‖xk − x∗‖2 = O([1− µ/L]k ) +O(
m−skm · 1
sk
)sampling w/ replacement E‖xk − x∗‖2 = O([1− µ/L]k ) +O
(1sk
)22 / 33
Illustration
Sample-size schedule Cumulative samples
0 20 40 60 80 100
Iteration (k)
0.0
0.5
1.0
Sam
ple
siz
e (s/m
)
0 20 40 60 80 100
Iteration (k)
100
103
106
Cum
ula
tive s
am
ple
s
replacement
no replacement
deterministic
23 / 33
In practice
Algorithm:
xk+1 ← xk − αkd , Bkd = −gk(xk ), gk(x) =1sk
∑i∈Sk
∇fi (x)
Hessian approximations Bk :• (limited-memory) quasi-Newton: [Schraudolph et al ’07]
sk := xk+1 − xk , yk := gk(xk+1)− gk(xk )
• sample-average Hessian: [Byrd et al ’11]
Bk :=1
hk
∑i∈Hk
∇2fi (xk ), hk sk
• Fisher information [for f (x) = −∑
i log pi (ωi ; x)]: [Osborne ’92]
Bk ≈ Eω[∇2f (x)] = Eω[∇f (x)∇f (x)T ]
24 / 33
APPLICATIONS
Binary logistic regression
maxx
m∑i
log p(bi | ai , x), p(bi | ai , x) =1
1 + exp(−bi aTi x)
, bi ∈ −1, 1
• Email spam classifier (Cormack and Lynam, 2005)• TREC 2005 dataset: 92,189 email msgs from Enron investigation
0 10 20 30 40 50 60 70 80 90 100
100
101
102
103
104
Passes through data
Ob
jec
tive
min
us
Op
tim
al
Stochastic1(1e+01)
Stochastic1(1e+00)
Stochastic1(1e−01)
Hybrid
Deterministic
25 / 33
Multinomial logistic regression
maxx
m∑i
log p(bi = j | ai , xj∈C), bi ∈ C
• Digit classification• MNIST dataset: 70,000 handwritten 28×28 images of digits
0 10 20 30 40 50 60 70 80 90 100
103
104
105
Passes through data
Ob
jec
tive
min
us
Op
tim
al
Stochastic1(1e−02)
Stochastic1(1e−03)
Stochastic1(1e−04)
Hybrid
Deterministic
26 / 33
Chain-structured conditional random fields
maxx
m∑i
log p(bki = jkk∈Ω | ak
i k∈Ω, xjj∈C), bki ∈ C, k ∈ Ω
• noun-phrase chunking task from natural-language processing• CoNLL-2000 Shared Task dataset: 211,727 words in 8,936 sentences
0 10 20 30 40 50 60 70 80 90 100
102
103
104
105
Passes through data
Ob
jec
tiv
e m
inu
s O
ptim
al
Stochastic1(1e−01)
Stochastic1(1e−02)
Stochastic1(1e−03)
Hybrid
Deterministic
27 / 33
General conditional random fields
minx
m∑i=1
p(bi ; x)
• Statistical denoising• Kumar/Hebert dataset of 50 synthetic 64× 64 images
0 10 20 30 40 50 60 70 80 90 100
10−8
10−6
10−4
10−2
100
102
104
Passes through data
Ob
jec
tiv
e m
inu
s O
ptim
al
Stochastic1(1e−03)
Stochastic1(1e−04)
Stochastic1(1e−05)
Hybrid
Deterministic
28 / 33
Seismic inversion
minx
m∑i
∑ω∈Ω
‖di − PHω(x)−1qi‖2
• Recover seismic image via nonlinear least squares• Marmousi 2D acoustic model; 101 sources/receivers; 8 frequencies
0 5 10 15 20 25
104
105
106
107
Passes through data
Ob
jec
tive
min
us
Op
tim
al
Stochastic1(1e−08)
Stochastic1(1e−09)
Stochastic1(1e−10)
Hybrid
Deterministic
29 / 33
No strong convexity, no linear convergence
Steepest descentxk+1 ← xk − α∇f (xk )
is sublinear:f (xk )− f (x∗) = O(1/k)
Approximate steepest descent
xk+1 ← xk − αdk with dk = ∇f (xk ) + ek
Summable errors, ie,∑∞
k ‖ek‖ <∞, gives sublinear iterate average:
f (xk )− f (x∗) = O(1/k), xk :=1k
k∑i=1
xi
30 / 33
References I• H. Avron and S. Toledo. Randomized algorithms for estimating the trace of an implicit
symmetric positive semi-definite matrix. J. ACM, 58:8:1–8:34, April 2011.
• D. Bertsekas. Incremental Gradient, Subgradient, and Proximal Methods for ConvexOptimization: A Survey. In S. Sra, S. Nowozin, and S. J. Wright, editors, Optimization formachine learning, chapter 4, pages 85–115. MIT Press, 2012.
• D. Bertsekas and J. Tsitsiklis. Neuro-dynamic programming. Athena Scientific, 1996.
• D. Blatt, A. Hero, and H. Gauchman. A convergent incremental gradient method with aconstant step size. SIAM J. Optim., 18(1):29–51, 2008.
• L. Bottou. Large-scale machine learning with stochastic gradient descent. In Y. Lechevallier andG. Saporta, editors, Proceedings of the 19th International Conference on ComputationalStatistics (COMPSTAT’2010), pages 177–187, Paris, France, August 2010. Springer.
• R. H. Byrd, G. Chin, J. Nocedal, and Y. Wu. Sample size selection in optimization methods formachine learning. Math. Program., 134:127–155, 2012.
• R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal. On the use of stochastic hessianinformation in optimization methods for machine learning. SIAM J. Optim., 21(3):977–995,2011.
• L. Grippo. A class of unconstrained minimization methods for neural network training. Optim.Methods Softw., 4(2):135–150, 1994.
• E. Haber, M. Chung, and F. J. Herrmann. An effective method for parameter estimation withPDE constraints with multiple right-hand sides. SIAM J. Optim., 22(3):739–757, 2012.
31 / 33
References II• M. Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian
smoothing splines. Comm. Statist. Simulation Comput., 19(2):433–450, 1990.
• A. Kleywegt, A. Shapiro, and T. Homem-de Mello. The sample average approximation methodfor stochastic discrete optimization. SIAM J. Optim., 12(2):479–502, 2002.
• Z. Luo and P. Tseng. Error bounds and convergence analysis of feasible descent methods: Ageneral approach. Ann. Oper. Res., 46(1):157–178, 1993.
• A. Nedic and D. Bertsekas. Convergence rate of incremental subgradient algorithms. StochasticOptimization: Algorithms and Applications, pages 263–304, 2000.
• Y. Nesterov. A method for unconstrained convex minimization problem with the rate ofconvergence O(1/k2). Soviet Math. Dokl., 269, 1983.
• M. Osborne. Fisher’s method of scoring. Intern. Stat. Rev., pages 99–117, 1992.
• N. N. Schraudolph, J. Yu, and S. Gunter. A Stochastic Quasi-Newton Method for OnlineConvex Optimization. In M. Meila and X. Shen, editors, Proc. 11th Intl. Conf. ArtificialIntelligence and Statistics (AIStats), volume 2 of Workshop Conf. Proc., pages 436–443, SanJuan, Puerto Rico, 2007. J. Machine Learning Res.
• P. Tseng. Approximation accuracy, gradient methods, and error bound for structured convexoptimization. Math. Program., 125:263–295, 2010.
• K. van den Doel and U. M. Ascher. Adaptive and stochastic algorithms for electrical impedancetomography and dc resistivity problems with piecewise constant solutions and manymeasurements. SIAM J. Comput., 34(1), 2012.
32 / 33
Thanks!
Read:• M. P. Friedlander and M. Schmidt. Hybrid deterministic-stochastic
methods for data fitting. SIAM J. Scientific Computing, 34(3), April 2012.
• A. Aravkin, M. P. Friedlander, F. Herrmann, and T. van Leeuwen.Robust inversion, dimensionality reduction, and randomized sampling.Mathematical Programming, 134(1):101–125, 2012.
Email:• [email protected]
Surf:• http://www.cs.ubc.ca/˜mpf
33 / 33