michael p. friedlander university of british columbia august … · 2020. 8. 28. · avron &...

Robust inversion and randomized sampling

Michael P. FriedlanderUniversity of British Columbia

International Symposium on Mathematical ProgrammingAugust 19–24, 2012

Collaborators:Sasha Aravkin, Felix Herrmann, Tristan van Leeuwen, and Mark Schmidt

Model problem

minimizex

f (x) :=1m

m∑i=1

fi (x)

Examples:• least-squares fi (x) = (aT

i x − bi )2 f (x) = ‖Ax − b‖2

• log likelihood fi (x) = − log p(bi ; x)

• sample average fi (x) = f (x , ωi ) f (x) ≈ Eω[f (x , ω)]

Context:• m large• each fi (x) and ∇fi (x) expensive to evaluate

Computing costs:• count fi /∇fi evals, not f /∇f evals• minimize passes through full data set (f1, . . . , fm)

2 / 33

SEISMICINVERSION

Reflection seismology

3 / 33

Full waveform inversion: velocity models

Lateral distance (Km)

Dep

th (

Km

)

1 2 3 4 5 6 7 8 9

0.5

1

1.5

2

2.5

3 2000

2500

3000

3500

4000

4500

5000

5500

x [km]

z [k

m]

0 2 4 6

0

1

2

[km

/s]

2

3

4

4 / 33

Full waveform inversionEach of m experiments yields a vector of measurements:sources: q1, . . . , qm

measurements: d1, . . . , dm

Seismic Laboratory for Imaging and Modeling

Forward problem: given m, predict data.If m, the velocity on the grid, is known, then we can predict (frequency-domain) data via Helmholz PDE:

d = Pu

q

!2m + r2

u = q

!"#$%&'#())'*m =

q: known source

m: known velocity

: frequency

u: predicted wavefield (on entire grid)

P: restriction to surface (where data is observed)

d: Predicted data.

!

Thursday, July 19, 2012

1 source, 1 frequency:

minimizex ,u

‖d − Pu‖2 subj to Hω(x)u = q

All sources, all frequencies: (eg, 1k sources, ∼ 10 freqs)

minimizex

m∑i

∑ω∈Ω

‖di − PHω(x)−1qi‖2

Main cost is solution of Helmholtz equation for each (i , ω) pair:

Hω(x)u = qi

5 / 33

Dimensionality reduction and stochastic optimization

Use all your data:

f (x) =m∑i‖ri (x)‖2 with ri (x) = di − F (x)qi

Dimensionality reduction: form s weighted averages (s m)

dj :=m∑j

wij di and qj :=m∑j

wij qi , j = 1, . . . , s

Stochastic approximation of the misfit:

f (x) =s∑j‖r j (x)‖2 with r j (x) = dj − F (x)qj

Stochastic optimization interpretation holds if E[WW T ] = I:

E[f (x)] = f (x) and E[∇f (x)] = ∇f (x)

6 / 33

Stochastic trace estimationLeast-squares misfit is a matrix trace:

f (x) =m∑i‖ri‖2 = trace(RTR) with R = [ r1 | · · · | rm ]

Data mixing (dimensionality reduction) equivalent to

R W = R

gives sample-average function

f (x) =s∑j‖r j‖2 = trace(RTR) with R = [ r 1 | · · · | r s ], s m

Connected to stochastic trace estimation.• Hutchinson (’90) — minimize var[trace(RT R)] with W ∼ Rademacher• Avron & Toledo (’11) — other optimal choices for W• Haber, Chung, Herrmann (’12) — connection to inverse problems

7 / 33

Nonlinear least-squares with corrupted data

good data 4% bad data

50 100 150 200 250 300 350 400 450 500 550

20

40

60

80

100

12050 100 150 200 250 300 350 400 450 500 550

20

40

60

80

100

120

collaboration with Total

8 / 33

Robust misfit measures

di = F (x)qi + ε

0

Student’s-tLaplaceNormal

0

|x |

12 x2

log(k + x2)

density function penalty function

• Least-squares penalty assumes ε ∼ N (0, 1)

• Theorem: Log-concave densities—ie, those induced by convexpenalties—are all exponentially bounded:

prob(|x | > t + ∆t given |x | > t

)= O(e−∆t)

• Heavy-tailed densities: robust to wrong data & approximate models9 / 33

Robust fullwaveform inversion results

• Recover velocity on a 2D grid: 201× 301• 151 sources, 6 frequencies: 906 PDE solves per f /∇f -eval• 50% corrupted data

0 5 10 15 20 25 30

0

5

10

15

20

!3 0 30

0.1

10 / 33

minimize ρ(R(x)) with R(x) = D − F (x)Q

Normal Laplace Student’s-t

‖R(x)‖2F

∑ij ‖Rij (x)‖1

∑ij log

(k + Rij (x)2)

0 5 10 15 20 25 30

0

5

10

15

20

0 5 10 15 20 25 30

0

5

10

15

20

0 5 10 15 20 25 30

0

5

10

15

20

!3 0 30

0.1

!3 0 30

0.1

!3 0 30

0.1

11 / 33

Sampling strategies for dimensionality reduction

Generic inverse problem (assuming iid observations):

minx

f (x) :=1m

m∑iρ(ri)

with R(x) = [ r1 | · · · | rm ]

Moment condition E [WW T ] = I is not generally sufficient to guarantee

E[ f (x)] = f (x)

E[∇f (x)] = ∇f (x)with R(x) = R(x)W

Random subset selection without replacement, ie,

R(x) = [ ri(1) | ri(2) | · · · | ri(s) ]

does yield desired “expected objective” property

12 / 33

MODEL PROBLEM

minimizex

f (x) :=1m

m∑i=1

fi (x)

13 / 33

Complexity of steepest descent

Baseline Algorithm:

xk+1 ← xk − αk∇f (xk ), αk ≡ 1/L

Assume: convex f ; Lipschitz ∇f with param L

Sublinear rate:• f (xk )− f (x∗) = O(1/k) (constant stepsize)• f (xk )− f (x∗) = O(1/k2) (optimal rate with extrapolation)

[Nesterov ’83; Tseng ’10]

Linear rate: additionally assume that f is strongly convex w/ param µ

• f (xk )− f (x∗) = O([1− µ/L]k )

Note: if f is twice differentiable, µI ∇2f (x) LI

14 / 33

Incremental gradient methods

Algorithm: xk+1 ← xk − α∇fi (xk ), i ∈ 1, . . . ,m cyclic , randomized

Assume: f strongly convex (µ) and Lipschitz ∇f (L)

Constant stepsize: αk ≡ α

• ‖xk − x∗‖2 ≤ O([1− µ/L]k ) +O(m2α) k full cycles

• E‖xk − x∗‖2 ≤ O([1− µ/L]k ) +O(mα) k iterations

Decreasing stepsize:∑

k αk =∞,∑

k α2k <∞

• ‖xk − x∗‖2 = O(1/k) k full cycles

• E‖xk − x∗‖2 = O(1/k) k iterations

Many variations:Luo/Tseng ’94; Nedic/Bertsekas ’00/’10; Blatt et al ’08; Bottou ’10

15 / 33

EXAMPLES

Seismic inversionRecover image of geological structures via nonlinear least squares

minimizex

m∑i

∑ω∈Ω


Observations: Each of m “shots” is an experiment:

sources: q1, . . . , qm, measurements: d1, . . . , dm

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

16 / 33

full gradient incremental gradient gradient sampling

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

x [km]

z [km

]

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

0.010.40.822.6471016223039 of 39 passes

17 / 33

Image denoising

• Statistical denoising via conditional random fields• Dataset of 50 synthetic 64× 64 images [Kumar/Hebert ’04]• Generalization of logistic model to capture dependencies among labels

maximizex

m∑i=1

log p(bi ; x)

• p is intractable and approximated

18 / 33

true image noisy sample

full gradient incremental gradient gradient sampling

0.250.500.7512345 of 5 passes 19 / 33

Sampling approach

Increasing batch: [Bertsekas/Tsitsiklis ’96, Shapiro/H-do-M ’00]

Sk ⊆ 1, . . . ,m, sk → m (slowly)

Sample-average gradient:

gk (x) :=1sk

∑i∈Sk

∇fi (x)

Algorithm:

xk+1 ← xk − αk d with Hk d = −gk (xk )

Goal: non-asymptotic analysis based on controlling gradient error

• gk (x) = ∇f (xk ) + ek where ‖ek‖2 ≤ εk or E[‖ek‖2] ≤ εk

• How to control sample size sk ?

20 / 33

Gradient with generic errors

Prototype algo:

xk+1 ← xk − αgk , gk = ∇f (xk ) + ek , α fixed

Assumptions:• Lipschitz gradient (L); strong convexity (µ)• ‖ek‖2 ≤ εk

Convergence rate: for all k = 1, 2, . . . [F. & Schmidt ’12]

‖xk − x∗‖2 ≤ O([1− µ/L]k ) +O(εk )

Examples:linear: εk = O(γk ) =⇒ f (xk )− f∗ = O(maxγ, 1− µ/Lk )

sublinear: εk = O(1/k2) =⇒ f (xk )− f∗ = O(1/k2)

persistent err: infkεk > 0 =⇒ f (xk )− f∗ = O(εk )

21 / 33

Growing the sample size

Prototype algo:

xk+1 ← xk − αgk , gk =1sk

∑i∈Sk

∇fi (xk ), Sk ⊆ 1, . . . ,m

Sampling strategies• Deterministic: pre-determined sample sequence• Randomized: uniform sampling

Convergence rates for all k = 1, 2, . . . [F. & Schmidt ’12]

deterministic: ‖xk − x∗‖2 = O([1− µ/L]k ) +O([m−sk

m]2)

sampling w/o replacement E‖xk − x∗‖2 = O([1− µ/L]k ) +O(

m−skm · 1

sk

)sampling w/ replacement E‖xk − x∗‖2 = O([1− µ/L]k ) +O

(1sk

)22 / 33

Illustration

Sample-size schedule Cumulative samples

0 20 40 60 80 100

Iteration (k)

0.0

0.5

1.0

Sam

ple

siz

e (s/m

)

0 20 40 60 80 100

Iteration (k)

100

103

106

Cum

ula

tive s

am

ple

s

replacement

no replacement

deterministic

23 / 33

In practice

Algorithm:

xk+1 ← xk − αkd , Bkd = −gk(xk ), gk(x) =1sk

∑i∈Sk

∇fi (x)

Hessian approximations Bk :• (limited-memory) quasi-Newton: [Schraudolph et al ’07]

sk := xk+1 − xk , yk := gk(xk+1)− gk(xk )

• sample-average Hessian: [Byrd et al ’11]

Bk :=1

hk

∑i∈Hk

∇2fi (xk ), hk sk

• Fisher information [for f (x) = −∑

i log pi (ωi ; x)]: [Osborne ’92]

Bk ≈ Eω[∇2f (x)] = Eω[∇f (x)∇f (x)T ]

24 / 33

APPLICATIONS

Binary logistic regression

maxx

m∑i

log p(bi | ai , x), p(bi | ai , x) =1

1 + exp(−bi aTi x)

, bi ∈ −1, 1

• Email spam classifier (Cormack and Lynam, 2005)• TREC 2005 dataset: 92,189 email msgs from Enron investigation

0 10 20 30 40 50 60 70 80 90 100

100

101

102

103

104

Passes through data

Ob

jec

tive

min

us

Op

tim

al

Stochastic1(1e+01)

Stochastic1(1e+00)

Stochastic1(1e−01)

Hybrid

Deterministic

25 / 33

Multinomial logistic regression

maxx

m∑i

log p(bi = j | ai , xj∈C), bi ∈ C

• Digit classification• MNIST dataset: 70,000 handwritten 28×28 images of digits

0 10 20 30 40 50 60 70 80 90 100

103

104

105

Passes through data

Ob

jec

tive

min

us

Op

tim

al




Hybrid

Deterministic

26 / 33

Chain-structured conditional random fields

maxx

m∑i

log p(bki = jkk∈Ω | ak

i k∈Ω, xjj∈C), bki ∈ C, k ∈ Ω

• noun-phrase chunking task from natural-language processing• CoNLL-2000 Shared Task dataset: 211,727 words in 8,936 sentences

0 10 20 30 40 50 60 70 80 90 100

102

103

104

105

Passes through data

Ob

jec

tiv

e m

inu

s O

ptim

al




Hybrid

Deterministic

27 / 33

General conditional random fields

minx

m∑i=1

p(bi ; x)

• Statistical denoising• Kumar/Hebert dataset of 50 synthetic 64× 64 images

0 10 20 30 40 50 60 70 80 90 100

10−8

10−6

10−4

10−2

100

102

104

Passes through data

Ob

jec

tiv

e m

inu

s O

ptim

al




Hybrid

Deterministic

28 / 33

Seismic inversion

minx

m∑i

∑ω∈Ω


• Recover seismic image via nonlinear least squares• Marmousi 2D acoustic model; 101 sources/receivers; 8 frequencies

0 5 10 15 20 25

104

105

106

107

Passes through data

Ob

jec

tive

min

us

Op

tim

al




Hybrid

Deterministic

29 / 33

No strong convexity, no linear convergence

Steepest descentxk+1 ← xk − α∇f (xk )

is sublinear:f (xk )− f (x∗) = O(1/k)

Approximate steepest descent

xk+1 ← xk − αdk with dk = ∇f (xk ) + ek

Summable errors, ie,∑∞

k ‖ek‖ <∞, gives sublinear iterate average:

f (xk )− f (x∗) = O(1/k), xk :=1k

k∑i=1

xi

30 / 33

References I• H. Avron and S. Toledo. Randomized algorithms for estimating the trace of an implicit

symmetric positive semi-definite matrix. J. ACM, 58:8:1–8:34, April 2011.

• D. Bertsekas. Incremental Gradient, Subgradient, and Proximal Methods for ConvexOptimization: A Survey. In S. Sra, S. Nowozin, and S. J. Wright, editors, Optimization formachine learning, chapter 4, pages 85–115. MIT Press, 2012.

• D. Bertsekas and J. Tsitsiklis. Neuro-dynamic programming. Athena Scientific, 1996.

• D. Blatt, A. Hero, and H. Gauchman. A convergent incremental gradient method with aconstant step size. SIAM J. Optim., 18(1):29–51, 2008.

• L. Bottou. Large-scale machine learning with stochastic gradient descent. In Y. Lechevallier andG. Saporta, editors, Proceedings of the 19th International Conference on ComputationalStatistics (COMPSTAT’2010), pages 177–187, Paris, France, August 2010. Springer.

• R. H. Byrd, G. Chin, J. Nocedal, and Y. Wu. Sample size selection in optimization methods formachine learning. Math. Program., 134:127–155, 2012.

• R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal. On the use of stochastic hessianinformation in optimization methods for machine learning. SIAM J. Optim., 21(3):977–995,2011.

• L. Grippo. A class of unconstrained minimization methods for neural network training. Optim.Methods Softw., 4(2):135–150, 1994.

• E. Haber, M. Chung, and F. J. Herrmann. An effective method for parameter estimation withPDE constraints with multiple right-hand sides. SIAM J. Optim., 22(3):739–757, 2012.

31 / 33

References II• M. Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian

smoothing splines. Comm. Statist. Simulation Comput., 19(2):433–450, 1990.

• A. Kleywegt, A. Shapiro, and T. Homem-de Mello. The sample average approximation methodfor stochastic discrete optimization. SIAM J. Optim., 12(2):479–502, 2002.

• Z. Luo and P. Tseng. Error bounds and convergence analysis of feasible descent methods: Ageneral approach. Ann. Oper. Res., 46(1):157–178, 1993.

• A. Nedic and D. Bertsekas. Convergence rate of incremental subgradient algorithms. StochasticOptimization: Algorithms and Applications, pages 263–304, 2000.

• Y. Nesterov. A method for unconstrained convex minimization problem with the rate ofconvergence O(1/k2). Soviet Math. Dokl., 269, 1983.

• M. Osborne. Fisher’s method of scoring. Intern. Stat. Rev., pages 99–117, 1992.

• N. N. Schraudolph, J. Yu, and S. Gunter. A Stochastic Quasi-Newton Method for OnlineConvex Optimization. In M. Meila and X. Shen, editors, Proc. 11th Intl. Conf. ArtificialIntelligence and Statistics (AIStats), volume 2 of Workshop Conf. Proc., pages 436–443, SanJuan, Puerto Rico, 2007. J. Machine Learning Res.

• P. Tseng. Approximation accuracy, gradient methods, and error bound for structured convexoptimization. Math. Program., 125:263–295, 2010.

• K. van den Doel and U. M. Ascher. Adaptive and stochastic algorithms for electrical impedancetomography and dc resistivity problems with piecewise constant solutions and manymeasurements. SIAM J. Comput., 34(1), 2012.

32 / 33

http://nic.schraudolph.org/pubs/SchYuGue07.pdf

http://nic.schraudolph.org/pubs/SchYuGue07.pdf

Thanks!

Read:• M. P. Friedlander and M. Schmidt. Hybrid deterministic-stochastic

methods for data fitting. SIAM J. Scientific Computing, 34(3), April 2012.

• A. Aravkin, M. P. Friedlander, F. Herrmann, and T. van Leeuwen.Robust inversion, dimensionality reduction, and randomized sampling.Mathematical Programming, 134(1):101–125, 2012.

Email:• [email protected]

Surf:• http://www.cs.ubc.ca/˜mpf

33 / 33

http://www.cs.ubc.ca/~mpf

michael p. friedlander university of british columbia august … · 2020. 8. 28. · avron &...

Documents