basics of numerical optimization: preliminaries · =)replace hby dnn w, i.e., a deep neural network...

84
Basics of Numerical Optimization: Preliminaries Ju Sun Computer Science & Engineering University of Minnesota, Twin Cities February 11, 2020 1 / 24

Upload: others

Post on 05-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Basics of Numerical Optimization:Preliminaries

Ju Sun

Computer Science & Engineering

University of Minnesota, Twin Cities

February 11, 2020

1 / 24

Page 2: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Supervised learning as function approximation

– Underlying true function: f0

– Training data: xi,yi with yi ≈ f0 (xi)

– Choose a family of functions H, so that

∃f ∈ H and f and f0 are close

– Find f , i.e., optimization

minf∈H

∑i

` (yi, f (xi)) + Ω (f)

– Approximation capacity: Univeral approximation theorems (UAT)

=⇒ replace H by DNNW , i.e., a deep neural network with weights W

– Optimization:

minW

∑i

` (yi,DNNW (xi)) + Ω (W )

– Generalization: how to avoid over-complicated DNNW in view of UAT

Now we start to focus on optimization.

2 / 24

Page 3: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Supervised learning as function approximation

– Underlying true function: f0

– Training data: xi,yi with yi ≈ f0 (xi)

– Choose a family of functions H, so that

∃f ∈ H and f and f0 are close

– Find f , i.e., optimization

minf∈H

∑i

` (yi, f (xi)) + Ω (f)

– Approximation capacity: Univeral approximation theorems (UAT)

=⇒ replace H by DNNW , i.e., a deep neural network with weights W

– Optimization:

minW

∑i

` (yi,DNNW (xi)) + Ω (W )

– Generalization: how to avoid over-complicated DNNW in view of UAT

Now we start to focus on optimization.

2 / 24

Page 4: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Supervised learning as function approximation

– Underlying true function: f0

– Training data: xi,yi with yi ≈ f0 (xi)

– Choose a family of functions H, so that

∃f ∈ H and f and f0 are close

– Find f , i.e., optimization

minf∈H

∑i

` (yi, f (xi)) + Ω (f)

– Approximation capacity: Univeral approximation theorems (UAT)

=⇒ replace H by DNNW , i.e., a deep neural network with weights W

– Optimization:

minW

∑i

` (yi,DNNW (xi)) + Ω (W )

– Generalization: how to avoid over-complicated DNNW in view of UAT

Now we start to focus on optimization.

2 / 24

Page 5: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Supervised learning as function approximation

– Underlying true function: f0

– Training data: xi,yi with yi ≈ f0 (xi)

– Choose a family of functions H, so that

∃f ∈ H and f and f0 are close

– Find f , i.e., optimization

minf∈H

∑i

` (yi, f (xi)) + Ω (f)

– Approximation capacity: Univeral approximation theorems (UAT)

=⇒ replace H by DNNW , i.e., a deep neural network with weights W

– Optimization:

minW

∑i

` (yi,DNNW (xi)) + Ω (W )

– Generalization: how to avoid over-complicated DNNW in view of UAT

Now we start to focus on optimization.

2 / 24

Page 6: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Supervised learning as function approximation

– Underlying true function: f0

– Training data: xi,yi with yi ≈ f0 (xi)

– Choose a family of functions H, so that

∃f ∈ H and f and f0 are close

– Find f , i.e., optimization

minf∈H

∑i

` (yi, f (xi)) + Ω (f)

– Approximation capacity: Univeral approximation theorems (UAT)

=⇒ replace H by DNNW , i.e., a deep neural network with weights W

– Optimization:

minW

∑i

` (yi,DNNW (xi)) + Ω (W )

– Generalization: how to avoid over-complicated DNNW in view of UAT

Now we start to focus on optimization.

2 / 24

Page 7: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Outline

Elements of multivatiate calculus

Optimality conditions of unconstrained optimization

3 / 24

Page 8: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Recommended references

[Munkres, 1997, Zorich, 2015, Coleman, 2012]

4 / 24

Page 9: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Our notation

– scalars: x, vectors: x, matrices: X, tensors: X , sets: S

– vectors are always column vectors, unless stated otherwise

– xi: i-th element of x, xij : (i, j)-th element of X, xi: i-th row

of X as a row vector, xj : j-th column of X as a column

vector

– R: real numbers, R+: positive reals, Rn: space of

n-dimensional vectors, Rm×n: space of m× n matrices,

Rm×n×k: space of m× n× k tensors, etc

– [n].= 1, . . . , n

5 / 24

Page 10: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Our notation

– scalars: x, vectors: x, matrices: X, tensors: X , sets: S

– vectors are always column vectors, unless stated otherwise

– xi: i-th element of x, xij : (i, j)-th element of X, xi: i-th row

of X as a row vector, xj : j-th column of X as a column

vector

– R: real numbers, R+: positive reals, Rn: space of

n-dimensional vectors, Rm×n: space of m× n matrices,

Rm×n×k: space of m× n× k tensors, etc

– [n].= 1, . . . , n

5 / 24

Page 11: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Our notation

– scalars: x, vectors: x, matrices: X, tensors: X , sets: S

– vectors are always column vectors, unless stated otherwise

– xi: i-th element of x, xij : (i, j)-th element of X, xi: i-th row

of X as a row vector, xj : j-th column of X as a column

vector

– R: real numbers, R+: positive reals, Rn: space of

n-dimensional vectors, Rm×n: space of m× n matrices,

Rm×n×k: space of m× n× k tensors, etc

– [n].= 1, . . . , n

5 / 24

Page 12: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Our notation

– scalars: x, vectors: x, matrices: X, tensors: X , sets: S

– vectors are always column vectors, unless stated otherwise

– xi: i-th element of x, xij : (i, j)-th element of X, xi: i-th row

of X as a row vector, xj : j-th column of X as a column

vector

– R: real numbers, R+: positive reals, Rn: space of

n-dimensional vectors, Rm×n: space of m× n matrices,

Rm×n×k: space of m× n× k tensors, etc

– [n].= 1, . . . , n

5 / 24

Page 13: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Our notation

– scalars: x, vectors: x, matrices: X, tensors: X , sets: S

– vectors are always column vectors, unless stated otherwise

– xi: i-th element of x, xij : (i, j)-th element of X, xi: i-th row

of X as a row vector, xj : j-th column of X as a column

vector

– R: real numbers, R+: positive reals, Rn: space of

n-dimensional vectors, Rm×n: space of m× n matrices,

Rm×n×k: space of m× n× k tensors, etc

– [n].= 1, . . . , n

5 / 24

Page 14: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Differentiability — first order

Consider f (x) : Rn → Rm

– Definition: First-order differentiable at a point x if there exists a

matrix B ∈ Rm×n such that

f (x+ δ)− f (x)−Bδ‖δ‖2

→ 0 as δ → 0.

i.e., f (x+ δ) = f (x) +Bδ + o(‖δ‖2) as δ → 0.

– B is called the (Frechet) derivative. When m = 1, bᵀ (i.e., Bᵀ)

called gradient, denoted as ∇f (x). For general m, also called

Jacobian matrix, denoted as Jf (x).

– Calculation: bij =∂fi∂xj

(x)

– Sufficient condition: if all partial derivatives exist and are

continuous at x, then f (x) is differentiable at x.

6 / 24

Page 15: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Differentiability — first order

Consider f (x) : Rn → Rm

– Definition: First-order differentiable at a point x if there exists a

matrix B ∈ Rm×n such that

f (x+ δ)− f (x)−Bδ‖δ‖2

→ 0 as δ → 0.

i.e., f (x+ δ) = f (x) +Bδ + o(‖δ‖2) as δ → 0.

– B is called the (Frechet) derivative. When m = 1, bᵀ (i.e., Bᵀ)

called gradient, denoted as ∇f (x). For general m, also called

Jacobian matrix, denoted as Jf (x).

– Calculation: bij =∂fi∂xj

(x)

– Sufficient condition: if all partial derivatives exist and are

continuous at x, then f (x) is differentiable at x.

6 / 24

Page 16: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Differentiability — first order

Consider f (x) : Rn → Rm

– Definition: First-order differentiable at a point x if there exists a

matrix B ∈ Rm×n such that

f (x+ δ)− f (x)−Bδ‖δ‖2

→ 0 as δ → 0.

i.e., f (x+ δ) = f (x) +Bδ + o(‖δ‖2) as δ → 0.

– B is called the (Frechet) derivative. When m = 1, bᵀ (i.e., Bᵀ)

called gradient, denoted as ∇f (x). For general m, also called

Jacobian matrix, denoted as Jf (x).

– Calculation: bij =∂fi∂xj

(x)

– Sufficient condition: if all partial derivatives exist and are

continuous at x, then f (x) is differentiable at x.

6 / 24

Page 17: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Differentiability — first order

Consider f (x) : Rn → Rm

– Definition: First-order differentiable at a point x if there exists a

matrix B ∈ Rm×n such that

f (x+ δ)− f (x)−Bδ‖δ‖2

→ 0 as δ → 0.

i.e., f (x+ δ) = f (x) +Bδ + o(‖δ‖2) as δ → 0.

– B is called the (Frechet) derivative. When m = 1, bᵀ (i.e., Bᵀ)

called gradient, denoted as ∇f (x). For general m, also called

Jacobian matrix, denoted as Jf (x).

– Calculation: bij =∂fi∂xj

(x)

– Sufficient condition: if all partial derivatives exist and are

continuous at x, then f (x) is differentiable at x.

6 / 24

Page 18: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Differentiability — first order

Consider f (x) : Rn → Rm

– Definition: First-order differentiable at a point x if there exists a

matrix B ∈ Rm×n such that

f (x+ δ)− f (x)−Bδ‖δ‖2

→ 0 as δ → 0.

i.e., f (x+ δ) = f (x) +Bδ + o(‖δ‖2) as δ → 0.

– B is called the (Frechet) derivative. When m = 1, bᵀ (i.e., Bᵀ)

called gradient, denoted as ∇f (x). For general m, also called

Jacobian matrix, denoted as Jf (x).

– Calculation: bij =∂fi∂xj

(x)

– Sufficient condition: if all partial derivatives exist and are

continuous at x, then f (x) is differentiable at x.

6 / 24

Page 19: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Calculus rules

Assume f, g : Rn → Rm are differentiable at a point x ∈ Rn.

– linearity: λ1f + λ2g is differentiable at x and

∇ [λ1f + λ2g] (x) = λ1∇f (x) + λ2∇g (x)

– product: assume m = 1, fg is differentiable at x and

∇ [fg] (x) = f (x)∇g (x) + g (x)∇f (x)

– quotient: assume m = 1 and g (x) 6= 0, fg is differentiable at x and

∇[fg

](x) = g(x)∇f(x)−f(x)∇g(x)

g2(x)

– Chain rule: Let f : Rm → Rn and h : Rn → Rk, and f is

differentiable at x and y = f (x) and h is differentiable at y. Then,

h f : Rn → Rk is differentiable at x, and

J [hf ] (x) = Jh (f (x))Jf (x) .

When k = 1,

∇ [h f ] (x) = J>f (x)∇h (f (x)) .

7 / 24

Page 20: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Calculus rules

Assume f, g : Rn → Rm are differentiable at a point x ∈ Rn.

– linearity: λ1f + λ2g is differentiable at x and

∇ [λ1f + λ2g] (x) = λ1∇f (x) + λ2∇g (x)

– product: assume m = 1, fg is differentiable at x and

∇ [fg] (x) = f (x)∇g (x) + g (x)∇f (x)

– quotient: assume m = 1 and g (x) 6= 0, fg is differentiable at x and

∇[fg

](x) = g(x)∇f(x)−f(x)∇g(x)

g2(x)

– Chain rule: Let f : Rm → Rn and h : Rn → Rk, and f is

differentiable at x and y = f (x) and h is differentiable at y. Then,

h f : Rn → Rk is differentiable at x, and

J [hf ] (x) = Jh (f (x))Jf (x) .

When k = 1,

∇ [h f ] (x) = J>f (x)∇h (f (x)) .

7 / 24

Page 21: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Calculus rules

Assume f, g : Rn → Rm are differentiable at a point x ∈ Rn.

– linearity: λ1f + λ2g is differentiable at x and

∇ [λ1f + λ2g] (x) = λ1∇f (x) + λ2∇g (x)

– product: assume m = 1, fg is differentiable at x and

∇ [fg] (x) = f (x)∇g (x) + g (x)∇f (x)

– quotient: assume m = 1 and g (x) 6= 0, fg is differentiable at x and

∇[fg

](x) = g(x)∇f(x)−f(x)∇g(x)

g2(x)

– Chain rule: Let f : Rm → Rn and h : Rn → Rk, and f is

differentiable at x and y = f (x) and h is differentiable at y. Then,

h f : Rn → Rk is differentiable at x, and

J [hf ] (x) = Jh (f (x))Jf (x) .

When k = 1,

∇ [h f ] (x) = J>f (x)∇h (f (x)) .7 / 24

Page 22: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Differentiability — second order

Consider f (x) : Rn → R and assume f is 1st-order differentiable in a

small ball around x

– Write ∂f2

∂xj∂xi(x)

.=[

∂∂xj

(∂f∂xi

)](x) provided the right side well

defined

– Symmetry: If both ∂f2

∂xj∂xi(x) and ∂f2

∂xi∂xj(x) exist and both are

continuous at x, then they are equal.

– Hessian (matrix):

∇2f (x).=

[∂f2

∂xj∂xi(x)

]j,i

, (1)

where[

∂f2

∂xj∂xi(x)]j,i∈ Rn×n has its (j, i)-th element as ∂f2

∂xj∂xi(x).

– ∇2f is symmetric.

– Sufficient condition: if all ∂f2

∂xj∂xi(x) exist and are continuous, f

is 2nd-order differentiable at x (not converse; we omit the definition

due to its technicality).

8 / 24

Page 23: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Differentiability — second order

Consider f (x) : Rn → R and assume f is 1st-order differentiable in a

small ball around x

– Write ∂f2

∂xj∂xi(x)

.=[

∂∂xj

(∂f∂xi

)](x) provided the right side well

defined

– Symmetry: If both ∂f2

∂xj∂xi(x) and ∂f2

∂xi∂xj(x) exist and both are

continuous at x, then they are equal.

– Hessian (matrix):

∇2f (x).=

[∂f2

∂xj∂xi(x)

]j,i

, (1)

where[

∂f2

∂xj∂xi(x)]j,i∈ Rn×n has its (j, i)-th element as ∂f2

∂xj∂xi(x).

– ∇2f is symmetric.

– Sufficient condition: if all ∂f2

∂xj∂xi(x) exist and are continuous, f

is 2nd-order differentiable at x (not converse; we omit the definition

due to its technicality).

8 / 24

Page 24: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Differentiability — second order

Consider f (x) : Rn → R and assume f is 1st-order differentiable in a

small ball around x

– Write ∂f2

∂xj∂xi(x)

.=[

∂∂xj

(∂f∂xi

)](x) provided the right side well

defined

– Symmetry: If both ∂f2

∂xj∂xi(x) and ∂f2

∂xi∂xj(x) exist and both are

continuous at x, then they are equal.

– Hessian (matrix):

∇2f (x).=

[∂f2

∂xj∂xi(x)

]j,i

, (1)

where[

∂f2

∂xj∂xi(x)]j,i∈ Rn×n has its (j, i)-th element as ∂f2

∂xj∂xi(x).

– ∇2f is symmetric.

– Sufficient condition: if all ∂f2

∂xj∂xi(x) exist and are continuous, f

is 2nd-order differentiable at x (not converse; we omit the definition

due to its technicality).

8 / 24

Page 25: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Differentiability — second order

Consider f (x) : Rn → R and assume f is 1st-order differentiable in a

small ball around x

– Write ∂f2

∂xj∂xi(x)

.=[

∂∂xj

(∂f∂xi

)](x) provided the right side well

defined

– Symmetry: If both ∂f2

∂xj∂xi(x) and ∂f2

∂xi∂xj(x) exist and both are

continuous at x, then they are equal.

– Hessian (matrix):

∇2f (x).=

[∂f2

∂xj∂xi(x)

]j,i

, (1)

where[

∂f2

∂xj∂xi(x)]j,i∈ Rn×n has its (j, i)-th element as ∂f2

∂xj∂xi(x).

– ∇2f is symmetric.

– Sufficient condition: if all ∂f2

∂xj∂xi(x) exist and are continuous, f

is 2nd-order differentiable at x (not converse; we omit the definition

due to its technicality). 8 / 24

Page 26: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Taylor’s theorem

Vector version: consider f (x) : Rn → R

– If f is 1st-order differentiable at x, then

f (x+ δ) = f (x) + 〈∇f (x) , δ〉+ o(‖δ‖2) as δ → 0.

– If f is 2nd-order differentiable at x, then

f (x+ δ) = f (x) + 〈∇f (x) , δ〉+1

2

⟨δ,∇2f (x) δ

⟩+ o(‖δ‖22) as δ → 0.

Matrix version: consider f (X) : Rm×n → R

– If f is 1st-order differentiable at X, then

f (X + ∆) = f (X) + 〈∇f (X) ,∆〉+ o(‖∆‖F ) as ∆→ 0.

– If f is 2nd-order differentiable at X, then

f (X + ∆) = f (X) + 〈∇f (X) ,∆〉+1

2

⟨∆,∇2f (X) ∆

⟩+ o(‖∆‖2F )

as ∆→ 0.

9 / 24

Page 27: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Taylor’s theorem

Vector version: consider f (x) : Rn → R

– If f is 1st-order differentiable at x, then

f (x+ δ) = f (x) + 〈∇f (x) , δ〉+ o(‖δ‖2) as δ → 0.

– If f is 2nd-order differentiable at x, then

f (x+ δ) = f (x) + 〈∇f (x) , δ〉+1

2

⟨δ,∇2f (x) δ

⟩+ o(‖δ‖22) as δ → 0.

Matrix version: consider f (X) : Rm×n → R

– If f is 1st-order differentiable at X, then

f (X + ∆) = f (X) + 〈∇f (X) ,∆〉+ o(‖∆‖F ) as ∆→ 0.

– If f is 2nd-order differentiable at X, then

f (X + ∆) = f (X) + 〈∇f (X) ,∆〉+1

2

⟨∆,∇2f (X) ∆

⟩+ o(‖∆‖2F )

as ∆→ 0.

9 / 24

Page 28: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Taylor’s theorem

Vector version: consider f (x) : Rn → R

– If f is 1st-order differentiable at x, then

f (x+ δ) = f (x) + 〈∇f (x) , δ〉+ o(‖δ‖2) as δ → 0.

– If f is 2nd-order differentiable at x, then

f (x+ δ) = f (x) + 〈∇f (x) , δ〉+1

2

⟨δ,∇2f (x) δ

⟩+ o(‖δ‖22) as δ → 0.

Matrix version: consider f (X) : Rm×n → R

– If f is 1st-order differentiable at X, then

f (X + ∆) = f (X) + 〈∇f (X) ,∆〉+ o(‖∆‖F ) as ∆→ 0.

– If f is 2nd-order differentiable at X, then

f (X + ∆) = f (X) + 〈∇f (X) ,∆〉+1

2

⟨∆,∇2f (X) ∆

⟩+ o(‖∆‖2F )

as ∆→ 0.

9 / 24

Page 29: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Taylor’s theorem

Vector version: consider f (x) : Rn → R

– If f is 1st-order differentiable at x, then

f (x+ δ) = f (x) + 〈∇f (x) , δ〉+ o(‖δ‖2) as δ → 0.

– If f is 2nd-order differentiable at x, then

f (x+ δ) = f (x) + 〈∇f (x) , δ〉+1

2

⟨δ,∇2f (x) δ

⟩+ o(‖δ‖22) as δ → 0.

Matrix version: consider f (X) : Rm×n → R

– If f is 1st-order differentiable at X, then

f (X + ∆) = f (X) + 〈∇f (X) ,∆〉+ o(‖∆‖F ) as ∆→ 0.

– If f is 2nd-order differentiable at X, then

f (X + ∆) = f (X) + 〈∇f (X) ,∆〉+1

2

⟨∆,∇2f (X) ∆

⟩+ o(‖∆‖2F )

as ∆→ 0.

9 / 24

Page 30: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Taylor approximation — asymptotic uniqueness

Let f : R→ R be k (k ≥ 1 integer) times differentiable at a point x. If P (δ) is

a k-th order polynomial satisfying f (x+ δ)− P (δ) = o(δk) as δ → 0, then

P (δ) = Pk(δ).= f(x) +

∑ki=1

1k!f (k) (x) δk.

Generalization to the vector version

– Assume f (x) : Rn → R is 1-order differentiable at x. If

P (δ).= f (x) + 〈v, δ〉 satisties that

f (x+ δ)− P (δ) = o(‖δ‖2) as δ → 0,

then P (δ) = f (x) + 〈∇f (x) , δ〉, i.e., the 1st-order Taylor expansion.

– Assume f (x) : Rn → R is 2-order differentiable at x. If

P (δ).= f (x) + 〈v, δ〉+ 1

2〈δ,Hδ〉 with H symmetric satisties that

f (x+ δ)− P (δ) = o(‖δ‖22) as δ → 0,

then P (δ) = f (x) + 〈∇f (x) , δ〉+ 12

⟨δ,∇2f (x) δ

⟩, i.e., the 2nd-order

Taylor expansion. We can read off ∇f and ∇2f if we know the expansion!

Similarly for the matrix version. See Chap 5 of [Coleman, 2012] for other

forms of Taylor theorems and proofs of the asymptotic uniqueness.

10 / 24

Page 31: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Taylor approximation — asymptotic uniqueness

Let f : R→ R be k (k ≥ 1 integer) times differentiable at a point x. If P (δ) is

a k-th order polynomial satisfying f (x+ δ)− P (δ) = o(δk) as δ → 0, then

P (δ) = Pk(δ).= f(x) +

∑ki=1

1k!f (k) (x) δk.

Generalization to the vector version

– Assume f (x) : Rn → R is 1-order differentiable at x. If

P (δ).= f (x) + 〈v, δ〉 satisties that

f (x+ δ)− P (δ) = o(‖δ‖2) as δ → 0,

then P (δ) = f (x) + 〈∇f (x) , δ〉, i.e., the 1st-order Taylor expansion.

– Assume f (x) : Rn → R is 2-order differentiable at x. If

P (δ).= f (x) + 〈v, δ〉+ 1

2〈δ,Hδ〉 with H symmetric satisties that

f (x+ δ)− P (δ) = o(‖δ‖22) as δ → 0,

then P (δ) = f (x) + 〈∇f (x) , δ〉+ 12

⟨δ,∇2f (x) δ

⟩, i.e., the 2nd-order

Taylor expansion. We can read off ∇f and ∇2f if we know the expansion!

Similarly for the matrix version. See Chap 5 of [Coleman, 2012] for other

forms of Taylor theorems and proofs of the asymptotic uniqueness.

10 / 24

Page 32: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Taylor approximation — asymptotic uniqueness

Let f : R→ R be k (k ≥ 1 integer) times differentiable at a point x. If P (δ) is

a k-th order polynomial satisfying f (x+ δ)− P (δ) = o(δk) as δ → 0, then

P (δ) = Pk(δ).= f(x) +

∑ki=1

1k!f (k) (x) δk.

Generalization to the vector version

– Assume f (x) : Rn → R is 1-order differentiable at x. If

P (δ).= f (x) + 〈v, δ〉 satisties that

f (x+ δ)− P (δ) = o(‖δ‖2) as δ → 0,

then P (δ) = f (x) + 〈∇f (x) , δ〉, i.e., the 1st-order Taylor expansion.

– Assume f (x) : Rn → R is 2-order differentiable at x. If

P (δ).= f (x) + 〈v, δ〉+ 1

2〈δ,Hδ〉 with H symmetric satisties that

f (x+ δ)− P (δ) = o(‖δ‖22) as δ → 0,

then P (δ) = f (x) + 〈∇f (x) , δ〉+ 12

⟨δ,∇2f (x) δ

⟩, i.e., the 2nd-order

Taylor expansion. We can read off ∇f and ∇2f if we know the expansion!

Similarly for the matrix version. See Chap 5 of [Coleman, 2012] for other

forms of Taylor theorems and proofs of the asymptotic uniqueness.

10 / 24

Page 33: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Taylor approximation — asymptotic uniqueness

Let f : R→ R be k (k ≥ 1 integer) times differentiable at a point x. If P (δ) is

a k-th order polynomial satisfying f (x+ δ)− P (δ) = o(δk) as δ → 0, then

P (δ) = Pk(δ).= f(x) +

∑ki=1

1k!f (k) (x) δk.

Generalization to the vector version

– Assume f (x) : Rn → R is 1-order differentiable at x. If

P (δ).= f (x) + 〈v, δ〉 satisties that

f (x+ δ)− P (δ) = o(‖δ‖2) as δ → 0,

then P (δ) = f (x) + 〈∇f (x) , δ〉, i.e., the 1st-order Taylor expansion.

– Assume f (x) : Rn → R is 2-order differentiable at x. If

P (δ).= f (x) + 〈v, δ〉+ 1

2〈δ,Hδ〉 with H symmetric satisties that

f (x+ δ)− P (δ) = o(‖δ‖22) as δ → 0,

then P (δ) = f (x) + 〈∇f (x) , δ〉+ 12

⟨δ,∇2f (x) δ

⟩, i.e., the 2nd-order

Taylor expansion. We can read off ∇f and ∇2f if we know the expansion!

Similarly for the matrix version. See Chap 5 of [Coleman, 2012] for other

forms of Taylor theorems and proofs of the asymptotic uniqueness. 10 / 24

Page 34: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Asymptotic uniqueness — why interesting?

Two ways of deriving gradients and Hessians (Recall HW0!)

11 / 24

Page 35: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Asymptotic uniqueness — why interesting?

Think of neural networks with identity activation functions

f (W ) =∑i

‖yi −W kW k−1 . . .W 2W 1xi‖2F

How to derive the gradient?

– Scalar chain rule?

– Vector chain rule?

– First-order Taylor expansion

Why interesting? See e.g.,

[Kawaguchi, 2016, Lampinen and Ganguli, 2018]

12 / 24

Page 36: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Asymptotic uniqueness — why interesting?

Think of neural networks with identity activation functions

f (W ) =∑i

‖yi −W kW k−1 . . .W 2W 1xi‖2F

How to derive the gradient?

– Scalar chain rule?

– Vector chain rule?

– First-order Taylor expansion

Why interesting? See e.g.,

[Kawaguchi, 2016, Lampinen and Ganguli, 2018]

12 / 24

Page 37: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Asymptotic uniqueness — why interesting?

Think of neural networks with identity activation functions

f (W ) =∑i

‖yi −W kW k−1 . . .W 2W 1xi‖2F

How to derive the gradient?

– Scalar chain rule?

– Vector chain rule?

– First-order Taylor expansion

Why interesting? See e.g.,

[Kawaguchi, 2016, Lampinen and Ganguli, 2018]

12 / 24

Page 38: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Asymptotic uniqueness — why interesting?

Think of neural networks with identity activation functions

f (W ) =∑i

‖yi −W kW k−1 . . .W 2W 1xi‖2F

How to derive the gradient?

– Scalar chain rule?

– Vector chain rule?

– First-order Taylor expansion

Why interesting? See e.g.,

[Kawaguchi, 2016, Lampinen and Ganguli, 2018]

12 / 24

Page 39: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Asymptotic uniqueness — why interesting?

Think of neural networks with identity activation functions

f (W ) =∑i

‖yi −W kW k−1 . . .W 2W 1xi‖2F

How to derive the gradient?

– Scalar chain rule?

– Vector chain rule?

– First-order Taylor expansion

Why interesting? See e.g.,

[Kawaguchi, 2016, Lampinen and Ganguli, 2018]

12 / 24

Page 40: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Directional derivatives and curvatures

Consider f (x) : Rn → R

– directional derivative: Dvf (x).= d

dtf (x+ tv)

– When f is 1-st order differentiable at x,

Dvf (x) = 〈∇f (x) ,v〉 .

– Now Dvf (x) : Rn → R, what is Du (Dvf) (x)?

Du (Dvf) (x) =⟨u,∇2f (x)v

⟩.

– When u = v,

Du (Duf) (x) =⟨u,∇2f (x)u

⟩=

d2

dt2f (x+ tu) .

–〈u,∇2f(x)u〉‖u‖22

is the directional curvature along u independent of

the norm of u

13 / 24

Page 41: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Directional derivatives and curvatures

Consider f (x) : Rn → R

– directional derivative: Dvf (x).= d

dtf (x+ tv)

– When f is 1-st order differentiable at x,

Dvf (x) = 〈∇f (x) ,v〉 .

– Now Dvf (x) : Rn → R, what is Du (Dvf) (x)?

Du (Dvf) (x) =⟨u,∇2f (x)v

⟩.

– When u = v,

Du (Duf) (x) =⟨u,∇2f (x)u

⟩=

d2

dt2f (x+ tu) .

–〈u,∇2f(x)u〉‖u‖22

is the directional curvature along u independent of

the norm of u

13 / 24

Page 42: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Directional derivatives and curvatures

Consider f (x) : Rn → R

– directional derivative: Dvf (x).= d

dtf (x+ tv)

– When f is 1-st order differentiable at x,

Dvf (x) = 〈∇f (x) ,v〉 .

– Now Dvf (x) : Rn → R, what is Du (Dvf) (x)?

Du (Dvf) (x) =⟨u,∇2f (x)v

⟩.

– When u = v,

Du (Duf) (x) =⟨u,∇2f (x)u

⟩=

d2

dt2f (x+ tu) .

–〈u,∇2f(x)u〉‖u‖22

is the directional curvature along u independent of

the norm of u

13 / 24

Page 43: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Directional derivatives and curvatures

Consider f (x) : Rn → R

– directional derivative: Dvf (x).= d

dtf (x+ tv)

– When f is 1-st order differentiable at x,

Dvf (x) = 〈∇f (x) ,v〉 .

– Now Dvf (x) : Rn → R, what is Du (Dvf) (x)?

Du (Dvf) (x) =⟨u,∇2f (x)v

⟩.

– When u = v,

Du (Duf) (x) =⟨u,∇2f (x)u

⟩=

d2

dt2f (x+ tu) .

–〈u,∇2f(x)u〉‖u‖22

is the directional curvature along u independent of

the norm of u

13 / 24

Page 44: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Directional derivatives and curvatures

Consider f (x) : Rn → R

– directional derivative: Dvf (x).= d

dtf (x+ tv)

– When f is 1-st order differentiable at x,

Dvf (x) = 〈∇f (x) ,v〉 .

– Now Dvf (x) : Rn → R, what is Du (Dvf) (x)?

Du (Dvf) (x) =⟨u,∇2f (x)v

⟩.

– When u = v,

Du (Duf) (x) =⟨u,∇2f (x)u

⟩=

d2

dt2f (x+ tu) .

–〈u,∇2f(x)u〉‖u‖22

is the directional curvature along u independent of

the norm of u

13 / 24

Page 45: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Directional derivatives and curvatures

Consider f (x) : Rn → R

– directional derivative: Dvf (x).= d

dtf (x+ tv)

– When f is 1-st order differentiable at x,

Dvf (x) = 〈∇f (x) ,v〉 .

– Now Dvf (x) : Rn → R, what is Du (Dvf) (x)?

Du (Dvf) (x) =⟨u,∇2f (x)v

⟩.

– When u = v,

Du (Duf) (x) =⟨u,∇2f (x)u

⟩=

d2

dt2f (x+ tu) .

–〈u,∇2f(x)u〉‖u‖22

is the directional curvature along u independent of

the norm of u

13 / 24

Page 46: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Directional curvature

〈u,∇2f(x)u〉‖u‖22

is the directional curvature along u independent of the

norm of u

Blue: negative curvature (bending down)

Red: positive curvature (bending up)

14 / 24

Page 47: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Directional curvature

〈u,∇2f(x)u〉‖u‖22

is the directional curvature along u independent of the

norm of u

Blue: negative curvature (bending down)

Red: positive curvature (bending up)

14 / 24

Page 48: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Outline

Elements of multivatiate calculus

Optimality conditions of unconstrained optimization

15 / 24

Page 49: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Optimization problems

Nothing takes place in the world

whose meaning is not that of some

maximum or minimum. – Euler

minxf (x) s. t. x ∈ C.

– x: optimization variables, f (x): objective function, C: constraint

(or feasible) set

– C consists of discrete values (e.g., −1,+1n): discrete

optimization; C consists of continuous values (e.g., Rn, [0, 1]n):

continuous optimization

– C whole space Rn: unconstrained optimization; C a strict subset

of the space: constrained optimization

We focus on continuous, unconstrained optimization here.

16 / 24

Page 50: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Optimization problems

Nothing takes place in the world

whose meaning is not that of some

maximum or minimum. – Euler

minxf (x) s. t. x ∈ C.

– x: optimization variables, f (x): objective function, C: constraint

(or feasible) set

– C consists of discrete values (e.g., −1,+1n): discrete

optimization; C consists of continuous values (e.g., Rn, [0, 1]n):

continuous optimization

– C whole space Rn: unconstrained optimization; C a strict subset

of the space: constrained optimization

We focus on continuous, unconstrained optimization here.

16 / 24

Page 51: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Optimization problems

Nothing takes place in the world

whose meaning is not that of some

maximum or minimum. – Euler

minxf (x) s. t. x ∈ C.

– x: optimization variables, f (x): objective function, C: constraint

(or feasible) set

– C consists of discrete values (e.g., −1,+1n): discrete

optimization; C consists of continuous values (e.g., Rn, [0, 1]n):

continuous optimization

– C whole space Rn: unconstrained optimization; C a strict subset

of the space: constrained optimization

We focus on continuous, unconstrained optimization here.

16 / 24

Page 52: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Optimization problems

Nothing takes place in the world

whose meaning is not that of some

maximum or minimum. – Euler

minxf (x) s. t. x ∈ C.

– x: optimization variables, f (x): objective function, C: constraint

(or feasible) set

– C consists of discrete values (e.g., −1,+1n): discrete

optimization; C consists of continuous values (e.g., Rn, [0, 1]n):

continuous optimization

– C whole space Rn: unconstrained optimization; C a strict subset

of the space: constrained optimization

We focus on continuous, unconstrained optimization here.16 / 24

Page 53: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Global and local mins

Credit: study.com

Let f (x) : Rn → R,

minx∈Rn

f (x)

– x0 is a local minimizer if: ∃ε > 0, so that f (x0) ≤ f (x) for all x

satisfying ‖x− x0‖2 < ε. The value f (x0) is called a local

minimum.

– x0 is a global minimizer if: f (x0) ≤ f (x) for all x ∈ Rn. The

value is f (x0) called the global minimum.

17 / 24

Page 54: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Global and local mins

Credit: study.com

Let f (x) : Rn → R,

minx∈Rn

f (x)

– x0 is a local minimizer if: ∃ε > 0, so that f (x0) ≤ f (x) for all x

satisfying ‖x− x0‖2 < ε. The value f (x0) is called a local

minimum.

– x0 is a global minimizer if: f (x0) ≤ f (x) for all x ∈ Rn. The

value is f (x0) called the global minimum.

17 / 24

Page 55: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

A naive solution

Grid search

– For 1D problem, assume we know the global min lies in [−1, 1]

– Take uniformly grid points in [−1, 1] so that any adjanent points are

separated by ε.

– Need O(ε−1) points to get an ε-close point to the global min by

exhaustive search

For N -D problems, need O(ε−n

)computation.

Better characterization of the local/global mins may help avoid this.

18 / 24

Page 56: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

A naive solution

Grid search

– For 1D problem, assume we know the global min lies in [−1, 1]

– Take uniformly grid points in [−1, 1] so that any adjanent points are

separated by ε.

– Need O(ε−1) points to get an ε-close point to the global min by

exhaustive search

For N -D problems, need O(ε−n

)computation.

Better characterization of the local/global mins may help avoid this.

18 / 24

Page 57: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

A naive solution

Grid search

– For 1D problem, assume we know the global min lies in [−1, 1]

– Take uniformly grid points in [−1, 1] so that any adjanent points are

separated by ε.

– Need O(ε−1) points to get an ε-close point to the global min by

exhaustive search

For N -D problems, need O(ε−n

)computation.

Better characterization of the local/global mins may help avoid this.

18 / 24

Page 58: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

A naive solution

Grid search

– For 1D problem, assume we know the global min lies in [−1, 1]

– Take uniformly grid points in [−1, 1] so that any adjanent points are

separated by ε.

– Need O(ε−1) points to get an ε-close point to the global min by

exhaustive search

For N -D problems, need O(ε−n

)computation.

Better characterization of the local/global mins may help avoid this.

18 / 24

Page 59: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

A naive solution

Grid search

– For 1D problem, assume we know the global min lies in [−1, 1]

– Take uniformly grid points in [−1, 1] so that any adjanent points are

separated by ε.

– Need O(ε−1) points to get an ε-close point to the global min by

exhaustive search

For N -D problems, need O(ε−n

)computation.

Better characterization of the local/global mins may help avoid this.

18 / 24

Page 60: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a

local minimizer, ∇f (x0) = 0.

Intuition: ∇f is “rate of change” of function

value. If the rate is not zero at x0, possible to

decrease f along −∇f (x0)

Taylor’s: f (x0 + δ) = f (x0) + 〈∇f (x0) , δ〉+ o(‖δ‖2

). If x0 is a local min:

– For all δ sufficiently small,

f (x0 + δ)− f (x0) = 〈∇f (x0) , δ〉+ o(‖δ‖2

)≥ 0

– For all δ sufficiently small, sign of 〈∇f (x0) , δ〉+ o(‖δ‖2

)determined by

the sign of 〈∇f (x0) , δ〉, i.e., 〈∇f (x0) , δ〉 ≥ 0.

– So for all δ sufficiently small, 〈∇f (x0) , δ〉 ≥ 0 and

〈∇f (x0) ,−δ〉 = −〈∇f (x0) , δ〉 ≥ 0 =⇒ 〈∇f (x0) , δ〉 = 0

– So ∇f (x0) = 0.

19 / 24

Page 61: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a

local minimizer, ∇f (x0) = 0.

Intuition: ∇f is “rate of change” of function

value. If the rate is not zero at x0, possible to

decrease f along −∇f (x0)

Taylor’s: f (x0 + δ) = f (x0) + 〈∇f (x0) , δ〉+ o(‖δ‖2

). If x0 is a local min:

– For all δ sufficiently small,

f (x0 + δ)− f (x0) = 〈∇f (x0) , δ〉+ o(‖δ‖2

)≥ 0

– For all δ sufficiently small, sign of 〈∇f (x0) , δ〉+ o(‖δ‖2

)determined by

the sign of 〈∇f (x0) , δ〉, i.e., 〈∇f (x0) , δ〉 ≥ 0.

– So for all δ sufficiently small, 〈∇f (x0) , δ〉 ≥ 0 and

〈∇f (x0) ,−δ〉 = −〈∇f (x0) , δ〉 ≥ 0 =⇒ 〈∇f (x0) , δ〉 = 0

– So ∇f (x0) = 0.

19 / 24

Page 62: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a

local minimizer, ∇f (x0) = 0.

Intuition: ∇f is “rate of change” of function

value. If the rate is not zero at x0, possible to

decrease f along −∇f (x0)

Taylor’s: f (x0 + δ) = f (x0) + 〈∇f (x0) , δ〉+ o(‖δ‖2

). If x0 is a local min:

– For all δ sufficiently small,

f (x0 + δ)− f (x0) = 〈∇f (x0) , δ〉+ o(‖δ‖2

)≥ 0

– For all δ sufficiently small, sign of 〈∇f (x0) , δ〉+ o(‖δ‖2

)determined by

the sign of 〈∇f (x0) , δ〉, i.e., 〈∇f (x0) , δ〉 ≥ 0.

– So for all δ sufficiently small, 〈∇f (x0) , δ〉 ≥ 0 and

〈∇f (x0) ,−δ〉 = −〈∇f (x0) , δ〉 ≥ 0 =⇒ 〈∇f (x0) , δ〉 = 0

– So ∇f (x0) = 0.

19 / 24

Page 63: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a

local minimizer, ∇f (x0) = 0.

Intuition: ∇f is “rate of change” of function

value. If the rate is not zero at x0, possible to

decrease f along −∇f (x0)

Taylor’s: f (x0 + δ) = f (x0) + 〈∇f (x0) , δ〉+ o(‖δ‖2

). If x0 is a local min:

– For all δ sufficiently small,

f (x0 + δ)− f (x0) = 〈∇f (x0) , δ〉+ o(‖δ‖2

)≥ 0

– For all δ sufficiently small, sign of 〈∇f (x0) , δ〉+ o(‖δ‖2

)determined by

the sign of 〈∇f (x0) , δ〉, i.e., 〈∇f (x0) , δ〉 ≥ 0.

– So for all δ sufficiently small, 〈∇f (x0) , δ〉 ≥ 0 and

〈∇f (x0) ,−δ〉 = −〈∇f (x0) , δ〉 ≥ 0 =⇒ 〈∇f (x0) , δ〉 = 0

– So ∇f (x0) = 0.

19 / 24

Page 64: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a

local minimizer, ∇f (x0) = 0.

Intuition: ∇f is “rate of change” of function

value. If the rate is not zero at x0, possible to

decrease f along −∇f (x0)

Taylor’s: f (x0 + δ) = f (x0) + 〈∇f (x0) , δ〉+ o(‖δ‖2

). If x0 is a local min:

– For all δ sufficiently small,

f (x0 + δ)− f (x0) = 〈∇f (x0) , δ〉+ o(‖δ‖2

)≥ 0

– For all δ sufficiently small, sign of 〈∇f (x0) , δ〉+ o(‖δ‖2

)determined by

the sign of 〈∇f (x0) , δ〉, i.e., 〈∇f (x0) , δ〉 ≥ 0.

– So for all δ sufficiently small, 〈∇f (x0) , δ〉 ≥ 0 and

〈∇f (x0) ,−δ〉 = −〈∇f (x0) , δ〉 ≥ 0 =⇒ 〈∇f (x0) , δ〉 = 0

– So ∇f (x0) = 0.

19 / 24

Page 65: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a

local minimizer, ∇f (x0) = 0.

Intuition: ∇f is “rate of change” of function

value. If the rate is not zero at x0, possible to

decrease f along −∇f (x0)

Taylor’s: f (x0 + δ) = f (x0) + 〈∇f (x0) , δ〉+ o(‖δ‖2

). If x0 is a local min:

– For all δ sufficiently small,

f (x0 + δ)− f (x0) = 〈∇f (x0) , δ〉+ o(‖δ‖2

)≥ 0

– For all δ sufficiently small, sign of 〈∇f (x0) , δ〉+ o(‖δ‖2

)determined by

the sign of 〈∇f (x0) , δ〉, i.e., 〈∇f (x0) , δ〉 ≥ 0.

– So for all δ sufficiently small, 〈∇f (x0) , δ〉 ≥ 0 and

〈∇f (x0) ,−δ〉 = −〈∇f (x0) , δ〉 ≥ 0 =⇒ 〈∇f (x0) , δ〉 = 0

– So ∇f (x0) = 0.

19 / 24

Page 66: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a

local minimizer, ∇f (x0) = 0.

Intuition: ∇f is “rate of change” of function

value. If the rate is not zero at x0, possible to

decrease f along −∇f (x0)

Taylor’s: f (x0 + δ) = f (x0) + 〈∇f (x0) , δ〉+ o(‖δ‖2

). If x0 is a local min:

– For all δ sufficiently small,

f (x0 + δ)− f (x0) = 〈∇f (x0) , δ〉+ o(‖δ‖2

)≥ 0

– For all δ sufficiently small, sign of 〈∇f (x0) , δ〉+ o(‖δ‖2

)determined by

the sign of 〈∇f (x0) , δ〉, i.e., 〈∇f (x0) , δ〉 ≥ 0.

– So for all δ sufficiently small, 〈∇f (x0) , δ〉 ≥ 0 and

〈∇f (x0) ,−δ〉 = −〈∇f (x0) , δ〉 ≥ 0 =⇒ 〈∇f (x0) , δ〉 = 0

– So ∇f (x0) = 0.19 / 24

Page 67: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a

local minimizer, then ∇f (x0) = 0.

When sufficient?

for convex functions

Credit: Wikipedia

– geometric def.: function for which any

line segment connecting two points of its

graph always lies above the graph

– algebra def.: ∀x,y and α ∈ [0, 1]:

f (αx+ (1− α)y) ≤ αf (x)+(1− α) f (y) .

Any convex function has only one local minimum (value!), which is also global!

Proof sketch: if x,z are both local minimizers and f (z) < f (x),

f (αz + (1− α)x) ≤ αf (z) + (1− α) f (x) < αf (x) + (1− α) f (x) = f (x).

But αz + (1− α)x→ x as α→ 0.

20 / 24

Page 68: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a

local minimizer, then ∇f (x0) = 0.

When sufficient? for convex functions

Credit: Wikipedia

– geometric def.: function for which any

line segment connecting two points of its

graph always lies above the graph

– algebra def.: ∀x,y and α ∈ [0, 1]:

f (αx+ (1− α)y) ≤ αf (x)+(1− α) f (y) .

Any convex function has only one local minimum (value!), which is also global!

Proof sketch: if x,z are both local minimizers and f (z) < f (x),

f (αz + (1− α)x) ≤ αf (z) + (1− α) f (x) < αf (x) + (1− α) f (x) = f (x).

But αz + (1− α)x→ x as α→ 0.

20 / 24

Page 69: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a

local minimizer, then ∇f (x0) = 0.

When sufficient? for convex functions

Credit: Wikipedia

– geometric def.: function for which any

line segment connecting two points of its

graph always lies above the graph

– algebra def.: ∀x,y and α ∈ [0, 1]:

f (αx+ (1− α)y) ≤ αf (x)+(1− α) f (y) .

Any convex function has only one local minimum (value!), which is also global!

Proof sketch: if x,z are both local minimizers and f (z) < f (x),

f (αz + (1− α)x) ≤ αf (z) + (1− α) f (x) < αf (x) + (1− α) f (x) = f (x).

But αz + (1− α)x→ x as α→ 0.

20 / 24

Page 70: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a

local minimizer, then ∇f (x0) = 0.

When sufficient? for convex functions

Credit: Wikipedia

– geometric def.: function for which any

line segment connecting two points of its

graph always lies above the graph

– algebra def.: ∀x,y and α ∈ [0, 1]:

f (αx+ (1− α)y) ≤ αf (x)+(1− α) f (y) .

Any convex function has only one local minimum (value!), which is also global!

Proof sketch: if x,z are both local minimizers and f (z) < f (x),

f (αz + (1− α)x) ≤ αf (z) + (1− α) f (x) < αf (x) + (1− α) f (x) = f (x).

But αz + (1− α)x→ x as α→ 0.

20 / 24

Page 71: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a

local minimizer, then ∇f (x0) = 0.

When sufficient? for convex functions

Credit: Wikipedia

– geometric def.: function for which any

line segment connecting two points of its

graph always lies above the graph

– algebra def.: ∀x,y and α ∈ [0, 1]:

f (αx+ (1− α)y) ≤ αf (x)+(1− α) f (y) .

Any convex function has only one local minimum (value!), which is also global!

Proof sketch: if x,z are both local minimizers and f (z) < f (x),

f (αz + (1− α)x) ≤ αf (z) + (1− α) f (x) < αf (x) + (1− α) f (x) = f (x).

But αz + (1− α)x→ x as α→ 0.

20 / 24

Page 72: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a

local minimizer, then ∇f (x0) = 0.

When sufficient? for convex functions

Credit: Wikipedia

– geometric def.: function for which any

line segment connecting two points of its

graph always lies above the graph

– algebra def.: ∀x,y and α ∈ [0, 1]:

f (αx+ (1− α)y) ≤ αf (x)+(1− α) f (y) .

Any convex function has only one local minimum (value!), which is also global!

Proof sketch: if x,z are both local minimizers and f (z) < f (x),

f (αz + (1− α)x) ≤ αf (z) + (1− α) f (x) < αf (x) + (1− α) f (x) = f (x).

But αz + (1− α)x→ x as α→ 0.

20 / 24

Page 73: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a

local minimizer, then ∇f (x0) = 0.

When sufficient? for convex functions

Credit: Wikipedia

– geometric def.: function for which any

line segment connecting two points of its

graph always lies above the graph

– algebra def.: ∀x,y and α ∈ [0, 1]:

f (αx+ (1− α)y) ≤ αf (x)+(1− α) f (y) .

Any convex function has only one local minimum (value!), which is also global!

Proof sketch: if x,z are both local minimizers and f (z) < f (x),

f (αz + (1− α)x) ≤ αf (z) + (1− α) f (x) < αf (x) + (1− α) f (x) = f (x).

But αz + (1− α)x→ x as α→ 0.

20 / 24

Page 74: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a

local minimizer, then ∇f (x0) = 0.

Sufficient condition: Assume f is convex and 1st-order differentiable. If

∇f (x) = 0 at a point x = x0, then x0 is a local/global minimizer.

– Convex analysis (i.e., theory) and optimization (i.e., numerical methods)

are relatively mature. Recommended resources: analysis:

[Hiriart-Urruty and Lemarechal, 2001], optimization:

[Boyd and Vandenberghe, 2004]

– We don’t assume convexity unless stated, as DNN objectives are almost

always nonconvex.

21 / 24

Page 75: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a

local minimizer, then ∇f (x0) = 0.

Sufficient condition: Assume f is convex and 1st-order differentiable. If

∇f (x) = 0 at a point x = x0, then x0 is a local/global minimizer.

– Convex analysis (i.e., theory) and optimization (i.e., numerical methods)

are relatively mature. Recommended resources: analysis:

[Hiriart-Urruty and Lemarechal, 2001], optimization:

[Boyd and Vandenberghe, 2004]

– We don’t assume convexity unless stated, as DNN objectives are almost

always nonconvex.

21 / 24

Page 76: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Second-order optimality condition

Necessary condition: Assume f (x) is 2-order differentiable at x0. If x0 is

a local min, ∇f (x0) = 0 and ∇2f (x0) 0 (i.e., positive semidefinite).

Sufficient condition: Assume f (x) is 2-order differentiable at x0. If

∇f (x0) = 0 and ∇2f (x0) 0 (i.e., positive definite), x0 is a local min.

Taylor’s: f (x0 + δ) = f (x0) + 〈∇f (x0) , δ〉+ 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

).

– If x0 is a local min, ∇f (x0) = 0 (1st-order condition) and

f (x0 + δ) = f (x0) + 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

).

– So f (x0 + δ)− f (x0) = 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

)≥ 0 for all δ

sufficiently small

– For all δ sufficiently small, sign of 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

)determined by the sign of 1

2

⟨δ,∇2f (x0) δ

⟩=⇒ 1

2

⟨δ,∇2f (x0) δ

⟩≥ 0

– So ∇2f (x0) 0.

22 / 24

Page 77: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Second-order optimality condition

Necessary condition: Assume f (x) is 2-order differentiable at x0. If x0 is

a local min, ∇f (x0) = 0 and ∇2f (x0) 0 (i.e., positive semidefinite).

Sufficient condition: Assume f (x) is 2-order differentiable at x0. If

∇f (x0) = 0 and ∇2f (x0) 0 (i.e., positive definite), x0 is a local min.

Taylor’s: f (x0 + δ) = f (x0) + 〈∇f (x0) , δ〉+ 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

).

– If x0 is a local min, ∇f (x0) = 0 (1st-order condition) and

f (x0 + δ) = f (x0) + 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

).

– So f (x0 + δ)− f (x0) = 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

)≥ 0 for all δ

sufficiently small

– For all δ sufficiently small, sign of 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

)determined by the sign of 1

2

⟨δ,∇2f (x0) δ

⟩=⇒ 1

2

⟨δ,∇2f (x0) δ

⟩≥ 0

– So ∇2f (x0) 0.

22 / 24

Page 78: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Second-order optimality condition

Necessary condition: Assume f (x) is 2-order differentiable at x0. If x0 is

a local min, ∇f (x0) = 0 and ∇2f (x0) 0 (i.e., positive semidefinite).

Sufficient condition: Assume f (x) is 2-order differentiable at x0. If

∇f (x0) = 0 and ∇2f (x0) 0 (i.e., positive definite), x0 is a local min.

Taylor’s: f (x0 + δ) = f (x0) + 〈∇f (x0) , δ〉+ 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

).

– If x0 is a local min, ∇f (x0) = 0 (1st-order condition) and

f (x0 + δ) = f (x0) + 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

).

– So f (x0 + δ)− f (x0) = 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

)≥ 0 for all δ

sufficiently small

– For all δ sufficiently small, sign of 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

)determined by the sign of 1

2

⟨δ,∇2f (x0) δ

⟩=⇒ 1

2

⟨δ,∇2f (x0) δ

⟩≥ 0

– So ∇2f (x0) 0.

22 / 24

Page 79: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Second-order optimality condition

Necessary condition: Assume f (x) is 2-order differentiable at x0. If x0 is

a local min, ∇f (x0) = 0 and ∇2f (x0) 0 (i.e., positive semidefinite).

Sufficient condition: Assume f (x) is 2-order differentiable at x0. If

∇f (x0) = 0 and ∇2f (x0) 0 (i.e., positive definite), x0 is a local min.

Taylor’s: f (x0 + δ) = f (x0) + 〈∇f (x0) , δ〉+ 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

).

– If x0 is a local min, ∇f (x0) = 0 (1st-order condition) and

f (x0 + δ) = f (x0) + 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

).

– So f (x0 + δ)− f (x0) = 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

)≥ 0 for all δ

sufficiently small

– For all δ sufficiently small, sign of 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

)determined by the sign of 1

2

⟨δ,∇2f (x0) δ

⟩=⇒ 1

2

⟨δ,∇2f (x0) δ

⟩≥ 0

– So ∇2f (x0) 0.

22 / 24

Page 80: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Second-order optimality condition

Necessary condition: Assume f (x) is 2-order differentiable at x0. If x0 is

a local min, ∇f (x0) = 0 and ∇2f (x0) 0 (i.e., positive semidefinite).

Sufficient condition: Assume f (x) is 2-order differentiable at x0. If

∇f (x0) = 0 and ∇2f (x0) 0 (i.e., positive definite), x0 is a local min.

Taylor’s: f (x0 + δ) = f (x0) + 〈∇f (x0) , δ〉+ 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

).

– If x0 is a local min, ∇f (x0) = 0 (1st-order condition) and

f (x0 + δ) = f (x0) + 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

).

– So f (x0 + δ)− f (x0) = 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

)≥ 0 for all δ

sufficiently small

– For all δ sufficiently small, sign of 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

)determined by the sign of 1

2

⟨δ,∇2f (x0) δ

⟩=⇒ 1

2

⟨δ,∇2f (x0) δ

⟩≥ 0

– So ∇2f (x0) 0.

22 / 24

Page 81: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Second-order optimality condition

Necessary condition: Assume f (x) is 2-order differentiable at x0. If x0 is

a local min, ∇f (x0) = 0 and ∇2f (x0) 0 (i.e., positive semidefinite).

Sufficient condition: Assume f (x) is 2-order differentiable at x0. If

∇f (x0) = 0 and ∇2f (x0) 0 (i.e., positive definite), x0 is a local min.

Taylor’s: f (x0 + δ) = f (x0) + 〈∇f (x0) , δ〉+ 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

).

– If x0 is a local min, ∇f (x0) = 0 (1st-order condition) and

f (x0 + δ) = f (x0) + 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

).

– So f (x0 + δ)− f (x0) = 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

)≥ 0 for all δ

sufficiently small

– For all δ sufficiently small, sign of 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

)determined by the sign of 1

2

⟨δ,∇2f (x0) δ

⟩=⇒ 1

2

⟨δ,∇2f (x0) δ

⟩≥ 0

– So ∇2f (x0) 0.

22 / 24

Page 82: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

Second-order optimality condition

Necessary condition: Assume f (x) is 2-order differentiable at x0. If x0 is

a local min, ∇f (x0) = 0 and ∇2f (x0) 0 (i.e., positive semidefinite).

Sufficient condition: Assume f (x) is 2-order differentiable at x0. If

∇f (x0) = 0 and ∇2f (x0) 0 (i.e., positive definite), x0 is a local min.

Taylor’s: f (x0 + δ) = f (x0) + 〈∇f (x0) , δ〉+ 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

).

– If x0 is a local min, ∇f (x0) = 0 (1st-order condition) and

f (x0 + δ) = f (x0) + 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

).

– So f (x0 + δ)− f (x0) = 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

)≥ 0 for all δ

sufficiently small

– For all δ sufficiently small, sign of 12

⟨δ,∇2f (x0) δ

⟩+ o

(‖δ‖22

)determined by the sign of 1

2

⟨δ,∇2f (x0) δ

⟩=⇒ 1

2

⟨δ,∇2f (x0) δ

⟩≥ 0

– So ∇2f (x0) 0.

22 / 24

Page 83: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

What’s in between?

2nd order sufficient: ∇f (x0) = 0 and ∇2f (x0) 0

2nd order necessary: ∇f (x0) = 0 and ∇2f (x0) 0

∇f =

[2x

−2y

],∇2f =

[2 0

0 −2

]∇g =

[3x2

−3y2

],∇2g =

[6x 0

0 −6y

]

23 / 24

Page 84: Basics of Numerical Optimization: Preliminaries · =)replace Hby DNN W, i.e., a deep neural network with weights W { Optimization: min W X i ‘(y i;DNN W (x i)) + (W) { Generalization:

References i

[Boyd and Vandenberghe, 2004] Boyd, S. and Vandenberghe, L. (2004). Convex

Optimization. Cambridge University Press.

[Coleman, 2012] Coleman, R. (2012). Calculus on Normed Vector Spaces. Springer

New York.

[Hiriart-Urruty and Lemarechal, 2001] Hiriart-Urruty, J.-B. and Lemarechal, C.

(2001). Fundamentals of Convex Analysis. Springer Berlin Heidelberg.

[Kawaguchi, 2016] Kawaguchi, K. (2016). Deep learning without poor local minima.

arXiv:1605.07110.

[Lampinen and Ganguli, 2018] Lampinen, A. K. and Ganguli, S. (2018). An analytic

theory of generalization dynamics and transfer learning in deep linear networks.

arXiv:1809.10374.

[Munkres, 1997] Munkres, J. R. (1997). Analysis On Manifolds. Taylor & Francis Inc.

[Zorich, 2015] Zorich, V. A. (2015). Mathematical Analysis I. Springer Berlin

Heidelberg.

24 / 24