is depth needed for deep learning? circuit complexity in...

73
Is Depth Needed for Deep Learning? Circuit Complexity in Neural Networks Ohad Shamir and Microsoft Research STOC Deep Learning Workshop June 2017 Ohad Shamir Is Depth Needed for Deep Learning? 1/34

Upload: others

Post on 27-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Is Depth Needed for Deep Learning?Circuit Complexity in Neural Networks

Ohad Shamir

and Microsoft Research

STOC Deep Learning WorkshopJune 2017

Ohad Shamir Is Depth Needed for Deep Learning? 1/34

Page 2: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Neural Networks (a.k.a. Deep Learning)

Ohad Shamir Is Depth Needed for Deep Learning? 2/34

Page 3: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Neural Networks (a.k.a. Deep Learning)

Ohad Shamir Is Depth Needed for Deep Learning? 2/34

Page 4: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Neural Networks (a.k.a. Deep Learning)

Ohad Shamir Is Depth Needed for Deep Learning? 2/34

Page 5: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Neural Networks (a.k.a. Deep Learning)

Ohad Shamir Is Depth Needed for Deep Learning? 2/34

Page 6: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Neural Networks (a.k.a. Deep Learning)

Ohad Shamir Is Depth Needed for Deep Learning? 2/34

Page 7: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Neural Networks (a.k.a. Deep Learning)

Ohad Shamir Is Depth Needed for Deep Learning? 2/34

Page 8: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Neural Networks (a.k.a. Deep Learning)

Ohad Shamir Is Depth Needed for Deep Learning? 2/34

Page 9: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Neural Networks (a.k.a. Deep Learning)

A single neuron

x 7→ σ(w>x + b)

Activation σ examples

ReLU: [z]+ := max0, z

Feedforward neural network

Deep Networks

x (∈ Rd) 7→Wk σk−1 (· · ·σ2 (W2 σ1 (W1x + b1) + b2) · · · ) + bk

Depth: kWidth: Maximal dimension of W1, . . . ,Wk

Ohad Shamir Is Depth Needed for Deep Learning? 3/34

Page 10: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Deep Learning

Winner of imagenet challenge 2012:Alexnet, 8 layers

Ohad Shamir Is Depth Needed for Deep Learning? 4/34

Page 11: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Deep Learning

Winner of imagenet challenge 2014:VGG, 19 layers

Ohad Shamir Is Depth Needed for Deep Learning? 4/34

Page 12: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Deep Learning

Winner of imagenet challenge 2015:Resnet, 152 layers

Ohad Shamir Is Depth Needed for Deep Learning? 4/34

Page 13: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Is Depth Needed for Deep Learning?

Overlwhelming empirical evidence

Intuitive

Many tasks are naturally modelled as a pipelineDeep networks allow end-to-end learning

Ohad Shamir Is Depth Needed for Deep Learning? 5/34

Page 14: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Is Depth Needed for Deep Learning?

Overlwhelming empirical evidenceIntuitive

Many tasks are naturally modelled as a pipelineDeep networks allow end-to-end learning

hand-crafted features predictor “dog”

明天打电话给我。 “call me tomorrow”

Ohad Shamir Is Depth Needed for Deep Learning? 5/34

Page 15: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Is Depth Needed for Deep Learning?

Overlwhelming empirical evidenceIntuitive

Many tasks are naturally modelled as a pipelineDeep networks allow end-to-end learning

hand-crafted features predictor “dog”

明天打电话给我。 “call me tomorrow”

Ohad Shamir Is Depth Needed for Deep Learning? 5/34

Page 16: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Is Depth Needed for Deep Learning?

No (in some sense):

Universal Approximation Theorems [Cybenko, 1989, Hornik 1991,Leshno et al. 1993...]

2-layer networks, with any non-polynomial activation σ, canapproximate any continuous f : [0, 1]d → R to arbitrary accuracy

Catch: Construction uses exp(d)-wide networksWhat about poly-sized networks??

Ohad Shamir Is Depth Needed for Deep Learning? 6/34

Page 17: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Is Depth Needed for Deep Learning?

No (in some sense):

Universal Approximation Theorems [Cybenko, 1989, Hornik 1991,Leshno et al. 1993...]

2-layer networks, with any non-polynomial activation σ, canapproximate any continuous f : [0, 1]d → R to arbitrary accuracy

Catch: Construction uses exp(d)-wide networksWhat about poly-sized networks??

Ohad Shamir Is Depth Needed for Deep Learning? 6/34

Page 18: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Is Depth Needed for Deep Learning?

No (in some sense):

Universal Approximation Theorems [Cybenko, 1989, Hornik 1991,Leshno et al. 1993...]

2-layer networks, with any non-polynomial activation σ, canapproximate any continuous f : [0, 1]d → R to arbitrary accuracy

Catch: Construction uses exp(d)-wide networksWhat about poly-sized networks??

Ohad Shamir Is Depth Needed for Deep Learning? 6/34

Page 19: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Is Depth Needed for Deep Learning?

Main Question

Are there real-valued functions which are

Expressible by a depth-h, width-w neural network

Can’t be even approximated by any depth < h network, unlesswidth is w

Approximation metric: Expected loss w.r.t. some data distribution:

d(n, f ) = Ex∼D `(n(x), f (x))

In this talk: `(y , y ′) = (y − y ′)2

Ohad Shamir Is Depth Needed for Deep Learning? 7/34

Page 20: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Should Sound Familiar...

Same question asked in circuit complexity! (just differentmotivation)

Boolean circuits (e.g. AC 0)

Separation between any two constant depths (Hastad 1986,Rossman et al. 2015)

Threshold circuits (e.g. TC 0)

Neural networks with σ(z) = 1z ≥ 0 activations; Booleaninput/outputSeparation between depth 2 and 3, if weights are bounded[Hajnal et al. 1987]Sufficiently larger depths known to hit natural proofs barrier

Arithmetic circuits

Neural networks computing polynomialsEach neuron computes sum or product

Ohad Shamir Is Depth Needed for Deep Learning? 8/34

Page 21: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Should Sound Familiar...

Same question asked in circuit complexity! (just differentmotivation)

Boolean circuits (e.g. AC 0)

Separation between any two constant depths (Hastad 1986,Rossman et al. 2015)

Threshold circuits (e.g. TC 0)

Neural networks with σ(z) = 1z ≥ 0 activations; Booleaninput/outputSeparation between depth 2 and 3, if weights are bounded[Hajnal et al. 1987]Sufficiently larger depths known to hit natural proofs barrier

Arithmetic circuits

Neural networks computing polynomialsEach neuron computes sum or product

Ohad Shamir Is Depth Needed for Deep Learning? 8/34

Page 22: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Should Sound Familiar...

Same question asked in circuit complexity! (just differentmotivation)

Boolean circuits (e.g. AC 0)

Separation between any two constant depths (Hastad 1986,Rossman et al. 2015)

Threshold circuits (e.g. TC 0)

Neural networks with σ(z) = 1z ≥ 0 activations; Booleaninput/outputSeparation between depth 2 and 3, if weights are bounded[Hajnal et al. 1987]Sufficiently larger depths known to hit natural proofs barrier

Arithmetic circuits

Neural networks computing polynomialsEach neuron computes sum or product

Ohad Shamir Is Depth Needed for Deep Learning? 8/34

Page 23: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Should Sound Familiar...

Same question asked in circuit complexity! (just differentmotivation)

Boolean circuits (e.g. AC 0)

Separation between any two constant depths (Hastad 1986,Rossman et al. 2015)

Threshold circuits (e.g. TC 0)

Neural networks with σ(z) = 1z ≥ 0 activations; Booleaninput/outputSeparation between depth 2 and 3, if weights are bounded[Hajnal et al. 1987]Sufficiently larger depths known to hit natural proofs barrier

Arithmetic circuits

Neural networks computing polynomialsEach neuron computes sum or product

Ohad Shamir Is Depth Needed for Deep Learning? 8/34

Page 24: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Should Sound Familiar...

But: Modern neural networks have non-Boolean inputs/outputs,and non-polynomial activations

Not Boolean circuits

Not threshold circuits

Not arithmetic circuits

Unlike work from the 80’s/90’s (e.g. Parberry [1994]), interestedin real-valued inputs/outputs, not just Boolean functions

Ohad Shamir Is Depth Needed for Deep Learning? 9/34

Page 25: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

This Talk

Depth separations for modern neural networks

Nascent field in machine learning community; Some examplesof results and techniques

Focus on clean lower bounds and standard activations

Many open questions..

Comments/feedback welcome!

Ohad Shamir Is Depth Needed for Deep Learning? 10/34

Page 26: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

This Talk

Depth separations for modern neural networks

Nascent field in machine learning community; Some examplesof results and techniques

Focus on clean lower bounds and standard activations

Many open questions..

Comments/feedback welcome!

Ohad Shamir Is Depth Needed for Deep Learning? 10/34

Page 27: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

This Talk

Depth separations for modern neural networks

Nascent field in machine learning community; Some examplesof results and techniques

Focus on clean lower bounds and standard activations

Many open questions..

Comments/feedback welcome!

Ohad Shamir Is Depth Needed for Deep Learning? 10/34

Page 28: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Separating Depth 2 and 3 via Correlations

Depth-2 networks: x 7→ w>2 σ(W1x + b1) + b2

Linear combination of neurons σ(w>x + b)

Depth-3 networks: x 7→ w>3 σ(W2σ(W1x + b1) + b2) + b3

Theorem (Daniely 2017)

Let (x, y) 7→ f (x>y), where x, y uniform on Sd−1,f (z) = sin(πd3z):

ε-approximable by depth-3 ReLU network of poly(d , 1/ε)width and weight sizes

Not Ω(1)-approximable by any depth-2 ReLU network ofexp(o(d log d)) width and O(exp(d))-sized weights.

More generally: Other activations; Any f which is inapproximablewith O(d1+ε)-degree polynomial

Ohad Shamir Is Depth Needed for Deep Learning? 11/34

Page 29: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Separating Depth 2 and 3 via Correlations

Depth-2 networks: x 7→ w>2 σ(W1x + b1) + b2

Linear combination of neurons σ(w>x + b)

Depth-3 networks: x 7→ w>3 σ(W2σ(W1x + b1) + b2) + b3

Theorem (Daniely 2017)

Let (x, y) 7→ f (x>y), where x, y uniform on Sd−1,f (z) = sin(πd3z):

ε-approximable by depth-3 ReLU network of poly(d , 1/ε)width and weight sizes

Not Ω(1)-approximable by any depth-2 ReLU network ofexp(o(d log d)) width and O(exp(d))-sized weights.

More generally: Other activations; Any f which is inapproximablewith O(d1+ε)-degree polynomial

Ohad Shamir Is Depth Needed for Deep Learning? 11/34

Page 30: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Separating Depth 2 and 3 via Correlations

Depth-2 networks: x 7→ w>2 σ(W1x + b1) + b2

Linear combination of neurons σ(w>x + b)

Depth-3 networks: x 7→ w>3 σ(W2σ(W1x + b1) + b2) + b3

Theorem (Daniely 2017)

Let (x, y) 7→ f (x>y), where x, y uniform on Sd−1,f (z) = sin(πd3z):

ε-approximable by depth-3 ReLU network of poly(d , 1/ε)width and weight sizes

Not Ω(1)-approximable by any depth-2 ReLU network ofexp(o(d log d)) width and O(exp(d))-sized weights.

More generally: Other activations; Any f which is inapproximablewith O(d1+ε)-degree polynomial

Ohad Shamir Is Depth Needed for Deep Learning? 11/34

Page 31: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Lower Bound Proof Idea

Based on Harmonic analysis over Sd−1

(x, y) 7→ f (x>y) almost orthogonal to any(x, y) 7→ ψ(x>w, v>y) (e.g. one neuron)

Need many neurons (or huge weight) to correlate with f (x>y)

Comparison to Threshold Circuit Results

Correlation bounds also used for depth-2/3 separation ofthreshold circuits [Hajnal et al. 1987]

But, stronger separation: exp(Ω(d log d)) vs. exp(Ω(d))width

Boolean functions never require more than O(2d) width...

Ohad Shamir Is Depth Needed for Deep Learning? 12/34

Page 32: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Lower Bound Proof Idea

Based on Harmonic analysis over Sd−1

(x, y) 7→ f (x>y) almost orthogonal to any(x, y) 7→ ψ(x>w, v>y) (e.g. one neuron)

Need many neurons (or huge weight) to correlate with f (x>y)

Comparison to Threshold Circuit Results

Correlation bounds also used for depth-2/3 separation ofthreshold circuits [Hajnal et al. 1987]

But, stronger separation: exp(Ω(d log d)) vs. exp(Ω(d))width

Boolean functions never require more than O(2d) width...

Ohad Shamir Is Depth Needed for Deep Learning? 12/34

Page 33: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Weight Restrictions

Result assume that weights are not too large. Reallynecessary?

In threshold circuits: 30-year-old open question

Next: Separating depth-2/3 neural networks, without anyweight restrictions, using a different technique

Theorem (Eldan and S., 2016)

∃ function f & distribution on Rd which is

ε-approximable by 3-layer, poly(d , 1/ε)-wide network

Not Ω(1)-approximable by any 2-layer, exp(o(d))-widenetwork

Applies to virtually any measurable σ(·) s.t. |σ(x)| ≤ poly(x)

Ohad Shamir Is Depth Needed for Deep Learning? 13/34

Page 34: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Weight Restrictions

Result assume that weights are not too large. Reallynecessary?

In threshold circuits: 30-year-old open question

Next: Separating depth-2/3 neural networks, without anyweight restrictions, using a different technique

Theorem (Eldan and S., 2016)

∃ function f & distribution on Rd which is

ε-approximable by 3-layer, poly(d , 1/ε)-wide network

Not Ω(1)-approximable by any 2-layer, exp(o(d))-widenetwork

Applies to virtually any measurable σ(·) s.t. |σ(x)| ≤ poly(x)

Ohad Shamir Is Depth Needed for Deep Learning? 13/34

Page 35: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Proof Idea

Use radial functions:

f (x) = g(‖x‖2) where x ∈ Rd , f : R→ R

If f is Lipschitz, easy to approximate with depth 3

One layer: Approximate x 7→ x2;hence also x 7→ ‖x‖2 =

∑i x

2i

Second layer+output neuron: Approximate univariate functionof ‖x‖2

With two layers, difficult to do

Ohad Shamir Is Depth Needed for Deep Learning? 14/34

Page 36: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Proof Idea

Use radial functions:

f (x) = g(‖x‖2) where x ∈ Rd , f : R→ R

If f is Lipschitz, easy to approximate with depth 3

One layer: Approximate x 7→ x2;hence also x 7→ ‖x‖2 =

∑i x

2i

Second layer+output neuron: Approximate univariate functionof ‖x‖2

With two layers, difficult to do

Ohad Shamir Is Depth Needed for Deep Learning? 14/34

Page 37: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Proof Idea

Use radial functions:

f (x) = g(‖x‖2) where x ∈ Rd , f : R→ R

If f is Lipschitz, easy to approximate with depth 3

One layer: Approximate x 7→ x2;hence also x 7→ ‖x‖2 =

∑i x

2i

Second layer+output neuron: Approximate univariate functionof ‖x‖2

With two layers, difficult to do

Ohad Shamir Is Depth Needed for Deep Learning? 14/34

Page 38: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Proof Idea

Fourier Transform on Rd

Given function f , f (ξ) =∫f (x) exp(−2πiξ>x)dx

If x sampled from a distribution with density ϕ2,

Ex∼ϕ2

[(n(x)− f (x))2

]=

∫(n(x)− f (x))2ϕ2(x)dx

=

∫(n(x)ϕ(x)− f (x)ϕ(x))2dx

=

∫(n(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ))2dξ

For a two-layer network, n(x) =∑

i ni ,wi(x) :=

∑i ni (w

>i x), so

equals ∫ (∑i

ni ,wi(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ)

)2

Ohad Shamir Is Depth Needed for Deep Learning? 15/34

Page 39: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Proof Idea

Fourier Transform on Rd

Given function f , f (ξ) =∫f (x) exp(−2πiξ>x)dx

If x sampled from a distribution with density ϕ2,

Ex∼ϕ2

[(n(x)− f (x))2

]=

∫(n(x)− f (x))2ϕ2(x)dx

=

∫(n(x)ϕ(x)− f (x)ϕ(x))2dx

=

∫(n(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ))2dξ

For a two-layer network, n(x) =∑

i ni ,wi(x) :=

∑i ni (w

>i x), so

equals ∫ (∑i

ni ,wi(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ)

)2

Ohad Shamir Is Depth Needed for Deep Learning? 15/34

Page 40: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Proof Idea

Fourier Transform on Rd

Given function f , f (ξ) =∫f (x) exp(−2πiξ>x)dx

If x sampled from a distribution with density ϕ2,

Ex∼ϕ2

[(n(x)− f (x))2

]=

∫(n(x)− f (x))2ϕ2(x)dx

=

∫(n(x)ϕ(x)− f (x)ϕ(x))2dx

=

∫(n(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ))2dξ

For a two-layer network, n(x) =∑

i ni ,wi(x) :=

∑i ni (w

>i x), so

equals ∫ (∑i

ni ,wi(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ)

)2

Ohad Shamir Is Depth Needed for Deep Learning? 15/34

Page 41: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Proof Idea

∫ (∑i

ni ,wi(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ)

)2

For (say) Gaussian ϕ2, ϕ is Gaussian ⇒

ni ,wi(ξ) ∗ ϕ(ξ) f (ξ) ∗ ϕ(ξ)

Intuition: Can’t approximate “fat” function with few “thin”functions in high dimension

Ohad Shamir Is Depth Needed for Deep Learning? 16/34

Page 42: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Proof Idea

∫ (∑i

ni ,wi(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ)

)2

For (say) Gaussian ϕ2, ϕ is Gaussian ⇒

ni ,wi(ξ) ∗ ϕ(ξ) f (ξ) ∗ ϕ(ξ)

Intuition: Can’t approximate “fat” function with few “thin”functions in high dimension

Ohad Shamir Is Depth Needed for Deep Learning? 16/34

Page 43: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Proof Idea

But: Hard to handle Gaussian tailIdea: Use density ϕ2 s.t.

f ∗ ϕ is “sufficiently fat”ϕ has bounded support∑

i ni ,wi(ξ) ∗ ϕ(ξ)

Ohad Shamir Is Depth Needed for Deep Learning? 17/34

Page 44: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Proof Idea

Explicit construction in Rd :

Density ϕ2(x) =

(Rd

‖x‖

)d

· J2d/2(2πRd ‖x‖)︸ ︷︷ ︸

Bessel function of first kind

Function f (x) =

poly(d)∑i=1

εi1 ‖x‖2 ∈ ∆i

where εi ∈ −1,+1, and ∆i are disjoint intervals

Ohad Shamir Is Depth Needed for Deep Learning? 18/34

Page 45: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Higher Depth

So far: Separations between depths 2 and 3, in terms of dimension

Open Question: Can we show separations for higher depths?

In threshold circuits: Longstanding open problem. Probablyvery difficult from some constant depth (natural proofsbarrier)

But: Not threshold circuits... Perhaps “hard” functions inEuclidean space?

Next: Higher-depth separations, in terms of quantities other thandimension

Ohad Shamir Is Depth Needed for Deep Learning? 19/34

Page 46: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Higher Depth

So far: Separations between depths 2 and 3, in terms of dimension

Open Question: Can we show separations for higher depths?

In threshold circuits: Longstanding open problem. Probablyvery difficult from some constant depth (natural proofsbarrier)

But: Not threshold circuits... Perhaps “hard” functions inEuclidean space?

Next: Higher-depth separations, in terms of quantities other thandimension

Ohad Shamir Is Depth Needed for Deep Learning? 19/34

Page 47: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Higher Depth

So far: Separations between depths 2 and 3, in terms of dimension

Open Question: Can we show separations for higher depths?

In threshold circuits: Longstanding open problem. Probablyvery difficult from some constant depth (natural proofsbarrier)

But: Not threshold circuits... Perhaps “hard” functions inEuclidean space?

Next: Higher-depth separations, in terms of quantities other thandimension

Ohad Shamir Is Depth Needed for Deep Learning? 19/34

Page 48: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Highly Oscillatory Functions

Theorem (Telgarsky, 2016)

There exists a family of functions ϕk∞k=1 on [0, 1],s.t. for any k,

ϕk expressible by k-depth, O(1)-width ReLU network

ϕk not approximable by any o(k/ log(k))-depth, poly(k)width ReLU network

* Approximation w.r.t. uniform distribution on [0, 1]

Again, can be generalized to other activations

Ohad Shamir Is Depth Needed for Deep Learning? 20/34

Page 49: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Highly Oscillatory Functions

Theorem (Telgarsky, 2016)

There exists a family of functions ϕk∞k=1 on [0, 1],s.t. for any k,

ϕk expressible by k-depth, O(1)-width ReLU network

ϕk not approximable by any o(k/ log(k))-depth, poly(k)width ReLU network

* Approximation w.r.t. uniform distribution on [0, 1]

Again, can be generalized to other activations

Ohad Shamir Is Depth Needed for Deep Learning? 20/34

Page 50: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Construction

ϕ1(x) = [2x ]+ − [4x − 2]+

Ohad Shamir Is Depth Needed for Deep Learning? 21/34

Page 51: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Construction

ϕ2 = ϕ1(ϕ1(x))

Ohad Shamir Is Depth Needed for Deep Learning? 21/34

Page 52: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Construction

ϕk(x) = ϕk1(x)

Ohad Shamir Is Depth Needed for Deep Learning? 21/34

Page 53: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Construction

ϕk expressible by O(k)-depth,O(1)-width ReLU network

ϕk composed of 2k+1 linear segments;can’t be approximated bypiecewise-linear function with o(2k)segments

A depth h, width w network expressesat most (2w)h linear segments

⇒ If h = o(k/ log(k)), can’tapproximate with w = poly(k) width

Ohad Shamir Is Depth Needed for Deep Learning? 22/34

Page 54: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Separations in Accuracy

Theorem (Safran and S., 2016)

There exists a large family F of C 2 functions on [0, 1]d (includingx 7→ x2), s.t. for any f ∈ F :

Can be ε-approximated with polylog(1/ε) depth and widthReLU network

Cannot be ε-approximated with O(1) depth ReLU network,unless width is poly(1/ε)

* Approximation w.r.t. uniform distribution

F ≈ non-linear functions expressible by a fixed number ofadditions and multiplications.

Note: Broadly similar and independent results in [Yarotski 2016],[Liang and Srikant 2016]

Ohad Shamir Is Depth Needed for Deep Learning? 23/34

Page 55: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Separations in Accuracy

Theorem (Safran and S., 2016)

There exists a large family F of C 2 functions on [0, 1]d (includingx 7→ x2), s.t. for any f ∈ F :

Can be ε-approximated with polylog(1/ε) depth and widthReLU network

Cannot be ε-approximated with O(1) depth ReLU network,unless width is poly(1/ε)

* Approximation w.r.t. uniform distribution

F ≈ non-linear functions expressible by a fixed number ofadditions and multiplications.

Note: Broadly similar and independent results in [Yarotski 2016],[Liang and Srikant 2016]

Ohad Shamir Is Depth Needed for Deep Learning? 23/34

Page 56: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Proof Idea for x 7→ x2

Upper bound:

Use ϕ1, ϕ2, . . . , ϕO(log(1/ε)) variants toextract first O(log(1/ε)) bits of x

Given bit vector, do long multiplicationto get first O(log(1/ε)) bits of x2

Convert back to R

Representable via O(log(1/ε)) depth/widthnetwork

Ohad Shamir Is Depth Needed for Deep Learning? 24/34

Page 57: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Proof Idea for x 7→ x2

Upper bound:

Use ϕ1, ϕ2, . . . , ϕO(log(1/ε)) variants toextract first O(log(1/ε)) bits of x

Given bit vector, do long multiplicationto get first O(log(1/ε)) bits of x2

Convert back to R

Representable via O(log(1/ε)) depth/widthnetwork

Ohad Shamir Is Depth Needed for Deep Learning? 24/34

Page 58: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Proof Idea for x 7→ x2

Lower Bound:

If h linear,∫ a+∆a

(x2 − h(x)

)2= Ω

(∆5)

⇒ If h piecewise-linear with O(n)

segments,∫ 1

0

(x2 − h(x)

)2= Ω

(n−4)

But: Any O(1)-depth, w -width networkcan express only poly(w) segments

⇒ For ε approximation, need poly(1/ε)width

Similar ideas also in higher dimensions

Ohad Shamir Is Depth Needed for Deep Learning? 25/34

Page 59: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Natural Depth Separations

So far: Depth separations for some functions

But for machine learning, this is only 1/3 of the picture!

Expressiveness Statistical Error Optimization Error

Depth separations for functions that we can hope to learn withstandard optimization methods?

(x, y) 7→ f (x>y), f highly-oscillating

x 7→ f (‖x‖), f highly-oscillating

x 7→ f (x), f highly-oscillating

x 7→ x2, bit-extraction (highly-oscillating) + longmultiplication networks

Ohad Shamir Is Depth Needed for Deep Learning? 26/34

Page 60: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Natural Depth Separations

So far: Depth separations for some functions

But for machine learning, this is only 1/3 of the picture!

Expressiveness Statistical Error Optimization Error

Depth separations for functions that we can hope to learn withstandard optimization methods?

(x, y) 7→ f (x>y), f highly-oscillating

x 7→ f (‖x‖), f highly-oscillating

x 7→ f (x), f highly-oscillating

x 7→ x2, bit-extraction (highly-oscillating) + longmultiplication networks

Ohad Shamir Is Depth Needed for Deep Learning? 26/34

Page 61: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Natural Depth Separations

So far: Depth separations for some functions

But for machine learning, this is only 1/3 of the picture!

Expressiveness Statistical Error Optimization Error

Depth separations for functions that we can hope to learn withstandard optimization methods?

(x, y) 7→ f (x>y), f highly-oscillating

x 7→ f (‖x‖), f highly-oscillating

x 7→ f (x), f highly-oscillating

x 7→ x2, bit-extraction (highly-oscillating) + longmultiplication networks

Ohad Shamir Is Depth Needed for Deep Learning? 26/34

Page 62: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Natural Depth Separations

So far: Depth separations for some functions

But for machine learning, this is only 1/3 of the picture!

Expressiveness Statistical Error Optimization Error

Depth separations for functions that we can hope to learn withstandard optimization methods?

(x, y) 7→ f (x>y), f highly-oscillating

x 7→ f (‖x‖), f highly-oscillating

x 7→ f (x), f highly-oscillating

x 7→ x2, bit-extraction (highly-oscillating) + longmultiplication networks

Ohad Shamir Is Depth Needed for Deep Learning? 26/34

Page 63: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

First Example: Indicator of L2 Ball

Theorem (Safran & S., 2016)

Let f (x) = 1 ‖x‖ ≤ 1 on Rd , exists distribution s.t.

ε-approximable with depth-3, poly(d , 1/ε)-wide ReLU network

Not Ω(d−4)-approximable by any depth-2, exp(o(d))-wideReLU network

Can be generalized to indicators of any ellipsoids

Proof idea: Reduction from construction of Eldan and S.

Ohad Shamir Is Depth Needed for Deep Learning? 27/34

Page 64: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Experiment: Unit L2 ball

d = 100

Batch number (x1000)0 20 40 60 80 100 120 140 160 180 200

RM

SE

(va

lidat

ion

set)

0.15

0.2

0.25

0.3

3-layer, width 1002-layer, width 1002-layer, width 2002-layer, width 4002-layer, width 800

Ohad Shamir Is Depth Needed for Deep Learning? 28/34

Page 65: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Second Example: L1 Ball

Theorem (Safran & S., 2016)

Let f (x) = [‖x‖1 − 1]+ on Rd , exists distribution s.t.

Expressible with with depth-3, width 2d ReLU network

Not ε-approximable by any depth-2, width min1/ε, exp(d)ReLU network

Ohad Shamir Is Depth Needed for Deep Learning? 29/34

Page 66: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Proof Idea

Upper Bound

[‖x‖1 − 1]+ =

[d∑

i=1

([xi ]+ + [−xi ]+

)− 1

]+

Lower Bound

Function “breaks” along 2d facets ofthe L1 ball

For a good approximation, most facetsmust have a ReLU neuron breakingclose to it

Bound can probably be improved...

Ohad Shamir Is Depth Needed for Deep Learning? 30/34

Page 67: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Experiment: L1 Ball

Ohad Shamir Is Depth Needed for Deep Learning? 31/34

Page 68: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Other Directions

Many other works and directions!

Study architectural properties of neural networks (depth andbeyond) using arithmetic-style circuits

Depth separations using metrics other than approximationerror

Study realistic architectures via upper bounds

...[Delalleau and Bengio 2011],[Pascanu et al. 2013],[Martens et al. 2013],[Montufar etal. 2014],[Cohen et al. 2015],[Cohen et al. 2016],[Raghu et al. 2016],[Poole et al.2016],[Arora et al. 2016],[Mhaskar and Poggio 2016],[Shaham et al. 2016],[Mossel2016],[McCane and Szymanskic 2016],[Poggio et al. 2017],[Sharir and Shashua2017],[Rolnick and Tegmark 2017],[Nguyen and Hein 2017],[Petersen and Voigtlaender2017], [Lu et al. 2017],[Montanelli and Du 2017],[Telgarsky 2017],[Lee et al.2017],[Khrulkov et al. 2017],[Serra et al. 2017],[Guss and Salakhutdinov2017],[Mukherjee and Basu 2017]...

Ohad Shamir Is Depth Needed for Deep Learning? 32/34

Page 69: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Summary and Discussion

Depth separations for modern neural networks

Take-Home Message

Similar questions as in circuit complexity, but not standard circuitsand a different playing field

Euclidean (not Boolean) input/output

Continuity; large Lipschitz constants; Fourier in Rd ...

No clear algebraic structure (as in arithmetic circuits). Usegeometric properties instead

Curvature; piecewise linearity; sparsity in Fourier domain...

AFAIK, little study of connections between fields

Ohad Shamir Is Depth Needed for Deep Learning? 33/34

Page 70: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Open Questions

Separations w.r.t. dimension for depths > 3?

Alternatively, a “natural proof” barrier? How to even define?

Strong separations w.r.t. dimension for O(1)-Lipschitzfunctions?

Circuit complexity techniques to analyze neural networks?And vice versa?

Is there any function which is both (1) provably deep and (2)easily learned with neural networks?

Architecture and expressiveness of modern neural networksbeyond depth: Convolutions, pooling, recurrences, skipconnections...

Ohad Shamir Is Depth Needed for Deep Learning? 34/34

Page 71: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Open Questions

Separations w.r.t. dimension for depths > 3?

Alternatively, a “natural proof” barrier? How to even define?

Strong separations w.r.t. dimension for O(1)-Lipschitzfunctions?

Circuit complexity techniques to analyze neural networks?And vice versa?

Is there any function which is both (1) provably deep and (2)easily learned with neural networks?

Architecture and expressiveness of modern neural networksbeyond depth: Convolutions, pooling, recurrences, skipconnections...

Ohad Shamir Is Depth Needed for Deep Learning? 34/34

Page 72: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Open Questions

Separations w.r.t. dimension for depths > 3?

Alternatively, a “natural proof” barrier? How to even define?

Strong separations w.r.t. dimension for O(1)-Lipschitzfunctions?

Circuit complexity techniques to analyze neural networks?And vice versa?

Is there any function which is both (1) provably deep and (2)easily learned with neural networks?

Architecture and expressiveness of modern neural networksbeyond depth: Convolutions, pooling, recurrences, skipconnections...

Ohad Shamir Is Depth Needed for Deep Learning? 34/34

Page 73: Is Depth Needed for Deep Learning? Circuit Complexity in ...rongge/stoc2018ml/Shamir_depthfordeep… · Neural Networks (a.k.a. Deep Learning) Ohad Shamir Is Depth Needed for Deep

Open Questions

Separations w.r.t. dimension for depths > 3?

Alternatively, a “natural proof” barrier? How to even define?

Strong separations w.r.t. dimension for O(1)-Lipschitzfunctions?

Circuit complexity techniques to analyze neural networks?And vice versa?

Is there any function which is both (1) provably deep and (2)easily learned with neural networks?

Architecture and expressiveness of modern neural networksbeyond depth: Convolutions, pooling, recurrences, skipconnections...

Ohad Shamir Is Depth Needed for Deep Learning? 34/34