optimal approximation with sparse deep neural...

71
Optimal Approximation with Sparse Deep Neural Networks Gitta Kutyniok (Technische Universit¨ at Berlin) joint with Helmut B¨ olcskei (ETH Z¨ urich) Philipp Grohs (Universit¨ at Wien) Philipp Petersen (Technische Universit¨ at Berlin) SPARS 2017 University of Lisbon, June 5–8, 2017 B¨olcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 1 / 51

Upload: others

Post on 03-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Optimal Approximationwith Sparse Deep Neural Networks

Gitta Kutyniok

(Technische Universitat Berlin)

joint with

Helmut Bolcskei (ETH Zurich)

Philipp Grohs (Universitat Wien)

Philipp Petersen (Technische Universitat Berlin)

SPARS 2017University of Lisbon, June 5–8, 2017

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 1 / 51

Page 2: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Deep Neural Networks

Deep neural networks have recently shown impressive results in a variety ofreal-world applications.

Some Examples...

Image classification (ImageNet). Hierarchical classification of images.

Games (AlphaGo). Success against world class players.

Speech Recognition (Siri). Current standard in numerous cell phones.

Very few theoretical results explaining their success!

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 2 / 51

Page 3: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Deep Neural Networks

Deep neural networks have recently shown impressive results in a variety ofreal-world applications.

Some Examples...

Image classification (ImageNet). Hierarchical classification of images.

Games (AlphaGo). Success against world class players.

Speech Recognition (Siri). Current standard in numerous cell phones.

Very few theoretical results explaining their success!

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 2 / 51

Page 4: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Delving Deeper into Neural Networks

Main Task:Deep neural networks approximate highly complex functions typically basedon given sampling points.

Image Classification:

f :M→ {1, 2, . . . ,K}

Speech Recognition:

f : RS1 → RS2

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 3 / 51

Page 5: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Sparsely Connected Deep Neural Networks

Key Problem:

Deep neural networks employed in practiceoften consist of hundreds of layers.

Training and operation of such networkspose formidable computational challenge.

Employ deep neural networks with sparse connectivity!

Example of Speech Recognition:

Typically speech recognition is performed in the cloud (e.g. SIRI).

New speech recognition systems (e.g. Android) can operate offline andare based on a sparsely connected deep neural network.

Key Challenge:Approximation accuracy ↔ Complexity of approximating DNN

in terms of sparse connectivity

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 4 / 51

Page 6: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Outline

1 Deep Neural NetworksMathematical ModelNetworks and Approximation Theory

2 Complexity of NetworksNetworks with Sparse ConnectivityA Fundamental Lower Bound on Sparsity

3 Optimally Sparse NetworksIngredients from Applied Harmonic AnalysisOptimal Sparse Approximation Result

4 Numerical Experiments

5 Conclusions

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 5 / 51

Page 7: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Deep Neural Networks

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 6 / 51

Page 8: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Neural Networks

Definition:Assume the following notions:

d ∈ N: Dimension of input layer.

L: Number of layers.

N: Number of neurons.

ρ : R→ R: (Non-linear) function called rectifier.

W` : RN`−1 → RN` , ` = 1, . . . , L: Affine linear maps.

Then Φ : Rd → RNL given by

Φ(x) = WLρ(WL−1ρ(. . . ρ(W1(x))), x ∈ Rd ,

is called a (deep) neural network (DNN).

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 7 / 51

Page 9: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Sparse Connectivity

Remark: The affine linear map W` is defined by a matrix A` ∈ RN`−1×N`

and an affine part b` ∈ R via

W`(x) = A`x + b`.

A1 =

(A1)1,1 (A1)1,2 00 0 (A1)2,3

0 0 (A1)3,3

A2 =

((A2)1,1 (A2)1,2 0

0 0 (A2)2,3

)

x1 x2 x3

Definition: We write Φ ∈ NNL,M,d ,ρ, where M number of edges withnon-zero weight. A DNN with small M is called sparsely connected.

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 8 / 51

Page 10: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

The Start

Universal Approximation Theorem (Cybenko, 1989)(Hornik, 1991)(Pinkus, 1999):Let d ∈ N, K ⊂ Rd compact, f : K → R continuous, ρ : R→ Rcontinuous and not a polynomial. Then, for each ε > 0, there existN ∈ N, ak , bk ∈ R,wk ∈ Rd such that

‖f −N∑

k=1

akρ(〈wk , ·〉 − bk)‖∞ ≤ ε.

Interpretation: Every continuous function can be approximated up to anerror of ε > 0 with a neural network with a single hidden layer and withO(N) neurons.

What is the connection between ε and N?

Approximation accuracy ↔ Complexity of approximating DNN?

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 9 / 51

Page 11: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

The Start

Universal Approximation Theorem (Cybenko, 1989)(Hornik, 1991)(Pinkus, 1999):Let d ∈ N, K ⊂ Rd compact, f : K → R continuous, ρ : R→ Rcontinuous and not a polynomial. Then, for each ε > 0, there existN ∈ N, ak , bk ∈ R,wk ∈ Rd such that

‖f −N∑

k=1

akρ(〈wk , ·〉 − bk)‖∞ ≤ ε.

Interpretation: Every continuous function can be approximated up to anerror of ε > 0 with a neural network with a single hidden layer and withO(N) neurons.

What is the connection between ε and N?

Approximation accuracy ↔ Complexity of approximating DNN?

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 9 / 51

Page 12: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

The Start

Universal Approximation Theorem (Cybenko, 1989)(Hornik, 1991)(Pinkus, 1999):Let d ∈ N, K ⊂ Rd compact, f : K → R continuous, ρ : R→ Rcontinuous and not a polynomial. Then, for each ε > 0, there existN ∈ N, ak , bk ∈ R,wk ∈ Rd such that

‖f −N∑

k=1

akρ(〈wk , ·〉 − bk)‖∞ ≤ ε.

Interpretation: Every continuous function can be approximated up to anerror of ε > 0 with a neural network with a single hidden layer and withO(N) neurons.

What is the connection between ε and N?

Approximation accuracy ↔ Complexity of approximating DNN?

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 9 / 51

Page 13: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Function Approximation in a Nutshell

Goal: Given C ⊆ L2(Rd) and (ϕi )i∈I ⊆ L2(Rd). Measure the suitability of(ϕi )i∈I for uniformly approximating functions from C.

Definition: The error of best M-term approximation of some f ∈ C isgiven by

‖f − fM‖L2(Rd ) := infIM⊂I ,#IM=M,(ci )i∈IM

‖f −∑i∈IM

ciϕi‖L2(Rd ).

The largest γ > 0 such that

supf ∈C‖f − fM‖L2(Rd ) = O(M−γ) as M →∞

determines the optimal (sparse) approximation rate of C by (ϕi )i∈I .

Approximation accuracy ↔ Complexity of approximating systemin terms of sparsity

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 10 / 51

Page 14: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Function Approximation in a Nutshell

Goal: Given C ⊆ L2(Rd) and (ϕi )i∈I ⊆ L2(Rd). Measure the suitability of(ϕi )i∈I for uniformly approximating functions from C.

Definition: The error of best M-term approximation of some f ∈ C isgiven by

‖f − fM‖L2(Rd ) := infIM⊂I ,#IM=M,(ci )i∈IM

‖f −∑i∈IM

ciϕi‖L2(Rd ).

The largest γ > 0 such that

supf ∈C‖f − fM‖L2(Rd ) = O(M−γ) as M →∞

determines the optimal (sparse) approximation rate of C by (ϕi )i∈I .

Approximation accuracy ↔ Complexity of approximating systemin terms of sparsity

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 10 / 51

Page 15: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Approximation with Sparse Deep Neural Networks

Complexity: Sparse connectivity = Few non-zero weights.

Definition: Given C ⊆ L2(Rd) and fixed ρ. For each M ∈ N and f ∈ C, wedefine

ΓM(f ) := infΦ∈NN∞,M,d,ρ

‖f − Φ‖L2(Rd ).

The largest γ > 0 such that

supf ∈C

ΓM(f ) = O(M−γ) as M →∞

then determines the optimal (sparse) approximation rate of C byNN∞,∞,d ,ρ.

Approximation accuracy ↔ Complexity of approximating DNNin terms of sparse connectivity

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 11 / 51

Page 16: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Approximation with Sparse Deep Neural Networks

Complexity: Sparse connectivity = Few non-zero weights.

Definition: Given C ⊆ L2(Rd) and fixed ρ. For each M ∈ N and f ∈ C, wedefine

ΓM(f ) := infΦ∈NN∞,M,d,ρ

‖f − Φ‖L2(Rd ).

The largest γ > 0 such that

supf ∈C

ΓM(f ) = O(M−γ) as M →∞

then determines the optimal (sparse) approximation rate of C byNN∞,∞,d ,ρ.

Approximation accuracy ↔ Complexity of approximating DNNin terms of sparse connectivity

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 11 / 51

Page 17: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Non-Exhaustive List of Previous Results

Approximation by NNs with one Single Hidden Layer:Bounds in terms terms of nodes and sample size (Barron; 1993, 1994).

Localized approximations (Chui, Li, and Mhaskar; 1994).

Fundamental lower bound on approximation rates (DeVore, Oskolkov, andPetrushev; 1997)(Candes; 1998).

Lower bounds on the sparsity in terms of number of neurons (Schmitt; 1999).

Approximation using specific rectifiers (Cybenko; 1989).

Approximation of specific function classes (Mhaskar and Micchelli; 1995), (Mhaskar;1996).

Approximation by NNs with Multiple Hidden Layers:Approximation with sigmoidal rectifiers (Hornik, Stinchcombe, and White; 1989).

Approximation of continuous functions (Funahashi; 1998).

Approximation of functions together and their derivatives (Nguyen-Thien andTran-Cong; 1999).

Relation between one and multi layers (Eldan and Shamir; 2016), (Mhaskar andPoggio; 2016).

Approximation by DDNs versus best M-term approximations by wavelets (Shaham,Cloninger, and Coifman; 2017).

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 12 / 51

Page 18: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Non-Exhaustive List of Previous Results

Approximation by NNs with one Single Hidden Layer:Bounds in terms terms of nodes and sample size (Barron; 1993, 1994).

Localized approximations (Chui, Li, and Mhaskar; 1994).

Fundamental lower bound on approximation rates (DeVore, Oskolkov, andPetrushev; 1997)(Candes; 1998).

Lower bounds on the sparsity in terms of number of neurons (Schmitt; 1999).

Approximation using specific rectifiers (Cybenko; 1989).

Approximation of specific function classes (Mhaskar and Micchelli; 1995), (Mhaskar;1996).

Approximation by NNs with Multiple Hidden Layers:Approximation with sigmoidal rectifiers (Hornik, Stinchcombe, and White; 1989).

Approximation of continuous functions (Funahashi; 1998).

Approximation of functions together and their derivatives (Nguyen-Thien andTran-Cong; 1999).

Relation between one and multi layers (Eldan and Shamir; 2016), (Mhaskar andPoggio; 2016).

Approximation by DDNs versus best M-term approximations by wavelets (Shaham,Cloninger, and Coifman; 2017).

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 12 / 51

Page 19: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Goal for Today

Challenges:

Theoretical analysis of relation between sparse connectivity andapproximation accuracy. Conceptual bound on the approximation properties of a sparse

deep neural network, which each learning algorithm obeys!

Determine how neural networks trained by backpropagation behave. Understand how far practical applications are from optimality!

Roadmap and Contributions:

(1) Fundamental lower bound on sparse connectivity.

(2) Sharpness of this bound ⇒ Optimality of the bound!

(3) Success of certain network topologies trained by backpropagationto reach optimal bound.

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 13 / 51

Page 20: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Complexity of Approximating Deep Neural Networks

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 14 / 51

Page 21: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Rate Distortion Theory

To use information theoretic arguments, we require the following notionsfrom information theory:

Definition:Let d ∈ N. Then, for each ` ∈ N, we let

E` := {E : L2(Rd)→ {0, 1}`}

denote the set of binary encoders with length ` and

D` := {D : {0, 1}` → L2(Rd)}

the set of binary decoders of length `.

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 15 / 51

Page 22: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Rate Distortion Theory

Definition (continued):For arbitrary ε > 0 and C ⊂ L2(Rd), the minimax code length L(ε, C) isgiven by

L(ε, C) := min{` ∈ N : ∃(E ,D) ∈ E` ×D` : supf ∈C‖D(E (f ))− f ‖L2(Rd ) ≤ ε},

and the optimal exponent γ∗(C) is defined by

γ∗(C) := inf{γ ∈ R : L(ε, C) = O(ε−γ)}.

Interpretation:The optimal exponent γ∗(C) describes the dependence of the code lengthon the required approximation quality.

γ∗(C) is a measure of the complexity of the function class!

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 16 / 51

Page 23: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Rate Distortion Theory

Examples for optimal exponents:

Let C ⊂ Bsp,q(Rd) be bounded. Then we have

γ∗(C) =d

s.

(Donoho; 2001) Let C ⊂ L2(R2) contain all piecewise C 2(R2)functions with C 2 singularities, so-called cartoon-like functions.Then

γ∗(C) ≥ 1.

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 17 / 51

Page 24: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

A Fundamental Lower Bound

Theorem (Bolcskei, Grohs, K, and Petersen; 2017):Let d ∈ N, ρ : R→ R, c > 0, and let C ⊂ L2(Rd). Further, let

Learn : (0, 1)× C → NN∞,∞,d ,ρ

satisfy that, for each f ∈ C and 0 < ε < 1:

(1) Each weight of Learn(ε, f ) can be encoded with < −c log2(ε) bits,

(2) andsupf ∈C‖f − Learn(ε, f )‖L2(Rd ) ≤ ε.

Then, for all γ < γ∗(C),

εγ supf ∈CM(Learn(ε, f ))→∞ as ε→ 0,

where M(Learn(ε, f )) denotes the number of non-zero weights inLearn(ε, f ).

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 18 / 51

Page 25: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

A Fundamental Lower Bound

Some implications and remarks:

If a neural network stems from a fixed learning procedure Learn,then, for all γ < γ∗(C), there does not exist C > 0 such that

supf ∈CM(Learn(ε, f )) ≤ Cε−γ for all ε > 0.

⇒ There exists a fundamental lower bound on the number of edges.

This bound is in terms of the edges, hence the sparsity of theconnectivity, not the neurons. However, the number of neurons isalways bounded up to a uniform constant by the number of edges.

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 19 / 51

Page 26: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Optimally Sparse Networks

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 20 / 51

Page 27: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

DNNs and Representation Systems, I

Question:Can we exploit approximation results with representation systems?

Observation: Assume a system (ϕi )i∈I ⊂ L2(Rd) satisfies:

For each i ∈ I , there exists a neural network Φi with at most C > 0edges such that ϕi = Φi .

Then we can construct a network Φ with O(M) edges with

Φ =∑i∈IM

ciϕi , if |IM | = M.

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 21 / 51

Page 28: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

DNNs and Representation Systems, I

Question:Can we exploit approximation results with representation systems?

Observation: Assume a system (ϕi )i∈I ⊂ L2(Rd) satisfies:

For each i ∈ I , there exists a neural network Φi with at most C > 0edges such that ϕi = Φi .

Then we can construct a network Φ with O(M) edges with

Φ =∑i∈IM

ciϕi , if |IM | = M.

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 21 / 51

Page 29: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

DNNs and Representation Systems, II

Corollary: Assume a system (ϕi )i∈I ⊂ L2(Rd) satisfies:

For each i ∈ I , there exists a neural network Φi with at most C > 0edges such that ϕi = Φi .

There exists C > 0 such that, for all f ∈ C ⊂ L2(Rd), there existsIM ⊂ I with

‖f −∑i∈IM

ciϕi‖ ≤ CM−1/γ∗(C).

Then every f ∈ C can be approximated up to an error of ε by a neuralnetwork with only O(ε−γ

∗(C)) edges.

Recall: If a neural network stems from a fixed learning procedure Learn,then, for all γ < γ∗(C), there does not exist C > 0 such that

supf ∈CM(Learn(ε, f )) ≤ Cε−γ for all ε > 0.

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 22 / 51

Page 30: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

DNNs and Representation Systems, II

Corollary: Assume a system (ϕi )i∈I ⊂ L2(Rd) satisfies:

For each i ∈ I , there exists a neural network Φi with at most C > 0edges such that ϕi = Φi .

There exists C > 0 such that, for all f ∈ C ⊂ L2(Rd), there existsIM ⊂ I with

‖f −∑i∈IM

ciϕi‖ ≤ CM−1/γ∗(C).

Then every f ∈ C can be approximated up to an error of ε by a neuralnetwork with only O(ε−γ

∗(C)) edges.

Recall: If a neural network stems from a fixed learning procedure Learn,then, for all γ < γ∗(C), there does not exist C > 0 such that

supf ∈CM(Learn(ε, f )) ≤ Cε−γ for all ε > 0.

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 22 / 51

Page 31: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Road Map

General Approach:

(1) Determine a class of functions C ⊆ L2(R2).

(2) Determine an associated representation system with the followingproperties:

I The elements of this system can be realized by a neural network withcontrolled number of edges.

I This system provides optimally sparse approximations for C.

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 23 / 51

Page 32: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Applied Harmonic Analysis

Representation systems designed by Applied Harmonic Analysis conceptshave established themselves as a standard tool in applied mathematics,computer science, and engineering.

Examples:

Wavelets.

Ridgelets.

Curvelets.

Shearlets.

...

Key Property:Fast Algorithms combined with Sparse Approximation Properties!

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 24 / 51

Page 33: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Affine Transforms

Building Principle:Many systems from applied harmonic analysis such as

wavelets,

ridgelets,

shearlets,

constitute affine systems:

{| detA|d/2ψ(A · −t) : A ∈ G ⊆ GL(d), t ∈ Zd}, ψ ∈ L2(Rd).

Realization by Neural Networks:The following conditions are equivalent:

(i) | detA|d/2ψ(A · −t) can be realized by a neural network Φ1.

(ii) ψ can be realized by a neural network Φ2.

Also, Φ1 and Φ2 have the same number of edges up to a constant factor.

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 25 / 51

Page 34: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Affine Transforms

Building Principle:Many systems from applied harmonic analysis such as

wavelets,

ridgelets,

shearlets,

constitute affine systems:

{| detA|d/2ψ(A · −t) : A ∈ G ⊆ GL(d), t ∈ Zd}, ψ ∈ L2(Rd).

Realization by Neural Networks:The following conditions are equivalent:

(i) | detA|d/2ψ(A · −t) can be realized by a neural network Φ1.

(ii) ψ can be realized by a neural network Φ2.

Also, Φ1 and Φ2 have the same number of edges up to a constant factor.

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 25 / 51

Page 35: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Shearlets

“Definition”:

Aj :=

(2j 0

0 2j/2

), Sk :=

(1 k0 1

), j , k ∈ Z.

Thenψj ,k,m := 2

3j4 ψ(SkAj · −m).

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 26 / 51

Page 36: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Shearlets

“Definition”:

Aj :=

(2j 0

0 2j/2

), Sk :=

(1 k0 1

), j , k ∈ Z.

Thenψj ,k,m := 2

3j4 ψ(SkAj · −m).

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 26 / 51

Page 37: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Shearlets

“Definition”:

Aj :=

(2j 0

0 2j/2

), Sk :=

(1 k0 1

), j , k ∈ Z.

Thenψj ,k,m := 2

3j4 ψ(SkAj · −m).

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 26 / 51

Page 38: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Shearlets

“Definition”:

Aj :=

(2j 0

0 2j/2

), Sk :=

(1 k0 1

), j , k ∈ Z.

Thenψj ,k,m := 2

3j4 ψ(SkAj · −m).

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 26 / 51

Page 39: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Cone-Adapted Shearlets (K, Labate; 2006)

Intuition:Ψ := {ψj ,k,m,ι : (j , k,m, ι) ∈ Λ}

Scale: j ∈ N0,

Shearing: k ∈ Z, |k | ≤ 2j/2,

Translation: m ∈ cZ2,

Cone: ι ∈ {−1, 0, 1}.

Some Elements:

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 27 / 51

Page 40: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Extension to α-Shearlets

Main Idea:

Introduction of a parameter α ∈ [0, 1] to measure the amount ofanisotropy.For j ∈ Z, define

Aα,j =

(2j 00 2αj

).

2−αj

2−j

Illustration:

α = 0 12 1

Ridgelets Curvelets/Shearlets Wavelets

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 28 / 51

Page 41: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

α-Shearlets

Definition (Grohs, Keiper, K, and Schafer; 2016)(Voigtlaender; 2017):For c ∈ R+ and α ∈ [0, 1], the cone-adapted α-shearlet systemSHα(φ, ψ, ψ, c) generated by φ, ψ, ψ ∈ L2(R2) is defined by

SHα(φ, ψ, ψ, c) := Φ(φ, c , α) ∪Ψ(ψ, c , α) ∪ Ψ(ψ, c , α),

where

Φ(φ, c , α) := {φ(· −m) : m ∈ cZ2},Ψ(ψ, c , α) := {2j(1+α)/2ψ(SkAα,j · −m) : j ≥ 0, |k| ≤ d2j(1−α)e,

m ∈ cZ2, k ∈ Z2},Ψ(ψ, c , α) := {2j(1+α)/2ψ(ST

k Aα,j · −m) : j ≥ 0, |k | ≤ d2j(1−α)e,m ∈ cZ2, k ∈ Z2}.

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 29 / 51

Page 42: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Cartoon-Like Functions

Definition (Donoho; 2001)(Grohs, Keiper, K, and Schafer; 2016):Let α ∈ [ 1

2 , 1] and ν > 0. We then define the class of α-cartoon-likefunctions by

E1α (R2) = {f ∈ L2(R2) : f = f1 + χB f2},

where B ⊂ [0, 1]2 with ∂B ∈ C1α , and the functions f1 and f2 satisfy

f1, f2 ∈ C1α

0 ([0, 1]2), ‖f1‖C

1α, ‖f2‖

C1α, ‖∂B‖

C1α< ν.

Illustration:

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 30 / 51

Page 43: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Optimal Sparse Approximation with α-Shearlets

Theorem (Grohs, Keiper, K, and Schafer; 2016)(Voigtlaender; 2017):Let α ∈ [ 1

2 , 1], let φ, ψ ∈ L2(R2) be sufficiently smooth and compactlysupported, and let ψ have sufficiently many vanishing moments. Also setψ(x1, x2) := ψ(x2, x1) for all x1, x2 ∈ R.Then there exists some c∗ > 0 such that, for every ε > 0, there exists aconstant Cε > 0 with

‖f − fN‖L2(R2) ≤ CεN− 1

2α+ε for all f ∈ E

1α (R2),

where fN is a best N-term approximation with respect to SHα(ϕ,ψ, ψ, c)and 0 < c < c∗.

This is the (almost) optimal sparse approximation rate!

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 31 / 51

Page 44: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Road Map

General Approach:

(1) Determine a class of functions C ⊆ L2(R2).Yes

(2) Determine an associated representation system with the followingproperties:Yes

I The elements of this system can be realized by a neural network withcontrolled number of edges.Yes

I This system provides optimally sparse approximations for C.This has been proven!

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 32 / 51

Page 45: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Road Map

General Approach:

(1) Determine a class of functions C ⊆ L2(R2). α-Cartoon-like functions!

(2) Determine an associated representation system with the followingproperties:Yes

I The elements of this system can be realized by a neural network withcontrolled number of edges.Yes

I This system provides optimally sparse approximations for C.This has been proven!

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 32 / 51

Page 46: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Road Map

General Approach:

(1) Determine a class of functions C ⊆ L2(R2). α-Cartoon-like functions!

(2) Determine an associated representation system with the followingproperties: α-Shearlets!

I The elements of this system can be realized by a neural network withcontrolled number of edges.Yes

I This system provides optimally sparse approximations for C.This has been proven!

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 32 / 51

Page 47: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Road Map

General Approach:

(1) Determine a class of functions C ⊆ L2(R2). α-Cartoon-like functions!

(2) Determine an associated representation system with the followingproperties: α-Shearlets!

I The elements of this system can be realized by a neural network withcontrolled number of edges.Yes

I This system provides optimally sparse approximations for C. This has been proven!

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 32 / 51

Page 48: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Road Map

General Approach:

(1) Determine a class of functions C ⊆ L2(R2). α-Cartoon-like functions!

(2) Determine an associated representation system with the followingproperties: α-Shearlets!

I The elements of this system can be realized by a neural network withcontrolled number of edges. Still to be analyzed!

I This system provides optimally sparse approximations for C. This has been proven!

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 32 / 51

Page 49: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Construction of Generators

Next Task: Realize sufficiently smooth functions with sufficiently manyvanishing moments with a neural network.

Wavelet generators (LeCun; 1987), (Shaham, Cloninger, and Coifman;2017):

Assume rectifiers ρ(x) = max{x , 0} (ReLUs).

Definet(x) := ρ(x)− ρ(x − 1)− ρ(x − 2) + ρ(x − 3).

ρt

0 0 1 2 3

t can be constructed with a two layer network.

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 33 / 51

Page 50: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Construction of Generators

Next Task: Realize sufficiently smooth functions with sufficiently manyvanishing moments with a neural network.

Wavelet generators (LeCun; 1987), (Shaham, Cloninger, and Coifman;2017):

Assume rectifiers ρ(x) = max{x , 0} (ReLUs).

Definet(x) := ρ(x)− ρ(x − 1)− ρ(x − 2) + ρ(x − 3).

ρt

0 0 1 2 3

t can be constructed with a two layer network.

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 33 / 51

Page 51: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Construction of Wavelet Generators

Construction by (Shaham, Cloninger, and Coifman; 2017) continued:

Observe thatφ(x1, x2) := ρ(t(x1) + t(x2)− 1)

yields a 2D bump function.

Summing up shifted versions of φ yields a function ψ with vanishingmoments.

Then ψ can be realized by a 3 layer neural network.

This cannot yield differentiable functions ψ!

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 34 / 51

Page 52: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Construction of Wavelet Generators

Construction by (Shaham, Cloninger, and Coifman; 2017) continued:

Observe thatφ(x1, x2) := ρ(t(x1) + t(x2)− 1)

yields a 2D bump function.

Summing up shifted versions of φ yields a function ψ with vanishingmoments.

Then ψ can be realized by a 3 layer neural network.

This cannot yield differentiable functions ψ!

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 34 / 51

Page 53: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

New Class of Rectifiers

Definition (Bolcskei, Grohs, K, Petersen; 2017):Let ρ : R→ R+, ρ ∈ C∞(R) satisfy

ρ(x) =

{0 for x ≤ 0,x for x ≥ K ,

for some constant K > 0. Then we call ρ an admissible smooth rectifier.

Construction of ‘good’ generators:Let ρ be an admissible smooth rectifier, and define

t(x) := ρ(x)− ρ(x − 1)− ρ(x − 2) + ρ(x − 3),

φ(x) := ρ(t(x1) + t(x2)− 1).

This yields smooth bump functions φ, and thus smooth functions ψ withmany vanishing moments. Leads to appropriate shearlet generators!

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 35 / 51

Page 54: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

New Class of Rectifiers

Definition (Bolcskei, Grohs, K, Petersen; 2017):Let ρ : R→ R+, ρ ∈ C∞(R) satisfy

ρ(x) =

{0 for x ≤ 0,x for x ≥ K ,

for some constant K > 0. Then we call ρ an admissible smooth rectifier.

Construction of ‘good’ generators:Let ρ be an admissible smooth rectifier, and define

t(x) := ρ(x)− ρ(x − 1)− ρ(x − 2) + ρ(x − 3),

φ(x) := ρ(t(x1) + t(x2)− 1).

This yields smooth bump functions φ, and thus smooth functions ψ withmany vanishing moments. Leads to appropriate shearlet generators!

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 35 / 51

Page 55: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Optimal Approximation

Theorem (Bolcskei, Grohs, K, and Petersen; 2017): Let ρ be an admissiblesmooth rectifier, and let ε > 0. Then there exist Cε > 0 such that thefollowing holds:

For all f ∈ E1α (R2) and N ∈ N, there exists Φ ∈ NN3,O(N),2,ρ such that

‖f − Φ‖L2(R2) ≤ CεN− 1

2α−ε.

This is the optimal approximation rate:

Theorem (Grohs, Keiper, K, Schafer; 2016): We have

γ∗(E1α (R2)) = 2α.

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 36 / 51

Page 56: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Optimal Approximation

Theorem (Bolcskei, Grohs, K, and Petersen; 2017): Let ρ be an admissiblesmooth rectifier, and let ε > 0. Then there exist Cε > 0 such that thefollowing holds:

For all f ∈ E1α (R2) and N ∈ N, there exists Φ ∈ NN3,O(N),2,ρ such that

‖f − Φ‖L2(R2) ≤ CεN− 1

2α−ε.

This is the optimal approximation rate:

Theorem (Grohs, Keiper, K, Schafer; 2016): We have

γ∗(E1α (R2)) = 2α.

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 36 / 51

Page 57: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Different Choice of Rectifiers

Similar results can be obtained if ρ is not smooth, but a sigmoidal function.

Definition (Cybenko; 1989): A function ρ : R→ R is called a sigmoidalfunction of order k ≥ 2, if

limx→−∞

1

xkρ(x) = 0, lim

x→∞

1

xkρ(x) = 1

and, for all x ∈ R,|ρ(x)| ≤ C (1 + |x |)k ,

where C ≥ 1 is a constant.

Key Idea:Use a result by (Chui, Mhaskar; 1994) about approximation of B-splineswith DNNs.

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 37 / 51

Page 58: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Functions on Manifolds

Situation:We now consider f :M⊆ Rd → R, where Mis an immersed submanifold of dimension m < d .

Road Map for Extension of Results:

Construct atlas for M by covering it with open balls.

Obtain a smooth partition of unity of M.

Represent any function on M as a sum of functions on Rm.

We require function classes C which are invariant with respect to

diffeomorphisms and multiplications by smooth functions such as E 1α (R2).

Theorem (Bolcskei, Grohs, K, and Petersen; 2017):“Deep neural networks are optimal for the approximation of piecewisesmooth functions on manifolds.”

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 38 / 51

Page 59: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Finally some Numerics...

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 39 / 51

Page 60: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Approximation by Learned Networks

Typically weights are learnt by backpropagation. This raises the followingquestion:

Does this lead to the optimal sparse connectivity?

Our setup:

Fixed network topology with ReLUs.

Specific functions to learn.

Learning through backpropagation.

Analysis of the connection between approximation error and number ofedges.

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 40 / 51

Page 61: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Chosen Network Topology

Topology inspired by previous construction of functions:

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 41 / 51

Page 62: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Example: Function 1

We train the network using the following function:

50 100 150 200 250

50

100

150

200

250

Function 1: Linear Singularity

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 42 / 51

Page 63: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Training with Function 1

Approximation Error:

10-5

10-4

10-3

10-2

10-1

100 200180160140120

Error

# of edges

Observation: The decay is exponential. This is expected if the network is asum of 0-shearlets, which are ridgelets.

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 43 / 51

Page 64: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Training with Function 1

The network with fixed topology naturally admits subnetworks.

Examples of Subnetworks:

10 20 30 40 50 60

10

20

30

40

50

60

10 20 30 40 50 60

10

20

30

40

50

60

10 20 30 40 50 60

10

20

30

40

50

60

These have indeed the shape of ridgelets!

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 44 / 51

Page 65: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Example: Function 2

We train the network using the following function:

20 40 60 80 100 120

20

40

60

80

100

120

Function 2: Curvilinear Singularity

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 45 / 51

Page 66: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Training with Function 2

Approximation Error:

102 10

3104

105

10-4

10-3

10-2

10-1Error

# of edges

Observation: The decay is of the order M−1. This is expected if thenetwork is a sum of 1

2 -shearlets.

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 46 / 51

Page 67: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Training with Function 2

Examples of Subnetworks:

These seem to be indeed anisotropic!

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 47 / 51

Page 68: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Training with Function 2

Form of Approximation:

Largest Elements Medium Sized Elements Smallest Elements

The learnt neural network has a multiscale behavior!

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 48 / 51

Page 69: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

Let’s conclude...

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 49 / 51

Page 70: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

What to take Home...?

One central task of a (deep) neural network is to approximatefunctions for solving tasks such as classification.

The number of edges determines the sparsity of the connectivity.

We derive a fundamental lower bound on this sparsity with respectto the approximation accuracy, which learning procedures need to obey.

Using the framework of α-shearlets, this bound can be shown to besharp, leading to an optimality result.

For certain topologies, the standard backpropagation algorithmgenerates a deep neural network which provides those optimalapproximation rates; interestingly even yieldingα-shearlet-like functions.

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 50 / 51

Page 71: Optimal Approximation with Sparse Deep Neural Networksspars2017.lx.it.pt/index_files/PlenarySlides/Kutyniok_SPARS2017.pdf · Approximation by DDNs versus best M-term approximations

THANK YOU!

References available at:

www.math.tu-berlin.de/∼kutyniok

Related Books:

G. Kutyniok and D. LabateShearlets: Multiscale Analysis for Multivariate DataBirkhauser-Springer, 2012.

Bolcskei, Grohs, Kutyniok & Petersen Sparse Deep Neural Networks SPARS 2017 51 / 51