from approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf ·...

41
Introduction 1 -minimization SVM’s Ridge functions & Neural networks From approximation theory to machine learning New perspectives in the theory of function spaces and their applications September 2017, Bedlewo, Poland Jan Vyb´ ıral Charles University/Czech Technical University Prague, Czech Republic 1/40

Upload: others

Post on 20-Mar-2020

7 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

From approximation theory tomachine learning

New perspectives in the theory of function spacesand their applications

September 2017, Bedlewo, Poland

Jan Vybıral

Charles University/Czech Technical UniversityPrague, Czech Republic

1/40

Page 2: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

Introduction

• Approximation Theory vs. Data Analysis (or MachineLearning, or Learning Theory)

• Example I: `1-minimization

• Example II: Support vector machines

• Example III: Ridge functions and neural networks

2/40

Page 3: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

Approximation theory

• K - class of objects of interest: vectors, functions,operators,. . .

• . . . approximated by (same or simpler) objects . . .

• . . . usually with only limited information about the unknownobject. . .

• with the similarity (approximation error) measured by somesort of distance

• worst case - supremum of the approximation error over K

• average case - mean value of the (square of the)approximation error over K w.r.t. some probability measureon K

3/40

Page 4: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

Approximation of multivariate functions

Let f : Ω ⊂ Rd → R be a function of many (d 1) variables

We want to approximate f using only (a small number of) functionvalues f (x1), . . . , f (xn)

Questions:

• Which functions (assumptions on f )?

• How to measure the error?

• How to choose the sampling points?

• Decay of the error with growing number of sampling points?

• Algorithms and optimality?

• Dependence on d?

4/40

Page 5: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

Data Analysis

Discovering structure in given data, allowing for predictions,conclusions, etc.

Given data can have different (and also heterogeneous) formats.For inputs x1, . . . , xn ∈ Rd and outputs y1, . . . , yn ∈ R we wouldlike to study the functional dependence yi ≈ f (xi )

The performance is in practice often tested by learning on a subsetof the data and testing the rest.

To allow theoretical results, we often assume that the data isdrawn independently from some (unknown) underlying probabilitydistribution, i.e. (xi , yi ) ∈ A with prob. P(A).

5/40

Page 6: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

Least squares & ridge regression

Least squares:

• x1, . . . , xn ∈ Rd : n data points (inputs) in Rd

• y1, . . . , yn ∈ R: n real values (outputs)

• We look for dependence yi ≈ 〈ω, xi 〉

• We minimizen∑

i=1

|yi − 〈ω, xi 〉|2 = ‖y − Xω‖22 over ω ∈ Rd

Ridge regression: Adding regularization term λ‖ω‖22

LASSO: Adding regularization term λ‖ω‖1

6/40

Page 7: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

Principal Component Analysis (PCA)

PCA: Classical dimensionality-reduction technique

• Given data points x1, . . . , xn ∈ Rd

• minimize over subspaces L with dim L = k

• the distance of xi ’s to L; i.e. minL:dim L=k

n∑i=1

‖xi − PLxi‖22

• the answer is given by singular value decomposition of then × d data matrix.

“. . . swiss army knife of numerical linear algebra. . . ”

7/40

Page 8: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

Support Vector Machines

V. N. Vapnik and A. Ya. Chervonenkis (1963)B. E. Boser, I. M. Guyon and V. N. Vapnik (1992) - non-linearC. Cortes and V. N. Vapnik (1995) - soft margin

Given data points x1, . . . , xn ∈ Rd and labels yi ∈ −1, 1,separate the sets

xi : yi = −1 and xi : yi = +1

by a linear hyperplane... i.e. f (x) = g(〈a, x〉), where

g(t) =

1, if t ≥ 0,

−1, if t < 0is known.

8/40

Page 9: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

Soft margin SVM:

minω∈Rd

n∑i=1

(1− yi 〈ω, xi 〉)+ + λ‖ω‖22

9/40

Page 10: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

Limits of the theory...All theory, dear friend, is grey, but the golden tree of actual lifesprings ever green. Johann Wolfgang von Goethe

It is a capital mistake to theorize before one has data. Insensiblyone begins to twist facts to suit theories, instead of theories to suitfacts. Sherlock Holmes, A Scandal in Bohemia

Approximation theory:

• The class of functions of interest is unclear.• The data points sometimes can not be designed.

Data analysis - learning theory:

• The independence is not always ensured• The underlying probability (and its basic properties) are

unknown• The success measured by leave-out data differs from one case

to the other10/40

Page 11: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

Active choice of points!link between data analysis and approximation theory

Difference between regression and sampling?. . . active choice of the points . . .

Nowadays, more and more data analysis is done on calculated data!

In material science, material properties (like color or conductivity)are calculated ab initio from the molecular information, notmeasured in the lab.

Typically, these properties are governed by a PDE (likeSchrodinger’s equation), where the structure of the material comesin as initial data.

11/40

Page 12: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

Calculation of one property for one material is then a numericalsolution of a parametric: PDE(u, p, x) = 0

The (unknown) functional dependence f = f (p) of the materialproperty on the input parameters is calculated for materialsdescribed with parameters p1, . . . , pm

Sampling of f at pj ⇔⇔ query of the solution ⇔⇔ running an expensive simulation

Nearly noise-free sampling possible!

Trade-off: more noisy samples vs. fewer less noisy samples

The materials with p1, . . . , pm can be actively chosen!

12/40

Page 13: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

Summary of the introduction

• Approximation theory and Data Analysis often study similarproblems from different points of view

• The choice of approximation algorithms is free in both areas

• Approximation theory (usually) assumes that the data can bedesigned/chosen

• The aim of the talk is to show the benefits of the combinationof both approaches

13/40

Page 14: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

`1-norm minimizationData Analysis

LASSO (least absolute shrinkage and selection operator):

• Tibshirani (1996)

• Least squares with penalization term λ‖ω‖1

• X = (x1, . . . , xn) ∈ Rn×d : n data points (inputs) in Rd

• y = (y1, . . . , yn) ∈ Rn: n real values (outputs)

• LASSO: arg minω∈Rd

‖y − Xω‖22 + λ‖ω‖1 weights the fit y ∼ Xω

against the number of non-zero coordinates of ω

• Leaves open a number of questions - how many data pointsand how distributed are sufficient to identify/approximateω...?

14/40

Page 15: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

`1-norm minimizationApproximation theory

Compressed Sensing:Let w ∈ Rd be sparse (with small number s d of non-zerocoefficients on unknown positions) and consider the linear functiong(x) = 〈x ,w〉.

Design n (as small as possible) sampling points x1, . . . , xn and thefunction values yj = g(xj) = 〈xj ,w〉 and find a recovery mapping∆ : Rn → Rd , such that ∆(y) is equal/close to w

• Take xj ’s independent at random, i.e. y = Xw• Use `1-minimization for recovery

∆(y) = arg minω∈Rd

‖ω‖1, s.t. Xω = y

• It is enough to take n ≈ s log(d)

15/40

Page 16: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

`1-norm minimizationApproximation theory

Compressed Sensing:Let w ∈ Rd be sparse (with small number s d of non-zerocoefficients on unknown positions) and consider the linear functiong(x) = 〈x ,w〉.

Design n (as small as possible) sampling points x1, . . . , xn and thefunction values yj = g(xj) = 〈xj ,w〉 and find a recovery mapping∆ : Rn → Rd , such that ∆(y) is equal/close to w

• Take xj ’s independent at random, i.e. y = Xw• Use `1-minimization for recovery

∆(y) = arg minω∈Rd

‖ω‖1, s.t. Xω = y

• It is enough to take n ≈ s log(d)

15/40

Page 17: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

`1-norm minimizationApproximation theory

Theorem (Donoho, Candes, Romberg, Tao - 2006)There is a constant C > 0 such that if m ≥ Cs log(d) and thecomponents of xi are independent Gaussian random variables, then∆(y) = w (with high probability).

Many different aspects (other xi ’s, noisy sampling, nearly-sparsevectors w ∈ Rd , etc.) were studied intensively.

Benefits for Data Analysis:

• Estimates of the minimal number of data points

• If possible, choose mostly uncorrelated data points

• Better information about possibilities and limits of LASSO

16/40

Page 18: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

`1-SVMData Analysis

`1-SVM:

• P.S. Bradley, O.L. Mangasarian (1998)J. Zhu, S. Rosset, T. Hastie, R. Tibshirani (2004)

• Support vector machine with `1-penalization term

• X = (x1, . . . , xn) ∈ Rn×d : n data points (inputs) in Rd

• y = (y1, . . . , yn) ∈ −1,+1: n labels

• `1-SVM:

minω∈Rd

n∑i=1

(1− yi 〈ω, xi 〉)+ + λ‖ω‖1

• Weights the quality of the classification against the number ofnon-zero coordinates of ω

• Leaves open a number of questions - how many data pointsand how distributed are sufficient to identify ω...?

17/40

Page 19: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

`1-SVMData Analysis

• Standard technique of sparse classification

• Bioinformatics

• gene selection

• microarray classification

• cancer classification

• feature selection

• face recognition

• . . .

18/40

Page 20: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

`1-SVMData Analysis

Results:I. Steinwart, Support vector machines are universally consistent, J.Complexity (2002)I. Steinwart, Consistency of support vector machines and otherregularized kernel classifiers, IEEE Trans. Inf. Theory (2005):

The SVM’s are consistent (in general setting with kernels), i.e.they can learn the hidden classifier for n→∞ almost surely; theparameter of SVM has to grow to infinity.

19/40

Page 21: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

`1-SVMApproximation theory

• We want to identify sparse (or nearly sparse) ω ∈ Rd

• We assume ω ∈ K = x ∈ Rd : ‖x‖p ≤ 1, p < 1

• We are allowed to take only 1-bit measurementsyi = sign(〈ω, xi 〉) - non-linear!

• We are allowed to design the “sampling points”x1, . . . , xn ∈ Rd and a (non-linear) recovery algorithm

• Recent area of 1-bit Compressed Sensing

20/40

Page 22: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

`1-SVMApproximation theory

1-bit Compressed Sensing:Boufounos & Baraniuk (2008), Y. Plan & R. Vershynin (2013)

ω ∈ K = x ∈ Rd : ‖x‖2 ≤ 1, ‖x‖1 ≤√

sxi ∼ N (0, id), yi = sign(〈ω, xi 〉)

ω := arg maxw∈K

n∑i=1

yi 〈xi ,w〉

Result: For n ≥ Cδ−6w(K )2

‖ω − ω‖22 ≤ δ

√log(e/δ) with prob. ≥ 1− 8 exp(−cδ2n)

Mean Gaussian width: w(K ) = E supx∈K−K 〈g , x〉

21/40

Page 23: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

`1-SVMApproximation theory

What if we insist on `1-SVM as the recovery algorithm?Setting:

• ‖ω‖2 = 1, ‖ω‖1 ≤ R

• xi ∼ N (0, r 2 · id), i = 1, . . . , n

• yi = sign(〈xi , ω〉), i = 1, . . . , n

• ω ∈ Rd a minimizer of the `1-SVM

minw∈Rd

n∑i=1

[1− yi 〈xi ,w〉]+ subject to ‖w‖1 ≤ R.

Theorem (Kolleck, V. (2016))

Let 0 < ε < 0.18, r >√

2π(0.57− πε)−1, n ≥ Cε−2r 2R2 ln(d)Then it holds ∥∥∥∥ω − ω

‖ω‖2

∥∥∥∥2

≤ C ′(ε+

1

r

)with probability at least 1− exp (−C ′′ ln(d))

22/40

Page 24: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

`1-SVM

Summary on SVM’s:

• Non-asymptotic analysis of (`1-)SVM’s allows to predict thelimits of the algorithm

• For random (i.e. highly uncorrelated data) small number n ofmeasurements/data is sufficient

• Motivated by the analysis, we proposed an SVM, whichcombines the `1 and the `2 penalty; this method performsbetter in analysis, but also in the numerical simulation!

• Use of the 1-Bit CS algorithm in Machine Learning?

23/40

Page 25: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

Ridge functions & Neural networks

Ridge functionsLet g : R→ R and a ∈ Rd \ 0.Ridge function with ridge profile g and ridge vector a is thefunction

f (x) := g(〈a, x〉).

Constant along the hyperplane a⊥ = y ∈ Rd : 〈y , a〉 = 0 and itstranslates.

More general, if g : Rk → R and A ∈ Rk×d with k d then

f (x) := g(Ax)

is a k−ridge function.

24/40

Page 26: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

Ridge functions in mathematics

• Kernels of important transforms (Fourier, Radon)

• Plane waves in PDE’s:Solutions to r∏

i=1

(bi∂

∂x− ai

∂y

)F = 0

are of the form

F (x , y) =r∑

i=1

fi (aix + biy).

• Ridgelets, curvelets, shearlets,. . . : wavelet-like framescapturing singularities along curves and edges (Candes,Donoho, Kutyniok, Labate, . . . )

25/40

Page 27: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

Ridge functions in approximation theory

Approximation of a function by functions from the dictionary

Dridge = %(〈k, x〉 − b) : k ∈ Rd , b ∈ R

• Fundamentality

• Greedy algorithms

Lin & Pinkus, Fundamentality of ridge functions, J. Approx. Theory 75 (1993),no. 3, 295–311

Cybenko, Approximation by superpositions of a sigmoidal function, Math.Control Signals Systems 2 (1989), 303–314

Leshno, Lin, Pinkus & Schocken, Multilayer feedforward networks with anonpolynomial activation function can approximate any function, NeuralNetworks 6 (1993), 861–867

26/40

Page 28: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

Neural networks

Motivated by biological research on human brain and neuronsW. McCulloch, W. Pitts (1943); M. Minsky, S. Papert (1969)Artificial Neuron:. . . gets activated if a linear combination of its inputs grows over acertain threshold. . .

• Inputs x = (x1, . . . , xn) ∈ Rn

• Weights w = (w1, . . . ,wn) ∈ Rn

• Comparing 〈w , x〉 with a threshold b ∈ R• Plugging the result into the “activation function” - jump (or

smoothed jump) function σ

Artificial neuron is a functionx → σ(〈x ,w〉 − b),

where σ : R→ R might be σ(x) = sign(x) or σ(x) = ex/(1 + ex),etc.

27/40

Page 29: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

Neural networks

Artificial neural network is a directed, acyclic graph of artificialneurons

• Input: x = (x1, . . . , xn) ∈ Rn

• First layer of neurons:y1 = σ(〈x ,w 1

1 〉 − b11), . . . , yn1 = σ(〈x ,w 1

n1〉 − b1

n1)

• The outputs y = (y1, . . . , yn1) become inputs for the nextlayer . . . ; last layer outputs y ∈ R

• “Deep Learning” relies on an artificial neural network with∼ 100− 1000 layers

• Training the network: given inputs x1, . . . , xN and outputsy 1, . . . , yN and optimize over weights w ’s and b’s

• Non-convex minimization over a huge space...???

28/40

Page 30: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

Approximation of ridge functions (and their sums)

k = 1: f (x) = g(〈a, x〉), ‖a‖2 = 1, g smooth

Approximation has two parts: approximation of g and of a

Recovery of a - from ∇f (x):

∇f (x) = g ′(〈a, x〉)a,∇f (0) = g ′(0)a.

After recovering a, the problem becomes essentiallyone-dimensional and one can use arbitrary sampling method toapproximate g .

g ′(0) 6= 0. . . g ′(0) = 1

29/40

Page 31: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

Buhmann & Pinkus ’99Identifying linear combinations of ridge functions

Approximation of functions

f (x) =m∑i=1

gi (〈ai , x〉), x ∈ Rd

• gi ∈ C 2m−1(R), i = 1, . . . ,m;

• g(2m−1)i (0) 6= 0, i = 1, . . . ,m;

(D2m−1−ku Dk

v f )(0) =m∑i=1

(〈u, ai 〉)2m−1−k(〈v , ai 〉)kg(2m−1)i (0)

for k = 0, . . . , 2m − 1 and v1, . . . , vd ∈ Rd and solving this systemof equations.

30/40

Page 32: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

A.Cohen, I.Daubechies, R.DeVore, G.Kerkyacharian, D.Picard, ’12Capturing ridge functions in high dimensions from point queries

• k = 1 : f (x) = g(〈a, x〉)k = 1 : f (x) = g(〈a, x〉)k = 1 : f (x) = g(〈a, x〉)• f : [0, 1]d → R• g ∈ C s([0, 1]), 1 < s

• ‖g‖C s ≤ M0

• ‖a‖`dq ≤ M1, 0 < q ≤ 1‖a‖`dq ≤ M1, 0 < q ≤ 1‖a‖`dq ≤ M1, 0 < q ≤ 1

• a ≥ 0a ≥ 0a ≥ 0

Then

‖f − f ‖∞ ≤ CM0

L−s + M1

(1 + log(d/L)

L

)1/q−1

using 3L + 2 sampling points

31/40

Page 33: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

• First sampling along the diagonaliL1 = i

L(1, . . . , 1), i = 0, . . . , L :

f( i

L1)

= g(⟨ i

L1, a⟩)

= g(i‖a‖1/L)

• Recovery of g on a grid of [0, ‖a‖1]

• Finding i0 with largest g((i0 + 1)‖a‖1/L)− g(i0‖a‖1/L)

• Approximating Dϕj f (i0/L · 1) = g ′(i0‖a‖1/L)〈a, ϕj〉 by firstorder differences

• Then recovery of a from 〈a, ϕ1〉, . . . , 〈a, ϕm〉 by methods ofcompressed sensing (CS)

32/40

Page 34: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

M. Fornasier, K. Schnass, J. V., Learning functions of few arbitrarylinear parameters in high dimensions (2012)

• f : B(0, 1)→ R

• ‖a‖2 = 1

• 0 < q ≤ 1, ‖a‖q ≤ c

• g ∈ C 2[−1, 1]

Put

yj :=f (hϕj)− f (0)

h≈ g ′(0)〈a, ϕj〉, j = 1, . . . ,mΦ, mΦ ≤ d ,

where h > 0 is small, and

ϕj ,k = ± 1√

mΦ, k = 1, . . . , d

33/40

Page 35: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

yj are scalar products 〈a, ϕj〉 corrupted by deterministic noise

a = arg minz∈Rd

‖z‖1, s.t. 〈ϕj , z〉 = yj , j = 1, . . . ,mΦ.

a = a/‖a‖2 - approximation of a

g is obtained by sampling f along a: g(y) := f (a · t), t ∈ (−1, 1).

Thenf (x) := g(〈a, x〉),

has the approximation property

‖f − f ‖∞ ≤ C[( mΦ

log(d/mΦ) + 1

)−( 1q− 1

2

)+

h√

].

34/40

Page 36: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

Active coordinates:R. DeVore, G. Petrova, P. Wojtaszczyk, Approximation of functions of fewvariables in high dimensions, Constr. Appr. ’11

K. Schnass, J.V., Compressed learning of high-dimensional sparse functions,

Proceedings of ICASSP ’11

f (x) = g(xi1 , . . . , xik )

Use of low-rank matrix recovery:H. Tyagi, V. Cevher, Learning non-parametric basis independent models from

point queries via low-rank methods, ACHA ’14

35/40

Page 37: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

I. Daubechies, M. Fornasier, J.V., Approximation of sums of ridgefunctions, in preparation: Sums of ridge functions

Recovery of

f (x) =k∑

j=1

gj(〈aj , x〉)

• We would like to identify a1, . . . , ak , then g1, . . . , gk

• Step 1.: Sampling of

∇f (x) =k∑

j=1

g ′j (〈aj , x〉)aj

at different points gives elements of spana1, . . . , ak ⊂ Rd

• Afterwards, we can reduce the dimension to d = k. . . one k-dimensional problem. . .

36/40

Page 38: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

Recovery of individual ai ’s for d = k?

• Step 2.: Second order derivatives:

∇2f (x) =k∑

j=1

g ′′j (〈aj , x〉)aj ⊗ aj

• We can recover L - an approximation of

L = spanai ⊗ ai , i = 1, . . . , k ⊂ Rk×k

37/40

Page 39: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

• Step 3.: We try to find ai ⊗ ai in L

• We look for matrices in L with the smallest rank

• We analyze the non-convex problem

arg max ‖M‖, s.t. M ∈ L, ‖M‖F ≤ 1

• Every algorithm, which is able to find a1 ⊗ a1 can also findaj ⊗ aj , j = 2, . . . , k, hence it must be non-convex

38/40

Page 40: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

Summary

• Approximation theory and Machine Learning study similarproblems from different points of view

• They can inspire/enrich one the other

• Non-linear, non-convex classes; non-linear information

• Sparsity and other structural assumptions

• Open problem: Geometrical properties of the class offunctions, which can be modeled by neural networks

Nd ,L = f : Rd → R : f is a neural network with L layers

39/40

Page 41: From approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf · Approximation theory and Data Analysis often study similar problems from di erent points

Introduction `1-minimization SVM’s Ridge functions & Neural networks

Thank you for your attention!

40/40