from approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf ·...

Introduction `1-minimization SVM’s Ridge functions & Neural networks

From approximation theory tomachine learning

New perspectives in the theory of function spacesand their applications

September 2017, Bedlewo, Poland

Jan Vybıral

Charles University/Czech Technical UniversityPrague, Czech Republic

1/40


Introduction

• Approximation Theory vs. Data Analysis (or MachineLearning, or Learning Theory)

• Example I: `1-minimization

• Example II: Support vector machines

• Example III: Ridge functions and neural networks

2/40


Approximation theory

• K - class of objects of interest: vectors, functions,operators,. . .

• . . . approximated by (same or simpler) objects . . .

• . . . usually with only limited information about the unknownobject. . .

• with the similarity (approximation error) measured by somesort of distance

• worst case - supremum of the approximation error over K

• average case - mean value of the (square of the)approximation error over K w.r.t. some probability measureon K

3/40


Approximation of multivariate functions

Let f : Ω ⊂ Rd → R be a function of many (d 1) variables

We want to approximate f using only (a small number of) functionvalues f (x1), . . . , f (xn)

Questions:

• Which functions (assumptions on f )?

• How to measure the error?

• How to choose the sampling points?

• Decay of the error with growing number of sampling points?

• Algorithms and optimality?

• Dependence on d?

4/40


Data Analysis

Discovering structure in given data, allowing for predictions,conclusions, etc.

Given data can have different (and also heterogeneous) formats.For inputs x1, . . . , xn ∈ Rd and outputs y1, . . . , yn ∈ R we wouldlike to study the functional dependence yi ≈ f (xi )

The performance is in practice often tested by learning on a subsetof the data and testing the rest.

To allow theoretical results, we often assume that the data isdrawn independently from some (unknown) underlying probabilitydistribution, i.e. (xi , yi ) ∈ A with prob. P(A).

5/40


Least squares & ridge regression

Least squares:

• x1, . . . , xn ∈ Rd : n data points (inputs) in Rd

• y1, . . . , yn ∈ R: n real values (outputs)

• We look for dependence yi ≈ 〈ω, xi 〉

• We minimizen∑

i=1

|yi − 〈ω, xi 〉|2 = ‖y − Xω‖22 over ω ∈ Rd

Ridge regression: Adding regularization term λ‖ω‖22

LASSO: Adding regularization term λ‖ω‖1

6/40


Principal Component Analysis (PCA)

PCA: Classical dimensionality-reduction technique

• Given data points x1, . . . , xn ∈ Rd

• minimize over subspaces L with dim L = k

• the distance of xi ’s to L; i.e. minL:dim L=k

n∑i=1

‖xi − PLxi‖22

• the answer is given by singular value decomposition of then × d data matrix.

“. . . swiss army knife of numerical linear algebra. . . ”

7/40


Support Vector Machines

V. N. Vapnik and A. Ya. Chervonenkis (1963)B. E. Boser, I. M. Guyon and V. N. Vapnik (1992) - non-linearC. Cortes and V. N. Vapnik (1995) - soft margin

Given data points x1, . . . , xn ∈ Rd and labels yi ∈ −1, 1,separate the sets

xi : yi = −1 and xi : yi = +1

by a linear hyperplane... i.e. f (x) = g(〈a, x〉), where

g(t) =

1, if t ≥ 0,

−1, if t < 0is known.

8/40


Soft margin SVM:

minω∈Rd

n∑i=1

(1− yi 〈ω, xi 〉)+ + λ‖ω‖22

9/40


Limits of the theory...All theory, dear friend, is grey, but the golden tree of actual lifesprings ever green. Johann Wolfgang von Goethe

It is a capital mistake to theorize before one has data. Insensiblyone begins to twist facts to suit theories, instead of theories to suitfacts. Sherlock Holmes, A Scandal in Bohemia

Approximation theory:

• The class of functions of interest is unclear.• The data points sometimes can not be designed.

Data analysis - learning theory:

• The independence is not always ensured• The underlying probability (and its basic properties) are

unknown• The success measured by leave-out data differs from one case

to the other10/40


Active choice of points!link between data analysis and approximation theory

Difference between regression and sampling?. . . active choice of the points . . .

Nowadays, more and more data analysis is done on calculated data!

In material science, material properties (like color or conductivity)are calculated ab initio from the molecular information, notmeasured in the lab.

Typically, these properties are governed by a PDE (likeSchrodinger’s equation), where the structure of the material comesin as initial data.

11/40


Calculation of one property for one material is then a numericalsolution of a parametric: PDE(u, p, x) = 0

The (unknown) functional dependence f = f (p) of the materialproperty on the input parameters is calculated for materialsdescribed with parameters p1, . . . , pm

Sampling of f at pj ⇔⇔ query of the solution ⇔⇔ running an expensive simulation

Nearly noise-free sampling possible!

Trade-off: more noisy samples vs. fewer less noisy samples

The materials with p1, . . . , pm can be actively chosen!

12/40


Summary of the introduction

• Approximation theory and Data Analysis often study similarproblems from different points of view

• The choice of approximation algorithms is free in both areas

• Approximation theory (usually) assumes that the data can bedesigned/chosen

• The aim of the talk is to show the benefits of the combinationof both approaches

13/40


`1-norm minimizationData Analysis

LASSO (least absolute shrinkage and selection operator):

• Tibshirani (1996)

• Least squares with penalization term λ‖ω‖1

• X = (x1, . . . , xn) ∈ Rn×d : n data points (inputs) in Rd

• y = (y1, . . . , yn) ∈ Rn: n real values (outputs)

• LASSO: arg minω∈Rd

‖y − Xω‖22 + λ‖ω‖1 weights the fit y ∼ Xω

against the number of non-zero coordinates of ω

• Leaves open a number of questions - how many data pointsand how distributed are sufficient to identify/approximateω...?

14/40


`1-norm minimizationApproximation theory

Compressed Sensing:Let w ∈ Rd be sparse (with small number s d of non-zerocoefficients on unknown positions) and consider the linear functiong(x) = 〈x ,w〉.

Design n (as small as possible) sampling points x1, . . . , xn and thefunction values yj = g(xj) = 〈xj ,w〉 and find a recovery mapping∆ : Rn → Rd , such that ∆(y) is equal/close to w

• Take xj ’s independent at random, i.e. y = Xw• Use `1-minimization for recovery

∆(y) = arg minω∈Rd

‖ω‖1, s.t. Xω = y

• It is enough to take n ≈ s log(d)

15/40


`1-norm minimizationApproximation theory

Theorem (Donoho, Candes, Romberg, Tao - 2006)There is a constant C > 0 such that if m ≥ Cs log(d) and thecomponents of xi are independent Gaussian random variables, then∆(y) = w (with high probability).

Many different aspects (other xi ’s, noisy sampling, nearly-sparsevectors w ∈ Rd , etc.) were studied intensively.

Benefits for Data Analysis:

• Estimates of the minimal number of data points

• If possible, choose mostly uncorrelated data points

• Better information about possibilities and limits of LASSO

16/40


`1-SVMData Analysis

`1-SVM:

• P.S. Bradley, O.L. Mangasarian (1998)J. Zhu, S. Rosset, T. Hastie, R. Tibshirani (2004)

• Support vector machine with `1-penalization term

• X = (x1, . . . , xn) ∈ Rn×d : n data points (inputs) in Rd

• y = (y1, . . . , yn) ∈ −1,+1: n labels

• `1-SVM:

minω∈Rd

n∑i=1

(1− yi 〈ω, xi 〉)+ + λ‖ω‖1

• Weights the quality of the classification against the number ofnon-zero coordinates of ω

• Leaves open a number of questions - how many data pointsand how distributed are sufficient to identify ω...?

17/40


`1-SVMData Analysis

• Standard technique of sparse classification

• Bioinformatics

• gene selection

• microarray classification

• cancer classification

• feature selection

• face recognition

• . . .

18/40


`1-SVMData Analysis

Results:I. Steinwart, Support vector machines are universally consistent, J.Complexity (2002)I. Steinwart, Consistency of support vector machines and otherregularized kernel classifiers, IEEE Trans. Inf. Theory (2005):

The SVM’s are consistent (in general setting with kernels), i.e.they can learn the hidden classifier for n→∞ almost surely; theparameter of SVM has to grow to infinity.

19/40


`1-SVMApproximation theory

• We want to identify sparse (or nearly sparse) ω ∈ Rd

• We assume ω ∈ K = x ∈ Rd : ‖x‖p ≤ 1, p < 1

• We are allowed to take only 1-bit measurementsyi = sign(〈ω, xi 〉) - non-linear!

• We are allowed to design the “sampling points”x1, . . . , xn ∈ Rd and a (non-linear) recovery algorithm

• Recent area of 1-bit Compressed Sensing

20/40



1-bit Compressed Sensing:Boufounos & Baraniuk (2008), Y. Plan & R. Vershynin (2013)

ω ∈ K = x ∈ Rd : ‖x‖2 ≤ 1, ‖x‖1 ≤√

sxi ∼ N (0, id), yi = sign(〈ω, xi 〉)

ω := arg maxw∈K

n∑i=1

yi 〈xi ,w〉

Result: For n ≥ Cδ−6w(K )2

‖ω − ω‖22 ≤ δ

√log(e/δ) with prob. ≥ 1− 8 exp(−cδ2n)

Mean Gaussian width: w(K ) = E supx∈K−K 〈g , x〉

21/40



What if we insist on `1-SVM as the recovery algorithm?Setting:

• ‖ω‖2 = 1, ‖ω‖1 ≤ R

• xi ∼ N (0, r 2 · id), i = 1, . . . , n

• yi = sign(〈xi , ω〉), i = 1, . . . , n

• ω ∈ Rd a minimizer of the `1-SVM

minw∈Rd

n∑i=1

[1− yi 〈xi ,w〉]+ subject to ‖w‖1 ≤ R.

Theorem (Kolleck, V. (2016))

Let 0 < ε < 0.18, r >√

2π(0.57− πε)−1, n ≥ Cε−2r 2R2 ln(d)Then it holds ∥∥∥∥ω − ω

‖ω‖2

∥∥∥∥2

≤ C ′(ε+

1

r

)with probability at least 1− exp (−C ′′ ln(d))

22/40


`1-SVM

Summary on SVM’s:

• Non-asymptotic analysis of (`1-)SVM’s allows to predict thelimits of the algorithm

• For random (i.e. highly uncorrelated data) small number n ofmeasurements/data is sufficient

• Motivated by the analysis, we proposed an SVM, whichcombines the `1 and the `2 penalty; this method performsbetter in analysis, but also in the numerical simulation!

• Use of the 1-Bit CS algorithm in Machine Learning?

23/40


Ridge functions & Neural networks

Ridge functionsLet g : R→ R and a ∈ Rd \ 0.Ridge function with ridge profile g and ridge vector a is thefunction

f (x) := g(〈a, x〉).

Constant along the hyperplane a⊥ = y ∈ Rd : 〈y , a〉 = 0 and itstranslates.

More general, if g : Rk → R and A ∈ Rk×d with k d then

f (x) := g(Ax)

is a k−ridge function.

24/40


Ridge functions in mathematics

• Kernels of important transforms (Fourier, Radon)

• Plane waves in PDE’s:Solutions to r∏

i=1

(bi∂

∂x− ai

∂

∂y

)F = 0

are of the form

F (x , y) =r∑

i=1

fi (aix + biy).

• Ridgelets, curvelets, shearlets,. . . : wavelet-like framescapturing singularities along curves and edges (Candes,Donoho, Kutyniok, Labate, . . . )

25/40


Ridge functions in approximation theory

Approximation of a function by functions from the dictionary

Dridge = %(〈k, x〉 − b) : k ∈ Rd , b ∈ R

• Fundamentality

• Greedy algorithms

Lin & Pinkus, Fundamentality of ridge functions, J. Approx. Theory 75 (1993),no. 3, 295–311

Cybenko, Approximation by superpositions of a sigmoidal function, Math.Control Signals Systems 2 (1989), 303–314

Leshno, Lin, Pinkus & Schocken, Multilayer feedforward networks with anonpolynomial activation function can approximate any function, NeuralNetworks 6 (1993), 861–867

26/40


Neural networks

Motivated by biological research on human brain and neuronsW. McCulloch, W. Pitts (1943); M. Minsky, S. Papert (1969)Artificial Neuron:. . . gets activated if a linear combination of its inputs grows over acertain threshold. . .

• Inputs x = (x1, . . . , xn) ∈ Rn

• Weights w = (w1, . . . ,wn) ∈ Rn

• Comparing 〈w , x〉 with a threshold b ∈ R• Plugging the result into the “activation function” - jump (or

smoothed jump) function σ

Artificial neuron is a functionx → σ(〈x ,w〉 − b),

where σ : R→ R might be σ(x) = sign(x) or σ(x) = ex/(1 + ex),etc.

27/40


Neural networks

Artificial neural network is a directed, acyclic graph of artificialneurons

• Input: x = (x1, . . . , xn) ∈ Rn

• First layer of neurons:y1 = σ(〈x ,w 1

1 〉 − b11), . . . , yn1 = σ(〈x ,w 1

n1〉 − b1

n1)

• The outputs y = (y1, . . . , yn1) become inputs for the nextlayer . . . ; last layer outputs y ∈ R

• “Deep Learning” relies on an artificial neural network with∼ 100− 1000 layers

• Training the network: given inputs x1, . . . , xN and outputsy 1, . . . , yN and optimize over weights w ’s and b’s

• Non-convex minimization over a huge space...???

28/40


Approximation of ridge functions (and their sums)

k = 1: f (x) = g(〈a, x〉), ‖a‖2 = 1, g smooth

Approximation has two parts: approximation of g and of a

Recovery of a - from ∇f (x):

∇f (x) = g ′(〈a, x〉)a,∇f (0) = g ′(0)a.

After recovering a, the problem becomes essentiallyone-dimensional and one can use arbitrary sampling method toapproximate g .

g ′(0) 6= 0. . . g ′(0) = 1

29/40


Buhmann & Pinkus ’99Identifying linear combinations of ridge functions

Approximation of functions

f (x) =m∑i=1

gi (〈ai , x〉), x ∈ Rd

• gi ∈ C 2m−1(R), i = 1, . . . ,m;

• g(2m−1)i (0) 6= 0, i = 1, . . . ,m;

(D2m−1−ku Dk

v f )(0) =m∑i=1

(〈u, ai 〉)2m−1−k(〈v , ai 〉)kg(2m−1)i (0)

for k = 0, . . . , 2m − 1 and v1, . . . , vd ∈ Rd and solving this systemof equations.

30/40


A.Cohen, I.Daubechies, R.DeVore, G.Kerkyacharian, D.Picard, ’12Capturing ridge functions in high dimensions from point queries

• k = 1 : f (x) = g(〈a, x〉)k = 1 : f (x) = g(〈a, x〉)k = 1 : f (x) = g(〈a, x〉)• f : [0, 1]d → R• g ∈ C s([0, 1]), 1 < s

• ‖g‖C s ≤ M0

• ‖a‖`dq ≤ M1, 0 < q ≤ 1‖a‖`dq ≤ M1, 0 < q ≤ 1‖a‖`dq ≤ M1, 0 < q ≤ 1

• a ≥ 0a ≥ 0a ≥ 0

Then

‖f − f ‖∞ ≤ CM0

L−s + M1

(1 + log(d/L)

L

)1/q−1

using 3L + 2 sampling points

31/40


• First sampling along the diagonaliL1 = i

L(1, . . . , 1), i = 0, . . . , L :

f( i

L1)

= g(⟨ i

L1, a⟩)

= g(i‖a‖1/L)

• Recovery of g on a grid of [0, ‖a‖1]

• Finding i0 with largest g((i0 + 1)‖a‖1/L)− g(i0‖a‖1/L)

• Approximating Dϕj f (i0/L · 1) = g ′(i0‖a‖1/L)〈a, ϕj〉 by firstorder differences

• Then recovery of a from 〈a, ϕ1〉, . . . , 〈a, ϕm〉 by methods ofcompressed sensing (CS)

32/40


M. Fornasier, K. Schnass, J. V., Learning functions of few arbitrarylinear parameters in high dimensions (2012)

• f : B(0, 1)→ R

• ‖a‖2 = 1

• 0 < q ≤ 1, ‖a‖q ≤ c

• g ∈ C 2[−1, 1]

Put

yj :=f (hϕj)− f (0)

h≈ g ′(0)〈a, ϕj〉, j = 1, . . . ,mΦ, mΦ ≤ d ,

where h > 0 is small, and

ϕj ,k = ± 1√

mΦ, k = 1, . . . , d

33/40


yj are scalar products 〈a, ϕj〉 corrupted by deterministic noise

a = arg minz∈Rd

‖z‖1, s.t. 〈ϕj , z〉 = yj , j = 1, . . . ,mΦ.

a = a/‖a‖2 - approximation of a

g is obtained by sampling f along a: g(y) := f (a · t), t ∈ (−1, 1).

Thenf (x) := g(〈a, x〉),

has the approximation property

‖f − f ‖∞ ≤ C[( mΦ

log(d/mΦ) + 1

)−( 1q− 1

2

)+

h√

mΦ

].

34/40


Active coordinates:R. DeVore, G. Petrova, P. Wojtaszczyk, Approximation of functions of fewvariables in high dimensions, Constr. Appr. ’11

K. Schnass, J.V., Compressed learning of high-dimensional sparse functions,

Proceedings of ICASSP ’11

f (x) = g(xi1 , . . . , xik )

Use of low-rank matrix recovery:H. Tyagi, V. Cevher, Learning non-parametric basis independent models from

point queries via low-rank methods, ACHA ’14

35/40


I. Daubechies, M. Fornasier, J.V., Approximation of sums of ridgefunctions, in preparation: Sums of ridge functions

Recovery of

f (x) =k∑

j=1

gj(〈aj , x〉)

• We would like to identify a1, . . . , ak , then g1, . . . , gk

• Step 1.: Sampling of

∇f (x) =k∑

j=1

g ′j (〈aj , x〉)aj

at different points gives elements of spana1, . . . , ak ⊂ Rd

• Afterwards, we can reduce the dimension to d = k. . . one k-dimensional problem. . .

36/40


Recovery of individual ai ’s for d = k?

• Step 2.: Second order derivatives:

∇2f (x) =k∑

j=1

g ′′j (〈aj , x〉)aj ⊗ aj

• We can recover L - an approximation of

L = spanai ⊗ ai , i = 1, . . . , k ⊂ Rk×k

37/40


• Step 3.: We try to find ai ⊗ ai in L

• We look for matrices in L with the smallest rank

• We analyze the non-convex problem

arg max ‖M‖, s.t. M ∈ L, ‖M‖F ≤ 1

• Every algorithm, which is able to find a1 ⊗ a1 can also findaj ⊗ aj , j = 2, . . . , k, hence it must be non-convex

38/40


Summary

• Approximation theory and Machine Learning study similarproblems from different points of view

• They can inspire/enrich one the other

• Non-linear, non-convex classes; non-linear information

• Sparsity and other structural assumptions

• Open problem: Geometrical properties of the class offunctions, which can be modeled by neural networks

Nd ,L = f : Rd → R : f is a neural network with L layers

39/40


Thank you for your attention!

40/40

from approximation theory to machine learningnpfsa2017.uni-jena.de/l_notes/vybiral.pdf ·...

Documents