towards deep kernel machineslear.inrialpes.fr/people/mairal/resources/pdf/prague.pdf · idea:...

73
Towards Deep Kernel Machines Julien Mairal Inria, Grenoble Prague, April, 2017 Julien Mairal Towards deep kernel machines 1/51

Upload: others

Post on 18-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Towards Deep Kernel Machines

Julien Mairal

Inria, Grenoble

Prague, April, 2017

Julien Mairal Towards deep kernel machines 1/51

Page 2: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Part I: Scientific Context

Julien Mairal Towards deep kernel machines 2/51

Page 3: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

A quick zoom on multilayer neural networks

The goal is to learn a prediction function f : Rp → R given labeledtraining data (xi , yi )i=1,...,n with xi in Rp, and yi in R:

minf ∈F

1

n

n∑i=1

L(yi , f (xi ))︸ ︷︷ ︸empirical risk, data fit

+ λΩ(f )︸ ︷︷ ︸regularization

.

Julien Mairal Towards deep kernel machines 3/51

Page 4: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

A quick zoom on multilayer neural networks

The goal is to learn a prediction function f : Rp → R given labeledtraining data (xi , yi )i=1,...,n with xi in Rp, and yi in R:

minf ∈F

1

n

n∑i=1

L(yi , f (xi ))︸ ︷︷ ︸empirical risk, data fit

+ λΩ(f )︸ ︷︷ ︸regularization

.

What is specific to multilayer neural networks?

The “neural network” space F is explicitly parametrized by:

f (x) = σk(Akσk–1(Ak–1 . . . σ2(A2σ1(A1x)) . . .)).

Finding the optimal A1,A2, . . . ,Ak yields a non-convexoptimization problem in huge dimension.

Julien Mairal Towards deep kernel machines 4/51

Page 5: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

A quick zoom on convolutional neural networks

Figure: Picture from LeCun et al. [1998]

CNNs perform “simple” operations such as convolutions, pointwisenon-linearities and subsampling.

for most successful applications of CNNs, training is supervised.

Julien Mairal Towards deep kernel machines 5/51

Page 6: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

A quick zoom on convolutional neural networks

What are the main features of CNNs?

they capture compositional and multiscale structures in images;

they provide some invariance;

they model local stationarity of images at several scales.

What are the main open problems?

very little theoretical understanding;

they require large amounts of labeled data;

they require manual design and parameter tuning;

Nonetheless...

they are the focus of a huge academic and industrial effort;

there is efficient and well-documented open-source software.

Julien Mairal Towards deep kernel machines 6/51

Page 7: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

A quick zoom on convolutional neural networks

What are the main features of CNNs?

they capture compositional and multiscale structures in images;

they provide some invariance;

they model local stationarity of images at several scales.

What are the main open problems?

very little theoretical understanding;

they require large amounts of labeled data;

they require manual design and parameter tuning;

Nonetheless...

they are the focus of a huge academic and industrial effort;

there is efficient and well-documented open-source software.

Julien Mairal Towards deep kernel machines 6/51

Page 8: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

A quick zoom on convolutional neural networks

What are the main features of CNNs?

they capture compositional and multiscale structures in images;

they provide some invariance;

they model local stationarity of images at several scales.

What are the main open problems?

very little theoretical understanding;

they require large amounts of labeled data;

they require manual design and parameter tuning;

Nonetheless...

they are the focus of a huge academic and industrial effort;

there is efficient and well-documented open-source software.

Julien Mairal Towards deep kernel machines 6/51

Page 9: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Context of kernel methods

1 0.5 0.30.5 1 0.60.3 0.6 1

K=

X

S

(S)=(aatcgagtcac,atggacgtct,tgcactact)φ

Idea: representation by pairwise comparisons

Define a “comparison function”: K : X × X 7→ R.Represent a set of n data points S = x1, . . . , xn by the n × nmatrix:

Kij := K (xi , xj).

[Shawe-Taylor and Cristianini, 2004, Scholkopf and Smola, 2002].

Julien Mairal Towards deep kernel machines 7/51

Page 10: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Context of kernel methods

Theorem (Aronszajn, 1950)

K : X × X → R is a positive definite kernel if and only if there exists aHilbert space H and a mapping

ϕ : X → H,

such that, for any x, x in X ,

K (x, x′) = 〈ϕ(x), ϕ(x′)〉H.

Julien Mairal Towards deep kernel machines 8/51

Page 11: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Context of kernel methods

2R

x1

x2

x1

x2

2

The classical challenge of kernel methods

Find a kernel K such that

the data in the feature space H has nice properties, e.g., linearseparability, cluster structure.

K is fast to compute and mathematically valid (p.d.).

Julien Mairal Towards deep kernel machines 9/51

Page 12: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Context of kernel methods (supervised learning)

The goal is to learn a prediction function f : X → R given labeledtraining data (xi , yi )i=1,...,n with xi in X , and yi in R:

minf ∈H

1

n

n∑i=1

L(yi , f (xi ))︸ ︷︷ ︸empirical risk, data fit

+ λ‖f ‖2H︸ ︷︷ ︸

regularization

.

What is specific here to kernel methods?

The “kernel method” space H is possibly infinite-dimensional.

Optimization over f is done implicitely by (often) minimizing aconvex function.

X does not need to be a vector space.

Julien Mairal Towards deep kernel machines 10/51

Page 13: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Context of kernel methods

What are the main features of kernel methods?

decoupling of data representation and learning algorithm;

a huge number of unsupervised and supervised algorithms;

typically, convex optimization problems in a supervised context;

versatility: applies to vectors, sequences, graphs, sets,. . . ;

natural regularization function to control the learning capacity;

well studied theoretical framework.

But...

poor scalability in n, at least O(n2);

decoupling of data representation and learning may not be a goodthing, according to recent supervised deep learning success.

Julien Mairal Towards deep kernel machines 11/51

Page 14: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Context of kernel methods

What are the main features of kernel methods?

decoupling of data representation and learning algorithm;

a huge number of unsupervised and supervised algorithms;

typically, convex optimization problems in a supervised context;

versatility: applies to vectors, sequences, graphs, sets,. . . ;

natural regularization function to control the learning capacity;

well studied theoretical framework.

But...

poor scalability in n, at least O(n2);

decoupling of data representation and learning may not be a goodthing, according to recent supervised deep learning success.

Julien Mairal Towards deep kernel machines 11/51

Page 15: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Context of kernel methods

Challenges

Scaling-up kernel methods with approximate feature maps;

K (x, x′) ≈ 〈ψ(x), ψ(x′)〉.

[Williams and Seeger, 2001, Rahimi and Recht, 2007, Vedaldi and

Zisserman, 2012, Le et al., 2013]...

Design data-adaptive and task-adaptive kernels;

Build kernel hierarchies to capture compositional structures.

Introduce supervision in the kernel design.

Julien Mairal Towards deep kernel machines 12/51

Page 16: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Context of kernel methods

Challenges

Scaling-up kernel methods with approximate feature maps;

K (x, x′) ≈ 〈ψ(x), ψ(x′)〉.

[Williams and Seeger, 2001, Rahimi and Recht, 2007, Vedaldi and

Zisserman, 2012, Le et al., 2013]...

Design data-adaptive and task-adaptive kernels;

Build kernel hierarchies to capture compositional structures.

Introduce supervision in the kernel design.

We need deep kernelmachines!

Julien Mairal Towards deep kernel machines 12/51

Page 17: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Some more motivation

Longer term objectives

build a kernel for images (abstract object), for which we canprecisely quantify the invariance, stability to perturbations,recovery, and complexity properties.

build deep networks which can be easily regularized.

build deep networks for structured objects (graph, sequences)...

add more geometric interpretation to deep networks.

. . .

Julien Mairal Towards deep kernel machines 13/51

Page 18: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Part II: Basic Principles of Deep Kernel Machines

Julien Mairal Towards deep kernel machines 14/51

Page 19: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Basic principles of deep kernel machines: composition

Composition of feature spaces

Consider a p.d. kernel K1 : X 2 → R and its RKHS H1 with mappingϕ1 : X → H1. Consider also a p.d. kernel K2 : H2

1 → R and its RKHSH2 with mapping ϕ2 : H1 → H2. Then, K3 : X 2 → R below is also p.d.

K3(x, x′) = K2(ϕ1(x), ϕ1(x′)),

and its RKHS mapping is ϕ3 = ϕ2 ϕ1.

Examples

K3(x, x′) = e− 1

2σ2 ‖ϕ1(x)−ϕ1(x′)‖2H1 .

K3(x, x′) = 〈ϕ1(x), ϕ1(x′)〉2H1= K1(x, x′)2.

Julien Mairal Towards deep kernel machines 15/51

Page 20: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Basic principles of deep kernel machines: composition

Composition of feature spaces

Consider a p.d. kernel K1 : X 2 → R and its RKHS H1 with mappingϕ1 : X → H1. Consider also a p.d. kernel K2 : H2

1 → R and its RKHSH2 with mapping ϕ2 : H1 → H2. Then, K3 : X 2 → R below is also p.d.

K3(x, x′) = K2(ϕ1(x), ϕ1(x′)),

and its RKHS mapping is ϕ3 = ϕ2 ϕ1.

Examples

K3(x, x′) = e− 1

2σ2 ‖ϕ1(x)−ϕ1(x′)‖2H1 .

K3(x, x′) = 〈ϕ1(x), ϕ1(x′)〉2H1= K1(x, x′)2.

Julien Mairal Towards deep kernel machines 15/51

Page 21: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Basic principles of deep kernel machines: composition

Remarks on the composition of feature spaces

we can iterate the process many times.

the idea appears early in the literature of kernel methods [seeScholkopf et al., 1998, for a multilayer variant of kernel PCA].

Is this idea sufficient to make kernel methods more powerful?

Probably not:

K2 is doomed to be a simple kernel (dot-product or RBF kernel).

it does not address any of previous challenges.

K3 and K1 operate on the same type of object; it is not clearwhy desining K3 is easier than designing K1 directly.

Nonetheless, we will see later that this idea can be used to build ahierarchies of kernels that operate on more and more complex objects.

Julien Mairal Towards deep kernel machines 16/51

Page 22: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Basic principles of deep kernel machines: composition

Remarks on the composition of feature spaces

we can iterate the process many times.

the idea appears early in the literature of kernel methods [seeScholkopf et al., 1998, for a multilayer variant of kernel PCA].

Is this idea sufficient to make kernel methods more powerful?

Probably not:

K2 is doomed to be a simple kernel (dot-product or RBF kernel).

it does not address any of previous challenges.

K3 and K1 operate on the same type of object; it is not clearwhy desining K3 is easier than designing K1 directly.

Nonetheless, we will see later that this idea can be used to build ahierarchies of kernels that operate on more and more complex objects.

Julien Mairal Towards deep kernel machines 16/51

Page 23: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Basic principles of deep kernel machines: composition

Remarks on the composition of feature spaces

we can iterate the process many times.

the idea appears early in the literature of kernel methods [seeScholkopf et al., 1998, for a multilayer variant of kernel PCA].

Is this idea sufficient to make kernel methods more powerful?

Probably not:

K2 is doomed to be a simple kernel (dot-product or RBF kernel).

it does not address any of previous challenges.

K3 and K1 operate on the same type of object; it is not clearwhy desining K3 is easier than designing K1 directly.

Nonetheless, we will see later that this idea can be used to build ahierarchies of kernels that operate on more and more complex objects.

Julien Mairal Towards deep kernel machines 16/51

Page 24: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Basic principles of deep kernel machines: infinite NN

A large class of kernels on Rp may be defined as an expectation

K (x, y) = Ew[s(w>x)s(w>y)],

where s : R→ R is a nonlinear function. The encoding can be seen as aone-layer neural network with infinite number of random weights.

Examples

random Fourier features

κ(x− y) = Ew∼q(w),b∼U [0,2π]

[√2 cos(w>x + b)

√2 cos(w>y + b)

]Gaussian kernel

e−1

2σ2 ‖x−y‖22 ∝ Ew

[e

2σ2 w>xe

2σ2 w>y

]with w ∼ N (0, (σ2/4)I).

Julien Mairal Towards deep kernel machines 17/51

Page 25: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Basic principles of deep kernel machines: infinite NN

A large class of kernels on Rp may be defined as an expectation

K (x, y) = Ew[s(w>x)s(w>y)],

where s : R→ R is a nonlinear function. The encoding can be seen as aone-layer neural network with infinite number of random weights.

Examples

random Fourier features

κ(x− y) = Ew∼q(w),b∼U [0,2π]

[√2 cos(w>x + b)

√2 cos(w>y + b)

]Gaussian kernel

e−1

2σ2 ‖x−y‖22 ∝ Ew

[e

2σ2 w>xe

2σ2 w>y

]with w ∼ N (0, (σ2/4)I).

Julien Mairal Towards deep kernel machines 17/51

Page 26: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Basic principles of deep kernel machines: infinite NN

Example, arc-cosine kernels

K (x, y) ∝ Ew

[max

(w>x, 0

)αmax

(w>y, 0

)α]with w ∼ N (0, I),

for x, y on the hyper-sphere Sm−1. Interestingly, the non-linearity s aretypical ones from the neural network literature.

s(u) = max(0, u) (rectified linear units) leads toK1(x, y) = sin(θ) + (π − θ) cos(θ) with θ = cos−1(x>y);

s(u) = max(0, u)2 (squared rectified linear units) leads toK2(x, y) = 3 sin(θ) cos(θ) + (π − θ)(1 + 2 cos2(θ));

Remarks

infinite neural nets were discovered by Neal, 1994; then revisitedmany times [Le Roux, 2007, Cho and Saul, 2009].

the concept does not lead to more powerful kernel methods...

Julien Mairal Towards deep kernel machines 18/51

Page 27: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Basic principles of DKM: dot-product kernels

Another basic link between kernels and neural networks can be obtainedby considering dot-product kernels.

A classical old result

Let X = Sd−1 be the unit sphere of Rd . The kernel K : X 2 → R

K (x, y) = κ(〈x, y〉Rd )

is positive definite if and only if κ is smooth and its Taylor expansioncoefficients are non-negative.

Remark

the proposition holds if X is the unit sphere of some Hilbert spaceand 〈x, y〉Rd is replaced by the corresponding inner-product.

Julien Mairal Towards deep kernel machines 19/51

Page 28: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Basic principles of DKM: dot-product kernels

The Nystrom method consists of replacing any point ϕ(x) in H, for xin X by its orthogonal projection onto a finite-dimensional subspace

F = span(ϕ(z1), . . . , ϕ(zp)),

for some anchor points Z = [z1, . . . , zp] in Rd×p

Hilbert space H

F

ϕ(x)

ϕ(x′)

[Williams and Seeger, 2001, Smola and Scholkopf, 2000, Fine and Scheinberg, 2001].

[Williams and Seeger, 2002], [Smola and Scholkopf, 2000], [Fine and Scheinberg,

2001].

Julien Mairal Towards deep kernel machines 20/51

Page 29: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Basic principles of DKM: dot-product kernels

The projection is equivalent to

ΠF [x] :=

p∑j=1

β?j ϕ(zj) with β? ∈ arg minβ∈Rp

∥∥∥∥∥∥ϕ(x)−p∑

j=1

βjϕ(zj)

∥∥∥∥∥∥2

H

,

Then, it is possible to show that with K (x, y) = κ(〈x, y〉Rd ),

K (x, y) ≈ 〈ΠF [x],ΠF [y]〉H = 〈ψ(x), ψ(y)〉Rp ,

withψ(x) = κ(Z>Z)−1/2κ(Z>x),

where the function κ is applied pointwise to its arguments. The resultingψ can be interpreted as a neural network performing (i) linear operation,(ii) pointwise non-linearity, (iii) linear operation.

Julien Mairal Towards deep kernel machines 21/51

Page 30: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Part III: Convolutional Kernel Networks

Julien Mairal Towards deep kernel machines 22/51

Page 31: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Convolutional kernel networks

The (happy?) marriage of kernel methods and CNNs

1 a multilayer convolutional kernel for images: A hierarchy ofkernels for local image neighborhoods (aka, receptive fields).

2 unsupervised scheme for large-scale learning: the kernel beeingtoo computationally expensive, the Nystrom approximation at eachlayer yields a new type of unsupervised deep neural network.

3 end-to-end learning: learning subspaces in the RKHSs can beachieved with a supervised loss function.

First proof of concept with unsupervised learning

J. Mairal, P. Koniusz, Z. Harchaoui and C. Schmid. Convolutional KernelNetworks. NIPS 2014.

The model of this presentation

J. Mairal. End-to-End Kernel Learning with Supervised Convolutional KernelNetworks. NIPS 2016.

This presentation follows the NIPS’16 paper.

Julien Mairal Towards deep kernel machines 23/51

Page 32: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Convolutional kernel networks

The (happy?) marriage of kernel methods and CNNs

1 a multilayer convolutional kernel for images: A hierarchy ofkernels for local image neighborhoods (aka, receptive fields).

2 unsupervised scheme for large-scale learning: the kernel beeingtoo computationally expensive, the Nystrom approximation at eachlayer yields a new type of unsupervised deep neural network.

3 end-to-end learning: learning subspaces in the RKHSs can beachieved with a supervised loss function.

First proof of concept with unsupervised learning

J. Mairal, P. Koniusz, Z. Harchaoui and C. Schmid. Convolutional KernelNetworks. NIPS 2014.

The model of this presentation

J. Mairal. End-to-End Kernel Learning with Supervised Convolutional KernelNetworks. NIPS 2016.

This presentation follows the NIPS’16 paper.Julien Mairal Towards deep kernel machines 23/51

Page 33: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Related work

proof of concept for combining kernels and deep learning [Cho andSaul, 2009];

hierarchical kernel descriptors [Bo et al., 2011];

other multilayer models [Bouvrie et al., 2009, Montavon et al.,2011, Anselmi et al., 2015];

deep Gaussian processes [Damianou and Lawrence, 2013].

multilayer PCA [Scholkopf et al., 1998].

old kernels for images [Scholkopf, 1997].

RBF networks [Broomhead and Lowe, 1988].

Julien Mairal Towards deep kernel machines 24/51

Page 34: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

The multilayer convolutional kernel

Definition: image feature maps

An image feature map is a function I : Ω→ H, where Ω is a 2D gridrepresenting “coordinates” in the image and H is a Hilbert space.

I0 : Ω0 → H0I0(ω0) ∈ H0

Pω1 ∈ P0 (patch)

Kernel trick

I0.5(ω1) = ϕ1(Pω1) ∈ H1I0.5 : Ω0 → H1

I1 : Ω1 → H1

Linear pooling

I1(ω2) ∈ H1

Julien Mairal Towards deep kernel machines 25/51

Page 35: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

The multilayer convolutional kernel

Definition: image feature maps

An image feature map is a function I : Ω→ H, where Ω is a 2D gridrepresenting “coordinates” in the image and H is a Hilbert space.

Motivation and examples

Each point I (ω) carries information about an image neighborhood,which is motivated by the local stationarity of natural images.

We will construct a sequence of maps I0, . . . , Ik . Going up in thehierarchy yields larger receptive fields with more invariance.

I0 may simply be the input image, where H0 = R3 for RGB.

How do we go from I0 : Ω0 → H0 to I1 : Ω1 → H1?

First, define a p.d. kernel on patches of I0!

Julien Mairal Towards deep kernel machines 26/51

Page 36: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

The multilayer convolutional kernel

Definition: image feature maps

An image feature map is a function I : Ω→ H, where Ω is a 2D gridrepresenting “coordinates” in the image and H is a Hilbert space.

Motivation and examples

Each point I (ω) carries information about an image neighborhood,which is motivated by the local stationarity of natural images.

We will construct a sequence of maps I0, . . . , Ik . Going up in thehierarchy yields larger receptive fields with more invariance.

I0 may simply be the input image, where H0 = R3 for RGB.

How do we go from I0 : Ω0 → H0 to I1 : Ω1 → H1?

First, define a p.d. kernel on patches of I0!

Julien Mairal Towards deep kernel machines 26/51

Page 37: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

The multilayer convolutional kernel

Definition: image feature maps

An image feature map is a function I : Ω→ H, where Ω is a 2D gridrepresenting “coordinates” in the image and H is a Hilbert space.

Motivation and examples

Each point I (ω) carries information about an image neighborhood,which is motivated by the local stationarity of natural images.

We will construct a sequence of maps I0, . . . , Ik . Going up in thehierarchy yields larger receptive fields with more invariance.

I0 may simply be the input image, where H0 = R3 for RGB.

How do we go from I0 : Ω0 → H0 to I1 : Ω1 → H1?

First, define a p.d. kernel on patches of I0!

Julien Mairal Towards deep kernel machines 26/51

Page 38: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

The multilayer convolutional kernel

Going from I0 to I0.5: kernel trick

Patches of size e0 × e0 can be defined as elements of the Cartesianproduct P0 := He0×e0

0 endowed with its natural inner-product.

Define a p.d. kernel on such patches: For all x, x′ in P0,

K1(x, x′) = ‖x‖P0‖x′‖P0κ1

( 〈x, x′〉P0

‖x‖P0‖x′‖P0

)if x, x′ 6= 0 and 0 otherwise.

Note that for y, y′ normalized, we may choose

κ1

(〈y, y′〉P0

)= eα1(〈y,y′〉P0

−1) = e−α1

2‖y−y′‖2

P0 .

We call H1 the RKHS and define a mapping ϕ1 : P0 → H1.

Then, we may define the map I0.5 : Ω0 → H1 that carries therepresentations in H1 of the patches from I0 at all locations in Ω0.

Julien Mairal Towards deep kernel machines 27/51

Page 39: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

The multilayer convolutional kernel

I0 : Ω0 → H0I0(ω0) ∈ H0

Pω1 ∈ P0 (patch)

Kernel trick

I0.5(ω1) = ϕ1(Pω1) ∈ H1I0.5 : Ω0 → H1

I1 : Ω1 → H1

Linear pooling

I1(ω2) ∈ H1

How do we go from I0.5 : Ω0 → H1 to I1 : Ω1 → H1?

Linear pooling!

Julien Mairal Towards deep kernel machines 28/51

Page 40: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

The multilayer convolutional kernel

I0 : Ω0 → H0I0(ω0) ∈ H0

Pω1 ∈ P0 (patch)

Kernel trick

I0.5(ω1) = ϕ1(Pω1) ∈ H1I0.5 : Ω0 → H1

I1 : Ω1 → H1

Linear pooling

I1(ω2) ∈ H1

How do we go from I0.5 : Ω0 → H1 to I1 : Ω1 → H1?

Linear pooling!Julien Mairal Towards deep kernel machines 28/51

Page 41: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

The multilayer convolutional kernel

Going from I0.5 to I1: linear pooling

For all ω in Ω1:

I1(ω) =∑ω′∈Ω0

I0.5(ω′)e−β1‖ω′−ω‖22 .

The Gaussian weight can be interpreted as an anti-aliasing filter fordownsampling the map I0.5 to a different resolution.

Linear pooling is compatible with the kernel interpretation: linearcombinations of points in the RKHS are still points in the RKHS.

Finally,

We may now repeat the process and build I0, I1, . . . , Ik .

and obtain the multilayer convolutional kernel

K (Ik , I′k) =

∑ω∈Ωk

〈Ik(ω), I ′k(ω)〉Hk.

Julien Mairal Towards deep kernel machines 29/51

Page 42: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

The multilayer convolutional kernel

In summary

The multilayer convolutional kernel builds upon similar principles asa convolutional neural net (multiscale, local stationarity).

Invariance to local translations is achieved through linear poolingin the RKHS.

It remains a conceptual object due to its high complexity.

Learning and modelling are still decoupled.

Let us first address the second point (scalability).

Julien Mairal Towards deep kernel machines 30/51

Page 43: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Unsupervised learning for convolutional kernel networks

Learn linear subspaces of finite-dimensions where we project the data

M0

x

x′

kernel trick

projection on F1

M0.5

ψ1(x)

ψ1(x′)

M1

linear poolingHilbert space H1

F1

ϕ1(x)

ϕ1(x′)

Figure: The convolutional kernel network model between layers 0 and 1.

Julien Mairal Towards deep kernel machines 31/51

Page 44: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Unsupervised learning for convolutional kernel networks

Formally, this means using the Nystrom approximation

We now manipulate finite-dimensional maps Mj : Ωj → Rpj .

Every linear subspace is parametrized by anchor points

Fj := Span(ϕ(zj ,1), . . . , ϕ(zj ,pj )

),

where the z1,j ’s are in Rpj−1e2j−1 for patches of size ej−1 × ej−1.

The encoding function at layer j is

ψj(x) := ‖x‖κj(Z>j Zj)−1/2κ1

(Z>j

x

‖x‖

)if x 6= 0 and 0 otherwise,

where Zj = [zj ,1, . . . , zj ,pj ] and ‖.‖ is the Euclidean norm.

The interpretation is convolution with filters Zj , pointwisenon-linearity, 1× 1 convolution, contrast normalization.

Julien Mairal Towards deep kernel machines 32/51

Page 45: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Unsupervised learning for convolutional kernel networks

The pooling operation keeps points in the linear subspace Fj , andpooling M0.5 : Ω0 → Rp1 is equivalent to pooling I0.5 : Ω0 → H1.

M0

x

x′

kernel trick

projection on F1

M0.5

ψ1(x)

ψ1(x′)

M1

linear poolingHilbert space H1

F1

ϕ1(x)

ϕ1(x′)

Figure: The convolutional kernel network model between layers 0 and 1.

Julien Mairal Towards deep kernel machines 33/51

Page 46: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Unsupervised learning for convolutional kernel networks

How do we learn the filters with no supervision?

we learn one layer at a time, starting from the bottom one.

we extract a large number—say 100 000 patches from layers j − 1computed on an image database and normalize them;

perform a spherical K-means algorithm to learn the filters Zj ;

compute the projection matrix κj(Z>j Zj)−1/2.

Remarks

with kernels, we map patches in infinite dimension; with theprojection, we manipulate finite-dimensional objects.

we obtain an unsupervised convolutional net with a geometricinterpretation, where we perform projections in the RKHSs.

Julien Mairal Towards deep kernel machines 34/51

Page 47: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Unsupervised learning for convolutional kernel networks

Remark on input image pre-processing

Unsupervised CKNs are sensitive to pre-processing; we have tested

RAW RGB input;

local centering of every color channel;

local whitening of each color channel;

2D image gradients.

(a) RAW RGB (b) centering

Julien Mairal Towards deep kernel machines 35/51

Page 48: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Unsupervised learning for convolutional kernel networks

Remark on input image pre-processing

Unsupervised CKNs are sensitive to pre-processing; we have tested

RAW RGB input;

local centering of every color channel;

local whitening of each color channel;

2D image gradients.

(c) RAW RGB (d) whitening

Julien Mairal Towards deep kernel machines 35/51

Page 49: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Unsupervised learning for convolutional kernel networks

Remark on pre-processing with image gradients and 1× 1 patches

Every pixel/patch can be represented as a two dimensional vector

x = ρ[cos(θ), sin(θ)],

where ρ = ‖x‖ is the gradient intensity and θ is the orientation.

A natural choice of filters Z would be

zj = [cos(θj), sin(θj)] with θj = 2jπ/p0.

Then, the vector ψ(x) = ‖x‖κ1(Z>Z)−1/2κ1

(Z> x‖x‖

), can be

interpreted as a “soft-binning” of the gradient orientation.

After pooling, the representation of this first layer is very closeto SIFT/HOG descriptors [see Bo et al., 2011].

Julien Mairal Towards deep kernel machines 36/51

Page 50: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Convolutional kernel networks with supervised learning

How do we learn the filters with supervision?

Given a kernel K and RKHS H, the ERM objective is

minf ∈H

1

n

n∑i=1

L(yi , f (xi ))︸ ︷︷ ︸empirical risk, data fit

2‖f ‖2H︸ ︷︷ ︸

regularization

.

here, we use the parametrized kernel

KZ(I0, I′0) =

∑ω∈Ωk

〈Mk(ω),M ′k(ω)〉 = 〈Mk ,M′k〉F,

and we obtain the simple formulation

minW∈Rpk×|Ωk |

1

n

n∑i=1

L(yi , 〈W,M ik〉F) +

λ

2‖W‖2

F. (1)

Julien Mairal Towards deep kernel machines 37/51

Page 51: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Convolutional kernel networks with supervised learning

How do we learn the filters with supervision?

we jointly optimize w.r.t. Z (set of filters) and W.

we alternate between the optimization of Z and of W;

for W, the problem is strongly-convex and can be tackled withrecent algorithms that are much faster than SGD;

for Z, we derive backpropagation rules and use classical tricks forlearning CNNs (SGD+momentum);

The only tricky part is to differentiate κj(Z>j Zj)−1/2 w.r.t Zj , which is a

non-standard operation in classical CNNs.

Julien Mairal Towards deep kernel machines 38/51

Page 52: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Convolutional kernel networks

In summary

a multilayer kernel for images, which builds upon similar principlesas a convolutional neural net (multiscale, local stationarity).

A new type of convolutional neural network with a geometricinterpretation: orthogonal projections in RKHS.

Learning may be unsupervised: align subspaces with data.

Learning may be supervised: subspace learning in RKHSs.

Julien Mairal Towards deep kernel machines 39/51

Page 53: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Part IV: Applications

Julien Mairal Towards deep kernel machines 40/51

Page 54: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Image classification

Experiments were conducted on classical “deep learning” datasets, onCPUs with no model averaging and no data augmentation.

Dataset ] classes im. size ntrain ntest

CIFAR-10 10 32× 32 50 000 10 000

SVHN 10 32× 32 604 388 26 032

Figure: Figure from the NIPS’16 paper. Error rates in percents.

Remarks on CIFAR-10

10% is the standard “good” result for single model with no dataaugmentation.

the best unsupervised architecture has two layers, is wide(1024-16384 filters), and achieves 14.2%;

Julien Mairal Towards deep kernel machines 41/51

Page 55: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Image super-resolution

The task is to predict a high-resolution y image from low-resolutionone x. This may be formulated as a multivariate regression problem.

(a) Low-resolution y (b) High-resolution x

Julien Mairal Towards deep kernel machines 42/51

Page 56: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Image super-resolution

The task is to predict a high-resolution y image from low-resolutionone x. This may be formulated as a multivariate regression problem.

(c) Low-resolution y (d) Bicubic interpolation

Julien Mairal Towards deep kernel machines 42/51

Page 57: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Image super-resolution

Fact. Dataset Bicubic SC CNN CSCN SCKN

x2Set5 33.66 35.78 36.66 36.93 37.07Set14 30.23 31.80 32.45 32.56 32.76Kodim 30.84 32.19 32.80 32.94 33.21

x3Set5 30.39 31.90 32.75 33.10 33.08Set14 27.54 28.67 29.29 29.41 29.50Kodim 28.43 29.21 29.64 29.76 29.88

Table: Reconstruction accuracy for super-resolution in PSNR (the higher, thebetter). All CNN approaches are without data augmentation at test time.

Remarks

CNN is a “vanilla CNN” [Dong et al., 2016];

Very recent work does better with very deep CNNs and residuallearning [Kim et al., 2016];

CSCN combines ideas from sparse coding and CNNs;

[Zeyde et al., 2010, Dong et al., 2016, Wang et al., 2015, Kim et al., 2016].

Julien Mairal Towards deep kernel machines 43/51

Page 58: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Image super-resolution

Bicubic Sparse coding CNN SCKN (Ours)

Figure: Results for x3 upscaling.

Julien Mairal Towards deep kernel machines 44/51

Page 59: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Image super-resolution

Figure: Bicubic

Julien Mairal Towards deep kernel machines 45/51

Page 60: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Image super-resolution

Figure: SCKN

Julien Mairal Towards deep kernel machines 45/51

Page 61: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Image super-resolution

Bicubic Sparse coding CNN SCKN (Ours)

Figure: Results for x3 upscaling.

Julien Mairal Towards deep kernel machines 46/51

Page 62: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Image super-resolution

Figure: Bicubic

Julien Mairal Towards deep kernel machines 47/51

Page 63: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Image super-resolution

Figure: SCKN

Julien Mairal Towards deep kernel machines 47/51

Page 64: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Image super-resolution

Bicubic Sparse coding CNN SCKN (Ours)

Figure: Results for x3 upscaling.

Julien Mairal Towards deep kernel machines 48/51

Page 65: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Image super-resolution

Figure: Bicubic

Julien Mairal Towards deep kernel machines 49/51

Page 66: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Image super-resolution

Figure: SCKN

Julien Mairal Towards deep kernel machines 49/51

Page 67: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Image super-resolution

Bicubic CNN SCKN (Ours)

Figure: Results for x3 upscaling.

Julien Mairal Towards deep kernel machines 50/51

Page 68: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Image super-resolution

Figure: Bicubic

Julien Mairal Towards deep kernel machines 51/51

Page 69: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

Image super-resolution

Figure: SCKN

Julien Mairal Towards deep kernel machines 51/51

Page 70: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

References I

Fabio Anselmi, Lorenzo Rosasco, Cheston Tan, and Tomaso Poggio.Deep convolutional networks are hierarchical kernel machines. arXivpreprint arXiv:1508.01084, 2015.

L. Bo, K. Lai, X. Ren, and D. Fox. Object recognition with hierarchicalkernel descriptors. In Proc. CVPR, 2011.

J. V. Bouvrie, L. Rosasco, and T. Poggio. On invariance in hierarchicalmodels. In Adv. NIPS, 2009.

David S Broomhead and David Lowe. Radial basis functions,multi-variable functional interpolation and adaptive networks.Technical report, DTIC Document, 1988.

Y. Cho and L. K. Saul. Kernel methods for deep learning. In Adv. NIPS,2009.

A. Damianou and N. Lawrence. Deep Gaussian processes. In Proc.AISTATS, 2013.

Julien Mairal Towards deep kernel machines 52/51

Page 71: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

References IIC. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using

deep convolutional networks. IEEE T. Pattern Anal., 38(2):295–307,2016.

Shai Fine and Katya Scheinberg. Efficient svm training using low-rankkernel representations. J. Mach. Learn. Res., 2(Dec):243–264, 2001.

Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate imagesuper-resolution using very deep convolutional networks. In Proc.CVPR, 2016.

Quoc Le, Tamas Sarlos, and Alexander Smola. Fastfood-computinghilbert space expansions in loglinear time. In Proceedings of the 30thInternational Conference on Machine Learning (ICML-13), pages244–252, 2013.

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learningapplied to document recognition. P. IEEE, 86(11):2278–2324, 1998.

Julien Mairal Towards deep kernel machines 53/51

Page 72: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

References IIIGrASgoire Montavon, Mikio L Braun, and Klaus-Robert MAA§ller.

Kernel analysis of deep networks. Journal of Machine LearningResearch, 12(Sep):2563–2581, 2011.

A. Rahimi and B. Recht. Random features for large-scale kernelmachines. In Adv. NIPS, 2007.

B. Scholkopf. Support Vector Learning. PhD thesis, TechnischenUniversitat Berlin, 1997.

Bernhard Scholkopf and Alexander J Smola. Learning with kernels:support vector machines, regularization, optimization, and beyond.MIT press, 2002.

Bernhard Scholkopf, Alexander Smola, and Klaus-Robert Muller.Nonlinear component analysis as a kernel eigenvalue problem. NeuralComputation, 10(5):1299–1319, 1998.

J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis.2004.

Julien Mairal Towards deep kernel machines 54/51

Page 73: Towards Deep Kernel Machineslear.inrialpes.fr/people/mairal/resources/pdf/Prague.pdf · Idea: representation by pairwise comparisons De ne a \comparison function": K : XX7! R: Represent

References IVAlex J Smola and Bernhard Scholkopf. Sparse greedy matrix

approximation for machine learning. 2000.

A. Vedaldi and A. Zisserman. Efficient additive kernels via explicitfeature maps. IEEE T. Pattern Anal., 34(3):480–492, 2012.

Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang. Deep networks forimage super-resolution with sparse prior. In Proc. ICCV, 2015.

C. Williams and M. Seeger. Using the Nystrom method to speed upkernel machines. In Adv. NIPS, 2001.

R. Zeyde, M. Elad, and M. Protter. On single image scale-up usingsparse-representations. In Curves and Surfaces, pages 711–730. 2010.

Julien Mairal Towards deep kernel machines 55/51