lecture 3 theory of kernel functions - uni-tuebingen.de · lecture 3 theory of kernel functions ......

57
Lecture 3 Theory of Kernel Functions Pavel Laskov 1 Blaine Nelson 1 1 Cognitive Systems Group Wilhelm Schickard Institute for Computer Science Universit¨ at T¨ ubingen, Germany Advanced Topics in Machine Learning, 2012 P. Laskov and B. Nelson (T¨ ubingen) Lecture 3: Kernel Functions April 17, 2012 1 / 47

Upload: trandung

Post on 11-Apr-2018

224 views

Category:

Documents


1 download

TRANSCRIPT

Lecture 3Theory of Kernel Functions

Pavel Laskov1 Blaine Nelson1

1Cognitive Systems GroupWilhelm Schickard Institute for Computer ScienceUniversitat Tubingen, Germany

Advanced Topics in Machine Learning, 2012

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 1 / 47

Part I

Introduction: Kernel Functions

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 2 / 47

Overview

In this lecture, we will formally define kernel functions

Recall: advantages of kernel-based learning:1 Kernels allow for learning in high-dimensional feature spaces without explicit

mapping into feature space2 Kernels make learning in high-dimensional feature spaces computationally

feasible3 Kernel methods learn non-linear function with the machinery of algorithms

for learning linear functions4 Kernels provide an abstraction that separates data representation & learning

Questions to be addressed:1 What properties do kernels have & what properties does a function need to

be a kernel?2 How can we verify that a kernel function is valid?3 How does one construct a kernel function?

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 3 / 47

Recall: Kernel MagicExample 2: 2-dimensional Polynomials of Degree 3

Consider a (slightly modified) feature space for 2-dimensionalpolynomials of degree 3:

Φ(x) = [x31 , x32 ,√3x21 x2,

√3x1x

22 ,√3x21 ,√3x22 ,√6x1x2,

√3x1,√3x2, 1]

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 4 / 47

Recall: Kernel MagicExample 2: 2-dimensional Polynomials of Degree 3

Consider a (slightly modified) feature space for 2-dimensionalpolynomials of degree 3:

Φ(x) = [x31 , x32 ,√3x21 x2,

√3x1x

22 ,√3x21 ,√3x22 ,√6x1x2,

√3x1,√3x2, 1]

Let us compute the inner product between two points in the featurespace:

Φ(x)⊤Φ(y) = x31 y31 + x32 y

32 + 3x21 x2y

21 y2 + 3x1x

22 y1y

22 + 3x21 y

21 + 3x22 y

22

+ 6x1x2y1y2 + 3x1y1 + 3x2y2 + 1

= (x1y1 + x2y2 + 1)3

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 4 / 47

Recall: Kernel MagicExample 2: 2-dimensional Polynomials of Degree 3

Consider a (slightly modified) feature space for 2-dimensionalpolynomials of degree 3:

Φ(x) = [x31 , x32 ,√3x21 x2,

√3x1x

22 ,√3x21 ,√3x22 ,√6x1x2,

√3x1,√3x2, 1]

Let us compute the inner product between two points in the featurespace:

Φ(x)⊤Φ(y) = x31 y31 + x32 y

32 + 3x21 x2y

21 y2 + 3x1x

22 y1y

22 + 3x21 y

21 + 3x22 y

22

+ 6x1x2y1y2 + 3x1y1 + 3x2y2 + 1

= (x1y1 + x2y2 + 1)3

Complexity: 3 multiplications instead of 10.

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 4 / 47

Kernel Questions

Which of the following functions are kernels?

κ1 (x, z) =∑D

i=1 (xi + zi ) κ2 (x, z) =∏D

i=1 h(xi−ca

)h( zi−ca

)

κ3 (x, z) = − 〈x,z〉‖x‖2‖z‖2

κ4 (x, z) =√

‖x− z‖22 + 1

where h(x) = cos(1.75x) exp(−x2/2)

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 5 / 47

Part II

Linear Algebra Review

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 6 / 47

Vector Space

Definition 1

A set X is a vector space (over the reals) if it is closed under an additionoperator ‘+’ (i.e., ∀ x, z ∈ X x+ z ∈ X ) & a scalar multiplicationoperator ‘·’ (i.e., ∀ x ∈ X , a ∈ ℜ a · x ∈ X ) and these operators satisfy;

1 (Additive Associativity) u+ (v + w) = (u+ v) + w

2 (Additive Commutativity) u+ v = v + u

3 (Additive Identity) ∃ 0 ∈ X s.t. ∀ u ∈ X u+ 0 = u

4 (Additive Inverse) ∀ u ∈ X ∃ −u ∈ X s.t. u+ (−u) = 0

5 (Distibutivity) a · (u+ v) = a · u+ a · v & (a + b) · u = a · u+ b · u6 (Multiplicative Associativity) a · (b · u) = (a · b) · u7 (Multiplicative Identity) 1 · u = u

Example: For any D ∈ ℵ, ℜD is a vector space.

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 7 / 47

Vectors I

A D-dimensional vector x is a list of D-reals in vector space ℜD

Vectors {xi}Ni=1 are linearly dependent if there exists c1, c2, . . . , cD (atleast one not 0) such that

∑Ni=1 ci · xi = 0 ;

otherwise, they are linearly independent.

The inner product of x and z is defined as: x⊤z =∑D

i=1 xi · ziNon-trivial vectors {xi}Ni=1 are orthogonal if for i 6= j , x⊤i xj = 0. They arenormal vectors if for all i , x⊤i xi = 1. They are orthonormal if both hold; i.e.,∀ i , j 〈xi , xj〉 = δi ,j = I [i == j ]A set of orthogonal vectors are linearly independentEuclidean norm of a vector: ‖x‖2 =

〈x, x〉Angle between vectors: θ (u, v) = arccos 〈u,v〉

‖u‖‖v‖

Projection onto vector z: projz(x) =〈z,x〉〈z,z〉z

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 8 / 47

Vectors II

Vectors {xi}Ni=1 spans X if for every x ∈ X , x =∑N

i=1 αixi

{xi}Ni=1 is a basis for X if it spans X & is linearly independent

The dimension of X is the number of elements in any basis of X :Every vector in X can be represented as its projection onto an orthonormalbasis {xi}Ni=1 of X :

z =∑D

i=1 projxi (z) =∑D

i=1 〈xi , z〉 · xi

This is a Fourier decomposition of z in that basis.

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 9 / 47

Matrices

Matrices are a N × D grid of reals.

A =

A1,1 A1,2 . . . A1,D

A2,1 A2,2 . . . A2,D

......

. . ....

AN,1 AN,2 . . . AN,D

=

−A⊤

1,•−

−A⊤

2,•−...

−A⊤

N,•−

=

| | |A•,1 A•,2 . . . A•,D

| | |

Matrix-Vector Multiplication (x is a D-vector & z is a N-vector):

Ax =

−A⊤

1,•−−A⊤

2,•−...

−A⊤

N,•−

|x|

=

A⊤

1,•xA⊤

2,•x...

A⊤

N,•x

z⊤A =[

z⊤A•,1 z⊤A•,2 . . . z⊤A•,1

]

Matrix-Matrix Multiplication:

AB =

−A⊤

1,•−−A⊤

2,•−...

−A⊤

N,•−

| | |B•,1 B•,2 . . . B•,D

| | |

=

A⊤

1,•B•,1 A⊤

1,•B•,2 . . . A⊤

1,•B•,D

A⊤

2,•B•,1 A⊤

2,•B•,2 . . . A⊤

2,•B•,D

......

. . ....

A⊤

N,•B•,1 A⊤

N,•B•,2 . . . A⊤

N,•B•,D

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 10 / 47

Matrices

Matrix Multiplication as summations:

[Ax]i =∑

ℓ Ai ,ℓxℓ

[

z⊤A]

j=

k zkAk,j

z⊤Ax =∑

k,ℓ zkAk,ℓxℓ [AB]i ,k =∑

j Ai ,jBj ,k

Special forms of matrices:

Lower Triangular

• 0 0 . . . 0• • 0 . . . 0• • • . . . 0...

......

. . ....

• • • . . . •

Diagonal Matrix

• 0 0 . . . 00 • 0 . . . 00 0 • . . . 0...

......

. . ....

0 0 0 . . . •

Upper Triangular

• • • . . . •0 • • . . . •0 0 • . . . •...

......

. . ....

0 0 0 . . . •

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 11 / 47

Basic Linear Algebra

Suppose matrix A is an N × D matrix

A square matrix has same # of rows & columns; i.e., N = D

The identity matrix IN is a N × N diagonal matrix of 1’s

For any A, AID = A and INA = A

The transpose of A is denoted by A⊤ (it is D × N

A symmetric matrix is its own transpose: A = A⊤

The inverse of A is denoted by A−1: AA−1 = I & A−1A = I

An orthonormal or unitary matrix is its own inverse:AA⊤ = A⊤A = I. . . both the columns & rows of A form a basis

The rank of matrix A is the maximum number of columns of A that arelinearly independent (i.e., the dimension of its column space). A isfull-rank if rank (A) = min(M,N)

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 12 / 47

Singularity

Definition 2

A matrix A is singular if there exists some x 6= 0 such that Ax = 0;otherwise, A is nonsingular.

Theorem 3

The following are equivalent:

Matrix A is invertible

Matrix A is nonsingular

Matrix A is full-rank

The spectrum of A does not contain 0; i.e., 0 6∈ eig (A)

If A is square, the (linear) function f (x) = Ax is one-to-one & onto,f (x) = b has at least 1 solution, and f (x) = 0 only has solution x = 0

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 13 / 47

Eigenvalues & Eigenvectors

Given an N × N matrix A, an eigenvector of A is a non-trivial vector vthat satisfies

Av = λv ;

the corresponding value λ is an eigenvalue

The Rayleigh quotient is defined by

λ =v⊤Av

v⊤v

In fact, the maximum eigen-value/vector pair of A is a solution to

max‖x‖=1

x⊤Ax

x⊤x

with x restricted to have norm 1

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 14 / 47

Eigen-Decomposition & Deflation

Deflation: for any eigen-value/vector pair (λ,v) of A, the transform

A← A− λvv⊤

deflates the matrix; i.e., v is an eigenvector of A but has eigenvalue 0

A symmetric matrix has N orthonormal eigenvectors {vi} correspondingto N eigenvalues—its spectrum; eig (A)

λ1 (A) ≥ λ2 (A) ≥ . . . ≥ λN (A)

Eigen-vectors/values form orthonormal matrix V & diagonal matrix Λ

V =

| | |v1 v2 . . . vN| | |

Λ =

λ1 (A) 0 . . . 00 λ2 (A) . . . 0...

.... . .

...0 0 λN (A)

which form the eigen-decomposition of A: A = VΛV⊤

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 15 / 47

The Spectral Theorem

Theorem 4

If A is a symmetric N × N real-valued matrix, it can be written as

A =

N∑

i=1

λN (A) viv⊤i

where (λi , vi ) are eigen-value/vector pairs of A. This is called the spectraldecomposition of A

For a matrix with rank K < N, the spectral decomposition of A onlyhas K summands

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 16 / 47

Matrix Functions

Properties of diagonal matrix D with entries Di ,i :

For k = 0, 1, 2, . . .: Dk is diagonal with entries [Dk ]i ,i = (Di ,i)k

If ∄ i s.t. Di ,i = 0 then D−1 exists, is diagonal, & [D−1]i ,i = (Di ,i)−1

√D is diagonal with entries [

√D]i ,i =

Di ,i

Functions of A are defined by its eigen-decomposition A = VΛV⊤ &the fact that V⊤V = I

For k = 0, 1, 2, . . . Ak = VΛkV⊤

If A is non-singular, then A−1 = VΛ−1V⊤√A = V

√ΛV⊤ (Note: this satisfies

√A√A = A)

exp (A) = V exp (Λ)V⊤

log (A) = V log (Λ)V⊤

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 17 / 47

Part III

Positive (Semi-)Definiteness

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 18 / 47

Positive (Semi-)Definite Matrices

Definition 5 (Positive Semi-Definite Matrix)

Matrix A is positive semi-definite (PSD) if all its eigenvalues arenon-negative (∀ i λi (A) ≥ 0); i.e., for all x ∈ X :

x⊤Ax ≥ 0

from the Rayleigh quotient. We use A � 0 to denote that A is PSD

Definition 6 (Positive Definite Matrix)

Matrix A is positive definite if all its eigenvalues are positive(∀ i λi (A) > 0); i.e., A is PSD &

x⊤Ax = 0 ⇔ x = 0

We denote this as A ≻ 0.

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 19 / 47

PSD Matrices

Proposition 7

Matrix A is PSD iff there exists a real matrix B such that A = B⊤B

ProofCase ⇐: Suppose A = B⊤B, then for any x

x⊤Ax = x⊤B⊤Bx = ‖Bx‖2 ≥ 0

Case ⇒: If A � 0 then its eigen-decomposition (A = VΛV⊤) has onlynon-negative eigenvalues and thus,

√Λ is a real-valued matrix. Thus, let

B =√ΛV⊤ and we have

B⊤B = V√Λ√ΛV⊤ = VΛV⊤ = A

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 20 / 47

Part IV

Reproducing Kernel Hilbert Spaces

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 21 / 47

Inner Product Space

Definition 8

An inner product space X is a vector space with an associated innerproduct 〈·, ·〉 : X × X → ℜ that satisfies:

1 (Symmetry) 〈x, z〉 = 〈z, x〉2 (Linearity) 〈a · x, z〉 = a · 〈x, z〉 & 〈w + x, z〉 = 〈w, z〉+ 〈x, z〉3 (Positive Semi-Definiteness) 〈x, x〉 ≥ 0

The inner product space is strict if 〈x, x〉 = 0 ⇔ x = 0

A strict inner product space X has a natural norm given by‖x‖2 =

〈x, x〉. The associated metric is d (x, z) = ‖x− z‖2The space ℜD has the inner product 〈x, z〉 = x⊤z which yields theEuclidean norm:

‖x‖22 =D∑

i=1

x2i

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 22 / 47

Hilbert Space

Definition 9

A strict inner product space F is a Hilbert space if it is

1 Complete: Every (Cauchy) sequence {hi ∈ F}∞i=1 such that

limn→∞

supm>n‖hn − hm‖ = 0

converges to an element h ∈ F ; i.e., hi → h

2 Separable: There is a countable subset F = {hi ∈ F}∞i=1 such that forall h ∈ F and ǫ > 0, there exists hi ∈ F such that

‖hi − h‖ < ǫ

Hilbert Space Examples: the interval [0, 1], the reals ℜ, the complexnumbers C, & Euclidean spaces ℜD for D ∈ ℵ.P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 23 / 47

Hilbert Space

Definition 9

A strict inner product space F is a Hilbert space if it is

1 Complete: Every (Cauchy) sequence {hi ∈ F}∞i=1 such that

limn→∞

supm>n‖hn − hm‖ = 0

converges to an element h ∈ F ; i.e., hi → h

Technical Condition required for potentially infinite-dimensional sets

2 Separable: There is a countable subset F = {hi ∈ F}∞i=1 such that forall h ∈ F and ǫ > 0, there exists hi ∈ F such that

‖hi − h‖ < ǫ

Hilbert Space Examples: the interval [0, 1], the reals ℜ, the complexnumbers C, & Euclidean spaces ℜD for D ∈ ℵ.P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 23 / 47

Hilbert Space

Definition 9

A strict inner product space F is a Hilbert space if it is

1 Complete: Every (Cauchy) sequence {hi ∈ F}∞i=1 such that

limn→∞

supm>n‖hn − hm‖ = 0

converges to an element h ∈ F ; i.e., hi → h

Technical Condition required for potentially infinite-dimensional sets

2 Separable: There is a countable subset F = {hi ∈ F}∞i=1 such that forall h ∈ F and ǫ > 0, there exists hi ∈ F such that

‖hi − h‖ < ǫCondition required to make Hilbert space isomorphisms

Hilbert Space Examples: the interval [0, 1], the reals ℜ, the complexnumbers C, & Euclidean spaces ℜD for D ∈ ℵ.P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 23 / 47

Function Spaces

What is a vector? An ordered list of D elements from ℜ indexed by theindex set ID = {1, 2, . . . ,D}. The set of all such lists is ℜD

We can extend this notion to countable sequences x = (x1, x2, . . .) byusing the index set I = ℵ

Inner-product generalizes naturally as

〈x, y〉 =∑

i∈ℵ

xi · yi

However, we need additional restrictions to make such spaces well-behavedThe subspace ℓ2 for which ∀ x 〈x, x〉 <∞ is a Hilbert space

Further, a function f : X → ℜ maps each x ∈ X to exactly one y ∈ ℜ;i.e., it is also a vector with an uncountable index set (e.g., I = ℜD)

Inner-product again generalizes naturally as

〈f , g〉 =∫

X

f (x) g (x)dx

Subspace L2 (X ) defined on X , a compact subspace of ℜD , for which∀ f ∈ L2 (X ) , 〈f , f 〉 <∞ is a Hilbert space

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 24 / 47

Properties of (Separable) Hilbert Spaces

Hilbert space F is isomorphic to H if there is a one-to-one linearmapping T : F → H such that for all x, z ∈ F

〈T (x),T (z)〉H = 〈x, z〉F

Every separable Hilbert space (A) of dimension D is isomorphic to ℜD

and (B) of infinite dimension is isomorphic to ℓ2

Since Hilbert space F is isomorphic to ℜD or ℓ2, F has an orthonormalbasis {φi} & element in x ∈ F have a Fourier decomposition:

x =∑

i 〈φi , x〉F · φi

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 25 / 47

Part V

Characterizing Kernel Functions

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 26 / 47

Kernel Terminology

Definition 10

A kernel is a two-argument real-valued function over X ×X(κ : X × X → ℜ) such that for any x, z ∈ X

κ (x, z) = 〈φ (x) , φ (z)〉F (1)

for some inner-product space F such that ∀ x ∈ X φ (x) ∈ F

Kernel functions must be symmetric since inner products are symmetric

To show that κ is a valid kernel, it is sufficient to show that a mapping φexists that yields Eq. 1. However, this is generally difficult to construct.

In this rest of this lecture, we will demonstrate additional ways toconstruct & validate kernels

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 27 / 47

Kernel Matrices

Definition 11

A kernel matrix (or Gram matrix) K is the matrix that results fromapplying κ to all pairs of datapoints in set {xi}Ni=1

K =

κ (x1, x1) κ (x1, x2) . . . κ (x1, xN)κ (x2, x1) κ (x2, x2) . . . κ (x2, xN)

......

. . ....

κ (xN , x1) κ (xN , x2) . . . κ (xN , xN)

that is, Ki ,j = κ (xi , xj )

Kernel matrices are square & symmetric

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 28 / 47

Kernel Matrices are PSD

Proposition 12

Kernel matrices, which are constructed from a kernel corresponding to astrict inner product space F , are PSD.

ProofBy definition of a kernel matrix, for all i , j ∈ 1, . . . ,N

Ki ,j = κ (xi , xj ) = 〈φ (xi ) , φ (xj )〉FThus, for any v ∈ ℜN :

v⊤Kv =∑

i ,j viKi ,jvj =∑

i ,j vi 〈φ (xi ) , φ (xj )〉F vj

=⟨

i viφ (xi ),∑

j vjφ (xj)⟩

F

= ‖∑

i viφ (xi )‖2F ≥ 0

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 29 / 47

Reproducing Kernel Function

Definition 13 (Reproducing Kernel Function (Aronszajn, 1950) [1])

Suppose F is a Hilbert space of functions over X ; the functionκ : X ×X → ℜ is a reproducing kernel of F if

1 For every x ∈ X , the function fx (·) = κ (·, x) is in F .2 Reproducing Property: for every z ∈ X and every f ∈ F

f (z) = 〈f , κ (·, z)〉F

Further, the space is called a Reproducing Kernel Hilbert Space (RKHS)

By 1st property & closure of F , for any αi ∈ ℜ and xi ∈ X , we have∑N

i=1 αi · κ (·, xi ) ∈ XApplying fx from 1st property to 2nd property, for any x, z ∈ X , we have

κ (x, z) = 〈κ (·, x) , κ (·, z)〉FP. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 30 / 47

Kernel Functions

Definition 14 (Finitely Positive Semi-definite)

A function κ : X × X → ℜ is finitely positive semi-definite (FPSD) if

It is symmetric; i.e., ∀ x, z ∈ X κ (x, z) = κ (z, x)

The matrix K formed by applying κ to any finite subset of X is positivesemi-definite: K � 0

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 31 / 47

Kernel Functions I

Theorem 15

κ : X ×X → ℜ (either continuous or with a countable domain) is FPSDiff ∃ Hilbert space F with feature map φ : X → F s.t.

κ (x, z) = 〈φ (x) , φ (z)〉

ProofCase ⇐: Follows from Proposition 12.Case ⇒: Suppose κ if FPSD & we construct Hilbert Space Fκ with κ asits reproducing kernel; i.e., Fκ is the closure of functions: fx (·) = κ (·, x).Thus, for any αi , xi , g (·) = ∑

i αiκ (·, xi ) is in Fκ &, by the reproducingproperty,

〈g , g〉 = ∑

i ,j αiαjκ (xi , xj ) = α⊤Kα

where K is the kernel matrix {xi}, & thus α⊤Kα ≥ 0 since K � 0.

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 32 / 47

Kernel Functions II

(Completeness) Follows from the Cauchy-Schwarz inequality, but beyondthe scope of this course.

(Separability) Separability follows from κ being continuous or having acountable domain, but is not shown here.

Finally, the mapping φ is specified by κ and φ (x) = κ (·, x) ∈ Fκ .

Note, the inner product defined above is strict since if ‖f ‖ = 0, then forall x, |f (x)| ≤ ‖f ‖ ‖φ (x)‖ = 0

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 33 / 47

Part VI

Kernel Constructions

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 34 / 47

Simple Kernels

Clearly, the linear kernel defined by

κlin (x, z) = 〈x, z〉 = x⊤z

is a valid kernel function since it is an inner product in XFor any N × N matrix B � 0,

κB (x, z) = 〈x |B| z〉 = x⊤Bz

is a valid kernel function

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 35 / 47

Closure Properties of Kernels I

Proposition 16

Suppose κ1 & κ2 are kernels on X , a > 0, f : X → ℜ, φ : X → ℜM , & κ3is a kernel on ℜM . Then these are all kernel functions on X :

16.1 κ (x, z) = κ1 (x, z) + κ2 (x, z)

16.2 κ (x, z) = a · κ1 (x, z)16.3 κ (x, z) = κ1 (x, z) · κ2 (x, z)16.4 κ (x, z) = f (x) f (z)

16.5 κ (x, z) = κ3 (φ (x) , φ (z))

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 36 / 47

Closure Properties of Kernels II

ProofLet K1 & K2 be the kernel matrices of κ1 & κ2 applied to any set{xi}Ni=1—both these matrices are PSD. Also let α be any N-vector:

(Part 1): K = K1 +K2 ⇒ α⊤Kα = α⊤K1α+α⊤K2α ≥ 0(Part 2): K = aK1 ⇒ α⊤Kα = a · α⊤K1α ≥ 0(Part 3): Take the spectral decomposition of K1 =

∑Ni=1 λiviv

⊤i and

K2 =∑N

i=1 γiwiw⊤i . The spectral decomposition of their element-wise

product, K = K1 ⊙K2, is then K =∑N

i ,j=1

λiγj(vi ⊙ wj)(vi ⊙ wj)⊤;

i.e., a summation of rank-1 matrices with positive coefficients ⇒ PSD.(Part 4): κ (x, z) = 〈ψ (x) , ψ (z)〉 where ψ : x 7→ f (x); thus, κ is PSD.(Part 5): Since κ3 is a kernel, applying it to any set of vectors {φ (xi )}Ni=1

yields a PSD matrix.

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 37 / 47

Closure Properties of Kernels III

The feature spaces for these kernels are as follows:

For kernel κ1 (x, z) + κ2 (x, z), the new feature map is equivalent tostacking the feature maps of κ1 & κ2:

φ (x) =

[

φ1 (x)φ2 (x)

]

For kernel a · κ1 (x, z), its feature space is scaled by√a

For kernel κ1 (x, z) · κ2 (x, z), if φ1 has dimension N1 and φ2 hasdimension N2, φ has N1N2 features given by

[φ (x)]ij = [φ1 (x)]i · [φ2 (x)]j

It follows that the features of κ1 (x, z)d are all monomials of the form

[φ1 (x)]d11 [φ1 (x)]

d22 . . . [φ1 (x)]

dNN

i di = d

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 38 / 47

Additional Kernel Functions

Proposition 17

Suppose κ1 is a kernel on X & p : ℜ → ℜ is a polynomial withnon-negative coefficients. Then, the following are kernels:

1 κ (x, z) = p (κ1 (x, z))

2 κ (x, z) = exp (κ1 (x, z))

3 Gaussian or RBF kernel: κ (x, z) = exp(

−‖x−z‖222σ2

)

Proof(Part 1) Constructing a polynomial kernel from base kernel κ1 proceedsdirectly from Proposition 16.1, 16.2, & 16.3(Part 2) Consider that exp (x) = 1+ x + 1

2x2 + . . . 1

i !xi + . . .. Thus, it is a

limit of polynomials & the PSD property is closed under pointwise limits.(Part 3) Left as an exercise.

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 39 / 47

Common Kernel Functions

Linear Kernel: κlin (x, z) = x⊤z

Polynomial Kernel: κpoly (x, z) = (x⊤z+ R)d

RBF Kernel: κrbf (x, z) = exp(

−‖x−z‖222σ2

)

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 40 / 47

Kernel Questions

Which of the following functions are kernels?

κ1 (x, z) =∑D

i=1 (xi + zi ) κ2 (x, z) =∏D

i=1 h(xi−ca

)h( zi−ca

)

κ3 (x, z) = − 〈x,z〉‖x‖2‖z‖2

κ4 (x, z) =√

‖x− z‖22 + 1

where h(x) = cos(1.75x) exp(−x2/2)κ1 is not a kernel. Consider x1 =

[

1 0]⊤

& x2 =[

0 2]⊤

. Their kernelmatrix has eigenvalues −1 and 5.κ2 is a kernel because it can be written as the product f (x)f (z) wheref (x) =

∏Di=1 h(

xi−ca

)κ3 is not a kernel because it is the negation of a valid non-trivial kernel& thus will have negative eigenvalues

κ4 is not a kernel. Consider x1 =[

1 0]⊤

& x2 =[

0 1]⊤

. Again, theirkernel matrix has a negative eigenvalue

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 41 / 47

Part VII

Transforming Kernel Matrices

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 42 / 47

Operations on Kernel MatricesSimple Transformations

Adding a non-negative constant to the Kernel Matrix: corresponds toadding a new constant feature to each training example; i.e., given thematrix Φ of features such that K = ΦΦ⊤,

[

Φ c1]

∗[

Φ c1]⊤

= K+ c211⊤

Adding a non-negative constant to its diagonal: corresponds to addingan indicator feature for every data point

φ (x1) c 0 . . . 0φ (x2) 0 c . . . 0

......

.... . .

...φ (xN) 0 0 . . . c

φ (x1) c 0 . . . 0φ (x2) 0 c . . . 0

......

.... . .

...φ (xN) 0 0 . . . c

= K+ c2I

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 43 / 47

Operations on Kernel MatricesCentering Data

Suppose we want to translate the origin to the data’s center of mass. . .

x1

x2

x3

x4

x5x6

x7

x8

x9

x10

x11

x12

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 44 / 47

Operations on Kernel MatricesCentering Data

Suppose we want to translate the origin to the data’s center of mass. . .

x1

x2

x3

x4

x5x6

x7

x8

x9

x10

x11

x12x

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 44 / 47

Operations on Kernel MatricesCentering Data

Suppose we want to translate the origin to the data’s center of mass. . .

x1

x2

x3

x4

x5x6

x7

x8

x9

x10

x11

x12x

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 44 / 47

Operations on Kernel MatricesCentering Data

Suppose we want to translate the origin to the data’s center of mass. . .

x1

x2

x3

x4

x5x6

x7

x8

x9

x10

x11

x12x

As we will see next lecture, this transformation can be expressed askernel transform

K← K− 1N11⊤K− 1

NK11⊤ + 1⊤K1

N2 11⊤

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 44 / 47

Operations on Kernel MatricesNormalizing Data

Suppose we want to project all data to be norm 1; i.e., ‖x‖ = 1. . .

x1

x2

x3

x4

x5x6

x7

x8

x9

x10

x11

x12

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 45 / 47

Operations on Kernel MatricesNormalizing Data

Suppose we want to project all data to be norm 1; i.e., ‖x‖ = 1. . .

x1

x2

x3

x4

x5x6

x7

x8

x9

x10

x11

x12

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 45 / 47

Operations on Kernel MatricesNormalizing Data

Suppose we want to project all data to be norm 1; i.e., ‖x‖ = 1. . .

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 45 / 47

Operations on Kernel MatricesNormalizing Data

Suppose we want to project all data to be norm 1; i.e., ‖x‖ = 1. . .

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

This transformation can be achieved using only the information from thekernel matrix:

κ (x, z) =κ (x, z)

κ (x, x) κ (z, z)

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 45 / 47

Summary

We explored a formal framework for kernels

We saw a formal defintion for kernel functions & matrices

We saw the properties that kernels must exhibit and how thoseproperties can be used to validate kernel functions & construct newkernels from existing kernels

We explored some operations that allow us to manipulate data infeature space

Next Lecture: we will see basic kernel-based learning algorithms

We will explore how to take the mean of data in feature space & use that toconstruct a novelty detection algorithmWe will explore how to project data in feature space & use that for a basicsubspace algorithm

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 46 / 47

Bibliography I

The Majority of the work from this talk can be found in the lecuture’saccompanying book, “Kernel Methods for Pattern Analysis.”

[1] N. Aronszajn. Theory of reproducing kernels. Transactions of theAmerican Mathematical Society, 68(3):337–404, 1950.

P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 47 / 47