nuno vasconcelos ece depp,artment, ucsdpositive definite kernels it is not hard to show that all dot...

Dot-product kernels

Nuno Vasconcelos ECE Department, UCSDp ,

Classificationa classification problem has two types of variables

• X - vector of observations (features) in the worldX - vector of observations (features) in the world• Y - state (class) of the world

Perceptron: classifier implements the linear decision rulep p

with [ ]g(x)sgn )( =xh bxwxg T +=)( w

appropriate when the classes arelinearly separable

to deal with non-linear separabilitywe introduce a kernel

Kernel summary1. D not linearly separable in X, apply feature transformation Φ:X → Z,

such that dim(Z) >> dim(X)

2. computing Φ(x) too expensive:• write your learning algorithm in dot-product form• instead of Φ(xi) we only need Φ(xi)TΦ(xj) ∀ijinstead of Φ(xi), we only need Φ(xi) Φ(xj) ∀ij

3. instead of computing Φ(xi)TΦ(xj) ∀ij, define the “dot-product kernel”

)()(),( zxzxK T ΦΦ=

and compute K(xi,xj) ∀ij directly• note: the matrix

)()()(

⎥⎤

⎢⎡ M

is called the “kernel” or Gram matrix

⎥⎥⎥

⎦⎢⎢⎢

LL ),( ji xxKK

4. forget about Φ(x) and use K(x,z) from the start!

Polynomial kernelsthis makes a significant difference when K(x,z) is easier to compute that Φ(x)TΦ(z)p ( ) ( )e.g., we have seen that

( ) TT zxzxzxK )()(),( 2ΦΦ== ( )

)()(),(

1 ⎞⎜⎛

ℜ→ℜΦ : with2dd

( )Tddddd

xxxxxxxxxxxxx

,,,,,,,, 2112111 LLLM →⎟⎟

⎠⎜⎜⎜

while K(x,z) has complexity O(d), Φ(x)TΦ(z) is O(d2)for K(x,z) = (xTz)k we go from O(d) to O(dk)

( , ) ( ) g ( ) ( )

Questionwhat is a good dot-product kernel?• intuitively a good kernel is one that maximizes the margin γ in• intuitively, a good kernel is one that maximizes the margin γ in

range space• however, nobody knows how to do this effectively

in practice:• pick a kernel from a library of known kernels• we talked about• we talked about

• the linear kernel K(x,z) = xTz• the Gaussian family 2zx −

• the polynomial family

σ),(zx

ezxK−

( ) { }L,2,1,1),( ∈+= kzxzxK kT

Question“this problem of mine is really asking for the kernel k’(x,z) = ...”( , )• how do I know if this is a dot-product kernel?

let’s start by the definitionDefinition: a mapping

k: X x X → ℜxxx

(x,y) → k(x,y)is a dot-product kernel if and only if

ooo x1

k(x,y) = <Φ(x),Φ(y)>where Φ: X → H, H is a vector space and, <.,.> a dot-product in H

o o oo oox1

x3 x2xn

product in H

Vector spacesnote that both H and <.,.> can be abstract, not necessarily ℜdyDefinition: a vector space is a set H where • addition and scalar multiplication are defined and satisfy:

1) x+(x’+x’’) = (x+x’)+x” 5) λx ∈ H2) x+x’ = x’+x ∈ H 6) 1x = x3) 0 ∈ H, 0 + x = x 7) λ(λ’ x) = (λλ’)x) , ) ( ) ( )4) –x ∈ H, -x + x = 0 8) λ(x+x’) = λx + λx’

9) (λ+λ’)x = λx + λ’x

the canonical example is ℜd with standard vector additionthe canonical example is ℜ with standard vector addition and scalar multiplicationanother example is the space of mappings X → ℜ with

(f+g)(x) = f(x) + g(x) (λf)(x) = λf(x)

Bilinear formsto define dot-product we first need to recall the notion of a bilinear formDefinition: a bilinear form on a vector space H is a mapping

Q: H x H ℜQ: H x H → ℜ(x,x’) → Q(x,x’)

such that ∀ x,x’,x’’∈ Hsuch that ∀ x,x ,x ∈ H

i) Q[(λx+λx’),x”] = λQ(x,x”) + λ’Q(x’,x”)ii) Q[x”,(λx+λx’)] = λQ(x”,x) + λ’Q(x”,x’)) [ ( )] ( ) ( )

in ℜ d the canonical bilinear form is Q(x,x’) = xTAx’

if Q(x,x’) = Q(x’,x) ∀ x,x’∈ H, the form is symmetric

Dot productsDefinition: a dot-product on a vector space H is a symmetric bilinear formy

<.,.>: H x H → ℜ(x,x’) → <x,x’>

h th tsuch that

i) <x,x> ≥ 0, ∀ x∈ Hii) <x x> = 0 if and only if x = 0ii) <x,x> = 0 if and only if x = 0

note that for the canonical bilinear form in ℜ dnote that for the canonical bilinear form in ℜ<x,x> = xTAx

this means that A must be positive definite

this means that A must be positive definitexTAx > 0, ∀ x ≠ 0

Positive definite matricesrecall that (e.g. Linear Algebra and Applications, Strang)

Definition: each of the following is a necessary andDefinition: each of the following is a necessary and sufficient condition for a real symmetric matrix A to be (semi) positive definite:

i) TA ≥ 0 0i) xTAx ≥ 0, ∀ x ≠ 0 ii) all eigenvalues of A satisfy λi ≥ 0iii) all upper-left submatrices Ak have non-negative determinanti ) there is a matri R ith independent ro s s ch thativ) there is a matrix R with independent rows such that

A = RTR

l ft b t iupper left submatrices:

L 322212

3,12,11,1

32,11,1

2111 ⎥⎥⎤

⎢⎢⎡

=⎥⎤

⎢⎡

== aaaaaa

3,32,31,3

3,22,21,232,21,2

⎥⎥⎦⎢

⎢⎣

⎥⎦

⎢⎣ aaa

Positive definite matricesproperty iv) is particularly interesting• in ℜ d <x x> = xTAx is a dot-product kernel if and only if A is• in ℜ , <x,x> = x Ax is a dot-product kernel if and only if A is

positive definite• from iv) this holds if and only if there is R such that A = RTR• hence

<x,y> = xTAy = (Rx)T(Ry) = Φ(x)TΦ(y)with

Φ: ℜ d → ℜ d

x → Rx

i.e. the dot-product kernelp

k(x,z) = xTAz, (A positive definite)is the standard dot-product in the range space of the

is the standard dot product in the range space of the mapping Φ(x) = Rx

Notethere are positive semidefinite matrices

xTAx ≥ 0x Ax ≥ 0and positive definite matrices

xTAx > 0

we will work with semidefinite but, to simplify, will call definiteif we really need > 0 we will say “strictly positive definite”

Positive definite kernelshow do we define a positive definite function?Definition: a function k(x y) is a positive definite kernel onDefinition: a function k(x,y) is a positive definite kernel on X xX if ∀ l and ∀ {x1, ..., xl}, xi∈ X, the Gram matrix

⎤⎡ M

⎥⎥⎥

⎢⎢⎢

),( ji xxkK

is positive definite.Note: this implies that

⎥⎦⎢⎣ M

Note: this implies that• k(x,x) ≥ 0 ∀ x∈ X

• PD ∀ x,y∈ X⎤⎡ ),(),( yxkxxk (*)

yetc...⎥

⎤⎢⎣

⎡),(),(),(),(

yykxyky ( )

Positive definite kernelsthis proves some simple properties• a PD kernel is symmetrica PD kernel is symmetric

Xyxxykyxk ∈∀= ,),,(),( Proof:

since PD means symmetric (*) implies k(x,y) = k(y,x) ∀ x,y∈ X

• Cauchy-Schwarz inequality for kernels: if k(x,y) is a PD kernel, y q y ( y)then

Xyxyykxxkyxk ∈∀≤ ,),,(),(),( 2

Proof:from (*), and property iii) of PD matrices, the determinant of the 2x2 matrix of (*) is non-negative This means that

2x2 matrix of ( ) is non negative. This means that

k(x,x)k(y,y) – k(x,y)2 ≥ 0

Positive definite kernelsit is not hard to show that all dot product kernels are PDLemma 1: Let k(x y) be a dot product kernel Then k(x y)Lemma 1: Let k(x,y) be a dot-product kernel. Then k(x,y)is positive definiteproof: p• k(x,y) dot product kernel implies that• ∃Φ and some dot product <.,.> such that

k( ) <Φ( ) Φ( )>k(x,y) = <Φ(x),Φ(y)> • this implies that if:

• we pick any l, and any sequence {x1, ..., xl}, ⎤⎡ My y { 1 l}• and let K be the associated Gram matrix • then, for ∀ c ≠ 0 ⎥

⎥⎥

⎢⎢⎢

),( ji xxkK

Positive definite kernels

( ),= ∑ij

jijiT xxkccKcc

( ) ( ),ΦΦ= ∑ij

xxcc (k is dot product)

( ) ( ), ΦΦ= ∑∑j

ii xcxc (<.,.> is a bilinear form)

( ) 02

≥Φ= ∑ ii

xc (from def of dot product)i product)

Positive definite kernelsthe converse is also true but more difficult to proveLemma 2: Let k(x y) x y∈ X be a positive definiteLemma 2: Let k(x,y), x,y∈ X, be a positive definite kernel. Then k(x,y) is a dot product kernelproof:p• we need to show that there is a transformation Φ, a vector space

H = Φ(X), and a dot product <.,.>* in H such that k(x y) <Φ(x) Φ(y)> x2k(x,y) = <Φ(x),Φ(y)>*

• we proceed in three steps1. construct a vector space H

2. define the dot-product <.,.>* on H

3. show that k(x,y) = <Φ(x),Φ(y)>* holdsx xx

oo ox1

x3 x2xn

The vector space Hwe define H as the space spanned by linear combinations of k(.,xi)( , i)

H = ( )⎭⎬⎫

⎩⎨⎧

∈∀∀= ∑=

iiii Xxmxkff

1,,.,(.)|(.) α

notation: by k(.,xi) we mean a function of g(y) = k(y,xi) of y, xi is fixed.homework: check that H is a vector space• e.g. 2) ( )

⎪⎪⎫

= ∑ .,(.) xkfm

iiα ( )

( )∈+=+

⎪⎪⎭

⎪⎪⎬

∑= (.)(.)'(.)'(.)

'.,(.)'

1 ffffxkf

⎪⎭=1i

Examplewhen we use the Gaussian kernel

2. ix−

k( ) i G i t d ith i I

2)(., σi

i exK−

k(.,xi) is a Gaussian centered on xi with covariance σIand

⎪⎫⎪⎧−m x i.

is the space of all linear⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

∀∀= ∑=

iii xmeff

,,(.)|(.) 2 σα

is the space of all linear combinations of Gaussiansnote that these are not mixtures

but close

The operator <.,.>*

if f(.) and g(.) ∈ H, with'mm

( ) ( )∑∑==

'.,(.).,(.)m

iii xkgxkf βα (**)

we define the operator <.,.>* as

( )∑∑= =

jji xxkgf

',, βα (***)

Examplewhen we use the Gaussian kernel

2. ix−

th t < > i i ht d f G i t

2)(., σi

i exK−

the operator <.,.>* is a weighted sum of Gaussian terms

∑∑−

−m xxm ji

you can look at this as either:

∑∑= =

ji egf1 1

2, σβα

you can look at this as either:• a dot product in H (still need to prove this)• a non-linear measure of similarity in X, somewhat related to

ylikelihoods

The operator <.,.>*

important note: for f(.) and g(.) ∈ H, the operator <.,.>*

( )∑∑= =

jji xxkgf

',, βα

has the property

( )xxkxkxk ')'()( = (****)

proof: just make

( )jiji xxkxkxk ,)(.,),(.,*

⎩⎨⎧

≠∀==≠∀==

ki 0,10,1

ββαα

⎩ ≠∀ jkkj 0,1 ββ

The operator <.,.>*

assume that <.,.>* is a dot product in H (proof in moments))since

( )jiji xxkxkxk ,)(.,),(.,*

then, clearly

)(),(, jiji xxxxk ΦΦ=

)(),(, jiji

Φ: X → Hx → k( x)

i.e. the kernel is a dot-product on H, which results from the feature transformation Φ

x → k(.,x)

this proves Lemma 2

Examplewhen we use the Gaussian kernel• the point xi ∈ ℜd is mapped into the Gaussian G(x xi σI)

),(ixx

i exxK−

• the point xi ∈ ℜ is mapped into the Gaussian G(x,xi,σI)• H is the space of all functions that are linear combinations of

Gaussians• this has infinite dimension• the kernel is a dot product in H, and a non-linear

similarity on X

ℜd H*

In summaryto show that k(x,y), x,y∈ X, positive definite ⇒ k(x,y) is a dot product kernelpwe need to show that

( )m m '

( )∑∑= =

ji xxkgf1 1

*',, βα

is a dot product on

H = ( ) ⎬⎫

⎨⎧

∈∀∀= ∑m

Xxmxkff ( )|( ) αH =

this reduces to verifying the dot product conditions

( )⎭⎬

⎩⎨ ∈∀∀= ∑

=iiii Xxmxkff

1,,.,(.)|(.) α

The operator <.,.>*

1) is <.,.>* a bilinear form on H ?

by definition of f( ) and g( ) in (**)by definition of f(.) and g(.) in ( )

( ) ( )'

*'.,,.,, j

ii xkxkgf ∑∑= βα

on the other hand,*11 ji

∑∑==

∑∑= =

jji xxkgf

',, βα from (***)

equality of the two left hand sides is the definition of

( ) ( )∑∑= =

jji xkxk

1'.,,.,βα from (****)

q ybilinearity

The operator <.,.>*

2) is <.,.>* symmetric?note thatnote that

,,', gfxxkfgm

jji == ∑∑

if and only if k(xi,xj’) = k(xj’,xi) for all xi, xj’.b t thi f ll f th iti d fi it f k( )

1 1i j

but this follows from the positive definiteness of k(x,y)we have seen that a PD kernel is always symmetrich i t ihence, <.,.>* is symmetric

The operator <.,.>*

3) is <f,f>* ≥ 0, ∀ f ∈ H ?

by definition of f( ) in (**)by definition of f(.) in ( )

( ) αααα Kxxkff Tm

ji == ∑∑*,,

where α ∈ ℜm and K is the Gram matrixsince k(x y) is positive definite K is positive definite by

i j= =1 1

since k(x,y) is positive definite, K is positive definite by definition and <f,f>* ≥ 0the only non-trivial part of the proof is to show that

<f,f>* = 0 ⇒ f = 0we need two more results

The operator <.,.>*

Lemma 3: <.,.>* is itself a positive definite kernel on H x H

proof:proof:• consider any sequence {f1, ..., fm}, fi ∈ H

• then

=∑ ∑∑*

*,, ffff

ij jjj

iiijiji γγγγ (by bilinearity of <.,.>*)

,ggj j

(for some g1,g2 ∈ H )

(by (x) )

• hence the Gram matrix is always PD and the kernel <.,.>* is PD

( y ( ) )

The operator <.,.>*

Lemma 4: ∀ f ∈ H, <k(.,x),f(.)>* = f(x)

proof:proof:

)(.,),(.,(.)),(.,*

xkxkfxk ii= ∑α (by (**) )

)(.,),(.,*

= ∑α (by bilinearity of <.,.>*)

)(),( xfxxki

ii == ∑α (by (****) )

The operator <.,.>*

4) we are now ready to prove that <f,f>* = 0 ⇒ f = 0proof:proof:• since <.,.>* is a PD kernel (lemma 3) we can apply Cauchy-

Schwarz Xyxyykxxkyxk ∈∀≤ ,),,(),(),( 2 • using k(.,x) as x and f(.) as y this becomes

***),(.,,)(.,),(., fxkffxkxk ≥

• and using lemma 4

)(,),( 2*

xfffxxk ≥

• from which <f,f>* = 0 ⇒ f = 0

In summarywe have shown that

( )∑∑= =

jji xxkgf

',, βα

is a dot product on

( ) ⎬⎫

⎨⎧

∀∀∑m

Xxmxkff ( )|( )H =

and this shows that if k(x y) x y∈ X is a positive definite

( )⎭⎬

⎩⎨ ∈∀∀= ∑

=iiii Xxmxkff

1,,.,(.)|(.) α

and this shows that if k(x,y), x,y∈ X, is a positive definite kernel, then k(x,y) is a dot product kernel.since we had initially proven the converse, we have the

y p ,following theorem.

Dot product kernelsTheorem: k(x,y), x,y∈ X, is a dot-product kernel if and only if it is a positive definite kernely p

this is interesting because it allows us to check whether a k l i d t d t t!kernel is a dot product or not!• check if the Gram matrix is positive definite for all possible

sequences {x1, ..., xl}, xi∈ X

but the proof is much more interesting than this result aloneit actually gives us insight on what the kernel is doinglet’s summarize

Dot product kernelsa dot product kernel k(x,y), x,y∈ X:• applies a feature transformationapplies a feature transformation

Φ: X → Hx → k(.,x)

• to the vector space

x → k(.,x)

( )⎭⎬⎫

⎩⎨⎧

∈∀∀= ∑m

iii Xxmxkff ,,.,(.)|(.) α

• where the kernel implements the dot product

( )⎭⎬

⎩⎨ ∑

1,,,( )|( )

( )∑∑= =

jji xxkgf

',, βα

Dot product kernelsthe dot product

h th d i t

( )∑∑= =

jji xxkgf

',, βα

has the reproducing property

)((.)),(.,*

xffxk =you can think of this as analog to the convolution with a Dirac deltawe will talk about this a lot in the coming lecturesfinally, <.,.>* is itself a positive definite kernel on H x H

A good picture to rememberwhen we use the Gaussian kernel• the point xi ∈ ℜd is mapped into the Gaussian G(x xi σI)

),(ixx

i exxK−

• the point xi ∈ ℜ is mapped into the Gaussian G(x,xi,σI)• H is the space of all functions that are linear combinations of

Gaussiansth k l i d t d t i• the kernel is a dot product in H

• the dot product with one of the Gaussianshas the reproducing property

ℜd H*

nuno vasconcelos ece depp,artment, ucsdpositive definite kernels it is not hard to show that all dot...

Documents

autoencoders& kernels

12 kernels from generative models - people › ~jordan ›...

lemma coaching feedback könyv

neyman pearson lemma - introduction

dereje lemma lalisho

kernels - vermont foodbank

lemma ii.1 (baire)

fatou's lemma

euclid's division lemma

ribet’s lemma, generalizations, and...

munkres lemma 41.3

o closing lemma ergódico

1-pumping lemma (1)

the pumping lemma

pumping lemma (1)

kernels - arxiv · among di erent kernels such security...

kernels and representation

presentazione laboratorio lemma / lemma laboratory...

euclids division lemma

lncs 4098 - generalizing newman’s lemma for left … ·...