lecture3: dependencemeasures...

126
Lecture 3: Dependence measures using RKHS embeddings MLSS T¨ ubingen, 2015 Arthur Gretton Gatsby Unit, CSML, UCL

Upload: others

Post on 20-Jul-2020

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Lecture 3: Dependence measuresusing RKHS embeddings

MLSS Tubingen, 2015

Arthur Gretton

Gatsby Unit, CSML, UCL

Page 2: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Outline

• Three or more variable interactions, comparison with conditional

dependence testing [Sejdinovic et al., 2013a]

• Dependence detection in detail, covariance operators

• Bayesian inference without models, comparison with approximate

Bayesian computation (ABC) [Fukumizu et al., 2013]

• Recent work (2014/2015) (not in this talk, see my webpage)

– Testing for time series Chwialkowski and Gretton [2014], Chwialkowski et al. [2014]

– Nonparametric adaptive expectation propagation Jitkrittum et al. [2015]

– Infinite dimensional exponential families Sriperumbudur et al. [2014]

– Adaptive MCMC, and adaptive Hamiltonian Monte Carlo Sejdinovic et al.

[2014], Strathmann et al. [2015]

Page 3: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Lancaster (3-way) Interactions

Page 4: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Detecting a higher order interaction

• How to detect V-structures with pairwise weak individual dependence?

X Y

Z

Page 5: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Detecting a higher order interaction

• How to detect V-structures with pairwise weak individual dependence?

Page 6: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Detecting a higher order interaction

• How to detect V-structures with pairwise weak individual dependence?

• X ⊥⊥ Y , Y ⊥⊥ Z, X ⊥⊥ Z

X vs Y Y vs Z

X vs Z XY vs Z

X Y

Z

• X,Yi.i.d.∼ N (0, 1),

• Z| X,Y ∼ sign(XY )Exp( 1√2)

Faithfulness violated here

Page 7: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

V-structure Discovery

X Y

Z

Assume X ⊥⊥ Y has been established. V-structure can then be detected by:

• Consistent CI test: H0 : X ⊥⊥ Y |Z [Fukumizu et al., 2008, Zhang et al., 2011], or

Page 8: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

V-structure Discovery

X Y

Z

Assume X ⊥⊥ Y has been established. V-structure can then be detected by:

• Consistent CI test: H0 : X ⊥⊥ Y |Z [Fukumizu et al., 2008, Zhang et al., 2011], or

• Factorisation test: H0 : (X,Y ) ⊥⊥ Z ∨ (X,Z) ⊥⊥ Y ∨ (Y, Z) ⊥⊥ X

(multiple standard two-variable tests)

– compute p-values for each of the marginal tests for (Y, Z) ⊥⊥ X,

(X,Z) ⊥⊥ Y , or (X,Y ) ⊥⊥ Z

– apply Holm-Bonferroni (HB) sequentially rejective correction(Holm 1979)

Page 9: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

V-structure Discovery (2)

• How to detect V-structures with pairwise weak (or nonexistent)

dependence?

• X ⊥⊥ Y , Y ⊥⊥ Z, X ⊥⊥ Z

X1 vs Y1 Y1 vs Z1

X1 vs Z1 X1*Y1 vs Z1

X Y

Z

• X1, Y1i.i.d.∼ N (0, 1),

• Z1| X1, Y1 ∼ sign(X1Y1)Exp(1√2)

• X2:p, Y2:p, Z2:pi.i.d.∼ N (0, Ip−1) Faith-

fulness violated here

Page 10: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

V-structure Discovery (3)

CI: X ⊥⊥Y |Z

2var: Factor

Null

acc

epta

nce

rate

(Typ

eII

erro

r)V-structure discovery: Dataset A

Dimension

1 3 5 7 9 11 13 15 17 19

0

0.2

0.4

0.6

0.8

1

Figure 1: CI test for X ⊥⊥ Y |Z from Zhang et al (2011), and a factorisation test

with a HB correction, n = 500

Page 11: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is

a signed measure ∆P that vanishes whenever P can be factorised in a

non-trivial way as a product of its (possibly multivariate) marginal

distributions.

• D = 2 : ∆LP = PXY − PXPY

Page 12: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is

a signed measure ∆P that vanishes whenever P can be factorised in a

non-trivial way as a product of its (possibly multivariate) marginal

distributions.

• D = 2 : ∆LP = PXY − PXPY

• D = 3 : ∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ

Page 13: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is

a signed measure ∆P that vanishes whenever P can be factorised in a

non-trivial way as a product of its (possibly multivariate) marginal

distributions.

• D = 2 : ∆LP = PXY − PXPY

• D = 3 : ∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ

X Y

Z

X Y

Z

X Y

Z

X Y

Z

PXY Z −PXPY Z −PY PXZ −PZPXY +2PXPY PZ

∆LP =

Page 14: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is

a signed measure ∆P that vanishes whenever P can be factorised in a

non-trivial way as a product of its (possibly multivariate) marginal

distributions.

• D = 2 : ∆LP = PXY − PXPY

• D = 3 : ∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ

X Y

Z

X Y

Z

X Y

Z

X Y

Z

PXY Z −PXPY Z −PY PXZ −PZPXY +2PXPY PZ

∆LP = 0

Case of PX ⊥⊥ PY Z

Page 15: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is

a signed measure ∆P that vanishes whenever P can be factorised in a

non-trivial way as a product of its (possibly multivariate) marginal

distributions.

• D = 2 : ∆LP = PXY − PXPY

• D = 3 : ∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ

(X,Y ) ⊥⊥ Z ∨ (X,Z) ⊥⊥ Y ∨ (Y, Z) ⊥⊥ X ⇒ ∆LP = 0.

...so what might be missed?

Page 16: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is

a signed measure ∆P that vanishes whenever P can be factorised in a

non-trivial way as a product of its (possibly multivariate) marginal

distributions.

• D = 2 : ∆LP = PXY − PXPY

• D = 3 : ∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ

∆LP = 0 ; (X,Y ) ⊥⊥ Z ∨ (X,Z) ⊥⊥ Y ∨ (Y, Z) ⊥⊥ X

Example:

P (0, 0, 0) = 0.2 P (0, 0, 1) = 0.1 P (1, 0, 0) = 0.1 P (1, 0, 1) = 0.1

P (0, 1, 0) = 0.1 P (0, 1, 1) = 0.1 P (1, 1, 0) = 0.1 P (1, 1, 1) = 0.2

Page 17: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

A Test using Lancaster Measure

• Test statistic is empirical estimate of ‖µκ (∆LP )‖2Hκ

, where

κ = k ⊗ l ⊗m:

‖µκ(PXY Z − PXY PZ − · · · )‖2Hκ=

〈µκPXY Z , µκPXY Z〉Hκ− 2 〈µκPXY Z , µκPXY PZ〉Hκ

· · ·

Page 18: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Inner Product Estimators

ν\ν′ PXY Z PXY PZ PXZPY PY ZPX PXPY PZ

PXY Z (K ◦ L ◦ M)++ ((K ◦ L)M)++ ((K ◦ M)L)++ ((M ◦ L)K)++ tr(K+ ◦ L+ ◦ M+)

PXY PZ (K ◦ L)++ M++ (MKL)++ (KLM)++ (KL)++M++

PXZPY (K ◦ M)++ L++ (KML)++ (KM)++L++

PY ZPX (L ◦ M)++ K++ (LM)++K++

PXPY PZ K++L++M++

Table 1: V -statistic estimators of 〈µκν, µκν′〉Hκ

Page 19: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Inner Product Estimators

ν\ν′ PXY Z PXY PZ PXZPY PY ZPX PXPY PZ

PXY Z (K ◦ L ◦ M)++ ((K ◦ L)M)++ ((K ◦ M)L)++ ((M ◦ L)K)++ tr(K+ ◦ L+ ◦ M+)

PXY PZ (K ◦ L)++ M++ (MKL)++ (KLM)++ (KL)++M++

PXZPY (K ◦ M)++ L++ (KML)++ (KM)++L++

PY ZPX (L ◦ M)++ K++ (LM)++K++

PXPY PZ K++L++M++

Table 2: V -statistic estimators of 〈µκν, µκν′〉Hκ

‖µκ (∆LP )‖2Hκ

=1

n2(HKH ◦HLH ◦HMH)++ .

Empirical joint central moment in the feature space

Page 20: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Example A: factorisation tests

CI: X ⊥⊥Y |Z

∆L: Factor

2var: Factor

Null

acc

epta

nce

rate

(Typ

eII

erro

r)

V-structure discovery: Dataset A

Dimension

1 3 5 7 9 11 13 15 17 19

0

0.2

0.4

0.6

0.8

1

Figure 2: Factorisation hypothesis: Lancaster statistic vs. a two-variable

based test (both with HB correction); Test for X ⊥⊥ Y |Z from Zhang et al

(2011), n = 500

Page 21: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Example B: Joint dependence can be easier to detect

• X1, Y1i.i.d.∼ N (0, 1)

• Z1 =

X21 + ǫ, w.p. 1/3,

Y 21 + ǫ, w.p. 1/3,

X1Y1 + ǫ, w.p. 1/3,

where ǫ ∼ N (0, 0.12).

• X2:p, Y2:p, Z2:pi.i.d.∼ N (0, Ip−1)

• dependence of Z on pair (X,Y ) is stronger than on X and Y individually

• Satisfies faithfulness

Page 22: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Example B: factorisation tests

CI: X ⊥⊥Y |Z

∆L: Factor

2var: Factor

Null

acc

epta

nce

rate

(Typ

eII

erro

r)

V-structure discovery: Dataset B

Dimension

1 3 5 7 9 11 13 15 17 19

0

0.2

0.4

0.6

0.8

1

Figure 3: Factorisation hypothesis: Lancaster statistic vs. a two-variable

based test (both with HB correction); Test for X ⊥⊥ Y |Z from Zhang et al

(2011), n = 500

Page 23: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Interaction for D ≥ 4

• Interaction measure valid for all D

(Streitberg, 1990):

∆SP =∑

π

(−1)|π|−1 (|π| − 1)!JπP

– For a partition π, Jπ associates to

the joint the corresponding

factorisation, e.g.,

J13|2|4P = PX1X3PX2PX4 .

Page 24: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Interaction for D ≥ 4

• Interaction measure valid for all D

(Streitberg, 1990):

∆SP =∑

π

(−1)|π|−1 (|π| − 1)!JπP

– For a partition π, Jπ associates to

the joint the corresponding

factorisation, e.g.,

J13|2|4P = PX1X3PX2PX4 .

Page 25: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Interaction for D ≥ 4

• Interaction measure valid for all D

(Streitberg, 1990):

∆SP =∑

π

(−1)|π|−1 (|π| − 1)!JπP

– For a partition π, Jπ associates to

the joint the corresponding

factorisation, e.g.,

J13|2|4P = PX1X3PX2PX4 .

1e+04

1e+09

1e+14

1e+19

1 3 5 7 9 11 13 15 17 19 21 23 25D

Nu

mb

er

of

pa

rtitio

ns o

f {1

,...

,D} Bell numbers growth

joint central moments (Lancaster interaction)

vs.

joint cumulants (Streitberg interaction)

Page 26: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Total independence test

• Total independence test:

H0 : PXY Z = PXPY PZ vs. H1 : PXY Z 6= PXPY PZ

Page 27: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Total independence test

• Total independence test:

H0 : PXY Z = PXPY PZ vs. H1 : PXY Z 6= PXPY PZ

• For (X1, . . . , XD) ∼ PX, and κ =⊗D

i=1 k(i):

∥∥∥∥∥∥∥∥∥∥∥

µκ

(PX −

D∏

i=1

PXi

)

︸ ︷︷ ︸∆totP

∥∥∥∥∥∥∥∥∥∥∥

2

=1

n2

n∑

a=1

n∑

b=1

D∏

i=1

K(i)ab −

2

nD+1

n∑

a=1

D∏

i=1

n∑

b=1

K(i)ab

+1

n2D

D∏

i=1

n∑

a=1

n∑

b=1

K(i)ab .

• Coincides with the test proposed by Kankainen (1995) using empirical

characteristic functions: similar relationship to that between dCov and

HSIC (DS et al, 2013)

Page 28: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Example B: total independence tests

∆tot : total indep.

∆L: total indep.

Null

acc

epta

nce

rate

(Typ

eII

erro

r)

Total independence test: Dataset B

Dimension

1 3 5 7 9 11 13 15 17 19

0

0.2

0.4

0.6

0.8

1

Figure 4: Total independence: ∆totP vs. ∆LP , n = 500

Page 29: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Kernel dependence measures - in detail

Page 30: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

MMD for independence: HSIC

!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&

-"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*&

2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0&

#;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:&

=&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2&

,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&

-"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-&

2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':&

!#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.&

@'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',&

#;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#&

<%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&&

$'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%&

2',&0(//(7&3(+&#5#%37"#%#:&

!" #"

Empirical HSIC(PXY ,PXPY ):

1

n2(HKH ◦HLH)++

Page 31: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Covariance to reveal dependence

A more intuitive idea: maximize covariance of smooth mappings:

COCO(P;F ,G) := sup‖f‖F=1,‖g‖G=1

(Ex,y[f(x)g(y)]−Ex[f(x)]Ey[g(y)])

Page 32: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Covariance to reveal dependence

A more intuitive idea: maximize covariance of smooth mappings:

COCO(P;F ,G) := sup‖f‖F=1,‖g‖G=1

(Ex,y[f(x)g(y)]−Ex[f(x)]Ey[g(y)])

−2 0 2−1.5

−1

−0.5

0

0.5

1

1.5

X

Y

Correlation: −0.00

Page 33: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Covariance to reveal dependence

A more intuitive idea: maximize covariance of smooth mappings:

COCO(P;F ,G) := sup‖f‖F=1,‖g‖G=1

(Ex,y[f(x)g(y)]−Ex[f(x)]Ey[g(y)])

−2 0 2−1.5

−1

−0.5

0

0.5

1

1.5

X

Y

Correlation: −0.00

−2 0 2−1

−0.5

0

0.5

x

f(x)

Dependence witness, X

−2 0 2−1

−0.5

0

0.5

y

g(y

)

Dependence witness, Y

Page 34: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Covariance to reveal dependence

A more intuitive idea: maximize covariance of smooth mappings:

COCO(P;F ,G) := sup‖f‖F=1,‖g‖G=1

(Ex,y[f(x)g(y)]−Ex[f(x)]Ey[g(y)])

−2 0 2−1.5

−1

−0.5

0

0.5

1

1.5

X

Y

Correlation: −0.00

−2 0 2−1

−0.5

0

0.5

x

f(x)

Dependence witness, X

−2 0 2−1

−0.5

0

0.5

y

g(y

)

Dependence witness, Y

−1 −0.5 0 0.5−1

−0.5

0

0.5

1

f(X)g(Y

)

Correlation: −0.90 COCO: 0.14

Page 35: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Covariance to reveal dependence

A more intuitive idea: maximize covariance of smooth mappings:

COCO(P;F ,G) := sup‖f‖F=1,‖g‖G=1

(Ex,y[f(x)g(y)]−Ex[f(x)]Ey[g(y)])

−2 0 2−1.5

−1

−0.5

0

0.5

1

1.5

X

Y

Correlation: −0.00

−2 0 2−1

−0.5

0

0.5

x

f(x)

Dependence witness, X

−2 0 2−1

−0.5

0

0.5

y

g(y

)

Dependence witness, Y

−1 −0.5 0 0.5−1

−0.5

0

0.5

1

f(X)g(Y

)

Correlation: −0.90 COCO: 0.14

How do we define covariance in (infinite) feature spaces?

Page 36: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Covariance to reveal dependence

Covariance in RKHS: Let’s first look at finite linear case.

We have two random vectors x ∈ Rd, y ∈ R

d′ . Are they linearly dependent?

Page 37: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Covariance to reveal dependence

Covariance in RKHS: Let’s first look at finite linear case.

We have two random vectors x ∈ Rd, y ∈ R

d′ . Are they linearly dependent?

Compute their covariance matrix: (ignore centering)

Cxy = E(xy

⊤)

How to get a single “summary” number?

Page 38: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Covariance to reveal dependence

Covariance in RKHS: Let’s first look at finite linear case.

We have two random vectors x ∈ Rd, y ∈ R

d′ . Are they linearly dependent?

Compute their covariance matrix: (ignore centering)

Cxy = E(xy

⊤)

How to get a single “summary” number?

Solve for vectors f ∈ Rd, g ∈ R

d′

argmax‖f‖=1,‖g‖=1

f⊤Cxyg = argmax‖f‖=1,‖g‖=1

Exy

[(f⊤x

)(g⊤y

)]

= argmax‖f‖=1,‖g‖=1

Ex,y[f(x)g(y)] = argmax‖f‖=1,‖g‖=1

cov (f(x)g(y))

(maximum singular value)

Page 39: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Challenges in defining feature space covariance

Given features φ(x) ∈ F and ψ(y) ∈ G:

Challenge 1: Can we define a feature space analog to x y⊤?

YES:

• Given f ∈ Rd, g ∈ R

d′ , h ∈ Rd′ , define matrix f g⊤ such that

(f g⊤)h = f(g⊤h).

• Given f ∈ F , g ∈ G, h ∈ G, define tensor product operator f ⊗ g such

that (f ⊗ g)h = f〈g, h〉G .

• Now just set f := φ(x), g = ψ(y), to get x y⊤ → φ(x)⊗ ψ(y)

Page 40: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Challenges in defining feature space covariance

Given features φ(x) ∈ F and ψ(y) ∈ G:

Challenge 2: Does a covariance “matrix” (operator) in feature space exist?

I.e. is there some CXY : G → F such that

〈f, CXY g〉F = Ex,y[f(x)g(y)] = cov (f(x), g(y))

Page 41: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Challenges in defining feature space covariance

Given features φ(x) ∈ F and ψ(y) ∈ G:

Challenge 2: Does a covariance “matrix” (operator) in feature space exist?

I.e. is there some CXY : G → F such that

〈f, CXY g〉F = Ex,y[f(x)g(y)] = cov (f(x), g(y))

YES: via Bochner integrability argument (as with mean embedding).

Under the condition Ex,y

(√k(x, x)l(y, y)

)<∞, we can define:

CXY := Ex,y [φ(x)⊗ ψ(y)]

which is a Hilbert-Schmidt operator (sum of squared singular values is finite).

Page 42: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

REMINDER: functions revealing dependence

COCO(P;F ,G) := sup‖f‖F=1,‖g‖G=1

(Ex,y[f(x)g(y)]−Ex[f(x)]Ey[g(y)])

−2 0 2−1.5

−1

−0.5

0

0.5

1

1.5

X

Y

Correlation: −0.00

−2 0 2−1

−0.5

0

0.5

x

f(x)

Dependence witness, X

−2 0 2−1

−0.5

0

0.5

y

g(y

)

Dependence witness, Y

−1 −0.5 0 0.5−1

−0.5

0

0.5

1

f(X)

g(Y

)

Correlation: −0.90 COCO: 0.14

How do we compute this from finite data?

Page 43: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Empirical covariance operator

The empirical covariance given z := (xi, yi)ni=1 (now include centering)

CXY :=1

n

n∑

i=1

φ(xi)⊗ ψ(yi)− µx ⊗ µy,

where µx := 1n

∑ni=1 φ(xi). More concisely,

CXY =1

nXHY ⊤,

where H = In − n−11n, and 1n is an n× n matrix of ones, and

X =[φ(x1) . . . φ(xn)

]Y =

[ψ(y1) . . . ψ(yn)

].

Define the kernel matrices

Kij =(X⊤X

)ij= k(xi, xj) Lij = l(yi, yj),

Page 44: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Functions revealing dependence

Optimization problem:

COCO(z;F ,G) := max⟨f, CXY g

⟩F

subject to ‖f‖F ≤ 1

‖g‖G ≤ 1

Assume

f =

n∑

i=1

αi [φ(xi)− µx] = XHα g =

n∑

j=1

βi [ψ(yi)− µy] = Y Hβ,

The associated Lagrangian is

L(f, g, λ, γ) = f⊤CXY g −λ

2

(‖f‖2F − 1

)−γ

2

(‖g‖2G − 1

),

Page 45: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Covariance to reveal dependence

• Empirical COCO(z;F ,G) largest eigenvalue of 0 1

nKL

1

nLK 0

α

β

= γ

K 0

0 L

α

β

.

• K and L are matrices of inner products between centred observations in

respective feature spaces:

K = HKH where H = I −1

n11⊤

Page 46: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Covariance to reveal dependence

• Empirical COCO(z;F ,G) largest eigenvalue of 0 1

nKL

1

nLK 0

α

β

= γ

K 0

0 L

α

β

.

• K and L are matrices of inner products between centred observations in

respective feature spaces:

K = HKH where H = I −1

n11⊤

• Mapping function for x:

f(x) =n∑

i=1

αi

k(xi, x)−

1

n

n∑

j=1

k(xj , x)

Page 47: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Hard-to-detect dependence

−2 0 2−3

−2

−1

0

1

2

3

X

Y

Smooth density

−4 −2 0 2 4−4

−2

0

2

4

X

Y

500 Samples, smooth density

−2 0 2−3

−2

−1

0

1

2

3

X

Y

Rough density

−4 −2 0 2 4−4

−2

0

2

4

X

Y

500 samples, rough density

Density takes the form:

Px,y ∝ 1 + sin(ωx) sin(ωy)

Page 48: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Hard-to-detect dependence

• Example: sinusoids of increasing frequency

ω=1 ω=2

ω=3 ω=4

ω=5 ω=6

0 1 2 3 4 5 6 70.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Frequency of non−constant density component

CO

CO

COCO (empirical average, 1500 samples)

Page 49: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Hard-to-detect dependence

COCO vs frequency of perturbation from independence.

Page 50: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Hard-to-detect dependence

COCO vs frequency of perturbation from independence.

Case of ω = 1

−4 −2 0 2 4−4

−3

−2

−1

0

1

2

3

4

X

Y

Correlation: 0.27

−2 0 2−1

−0.5

0

0.5

1

x

f(x)

Dependence witness, X

−2 0 2−1

−0.5

0

0.5

1

y

g(y

)

Dependence witness, Y

−1 −0.5 0 0.5 1−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

f(X)

g(Y

)

Correlation: −0.50 COCO: 0.09

Page 51: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Hard-to-detect dependence

COCO vs frequency of perturbation from independence.

Case of ω = 2

−4 −2 0 2 4−4

−3

−2

−1

0

1

2

3

4

X

Y

Correlation: 0.04

−2 0 2−1

−0.5

0

0.5

1

x

f(x)

Dependence witness, X

−2 0 2−1

−0.5

0

0.5

1

y

g(y

)

Dependence witness, Y

−1 −0.5 0 0.5 1−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

f(X)

g(Y

)

Correlation: 0.51 COCO: 0.07

Page 52: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Hard-to-detect dependence

COCO vs frequency of perturbation from independence.

Case of ω = 3

−4 −2 0 2 4−4

−3

−2

−1

0

1

2

3

4

X

Y

Correlation: 0.03

−2 0 2−1

−0.5

0

0.5

1

x

f(x)

Dependence witness, X

−2 0 2−1

−0.5

0

0.5

1

y

g(y

)

Dependence witness, Y

−0.6 −0.4 −0.2 0 0.2 0.4−0.5

0

0.5

f(X)

g(Y

)

Correlation: −0.45 COCO: 0.03

Page 53: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Hard-to-detect dependence

COCO vs frequency of perturbation from independence.

Case of ω = 4

−4 −2 0 2 4−4

−3

−2

−1

0

1

2

3

4

X

Y

Correlation: 0.03

−2 0 2−1

−0.5

0

0.5

1

x

f(x)

Dependence witness, X

−2 0 2−1

−0.5

0

0.5

1

y

g(y

)

Dependence witness, Y

−0.5 0 0.5 1−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

f(X)

g(Y

)

Correlation: 0.21 COCO: 0.02

Page 54: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Hard-to-detect dependence

COCO vs frequency of perturbation from independence.

Case of ω =??

−4 −2 0 2 4−3

−2

−1

0

1

2

3

X

Y

Correlation: 0.00

−2 0 2−1

−0.5

0

0.5

1

x

f(x)

Dependence witness, X

−2 0 2−1

−0.5

0

0.5

1

y

g(y

)

Dependence witness, Y

−1 −0.5 0 0.5 1−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

f(X)

g(Y

)

Correlation: −0.13 COCO: 0.02

Page 55: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Hard-to-detect dependence

COCO vs frequency of perturbation from independence.

Case of uniform noise!

This bias will decrease with increasing sample size.

−4 −2 0 2 4−3

−2

−1

0

1

2

3

X

Y

Correlation: 0.00

−2 0 2−1

−0.5

0

0.5

1

x

f(x)

Dependence witness, X

−2 0 2−1

−0.5

0

0.5

1

y

g(y

)

Dependence witness, Y

−1 −0.5 0 0.5 1−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

f(X)g

(Y)

Correlation: −0.13 COCO: 0.02

Page 56: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Hard-to-detect dependence

COCO vs frequency of perturbation from independence.

• As dependence is encoded at higher frequencies, the smooth mappings

f, g achieve lower linear covariance.

• Even for independent variables, COCO will not be zero at finite sample

sizes, since some mild linear dependence will be induced by f, g (bias)

• This bias will decrease with increasing sample size.

Page 57: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

More functions revealing dependence

• Can we do better than COCO?

Page 58: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

More functions revealing dependence

• Can we do better than COCO?

• A second example with zero correlation

−1 0 1−1.5

−1

−0.5

0

0.5

1

X

Y

Correlation: 0

−2 0 2−1

−0.5

0

0.5

x

f(x)

Dependence witness, X

−2 0 2−1

−0.5

0

0.5

y

g(y

)

Dependence witness, Y

−1 −0.5 0 0.5−1

−0.5

0

0.5

1

f(X)

g(Y

)

Correlation: −0.80 COCO: 0.11

Page 59: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

More functions revealing dependence

• Can we do better than COCO?

• A second example with zero correlation

−1 0 1−1.5

−1

−0.5

0

0.5

1

X

Y

Correlation: 0

−2 0 2−1

−0.5

0

0.5

1

x

f 2(x

)

2nd dependence witness, X

−2 0 2−1

−0.5

0

0.5

1

y

g2(y

)

2nd dependence witness, Y

Page 60: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

More functions revealing dependence

• Can we do better than COCO?

• A second example with zero correlation

−1 0 1−1.5

−1

−0.5

0

0.5

1

X

Y

Correlation: 0

−2 0 2−1

−0.5

0

0.5

1

x

f 2(x

)

2nd dependence witness, X

−2 0 2−1

−0.5

0

0.5

1

y

g2(y

)

2nd dependence witness, Y

−1 0 1−1

−0.5

0

0.5

1

f2(X)

g2(Y

)

Correlation: −0.37 COCO2: 0.06

Page 61: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Hilbert-Schmidt Independence Criterion

• Given γi := COCOi(z;F ,G), define Hilbert-Schmidt Independence

Criterion (HSIC) [ALT05, NIPS07a, JMLR10] :

HSIC(z;F ,G) :=n∑

i=1

γ2i

Page 62: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Hilbert-Schmidt Independence Criterion

• Given γi := COCOi(z;F ,G), define Hilbert-Schmidt Independence

Criterion (HSIC) [ALT05, NIPS07a, JMLR10] :

HSIC(z;F ,G) :=n∑

i=1

γ2i

• In limit of infinite samples:

HSIC(P;F,G) := ‖Cxy‖2HS

= 〈Cxy, Cxy〉HS

= Ex,x′,y,y′ [k(x, x′)l(y, y′)] +Ex,x′ [k(x, x

′)]Ey,y′ [l(y, y′)]

− 2Ex,y

[Ex′ [k(x, x

′)]Ey′ [l(y, y′)]]

– x′ an independent copy of x, y′ a copy of y

HSIC is identical to MMD(PXY ,PXPY )

Page 63: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

When does HSIC determine independence?

Theorem: When kernels k and l are each characteristic, then HSIC = 0 iff

Px,y = PxPy [Gretton, 2015].

Weaker than MMD condition (which requires a kernel characteristic on

X × Y to distinguish Px,y from Qx,y).

Page 64: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Intuition: why characteristic needed on both X and Y

Question: Wouldn’t it be enough just to use a rich mapping from X to Y,

e.g. via ridge regression with characteristic F :

f∗ = argminf∈F

(EXY (Y − 〈f, φ(X)〉F )

2 + λ‖f‖2F),

Page 65: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Intuition: why characteristic needed on both X and Y

Question: Wouldn’t it be enough just to use a rich mapping from X to Y,

e.g. via ridge regression with characteristic F :

f∗ = argminf∈F

(EXY (Y − 〈f, φ(X)〉F )

2 + λ‖f‖2F),

Counterexample: density symmetric about x-axis, s.t. p(x, y) = p(x,−y)

−2 0 2−1.5

−1

−0.5

0

0.5

1

1.5

X

Y

Correlation: −0.00

Page 66: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Energy Distance and the MMD

Page 67: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Energy distance and MMD

Distance between probability distributions:

Energy distance: [Baringhaus and Franz, 2004, Szekely and Rizzo, 2004, 2005]

DE(P,Q) = EP‖X −X ′‖q +EQ‖Y − Y ′‖q − 2EP,Q‖X − Y ‖q

0 < q ≤ 2

Maximum mean discrepancy [Gretton et al., 2007, Smola et al., 2007, Gretton et al., 2012]

MMD2(P,Q;F ) = EPk(X,X′) +EQk(Y, Y

′)− 2EP,Qk(X,Y )

Page 68: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Energy distance and MMD

Distance between probability distributions:

Energy distance: [Baringhaus and Franz, 2004, Szekely and Rizzo, 2004, 2005]

DE(P,Q) = EP‖X −X ′‖q +EQ‖Y − Y ′‖q − 2EP,Q‖X − Y ‖q

0 < q ≤ 2

Maximum mean discrepancy [Gretton et al., 2007, Smola et al., 2007, Gretton et al., 2012]

MMD2(P,Q;F ) = EPk(X,X′) +EQk(Y, Y

′)− 2EP,Qk(X,Y )

Energy distance is MMD with a particular kernel! [Sejdinovic et al., 2013b]

Page 69: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Distance covariance and HSIC

Distance covariance (0 < q, r ≤ 2) [Feuerverger, 1993, Szekely et al., 2007]

V2(X,Y ) = EXY EX′Y ′

[‖X −X ′‖q‖Y − Y ′‖r

]

+ EXEX′‖X −X ′‖qEY EY ′‖Y − Y ′‖r

− 2EXY

[EX′‖X −X ′‖qEY ′‖Y − Y ′‖r

]

Hilbert-Schmdit Indepence Criterion [Gretton et al., 2005, Smola et al., 2007, Gretton et al.,

2008, Gretton and Gyorfi, 2010]Define RKHS F on X with kernel k, RKHS G on Y

with kernel l. Then

HSIC(PXY ,PXPY )

= EXY EX′Y ′k(X,X ′)l(Y, Y ′) +EXEX′k(X,X ′)EY EY ′ l(Y, Y ′)

− 2EX′Y ′

[EXk(X,X

′)EY l(Y, Y′)].

Page 70: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Distance covariance and HSIC

Distance covariance (0 < q, r ≤ 2) [Feuerverger, 1993, Szekely et al., 2007]

V2(X,Y ) = EXY EX′Y ′

[‖X −X ′‖q‖Y − Y ′‖r

]

+ EXEX′‖X −X ′‖qEY EY ′‖Y − Y ′‖r

− 2EXY

[EX′‖X −X ′‖qEY ′‖Y − Y ′‖r

]

Hilbert-Schmdit Indepence Criterion [Gretton et al., 2005, Smola et al., 2007, Gretton et al.,

2008, Gretton and Gyorfi, 2010]Define RKHS F on X with kernel k, RKHS G on Y

with kernel l. Then

HSIC(PXY ,PXPY )

= EXY EX′Y ′k(X,X ′)l(Y, Y ′) +EXEX′k(X,X ′)EY EY ′ l(Y, Y ′)

− 2EX′Y ′

[EXk(X,X

′)EY l(Y, Y′)].

Distance covariance is HSIC with particular kernels! [Sejdinovic et al., 2013b]

Page 71: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Semimetrics and Hilbert spaces

Theorem [Berg et al., 1984, Lemma 2.1, p. 74]

ρ : X × X → R is a semimetric (no triangle inequality) on X . Let z0 ∈ X , and

denote

kρ(z, z′) = ρ(z, z0) + ρ(z′, z0)− ρ(z, z′).

Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS)

iff ρ is of negative type.

Call kρ a distance induced kernel

Negative type: The semimetric space (Z, ρ) is said to have negative type if

∀n ≥ 2, z1, . . . , zn ∈ Z, and α1, . . . , αn ∈ R, with∑n

i=1 αi = 0,

n∑

i=1

n∑

j=1

αiαjρ(zi, zj) ≤ 0. (1)

Page 72: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Semimetrics and Hilbert spaces

Theorem [Berg et al., 1984, Lemma 2.1, p. 74]

ρ : X × X → R is a semimetric (no triangle inequality) on X . Let z0 ∈ X , and

denote

kρ(z, z′) = ρ(z, z0) + ρ(z′, z0)− ρ(z, z′).

Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS)

iff ρ is of negative type.

Call kρ a distance induced kernel

Special case: Z ⊆ Rd and ρq(z, z

′) = ‖z − z′‖q. Then ρq is a valid semimetric

of negative type for 0 < q ≤ 2.

Page 73: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Semimetrics and Hilbert spaces

Theorem [Berg et al., 1984, Lemma 2.1, p. 74]

ρ : X × X → R is a semimetric (no triangle inequality) on X . Let z0 ∈ X , and

denote

kρ(z, z′) = ρ(z, z0) + ρ(z′, z0)− ρ(z, z′).

Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS)

iff ρ is of negative type.

Call kρ a distance induced kernel

Special case: Z ⊆ Rd and ρq(z, z

′) = ‖z − z′‖q. Then ρq is a valid semimetric

of negative type for 0 < q ≤ 2.

Energy distance is MMD with a distance induced kernel

Distance covariance is HSIC with distance induced kernels

Page 74: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Two-sample testing benchmark

Two-sample testing example in 1-D:

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

X

P(X

)

VS

−6 −4 −2 0 2 4 6

0.1

0.2

0.3

0.4

X

Q(X

)

−6 −4 −2 0 2 4 6

0.1

0.2

0.3

0.4

X

Q(X

)

−6 −4 −2 0 2 4 6

0.1

0.2

0.3

0.4

X

Q(X

)

Page 75: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Two-sample test, MMD with distance kernel

Obtain more powerful tests on this problem when q 6= 1 (exponent of distance)

Key:

• Gaussian kernel

• q = 1

• Best: q = 1/3

• Worst: q = 2

Page 76: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Nonparametric Bayesian inference using distributionembeddings

Page 77: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Motivating Example: Bayesian inference without a model

• 3600 downsampled frames of

20× 20 RGB pixels

(Yt ∈ [0, 1]1200)

• 1800 training frames,

remaining for test.

• Gaussian noise added to Yt.

Challenges:

• No parametric model of camera dynamics (only samples)

• No parametric model of map from camera angle to image (only samples)

• Want to do filtering: Bayesian inference

Page 78: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

ABC: an approach to Bayesian inference without a model

Bayes rule:

P(y|x) =P(x|y)π(y)∫P(x|y)π(y)dy

• P(x|y) is likelihood

• π(y) is prior

One approach: Approximate Bayesian Computation (ABC)

Page 79: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

ABC: an approach to Bayesian inference without a model

Approximate Bayesian Computation (ABC):

−10 −5 0 5 10

−10

−5

0

5

10

Prior y ∼ π(Y )

Lik

eli

hood

x∼

P(X

|y)

ABC demonstration

Page 80: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

ABC: an approach to Bayesian inference without a model

Approximate Bayesian Computation (ABC):

−10 −5 0 5 10

−10

−5

0

5

10

Prior y ∼ π(Y )

Lik

eli

hood

x∼

P(X

|y)

ABC demonstration

x∗

Page 81: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

ABC: an approach to Bayesian inference without a model

Approximate Bayesian Computation (ABC):

−10 −5 0 5 10

−10

−5

0

5

10

Prior y ∼ π(Y )

Lik

eli

hood

x∼

P(X

|y)

ABC demonstration

x∗

Page 82: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

ABC: an approach to Bayesian inference without a model

Approximate Bayesian Computation (ABC):

−10 −5 0 5 10

−10

−5

0

5

10

Prior y ∼ π(Y )

Lik

eli

hood

x∼

P(X

|y)

ABC demonstration

x∗

Page 83: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

ABC: an approach to Bayesian inference without a model

Approximate Bayesian Computation (ABC):

−10 −5 0 5 10

−10

−5

0

5

10

Prior y ∼ π(Y )

Lik

eli

hood

x∼

P(X

|y)

ABC demonstration

x∗

P (Y |x∗)

Page 84: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

ABC: an approach to Bayesian inference without a model

Approximate Bayesian Computation (ABC):

−10 −5 0 5 10

−10

−5

0

5

10

Prior y ∼ π(Y )

Lik

eli

hood

x∼

P(X

|y)

ABC demonstration

x∗

P (Y |x∗)

Needed: distance measure D, tolerance parameter τ .

Page 85: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

ABC: an approach to Bayesian inference without a model

Bayes rule:

P(y|x) =P(x|y)π(y)∫P(x|y)π(y)dy

• P(x|y) is likelihood

• π(y) is prior

ABC generates a sample from p(Y |x∗) as follows:

1. generate a sample yt from the prior π,

2. generate a sample xt from P(X|yt),

3. if D(x∗, xt) < τ , accept y = yt; otherwise reject,

4. go to (i).

In step (3), D is a distance measure, and τ is a tolerance parameter.

Page 86: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Motivating example 2: simple Gaussian case

• p(x, y) is N ((0,1Td )T , V ) with V a randomly generated covariance

Posterior mean on x: ABC vs kernel approach

100

101

102

103

104

10−2

10−1

CPU time vs Error (6 dim.)

CPU time (sec)

Mea

n S

qu

are

Err

ors

KBI

COND

ABC

3.3×102

1.3×106

7.0×104

1.0×104

2.5×103

9.5×102

200

400

2000

600800

15001000

50006000

40003000

200

400

2000

4000

600800

1000

1500

6000

30005000

Page 87: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Bayes again

Bayes rule:

P(y|x) =P(x|y)π(y)∫P(x|y)π(y)dy

• P(x|y) is likelihood

• π is prior

How would this look with kernel embeddings?

Page 88: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Bayes again

Bayes rule:

P(y|x) =P(x|y)π(y)∫P(x|y)π(y)dy

• P(x|y) is likelihood

• π is prior

How would this look with kernel embeddings?

Define RKHS G on Y with feature map ψy and kernel l(y, ·)

We need a conditional mean embedding: for all g ∈ G,

EY |x∗g(Y ) = 〈g, µP(y|x∗)〉G

This will be obtained by RKHS-valued ridge regression

Page 89: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Ridge regression and the conditional feature mean

Ridge regression from X := Rd to a finite vector output Y := R

d′ (these

could be d′ nonlinear features of y):

Define training data

X =[x1 . . . xm

]∈ R

d×m Y =[y1 . . . ym

]∈ R

d′×m

Page 90: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Ridge regression and the conditional feature mean

Ridge regression from X := Rd to a finite vector output Y := R

d′ (these

could be d′ nonlinear features of y):

Define training data

X =[x1 . . . xm

]∈ R

d×m Y =[y1 . . . ym

]∈ R

d′×m

Solve

A = argminA∈Rd′×d

(‖Y −AX‖2 + λ‖A‖2HS

),

where

‖A‖2HS = tr(A⊤A) =min{d,d′}∑

i=1

γ2A,i

Page 91: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Ridge regression and the conditional feature mean

Ridge regression from X := Rd to a finite vector output Y := R

d′ (these

could be d′ nonlinear features of y):

Define training data

X =[x1 . . . xm

]∈ R

d×m Y =[y1 . . . ym

]∈ R

d′×m

Solve

A = argminA∈Rd′×d

(‖Y −AX‖2 + λ‖A‖2HS

),

where

‖A‖2HS = tr(A⊤A) =min{d,d′}∑

i=1

γ2A,i

Solution: A = CY X (CXX +mλI)−1

Page 92: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Ridge regression and the conditional feature mean

Prediction at new point x:

y∗ = Ax

= CY X (CXX +mλI)−1 x

=m∑

i=1

βi(x)yi

where

βi(x) = (K + λmI)−1[k(x1, x) . . . k(xm, x)

]⊤

and

K := X⊤X k(x1, x) = x⊤1 x

Page 93: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Ridge regression and the conditional feature mean

Prediction at new point x:

y∗ = Ax

= CY X (CXX +mλI)−1 x

=m∑

i=1

βi(x)yi

where

βi(x) = (K + λmI)−1[k(x1, x) . . . k(xm, x)

]⊤

and

K := X⊤X k(x1, x) = x⊤1 x

What if we do everything in kernel space?

Page 94: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Ridge regression and the conditional feature mean

Recall our setup:

• Given training pairs:

(xi, yi) ∼ PXY

• F on X with feature map ϕx and kernel k(x, ·)

• G on Y with feature map ψy and kernel l(y, ·)

We define the covariance between feature maps:

CXX = EX (ϕX ⊗ ϕX) CXY = EXY (ϕX ⊗ ψY )

and matrices of feature mapped training data

X =[ϕx1 . . . ϕxm

]Y :=

[ψy1 . . . ψym

]

Page 95: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Ridge regression and the conditional feature mean

Objective: [Weston et al. (2003), Micchelli and Pontil (2005), Caponnetto and De Vito (2007), Grunewalder

et al. (2012, 2013) ]

A = arg minA∈HS(F ,G)

(EXY ‖Y −AX‖2G + λ‖A‖2HS

), ‖A‖2HS =

∞∑

i=1

γ2A,i

Solution same as vector case:

A = CY X (CXX +mλI)−1 ,

Prediction at new x using kernels:

Aϕx =[ψy1 . . . ψym

](K + λmI)−1

[k(x1, x) . . . k(xm, x)

]

=

m∑

i=1

βi(x)ψyi

where Kij = k(xi, xj)

Page 96: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Ridge regression and the conditional feature mean

How is loss ‖Y −AX‖2G relevant to conditional expectation of some

EY |xg(Y )? Define: [Song et al. (2009), Grunewalder et al. (2013)]

µY |x := Aϕx

Page 97: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Ridge regression and the conditional feature mean

How is loss ‖Y −AX‖2G relevant to conditional expectation of some

EY |xg(Y )? Define: [Song et al. (2009), Grunewalder et al. (2013)]

µY |x := Aϕx

We need A to have the property

EY |xg(Y ) ≈ 〈g, µY |x〉G

= 〈g,Aϕx〉G

= 〈A∗g, ϕx〉F = (A∗g)(x)

Page 98: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Ridge regression and the conditional feature mean

How is loss ‖Y −AX‖2G relevant to conditional expectation of some

EY |xg(Y )? Define: [Song et al. (2009), Grunewalder et al. (2013)]

µY |x := Aϕx

We need A to have the property

EY |xg(Y ) ≈ 〈g, µY |x〉G

= 〈g,Aϕx〉G

= 〈A∗g, ϕx〉F = (A∗g)(x)

Natural risk function for conditional mean

L(A,PXY ) := sup‖g‖≤1

EX

(EY |Xg(Y )

)︸ ︷︷ ︸

Target

(X)− (A∗g)︸ ︷︷ ︸Estimator

(X)

2

,

Page 99: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Ridge regression and the conditional feature mean

The squared loss risk provides an upper bound on the natural risk.

L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G

Page 100: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Ridge regression and the conditional feature mean

The squared loss risk provides an upper bound on the natural risk.

L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G

Proof: Jensen and Cauchy Schwarz

L(A,PXY ) := sup‖g‖≤1

EX

[(EY |Xg(Y )

)(X)− (A∗g) (X)

]2

≤ EXY sup‖g‖≤1

[g(Y )− (A∗g) (X)]2

Page 101: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Ridge regression and the conditional feature mean

The squared loss risk provides an upper bound on the natural risk.

L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G

Proof: Jensen and Cauchy Schwarz

L(A,PXY ) := sup‖g‖≤1

EX

[(EY |Xg(Y )

)(X)− (A∗g) (X)

]2

≤ EXY sup‖g‖≤1

[g(Y )− (A∗g) (X)]2

= EXY sup‖g‖≤1

[〈g, ψY 〉G − 〈A∗g, ϕX〉F ]2

Page 102: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Ridge regression and the conditional feature mean

The squared loss risk provides an upper bound on the natural risk.

L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G

Proof: Jensen and Cauchy Schwarz

L(A,PXY ) := sup‖g‖≤1

EX

[(EY |Xg(Y )

)(X)− (A∗g) (X)

]2

≤ EXY sup‖g‖≤1

[g(Y )− (A∗g) (X)]2

= EXY sup‖g‖≤1

[〈g, ψY 〉G − 〈g,AϕX〉G ]2

Page 103: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Ridge regression and the conditional feature mean

The squared loss risk provides an upper bound on the natural risk.

L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G

Proof: Jensen and Cauchy Schwarz

L(A,PXY ) := sup‖g‖≤1

EX

[(EY |Xg(Y )

)(X)− (A∗g) (X)

]2

≤ EXY sup‖g‖≤1

[g(Y )− (A∗g) (X)]2

= EXY sup‖g‖≤1

〈g, ψY −AϕX〉2G

Page 104: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Ridge regression and the conditional feature mean

The squared loss risk provides an upper bound on the natural risk.

L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G

Proof: Jensen and Cauchy Schwarz

L(A,PXY ) := sup‖g‖≤1

EX

[(EY |Xg(Y )

)(X)− (A∗g) (X)

]2

≤ EXY sup‖g‖≤1

[g(Y )− (A∗g) (X)]2

= EXY sup‖g‖≤1

〈g, ψY −AϕX〉2G

≤ EXY ‖ψY −AϕX‖2G

Page 105: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Ridge regression and the conditional feature mean

The squared loss risk provides an upper bound on the natural risk.

L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G

Proof: Jensen and Cauchy Schwarz

L(A,PXY ) := sup‖g‖≤1

EX

[(EY |Xg(Y )

)(X)− (A∗g) (X)

]2

≤ EXY sup‖g‖≤1

[g(Y )− (A∗g) (X)]2

= EXY sup‖g‖≤1

〈g, ψY −AϕX〉2G

≤ EXY ‖ψY −AϕX‖2G

If we assume EY [g(Y )|X = x] ∈ F then upper bound tight (next slide).

Page 106: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Conditions for ridge regression = conditional mean

Conditional mean obtained by ridge regression when EY [g(Y )|X = x] ∈ F

Given a function g ∈ G. Assume EY |X [g(Y )|X = ·] ∈ F . Then

CXXEY |X [g(Y )|X = ·] = CXY g.

Why this is useful:

EY |X [g(Y )|X = x] = 〈EY |X [g(Y )|X = ·], ϕx〉F

= 〈C−1XXCXY g, ϕx〉F

= 〈g, CY XC−1XX︸ ︷︷ ︸

regression

ϕx〉G

Page 107: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Conditions for ridge regression = conditional mean

Conditional mean obtained by ridge regression when EY [g(Y )|X = x] ∈ F

Given a function g ∈ G. Assume EY |X [g(Y )|X = ·] ∈ F . Then

CXXEY |X [g(Y )|X = ·] = CXY g.

Proof : [Fukumizu et al., 2004]

For all f ∈ F , by definition of CXX ,

⟨f, CXXEY |X [g(Y )|X = ·]

⟩F

= cov(f,EY |X [g(Y )|X = ·]

)

= EX

(f(X) EY |X [g(Y )|X]

)

= EXY (f(X)g(Y ))

= 〈f, CXY g〉 ,

by definition of CXY .

Page 108: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Kernel Bayes’ law

• Prior: Y ∼ π(y)

• Likelihood: (X|y) ∼ P(x|y) with some joint P(x, y)

Page 109: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Kernel Bayes’ law

• Prior: Y ∼ π(y)

• Likelihood: (X|y) ∼ P(x|y) with some joint P(x, y)

• Joint distribution: Q(x, y) = P(x|y)π(y)

Warning: Q 6= P, change of measure from P(y) to π(y)

• Marginal for x:

Q(x) :=

∫P(x|y)π(y)dy.

• Bayes’ law:

Q(y|x) =P(x|y)π(y)

Q(x)

Page 110: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Kernel Bayes’ law

• Posterior embedding via the usual conditional update,

µQ(y|x) = CQ(y,x)C−1Q(x,x)φx.

Page 111: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Kernel Bayes’ law

• Posterior embedding via the usual conditional update,

µQ(y|x) = CQ(y,x)C−1Q(x,x)φx.

• Given mean embedding of prior: µπ(y)

• Define marginal covariance:

CQ(x,x) =

∫(ϕx ⊗ ϕx) P(x|y)π(y)dx = C(xx)yC

−1yy µπ(y)

• Define cross-covariance:

CQ(y,x) =

∫(φy ⊗ ϕx) P(x|y)π(y)dx = C(yx)yC

−1yy µπ(y).

Page 112: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Kernel Bayes’ law: consistency result

• How to compute posterior expectation from data?

• Given samples: {(xi, yi)}ni=1 from Pxy, {(uj)}

nj=1 from prior π.

• Want to compute E[g(Y )|X = x] for g in G

• For any x ∈ X ,

∣∣gTy RY |XkX(x)− E[f(Y )|X = x]

∣∣ = Op(n− 4

27 ), (n→ ∞),

where

– gy = (g(y1), . . . , g(yn))T ∈ R

n.

– kX(x) = (k(x1, x), . . . , k(xn, x))T ∈ R

n

– RY |X learned from the samples, contains the uj

Smoothness assumptions:

• π/pY ∈ R(C1/2Y Y ), where pY p.d.f. of PY ,

• E[g(Y )|X = ·] ∈ R(C2Q(xx)

).

Page 113: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Experiment: Kernel Bayes’ law vs EKF

Page 114: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Experiment: Kernel Bayes’ law vs EKF

• Compare with extended Kalman

filter (EKF) on camera

orientation task

• 3600 downsampled frames of

20× 20 RGB pixels

(Yt ∈ [0, 1]1200)

• 1800 training frames, remaining

for test.

• Gaussian noise added to Yt.

Page 115: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Experiment: Kernel Bayes’ law vs EKF

• Compare with extended Kalman

filter (EKF) on camera

orientation task

• 3600 downsampled frames of

20× 20 RGB pixels

(Yt ∈ [0, 1]1200)

• 1800 training frames, remaining

for test.

• Gaussian noise added to Yt.

Average MSE and standard errors (10 runs)

KBR (Gauss) KBR (Tr) Kalman (9 dim.) Kalman (Quat.)

σ2 = 10−4 0.210± 0.015 0.146± 0.003 1.980± 0.083 0.557± 0.023

σ2 = 10−3 0.222± 0.009 0.210± 0.008 1.935± 0.064 0.541± 0.022

Page 116: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Co-authors

• From UCL:

– Luca Baldasssarre

– Steffen Grunewalder

– Guy Lever

– Sam Patterson

– Massimiliano Pontil

– Dino Sejdinovic

• External:

– Karsten Borgwardt, MPI

– Wicher Bergsma, LSE

– Kenji Fukumizu, ISM

– Zaid Harchaoui, INRIA

– Bernhard Schoelkopf, MPI

– Alex Smola, CMU/Google

– Le Song, Georgia Tech

– Bharath Sriperumbudur,

Cambridge

Page 117: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Selected references

Characteristic kernels and mean embeddings:

• Smola, A., Gretton, A., Song, L., Schoelkopf, B. (2007). A hilbert space embedding for distributions. ALT.

• Sriperumbudur, B., Gretton, A., Fukumizu, K., Schoelkopf, B., Lanckriet, G. (2010). Hilbert space

embeddings and metrics on probability measures. JMLR.

• Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR.

Two-sample, independence, conditional independence tests:

• Gretton, A., Fukumizu, K., Teo, C., Song, L., Schoelkopf, B., Smola, A. (2008). A kernel statistical test of

independence. NIPS

• Fukumizu, K., Gretton, A., Sun, X., Schoelkopf, B. (2008). Kernel measures of conditional dependence.

• Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B. (2009). A fast, consistent kernel two-sample

test. NIPS.

• Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR

Energy distance, relation to kernel distances

• Sejdinovic, D., Sriperumbudur, B., Gretton, A.,, Fukumizu, K., (2013). Equivalence of distance-based and

rkhs-based statistics in hypothesis testing. Annals of Statistics.

Three way interaction

• Sejdinovic, D., Gretton, A., and Bergsma, W. (2013). A Kernel Test for Three-Variable Interactions. NIPS.

Page 118: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Selected references (continued)

Conditional mean embedding, RKHS-valued regression:

• Weston, J., Chapelle, O., Elisseeff, A., Scholkopf, B., and Vapnik, V., (2003). Kernel Dependency

Estimation, NIPS.

• Micchelli, C., and Pontil, M., (2005). On Learning Vector-Valued Functions. Neural Computation.

• Caponnetto, A., and De Vito, E. (2007). Optimal Rates for the Regularized Least-Squares Algorithm.

Foundations of Computational Mathematics.

• Song, L., and Huang, J., and Smola, A., Fukumizu, K., (2009). Hilbert Space Embeddings of Conditional

Distributions. ICML.

• Grunewalder, S., Lever, G., Baldassarre, L., Patterson, S., Gretton, A., Pontil, M. (2012). Conditional mean

embeddings as regressors. ICML.

• Grunewalder, S., Gretton, A., Shawe-Taylor, J. (2013). Smooth operators. ICML.

Kernel Bayes rule:

• Song, L., Fukumizu, K., Gretton, A. (2013). Kernel embeddings of conditional distributions: A unified

kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine.

• Fukumizu, K., Song, L., Gretton, A. (2013). Kernel Bayes rule: Bayesian inference with positive definite

kernels, JMLR

Page 119: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur
Page 120: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Kernel CCA: Definition

• There exists a factorization of Cxy such that [Baker, 1973]

Cxy = C1/2xx VxyC

1/2Y Y ‖Vxy‖S ≤ 1

• Regularized empirical estimate of spectral norm: [JMLR07]

‖Vxy‖S := supf∈F ,g∈G

〈f, Cxyg〉F subject to

〈f, (Cxx + ǫnI)f〉F = 1,

〈g, (Cyy + ǫnI)g〉G = 1,

– First canonical correlate

Page 121: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Kernel CCA: Definition

• There exists a factorization of Cxy such that [Baker, 1973]

Cxy = C1/2xx VxyC

1/2Y Y ‖Vxy‖S ≤ 1

• Regularized empirical estimate of spectral norm: [JMLR07]

‖Vxy‖S := supf∈F ,g∈G

〈f, Cxyg〉F subject to

〈f, (Cxx + ǫnI)f〉F = 1,

〈g, (Cyy + ǫnI)g〉G = 1,

– First canonical correlate

• Regularized empirical estimate of HS norm: [NIPS07b]

NOCCO(z;F,G) := ‖Vxy‖2HS = tr

[RyRx

], Rx := Kx(Kx + nǫnIn)

−1

Page 122: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Kernel CCA: Illustration

• Ring-shaped density, first eigenvalue

−2 0 2−1.5

−1

−0.5

0

0.5

1

1.5

X

Y

Correlation: −0.00

−2 0 2−0.8

−0.6

−0.4

−0.2

0

xf(

x)

Dependence witness, X

−2 0 20

0.2

0.4

0.6

0.8

y

g(y

)

Dependence witness, Y

−1 −0.5 00.2

0.3

0.4

0.5

0.6

0.7

f(X)

g(Y

)

Correlation: 1.00

Page 123: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Kernel CCA: Illustration

• Ring-shaped density, third eigenvalue

−2 0 2−1.5

−1

−0.5

0

0.5

1

1.5

X

Y

Correlation: −0.00

−2 0 20.2

0.25

0.3

0.35

0.4

xf(

x)

Dependence witness, X

−2 0 20.2

0.25

0.3

0.35

0.4

y

g(y

)

Dependence witness, Y

0.25 0.3 0.350.27

0.28

0.29

0.3

0.31

0.32

0.33

f(X)

g(Y

)

Correlation: 0.97

Page 124: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

NOCCO: HS Norm of Normalized Cross Covariance

• Define NOCCO as

NOCCO := ‖Vxy‖2HS

• Characteristic kernels: population NOCCO is mean-square contingency,

indep. of RKHS

NOCCO =

∫ ∫

X×Y

(pxy(x, y)

px(x)py(y)− 1

)2

px(x)py(y)dµ(x)dµ(y).

– µ(x) and µ(y) Lebesgue measures on X and Y; Pxy absolutely continuous w.r.t.

µ(x)× µ(y), density pxy , marginal densities px and py

• Convergence result: assume regularization ǫn satisfies ǫn → 0 and

ǫ3nn→ ∞, Then

‖Vxy − Vxy‖HS → 0

in probability

Page 125: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

References

C.Baker.

Join

tm

easures

and

cross-c

ovaria

nce

operators.

Tra

nsactio

nsofth

e

American

Math

ematic

alSociety,186:2

73–289,1973.

L.

Barin

ghaus

and

C.

Franz.

On

anew

multiv

aria

te

two-s

am

ple

test.

J.

Multiv

ariate

Anal.,

88:1

90–206,2004.

C.Berg,J.P.R.Christensen,and

P.Ressel.

Harmonic

Analysis

on

Semigro

ups.

Sprin

ger,New

York,1984.

K.Chwia

lkowskiand

A.G

retton.

Akernelin

dependence

test

for

random

processes.

ICM

L,2014.

K.Chwia

lkowski,

D.Sejd

inovic

,and

A.G

retton.

Awild

bootstrap

for

de-

generate

kerneltests.

NIP

S,2014.

Andrey

Feuerverger.A

consistenttestfo

rbiv

aria

te

dependence.In

ternatio

nal

Sta

tistic

alReview,61(3):4

19–433,1993.

K.Fukum

izu,F.R.Bach,and

M.I.

Jordan.

Dim

ensio

nalit

yreductio

nfo

rsupervised

learnin

gwith

reproducin

gkernel

Hilb

ert

spaces.

Journalof

Machin

eLearnin

gResearch,5:7

3–99,2004.

K.Fukum

izu,

A.G

retton,

X.Sun,

and

B.Scholk

opf.

Kernelm

easures

of

conditio

naldependence.In

Advancesin

Neura

lIn

formatio

nProcessin

gSystems

20,pages

489–496,Cam

brid

ge,M

A,2008.M

ITPress.

K.Fukum

izu,L.Song,and

A.G

retton.

Kernelbayes’rule

:Bayesia

nin

fer-

ence

with

positiv

edefin

ite

kernels.JournalofM

achin

eLearnin

gResearch,14:

3753–3783,2013.

A.G

retton

and

L.G

yorfi.

Consistent

nonparam

etric

tests

ofin

dependence.

JournalofM

achin

eLearnin

gResearch,11:1

391–1423,2010.

A.G

retton,O

.Bousquet,A.J.Sm

ola

,and

B.Scholk

opf.

Measurin

gstatisti-

caldependencewith

Hilb

ert-S

chm

idtnorm

s.In

S.Jain

,H.U.Sim

on,and

E.Tom

ita,editors,Proceedin

gsofth

eIn

ternatio

nalConference

on

Algorith

mic

Learnin

gTheory,pages

63–77.Sprin

ger-V

erla

g,2005.

A.G

retton,K

.Borgwardt,M

.Rasch,B.Scholk

opf,

and

A.J.Sm

ola

.A

ker-

nelm

ethod

for

the

two-s

am

ple

proble

m.

InAdvancesin

Neura

lIn

formatio

n

Processin

gSystems15,pages

513–520,Cam

brid

ge,M

A,2007.M

ITPress.

70-1

Page 126: Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

A.G

retton,K

.Fukum

izu,C.-H

.Teo,L.Song,B.Scholk

opf,

and

A.J.Sm

ola

.A

kernelstatistic

altestofin

dependence.In

Advancesin

Neura

lIn

formatio

n

Processin

gSystems20,pages

585–592,Cam

brid

ge,M

A,2008.M

ITPress.

A.G

retton,K

.Borgwardt,M

.Rasch,B.Schoelk

opf,

and

A.Sm

ola

.A

kernel

two-s

am

ple

test.

JM

LR,13:7

23–773,2012.

ArthurG

retton.A

sim

ple

rconditio

nfo

rconsistency

ofa

kernelin

dependence

test.

Technic

alReport

1501.0

6103,arXiv

,2015.

WittawatJitkrittum

,ArthurG

retton,Nic

ola

sHeess,S.M

.Ali

Esla

mi,

Bal-

aji

Lakshm

inarayanan,D

ino

Sejd

inovic

,and

Zoltan

Szabo.K

ernel-b

ased

just-in

-tim

ele

arnin

gfo

rpassin

gexpectatio

npropagatio

nm

essages.UAI,

2015.

D.Sejd

inovic

,A.G

retton,and

W.Bergsm

a.A

kerneltestfo

rthree-v

aria

ble

interactio

ns.

InNIP

S,2013a.

D.Sejd

inovic

,B.Srip

erum

budur,A.G

retton,and

K.Fukum

izu.Equiv

ale

nce

ofdistance-b

ased

and

rkhs-b

ased

statistic

sin

hypothesis

testin

g.

Annals

ofSta

tistic

s,41(5):2

263–2702,2013b.

D.Sejd

inovic

,H.Strathm

ann,M

.Lom

eli

Garcia

,C.Andrie

u,and

A.G

ret-

ton.

Kerneladaptiv

eM

etropolis

-Hastin

gs.

ICM

L,2014.

A.J.Sm

ola

,A.G

retton,L.Song,and

B.Scholk

opf.

AHilb

ert

space

em

-beddin

gfo

rdistrib

utio

ns.

InProceedin

gs

ofth

eIn

ternatio

nalConference

on

Algorith

mic

Learnin

gTheory,volu

me

4754,pages

13–31.Sprin

ger,2007.

B.Srip

erum

budur,

K.Fukum

izu,

A.G

retton,

and

A.Hyvarin

en.

Density

estim

atio

nin

infin

ite

dim

ensio

nalexponentia

lfa

milie

s.Technic

alReport

1312.3

516,ArXiv

e-p

rin

ts,2014.

Heik

oStrathm

ann,D

ino

Sejd

inovic

,Sam

uelLiv

ingstone,Zoltan

Szabo,and

Arthur

Gretton.

Gradie

nt-fr

ee

Ham

iltonia

nM

onte

Carlo

with

effic

ient

kernelexponentia

lfa

milie

s.

arxiv,2015.

G.Szekely

and

M.Riz

zo.Testin

gfo

requaldistrib

utio

nsin

hig

hdim

ensio

n.

InterSta

t,5,2004.

G.Szekely

and

M.Riz

zo.A

new

testfo

rm

ultiv

aria

tenorm

alit

y.J.M

ultiv

ariate

Anal.,

93:5

8–80,2005.

G.Szekely

,M

.Riz

zo,and

N.Bakirov.M

easurin

gand

testin

gdependence

by

correla

tio

nofdistances.

Ann.Sta

t.,35(6):2

769–2794,2007.

K.Zhang,J.Peters,D

.Janzin

g,B.,

and

B.Scholk

opf.

Kernel-b

ased

con-

ditio

nalin

dependence

test

and

applic

atio

nin

causaldiscovery.

In27th

Conferenceon

Uncerta

inty

inArtifi

cialIn

tellig

ence,pages

804–813,2011.

70-2