gaussian processes: from the basics to the...
TRANSCRIPT
Gaussian Processes:From the Basics to the State-of-the-Art
Dr. Richard E. Turner ([email protected])Computational and Biological Learning Lab, Department of
Engineering, University of Cambridge
with Thang Bui, Yingzhen Li, Jose Miguel Hernandez Lobato,Daniel Hernandez Lobato, Josiah Jan, Alex Navarro,
Felipe Tobar, and Maneesh Sahani
1 / 335
Motivation: non-linear regression
1 5 10 15 20x
−2.0
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
y
2 / 335
Motivation: non-linear regression
1 5 10 15 20x
−2.0
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
y
?
3 / 335
Motivation: non-linear regression
1 5 10 15 20x
−2.0
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
y
4 / 335
Motivation: non-linear regression
1 5 10 15 20x
−2.0
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
y
5 / 335
Motivation: non-linear regression
1 5 10 15 20x
−2.0
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
y
Can we do this with a plain old Gaussian? 6 / 335
Gaussian distribution
p(y|Σ) ∝ exp(−1
2yTΣ−1y)
Σ =[
1 .7.7 1
]
y1
y2
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
7 / 335
Gaussian distribution
p(y|Σ) ∝ exp(−1
2yTΣ−1y)
Σ =[
1 .7.7 1
]
y1
y2
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
8 / 335
Gaussian distribution
p(y|Σ) ∝ exp(−1
2yTΣ−1y)
Σ =[
1 .6.6 1
]
y1
y2
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
9 / 335
Gaussian distribution
p(y|Σ) ∝ exp(−1
2yTΣ−1y)
Σ =[
1 .4.4 1
]
y1
y2
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
10 / 335
Gaussian distribution
p(y|Σ) ∝ exp(−1
2yTΣ−1y)
Σ =[
1 .1.1 1
]
y1
y2
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
11 / 335
Gaussian distribution
p(y|Σ) ∝ exp(−1
2yTΣ−1y)
Σ =[
1 .0.0 1
]
y1
y2
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
12 / 335
Gaussian distribution
p(y|Σ) ∝ exp(−1
2yTΣ−1y)
Σ =[
1 .7.7 1
]
y1
y2
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
13 / 335
Gaussian distribution
p(y|Σ) ∝ exp(−1
2yTΣ−1y)
Σ =[
1 .7.7 1
]
y1
y2
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
14 / 335
Gaussian distribution
p(y2|y1,Σ) ∝ exp(−1
2(y2 − µ∗)Σ∗−1(y2 − µ∗)) [ 1
1
]
y1
y2
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
15 / 335
Gaussian distribution
p(y2|y1,Σ) ∝ exp(−1
2(y2 − µ∗)Σ∗−1(y2 − µ∗)) [ 1
1
]
y1
y 2
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
16 / 335
Gaussian distribution
p(y2|y1,Σ) ∝ exp(−1
2(y2 − µ∗)Σ∗−1(y2 − µ∗)) [ 1
1
]
y1
y2
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
17 / 335
Gaussian distribution
p(y2|y1,Σ) ∝ exp(−1
2(y2 − µ∗)Σ∗−1(y2 − µ∗)) [ 1
1
]
y1
y2
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
18 / 335
Gaussian distribution
p(y2|y1,Σ) ∝ exp(−1
2(y2 − µ∗)Σ∗−1(y2 − µ∗)) [ 1
1
]
y1
y2
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
19 / 335
Gaussian distribution
p(y2|y1,Σ) ∝ exp(−1
2(y2 − µ∗)Σ∗−1(y2 − µ∗)) [ 1
1
]
y1
y2
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
20 / 335
Gaussian distribution
p(y2|y1,Σ) ∝ exp(−1
2(y2 − µ∗)Σ∗−1(y2 − µ∗)) [ 1
1
]
y1
y2
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
21 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y 2
Σ =[
1 .9.9 1
]
22 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y 2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
23 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y 2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
24 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
25 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
26 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
27 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
28 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
29 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
30 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
31 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
32 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
33 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
34 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
35 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
36 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
37 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
38 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
39 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
40 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
41 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
42 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
43 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
44 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
45 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
46 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
47 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
48 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
49 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
50 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
51 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
52 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
53 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
54 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
55 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
56 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
57 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
58 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
59 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
60 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
61 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
62 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
63 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y2
1 2−3
−2
−1
0
1
2
3
variable index
y
Σ =[
1 .9.9 1
]
64 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
65 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
66 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
67 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
68 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
69 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
70 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
71 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
72 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
73 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
74 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
75 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
76 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
77 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
78 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
79 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
80 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
81 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
82 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
83 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
84 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
85 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
86 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
87 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
88 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
89 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
90 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
91 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
92 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
93 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
94 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
95 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
96 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
97 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
98 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
99 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
100 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
101 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
102 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
103 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y5
1 2 3 4 5−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 .9 .8 .6 .4.9 1 .9 .8 .6.8 .9 1 .9 .8.6 .8 .9 1 .9.4 .6 .8 .9 1
104 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
105 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
106 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
107 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
108 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
109 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
110 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
111 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
112 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
113 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
114 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
115 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
116 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
117 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
118 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
119 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
120 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
121 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
122 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
123 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
124 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
125 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
126 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
127 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
128 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
129 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
130 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
131 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
132 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
133 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
134 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
135 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
136 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
137 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
138 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
139 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
140 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
141 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
142 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
143 / 335
New visualisation
−2 0 2−3
−2
−1
0
1
2
3
y1
y20
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
144 / 335
Regression using Gaussians
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
145 / 335
Regression using Gaussians
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
146 / 335
Regression using Gaussians
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
1 10 20
1
10
20
147 / 335
Regression using Gaussians
1 2 3 4 5 6 7 8 9 1011121314151617181920−3
−2
−1
0
1
2
3
variable index
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
K(x1, x2) = σ2e−1
2l2(x1−x2)2
1 10 20
1
10
20
148 / 335
Regression: probabilistic inference in function space
−3
−2
−1
0
1
2
3
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
K(x1, x2) = σ2e−1
2l2(x1−x2)2
1 10 20
1
10
20
149 / 335
Regression: probabilistic inference in function space
−3
−2
−1
0
1
2
3
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric)
K(x1, x2) = σ2e−1
2l2(x1−x2)2
1 10 20
1
10
20
Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)
150 / 335
Regression: probabilistic inference in function space
−3
−2
−1
0
1
2
3
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
Non-parametric (∞-parametric)
p(y|θ) = N (0,Σ) function estimatewith uncertainty
K(x1, x2) = σ2e−1
2l2(x1−x2)2
1 10 20
1
10
20
Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)
151 / 335
Regression: probabilistic inference in function space
−3
−2
−1
0
1
2
3
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric)
noise
K(x1, x2) = σ2e−1
2l2(x1−x2)2
1 10 20
1
10
20
Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)
152 / 335
Regression: probabilistic inference in function space
−3
−2
−1
0
1
2
3
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric)
horizontal-scale
K(x1, x2) = σ2e−1
2l2(x1−x2)2
1 10 20
1
10
20
Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)
153 / 335
Regression: probabilistic inference in function space
−3
−2
−1
0
1
2
3
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
1 10 20
1
10
20
horizontal-scalevertical-scale
154 / 335
Mathematical Foundations: Definition
Gaussian process = generalisation of multivariate Gaussian distribution toinfinitely many variables.
Definition: a Gaussian process is a collection of random variables, anyfinite number of which have (consistent) Gaussian distributions.
A Gaussian distribution is fully specified by a mean vector, µ, andcovariance matrix Σ:
f = (f1, . . . , fn) ∼ N (µ,Σ), indices i = 1, . . . ,n
A Gaussian process is fully specified by a mean function m(x) andcovariance function K(x, x′):
f (x) ∼ GP(m(x),K(x, x′)
), indices x
155 / 335
Mathematical Foundations: Regression
Q1. What's the formal justification for how we were using GPs for regression?
156 / 335
Mathematical Foundations: Regression
Q1. What's the formal justification for how we were using GPs for regression?
generative model (like non-linear regression)
156 / 335
Mathematical Foundations: Regression
Q1. What's the formal justification for how we were using GPs for regression?
generative model (like non-linear regression)
place GP prior over the non-linear function
(smoothly wiggling functions expected)
156 / 335
Mathematical Foundations: Regression
Q1. What's the formal justification for how we were using GPs for regression?
generative model (like non-linear regression)
place GP prior over the non-linear function
sum of Gaussian variables = Gaussian: induces a GP over
(smoothly wiggling functions expected)
156 / 335
Mathematical Foundations: Marginalisation
Q2. A GP is "like" a Gaussian distribution with an infinitely long mean vector and an "infinite by infinite" covariance matrix, so how do we represent it on a computer?
157 / 335
Mathematical Foundations: Marginalisation
Q2. A GP is "like" a Gaussian distribution with an infinitely long mean vector and an "infinite by infinite" covariance matrix, so how do we represent it on a computer?
We are saved by the marginalisation property:
157 / 335
Mathematical Foundations: Marginalisation
Q2. A GP is "like" a Gaussian distribution with an infinitely long mean vector and an "infinite by infinite" covariance matrix, so how do we represent it on a computer?
We are saved by the marginalisation property:
157 / 335
Mathematical Foundations: Marginalisation
Q2. A GP is "like" a Gaussian distribution with an infinitely long mean vector and an "infinite by infinite" covariance matrix, so how do we represent it on a computer?
We are saved by the marginalisation property:
Only need to represent finite dimensional projections of GPs on computer.
157 / 335
Mathematical Foundations: Prediction
Q3. How do we make predictions?
158 / 335
Mathematical Foundations: Prediction
Q3. How do we make predictions?
158 / 335
Mathematical Foundations: Prediction
Q3. How do we make predictions?
158 / 335
Mathematical Foundations: Prediction
Q3. How do we make predictions?
158 / 335
Mathematical Foundations: Prediction
Q3. How do we make predictions?
predictive mean
158 / 335
Mathematical Foundations: Prediction
Q3. How do we make predictions?
linear in the data
predictive mean
158 / 335
Mathematical Foundations: Prediction
Q3. How do we make predictions?
prior uncertainty
predictive uncertainty
reduction inuncertainty
linear in the data
predictive mean predictive covariance
predictions more confident than prior
158 / 335
Regression: probabilistic inference in function space
−3
−2
−1
0
1
2
3
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
1 10 20
1
10
20
horizontal-scalevertical-scale
159 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
160 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
161 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
162 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
163 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
164 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
165 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
166 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
167 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
168 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
169 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
170 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
171 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
172 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
173 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
174 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
175 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
176 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
177 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
178 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
179 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
180 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
Σ =
short horizontal length-scale
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
181 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
short horizontal length-scale
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
182 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
short horizontal length-scale
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
183 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
short horizontal length-scale
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
184 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
short horizontal length-scale
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
185 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
short horizontal length-scale
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
186 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
short horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
187 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
short horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
188 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
short horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
189 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
short horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
190 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric)short horizontal length-scale
Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
191 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
short horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
192 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric)short horizontal length-scale
Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
193 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
short horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
194 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
short horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
195 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
short horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
196 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric)short horizontal length-scale
Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
197 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric)short horizontal length-scale
Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
198 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric)short horizontal length-scale
Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
199 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric)short horizontal length-scale
Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
200 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric)short horizontal length-scale
Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
201 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
Σ =
long horizontal length-scale
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
202 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
long horizontal length-scale
Σ(x1, x2) = K(x1, x2) + Iσ2y
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
203 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
long horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
204 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
long horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
205 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
long horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
206 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
long horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
207 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
long horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
208 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
long horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
209 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
long horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
210 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
long horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
211 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
long horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
212 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
long horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
213 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
long horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
214 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
long horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
215 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
long horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
216 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
long horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
217 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
long horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
218 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
long horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
219 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
long horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
220 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
long horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
221 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
long horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
222 / 335
What effect do the hyper-parameters have?
K(x1, x2) = σ2 exp(− 1
2l2 (x1 − x2)2)
Hyper-parameters have a strong effectI l controls the horizontal length-scaleI σ2 controls the vertical scale of the data
=⇒ need automatic learning of hyper-parameters from data e.g.
arg maxl,σ2
log p(y |θ)
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x−3
−2
−1
0
1
2
3
x−3
−2
−1
0
1
2
3
x−3
−2
−1
0
1
2
3
x−3
−2
−1
0
1
2
3
x−3
−2
−1
0
1
2
3
x−3
−2
−1
0
1
2
3
x−3
−2
−1
0
1
2
3
x−3
−2
−1
0
1
2
3
x−3
−2
−1
0
1
2
3
x−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
l = short l = longl = medium
223 / 335
What effect do the hyper-parameters have?
K(x1, x2) = σ2 exp(− 1
2l2 (x1 − x2)2)
Hyper-parameters have a strong effectI l controls the horizontal length-scaleI σ2 controls the vertical scale of the data
=⇒ need automatic learning of hyper-parameters from data e.g.
arg maxl,σ2
log p(y |θ)
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x−3
−2
−1
0
1
2
3
x−3
−2
−1
0
1
2
3
x−3
−2
−1
0
1
2
3
x−3
−2
−1
0
1
2
3
x−3
−2
−1
0
1
2
3
x−3
−2
−1
0
1
2
3
x−3
−2
−1
0
1
2
3
x−3
−2
−1
0
1
2
3
x−3
−2
−1
0
1
2
3
x−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
l = short l = longl = medium
224 / 335
Higher dimensional input spaces
0
20
40
60
80
100
0
20
40
60
80
100
−2
0
2
x1x
2
y
225 / 335
Higher dimensional input spaces
0
20
40
60
80
100
0
20
40
60
80
100
−2
0
2
x1x
2
y
226 / 335
Higher dimensional input spaces
0
20
40
60
80
100
0
20
40
60
80
100
−2
0
2
x1x
2
y
227 / 335
Higher dimensional input spaces
0
20
40
60
80
100
0
20
40
60
80
100
−2
0
2
x1x
2
y
228 / 335
Higher dimensional input spaces
0
20
40
60
80
100
0
20
40
60
80
100
−2
0
2
x1x
2
y
229 / 335
Higher dimensional input spaces
0
20
40
60
80
100
0
20
40
60
80
100
−2
0
2
x1x
2
y
230 / 335
Higher dimensional input spaces
0
20
40
60
80
100
0
20
40
60
80
100
−2
0
2
x1x
2
y
231 / 335
Higher dimensional input spaces
0
20
40
60
80
100
0
20
40
60
80
100
−2
0
2
x1x
2
y
232 / 335
Higher dimensional input spaces
0
20
40
60
80
100
0
20
40
60
80
100
−2
0
2
x1x
2
y
233 / 335
Higher dimensional input spaces
0
20
40
60
80
100
0
20
40
60
80
100
−2
0
2
x1x
2
y
234 / 335
Higher dimensional input spaces
0
20
40
60
80
100
0
20
40
60
80
100
−2
0
2
x1x
2
y
235 / 335
Higher dimensional input spaces
0
20
40
60
80
100
0
20
40
60
80
100
−2
0
2
x1x
2
y
236 / 335
Higher dimensional input spaces
0
20
40
60
80
100
0
20
40
60
80
100
−2
0
2
x1x
2
y
237 / 335
Higher dimensional input spaces
0
20
40
60
80
100
0
20
40
60
80
100
−2
0
2
x1x
2
y
238 / 335
Higher dimensional input spaces
0
20
40
60
80
100
0
20
40
60
80
100
−2
0
2
x1x
2
y
239 / 335
What effect do the hyper-parameters have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Σ(x1, x2) = K(x1, x2) + Iσ2y
long horizontal length-scale
p(y|θ) = N (0,Σ)
Non-parametric (∞-parametric) Parametric model
y(x) = f (x; θ) + σyε
ε ∼ N (0, 1)K(x1, x2) = σ2e−1
2l2(x1−x2)2
240 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
Σ =
Laplacian covariance function
Browninan motion
Ornstein-Uhlenbeck
Σ =
K(x1, x2) = σ2 exp(− 1
2l2 |x1 − x2|)
241 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Laplacian covariance function
Browninan motion
Ornstein-Uhlenbeck
K(x1, x2) = σ2 exp(− 1
2l2 |x1 − x2|)
242 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Laplacian covariance function
Browninan motion
Ornstein-Uhlenbeck
K(x1, x2) = σ2 exp(− 1
2l2 |x1 − x2|)
243 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Laplacian covariance function
Browninan motion
Ornstein-Uhlenbeck
K(x1, x2) = σ2 exp(− 1
2l2 |x1 − x2|)
244 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Laplacian covariance function
Browninan motion
Ornstein-Uhlenbeck
K(x1, x2) = σ2 exp(− 1
2l2 |x1 − x2|)
245 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Laplacian covariance function
Browninan motion
Ornstein-Uhlenbeck
K(x1, x2) = σ2 exp(− 1
2l2 |x1 − x2|)
246 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Laplacian covariance function
Browninan motion
Ornstein-Uhlenbeck
K(x1, x2) = σ2 exp(− 1
2l2 |x1 − x2|)
247 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Laplacian covariance function
Browninan motion
Ornstein-Uhlenbeck
K(x1, x2) = σ2 exp(− 1
2l2 |x1 − x2|)
248 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Laplacian covariance function
Browninan motion
Ornstein-Uhlenbeck
K(x1, x2) = σ2 exp(− 1
2l2 |x1 − x2|)
249 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Laplacian covariance function
Browninan motion
Ornstein-Uhlenbeck
K(x1, x2) = σ2 exp(− 1
2l2 |x1 − x2|)
250 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
Σ =
Laplacian covariance function
Browninan motion
Ornstein-Uhlenbeck
K(x1, x2) = σ2 exp(− 1
2l2 |x1 − x2|)
251 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Laplacian covariance function
Browninan motion
Ornstein-Uhlenbeck
K(x1, x2) = σ2 exp(− 1
2l2 |x1 − x2|)
252 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Laplacian covariance function
Browninan motion
Ornstein-Uhlenbeck
K(x1, x2) = σ2 exp(− 1
2l2 |x1 − x2|)
253 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Laplacian covariance function
Browninan motion
Ornstein-Uhlenbeck
K(x1, x2) = σ2 exp(− 1
2l2 |x1 − x2|)
254 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Laplacian covariance function
Browninan motion
Ornstein-Uhlenbeck
K(x1, x2) = σ2 exp(− 1
2l2 |x1 − x2|)
255 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Laplacian covariance function
Browninan motion
Ornstein-Uhlenbeck
K(x1, x2) = σ2 exp(− 1
2l2 |x1 − x2|)
256 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Laplacian covariance function
Browninan motion
Ornstein-Uhlenbeck
K(x1, x2) = σ2 exp(− 1
2l2 |x1 − x2|)
257 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Laplacian covariance function
Browninan motion
Ornstein-Uhlenbeck
K(x1, x2) = σ2 exp(− 1
2l2 |x1 − x2|)
258 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Laplacian covariance function
Browninan motion
Ornstein-Uhlenbeck
K(x1, x2) = σ2 exp(− 1
2l2 |x1 − x2|)
259 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Laplacian covariance function
Browninan motion
Ornstein-Uhlenbeck
K(x1, x2) = σ2 exp(− 1
2l2 |x1 − x2|)
260 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Laplacian covariance function
Browninan motion
Ornstein-Uhlenbeck
K(x1, x2) = σ2 exp(− 1
2l2 |x1 − x2|)
261 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
Σ =
Rational Quadratic
K(x1, x2) = σ2(1 + 1
2αl2 |x1 − x2|)−α
262 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2(1 + 1
2αl2 |x1 − x2|)−α
Rational Quadratic
263 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2(1 + 1
2αl2 |x1 − x2|)−α
Rational Quadratic
264 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2(1 + 1
2αl2 |x1 − x2|)−α
Rational Quadratic
265 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2(1 + 1
2αl2 |x1 − x2|)−α
Rational Quadratic
266 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2(1 + 1
2αl2 |x1 − x2|)−α
Rational Quadratic
267 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2(1 + 1
2αl2 |x1 − x2|)−α
Rational Quadratic
268 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2(1 + 1
2αl2 |x1 − x2|)−α
Rational Quadratic
269 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2(1 + 1
2αl2 |x1 − x2|)−α
Rational Quadratic
270 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2(1 + 1
2αl2 |x1 − x2|)−α
Rational Quadratic
271 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2(1 + 1
2αl2 |x1 − x2|)−α
Rational Quadratic
272 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2(1 + 1
2αl2 |x1 − x2|)−α
Rational Quadratic
273 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2(1 + 1
2αl2 |x1 − x2|)−α
Rational Quadratic
274 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2(1 + 1
2αl2 |x1 − x2|)−α
Rational Quadratic
275 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2(1 + 1
2αl2 |x1 − x2|)−α
Rational Quadratic
276 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2(1 + 1
2αl2 |x1 − x2|)−α
Rational Quadratic
277 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2(1 + 1
2αl2 |x1 − x2|)−α
Rational Quadratic
278 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Rational Quadratic
K(x1, x2) = σ2(1 + 1
2αl2 |x1 − x2|)−α
279 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Rational Quadratic
K(x1, x2) = σ2(1 + 1
2αl2 |x1 − x2|)−α
280 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Rational Quadratic
K(x1, x2) = σ2(1 + 1
2αl2 |x1 − x2|)−α
281 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
Rational Quadratic
K(x1, x2) = σ2(1 + 1
2αl2 |x1 − x2|)−α
282 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
Σ =
Periodic
sinusoid × squared exponential
K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1
2l2 (x1 − x2)2)
283 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1
2l2 (x1 − x2)2)
Periodic
sinusoid × squared exponential
284 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1
2l2 (x1 − x2)2)
Periodic
sinusoid × squared exponential
285 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1
2l2 (x1 − x2)2)
Periodic
sinusoid × squared exponential
286 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1
2l2 (x1 − x2)2)
Periodic
sinusoid × squared exponential
287 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1
2l2 (x1 − x2)2)
Periodic
sinusoid × squared exponential
288 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1
2l2 (x1 − x2)2)
Periodic
sinusoid × squared exponential
289 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1
2l2 (x1 − x2)2)
Periodic
sinusoid × squared exponential
290 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1
2l2 (x1 − x2)2)
Periodic
sinusoid × squared exponential
291 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1
2l2 (x1 − x2)2)
Periodic
sinusoid × squared exponential
292 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1
2l2 (x1 − x2)2)
Periodic
sinusoid × squared exponential
293 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1
2l2 (x1 − x2)2)
Periodic
sinusoid × squared exponential
294 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1
2l2 (x1 − x2)2)
Periodic
sinusoid × squared exponential
295 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1
2l2 (x1 − x2)2)
Periodic
sinusoid × squared exponential
296 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1
2l2 (x1 − x2)2)
Periodic
sinusoid × squared exponential
297 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1
2l2 (x1 − x2)2)
Periodic
sinusoid × squared exponential
298 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1
2l2 (x1 − x2)2)
Periodic
sinusoid × squared exponential
299 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1
2l2 (x1 − x2)2)
Periodic
sinusoid × squared exponential
300 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1
2l2 (x1 − x2)2)
Periodic
sinusoid × squared exponential
301 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1
2l2 (x1 − x2)2)
Periodic
sinusoid × squared exponential
302 / 335
What effect does the form of the covariance function have?
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
Σ =
K(x1, x2) = σ2 cos(ω(x1 − x2)) exp(− 1
2l2 (x1 − x2)2)
Periodic
sinusoid × squared exponential
303 / 335
The covariance function has a large effect
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
OU RQ periodic
304 / 335
The covariance function has a large effect
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
OU RQ periodic
Bayesian model comparison:
305 / 335
The covariance function has a large effect
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
OU RQ periodic
Bayesian model comparison:prior over models
306 / 335
The covariance function has a large effect
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
OU RQ periodic
Bayesian model comparison:prior over models
marginal likelihood
307 / 335
The covariance function has a large effect
Health warnings: Hard to compute (need approximations)Often results are very sensitive to the priors
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
−3
−2
−1
0
1
2
3
x
y
OU RQ periodic
Bayesian model comparison:prior over models
marginal likelihood
308 / 335
Scaling Gaussian Processes toLarge Datasets
309 / 335
Motivation: Gaussian Process Regression
inputs
outputs
310 / 335
Motivation: Gaussian Process Regression
inputs
outputs
?
310 / 335
Motivation: Gaussian Process Regression
inputs
outputs
?
310 / 335
Motivation: Gaussian Process Regression
inputs
outputs
?
inference & learning
310 / 335
Motivation: Gaussian Process Regression
inputs
outputs
?
inference & learning
intractabilities
computational
analytic
310 / 335
Motivation: Gaussian Process Regression
310 / 335
A Brief History of Gaussian Process Approximations
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
311 / 335
A Brief History of Gaussian Process Approximations
approximate generative modelexact inference
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
311 / 335
A Brief History of Gaussian Process Approximations
approximate generative modelexact inference
methods employingpseudo-data
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
311 / 335
A Brief History of Gaussian Process Approximations
approximate generative modelexact inference
methods employingpseudo-data
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
FITCPITCDTC
311 / 335
A Brief History of Gaussian Process Approximations
approximate generative modelexact inference
methods employingpseudo-data
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
FITCPITCDTC
A Unifying View of Sparse Approximate Gaussian Process RegressionQuinonero-Candela & Rasmussen, 2005(FITC, PITC, DTC)
311 / 335
A Brief History of Gaussian Process Approximations
approximate generative modelexact inference
exact generative modelapproximate inference
methods employingpseudo-data
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
FITCPITCDTC
A Unifying View of Sparse Approximate Gaussian Process RegressionQuinonero-Candela & Rasmussen, 2005(FITC, PITC, DTC)
311 / 335
A Brief History of Gaussian Process Approximations
approximate generative modelexact inference
exact generative modelapproximate inference
methods employingpseudo-data
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
VFEEPPP
FITCPITCDTC
A Unifying View of Sparse Approximate Gaussian Process RegressionQuinonero-Candela & Rasmussen, 2005(FITC, PITC, DTC)
311 / 335
A Brief History of Gaussian Process Approximations
approximate generative modelexact inference
exact generative modelapproximate inference
methods employingpseudo-data
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
VFEEPPP
FITCPITCDTC
A Unifying View of Sparse Approximate Gaussian Process RegressionQuinonero-Candela & Rasmussen, 2005(FITC, PITC, DTC)
311 / 335
A Brief History of Gaussian Process Approximations
approximate generative modelexact inference
exact generative modelapproximate inference
methods employingpseudo-data
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
VFEEPPP
FITCPITCDTC
A Unifying View of Sparse Approximate Gaussian Process RegressionQuinonero-Candela & Rasmussen, 2005(FITC, PITC, DTC)
A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation PropagationBui, Yan and Turner, 2016(VFE, EP, FITC, PITC ...)
311 / 335
EP pseudo-point approximation
true posterior
312 / 335
EP pseudo-point approximation
true posterior
312 / 335
EP pseudo-point approximation
true posterior
marginal
likelihood
posterior
312 / 335
EP pseudo-point approximation
true posterior approximate posterior
marginal
likelihood
posterior
312 / 335
EP pseudo-point approximation
true posterior approximate posterior
marginal
likelihood
posterior
312 / 335
EP pseudo-point approximation
true posterior approximate posterior
marginal
likelihood
posterior
312 / 335
EP pseudo-point approximation
true posterior approximate posterior
marginal
likelihood
posterior
312 / 335
EP pseudo-point approximation
input locations of
'pseudo' data
outputs and covariance
'pseudo' data
true posterior approximate posterior
marginal
likelihood
posterior
exact joint
of new GP
regression
model
312 / 335
EP algorithm
313 / 335
EP algorithm
1. remove take out onepseudo-observation
likelihoodcavity
313 / 335
EP algorithm
1. remove
2. include
take out onepseudo-observation
likelihood
add in onetrue observation
likelihood
cavity
tilted
313 / 335
EP algorithm
1. remove
2. include
3. project
take out onepseudo-observation
likelihood
add in onetrue observation
likelihood
project ontoapproximating
family
cavity
tilted KL between unnormalisedstochastic processes
313 / 335
EP algorithm
1. remove
2. include
3. project
4. update
take out onepseudo-observation
likelihood
add in onetrue observation
likelihood
project ontoapproximating
family
updatepseudo-observation
likelihood
cavity
tilted KL between unnormalisedstochastic processes
313 / 335
EP algorithm
1. remove
2. include
3. project
4. update
take out onepseudo-observation
likelihood
add in onetrue observation
likelihood
project ontoapproximating
family
updatepseudo-observation
likelihood
cavity
tilted
1. minimum: moments matched at pseudo-inputs2. Gaussian regression: matches moments everywhere
KL between unnormalisedstochastic processes
313 / 335
EP algorithm
1. remove
2. include
3. project
4. update
take out onepseudo-observation
likelihood
add in onetrue observation
likelihood
project ontoapproximating
family
updatepseudo-observation
likelihood
cavity
tilted
1. minimum: moments matched at pseudo-inputs2. Gaussian regression: matches moments everywhere
KL between unnormalisedstochastic processes
rank 1
313 / 335
A Brief History of Gaussian Process Approximations
approximate generative modelexact inference
exact generative modelapproximate inference
methods employingpseudo-data
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
VFEEPPP
FITCPITCDTC
A Unifying View of Sparse Approximate Gaussian Process RegressionQuinonero-Candela & Rasmussen, 2005(FITC, PITC, DTC)
A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation PropagationBui, Yan and Turner, 2016(VFE, EP, FITC, PITC ...)
314 / 335
Fixed points of EP = FITC approximation
approximate generative modelexact inference
exact generative modelapproximate inference
methods employingpseudo-data
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
VFEEPPP
FITCPITCDTC
A Unifying View of Sparse Approximate Gaussian Process RegressionQuinonero-Candela & Rasmussen, 2005(FITC, PITC, DTC)
A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation PropagationBui, Yan and Turner, 2016(VFE, EP, FITC, PITC ...)
315 / 335
Fixed points of EP = FITC approximation
approximate generative modelexact inference
exact generative modelapproximate inference
methods employingpseudo-data
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
VFEEPPP
FITCPITCDTC
A Unifying View of Sparse Approximate Gaussian Process RegressionQuinonero-Candela & Rasmussen, 2005(FITC, PITC, DTC)
A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation PropagationBui, Yan and Turner, 2016(VFE, EP, FITC, PITC ...)
315 / 335
Fixed points of EP = FITC approximation
approximate generative modelexact inference
exact generative modelapproximate inference
methods employingpseudo-data
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs”PITC: Snelson et al. “Local and global sparse Gaussian process approximations”EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes”DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
VFEEPPP
FITCPITCDTC
interpretation resolves issues with FITC: why does it work so well?
are we allowed to increase M with N
A Unifying View of Sparse Approximate Gaussian Process RegressionQuinonero-Candela & Rasmussen, 2005(FITC, PITC, DTC)
A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation PropagationBui, Yan and Turner, 2016(VFE, EP, FITC, PITC ...)
315 / 335
EP algorithm
1. remove
2. include
3. project
4. update
take out onepseudo-observation
likelihood
add in onetrue observation
likelihood
project ontoapproximating
family
updatepseudo-observation
likelihood
cavity
tilted
1. minimum: moments matched at pseudo-inputs2. Gaussian regression: matches moments everywhere
KL between unnormalisedstochastic processes
rank 1
316 / 335
Power EP algorithm (as tractable as EP)
1. remove
2. include
3. project
4. update
take out fraction ofpseudo-observation
likelihood
add in fraction oftrue observation
likelihood
project ontoapproximating
family
updatepseudo-observation
likelihood
cavity
tilted
1. minimum: moments matched at pseudo-inputs2. Gaussian regression: matches moments everywhere
KL between unnormalisedstochastic processes
rank 1
317 / 335
Power EP: a unifying framework
FITCCsato and Opper, 2002
Snelson and Ghahramani, 2005
VFETitsias, 2009
318 / 335
Power EP: a unifying framework
GP Regression GP Classification
PEPVFE
EP
inter-domain
[4] Quiñonero-Candela et al. 2005[5] Snelson et al., 2005[6] Snelson, 2006[7] Schwaighofer, 2002
[10,5,6*]
[14*]
[12*,15*]
[13][17,13]
[9,11,8*]
[16*] inter-domain
structuredapprox.
structuredapprox.
(FITC)
[7,4*,6*](PITC)
[8] Titsias, 2009[9] Csató, 2002[10] Csató et al., 2002[11] Seeger et al., 2003
[12] Naish-Guzman et al, 2007[13] Qi et al., 2010[14] Hensman et al., 2015[15] Hernández-Lobato et al., 2016[16] Matthews et al., 2016[17] Figueiras-Vidal et al., 2009
PEPVFE
EP
* = optimised pseudo-inputs ** = structured versions of VFE recover VFE
** **
319 / 335
How should I set the power parameter α?
6 UCI classification datasets20 random splitsM = 10, 50, 100
hypers and inducing inputs optimised
8 UCI regression datasets20 random splits
M = 0 - 200hypers and inducing
inputs optimised
0.0 0.2 0.4 0.6 0.8 1.02
3
4
5
6
7
8
0.0 0.2 0.4 0.6 0.8 1.02
3
4
5
6
7
8
MS
Era
nker
ror
rank
log-
loss
ran
klo
g-lo
ss r
ank
= 0.5 does well on average
0.0 0.2 0.4 0.6 0.8 1.02.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
0.0 0.2 0.4 0.6 0.8 1.01
2
3
4
5
6
7
320 / 335
Deep Gaussian Processes forRegression
321 / 335
Pros and cons of Gaussian Process Regression
Probabilistic Xneed well-calibrated predictive uncertainty (decision making)big models (even with small data)neural networks do not provide well calibrated uncertainty estimates
Non-parametric (∞ - parametric) Xmodel parameters: information bottleneck between training and test data
training data→ θθθ → test data
unbounded model (θθθ grows with data) pruned by data
Deep (and wide) 7
intrinsic hierarchical structure in real world datasimpler to learn a series of weak non-linearities
322 / 335
Pros and cons of Gaussian Process Regression
Probabilistic Xneed well-calibrated predictive uncertainty (decision making)big models (even with small data)neural networks do not provide well calibrated uncertainty estimates
Non-parametric (∞ - parametric) Xmodel parameters: information bottleneck between training and test data
training data→ θθθ → test data
unbounded model (θθθ grows with data) pruned by data
Deep (and wide) 7
intrinsic hierarchical structure in real world datasimpler to learn a series of weak non-linearities
322 / 335
Pros and cons of Gaussian Process Regression
Probabilistic Xneed well-calibrated predictive uncertainty (decision making)big models (even with small data)neural networks do not provide well calibrated uncertainty estimates
Non-parametric (∞ - parametric) Xmodel parameters: information bottleneck between training and test data
training data→ θθθ → test data
unbounded model (θθθ grows with data) pruned by data
Deep (and wide) 7
intrinsic hierarchical structure in real world datasimpler to learn a series of weak non-linearities
322 / 335
Pros and cons of Gaussian Process Regression
Probabilistic Xneed well-calibrated predictive uncertainty (decision making)big models (even with small data)neural networks do not provide well calibrated uncertainty estimates
Non-parametric (∞ - parametric) Xmodel parameters: information bottleneck between training and test data
training data→ θθθ → test data
unbounded model (θθθ grows with data) pruned by data
Deep (and wide) 7
intrinsic hierarchical structure in real world datasimpler to learn a series of weak non-linearities
322 / 335
From Gaussian Processes to Deep Gaussian Processes
input
323 / 335
From Gaussian Processes to Deep Gaussian Processes
input
323 / 335
From Gaussian Processes to Deep Gaussian Processes
}input
input
323 / 335
From Gaussian Processes to Deep Gaussian Processes
}input
input
long short
323 / 335
From Gaussian Processes to Deep Gaussian Processes
}input
input
long short
DGP = multi-layer neural network, infinitely wide hidden layer
323 / 335
Deep Gaussian Processes
fl ∼ GP(0, k(., .))yn = g(xn) = fL(fL−1(· · · f2(f1(xn)))) + εn
hL−1,n := fL−1(· · · f1(xn)), yn = fL(hL−1,n) + εn
Deep GPsa aremulti-layer generalisation of Gaussian processes,equivalent to deep neural networks with infinitelywide hidden layers
Questions:How to perform inference and learning tractably?How Deep GPs compare to alternative,e.g. Bayesian neural networks?
aDamianou and Lawrence (2013) [unsupervised learning]
xn
h1,n
h2,n
yn
f1
f2
f3
N
324 / 335
Approximate inference for (Deep) Gaussian Processes
GP Regression GP Classification
PEPVFE
EP
inter-domain
[4] Quiñonero-Candela et al. 2005[5] Snelson et al., 2005[6] Snelson, 2006[7] Schwaighofer, 2002
[10,5,6*]
[14*]
[12*,15*]
[13][17,13]
[9,11,8*]
[16*] inter-domain
structuredapprox.
structuredapprox.
(FITC)
[7,4*,6*](PITC)
[8] Titsias, 2009[9] Csató, 2002[10] Csató et al., 2002[11] Seeger et al., 2003
[12] Naish-Guzman et al, 2007[13] Qi et al., 2010[14] Hensman et al., 2015[15] Hernández-Lobato et al., 2016[16] Matthews et al., 2016[17] Figueiras-Vidal et al., 2009
PEPVFE
EP
* = optimised pseudo-inputs ** = structured versions of VFE recover VFE
** **
325 / 335
Experiment: Value function of the mountain car problem
goalr = 10
car
each stepr=-1
[x,x]
x1
x2
GP fit
x1
x2
Value function
x1
x2
DGP fit
326 / 335
Experiment: Comparison to Bayesian neural networks
Compare DGPs with GPs and Bayesian neural networks with one andtwo hidden layers using:
VI(G): Graves’ VI [diagonal Gaussian, without the reparam. trick]VI(KW): Kingma and Welling’s VI [with the reparam. trick]PBP: ADF with Probablistic BackpropagationDropout: Combining dropout predictions at test timeSGLD: Stochastic gradient Langevin dynamicsHMC: Hamiltonian Monte Carlo [only for small networks]
327 / 335
Experiment: Comparison to Bayesian neural networks
Rankings of all methods across all datasets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
MLL Average Rank
PBP−1Dropout−1SGLD−1DGP−1 50SGLD−2VI(KW)−1GP 50
DGP−3 100DGP−2 100DGP−3 50DGP−2 50
GP 100HMC−1
VI(KW)−2DGP−1 100
CD
328 / 335
Experiment: Comparison to Bayesian neural networks [Best results]
bostonN = 506D = 13
−2.5
−2.4
−2.3
−2.2
−2.1
−2.0
aver
age
test
log-
likel
ihoo
d/na
ts
concreteN = 1030
D = 8
−3.2
−3.0
−2.8
−2.6
−2.4
−2.2
−2.0
−1.8
energyN = 768D = 8
−2.0
−1.8
−1.6
−1.4
−1.2
−1.0
−0.8
−0.6
kin8nmN = 8192
D = 8
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
3.2
navalN = 11934
D = 16
5.0
5.5
6.0
6.5
7.0
7.5
powerN = 9568
D = 4
−2.65
−2.60
−2.55
−2.50
−2.45
−2.40
−2.35
−2.30
−2.25
proteinN = 45730
D = 9
−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
red wineN = 1588D = 11
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
yachtN = 308D = 6
−1.8
−1.6
−1.4
−1.2
−1.0
−0.8
−0.6
−0.4
yearN = 515345
D = 90
−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
BNN-deterministic BNN-sampling GP DGP
329 / 335
Experiment: Comparison to Bayesian neural networks [Best results]
bostonN = 506D = 13
−2.5
−2.4
−2.3
−2.2
−2.1
−2.0
aver
age
test
log-
likel
ihoo
d/na
ts
concreteN = 1030
D = 8
−3.2
−3.0
−2.8
−2.6
−2.4
−2.2
−2.0
−1.8
energyN = 768D = 8
−2.0
−1.8
−1.6
−1.4
−1.2
−1.0
−0.8
−0.6
kin8nmN = 8192
D = 8
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
3.2
navalN = 11934
D = 16
5.0
5.5
6.0
6.5
7.0
7.5
powerN = 9568
D = 4
−2.65
−2.60
−2.55
−2.50
−2.45
−2.40
−2.35
−2.30
−2.25
proteinN = 45730
D = 9
−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
red wineN = 1588D = 11
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
yachtN = 308D = 6
−1.8
−1.6
−1.4
−1.2
−1.0
−0.8
−0.6
−0.4
yearN = 515345
D = 90
−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
BNN-deterministic BNN-sampling GP DGP
330 / 335
Experiment: Comparison to Bayesian neural networks [Best results]
bostonN = 506D = 13
−2.5
−2.4
−2.3
−2.2
−2.1
−2.0
aver
age
test
log-
likel
ihoo
d/na
ts
concreteN = 1030
D = 8
−3.2
−3.0
−2.8
−2.6
−2.4
−2.2
−2.0
−1.8
energyN = 768D = 8
−2.0
−1.8
−1.6
−1.4
−1.2
−1.0
−0.8
−0.6
kin8nmN = 8192
D = 8
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
3.2
navalN = 11934
D = 16
5.0
5.5
6.0
6.5
7.0
7.5
powerN = 9568
D = 4
−2.65
−2.60
−2.55
−2.50
−2.45
−2.40
−2.35
−2.30
−2.25
proteinN = 45730
D = 9
−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
red wineN = 1588D = 11
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
yachtN = 308D = 6
−1.8
−1.6
−1.4
−1.2
−1.0
−0.8
−0.6
−0.4
yearN = 515345
D = 90
−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
BNN-deterministic BNN-sampling GP DGP
331 / 335
Experiment: Comparison to Bayesian neural networks [Best results]
bostonN = 506D = 13
−2.5
−2.4
−2.3
−2.2
−2.1
−2.0
aver
age
test
log-
likel
ihoo
d/na
ts
concreteN = 1030
D = 8
−3.2
−3.0
−2.8
−2.6
−2.4
−2.2
−2.0
−1.8
energyN = 768D = 8
−2.0
−1.8
−1.6
−1.4
−1.2
−1.0
−0.8
−0.6
kin8nmN = 8192
D = 8
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
3.2
navalN = 11934
D = 16
5.0
5.5
6.0
6.5
7.0
7.5
powerN = 9568
D = 4
−2.65
−2.60
−2.55
−2.50
−2.45
−2.40
−2.35
−2.30
−2.25
proteinN = 45730
D = 9
−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
red wineN = 1588D = 11
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
yachtN = 308D = 6
−1.8
−1.6
−1.4
−1.2
−1.0
−0.8
−0.6
−0.4
yearN = 515345
D = 90
−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
BNN-deterministic BNN-sampling GP DGP
332 / 335
Experiment: Comparison to Bayesian neural networks [Best results]
bostonN = 506D = 13
−2.5
−2.4
−2.3
−2.2
−2.1
−2.0
aver
age
test
log-
likel
ihoo
d/na
ts
concreteN = 1030
D = 8
−3.2
−3.0
−2.8
−2.6
−2.4
−2.2
−2.0
−1.8
energyN = 768D = 8
−2.0
−1.8
−1.6
−1.4
−1.2
−1.0
−0.8
−0.6
kin8nmN = 8192
D = 8
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
3.2
navalN = 11934
D = 16
5.0
5.5
6.0
6.5
7.0
7.5
powerN = 9568
D = 4
−2.65
−2.60
−2.55
−2.50
−2.45
−2.40
−2.35
−2.30
−2.25
proteinN = 45730
D = 9
−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
red wineN = 1588D = 11
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
yachtN = 308D = 6
−1.8
−1.6
−1.4
−1.2
−1.0
−0.8
−0.6
−0.4
yearN = 515345
D = 90
−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
BNN-deterministic BNN-sampling GP DGP
333 / 335
Pros and cons of Gaussian Process Regression
Probabilistic Xneed well-calibrated predictive uncertainty (decision making)big models (even with small data)neural networks do not provide well calibrated uncertainty estimates
Non-parametric (∞ - parametric) Xmodel parameters: information bottleneck between training and test data
training data→ θθθ → test data
unbounded model (θθθ grows with data) pruned by data
Deep (and wide) Xintrinsic hierarchical structure in real world datasimpler to learn a series of weak non-linearities
334 / 335
References (hyperlinked)
Deep Gaussian Processes:Deep Gaussian Processes for Regression using ApproximateExpectation Propagation, ICML 2016
Approximate inference in GPs:
A Unifying Framework for Sparse Gaussian Process Approximationusing Power Expectation Propagation, arXiv preprint 2016
Scalable Approximate inference:Stochastic Expectation Propagation, NIPS 2015Black-box α-divergence Minimization, ICML 2016
335 / 335