advances in using gps with derivative observationssiivole1/pdf/gpa17_siivola.pdf · advances in...
TRANSCRIPT
Advances in using GPs with derivativeobservationsGaussian Process approximations 2017 –workshopby Eero Siivola1, joint work with Aki Vehtari1, Juho Piironen1, JavierGonzález2, Jarno Vanhatalo3 and Olli-Pekka Koistinen1
1Aalto University, Finland2Amazon, Cambridge, UK3Univeristy of Helsinki, Finland
Advances in using GPs with derivative observationsMay 30, 2017
2/26
Contents of this talk
I Theory behind GPs + derivativesI GP-NEBI Automatic monotonicity detection with GPsI Bayesian optimization with derivative sign information
Advances in using GPs with derivative observationsMay 30, 2017
3/26
Theory: GP + derivative observations
How to use (partial) derivatives with GPs?We need to consider two parts:
I Covariance functionI Likelihood function
I Posterior -> Inference method
Advances in using GPs with derivative observationsMay 30, 2017
4/26
Covariance function
Nice property (See e.g. Papoulis [1991, ch. 10]):
cov
(∂f (1)
∂x(1)g
, f (2))
=∂
∂x(1)g
cov(
f (1), f (2))
=∂
∂x(1)g
k(
x(1),x(2))
and:
cov
(∂f (1)
∂x(1)g
,∂f (2)
∂x(2)h
)= ∂2
∂x(1)g ∂x(2)
h
cov(f (1), f (2)
)= ∂2
∂x(1)g ∂x(2)
h
k(x(1),x(2))
Advances in using GPs with derivative observationsMay 30, 2017
5/26
Let X =[x(1), . . . ,x(n)]T and X =
[x(1), . . . , x(m)
]T, be points
where we observe function values and partial derivative values.
The covariance between latent function valuesfX =
[f (1), . . . , f (n)
]Tand latent function derivative values
f′X
=
[∂ f (1)
∂x(1)g, . . . , ∂ f (m)
∂x(m)g
]T
is:
KX,X =
∂
∂x(1)g
cov(f (1), f (1)) · · · ∂
∂x(m)g
cov(f (1), f (m))
.... . .
...∂
∂x(1)g
cov(f (n), f (1)) · · · ∂
∂x(m)g
cov(f (n), f (m))
= KTX,X
Advances in using GPs with derivative observationsMay 30, 2017
6/26
And between latent function derivative values fX and fX
KX,X =
∂2
∂x(1)g ∂x(1)
gcov(f (1), f (1)) · · · ∂2
∂x(1)g ∂x(m)
gcov(f (1), f (m))
.... . .
...∂2
∂x(m)g ∂x(1)
gcov(f (m), f (1)) · · · ∂2
∂x(m)g ∂x(m)
gcov(f (m), f (m))
Advances in using GPs with derivative observationsMay 30, 2017
7/26
Likelihood functionObservations are assumed independent given latent functionvalues:
p(y, y′|fX, f′X) =
(n∏
i=1
p(y (i)|f (i)))(
m∏i=1
p
(∂y (i)
∂x(i)g
∣∣∣∣∣ ∂ f (i)
∂x(i)g
))How to select the likelihood of derivatives?
I If direct derivative values can be observed:Gaussian likelihood
I If we only have hint about the direction:Probit likelihood with a tuning parameter (Riihimäki andVehtari (2010))
p
(∂y (i)
∂x(i)g
∣∣∣∣∣ ∂ f (i)
∂x(i)g
)= Φ
(∂ f (i)
∂x(i)g
1ν
), where
φ(a)=
a∫−∞
N(x |0,1)dx
Advances in using GPs with derivative observationsMay 30, 2017
8/26
−3 −2 −1 0 1 2 3
0
1
x
y
Probit likelihood with ν = 1
−3 −2 −1 0 1 2 3
0
1
x
y
Probit likelihood with ν = 1× 10−4
Advances in using GPs with derivative observationsMay 30, 2017
9/26
Posterior distributionPosterior distribution of joint values:
p(f, f′|y, y′,X, X) =
p(f, f′|X, X)
(n∏
i=1p(y (i)|f (i))
)(m∏
i=1p(
∂y (i)
∂x(i)g
∣∣∣∣ ∂ f (i)
∂x(i)g
))Z
Different parts:I p(f, f′|X, X) is GaussianI p(y (i)|f (i)) are Gaussian
I p(
∂y (i)
∂x(i)g
∣∣∣∣ ∂ f (i)
∂x(i)g
)Gaussian/probit
The posterior distribution is either Gaussian or similar as inclassification problems
I We might need posterior approximation methods
Advances in using GPs with derivative observationsMay 30, 2017
10/26
Saddle point search using GPs + derivativeobservations
I The properties of thesystem can be describedby an energy surface
I Finding a minimum energypath and the saddle pointbetween two states isuseful when determiningproperties of transitions
Advances in using GPs with derivative observationsMay 30, 2017
11/26
Nudged elastic band (NEB)I Starting from an initial
guess, the idea is to movethe images downwards onthe energy surface but keepthem evenly spaced
I The images are movedalong a force vector, whichis a resultant of twocomponents:
I (Negative) energygradient componentperpendicular to the path
I A spring force parallel tothe path, which tends tokeep the images evenlyspaced
Advances in using GPs with derivative observationsMay 30, 2017
12/26
I The convergence of NEB may require hundreds orthousands of iterations
I Each iteration requires evaluation of the energy gradientfor all images, which is often a time-consuming operation
Advances in using GPs with derivative observationsMay 30, 2017
13/26
Speedup of NEB
I Repeat until convergence:1. Evaluate the energy (and
forces) at the images ofthe current path
2. If path not converged,approximate the energysurface using machinelearning based on theobservations so far
3. Find the predictedminimum energy path onthe approximate surfaceand go to 1
I The details in paper byPeterson (2016)
Advances in using GPs with derivative observationsMay 30, 2017
14/26
Speedup of NEB with GP and derivatives
I Evaluate the energy (and forces) only at the image with thehighest uncertainty
I Re-approximate the energy surface and find a new MEPguess after each image evaluation
I Convergence check:I If the magnitude of the force (may be accurate or
approximation) is below the convergence limit for allimages, we don’t move the path, but evaluate more images,until the convergence limit is not met any more or allimages have been evaluated
I If we manage to evaluate all images without moving thepath, we know for sure if the path is converged
I The details in paper by Koistinen, Maras, Vehtari andJónsson (2016):
Advances in using GPs with derivative observationsMay 30, 2017
15/26
Advances in using GPs with derivative observationsMay 30, 2017
16/26
Advances in using GPs with derivative observationsMay 30, 2017
17/26
Advances in using GPs with derivative observationsMay 30, 2017
18/26
I When evaluating the transition rates, the Hessian of theminimum points needs to be evaluated at some phase
I This information can be used to improve the GPapproximations, especially in the beginning, when there islittle information
Advances in using GPs with derivative observationsMay 30, 2017
19/26
Comparison of methods in heptamer case study
Advances in using GPs with derivative observationsMay 30, 2017
20/26
Automatic monotonicity detection
I Derivative sign information can be used to find monotonicinput output directions
I The basic idea:I Add derivative sign observations to the GP modelI See if the additions affect to the probability of the data
I the dimension is monotonic if not
I The details in paper by Siivola, Piironen and Vehtari (2016)
Advances in using GPs with derivative observationsMay 30, 2017
21/26
Theoretical background
Energy comparison:
E(y, y′|X, Xm) = − log p(y, y′|X, Xm)
= − log
p(y|X)
≈1︷ ︸︸ ︷p(y′|y,X, Xm)
≈ E(y|X).
Advances in using GPs with derivative observationsMay 30, 2017
22/26
GP with monotonicity assumption
regular GP
Number of virtual observations
Energy of data
E
E′
< N
Figure : Change in energy in reality as a function of virtual derivativesign observations
Advances in using GPs with derivative observationsMay 30, 2017
23/26
Using automatic monotonicity detection inmodelling
I Monotonic dimensions can be detected from the data andused in modelling
I The method makes the modelling results especially on theborders.
Advances in using GPs with derivative observationsMay 30, 2017
24/26
ExperimentI Six different functions of varying monotonicityI Different amount of noise added to training samples (signal
to noise ratio (SNR) between 0 and 1)I Measure the log predictive posterior density of samples
from a hold out set that resemble 20 % of the bordermostsamples in the training data:
lppd =L∑
i=1
log∫
p(yi |f )ppost(f |xi)df
I Do this for three different models for 200 times:I Use fixed monotonicityI Use monotonicity if the it does not change the energy
(adaptive monotonicity)I Use model without derivative observations
Advances in using GPs with derivative observationsMay 30, 2017
25/26
Results
0 0.5 1
−5
0
N=
30
f(x) = −x
0 0.5 1−15.4
0
N=
90
0 0.5 1−0.15
5·1
0−
2
f(x) = 0
adaptivemonotonicity
fixedmonotonicity
0 0.5 1−0.1
0.1
0 0.5 1−0.1
0.3
f(x) = x
0 0.5 1
00.2
0 0.5 1
00.3
f(x) = 11+e−0.5x
0 0.5 1
00.2
0 0.5 1
00.3
f(x) = 11+e−x
0 0.5 1
00.2
0 0.5 1−0.1
0.3
f(x) = 11+e−2x
0 0.5 1−0.2
0.5
Figure : ∆LPPD of baseline and named method on y axis, SNR on xaxis
Advances in using GPs with derivative observationsMay 30, 2017
26/26
Multidimensional experiment
Diabetes data1:I Target value: a measure of diabetes progression one year
after baselineI 10 dimensionsI Detect monotonic dimensions and use them if needed
1diabetes data, available at:http://web.stanford.edu/~hastie/Papers/LARS/diabetes.data
Advances in using GPs with derivative observationsMay 30, 2017
27/26
Results
40 80
100
200
300
age
0 1
sex
20 40
body mass index (bmi)
80 120
mean arterial pressure (map)
100 200 300
triglycerides (tc)
100 200 300
100
200
300
low-density lipoprotein (ldl)
40 80 120
high-density lipoprotein (hdl)
5 10
total choleserol (tch)
4 6
low-tension glaucoma (ltg)
80 120
fasting blood glucose (glu)
basic GP
GP with AMD
Figure : Target value as a function of single predictive values whileothers are kept at the median of dataset. Regular black linescorrespond to regular GP mean and 90 % posterior central interval.Red dashed lines correspond to AMD GPs mean and standarddeviation when body mass index and low-tension glaucoma aredetected as increasing. Black dashed line corresponds to the largestvalue of covariate.
Advances in using GPs with derivative observationsMay 30, 2017
28/26
Bayesian optimization with virtual derivative signobservations
Bayesian optimization (BO):I A global optimization strategy designed to find the
minimum of expensive black-box functions:1. Fit GP to the available dataset X, y2. Evaluate the function at a new location based on some
acquisition function3. If stopping criterion is not met, go to 1
I Usually the search space is selceted so that the minimumis not on the border
I An over-exploration of the edges is a typical problem
Advances in using GPs with derivative observationsMay 30, 2017
29/26
Figure : Over exploration of the edges visualized with LCB as anacquisition function. Circles are initial samples and crosses areacquisitions.
Advances in using GPs with derivative observationsMay 30, 2017
30/26
Fixing over exploration with derivative signobservations
I By adding fake derivative observations to the borders, theover-exploration problem can be solved:
1. Fit GP to the available dataset X, y2. Find a new location based on some acquisition function3. If the new location is at the border:
I add a derivative sign observation to the border4. Else:
I add the new location.
5. If stopping criterion is not met, go to 1I The details in paper by Siivola, Vehtari, Vanhatalo and
Gonzalez (2017)
Advances in using GPs with derivative observationsMay 30, 2017
31/26
Figure : GP prior and acquisition functions for one dimensionalspace. a) and c), without fake derivative sign observations. b) and d)with derivative sign observations.
Advances in using GPs with derivative observationsMay 30, 2017
32/26
Experiments
Metrics for comparing performances of two BO algorithm:I Percentual minimum difference (PMD): PMD is designed to
compare the absolute performances of the algorithms andintuitively it measures the difference of the best values ofboth algorithms.
I Percentual hit difference (PHD): PHD is created forcomparing the speeds of the algorithms and intuitively itmeasures difference of how fast both algorithms are ableto find good enough values.
I Percentual border hit difference (PBHD): Assuming thatthe minimum is not near the border, BHD tells the scaleddifference of unnecessary samples taken near the borders.
Advances in using GPs with derivative observationsMay 30, 2017
33/26
I Average evaluation distance difference (AED): Intuitively,AED measures the overall performance of the algorithmbefore finding the minimum.
I Virtual derivative observations per dimension (VDO):Intuitively, larger VDOs are worse, since they increase thecomputational burden of the algorithm as GP’s scale asO((n + q)3).
The interpretation for the magnitude of PMD, PHD and PBHDare that negative values tell that the proposed method is better,the values are always scaled between −1 and 1 and the furtheraway the value is from 0, the bigger the difference between thetwo methods is.
Advances in using GPs with derivative observationsMay 30, 2017
34/26
Experiment 1
Advances in using GPs with derivative observationsMay 30, 2017
35/26
Experiment 2I 100 d-dimensional multivariate normal distribution
functions as d = 1, ...,11I Different amount of noise added to the functionsI BO and BO with derivatives ran for 100 acquisitions
Advances in using GPs with derivative observationsMay 30, 2017
36/26
Results
Advances in using GPs with derivative observationsMay 30, 2017
37/26
Advances in using GPs with derivative observationsMay 30, 2017
38/26
Experiment 3
I Same as in experiment 2, but for sigopt function dataset2
I 113 functions from 1 to 11 dimensions
2Dataset available at: https://github.com/sigopt/sigopt-examples
Advances in using GPs with derivative observationsMay 30, 2017
39/26
Results
Advances in using GPs with derivative observationsMay 30, 2017
40/26
Advances in using GPs with derivative observationsMay 30, 2017
41/26
Summary
Derivatives can be used with GPs in many new ways:I To improve accuracy of GPs in simulation of energy
surfacesI To automatically find monotonic dimensions from dataI To fix border over-exploration problem of BOs
Advances in using GPs with derivative observationsMay 30, 2017
42/26
Questions?I email: [email protected]
IAdvances in using GPs with derivative observations
May 30, 201743/26
References
I Papoulis, A. (1991). Probability, Random Variables, and Stochastic Processes.McGraw-Hill, New York. Third Edition.
I Riihimäki, J. and Vehtari, A. (2010) "Gaussian processes with monotonicityinformation." In proceedings of AISTATS 2010. vol. 9, pp. 645-652.
I Peterson (2016) "Acceleration of saddle-point searches with machine learning".In J. Chem. Phys., 145, p. 074106
I Koistinen, O.-P., Maras, E., Vehtari, A. and Jónsson, H. (2016) "Minimum energypath calculations with Gaussian process regression". Nanosystems: Physics,Chemistry, Mathematics, 2016, 7 (6), p. 925-935
I Siivola, E., Piironen, J., and Vehtari, A. (2016) "Automatic monotonicity detectionfor Gaussian Processes" arXiv: https://arxiv.org/abs/1610.05440
I Siivola, E., Vehtari, A., Vanhatalo, J., and González, J. (2017) "Bayesianoptimization with virtual derivative sign observations" arXiv:https://arxiv.org/abs/1704.00963