scaling gaussian process regression with derivativeserichanslee/gp2018poster.pdfi gradient...

1
Scaling Gaussian Process Regression with Derivatives Kun Dong 1 David Eriksson 1 Eric Hans Lee 2 David Bindel 2 Andrew Gordon Wilson 3 Applied Math 1 , CS 2 , ORIE 3 Gaussian Processes (GPs) I Multivariate normals are distributions over vectors. I Gaussian processes are distributions over functions with a mean field μ : R d R and a covariance kernel k : R d × R d R. I f GP ( μ , k ) means for any X =(x 1 ,..., x n ), x i R d : f X N ( μ X , K XX ) where f X R n ; (f X ) i = f (x i ) μ X R n ; ( μ X ) i = μ (x i ) K XX R n×n ; (K XX ) ij = k (x i , x j ) We write K XX as K when unambiguous. Figure: GP posterior GP Regression with Derivatives I Function values and derivatives can be modeled by a multi-output GP: f X X f X N ( μ X , K XX ), μ (x )= μ (x ) x μ (x ) , k (x , x 0 )= k (x , x 0 ) x 0 k (x , x 0 ) T x k (x , x 0 ) 2 x k (x , x 0 ) . I With derivatives, we get a larger kernel matrix K XX R n(d +1)×n(d +1) . I Gradient information significantly improves regression accuracy. Figure: Branin function approximated by GPs with SE kernel. Kernel Learning with Derivatives I Kernel depends on hyperparameters θ , which are trained by maximizing the log-likelihood: L (θ |y )= L y + L |K | - n (d + 1) 2 log(2π ), where (assuming K c =(y - μ X )) L y = - 1 2 (y - μ ) T c , L y ∂θ i = 1 2 c T K ∂θ i ! c , L |K | = - 1 2 log det K , L |K | ∂θ i = - 1 2 tr K -1 K ∂θ i ! . I Naive approach computes the Cholesky factorization of K . I Challenges: . Direct methods lead to O (n 3 d 3 ) training and O (nd ) prediction. . K is far more ill-conditioned than K . Main Ideas I Apply structured kernel interpolation to enable fast MVMs for K . I Combine iterative methods and stochastic estimators to avoid direct computation, with preconditioning to improve convergence. I Use active subspace learning to overcome curse of dimensionality. D-SKI: Structured Kernel Interpolation with Derivatives I Differentiate the interpolation weights to guarantee positive-semidefiniteness k (x , x 0 ) i w i (x )k (x i , x 0 ) k (x , x 0 ) i w i (x )k (x i , x 0 ). I Use local quintic interpolation to get a sparse weight matrix and grid-structured kernel K W K UU W T = W W K UU W W T . I Due to sparsity of W , W and fast MVMs with K UU via FFTs, overall MVM complexity is O (nd 6 d + m log m). D-SKIP: Structure Kernel Interpolation for Products with Derivatives I SKIP yields the following approximations for separable kernels: K (W 1 K 1 W T 1 ) (W 2 K 2 W T 2 ) ... (W d K d W T d ), I We differentiate and iteratively apply the Lanczos decomposition to obtain: K (Q 1 T 1 Q T 1 ) (Q 2 T 2 Q T 2 ). I For an effective kernel rank of r at each step, . Construction cost: O (d 2 (n + m log m + r 3 n log d )). . MVM cost: O (dr 2 n ). 2500 5000 10000 20000 30000 Matrix Size 10 -4 10 -3 10 -2 10 -1 10 0 MVM Time A Comparison of MVM Scalings O(n 2 ) O(n) O(n) SE Exact SE SKI (2D) SE SKIP (11D) Preconditioning I (Partial) pivoted Cholesky factorization K FF T provides a cheap and effective preconditioner M = σ 2 I + FF T . I Sherman-Morrison-Woodbury formula for M -1 is unstable in practice. I Instead, we use M -1 f = σ -2 (f - Q 1 (Q T 1 f )) where Q 1 Q 2 R = F σ I . I Crucial for the convergence of iterative solvers on K Dimensionality Reduction via Active Subspace Learning I Many high-dimensional problems are low-dimensional in practice. I The active subspace spans the dominant eigenvectors of the cov. matrix: C = Z Ω f (x )f (x ) T dx . I The function can be well approximated by a GP on the reduced space using the kernel k (P T x , P T x 0 ), where P projects onto the active subspace. Experiment: Reconstructing Implicit Surfaces from Normals Figure: (Left) Original surface (Middle) Noisy surface (Right) SKI reconstruction from noisy surface I The Stanford bunny is a data set of 25000 points and associated normals. I We heavily pollute the data points and normals with noise, and accurately recover the underlying implicit surface from the noisy data. I We use SKI with an induced grid of size 30 in each dimension. Bayesian Optimization with Derivatives and Active Subspace Learning while Budget not exhausted do Calculate active subspace projection P R D ×d using sampled gradients Optimize acquisition function, u n+1 = arg max A (u ) with x n+1 = Pu n+1 Sample point x n+1 , value f n+1 , and gradient f n+1 Update data D i +1 = D i ∪{x n+1 , f n+1 , f n+1 } Update hyperparameters of GP with gradient defined by kernel k (P T x , P T x 0 ) I We test on 5D Ackley embedded in [-10, 15] 50 and 5D Rastrigin in [-4, 5] 50 . I We pick a 2D active subspace at each iteration, using the D-SKI kernel and expected improvement acquisition function in this lower-dimensional space. 0 100 200 300 400 500 -20 -15 -10 -5 BO exact BO D-SKI BFGS Random sampling (a) BO on Ackley 0 100 200 300 400 500 -40 -20 0 20 BO exact BO SKI BFGS Random sampling (b) BO on Rastrigin Discussion I Gradient information is valuable for GP regression, but scalability is a problem. I Our methods make computation of GP with derivatives scalable through fast MVMs, preconditioning, and dimensionality reduction. I Our approach builds upon structured interpolation, but extends to any differentiable MVM. I BO with derivatives unifies global optimization and gradient-based local optimization. I Implementation available at: https://github.com/ericlee0803/GP_Derivatives. https://people.cam.cornell.edu/~kd383/ [email protected]

Upload: others

Post on 21-Sep-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scaling Gaussian Process Regression with Derivativeserichanslee/gp2018poster.pdfI Gradient information is valuable for GP regression, but scalability is a problem. I Our methods make

Scaling Gaussian Process Regression with DerivativesKun Dong1 David Eriksson1 Eric Hans Lee2 David Bindel2 Andrew Gordon Wilson3

Applied Math1, CS2, ORIE3

Gaussian Processes (GPs)

I Multivariate normals are distributions over vectors.I Gaussian processes are distributions over functions with a mean field

µ :Rd→R and a covariance kernel k :Rd×Rd→R.I f ∼GP(µ,k ) means for any X = (x1, . . . ,xn), xi ∈Rd:

fX ∼ N(µX ,KXX) where

fX ∈Rn; (fX)i = f(xi)µX ∈Rn; (µX)i = µ(xi)KXX ∈Rn×n; (KXX)ij = k (xi,xj)

We write KXX as K when unambiguous.Figure: GP posterior

GP Regression with Derivatives

I Function values and derivatives can be modeled by a multi-output GP:[fX

∂X fX

]∼N(µ

X ,K∇

XX), µ∇(x) =

[µ(x)

∂xµ(x)

], k ∇(x,x ′) =

[k (x,x ′) ∂x ′k (x,x ′)T

∂xk (x,x ′) ∂ 2x k (x,x ′)

].

I With derivatives, we get a larger kernel matrix K ∇

XX ∈Rn(d+1)×n(d+1).

I Gradient information significantly improves regression accuracy.

Figure: Branin function approximated by GPs with SE kernel.

Kernel Learning with Derivatives

I Kernel depends on hyperparameters θ , which are trained by maximizing thelog-likelihood:

L (θ |y∇) = Ly∇ +L|K∇|−n(d + 1)

2log(2π),

where (assuming K ∇c = (y∇−µ∇

X ))

Ly∇ =−12

(y∇−µ∇)Tc,

∂Ly∇

∂θi=

12

cT

(∂K ∇

∂θi

)c,

L|K∇| =−12

logdetK ∇,∂L|K∇|

∂θi=−1

2tr

(K ∇−1∂K ∇

∂θi

).

I Naive approach computes the Cholesky factorization of K ∇.I Challenges:

. Direct methods lead to O(n3d3) training and O(nd) prediction.

. K ∇ is far more ill-conditioned than K .

Main Ideas

I Apply structured kernel interpolation to enable fast MVMs for K∇.

I Combine iterative methods and stochastic estimators to avoid directcomputation, with preconditioning to improve convergence.

I Use active subspace learning to overcome curse of dimensionality.

D-SKI: Structured Kernel Interpolation with Derivatives

I Differentiate the interpolation weights to guarantee positive-semidefiniteness

k (x,x ′)≈∑i

wi(x)k (xi,x ′)→ ∇k (x,x ′)≈∑i

∇wi(x)k (xi,x ′).

I Use local quintic interpolation to get a sparse weight matrix and grid-structuredkernel

K ∇ ≈W∇KUUW∇T=

[W

∂W

]KUU

[W

∂W

]T

.

I Due to sparsity of W ,∂W and fast MVMs with KUU via FFTs, overall MVMcomplexity is O(nd6d + m logm).

D-SKIP: Structure Kernel Interpolation for Products with Derivatives

I SKIP yields the following approximations for separable kernels:

K ≈ (W1K1WT1 ) (W2K2WT

2 ) . . . (WdKdWTd ),

I We differentiate and iteratively apply the Lanczos decomposition to obtain:

K ∇ ≈ (Q1T1QT1 ) (Q2T2QT

2 ).

I For an effective kernel rank of r at each step,. Construction cost: O(d2(n + m logm + r3n logd)).. MVM cost: O(dr2n).

2500 5000 10000 20000 30000

Matrix Size

10-4

10-3

10-2

10-1

100

MV

M T

ime

A Comparison of MVM Scalings

O(n2)

O(n)

O(n)

SE Exact

SE SKI (2D)

SE SKIP (11D)

Preconditioning

I (Partial) pivoted Cholesky factorization K ∇ ≈ FFT provides a cheap andeffective preconditioner M = σ2I + FFT .

I Sherman-Morrison-Woodbury formula for M−1 is unstable in practice.I Instead, we use M−1f = σ−2(f −Q1(QT

1 f)) where[Q1Q2

]R =

[Fσ I

].

I Crucial for the convergence of iterative solvers on K ∇

Dimensionality Reduction via Active Subspace Learning

I Many high-dimensional problems are low-dimensional in practice.I The active subspace spans the dominant eigenvectors of the cov. matrix:

C =∫

Ω∇f(x)∇f(x)Tdx.

I The function can be well approximated by a GP on the reduced space usingthe kernel k (PTx,PTx ′), where P projects onto the active subspace.

Experiment: Reconstructing Implicit Surfaces from Normals

Figure: (Left) Original surface (Middle) Noisy surface (Right) SKI reconstruction from noisy surface

I The Stanford bunny is a data set of 25000 points and associated normals.I We heavily pollute the data points and normals with noise, and accurately

recover the underlying implicit surface from the noisy data.I We use SKI with an induced grid of size 30 in each dimension.

Bayesian Optimization with Derivatives and Active Subspace Learning

while Budget not exhausted doCalculate active subspace projection P ∈RD×d using sampled gradientsOptimize acquisition function, un+1 = arg max A (u) with xn+1 = Pun+1Sample point xn+1, value fn+1, and gradient ∇fn+1Update data Di+1 = Di∪xn+1, fn+1,∇fn+1Update hyperparameters of GP with gradient defined by kernel k (PTx,PTx ′)

I We test on 5D Ackley embedded in [−10,15]50 and 5D Rastrigin in [−4,5]50.I We pick a 2D active subspace at each iteration, using the D-SKI kernel and

expected improvement acquisition function in this lower-dimensional space.

0 100 200 300 400 500

-20

-15

-10

-5 BO exact

BO D-SKI

BFGS

Random sampling

(a) BO on Ackley

0 100 200 300 400 500

-40

-20

0

20BO exact

BO SKI

BFGS

Random sampling

(b) BO on Rastrigin

Discussion

I Gradient information is valuable for GP regression, but scalability is a problem.I Our methods make computation of GP with derivatives scalable through fast

MVMs, preconditioning, and dimensionality reduction.I Our approach builds upon structured interpolation, but extends to any

differentiable MVM.I BO with derivatives unifies global optimization and gradient-based local

optimization.I Implementation available at:

https://github.com/ericlee0803/GP_Derivatives.

https://people.cam.cornell.edu/~kd383/ [email protected]