optimization of machine learning...
Post on 07-Jul-2018
241 Views
Preview:
TRANSCRIPT
Optimization of Machine Learning Hyperparameters
Dr. Frank Hutter
Head of Emmy Noether Research Group on Learning, Optimization, and Automated Algorithm Design
Computer Science Institute University of Freiburg, Germany
July 2014
Motivation
• The machine learning algorithms you have learned about had several degrees of freedom
– E.g., in neural networks: regularization, momentum, learning rate, number of layers, number of units, …
• So far, how have you been setting these in practice?
– Changing one parameter at a time
– Grid search
• Was this tedious? Time-consuming?
– Imagine you have millions of data points and each evaluation takes hours or days…
2
High-level Learning Goals
• After this module, you can …
– Effectively use modern hyperparameter optimization methods
– Explain the concept of over-fitting
– Describe what measures can be taken to avoid over-fitting
– Describe the core mechanisms of several types of hyperparameter optimization methods
– Reason about the pros and cons of using a particular hyperparameter optimization method for a particular problem
– Derive the mechanisms behind Bayesian optimization
3
Outline of Today’s Class
• Generalization to previously unseen data
• Overview of hyperparameter optimization methods
• Foundations of Bayesian optimization: Bayesian linear regression & Gaussian processes
4
Learning and Generalization
• Much of supervised machine learning is about selecting a model from a given hypothesis space that
– Explains the seen data well
– Is likely to also work well for new data
• Example: Which model will describe new data better? The polynomial or the line?
5
Image source: Wikipedia
Occam’s razor (or Ockham’s razor)
“Numquam ponenda est pluralitas sine necessitate”
[Plurality must never be posited without necessity.]
• General problem solving principle
– In the absence of evidence to the contrary, prefer the simplest explanation.
– Adapted to machine learning: all things being equal prefer the simplest model.
6
William of Ockham, 1287-1347, philosopher
and theologian.
Image source: Wikipedia
Occam’s razor in practice
• We need to trade off model complexity and model fit
• Model fit
– E.g., likelihood of the data under the model: P(data|model)
– In general: some loss of the predictor on the training data
• Model complexity
– E.g., number of free parameters
– E.g., number of effective dimensions
– E.g., VC dimension [Vapnik–Chervonenkis, 1971]
• Use regularization to penalize complex models: minimize training loss + C * regularization cost
7
Parameters vs. Hyperparameters
• Most machine learning algorithms optimize parameters under the hood
– E.g., weights in linear regression and neural networks
– E.g., deep learning: millions of parameters
• Standard approach: minimize training loss + C * regularization cost
– Using standard gradient-based optimizers
• Hyperparameters: decisions left to algorithm designer
– How complex a model to use?
– How to set C?
– How many layers/which structure of deep networks to use?
8
How to set the hyperparameters?
• We wish to achieve good generalization performance
• In practice, we need to try several values and empirically evaluate how well they generalize
– Train the model for a given hyperparameter setting
– Evaluate the model’s generalization performance
• Which data set should we use to evaluate the model’s generalization performance?
1. The same data set that we use all the time: all the data we have
2. We split the data we have available: use one part for training the model, another disjoint part for evaluating generalization
9
Interactive question
• Which data set should we use to evaluate a model’s generalization performance empirically?
– We split the data we have available: use one part for training the model, another disjoint part for evaluating performance
• Why?
– The assumption we make is that future data will come from the same ``true’’ distribution as our current data.
– Then, using an unseen sample of that distribution gives us an unbiased estimate of generalization to future data
– If our assumption is false, then we must control for concept drift … a topic for another lecture ;-)
10
Overfitting & early stopping heuristic
• Too little data / too little regularization:
– The error on the training data keeps on decreasing
– After too much training, the error on separate validation data starts to increase
• Early stopping heuristic: stop training at that point
11 Training time
Image source: Wikipedia
Generalization of performance
• The dark ages
– Student tweaks hyperparameters until it works
– Supervisor may not even know about the tuning
– Results get published without acknowledging the tuning
– Of course, the approach does not generalize
• A step further
– Optimize parameters on a training set
– Evaluate generalization on a test set
• Another step further: avoid “peeking” at the test set
– Put test set into a vault (i.e., never look at it)
– Split training set again into training and validation set
– Only use test set in the end to generate results for publication 12
Training Validation Training Validation Training
Cross-validation for model selection
• Problem: single split of training data into training/validation might not be representative
• Standard solution: average performance across k cross-validation folds (here: k=3)
13
Training Validation
Training Validation
Cross-validation for model selection
• Standard model selection using cross-validation (CV):
• is a learning algorithm
• We apply to dataset and evaluate the resulting model on dataset
• We call the resulting loss
• We average these losses over the k cross-validation folds and pick the best-performing learning algorithm
14
Cross-validation for further tasks
• Standard model selection using cross-validation (CV):
• Standard hyperparameter optimization using CV:
• Combination of the two:
15
Cross-validation Details
• How to choose the number of folds k?
– Too low: noisy approximations of generalization
Poor generalization to test instances
– Too high: evaluating a configuration is expensive
Optimization process is slow Also, performance in folds is not independent, so increasing k does not always improve generalization
• Theory is lacking
• In practice, typically choose k=5 or k=10 [Kohavi, 1995]
• Practical speedup trick [Hutter, Hoos & Leyton-Brown, 2011]
– We do not need to evaluate all folds for each configuration
– Example: best configuration so far has average C/V error 0.1 based on 5 folds; new configuration has error 0.6 in first fold
16
Outline of Today’s Class
• Generalization to previously unseen data
• Overview of hyperparameter optimization methods
• Foundations of Bayesian optimization: Bayesian linear regression & Gaussian processes
17
Manual Search
Start with some configuration
repeat
Modify a single parameter
if performance on a benchmark set degrades then
undo modification
until no more improvement possible (or “good enough")
(manually-executed hill climbing)
18
Aka “Optimization by Graduate Student”
Pros and cons of manual search
• Pros
– Student gains some intuition helps understanding
– Student can notice irregularities, e.g.
• A configuration is worse than expected find bugs
• E.g., aliasing in filters learned by a convolutional network [Zeiler & Fergus, 2013]
• A run dies because of temporary file system errors repeat the run
• Cons
– “Blind” search: inefficient use of student’s time
– Sometimes “false intuition”: e.g., based on a different dataset and a different architecture a year ago
19
Simple Search Strategy: Grid Search
20
Image source: Bergstra et al, Random Search for Hyperparameter Optimization, JMLR 2012
• Select D values for each of N hyperparameters, try all DN combinations
• Direct feedback:
– Which values work/don’t work for each setting
– Which parameters are important? Are there interactions?
Simple Search Strategy: Random Search
• Select configurations uniformly at random
– Completely uninformed
– Global search, won’t get stuck in a local region
– Better than grid search for low effective dimensionality:
21
Image source: Bergstra et al, Random Search for Hyperparameter Optimization, JMLR 2012
Further Benefits of Random Search
• Perfect parallelizability
– Simply start K runs in parallel on a compute cluster
• Fault tolerance
– In practice, some runs often die because of some problem • File system error
• Parameter combination not legal
• Code crashes
– In grid search, you need the entire grid
– In random search, a design with M < K runs is also valid
22
Disadvantages of Random Search
• Entirely uninformed – Cannot follow an obvious gradient (e.g. bigger is better)
• Curse of dimensionality – Example: only ½ of the values of each dimensions is good
– Probability of randomly drawing a good configuration in N dimensions: 0.5N
• In 1 dimension: 0.5
• In 2 dimensions: 0.25
• In 10 dimensions: < 0.001
• In 20 dimensions: < 0.0000001
• Grid search has the same problems – Random search is the better search method
– Grid search only gives better intuitions
23
Stochastic Local Search
• Balance intensification and diversification
– Intensification: gradient descent
– Diversification: restarts, random steps, perturbations, …
• Prominent general methods
– Taboo search [Glover, 1986]
– Simulated annealing [Kirkpatrick, Gelatt, C. D.; Vecchi, 1983]
– Iterated local search [Lourenço, Martin & Stützle, 2003]
24
[e.g., Hoos and Stützle, 2005]
Population-based Methods
• Population of configurations
– Global + local search via population
– Maintain population fitness & diversity
• Examples
– Genetic algorithms [e.g., Barricelli, ’57, Goldberg, ’89]
– Evolutionary strategies [e.g., Beyer & Schwefel, ’02]
– Ant colony optimization [e.g., Dorigo & Stützle, ’04]
– Particle swarm optimization [e.g., Kennedy & Eberhart, ’95]
25
Bayesian Optimization
• Fit a (probabilistic) model of the function
• Use that model to trade off exploitation vs exploration
• Also known as sequential model-based optimization (SMBO)
26
Bayesian Optimization
• Popular approach in statistics to minimize expensive blackbox functions [Mockus, '78]
– Efficient in the number of function evaluations
– Works when objective is nonconvex, noisy, has unknown derivatives, etc
• Recent progress in the machine learning literature: global convergence rates for continuous optimization [Srinivas et al, ICML 2010] [Bull, JMLR 2011] [Bubeck et al., JMLR 2011] [de Freitas, Smola, Zoghi, ICML 2012]
27
Estimation of Distribution (EDA)
• Also uses a probabilistic model
• Also uses that model to inform where to evaluate next
• But models promising configurations: P(x is “good”)
– In contrast to modeling the function: P(f|x)
28
Image source: Wikipedia
[e.g., Pelikan, Goldberg and Lobo, 2002]
Outline of Today’s Class
• Generalization to previously unseen data
• Overview of hyperparameter optimization methods
• Foundations of Bayesian optimization: Bayesian linear regression & Gaussian processes
29
Reminder: Bayesian Optimization
30
Aside: why is it called “Bayesian” ?
• Often you have causal knowledge
– For example • P(symptom | disease)
• P(observed noisy function values | true function)
– This is the likelihood: P(evidence e | hypothesis h)
• ... and you want to do evidential reasoning
– For example • P(disease | symptom)
• P(true function | observed noisy function values)
– This is the posterior: P(hypothesis h | evidence e)
• To compute this posterior, you also need – the prior P(hypothesis h) and Bayes rule
31
Bayes rule (or Bayes’ rule)
32
Thomas Bayes, 1701-1761, English statistician and philosopher. Image source: Wikipedia
Bayes rule in Bayesian optimization
• Denote the observed data as
• Denote our prior over functions as
• Then the posterior over functions is:
33
posterior likelihood prior
Two components of Bayesian optimization
• The probabilistic model
– Typically used: Gaussian process
– Today: Bayesian linear regression & Gaussian processes
– Next time: random forests
• The acquisition function
– Trades off exploration vs. exploitation
34
Bayesian linear regression & Gaussian processes
• Acknowledgement: The following slides are taken from Phillip Hennig’s tutorial on Gaussian processes in the machine learning summer school 2013
• All of Phillip’s slides are online: http://mlss.tuebingen.mpg.de/hennig_slides1.pdf
• Phillip’s website also has video lectures and more slides: http://www.is.tuebingen.mpg.de/nc/employee/details/phennig.html
35
Carl Friedrich Gauss (1777–1855)Paying Tolls with A Bell
f(x) = 1
σ√
2πe− (x−µ)22σ2
2 ,
The Gaussian distributionMultivariate Form
N (x;µ,Σ) = 1(2π)N/2∣Σ∣1/2 exp [−1
2(x − µ)⊺Σ−1(x − µ)]
−4 −2 0 µ1 4 6 8−4−20
µ2
4
6
8
▸ x,µ ∈ RN , Σ ∈ RN×N▸ Σ is positive semidefinite, i.e.
▸ v⊺Σv ≥ 0 for all v ∈ RN
▸ Hermitian, all eigenvalues ≥ 0
3 ,
Why Gaussian?an experiment
−0.1 −5 ⋅ 10−2 0 5 ⋅ 10−2 0.10
20
40
▸ nothing in the real world is Gaussian (except sums of i.i.d. variables)▸ But nothing in the real world is linear either!
Gaussians are for inference what linear maps are for algebra.
4 ,
Closure Under Multiplicationmultiple Gaussian factors form a Gaussian
N (x;a,A)N (x; b,B) = N (x; c,C)N (a; b,A +B)C ∶= (A−1 +B−1)−1 c ∶= C(A−1a +B−1b)
−4 −2 0 µ1 4 6 8−4−20
µ2
4
6
8
5 ,
Closure Under Multiplicationmultiple Gaussian factors form a Gaussian
N (x;a,A)N (x; b,B) = N (x; c,C)N (a; b,A +B)C ∶= (A−1 +B−1)−1 c ∶= C(A−1a +B−1b)
−4 −2 0 µ1 4 6 8−4−20
µ2
4
6
8
5 ,
Closure Under Multiplicationmultiple Gaussian factors form a Gaussian
N (x;a,A)N (x; b,B) = N (x; c,C)N (a; b,A +B)C ∶= (A−1 +B−1)−1 c ∶= C(A−1a +B−1b)
−4 −2 0 µ1 4 6 8−4−20
µ2
4
6
8
5 ,
Closure under Linear MapsLinear Maps of Gaussians are Gaussians
−4 −2 0 µ1 4 6 8−4−20
µ2
4
6
8
p(z) = N (z;µ,Σ)⇒ p(Az) = N (Az,Aµ,AΣA⊺)Here: A = [1,−0.5]
6 ,
Closure under Marginalizationprojections of Gaussians are Gaussian
▸ projection with A = (1 0)∫ N [(x
y) ;(µx
µy) ,(Σxx Σxy
Σyx Σyy)] dy = N (x;µx,Σxx)
−4 −2 0 µ1 4 6 8−4−20
µ2
4
6
8
▸ this is the sum rule
∫ p(x, y) dy = ∫ p(y ∣x)p(x) dy = p(x)▸ so every finite-dim Gaussian is a
marginal of infinitely many more
7 ,
Closure under Conditioningcuts through Gaussians are Gaussians
p(x ∣ y) = p(x, y)p(y) = N (x;µx +ΣxyΣ−1
yy(y − µy),Σxx −ΣxyΣ−1yyΣyx)
−4 −2 0 µ1 4 6 8−4−20
µ2
4
6
8
▸ this is the product rule▸ so Gaussians are closed under
the rules of probability
8 ,
Bayesian Inferenceexplaining away
0 5
0
5 p(x)= N (x;µ,Σ)= N [(x1
x2) ;( 1
0.5) ,(32 0
0 32)]
p(y ∣x, σ)= N (y;A⊺x;σ2)= N [6; (1 0.6)(x1
x2) , σ2]
p(x ∣σ2, y) = p(x)p(y ∣x)p(x)
= N (x;µ +ΣA(A⊺ΣA + σ2)−1(y −A⊺µ),Σ −ΣA(A⊺ΣA + σ2)−1A⊺Σ)= N [(x1
x2) ;(3.9
2.3) ,( 3.4 −3.4−3.4 7.0
)]
9 ,
Bayesian Inferenceexplaining away
0 5
0
5 p(x)= N (x;µ,Σ)= N [(x1
x2) ;( 1
0.5) ,(32 0
0 32)]p(y ∣x, σ)= N (y;A⊺x;σ2)
= N [6; (1 0.6)(x1
x2) , σ2]
p(x ∣σ2, y) = p(x)p(y ∣x)p(x)
= N (x;µ +ΣA(A⊺ΣA + σ2)−1(y −A⊺µ),Σ −ΣA(A⊺ΣA + σ2)−1A⊺Σ)= N [(x1
x2) ;(3.9
2.3) ,( 3.4 −3.4−3.4 7.0
)]
9 ,
Bayesian Inferenceexplaining away
0 5
0
5 p(x)= N (x;µ,Σ)= N [(x1
x2) ;( 1
0.5) ,(32 0
0 32)]p(y ∣x, σ)= N (y;A⊺x;σ2)
= N [6; (1 0.6)(x1
x2) , σ2]
p(x ∣σ2, y) = p(x)p(y ∣x)p(x)
= N (x;µ +ΣA(A⊺ΣA + σ2)−1(y −A⊺µ),Σ −ΣA(A⊺ΣA + σ2)−1A⊺Σ)= N [(x1
x2) ;(3.9
2.3) ,( 3.4 −3.4−3.4 7.0
)]
9 ,
Bayesian Inferenceexplaining away
0 5
0
5 p(x)= N (x;µ,Σ)= N [(x1
x2) ;( 1
0.5) ,(32 0
0 32)]p(y ∣x, σ)= N (y;A⊺x;σ2)
= N [6; (1 0.6)(x1
x2) , σ2]
p(x ∣σ2, y) = p(x)p(y ∣x)p(x)
= N (x;µ +ΣA(A⊺ΣA + σ2)−1(y −A⊺µ),Σ −ΣA(A⊺ΣA + σ2)−1A⊺Σ)= N [(x1
x2) ;(3.9
2.3) ,( 3.4 −3.4−3.4 7.0
)]9 ,
What can we do with this?linear regression
given y ∈ RN , p(y ∣ f), what’s f?
−8 −6 −4 −2 0 2 4 6 8
−100
10
20
x
y
10 ,
A priorover linear functions
f(x) = w1 +w2x = φ⊺xw p(w) = N (w;µ,Σ)φx = (1
x) p(f) = N (f ;φ⊺xµ,φ⊺xΣφx)
11 ,
A priorover linear functions
f(x) = w1 +w2x = φ⊺xw p(w) = N (w;µ,Σ)φx = (1
x) p(f) = N (f ;φ⊺xµ,φ⊺xΣφx)
12 ,
The posteriorover linear functions
p(y ∣w,φX) = N (y;φ⊺Xw,σ2I)p(w ∣ y, φX) = N (w;µ +ΣφX(φ⊺XΣφX + σ2I)−1(y − φ⊺Xµ),
Σ −ΣφX(φ⊺XΣφX + σ2I)−1φ⊺XΣ)φx
13 ,
The posteriorover linear functions
p(y ∣w,φX) = N (y;φ⊺Xw,σ2I)p(fx ∣ y, φX) = N (fx;φ⊺xµ + φ⊺xΣφX(φ⊺XΣφX + σ2I)−1(y − φ⊺Xµ),
φ⊺xΣφx − φ⊺xΣφX(φ⊺XΣφX + σ2I)−1φ⊺XΣφx
13 ,
% prior on wF = 2; % number of featuresphi = @(a)(bsxfun(@power,a,0:F-1)); % φ(a) = [1;a]mu = zeros(F,1);Sigma = eye(F); % p(w) =N (µ,Σ)% prior on f(x)n = 100; x = linspace(-6,6,n)’; % ‘test’ pointsphix = phi(x); % features of xm = phix * mu;kxx = phix * Sigma * phix’; % p(fx) =N (m,kxx)s = bsxfun(@plus,m,chol(kxx + 1.0e-8 * eye(n))’ * randn(n,3)); % samples from priorstdpi = sqrt(diag(kxx)); % marginal stddev, for plotting
load(’data.mat’); N = length(Y); % gives Y,X,sigma
% prior on Y = fX + εphiX = phi(X); % features of dataM = phiX * mu;kXX = phiX * Sigma * phiX’; % p(fX) =N (M,kXX)G = kXX + sigma^2 * eye(N); % p(Y ) =N (M,kXX + σ2I)R = chol(G); % most expensive step: O(N3)kxX = phix * Sigma * phiX’; % cov(fx, fX) = kxXA = kxX / R; % pre-compute for re-use
mpost = m + A * (R’ \ (Y-M)); % p(fx ∣Y ) =N (m + kxX(kXX + σ2I)−1(Y −M),vpost = kxx - A * A’; % kxx − kxX(kXX + σ2I)−1kXx)spost = bsxfun(@plus,mpost,chol(vpost + 1.0e-8 * eye(n))’ * randn(n,3)); % samples
stdpo = sqrt(diag(vpost)); % marginal stddev, for plotting
14 ,
A More Realistic DatasetGeneral Linear Regression
f(x) = φ⊺xw ?
−8 −6 −4 −2 0 2 4 6 8
−10
0
10
20
x
y
15 ,
f(x) = w1 +w2x = φ⊺xw
φx ∶= (1x)
16 ,
% prior on wF = 2; % number of featuresphi = @(a)(bsxfun(@power,a,0:F-1)); % φ(a) = [1;a]mu = zeros(F,1);Sigma = eye(F); % p(w) =N (µ,Σ)% prior on f(x)n = 100; x = linspace(-6,6,n)’; % ‘test’ pointsphix = phi(x); % features of xm = phix * mu;kxx = phix * Sigma * phix’; % p(fx) =N (m,kxx)s = bsxfun(@plus,m,chol(kxx + 1.0e-8 * eye(n))’ * randn(n,3)); % samples from priorstdpi = sqrt(diag(kxx)); % marginal stddev, for plotting
load(’data.mat’); N = length(Y); % gives Y,X,sigma
% prior on Y = fX + εphiX = phi(X); % features of dataM = phiX * mu;kXX = phiX * Sigma * phiX’; % p(fX) =N (M,kXX)G = kXX + sigma^2 * eye(N); % p(Y ) =N (M,kXX + σ2I)R = chol(G); % most expensive step: O(N3)kxX = phix * Sigma * phiX’; % cov(fx, fX) = kxXA = kxX / R; % pre-compute for re-use
mpost = m + A * (R’ \ (Y-M)); % p(fx ∣Y ) =N (m + kxX(kXX + σ2I)−1(Y −M),vpost = kxx - A * A’; % kxx − kxX(kXX + σ2I)−1kXx)spost = bsxfun(@plus,mpost,chol(vpost + 1.0e-8 * eye(n))’ * randn(n,3)); % samples
stdpo = sqrt(diag(vpost)); % marginal stddev, for plotting
17 ,
Cubic Regressionphi = @(a)(bsxfun(@power,a,[0:3]));
f(x) = φ(x)⊺w φ(x) = (1 x x.2 x.3)⊺
18 ,
Cubic Regressionphi = @(a)(bsxfun(@power,a,[0:3]));
f(x) = φ(x)⊺w φ(x) = (1 x x.2 x.3)⊺
18 ,
Septic Regression ?phi = @(a)(bsxfun(@power,a,[0:7]));
f(x) = φ(x)⊺w φ(x) = (1 x x.2 ⋯ x.7)⊺
19 ,
Septic Regression ?phi = @(a)(bsxfun(@power,a,[0:7]));
f(x) = φ(x)⊺w φ(x) = (1 x x.2 ⋯ x.7)⊺
19 ,
Fourier Regressionphi = @(a)(2 * [cos(bsxfun(@times,a/8,[0:8])), sin(bsxfun(@times,a/8,[1:8]))]);
φ(x) = (cos(x) cos(2x) cos(3x) . . . sin(x) sin(2x) . . .)⊺
20 ,
Fourier Regressionphi = @(a)(2 * [cos(bsxfun(@times,a/8,[0:8])), sin(bsxfun(@times,a/8,[1:8]))]);
φ(x) = (cos(x) cos(2x) cos(3x) . . . sin(x) sin(2x) . . .)⊺
20 ,
Step Regressionphi = @(a)(-1 + 2 * bsxfun(@lt,a,linspace(-8,8,16)));
φ(x) = −1 + 2 (θ(x − 8) θ(8 − x) θ(x − 7) θ(7 − x) . . .)⊺
21 ,
Step Regressionphi = @(a)(-1 + 2 * bsxfun(@lt,a,linspace(-8,8,16)));
φ(x) = −1 + 2 (θ(x − 8) θ(8 − x) θ(x − 7) θ(7 − x) . . .)⊺
21 ,
V Regressionphi = @(a)(bsxfun(@minus,abs(bsxfun(@minus,a,linspace(-8,8,16))),linspace(-8,8,16)));
φ(x) = (∣x − 8∣ + 8 ∣x − 7∣ + 7 ∣x − 6∣ + 6 . . .)⊺
23 ,
V Regressionphi = @(a)(bsxfun(@minus,abs(bsxfun(@minus,a,linspace(-8,8,16))),linspace(-8,8,16)));
φ(x) = (∣x − 8∣ + 8 ∣x − 7∣ + 7 ∣x − 6∣ + 6 . . .)⊺
23 ,
Eiffel Tower Regressionphi = @(a)(exp(-abs(bsxfun(@minus,a,[-8:1:8]))));
φ(x) = (e−∣x−8∣ e−∣x−7∣ e−∣x−6∣ . . .)⊺
25 ,
Eiffel Tower Regressionphi = @(a)(exp(-abs(bsxfun(@minus,a,[-8:1:8]))));
φ(x) = (e−∣x−8∣ e−∣x−7∣ e−∣x−6∣ . . .)⊺
25 ,
Bell Curve Regressionphi = @(a)(exp(-0.5 * bsxfun(@minus,a,[-8:1:8]).^2));
φ(x) = (e− 12 (x−8)2 e− 1
2 (x−7)2 e− 12 (x−6)2 . . .)⊺
26 ,
Bell Curve Regressionphi = @(a)(exp(-0.5 * bsxfun(@minus,a,[-8:1:8]).^2));
φ(x) = (e− 12 (x−8)2 e− 1
2 (x−7)2 e− 12 (x−6)2 . . .)⊺
26 ,
Multiple Inputsall this works for in multiple dimensions, too
φ ∶ RN _R f ∶ RN _R
27 ,
Multiple Inputsall this works for in multiple dimensions, too
28 ,
How many features should we use?let’s look at that algebra again
p(fx ∣ y, φX) = N (fx;φ⊺xµ + φ⊺xΣφX(φ⊺XΣφX + σ2I)−1(y − φ⊺Xµ),φ⊺xΣφx − φ⊺xΣφX(φ⊺XΣφX + σ2I)−1φ⊺XΣφx)
▸ there’s no lonely φ in there▸ all objects involving φ are of the form
▸ φ⊺µ — the mean function▸ φ⊺Σφ — the kernel
▸ once these are known, cost is independent of the number of features▸ remember the code:
M = phiX * mu;m = phix * mu;kXX = phiX * Sigma * phiX’; % p(fX) =N (M,kXX)kxx = phix * Sigma * phix’; % p(fx) =N (m,kxx)kxX = phix * Sigma * phiX’; % cov(fx, fX) = kxX
32 ,
% prior on wF = 2; % number of featuresphi = @(a)(bsxfun(@power,a,0:F-1)); % φ(a) = [1;a]mu = zeros(F,1);Sigma = eye(F); % p(w) =N (µ,Σ)% prior on f(x)n = 100; x = linspace(-6,6,n)’; % ‘test’ pointsphix = phi(x); % features of xm = phix * mu;kxx = phix * Sigma * phix’; % p(fx) =N (m,kxx)s = bsxfun(@plus,m,chol(kxx + 1.0e-8 * eye(n))’ * randn(n,3)); % samples from priorstdpi = sqrt(diag(kxx)); % marginal stddev, for plotting
load(’data.mat’); N = length(Y); % gives Y,X,sigma
% prior on Y = fX + εphiX = phi(X); % features of dataM = phiX * mu;kXX = phiX * Sigma * phiX’; % p(fX) =N (M,kXX)G = kXX + sigma^2 * eye(N); % p(Y ) =N (M,kXX + σ2I)R = chol(G); % most expensive step: O(N3)kxX = phix * Sigma * phiX’; % cov(fx, fX) = kxXA = kxX / R; % pre-compute for re-use
mpost = m + A * (R’ \ (Y-M)); % p(fx ∣Y ) =N (m + kxX(kXX + σ2I)−1(Y −M),vpost = kxx - A * A’; % kxx − kxX(kXX + σ2I)−1kXx)spost = bsxfun(@plus,mpost,chol(vpost + 1.0e-8 * eye(n))’ * randn(n,3)); % samples
stdpo = sqrt(diag(vpost)); % marginal stddev, for plotting
33 ,
% priorF = 2; % number of featuresphi = @(a)(bsxfun(@power,a,0:F)); % φ(a) = [1;a]k = @(a,b)(phi(a)’ * phi(b)); % kernelmu = @(a)(zeros(size(a,1))); % mean function
% belief on f(x)n = 100; x = linspace(-6,6,n)’; % ‘test’ pointsm = mu(x);kxx = k(x,x); % p(fx) =N (m,kxx)s = bsxfun(@plus,m,chol(kxx + 1.0e-8 * eye(n))’ * randn(n,3)); % samples from priorstdpi = sqrt(diag(kxx)); % marginal stddev, for plotting
load(’data.mat’); N = length(Y); % gives Y,X,sigma
% prior on Y = fX + εM = mu(X);kXX = k(X,X); % p(fX) =N (M,kXX)G = kXX + sigma^2 * eye(N); % p(Y ) =N (M,kXX + σ2I)R = chol(G); % most expensive step: O(N3)kxX = k(x,X); % cov(fx, fX) = kxXA = kxX / R; % pre-compute for re-use
mpost = m + A * (R’ \ (Y-M)); % p(fx ∣Y ) =N (m + kxX(kXX + σ2I)−1(Y −M),vpost = kxx - A * A’; % kxx − kxX(kXX + σ2I)−1kXx)spost = bsxfun(@plus,mpost,chol(vpost + 1.0e-8 * eye(n))’ * randn(n,3)); % samples
stdpo = sqrt(diag(vpost)); % marginal stddev, for plotting
34 ,
Exponentiated Squaresphi = @(a)(exp(-0.5 * bsxfun(@minus,a,linspace(-8,8,10)).^2 ./ell.^2));
▸ aka. radial basis function, square(d)-exponential kernel
37 ,
Exponentiated Squaresphi = @(a)(exp(-0.5 * bsxfun(@minus,a,linspace(-8,8,30)).^2 ./ell.^2));
▸ aka. radial basis function, square(d)-exponential kernel
37 ,
Exponentiated Squaresk = @(a,b)(5*exp(-0.25*bsxfun(@minus,a,b’).^2));
▸ aka. radial basis function, square(d)-exponential kernel
37 ,
Exponentiated Squaresk = @(a,b)(5*exp(-0.25*bsxfun(@minus,a,b’).^2));
▸ aka. radial basis function, square(d)-exponential kernel
37 ,
What just happened?kernelization to infinitely many features
DefinitionA function k ∶ X ×X_R is a Mercer kernel if, for any finite collectionX = [x1, . . . , xN ], the matrix kXX ∈ RN×N with elementskXX,(i,j) = k(xi, xj) is positive semidefinite.
LemmaAny kernel that can be written as
k(x,x′) = ⨋ φ`(x)φ`(x′)d`is a Mercer kernel. (assuming integral over positive set)Proof: ∀X ∈ XN , v ∈ RNv⊺kXXv = ⨋ N∑
i
viφ`(xi)N∑j
vjφ`(xj)d` = ⨋ [∑i
viφ`(xi)]2
d` ≥ 0 ◻38 ,
What just happened?Gaussian process priors
DefinitionA function k ∶ X ×X_R is a Mercer kernel if, for any finite collectionX = [x1, . . . , xN ], the matrix kXX ∈ RN×N with elementskXX,(i,j) = k(xi, xj) is positive semidefinite.
DefinitionLet µ ∶ X_R be any function, k ∶ X ×X_R be a Mercer kernel.A Gaussian process p(f) = GP(f ;µ, k) is a probability distribution overthe function f ∶ X_R, such that every finite restriction to function valuesfX ∶= [fx1 , . . . , fxN ] is a Gaussian distribution p(fX) = N (fX ;µX , kXX).
39 ,
The posteriorover linear functions
p(y ∣w,φX) = N (y;φ⊺Xw,σ2I)p(w ∣ y, φX) = N (w;µ +ΣφX(φ⊺XΣφX + σ2I)−1(y − φ⊺Xµ),
Σ −ΣφX(φ⊺XΣφX + σ2I)−1φ⊺XΣ)φx
13 ,
The posteriorover linear functions
p(y ∣w,φX) = N (y;φ⊺Xw,σ2I)p(fx ∣ y, φX) = N (fx;φ⊺xµ + φ⊺xΣφX(φ⊺XΣφX + σ2I)−1(y − φ⊺Xµ),
φ⊺xΣφx − φ⊺xΣφX(φ⊺XΣφX + σ2I)−1φ⊺XΣφx
13 ,
Cubic Regressionphi = @(a)(bsxfun(@power,a,[0:3]));
f(x) = φ(x)⊺w φ(x) = (1 x x.2 x.3)⊺
18 ,
Septic Regression ?phi = @(a)(bsxfun(@power,a,[0:7]));
f(x) = φ(x)⊺w φ(x) = (1 x x.2 ⋯ x.7)⊺
19 ,
Fourier Regressionphi = @(a)(2 * [cos(bsxfun(@times,a/8,[0:8])), sin(bsxfun(@times,a/8,[1:8]))]);
φ(x) = (cos(x) cos(2x) cos(3x) . . . sin(x) sin(2x) . . .)⊺
20 ,
Step Regressionphi = @(a)(-1 + 2 * bsxfun(@lt,a,linspace(-8,8,16)));
φ(x) = −1 + 2 (θ(x − 8) θ(8 − x) θ(x − 7) θ(7 − x) . . .)⊺
21 ,
V Regressionphi = @(a)(bsxfun(@minus,abs(bsxfun(@minus,a,linspace(-8,8,16))),linspace(-8,8,16)));
φ(x) = (∣x − 8∣ + 8 ∣x − 7∣ + 7 ∣x − 6∣ + 6 . . .)⊺
23 ,
Eiffel Tower Regressionphi = @(a)(exp(-abs(bsxfun(@minus,a,[-8:1:8]))));
φ(x) = (e−∣x−8∣ e−∣x−7∣ e−∣x−6∣ . . .)⊺
25 ,
Bell Curve Regressionphi = @(a)(exp(-0.5 * bsxfun(@minus,a,[-8:1:8]).^2));
φ(x) = (e− 12 (x−8)2 e− 1
2 (x−7)2 e− 12 (x−6)2 . . .)⊺
26 ,
Exponentiated Squaresk = @(a,b)(5*exp(-0.25*bsxfun(@minus,a,b’).^2));
▸ aka. radial basis function, square(d)-exponential kernel
37 ,
The predictive posterior distribution
The posterior Gaussian process has predictive distribution , where
36
The predictive posterior under noise
The posterior Gaussian process has predictive distribution , where
37
Computational complexity of GPs
• Let t denote the number of data points in the GP
• Inverting the kernel matrix: O(t3)
• Predictions of the variance: O(t2)
• Predictions of the mean: O(t)
38
The posterior Gaussian process has predictive distribution , where
38
Two components of Bayesian optimization
• The probabilistic model
– Typically used: Gaussian process
– Later: other models are possible, e.g., random forests
• The acquisition function
– Trades off exploration vs. exploitation
– We’ll discuss this in detail
39
Probability of Improvement
40
Expected Improvement
41
(the derivation of this integral’s closed-form solution will be an exercise)
Upper Confidence Bound (UBC)
42
Entropy Search
• Compute a probability distribution over which configuration is optimal
• Acquisition function: try to push this probability distribution as close to a delta distribution as possible
• One of the most powerful acquisition functions
– Can choose to actively evaluate in one region of the space to learning something about a different region of the space
43
Putting it all Together
• How to optimize the acquisition function?
– Subsidiary optimization method
– Important: in that subsidiary optimization, function evaluations are cheap (just evaluations of the GP).
44
Summary of Bayesian Optimization
• Bayesian optimization integrates
– prior information and
– the likelihood of the observed data
• It uses quite involved computation to select which function value to evaluate next
– Thus, it’s most useful for expensive blackbox functions
45
Overall summary
• Generalization: we need to safeguard against over-fitting
• Overview over Hyperparameter optimization methods
• Bayesian optimization
– Based on linear regression & Gaussian processes
• Next week:
– Bayesian optimization with random forests
– Extensions and applications
46
top related