hyperparameter optimization with approximate gradient

HYPERPARAMETER OPTIMIZATION WITHAPPROXIMATE GRADIENT

Fabian Pedregosa

Chaire Havas-Dauphine Paris-Dauphine / École Normale

Supérieure

http://fa.bianp.net/

HYPERPARAMETERSMost machine learning models depend on at least one

hyperparameter to control for model complexity. Examplesinclude:

Amount of regularization.Kernel parameters.Architecture of a neural network.

Model parameters Estimated using some

(regularized) goodness of�t on the data.

Hyperparameters Cannot be estimated usingthe same criteria as modelparameters (over�tting).

HYPERPARAMETER SELECTIONCriterion to for hyperparameter selection:

Optimize loss on unseen data: cross-validation.Minimize risk estimator: SURE, AIC/BIC, etc.

Example: least squares with regularization.ℓ2

loss =

Costly evaluation function,non-convex.

Common methods: gridsearch, random search, SMBO.

( + X(λ)*ni=1 bi ai )2

GRADIENT-BASED HYPERPARAMETER OPTIMIZATIONCompute gradients with respect to hyperparameters[Larsen 1996, 1998, Bengio 2000].Hyperparameter optimization as nested or bi-leveloptimization:

arg minλ!

s.t. X(λ)⏟model parameters

f (λ) u g(X(λ), λ) loss on test set

! arg minx!ℝp

h(x, λ)⏟loss on train set

GOAL: COMPUTE f (λ)By chain rule,

Two main approaches: implicit differentiation and iterativedifferentiation [Domke et al. 2012, Macaulin 2015]Implicit differentiation [Larsen 1996, Bengio 2000]:formulate inner optimization as implicit equation.

f = Þ+�g�λ

�g�X

known

�X�λ⏟unknown

X(λ) ! arg min h(x, λ) ⟺ h(X(λ), λ) = 0 1 implicit equation for X

GRADIENT-BASED HYPERPARAMETER OPTIMIZATION f = g + g 2 ( h) 2

1,2T ( h) 2

1+1 1

Possible to compute gradient w.r.t. hyperparameters, given

Solution to the inner optimization

Solution to linear system

X(λ)g( h) 2

1+1 1

computationally expensive.⟹

HOAG: HYPERPARAMETER OPTIMIZATION WITH APPROXIMATEGRADIENT

Loose approximation Cheap iterations, might

diverge.

Precise approximation Costly iterations,

convergence to stationarypoint.

Replace by an approximate solution of the inneroptimization.Approximately solve linear system.Update using

Tradeoff

X(λ)

λ a fpk

point.HOAG At iteration perform the following:k = 1, 2,…

i) Solve the inner optimization problem up to tolerance , i.e. �nd such that

ii) Solve the linear system up to tolerance . That is, �nd suchthat

iii) Compute approximate gradient as

iv) Update hyperparameters:

εk!xk ℝp

>X( ) + > } .λk xk εk

εk qk

> h( , ) + g( , )> } . 21 xk λk qk 1 xk λk εk

pk= g( , ) + h( , ,pk 2 xk λk 2

1,2 xk λk )T qk

= ( + ) .λk+1 P λk1L

pk

ANALYSIS - GLOBAL CONVERGENCEAssumptions:

(A1). Lipschits and .(A2). non-singular

(A3). Domain is bounded.

g h 2

h(X(λ), λ) 21

Corollary: If , then converges to a

stationary point :

if is in the interior of then

< 7*7i=1 εi λk

λ0

⟨ f ( ), α + ⟩ ~ 0 , �α ! λ0 λ0

⟹ λ0 f ( ) = 0λ0

EXPERIMENTSHow to choose tolerance ?εk

Different strategies for the tolerance decrease. Quadratic: , Cubic: , Exponential: = 0.1/εk k2 0.1/k3 0.1 × 0.9k

Approximate-gradient strategies achieve much fasterdecrease in early iterations.

EXPERIMENTS I

Model: -regularizedlogistic regression.

1 Hyperparameter.

Datasets:20news (18k 130k )real-sim (73k 20k)

ℓ2

××

EXPERIMENTS II

Kernel ridge regression.2 hyperparameters.Parkinson dataset: 654 17

Multinomial Logisticregression with onehyperparameter per feature[Maclaurin et al. 2015]

784 10hyperparametersMNIST dataset: 60k 784

×

×

×

CONCLUSION

Hyperparameter optimization with inexact gradient:

can update hyperparameters before model parametershave fully converged.independent of inner optimization algorithm.convergence guarantees under smoothnessassumptions.

Open questions.

Non-smooth inner optimization (e.g. sparse models)?Stochastic / online approximation?

REFERENCES

[Y. Bengio, 2000] Bengio, Yoshua. "Gradient-based optimization ofhyperparameters." Neural computation 12.8 (2000): 1889-1900.

[J. Bergstra, Y. Bengio 2012] Bergstra, James, and Yoshua Bengio. "Randomsearch for hyper-parameter optimization." The Journal of MachineLearning Research 13.1 (2012): 281-305.

[J. Snoek et al., 2015] Snoek, J. et al. Scalable Bayesian Optimization UsingDeep Neural Networks. (2015). at

[K. Swersky et al., 2014] Swersky, K., Snoek, J. & Adams, R. Freeze-ThawBayesian Optimization. arXiv Prepr. arXiv1406.3896 1–12 (2014). at

[F. Hutter et al., 2013] Hutter, F., Hoos, H. & Leyton-Brown, K. Anevaluation of sequential model-based optimization for expensive blackboxfunctions.

http://arxiv.org/abs/1502.05700a

http://arxiv.org/abs/1406.3896

http://arxiv.org/abs/1502.05700a


REFERENCES 2

[M. Schmidt et al., 2013] Schmidt, M., Roux, N. & Bach, F. Minimizing �nitesums with the stochastic average gradient. arXiv Prepr. arXiv1309.23881–45 (2013). at

[J. Domke et al., 2012] Domke, J. Generic Methods for Optimization-BasedModeling. Proc. Fifteenth Int. Conf. Artif. Intell. Stat. XX, 318–326 (2012).

[M. P. Friedlander et al., 2012] Friedlander, M. P. & Schmidt, M. HybridDeterministic-Stochastic Methods for Data Fitting. SIAM J. Sci. Comput.34, A1380–A1405 (2012).



EXPERIMENTS - COST FUNCTION

EXPERIMENTSComparison with other hyperparameter optimization

methods

Random = Random search, SMBO = Sequential Model-BasedOptimization (Gaussian process), Iterdiff = reverse-mode

differentiation .

EXPERIMENTSComparison in terms of a validation loss.

Random = Random search, SMBO = Sequential Model-BasedOptimization (Gaussian process), Iterdiff = reverse-mode

differentiation .

hyperparameter optimization with approximate gradient

Science