hyperparameter optimization with approximate gradient
TRANSCRIPT
![Page 1: Hyperparameter optimization with approximate gradient](https://reader031.vdocuments.site/reader031/viewer/2022021815/58731d7d1a28ab673e8b6b5b/html5/thumbnails/1.jpg)
HYPERPARAMETER OPTIMIZATION WITHAPPROXIMATE GRADIENT
Fabian Pedregosa
Chaire Havas-Dauphine Paris-Dauphine / École Normale
Supérieure
![Page 2: Hyperparameter optimization with approximate gradient](https://reader031.vdocuments.site/reader031/viewer/2022021815/58731d7d1a28ab673e8b6b5b/html5/thumbnails/2.jpg)
HYPERPARAMETERSMost machine learning models depend on at least one
hyperparameter to control for model complexity. Examplesinclude:
Amount of regularization.Kernel parameters.Architecture of a neural network.
Model parameters Estimated using some
(regularized) goodness of�t on the data.
Hyperparameters Cannot be estimated usingthe same criteria as modelparameters (over�tting).
![Page 3: Hyperparameter optimization with approximate gradient](https://reader031.vdocuments.site/reader031/viewer/2022021815/58731d7d1a28ab673e8b6b5b/html5/thumbnails/3.jpg)
HYPERPARAMETER SELECTIONCriterion to for hyperparameter selection:
Optimize loss on unseen data: cross-validation.Minimize risk estimator: SURE, AIC/BIC, etc.
Example: least squares with regularization.ℓ2
loss =
Costly evaluation function,non-convex.
Common methods: gridsearch, random search, SMBO.
( + X(λ)*ni=1 bi ai )2
![Page 4: Hyperparameter optimization with approximate gradient](https://reader031.vdocuments.site/reader031/viewer/2022021815/58731d7d1a28ab673e8b6b5b/html5/thumbnails/4.jpg)
GRADIENT-BASED HYPERPARAMETER OPTIMIZATIONCompute gradients with respect to hyperparameters[Larsen 1996, 1998, Bengio 2000].Hyperparameter optimization as nested or bi-leveloptimization:
arg minλ!
s.t. X(λ)⏟model parameters
f (λ) u g(X(λ), λ) loss on test set
! arg minx!ℝp
h(x, λ)⏟loss on train set
![Page 5: Hyperparameter optimization with approximate gradient](https://reader031.vdocuments.site/reader031/viewer/2022021815/58731d7d1a28ab673e8b6b5b/html5/thumbnails/5.jpg)
GOAL: COMPUTE f (λ)By chain rule,
Two main approaches: implicit differentiation and iterativedifferentiation [Domke et al. 2012, Macaulin 2015]Implicit differentiation [Larsen 1996, Bengio 2000]:formulate inner optimization as implicit equation.
f = Þ+�g�λ
�g�X
known
�X�λ⏟unknown
X(λ) ! arg min h(x, λ) ⟺ h(X(λ), λ) = 0 1 implicit equation for X
![Page 6: Hyperparameter optimization with approximate gradient](https://reader031.vdocuments.site/reader031/viewer/2022021815/58731d7d1a28ab673e8b6b5b/html5/thumbnails/6.jpg)
GRADIENT-BASED HYPERPARAMETER OPTIMIZATION f = g + g 2 ( h) 2
1,2T ( h) 2
1+1 1
Possible to compute gradient w.r.t. hyperparameters, given
Solution to the inner optimization
Solution to linear system
X(λ)g( h) 2
1+1 1
computationally expensive.⟹
![Page 7: Hyperparameter optimization with approximate gradient](https://reader031.vdocuments.site/reader031/viewer/2022021815/58731d7d1a28ab673e8b6b5b/html5/thumbnails/7.jpg)
HOAG: HYPERPARAMETER OPTIMIZATION WITH APPROXIMATEGRADIENT
Loose approximation Cheap iterations, might
diverge.
Precise approximation Costly iterations,
convergence to stationarypoint.
Replace by an approximate solution of the inneroptimization.Approximately solve linear system.Update using
Tradeoff
X(λ)
λ a fpk
![Page 8: Hyperparameter optimization with approximate gradient](https://reader031.vdocuments.site/reader031/viewer/2022021815/58731d7d1a28ab673e8b6b5b/html5/thumbnails/8.jpg)
point.HOAG At iteration perform the following:k = 1, 2,…
i) Solve the inner optimization problem up to tolerance , i.e. �nd such that
ii) Solve the linear system up to tolerance . That is, �nd suchthat
iii) Compute approximate gradient as
iv) Update hyperparameters:
εk!xk ℝp
>X( ) + > } .λk xk εk
εk qk
> h( , ) + g( , )> } . 21 xk λk qk 1 xk λk εk
pk= g( , ) + h( , ,pk 2 xk λk 2
1,2 xk λk )T qk
= ( + ) .λk+1 P λk1L
pk
![Page 9: Hyperparameter optimization with approximate gradient](https://reader031.vdocuments.site/reader031/viewer/2022021815/58731d7d1a28ab673e8b6b5b/html5/thumbnails/9.jpg)
ANALYSIS - GLOBAL CONVERGENCEAssumptions:
(A1). Lipschits and .(A2). non-singular
(A3). Domain is bounded.
g h 2
h(X(λ), λ) 21
Corollary: If , then converges to a
stationary point :
if is in the interior of then
< 7*7i=1 εi λk
λ0
⟨ f ( ), α + ⟩ ~ 0 , �α ! λ0 λ0
⟹ λ0 f ( ) = 0λ0
![Page 10: Hyperparameter optimization with approximate gradient](https://reader031.vdocuments.site/reader031/viewer/2022021815/58731d7d1a28ab673e8b6b5b/html5/thumbnails/10.jpg)
EXPERIMENTSHow to choose tolerance ?εk
Different strategies for the tolerance decrease. Quadratic: , Cubic: , Exponential: = 0.1/εk k2 0.1/k3 0.1 × 0.9k
Approximate-gradient strategies achieve much fasterdecrease in early iterations.
![Page 11: Hyperparameter optimization with approximate gradient](https://reader031.vdocuments.site/reader031/viewer/2022021815/58731d7d1a28ab673e8b6b5b/html5/thumbnails/11.jpg)
EXPERIMENTS I
Model: -regularizedlogistic regression.
1 Hyperparameter.
Datasets:20news (18k 130k )real-sim (73k 20k)
ℓ2
××
![Page 12: Hyperparameter optimization with approximate gradient](https://reader031.vdocuments.site/reader031/viewer/2022021815/58731d7d1a28ab673e8b6b5b/html5/thumbnails/12.jpg)
EXPERIMENTS II
Kernel ridge regression.2 hyperparameters.Parkinson dataset: 654 17
Multinomial Logisticregression with onehyperparameter per feature[Maclaurin et al. 2015]
784 10hyperparametersMNIST dataset: 60k 784
×
×
×
![Page 13: Hyperparameter optimization with approximate gradient](https://reader031.vdocuments.site/reader031/viewer/2022021815/58731d7d1a28ab673e8b6b5b/html5/thumbnails/13.jpg)
CONCLUSION
Hyperparameter optimization with inexact gradient:
can update hyperparameters before model parametershave fully converged.independent of inner optimization algorithm.convergence guarantees under smoothnessassumptions.
Open questions.
Non-smooth inner optimization (e.g. sparse models)?Stochastic / online approximation?
![Page 14: Hyperparameter optimization with approximate gradient](https://reader031.vdocuments.site/reader031/viewer/2022021815/58731d7d1a28ab673e8b6b5b/html5/thumbnails/14.jpg)
REFERENCES
[Y. Bengio, 2000] Bengio, Yoshua. "Gradient-based optimization ofhyperparameters." Neural computation 12.8 (2000): 1889-1900.
[J. Bergstra, Y. Bengio 2012] Bergstra, James, and Yoshua Bengio. "Randomsearch for hyper-parameter optimization." The Journal of MachineLearning Research 13.1 (2012): 281-305.
[J. Snoek et al., 2015] Snoek, J. et al. Scalable Bayesian Optimization UsingDeep Neural Networks. (2015). at
[K. Swersky et al., 2014] Swersky, K., Snoek, J. & Adams, R. Freeze-ThawBayesian Optimization. arXiv Prepr. arXiv1406.3896 1–12 (2014). at
[F. Hutter et al., 2013] Hutter, F., Hoos, H. & Leyton-Brown, K. Anevaluation of sequential model-based optimization for expensive blackboxfunctions.
http://arxiv.org/abs/1502.05700a
http://arxiv.org/abs/1406.3896
![Page 15: Hyperparameter optimization with approximate gradient](https://reader031.vdocuments.site/reader031/viewer/2022021815/58731d7d1a28ab673e8b6b5b/html5/thumbnails/15.jpg)
REFERENCES 2
[M. Schmidt et al., 2013] Schmidt, M., Roux, N. & Bach, F. Minimizing �nitesums with the stochastic average gradient. arXiv Prepr. arXiv1309.23881–45 (2013). at
[J. Domke et al., 2012] Domke, J. Generic Methods for Optimization-BasedModeling. Proc. Fifteenth Int. Conf. Artif. Intell. Stat. XX, 318–326 (2012).
[M. P. Friedlander et al., 2012] Friedlander, M. P. & Schmidt, M. HybridDeterministic-Stochastic Methods for Data Fitting. SIAM J. Sci. Comput.34, A1380–A1405 (2012).
http://arxiv.org/abs/1309.2388
![Page 16: Hyperparameter optimization with approximate gradient](https://reader031.vdocuments.site/reader031/viewer/2022021815/58731d7d1a28ab673e8b6b5b/html5/thumbnails/16.jpg)
EXPERIMENTS - COST FUNCTION
![Page 17: Hyperparameter optimization with approximate gradient](https://reader031.vdocuments.site/reader031/viewer/2022021815/58731d7d1a28ab673e8b6b5b/html5/thumbnails/17.jpg)
EXPERIMENTSComparison with other hyperparameter optimization
methods
Random = Random search, SMBO = Sequential Model-BasedOptimization (Gaussian process), Iterdiff = reverse-mode
differentiation .
![Page 18: Hyperparameter optimization with approximate gradient](https://reader031.vdocuments.site/reader031/viewer/2022021815/58731d7d1a28ab673e8b6b5b/html5/thumbnails/18.jpg)
EXPERIMENTSComparison in terms of a validation loss.
Random = Random search, SMBO = Sequential Model-BasedOptimization (Gaussian process), Iterdiff = reverse-mode
differentiation .