a survey of model-based methods for global...

A Survey of Model-based Methods for GlobalOptimization

Thomas Bartz-Beielstein

SPOTSeven Labwww.spotseven.de

TechnologyArts Sciences

TH Köln

BIOMA

Bartz-Beielstein MBO 1 / 94

www.spotseven.de

Keywords

I Abelson & SussmannI WolpertI NetflixI Deep Learning


Agenda

I Part 1: BasicsI Part 2: ExampleI Part 3: ConsiderationsI Part 4: ExampleI Part 5: Conclusion


Introduction

Overview

Introduction

Stochastic Search Algorithms

Quality Criteria: How to Select Surrogates

Examples

Ensembles: Considerations

SPO2 Part 2

Stacking: Considerations


Introduction

Model-based optimization (MBO)

I Prominent role in todays modeling, simulation, and optimizationprocesses

I Most efficient technique for expensive and time-demanding real-worldoptimization problems

I Engineering domain, MBO is an important practiceI Recent advances in

I computer science,I statistics, andI engineeringI in combination with progress in high-performance computing

I Tools for handling problems, considered unsolvable only a few decadesago


Introduction

Global optimization (GO)

I GO can be categorized based on different criteria.I Properties of problems

I continuous versus combinatorialI linear versus nonlinearI convex versus multimodal, etc.

I We present an algorithmic view, i.e., properties of algorithmsI The term GO will be used in this talk for algorithms that are trying to find

and explore global optimal solutions with complex, multimodal objectivefunctions [Preuss, 2015].

I GO problems are difficult: nearly no structural information (e.g., numberof local extrema) available

I GO problems belong to the class of black-box functions, i.e., the analyticform is unknown

I Class of black-box function contains also functions that are easy to solve,e.g., convex functions


Introduction

Problem

I Optimization problem given by

Minimize: f (~x) subject to ~xl ≤ ~x ≤ ~xu,

where f : Rn → R is referred to as the objective function and ~xl and ~xudenote the lower and upper bounds of the search space (region ofinterest), respectively

I Setting arises in many real-world systems:I when the explicit form of the objective function f is not readily available,I e.g., user has no access to the source code of a simulator

I We cover stochastic (random) search algorithms, deterministic GOalgorithms are not further discussed

I Random and stochastic used synonymously


Introduction

Taxonomy of model-based approaches in GO

[1] Deterministic[2] Random Search

[2.1] Instance based[2.2] Model based optimization (MBO)

[2.2.1] Distribution based[2.2.2] Surrogate Model Based Optimization (SBO)

[2.2.2.1] Single surrogate based[2.2.2.2] Multi-fidelity based[2.2.2.3] Evolutionary surrogate based[2.2.2.4] Ensemble surrogate based



Overview

Introduction



Examples


SPO2 Part 2




Random Search

I Stochastic search algorithm: Iterative search algorithm that uses astochastic procedure to generate the next iterate

I Next iterate can beI a candidate solution to the GO orI a probabilistic model, where solutions can be drawn from

I Do not depend on any structural information of the objective function suchas gradient information or convexity⇒ robust and easy to implement

I Stochastic search algorithms can further be categorized asI instance-based orI model-based algorithms [Zlochin et al., 2004]



[2.1] Instance-based Algorithms

I Instance-based algorithms: use a single solution, ~x , or population, P(t),of candidate solutions

I Construction of new candidates depends explicitly on previouslygenerated solutions

I Examples: Simulated annealing, evolutionary algorithms

1: t = 0. InitPopulation(P).2: Evaluate(P).3: while not TerminationCriterion() do4: Generate new candidate solutions P’(t ) according to a specified

random mechanism.5: Update the current population P(t+1) based on population P(t) and

candidate solutions in P’(t).6: Evaluate(P(t + 1)).7: t = t + 1.8: end while



[2.2] MBO: Model-based Algorithms

I MBO algorithms: generate a population of new candidate solutions P ′(t)by sampling from a model

I In statistics: model ≡ distributionI Model (distribution) reflects structural properties of the underlying true

function, say fI Adapting the model (or the distribution), the search is directed into

regions with improved solutionsI One of the key ideas: replacement of expensive, high fidelity, fine grained

function evaluations, f (~x), with evaluations, f (~x), of an adequate cheap,low fidelity, coarse grained model, M



[2.2.1] Distribution-based Approaches

I Metamodel is a distributionI Generate a sequence of iterates (probability distributions) {p(t)} with the

hope thatp(t)→ p∗ as t →∞,

where p∗: limiting distribution, assigns most of its probability mass to theset of optimal solutions

I Probability distribution is propagated from one iteration to the nextI Instance-based algorithms propagate candidate solutions

1: t = 0. Let p(t) be a probability distribution.2: while not TerminationCriterion() do3: Randomly generate a population of candidate solutions P(t) from p(t).4: Evaluate(P(t)).5: Update the distribution using population (samples) P(t) to generate a

new distribution p(t + 1).6: t = t + 1.7: end while



[2.2.1] Estimation of distribution algorithms (EDA)

I EDA: very popular in the field of evolutionary algorithms (EA)I Variation operators such as mutation and recombination replaced by a

distribution based procedure:I Probability distribution estimated from promising candidate solutions from

the current population⇒ generate new populationI Larraaga and Lozano [2002] review different ways for using probabilistic

modelsI Hauschild and Pelikan [2011] discuss advantages and outline many of

the different types of EDAsI Hu et al. [2012] present recent approaches and a unified view



[2.2.2] Focus on Surrogates

I Although distribution-based approaches play an important role in GO,they will not be discussed further in this talk

I We will concentrate on surrogate model-based approachesI Origin in statistical design and analysis of experiments, especially in

response surface methodology [G E P Box, 1951, Montgomery, 2001]



[2.2.2] Surrogate Model-based Approaches

I In general: Surrogates used, when outcome of a process cannot bedirectly measured

I Imitate the behavior of the real model as closely as possible while beingcomputationally cheaper to evaluate

I Surrogate models also known asI the cheap model, orI a response surface,I meta model,I approximation,I coarse grained model

I Simple surrogate models constructed using a data-driven approachI Refined by integrating additional points or domain knowledge, e.g.,

constraints



[2.2.2] Surrogate Model-based Approaches

Sample designspace

Optimize onmetamodel

Initial design

Buildmetamodel

Validatemetamodel

I Validation step (e.g., via CV) is optionalI Samples generated iteratively to improve the surrogate model accuracy



[2.2.2] Surrogate Model Based Optimization (SBO)Algorithm

1: t = 0. InitPopulation(P(t))2: Evaluate(P(t))3: while not TerminationCriterion() do4: Use P(t) to build a cheap model M(t)5: P ′(t + 1) = GlobalSearch(M(t))6: Evaluate(P ′(t + 1))7: P(t + 1) ⊆ P(t) + P ′(t + 1)8: t = t + 19: end while



[2.2.2] Surrogates

I Wide range of surrogates developed in the last decades⇒ complexdesign decisions [Wang and Shan, 2007]:

I (a) MetamodelsI (b) DesignsI (c) Model fit

I (a) Metamodels:I Classical regression models such as polynomial regression or response

surface methodology [G E P Box, 1951, Montgomery, 2001]I support vector machines (SVM) [Vapnik, 1998],I neural networks [Zurada, 1992],I radial basis functions [Powell, 1987], orI Gaussian process (GP) models, design and analysis of computer

experiments, Kriging [Schonlau, 1997], [Büche et al., 2005], [Antognini andZagoraiou, 2010], [Kleijnen, 2009], [Santner et al., 2003]

I Comprehensive introduction to SBO in [Forrester et al., 2008]



[2.2.2] Surrogates: Popular metamodeling techniques

I (b) Designs [Wang and Shan, 2007]:

I ClassicalI Fractional factorialI Central compositeI Box-BehnkenI A-, D-optimal (alphabetically)I Plackett-Burmann

I Space fillingI Simple grids

I Latin hypercubeI OrthogonalI UniformI Minimax and Maximin

I Hybrid methodsI Random or human selectionI Sequential methods



[2.2.2] Surrogates: Popular metamodeling techniques

I (c) Model fitting [Wang and Shan, 2007]:

I Weighted least squaresregression

I Best linear unbiased predictor(BLUP)

I Likelihood

I Multipoint approximationI Sequential metamodelingI Neural networks:

backpropagationI Decision trees: entropy



[2.2.2] Applications of SBO

I One of the most popular application areas for SBO:I Simulation-based design of complex engineering problems

I computational fluid dynamics (CFD)I finite element modeling (FEM) methods

I Exact solutions⇒ solvers require a large number of expensive computersimulations

I Two variants of SBOI (i) metamodel [2.2.2.1]: uses one or several different metamodelsI (ii) multi-fidelity approximation [2.2.2.2]: uses several instances with different

parameterizations of the same metamodel



[2.2.2.1] Applications of Metamodels and [2.2.2.2]Multi-fidelity Approximation

I Meta-modeling approachesI 31 variable helicopter rotor design [Booker et al., 1998]I Aerodynamic shape design problem [Giannakoglou, 2002]I Multi-objective optimal design of a liquid rocket injector [Queipo et al., 2005]I Airfoil shape optimization with CFD [Zhou et al., 2007]I Aerospace design [Forrester and Keane, 2009]

I Multi-fidelity ApproximationI Several simulation models with different grid sizes in FEM [Huang et al.,

2015]I Sheet metal forming process [Sun et al., 2011]

I “How far have we really come?” [Simpson et al., 2012]



[2.2.2.3] Surrogate-assisted Evolutionary Algorithms

I Surrogate-assisted EA: EA that decouple the evolutionary search and thedirect evaluation of the objective function

I Cheap surrogate model replaces evaluations of expensive objectivefunction



[2.2.2.3] Surrogate-assisted Evolutionary Algorithms

I Combination of a genetic algorithm and neural networks for aerodynamicdesign optimization [Hajela and Lee, 1997]

I Approximate model of the fitness landscape using Kriging interpolation toaccelerate the convergence of EAs [Ratle, 1998]

I Evolution strategy (ES) with neural network based fitness evaluations [Jinet al., 2000]

I Surrogate-assisted EA framework with online learning [Zhou et al., 2007]I Not evaluate every candidate solution (individual), but to just estimate the

objective function value of some of the neighboring individuals [Brankeand Schmidt, 2005]

I Survey of surrogate-assisted EA approaches [Jin, 2003]I SBO approaches for evolution strategies [Emmerich et al., 2002]



[2.2.2.4] Multiple Models

I Instead of using one surrogate model only, several models Mi ,i = 1,2, . . . ,p, generated and evaluated in parallel

I Each model Mi : X → y usesI same candidate solutions, X , from the population P andI same results, y , from expensive function evaluations

I Multiple models can also be used to partition the search spaceI The tree-based Gaussian process (TGP): regression trees to partition the

search space, fit local GP surrogates in each region [Gramacy, 2007].I Tree-based partitioning of an aerodynamic design space, independent

Kriging surfaces in each partition [Nelson et al., 2007]I Combination of an evolutionary model selection (EMS) algorithm with

expected improvement (EI) criterion: select best performing surrogatemodel type at each iteration of the EI algorithm [Couckuyt et al., 2011]



[2.2.2.4] Multiple Models: Ensembles

I Ensembles of surrogate models gained popularity:I Adaptive weighted average model of the individual surrogates [Zerpa

et al., 2005]I Use the best surrogate model or a weighted average surrogate model

instead [Goel et al., 2006]I Weighted-sum approach for the selection of model ensembles [Sanchez

et al., 2006]I Models for the ensemble chosen based on their performanceI Weights are adaptive and inversely proportional to the local modeling errors



Overview

Introduction



Examples


SPO2 Part 2




Model Refinement: Selection Criteria for SamplePoints

I An initial model refined during the optimization⇒ Adaptive samplingI Identify new points, so-called infill pointsI Balance between

I exploration, i.e., improving the model quality (related to the model, global),and

I exploitation, i.e., improving the optimization and determining the optimum(related to the objective function, local)

I Expected improvement (EI): popular adaptive sampling method [Mockuset al., 1978], [Jones et al., 1998]



Model Selection Criteria

I EI approach handles the initialization and refinement of a surrogatemodel

I But not the selection of the model itselfI Popular efficient global optimization (EGO) algorithm uses a Kriging

modelI Because Kriging inherently determines the prediction variance (necessary

for the EI criterion)I But there is no proof that Kriging is the best choiceI Alternative surrogate models, e.g., neural networks, regression trees,

support vector machines, or lasso and ridge regression may be bettersuited

I An a priory selection of the best suited surrogate model is conceptuallyimpossible in the framework treated in this talk, because of the black-boxsetting



Single or Ensemble

I Regarding the model choice, the user can decide whether to useI one single model, i.e., one unique global model orI multiple models, i.e., an ensemble of different, possibly local, models

The static SBO uses a single, global surrogate model, usually refined byadaptive sampling, but did not change⇒ category [2.2.2.1]



Criteria for Selecting a Surrogate

I Here, we do not consider the selection of a new sample point (as done inEI)

I Instead: Criteria for the selection of one (or several) surrogate modelsI Usually, surrogate models chosen according to their estimated true error

[Jin et al., 2001], [Shi and Rasheed, 2010]I Commonly used performance metrics:

I mean absolute error (MAE)I root mean square error (RMSE)

I Generally, attaining a surrogate model that has minimal error is thedesired feature

I Methods from statistics, statistical learning [Hastie, 2009], and machinelearning [Murphy, 2012]:

I Simple holdoutI Cross-validationI Bootstrap


Examples

Overview

Introduction



Examples


SPO2 Part 2



Examples

Criteria for Selecting a Surrogate: Evolvability

I Model error is not the only criterion for selecting surrogate modelsI Evolvability learning of surrogates approach (EvoLS) [Le et al., 2013]:

I Use fitness improvement for determining the quality of surrogate modelsI EvoLS belongs to the category of surrogate-assisted evolutionary

algorithms ([2.2.2.3])I Distributed, local information


Examples

Evolvability Learning of Surrogates

I EvoLS: select a surrogate models that enhance search improvement inthe context of optimization

I Process information about theI (i) different fitness landscapes,I (ii) state of the search, andI (iii) characteristics of the search algorithm to statistically determine the

so-called evolvability of each surrogate modelI Evolvability of a surrogate model estimates the expected improvement of

the objective function value that the new candidate solution has gainedafter a local search has been performed on the related surrogatemodel [Le et al., 2013]


Examples

Evolvability

I Local search: After recombination and mutation, a local search isperformed

I It uses an individual local meta-model, M, for each offspringI The local optimizer, ϕM , uses an offspring ~y as an input and returns ~y∗ as

the refined offspringI Evolvability measure can be estimated as follows [Le et al., 2013]:

EvM(~x) = f (~x)−K∑

i=1

f (~y∗i )× wi(~x)

with weights (selection probabilities of the offsprings):

wi(~x) =P(~yi |P(t), ~x)∑Kj=1 P(~yj |P(t), ~x)


Examples

SPO

I EvoLS: distributed, local information. Now: more centralized, globalinformation⇒ sequential parameter optimization (SPO)

I Goal: Analysis and understanding of algorithmsI Early versions of the SPO [Bartz-Beielstein, 2003, Bartz-Beielstein et al.,

2005] combined methods fromI design of experiments (DOE) [Pukelsheim, 1993]I response surface methodology (RSM) [Box and Draper, 1987, Montgomery,

2001]I design and analysis of computer experiments (DACE) [Lophaven et al.,

2002, Santner et al., 2003]I regression trees [Breiman et al., 1984]

I Also: SPO as an optimizer


Examples

SPO

I SPO: sequential, model based approach to optimizationI Nowadays: established parameter tuner and an optimization algorithmI Extended in several ways:

I For example, Hutter et al. [2013] benchmark an SPO derivative, theso-called sequential model-based algorithm configuration (SMAC)procedure, on the BBOB set of blackbox functions.

I Small budget of 10× d evaluations of d-dimensional functions, SMAC inmost cases outperforms the state-of- the-art blackbox optimizer CMA-ES


Examples

SPO

I The most recent version, SPO2, is currently under developmentI Integration of state-of-the-art ensemble learnersI SPO2 ensemble engine:

I Portfolio of surrogate modelsI regression trees and random forest, least angle regression (lars), and KrigingI Uses cross validation to select an improved model from the portfolio of

candidate modelsI Creates a weighted combination of several surrogate models to build the

improved modelI Use stacked generalization to combine several level-0 models of different

types with one level-1 model into an ensemble [Wolpert, 1992]I Level-1 training algorithm: simple linear model


Examples

SPO

I Promising preliminary resultsI SPO2 ensemble engine can lead to significant performance

improvementsI Rebolledo Coy et al. [2016] present a comparison of different data driven

modeling methodsI Bayesian modelI Several linear regression modelsI Kriging modelI Genetic programming

I Models build on industrial data for the development of a robust gas sensorI Limited amount of samples and a high variance


Examples

SPO

I Two sensors are comparedI 1st sensor (MSE)

I Linear model (0.76), OLS (0.79), Lasso (0.56), Kriging (0.57), Bayes (0.79),and genetic programming (0.58)

I SPO2 0.38I 2nd sensor (MSE)

I Linear model (0.67), OLS (0.80), Lasso (0.49), Kriging (0.49), Bayes (0.79),and genetic programming (0.27)

I SPO2 0.29


Examples

Begin: Jupyter Interactive Document

I The following part of this talk is based on an interactive jupyternotebook [Pérez and Granger, 2007]

I The next slides (43 - 56) summarize the output from the jupyter notebook

BIOMA


Examples DSPO: Deep Sequential Parameter Optimization

Preparation [jupyter]

I BasicallyI 1. import libraries andI 2. set the SPO2 parameters, i.e., the number of folds

I Code implements ideas from Wolpert [1992], based on Olivetti [2012]I Libraries shown explicitly, because we will comment on this topic later!

import matplotlib.pyplot as pltfrom IPython.display import set_matplotlib_formatsfrom sklearn.linear_model import LassoCV, LassoLarsCV, LassoLarsIC...import statsmodels.formula.api as sm


Examples Data

The Complete Data Set [jupyter]I Problem description in Rebolledo Coy et al. [2016]I One training data set and one test data set

In [2]: dfTrain = read_csv('/Users/bartz/workspace/svnbib/Python.d/projects/pyspot2/data/training.csv')dfTest = read_csv('/Users/bartz/workspace/svnbib/Python.d/projects/pyspot2/data/validation.csv')

RangeIndex: 80 entries, 0 to 79Data columns (total 9 columns):X1 80 non-null float64X2 80 non-null float64..X7 80 non-null float64Y1 80 non-null float64Y2 80 non-null float64dtypes: float64(9)memory usage: 5.7 KB

X1 X2 X3 ...0 -1.132054 -1.206144 .......


Examples Data

Data Used in this Study [jupyter]

I Here, we consider data from the seconds sensorI There are seven input values and one output value (y )I The goal of this study: predict the outcome y using the seven input

measurements (X1, . . . ,X7)I Output (y ) plotted against input (X1, . . . ,X7)

−2−1012345−3

−2

−1

0

1

2

3

−2−1012345−3

−2

−1

0

1

2

3

−2−1012345−3

−2

−1

0

1

2

3

−2−1012345−3

−2

−1

0

1

2

3

−2−1012345−3

−2

−1

0

1

2

3

−2−1012345−3

−2

−1

0

1

2

3

−2−1012345−3

−2

−1

0

1

2

3


Examples Data

2. Cross Validation (CV Splits) [jupyter]

I The training data are split into foldsI KFold divides all the samples in k = nfolds folds, i.e., groups of samples of

equal sizes (if possible)I If k = n, this is equivalent to the Leave One Out strategyI Prediction function is learned using (k − 1) folds, and the fold left out is

used for test


Examples Data

3. Models in the Ensemble [jupyter]

I Linear RegressionI 1. Normalized predictors

I LinearRegression(normalize=False)I LinearRegression(normalize=True)

I 2. InterceptI LinearRegression(normalize=False, fit_intercept = False)I LinearRegression(normalize=True, fit_intercept = False)

I Random ForestI RandomForestRegressor(n_estimators = 10, random_state=0),I RandomForestRegressor(n_estimators = 100, oob_score = True,

random_state=2),I Lasso

I Lasso(alpha=0.1, fit_intercept = False)I LassoCV(positive = True)


Examples Data


I Gaussian ProcessesI The kernel specifying the covariance function of the GPI Parameterized with different kernels:I For example RBF, Matern, RationalQuadratic, ExpSineSquared, DotProduct,

ConstantKernelI If none is passed, the kernel RBF() is used as defaultI Kernel’s hyperparameters are optimized during fittingI Kernel combinations are allowed.I Example:kern = 1.0 * RBF(length_scale=100.0

, length_scale_bounds = (1e-2, 1e3))+ WhiteKernel(noise_level=1, noise_level_bounds=(1e-10, 1e+1))

⇒ GaussianProcessRegressor(kern)


Examples Data


In [5]: clfs = [ LinearRegression(), RandomForestRegressor()#, GaussianProcessRegressor() ]


Examples Data

Matrix Preparation, Dimensions [jupyter]

I n: size of the training set (samples): X.shape[0]I k : number of folds for CV: n_foldsI p: number of models: len(clfs)I m: size of the test data (samples): XTest.shape[0]I We will use two matrices:

I 1. YCV is a (n × p)-matrix. It stores the results from the cross validation foreach model. The training set is partitioned into k folds (n_folds=k).

I 2. YBT is a (m × p)-matrix. It stores the aggregated results from the crossvalidation models on the test data. For each fold, p separate models arebuild, which are used for prediction on the test data. The predicted valuesfrom the k folds are averaged for each model, which results in (m × p)different values.

In [6]: YCV = np.zeros((X.shape[0], len(clfs)))YBT = np.zeros((XTest.shape[0], len(clfs)))


Examples Data

Cross-Validation [jupyter]

I Each of the p algorithms is run separately on the k foldsI The training data set is split into foldsI Each fold contains training data (train) and validation data (val)I For each fold, a model is build using the train dataI The model is used to make predictions on

I 1. the validation data from the k -th fold These values are stored in the matrixYCV

I 2. the test data⇒ YBT[i]I The average of these predictions are stored in the matrix YBT


Examples Data

The YCV Matrix [jupyter]

I The YCV matrix has p columns and n rowsI Each row contains the prediction for the n-th sample from the training

data set

(80, 2)[[ 0.98940238 1.72943761][ 0.14445345 0.17159034][-0.0969411 0.86240816][ 1.24820656 0.27214466][ 0.0897393 0.1542136 ]...[-0.56801858 -0.50319222]]


Examples Data

The YBT Matrix [jupyter]

I The YBT matrix has p columns and m rowsI Each row contains the prediction for the m-th sample from the test data

setI These values are the mean values of the k fold predictions calculated

with the training dataI Each fold generates one model, which is used for prediction on the test

dataI The mean value from these k predictions are stored in the matrix YBT

(60, 2)[[ 6.85725928e-01 1.18237211e+00][ 1.19810293e+00 -2.93516856e-01][ -8.16703631e-01 -1.87229011e+00][ 2.06323037e+00 7.25281970e-01][ 1.37816630e+00 -1.35272556e-01]...[ 3.22714124e+00 1.36502285e+00]]


Examples Blending: Level-1 Model

Model Building [jupyter]

I The level-1 model is a function of the CV-values of each model to theknown, training y -values

I It provides an estimate of the influence of the single modelsI For example, if a linear level-1 model is used, the coefficient βi represents

the effect of the i-th model



Model Prediction [jupyter]

I The level-1 model is used for predictions on the YBT data, i.e., on theaveraged predictions of the CV-models

I Constructed using the effects of the predicted values of the single models(determined by linear regression) on the true values of the training data

I If a model predicts a similar value as the true value during the CV, then ithas a strong effect

I The final predictions are made using the coefficients (weights) of thesingle models on the YBT data

I YBT data: predicted values from the corresponding models on the finaltest data

[[ 0.98940238 1.72943761 0.55617508][ 0.14445345 0.17159034 0.28026176]...

[-0.56801858 -0.50319222 -0.28009123]]Intercept: Coefficients:-0.0199327428866 [ 0.34705597 0.68483785]



Comparing MSE [jupyter]I Comparison of the mean squared error from the SPO2 ensemble and the

single models:

SPO2 (MSE): 0.284948273406L (MSE): 0.673695001324R (MSE): 0.367652881967

−3 −2 −1 0 1 2 3

Measured

−3

−2

−1

0

1

2

3

4

Pre

dic

ted

SPO2

L

R



End: Jupyter Interactive Document

I End of the interactive jupyter notebook [Pérez and Granger, 2007]I The next slides (59 - 64) are based on statics slides

BIOMA



Overview

Introduction



Examples


SPO2 Part 2




Why are Ensembles Better?

I The following considerations are based on van Veen [2015]

Example 1 (Error Correcting Codes)I Signal in the form of a binary string like:

0111101010100000111001010101 gets corrupted with just one bitflipped as in the following:0110101010100000111001010101

I Repetition code: Simplest solutionI Repeat signal multiple times in equally sized chunks and have a majority

voteI 1. Original signal: 001010101 2. Encoded: (Relay multiple times)

000000111000111000111000111 3. Decoding: Take three bit andcalculate majority



A Simple Machine Learning Example

I Test set of ten samplesI The ground truth is all positive: 1111111111I Three binary classifiers (A ,B, C) with a 70% accuracyI View classifiers as pseudo-random number generators

I output a 1 70% of the time andI a 0 30% of the time

I Pseudo-classifiers are able to obtain 78% accuracy through a votingensemble [van Veen, 2015]



A Simple Machine Learning Example

I All three are correct: (0.7× 0.7× 0.7) = 0.3429I Two are correct: (0.7× 0.7× 0.3 + 0.7× 0.3× 0.7 + 0.3× 0.7× 0.7) =

0.4409I Two are wrong : (0.3× 0.3× 0.7 + 0.3× 0.7× 0.3 + 0.7× 0.3× 0.3) =

0.189I All three are wrong (0.3× 0.3× 0.3) = 0.027I This majority vote ensemble will be correct an average of ≈ 78%:

(0.3429 + 0.4409 = 0.7838)



Correlation

I Uncorrelated models clearly do better when ensembled than correlatedI 1111111100 = 80% accuracyI 1111111100 = 80% accuracyI 1011111100 = 70% accuracy

I Models highly correlated in their predictions. Majority vote results in noimprovement:

I 1111111100 = 80% accuracyI Now 3 less-performing, but highly uncorrelated models:

I 1111111100 = 80% accuracyI 0111011101 = 70% accuracyI 1000101111 = 60% accuracy

I Ensembling with a majority vote results in:I 1111111101 = 90% accuracy

I Lower correlation between ensemble model members seems to result inan increase in the error-correcting capability [van Veen, 2015]



Beyond Averaging: Stacking

I Averaging, i.e., taking the mean of individual model predictions, workswell for a wide range of problems (classification and regression) andmetrics

I Often referred to as baggingI Averaging predictions often reduces overfit [van Veen, 2015]I Wolpert [1992] introduced stacked generalization before bagging was

proposed by Breiman [1996]I Wolpert is famous for There is no free lunch in search and optimization

theoremI Basic idea behind stacking: use a pool of base classifiers, then using

another classifier to combine their predictions



2-fold Stacking

I Stacker model gets information on the problem space by using thefirst-stage predictions as features [van Veen, 2015]:

I 1. Split the train set in 2 parts: train_a and train_bI 2. Fit a first-stage model on train_a and create predictions for train_bI 3. Fit the same model on train_b and create predictions for train_aI 4. Finally fit the model on the entire train set and create predictions for the

test set.I 5. Now train a second-stage stacker model on the predictions from the

first-stage model(s).

⇒ Python Example



Begin: Jupyter Interactive Document

I The following part of this talk is based on an interactive jupyternotebook [Pérez and Granger, 2007]

I The next slides (67 - 85) summarize the output from the jupyter notebook

BIOMA


SPO2 Part 2

Overview

Introduction



Examples


SPO2 Part 2



SPO2 Part 2 Artificial Test Functions

Function Definitions [jupyter]

I Motivated by van der Laan and Polley [2010], we consider six testfunctions

I All simulations involve a univariate X drawn from a uniform distribution in[-4, +4]

I Test functions:I f1(x): return -2 * I(x < -3) + 2.55 * I(x > -2) - 2 * I(x > 0) + 4 * I(x > 2) - 1 * I(x

>3 ) + εI f2(x): return 6 + 0.4 * x - 0.36x * x + 0.005x * x * x + εI f3(x): return 2.83 * np.sin(math.pi/2 * x) + εI f4(x): return 4.0 * np.sin(3 * math.pi * x) * I(x >= 0) + εI f5(x): return x + εI f6(x): return np.random.normal(0,1,len(x)) + ε

I I(·) indicator function, ε drawn from an independent standard normaldistribution, sample size r = 100 (repeats)




I f1: Step function

−4 −3 −2 −1 0 1 2 3 4−4

−2

0

2

4

6




I f2: Polynomial function

−4 −3 −2 −1 0 1 2 3 4−2

0

2

4

6

8




I f3: Sine function

−4 −3 −2 −1 0 1 2 3 4

−4

−2

0

2

4




I f4: Composite function

−4 −3 −2 −1 0 1 2 3 4−8

−6

−4

−2

0

2

4

6




I f5: Linear function

−4 −3 −2 −1 0 1 2 3 4−8

−6

−4

−2

0

2

4

6




I f6: Noise function

−4 −3 −2 −1 0 1 2 3 4−5

−4

−3

−2

−1

0

1

2

3


SPO2 Part 2 Experiment 1: Step Function

f1: Coefficients of the Level-1 Model [jupyter]

I The coefficients can be interpreted as weights in the linear combinationof the models. 0 = intercept; 1, 2, and 3 denote the β1, β2, and β3 values,respectively

0 1 2 3−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4


SPO2 Part 2 Experiment 1: Step Function

f1: R2 Values [jupyter]I R2 (larger values are better) and standard deviation.

I SPO: 0.78211976, 0.03308847I L: 0.4024831, 0.07134356I R: 0.78556947, 0.03187105I G: 0.76547433, 0.03564519

0 1 2 30.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


SPO2 Part 2 Experiment 2: Polynomial Function



0 1 2 3−0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2


SPO2 Part 2 Experiment 2: Polynomial Function


I SPO: 0.79514735 0.03602018I L: 0.21445917 0.07656562I R: 0.79488344 0.03604606I G: 0.79514727 0.03602018

0 1 2 30.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


SPO2 Part 2 Experiment 3: Sine Function



0 1 2 3−0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2


SPO2 Part 2 Experiment 3: Sine Function


I SPO: 0.7939634 0.02777211I L: 0.11677184 0.05688847I R: 0.79244941 0.02743085I G: 0.79396338 0.02777211

0 1 2 30.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


SPO2 Part 2 Experiment 4: Linear-Sine Function



0 1 2 3−7

−6

−5

−4

−3

−2

−1

0

1

2


SPO2 Part 2 Experiment 4: Linear-Sine Function


I SPO: 0.74144195 0.05779718I L: 0.00651219 0.01489886I R: 0.75301025 0.05133169I G: 0.31721598 0.07939812

0 1 2 3

0.0

0.2

0.4

0.6

0.8


SPO2 Part 2 Experiment 5: Linear Function



0 1 2 3−0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2


SPO2 Part 2 Experiment 5: Linear Function


I SPO: 0.8362937 0.02381472I L: 0.8362937 0.02381472I R: 0.83628043 0.02374492I G: 0.8362937 0.02381472

0 1 2 30.76

0.78

0.80

0.82

0.84

0.86

0.88

0.90


SPO2 Part 2 Experiment 6: Random Noise (normal)



0 1 2 3−12

−10

−8

−6

−4

−2

0

2




I SPO: -0.02025601 0.10308039I L: -0.00035958 0.01505964I R: 0.3586063 0.06232495I G: 0.10037904 0.05356867

0 1 2 3−0.3

−0.2

−0.1

0.0

0.1

0.2

0.3

0.4

0.5



End: Jupyter Interactive Document

I End of the interactive jupyter notebook [Pérez and Granger, 2007]I The next slides (88 - 93) are based on statics slides

BIOMA



Overview

Introduction



Examples


SPO2 Part 2




Data

Training:{(Xi, yi)}i=1,...n{(Xi, yi)}i=1,...n

{yti}{yti}

A1A1 A2A2Val_Training:{(Xi, yi)}i=1,...q{(Xi, yi)}i=1,...q

Val_Test:{(Xi, yi)}i=q+1,...n{(Xi, yi)}i=q+1,...n

AV1AV1 AV

2AV2

Val_Test:{(Xi)}i=q+1,...n{(Xi)}i=q+1,...n

Val_Test:{(yi)}i=q+1,...n{(yi)}i=q+1,...nyAV

1yAV

1

yAV2

yAV2

A3A3 A(y,y)3A(y,y)3

ATr2ATr2ATr

1ATr1

yAT r1

yAT r1

yAT r2

yAT r2

yy

{Xti}{Xti}

Test:{(Xti, yti)}i=1,...,m{(Xti, yti)}i=1,...,m



Blending and Meta-Meta Models

I Continue with the summary of van Veen [2015]:I Blending is a word introduced by the Netflix winnersI Instead of creating out-of-fold predictions for the train set: use a small

holdout set of say 10% of the train setI Benefit: Simpler than stackingI Con: The final model may overfit to the holdout set

I Further improvements: Combine multiple ensembled modelsI Use averaging or voting on manually-selected well-performing ensembles:

I Start with a base ensembleI Add a model when it increases the train set score the mostI By allowing put-back of models, a single model may be picked multiple times

(weighting)I Use of genetic algorithms and CV-scores as the fitness function

I van Veen [2015] proposes a fully random method:I Create a 100 or so ensembles from randomly selected ensembles (without

placeback)I Then pick the highest scoring model



“Frankenstein Ensembles"

I Why stacking and combining 1000s of models and computational hours?I van Veen [2015] lists the following pros for these “monster models":

I Win competitions (Netflix, Kaggle, . . .)I Beat most state-of-the-art academic benchmarks with a single approachI Transfer knowledge from the ensemble back to a simpler shallow modelI Loss of one model is not fatal for creating good predictionsI Automated large ensembles don’t require much tuning or selectionI 1% increase in accuracy may push an investment fund from making a loss,

into making a little less loss. More seriously: Improving healthcare screeningmethods helps save lives



Video Sequence

I The following part of this talk refers to a sequence from the video “GeraldJay Sussman on Flexible Systems, The Power of Generic Operations”,which is available on https://vimeo.com/151465912

I 1:01:42 - 1:04:25 (to be deleted from the videolectures.net version)

BIOMA


https://vimeo.com/151465912


Summary: Structure and Interpretation of ComputerPrograms (SICP)

I PROGRAMMING BY POKING: WHY MIT STOPPED TEACHING SICPhttp://www.posteriorscience.net/?p=206

I 1. Hal Abelson and Gerald Jay Sussman and got tired of teaching itI 2. SICP curriculum no longer prepared engineers for what engineering is

like todayI In the 80s and 90s, engineers built complex systems by combining simple

and well-understood partsI Programming today is

” [m]ore like science. You grab this piece of library and you pokeat it. You write programs that poke it and see what it does. Andyou say, ‘Can I tweak it to do the thing I want?’”

I MIT chose Python as an alternative for SCIP


http://www.posteriorscience.net/?p=206


Summary

I Keywords:I Abelson & Sussmann XI Wolpert XI Netflix XI Deep learning X

I Related report “Stacked Generalization of Surrogate Models—A PracticalApproach” [Bartz-Beielstein, 2016]

I More: http://www.spotseven.de


http://www.spotseven.de


Acknowledgement

I This work has been supported by the Bundesministeriums für Wirtschaftund Energie under the grants KF3145101WM3 und KF3145103WM4.

I This work is part of a project that has received funding from the EuropeanUnion’s Horizon 2020 research and innovation program under grantagreement No 692286.


References

Alessandro Baldi Antognini and Maroussa Zagoraiou. Exact optimal designsfor computer experiments via Kriging metamodelling. Journal of StatisticalPlanning and Inference, 140(9):2607–2617, September 2010.

Thomas Bartz-Beielstein. Experimental Analysis of EvolutionStrategies—Overview and Comprehensive Introduction. Technical report,November 2003.

Thomas Bartz-Beielstein. Stacked Generalization of Surrogate Models—APratical Approach. Technical Report 05/2016, Cologne Open Science,Cologne, 2016. URL https://cos.bibl.th-koeln.de/solrsearch/index/search/searchtype/series/id/8.

Thomas Bartz-Beielstein, Christian Lasarczyk, and Mike Preuß. SequentialParameter Optimization. In B McKay et al., editors, Proceedings 2005Congress on Evolutionary Computation (CEC’05), Edinburgh, Scotland,pages 773–780, Piscataway NJ, 2005. IEEE Press.

Andrew J Booker, J E Dennis Jr, Paul D Frank, David B Serafini, and VirginiaTorczon. Optimization Using Surrogate Objectives on a Helicopter TestExample. In Computational Methods for Optimal Design and Control, pages49–58. Birkhäuser Boston, Boston, MA, 1998.

G E P Box and N R Draper. Empirical Model Building and ResponseSurfaces. Wiley, New York NY, 1987.


https://cos.bibl.th-koeln.de/solrsearch/index/search/searchtype/series/id/8

https://cos.bibl.th-koeln.de/solrsearch/index/search/searchtype/series/id/8

References

J Branke and C Schmidt. Faster convergence by means of fitness estimation.Soft Computing, 9(1):13–20, January 2005.

L Breiman, J H Friedman, R A Olshen, and C J Stone. Classification andRegression Trees. Wadsworth, Monterey CA, 1984.

Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.ISSN 1573-0565. doi: 10.1023/A:1018054314350. URLhttp://dx.doi.org/10.1023/A:1018054314350.

D Büche, N N Schraudolph, and P Koumoutsakos. Accelerating EvolutionaryAlgorithms With Gaussian Process Fitness Function Models. IEEETransactions on Systems, Man and Cybernetics, Part C (Applications andReviews), 35(2):183–194, May 2005.

Ivo Couckuyt, Filip De Turck, Tom Dhaene, and Dirk Gorissen. Automaticsurrogate model type selection during the optimization of expensiveblack-box problems. In 2011 Winter Simulation Conference - (WSC 2011),pages 4269–4279. IEEE, 2011.

Michael Emmerich, Alexios Giotis, Mutlu özdemir, Thomas Bäck, andKyriakos Giannakoglou. Metamodel-assisted evolution strategies. InJ J Merelo Guervós, P Adamidis, H G Beyer, J L Fernández-Villacañas, andH P Schwefel, editors, Parallel Problem Solving from Nature—PPSN VII,


http://dx.doi.org/10.1023/A:1018054314350

References

Proceedings~Seventh International Conference, Granada, pages 361–370,Berlin, Heidelberg, New York, 2002. Springer.

Alexander Forrester, András Sóbester, and Andy Keane. Engineering Designvia Surrogate Modelling. Wiley, 2008.

Alexander I J Forrester and Andy J Keane. Recent advances insurrogate-based optimization. Progress in Aerospace Sciences, 45(1-3):50–79, January 2009.

K B Wilson G E P Box. On the Experimental Attainment of OptimumConditions. Journal of the Royal Statistical Society. Series B(Methodological), 13(1):1–45, 1951.

K C Giannakoglou. Design of optimal aerodynamic shapes using stochasticoptimization methods and computational intelligence. Progress inAerospace Sciences, 38(1):43–76, January 2002.

Tushar Goel, Raphael T Haftka, Wei Shyy, and Nestor V Queipo. Ensemble ofsurrogates. Struct. Multidisc. Optim., 33(3):199–216, September 2006.

Robert B Gramacy. tgp: An R Package for Bayesian Nonstationary,Semiparametric Nonlinear Regression and Design by Treed GaussianProcess Models. Journal of Statistical Software, 19(9):1–46, June 2007.


References

P Hajela and E Lee. Topological optimization of rotorcraft subfloor structuresfor crashworthiness considerations. Computers & Structures, 64(1-4):65–76, July 1997.

Trevor Hastie. The elements of statistical learning : data mining, inference,and prediction. Springer, New York, 2nd ed. edition, 2009.

Mark Hauschild and Martin Pelikan. An introduction and survey of estimationof distribution algorithms. Swarm and Evolutionary Computation, 1(3):111–128, September 2011.

Jiaqiao Hu, Yongqiang Wang, Enlu Zhou, Michael C Fu, and Steven I Marcus.A Survey of Some Model-Based Methods for Global Optimization. In DanielHernández-Hernández and J Adolfo Minjárez-Sosa, editors, Optimization,Control, and Applications of Stochastic Systems, pages 157–179.Birkhäuser Boston, Boston, 2012.

Edward Huang, Jie Xu, Si Zhang, and Chun Hung Chen. Multi-fidelity ModelIntegration for Engineering Design. Procedia Computer Science, 44:336–344, 2015.

Frank Hutter, Holger Hoos, and Kevin Leyton-Brown. An Evaluation ofSequential Model-based Optimization for Expensive Blackbox Functions. InProceedings of the 15th Annual Conference Companion on Genetic and


References

Evolutionary Computation, pages 1209–1216, New York, NY, USA, 2013.ACM.

R Jin, W Chen, and T W Simpson. Comparative studies of metamodellingtechniques under multiple modelling criteria. Struct. Multidisc. Optim., 23(1):1–13, December 2001.

Y Jin. A comprehensive survey of fitness approximation in evolutionarycomputation. Soft Computing, 9(1):3–12, October 2003.

Y Jin, M Olhofer, and B Sendhoff. On Evolutionary Optimization withApproximate Fitness Functions. GECCO, 2000.

D R Jones, M Schonlau, and W J Welch. Efficient Global Optimization ofExpensive Black-Box Functions. Journal of Global Optimization, 13:455–492, 1998.

Jack P C Kleijnen. Kriging metamodeling in simulation: A review. EuropeanJournal of Operational Research, 192(3):707–716, February 2009.

P Larraaga and J A Lozano. Estimation of Distribution Algorithms. A New Toolfor Evolutionary Computation. Kluwer, Boston MA, 2002.

Minh Nghia Le, M N Le, Yew Soon Ong, Y S Ong, S Menzel, Stefan Menzel,Yaochu Jin, Y Jin, B Sendhoff, and Bernhard Sendhoff. Evolution byadapting surrogates. Evolutionary Computation, 21(2):313–340, 2013.


References

S N Lophaven, H B Nielsen, and J Søndergaard. DACE—A Matlab KrigingToolbox. Technical report, 2002.

J Mockus, V Tiesis, and A Zilinskas. Bayesian Methods for Seeking theExtremum. In L C W Dixon and G P Szegö, editors, Towards GlobalOptimization, pages 117–129. Amsterdam, 1978.

D C Montgomery. Design and Analysis of Experiments. Wiley, New York NY,5th edition, 2001.

K P Murphy. Machine learning: a probabilistic perspective, 2012.Andrea Nelson, Juan Alonso, and Thomas Pulliam. Multi-Fidelity

Aerodynamic Optimization Using Treed Meta-Models. In Fluid Dynamicsand Co-located Conferences. American Institute of Aeronautics andAstronautics, Reston, Virigina, June 2007.

Emanuele Olivetti. blend.py, 2012. Code available on https://github.com/emanuele/kaggle_pbr/blob/master/blend.py.Published under a BSD 3 license.

Fernando Pérez and Brian E. Granger. IPython: a system for interactivescientific computing. Computing in Science and Engineering, 9(3):21–29,May 2007. ISSN 1521-9615. doi: 10.1109/MCSE.2007.53. URLhttp://ipython.org.

MJD Powell. Radial Basis Functions. Algorithms for Approximation, 1987.Bartz-Beielstein MBO 94 / 94

https://github.com/emanuele/kaggle_pbr/blob/master/blend.py

https://github.com/emanuele/kaggle_pbr/blob/master/blend.py

http://ipython.org

References

Mike Preuss. Multimodal Optimization by Means of Evolutionary Algorithms.Natural Computing Series. Springer International Publishing, Cham, 2015.

F Pukelsheim. Optimal Design of Experiments. Wiley, New York NY, 1993.Nestor V Queipo, Raphael T Haftka, Wei Shyy, Tushar Goel, Rajkumar

Vaidyanathan, and P Kevin Tucker. Surrogate-based analysis andoptimization. Progress in Aerospace Sciences, 41(1):1–28, January 2005.

Alain Ratle. Parallel Problem Solving from Nature — PPSN V: 5thInternational Conference Amsterdam, The Netherlands September 27–30,1998 Proceedings. pages 87–96. Springer Berlin Heidelberg, Berlin,Heidelberg, 1998.

Margarita Alejandra Rebolledo Coy, Sebastian Krey, Thomas Bartz-Beielstein,Oliver Flasch, Andreas Fischbach, and Jörg Stork. Modeling andOptimization of a Robust Gas Sensor. Technical Report 03/2016, CologneOpen Science, Cologne, 2016.

E Sanchez, S Pintos, and N V Queipo. Toward an Optimal Ensemble ofKernel-based Approximations with Engineering Applications. In The 2006IEEE International Joint Conference on Neural Network Proceedings,pages 2152–2158. IEEE, 2006.

T J Santner, B J Williams, and W I Notz. The Design and Analysis ofComputer Experiments. Springer, Berlin, Heidelberg, New York, 2003.


References

M Schonlau. Computer Experiments and Global Optimization. PhD thesis,University of Waterloo, Ontario, Canada, 1997.

L Shi and K Rasheed. A Survey of Fitness Approximation Methods Applied inEvolutionary Algorithms. In Computational Intelligence in ExpensiveOptimization Problems, pages 3–28. Springer Berlin Heidelberg, Berlin,Heidelberg, 2010.

Timothy Simpson, Vasilli Toropov, Vladimir Balabanov, and Felipe Viana.Design and Analysis of Computer Experiments in Multidisciplinary DesignOptimization: A Review of How Far We Have Come - Or Not. In 12thAIAA/ISSMO Multidisciplinary Analysis and Optimization Conference,pages 1–22, Reston, Virigina, June 2012. American Institute of Aeronauticsand Astronautics.

G Sun, G Li, S Zhou, W Xu, X Yang, and Q Li. Multi-fidelity optimization forsheet metal forming process. Structural and Multidisciplinary . . . , 2011.

M J van der Laan and E C Polley. Super Learner in Prediction. UC BerkeleyDivision of Biostatistics Working Paper . . . , 2010.

Henk van Veen. Using Ensembles in Kaggle Data Science Competitions,2015. URL http://www.kdnuggets.com/2015/06/ensembles-kaggle-data-science-competition-p1.html.

V N Vapnik. Statistical learning theory. Wiley, 1998.Bartz-Beielstein MBO 94 / 94

http://www.kdnuggets.com/2015/06/ensembles-kaggle-data-science-competition-p1.html

http://www.kdnuggets.com/2015/06/ensembles-kaggle-data-science-competition-p1.html


G Gary Wang and S Shan. Review of Metamodeling Techniques in Support ofEngineering Design Optimization. Journal of Mechanical . . . , 129(4):370–380, 2007.

David H Wolpert. Stacked generalization. Neural Networks, 5(2):241–259,January 1992.

Luis E Zerpa, Nestor V Queipo, Salvador Pintos, and Jean-Louis Salager. Anoptimization methodology of alkaline–surfactant–polymer floodingprocesses using field scale numerical simulation and multiple surrogates.Journal of Petroleum Science . . . , 47(3-4):197–208, June 2005.

Z Zhou, Y S Ong, P B Nair, A J Keane, and K Y Lum. Combining Global andLocal Surrogate Models to Accelerate Evolutionary Optimization. IEEETransactions on Systems, Man and Cybernetics, Part C (Applications andReviews), 37(1):66–76, 2007.

Mark Zlochin, Mauro Birattari, Nicolas Meuleau, and Marco Dorigo.Model-Based Search for Combinatorial Optimization: A Critical Survey.Annals of Operations Research, 131(1-4):373–395, 2004.

J M Zurada. Analog implementation of neural networks. IEEE Circuits andDevices Magazine, 8(5):36–41, 1992.


a survey of model-based methods for global...

Documents