arxiv:2111.14514v1 [cs.lg] 29 nov 2021

Noname manuscript No.(will be inserted by the editor)

Naive Automated Machine Learning

Felix Mohr, Marcel Wever

the date of receipt and acceptance should be inserted later

Abstract An essential task of Automated Machine Learning (AutoML) is theproblem of automatically finding the pipeline with the best generalization per-formance on a given dataset. This problem has been addressed with sophisticatedblack-box optimization techniques such as Bayesian Optimization, Grammar-BasedGenetic Algorithms, and tree search algorithms. Most of the current approachesare motivated by the assumption that optimizing the components of a pipeline inisolation may yield sub-optimal results. We present Naive AutoML, an approachthat does precisely this: It optimizes the different algorithms of a pre-definedpipeline scheme in isolation. The finally returned pipeline is obtained by just tak-ing the best algorithm of each slot. The isolated optimization leads to substan-tially reduced search spaces, and, surprisingly, this approach yields comparableand sometimes even better performance than current state-of-the-art optimizers.

1 Introduction

An important task in Automated machine learning (AutoML) is the one of auto-matically finding the pre-processing and learning algorithms with the best gener-alization performance on a given dataset. The combination of such algorithms istypically called a (machine learning) pipeline because several algorithms for datamanipulation and analysis are put into (partial) order. The choices to be made inpipeline optimization include the algorithms used for feature pre-processing andlearning as well as the hyper-parameters of the chosen algorithms.

Maybe surprisingly, all common approaches to this problem try to optimizeover all decision variables simultaneously [33,9,27,23], and it has, to our knowledge,never been tried to optimize the different components in isolation. While it isnearby that there are significant interactions between the optimization decisions,one can argue that achieving a global optimum by local optimization of componentscould be at least considered a relevant baseline to compare against.

Felix Mohr, Universidad de La Sabana, Chıa, Colombia ([email protected])Marcel Wever, Paderborn University, Germany ([email protected])

arX

iv:2

111.

1451

4v1

[cs

.LG

] 2

9 N

ov 2

021

2 Felix Mohr, Marcel Wever

We present two approaches for pipeline optimization that do exactly that:They optimize a pipeline locally instead of globally. The most extreme approach,Naive AutoML, pretends that a locally optimal decision is also globally optimal,i.e., the optimality of a local decision is independent of how other componentsare chosen. In practice, this means that all components that are not subject toa local optimization process are left blank, except the learner slot, e.g., classi-fier or regressor, which is configured with some arbitrary default algorithm, e.g.,decision tree, in order to obtain a valid pipeline. Since Naive AutoML mightsometimes be too naive, we consider a marginally less extreme optimizer, calledQuasi-Naive AutoML, which assumes an order in which components are optimizedand applies the naivety assumption only for the upcoming decisions. That is, it as-sumes that the quality of local optimization decisions for a component may beinfluenced by earlier optimized components but not by components that will beoptimized subsequently.

On top of naivety, both Naive AutoML and Quasi-Naive AutoML pretend thatparameter optimization is irrelevant for choosing the best algorithm for each slot.That is, they assume that the best algorithm under default parametrization isalso the best among all tuned algorithms. Therefore, both Naive AutoML andQuasi-Naive AutoML optimize a slot by first selecting an algorithm and then ap-plying a random search in the space of parameters of each chosen algorithm.

Our experimental evaluation shows that these simple techniques are surpris-ingly strong when compared against state-of-the-art optimizers used in auto-sklearnand GAMA. While Naive AutoML is outperformed in the long run (24h), it iscompetitive with state-of-the art approaches in the short run (1h runtime). Onthe contrary, Quasi-Naive AutoML is not only competitive in the long run (24h)by achieving a de-facto optimal performance in over 90% of the cases but evenoutperforms the state-of-the-art techniques in the short run.

While these results might suggest Quasi-Naive AutoML as a meaningful base-line over which one should be able to substantially improve, we see the actual role ofQuasi-Naive AutoML as the door opener for sequential optimization of pipelines.The currently applied black-box optimizers come with a series of problems dis-cussed in recent literature such as lack of interpretability and flexibility [5,4]. Thenaive approaches follow a sequential optimization approach, optimizing one com-ponent after the other. While interpretability and flexibility are not a topic in thispaper, they can be arguably realized more easily in custom sequential optimiza-tion approaches than in black-box optimization approaches. The strong results ofQuasi-Naive AutoML seem like a promise that extensions of Quasi-Naive AutoMLsuch as [22] could overcome the above problems of black-box optimizers without

sacrificing global optimality. We discuss this in more depth in Sec. 5.3.

2 Problem Definition

Even though the vision of AutoML is much broader, a core task of AutoML ad-dressed by most AutoML contributions is to automatically compose and parametrize

machine learning algorithms to maximize a given metric such as accuracy.In this paper, we focus on AutoML for supervised learning. Formally, in the

supervised learning context, we assume some instance space X ⊆ Rd and a label

space Y. A dataset D ⊂ {(x, y) | x ∈ X , y ∈ Y} is a finite relation between the

Naive Automated Machine Learning 3

instance space and the label space, and we denote as D the set of all possibledatasets. We consider two types of operations over instance and label spaces:

1. Pre-processors. A pre-processor is a function t : XA → XB , converting an in-stance x of instance space XA into an instance of another instance space XB .

2. Predictors. A predictor is a function p : Xp → Y, assigning an instance of itsinstance space Xp a label in the label space Y.

In this paper, a pipeline P = t1 ◦ .. ◦ tk ◦ p is a functional concatenation in whichti : Xi−1 → Xi are pre-processors with X0 = X being the original instance space,and p : Xk → Y is a predictor. Hence, a pipeline is a function P : X → Y thatassigns a label to each object of the instance space. We denote as P the space ofall pipelines of this kind. In general, the first part of a pipeline could be not onlya sequence but also a pre-processing tree with several parallel pre-processors thatare then merged [27], but we do not consider such structures in this paper sincethey are not necessary for our key argument. An extension to such tree-shapedpipelines is canonical future work.

In addition to the sequential structure, many AutoML approaches restrict thesearch space still a bit further. First, often a particular best order in which differenttypes of pre-processor should be applied is assumed. For example, we assume thatfeature selection should be conducted after feature scaling. So P will only containpipelines compatible with this order. Second, the optimal pipeline uses at most onepre-processor of each type. These assumptions allow us to express every elementof P as a concatenation of k + 1 functions, where k is the number of consideredpre-processor types, e.g., feature scalers, feature selectors, etc. If a pipeline doesnot adopt an algorithm of one of those types, say the i-th type, then ti will simplybe the identity function.

The theoretical goal in supervised machine learning is to find a pipeline thatminimizes the prediction error averaged over all instances from the same sourceas the given data. This performance cannot be computed in practice, so insteadone optimizes some function φ : D × P → R that estimates the performance of acandidate pipeline based on some validation data. Typical metrics used for thisevaluation include error rate, least squares, AUROC, F1, log-loss, and others.

Consequently, a supervised AutoML problem instance is defined by a datasetD ∈ D, a search space P of pipelines, and a performance estimation metric φ :D × P → R for solutions. An AutoML solver A : D → P is a function thatcreates a pipeline given some training set Dtrain ⊂ D. The performance of A isgiven by E

[φ(Dtest,A(Dtrain)

)], where the expectation is taken with respect to

the possible (disjoint) splits of D into Dtrain and Dtest. In practice, this score istypically computed taking a series of random binary splits of D and averagingover the observed scores. Naturally, the goal of any AutoML solver is to optimizethis metric, and we assume that A has access to φ (but not to Dtest) in order toevaluate candidates with respect to the objective function.

3 Related Work

Even though the foundation of AutoML is often attributed to the proposal ofAuto-WEKA, there have been some works on the topic long before. Initial ap-proaches date back to the 90s in the field of “knowledge discovery in databasesystems” [6,24]. Another early work applied in the medical area is GEMS [32] in


which the best pipeline of a pre-defined portfolio of configurations is selected basedon a cross-validation. In contrast, the work in [6] searches a huge tree containing allpossible pipeline configurations. There are mainly three approaches following thisdirection, differing in the way how the search space is defined and how the searchprocess is guided. The first approach we are aware of was designed for the configu-ration of RapidMiner modules based on hierarchical planning [16,17] most notablyMetaMiner [26,25]. With ML-Plan [23], the idea of HTN-based graph definitionswas later combined with a best-first search using random roll-outs to obtain nodequality estimates. Similarly, [30] introduced AutoML based on Monte-Carlo TreeSearch, which is closely related to ML-Plan. However, the authors of [30] do notdiscuss the layout of the search tree, which is a crucial detail, because it is theprimary channel to inject knowledge into the search problem.

Rather recently, it has also been recognized that random search is a quitecompetitive optimization algorithm for hyper-parameter optimization. Relevantapproaches in this respect are successive halving (SH) [15] and Hyperband (HB)[19]. Similar to our naive approaches, these optimizers have a sequential aspectand greedily follow candidates that appear good early in the optimization process.However, in contrast to the naive approaches, SH and HB are sequential in theevaluation budget and not in the pipeline slots. Our naive approaches can be con-figured with any type of validation function and in fact could adopt HB or SH forlocal hyper-parameter tuning.

Another line of research based on Bayesian Optimization was initialized withthe advent of Auto-WEKA [33,18]. Like Naive AutoML, Auto-WEKA assumes afixed structure of the pipeline, admitting a feature selection step and a predic-tor. The decisions are encoded into a large vector that is then optimized usingSMAC [14]. Auto-WEKA optimizes pipelines with algorithms of the Java dataanalysis library WEKA [13]. For the Python framework scikit-learn [28], the sametechnique was adopted by auto-sklearn [9]. In the original version, auto-sklearnadded a data transformation step to the pipeline; meanwhile, the tool has beenextended to support some more pre-processing functionalities in the pipeline. Be-sides, auto-sklearn features warm-starting and ensembling. The main differencebetween these approaches and tree search is that tree search successively creates

solution candidates as paths of a tree instead of obtaining them from an acquisition

function as done by Auto-WEKA and auto-sklearn.The idea of warm-starting introduced by auto-sklearn was also examined in

specific works based on recommendations. Approaches here include specificallycollaborative filtering like OBOE [36], probabilistic matrix factorization [10], andrecommendations based on average ranks [2] . These approaches are not necessar-ily requiring but are specifically designed for cases in which a database of pastexperiences on other datasets is available.

Another interesting line of research is the application of evolutionary andswarm algorithms. One of the first approaches was PSMS [8], which used swarmparticles for optimization. A more recent approaches is TPOT [27]. In contrastto the above approaches, TPOT allows not just one pre-processing step but anarbitrary number of feature extraction techniques at the same time. TPOT adoptsa genetic algorithm to find good pipelines and adopts the scikit-learn frameworkto evaluate candidates. Another approach is RECIPE [31], which uses a grammar-based evolutionary approach to evolve pipeline construction. In this, it is similarto the tree search based approaches. Focusing on the construction of stacking en-


sembles, another genetic approach was presented with AutoStacker [3]. The mostrecent development in genetic algorithms for AutoML is GAMA [12], which wealso consider in the experiments of this paper.

A recent line of research adopts a type of black-box optimization relying onthe framework of multipliers (ADMM) [1]. The main idea here is to decompose theoptimization problem into two sub-problems for different variable types, consider-ing that algorithm selection variables are Boolean while most parameter variablesare continuous. This approach was first presented in [21].

Finally, a related approach is AutoGluon [7]. It is similar to the naive ap-proaches in that it is also conducting a kind of sequential optimization process.However, while our naive approach only optimizes base algorithms, the focus ofAutoGluon is more on ensemble building through stacking and bagging.

Given this relatively rich list of approaches, it is a bit surprising that the naiveapproach has never been tried before. Some tools such as ML-Plan consider in afirst phase only all possible default configurations. This is a bit similar to the naiveapproach presented in this paper but does not yet decompose the search spaces.Also, after this initial phase, the components are again optimized simultaneously.

4 Naive AutoML and Quasi-Naive AutoML

4.1 Naivety Assumption

Naive AutoML pretends that the optimal pipeline is the one that is locally bestfor each of its pre-processors and the final predictor. In other words, taking intoaccount pipelines with (up to) k pre-processors and a predictor, we assume thatfor all datasets D and all 1 ≤ i ≤ k + 1

c∗i ∈ arg minci

φ(D, c1 ◦ .. ◦ ck+1) (1)

is invariant to the choices of c1, ..ci−1, ci+1, .., ck+1, which are supposed to be fixedin the above equation. Note that we here use the letter c instead of t for pre-processors or p for the predictor because c may be any of the two types.

The typical approach to optimize the ci is not to directly construct thosefunctions but to adopt parametrized model building processes that create thesefunctions. For example, c1 could be a projection obtained by determining somefeatures which we want to stay with, or ck+1 could be a trained neural network.These induction processes for the components can be described by an algorithmai and a parametrization θi of the algorithm. The component ci is obtained byrunning ai under parameters θi with some training data. So to optimize ci, weneed to optimally choose ai and θi.

We dub the approach Naive AutoML, because there is a direct link to theassumption made by the Naive Bayes classifier. Consider P an urn and denote asY the event to observe an optimal pipeline in the urn. Then

P(Y | c1, .., ck+1) ∝ P(c1, .., ck+1 | Y )P(Y )naive

= P(ci | Y )k+1∏

j=1,j 6=i

P(cj | Y )P(Y ),

in which we consider cj to be fixed components for j 6= i, and only ci being subjectto optimization. Applying Bayes’ theorem again to P(ci | Y ) and observing that the


remaining product is a constant regardless the choices of ci 6=j , it gets clear thatthe optimal solution is the one that maximizes the probability of being locallyoptimal, and that this choice is independent of the choice of the other components.

A direct consequence of the naivety assumption is that we can leave all com-ponents cj 6=i except the predictor component ck+1 even blank when optimizingci. In practice, this should be done since it substantially reduces the runtime ofcandidate evaluations. The reason why we cannot leave the predictor ck+1 blankis, of course, that we cannot assess the performance of a pipeline that only hasa pre-processor but no predictor. However, under naivety we could just use any

available predictor, perhaps the fastest one.It is clear that the naivety assumption does seldomly hold in practice. One

way to see this is the fact that it would enable us even to use a guessing predictorto optimize the pre-processing steps. In fact, a reasonable default choice for thepredictor would be the fastest learner, and the arguably fastest algorithm is onethat just guesses an output (or maybe always predicts the most common label). Itis unlikely that such a predictor is of much help when optimizing a pre-processoreven if it is part of the candidates for ck+1.

However, we can rescue the naivety approach by using some meaningful fixeddefault predictor that somewhat “represents” the candidates available for ck+1. Areasonable choice could be a nearest neighbors or a decision tree predictor that atleast take the features into account. Of course, this standard predictor should befixed a priori and not depend on the dataset.

In some cases, the naivety assumption will not hold even after this repair.That is, there are datasets, on which combining classifier c1 with pre-processorp1 is better than combining it with p2, but for another classifier c2 it is better tocombine it with p2 than with p1. Such situations can be a problem for the naiveapproach, and the question is then how strong the performance gaps can get.

4.2 Separate Algorithm Selection and Algorithm Configuration

On top of the naivety assumption, Naive AutoML additionally pretends that eveneach component ci can be optimized by local optimization techniques. More pre-cisely, it is assumed that the algorithm that yields the best component when usingthe default parametrization is also the algorithm that yields the best componentif all algorithms are run with the best parametrization possible.

Just like for the naivety assumption itself, we stress that this assumption is justan algorithmic decision, which does not necessarily hold in practice. In fact, theresults on some datasets in the experiments clearly suggest that this assumption isnot always correct. However, this does not necessarily imply that the results willoverly deteriorate by making these assumptions. In a sense, our goal is precisely tostudy the extent by which state-of-the-art approaches can improve over the naiveapproach by not making this kind of simplifying assumptions. Moreover, this gapis often surprisingly small, as the experiments in Sec. 5 show.

On the other side, the assumption might be less far-fetched than one might ex-pect. Our preliminary experiments showed that the tuning of parameters has oftenno or only a slight improvement over the performance achieved with the defaultconfiguration. Keeping some exceptions like neural networks or SVMs in mind,experiments indicate that the variance of the variable describing the improvement


of a configuration over the performance with default configuration is relatively lowfor many learners – at least for the here considered datasets. Additionally, in somecases, we can also expand one algorithm with highly influencing parameters intoseveral algorithms in which these parameters have already been set. For example,we could simply treat support vector machines with different kernels and different(orders of) complexity constants as different algorithms. In the large majority ofalgorithms, this practice does not yield an explosion in the algorithm space. Infact, the only exception we can think of is indeed neural networks.

4.3 The Naive AutoML Optimizer

The Naive AutoML optimizer consists of two phases. In a first phase, it just selectsthe best component for each slot of the pipeline based on default hyperparametervalues. To this end, Naive AutoML iterates over all slots and, for each slot, buildsone pipeline for each component that can be filled into that slot. The pipelinesare constructed via the getPipeline function, which creates a pipeline that containsonly that single component or, in the case of pre-processors, an additional standardprediction component, e.g., a decision tree. In Alg. 1, the first phase spans lines 1 to14. In a second phase, the algorithm runs in rounds in which it tries a new randomparametrization for each of the components (in isolation). If the performance ofsuch a pipeline is better than the currently best, the parameters for that slot’scomponent are updated correspondingly. This is done until the time-bound is hit.The whole algorithm works as a generator, and whenever a new best configurationis found, a new best pipeline p∗ is obtained. This pipeline p∗ holds for each slotthe best-found choice. In Alg. 1, the second phase is described in lines 15 to 26.

In order to make the performance of Naive AutoML, on average, independentof the order in which slots and algorithms are defined in the input, those sets areshuffled at the beginning (l. 1 and 3), so that the order in which both the slotsthemselves and the candidates per slots are tested are subject to randomness. Ofcourse, in extensions the order of components can be suggested by warm-startingmechanisms, which is however not the focus of this paper.

Note that the returned pipelines p∗ are never executed as a whole internally,so the algorithm has, in fact, no estimate of their performance. This is preciselywhere the naivety enters: The algorithm trusts that each local decision is optimal,so the pipeline composed of those locally optimal decisions is also expected tobe globally optimal. Hence, it is not necessary to have a concrete estimate of theperformance of p∗.

Even though the first phase of Naive AutoML entirely fixes the algorithmsof the pipeline, the hyperparameter optimization (HPO) phase optimizes eachcomponent in isolation. Instead of optimizing slot after slot, each main HPO stepperforms one optimization step for each slot. This procedure is repeated until theoverall timeout is exhausted.

From the above presentation is becomes apparent that Naive AutoML is in-deed not even a fully defined AutoML tool but only an optimizer for the AutoMLcontext. One indicator for this is that Naive AutoML itself does not train a fi-nal candidate. A minimalistic AutoML tool around Naive AutoML would, aftergaining back control, train the last received pipeline on the full data.


Algorithm 1 Naive AutoML - Optimization Routine

Require: Components C = (C1, .., Ck+1) for pipeline slots, validation function Validate1: S ← shuffled list of {1, .., k + 1}2: for all slot s ∈ S do3: Cs ← shuffled list of candidates for slot s4: c∗s , θ

∗s , v

∗s , v

∗ ← ⊥,⊥,−∞,−∞5: for all candidate cs ∈ Cs do6: vs ← Validate(getPipeline(k, cs,⊥))7: if vs > v∗s then8: c∗s , v

∗s ← p, vs

9: if vs > v∗ then10: yield ((c∗s1 ,⊥), .., (c∗s ,⊥))11: end if12: end if13: end for14: end for15: while timeout not reached do16: for all slot s ∈ S do17: θs ← random configuration for c∗s18: vs ← Validate(getPipeline(k, cs, θs))19: if vs > v∗s then20: θ∗s , v

∗s ← θs, vs

21: if vs > v∗ then22: yield ((c∗s1 , θ

∗s1

), .., (c∗sk+1, θ∗sk+1

))

23: end if24: end if25: end for26: end while

Even more, it can happen that such an outer algorithm must “repair” thefinally built pipeline if it is corrupt in the sense that it cannot be successfullytrained. Suppose that the last yielded pipeline is p∗. It can happen (and in prac-tice, it does happen occasionally) that p∗ is not executable on specific data. Forexample, a pipeline p∗ for scikit-learn [28] may contain a StandardScaler, whichproduces negative attribute values for some instances, and a MultinomialNB pre-dictor, which cannot work with negative values. Since the two components arenever executed together during search, the optimizer does not detect any prob-lem with the two outputs StandardScaler and MultinomialNB in isolation. Severalrepair possibilities would be imaginable, e.g., to replace the pre-processors withearlier found candidates for that slot, or to simply try earlier candidates of p∗. Tokeep things simple, in this paper, we just removed pre-processors from left to theright until an executable pipeline p∗′ is created; in the extreme case leading justto a predictor without pre-processors.

4.4 The Quasi-Naive AutoML Optimizer

The Quasi-Naive AutoML Optimizer makes two minor changes in the above codeof Naive AutoML. First, the shuffle operation in line 1 of the algorithm is replacedby a fixed permutation σ. Second, the getPipeline routine does not leave compo-nents of previous decision steps blank (or plugs in the default predictor) but putsin the default configured component c∗s chosen for the respective slot s. More for-mally, if σ(i) < σ(j) and the algorithm is building a pipeline with slot j as decision


variable, then slot i is filled with (c∗si ,⊥). Typically, σ will order the predictor firstand then assume some order of decisions on the pre-processors.

Under this adjustment, the naivety assumption in Eq. (1) is relaxed as follows.Instead of assuming that all other components are irrelevant for the best choice ofa component in the pipeline, one now only pretends that the subsequently chosencomponents are irrelevant for the optimal choice. In contrast, the previously madedecisions are relevant for the current optimization question. Concerning the naivetyassumption, they are relevant in the sense that the previously decided componentscannot be chosen arbitrarily in the naivety property but are supposed to be fixedaccording to the choice that was made for that slot.

From a practical viewpoint, the strict Naive AutoML approach has almost noadvantage over the Quasi-Naive AutoML approach. The only plus offered by strictNaive AutoML is that one can optimize the different slots in parallel. Intuitively, ifno such parallelization is adopted and components are optimized in sequence, thereis no good reason not first to find a best learner (classifier or regressor) and then usein each slot the choices made already earlier in other slots when filling a pipeline.On the other hand, parallelization can also be adopted for the optimization processof a single slot, so we would argue that pure Naive AutoML is rather of theoreticalinterest, e.g., in order to verify the appropriateness of the naivety assumption, butdoes not seem to have any other relevant practical advantage.

5 Evaluation

We compare Naive AutoML and Quasi-Naive AutoML with state-of-the-art op-timizers used in the context of AutoML. We stress that we aim at comparingoptimizers and not whole AutoML tools. That is, we explicitly abandon previousknowledge that can be used to warm-start an optimizer and also abandon post-processing techniques like ensembling [9,12] or validation-fold-based model selec-tion [23]. Those techniques are (largely) orthogonal to the optimizer and henceirrelevant for its performance analysis. This being said, it is, of course, possiblethat some optimizers benefit more from certain additional techniques like warm-starting etc. than others. However, this kind of analysis is in the scope of studiesthat propose those kinds of techniques.

When comparing the naive approaches with state-of-the-art optimizers, weshould recognize that the naive approaches are indeed very weak optimizers. First,in contrast to global optimizers, the naive approaches do not necessarily convergeto an optimal solution because large parts of the search space are pruned early.In other words, the naive approaches can only lose (or at best be competitive) inthe long run. Second, the highly stochastic nature of the algorithms also does notgive high hopes for great performance in the short run. Both Naive AutoML andQuasi-Naive AutoML are closely related to random search, which can be consid-ered one of the most simple baselines1. The only reason to believe in Naive AutoMLseems to be that it quickly commits to an apparently locally best component and

1 In fact, Naive AutoML is a random search in a decomposed search space: While theHPO phase is an explicit random search, the algorithm selection phase simply iterates overall possible algorithms, which is equivalent to a random search due to the small number ofcandidates (all of them are considered anyway).


that hyperparameter-tuned versions of those components will also occur in anoptimal pipeline.

These observations then motivate three research questions:

RQ 1: Do the naive approaches find better pipelines than state-of-the-art op-timizers in the short run?RQ 2: How often do global optimizers outperform the naive approaches in thelong run, and how long do they need to take the lead?RQ 3: How large is the performance gap between the solutions found by naiveapproaches compared to the best ones found by global optimizers?

To operationalize the terms “short run” and “long run”, we choose time win-dows of 1h and 1d, respectively. These time limits are, of course, arbitrary but arecommon practice and seem to represent a good compromise taking into accountthe ecological impact of such extensive experiments.

5.1 Experiment Setup

5.1.1 Compared Optimizers and Search Space Definition

The evaluation is focused on the machine learning package scikit-learn [28]. Onthe state-of-the-art side, we compare solutions with the competitive AutoML toolsauto-sklearn and GAMA. Hence, as one reference optimizer, we consider BayesianOptimization through the notion of auto-sklearn, which adopts SMAC as its op-timizer [14]. We use version 0.12.6, which underwent substantial changes and im-provements compared to the original version [9]. As a second baseline, we compareagainst the genetic algorithm optimizer proposed in GAMA [12]. In order to isolatepossibly confounding factors and to only compare optimization techniques, all pre-and post-processing activities such as warm-starting and ensemble building weredeactivated in auto-sklearn and GAMA. We are not aware of other approachesthat have shown to substantially outperform these tools at the optimizer level.Some works claim to outperform auto-sklearn but only demonstrate that withrank plots, so the extend of improvement is unclear [30,21].

To maximize the comparability, we unified the search space among the com-pared optimizers as far as possible. Since all optimizers except auto-sklearn canbe configured relatively easily in their search space and pipeline structure, weadopted the pipeline structure dictated by auto-sklearn. This pipeline consistsof three steps, including so-called data-pre-processors, which are mainly featurescalers, feature-pre-processors, which are mainly feature selectors and decomposi-tion techniques, and finally the estimator. The appendix shows the concrete listof algorithms used for each category. We also used the hyperparameter space de-fined by auto-sklearn for each of the components. Unfortunately, there are some(proprietary) components such as balancing and minority coalescer that cannotbe deactivated in auto-sklearn but also cannot be easily used in other tools. Thisimplies that the search spaces are not entirely identical, but an analysis of re-sults suggests that those differences are probably not relevant for the comparison.The search spaces of the naive approaches and GAMA are almost identical. Theonly difference is that GAMA, at the time of writing, does only support explic-itly defined domains for parameter values, which does not match the concept of


numerical parameters used in auto-sklearn and the naive approaches through theConfigSpace library [20]. To overcome this problem, we sampled 10000 values foreach parameter and used these as a discrete space; this sampling mechanism al-ready included log-scale sampling where applicable2.

Implementations of the naive approaches and the experiments are available forthe public 3. The repository also comes along with the data we used to create theresult figures and tables.

5.1.2 Benchmark Datasets

The evaluation is based on the dataset portfolio proposed in [11]. This is a col-lection of datasets available on openml.org [34]. These datasets cover classifica-tion for both binary and multi-class classification with numerical and categoricalattributes. Within this scope, the dataset selection is quite diverse in terms ofnumbers of instances, numbers of attributes, numbers of classes, and distributionsof types of attributes. The appendix lists the relevant properties of each of thesedatasets to confirm this diversity. Our assessment is hence limited to binary andmulti-class classification.

For all datasets, categorical attributes were replaced by an Bernoulli encoding(one-hot-encoding) and missing values are replaced by 0 prior to passing it to theoptimizer. This was just with the purpose to avoid implicit search space differ-ences, because auto-sklearn comes with some pre-processors specifically tailoredfor categorical attributes. Since these are partially proprietary and not easily ap-plicable with GAMA and the naive approaches, we simply eliminated this decisionvariable from the search space. Hence, the optimizers construct pipelines that areapplied to purely numerical datasets. Of course, the imputation with 0 is oftensub-optimal, but since the imputation is the same for all optimizers, it does notaffect the comparison among them.

5.1.3 Validation Mechanism and Performance Metrics

With respect to validation, we standardized both the outer and the inner evalu-ation mechanism. First, the outer evaluation mechanism splits the original datainto train and test data. For this, we chose a 90% train fold size and a 10% testfold size. Running each optimizer 10 times with different such random splits corre-sponds to a 10 iterations Monte Carlo cross-validation with 90% train fold size. Ofcourse, splits were identical per seed among all optimizers. Second, the evaluationmechanism for a concrete pipeline candidate was fixed among all approaches to5-fold cross-validation; no early stopping was applied.

Note that our primary focus here is not on test performance but validationperformance. This paper compares optimizers, so we should measure them in termsof what they optimize, namely validation performance. It can clearly happen thatstrong optimization of that metrics yields no better or even worse performanceon the test data (over-fitting). Even though test performance is, in our view, not

2 All these efforts were realized in collaboration with and under the approval of the authors ofGAMA to ensure a maximally faithful adaption of the code for the purpose of this benchmark.

3 https://github.com/fmohr/naiveautoml/tree/mlj2021

https://github.com/fmohr/naiveautoml/tree/mlj2021


relevant for the research questions, we conduct the outer splits and hence providetest performance results in order to maximize insights.

Following the argumentation of Provost et al. [29], we abstain from the use ofaccuracy as a performance measure for comparison. Instead, we use area under thereceiver operator curve (AUROC) as a performance measure of models on binaryclassification data as proposed in [29] and log-loss on multi-class classification data,as suggested in the context of the AutoML benchmark [11]. Since these metricsare based on prediction probabilities, they allow for more fine granular assessment.A particular advantage of AUROC is that it is agnostic to class imbalance.

5.1.4 Resources and used Hardware

Timeouts were configured as follows. For the short (long) run, we applied a totaloverall runtime of 1h (24h), and the runtime for a single pipeline execution wasconfigured to take up to 5 (20) minutes. The memory was set to 24GB and, despitethe technical possibilities, we did not parallelize evaluations. That is, all the toolswere configured to run with a single CPU core. The computations were executedin a compute center with Linux machines, each of them equipped with 2.6GhzIntel Xeon E5-2670 processors and 32GB memory.

5.2 Results

5.2.1 RQ 1: Do the naive approaches find better pipelines than state-of-the-art

optimizers in the short run?

To answer this question, we look at the results for an overall timeout of 1h. Takinginto account the timeouts allowed in some competitions, this time limit can evenbe considered kind of generous. However, those competitions typically comparefully-fledged systems, making massive use of warm-starting, so a timeout of 1hseems appropriate when comparing cold-started optimizers. Also, note that this isnot the same as looking onto the first part of the 24h runs since the timeout perexecution is also lower; it is hence a different setup.

Fig. 1 summarizes the results on a very abstract level. In these plots, bothfigures show performance ranks. The left plot shows for each point of time t, therank obtained by an approach when using the validation performance of the best-seen solution up to t. That is, it shows the internally best 5-CV result observedfor any candidate pipeline up to that time. The lines indicate median ranks, andthe shaded areas show the rank IQRs. Since we compare optimizers and are notprimarily interested in test performance, we focus on validation performances.However, to complement the internal validation results, the right plots show therankings of the different optimizers with respect to the performance obtained bythe finally returned model on the test data. The vertical bars in the violin plotsare the respective medians.

The plot shows that the naive approaches are competitive or even strongerthan state-of-the-art tools in the short run. Naive AutoML is competitive withGAMA both of which outperform auto-sklearn’s SMAC in this time horizon.Quasi-Naive AutoML even substantially detaches from that group and maintainsa clear advantage over the whole time streak. This advantage is also preserved on


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Val

idat

ion

Ran

k

auto-sklearn

GAMA

naive

quasi-naive

auto-sklearn GAMA naive quasi-naive

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Tes

tR

ank

Fig. 1 Validation ranks (left) and test ranks (right) for 1h timeouts.

0 500 1000 1500 2000 2500 3000 3500

Time (s)

0.00

0.02

0.04

0.06

0.08

0.10

Gap

inA

UR

OC

AUROC gaps over time

auto-sklearn

GAMA

naive

quasi-naive

0 500 1000 1500 2000 2500 3000 3500

Time (s)

0.00

0.05

0.10

0.15

0.20

Gap

inL

og-L

oss

Log-Loss gaps over time

auto-sklearn

GAMA

naive

quasi-naive

Fig. 2 Empirical gaps on the validation performance for binary (left) and multi-class (right)classification on 1h timeouts.

the test set, yielding the best test set performance among all approaches. In atleast 50% of the cases, Quasi-Naive AutoML ranks best or is the runner-up.

We now discuss a more quantitative metric based on the empirical gap. Thismetric considers, for each point of time t, the best performance observed for any

candidate up to time t and then computes for each optimizer the gap between itsbest-found solution and that reference score. These empirical gaps over time areshown in Fig. 2. Since the empirical gap metric requires comparable scales, it canonly be averaged over instances of identical problem types (and hence identicalbase metrics). Therefore, we provide it once for binary classification and based onAUROC (left) and once for the multi-class classification datasets based on log-loss (right). At each point of time, the observations for all datasets correspondingto the respective type (binary or multi-class) are aggregated: The solid lines aremedian gaps, the shaded areas are the interquartile ranges (IQR), and the dashedlines are 10%-trimmed mean gaps.

First, we can see that for both performance measures AUROC and log-loss,the median gaps are tiny. Our interpretation of this is that for at least 50% ofthe datasets, the different optimization approaches perform more or less equally.In terms of log-loss, auto-sklearn is slightly worse than the other approaches by amargin of approximately 0.05.

In general, we argue that gaps in log-loss below 0.1 are somewhat negligible. Ifthe difference in log-loss between two models is below 0.1 this means that the ratio

of probabilities assigned to the correct class is, on average, around 1.1. For a three-class problem, this means that, even for situations of rather high uncertainty, if thebetter model assigns 55% probability to the correct class, the weaker model alsostill assigns at least 51% probability to the correct class and will hence choose it.Now, this degree of irrelevance increases with a higher certainty of the better modelor with higher numbers of classes. In other words, in concrete situations where thetwo or three classes with the highest probability are at par, small differences inlog-loss will not necessarily but often result in identical behaviors of the models.


auto-sklearn GAMA naive quasi-naive0.00

0.02

0.04

0.06

0.08

0.10Gaps on AUROC datasets


0.025

0.050

0.075

0.100

0.125

0.150

0.175

0.200Gaps on log-loss datasets

Fig. 3 Empirical gaps on the test performance for binary (left) and multi-class (right) classi-fication on 1h timeouts.

As a second observation, we can see that the trimmed mean statistics does

show a substantial difference between the approaches. This holds specifically forauto-sklearn and for GAMA in case of log-loss performance. The fact that thosecurves are consistently above the IQR area indicates that there are some datasetson which a substantial gap can be observed; the concrete performance curves perdataset in Sec. D of the Appendix confirm this observation.

To complete this analysis, we also look at the test performance gaps of thedifferent approaches. These are summarized in Fig. 3. For each dataset, the besttest score among the four models returned by the optimizers is computed, andthen the gap of each approach is the test performance of its model minus the bestscore (in the case of AUROC, for log-loss, this difference is inverted).

We can observe that Quasi-Naive AutoML plays a reasonably dominant rolein this comparison. On both problem classes, i.e., binary classification and multi-class classification, it has a median gap of 0, indicating that it sets the bestamong all found solutions in at least 50% of the cases. While the advantage ofQuasi-Naive AutoML in the multi-class scenario is quite pronounced, its tail be-havior in the case of AUROC does not seem as strong as the one of GAMA.However, there are only two datasets on which the gap in AUROC is above 0.03.

Putting everything together, our assessment is that the naive approaches in-deed compete with or even outperform the other approaches in the short run. Nei-ther auto-sklearn nor GAMA can substantially outperform the strict Naive AutoMLapproach in terms of validation performance; they achieve, however, a slightly bet-ter test performance in some cases. Overall, Naive AutoML is competitive withauto-sklearn and GAMA and even outperforms one of them on either binary ormulti class classification. For Quasi-Naive AutoML we observe the same, but theadvantage is much more pronounced. In the short run, Quasi-Naive AutoML seemsto be the, by far, best and stable choice among the four optimizers.

5.2.2 RQ 2: How often do global optimizers outperform the naive approaches in the

long run and how long do they need to take the lead?

To get a first idea about the behavior of the optimizers in the long run, we againconsider the average gap plots for the timeout of 24h in Fig. 4. As expected,we can observe that, over time, the more sophisticated optimizers gain an ad-vantage over the naive approaches. We added a vertical dotted black line at therespective points of time where auto-sklearn and GAMA start to rank better thanQuasi-Naive AutoML. For auto-sklearn, this point sets in after approximately 4hof runtime, and for GAMA after 7h of runtime. However, even though AutoML


0 20000 40000 60000 80000

Runtime (s)

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Val

idat

ion

Ran

k

auto-sklearn

GAMA

naive

quasi-naive


1.0

1.5

2.0

2.5

3.0

3.5

4.0

Tes

tR

ank

Fig. 4 Validation ranks (left) and test ranks (right) for 24h timeouts.

takes and keeps the lead after 4h, it barely improves over a rank of 2, which meansthat it is on average on par with the set of other optimizers.

The ranking observed on the internal validation performance can also roughlybe observed on the test performance ranks. In general, auto-sklearn produces thebest or second-best test performance in 50% of the cases whereas GAMA andQuasi-Naive AutoML have a slightly worse test rank performance. Among thesetwo, both have the same median rank, but GAMA scores slightly better under theq1-quantile. Naive AutoML is outperformed in terms of ranks in this time horizon.

In order to get a slightly better understanding of the points of time whenNaive AutoML and Quasi-Naive AutoML start to be outperformed on which num-bers of datasets, Fig. 5 plots the numbers of wins in the duels between the naiveapproaches on one side and auto-sklearn and GAMA on the other side over time.Each of these plots contains two lines, one corresponding to each of the duelingoptimizers. The left plots show the duels between Naive AutoML and auto-sklearnand GAMA, and the right plots show the respective duels of Quasi-Naive AutoMLagainst auto-sklearn and GAMA. On the x-axis, we show the runtime on a log scale.On the y-axis, we count the number of datasets on which the respective optimizerhas the lead (best-observed validation performance up to that point of time).

Looking at the left side of the plots, one observes that Naive AutoML isclearly inferior to both auto-sklearn and GAMA. Naive AutoML is on par withauto-sklearn and GAMA in the beginning, but the number of datasets on whichthe others improve over Naive AutoML steadily increases after 3 hours.

Assessing the plots for Quasi-Naive AutoML on the right, we see a similarbut less extreme picture. For the first three hours, the algorithms are fairly bal-anced with light but consistent advantages of Quasi-Naive AutoML over bothauto-sklearn and GAMA. From this viewpoint, Quasi-Naive AutoML is clearlyperforming better than Naive AutoML in the short run. Likewise, the advantageof auto-sklearn and GAMA after the first three hours is much less pronouncedthan in the case of Naive AutoML. In fact, until the end, Quasi-Naive AutoMLkeeps being the winner on 30% of the datasets.

To summarize, we can answer the second research question as follows. Weobserve that both auto-sklearn and GAMA manage to outperform Naive AutoMLin 75% of the time in the long run. While GAMA is constantly the winner ona higher number of datasets when dueling with Naive AutoML, auto-sklearn hasslight disadvantages in the first 20 minutes but then takes the lead. With respectto Quasi-Naive AutoML, we observe that auto-sklearn wins in 70% and GAMAin 65% of the cases in the long run and it takes them roughly 3 hours to take the


103 104

Time (s)

0

20

40

60

Ab

solu

teW

ins

auto-sklearn

naive

103 104

Time (s)

0

20

40

60

Ab

solu

teW

ins

auto-sklearn

quasi-naive

103 104

Time (s)

0

20

40

60

Ab

solu

teW

ins

GAMA

naive

103 104

Time (s)

0

20

40

60

Ab

solu

teW

ins

GAMA

quasi-naive

Fig. 5 Validation performance duels: Count of datasets on which each of the dueling optimiz-ers obtained the better validation score up to some point of time (x-axis on log-scale).

lead in this aggregated view. Needless to say that this is a very condensed view,so we refer to detailed plots per dataset over time in the appendix.

The discussion around RQ2 has been entirely qualitative. We only looked atorderings of optimizers but not at absolute performance. While ranks are exactlywhat is needed to answer the binary question of whether an optimizer outperformsanother one, we are clearly also interested in the extent by which the better opti-mizers outperform the others. After all auto-sklearn seems to make a pretty strongcase in the evaluation above, but how substantial are those advantages?

5.2.3 RQ 3: How large is the performance gap between the solutions found by naive

approaches compared to the best ones found by global optimizers?

To answer this question, we just look at the performance of the final solution oneach problem produced by each optimizer. For each dataset, we take the median

performance of each algorithm and identify the best among them. The gap of analgorithm is the difference between its own score and the best one. For the twodifferent problem classes, i.e., binary and multi-class classification, the gaps aresummarized in Fig. 6. Whiskers show median, 90% quantile, and the maximumobservation respectively.

These plots now clearly relativize the apparent dominance of auto-sklearn sug-gested in the rank plots. Looking first on the left plot for the AUROC in binaryclassification, we can see that all of the optimizers have a close-to-zero median;that is, each of the optimizers is, on 50% of the datasets, performing competi-tive to the optimal one. Both auto-sklearn and GAMA are in 90% of the casesless than 0.02 away from the best performance (lower auxilliary line). However,Quasi-Naive AutoML is also competitive up to 0.02 in 85% of the cases and to 0.04in 95% of the cases (worse only on one dataset; upper auxilliary line). So whilethe advantages of auto-sklearn and GAMA are quantitatively measurable, theyare indeed fairly small in the great majority of the cases. Looking now at the righthand side, the situation is even more balanced on the benchmarks for multi-classclassification. In fact, auto-sklearn and Quasi-Naive AutoML have a comparable90% quantile, which is below 0.1 (auxilliary line). As discussed already for RQ1, weconsider differences of less than 0.1 rather negligible. Put differently, in over 90%of the cases, both auto-sklearn and Quasi-Naive AutoML exhibit essentially opti-mal performance on multi-class classification datasets after 24h. The performance



0.02

0.04

0.06

0.08

0.10Gaps on AUROC datasets


0.025

0.050

0.075

0.100

0.125

0.150

0.175

0.200Gaps on log-loss datasets

Fig. 6 Empirical gaps on the test performance for binary (left) and multi-class (right) classi-fication on 24h timeouts. Whiskers show the medians, 90% quantiles, and the maxima.

of GAMA is not substantially worse though, since also here 80% of the runs areat most 0.1 worse than the best solution, which can still be regarded considerablygood. In fact, even the performance of Naive AutoML is not too bad in that atleast in 60% of the cases the performance gap is below 0.1. However, there is alsoa good number of cases in which the gap of Naive AutoML is substantial.

This being said, we answer the research question as follows. auto-sklearn, asthe algorithm that shows the best performance on most datasets after 24h, ex-hibits virtually no performance advantage over any of the naive approaches on50% of the datasets for both binary and multi-class classification. While it is ableto significantly outperform Naive AutoML in the long run on some datasets, itrarely ever outperforms Quasi-Naive AutoML. In the case of multi-class classifica-tion benchmarks, Quasi-Naive AutoML is almost fully on par with auto-sklearn,and in binary classification there are 5 datasets on which the performance gapof Quasi-Naive AutoML is bigger than 0.02 while being worse than 0.04 onlyonce. On binary classification, the same comparison holds for GAMA againstQuasi-Naive AutoML, while on multi-class classification Quasi-Naive AutoML evenperforms superior to GAMA.

5.3 Discussion

Putting all the results together, the naive approach seems to make a maybe un-expectedly strong case against established optimizers for standard classificationproblems. Even the fully naive approach is competitive in the long run in 50%of the cases. A possible reason for this could also be that some datasets are “tooeasy” to optimize over, but there is no specific reason to believe that real worlddatasets are necessarily harder in this sense. When applying the quasi-naive as-sumption, we obtain an optimizer that is hardly ever significantly outperformedneither by auto-sklearn nor by GAMA. Both AutoML and GAMA manage to gainslight qualitative advantages over Quasi-Naive AutoML as runtime increases, butthe associated quantitative advantages are negligible most of the time.

In our view, these are thrilling results as they propose an entirely new wayof thinking about the optimization process in AutoML. Until now, pipeline opti-mization has almost always been treated as a complete black-box. One strengthbut also weakness of the black-box approaches is that they do not require but alsonot efficiently support domain knowledge. That is, the knowledge about machinelearning is either encoded into the problem or wrapped around the optimizers inpre-processing (e.g., warm-starting) or post-processing (e.g., ensembling). How-ever, little knowledge can be considered within the optimization approach, and


the naive approach is different in this regard. In contrast to black box optimizers,it suggests that the optimization process can be realized sequentially. The abilityof sequential optimization opens the door to optimization flows, which in turngive room for specialized components within the optimization process [22]. Forexample, based on the observations in the optimization of one slot, it would bepossible to activate or deactivate certain optimization modules in the subsequentoptimization workflow. Since this paper has shown even Quasi-Naive AutoML tobe competitive, there is some reason to believe that such more sophisticated ap-proaches might even superior to black-box optimization.

Besides potential superiority in terms of absolute performance, sequential op-timization frameworks come with a couple of qualitative advantages. As recentstudies have shown, black-box approaches have, besides their difficulties of in-cluding knowledge into the optimization process, some substantial drawbacks thathinder their application in practice such as flexibility, lack of interactivity, andunderstandability [5,35]. There are many techniques in the spectrum betweenQuasi-Naive AutoML and full black-box optimization that are still to be exploredand that have the potential to combine high performance while satisfying impor-tant additional “soft” requirements mentioned above. The competitive results ofQuasi-Naive AutoML clearly motivate research in this direction.

6 Conclusion

In this paper, we have presented two naive approaches for the optimization ofmachine learning pipelines. Contrary to previous works, these approaches fully(Naive AutoML) or largely (Quasi-Naive AutoML) ignore the general assumptionof dependencies between the choices of algorithms within a pipeline. Furthermore,algorithm selection and hyperparameter optimization are decoupled by first select-ing the algorithms of a pipeline only considering their default parametrizations.Only when the algorithms are fixed, their hyperparameters are optimized.

The results show that the naive approaches are much more competitive thanone would maybe expect. For shorter timeouts of 1h, both naive algorithms per-forms highly competitive to optimization algorithms of state-of-the-art AutoMLtools and sometimes (Quasi-Naive AutoML in fact even consistently) superior. Inthe long run, 24h experiments shows that Quasi-Naive AutoML is largely on parwith auto-sklearn and GAMA in terms of gaps to the best solution.

The naive approach is a door-opener for sequential optimization flows andhence naturally motivates a series of future work. Besides the canonical exten-sions towards more complex pipeline structures such as the shape of a tree or adirected acyclic graph, it seems imperative to further explore the potential of aless naive approach as suggested in [22], which adopts a stage-based optimizationscheme. Another interesting direction is to create a more interactive version ofNaive AutoML in which the expert obtains visual summaries of what choices havebeen made and with the option for the expert to intervene, e.g., by revising someof the choices. This could lead to an approach considering different optimizationrounds for different slots.

Acknowledgements: We thank Matthias Feurer and Pieter Gijsbers for theirremarkable support in adjusting auto-sklearn and GAMA for our evaluations.


Declarations

Funding: This work was supported by the German Research Foundation (DFG)within the Collaborative Research Center “On-The-Fly Computing” (SFB 901)Conflicts of interest/Competing interests: Eyke HullermeierAvailability of data and material: https://github.com/fmohr/naiveautoml/tree/mlj2021Code availability: https://github.com/fmohr/naiveautoml/tree/mlj2021Authors’ contributions: Felix Mohr is the main author of both paper and imple-mentation. Marcel Wever contributed in the manuscript revision as well as theresolution of technical aspects of the evaluation.Ethics approval: not applicableConsent to participate: not applicableConsent for publication: not applicable

References

1. Boyd, S., Parikh, N., Chu, E.: Distributed optimization and statistical learning via thealternating direction method of multipliers. Now Publishers Inc (2011)

2. Cachada, M., Abdulrahman, S.M., Brazdil, P.: Combining feature and algorithm hy-perparameter selection using some metalearning methods. In: P. Brazdil, J. Van-schoren, F. Hutter, H.H. Hoos (eds.) Proceedings of the International Workshop on Au-toML@PKDD/ECML 2017, CEUR Workshop Proceedings, vol. 1998, pp. 69–83 (2017)

3. Chen, B., Wu, H., Mo, W., Chattopadhyay, I., Lipson, H.: Autostacker: A compositionalevolutionary learning system. In: Proceedings of the Genetic and Evolutionary Computa-tion Conference, pp. 402–409 (2018)

4. Crisan, A., Fiore-Gartland, B.: Fits and starts: Enterprise use of automl and the role ofhumans in the loop. arXiv preprint arXiv:2101.04296 (2021)

5. Drozdal, J., Weisz, J., Wang, D., Dass, G., Yao, B., Zhao, C., Muller, M., Ju, L., Su, H.:Trust in automl: Exploring information needs for establishing trust in automated machinelearning systems. In: Proceedings of the 25th International Conference on Intelligent UserInterfaces, pp. 297–307 (2020)

6. Engels, R.: Planning tasks for knowledge discovery in databases; performing task-orienteduser-guidance. In: E. Simoudis, J. Han, U.M. Fayyad (eds.) Proceedings of the SecondInternational Conference on Knowledge Discovery and Data Mining (KDD-96), Portland,Oregon, USA, pp. 170–175. AAAI Press (1996)

7. Erickson, N., Mueller, J., Shirkov, A., Zhang, H., Larroy, P., Li, M., Smola, A.: Autogluon-tabular: Robust and accurate automl for structured data. arXiv preprint arXiv:2003.06505(2020)

8. Escalante, H.J., Montes-y-Gomez, M., Sucar, L.E.: Particle swarm model selection. J.Mach. Learn. Res. 10, 405–440 (2009)

9. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficientand robust automated machine learning. In: Advances in Neural Information ProcessingSystems, pp. 2962–2970 (2015)

10. Fusi, N., Sheth, R., Elibol, M.: Probabilistic matrix factorization for automated machinelearning. In: Proceedings of the 32nd International Conference on Neural InformationProcessing Systems, pp. 3352–3361 (2018)

11. Gijsbers, P., LeDell, E., Thomas, J., Poirier, S., Bischl, B., Vanschoren, J.: An open sourceautoml benchmark. arXiv preprint arXiv:1907.00909 (2019)

12. Gijsbers, P., Vanschoren, J.: GAMA: genetic automated machine learning assistant. J.Open Source Softw. 4(33), 1132 (2019)

13. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The wekadata mining software: an update. ACM SIGKDD Explorations 11 (2009)

14. Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for generalalgorithm configuration. LION 5, 507–523 (2011)

15. Jamieson, K., Talwalkar, A.: Non-stochastic best arm identification and hyperparameteroptimization. In: Artificial Intelligence and Statistics, AISTATS’16, pp. 240–248 (2016)


16. Kietz, J., Serban, F., Bernstein, A., Fischer, S.: Towards cooperative planning of datamining workflows. In: Proceedings of the Third Generation Data Mining Workshop at the2009 European Conference on Machine Learning, pp. 1–12. Citeseer (2009)

17. Kietz, J.U., Serban, F., Bernstein, A., Fischer, S.: Designing KDD-workflows via HTN-planning for intelligent discovery assistance. In: 5th Planning to Learn Workshop WS28at ECAI 2012, p. 10 (2012)

18. Kotthoff, L., Thornton, C., Hoos, H.H., Hutter, F., Leyton-Brown, K.: Auto-weka 2.0:Automatic model selection and hyperparameter optimization in weka. The Journal ofMachine Learning Research 18(1), 826–830 (2017)

19. Li, L., Jamieson, K.G., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: Anovel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 18,185:1–185:52 (2017). URL http://jmlr.org/papers/v18/16-558.html

20. Lindauer, M., Eggensperger, K., Feurer, M., Biedenkapp, A., Marben, J., Muller, P., Hut-ter, F.: Boah: A tool suite for multi-fidelity bayesian optimization & analysis of hyperpa-rameters. arXiv:1908.06756 [cs.LG]

21. Liu, S., Ram, P., Vijaykeerthy, D., Bouneffouf, D., Bramble, G., Samulowitz, H., Wang,D., Conn, A., Gray, A.: An admm based framework for automl pipeline configuration.In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 4892–4899(2020)

22. Mohr, F., Wever, M.: Replacing the ex-def baseline in automl by naive automl. In: 8thICML Workshop on Automated Machine Learning (AutoML) (2021)

23. Mohr, F., Wever, M., Hullermeier, E.: Ml-plan: Automated machine learning via hierar-chical planning. Machine Learning 107(8), 1495–1515 (2018)

24. Morik, K., Scholz, M.: The miningmart approach to knowledge discovery in databases. In:Intelligent technologies for information analysis, pp. 47–65. Springer (2004)

25. Nguyen, P., Hilario, M., Kalousis, A.: Using meta-mining to support data mining workflowplanning and optimization. Journal of Artificial Intelligence Research 51, 605–644 (2014)

26. Nguyen, P., Kalousis, A., Hilario, M.: Experimental evaluation of the e-lico meta-miner.In: 5th planning to learn workshop WS28 at ECAI, pp. 18–19 (2012)

27. Olson, R.S., Moore, J.H.: Tpot: A tree-based pipeline optimization tool for automatingmachine learning. In: Workshop on Automatic Machine Learning, pp. 66–74 (2016)

28. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel,M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Machine learning inpython. the Journal of machine Learning research 12, 2825–2830 (2011)

29. Provost, F.J., Fawcett, T., Kohavi, R.: The case against accuracy estimation for comparinginduction algorithms. In: J.W. Shavlik (ed.) Proceedings of the Fifteenth InternationalConference on Machine Learning (ICML 1998), Madison, Wisconsin, USA, July 24-27,1998, pp. 445–453. Morgan Kaufmann (1998)

30. Rakotoarison, H., Schoenauer, M., Sebag, M.: Automated machine learning with monte-carlo tree search. In: S. Kraus (ed.) Proceedings of the Twenty-Eighth International JointConference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pp.3296–3303. ijcai.org (2019). DOI 10.24963/ijcai.2019/457

31. de Sa, A.G., Pinto, W.J.G., Oliveira, L.O.V., Pappa, G.L.: Recipe: a grammar-basedframework for automatically evolving classification pipelines. In: European Conference onGenetic Programming, pp. 246–261. Springer (2017)

32. Statnikov, A.R., Tsamardinos, I., Dosbayev, Y., Aliferis, C.F.: GEMS: A system for au-tomated cancer diagnosis and biomarker discovery from microarray gene expression data.Int. J. Medical Informatics 74(7-8), 491–503 (2005). DOI 10.1016/j.ijmedinf.2005.05.002

33. Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-WEKA: combined selectionand hyperparameter optimization of classification algorithms. In: The 19th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, KDD 2013, Chicago,IL, USA, pp. 847–855 (2013)

34. Vanschoren, J., van Rijn, J.N., Bischl, B., Torgo, L.: OpenML: Networked science in ma-chine learning. SIGKDD Explorations 15(2), 49–60 (2013). DOI 10.1145/2641190.2641198

35. Wang, D., Weisz, J.D., Muller, M., Ram, P., Geyer, W., Dugan, C., Tausczik, Y., Samu-lowitz, H., Gray, A.: Human-ai collaboration in data science: Exploring data scientists’perceptions of automated ai. Proceedings of the ACM on Human-Computer Interaction3(CSCW), 1–24 (2019)

36. Yang, C., Akimoto, Y., Kim, D.W., Udell, M.: Oboe: Collaborative filtering for automlmodel selection. In: Proceedings of the 25th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining, pp. 1173–1183 (2019)

http://jmlr.org/papers/v18/16-558.html


Appendix

A Datasets

All datasets are available via the openml.org platform [34].

openmlid name instances features numeric features classes min % maj % % missing % [0,1] % µ = 0 % σ = 1

3 kr-vs-kp 3196 36 0 2 47% 52% 0% n/a n/a n/a12 mfeat-factors 2000 216 216 10 10% 10% 0% n/a n/a n/a23 cmc 1473 9 2 3 22% 42% 0% n/a n/a n/a31 credit-g 1000 20 7 2 30% 70% 0% 0% 0% 0%54 vehicle 846 18 18 4 23% 25% 0% 0% 0% 0%

181 yeast 1484 8 8 10 0% 31% 0% 25% 0% 0%188 eucalyptus 736 19 14 5 14% 29% 3% 0% 0% 0%

1049 pc4 1458 37 37 2 12% 87% 0% 3% 0% 0%1067 kc1 2109 21 21 2 15% 84% 0% 0% 0% 0%1111 KDDCup09 appetency 50000 230 192 2 1% 98% 69% 0% 0% 0%1457 amazon-commerce-revi 1500 10000 10000 50 2% 2% 0% 22% 0% 0%1461 bank-marketing 45211 16 7 2 11% 88% 0% 0% 0% 0%1464 blood-transfusion-se 748 4 4 2 23% 76% 0% 0% 0% 0%1468 cnae-9 1080 856 856 9 11% 11% 0% 88% 0% 0%1475 first-order-theorem- 6118 51 51 6 7% 41% 0% 0% 4% 2%1485 madelon 2600 500 500 2 50% 50% 0% 0% 0% 0%1486 nomao 34465 118 89 2 28% 71% 0% 83% 0% 0%1487 ozone-level-8hr 2534 72 72 2 6% 93% 0% 0% 0% 0%1489 phoneme 5404 5 5 2 29% 70% 0% 0% 100% 100%1494 qsar-biodeg 1055 41 41 2 33% 66% 0% 7% 0% 0%1515 micro-mass 571 1300 1300 20 1% 10% 0% 0% 17% 0%1590 adult 48842 14 6 2 23% 76% 0% 0% 0% 0%4134 Bioresponse 3751 1776 1776 2 45% 54% 0% 81% 1% 0%4135 Amazon employee acce 32769 9 0 2 5% 94% 0% n/a n/a n/a4534 PhishingWebsites 11055 30 0 2 44% 55% 0% n/a n/a n/a4538 GesturePhaseSegmenta 9873 32 32 5 10% 29% 0% 0% 75% 0%4541 Diabetes130US 101766 49 13 3 11% 53% 0% 0% 0% 0%

23512 higgs 98050 28 28 2 47% 52% 0% 0% 7% 0%23517 numerai28.6 96320 21 21 2 49% 50% 0% 100% 0% 0%40498 wine-quality-white 4898 11 11 7 0% 44% 0% 0% 0% 0%40668 connect-4 67557 42 0 3 9% 65% 0% n/a n/a n/a40670 dna 3186 180 0 3 24% 51% 0% n/a n/a n/a40685 shuttle 58000 9 9 7 0% 78% 0% 0% 0% 0%40701 churn 5000 20 16 2 14% 85% 0% 0% 0% 0%40900 Satellite 5100 36 36 2 1% 98% 0% 0% 0% 0%40975 car 1728 6 0 4 3% 70% 0% n/a n/a n/a40978 Internet-Advertiseme 3279 1558 3 2 13% 86% 0% 0% 0% 0%40981 Australian 690 14 6 2 44% 55% 0% 0% 0% 0%40982 steel-plates-fault 1941 27 27 7 2% 34% 0% 11% 0% 0%40983 wilt 4839 5 5 2 5% 94% 0% 0% 0% 0%40984 segment 2310 19 19 7 14% 14% 0% 6% 0% 0%40996 Fashion-MNIST 70000 784 784 10 10% 10% 0% 0% 0% 0%41027 jungle chess 2pcs ra 44819 6 6 3 9% 51% 0% 0% 0% 0%41138 APSFailure 76000 170 170 2 1% 98% 8% 0% 0% 0%41142 christine 5418 1636 1599 2 50% 50% 0% 0% 0% 0%41143 jasmine 2984 144 8 2 50% 50% 0% 0% 0% 0%41144 madeline 3140 259 259 2 49% 50% 0% 0% 0% 0%41145 philippine 5832 308 308 2 50% 50% 0% 0% 3% 0%41146 sylvine 5124 20 20 2 50% 50% 0% 0% 0% 0%41147 albert 425240 78 26 2 50% 50% 8% 0% 0% 0%41150 MiniBooNE 130064 50 50 2 28% 71% 0% 0% 0% 0%41156 ada 4147 48 48 2 24% 75% 0% 83% 8% 0%41157 arcene 100 10000 10000 2 44% 56% 0% 0% 1% 0%41158 gina 3153 970 970 2 49% 50% 0% 0% 0% 0%41159 guillermo 20000 4296 4296 2 40% 59% 0% 0% 0% 0%41161 riccardo 20000 4296 4296 2 25% 75% 0% 0% 0% 0%41162 kick 72983 32 14 2 12% 87% 6% 0% 0% 0%41163 dilbert 10000 2000 2000 5 19% 20% 0% 0% 0% 0%41164 fabert 8237 800 800 7 6% 23% 0% 96% 4% 0%41165 robert 10000 7200 7200 10 9% 10% 0% 0% 0% 0%41166 volkert 58310 180 180 10 2% 21% 0% 15% 18% 0%41167 dionis 416188 60 60 355 0% 0% 0% 0% 10% 0%41168 jannis 83733 54 54 4 2% 46% 0% 4% 0% 0%41169 helena 65196 27 27 100 0% 6% 0% 4% 0% 0%42732 sf-police-incidents 2215023 9 3 2 12% 87% 0% 0% 0% 0%42733 Click prediction sma 39948 11 5 2 16% 83% 0% 0% 0% 0%

Table 1 Overview of datasets used in the evaluation.


B Considered Algorithms

The following algorithms from the scikit-learn library were considered for the three pipelineslots (same setup for all optimizers). Please refer to https://github.com/fmohr/naiveautoml/tree/mlj2021 for the exact specification of the search space including the hyper-parameterspaces.

Data-Pre-Processors

– Normalizer– VarianceThreshold– QuantileTransformer– StandardScaler– MinMaxScaler– PowerTransformer– RobustScaler

Feature-Pre-Processors

– FeatureAgglomeration– PCA– PolynomialFeatures– Nystroem– SelectPercentile– KernelPCA– GenericUnivariateSelect– RBFSampler– FastICA

Classifiers

– SVC (once for each out of four kernels)– KNeighborsClassifier– QuadraticDiscriminantAnalysis– RandomForestClassifier– MultinomialNB– LinearDiscriminantAnalysis– ExtraTreesClassifier– BernoulliNB– MLPClassifier– GradientBoostingClassifier– GaussianNB– DecisionTreeClassifier




C Final Result Tables

The following tables show the mean test score results of the approaches on the different datasetstogether with the standard deviation. Best performances are in bold, and entries that are notat least 0.01 (AUROC) or 0.1 worse (log-loss) than the best one or not statistically significantlydifferent (according to a Wilcoxon signed rank test with p=0.05) are underlined.

C.1 Binary Classification Datasets

id auto-sklearn GAMA naive quasi-naive

3 1.0±0.0 1.0±0.0 1.0±0.0 1.0±0.031 0.78±0.04 0.78±0.05 0.75±0.09 0.77±0.05

1049 0.94±0.02 0.95±0.02 0.91±0.04 0.95±0.021067 0.83±0.02 0.83±0.03 0.84±0.03 0.84±0.021111 0.68±0.03 0.67±0.06 0.63±0.02 0.69±0.031461 0.78±0.01 0.78±0.01 0.78±0.01 0.78±0.011464 0.73±0.06 0.73±0.05 0.74±0.04 0.74±0.061485 0.93±0.01 0.94±0.13 0.92±0.01 0.91±0.031486 0.98±0.0 0.98±0.0 0.98±0.0 0.98±0.01487 0.93±0.04 0.91±0.09 0.93±0.05 0.92±0.061489 0.97±0.01 0.96±0.01 0.97±0.01 0.97±0.041494 0.93±0.03 0.93±0.03 0.91±0.03 0.93±0.031590 0.89±0.0 0.89±0.01 0.89±0.0 0.89±0.04134 0.89±0.01 0.88±0.01 0.88±0.01 0.88±0.034135 0.83±0.02 0.84±0.02 0.66±0.17 0.84±0.034534 1.0±0.0 1.0±0.0 1.0±0.0 1.0±0.0

23512 0.79±0.01 0.8±0.04 0.79±0.04 0.79±0.0123517 0.53±0.01 0.53±0.01 0.53±0.01 0.53±0.0140701 0.91±0.03 0.9±0.08 0.9±0.02 0.92±0.0240900 0.99±0.01 0.99±0.01 0.9±0.12 0.99±0.1540978 0.98±0.01 0.97±0.02 0.97±0.03 0.98±0.0240981 0.95±0.03 0.94±0.03 0.93±0.04 0.95±0.0340983 1.0±0.0 1.0±0.0 0.61±0.31 0.68±0.2641138 0.99±0.01 0.99±0.0 0.99±0.01 0.99±0.0141142 0.82±0.03 0.81±0.02 0.81±0.02 0.82±0.0241143 0.88±0.02 0.89±0.02 0.87±0.02 0.88±0.0241144 0.95±0.02 0.94±0.01 0.94±0.02 0.82±0.241145 0.9±0.02 0.86±0.07 0.79±0.03 0.84±0.0941146 0.99±0.0 0.99±0.0 0.99±0.0 0.89±0.1941147 0.64±0.05 0.72±0.04 0.73±0.0 0.73±0.041150 0.97±0.0 0.98±0.02 0.98±0.0 0.98±0.041156 0.91±0.02 0.91±0.02 0.91±0.02 0.91±0.0141157 0.87±0.11 0.86±0.12 0.89±0.13 0.95±0.0641158 0.99±0.01 0.99±0.0 0.99±0.0 0.99±0.041159 0.5±0.0 0.71±0.12 0.88±0.01 0.89±0.0141161 0.5±0.0 0.93±0.09 1.0±0.0 1.0±0.041162 0.74±0.01 0.72±0.02 0.73±0.01 0.73±0.0142732 0.5±0.0 0.62±0.05 0.64±0.0 0.64±0.042733 0.62±0.01 0.62±0.01 0.65±0.05 0.72±0.01

Table 2 Avg. test AUROC on binary classification datasets (1h timeout).



3 1.0±0.0 1.0±0.0 1.0±0.03 1.0±0.031 0.77±0.05 0.78±0.05 0.77±0.05 0.77±0.05

1049 0.95±0.02 0.95±0.02 0.92±0.05 0.94±0.031067 0.85±0.02 0.83±0.03 0.8±0.04 0.81±0.041111 0.77±0.02 0.7±0.04 0.71±0.03 0.74±0.021461 0.78±0.01 0.78±0.01 0.78±0.01 0.78±0.011464 0.72±0.06 0.73±0.04 0.73±0.06 0.73±0.051485 0.95±0.01 0.94±0.01 0.93±0.01 0.91±0.021486 0.99±0.0 0.99±0.0 0.99±0.0 0.99±0.01487 0.93±0.04 0.93±0.04 0.92±0.04 0.92±0.051489 0.97±0.01 0.97±0.01 0.97±0.01 0.97±0.011494 0.93±0.03 0.94±0.02 0.92±0.02 0.92±0.131590 0.89±0.0 0.89±0.0 0.89±0.0 0.89±0.04134 0.89±0.01 0.88±0.01 0.89±0.01 0.89±0.034135 0.88±0.02 0.85±0.03 0.88±0.02 0.87±0.024534 1.0±0.0 1.0±0.0 1.0±0.0 1.0±0.0

23512 0.81±0.01 0.81±0.0 0.79±0.01 0.8±0.0123517 0.53±0.01 0.53±0.01 0.52±0.01 0.53±0.0140701 0.92±0.03 0.92±0.03 0.91±0.02 0.92±0.0940900 1.0±0.0 1.0±0.01 0.91±0.1 0.95±0.0940978 0.98±0.02 0.98±0.01 0.97±0.04 0.98±0.0140981 0.95±0.03 0.95±0.03 0.93±0.04 0.95±0.0340983 1.0±0.0 1.0±0.0 0.71±0.28 0.75±0.2141138 0.99±0.0 0.99±0.0 0.99±0.0 0.99±0.041142 0.84±0.02 0.83±0.03 0.83±0.03 0.82±0.0741143 0.88±0.02 0.89±0.02 0.87±0.02 0.88±0.0241144 0.97±0.01 0.95±0.01 0.95±0.01 0.88±0.1841145 0.93±0.01 0.91±0.01 0.79±0.02 0.9±0.0241146 0.99±0.0 0.99±0.0 0.99±0.0 0.94±0.1741147 0.77±0.0 0.76±0.02 0.73±0.0 0.73±0.041150 0.99±0.0 0.99±0.16 0.97±0.0 0.92±0.1641156 0.92±0.02 0.91±0.02 0.91±0.01 0.91±0.0241157 0.9±0.08 0.84±0.12 0.87±0.1 0.94±0.0641158 0.99±0.0 0.99±0.0 0.99±0.0 0.99±0.041159 0.92±0.01 0.9±0.02 0.89±0.01 0.92±0.0141161 1.0±0.0 1.0±0.0 0.87±0.23 1.0±0.041162 0.75±0.01 0.74±0.01 0.73±0.01 0.73±0.0142732 0.64±0.01 0.65±0.0 0.64±0.0 0.65±0.042733 0.64±0.01 0.63±0.03 0.62±0.02 0.61±0.03

Table 3 Avg. test AUROC on binary classification datasets (1d timeout).


C.2 Multi-Class Classification Datasets


12 0.13±0.06 0.17±7.05 1.54±5.17 0.13±0.0423 0.91±0.04 0.91±0.04 2.02±3.08 0.94±0.2254 0.47±0.06 0.44±7.79 1.05±0.36 0.42±0.05

181 1.05±0.08 1.13±0.19 1.09±0.19 1.07±0.08188 1.15±0.07 1.16±0.07 1.28±0.09 1.21±0.07

1457 2.14±0.05 2.04±0.52 1.3±0.8 0.75±0.151468 0.27±0.09 0.2±0.06 0.18±0.62 0.13±0.041475 1.12±0.05 1.08±0.06 1.12±0.07 1.15±0.041515 0.57±0.1 0.56±0.2 0.5±0.08 0.53±0.094538 0.92±0.04 0.85±0.05 0.87±0.03 0.89±0.154541 0.94±0.01 0.93±0.01 0.93±0.0 0.92±0.0

40498 0.8±0.03 1.17±0.58 0.77±0.25 0.79±0.240668 0.65±0.12 0.45±0.3 0.59±0.01 0.59±0.0140670 0.11±0.05 0.17±0.03 0.19±0.07 0.1±0.0240685 0.0±0.0 0.0±0.0 0.0±0.0 0.0±0.040975 0.0±0.01 0.0±0.0 0.24±0.22 0.02±0.2140982 0.53±0.05 0.76±0.6 0.68±0.06 0.54±0.4640984 0.08±0.03 0.07±0.03 0.07±0.03 0.06±0.0340996 2.3±0.0 1.37±0.58 0.37±0.01 0.36±0.0141027 0.23±0.06 0.2±0.03 0.42±0.1 0.27±0.0341163 0.22±0.06 0.17±0.23 0.05±0.04 0.05±0.0141164 0.83±0.02 0.96±0.12 0.89±0.08 0.84±0.0841165 2.3±0.0 2.26±2.37 1.71±0.02 1.72±0.0241166 1.18±0.35 1.91±5.4 1.04±0.01 1.02±0.0341167 5.87±0.0 5.37±6.06 2.31±0.03 2.31±0.0341168 0.76±0.04 0.76±0.05 0.8±0.06 0.78±0.0241169 3.83±0.33 3.45±0.44 3.26±0.02 3.07±0.0242734 0.72±0.05 0.65±0.04 0.64±0.05 0.62±0.01

Table 4 Avg. test Log-Loss on multi-class classification datasets (1h timeout).



12 0.12±0.06 0.47±0.96 1.45±5.03 0.13±0.0523 0.9±0.04 0.91±0.04 2.73±5.54 0.93±0.0654 0.36±0.04 3.7±10.47 nan 0.43±0.11

181 1.05±0.05 4.04±8.23 1.18±0.23 1.06±0.09188 1.12±0.07 1.16±0.06 1.3±0.1 1.19±0.07

1457 1.31±0.24 1.23±0.41 2.3±0.8 1.13±0.291468 0.19±0.09 0.19±0.1 0.18±1.08 0.16±0.371475 1.07±0.03 1.07±0.05 1.12±0.07 1.07±0.031515 0.41±0.14 0.48±0.17 0.43±0.07 0.46±0.064538 0.85±0.02 0.8±0.03 0.85±0.02 0.91±0.234541 0.9±0.0 0.9±0.01 0.92±0.01 0.9±0.0

40498 0.76±0.03 1.39±4.88 nan 0.85±0.3240668 0.24±0.12 0.39±0.13 0.38±0.05 0.36±0.0540670 0.09±0.02 0.12±0.02 0.17±0.05 0.11±0.0940685 0.0±0.0 0.0±0.0 0.0±0.01 0.0±0.040975 0.0±0.0 0.0±0.0 0.36±0.25 0.01±0.1440982 0.52±0.06 0.53±0.08 0.62±0.0 0.51±0.0740984 0.07±0.02 0.07±0.03 0.07±0.03 0.32±1.0340996 0.29±0.03 0.39±0.07 0.37±0.01 0.53±0.3641027 0.18±0.05 0.11±0.07 0.49±0.21 0.19±0.0241163 0.03±0.02 0.03±0.01 0.05±0.01 0.04±0.0141164 0.8±0.03 0.81±0.05 0.85±0.03 0.82±0.0341165 1.64±0.15 1.62±0.14 1.7±0.03 1.69±0.1841166 0.86±0.07 0.86±0.07 1.02±0.03 0.94±0.0341167 2.14±0.57 10.68±10.68 2.69±0.11 2.31±0.0341168 0.68±0.02 0.68±0.2 0.78±0.01 0.74±0.0141169 2.77±0.17 3.44±1.52 2.79±0.06 2.62±0.0342734 0.61±0.01 0.62±0.03 0.63±0.04 0.61±0.0

Table 5 Avg. test Log-Loss on multi-class classification datasets (1d timeout).


D Performance Plots over time

0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.996

0.998

1.000

AU

RO

C

Results on 3 (kr-vs-kp)

101 102 103 104 105

Runtime (s)

Results on 3 (kr-vs-kp)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−0.3

−0.2

−0.1

Neg

Log

-Loss

Results on 12 (mfeat-factors)

101 102 103 104 105

Runtime (s)

Results on 12 (mfeat-factors)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−1.0

−0.8

Neg

Log

-Los

s

Results on 23 (cmc)

101 102 103 104 105

Runtime (s)

Results on 23 (cmc)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.76

0.78

AU

RO

C

Results on 31 (credit-g)

101 102 103 104 105

Runtime (s)

Results on 31 (credit-g)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−0.6

−0.4

Neg

Log

-Los

s

Results on 54 (vehicle)

101 102 103 104 105

Runtime (s)

Results on 54 (vehicle)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−1.4

−1.2

−1.0

Neg

Log

-Los

s

Results on 181 (yeast)

101 102 103 104 105

Runtime (s)

Results on 181 (yeast)



0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−1.50

−1.25

Neg

Log-L

oss

Results on 188 (eucalyptus)

101 102 103 104 105

Runtime (s)

Results on 188 (eucalyptus)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.90

0.95

AU

RO

C

Results on 1049 (pc4)

101 102 103 104 105

Runtime (s)

Results on 1049 (pc4)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.800

0.825

0.850

AU

RO

C

Results on 1067 (kc1)

101 102 103 104 105

Runtime (s)

Results on 1067 (kc1)


500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.6

0.7

AU

RO

C

Results on 1111 (KDDCup09 appetency)

101 102 103 104 105

Runtime (s)

Results on 1111 (KDDCup09 appetency)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−3

−2

−1

Neg

Log

-Los

s

Results on 1457 (amazon-commerce-reviews)

101 102 103 104 105

Runtime (s)

Results on 1457 (amazon-commerce-reviews)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.725

0.750

0.775

AU

RO

C

Results on 1461 (bank-marketing)

101 102 103 104 105

Runtime (s)

Results on 1461 (bank-marketing)



0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.70

0.75

AU

RO

C

Results on 1464 (blood-transfusion-service-center)

101 102 103 104 105

Runtime (s)

Results on 1464 (blood-transfusion-service-center)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−1.0

−0.5

Neg

Log-L

oss

Results on 1468 (cnae-9)

101 102 103 104 105

Runtime (s)

Results on 1468 (cnae-9)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−3

−2

−1

Neg

Log

-Los

s

Results on 1475 (first-order-theorem-proving)

101 102 103 104 105

Runtime (s)

Results on 1475 (first-order-theorem-proving)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.8

0.9

AU

RO

C

Results on 1485 (madelon)

101 102 103 104 105

Runtime (s)

Results on 1485 (madelon)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.96

0.98

AU

RO

C

Results on 1486 (nomao)

101 102 103 104 105

Runtime (s)

Results on 1486 (nomao)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.85

0.90

AU

RO

C

Results on 1487 (ozone-level-8hr)

101 102 103 104 105

Runtime (s)

Results on 1487 (ozone-level-8hr)



0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.90

0.95

AU

RO

C

Results on 1489 (phoneme)

101 102 103 104 105

Runtime (s)

Results on 1489 (phoneme)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.92

0.94

AU

RO

C

Results on 1494 (qsar-biodeg)

101 102 103 104 105

Runtime (s)

Results on 1494 (qsar-biodeg)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−1.5

−1.0

−0.5

Neg

Log

-Los

s

Results on 1515 (micro-mass)

101 102 103 104 105

Runtime (s)

Results on 1515 (micro-mass)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.88

0.90

AU

RO

C

Results on 1590 (adult)

101 102 103 104 105

Runtime (s)

Results on 1590 (adult)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.80

0.85

AU

RO

C

Results on 4134 (Bioresponse)

101 102 103 104 105

Runtime (s)

Results on 4134 (Bioresponse)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.80

0.85

AU

RO

C

Results on 4135 (Amazon employee access)

101 102 103 104 105

Runtime (s)

Results on 4135 (Amazon employee access)



0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.990

0.995

1.000

AU

RO

C

Results on 4534 (PhishingWebsites)

101 102 103 104 105

Runtime (s)

Results on 4534 (PhishingWebsites)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−1.25

−1.00

−0.75

Neg

Log-L

oss

Results on 4538 (GesturePhaseSegmentationProcessed)

101 102 103 104 105

Runtime (s)

Results on 4538 (GesturePhaseSegmentationProcessed)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−2

−1

Neg

Log

-Los

s

Results on 4541 (Diabetes130US)

101 102 103 104 105

Runtime (s)

Results on 4541 (Diabetes130US)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.70

0.75

0.80

AU

RO

C

Results on 23512 (higgs)

101 102 103 104 105

Runtime (s)

Results on 23512 (higgs)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.52

0.53

0.54

AU

RO

C

Results on 23517 (numerai28.6)

101 102 103 104 105

Runtime (s)

Results on 23517 (numerai28.6)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−1.2

−1.0

−0.8

Neg

Log-

Los

s

Results on 40498 (wine-quality-white)

101 102 103 104 105

Runtime (s)

Results on 40498 (wine-quality-white)



0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−0.6

−0.4

Neg

Log-L

oss

Results on 40668 (connect-4)

101 102 103 104 105

Runtime (s)

Results on 40668 (connect-4)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−0.2

−0.1

Neg

Log-L

oss

Results on 40670 (dna)

101 102 103 104 105

Runtime (s)

Results on 40670 (dna)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−0.05

0.00

Neg

Log

-Los

s

Results on 40685 (shuttle)

101 102 103 104 105

Runtime (s)

Results on 40685 (shuttle)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.80

0.85

0.90

AU

RO

C

Results on 40701 (churn)

101 102 103 104 105

Runtime (s)

Results on 40701 (churn)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.98

1.00

AU

RO

C

Results on 40900 (Satellite)

101 102 103 104 105

Runtime (s)

Results on 40900 (Satellite)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−0.1

0.0

Neg

Log-

Los

s

Results on 40975 (car)

101 102 103 104 105

Runtime (s)

Results on 40975 (car)



0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.96

0.98

AU

RO

C

Results on 40978 (Internet-Advertisements)

101 102 103 104 105

Runtime (s)

Results on 40978 (Internet-Advertisements)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.92

0.94

AU

RO

C

Results on 40981 (Australian)

101 102 103 104 105

Runtime (s)

Results on 40981 (Australian)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−1.0

−0.8

−0.6

Neg

Log

-Los

s

Results on 40982 (steel-plates-fault)

101 102 103 104 105

Runtime (s)

Results on 40982 (steel-plates-fault)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.990

0.995

1.000

AU

RO

C

Results on 40983 (wilt)

101 102 103 104 105

Runtime (s)

Results on 40983 (wilt)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−0.2

0.0

Neg

Log

-Los

s

Results on 40984 (segment)

101 102 103 104 105

Runtime (s)

Results on 40984 (segment)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−5.0

−2.5

Neg

Log

-Los

s

Results on 40996 (Fashion-MNIST)

101 102 103 104 105

Runtime (s)

Results on 40996 (Fashion-MNIST)



0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−0.6

−0.4

−0.2

Neg

Log

-Los

s

Results on 41027 (jungle chess 2pcs raw endgame complete)

101 102 103 104 105

Runtime (s)

Results on 41027 (jungle chess 2pcs raw endgame complete)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.96

0.98

1.00

AU

RO

C

Results on 41138 (APSFailure)

101 102 103 104 105

Runtime (s)

Results on 41138 (APSFailure)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.775

0.800

0.825

AU

RO

C

Results on 41142 (christine)

101 102 103 104 105

Runtime (s)

Results on 41142 (christine)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.84

0.86

0.88

AU

RO

C

Results on 41143 (jasmine)

101 102 103 104 105

Runtime (s)

Results on 41143 (jasmine)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.8

0.9

AU

RO

C

Results on 41144 (madeline)

101 102 103 104 105

Runtime (s)

Results on 41144 (madeline)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.80

0.85

0.90

AU

RO

C

Results on 41145 (philippine)

101 102 103 104 105

Runtime (s)

Results on 41145 (philippine)



0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.96

0.98

1.00

AU

RO

C

Results on 41146 (sylvine)

101 102 103 104 105

Runtime (s)

Results on 41146 (sylvine)


500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.6

0.7

AU

RO

C

Results on 41147 (albert)

101 102 103 104 105

Runtime (s)

Results on 41147 (albert)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.90

0.95

AU

RO

C

Results on 41150 (MiniBooNE)

101 102 103 104 105

Runtime (s)

Results on 41150 (MiniBooNE)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.88

0.90

0.92

AU

RO

C

Results on 41156 (ada)

101 102 103 104 105

Runtime (s)

Results on 41156 (ada)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.8

0.9

AU

RO

C

Results on 41157 (arcene)

101 102 103 104 105

Runtime (s)

Results on 41157 (arcene)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.90

0.95

1.00

AU

RO

C

Results on 41158 (gina)

101 102 103 104 105

Runtime (s)

Results on 41158 (gina)



500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.6

0.8

AU

RO

C

Results on 41159 (guillermo)

101 102 103 104 105

Runtime (s)

Results on 41159 (guillermo)


500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.8

0.9

1.0

AU

RO

C

Results on 41161 (riccardo)

101 102 103 104 105

Runtime (s)

Results on 41161 (riccardo)


500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.4

0.6

AU

RO

C

Results on 41162 (kick)

101 102 103 104 105

Runtime (s)

Results on 41162 (kick)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−1.0

−0.5

0.0

Neg

Log

-Los

s

Results on 41163 (dilbert)

101 102 103 104 105

Runtime (s)

Results on 41163 (dilbert)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−7.5

−5.0

−2.5

Neg

Log

-Los

s

Results on 41164 (fabert)

101 102 103 104 105

Runtime (s)

Results on 41164 (fabert)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−20

−10

Neg

Log

-Los

s

Results on 41165 (robert)

101 102 103 104 105

Runtime (s)

Results on 41165 (robert)



0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−4

−2

Neg

Log

-Los

s

Results on 41166 (volkert)

101 102 103 104 105

Runtime (s)

Results on 41166 (volkert)


500 1000 1500 2000 2500 3000 3500

Runtime (s)

−7.5

−5.0

−2.5

Neg

Log-L

oss

Results on 41167 (dionis)

101 102 103 104 105

Runtime (s)

Results on 41167 (dionis)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−1.00

−0.75

Neg

Log

-Los

s

Results on 41168 (jannis)

101 102 103 104 105

Runtime (s)

Results on 41168 (jannis)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

−6

−4

Neg

Log

-Los

s

Results on 41169 (helena)

101 102 103 104 105

Runtime (s)

Results on 41169 (helena)


500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.4

0.5

0.6

AU

RO

C

Results on 42732 (sf-police-incidents)

101 102 103 104 105

Runtime (s)

Results on 42732 (sf-police-incidents)


0 500 1000 1500 2000 2500 3000 3500

Runtime (s)

0.55

0.60

AU

RO

C

Results on 42733 (Click prediction small)

101 102 103 104 105

Runtime (s)

Results on 42733 (Click prediction small)



500 1000 1500 2000 2500 3000 3500

Runtime (s)

−0.7

−0.6

Neg

Log-L

oss

Results on 42734 (okcupid-stem)

101 102 103 104 105

Runtime (s)

Results on 42734 (okcupid-stem)


arxiv:2111.14514v1 [cs.lg] 29 nov 2021

Documents