fast cross-validation via sequential analysis - paper

Fast Cross-Validation via Sequential Analysis

Tammo Krueger, Danny Panknin, Mikio BraunTechnische Universitaet Berlin

Machine Learning Group10587 Berlin

[email protected], {panknin|mikio}@cs.tu-berlin.de

Abstract

With the increasing size of today’s data sets, finding the right parameter configu-ration via cross-validation can be an extremely time-consuming task. In this paperwe propose an improved cross-validation procedure which uses non-parametrictesting coupled with sequential analysis to determine the best parameter set on lin-early increasing subsets of the data. By eliminating underperforming candidatesquickly and keeping promising candidates as long as possible the method speedsup the computation while preserving the capability of the full cross-validation.The experimental evaluation shows that our method reduces the computation timeby a factor of up to 70 compared to a full cross-validation with a negligible impacton the accuracy.

1 Introduction

Unarguably, a lot of computing time is spent on cross-validation [1] to tune free parameters ofmachine learning methods. While cross-validation can be parallelized easily with every instanceevaluating a single candidate parameter setting, an enormous amount of computing resources is stillspent on cross-validation, which could probably be put to better use in the actual learning methods.Just to give you an idea, if you perform five-fold cross-validation over two parameters, and you onlytake five candidates for each parameter, you have to train 125 times to perform the cross-validation.Thus, even a training time of one second becomes more than two minutes without parallelization.

In practice, almost no one performs cross-validation on the whole data set, though, as the parameterscan often already be inferred reliably on a small subset of the data, thereby speeding up the compu-tation time substantially. However, the choice of the subset depends a lot on the structure of the dataset. If the subset is too small compared to the complexity of the learning task, the wrong parameteris chosen. Usually, researchers can tell from experience what subset sizes are necessary for specificlearning problems, but one would like to have a robust method which is able to deal with a wholerange of learning problems in an automatic fashion.

In this paper, we propose a method which is based on the sequential analysis framework to achieveexactly this: Speed up cross-validation by taking subsets of the data, while being robust with respectto different problem complexities. To achieve this, the method performs cross-validation on subsetsof increasing size up to the full data set size, eliminating suboptimal parameter choices quickly. Thestatistical tests used for the elimination are tuned such that they try to retain promising parametersas long as possible to guard against unreliable measurements at small sample sizes.

In experiments, we show that even using such conservative tests, we can achieve significant speedups of typically 25 times up to 70 times, which translate to literally hours of computing time freedup on our clusters.

1

data points stepsconf. d1 d2 d3 · · · dn−1 dn 1 2 3 4 5 6 7 8 9 10

c1 -2.2 -1.9 -1.8 2.1 1.5 flop 0 0 0 0 1 0 0 0 0 0 (†)c2 -1.9 -2.4 -2.3 · · · 1.9 2.4 flop 0 1 0 0 0 0 0 0 0 0 (†)c3 -1.4 -0.9 -0.7 0.5 0.5 flop 0 1 1 0 0 1 0 0 0 0

......

... Ê ...ck−2 0.6 0.6 0.7 -0.8 -0.4 top → 0 1 0 1 1 1 1 0 1 1ck−1 0.1 0.5 0.7 · · · -0.9 -0.1 top 0 1 1 1 1 1 0 1 1 1

ck 0.5 0.4 0.6 -0.3 0.0 top 1 1 0 1 1 1 0 1 1 1pointwisePerformance matrix trace matrix

Ë↙ Ì ↓

0 5 10 15 20

05

1015

20

Step

Cum

ulat

ive

Sum

0 0 0 01 0 0 0 0 0

11 0

11

1 01

11

X

∆H0(π0, π1, βl, αl)

Sa(π0, π1, βl, αl)WINNER

LOSER

c1

ck

7 8 9 10c3 0 0 0 0

?=

......

ck−2 1 0 1 1ck−1 0 1 1 1

ck 0 1 1 1similarPerformance(·)

maxSteps = 20

∆ = N/maxStepsmodelSize = s∆

n = N − s∆

Figure 1: One step of the fast cross-validation procedure. Shown is the situation in step s = 10.Ê a model with modelSize data points is learned for each configuration (c1 to ck). Test errors arecalculated on the current test set (d1 to dn) and transformed into a binary performance indicator. Ëtraces of configurations are filtered via sequential analysis (c1 and c2 are dropped). Ì at the endof each step the procedure checks, whether the remaining configurations perform equally well in atime window and stops, if this is the case (see Sec. 5 in the Appendix for a complete example run).

2 Fast Cross-Validation

We consider the usual supervised learning setting: We have a data set consisting of data pointsd1 = (X1, Y1), . . . , dN = (XN , YN ) ∈ X × Y which we assume to be drawn i.i.d. from PX×Y .We have a learning algorithm A which depends on several parameters p. The goal is to select theparameter p∗ such that the learned predictor g has the best generalization error with respect to someloss function ` : Y × Y → R. Full k-fold cross-validation estimates the best parameter by splittingthe data into k parts, using k − 1 parts for training and estimating the error on the remaining part.

Our approach attempts to speed up the process by taking subsamples of size [sN/maxSteps] for1 ≤ s ≤ maxSteps, starting with the full set of parameter candidates and eliminating clearly un-derperforming candidates at each step. Each execution of the main loop of the algorithm depictedin Figure 1 performs the following main parts given a subset of the data: Ê The procedure trans-forms the pointwise test errors of the remaining configurations into a binary “top or flop” schemeË It drops significant loser configurations along the way using tests from the sequential analysisframework. Ì Applying robust, distribution free testing techniques allows for an early stopping ofthe procedure, when we have seen enough data for a stable parameter estimation. In the followingwe will discuss the individual steps in the algorithm.

Ê Robust Transformation of Test Errors: As the first step, the pointwise test errors for eachconfiguration is transformed into a binary value encoding whether the configuration is among thebest ones or not. We call this the “top or flop” scheme. This step abstracts from the underlying lossfunction or the scale of the errors, encoding the information whether a configuration looks promisingfor further analysis or not. From the point of view of statistical test theory, the question now isto find the k top-performing configurations which show a similar behavior on all tested samples.Traditionally, this test could be performed using ANOVA, however we propose to use the followingnon-parametric tests in order to increase robustness: For classification, we use the Cochran Q test[2] applied to the binary information whether a sample has been correctly classified or not. For

2

regression problems we apply the Friedman test [3] directly on the residuals of the prediction. Notethat both tests use a paired approach on the pointwise performance measure, thereby increasing thestatistical power of the result (see Sec. 6 in the Appendix for a summary of these tests).

Ë Determining Significant Losers: Having transformed the test errors in a scale-independent top orflop scheme, we can now test whether a given parameter configuration is an overall loser. Sequentialtesting of binary random variables is addressed in the sequential analysis framework developed byWald [4]. The main idea is the following: One observes a sequence of i.i.d. binary random variablesZ1, Z2, . . ., and one wants to test whether these variables are distributed according to H0 : π0 orH1 : π1 with π0 < π1. Both significance levels for the acceptance of H1 and H0 can be controlledvia the meta-parameters αl and βl. The test computes the likelihood for the so far observed dataand rejects one of the hypothesis when the respective likelihood ratio is larger than some factorcontrolled by the meta-parameters. It can be shown that the procedure has a very intuitive geometricrepresentation, shown in Figure 1, lower left: The binary observations are recorded as cumulativesums at each time step. If this sum exceeds the upper red line, we accept H1; if the sum is belowthe lower red line we accept H0; if the sum stays between the two red lines we have to draw anothersample. Since our main goal is to use the sequential test to eliminate underperformers, we choose theparameters π0 and π1 of the test such thatH1 (a configuration wins) is postponed as long as possible.At the same time, we want to maximize the area where configurations are eliminated (region denotedby “LOSER” in Fig. 1), rejecting the most loser configurations on the way as possible (see Sec. 1–3in the Appendix for the concrete derivation of theses parameters of the test).

Ì Early Stopping and Final Winner: Finally, we employ an early stopping rule which takesthe last earlyStoppingWindow columns from the trace matrix and checks whether all remainingconfigurations performed equally well in the past. If that is the case, the procedure is stopped.For the test, we again use the Cochran Q test which is illustrated in Figure 1, lower right: the lastthree traces at step 10 are performing nearly optimal in a given window but c3 shows a significantdifferent behavior, so the test will indicate a significant effect and the procedure will go on. Todetermine the final winner after the procedure has stopped we iteratively go back in time among allwinning configurations in each step until we have found an exclusive winner. This way, we makemost use of the data accumulated during the course of the procedure.

Efficient Parallelization: As for normal cross-validation the parallelization setup for the fast cross-validation procedure is a solid map-reduce scheme: the model of each remaining configuration ineach step of the procedure can be calculated on a single cluster node. Just the results of the model onthe data points d1, d2, . . . , dn have to be transferred back to a central instance to calculate the binaryÊ “top or flop” scheme. This central reduce node will then update the trace matrix accordinglyand Ë test for significant losers. After eliminating underperforming configurations the Ì earlystopping rule checks, whether the procedure will iterate once more and schedule the remainingconfigurations on the cluster. This stepwise elimination of underperforming configurations willresult in a significant speed-up as will be shown in the next section.

3 Experiments

In this section we will explore the performance of the fast cross-validation procedure on real-worlddata sets: First we use the benchmark repository as introduced by Ratsch et. al [5]. We split eachdata set in two halves using one half for the parameter estimation via full and fast cross-validationand the other half for the calculation of the test error. Additionally we use the covertype data set [6]:After scaling the data we use the first two classes with the most entries and follow the procedure ofthe paper in sampling 2,000 data points of each class for the model learning and estimate the testerror on the remaining data points. For all setups we use an ν−SVM with Gaussian kernel using610 parameter configurations (σ ∈ [−3, 3], ν ∈ [0.05, 0.5]). The fast cross-validation procedure iscarried out with 10 steps (fast) once with the early stopping rule and once without. For each data setwe repeat the process 50 times each with a different split.

Figure 2 shows that the speed improvement of the fast setup with early stopping often ranges inbetween 20 and 30 and even up to 70 for the covertype data set. Without the early stopping rulethe speed gain drops but for the most data sets stays in between 10 to 20. The absolute test errordifference of the fast cross-validation procedure compared to the normal cross-validation almostalways ranges below 1 percentage point (data in Sec. 4 in the Appendix). These results illustrate,

3

Relative Speed Factor (full/fast)

Rel

ativ

e S

peed

−up

20

40

60

80

fast/early

●

●●

●

●

●●

●

●

●

●●

bana

na

brea

stC

ance

r

diab

etis

flare

Sol

ar

germ

an

imag

e

ringn

orm

splic

e

thyr

oid

twon

orm

wav

efor

m

cove

rtyp

e

fast

●

●●●●● ●●●●●●

●●

●●

●

●●

●

●

●

bana

na

brea

stC

ance

r

diab

etis

flare

Sol

ar

germ

an

imag

e

ringn

orm

splic

e

thyr

oid

twon

orm

wav

efor

m

cove

rtyp

e

variable

banana

breastCancer

diabetis

flareSolar

german

image

ringnorm

splice

thyroid

twonorm

waveform

covertype

Figure 2: Distribution of relative speed gains of the fast cross-validation on the benchmark data sets.

that the huge speed improvement of the fast cross-validation comes at a very low price in terms ofabsolute test error difference.

4 Related Work

Using statistical tests in order to speed up learning has been the topic of several lines of researches.However, the existing body of work mostly focuses on reducing the number of test evaluations, whilewe focus on the overall process of eliminating candidates themselves. To the best of our knowledge,this is a new concept and can apparently be combined with the already available racing techniquesto further reduce the total calculation time.

Maron and Moore introduce the so-called Hoeffding Races [7, 8] which are based on the non-parametric Hoeffding bound for the mean of the test error. At each step of the algorithm a newtest point is evaluated by all remaining models and the confidence interval of the test errors are up-dated accordingly. Models whose confidence interval of the test error lies outside of at least oneinterval of a better performing model are dropped. Chien et al. [9, 10] devise a similar range of al-gorithms using concepts of PAC learning and game theory different hypotheses are ordered by theirexpected utility according to the test data the algorithm has seen so far. This concept of racing isfurther extended by Domingos and Hulten [11]: By introducing an upper bound for the learner’s lossas a function of the examples, the procedure allows for an early stopping of the learning process, ifthe loss is nearly as optimal as for infinite data. While Bradley and Shapire [12] use similar conceptsin the context of boosting (FilterBoost), Mnih et al. [13] introduce the empirical Bernstein Bounds toextend both the FilterBoost framework and the racing algorithms. In both cases the bounds are usedto estimate the error within a specific ε region with a given probability. These racing concepts areapplied in a wide variety of domains like reinforcement learning [14], multi-armed bandit problems[15], and timetabling [16] showing the relevance of the topic.

5 Conclusion and Further Work

We have proposed a procedure to significantly accelerate cross-validation by performing it on sub-sets of increasing size and eliminating underperforming candidates. We first transform the cross-validation problem into a binary trace matrix which contains the winners/losers for each configura-tion for each subset size. To speed up cross-validation, the goal is to identify overall losers as earlyas possible. Note that the distribution of the matrix is very complex and in general unknown, as itdepends on the data distribution, the learning algorithm, and the sample sizes. We can assume thatthe distribution of the columns of the matrix converges as the sample size becomes larger, but theremay also be significant shifts in what the top candidates are at smaller sample sizes.

Our approach is therefore a first step towards solving the problem by applying robust testing and thesequential analysis framework which makes several simplifying assumptions. To better understandthe true distribution of the problem is an interesting question for future research.

Acknowledgments: This work is generously funded by the BMBF project ALICE (01IB10003B).

4

References

[1] Sylvain Arlot, Alain Celisse, and Paul Painleve. A survey of cross-validation procedures formodel selection. Statistics Surveys, 4:40–79, 2010.

[2] W. G. Cochran. The comparison of percentages in matched samples. Biometrika, 37(3-4):256–266, 1950.

[3] Milton Friedman. The use of ranks to avoid the assumption of normality implicit in the analysisof variance. Journal of the American Statistical Association, 32(200):675–701, 1937.

[4] Abraham Wald. Sequential Analysis. Wiley, 1947.[5] G. Ratsch, T. Onoda, and K.-R. Muller. Soft margins for AdaBoost. Machine Learning,

42(3):287–320, 2001.[6] J. A. Blackard and D. J. Dean. Comparative accuracies of artificial neural networks and dis-

criminant analysis in predicting forest cover types from cartographic variables. Computers andElectronics in Agriculture, vol.24:131–151, 1999.

[7] Oded Maron and Andrew W. Moore. Hoeffding races: Accelerating model selection searchfor classification and function approximation. In Advances in Neural Information ProcessingSystems 6, pages 59–66. Morgan Kaufmann, 1994.

[8] Oded Maron and Andrew W. Moore. The racing algorithm: Model selection for lazy learners.Artif. Intell. Rev., 11:193–225, February 1997.

[9] Steve Chien, Jonathan Gratch, and Michael Burl. On the efficient allocation of resources forhypothesis evaluation: A statistical approach. IEEE Trans. Pattern Anal. Mach. Intell., 17:652–665, July 1995.

[10] Steve Chien, Andre Stechert, and Darren Mutz. Efficient heuristic hypothesis ranking. J. Artif.Int. Res., 10:375–397, June 1999.

[11] Pedro Domingos and Geoff Hulten. A general method for scaling up machine learning al-gorithms and its application to clustering. In Proceedings of the Eighteenth InternationalConference on Machine Learning, ICML ’01, pages 106–113, San Francisco, CA, USA, 2001.Morgan Kaufmann Publishers Inc.

[12] Joseph K. Bradley and Robert Schapire. Filterboost: Regression and classification on largedatasets. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Infor-mation Processing Systems 20, pages 185–192, Cambridge, MA, 2008. MIT Press.

[13] Volodymyr Mnih, Csaba Szepesvari, and Jean-Yves Audibert. Empirical bernstein stopping.In Proceedings of the 25th international conference on Machine learning, ICML ’08, pages672–679, New York, NY, USA, 2008. ACM.

[14] Verena Heidrich-Meisner and Christian Igel. Hoeffding and Bernstein races for selecting poli-cies in evolutionary direct policy search. In Proceedings of the 26th Annual InternationalConference on Machine Learning, ICML ’09, pages 401–408, New York, NY, USA, 2009.ACM.

[15] Jean-Yves Audibert, Remi Munos, and Csaba Szepesvari. Tuning bandit algorithms in stochas-tic environments. In Proceedings of the 18th international conference on Algorithmic LearningTheory, ALT ’07, pages 150–165, Berlin, Heidelberg, 2007. Springer-Verlag.

[16] Mauro Birattari. Tuning Metaheuristics: A Machine Learning Perspective. Springer, 2009.

5

fast cross-validation via sequential analysis - paper

Documents