results of the wcci 2006 performance prediction challenge isabelle guyon

RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION

CHALLENGE

Isabelle GuyonAmir Reza Saffari Azar Alamdari

Gideon Dror

Part I

INTRODUCTION

Model selection

• Selecting models (neural net, decision tree, SVM, …)

• Selecting hyperparameters (number of hidden units, weight decay/ridge, kernel parameters, …)

• Selecting variables or features (space dimensionality reduction.)

• Selecting patterns (data cleaning, data reduction, e.g by clustering.)

Performance prediction

How good are you at predicting

how good you are?

• Practically important in pilot studies.

• Good performance predictions render model selection trivial.

Why a challenge?

• Stimulate research and push the state-of-the art.

• Move towards fair comparisons and give a voice to methods that work but may not be backed up by theory (yet).

• Find practical solutions to true problems.• Have fun…

History

• USPS/NIST.• Unipen (with Lambert Schomaker): 40 institutions

share 5 million handwritten characters. • KDD cup, TREC, CASP, CAMDA, ICDAR, etc.• NIPS challenge on unlabeled data.• Feature selection challenge (with Steve Gunn):

success! ~75 entrants, thousands of entries.• Pascal challenges.• Performance prediction challenge …

1980

1990

2000

2001

2002

2003

2004

2005

Challenge

• Date started: Friday September 30, 2005.

• Date ended: Monday March 1, 2006

• Duration: 21 weeks.

• Estimated number of entrants: 145.

• Number of development entries: 4228.

• Number of ranked participants: 28.

• Number of ranked submissions: 117.

Datasets

Dataset Domain Type Feat-ures

Training Examples

Validation Examples

Test Examples

ADA Marketing Dense 48 4147 415 41471

GINA Digits Dense 970 3153 315 31532

HIVADrug discovery

Dense 1617 3845 384 38449

NOVAText classif.

Sparse binary 16969 1754 175 17537

SYLVA Ecology Dense 216 13086 1308 130858

http://www.modelselect.inf.ethz.ch/

BER distribution

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

50100150

ADA

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

50100150

GINA

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

50100150

HIVA

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

50100150

NOVA

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

50100150

SYLVA

BERTest BER

Results

Overall winners for ranked entries:

Ave rank: Roman Lutz with LB tree mix cut adaptedAve score: Gavin Cawley with Final #2

ADA: Marc Boullé with SNB(CMA)+10k F(2D) tv or SNB(CMA) + 100k F(2D) tv

GINA: Kari Torkkola & Eugene Tuv with ACE+RLSCHIVA: Gavin Cawley with Final #3 (corrected)NOVA: Gavin Cawley with Final #1SYLVA: Marc Boullé with SNB(CMA) + 10k F(3D) tv

Best AUC: Radford Neal with Bayesian Neural Networks

Part II

PROTOCOL and

SCORING

Protocol

• Data split: training/validation/test.• Data proportions: 10/1/100.• Online feed-back on validation data.• Validation label release one month before

end of challenge.• Final ranking on test data using the five

last complete submissions for each entrant.

Performance metrics

• Balanced Error Rate (BER): average of error rates of positive class and negative class.

• Guess error: BER = abs(testBER – guessedBER)

• Area Under the ROC Curve (AUC).

Optimistic guesses

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Test BER

Gu

esse

d B

ER

ADA

GINA

HIVA

NOVA

SYLVA

Scoring method

E = testBER + BER [1-exp(- BER/)] BER = abs(testBER – guessedBER)

Guessed BER

Cha

lleng

e sc

ore

Test BER

Test BER

BER/

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.510

-4

10-3

10-2

10-1

100

101

102

103

104

BER

DE

LTA

/SIG

MA

BER/

Test BER

E testBER + BER

ADA

GINA

HIVA

NOVA

SYLVA

Score

-10 -8 -6 -4 -2 0 2

0.04

0.045

0.05

0.055

0.06

0.065

log(gamma)

score

GINA

Roman LutzGavin Cawley

Radford Neal

Corinne Dahinden

Wei ChuNicolai Meinshausen

E

testBER testBER+BER

E = testBER + BER [1-exp(- BER/)]

Score (continued)

-10 -8 -6 -4 -2 0 2

0.2

0.25

0.3

0.35

0.4

log(gamma)

scor

e

ADA

-10 -8 -6 -4 -2 0 20.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

log(gamma)

scor

e

GINA

-10 -8 -6 -4 -2 0 20.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

log(gamma)

scor

e

HIVA

-10 -8 -6 -4 -2 0 20

0.05

0.1

0.15

0.2

0.25

log(gamma)

scor

e

NOVA

-10 -8 -6 -4 -2 0 20

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

log(gamma)

scor

e

SYLVAADA GINA SYLVA

HIVA NOVA

Part III

RESULT ANALYSIS

What did we expect?

• Learn about new competitive machine learning techniques.

• Identify competitive methods of performance prediction, model selection, and ensemble learning (theory put into practice.)

• Drive research in the direction of refining such methods (on-going benchmark.)

Method comparison

0 0.05 0.1 0.15 0.2 0.25 0.3 0.3510

-4

10-3

10-2

10-1

100

BER

Del

ta B

ER

X

TREE

NN/BNNNB

LD/SVM/KLS/GP

SYLVA

GINA

NOVA

ADA

HIVA

BER

Test BER

Danger of overfitting

0 20 40 60 80 100 120 140 1600

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5B

ER

Time (days)

ADA

GINA

HIVA

NOVA

SYLVA

Full line: test BER

Dashed line: validation BER

How to estimate the BER?

• Statistical tests (Stats): Compute it on training data; compare with a “null hypothesis” e.g. the results obtained with a random permutation of the labels.

• Cross-validation (CV): Split the training data many times into training and validation set; average the validation data results.

• Guaranteed risk minimization (GRM): Use of theoretical performance bounds.

Stats / CV / GRM ???

Top ranking methods

• Performance prediction:– CV with many splits 90% train / 10% validation– Nested CV loops

• Model selection:– Use of a single model family– Regularized risk / Bayesian priors– Ensemble methods– Nested CV loops, computationally efficient with

with VLOO

Other methods

• Use of training data only:– Training BER.– Statistical tests.

• Bayesian evidence.

• Performance bounds.

• Bilevel optimization.

Part IV

CONCLUSIONS AND FURTHER WORK

Open problems

Bridge the gap between theory and practice…• What are the best estimators of the variance of CV?• What should k be in k-fold?• Are other cross-validation methods better than k-

fold (e.g bootstrap, 5x2CV)?• Are there better “hybrid” methods?• What search strategies are best?• More than 2 levels of inference?

Future work

• Game of model selection.

• JMLR special topic on model selection.

• IJCNN 2007 challenge!

Benchmarking model selection?

• Performance prediction: Participants just need to provide a guess of their test performance. If they can solve that problem, they can perform model selection efficiently. Easy and motivating.

• Selection of a model from a finite toolbox: In principle a more controlled benchmark, but less attractive to participants.

CLOP

• CLOP=Challenge Learning Object Package.

• Based on the Spider developed at the Max Planck Institute.

• Two basic abstractions:– Data object– Model object

http://clopinet.com/isabelle/Projects/modelselect/MFAQ.html

CLOP tutorial

D=data(X,Y);hyper = {'degree=3', 'shrinkage=0.1'};

model = kridge(hyper); [resu, model] = train(model, D);tresu = test(model, testD);model = chain({standardize,kridge(hyper)});

At the Matlab prompt:

Conclusions

• Twice as much volume of participation as in the feature selection challenge

• Top methods as before (different order):– Ensembles of trees– Kernel methods (RLSC/LS-SVM, SVM)– Bayesian neural networks– Naïve Bayes.

• Danger of overfitting.• Triumph of cross-validation?

results of the wcci 2006 performance prediction challenge isabelle guyon

Documents