participation in the nips 2003 challenge theodor mader eth zurich, [email protected] five...

Participation in the NIPS 2003 ChallengeTheodor Mader

ETH Zurich, [email protected]

Five Datasets were provided for experiments:•ARCENE: cancer diagnosis•DEXTER: text categorization•DOROTHEA: drug discovery•GISETTE: digit recognition•MADELON: artificial dataAll problems were two-class Problems

DATASETS

BACKGROUNDFeature Selection is a vital part for any classification procedure. Thus it is very important to understand common procedures and algorithms and see how they perform on real world data. The students had the chance to apply the knowledge acquired during class to real world datasets, learning about the strengths and weaknesses of the individual approaches.

SUMMARYIn the course on feature extraction of the winter semester 2005-2006 the students had to participate in the NIPS 2003 feature selection challenge. They experimented with different feature selection and classification methods in order to achieve high rankings.

METHODSThe students were provided with a Matlab machine learning package containing a variety of classification and feature extraction algorithms like supportvector machines, random probe method,...

The results were ranked according to the balanced error rate (BER), i.e. the error rate of both classes weighted by the number of examples of each class. For this purpose a test set was used which was not available to the students.

RESULTS (BER in %)

BER (%)

0 2 4 6 8 10 12 14 16

Our Entry

Baseline

Winner

DOROTHEA: drug discoverySize: 5000 features, 300 training examples, 300 validation examples

•Our Results indicate: feature selection by TP statistic was already sufficient, the relief criterion didn’t give much improvement.•We optimized the number of features in order to beat the baseline model.Submission:my_model=chain({TP('f_max=1400’),naive, bias})

CONCLUSIONSParticipating in the NIPS 2003 challenge was a very interesting opportunity. It could be clearly seen that in most cases there was no need to use very complicated classifiers, rather simple techniques (when applied correctly) and combined with powerful feature selection methods were faster and more successful. The importance of feature selection was hencee shown very clearly. Only using a fraction of the original features improved classification results and speed significantly.

Dexter BER (%)

BER (%)

0 2 4 6 8 10 12 14 16

Our Entry

Baseline

Winner

ARCENE: cancer diagnosis10000 features, 100 training examples, 100 validation examples

•Very small number of training examples, merging training and validation set improved performance significantly.

•We empiricaly found that the probe method performed better than signal to noise ratio

•Optimal number of features found by crossvalidation:Crossvalidation BER (%) and errorbar

0

2

4

6

8

10

12

14

16

2000 2100 2200 2300 2400 2500 2600

Number of features

Submission:

my_classif=svc({'coef0=1','degree=3','gamma=0','shrinkage=0.1'})

my_model=chain({probe(relief,{'p_num=2000', ‘f_max=2400'}), normalize, my_classif})

BER (%)

0 1 2 3 4 5 6

Our Entry

Baseline

Winner

DEXTER: text categorization, 5000 features, 300 training examples, 300 validation examples.

•Also very small number of training examples => we merged training and validation set

•Initial experiments with principal component analysis showed no improvement

•We sticked to the baseline model and optimized the shrinkage and number of features parameter

Our final submission:•Feature selection by signal to noise ratio (1210 features kept)

•Normalization of features•Support vector classifier with linear kernel

GISETTE: digit recognition, 5000 features, 6000 training examples, 1000 validation examples

•Garbage features were easily removed by feature selection according to signal to noise ratio

•Applying gaussian blur to the input images brought significant improvement.

•Shrinkage could further improve the results

BER (%)

0 0,2 0,4 0,6 0,8 1 1,2 1,4 1,6 1,8 2

Our Entry

Baseline

Winner

my_classif=svc({'coef0=1', 'degree=4', 'gamma=0', 'shrinkage=0.5'});

my_model=chain({convolve(’dim1=5’,’dim2=5’), normalize, my_classif})

MADELON: artificial data5000 features, 6000 training examples, 1000 validation examples

The relief method performed extremely well: With only 20 features the classification error could be reduced to below 7%Shrinkage proved to be nearly as important as kernel width for the gaussian kernel.

BER (%)

6,2 6,4 6,6 6,8 7 7,2 7,4

Our Entry

Baseline

Winner

my_classif=svc({'coef0=1', 'degree=0', 'gamma=0.5', 'shrinkage=0.3'})

my_model=chain({probe(relief,{’f_max=20', 'pval_max=0'}), standardize, my_classif})

participation in the nips 2003 challenge theodor mader eth zurich, [email protected] five...

Documents