lab 1 getting started with clop and the spider package
TRANSCRIPT
![Page 1: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/1.jpg)
Lab 1
Getting started with CLOPand
the Spider package
![Page 2: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/2.jpg)
CLOP Tutorial
• CLOP=Challenge Learning Object Package.
• Based on the Spider developed at the Max Planck Institute.
• Two basic abstractions:– Data object– Model object
http://clopinet.com/isabelle/Projects/modelselect/MFAQ.html
![Page 3: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/3.jpg)
CLOP Data Objects
cd <code_dir>use_spider_clop;X=rand(10,8);Y=[1 1 1 1 1 -1 -1 -1 -1 -1]';D=data(X,Y); % constructor[p,n]=get_dim(D) get_x(D) get_y(D)
At the Matlab prompt:
![Page 4: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/4.jpg)
CLOP Model Objects
model = kridge; % constructor[resu, model] = train(model, D);resu, model.W, model.b0Yhat = D.X*model.W' + model.b0testD = data(rand(3,8), [-1 -1 1]');tresu = test(model, testD); balanced_errate(tresu.X, tresu.Y)
D is a data object previously defined.
![Page 5: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/5.jpg)
Hyperparameters and Chains
default(kridge) hyper = {'degree=3', 'shrinkage=0.1'}; model = kridge(hyper);
model = chain({standardize,kridge(hyper)}); [resu, model] = train(model, D); tresu = test(model, testD); balanced_errate(tresu.X, tresu.Y)
A model often has hyperparameters:
Models can be chained:
![Page 6: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/6.jpg)
Hyper-parameters
• http://clopinet.com/isabelle/Projects/modelselect/MFAQ.html
• Kernel methods: kridge and svc:k(x, y) = (coef0 + x y)degree exp(-gamma ||x - y||2)
kij = k(xi, xj)
kii kii + shrinkage
• Naïve Bayes: naive: none• Neural network: neural
units, shrinkage, maxiter
• Random Forest: rf (windows only)mtry
![Page 7: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/7.jpg)
Lab 2
Getting started with the NIPS 2003 FS challenge
![Page 8: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/8.jpg)
The Datasets
• Arcene: cancer vs. normal with mass-spectrometry analysis of blood serum.
• Dexter: filter texts about corporate acquisition from Reuters collection.
• Dorothea: predict which compounds bind to Thrombin from KDD cup 2001.
• Gisette: OCR digit “4” vs. digit “9” from NIST.
• Madelon: artificial data.
http://clopinet.com/isabelle/Projects/NIPS2003/Slides/NIPS2003-Datasets.pdf
![Page 9: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/9.jpg)
Data Preparation
• Preprocessing and scaling to numerical range 0 to 999 for continuous data and 0/1 for binary data.
• Probes: Addition of “random” features distributed similarly to the real features.
• Shuffling: Randomization of the order of the patterns and the features.
• Baseline error rates (errate): Training and testing on various data splits with simple methods.
• Test set size: Number of test examples needed using rule-of-thumb ntest = 100/errate.
![Page 10: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/10.jpg)
Data Statistics
Dataset Size Type FeaturesTraining Examples
Validation Examples
Test Examples
Arcene8.7 MB
Dense 10000 100 100 700
Gisette22.5 MB
Dense 5000 6000 1000 6500
Dexter0.9 MB
Sparse integer
20000 300 300 2000
Dorothea4.7 MB
Sparse binary
100000 800 350 800
Madelon2.9 MB
Dense 500 2000 600 1800
![Page 11: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/11.jpg)
ARCENE
• Sources: National Cancer Institute (NCI) and Eastern Virginia Medical School (EVMS).
• Three datasets: 1 ovarian cancer, 2 prostate cancer, all preprocessed similarly.
• Task: Separate cancer vs. normal.
ARCENE is the cancer dataset
0 2000 4000 6000 8000 10000 12000 14000 160000
10
20
30
40
50
60
70
80
90
100
![Page 12: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/12.jpg)
DEXTER
• Sources: Carnegie Group, Inc. and Reuters, Ltd.
• Preprocessing: Thorsten Joachims.• Task: Filter “corporate acquisition” texts.
DEXTER filters texts
NEW YORK, October 2, 2001 – Instinet Group Incorporated (Nasdaq: INET), the world’s largest electronic agency securities broker, today announced that it has completed the acquisition of ProTrader Group, LP, a provider of advanced trading technologies and electronic brokerage services primarily for retail active traders and hedge funds. The acquisition excludes ProTrader’s proprietary trading business. ProTrader’s 2000 annual revenues exceeded $83 million.
![Page 13: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/13.jpg)
DOROTHEA
• Sources: DuPont Pharmaceuticals Research Laboratories and KDD Cup 2001.
• Task: Predict compounds that bind to Thrombin.
DOROTHEA is the Thrombin dataset
![Page 14: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/14.jpg)
GISETTE
• Source: National Institute of Standards and Technologies (NIST).
• Preprocessing: Yann LeCun and collaborators.
• Task: Separate digits “4” and “9”.
GISETTE contains handwritten digits
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
![Page 15: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/15.jpg)
MADELON
• Source: Isabelle Guyon, inspired by Simon Perkins et al.
• Type of data: Clusters on the summits of a hypercube.
MADELON is random data
![Page 16: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/16.jpg)
Performance Measures
Prediction
Class -1 Class +1
Truth Class -1 a b
Class +1 c d
Confusion matrix
• Balanced Error Rate (BER): the average of the error rates for each class: BER = 0.5*(b/(a+b) + c/(c+d)).
• Area Under Curve (AUC): the area under the ROC curve obtained by plotting a/(a+b) against d/(c+d) for each confidence value, starting at (0,1) and ending at (1,0).
• Fraction of Features (FF): the ratio of the num. of features selected to the total num. of features in the dataset.
• Fraction of Probes (FP): the ratio of the num. of “garbage features” (probes) selected to the total num. of feat. select.
![Page 17: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/17.jpg)
BER distribution
0 5 10 15 20 25 30 35 40 45 500
20
40ARCENE
0 5 10 15 20 25 30 35 40 45 500
20
40
DEXTER
0 5 10 15 20 25 30 35 40 45 500
20
40
DOROTHEA
0 5 10 15 20 25 30 35 40 45 500
20
40GISETTE
0 5 10 15 20 25 30 35 40 45 500
20
40
MADELON
Test error (%)
![Page 18: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/18.jpg)
Power of Feature Selection
Best frac. feat
Actual frac. probes
ARCENE 5% 30%
DEXTER 1.5% 50%
DOROTHEA 0.3% 50%
GISETTE 18% 50%
MADELON 1.6% 96%
![Page 19: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/19.jpg)
Visualization
1) Create a heatmap of the data matrix:show(D.train);
2) Look at individual patterns: browse(D.train);
3) Make a scatter plot of the 2 first features:show(D.train);
4) Visualize the results:[Dat,Model]=train(model, D.train);Dat=test(Model, D.valid);roc(Dat);
![Page 20: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/20.jpg)
BER = f(threshold)
Theta = -37.40
Training set
Theta = -38.14
Test set
No bias adjustment, test BER=22.54%; with bias, test BER=12.37%
-100 0 100 200 300 400 5000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
threshold
-100 -50 0 50 100 150 200 250 300 350 4000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
threshold
sensitivity sensitivity
specificity specificity
BER BER
DOROTHEA
![Page 21: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/21.jpg)
ROC curve
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
AUC=0.91
Specificity
Sen
sitiv
ityDOROTHEA
![Page 22: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/22.jpg)
Feature SelectionMADELON (pval_max=0.1)
0 10 20 30 40 50 60 70 80 901
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
rank
W
0 10 20 30 40 50 60 70 80 900
0.1
0.2
0.3
0.4
0.5
0.6
0.7
rank
FD
R
![Page 23: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/23.jpg)
Heat mapARCENE
5 10 15 20 25 30 35 40 45 50
10
20
30
40
50
60
70
80
90
100 0
100
200
300
400
500
600
700
800
![Page 24: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/24.jpg)
Scatter plots
chain({standardize, s2n('f_max=2'), normalize, my_svc})
Test BER=49%
0 50 100 150 2000 20 40 60 80 100
0
50
100
150
200
0
20
40
60
80
100
0 100 200 300 4000 100 200 300 400
0
100
200
300
400
0
100
200
300
400
chain({standardize, s2n('f_max=1100'), normalize, gs('f_max=2'), my_svc})
Test BER=29.37%
ARCENE
![Page 25: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/25.jpg)
Lab 3
Playing with FS filters and classifiers on Madelon and Dexter
![Page 26: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/26.jpg)
Lab 3 software
1) Try the examples in Lab3 README.m2) Inspiring your self by the examples,
write a new feature ranking filter object. Choose one in Chapter 3 or invent your own.
3) Provide the pvalue and FDR (using a tabulated distribution or the probe method).
![Page 27: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/27.jpg)
Filters: see chapter 3
![Page 28: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/28.jpg)
Filters Implemented
• @s2n• @Relief• @Ttest• @Pearson (Use Matlab corrcoef. Gives the same results
as Ttest, classes are balanced.)
• @Ftest (gives the same results as Ttest. Important for the pvalues: the Fisher criterion needs to be multiplied by num_patt_per_class or use anovan.)
• @aucfs (ranksum test)
![Page 29: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/29.jpg)
Evalution of pval and FDR
• Ttest object: – computes pval analytically– FDR~pval*nsc/n
• probe object: – takes any feature ranking object as an
argument (e.g. s2n, relief, Ttest)– pval~nsp/np
– FDR~pval*nsc/n
![Page 30: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/30.jpg)
Analytic vs. probe
0 500 1000150020002500300035004000450050000
0.02
0.04
0.06
0.08
0.1
0.12
0.14
rank
FD
R
Arcene
0 500 1000150020002500300035004000450050000
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
rank
FD
R
Dexter
0 5001000150020002500300035004000450050000
1
2
x 10-4
rank
FD
R
Dorothea
0 500 1000150020002500300035004000450050000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
rank
FD
R
Gisette
0 50 100 150 200 250 300 350 400 450 5000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
rank
FD
RMadelon
Red analytic – Blue probe
![Page 31: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/31.jpg)
Relief vs. Ttest (Madelon)
0 5 10 15 20 25 30 35 40 45 500
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
rank
pva
l
0 5 10 15 20 25 30 35 40 45 500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
FD
R
Ttest
Relief Ttest
Relief
rank
![Page 32: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/32.jpg)
Lab 4
Plying with Feature construction
on Gisette
![Page 33: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/33.jpg)
Gisette
• Handwritten digits.• Goal: get familiar with the data and result
formats. Make a first submission.• Easiest LM for Gisette: naive and svc.• Best preprocessing: normalize.• Easiest feature selection method: s2n.• Many training examples (6000). Unsuitable
for kridge unless subsampling is used.• Many features (5000). Select features before
running neural or rf.
![Page 34: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/34.jpg)
Baseline Model
baselineGisette (BER=1.8%, feat=20%)
my_classif=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=1'});
my_model=chain({normalize, s2n('f_max=1000'), my_classif});
![Page 35: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/35.jpg)
Baseline methods
baselineGisette (CV=1.91%, test=1.80%, feat=20%)my_classif=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=1'});
my_model=chain({normalize, s2n('f_max=1000'), my_classif});
baselineGisette2 (CV=1.34%, test=1.17%, feat=20%)my_model=chain({s2n('f_max=1000'), normalize, my_classif});
pixelGisette (CV=1.31%, test=0.91%)my_classif=svc({'coef0=1', 'degree=4', 'gamma=0', 'shrinkage=0.1'});
my_model=chain({normalize, my_classif});
![Page 36: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/36.jpg)
ConvolutionsGISETTE (pixelGisette_exp_conv)Exponential kernel, sigma1=1.8 , sigma2=1.8
1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
9
prepro=my_model{1};
show(prepro.child);
DD=test(prepro,D.train);
browse_digit(DD.X, D.train.Y); Index: 4 Class: 1
5 10 15 20 25
5
10
15
20
25
Index: 4 Class: 1
5 10 15 20 25
5
10
15
20
25
chain({convolve(exp_ker({'dim1=9', 'dim2=9'})), normalize, my_classif})
![Page 37: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/37.jpg)
Principal Components
Index: 1 Class: 1
5 10 15 20 25
5
10
15
20
25
Index: 2 Class: -1
5 10 15 20 25
5
10
15
20
25
Index: 3 Class: 1
5 10 15 20 25
5
10
15
20
25
Index: 4 Class: 1
5 10 15 20 25
5
10
15
20
25
Index: 5 Class: 1
5 10 15 20 25
5
10
15
20
25
Index: 6 Class: 1
5 10 15 20 25
5
10
15
20
25
Index: 7 Class: 1
5 10 15 20 25
5
10
15
20
25
Index: 8 Class: 1
5 10 15 20 25
5
10
15
20
25
@pca_bank: Filter bank object retaining the first f_max principal components of the data matrix
@kmeans_bank: Filter bank containing templates corresponding to f_max cluster centers.
![Page 38: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/38.jpg)
Hadamard bank
10 20 30 40 50 60 70 80
10
20
30
40
50
60
70
@hadamard_bank: Filter bank object performing a Hadamard transform.
Hadamard bank
![Page 39: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/39.jpg)
Fourier Transform
Index: 5 Class: 1
5 10 15 20 25
5
10
15
20
25
Index: 5 Class: 1
5 10 15 20 25
5
10
15
20
25
@fourier: Two dimensional Fourier transform.
![Page 40: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/40.jpg)
Epilogue
Becoming a pro andplaying with
other datasets
![Page 41: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/41.jpg)
Baseline Methods for the Feature Extraction Class Isabelle Guyon
GISETTE Best BER=1.26Best BER=1.260.14% - n0=1000 (20%) – BER0=1.80%0.14% - n0=1000 (20%) – BER0=1.80%
my_classif=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=1'});
my_model=chain({normalize, s2n('f_max=1000'), my_classif});
ARCENE Best BER= 11.9 Best BER= 11.9 1.2 %1.2 % - n0=1100 (11%) – BER0=14.7%- n0=1100 (11%) – BER0=14.7%
my_svc=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=0.1'});
my_model=chain({standardize, s2n('f_max=1100'), normalize, my_svc})
BACKGROUND
We present supplementary course material complementing the book “Feature Extraction, Fundamentals and Applications”, I. Guyon et al Eds., to appear in Springer. Classical algorithms of feature extraction were reviewed in class. More attention was
given to the feature selection than feature construction because of the recent success of methods involving a large number of "low-level" features.
The book includes the results of a NIPS 2003 feature selection challenge. The students learned techniques employed by the best challengers and tried to match the best performances. A Matlab® toolbox was provided with sample code.The students could makepost-challenge entries to: http://www.nipsfsc.ecs.soton.ac.uk/.
Challenge: Good performance + few features. Tasks: Two-class classification. Data split: Training/validation/test. Valid entry: Results on all 5 datasets.
DATASETS
Dataset Size MB Type Features Training Validation Test
Arcene 8.7 Dense 10000 100 100 700
Gisette 22.5 Dense 5000 6000 1000 6500
Dexter 0.9 Sparse 20000 300 300 2000
Dorothea 4.7 Sparse 100000 800 350 800
Madelon 2.9 Dense 500 2000 600 1800
MADELON Best BER=6.22Best BER=6.220.57% - n0=20 (4%) – BER0=7.33%0.57% - n0=20 (4%) – BER0=7.33%
my_classif=svc({'coef0=1', 'degree=0', 'gamma=1', 'shrinkage=1'});
my_model=chain({probe(relief,{'p_num=2000', 'pval_max=0'}), standardize, my_classif})
METHODS
Scoring:• Ranking according to test set balanced error rate (BER) , i.e. the average positive class error rate and negative class error rate. • Ties broken by the feature set size.
Learning objects:• CLOP learning objects implemented in Matlab®.• Two simple abstractions: data and algorithm.• Download: http://www.modelselect.inf.ethz.ch/models.php.
Task of the students:• Baseline method provided, BER0 performance and n0 features.• Get BER<BER0 or BER=BER0 but n<n0.• Extra credit for beating the best challenge entry.• OK to use the validation set labels for training.
DOROTHEA Best BER=8.54Best BER=8.540.99% - n0=1000 (1%) – BER0=12.37%0.99% - n0=1000 (1%) – BER0=12.37%
my_model=chain({TP('f_max=1000'), naive, bias});
DEXTER Best BER=3.30Best BER=3.300.40% - n0=300 (1.5%) – BER0=5%0.40% - n0=300 (1.5%) – BER0=5%
my_classif=svc({'coef0=1', 'degree=1', 'gamma=0', 'shrinkage=0.5'});
my_model=chain({s2n('f_max=300'), normalize, my_classif})
ARCENE
0 5 10 15 20 25 30 35 40 45 50
DEXTER
0 5 10
15
20
25
30
35
40
45
50
DEXTER: text categorization
GISETTE
0 5 10 15 20 25 30 35 40 45 50
GISETTE: digit recognition
DOROTHEA: drug discoveryDOROTHEA
0 5 10
15
20
25
30
35
40
45
50
RESULTS
0 5 10 15 20 25 30 35 40 45 50
MADELON
MADELON: artificial data
CONCLUSIONSThe performances of the challengers could be matched with the CLOP library. Simple filter methods (S2N and Relief) were sufficient to get a space dimensionality reduction comparable to what the winners obtained. SVMs are easy to use and generally work better than other methods. We experienced with Gisette to add prior knowledge about the task and could outperform the winners. Further work includes using prior knowledge for other datasets.
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
NEW YORK, October 2, 2001 – Instinet Group Incorporated (Nasdaq: INET), the world’s largest electronic agency securities broker, today announced tha
ARCENE: cancer diagnosis0 2000 4000 6000 8000 10000 12000 14000 16000
0
10
20
30
40
50
60
70
80
90
100
![Page 42: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/42.jpg)
Best student results
http://clopinet.com/isabelle/Projects/ETH/Feature_Selection_w_CLOP.html
![Page 43: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/43.jpg)
Agnostic Learning vs.
Prior Knowledge challenge
Isabelle Guyon, Amir Saffari, Gideon Dror, Gavin Cawley, Olivier Guyon,
and many other volunteers, see http://www.agnostic.inf.ethz.ch/credits.php
Open until August 1st, 2007
![Page 44: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/44.jpg)
Datasets
Dataset Domain Type Feat-ures
Training Examples
Validation Examples
Test Examples
ADA Marketing Dense 48 4147 415 41471
GINA Digits Dense 970 3153 315 31532
HIVADrug discovery
Dense 1617 3845 384 38449
NOVAText classif.
Sparse binary 16969 1754 175 17537
SYLVA Ecology Dense 216 13086 1308 130858
http://www.agnostic.inf.ethz.ch
![Page 45: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/45.jpg)
ADAADA is the marketing database
• Task: Discover high revenue people from census data. Two-class pb.
• Source: Census bureau, “Adult” database from the UCI machine-learning repository.
• Features: 14 original attributes including age, workclass, education, education, marital status, occupation, native country. Continuous, binary and categorical features.
•
![Page 46: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/46.jpg)
GINA
• Task: Handwritten digit recognition. Separate the odd from the even digits. Two-class pb. with heterogeneous classes.
• Source: MNIST database formatted by LeCun and Cortes.
• Features: 28x28 pixel map.
•
GINA is the digit database
![Page 47: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/47.jpg)
HIVAHIVA is the HIV database
• Task: Find compounds active against the AIDS HIV infection. We brought it back to a two-class pb. (active vs. inactive), but provide the original labels (active, moderately active, and inactive).
• Data source: National Cancer Inst.• Data representation: The compounds are
represented by their 3d molecular structure.•
![Page 48: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/48.jpg)
NOVANOVA is the text classification database
• Task: Classify newsgroup emails into politics or religion vs. other topics.
• Source: The 20-Newsgroup dataset from in the UCI machine-learning repository.
• Data representation : The raw text with an estimated 17000 words of vocabulary.
Subject: Re: Goalie masksLines: 21
Tom Barrasso wore a great mask, one time, last season. He unveiled it at a game in Boston.
It was all black, with Pgh city scenes on it. The "Golden Triangle" graced the top, alongwith a steel mill on one side and the Civic Arena on the other. On the back of the helmet was the old Pens' logo the current (at the time) Penslogo, and a space for the "new" logo.
A great mask done in by a goalie's superstition.
Lori
![Page 49: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/49.jpg)
SYLVASYLVA is the ecology database
• Task: Classify forest cover types into Ponderosa pine vs. everything else.
• Source: US Forest Service (USFS). • Data representation: Forest cover type for 30
x 30 meter cells encoded with 108 features (elavation, hill shade, wilderness type, soil type, etc.)
•
![Page 50: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/50.jpg)
BER distribution (March 1st)
Agnostic learning Prior knowledge
The black vertical line indicates the best ranked entry (only the 5 last entry of each participant were ranked). Beware of overfitting!
![Page 51: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/51.jpg)
CLOP models
![Page 52: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/52.jpg)
Preprocessing and FS
![Page 53: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/53.jpg)
Model grouping
for k=1:10
base_model{k}=chain({standardize, naive});
end
my_model=ensemble(base_model);
![Page 54: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/54.jpg)
CLOP models (best entrant)
Dataset CLOP models selected
ADA 2*{sns,std,norm,gentleboost(neural),bias}; 2*{std,norm,gentleboost(kridge),bias}; 1*{rf,bias}
GINA 6*{std,gs,svc(degree=1)}; 3*{std,svc(degree=2)}
HIVA 3*{norm,svc(degree=1),bias}
NOVA 5*{norm,gentleboost(kridge),bias}
SYLVA 4*{std,norm,gentleboost(neural),bias}; 4*{std,neural}; 1*{rf,bias}
Juha Reunanen, cross-indexing-7
sns = shift’n’scale, std = standardize, norm = normalize (some details of hyperparameters not shown)
![Page 55: Lab 1 Getting started with CLOP and the Spider package](https://reader036.vdocuments.site/reader036/viewer/2022081506/56649eaa5503460f94baf8f5/html5/thumbnails/55.jpg)
CLOP models (2nd best entrant)
Dataset CLOP models selected
ADA {sns, std, norm, neural(units=5), bias}
GINA {norm, svc(degree=5, shrinkage=0.01), bias}
HIVA {std, norm, gentleboost(kridge), bias}
NOVA {norm,gentleboost(neural), bias}
SYLVA {std, norm, neural(units=1), bias}
Hugo Jair Escalante Balderas, BRun2311062
sns = shift’n’scale, std = standardize, norm = normalize (some details of hyperparameters not shown)
Note: entry Boosting_1_001_x900 gave better results, but was older.