1 csi5388 data sets: running proper comparative studies with large data repositories [based on...

15
1 Data Sets: Running Data Sets: Running Proper Comparative Proper Comparative Studies with Large Studies with Large Data Repositories Data Repositories [Based on Salzberg, S.L., [Based on Salzberg, S.L., 1997 1997 “On Comparing Classifiers: “On Comparing Classifiers: Pitfalls to Avoid and a Pitfalls to Avoid and a Recommended Approach”] Recommended Approach”]

Upload: hector-harris

Post on 23-Dec-2015

226 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls

11

CSI5388CSI5388Data Sets: Running Proper Data Sets: Running Proper Comparative Studies with Comparative Studies with Large Data RepositoriesLarge Data Repositories

[Based on Salzberg, S.L., 1997 [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: “On Comparing Classifiers:

Pitfalls to Avoid and a Recommended Pitfalls to Avoid and a Recommended Approach”]Approach”]

Page 2: 1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls

22

Advantages of Large Data Advantages of Large Data RepositoriesRepositories

The researcher can easily The researcher can easily experiment with real-world data sets experiment with real-world data sets (rather than artificial data).(rather than artificial data).• New algorithms can be tested in real-New algorithms can be tested in real-

world settings.world settings.• Since many researchers use the same Since many researchers use the same

data sets, the comparison between new data sets, the comparison between new and old classifiers is easy.and old classifiers is easy.

• Problems arising in real-world settings Problems arising in real-world settings can be identified and focused on.can be identified and focused on.

Page 3: 1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls

33

Disadvantages of Large Data Disadvantages of Large Data RepositoriesRepositories

The The Multiplicity EffectMultiplicity Effect: when running large numbers : when running large numbers of experiments, more stringent requirements need to of experiments, more stringent requirements need to be used to establish statistical significance than when be used to establish statistical significance than when only a few number of experiments are considered.only a few number of experiments are considered.

Community ExperimentsCommunity Experiments Problem: If many Problem: If many researchers run the same experiments, it is possible researchers run the same experiments, it is possible that, by chance, some people will obtain statistically that, by chance, some people will obtain statistically significant results. These people will be the ones significant results. These people will be the ones publishing their results (even though they may have publishing their results (even though they may have been obtained, only by chance!)been obtained, only by chance!)

Repeated TuningRepeated Tuning Problem: In order to be valid, all Problem: In order to be valid, all tuning should be done before the test set is known. tuning should be done before the test set is known. This is seldom the case.This is seldom the case.

The The Problem of Generalizing ResultsProblem of Generalizing Results: It is not : It is not necessarily correct to generalize from the UCI necessarily correct to generalize from the UCI Repository to any other data sets.Repository to any other data sets.

Page 4: 1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls

44

The Multiplicity Effect: An ExampleThe Multiplicity Effect: An Example

14 different algorithms get compared on 11 data 14 different algorithms get compared on 11 data sets to a default classifier.sets to a default classifier.

Differences are reported as significant if a two-Differences are reported as significant if a two-tailed, paired t-test produces a p-value smaller tailed, paired t-test produces a p-value smaller than 0.05.than 0.05.

This is not stringent enough: By running This is not stringent enough: By running 14*11=154 experiments, one has 154 chances to 14*11=154 experiments, one has 154 chances to be significant. So the expected number of be significant. So the expected number of significant results obtained by chance is significant results obtained by chance is 154*.05= 7.7.154*.05= 7.7.

This is not desirable. In fact, in such a setting, the This is not desirable. In fact, in such a setting, the acceptable p-value should be much smaller than acceptable p-value should be much smaller than 0.05 in order to obtain a true significance level of 0.05 in order to obtain a true significance level of 0.05.0.05.

Page 5: 1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls

55

The Multiplicity Effect: The Multiplicity Effect: More FormallyMore Formally

Let the acceptable significance level for each of Let the acceptable significance level for each of our experiments be our experiments be αα*.*.

Then the chance of making the right conclusion Then the chance of making the right conclusion for one experiment is for one experiment is

1- 1- αα** If we conduct n If we conduct n independentindependent experiments, the experiments, the

chances of getting them all right is chances of getting them all right is (1- (1- αα*)*)nn

Suppose that there is no real difference among Suppose that there is no real difference among the algorithms being tested, then, the chance the algorithms being tested, then, the chance that we will make at least one mistake is: that we will make at least one mistake is:

αα = 1- (1- = 1- (1- αα*)*)nn

Page 6: 1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls

66

The Multiplicity Effect: The The Multiplicity Effect: The Example continuedExample continued

We assume that there is no real difference among We assume that there is no real difference among the 14 algorithms being compared. the 14 algorithms being compared.

If our acceptable significance level were set to If our acceptable significance level were set to αα*= *= 0.05, then the odds of making at least one mistake 0.05, then the odds of making at least one mistake in our 154 experiments isin our 154 experiments is

1-(1-0.05)1-(1-0.05)154154 = 0.9996 = 0.9996 That is, we are 99.96% certain that at least one of That is, we are 99.96% certain that at least one of

our conclusions will incorrectly reach significance our conclusions will incorrectly reach significance at the 0.05 levelat the 0.05 level

If we wanted to reach a true significance level of If we wanted to reach a true significance level of 0.05, we would need to set 0.05, we would need to set

1- (1- 1- (1- αα*)*)154154≤0.05, ≤0.05, i.e., i.e., αα* ≤ 0.0003* ≤ 0.0003

This is called the This is called the BonferroniBonferroni adjustment adjustment

Page 7: 1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls

77

Experiment Independence Experiment Independence The Bonferroni adjustment is valid as long as the The Bonferroni adjustment is valid as long as the

experiments are independent.experiments are independent. However:However:

• If different algorithms are compared on the same test If different algorithms are compared on the same test data, then the test are not independent. data, then the test are not independent.

• If the training and testing data are drawn from the same If the training and testing data are drawn from the same data set, then the experiments are not independent,data set, then the experiments are not independent,

In these cases, it is even more likely that the In these cases, it is even more likely that the statistical tests will find significance when none statistical tests will find significance when none exists, even if the Bonferroni adjustment were exists, even if the Bonferroni adjustment were used.used.

In conclusion, the t-test, often used by researchers, In conclusion, the t-test, often used by researchers, is the wrong test to use in this particular is the wrong test to use in this particular experimental setting.experimental setting.

Page 8: 1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls

88

Alternative Statistical Test to deal Alternative Statistical Test to deal with the Multiplicity Effect Iwith the Multiplicity Effect I

The first approach suggested as an alternative to deal The first approach suggested as an alternative to deal with the Multiplicity effect is the following:with the Multiplicity effect is the following:• When comparing only two algorithms, A and B, When comparing only two algorithms, A and B,

We can count the number of examples that A got right and We can count the number of examples that A got right and B got wrong (A>B); and the number of examples that B B got wrong (A>B); and the number of examples that B got right and A got wrong (B>A);got right and A got wrong (B>A);

We can then compare the percent of times where A>B and We can then compare the percent of times where A>B and B>A, throwing out the tiesB>A, throwing out the ties

We can then use a Binomial test (or the McNemar test We can then use a Binomial test (or the McNemar test which is nearly identical and easier to compute) for the which is nearly identical and easier to compute) for the comparison, with the Bonferroni adjustment for multiple comparison, with the Bonferroni adjustment for multiple tests. (See Salzberg, 1997)tests. (See Salzberg, 1997)

However, the binomial test does not handle However, the binomial test does not handle quantitative differences between algorithms, more quantitative differences between algorithms, more than two algorithms and it does not consider the than two algorithms and it does not consider the frequency of agreements between two algorithms.frequency of agreements between two algorithms.

Page 9: 1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls

99

Alternative Statistical Tests to deal Alternative Statistical Tests to deal with the Multiplicity Effect IIwith the Multiplicity Effect II

The other approaches suggested as an alternative The other approaches suggested as an alternative to deal with the Multiplicity effect are the following:to deal with the Multiplicity effect are the following:• Use random, distinct samples of the data to test Use random, distinct samples of the data to test

each algorithm, and to use an analysis of each algorithm, and to use an analysis of variance (ANOVA) to compare the results.variance (ANOVA) to compare the results.

• Use the following Randomization testing Use the following Randomization testing approach: approach:

For each trial, the data set is copied and class labels For each trial, the data set is copied and class labels are replaced with random class labels.are replaced with random class labels.

An algorithm is used to find the most accurate classifier An algorithm is used to find the most accurate classifier it can, using the same methodology that is used with it can, using the same methodology that is used with the original data.the original data.

Any estimate of accuracy greater than random for the Any estimate of accuracy greater than random for the copied data reflects the bias in the methodology, and copied data reflects the bias in the methodology, and this reference distribution can then be used to adjust this reference distribution can then be used to adjust the estimates on the real data.the estimates on the real data.

Page 10: 1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls

1010

Community ExperimentsCommunity Experiments The multiplicity effect is not the only problem The multiplicity effect is not the only problem

plaguing the current experimental process.plaguing the current experimental process. There is another process that can be referred to There is another process that can be referred to

as the as the Community experiments effectCommunity experiments effect that that occurs even if all the statistical tests are occurs even if all the statistical tests are conducted properly.conducted properly.

Suppose that 100 different people are trying to Suppose that 100 different people are trying to compare the accuracy of algorithms A and B, compare the accuracy of algorithms A and B, which, we assume have the same mean accuracy which, we assume have the same mean accuracy on a very large population of data sets.on a very large population of data sets.

If these 100 people are studying these algorithms If these 100 people are studying these algorithms and looking for a significance level of .05, then we and looking for a significance level of .05, then we can expect 5 of these people to, by chance, find a can expect 5 of these people to, by chance, find a significant difference between A and B.significant difference between A and B.

One of these 5 people may publish their results, One of these 5 people may publish their results, while the others will move on to different while the others will move on to different experiments.experiments.

Page 11: 1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls

1111

How to Deal with Community How to Deal with Community ExperimentsExperiments

The way to guard against the community The way to guard against the community experiments effect is to duplicate the results.experiments effect is to duplicate the results.

Proper duplication requires drawing a new random Proper duplication requires drawing a new random sample from the population and repeating the sample from the population and repeating the study.study.

Unfortunately, since benchmark databases are Unfortunately, since benchmark databases are static and small in size, it is not possible to draw static and small in size, it is not possible to draw random samples of the same data sets at random.random samples of the same data sets at random.

Using a different partitioning of the data into Using a different partitioning of the data into training and test sets does not help either with this training and test sets does not help either with this problem.problem.

Should we rely on artificial data sets?Should we rely on artificial data sets?

Page 12: 1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls

1212

Repeated Tuning Repeated Tuning Most experiments need tuning. In many cases, the Most experiments need tuning. In many cases, the

algorithms themselves need tuning, and in most cases, algorithms themselves need tuning, and in most cases, various data representations need to be tried.various data representations need to be tried.

If the results of all this tuning is tested on the same data If the results of all this tuning is tested on the same data set as the data set used to report the final results, then set as the data set used to report the final results, then each adjustment should be counted as a separate each adjustment should be counted as a separate experiment. E.g., if 10 different combinations of experiment. E.g., if 10 different combinations of parameters are tried, then we would need to consider a parameters are tried, then we would need to consider a significance level of 0.005 in order to truly reach one of significance level of 0.005 in order to truly reach one of 0.05.0.05.

The solution to this problem is to do any kind of The solution to this problem is to do any kind of parameter tuning, algorithmic tweaking and so on ahead parameter tuning, algorithmic tweaking and so on ahead of seeing the testing set. Once a result has been produced of seeing the testing set. Once a result has been produced from the testing set, it is not possible to get back to it. from the testing set, it is not possible to get back to it.

This is a problem because it makes exploratory research This is a problem because it makes exploratory research impossible, if one wants to report statistically significant impossible, if one wants to report statistically significant results. On the other hand, it makes the report of results. On the other hand, it makes the report of statistically significant results impossible if one wants to statistically significant results impossible if one wants to do exploratory research.do exploratory research.

Page 13: 1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls

1313

Generalizing ResultsGeneralizing Results It is often believed that if some effect concerning a It is often believed that if some effect concerning a

learning algorithm is shown to hold on a random learning algorithm is shown to hold on a random subset of the UCI data sets, then this effect should subset of the UCI data sets, then this effect should also hold on other data sets.also hold on other data sets.

This is not necessarily the case because the UCI This is not necessarily the case because the UCI repository, as was shown by Holte (and others), only repository, as was shown by Holte (and others), only represent a very limited sample of problems, many of represent a very limited sample of problems, many of which are easy for a classifier. In other words, the which are easy for a classifier. In other words, the repository is not an unbiased sample of classification repository is not an unbiased sample of classification problems. problems.

A second problem with too much reliance on A second problem with too much reliance on community data sets such as the UCI Repository is community data sets such as the UCI Repository is that, consciously or not, researchers start developing that, consciously or not, researchers start developing algorithms tailored to those data sets. [E,g, they may algorithms tailored to those data sets. [E,g, they may develop algorithms for, say, missing data because develop algorithms for, say, missing data because the repository contains such problems, even if in the repository contains such problems, even if in reality, this turns out not to be a very prevalent reality, this turns out not to be a very prevalent problem.]problem.]

Page 14: 1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls

1414

A Recommended ApproachA Recommended Approach

Page 15: 1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls

1515

Two Extra Points regarding Valid Two Extra Points regarding Valid TestingTesting

Running many cross-validations on the Running many cross-validations on the same data set, and reporting each cross-same data set, and reporting each cross-validation as a single trial does not validation as a single trial does not produce valid statistics because the trials, produce valid statistics because the trials, in such a design, are highly in such a design, are highly interdependent.interdependent.

If one wishes to extend the recommended If one wishes to extend the recommended procedure to several data sets rather than procedure to several data sets rather than a single one, one should use the a single one, one should use the Bonferroni adjustment to adjust the Bonferroni adjustment to adjust the significance levels accordingly.significance levels accordingly.