csi5388 model selection

11

CSI5388CSI5388Model SelectionModel Selection

Based on “Key Concepts in Model Based on “Key Concepts in Model Selection: Performance and Selection: Performance and

Generalizability” by Malcom R. Generalizability” by Malcom R. ForsterForster

22

What is Model Selection?What is Model Selection?

Model Selection refers to the process of Model Selection refers to the process of optimizing a model (e.g., a classifier, a optimizing a model (e.g., a classifier, a regression analyzer, and so on).regression analyzer, and so on).

Model Selection encompasses both the Model Selection encompasses both the selection of a model (e.g., C4.5 versus selection of a model (e.g., C4.5 versus Naïve Bayes) and the adjustment of a Naïve Bayes) and the adjustment of a particular model’s parameters (e.g., particular model’s parameters (e.g., adjusting the number of hidden units in a adjusting the number of hidden units in a neural network). neural network).

33

What are potential issues with What are potential issues with Model Selection?Model Selection?

It is usually possible to improve a model’s fit with It is usually possible to improve a model’s fit with the data (up to a certain point). (e.g., more hidden the data (up to a certain point). (e.g., more hidden units will allow a neural network to fit the data on units will allow a neural network to fit the data on which it is trained, better).which it is trained, better).

The question is, however, where should the The question is, however, where should the distinction between improving the model and distinction between improving the model and hurting its performance on novel data (overfitting) hurting its performance on novel data (overfitting) be drawn?be drawn?

We want the model to use enough information We want the model to use enough information from the data set to be as unbiased as possible, from the data set to be as unbiased as possible, but we want it to discard all the information it but we want it to discard all the information it needs to make it generalize as well as it can (i.e., needs to make it generalize as well as it can (i.e., fare as well as possible on a variety of different fare as well as possible on a variety of different context).context).

As such, model selection is very tightly linked with As such, model selection is very tightly linked with the issue of the Bias/Variance tradeoff.the issue of the Bias/Variance tradeoff.

44

Why is the issue of Model Selection Why is the issue of Model Selection considered in a course on evaluation?considered in a course on evaluation?

By evaluation, in this course, we are principally By evaluation, in this course, we are principally concerned with the issue of evaluating a classifier concerned with the issue of evaluating a classifier once its tuning is finalized.once its tuning is finalized.

However, we must keep in mind that evaluation However, we must keep in mind that evaluation has a broader meaning in the sense that while has a broader meaning in the sense that while classifiers are being chosen and tuned, another classifiers are being chosen and tuned, another evaluation (not final) must take place to make evaluation (not final) must take place to make sure that we are on the right track.sure that we are on the right track.

In fact, there is a view that does not distinguish In fact, there is a view that does not distinguish between the two aspects of evaluation above, but between the two aspects of evaluation above, but rather, assumes that the final evaluation is rather, assumes that the final evaluation is nothing but a continuation of the model selection nothing but a continuation of the model selection process. process.

55

Different Approaches to Model Different Approaches to Model SelectionSelection

We will survey different approaches to We will survey different approaches to Model Selection not all of them most Model Selection not all of them most useful to our problem of maximizing useful to our problem of maximizing predictive performance.predictive performance.

In particular, we will consider:In particular, we will consider:• The Method of Maximum LikelihoodThe Method of Maximum Likelihood• Classical Hypothesis TestingClassical Hypothesis Testing• Akaike’s Information CriterionAkaike’s Information Criterion• Cross-Validation TechniquesCross-Validation Techniques• Bayes MethodBayes Method• Minimum Description LengthMinimum Description Length

66

The Method of Maximum Likelihood The Method of Maximum Likelihood (ML)(ML)

Out of the Maximum Likelihood (ML) Out of the Maximum Likelihood (ML) hypotheses in the competing models, hypotheses in the competing models, select the one that has the greatest select the one that has the greatest likelihood or log-likelihood.likelihood or log-likelihood.

This method is the antithesis of This method is the antithesis of Occam’s razor as, in the case of Occam’s razor as, in the case of nested models, it can never favour nested models, it can never favour anything less than the most complex anything less than the most complex of all competing models. of all competing models.

77

Classical Hypothesis Testing IClassical Hypothesis Testing I We consider the comparison of nested models, in We consider the comparison of nested models, in

which we decide to add or omit a single parameter which we decide to add or omit a single parameter θθ. So we are choosing between hypotheses . So we are choosing between hypotheses θθ=0 and =0 and θθ ≠0. ≠0.

θθ=0 is considered the null hypothesis.=0 is considered the null hypothesis. We set up a 5% critical region such that if We set up a 5% critical region such that if θθ^, the ^, the

maximum likelihood (ML) estimate for maximum likelihood (ML) estimate for θθ, is , is sufficiently close to 0, then the null hypothesis is not sufficiently close to 0, then the null hypothesis is not rejected (p< 0.5, two tailed).rejected (p< 0.5, two tailed).

Note that when the test fails to reject the null Note that when the test fails to reject the null hypothesis, it is favouring the simpler hypothesis in hypothesis, it is favouring the simpler hypothesis in spite of its poorer fit (because spite of its poorer fit (because θθ^ fits better than ^ fits better than θθ=0), if the null hypothesis is the simpler of the two =0), if the null hypothesis is the simpler of the two models.models.

So classical hypothesis testing succeeds in trading So classical hypothesis testing succeeds in trading off goodness-of-fit for simplicity.off goodness-of-fit for simplicity.

88

Classical Hypothesis Testing IIClassical Hypothesis Testing II

Question: Since classical hypothesis Question: Since classical hypothesis testing succeeds in trading off goodness-testing succeeds in trading off goodness-of-fit for simplicity, why do we need any of-fit for simplicity, why do we need any other method for model selection?other method for model selection?

That’s because it doesn’t apply well to That’s because it doesn’t apply well to non-nested models (when the issue is not non-nested models (when the issue is not one of adding or not a parameter.one of adding or not a parameter.

In fact, classical hypothesis testing works In fact, classical hypothesis testing works on some model selection problems only by on some model selection problems only by chance: it was not purposely designed to chance: it was not purposely designed to work on them.work on them.

99

Akaike’s Information Criterion IAkaike’s Information Criterion I Akaike’s Information Criterion (AIC) minimizes the Akaike’s Information Criterion (AIC) minimizes the

Kullback-Leibler distance of the selected density Kullback-Leibler distance of the selected density from the true density. from the true density.

In other words, the AIC rule maximizes In other words, the AIC rule maximizes log f(x; log f(x; θθ= = θθ^)/n – k/n ^)/n – k/n [where n is the number of observed data, k is the [where n is the number of observed data, k is the

number of adjustable parameters and f(x, number of adjustable parameters and f(x, θθ) is ) is the density].the density].

The first term in the formula above measures fit The first term in the formula above measures fit per datum, while the second term penalizes per datum, while the second term penalizes complex models.complex models.

AIC without the second term would be the same AIC without the second term would be the same as Maximum Likelihood (ML).as Maximum Likelihood (ML).

1010

Akaike’s Information Criterion IIAkaike’s Information Criterion II

What is the difference between AIC and What is the difference between AIC and classical hypothesis testing?classical hypothesis testing?• AIC applies to nested and non-nested models. AIC applies to nested and non-nested models.

All that’s needed for AIC is the ML values of the All that’s needed for AIC is the ML values of the models, and their k and n values. There is no models, and their k and n values. There is no need to choose a null hypothesis.need to choose a null hypothesis.

• AIC effectively tradeoffs Type I or Type II error. AIC effectively tradeoffs Type I or Type II error. As a result, AIC may give less weight to As a result, AIC may give less weight to simplicity than to fit than classical hypothesis simplicity than to fit than classical hypothesis testing.testing.

1111

Cross-Validation Techniques ICross-Validation Techniques I Use a calibration set (training set) and a test set Use a calibration set (training set) and a test set

to determine the best model.to determine the best model. Note, however, that the test set cannot be the Note, however, that the test set cannot be the

same as the test set we are used to, so in fact, same as the test set we are used to, so in fact, we need three sets: training, validation and test.we need three sets: training, validation and test.

This because the training set is different from the This because the training set is different from the training and validation sets taken together (our training and validation sets taken together (our goal is to optimize the model on that set), it is goal is to optimize the model on that set), it is best to use leave-one-out on training+validation best to use leave-one-out on training+validation to select a model.to select a model.

This is because training + validation – one data This is because training + validation – one data point is closer to training + validation than point is closer to training + validation than training, alone.training, alone.

1212

Cross-Validation Techniques IICross-Validation Techniques II Although cross-validation techniques makes no Although cross-validation techniques makes no

appeal to simplicity whatsoever, it is appeal to simplicity whatsoever, it is asymptotically equivqlent to AIC.asymptotically equivqlent to AIC.

This is because minimizing Kullback-Liebler This is because minimizing Kullback-Liebler distance between the ML density is the same as distance between the ML density is the same as maximizing predictive accuracy if that is defined in maximizing predictive accuracy if that is defined in terms of the expected log-likelihood of new data terms of the expected log-likelihood of new data generated by the true density (Forster and Sober, generated by the true density (Forster and Sober, 1994).1994).

More simply, I guess that this can be explained by More simply, I guess that this can be explained by the fact that there is truth to Occam’s razor: the the fact that there is truth to Occam’s razor: the simpler models are the best at predicting the simpler models are the best at predicting the future, so by optimizing predictive accuracy, we future, so by optimizing predictive accuracy, we are unwittingly trading off goodness-of-fit with are unwittingly trading off goodness-of-fit with model simplicity.model simplicity.

1313

Bayes Method IBayes Method I Bayes method says that models should be compared by Bayes method says that models should be compared by

their posterior probabilities. their posterior probabilities. Schwartz 1978 assumed that the prior probabilities of Schwartz 1978 assumed that the prior probabilities of

all models were equal, and then derived an asymptotic all models were equal, and then derived an asymptotic expression for the likelihood of a model as follows:expression for the likelihood of a model as follows:

A model can be viewed as a big disjunction which A model can be viewed as a big disjunction which asserts that either the first density in the set is the true asserts that either the first density in the set is the true density, or the second, the third and so on.density, or the second, the third and so on.

By the probability calculus, the likelihood of a model is, By the probability calculus, the likelihood of a model is, therefore, the average likelihood of its members where therefore, the average likelihood of its members where each likelihood is weighed by the prior probability of each likelihood is weighed by the prior probability of the particular density given that the model is true.the particular density given that the model is true.

In other words, the Bayesian Information Criterion (BIC) In other words, the Bayesian Information Criterion (BIC) rule is to favour the model with the highest value of rule is to favour the model with the highest value of

log f(x; log f(x; θθ= = θθ^)/n – [log(n)/2]k/n ^)/n – [log(n)/2]k/n Note: The Bayes method and BIC criterion are not Note: The Bayes method and BIC criterion are not

always the same thing.always the same thing.

1414

Bayes Method IIBayes Method II There is a philosophical disagreement There is a philosophical disagreement

between the Bayesian school and other between the Bayesian school and other researchers.researchers.

The Bayesians assume that BIC is an The Bayesians assume that BIC is an approximation of the Bayesian method, approximation of the Bayesian method, but this is the case only if the models are but this is the case only if the models are quasi-nested. If they are truly nested, quasi-nested. If they are truly nested, there is no implementation of Occam’s there is no implementation of Occam’s razor whatsoever.razor whatsoever.

Bayes method and AIC optimize entirely Bayes method and AIC optimize entirely different things and this is why they don’t different things and this is why they don’t always agree.always agree.

1515

Minimum Description Length Minimum Description Length CriteriaCriteria

In Computer Science, the best known In Computer Science, the best known implementation of Occam’s razor is the implementation of Occam’s razor is the minimum description length criteria (MDL) minimum description length criteria (MDL) or the minimum message length criteria or the minimum message length criteria (MML).(MML).

The motivating idea is that the best model The motivating idea is that the best model is one that facilitates the shortest is one that facilitates the shortest encoding of observed data.encoding of observed data.

Among the various implementations of Among the various implementations of MML and MDL one is asymptoically MML and MDL one is asymptoically equivalent to BIC.equivalent to BIC.

1616

Limitation of the different Limitation of the different approaches to Model Selection Iapproaches to Model Selection I

One issue with all these model selection One issue with all these model selection methods is called: methods is called: Selection biasSelection bias, and , and can’t be easily corrected.can’t be easily corrected.

Selection bias corresponds to the fact that Selection bias corresponds to the fact that model criteria are particularly risky when a model criteria are particularly risky when a selection is made from a large number of selection is made from a large number of competing models. competing models.

The random fluctuation in the data will The random fluctuation in the data will increase the scores of some models more increase the scores of some models more than others. The more models there are, than others. The more models there are, the greater the chance that the winner the greater the chance that the winner won by luck rather than by merit.won by luck rather than by merit.

1717

Limitation of the different Limitation of the different approaches to Model Selection IIapproaches to Model Selection II

The Bayesian method is not as sensitive to the The Bayesian method is not as sensitive to the problem of selection bias because its predictive problem of selection bias because its predictive density is a weighted average of all the densities density is a weighted average of all the densities in all domains. However, this advantage comes at in all domains. However, this advantage comes at the expense of making the prediction rather the expense of making the prediction rather imprecise.imprecise.

To defray problems of Selection Bias Golden, To defray problems of Selection Bias Golden, 2000 suggests a three-way statistical test that 2000 suggests a three-way statistical test that includes: accept, reject, or suspend judgement.includes: accept, reject, or suspend judgement.

Browne, 2000 emphasizes that selection criteria Browne, 2000 emphasizes that selection criteria should not be followed blindly. He warns that the should not be followed blindly. He warns that the term ‘selection’ suggests something definite, term ‘selection’ suggests something definite, which in fact has not been reached.which in fact has not been reached.

1818

Limitation of the different Limitation of the different approaches to Model Selection IIIapproaches to Model Selection III

If model selection is seen as using data sampled If model selection is seen as using data sampled from one distribution in order to predict data from one distribution in order to predict data sampled from another, then the methods sampled from another, then the methods previously discussed will not work well, since they previously discussed will not work well, since they assume that the two distributions are the same.assume that the two distributions are the same.

In this case, errors of estimation do not arise In this case, errors of estimation do not arise solely from small sample fluctuations, but also solely from small sample fluctuations, but also from the failure of the sampled data to properly from the failure of the sampled data to properly represent the domain of prediction.represent the domain of prediction.

We will now discuss a method by Busemeyer and We will now discuss a method by Busemeyer and Wang 2000 that deals with this issue of Wang 2000 that deals with this issue of extrapolation or generalization to new data.extrapolation or generalization to new data.

1919

Model Selection for New DataModel Selection for New Data [Busemeyer and Wang 2000 ] [Busemeyer and Wang 2000 ]

One response to the fact that if extrapolation to One response to the fact that if extrapolation to new data does not work is: “there is nothing we new data does not work is: “there is nothing we can do about that”.can do about that”.

Busemeyer and Wang (2000) as well as Forster do Busemeyer and Wang (2000) as well as Forster do not share this conviction. Instead, they designed not share this conviction. Instead, they designed the the generalization criterion methodologygeneralization criterion methodology which which state that successful extrapolation in the past may state that successful extrapolation in the past may be a useful indicator of further extrapolation.be a useful indicator of further extrapolation.

The idea is to find out whether there are situations The idea is to find out whether there are situations in which past extrapolation is a useful indicator of in which past extrapolation is a useful indicator of future extrapolation, and whether this empirical future extrapolation, and whether this empirical information is not already exploited by the information is not already exploited by the standard selection criteria. standard selection criteria.

2020

Experimental Results IExperimental Results I Forster ran some experiments to test this idea.Forster ran some experiments to test this idea. He found that on the task of fitting data coming He found that on the task of fitting data coming

from the same distribution, the model selection from the same distribution, the model selection methods we discussed were adequate at methods we discussed were adequate at predicting the best models: the most complex predicting the best models: the most complex models were always the better ones (we are in a models were always the better ones (we are in a situation where a lot of data is available to model situation where a lot of data is available to model the domain, and, thus, the fear of overfitting in the domain, and, thus, the fear of overfitting in case of a complex model is not present).case of a complex model is not present).

On the task of extrapolating from one domain to On the task of extrapolating from one domain to the next, the model selection methods were npt the next, the model selection methods were npt adequate since they didn’t reflect the fact that adequate since they didn’t reflect the fact that the best classifier were not necessarily the most the best classifier were not necessarily the most complex ones.complex ones.

2121

Experimental Results IIExperimental Results II

The generalization methodology divides The generalization methodology divides the training set into two subdomains, but the training set into two subdomains, but the subdomains are chosen so that the the subdomains are chosen so that the direction of the test extrapolation is most direction of the test extrapolation is most likely to indicate the success of the wider likely to indicate the success of the wider extrapolation.extrapolation.

This approach seems to yield better This approach seems to yield better results than that of standard model results than that of standard model selection.selection.

For a practical example of this kind of For a practical example of this kind of approach, see Henchiri & Japkowicz, 2007.approach, see Henchiri & Japkowicz, 2007.

2222

Experimental Results IIIExperimental Results III

Overall, it does appear that the Overall, it does appear that the generalization scores provide us with generalization scores provide us with useful empirical information that is not useful empirical information that is not exploited by the standard selection exploited by the standard selection criteria.criteria.

There are some cases, where the There are some cases, where the information is not only unexploited, but it information is not only unexploited, but it is also relatively clear cut and decisive.is also relatively clear cut and decisive.

Such information might at least Such information might at least supplement the standard criteriasupplement the standard criteria

csi5388 model selection

Documents