the expected performance curve samy bengio, johnny mariéthoz, mikaela keller

1

The Expected Performance CurveSamy Bengio, Johnny Mariéthoz, Mikaela Keller

MI – 25. oktober 2007Kresten Toftgaard Andersen

2

Introduction to the paper

By Samy Bengio, Johnny Mariéthoz and Mikaela Keller, 2005 For machine learning community and researchers ect, who need to compare

models.

Content of the paper: Introduces ROC curves very briefly. Points out some risks when using ROC curves for comparing different classifying

models. Argues that ROC curves can be misleading by showing some results. The authors contributes with a so called “Expected Performance Curve”, and

argues why it is better for comparing models. Extends EPC with confidence intervals and statistical difference tests. Concludes the paper summarizing their contribution and by listing strenghts and

weaknesses of ROC and EPC. Acknowledgement and references

3

Content

Motivation Introduce terminology and notation, define problem. Introduce ROC curves Example: how to calculate a ROC Present arguments of why ROC curves should be used with great care Introduce EPC Continue example showing how to calculate an EPC Present arguments of why EPC might be better than ROC Confidence interval My opinion Discussion

4

Motivation

ROC analysis is an important why to compare binary classifier models.

Can be used to select optimal models and discard suboptimal models.

Area of use: Medicine (diagnostic testing, evaluate evidence-based medicine approaches) Epidemiology (factors affecting health, evaluate optimal treatment approaches) Radiology (radar signals, evaluate new radiology techniques ) Psychology (signal detection, assess human detection of weak signals) Machine Learning (evaluation of machine learning techniques) …

5

Definition of 2-class classifiers

Definition of 2-class classification problems:

Apply function and associated threshold on a seperate test data set (true class must be known) and count the outcome.

6

Confusion matrix

Given a 2 class classifier and an instance, there are four possible outcomes:

TP: instance is positive and is classified as positive FN: instance is positive and is classified as negative TN: instance is negative and is classified as negative FN: instance is negative and is classified as positive

7

Perfomance metrics

Selected measure is a pair which is generically called V1 and V2. V1 and V2 can be calculated in many ways depending on the situation. All

are simple combinations of TP, TN, FP and FN. Exact calculation of V1 and V2 is not important in this paper.

8

Perfomance metrics

An unique measure generically called V combines V1 and V2 V can also be calculated in several ways depending on the situation

(Half Total Error Rate)

9

What is a ROC curve?

ROC Abbreviation for ”Receiver Operating Characteristics”. Technique for visualizing, organizing and selecting classifiers based on their

performance. ROC can both be presented as a graph or a curve.

Classifiers Discrete classifiers (decision trees, rule sets ect.) Probabilistic classifiers (Naive Bayes, neural network ect.) Varying a threshold for a probabilistic classifier will trace a curve (ROC)

Following example will show this.

10

Example

11

Example

12

Example

Threshold

13

Example

Threshold

14

Example

Threshold

15

Example

Threshold

16

Example

17

Example

18

ROC curves

• BEP = Breake Even Point

• BEP corresponds to the threshold nearst to a solutions such that V1 = V2

• The selected threshold have a significant impact on the model.

• The threshold represents the a trade-off between giving importance to V1 or V2.

19

Potential risk of using ROC

Each point corresponds to a particular setting of the threshold. But in “real applications” the thresholds need to be decided before seeing the test set.

Normally the threshold is found by searching for the BEP using some equation. Possibility of mismatch because training set is different from the test set. Situations may occur where the optimal threshold found be using the training set,

doesn’t correspond to the optimal threshold on the test set. One parameter, the threshold, is tuned using the training set. Potential risk to

expect that the training error reflects the general error.

“Real applications often suffer from an additional mismatch between training and test conditions”.

Risk of a different trade-off (V1, V2) in test set. ROC curves does not take the risk of a mismatch into account. This probalility should be reflected in the procedure when calculating the performance curve.

20

Potential risk of using ROC

ROC’s of two real models for a Text-Independent Speaker Verifacation task.

Looking at the curves only model B seems to be better than model A.

Looking at the thresholds, A is actually the best model.

21

Expected performance curve

EPC present a range of possible expected performance on the test set. The calculation takes into account the possible mismatch while estimating the

desired threshold. A parameter alpha is used to estimate the possible missmatch of the threshold.

Framework:

Paremetric performance measure: C( V1(θ, D), V2(θ, D); )Depends on:The parameter , V1 and V2 computed on some data D for the threshold θ.

Example:C( V1(θ, D), V2(θ, D); )= C( Precision(θ, D), Recall(θ, D) ; )= - ( Precision(θ, D) + (1 - ) Recall(θ, D))

Procedure:Vary inside a reasonable range and for each estimate θ that minimizes C(-,-;) on a development set and then use the obtained θ to compute V on the test set. At last plot V with respect to .

22

EPC Algorithm

23

Example

24

Example

25

Example

26

Example

27

Example

28

Example of an typical EPC

Alpha > 0,5 = more importance to false acceptance errors

Alpha < 0,5 = more importance to false rejection errors

29

EPC in real applications

Expected Performance Curves for person authentication, where one wants to trade-off false acceptance rates with false rejection rates.

Expected Performance Curves for text categorization, where one wants to trade-off precision and recall and print the F1 measure.

30

Confidence Interval Confidence intervals are used to indicate the reliability of an estimate

31

My opinion

The authors got a point and the idea is good. Good for comparing models… …but hard to read much from EPC, ROC more informative. Cumbersome to compute EPC. Useful… maybe? Apparently only used by the authors?

32

End of Line

QuestionsDiscussion

the expected performance curve samy bengio, johnny mariéthoz, mikaela keller

Documents

roc curves example

weaknesses of roc

suboptimal models

binary classifier models

curve rocfollowing example

different classifying

class classifiersdefinition

true class