et al a less biased evaluation of out-of-distribution sample … · 2019-08-21 · 2 shafaei et...

19
SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS 1 A Less Biased Evaluation of Out-of-distribution Sample Detectors Alireza Shafaei http://www.shafaei.ca Mark Schmidt http://cs.ubc.ca/~schmidtm James J. Little http://cs.ubc.ca/~little Computer Science Department University of British Columbia Vancouver, Canada Abstract In the real world, a learning system could receive an input that is unlike anything it has seen during training. Unfortunately, out-of-distribution samples can lead to unpredictable behaviour. We need to know whether any given input belongs to the population distribution of the training/evaluation data to prevent unpredictable behaviour in deployed systems. A recent surge of interest in this problem has led to the development of sophisticated techniques in the deep learning literature. However, due to the absence of a standard problem definition or an exhaustive evaluation, it is not evident if we can rely on these methods. What makes this problem different from a typical supervised learning setting is that the distribution of outliers used in training may not be the same as the distribution of outliers encountered in the application. Classical approaches that learn inliers vs. outliers with only two datasets can yield optimistic results. We introduce OD-test, a three-dataset evaluation scheme as a more reliable strategy to assess progress on this problem. We present an exhaustive evaluation of a broad set of methods from related areas on image classification tasks. Contrary to the existing results, we show that for realistic applications of high- dimensional images the previous techniques have low accuracy and are not reliable in practice. 1 Introduction If we present a natural image of an unknown class to the currently popular deep neural network that is trained to discriminate ImageNet [32] classes, we will get a prediction with a high (softmax) probability of an arbitrary class (see Fig. 1). With English speaking phone assistants, if we talk in another language, it will generate an English sentence that most often is not even remotely similar to what we have said. The silent failure of these systems is due to an implicit assumption: the input to the ImageNet classifier will be from the same ImageNet distribution, and the user will be speaking in English. However, in practice, any automation pipeline that involves a deep neural network will have a critical challenge: Can we trust the output of a neural network for a particular input? c 2019. The copyright of this document resides with its authors. It may be distributed unchanged freely in print or electronic forms. arXiv:1809.04729v2 [cs.LG] 20 Aug 2019

Upload: others

Post on 08-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ET AL A Less Biased Evaluation of Out-of-distribution Sample … · 2019-08-21 · 2 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS. AlexNet (2012) T-shirt 16% Dust

SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS 1

A Less Biased Evaluation ofOut-of-distribution Sample DetectorsAlireza Shafaeihttp://www.shafaei.ca

Mark Schmidthttp://cs.ubc.ca/~schmidtm

James J. Littlehttp://cs.ubc.ca/~little

Computer Science DepartmentUniversity of British ColumbiaVancouver, Canada

AbstractIn the real world, a learning system could receive an input that is unlike anything

it has seen during training. Unfortunately, out-of-distribution samples can lead tounpredictable behaviour. We need to know whether any given input belongs tothe population distribution of the training/evaluation data to prevent unpredictablebehaviour in deployed systems. A recent surge of interest in this problem has led tothe development of sophisticated techniques in the deep learning literature. However,due to the absence of a standard problem definition or an exhaustive evaluation, it isnot evident if we can rely on these methods. What makes this problem different froma typical supervised learning setting is that the distribution of outliers used in trainingmay not be the same as the distribution of outliers encountered in the application.Classical approaches that learn inliers vs. outliers with only two datasets can yieldoptimistic results. We introduce OD-test, a three-dataset evaluation scheme as amore reliable strategy to assess progress on this problem. We present an exhaustiveevaluation of a broad set of methods from related areas on image classificationtasks. Contrary to the existing results, we show that for realistic applications of high-dimensional images the previous techniques have low accuracy and are not reliable inpractice.

1 IntroductionIf we present a natural image of an unknown class to the currently popular deep neuralnetwork that is trained to discriminate ImageNet [32] classes, we will get a predictionwith a high (softmax) probability of an arbitrary class (see Fig. 1). With English speakingphone assistants, if we talk in another language, it will generate an English sentence thatmost often is not even remotely similar to what we have said. The silent failure of thesesystems is due to an implicit assumption: the input to the ImageNet classifier will be fromthe same ImageNet distribution, and the user will be speaking in English. However, inpractice, any automation pipeline that involves a deep neural network will have a criticalchallenge:

Can we trust the output of a neural network for a particular input?

c© 2019. The copyright of this document resides with its authors.It may be distributed unchanged freely in print or electronic forms.

arX

iv:1

809.

0472

9v2

[cs

.LG

] 2

0 A

ug 2

019

Citation
Citation
{Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Fei-Fei} 2015
Page 2: ET AL A Less Biased Evaluation of Out-of-distribution Sample … · 2019-08-21 · 2 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS. AlexNet (2012) T-shirt 16% Dust

2 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS

Cardigan 12% Theater Curtain 3%AlexNet (2012) T-shirt 16% Dust Cover 22% Coho 37%Sweatshirt 46%Dust Cover 44%Binder 51%VGG 19 (2014) Chest 11%Window Screen 5%

Chain Mail 29%Envelope 40% Dust Cover 52%Pacifier 33%ResNet 152 (2015) Sweatshirt 25%SqueezeNet (2016) Suit 21%Balance Beam 18% Jean 30%Poncho 32%Binder 43%

Tench 36%Chest 37%Chainlink Fence 31%Balance Beam 52%Envelope 31%DenseNet 161 (2017)

Figure 1: The predictions of several popular networks [14, 17, 18, 23, 37] that are trained onImageNet on unseen data. The red predictions are entirely wrong, the green predictions arejustifiable, and the orange predictions are less justifiable. The middle image is Gaussian noise.We show that thresholding the output probability is not a reliable defence.

One solution is to add a None class to the models to account for the absence of otherclasses. The first challenge is defining the None class. Would None mean all other imagevectors or only vectors of natural images? Another non-trivial challenge is capturing thediversity of None class with a finite sample set to use for training. The third and the mostprohibitive problem is that we have to significantly increase the complexity of our models tocapture the diversity of the None class. Despite these challenges, we might be able to achievereasonable results on low dimensional problems [16], but as the input dimension grows, theproblem becomes more severe. Our argument rests on the assumption that learning a reliableNone class is not a viable strategy for practical applications.

In supervised learning we typically assume the samples are independently and identicallydistributed (IID) – we expect the train and test examples to be drawn from a fixed populationdistribution. However, this condition cannot be easily enforced in deployed applications.When the IID assumption is not satisfied, the empirical error can no longer predict theperformance. Without any assumptions, the outputs on out-of-distribution (OOD) samples canbe arbitrarily bad. The OOD samples violate the identically distributed (ID) assumption. Thus,as long as we rely on the empirical error alone to train and evaluate deep neural networks, thefirst condition for making a reliable prediction (with bounded error) on any input would bewhether the ID assumption is satisfied. When we a priori expect to change the underlyingdistribution, the traditional applicable frameworks are transfer learning, multitask learning,and zero-shot learning. In our setting, we only wish to detect out-of-distribution samples.

In real life deployment of products that use complex machinery such as deep neuralnetworks (DNNs), we would have very little control over the input. In the absence ofextrapolation guarantees, when the IID assumption is violated, the behaviour of the pipelinemay be unpredictable. A reliable pipeline would first determine whether it can process a givensample, then it would use the prediction of the target neural network. Successful detection ofsuch violations could also be used in active learning, unsupervised learning, learning withnoisy data, or simply be a condition to invoking transfer learning strategies.

There has been a recent surge of interest in specifically OOD sample detection withdeep neural networks [2, 15, 16, 24, 28, 34]. However, the commonly applied evaluationmechanisms are susceptible to overly optimistic results and cannot provide a conclusiveevidence of reliability (we demonstrate this). There is an imminent need to define a standardproblem and properly define a benchmark that is both realistic and principled to compare theprevious and future approaches reliably.

Citation
Citation
{He, Zhang, Ren, and Sun} 2016
Citation
Citation
{Huang, Liu, vanprotect unhbox voidb@x penalty @M {}der Maaten, and Weinberger} 2017
Citation
Citation
{Iandola, Han, Moskewicz, Ashraf, Dally, and Keutzer} 2016
Citation
Citation
{Krizhevsky, Sutskever, and Hinton} 2012
Citation
Citation
{Simonyan and Zisserman} 2014
Citation
Citation
{Hendrycks, Mazeika, and Dietterich} 2018
Citation
Citation
{Bendale and Boult} 2016
Citation
Citation
{Hendrycks and Gimpel} 2017
Citation
Citation
{Hendrycks, Mazeika, and Dietterich} 2018
Citation
Citation
{Lakshminarayanan, Pritzel, and Blundell} 2017
Citation
Citation
{Liang, Li, and Srikant} 2018
Citation
Citation
{Schlegl, Seeb{ö}ck, Waldstein, Schmidt-Erfurth, and Langs} 2017
Page 3: ET AL A Less Biased Evaluation of Out-of-distribution Sample … · 2019-08-21 · 2 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS. AlexNet (2012) T-shirt 16% Dust

SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS 3

Our contributions are as follows:

1. We introduce OD-test: a more realistic formulation of the OOD detection task.2. We present a new benchmark to evaluate the existing techniques for OOD detection.3. We provide analysis of several previously proposed methods under OD-test.4. We release a PyTorch [30] package to replicate all the results (https://github.

com/ashafaei/OD-test).

We demonstrate that the performance of current techniques quickly approaches the randomprediction baseline as we make a transition to realistic high-dimensional images.

2 Related WorkThe violation of the ID assumption is not the only way to wreak havoc on deep learningpipelines. Adversarial example [38] attacks are crafted signals in otherwise innocent-lookingimages that fool the neural networks into misclassification. While the OOD detection is amodel-independent problem, adversarial images exploit inductive bias in the model families.We limit our attention to the OOD detection problem.The Uncertainty View An uncertainty measure could be directly applied to reject OODsamples as we would expect the uncertainty to be high on such inputs. The MC-Dropout [8]approach is a feasible uncertainty estimation method for a variety of applications [8, 9, 19].Lakshminarayanan et al. [24] show an ensemble of five neural networks (DeepEnsemble)trained with an adversarial-sample-augmented loss is sufficient to provide a measure ofpredictive uncertainty. We evaluate DeepEnsemble and MC-Dropout.The Abstention View If we abstain from prediction at the cost of a penalty, we end upwith the abstention view. A reject function makes the abstention choice and it could be athreshold on magnitude of the prediction [1] or chosen from a reject-hypothesis set [5, 6]. Thereject function is chosen simultaneously with the predictive function through an abstention-augmented loss. The reliability of such reject function is contingent on evaluation on afixed distribution. If we encounter an OOD sample, we do not know for sure if it would berejected. Our formulation of the OOD detection task is similar to the abstention view with keydifferences that we will discuss in Section 3. We show that the prior work on OOD detectioncan be reduced to an abstract problem of choosing a reject function from a specific space.The Anomaly View Density estimation-based techniques assume that low measure sam-ples are outliers. These approaches tend to work well mostly within low-dimensional orwell-defined distributions. PixelCNN++ [33] is an auto-regressive model with a tractablelikelihood that could be used within a density estimation scheme. Note that density estimationis not equivalent to the binary OOD detection: a perfect density estimator can solve the outlierdetection problem, but a perfect outlier detector does not necessarily have the informationneeded to solve the density estimation problem. Proximity-based methods use a distancemeasure and the train data to flag anomalies. A simple strategy is based on the K-nearestneighbours of a given input – we call this K-NNSVM. Clustering methods reject points thatdo not conform to any of the identified clusters. The one-class SVM [35] with a radial basisfunction learns a conservative region that covers the train data. Goldstein and Uchida [10]show that the proximity-based approaches are empirically the most effective outlier detectorsover a range of datasets. Reconstruction-based methods learn to reconstruct the train data,then try to reconstruct each given input. The samples that cannot be reconstructed well arethen flagged as anomalies. We use an autoencoder with a reconstruction threshold to test thisidea (AEThreshold).

Citation
Citation
{Paszke, Gross, Chintala, Chanan, Yang, DeVito, Lin, Desmaison, Antiga, and Lerer} 2017
Citation
Citation
{Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow, and Fergus} 2014
Citation
Citation
{Gal and Ghahramani} 2016{}
Citation
Citation
{Gal and Ghahramani} 2016{}
Citation
Citation
{Gal and Ghahramani} 2016{}
Citation
Citation
{Kendall and Gal} 2017
Citation
Citation
{Lakshminarayanan, Pritzel, and Blundell} 2017
Citation
Citation
{Bartlett and Wegkamp} 2008
Citation
Citation
{Cortes, DeSalvo, and Mohri} 2016
Citation
Citation
{Cortes, DeSalvo, Mohri, and Yang} 2017
Citation
Citation
{Salimans, Karpathy, Chen, and Kingma} 2017
Citation
Citation
{Sch{ö}lkopf, Williamson, Smola, Shawe-Taylor, and Platt} 2000
Citation
Citation
{Goldstein and Uchida} 2016
Page 4: ET AL A Less Biased Evaluation of Out-of-distribution Sample … · 2019-08-21 · 2 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS. AlexNet (2012) T-shirt 16% Dust

4 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS

The Novelty View Open-set recognition and novelty detection study the detection of anoma-lies at a semantic level. These methods are typically concerned with recognition of unseenclasses, e.g., new objects in the scene. This is a special case of OOD detection where theOOD samples explicitly differ by the semantic content. However, the notion of OOD is moregranular: an unseen viewpoint of a specific object violates the ID assumption, but it does notnecessarily constitute a novelty. The notion of novelty is often underspecified in practice andresults are limited to particular assumptions and problem definitions. Bendale and Boult [2]present OpenMax, a replacement for the softmax layer that detects unknown classes throughevaluation against a representative neural activation of each class.

Deep Learning Literature Guo et al. [13] observed that neural networks tend to be over-confident in predictions. They show that temperature scaling is an effective calibration strategyfor neural networks. Hendrycks and Gimpel [15] show that it is possible to detect OODsamples by thresholding the softmax probabilities. More recently, Liang et al. [28] presentedODIN, a method based on temperature rescaling and input perturbation to detect OOD samples.Further extensions rely on statistics of hidden representations [27, 31], construct classifierensembles with subsets of data [39], or perform regression on word embeddings [36].

There are also ideas that rely on GANs [7, 21, 26, 34] to detect anomalies or novelty inthe data. To the best of our knowledge, training GANs [11] is an active area of research, andit is not apparent what design decisions would be appropriate to implement these ideas inpractice. We are therefore unable to evaluate these ideas fairly at this time.

All the previous studies primarily focus on low-dimensional MNIST [25], SVHN [29],and CIFAR [22] datasets. We evaluate several previously proposed solutions in controlled ex-periments on datasets with varying complexity. We show that, in such low-dimensional spaces,simple anomaly detection methods work as well, thus stressing that a more comprehensiveevaluation is necessary for assessments of the current and future work.

3 OD-test: A Less Biased Evaluation of Outlier DetectorsLet us define the source distribution Ds to be our input distribution. The objective is to decidewhether a given sample belongs to Ds. We define a reject function r : X → {0,1} that makesthis binary decision. Note that this decision can be made independently from the ultimateprediction task. While in the abstention view we reject the samples that the predictive functionis likely to mislabel, here we reject the samples that do not belong to the source distributionDs, hence decoupling the reject function and the predictive function.

If the reject function flags an input, then the sample does not belong to the source distribu-tion; thus, the output of the pipeline may not be reliable. On the other hand, if the functionaccepts an input, we can continue the pipeline with the ID assumption. This form of rejectionis more general than the previous work. In addition to the previous methods, we can studynew approaches that operate in the input space directly (e.g. K-NNSVM, AEThreshold).

The r function is a binary classifier; the classes are in-distribution vs. out-of-distribution.To learn the classifier, the standard approach is to adopt the supervised learning assumptionsand use an outlier dataset Dv. In that scenario, what would happen if an outlier that weencounter is not represented by Dv? We end up with the original problem again. Thetraditional supervised outlier detection may overestimate our ability to detect outliers. Ahigh accuracy in this scenario may not yield an accurate model for practice in many settingswhere the outlier may not look like samples from Dv. The actual OOD samples are beyond

Citation
Citation
{Bendale and Boult} 2016
Citation
Citation
{Guo, Pleiss, Sun, and Weinberger} 2017
Citation
Citation
{Hendrycks and Gimpel} 2017
Citation
Citation
{Liang, Li, and Srikant} 2018
Citation
Citation
{Lee, Lee, Lee, and Shin} 2018{}
Citation
Citation
{Quintanilha, {de M. E. Filho}, Lezama, Delbracio, and Nunes} 2018
Citation
Citation
{Vyas, Jammalamadaka, Zhu, Das, Kaul, and Willke} 2018
Citation
Citation
{Shalev, Adi, and Keshet} 2018
Citation
Citation
{Deecke, Vandermeulen, Ruff, Mandt, and Kloft} 2018
Citation
Citation
{Kliger and Fleishman} 2018
Citation
Citation
{Lee, Lee, Lee, and Shin} 2018{}
Citation
Citation
{Schlegl, Seeb{ö}ck, Waldstein, Schmidt-Erfurth, and Langs} 2017
Citation
Citation
{Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio} 2014
Citation
Citation
{LeCun, Bottou, Bengio, and Haffner} 1998
Citation
Citation
{Netzer, Wang, Coates, Bissacco, Wu, and Ng} 2011
Citation
Citation
{Krizhevsky and Hinton} 2009
Page 5: ET AL A Less Biased Evaluation of Out-of-distribution Sample … · 2019-08-21 · 2 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS. AlexNet (2012) T-shirt 16% Dust

SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS 5

1

3

2

Train

Valid

Test Test

1

Valid

2

3

4

a

Ds – Source Outliers

Train Valid Test All dim(X ) |Y|MNIST (50 k 10 k) 10 k 70 k 784 10FashionMNIST (50 k 10 k) 10 k 70 k 784 10NotMNIST 18.6 k 784 10CIFAR10 (40 k 10 k) 10 k 60 k 3072 10CIFAR100 (40 k 10 k) 10 k 60 k 3072 100TinyImagenet 100 k 10 k 10 k 110 k 12,288 200STL10 5 k (4 k 4 k) 13 k 27,648 10

b

Figure 2: The OD-test. a.1) A set of distributions visualized as shapes within X . The sourcedistribution Ds is identified in the image; everything else is an outlier. a.2) We pick onevalidation distribution Dv and learn a binary reject function r that partitions the input spaceX based on Ds and Dv only. a.3) We evaluate r on other distributions (as Dt) and measurethe accuracy. a.4) The dataset splits for each step. b) A summary of the datasets. The datasetsare split as indicated by the parentheses.

our direct reach and our models can easily overfit in distinguishing Ds from Dv (we verifythis empirically, see Fig. 3). We present a less optimistic evaluation framework that preventsscoring high through overfitting.

Specifically, we introduce a third “target” distribution Dt to measure whether a methodcan actually detect outliers that are not only outside of Ds but that also might be outsideof Dv. The idea is to treat the problem as a binary classification between three differentdatasets. Similar to the supervised outlier detection, we begin by learning a reject functionto distinguish Ds from Dv. For evaluation, we use a third unseen distribution Dt instead ofDv (see Fig. 1a). Dt represents OOD examples that were not encountered during training– a more realistic evaluation setting for uncontrolled scenarios. Ideally, Dv and Dt shouldhave no similarity. In practice, we recommend choosing a collection of datasets with no labeloverlap. In our experiments, we cycle through all choices of outliers (Dv, Dt from Tab. 1b)and average the results. The pseudocode of the evaluation procedure is outlined in Alg. 1.

Algorithm 1: OD-testinput :Ds = (Dtrain

s ,Dvalids ,Dtest

s ) the source dataset.input :D = {Di} outlier set.input :M : D→R the method under evaluation.

1 begin2 A←− {}

/* Generate a reject-hypothesis class R using Dtrains . */

3 R←−M(Dtrains )

4 for Dv ∈ D do/* Find the best binary classifier in R. */

5 r←− train(R, {Dvalids : 0,Dv : 1})

6 for Dt ∈ D\{Dv} do/* Evaluate accuracy of r. */

7 acc←− eval(r, {Dtests : 0,Dt : 1})

8 add acc to A

9 return mean(A)

As an example, let us walk through the stages of evaluation for PbThreshold, the

Page 6: ET AL A Less Biased Evaluation of Out-of-distribution Sample … · 2019-08-21 · 2 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS. AlexNet (2012) T-shirt 16% Dust

6 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS

method that thresholds the maximum probability of a discriminative neural network. We needa trained deep neural network and a threshold for the maximum probability that would rejectthe outliers. We first train a deep neural network on Ds to discriminate the image classes(line 3). The reject classR returned on line 3 will have a single free parameter τ , the thresholdto use on the underlying classifier. We pick the optimal threshold τ in the next step. Online 5, we pick the best threshold τ to discriminate between Ds and Dv. After finding the bestthreshold τ on line 7, we evaluate the learned reject function on Ds and Dt. For unsupervisedOOD detection methods the evaluation is a single loop over Dt (see supplementary material).

Note that to score high on OD-test we do not need to perform density estimation. However,if we can perform density estimation well, we can score high on OD-test too. If the methodMsuccessfully learns a density function for Ds, we only would have to pick a single thresholdto reject OOD samples. Methods that yield a confidence (or uncertainty) estimate can be usedsimilarly. A binary classifier trained with the traditional supervised learning approaches isalso not sufficient to score high on this benchmark since we change the second distributionduring the test stage. In Section 4 we show how a traditional binary classifier would fallshort. The methods that learn conservative boundaries around Ds will have a higher chanceof success. All the existing approaches can be implemented within this framework.

We equalize the binary classes and only measure accuracy. Furthermore, we requirethe methods to pick the optimal parameters such as the threshold. These choices simplifyaggregation, analysis, and comparison of the results. We can meaningfully average overmultiple experiments and robustly compare methods in a variety of conditions. Similar to thesupervised learning regime, we can incorporate the prior knowledge on abundance and the riskassociated with the OOD samples into the evaluation by modifying the implicit objective online 5 and 7. We leave the proper choice of application dependent schemes to the practitionerand focus on assessing the discriminative power of the methods in the fixed 50/50 scenariowithout any prior knowledge. Through this constraint, we ensure that the methods that relyon the prior likelihood cannot perform better than random prediction.

We extend the previous work by evaluating over a broader set of datasets with varyinglevels of complexity. The variation in complexity allows for a fine-grained evaluation of thetechniques. Since OOD detection is closely related to the problem of density estimation, thedimensionality of the input image will be of vital importance in practical assessments. As theinput dimensionality increases, we expect the task to become much more difficult. Therefore,to provide a more accurate picture of performance, it is crucial to evaluate the methods onhigh dimensional data. Table 1b summarizes the datasets that we use. We also evaluate withuniform, and Gaussian noise for outliers. We evaluate the following methods:

• BinClass. A traditional binary classifier that is directly trained on Ds vs. Dv.• PbThreshold. A threshold on the softmax.• ScoreSVM. An SVM [4] classifier on the logits.• LogisticSVM. Similar to ScoreSVM, but the underlying classifier is trained with

logistic loss.• ODIN. A threshold on the scaled softmax outputs of the perturbed input.• K-NNSVM. A linear SVM on the sorted Euclidean distance between the input and the

k-nearest training samples. Note that a threshold on the average distance is a specialcase of K-NNSVM.• AEThreshold. A threshold on the autoencoder (AE) reconstruction error of the given

input. We train AEs with binary cross-entropy (BCE) and mean squared error (MSE).

Citation
Citation
{Cortes and Vapnik} 1995
Page 7: ET AL A Less Biased Evaluation of Out-of-distribution Sample … · 2019-08-21 · 2 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS. AlexNet (2012) T-shirt 16% Dust

SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS 7

BinC

lass

/VGG

BinC

lass

/Res

Log.

SVM

/Res

Scor

eSVM

/Res

Scor

eSVM

/VGG

Log.

SVM

/VGG

1-BN

NSVM

1-M

NNSV

M

AETh

re./B

CE

ODIN

/Res

1-NN

SVM

ODIN

/VGG

1-VN

NSVM

AETh

re./M

SE

Open

Max

/VGG

Deep

Ens./

VGG

Open

Max

/Res

MC-

Drop

out

PbTh

resh

/VGG

PbTh

resh

/Res

Deep

Ens./

Res

Pixe

lCNN

++

Method

0.5

0.6

0.7

0.8

0.9

1.0

Mea

n Te

st A

ccur

acy Two Datasets

OD-Test

Figure 3: Evaluation with two datasets versus OD-test. Evaluating OOD detectors with onlytwo distributions can be misleading for practical applications. The error bars are the 95%confidence level. The two-dataset evaluations are over all possible pairs of datasets (n = 46),whereas the OD-test evaluations are over all possible triplets (n = 308).

• K-MNNSVM, K-BNNSVM. A K-NNSVM applied on the learned hidden representationsof an autoencoder with MSE or BCE.• K-VNNSVM. Similar to the previous, except we use the learned representation of a

variational autoencoder [20].• MC-Dropout. A threshold on the entropy of average prediction of 7 evaluations per

input.• DeepEnsemble. Similar to MC-Dropout, except we average over the predictions

of 5 networks that are trained independently with adversarial-augmented loss.• PixelCNN++. A threshold on the log-likelihood of each input.• OpenMax. Similar to ScoreSVM, but we use the calibrated output of the OpenMax

module that also includes a probability for an unknown class.

We use two generic architectures: VGG-16 [37] and Resnet-50 [14] and reuse thesame base classifier for all the methods to provide a fair and meaningful comparison. Wetrain these architectures with cross-entropy loss (CE), and k-way logistic loss (KWL). CEloss enforces mutual exclusion in the predictions while KWL loss does not. We test thesetwo loss functions to see if the exclusivity assumption of CE hurts the ability to predict OODsamples. CE loss cannot make a None prediction without an explicitly defined None class,but KWL loss can make None predictions through low activations of all the classes.

Note that our formulation of the problem separates the target task from the OOD sampledetection. Thus, it is plausible to use, for OOD detection, a different predictive model fromthe actual predictive model of the target problem if there is an advantage. We tune thehyper-parameters of these methods following the best practices and the published guidelinesin the respective articles. The implementation details, a discussion of evaluation cost, and theperformance statistics of the above methods are in the supplementary material. The PyTorchimplementation with the pre-trained models and all the numerical results is available onhttps://github.com/ashafaei/OD-test.

4 Results and DiscussionWe evaluate the methods in a controlled regime against the datasets in Tab. 1b. We run over10,000 experiments on all combinations of Ds, Dv, and Dt. First we analyze the aggregated

Citation
Citation
{Kingma and Welling} 2014
Citation
Citation
{Simonyan and Zisserman} 2014
Citation
Citation
{He, Zhang, Ren, and Sun} 2016
Page 8: ET AL A Less Biased Evaluation of Out-of-distribution Sample … · 2019-08-21 · 2 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS. AlexNet (2012) T-shirt 16% Dust

8 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS

Open

Max

/Res

Pixe

lCNN

++Op

enM

ax/V

GG1-

VNNS

VMDe

epEn

s./Re

s8-

VNNS

VM2-

VNNS

VM4-

VNNS

VMBi

nCla

ss/V

GGLo

g.SV

M/R

esBi

nCla

ss/R

esSc

oreS

VM/R

esPb

Thre

sh/R

esAE

Thre

./MSE

2-M

NNSV

M4-

MNN

SVM

8-M

NNSV

M1-

MNN

SVM

1-BN

NSVM

2-BN

NSVM

8-BN

NSVM

4-BN

NSVM

8-NN

SVM

PbTh

resh

/VGG

4-NN

SVM

Deep

Ens./

VGG

2-NN

SVM

1-NN

SVM

MC-

Drop

out

AETh

re./B

CELo

g.SV

M/V

GGOD

IN/R

esSc

oreS

VM/V

GGOD

IN/V

GG

method

0.5

0.6

0.7

0.8

0.9

1.0

Mea

n Te

st A

ccur

acy

58.4

4%60

.74%

62.3

1%65

.16%

65.6

8%65

.78%

66.0

0%66

.54%

68.8

1%68

.98%

69.0

6%69

.88%

70.4

8%70

.63%

71.1

5%71

.26%

71.3

1%71

.42%

71.7

9%71

.82%

71.8

9%72

.09%

72.6

2%72

.63%

72.6

4%72

.84%

72.9

5%73

.37%

73.4

7%74

.43%

74.6

8%76

.99%

77.1

7%78

.66%

Figure 4: The average test accuracy of the OOD detection methods over 308 experi-ments/method with 95% confidence level. /VGG or /Res indicates the backing networkarchitecture. #-NN./ is the number of nearest neighbours. A random prediction would havean accuracy of 0.5.

results, then we look at the breakdown of the accuracy per source dataset. Figure 3 andFigure 4 show the average accuracy. Each method under OD-test is tested over the same set of308 experiments consisting of all compatible triplet of datasets. The two-dataset evaluationsare averaged over 46 experiments (all possible pairs). See the project page for the list ofexperiments.

Figure 3 compares the mean test accuracy of methods within OD-test and the two-dataset setting. Methods that perform well in distinguishing two datasets fail when a thirddataset is introduced. The gap in relative performance within each evaluation highlightsthe importance of having a more realistic assessment in practice. The general trend of theevaluation indicates that the methods that have a higher degree of freedom for OOD detection(the space characterized byR) are the most susceptible ones to overestimating accuracy withinthe traditional evaluation approach. This observation is consistent with the bias-variancetradeoff in learning theory.

BinClass, the direct binary classifier, consistently achieves a near-perfect accuracy onthe train and validation of the same datasets (see Fig. 3), however, once tested on a thirddataset that it has not seen before, the average accuracy drops to 68% for both VGG andResnet. This demonstrates why we need to adopt a new evaluation scheme. The classifieroverfits to the two distributions (Ds and Dv), but it cannot distinguish a third distribution (Dt).Because of the diversity of OOD samples, we may always encounter new input that we havenot seen before.

MC-Dropout and DeepEnsemble, the two uncertainty techniques, do not seem to pro-vide a strong enough signal to distinguish the two classes compared to the simpler ScoreSVM.Interestingly, MC-Dropout has a higher accuracy than DeepEnsemble. Considering thetraining cost of DeepEnsemble, MC-Dropout is a more favourable choice.

VGG-backed and Resnet-backed methods significantly differ in accuracy. The gap indi-cates the sensitivity of the methods to the underlying networks. PbThreshold, ScoreSVM,and ODIN all prefer VGG over Resnet even though Resnet networks outperform the VGGvariants in image classification. This means that the image classification accuracy may notbe the only relevant factor in performance of these methods. ODIN is less sensitive to theunderlying network. Furthermore, training the networks with KWL loss consistently reducesthe accuracy of OOD detection methods on average. ScoreSVM/VGG and ScoreSVM/Resboth outperform LogisticSVM/VGG, and LogisticSVM/Res respectively. Similarly,the autoencoders that were trained with BCE loss (AEThre./BCE) outperform the ones

Page 9: ET AL A Less Biased Evaluation of Out-of-distribution Sample … · 2019-08-21 · 2 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS. AlexNet (2012) T-shirt 16% Dust

SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS 9

MNIST FashionMNIST CIFAR10 CIFAR100 TinyImagenet STL10s

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Mea

n Te

st A

ccur

acy

Random Baseline

MethodBinClass/ResDeepEns./VGG1-NNSVMLog.SVM/VGGMC-DropoutODIN/VGGODIN/ResPixelCNN++PbThresh/VGGPbThresh/ResAEThre./BCEScoreSVM/VGGScoreSVM/Res

Figure 5: The test accuracy over 50 experiments/bar with 95% confidence level.

trained with MSE loss (AEThre./MSE). Note that we are comparing identical architectures.Within the nearest-neighbour methods, #-(X)NNSVM, the number of the nearest neigh-

bours does not significantly impact the accuracy on average. However, performing the nearest-neighbour in the input space directly outperforms nearest-neighbour in the latent representa-tions of autoencoders (BNNSVM, and MNNSVM) and VAE (VNNSVM). Interestingly, 1-NNSVMhas a higher accuracy than thresholding the probability (PbThresh) and DeepEnsembleon average within OD-test and two-dataset evaluation (Fig. 3). For #-NNSVM, if the referencesamples fit the GPU memory, a naive implementation could be faster than a forward pass onthe neural networks of large datasets like TinyImagenet.

PixelCNN++, the method that estimates the log-likelihood, has a surprisingly lowaccuracy on this problem on average. We suspect the auto-regressive nature of the model,specifically when coupled with the IID assumption, may be the reason for its failure. Thenetwork approximates the likelihood only in the vicinity of the training data. If we evaluatethe model on points that are far from the training data, the estimates are not reliable anymore.

Figure 5 shows the average test accuracy across each source datasetDs. For the full figure,see supplementary material. Our quantification of performance shows that all the methodshave a much lower accuracy on high-dimensional data than the low-dimensional data.

In low-dimensional datasets, K-NNSVM performs similarly or better than the other meth-ods. In the high-dimensional case, however, the accuracy approaches the random baselinequickly. Interestingly, K-NNSVM performs better on STL10 (96× 96) than TinyImagenet(64×64) which might be due to the higher diversity in TinyImagenet compared to STL10.Given the high accuracy of K-NNSVM in 784 dimensions, it might be feasible to learn anembedding for high-dimensional data, or learn a kernel function, to replace the original imagespace and enjoy a high accuracy. Going from CIFAR10 to CIFAR100, the dimensionality andthe dataset size remains the same, the only changing factor is the diversity of the data andthat seems to make the problem as difficult as the higher dimensional datasets. Except forK-NNSVM, the accuracy of other methods drops significantly in this transition.

AEThreshold has a near perfect accuracy on MNIST, however, the performance dropsquickly on complex datasets. AEThreshold also outperforms the density estimation methodPixelCNN++ on all datasets. We did not explore autoencoder architectures – more researchon better architectures or reconstruction constraints for AEThreshold may potentially havea high pay-off. The top-performing method, ODIN, is influenced by the number of classesin the dataset. Similar to PbThreshold, ODIN depends on the maximum signal in theclass predictions, therefore the increased number of classes would directly affect both of themethods. Furthermore, neither of them consistently prefers VGG over Resnet within alldatasets. Overall, ODIN consistently outperforms others in high-dimensional settings, but all

Page 10: ET AL A Less Biased Evaluation of Out-of-distribution Sample … · 2019-08-21 · 2 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS. AlexNet (2012) T-shirt 16% Dust

10 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS

the methods have a relatively low average accuracy in the 60%-78% range.Overall, we can summarize the main observations as follows: (i) Outlier detection with

two datasets yielded overly optimistic results with VGG and Resnet. (ii) The MC-Dropoutand DeepEnsemble uncertainty methods were not reliable enough for OOD detection.(iii) A more accurate image classifier did not lead to a more accurate outlier detector onaverage. (iv) The nearest neighbour methods were competitive in low-dimensional settings,both in computational cost and accuracy. (v) The latent representations of vanilla (variational)-autoencoders were not useful for this task when combined with nearest neighbour methods.(vi) The state-of-the-art auto-regressive density estimation method had a surprisingly lowaccuracy, performing worse than the random prediction baseline in some settings. (vii) Al-though ODIN outperforms other methods in realistic high-dimensional settings, its averageaccuracy is still below 80%.

To perform supervised OOD sample detection in practice, we have to pick a method andchoose a training outlier set Dv. Assuming that Dv may not represent the full spectrum ofanomalies, we should pick methods that do not overfit to Dv. OD-test tells us which methodsare less likely to overfit to a chosen Dv and therefore be more reliable in the face of an unseenOOD sample. Our results show that a two-dataset evaluation scheme can be too optimistic inidentifying the best available method. OD-test is a more realistic evaluation of OOD sampledetectors. In practice, the outlier set Dv should contain the largest variety of anomalies thatwe can use, and the method should be the one that is more accurate and less likely to overfit.

5 Conclusion

By detecting OOD samples (the outliers), we can ensure the deep learning pipelines operateas expected. We assume that collecting a representative set of outliers for None prediction isnot always practical. Therefore, we prefer methods that detect unseen outliers (the unknownunknowns) over methods that only detect previously seen outliers (the known unknowns).To assess how well methods cope with unknown unknowns we introduced a third datasetin the evaluation of outlier detectors. Furthermore, we averaged the accuracy across allcombinations of outlier datasets to reduce the measurement bias. We called this evaluationstrategy OD-test. OD-test is a new formulation of the problem that provides a more realisticassessment of the OOD detection methods. We also presented a new benchmark for OODsample detection within image classification pipelines based on OD-test. We showed thatthe traditional supervised learning approach to OOD detection does not always yield reliableresults – the previous assessments of the OOD detectors are too optimistic to be practicalin many scenarios. We presented a comprehensive evaluation of a diverse set of approachesacross a wide variety of datasets for the OOD detection problem. Furthermore, we showedthat none of the methods is suitable out-of-the-box for high-dimensional images. We releasethe open-source PyTorch project with the pre-trained models to replicate the results of ourstudy. We invite the community to tackle the outlined challenges in this work.

Acknowledgments We would like to thank the reviewers for their helpful comments. Wegratefully acknowledge the support of NVIDIA Corporation through the donation of theGPUs used for this research. This work was supported in part by the National Science andEngineering Research Council of Canada (NSERC), and the Canadian Institute for AdvancedResearch (CIFAR) Learning in Machines and Brains program.

Page 11: ET AL A Less Biased Evaluation of Out-of-distribution Sample … · 2019-08-21 · 2 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS. AlexNet (2012) T-shirt 16% Dust

SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS 11

References[1] Peter L. Bartlett and Marten H. Wegkamp. Classification with a Reject Option using a

Hinge Loss. JMLR, 2008.

[2] Abhijit Bendale and Terrance E. Boult. Towards Open Set Deep Networks. In CVPR,2016.

[3] Adam Coates, Andrew Ng, and Honglak Lee. An Analysis of Single-layer Networks inUnsupervised Feature Learning. In AISTATS, 2011.

[4] Corinna Cortes and Vladimir Vapnik. Support-vector Networks. Machine learning, 20(3):273–297, 1995.

[5] Corinna Cortes, Giulia DeSalvo, and Mehryar Mohri. Learning with Rejection. Interna-tional Conference on Algorithmic Learning Theory (ALT), 2016.

[6] Corinna Cortes, Giulia DeSalvo, Mehryar Mohri, and Scott Yang. On-line Learningwith Abstention. ArXiv e-prints, 2017.

[7] Lucas Deecke, Robert Vandermeulen, Lukas Ruff, Stephan Mandt, and Marius Kloft.Anomaly Detection with Generative Adversarial Networks, 2018.

[8] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian Approximation: RepresentingModel Uncertainty in Deep Learning. In ICML, 2016.

[9] Yarin Gal and Zoubin Ghahramani. A Theoretically Grounded Application of Dropoutin Recurrent Neural Networks. In NIPS, 2016.

[10] Markus Goldstein and Seiichi Uchida. A Comparative Evaluation of UnsupervisedAnomaly Detection Algorithms for Multivariate Data. PloS one, 11(4), 2016.

[11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. InNIPS, 2014.

[12] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and HarnessingAdversarial Examples. ICLR, 2015.

[13] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On Calibration of ModernNeural Networks. In ICML, 2017.

[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning forImage Recognition. In CVPR, 2016.

[15] Dan Hendrycks and Kevin Gimpel. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. In ICLR, 2017.

[16] Dan Hendrycks, Mantas Mazeika, and Thomas G Dietterich. Deep anomaly detectionwith outlier exposure. arXiv preprint arXiv:1812.04606, 2018.

[17] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. DenselyConnected Convolutional Networks. In CVPR, 2017.

Page 12: ET AL A Less Biased Evaluation of Out-of-distribution Sample … · 2019-08-21 · 2 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS. AlexNet (2012) T-shirt 16% Dust

12 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS

[18] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally,and Kurt Keutzer. Squeezenet: Alexnet-level Accuracy with 50x Fewer Parameters and< 0.5 MB Model Size. ArXiv e-prints, 2016.

[19] Alex Kendall and Yarin Gal. What Uncertainties Do We Need in Bayesian DeepLearning for Computer Vision? In NIPS, 2017.

[20] Diederik P Kingma and Max Welling. Auto-encoding Variational Bayes. In ICLR, 2014.

[21] Mark Kliger and Shachar Fleishman. Novelty Detection with GAN. ArXiv e-prints,2018.

[22] Alex Krizhevsky and Geoffrey Hinton. Learning Multiple Layers of Features from TinyImages. 2009.

[23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Imagenet Classification withDeep Convolutional Neural Networks. In NIPS, 2012.

[24] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and ScalablePredictive Uncertainty Estimation using Deep Ensembles. In NIPS, 2017.

[25] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based Learn-ing Applied to Document Recognition. Proceedings of the IEEE, 86(11):2278–2324,1998.

[26] Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. Training Confidence-calibratedClassifiers for Detecting Out-of-Distribution Samples. In ICLR, 2018.

[27] Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework fordetecting out-of-distribution samples and adversarial attacks. In NeurIPS, 2018.

[28] Shiyu Liang, Yixuan Li, and R. Srikant. Enhancing The Reliability of Out-of-distributionImage Detection in Neural Networks. In ICLR, 2018.

[29] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and AndrewNg. Reading Digits in Natural Images with Unsupervised Feature Learning. NIPSWorkshops, 2011.

[30] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, ZacharyDeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. AutomaticDifferentiation in PyTorch. NIPS Workshops, 2017.

[31] Igor M. Quintanilha, Roberto de M. E. Filho, José Lezama, Mauricio Delbracio, andLeonardo O. Nunes. Detecting Out-Of-Distribution Samples Using Low-Order DeepFeatures Statistics, 2018.

[32] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C Berg,and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.

[33] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improv-ing the pixelcnn with discretized logistic mixture likelihood and other modifications. InICLR, 2017.

Page 13: ET AL A Less Biased Evaluation of Out-of-distribution Sample … · 2019-08-21 · 2 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS. AlexNet (2012) T-shirt 16% Dust

SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS 13

[34] Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein, Ursula Schmidt-Erfurth, andGeorg Langs. Unsupervised Anomaly Detection with Generative Adversarial Networksto Guide Marker Discovery. In International Conference on Information Processing inMedical Imaging, 2017.

[35] Bernhard Schölkopf, Robert C Williamson, Alex J Smola, John Shawe-Taylor, andJohn C Platt. Support Vector Method for Novelty Detection. In NIPS, 2000.

[36] Gabi Shalev, Yossi Adi, and Joseph Keshet. Out-of-distribution detection using multiplesemantic label representations. In NeurIPS, 2018.

[37] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks forLarge-scale Image Recognition. ArXiv e-prints, 2014.

[38] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, IanGoodfellow, and Rob Fergus. Intriguing Properties of Neural Networks. In ICLR, 2014.

[39] Apoorv Vyas, Nataraj Jammalamadaka, Xia Zhu, Dipankar Das, Bharat Kaul, andTheodore L Willke. Out-of-distribution detection using an ensemble of self supervisedleave-out classifiers. In ECCV, 2018.

[40] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: A Novel Image Datasetfor Benchmarking Machine Learning Algorithms. ArXiv e-prints, 2017.

Page 14: ET AL A Less Biased Evaluation of Out-of-distribution Sample … · 2019-08-21 · 2 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS. AlexNet (2012) T-shirt 16% Dust

14 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS

6 Appendix: Formulation and Evaluation

6.1 Evaluation of Unsupervised Techniques

Algorithm 2: OD-test – the evaluation procedure for an unsupervised methodM.

input :Ds = (Dtrains ,Dvalid

s ,Dtests ) the source dataset.

input :D = {Di} outlier set.input :M :D→R the method under evaluation.

1 begin2 A←− {}

/* Generate a rejection hypothesis r using Dtrains . */

3 r←−M(Dtrains )

4 for Dt ∈ D do/* Evaluate the accuracy of r. */

5 acc←− eval(r, {Dtests : 0,Dt : 1})

6 add acc to A

7 return mean(A)

Algorithm 2 outlines the steps of evaluation for an unsupervised methodM. Note that inthis settingM returns a single binary classifier r. To make the performance of supervised andunsupervised methods comparable, we use the same splits of the datasets to guarantee fairnessof evaluation. In this work, we do not evaluate any unsupervised method. This additionalinformation is provided for clarity and completeness.

6.2 Implementation DetailsFor each training procedure, we randomly separate 80% and 20% of the (sub-)data for trainingand testing respectively. We return the model that has the highest performance on the test (sub-)subset. For classification tasks, we measure the performance by classification accuracy whilefor other tasks such as AEThreshold we measure the performance through the respectiveloss value on the test set.

VGG, Resnet We train two generic classifier architectures VGG-16 and Resnet-50on Dtrain

s to perform the corresponding classification task within the datasets. The networkarchitectures slightly differ across datasets to account for the change in the spatial size or thenumber of classes. We apply our modifications to the reference implementations availablein PyTorch’s torchvision package. These trained networks are subsequently used inPbThreshold, ScoreSVM, ODIN, Log.SVM, and DeepEnsemble. For MC-Dropoutwe only use the VGG variant, as the Resnet variants do not have dropouts. Table 1 showsthe summary of the networks’ classification accuracies on the entire Dtrain

s set.

Data Augmentation We only allow mirroring of the images for data augmentation. Weapply data augmentation on all the datasets except MNIST for which mirror augmentationdoes not make sense. We explicitly instantiate mirrored samples, as opposed to implicit on-airaugmentation, to ensure methods such as K-NNSVM are not at disadvantage. We do not apply

Page 15: ET AL A Less Biased Evaluation of Out-of-distribution Sample … · 2019-08-21 · 2 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS. AlexNet (2012) T-shirt 16% Dust

SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS 15

VGG Resnet

CE-Accuracy KL-Accuracy Size CE-Accuracy KL-Accuracy Size

MNIST [25] 99.89% 99.91% 19 MB 99.89% 99.91% 70 MBFashionMNIST [40] 98.82% 98.36% 19 MB 98.75% 98.73% 70 MBCIFAR10 [22] 97.63% 97.34% 159.8 MB 97.75% 97.51% 94.3 MBCIFAR100 [22] 91.40% 91.87% 161.3 MB 92.05% 91.82% 95.1 MBTinyImageneta 69.71% 72.23% 162.9 MB 89.59% 65.95% 95.9 MBSTL10 [3] 93.62% 95.18% 201.7 MB 92.32% 93.34% 94.3 MB

Mean 91.84% 92.48% 95.05% 91.21%

Table 1: The classification accuracy of the trained networks onDtrains using cross-entropy (CE)

and K-way Logistic (KL) loss functions. In both scenarios, the prediction is the maximumactivation. Note that because of the difference in training data, this table is not comparable tothe state-of-the-art performance on the respective datasets.

ahttps://tiny-imagenet.herokuapp.com/

any other data augmentation. We initially experimented with no data augmentation. Withoutany augmentation, the performance of all the methods reduces by 3−4%, but the relativeranking stays the same.

BinClass A binary classifier that is directly trained on Ds and Dv. We use the referenceResnet or VGG architectures with an additional linear layer to transform the output of thenetwork to a one-dimensional activation. We train the network with a binary cross-entropyloss on (Dtrain

s +Dvalids :0, Dv:1) to ensure the method has access to the same data as the other

methods. The networks typically achieve near-perfect accuracy after only a few epochs.

PbThreshold [15] One threshold parameter on top of the maximum of softmax output.The cost of the evaluation is a single forward pass on the network. We reuse the trainedreference VGG or Resnet architectures for this.

ScoreSVM A natural generalization of PbThreshold is to train an SVM [4] classifieron the pre-softmax activations. The cost of evaluation is a single forward pass on the networkwith the additional SVM layer. We reuse the trained reference VGG or Resnet architecturesfor this. We set the weight-decay regularization to 1

m , where m is the size of the training set.

ODIN [28] A threshold on the softmax outputs of the perturbed input. The cost of theevaluation is two forward passes and one backward pass. We do a grid search over the ε ,the perturbation step size, and γ , the temperature of the softmax operation. The range forgrid search is the same as the suggested range in [28]. We reuse the trained reference VGG orResnet architectures for this.

K-NNSVM A linear SVM on the sorted Euclidean distance between the input and the k-nearest training samples. Note that a threshold on the average distance is a special case ofK-NNSVM. The cost of the evaluation is finding the k-nearest neighbours in the training data.We use the Dtrain

s as the reference set, and tune the parameters with (Dvalids , Dv).

Citation
Citation
{LeCun, Bottou, Bengio, and Haffner} 1998
Citation
Citation
{Xiao, Rasul, and Vollgraf} 2017
Citation
Citation
{Krizhevsky and Hinton} 2009
Citation
Citation
{Krizhevsky and Hinton} 2009
Citation
Citation
{Coates, Ng, and Lee} 2011
Citation
Citation
{Hendrycks and Gimpel} 2017
Citation
Citation
{Cortes and Vapnik} 1995
Citation
Citation
{Liang, Li, and Srikant} 2018
Citation
Citation
{Liang, Li, and Srikant} 2018
Page 16: ET AL A Less Biased Evaluation of Out-of-distribution Sample … · 2019-08-21 · 2 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS. AlexNet (2012) T-shirt 16% Dust

16 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS

K-MNNSVM, K-BNNSVM, K-VNNSVM The same as K-NNSVM, except we use the lowdimensional representations of an autoencoder trained with MSE, BCE, or the VAE.

AEThreshold A threshold on the autoencoder reconstruction error of the given input.The evaluation cost is a single forward pass on the autoencoder. We train the autoencoder onDtrain

s and train the threshold parameters with (Dvalids ,Dv). We use the binary cross-entropy

loss with continuous targets1 or mean squared error to train and measure the reconstructionerror of a given input. The bottleneck dimensionality varies between 32 and 1024. Ourdecision rule given a reconstruction error ex for an input x is r(x) = (ex− µ)2 > τ , whereτ is the threshold and µ is the center around which we are thresholding with τ . If we setµ = 0, this decision function reduces to a basic threshold operator. We found that thissimple decision rule improves the final accuracy of the model. The reconstruction errorsof the in-distribution samples tend to stay more or less similar, whereas the reconstructionerror for OOD samples could either be too low or too high. This decision rule is meantto utilize this observation. The network architectures are procedurally generated. Seehttps://github.com/ashafaei/OD-test for the models.

MC-Dropout A threshold on the entropy of average predictions of 7 evaluations per input.The dropout probability is p = 0.5. This approach follows the work of [24] and Kendalland Gal [19]. We did not evaluate this approach on Resnet because the original structuredoes not have a dropout; therefore, it is not trivial to identify where the dropouts should belocated without sabotaging the performance of Resnet. We reuse the trained reference VGGarchitecture for this.

DeepEnsemble Similar to MC-Dropout, except we average over the predictions of 5networks that are trained independently with the adversarial strategy of [24]. In this approach,we augment the original loss function with a similar loss function on the adversarially-generated examples of the same batch. The adversarially-generated samples are generatedthrough the fast gradient-sign method (FGSM) [12].

PixelCNN++ We use the implementation from https://github.com/pclucas14/pixel-cnn-pp. We train the models using Ds until plateau on the test (sub-)subset, thenlearn a threshold parameter with Dv. Our models achieve a 0.89 BPD for MNIST, 2.65 BPDfor FashionMNIST, 2.98 BPD for CIFAR10, 3.01 BPD for CIFAR100, 2.70 BPD for Tiny-Imagenet, and 3.59 BPD for STL10 on the test (sub-)subset. Because of the auto-regressivenature, these models are prohibitively expensive to train. The PixelCNN++ authors notethat they have used 8 Titan X GPUs for five days to achieve state-of-the-art performancefor CIFAR102 (2.92 BPD). For TinyImagenet, and STL10 we process a downsampled ver-sion to 32-pixel width to be able to train and evaluate the models. Our experiments withAEThreshold indicate that the downsampled versions of TinyImagenet, and STL10 areeasier problems. However, even with this simplification, the PixelCNN++ does not performup to expectations. Figure 6 shows some of the generated samples.

1See http://pytorch.org/docs/0.3.1/nn.html#bcewithlogitsloss.2https://github.com/openai/pixel-cnn

Citation
Citation
{Lakshminarayanan, Pritzel, and Blundell} 2017
Citation
Citation
{Kendall and Gal} 2017
Citation
Citation
{Lakshminarayanan, Pritzel, and Blundell} 2017
Citation
Citation
{Goodfellow, Shlens, and Szegedy} 2015
Page 17: ET AL A Less Biased Evaluation of Out-of-distribution Sample … · 2019-08-21 · 2 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS. AlexNet (2012) T-shirt 16% Dust

SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS 17

a MNIST b FashionMNIST

c CIFAR10 d CIFAR100

e TinyImagenet (downsampled) f STL10 (downsampled)

Figure 6: Samples of the learned PixelCNN++ model.

Page 18: ET AL A Less Biased Evaluation of Out-of-distribution Sample … · 2019-08-21 · 2 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS. AlexNet (2012) T-shirt 16% Dust

18 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS

OpenMax This method is a replacement for the softmax layer after the training has finished.It fits a Weibull distribution on the distances of logits from the representatives of each classto reweight the logits and provide probabilities for encountering an unknown class. Theoutput of the OpenMax is similar to softmax, except with the addition of the probability foran unknown class. We learn the MAV vectors and the Weibull distribution on the Ds. We usethe Dv to learn the reject function on the calibrated probability outputs.

You can access all the results on https://github.com/ashafaei/OD-test whereyou will find the full list of evaluations for OD-test (n = 34× 308 = 10,472) and the two-dataset evaluation scheme (n = 22×46 = 1012).

7 Appendix: More ResultsFigure 7 shows the average performance of all the methods per source dataset Ds.

Page 19: ET AL A Less Biased Evaluation of Out-of-distribution Sample … · 2019-08-21 · 2 SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS. AlexNet (2012) T-shirt 16% Dust

SHAFAEI ET AL.: A LESS BIASED EVALUATION OF OOD SAMPLE DETECTORS 19

MNI

STFa

shio

nMNI

STCI

FAR1

0CI

FAR1

00Ti

nyIm

agen

etST

L10

s

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Mean Test Accuracy

Rand

om B

asel

ine

Met

hod

1-BN

NSVM

BinC

lass

/VGG

BinC

lass

/Res

Deep

Ens./

VGG

Deep

Ens./

Res

1-NN

SVM

Log.

SVM

/VGG

Log.

SVM

/Res

MC-

Drop

out

1-M

NNSV

MOD

IN/V

GGOD

IN/R

esOp

enM

ax/V

GGOp

enM

ax/R

esPi

xelC

NN++

PbTh

resh

/VGG

PbTh

resh

/Res

AETh

re./B

CEAE

Thre

./MSE

Scor

eSVM

/VGG

Scor

eSVM

/Res

1-VN

NSVM

Figu

re7:

The

aver

age

test

accu

racy

over

50ex

peri

men

tspe

rbar

.The

erro

rbar

sin

dica

teth

e95

%co

nfide

nce

leve

l.T

hefig

ure

isbe

stvi

ewed

inco

lor.