arxiv:2006.04183v1 [cs.lg] 7 jun 2020

8
Uncertainty-Aware Deep Classifiers using Generative Models Murat Sensoy, 1,2 Lance Kaplan, 3 Federico Cerutti, 4,5 Maryam Saleki 2 1 Blue Prism AI Labs, London, UK 2 Department of Computer Science, Ozyegin University, Istanbul, Turkey 3 US Army Research Lab, Adelphi, MD 20783, USA 4 Department of Information Engineering, University of Brescia, 25123 Brescia, Italy 5 Cardiff University, Cardiff, CF10 3AT, UK [email protected], [email protected], [email protected], [email protected] Abstract Deep neural networks are often ignorant about what they do not know and overconfident when they make uninformed pre- dictions. Some recent approaches quantify classification un- certainty directly by training the model to output high un- certainty for the data samples close to class boundaries or from the outside of the training distribution. These approaches use an auxiliary data set during training to represent out-of- distribution samples. However, selection or creation of such an auxiliary data set is non-trivial, especially for high dimen- sional data such as images. In this work we develop a novel neural network model that is able to express both aleatoric and epistemic uncertainty to distinguish decision boundary and out-of-distribution regions of the feature space. To this end, variational autoencoders and generative adversarial networks are incorporated to automatically generate out-of-distribution exemplars for training. Through extensive analysis, we demon- strate that the proposed approach provides better estimates of uncertainty for in- and out-of-distribution samples, and adversarial examples on well-known data sets against state-of- the-art approaches including recent Bayesian approaches for neural networks and anomaly detection methods. Introduction While deep learning models demonstrate remarkable gener- alization performance in light of the large number of param- eters they exploit, they can be misleadingly overconfident when they do make mistakes. The false sense of trust these models create may have serious consequences, especially if they are used for high-risk tasks. A striking example is the misclassification of the white side of a trailer as bright sky: this caused a car operating with automated vehicle con- trol systems to crash against a tractor-semitrailer truck near Williston, Florida, USA on 7th May 2016. The car driver died due to the sustained injury (NHTSA 2016). There are two categories of uncertainty (Matthies 2007). Epistemic uncertainty, or model uncertainty, results from limited knowledge and could in principle be reduced: uncer- tain predictions for out-of-distribution samples fall into this category. Among other approaches, Bayesian deep learning methods try to estimate epistemic uncertainty by modeling the distributions for the parameters values, distributions that seldom admit closed-form representations, hence requiring expensive Monte Carlo sampling methods (Bishop 2006). This is a post-referred version of the paper published in AAAI 2020. (a) Standard Nets (b) Evidential Nets (c) Generated Points (d) Proposed Model Figure 1: Class boundaries for models on a simple 2D classi- fication problem (green vs red dots): prediction confidence depicts on a color scale from maroon (low confidence) to blue (high confidence). The generated samples are shown as blue dots in (c) along with the original data points. Aleatoric uncertainty, or data uncertainty, is the noise inherent in the observations (e.g., label noise), or class over- lap: unlike epistemic uncertainty, aleatoric uncertainty cannot be reduced by observing more data samples. For instance, having identical samples with different labels, e.g., on the class boundary, is an example of aleatoric uncertainty. Ap- proaches such as Evidential Neural Networks (EDL) and Lightweight Probabilistic Deep Networks, are proposed re- cently to estimate aleatoric uncertainties in deep neural net- works by directly estimating parameters of the predictive pos- terior as their output (Sensoy, Kaplan, and Kandemir 2018; Gast and Roth 2018). These approaches do not require any sampling and assume minimal changes to the architecture of standard neural networks. Aforementioned approaches may still make misleading and overconfident predictions for the samples out of the train- ing distribution. Figures 1(a), (b), and (d) demonstrate pre- dicted class boundaries for a simple 2D classification problem of green vs red dots. Standard deterministic neural networks (Fig. 1(a)) do not decrease their prediction confidence when classifying samples around the class boundary, while EDL (Fig. 1(b)) does. However, both models have high prediction confidence when tested with out-of-distribution samples. To avoid such overconfident predictions, other approaches such arXiv:2006.04183v1 [cs.LG] 7 Jun 2020

Upload: others

Post on 11-Dec-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:2006.04183v1 [cs.LG] 7 Jun 2020

Uncertainty-Aware Deep Classifiers using Generative Models

Murat Sensoy,1,2 Lance Kaplan,3 Federico Cerutti,4,5 Maryam Saleki2

1Blue Prism AI Labs, London, UK2Department of Computer Science, Ozyegin University, Istanbul, Turkey

3US Army Research Lab, Adelphi, MD 20783, USA4Department of Information Engineering, University of Brescia, 25123 Brescia, Italy

5Cardiff University, Cardiff, CF10 3AT, [email protected], [email protected], [email protected], [email protected]

AbstractDeep neural networks are often ignorant about what they donot know and overconfident when they make uninformed pre-dictions. Some recent approaches quantify classification un-certainty directly by training the model to output high un-certainty for the data samples close to class boundaries orfrom the outside of the training distribution. These approachesuse an auxiliary data set during training to represent out-of-distribution samples. However, selection or creation of suchan auxiliary data set is non-trivial, especially for high dimen-sional data such as images. In this work we develop a novelneural network model that is able to express both aleatoric andepistemic uncertainty to distinguish decision boundary andout-of-distribution regions of the feature space. To this end,variational autoencoders and generative adversarial networksare incorporated to automatically generate out-of-distributionexemplars for training. Through extensive analysis, we demon-strate that the proposed approach provides better estimatesof uncertainty for in- and out-of-distribution samples, andadversarial examples on well-known data sets against state-of-the-art approaches including recent Bayesian approaches forneural networks and anomaly detection methods.

IntroductionWhile deep learning models demonstrate remarkable gener-alization performance in light of the large number of param-eters they exploit, they can be misleadingly overconfidentwhen they do make mistakes. The false sense of trust thesemodels create may have serious consequences, especiallyif they are used for high-risk tasks. A striking example isthe misclassification of the white side of a trailer as brightsky: this caused a car operating with automated vehicle con-trol systems to crash against a tractor-semitrailer truck nearWilliston, Florida, USA on 7th May 2016. The car driver dieddue to the sustained injury (NHTSA 2016).

There are two categories of uncertainty (Matthies 2007).Epistemic uncertainty, or model uncertainty, results fromlimited knowledge and could in principle be reduced: uncer-tain predictions for out-of-distribution samples fall into thiscategory. Among other approaches, Bayesian deep learningmethods try to estimate epistemic uncertainty by modelingthe distributions for the parameters values, distributions thatseldom admit closed-form representations, hence requiringexpensive Monte Carlo sampling methods (Bishop 2006).

This is a post-referred version of the paper published in AAAI 2020.

(a) Standard Nets (b) Evidential Nets

(c) Generated Points (d) Proposed Model

Figure 1: Class boundaries for models on a simple 2D classi-fication problem (green vs red dots): prediction confidencedepicts on a color scale from maroon (low confidence) toblue (high confidence). The generated samples are shown asblue dots in (c) along with the original data points.

Aleatoric uncertainty, or data uncertainty, is the noiseinherent in the observations (e.g., label noise), or class over-lap: unlike epistemic uncertainty, aleatoric uncertainty cannotbe reduced by observing more data samples. For instance,having identical samples with different labels, e.g., on theclass boundary, is an example of aleatoric uncertainty. Ap-proaches such as Evidential Neural Networks (EDL) andLightweight Probabilistic Deep Networks, are proposed re-cently to estimate aleatoric uncertainties in deep neural net-works by directly estimating parameters of the predictive pos-terior as their output (Sensoy, Kaplan, and Kandemir 2018;Gast and Roth 2018). These approaches do not require anysampling and assume minimal changes to the architecture ofstandard neural networks.

Aforementioned approaches may still make misleadingand overconfident predictions for the samples out of the train-ing distribution. Figures 1(a), (b), and (d) demonstrate pre-dicted class boundaries for a simple 2D classification problemof green vs red dots. Standard deterministic neural networks(Fig. 1(a)) do not decrease their prediction confidence whenclassifying samples around the class boundary, while EDL(Fig. 1(b)) does. However, both models have high predictionconfidence when tested with out-of-distribution samples. Toavoid such overconfident predictions, other approaches such

arX

iv:2

006.

0418

3v1

[cs

.LG

] 7

Jun

202

0

Page 2: arXiv:2006.04183v1 [cs.LG] 7 Jun 2020

as (Malinin and Gales 2018) propose to hand-pick an auxil-iary data set as the out-of-distribution samples and explicitlytrain the neural networks to give highly uncertain output forthem. This is often infeasible in high-dimensional real-lifesettings, given the very large space of possibilities.

In this paper, we propose a deterministic neural networkthat can effectively and efficiently estimate classification un-certainty for both in- and out-of-distribution samples. Usinga generative model, it synthesizes out-of-distribution samplesclose to the training samples, e.g., the blue dots in Fig. 1 (c).Then, it trains a classifier using both training and the gener-ated samples. Figure 1(d) depicts the result of our approachon our toy example: our model shows higher prediction con-fidence only for regions close to the training samples.

Our contribution in the paper is threefold. (1) We con-sider the output of the neural network as the parameters of aDirichlet distribution with uniform prior, instead of a categor-ical distribution over possible labels. (2) These parametersare calculated by learning an implicit density estimation foreach category by treating each output of the network as anoutput of a binary classifier, which learns to discriminatesamples of the category from the samples of other categoriesand out-of-distribution samples. (3) We also propose a novelgenerative adversarial network, which learns to distort thetraining samples to automatically generate the most infor-mative out-of-distribution samples during training, and soovercoming the need to hand-pick an auxiliary data set as theout-of-distribution samples.

Through extensive experiments, we compare our model notonly with state-of-the-art Bayesian networks and other mod-els for uncertainty estimation, but also with recent anomalydetection models, which are specifically designed to deter-mine out-of-distribution samples using deep neural networks.Our experiments on MNIST and CIFAR10 data sets and ad-versarial examples indicate that our approach outperformsthe existing approaches significantly in these tasks.

Generative Evidential Neural NetworksOur work can be considered as an extension of approachesfor classification, which take the output of a neural networkfor an input sample to estimate parameters of a Dirichletdistribution (Sensoy, Kaplan, and Kandemir 2018; Malininand Gales 2018; Gast and Roth 2018) for its classification.That is, the resulting Dirichlet distribution represents thelikelihood of each possible categorical distribution over thelabels for the classification of the sample.

More formally, the Dirichlet distribution is a probabilitydensity function (pdf) for possible values of the probabilitymass function (pmf) p. It is characterized by K parametersα = [α1, · · · , αK ] and is given by

D(p|α) ={

1B(α)

∏Ki=1 p

αi−1i for p ∈ SK ,

0 otherwise,(1)

where SK is the K-dimensional unit simplex and B(α) isthe K-dimensional multinomial beta function (Kotz, Balakr-ishnan, and Johnson 2000).

In classical neural networks for classification, softmaxfunction is used to predict class assignment probabilities.

However, it provides only a point estimate for the class prob-abilities of a sample and does not provide the associateduncertainty for this prediction. On the other hand, Dirichletdistributions can be used to model a probability distributionfor the class probabilities. For instance, a Dirichlet distribu-tion whose all parameters are one, i.e., D(p|〈1, . . . , 1〉) orshortly D(p|1), represents the uniform distribution over allpossible assignment of class probabilities and means totaluncertainty for the classification of a sample. As the param-eter referring to specific class increases, the likelihood ofprobability assignments with higher values for this class alsoincreases. For instance, D(p|〈2, . . . , 1〉) indicates that prob-ability distributions placing more mass on the first class areslightly more likely, while their likelihood increases furtherfor D(p|〈10, . . . , 1〉), which indicates that 8 more pieces ofevidence is observed for the assignment of the sample to thefirst class (Josang 2016).

The parameters of a Dirichlet distribution are associatedwith pseudocounts representing the number of observationsor evidence in each class. Hence, the predicted Dirichlet dis-tribution for a sample may refer to the amount of evidenceobserved on the training set for the assignment of the sampleto classes. If there is no evidence for the assignment, we con-sider a uniform prior, i.e., D(p|〈1, . . . , 1〉) and any evidenceei for class i should add up to the relevant parameter of thisprior (i.e., αi = 1 + ei) to generate the predicted Dirich-let distribution for the sample. The mean and the varianceof a Dirichlet distribution for the class probability pk arecomputed as

pk =αkS

and V ar(pk) =αk(S − αk)S2(S + 1)

, (2)

where S =∑Ki=1 αk. The Dirichlet distribution after incor-

poration of a number of evidence, p represents the minimummean square error (mmse) estimate of the ground truth ap-pearance probabilities given these observations.

In this paper, we consider Dirichlet distributions with uni-form prior, which means that S ≥ K. Then, the total evidenceused to update the uniform Dirichlet to the predicted Dirichletdistribution becomes S −K. After predicting the parametersof the Dirichlet distribution for each sample, previous ap-proaches have used its mean, i.e., p, as the class assignmentprobabilities for decision making, while usingK/S ∈ [0−1]as the associated uncertainty of this assignment (Sensoy, Ka-plan, and Kandemir 2018). As the total evidence increases,the variance of the Dirichlet distribution decreases, and sodoes the uncertainty of prediction.

In this work, we also use the mean of Dirichlet distribu-tions as the predictive categorical distribution for classifica-tion. However, to be consistent with literature in general andfor benchmark comparisons, we adopt the entropy of classprobabilities as a proxy for classification uncertainty (Galand Ghahramani 2016b; Louizos and Welling 2017).

Learning to Quantify Classification uncertaintyRecent approaches also used Dirichlet distribution to quan-tify classification uncertainty in deep neural nets. However,they failed to link the calculated Dirichlet parameters to the

Page 3: arXiv:2006.04183v1 [cs.LG] 7 Jun 2020

observations or the evidence derived from the distribution ofthe training set. That is why these models could still be ableto derive large amount of evidence and become overconfidentin their predictions for out-of-distribution samples.

We use ideas from implicit density models (Mohamed andLakshminarayanan 2016) and noise-contrastive estimation(NCE) (Gutmann and Hyvarinen 2012) to derive Dirichletparameters for samples. Let us consider a classification prob-lem with K classes and assume Pin, Pk, and Pout representrespectively the data distributions of the training set, class k,and out-of-distribution samples, i.e., the samples that do notbelong to any ofK classes. A convenient way to describe den-sity of samples from a class k is to describe it relative to thedensity of some other reference data. By using the same ref-erence data for all classes in the training set, we desire to getcomparable quantities for their density estimations. In NCE,noisy training data (Hafner et al. 2018) is usually used asreference, here, we generalize this as the out-of-distributionsamples, which may also include the noisy data.

Using the dummy labels y, we can reformulate the ratio ofthe densities Pk(x) and Pout(x) for a sample x as follows:

Pk(x)

Pout(x)=

p(x|y = k)

p(x|y = out)=

p(y = k|x)p(y = out|x)

(1− πkπk

)

(3)where πk is the marginal probability p(y = k) and (1 −πk)/πk can be approximated as the ratio of sample size, i.e.,nk/nout, which is taken as one in this work for simplicitywithout loss of generality.

As shown in Eq. 3, one can approximate the log densityratio log

(Pk(x)/Pout(x)

)as the logit output of a binary

classifier (Mohamed and Lakshminarayanan 2016), whichis trained to discriminate between the samples from Pk andPout. Let a neural network classifier f(x|θ) parameterized byweights θ have K outputs for a given sample x, where eachoutput fk(x|θ) corresponds to a logit for one of K classesand approximates log

(Pk(x)/Pout(x)

). To train such a net-

work, we use the Bernoulli (logarithmic) loss as given by

L1(θ) = −K∑

k=1

[E

Pk(x)[log(σ(fk(x|θ)))]+

EPout(x)

[log(1− σ(fk(x|θ)))]].

(4)

The expectations in Eq. 4 are computed by Monte Carlointegration using the equal number of samples from Pk andPout. In this work, we use samples from Pout, which are gen-erated by perturbing training set using a novel generative ad-versarial network. The generated samples are separable fromthe training samples in high dimensional input space, so theyare out of distribution, while still having many similarities tothe training samples in a lower dimensional representationspace, which is learned to reconstruct training samples.

As a result, exp(fk(x|θ)) approximates the relative den-sity Pk(x)/Pout(x) of class k, as it is trained using the sam-ples from class k and the samples close to, but easily differ-entiable from, the samples belonging to all K classes in thetraining set. For each sample x in the training set, we takee = exp(f(x|θ)) as the pseudocounts (i.e., evidence) vector,

where each element ek = exp(fk(x|θ)) is the pseudocountfor x being assigned to class k. Then, the parameters of theDirichlet distribution given the uniform Dirichlet prior is cal-culated as α = e+ 1. If the sample x is more similar to thesamples from Pout, then almost zero evidence is generatedby the neural network and the predicted Dirichlet distribu-tion becomes very close to the uniform Dirichlet distribution.This should be the case for samples from Pout and for theoutliers in the training set. On the other hand, if the sampleis labelled as k in the training set and it is not an outlier, weexpect ek > ej ≥ 0 for any j 6= k.

Uncertainty for Misclassified Samples

The computed α parameters defines a Dirichlet distributionD(p|α), from which one can sample categorical distribu-tions over possible classes of x. However, only one of theseclasses is correct and assignment of x to any other class isconsidered as a misclassification. If k is the true class ofx, then the marginal distribution for pk, i.e., the probabilityof correctly classifying x, is a two-parameter Dirichlet dis-tribution (also known as Beta distribution) with parameters〈αk,

∑j 6=k αj〉. Let p−k refer to the vector of probabilities

pj such that j 6= k. Probabilities of misclassifying x toeach class other than the true one is distributed based on theconditional Dirichlet distribution p′−k|pk ∼ D(p′−k|α−k),where p′−k is a categorical distribution over misclassifiedclasses and created by normalizing p−k with (1−pk), whichis also equivalent to

∑i 6=k pi. Since we desire a classifier

to be totally uncertain in its misclassifications (except neardecision boundaries), we minimize Kullback–Leibler (KL)divergence between D(p′−k|α−i) and the uniform Dirichletdistribution D(p−k|1) using the following regularizer:

L2(θ|x) = βKL[D(p−k|α−k) || D(p−k|1)], (5)

where β is the weight of the KL term. It can be set to (1−pk),which is the expectation for the probability of misclassifica-tion (i.e., 1 − pk) and its usage as a weight of the KL termenables the learned loss attenuation; that is, it places a higherweight on epistemic uncertainty enforcement as the aleatoricuncertainty for misclassification decreases. Then, the gener-ative evidential neural network learns the parameters θ byminimizing the overall loss defined as

L(θ) = L1(θ) + EPin(x)

[L2(θ|x)]. (6)

Generating out-of-distribution samples

Previous approaches generate out-of-distribution samples byperturbing training samples. However, they usually requiremanual determination of how much perturbation should bemade to these samples. Too little perturbation may not set theresulting samples apart from the actual samples, while toomuch perturbation may cast them far away from the trainingdata and deteriorate their usefulness.

Variational Autoencoders (VAE) are probabilistic genera-tive models that create low dimensional latent representations

Page 4: arXiv:2006.04183v1 [cs.LG] 7 Jun 2020

Figure 2: Original training samples (top), samples recon-structed by the VAE (middle), and the samples generated bythe proposed method (bottom) over a number of epochs.

for high dimensional data by maximizing

maxq✓,p�

NX

i=1

Eq✓(z|xi)

⇥log p�(xi | z) � KL(q✓(z | xi) || p(z))

⇤,

(7)where q✓(z | xi) is the latent space distribution for each sam-ple xi and p�(xi | z) is the decoder likelihood distributionthat is maximized for each sample xi. The KL term enforcesq✓(z | xi) to be close to a prior distribution p(z) and have adenser latent space.

Proximity of the encoded samples in the latent space of aVAE is commonly used as an indication of their semantic sim-ilarity and exploited for few-shot classification and anomalydetection tasks. In this work, we also use the latent space of aVAE as a proxy for semantic similarity between samples in in-put space. Hence, we exploit it to generate out-of-distributionsamples, which are similar to, but at the same time clearlyseparable from, the training examples in the input space.

For each xi in training set, we sample a latent point z fromq✓(z | xi) and perturb it by ✏ ⇠ q�(✏|z), which is imple-mented as a multivariate Gaussian distribution N (0, G(z)),where G(·) is a fully connected neural network with non-negative output that is trained via

maxG

E q✓(z|xi),q�(✏|z),

p�(xi|z+✏)

⇥log D0(z + ✏)| {z }

(a)

+ log(1 � D(xi)| {z }(b)

)⇤, (8)

where xi ⇠ p�(xi | z +✏) is the decoded out-of-distributionsample from the perturbed sample z + ✏. The discriminatorsD and D0 are binary classifiers with sigmoid output that tryto distinguish real samples from the generated ones. Thatis, given an input, a discriminator gives as an output theprobability that the sample is from the training set distribution.In Eq. 8, (a) forces the generated points to be similar to thereal latent points through making them indistinguishable byD0 in the latent space of the VAE and (b) encourages thegenerated samples to be distinguishable by D in the inputspace. The discriminators are optimized via

maxD0

log D0(z) + log(1 � D0(z + ✏)),| {z }(c)

(9)

maxD

log D(xi) + log(1 � D(xi)). (10)

Note that (c) of Eq. 9 is also included in the objective of theVAE (Eq. 8) to adapt the latent space during the training ofthe generator. We trained the VAE, generator, and discrimi-nators by iterating between maximizing Eq. 7 through Eq. 10until convergence, as in the regular training of generator and

discriminator in GANs. We demonstrate this approach inFig 1 (c) and Fig. 2, where a number of real and generatedMNIST images are shown.

EvaluationTo be able to compare our approach with the recent work, weadopted the same strategy used for evaluation in (Louizosand Welling 2017; Sensoy, Kaplan, and Kandemir 2018;Pawlowski et al. 2017). That is, we use LeNet-5 (LeCunet al. 1998) with ReLu non-linearities and max pooling as theneural network architecture and evaluated our approach withMNIST and CIFAR10 datasets, to be able to make a fair com-parison with the most related recent work. We implementedour approach 1 using Python and Tensorflow.

In this section, we compared our approach with the fol-lowing approaches: (a) L2 corresponds to the standard neu-ral nets with softmax probabilities and L2 regularization,(b) Dropout refers to the Bayesian model used in (Galand Ghahramani 2016a), (c) Deep Ensemble refers to themodel proposed in (Lakshminarayanan, Pritzel, and Blun-dell 2017), (d) FFG refers to the Bayesian model usedin (Kingma, Salimans, and Welling 2015) with the additiveparametrization (Molchanov, Ashukha, and Vetrov 2017),(e) MNFG2 refers to the variational approximation basedmodel in (Louizos and Welling 2017), (f) EDL refers to themodel in (Sensoy, Kaplan, and Kandemir 2018), (g) BBH3

refers to the Bayesian model based on implicit weight uncer-tainty (Pawlowski et al. 2017), and (h) GEN refers to theproposed approach.

Predictive Uncertainty EstimationWe used the network architectures in Table 1 to train ourmodel for the MNIST dataset. For CIFAR10, we used thesame architectures; however, the classifier uses 192 filters forConv1 and Conv2, also has 1000 neurons in FC1 as describedin (Louizos and Welling 2017). We used L2 regularizationwith coefficient 0.005 in the fully-connected layers. Otherapproaches are also trained using the same classifier archi-tecture with the priors and posteriors described in (Louizosand Welling 2017) and (Pawlowski et al. 2017). The classi-fication accuracy of each model on the MNIST test set canbe seen at Table 2. While we do not explicitly aim for highclassification accuracy, our results indicate that our approachis doing better than most of the other approaches.

We train models for MNIST using the images from 10 digitcategories from the training set as usual. However, we thentested these models on notMNIST dataset,4 which contains10 letters A-J instead of digits. For CIFAR10, we trainedmodels using the training data from the first five categories(referred to as CIFAR5) and tested these models using theimages from the last five categories. For both MNIST and CI-FAR10, the predicted label for any test sample is guaranteedto be wrong, since test samples are coming from a differentdistribution than the one for the training set. Hence, an ideal

1A demo can be found at https://muratsensoy.github.io/gen.html2https://github.com/AMLab-Amsterdam/MNF_VBNN3https://github.com/pawni/BayesByHypernet4https://www.kaggle.com/lubaroli/notmnist

Figure 2: Original training samples (top), samples recon-structed by the VAE (middle), and the samples generated bythe proposed method (bottom) over a number of epochs.

for high dimensional data by maximizing

maxqθ,pφ

N∑

i=1

Eqθ(z|xi)[log pφ(xi | z)−KL(qθ(z | xi) || p(z))

],

(7)where qθ(z | xi) is the latent space distribution for each sam-ple xi and pφ(xi | z) is the decoder likelihood distributionthat is maximized for each sample xi. The KL term enforcesqθ(z | xi) to be close to a prior distribution p(z) and have adenser latent space.

Proximity of the encoded samples in the latent space of aVAE is commonly used as an indication of their semantic sim-ilarity and exploited for few-shot classification and anomalydetection tasks. In this work, we also use the latent space of aVAE as a proxy for semantic similarity between samples in in-put space. Hence, we exploit it to generate out-of-distributionsamples, which are similar to, but at the same time clearlyseparable from, the training examples in the input space.

For each xi in training set, we sample a latent point z fromqθ(z | xi) and perturb it by ε ∼ qγ(ε|z), which is imple-mented as a multivariate Gaussian distribution N (0, G(z)),where G(·) is a fully connected neural network with non-negative output that is trained via

maxG

E qθ(z|xi),qγ(ε|z),

pφ(xi|z+ε)

[logD′(z + ε)︸ ︷︷ ︸

(a)

+ log(1−D(xi)︸ ︷︷ ︸(b)

)], (8)

where xi ∼ pφ(xi | z+ε) is the decoded out-of-distributionsample from the perturbed sample z + ε. The discriminatorsD and D′ are binary classifiers with sigmoid output that tryto distinguish real samples from the generated ones. Thatis, given an input, a discriminator gives as an output theprobability that the sample is from the training set distribution.In Eq. 8, (a) forces the generated points to be similar to thereal latent points through making them indistinguishable byD′ in the latent space of the VAE and (b) encourages thegenerated samples to be distinguishable by D in the inputspace. The discriminators are optimized via

maxD′

logD′(z) + log(1−D′(z + ε)),︸ ︷︷ ︸(c)

(9)

maxD

logD(xi) + log(1−D(xi)). (10)

Note that (c) of Eq. 9 is also included in the objective of theVAE (Eq. 8) to adapt the latent space during the training ofthe generator. We trained the VAE, generator, and discrimi-nators by iterating between maximizing Eq. 7 through Eq. 10until convergence, as in the regular training of generator and

discriminator in GANs. We demonstrate this approach inFig 1 (c) and Fig. 2, where a number of real and generatedMNIST images are shown.

EvaluationTo be able to compare our approach with the recent work, weadopted the same strategy used for evaluation in (Louizosand Welling 2017; Sensoy, Kaplan, and Kandemir 2018;Pawlowski et al. 2017). That is, we use LeNet-5 (LeCunet al. 1998) with ReLu non-linearities and max pooling as theneural network architecture and evaluated our approach withMNIST and CIFAR10 datasets, to be able to make a fair com-parison with the most related recent work. We implementedour approach 1 using Python and Tensorflow.

In this section, we compared our approach with the fol-lowing approaches: (a) L2 corresponds to the standard neu-ral nets with softmax probabilities and L2 regularization,(b) Dropout refers to the Bayesian model used in (Galand Ghahramani 2016a), (c) Deep Ensemble refers to themodel proposed in (Lakshminarayanan, Pritzel, and Blundell2017), (d) FFG refers to the fully factorized Bayesian modelwith Gaussian posteriors from (Blundell et al. 2015), whichis widely known as Bayes by Backprop (BBB), (e) FFLUrefers to the Bayesian model used in (Kingma, Salimans, andWelling 2015) with the additive parametrization (Molchanov,Ashukha, and Vetrov 2017), (f) MNFG2 refers to the varia-tional approximation based model in (Louizos and Welling2017), (g) EDL refers to the model in (Sensoy, Kaplan, andKandemir 2018), (h) BBH3 refers to the Bayesian modelbased on implicit weight uncertainty (Pawlowski et al. 2017),and (i) GEN refers to the proposed approach.

Predictive Uncertainty EstimationWe used the network architectures in Table 1 to train ourmodel for the MNIST dataset. For CIFAR10, we used thesame architectures; however, the classifier uses 192 filters forConv1 and Conv2, also has 1000 neurons in FC1 as describedin (Louizos and Welling 2017). We used L2 regularizationwith coefficient 0.005 in the fully-connected layers. Otherapproaches are also trained using the same classifier archi-tecture with the priors and posteriors described in (Louizosand Welling 2017) and (Pawlowski et al. 2017). The classi-fication accuracy of each model on the MNIST test set canbe seen at Table 2. While we do not explicitly aim for highclassification accuracy, our results indicate that our approachis doing better than most of the other approaches.

We train models for MNIST using the images from 10 digitcategories from the training set as usual. However, we thentested these models on notMNIST dataset,4 which contains10 letters A-J instead of digits. For CIFAR10, we trainedmodels using the training data from the first five categories(referred to as CIFAR5) and tested these models using theimages from the last five categories. For both MNIST and CI-FAR10, the predicted label for any test sample is guaranteed

1https://muratsensoy.github.io/gen.html2https://github.com/AMLab-Amsterdam/MNF VBNN3https://github.com/pawni/BayesByHypernet4https://www.kaggle.com/lubaroli/notmnist

Page 5: arXiv:2006.04183v1 [cs.LG] 7 Jun 2020

Layer Filters/Neurons Patch Size Stride Activation

Cla

ssifi

er

Conv1 20 5 × 5 1 reluMax Pool - 2 × 2 2 -Conv2 50 5 × 5 1 reluMax Pool - 2 × 2 2 -FC1 500 - - reluFC2 K = 10 - - -

D

Conv1− FC1 repeat repeat repeat repeatFC3 1 - - sigmoid

G

FC4 32 - - reluFC5 32 - - reluFC6 32 - - reluFC7 code sz = 50 - - softplus

D′ FC4− FC6 repeat repeat repeat repeat

FC8 1 - - sigmoid

Table 1: Network architectures.

Model MNIST CIFAR 5L2 99.4 76

Dropout 99.5 84Deep Ensemble 99.3 79

FFG 99.1 78FFLU 99.1 77MNFG 99.3 84BBH 99.1 80EDL 99.3 83GEN 99.3 83

Table 2: Test accuracies (%)for MNIST and CIFAR5.

Figure 3: Empirical CDF for the entropy of the predictive distributions on the notMNIST dataset (left) and samples from the lastfive categories of CIFAR10 dataset (right).

to be wrong, since test samples are coming from a differentdistribution than the one for the training set. Hence, an idealclassifier should report totally uncertain predictions insteadof associating a higher likelihood for a specific label for anout-of-distribution test sample.

To be consistent with recent works, we use the entropy ofthe predicted categorical distribution over labels as a proxyto quantify prediction uncertainty. That is, a prediction getsmore uncertain as its entropy approaches the maximum en-tropy, i.e., the entropy of the uniform categorical distribution.As in (Louizos and Welling 2017), we used the empiricalCDF of entropy distribution for predictions to quantify howuncertain they are. That is, as the predictions get more uncer-tain, the area under their entropy CDF curve gets smaller.

Figure 3 shows our results for MNIST and CIFAR10datasets. Standard neural networks (referred to as L2) isvery confident in its predictions as indicated by its entropyCDF curves in the figure. On the other hand, Bayesian neuralnetwork models appear to be more uncertain about their pre-dictions with respect to the standard neural networks. The per-formances of these models in terms of predictive uncertaintyvary for MNIST while they perform almost the same for CI-FAR10. EDL and GEN perform much better than Bayesianapproaches in both MNIST and CIFAR10. Especially, GENassociate very high uncertainty with its predictions for out-of-distribution samples.

After conducting these benchmark analysis by followingthe very same procedure proposed in (Louizos and Welling2017), we also tested these models with in-distribution sam-ples and analyzed the certainty they assign to correct andincorrect predictions. Figure 4 shows entropy CDF curvesfor successful and failed predictions in MNIST test set fordifferent models. The figure indicates that standard networksand Bayesian neural networks are overconfident (i.e., havelow entropy) for their failed predictions; that is, they havelarge area under the entropy CDF curve for their failed pre-

dictions (i.e., misclassifications). However, both EDL andGEN have significantly higher predictive uncertainty for theirfailed predictions. Furthermore, GEN gives a better dispar-ity between the successful and failed predictions in termsof uncertainty. We also conducted the same analysis for theCIFAR10 dataset and obtained similar results.

Robustness to Adversarial ExamplesRobustness to adversarial examples is an important challengefor machine learning models. While it is very hard to providecorrect predictions for carefully crafted adversarial exam-ples, a model should associate very high uncertainty with itsprediction when tested on them. Hence, in this section, wetest different models using a well-known white-box attackstrategy, the Fast Gradient Sign Method (FGSM), proposedby (Goodfellow, Shlens, and Szegedy 2014) and analyze howuncertain these models are when they fail to correctly classifythe generated adversarial examples.

White-box attacks have access to model parameters andexploit gradients of the loss with respect to an input to perturbthe input to create an adversarial example. The amount ofperturbation is defined by the ε ∈ [0, 1] parameter. Figure 5shows our results in terms of both accuracy and uncertaintyfor the MNIST test set for different ε values. The figure indi-cates that GEN demonstrates the ideal behavior; it associatesthe highest uncertainty (maximum entropy) with its predic-tions as it starts to fail making the right predictions for highvalues of ε. We observe the same behavior for CIFAR10dataset as shown in Figure 6.

Comparisons with Anomaly Detection MethodsOur work is also related to anomaly detection approaches,which are specifically designed to detect out-of-distributionsamples. Hence, in this section, we compare our approachwith the existing and the most recent anomaly detection meth-ods on MNIST and CIFAR10 datasets. We compared our

Page 6: arXiv:2006.04183v1 [cs.LG] 7 Jun 2020

Figure 4: Entropy CDF curves of different models for their successful and failed predictions on the MNIST test set.

Figure 5: Accuracy and entropy as a function of the adversarial perturbation ε on the MNIST dataset.

Figure 6: Accuracy and entropy as a function of the adversarial perturbation ε on the CIFAR10 dataset.

approach with the following models: (a) Calibrated refersto the calibration-based model for out-of-distribution detec-tion in (Lee et al. 2018), (b) GEOTRANS is the anomalydetection model in (Golan and El-Yaniv 2018), (c) SVMrefers to the one-class SVM applied to the latent space ofconvolutional autoencoder as described in (Golan and El-Yaniv 2018), (d) ADGAN is the anomaly detection methodbased on generative adversarial networks in (Deecke et al.2018), (e) DAGMM is the deep autoencoding Gaussian mix-ture model in (Zong et al. 2018), (f) DSEBM is the deepstructured energy-based model in (Zhai et al. 2016). We usepublicly available implementations of the Calibrated5 andGEOTRANS6 by their authors, which also contains imple-mentations of other models above. Unlike our approach, thesemodels predicts a score for out-of-distribution classification.To evaluate these approaches, the area under the ROC curve(AUC) is used as a measure of how well the produced scorescan distinguish between in- and out-of-distribution samples.

5https://github.com/alinlab/Confident classifier6https://github.com/izikgo/AnomalyDetectionTransformations

Similarly, as before, we train models using samples onlyfrom the first five categories of the MNIST and CIFAR10datasets. Then, we evaluate these models on the test sam-ples half of which comes from the first five and other halfcomes from the last five categories. We use the entropy of thepredictive probabilities from our model as a score to differen-tiate between in- and out-of-distribution samples. Let us notethat, as shown before, our approach provides highly uncer-tain predictions not only for out-of-distribution samples, butalso for the misclassified in-distribution samples. Hence, thein-distribution samples laying on the class boundary of firstfive categories may also be classified as out-of-distributionsamples based on their entropy-based score.

Figure 7 shows our results with anomaly detection moth-ods on MNIST and CIFAR10 datasets. Both Calibrated andGEOTRANS perform better than SVM, ADGAN, DAGMM,and DSEBM in our experiments. For MNIST, GEN andCalibrated achieve the best AUC values, respectively 0.965and 0.966. For CIFAR10, GEN achieves the best AUC(0.775), and performs significantly better than state-of-the-artanomaly detectors in recognizing out-of-distribution samples.

Page 7: arXiv:2006.04183v1 [cs.LG] 7 Jun 2020

Figure 7: Anomaly detection results for MNIST (left) and CIFAR10 (right).

While our approach has higher entropy for its predictionsfor both out-of-distribution samples and misclassified in-distribution samples,7 its entropy-based score is still perform-ing at least as good as the state-of-the-art anomaly detectionmethods.

Related WorkQuantification of predictive uncertainty has always beenvery important for machine learning models. Gaussian Pro-cesses (GPs) (Rasmussen and Williams 2006) has been verygood both in making accurate predictions and estimate theirpredictive uncertainties. However, these kernel-based non-parametric models cannot easily deal with high-dimensionaldata such as images due to the curse of dimensionality.

In recent years, Bayesian deep learning has emerged as afield combining deep neural networks with Bayesian prob-ability theory, which provides a principled way of mod-eling uncertainty of machine learning models by employ-ing prior distribution on their parameters and inferring theposterior distribution for these parameters using approxi-mations such as Variational Bayes (Blundell et al. 2015;Gal and Ghahramani 2016b). Then, the posterior predictivedistribution is approximated with sampling methods, whichbrings a significant computational overhead and leads to noisein predictive uncertainty estimates.

In these models, predictive uncertainty is modelled by tak-ing samples from the posterior distributions of model parame-ters and using the sampled parameters to create a distributionof predictions for each input of the network. However, as weshow in our experiments, modeling uncertainty of networkparameters may not necessarily lead to good estimates of thepredictive uncertainty of neural networks (Hafner et al. 2018).This is the case especially for the misclassified in-distributionsamples, where Bayesian models associate similar levels ofuncertainties with their successful and failed predictions.

Recently, a number of approaches (Sensoy, Kaplan, andKandemir 2018; Malinin and Gales 2018) have been pro-posed to use outputs of neural networks to estimate the pa-rameters of the Dirichlet prior of the categorical distributionfor classification, instead of predicting a categorical distribu-tion through the softmax function. Then, the resulting Dirich-let distribution is used to calculate the predictive uncertaintyfor classification. While similar in principle, our work distin-guishes from this line of work in two folds: (1) it relates the

7Table 2 indicates 17% of CIFAR5 test samples are misclassified.

parameters (i.e., the pseudo counts) of the resulting Dirichletdistribution to the density of the training data through noiseconstructive estimation, (2) it automatically synthesizes out-of-distribution samples sufficiently close to the training data,instead of hand-picking an auxiliary dataset.

Previous approaches used manually-tuned noise (Hafneret al. 2018) or GAN in the input space (Lee et al. 2018) tocreate out-of-distribution samples. The approaches based onGAN may suffer from the so-called mode collapse problem.To avoid it in this work, we created samples by automaticallyperturbing each training example in the latent space sepa-rately. Also, to avoid generating samples too similar to ordifferent from training examples, we used a generator withjoint objectives defined over outputs of two discriminators.

ConclusionsIn this work, we proposed to combine ideas from implicitdensity models, noise constructive density estimation, and ev-idential deep learning in a novel way to quantify classificationuncertainty in neural networks. We also proposed to gener-ate out-of-distribution samples by combining the strengthsof VAEs and GANs. The generated examples are used forlearning an implicit density model of the training data, whichis then utilized to generate pseudocounts (i.e., evidence) forDirichlet parameters. Through extensive experiments withwell-studied datasets and comprehensive comparisons withrecent approaches, we show that our approach significantlyenhances the state of the art in two uncertainty estimationbenchmarks: i) detection of out-of-distribution samples, andii) robustness to adversarial examples.

AcknowledgmentsThis research was sponsored by the U.S. Army ResearchLaboratory (ARL) and the U.K. Ministry of Defence underAgreement Number W911NF-16-3-0001. The views and con-clusions contained in this document are those of the authorsand should not be interpreted as representing the officialpolicies, either expressed or implied, of the U.S. Army Re-search Laboratory, the U.S. Government, the U.K. Ministryof Defence or the U.K. Government. The U.S. and U.K. Gov-ernments are authorized to reproduce and distribute reprintsfor Government purposes notwithstanding any copyright no-tation hereon. Also, Dr. Sensoy thanks to ARL for its supportunder grant W911NF-16-2-0173, and Newton-Katip CelebiFund and TUBITAK for their support under grant 116E918.

Page 8: arXiv:2006.04183v1 [cs.LG] 7 Jun 2020

References[Bishop 2006] Bishop, C. M. 2006. Pattern recognition and

machine learning. springer.[Blundell et al. 2015] Blundell, C.; Cornebise, J.;Kavukcuoglu, K.; and Wiestra, D. 2015. Weightuncertainty in neural networks. In ICML.

[Deecke et al. 2018] Deecke, L.; Vandermeulen, R.; Ruff, L.;Mandt, S.; and Kloft, M. 2018. Anomaly detection withgenerative adversarial networks.

[Gal and Ghahramani 2016a] Gal, Y., and Ghahramani, Z.2016a. Bayesian convolutional neural networks withbernoulli approximate variational inference. arXiv:1506.021.

[Gal and Ghahramani 2016b] Gal, Y., and Ghahramani, Z.2016b. Dropout as a Bayesian approximation: Represent-ing model uncertainty in deep learning. In ICML.

[Gast and Roth 2018] Gast, J., and Roth, S. 2018.Lightweight probabilistic deep networks. In Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition, 3369–3378.

[Golan and El-Yaniv 2018] Golan, I., and El-Yaniv, R. 2018.Deep anomaly detection using geometric transformations. InAdvances in Neural Information Processing Systems.

[Goodfellow, Shlens, and Szegedy 2014] Goodfellow, I.;Shlens, J.; and Szegedy, C. 2014. Explaining and harnessingadversarial examples. arXiv preprint arXiv:1412.6572.

[Gutmann and Hyvarinen 2012] Gutmann, M. U., andHyvarinen, A. 2012. Noise-contrastive estimation ofunnormalized statistical models, with applications to naturalimage statistics. Journal of Machine Learning Research13(Feb):307–361.

[Hafner et al. 2018] Hafner, D.; Tran, D.; Irpan, A.; Lilli-crap, T.; and Davidson, J. 2018. Reliable uncertainty esti-mates in deep neural networks using noise contrastive priors.arXiv:1807.09289.

[Josang 2016] Josang, A. 2016. Subjective Logic: A Formal-ism for Reasoning Under Uncertainty. Springer.

[Kingma, Salimans, and Welling 2015] Kingma, D.; Sali-mans, T.; and Welling, M. 2015. Variational dropout and thelocal reparameterization trick. In NIPS.

[Kotz, Balakrishnan, and Johnson 2000] Kotz, S.; Balakrish-nan, N.; and Johnson, N. 2000. Continuous MultivariateDistributions, volume 1. New York: Wiley.

[Lakshminarayanan, Pritzel, and Blundell 2017]Lakshminarayanan, B.; Pritzel, A.; and Blundell, C.2017. Simple and scalable predictive uncertainty estimationusing deep ensembles. In NIPS.

[LeCun et al. 1998] LeCun, Y.; Bottou, L.; Bengio, Y.;Haffner, P.; et al. 1998. Gradient-based learning appliedto document recognition. In the IEEE 86(11):2278–2324.

[Lee et al. 2018] Lee, K.; Lee, H.; Lee, K.; and Shin, J. 2018.Training confidence-calibrated classifiers for detecting out-of-distribution samples. In International Conference on Learn-ing Representations.

[Louizos and Welling 2017] Louizos, C., and Welling, M.

2017. Multiplicative normalizing flows for variationalbayesian neural networks. In ICML.

[Malinin and Gales 2018] Malinin, A., and Gales, M. 2018.Predictive uncertainty estimation via prior networks. In Ben-gio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and Garnett, R., eds., Advances in Neural Infor-mation Processing Systems 31. 7047–7058.

[Matthies 2007] Matthies, H. G. 2007. Quantifying Uncer-tainty: Modern Computational Representation of Probabilityand Applications. In Extreme Man-Made and Natural Haz-ards in Dynamics of Structures. 105–135.

[Mohamed and Lakshminarayanan 2016] Mohamed, S., andLakshminarayanan, B. 2016. Learning in implicit generativemodels. arXiv preprint arXiv:1610.03483.

[Molchanov, Ashukha, and Vetrov 2017] Molchanov, D.;Ashukha, A.; and Vetrov, D. 2017. Variational dropoutsparsifies deep neural networks. In ICML.

[NHTSA 2016] NHTSA. 2016. Tesla crash preliminary evalu-ation report (PE 16-007), U.S. Department of Transportation.

[Pawlowski et al. 2017] Pawlowski, N.; Brock, A.; Lee,M. C.; Rajchl, M.; and Glocker, B. 2017. Implicit weightuncertainty in neural networks. arXiv:1711.01297.

[Rasmussen and Williams 2006] Rasmussen, C., andWilliams, C. 2006. Gaussian Processes for MachineLearning. MIT Press.

[Sensoy, Kaplan, and Kandemir 2018] Sensoy, M.; Kaplan,L.; and Kandemir, M. 2018. Evidential deep learning toquantify classification uncertainty. In Bengio, S.; Wallach,H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and Gar-nett, R., eds., Advances in Neural Information ProcessingSystems 31. 3179–3189.

[Zhai et al. 2016] Zhai, S.; Cheng, Y.; Lu, W.; and Zhang, Z.2016. Deep structured energy based models for anomaly de-tection. In Proceedings of the 33rd International Conferenceon International Conference on Machine Learning - Volume48, ICML’16, 1100–1109.

[Zong et al. 2018] Zong, B.; Song, Q.; Min, M. R.; Cheng,W.; Lumezanu, C.; Cho, D.; and Chen, H. 2018. Deep autoen-coding gaussian mixture model for unsupervised anomalydetection. In International Conference on Learning Repre-sentations.