challenges in markov chain monte carlo for bayesian neural

21
Challenges in Markov chain Monte Carlo for Bayesian neural networks Theodore Papamarkou, Jacob Hinkle, M. Todd Young and David Womble Abstract. Markov chain Monte Carlo (MCMC) methods have not been broadly adopted in Bayesian neural networks (BNNs). This paper initially reviews the main challenges in sampling from the parameter posterior of a neural network via MCMC. Such challenges culminate to lack of con- vergence to the parameter posterior. Nevertheless, this paper shows that a non-converged Markov chain, generated via MCMC sampling from the parameter space of a neural network, can yield via Bayesian marginaliza- tion a valuable posterior predictive distribution of the output of the neural network. Classification examples based on multilayer perceptrons showcase highly accurate posterior predictive distributions. The postulate of limited scope for MCMC developments in BNNs is partially valid; an asymptotically exact parameter posterior seems less plausible, yet an accurate posterior predictive distribution is a tenable research avenue. Key words and phrases: Bayesian inference, Bayesian neural networks, con- vergence diagnostics, Markov chain Monte Carlo, posterior predictive dis- tribution. 1. MOTIVATION The universal approximation theorem (Cybenko, 1989) and its subsequent extensions (Hornik, 1991; Lu et al., 2017) state that feedforward neural net- works with exponentially large width and width- bounded deep neural networks can approximate any continuous function arbitrarily well. This uni- versal approximation capacity of neural networks along with available computing power explain the widespread use of deep learning nowadays. Bayesian inference for neural networks is typi- cally performed via stochastic Bayesian optimiza- tion, stochastic variational inference (Polson and Sokolov, 2017) or ensemble methods (Ashukha et al., 2020; Wilson and Izmailov, 2020). MCMC methods have been explored in the context of neural net- works, but have not become part of the Bayesian deep learning toolbox. Department of Mathematics, The University of Manchester, Manchester, UK, and Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA The slower evolution of MCMC methods for neu- ral networks is partly attributed to the lack of scal- ability of existing MCMC algorithms for big data and for high-dimensional parameter spaces. Further- more, additional factors hinder the adaptation of ex- isting MCMC methods in deep learning, including the hierarchical structure of neural networks and the associated covariance between parameters, lack of identifiability arising from weight symmetries, lack of a priori knowledge about the parameter space, and ultimately lack of convergence. The purpose of this paper is twofold. Initially, a literature review is conducted to identify infer- ential challenges in MCMC developments for neu- ral networks. Subsequently, Bayesian marginaliza- tion based on MCMC samples of neural network parameters is used for attaining accurate posterior predictive distributions of the respective neural net- work output, despite the lack of convergence of the MCMC samples to the parameter posterior. An outline of the paper layout follows. Section 2 reviews the inferential challenges arising from 1 arXiv:1910.06539v6 [stat.ML] 1 Oct 2021

Upload: others

Post on 20-Oct-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Challenges in Markov chain Monte Carlofor Bayesian neural networksTheodore Papamarkou, Jacob Hinkle, M. Todd Young and David Womble

Abstract. Markov chain Monte Carlo (MCMC) methods have not beenbroadly adopted in Bayesian neural networks (BNNs). This paper initiallyreviews the main challenges in sampling from the parameter posterior ofa neural network via MCMC. Such challenges culminate to lack of con-vergence to the parameter posterior. Nevertheless, this paper shows thata non-converged Markov chain, generated via MCMC sampling from theparameter space of a neural network, can yield via Bayesian marginaliza-tion a valuable posterior predictive distribution of the output of the neuralnetwork. Classification examples based on multilayer perceptrons showcasehighly accurate posterior predictive distributions. The postulate of limitedscope for MCMC developments in BNNs is partially valid; an asymptoticallyexact parameter posterior seems less plausible, yet an accurate posteriorpredictive distribution is a tenable research avenue.

Key words and phrases: Bayesian inference, Bayesian neural networks, con-vergence diagnostics, Markov chain Monte Carlo, posterior predictive dis-tribution.

1. MOTIVATION

The universal approximation theorem (Cybenko,1989) and its subsequent extensions (Hornik, 1991;Lu et al., 2017) state that feedforward neural net-works with exponentially large width and width-bounded deep neural networks can approximateany continuous function arbitrarily well. This uni-versal approximation capacity of neural networksalong with available computing power explain thewidespread use of deep learning nowadays.

Bayesian inference for neural networks is typi-cally performed via stochastic Bayesian optimiza-tion, stochastic variational inference (Polson andSokolov, 2017) or ensemble methods (Ashukha et al.,2020; Wilson and Izmailov, 2020). MCMC methodshave been explored in the context of neural net-works, but have not become part of the Bayesiandeep learning toolbox.

Department of Mathematics, The University ofManchester, Manchester, UK, and ComputationalSciences and Engineering Division, Oak RidgeNational Laboratory, Oak Ridge, TN, USA

The slower evolution of MCMC methods for neu-ral networks is partly attributed to the lack of scal-ability of existing MCMC algorithms for big dataand for high-dimensional parameter spaces. Further-more, additional factors hinder the adaptation of ex-isting MCMC methods in deep learning, includingthe hierarchical structure of neural networks and theassociated covariance between parameters, lack ofidentifiability arising from weight symmetries, lackof a priori knowledge about the parameter space,and ultimately lack of convergence.

The purpose of this paper is twofold. Initially,a literature review is conducted to identify infer-ential challenges in MCMC developments for neu-ral networks. Subsequently, Bayesian marginaliza-tion based on MCMC samples of neural networkparameters is used for attaining accurate posteriorpredictive distributions of the respective neural net-work output, despite the lack of convergence of theMCMC samples to the parameter posterior.

An outline of the paper layout follows. Section2 reviews the inferential challenges arising from

1

arX

iv:1

910.

0653

9v6

[st

at.M

L]

1 O

ct 2

021

2 T. PAPAMARKOU, J. HINKLE, M. T. YOUNG AND D. WOMBLE

the application of MCMC to neural networks. Sec-tion 3 provides an overview of the employed infer-ential framework, including the multilayer percep-tron (MLP) model and its likelihood for binary andmulticlass classification, the MCMC algorithms forsampling from MLP parameters, the multivariateMCMC diagnostics for assessing convergence andsampling effectiveness, and the Bayesian marginal-ization for attaining posterior predictive distribu-tions of MLP outputs. Section 4 showcases Bayesianparameter estimation via MCMC and Bayesianpredictions via marginalization by fitting differentMLPs to four datasets. Section 5 posits predictiveinference for neural networks, among else by com-bining Bayesian marginalization with approximateMCMC sampling or with ensemble training.

2. PARAMETER INFERENCE CHALLENGES

A literature review of inferential challenges in theapplication of MCMC methods to neural networksis conducted in this section thematically, with eachsubsection being focused on a different challenge.

2.1 Computational cost

Existing MCMC algorithms do not scale with in-creasing number of parameters or of data points. Forthis reason, approximate inference methods, includ-ing variational inference (VI), are preferred in high-dimensional parameter spaces or in big data prob-lems from a time complexity standpoint (MacKay,1995; Blei, Kucukelbir and McAuliffe, 2017; Blierand Ollivier, 2018). On the other hand, MCMCmethods are better than VI in terms of approximat-ing the log-likelihood (Dupuy and Bach, 2017).

Literature on MCMC methods for neural net-works is limited due to associated computationalcomplexity implications. Sequential Monte Carloand reversible jump MCMC have been applied ontwo types of neural network architectures, namelyMLPs and radial basis function networks (RBFs),see for instance Andrieu, de Freitas and Doucet(1999); de Freitas (1999); Andrieu, de Freitas andDoucet (2000); de Freitas et al. (2001). For a reviewof Bayesian approaches to neural networks, see Tit-terington (2004).

Many research developments have been made toscale MCMC algorithms to big data. The main fo-cus has been on designing Metropolis-Hastings or

Gibbs sampling variants that evaluate a costly log-likelihood on a subset (minibatch) of the data ratherthan on the entire data set (Welling and Teh, 2011;Chen, Fox and Guestrin, 2014; Ma, Foti and Fox,2017; Mandt, Hoffman and Blei, 2017; De Sa, Chenand Wong, 2018; Nemeth and Sherlock, 2018; Robertet al., 2018; Seita et al., 2018; Quiroz et al., 2019).

Among minibatch MCMC algorithms to big dataapplications, there exists a subset of studies apply-ing such algorithms to neural networks (Chen, Foxand Guestrin, 2014; Gu, Ghahramani and Turner,2015; Gong, Li and Hernandez-Lobato, 2019). Mini-batch MCMC approaches to neural networks pavethe way towards data-parallel deep learning. On theother hand, to the best of the authors’ knowledge,there is no published research on MCMC methodsthat evaluate the log-likelihood on a subset of neuralnetwork parameters rather than on the whole set ofparameters, and therefore no reported research onmodel-parallel deep learning via MCMC.

Minibatch MCMC has been studied analyticallyby Johndrow, Pillai and Smith (2020). Their the-oretical findings point out that some minibatchingschemes can yield inexact approximations and thatminibatch MCMC can not greatly expedite the rateof convergence.

2.2 Model structure

A neural network with ρ layers can be viewedas a hierarchical model with ρ levels, each networklayer representing a level (Williams, 2000). Due to itsnested layers and its non-linear activations, a neuralnetwork is a non-linear hierarchical model.

MCMC methods for non-linear hierarchical mod-els have been developed, see for example Bennett,Racine-Poon and Wakefield; Gilks and Roberts;Daniels and Kass (1998); Sargent, Hodges and Car-lin (2000). However, existing MCMC methods fornon-linear hierarchical models have not harnessedneural networks due to time complexity and conver-gence implications.

Although not designed to mirror the hierarchi-cal structure of a neural network, recent hierarchi-cal VI (Ranganath, Tran and Blei, 2016; Esmaeiliet al., 2019; Huang et al., 2019; Titsias and Ruiz,2019) provides more general variational approxima-tions of the parameter posterior of the neural net-work than mean-field VI. Introducing a hierarchi-

CHALLENGES IN MCMC FOR BAYESIAN NEURAL NETWORKS 3

cal structure in the variational distribution inducescorrelation among parameters, in contrast to themean-field variational distribution that assumes in-dependent parameters. So, one of the Bayesian in-ference strategies for neural networks is to approx-imate the covariance structure among network pa-rameters. In fact, there are published comparisonsbetween MCMC and VI in terms of speed and ac-curacy of convergence to the posterior covariance,both for linear or mixture models (Giordano, Brod-erick and Jordan, 2015; Mandt, Hoffman and Blei,2017; Ong, Nott and Smith, 2018) and for neuralnetworks (Zhang et al., 2018a).

2.3 Weight symmetries

The output of a feedforward neural network givensome fixed input remains unchanged under a setof transformations determined by the the choice ofactivations and by the network architecture moregenerally. For instance, certain weight permutationsand sign flips in MLPs with hyperbolic tangent ac-tivations leave the output unchanged (Chen, Lu andHecht-Nielsen, 1993).

If a parameter transformation leaves the outputof a neural network unchanged given some fixed in-put, then the likelihood is invariant under the trans-formation. In other words, transformations, such asweight permutations and sign-flips, render neuralnetworks non-identifiable (Pourzanjani, Jiang andPetzold, 2017).

It is known that the set of linear invertible pa-rameter transformations that leaves the output un-changed is a subgroup T of the group of invertiblelinear mappings from the parameter space Rn toitself (Hecht-Nielsen, 1990). T is a transformationgroup acting on the parameter space Rn. It can beshown that for each permutable feedforward neu-ral network, there exists a cone H ⊂ Rn dependentonly on the network architecture such that for anyparameter θ ∈ Rn there exist η ∈ H and τ ∈ T suchthat τη = θ. This relation means that every net-work parameter is equivalent to a parameter in theproper subset H of Rn (Hecht-Nielsen, 1990). Neuralnetworks with convolutions, max-pooling and batch-normalization contain more types of weight sym-metries than MLPs (Badrinarayanan, Mishra andCipolla, 2015).

In practice, the parameter space of a neural net-

work is set to be the whole of Rn rather than acone H of Rn. Since a neural network likelihoodwith support in the non-reduced parameter space ofRn is invariant under weight permutations, sign-flipsor other transformations, the posterior landscapeincludes multiple equally likely modes. This im-plies low acceptance rate, entrapment in local modesand convergence challenges for MCMC. Addition-ally, computational time is wasted during MCMC,since posterior modes represent equivalent solutions(Nalisnick, 2018). Such challenges manifest them-selves in the MLP examples of section 4. For neu-ral networks with higher number n of parameters inRn, the topology of the likelihood is characterizedby local optima embedded in high-dimensional flatplateaus (Brea et al., 2019). Thereby, larger neuralnetworks lead to a multimodal target density withsymmetric modes for MCMC.

Seeking parameter symmetries in neural networkscan lead to a variety of NP-hard problems (Ensignet al., 2017). Moreover, symmetries in neural net-works pose identifiability and associated inferentialchallenges in Bayesian inference, but they also pro-vide opportunities to develop inferential methodswith reduced computational cost (Hu, Zagoruykoand Komodakis, 2019) or with improved predic-tive performance (Moore, 2016). Empirical evidencefrom stochastic optimization simulations suggeststhat removing weight symmetries has a negative ef-fect on prediction accuracy in smaller and shallowerconvolutional neural networks (CNNs), but has noeffect in prediction accuracy in larger and deeperCNNs (Maddison et al., 2015).

Imposing constraints on neural network weights isone way of removing symmetries, leading to bettermixing for MCMC (Sen, Papamarkou and Dunson,2020). More generally, exploitation of weight sym-metries provides scope for scalable Bayesian infer-ence in deep learning by reducing the measure ordimension of parameter space. Bayesian inference insubspaces of parameter space for deep learning hasbeen proposed before (Izmailov et al., 2020).

Lack of identifiability is not unique to neural net-works. For instance, the likelihood of mixture modelsis invariant under relabelling of the mixture com-ponents, a condition known as the label switchingproblem (Stephens, 2000).

The high-dimensional parameter space of neu-

4 T. PAPAMARKOU, J. HINKLE, M. T. YOUNG AND D. WOMBLE

ral networks is another source of non-identifiability.A necessary condition for identifiability is that thenumber of data points must be larger than thenumber of parameters. This is one reason why bigdatasets are required for training neural networks.

2.4 Prior specification

Parameter priors have been used for generatingBayesian smoothing or regularization effects. For in-stance, de Freitas (1999) develops sequential MonteCarlo methods with smoothing priors for MLPs andWilliams (1995) introduces Bayesian regularizationand pruning for neural networks via a Laplace prior.

When parameter prior specification for a neuralnetwork is not driven by smoothing or regulariza-tion, the question becomes how to choose the prior.The choice of parameter prior for a neural networkis crucial in that it affects the parameter posterior(Lee, 2004), and consequently the posterior predic-tive distribution (Lee, 2005).

Neural networks are commonly applied to bigdata. For large amounts of data, practitioners maynot have intuition about the relationship between in-put and output variables. Furthermore, it is an openresearch question how to interpret neural networkweights and biases. As a priori knowledge about bigdatasets and about neural network parameters istypically not available, prior elicitation from expertsis not applicable to neural networks.

It seems logical to choose a prior that reflects apriori ignorance about the parameters. A constant-valued prior is a possible candidate, with the caveatof being improper for unbounded parameter spaces,such as Rn. However, for neural networks, an im-proper prior can result in an improper parameterposterior (Lee, 2005).

Typically, a truncated flat prior for neural net-works is sufficient for ensuring a valid parameter pos-terior (Lee, 2005). At the same time, the choice oftruncation bounds depends on weight symmetry andconsequently on the allocation of equivalent pointsin the parameter space. Lee (2003) proposes a re-stricted flat prior for feedforward neural networks bybounding some of the parameters and by imposingconstraints that guarantee layer-wise linear indepen-dence between activations, while Lee (2000) showsthat this prior is asymptotically consistent for theposterior. Moreover, Lee (2003) demonstrates that

such a restricted flat prior enables more effectiveMCMC sampling in comparison to alternative priorchoices.

Objective prior specification is an area of statisticsthat has not infiltrated Bayesian inference for neuralnetworks. Alternative ideas for constructing objec-tive priors with minimal effect on posterior inferenceexist in the statistics literature. For example, Jef-freys priors are invariant to differentiable one-to-onetransformations of the parameters (Jeffreys, 1962),maximum entropy priors maximize the Shannon en-tropy and therefore provide the least possible infor-mation (Jaynes, 1968), reference priors maximizethe expected Kullback-Leibler divergence from theassociated posteriors and in that sense are the leastinformative priors (Bernardo, 1979), and penalisedcomplexity priors penalise the complexity inducedby deviating from a simpler base model (Simpsonet al., 2017).

To the best of the authors’ knowledge, there areonly two published lines of research on objective pri-ors for neural networks; a theoretical derivation ofJeffreys and reference priors for feedforward neuralnetworks by Lee (2007), and an approximation ofreference priors via Monte Carlo sampling of a dif-ferentiable non-centered parameterization of MLPsand CNNs by Nalisnick (2018).

More broadly, research on prior specification forBNNs has been published recently (Pearce et al.,2019; Vladimirova et al., 2019). For a more thor-ough review of prior specification for BNNs, see Lee(2005).

2.5 Convergence

MCMC convergence depends on the target den-sity, namely on its multi-modality and level ofsmoothness. An MLP with fewer than a hundred pa-rameters fitted to a non-linearly separable datasetmakes convergence in fixed MCMC sampling timechallenging (see subsection 4.3).

Attaining MCMC convergence is not the onlychallenge. Assessing whether a finite sample from anMCMC algorithm represents an underlying targetdensity can not be done with certainty (Cowles andCarlin, 1996). MCMC diagnostics can fail to detectthe type of convergence failure they were designed toidentify. Combinations of diagnostics are thus usedin practice to evaluate MCMC convergence with re-

CHALLENGES IN MCMC FOR BAYESIAN NEURAL NETWORKS 5

duced risk of false diagnosis.

MCMC diagnostics were initially designed forasymptotically exact MCMC. Research activity onapproximate MCMC has emerged recently. Mini-batch MCMC methods (see subsection 2.1) are oneclass of approximate MCMC methods. Alternativeapproximate MCMC techniques without minibatch-ing have been developed (Rudolf and Schweizer,2018; Chen et al., 2019) along with new approachesto quantify convergence (Chwialkowski, Strathmannand Gretton, 2016).

Quantization and discrepancy are two notionspertinent to approximate MCMC methods. Thequantization of a target density p by an empiricalmeasure p provides an approximation to the tar-get p (Graf and Luschgy, 2007), while the notion ofdiscrepancy quantifies how well the empirical mea-sure p approximates the target p (Chen et al., 2019).The kernel Stein discrepancy (KSD) and the max-imum mean discrepancy (MMD) constitute two in-stances of discrepancy; for more details, see Chenet al. (2019) and Gretton et al. (2012), respectively.Rudolf and Schweizer (2018) provide an alternativeway of assessing the quality of approximation of atarget density p by an empirical measure p in thecontext of approximate MCMC using the notion ofWasserstein distance between p and p.

3. INFERENTIAL FRAMEWORK OVERVIEW

An overview of the inferential framework used inthis paper follows, including the MLP model andits likelihood for classification, MCMC samplers forparameter estimation, MCMC diagnostics for as-sessing convergence and sampling effectiveness, andBayesian marginalization for prediction.

3.1 The MLP model

MLPs have been chosen as a more tractable classof neural networks. CNNs are the most widely useddeep learning models. However, even small CNNs,such as AlexNet (Krizhevsky, Sutskever and Hinton,2012), SqueezeNet (Iandola et al., 2016), Xception(Chollet, 2017), MobileNet (Howard et al., 2017),ShuffleNet (Zhang et al., 2018b), EffNet (Free-man, Roese-Koerner and Kummert, 2018) or DCTI(Truong, Nguyen and Tran, 2018), have at least twoorders of magnitude higher number of parameters,thus amplifying issues of computational complexity,

model structure, weight symmetry, prior specifica-tion, posterior shape, MCMC convergence and sam-pling effectiveness.

3.1.1 Model definition. An MLP is a feedforwardneural network consisting of an input layer, one ormore hidden layers and an output layer (Rosenblatt,1958; Minsky and Papert, 1988; Hastie, Tibshiraniand Friedman, 2016). Let ρ ≥ 2 be a natural number.Consider an index j ∈ {0, 1, . . . , ρ} indicating thelayer, where j = 0 refers to the input layer, j =1, 2, . . . , ρ− 1 to one of the ρ− 1 hidden layers andj = ρ to the output layer. Let κj be the number ofneurons in layer j and use κ0:ρ = (κ0, κ1, . . . , κρ) asa shorthand for the sequence of neuron counts perlayer. Under such notation, MLP(κ0:ρ) refers to anMLP with ρ − 1 hidden layers and κj neurons atlayer j.

An MLP(κ0:ρ) with ρ − 1 ≥ 1 hidden layers andκj neurons at layer j is defined recursively as

gj(xi, θ1:j) = Wjhj−1(xi, θ1:j−1) + bj ,(3.1)

hj(xi, θ1:j) = φj(gj(xi, θ1:j)),(3.2)

for j = 1, 2, . . . , ρ. A data point xi ∈ Rκ0 corre-sponds to the input layer h0(xi) = xi, yielding thesequence g1(xi, θ1) = W1xi + b1 in the first hiddenlayer. Wj and bj are the respective weights and bi-ases at layer j = 1, 2, . . . , ρ, which constitute theparameters θj = (Wj , bj) at layer j. The shorthandθ1:j = (θ1, θ2, . . . , θj) denotes all weights and biasesup to layer j. Functions φj , known as activations,are applied elementwise to their input gj .

The default recommendation of activation in neu-ral networks is a rectified linear unit (ReLU), seefor instance Jarrett et al. (2009); Nair and Hinton(2009); Goodfellow, Bengio and Courville (2016).Other activations are the ELU, leaky RELU, tanhand sigmoid (Nwankpa et al., 2018). If an activationis not present at layer j, then the identity functionφj(gj) = gj is used as φj in (3.2).

The weight matrix Wj in (3.1) has κj rows andκj−1 columns, while the vector bj of biases has lengthκj . Concatenating all θj across hidden and outputlayers gives a parameter vector θ = θ1:ρ ∈ Rn oflength n =

∑ρj=1 κj(κj−1 + 1). To define θ uniquely,

the convention to traverse weight matrix elementsrow-wise is made. Apparently, each of gj in (3.1)and hj in (3.2) has length κj .

6 T. PAPAMARKOU, J. HINKLE, M. T. YOUNG AND D. WOMBLE

The notation Wj,k,l is introduced to point to the(k, l)-the element of weight matrix Wj at layer j.Analogously, bj,k points to the k-th coordinate ofbias vector bj at layer j.

3.1.2 Likelihood for binary classification. Con-sider s samples (xi, yi), i = 1, 2, . . . , s, consistingof some input xi ∈ Rκ0 and of a binary outputyi ∈ {0, 1}. An MLP(κ0, κ1, . . . , κρ = 1) with a sin-gle neuron in its output layer can be used for settingthe likelihood function L(y1:s|x1:s, θ) of labels y1:s =(y1, y2, . . . , ys) given the input x1:s = (x1, x2, . . . , xs)and MLP parameters θ.

Firstly, the sigmoid activation function φρ(gρ) =1/(1 + exp (−gρ)) is applied at the output layer ofthe MLP. So, the event probabilities Pr(yi = 1|xi, θ)are set to

Pr(yi = 1|xi, θ) = hρ(xi, θ) = φρ(gρ(xi, θ))

=1

1 + exp (−g(ρ)(x(i), θ)).

(3.3)

Assuming that the labels are outcomes of s in-dependent draws from Bernoulli probability massfunctions with event probabilities given by (3.3), thelikelihood becomes

(3.4) L(y1:s|x1:s, θ) =s∏i=1

2∏k=1

(zρ,k(xi, θ))1{yi=k−1} .

zρ,k(xi, θ), k = 1, 2, denotes the k-th coordinateof the vector zρ(xi, θ) = (1 − hρ(xi, θ), hρ(xi, θ)) ofevent probabilities for sample i = 1, 2, . . . , s. Fur-thermore, 1 denotes the indicator function, that is1{yi=k−1} = 1 if yi = k − 1, and 1{yi=k−1} = 0 oth-erwise. The log-likelihood follows as

(3.5) `(y1:s|x1:s, θ) =

s2∑i=1k=1

1{yi=k−1} log (zρ,k(xi, θ)).

The negative value of log-likelihood (3.5) is knownas the binary cross entropy (BCE). To infer the pa-rameters θ of MLP(κ0, κ1, . . . , κρ = 1), the binarycross entropy or a different loss function is mini-mized using stochastic optimization methods, suchas stochastic gradient descent (SGD).

3.1.3 Likelihood for multiclass classification. Letyi ∈ {1, 2, . . . , κρ} be an output variable, which can

take κρ ≥ 2 values. Moreover, consider an MLP(κ0:ρ)with κρ neurons in its output layer.

Initially, a softmax activation function φρ(gρ) =exp (gρ)/

∑κρk=1 exp (gρ,k) is applied at the output

layer of the MLP, where gρ,j denotes the k-th co-ordinate of the κρ-length vector gρ. Thus, the eventprobabilities Pr(yi = k|xi, θ) are

Pr(yi = k|xi, θ) = hρ,k(xi, θ)

= φρ(gρ,k(xi, θ))

=exp (gρ,k(x

(i), θ))∑κρr=1 exp (gρ,r(xi, θ))

.

(3.6)

hρ,k(xi, θ) denotes the k-th coordinate of the MLPoutput hρ(xi, θ).

It is assumed that the labels are outcomes of s in-dependent draws from categorical probability massfunctions with event probabilities given by (3.6), sothe likelihood is

(3.7) L(y1:s|x1:s, θ) =s∏i=1

κρ∏k=1

(hρ,k(xi, θ))1{yi=k} .

The log-likelihood follows as

(3.8) `(y1:s|x1:s, θ) =

sκρ∑i=1k=1

1{yi=k} log (hρ,k(xi, θ)).

The negative value of log-likelihood (3.8) is knownas cross entropy, and it is used as loss function forstochastic optimization in multiclass classificationMLPs.

An MLP(κ0, κ1, . . . , κρ = 2) with two neurons atthe output layer, event probabilities given by soft-max activation (3.6) and log-likelihood (3.8) can beused for binary classification. Such a formulation isan alternative to an MLP(κ0, κ1, . . . , κρ = 1) withone neuron at the output layer, event probabilitiesgiven by sigmoid activation (3.3) and log-likelihood(3.5). The difference between the two MLP models isthe parameterization of event probabilities, since acategorical distribution with κρ = 2 levels otherwisecoincides with a Bernoulli distribution.

3.2 MCMC sampling for parameter estimation

Interest is in sampling from the parameter pos-terior p(θ|x1:s, y1:s) ∝ L(y1:s|x1:s, θ)π(θ) of a neu-ral network given the neural network likelihood

CHALLENGES IN MCMC FOR BAYESIAN NEURAL NETWORKS 7

L(y1:s|x1:s, θ) and parameter prior π(θ). For MLPs,the likelihood L(y1:s|x1:s, θ) for binary and multi-class classification is provided by (3.4) and (3.7), re-spectively.

The parameter posterior p(θ|x1:s, y1:s) is alter-natively denoted by p(θ|D1:s) for brevity. D1:s =(x1:s, y1:s) is a dataset of size s consisting of inputx1:s and output y1:s.

This subsection provides an introduction to theMCMC algorithms and MCMC diagnostics used inthe examples of section 4. Three MCMC algorithmsare outlined, namely Metropolis-Hastings, Hamilto-nian Monte Carlo, and power posterior sampling.Two MCMC diagnostics are described, the multi-variate potential scale reduction factor (PSRF) andthe multivariate effective sample size (ESS).

3.2.1 Metropolis-Hastings algorithm. One of themost general algorithms for sampling from a poste-rior p(θ|D1:s) is the Metropolis-Hastings (MH) al-gorithm (Metropolis et al., 1953; Hastings, 1970).Given the current state θ, the MH algorithm ini-tially samples a state θ∗ from a proposal density gθand subsequently accepts the proposed state θ∗ withprobability min

{p(θ∗|D1:s)gθ∗ (θ)p(θ|D1:s)gθ(θ∗)

, 1}

if p(θ|D1:s)gθ(θ∗) > 0,

1 otherwise.

Typically, a normal proposal density gθ = N (θ,Λ)with a constant covariance matrix Λ is used. For sucha normal gθ, the acceptance probability simplifies tomin {p(θ∗|D1:s)/p(θ|D1:s), 1}, yielding the so calledrandom walk Metropolis algorithm.

3.2.2 Hamiltonian Monte Carlo. HamiltonianMonte Carlo (HMC) draws samples from an aug-mented parameter space via Gibbs steps, by com-puting a trajectory in the parameter space accord-ing to Hamiltonian dynamics. For a more detailedreview of HMC, see Neal (2011).

3.2.3 Power posterior sampling. Power posterior(PP) sampling by Friel and Pettitt (2008) is a pop-ulation Monte Carlo algorithm. It involves m + 1chains drawn from tempered versions pti(θ|D1:s) ofa target posterior p(θ|D1:s) for a temperature sched-ule ti ∈ [0, 1], i ∈ {0, 1, . . . ,m}, where tm = 1. Ateach iteration, the state of each chain is updated us-ing an MCMC sampler associated with that chain

and subsequently states between pairs of chains areswapped according to an MH algorithm. For the i-thchain, a sample j is drawn from a probability massfunction pi with probability pi(j), in order to deter-mine the pair (i, j) for a possible swap.

Power posteriors pti(θ|D1:s), ti < tm, are smoothapproximations of the target density ptm(θ|D1:s) =p(θ|D1:s), facilitating exploration of the parame-ter space via state transitions between chains ofpti(θ|D1:s) and of p(θ|D1:s). In this paper, a cate-gorical probability mass function pi is used in PPsampling for determining candidate pairs of chainsfor state swaps (see Appendix A).

3.2.4 Multivariate PSRF. PSRF, commonly de-noted by R, is an MCMC diagnostic of convergenceconceived by Gelman and Rubin (1992) and ex-tended to its multivariate version by Brooks andGelman (1998). This paper uses the multivariatePSRF by Brooks and Gelman (1998), which providesa single-number summary of convergence across then dimensions of a parameter, requiring a MonteCarlo covariance matrix estimator for the param-eter.

To acquire the multivariate PSRF, the multivari-ate initial monotone sequence estimator (MINSE)of Monte Carlo covariance is employed (Dai andJones, 2017). In a Bayesian setting, the MINSE esti-mates the covariance matrix of a parameter posteriorp(θ|D1:s).

To compute PSRF, several independent Markovchains are simulated. Gelman et al. (2004) recom-mend terminating MCMC sampling as soon as R <1.1. More recently, Vats and Knudson (2018) makean argument based on ESS that a cut-off of 1.1 forR is too high to estimate a Monte Carlo mean withreasonable uncertainty. Vehtari et al. (2019) recom-mend simulating at least m = 4 chains to computeR and using a threshold of R < 1.01.

3.2.5 Multivariate ESS. The ESS of an estimateobtained from a Markov chain realization is inter-preted as the number of independent samples thatprovide an estimate with variance equal to the vari-ance of the estimate obtained from the Markov chainrealization. For a more extensive treatment entailingunivariate approaches to ESS, see Vats and Flegal(2018); Gong and Flegal (2016); Kass et al. (1998).

R and its variants can fail to diagnose poor mixing

8 T. PAPAMARKOU, J. HINKLE, M. T. YOUNG AND D. WOMBLE

of a Markov chain, whereas low values of ESS are anindicator of poor mixing. It is thus recommended tocheck both R and ESS (Vehtari et al., 2019). For atheoretical treatment of the relation between R andESS, see Vats and Knudson (2018).

Univariate ESS pertains to a single coordinateof an n-dimensional parameter. Vats, Flegal andJones (2019) introduce a multivariate version ofESS, which provides a single-number summary ofsampling effectiveness across the n dimensions of aparameter. Similarly to multivariate PSRF (Brooksand Gelman, 1998), multivariate ESS (Vats, Flegaland Jones, 2019) requires a Monte Carlo covariancematrix estimator for the parameter.

Given a single Markov chain realization of lengthv for an n-dimensional parameter, Vats, Flegal andJones (2019) define multivariate ESS as

S = v

(det (E)

det (C)

)1/n

.

det (E) is the determinant of the empirical covari-ance matrix E and det (C) is the determinant of aMonte Carlo covariance matrix estimate C for thechain. In this paper, the multivariate ESS by Vats,Flegal and Jones (2019) is used, setting C to be theMINSE for the chain.

3.3 Bayesian marginalization for prediction

This subsection briefly reviews the notion ofposterior predictive distribution based on Bayesianmarginalization, posterior predictive distributionapproximation via Monte Carlo integration, and as-sociated binary and multiclass classification.

3.3.1 Posterior predictive distribution. Considera set D1:s = (x1:s, y1:s) of s training data pointsand a single test data point (x, y) consisting of sometest input x and test output y. Integrating out theparameters θ of a model fitted to D1:s yields theposterior predictive distribution

(3.9) p(y|x,D1:s)︸ ︷︷ ︸Predictivedistribution

=

∫p(y|x, θ)︸ ︷︷ ︸Likelihood

p(θ|D1:s)︸ ︷︷ ︸Parameterposterior

dθ.

Appendix B provides a derivation of (3.9).

3.3.2 Monte Carlo approximation. (3.9) can bewritten as

(3.10) p(y|x,D1:s) = Eθ|D1:s[p(y|x, θ)].

(3.10) states the posterior predictive distributionp(y|x,D1:s) as an expectation of the likelihoodp(y|x, θ) evaluated at the test output y with respectto the parameter posterior p(θ|D1:s) learnt from thetraining set D1:s.

The expectation in (3.10) can be approximatedvia Monte Carlo integration. More specifically, aMonte Carlo approximation of the posterior predic-tive distribution is given by

(3.11) p(y|x,D1:s) 'v∑k=1

p(y|x, ωk).

The sum in (3.11) involves evaluations of the like-lihood across v iterations ωk, k = 1, 2, . . . , v, of aMarkov chain realization ω1:v obtained from the pa-rameter posterior p(θ|D1:s).

3.3.3 Classification rule. In the case of binaryclassification, the prediction y for the test labely ∈ {0, 1} is

(3.12) y =

1 if p(y|x,D1:s) ≥ 0.5,

0 otherwise.

For multiclass classification, the prediction label yfor the test label y ∈ {1, 2, . . . , κρ} is

(3.13) y = arg maxy

{p(y|x,D1:s)}.

The classification rules (3.12) and (3.13) for bi-nary and multiclass classification maximize the pos-terior predictive distribution. This way, predictionsare made based on the Bayesian principle. The un-certainty of predictions is quantified, since the pos-terior predictive probability p(y|x,D1:s) of each pre-dicted label y is available.

4. EXAMPLES

Four examples of Bayesian inference for MLPsbased on MCMC are presented. A different datasetis used for each example. The four datasets entailsimulated noisy data from the exclusive-or (XOR)function, and observations collected from Pima Indi-ans, penguins and hawks. Section 4.1 introduces thefour datasets. Each of the four datasets is split intoa training and a test set for parameter inference andfor predictions, respectively. MLPs with one neuron

CHALLENGES IN MCMC FOR BAYESIAN NEURAL NETWORKS 9

Table 1Training and test sample sizes of the four datasets of section4, architectures of fitted MLP models and associated number

n of MLP parameters.

DatasetSample size

Model nTraining Test

Noisy XOR 500 120 MLP(2, 2, 1) 9

Pima 262 130 MLP(8, 2, 2, 1) 27

Penguins 223 110 MLP(6, 2, 2, 3) 29

Hawks 596 295 MLP(6, 2, 2, 3) 29

in the output layer are fitted to the noisy XOR andPima datasets to perform binary classification, whileMLPs with three neurons in the output layer are fit-ted to the penguin and hawk datasets to performmulticlass classification with three classes. Table 1shows the training and test sample sizes of the fourdatasets, and the fitted MLP models with their as-sociated number n of parameters.

In the examples, samples are drawn via MCMCfrom the unnormalized log-posterior

log (p(θ|x1:s, y1:s)) = `(y1:s|x1:s, θ) + log (π(θ))

of MLP parameters. The log-likelihood `(y1:s|x1:s, θ)for binary or multiclass classification corresponds to(3.5) or (3.8). log (π(θ)) is the log-prior of MLP pa-rameters.

4.1 Datasets

An introduction to the four datasets used in thispaper follows. The simulated noisy XOR datasetdoes not contain missing values, while the realdatasets for Pima, penguins and hawks come withmissing values. Data points containing missing val-ues in the chosen variables have been dropped fromthe three real datasets. All features (input vari-ables) in the three real datasets have been stan-dardized. The four datasets, in their final form usedfor inference and prediction, are available at https://github.com/papamarkou/bnn_mcmc_examples.

4.1.1 XOR dataset. The so called XOR functionf : {0, 1} × {0, 1} → {0, 1} returns 1 if exactly oneof its binary input values is equal to 1, otherwise itreturns 0. The s = 4 data points defining XOR are(x1, y1) = ((0, 0), 0), (x2, y2) = ((0, 1), 1), (x3, y3) =((1, 0), 1) and (x4, y4) = ((1, 1), 0).

A perceptron without a hidden layer can not learnthe XOR function (Minsky and Papert, 1988). Onthe other hand, an MLP(2, 2, 1) with a single hiddenlayer of two neurons can learn the XOR function(Goodfellow, Bengio and Courville, 2016).

An MLP(2, 2, 1) has a parameter vector θ oflength n = 9, as W1, b1,W2 and b2 have respectivedimensions 2 ·2, 2 ·1, 2 ·1 and 1 ·1. Since the numbers = 4 of data points defined by the exact XOR func-tion is less than the number n = 9 of parameters inthe fitted MLP(2, 2, 1), the parameters can not befully identified.

To circumvent the lack of identifiability arisingfrom the limited number of data points, a largerdataset is simulated by introducing a noisy ver-sion of XOR. Firstly, consider the auxiliary functionψ : [−c, 1 + c]× [−c, 1 + c]→ {0, 1}×{0, 1} given by

ψ(u− c, u− c) = (0, 0),

ψ(u− c, u+ c) = (0, 1),

ψ(u+ c, u− c) = (1, 0),

ψ(u+ c, u+ c) = (1, 1).

ψ is presented in parametrized form, in terms of aconstant c ∈ (0.5, 1) and a uniformly distributedrandom variable u ∼ U(0, 1). The noisy XOR func-tion is then defined as the function composition f◦ψ.

A training and a test set of noisy XOR points,generated using f ◦ ψ and c = 0.55, are shown infigure 2a. 125 and 30 noisy XOR points per exactXOR point (xi, yi), i = 1, 2, 3, 4, are contained in thetraining and test set, respectively. So, the trainingand test sample sizes are 500 and 120, as reportedin table 1 and as visualized in figure 2a.

In figure 2a, the training and test sets of noisyXOR points consist of two input variables (u ±0.55, u ± 0.55) ∈ [−0.55, 1.55] × [−0.55, 1.55] andof one output variable f ◦ ψ(u ± 0.55, u ± 0.55) ∈{0, 1}. The four colours classify noisy XOR in-put (u ± 0.55, u ± 0.55) with respect to the corre-sponding exact XOR input ψ(u ± 0.55, u ± 0.55) ∈{(0, 0), (0, 1), (1, 0), (1, 1)}; the two different shapesclassify noisy XOR output, with circle and trianglecorresponding to 0 and 1.

4.1.2 Pima dataset. The Pima dataset containsobservations taken from female patients of Pima In-dian heritage. The binary output variable indicateswhether or not a patient has diabetes. Eight features

10 T. PAPAMARKOU, J. HINKLE, M. T. YOUNG AND D. WOMBLE

are used as diagnostics of diabetes, namely the num-ber of pregnancies, plasma glucose concentration, di-astolic blood pressure, triceps skinfold thickness, in-sulin level, body mass index, diabetes pedigree func-tion and age.

For more information about the Pima dataset, seeSmith et al. (1988). The original data, prior to re-moval of missing values and feature standardization,are available as the PimaIndiansDiabetes2 dataframe of the mlbench R package.

4.1.3 Penguin dataset. The penguin dataset con-sists of body measurements for three penguin speciesobserved on three islands in the Palmer Archipelago,Antarctica. Adelie, Chinstrap and Gentoo penguinsare the three observed species. Four body measure-ments per penguin are taken, specifically body mass,flipper length, bill length and bill depth. The fourbody measurements, sex and location (island) makeup a total of six features utilized for deducing thespecies to which a penguin belongs. Thus, the pen-guin species is used as output variable.

Horst, Hill and Gorman (2020) provide moredetails about the penguin dataset. In their orig-inal form, prior to data filtering, the data areavailable at https://github.com/allisonhorst/

palmerpenguins.

4.1.4 Hawk dataset. The hawk dataset is com-posed of observations for three hawk species col-lected from Lake MacBride near Iowa City, Iowa.Cooper’s, red-tailed and sharp-shinned hawks arethe three observed species. Age, wing length, bodyweight, culmen length, hallux length and tail lengthare the six hawk features employed in this paper fordeducing the species to which a hawk belongs. So,the hawk species is used as output variable.

Cannon et al. (2019) mention that Emeritus Pro-fessor Bob Black at Cornell College shared the hawkdataset publicly. The original data, prior to data fil-tering, are available as the Hawks data frame of theStat2Data R package.

4.2 Experimental configuration

To fully specify the MLP models of table 1, theiractivations are listed. A sigmoid activation func-tion is applied at each hidden layer of each MLP.Additionally, a sigmoid activation function is ap-plied at the output layer of MLP(2, 2, 1) and ofMLP(8, 2, 2, 1), conforming to log-likelihood (3.5)

for binary classification. A softmax activation func-tion is applied at the output layer of MLP(6, 2, 2, 3),in accordance with log-likelihood (3.8) for multiclassclassification. The same MLP(6, 2, 2, 3) model is fit-ted to the penguin and hawk datasets.

A normal prior π(θ) = N (0, 10I) is adopted forthe parameters θ ∈ Rn of each MLP model shown intable 1. An isotropic covariance matrix 10I assignsrelatively high prior variance, equal to 10, to eachcoordinate of θ, thus setting empirically a seeminglynon-informative prior.

MH and HMC are run for each of the four ex-amples of table 1. PP sampling incurs higher com-putational cost than MH and HMC; for this rea-son, PP sampling is run only for noisy XOR. Tenpower posteriors are employed for PP sampling, andMH is used for within-chain moves. On the basis ofpilot runs, the PP temperature schedule is set toti = 1, i = 0, 1, . . . , 9; this implies that each powerposterior is set to be the parameter posterior andconsequently between-chain moves are made amongten chains realized from the parameter posterior.Empirical hyperparameter tuning for MH, HMC andPP is carried out. The chosen MH proposal variance,HMC number of leapfrog steps and HMC leapfrogstep size for each example can be found in https:

//github.com/papamarkou/bnn_mcmc_examples.

m = 10 Markov chains are realized for each com-bination of training dataset shown in table 1 andof MCMC sampler. 110, 000 iterations are run perchain realization, 10, 000 of which are discarded asburn-in. Thereby, v = 100, 000 post-burnin itera-tions are retained per chain realization.

MINSE computation, required by multivariatePSRF and multivariate ESS, is carried out using v =100, 000 post-burnin iterations per realized chain.The multivariate PSRF for each dataset-sampler set-ting is computed across the m = 10 realized chainsfor the setting. On the other hand, the multivari-ate ESS is computed for each realized chain, andthe mean across m = 10 ESSs is reported for eachdataset-sampler setting.

Monte Carlo approximations of posterior predic-tive distributions are computed according to (3.11)for each data point of each test set. To reduce thecomputational cost, the last v = 10, 000 iterationsof each realized chain are used in (3.11).

Predictions for binary and multiclass classifica-

CHALLENGES IN MCMC FOR BAYESIAN NEURAL NETWORKS 11

tion are made using (3.12) and (3.13), respectively.Given a single chain realization from an MCMCsampler, predictions are made for every point in atest set; the predictive accuracy is then computedas the number of correct predictions over the totalnumber of points in the test set. Subsequently, themean of predictive accuracies across the m = 10chains realized from the sampler is reported for thetest set.

4.3 Numerical summaries

Table 2 shows numerical summaries for each setof m = 10 Markov chains realized by an MCMCsampler for a dataset-MLP combination of table 1.Multivariate PSRF and multivariate ESS diagnosethe capacity of MCMC sampling to perform pa-rameter inference. Predictive accuracy via Bayesianmarginalization (3.11), based on classification rules(3.12) and (3.13) for binary and multiclass classifi-cation, demonstrates the predictive performance ofMCMC sampling. The last column of table 2 dis-plays the predictive accuracy via (3.11) with samplesωk, k = 1, 2, . . . , v, drawn from the prior π(θ) =N (0, 10I), thus providing an approximation of theexpected posterior predictive probability

(4.1) Eθ[p(y|x, θ)] =

∫p(y|x, θ)π(θ)dθ

with respect to prior π(θ).

PSRF is above 1.01 (Vehtari et al., 2019), indicat-ing lack of convergence, in three out of four datasets.ESS is low considering the post-burnin length ofv = 100, 000 of each chain realization, indicatingslow mixing. MCMC sampling for Pima data is theonly case of attaining PSRF less than 1.01, yet theESS values for Pima are the lowest among the fourdatasets. Overall, simultaneous low PSRF and highESS are not reached in any of the examples.

The predictive accuracy is high in multiclass clas-sification, despite the lack of convergence and slowmixing. Bayesian marginalization based on HMCsamples yields 100% and 98.03% predictive accuracyon the penguin and hawk test datasets, despite thePSRF values of 1.6082 and 1.4421 on the penguinand hawk training datasets.

PP sampling for the binary classification problemof noisy XOR leads to higher predictive accuracy(87.58%) than MH (75.92%) or HMC (74.75%). The

Table 2Multivariate PSRF, multivariate ESS and predictive

accuracy for each set of ten Markov chains realized by anMCMC sampler for a dataset-MLP combination. Predictiveaccuracies based on samples from the prior are reported as

model-agnostic baselines.

Sampler PSRF ESSAccuracy

MCMC Prior

Noisy XOR, MLP(2, 2, 1)

MH 1.2057 540 75.92

48.33HMC 13.8689 25448 74.75

PP 2.2885 4083 87.58

Pima, MLP(8, 2, 2, 1)

MH 1.0007 93 79.3151.69

HMC 1.0001 718 80.38

Penguins, MLP(6, 2, 2, 3)

MH 1.0229 217 100.0036.45

HMC 1.6082 3127 100.00

Hawks, MLP(6, 2, 2, 3)

MH 1.0319 168 97.9728.85

HMC 1.4421 1838 98.03

87.58% predictive accuracy is attained by PP sam-pling despite the associated PSRF value of 2.2885.

Bayesian marginalization based on MCMC sam-pling outperforms prior beliefs or random guessesin terms of predictive inference, despite MCMC di-agnostic failures. For instance, Bayesian marginal-ization via non-converged HMC chain realizationsyields 74.75%, 100% and 98.03% predictive accu-racy on the noisy XOR, penguin and hawk datasets.Approximating the posterior predictive distribu-tion with samples from the parameter prior yields48.33%, 36.45% and 28.85% predictive accuracy onthe same datasets. It is noted that 48.33% is close toa 50/50 random guess for binary classification, while36.45% and 28.85% are close to a 1/3 random guessfor multiclass classification with three classes.

4.4 Visual summaries for parameters

Visual summaries for MLP parameters are pre-sented in this subsection. In particular, Markovchain traceplots and a comparison between MCMCsampling and ensemble training are displayed.

4.4.1 Non-converged chain realizations. Figure 1shows chain traceplots of four parameters of MLP

12 T. PAPAMARKOU, J. HINKLE, M. T. YOUNG AND D. WOMBLE

Fig 1: Markov chain traceplots of four parametercoordinates of MLP models introduced in table 1.The vertical dotted lines indicate the end of burnin.

models introduced in table 1. These traceplots visu-ally demonstrate entrapment in local modes, modeswitching and more generally lack of convergence.

All 110, 000 iterations per realized chain, whichinclude burnin, are shown in the traceplots of figure1. The vertical dotted lines delineate the first 10, 000burnin iterations.

Two realized MH chains for parameter θ8 of theMLP(2, 2, 1) model fitted to the noisy XOR trainingdata are plotted. The traces in orange and in bluegravitate during burnin towards modes in the vicin-ity of 8 and −8, respectively, and then get entrappedfor the entire simulation time in these modes. Pa-rameter θ8 corresponds to a weight connecting a neu-ron in the hidden layer with the neuron of the outputlayer of MLP(2, 2, 1). The two realized chains for θ8explore two regions symmetric about zero associated

with symmetries of weight θ8.

Two realized MH chains for parameter θ18 of theMLP(6, 2, 2, 3) model fitted to the penguin trainingdata are plotted, one shown in orange and one inblue. Each of these two traces initially explore amode, transit to a seemingly symmetric mode abouthalfway through the simulation time (post-burnin)and explore the symmetric mode in the second halfof the simulation.

One HMC chain traceplot for parameter θ23 andone HMC chain traceplot for parameter θ26 of theMLP(6, 2, 2, 3) model fitted to the penguin and hawktraining data, respectively, are shown. The traces ofthese two parameters exhibit similar behaviour, eachof them switching between two symmetric regionsabout zero.

Switching between symmetric modes, as seen inthe displayed traceplots, manifests weight symme-tries. These traceplots exemplify how computationaltime is wasted during MCMC to explore equivari-ant parameter posterior modes of a neural network(Nalisnick, 2018). Consequently, the realized chainsdo not converge.

4.4.2 MCMC sampling vs ensemble training. Anexemplified comparison between MCMC samplingand ensemble training for neural networks follows.To this end, the same noisy XOR training data andthe same MLP(2, 2, 1) model, previously used forMCMC sampling, are used for ensemble training.

To recap, the noisy XOR dataset is introduced insubsection 4.1 and is displayed in figure 2a; a sig-moid activation function is applied to the hiddenand output layer of MLP(2, 2, 1), and the BCE lossfunction is employed, which is the negative value oflog-likelihood (3.5).

Ensemble learning is conducted by training theMLP(2, 2, 1) model on the noisy XOR training setmultiple times. At each training session, SGD is usedfor minimizing the BCE loss. SGD is initialized bydrawing a sample from π(θ) = N (0, 10I), which isthe same density used as prior for MCMC sampling.2, 000 epochs are run per training session, with abatch size of 50 and a learning rate of 0.002. TheSGD solution from the training session is acceptedif its predictive accuracy on the noisy XOR test set isabove 85%, otherwise it is rejected. Ensemble learn-ing is terminated as soon as 1, 000 SGD solutionswith the required level of accuracy are obtained.

CHALLENGES IN MCMC FOR BAYESIAN NEURAL NETWORKS 13

(a) Noisy XOR training set (left) and test set (right).

(b) 100 SGD solutions from training MLP(2, 2, 1).

(c) Histograms of parameter θ3 of MLP(2, 2, 1).

Fig 2: Comparison between MH sampling and en-semble training of an MLP(2, 2, 1) model fitted tonoisy XOR data. SGD is used for ensemble training.Each accepted SGD solution has predictive accuracyabove 85% on the noisy XOR test set.

Figure 2b shows a parallel coordinates plot of 100SGD solutions. Each line connects the nine coordi-nates of a solution. Overlaying lines of different SGDsolutions visualizes parameter symmetries.

Figure 2c displays histograms associated with pa-rameter θ3 of MLP(2, 2, 1). The green histogram rep-resents all 1, 000 SGD solutions for θ3 obtained fromensemble training based on noisy XOR. These 1, 000modes cluster in two regions approximately symmet-ric about zero. The orange histogram belongs to oneof ten realized MH chains for θ3 based on noisy XOR.This realized chain is entrapped in a local mode inthe vicinity of 5, where the orange histogram con-

centrates its mass. The overlaid green and orangehistograms show that MH sampling explores a re-gion of the marginal posterior of θ3 also explored byensemble training.

The blue histogram in figure 2c comes from achain realization for θ3 using MH sampling to ap-ply MLP(2, 2, 1) to the four exact XOR data points.The pink line in figure 2c shows the marginal priorπ(θ3) = N (0, σ2 = 10). Four data points are not suf-ficient to learn from them, given that MLP(2, 2, 1)has nine parameters. For this reason, the blue his-togram coincides with the pink line, which meansthat the marginal posterior p(θ3) obtained from ex-act XOR via MH sampling and the marginal priorπ(θ3) coincide.

4.5 Visual summaries for predictions

Visual summaries for MLP predictions and forMLP posterior predictive probabilities are presentedin this section. MLP posterior predictive probabili-ties are visually shown to quantify predictive uncer-tainty in classification.

4.5.1 Predictive accuracy. Figure 3 shows box-plots of predictive accuracies, hereinafter referred toas accuracies, for the examples introduced in table1. Each boxplot summarizes m = 10 accuracies asso-ciated with the ten chains realized per sampler for atest set. Accuracy computation is based on Bayesianmarginalization, as outlined in subsections 3.3 and4.2. Horizontal red lines represent accuracy medi-ans. Figure 3 and table 1 provide complementarysummaries, as they present respective quartiles andmeans of accuracies across chains per sampler.

Boxplot medians show high accuracy on the pen-guin and hawk test sets. Moreover, narrow box-plots indicate accuracies with small variation onthe penguin and hawk test sets. Thereby, Bayesianmarginalization based on non-converged chain real-izations attains high accuracy with small variabilityon the two multiclass classification examples.

Figure 3 also displays boxplots of accuracies basedon expected posterior predictive distribution ap-proximation (4.1) with respect to the prior. For allfour test sets and regardless of Markov chain con-vergence, Bayesian marginalization outperforms ag-nostic prior-based baseline (4.1).

The PP boxplot has more elevated median and isnarrower than its MH and HMC counterparts for the

14 T. PAPAMARKOU, J. HINKLE, M. T. YOUNG AND D. WOMBLE

Fig 3: Boxplots of predictive accuracies for the exam-ples introduced in table 1. Each boxplot summarizesm = 10 predictive accuracies associated with the tenchains realized by an MCMC sampler for a test set.

noisy XOR test set. This implies that PP samplingattains higher accuracy with smaller variation thanMH and HMC sampling on the noisy XOR test set.

4.5.2 Uncertainty quantification on a grid. Fig-ure 4 visualizes heatmaps of the ground truth and ofposterior predictive distribution approximations fornoisy XOR. More specifically, the posterior predic-tive probability p(y = 1|(x1, x2), D1:500) is approxi-mated at the centre (x1, x2) of each square cell of a22× 22 grid in [−0.5, 1.5]× [−0.5, 1.5]. D1:500 refersto the noisy XOR training dataset of size s = 500introduced in subsection 4.1. (3.11) is used for ap-proximating p(y = 1|(x1, x2), D1:500). Previously ac-quired Markov chain realizations (subsection 4.3) viaMCMC sampling of MLP(2, 2, 1) parameters, usingthe noisy XOR training dataset D1:500, are passedto (3.11).

The approximation p(y = 1|(x1, x2), D1:500) = cat the center (x1, x2) of a square cell determines thecolour of the cell in figure 4. If c is closer to 1, 0, or0.5, the cell is plotted with a shade of red, blue orwhite, respectively. So, darker shades of red indicatethat y = 1 with higher certainty, darker shades ofblue indicate that y = 0 with higher certainty, andshades of white indicate high uncertainty about thebinary label of noisy XOR.

Two posterior predictive distribution approxima-tions based on two HMC chain realizations learn dif-ferent regions of the exact posterior predictive dis-tribution. Each of the two HMC chain realizationsuncover about half of the ground truth of grid la-

Fig 4: Heatmaps of ground truth and of posteriorpredictive probabilities p(y = 1|(x1, x2), D1:500) =c on a grid of noisy XOR features (x1, x2). Theheatmap colour palette represents values of c. Theground truth heatmap visualizes true labels, whilethe other three heatmaps use approximate Bayesianmarginalization based on HMC and PP chain real-izations.

bels, while it remains highly uncertain for the otherhalf of grid labels. Moreover, both HMC chain re-alizations exhibit higher uncertainty closer to thedecision boundaries of ground truth. These decisionboundaries are the vertical straight line x1 = 0.5 andhorizontal straight line x2 = 0.5.

A posterior predictive distribution approximationbased on a PP chain realization is displayed. PPsampling uncovers larger regions of the ground truthof grid labels than HMC sampling in the consid-ered grid of noisy XOR features (x1, x2). AlthoughHMC and PP samples do not converge to the param-eter posterior of MLP(2, 2, 1), approximate Bayesianmarginalization using these samples predicts a sub-set of noisy XOR labels.

4.5.3 Uncertainty quantification on a test set.Figures 5 and 6 show approximations of predictiveprosterior probabilities for a binary classification(noisy XOR) and a multiclass classification (hawks)example. Two posterior predictive probabilities are

CHALLENGES IN MCMC FOR BAYESIAN NEURAL NETWORKS 15

(a) Scatterplot of noisy XOR features (x1, x2).

(b) Posterior predictive probabilities for noisy XOR.

Fig 5: Quantification of uncertainty in predictionsfor the noisy XOR test set. Approximate Bayesianmarginalization via MH sampling is used for com-puting posterior predictive probabilities.

interpreted contextually in each example to quantifypredictive uncertainty.

Figure 5a visualizes the noisy XOR test set ofsubsection 4.1. This is the same test set shownin figure 2a, but with test points coloured accord-ing to their labels. Figure 5b shows the poste-rior predictive probability p(y = c|(x1, x2), D1:500)of true label c ∈ {0, 1} for each noisy XOR testpoint ((x1, x2), y = c) given noisy XOR trainingset D1:500 of subsection 4.1. The posterior proba-bilities p(y = c|(x1, x2), D1:500) of predicting trueclass c are ordered within class c. Moreover, eachp(y = c|(x1, x2), D1:500) is coloured as red or palegreen depending on whether the resulting predictionis correct or not. One of the ten MH chain real-izations for MLP(2, 2, 1) parameter inference fromnoisy XOR data is used for approximating p(y =c|(x1, x2), D1:500) via (3.11) and for making predic-tions via (3.12).

Two points in the noisy XOR test set are marked

in figure 5 using a square and a rhombus. These twopoints have the same true label c = 1. Given pos-terior predictive probabilities 0.5269 and 0.9750 forthe rhombus and square-shaped test points, the labelc = 1 is correctly predicted for both points. However,the rhombus-shaped point is closer to the decisionboundary x2 = 0.5 than the square-shaped point, soclassifying the former entails higher uncertainty. As0.5269 < 0.9750, Bayesian marginalization quanti-fies the increased predictive uncertainty associatedwith the rhombus-shaped point despite using a non-converged MH chain realization.

Figure 6a shows a scatterplot of weight againsttail length for the hawk test set of subsection 4.1.Blue, red and green test points belong to Cooper’s,red-tailed and sharp-shinned hawk classes. Figure 6bshows the posterior predictive probabilities p(y =c|x,D1:596) for a subset of 100 hawk test points,where c ∈ {Cooper’s, red-tailed, sharp-shinned} de-notes the true label of test point (x, y = c) andD1:596 denotes the hawk training set of subsec-tion 4.1. These posterior predictive probabilities areshown ordered within each class, and are colouredred or pale green depending on whether they yieldcorrect or wrong predictions. One of the ten MHchain realizations for MLP(6, 2, 2, 3) parameter in-ference is used for approximating p(y = c|x,D1:596)via (3.11) and for making predictions via (3.13).

Two points in the hawk test set are marked in fig-ure 6 using a square and a rhombus. Each of thesetwo points represents weight and tail length mea-surements from a red-tailed hawk. The red-tailedhawk class is correctly predicted for both points.The squared-shaped observation belongs to the maincluster of red-tailed hawks in figure 6a and it ispredicted with high posterior predictive probability(0.9961). On the other hand, the rhombus-shapedobservation, which falls in the cluster of Cooper’shawk, is correctly predicted with a lower posteriorpredictive probability (0.5271). Bayesian marginal-ization provides approximate posterior predictiveprobabilities that signify the level of uncertainty inpredictions despite using a non-converged MH chainrealization.

4.6 Source code

The source code for this paper is split intothree Python packages, namely eeyore, kanga and

16 T. PAPAMARKOU, J. HINKLE, M. T. YOUNG AND D. WOMBLE

(a) Scatterplot of hawks’ weight against tail length.

(b) Posterior predictive probabilities for hawks.

Fig 6: Quantification of uncertainty in predictionsfor the hawk test set. Bayesian marginalization viaMH sampling is used for approximating posteriorpredictive probabilities.

bnn mcmc examples. eeyore implements MCMC al-gorithms for Bayesian neural networks. kanga imple-ments MCMC diagnostics. bnn mcmc examples in-cludes the examples of this paper.

eeyore is available via pip, via conda and athttps://github.com/papamarkou/eeyore. eeyoreimplements the MLP model, as defined by (3.1)-(3.2), using PyTorch. An MLP class is set to be a sub-class of torch.nn.Module, with log-likelihood (3.5)for binary classification equal to the negative value oftorch.nn.BCELoss and with log-likelihood (3.8) formulticlass classification equal to the negative valueof torch.nn.CrossEntropyLoss. Each MCMC al-gorithm takes an instance of torch.nn.Module asinput, with the logarithm of the target density be-ing a log target method of the instance. Log-target density gradients for HMC are computedvia the automatic differentiation functionality ofthe torch.autograd package of PyTorch. The MLP

class of eeyore provides a predictive posterior

method, which implements the posterior predictivedistribution approximation (3.11) given a realizedMarkov chain.

kanga is available via pip, via conda and athttps://github.com/papamarkou/kanga. kanga isa collection of MCMC diagnostics implemented us-ing numpy. MINSE, multivariate PSRF multivariateESS are available in kanga.

bnn mcmc examples organizes the examples ofthis paper in a package. bnn mcmc examples relieson eeyore for MCMC simulations and posterior pre-dictive distribution approximations, and on kanga

for MCMC diagnostics. For more details, see https://github.com/papamarkou/bnn_mcmc_examples.

Optimization via SGD for the example involv-ing MLP(2, 2, 1) and noisy XOR data (figure 2) isrun using PyTorch. The loss function for optimiza-tion is computed via torch.nn.BCELoss. This lossfunction corresponds to the negative log-likelihoodfunction (3.5) involved in MCMC, thus linking theSGD and MH simulations shown in figure 2c. SGDis coded manually instead of calling an optimizationalgorithm of the torch.optim package of PyTorch.Gradients for optimization are computed calling thebackward method. The SGD code related to the ex-ample of figure 2 is available at https://github.

com/papamarkou/bnn_mcmc_examples.

4.7 Hardware

Pilot MCMC runs indicated an increase in speedby using CPUs instead of GPUs; accordingly, com-putations were performed on CPUs for this paper.The GPU slowdown is explained by the overhead ofcopying PyTorch tensors between GPUs and CPUsfor small neural networks, such as the ones used insection 4.

The computations for section 4 were run onGoogle Cloud Platform (GCP). Eleven virtual ma-chine (VM) instances with virtual CPUs were cre-ated on GCP to spread the workload.

Setting aside heterogeneities in hardware config-uration between GCP VM instances and in order toprovide an indication of computational cost, MCMCsimulation runtimes are provided for the exampleof applying an MLP(6, 2, 2, 3) to the hawk train-ing dataset. The mean runtimes across the ten re-alized chains per MH and HMC are 0 : 42 : 54 and1 : 10 : 48, respectively (runtimes are formatted as

CHALLENGES IN MCMC FOR BAYESIAN NEURAL NETWORKS 17

‘hours : minutes : seconds’).

5. PREDICTIVE INFERENCE SCOPE

Bayesian marginalization can attain high predic-tive accuracy and can quantify predictive uncer-tainty using non-converged MCMC samples of neu-ral network parameters. Thus, MCMC sampling elic-its some information about the parameter posteriorof a neural network and conveys such informationto the posterior predictive distribution. It is pos-sible that MCMC sampling learns about the sta-tistical dependence among neural network param-eters. Along these lines, groups of weights or biasescan be formed, with strong within-group and weakbetween-group dependence, to investigate scalableblock Gibbs sampling methods for neural networks.

Another possibility of MCMC developments forneural networks entails shifting attention from theparameter space to the output space, since the latteris related to predictive inference directly. Approxi-mate MCMC methods that measure the discrepancyor Wasserstein distance between neural network pre-dictions and output data (Rudolf and Schweizer,2018) can be investigated.

Bayesian marginalization provides scope to de-velop predictive inference for neural networks. Forinstance, Bayesian marginalization can be examinedin the context of approximate MCMC sampling froma neural network parameter posterior, regardless ofconvergence to the parameter posterior and in anal-ogy to the workings of this paper. Moreover, the ideaof Wilson and Izmailov (2020) to interpret ensem-ble training of neural networks from a viewpoint ofBayesian marginalization can be studied using thenotion of quantization of probability distributions.

APPENDIX A: POWER POSTERIORS

This appendix provides the probability mass func-tion pi(j) for proposing a chain j for a possible swapof states between chains i and j in PP sampling. As-suming m+1 power posteriors, a neighbouring chainj of i is chosen randomly from the categorical prob-ability mass function pi = C(αi(0), αi(1), . . . , αi(i −1), αi(i+ 1), . . . , αi(m)) with event probabilities

αi(j) =exp (−β|j − i|)

γi,

where i ∈ {0, 1, . . . ,m}, j ∈ {0, 1, . . . ,m} \ {i}, β isa hyperparameter and γi is a normalizing constant.The hyperparameter β is typically set to β = 0.5, avalue which makes a jump to j = i±1 roughly threetimes more likely than a jump to j = i±3 (Friel andPettitt, 2008).

The normalizing constant γi is given by

γi =exp (−β)(2− exp (−βi)− exp (−β(m− i)))

1− exp (−β).

Starting from the fact that the event probabilitiesαi(j) add up to one, γi is derived as follows:

1 =

m∑j=0j 6=i

αi(j)⇒

γi =i−1∑j=0

exp (−β(i− j)) +m∑

j=i+1

exp (−β(j − i))

=i∑

j=1

exp (−βj) +m−i∑j=1

exp (−βj)

= exp (−β)

(1− exp (−βi)1− exp (−β)

)+ exp (−β)

(1− exp (−β(m− i))

1− exp (−β)

)=

exp (−β)(2− exp (−βi)− exp (−β(m− i)))1− exp (−β)

.

APPENDIX B: PREDICTIVE DISTRIBUTION

This appendix derives the posterior predictive dis-tribution (3.9). Applying the law of total probabilityand the definition of conditional probability yields

p(y|x,D1:s) =

∫p(y, θ|x,D1:s)dθ

=

∫p(y|x,D1:s, θ)p(θ|x,D1:s)dθ.

p(y|x,D1:s, θ) is equal to the likelihood p(y|x, θ):

p(y|x,D1:s, θ) =p(y,D1:s|x, θ)p(D1:s|x, θ)

=p(y|x, θ)p(D1:s|x, θ)

p(D1:s|x, θ)= p(y|x, θ).

18 T. PAPAMARKOU, J. HINKLE, M. T. YOUNG AND D. WOMBLE

Furthermore, p(θ|x,D1:s) is equal to the parameterposterior p(θ|D1:s):

p(θ|x,D1:s) =p(θ, x,D1:s)

p(x,D1:s)

=p(θ, x|D1:s)p(D1:s)

p(x)p(D1:s)

=p(θ|D1:s)p(x|D1:s)

p(x)

=p(θ|D1:s)p(x,D1:s)

p(x)p(D1:s)

=p(θ|D1:s)p(x)p(D1:s)

p(x)p(D1:s)

= p(θ|D1:s).

ACKNOWLEDGEMENTS

Research sponsored by the Laboratory DirectedResearch and Development Program of Oak RidgeNational Laboratory, managed by UT-Battelle,LLC, for the US Department of Energy under con-tract DE-AC05-00OR22725.

The first author would like to thank Google forthe provision of free credit on Google Cloud Plat-form.

REFERENCES

Andrieu, C., de Freitas, J. F. G. and Doucet, A. (1999).Sequential Bayesian estimation and model selection appliedto neural networks. 2

Andrieu, C., de Freitas, N. and Doucet, A. (2000). Re-versible jump MCMC simulated annealing for neural net-works. In Proceedings of the Sixteenth Conference on Un-certainty in Artificial Intelligence 11–18. 2

Ashukha, A., Lyzhov, A., Molchanov, D. andVetrov, D. (2020). Pitfalls of in-domain uncertaintyestimation and ensembling in deep learning. In Interna-tional Conference on Learning Representations. 1

Badrinarayanan, V., Mishra, B. and Cipolla, R. (2015).Symmetry-invariant optimization in deep networks. arXiv.3

Bennett, J. E., Racine-Poon, A. and Wakefield, J. C.MCMC for nonlinear hierarchical models. 339–358. 2

Bernardo, J. M. (1979). Reference posterior distributionsfor Bayesian inference. Journal of the Royal Statistical So-ciety. Series B (Methodological) 41 113–147. 4

Blei, D. M., Kucukelbir, A. and McAuliffe, J. D. (2017).Variational inference: a review for statisticians. Journal ofthe American Statistical Association 112 859-877. 2

Blier, L. and Ollivier, Y. (2018). The description length ofdeep learning models. In Advances in Neural InformationProcessing Systems 31. 2

Brea, J., Simsek, B., Illing, B. and Gerstner, W. (2019).Weight-space symmetry in deep networks gives rise to per-mutation saddles, connected by equal-loss valleys across theloss landscape. arXiv. 3

Brooks, S. P. and Gelman, A. (1998). General methods formonitoring convergence of iterative simulations. Journal ofComputational and Graphical Statistics 7 434–455. 7, 8

Cannon, A., Cobb, G., Hartlaub, B., Legler, J.,Lock, R., Moore, T., Rossman, A. and Witmer, J.(2019). Stat2Data: datasets for Stat2 R package version2.0.0. 10

Chen, T., Fox, E. and Guestrin, C. (2014). Stochastic gra-dient Hamiltonian Monte Carlo. In Proceedings of the 31stInternational Conference on Machine Learning 32 1683–1691. 2

Chen, A. M., Lu, H. and Hecht-Nielsen, R. (1993). Onthe geometry of feedforward neural network error surfaces.Neural Computation 5 910–927. 3

Chen, W. Y., Barp, A., Briol, F.-X., Gorham, J., Giro-lami, M., Mackey, L. and Oates, C. (2019). Stein pointMarkov chain Monte Carlo. In Proceedings of the 36thInternational Conference on Machine Learning 97 1011–1021. 5

Chollet, F. (2017). Xception: deep learning with depthwiseseparable convolutions. In Proceedings of the IEEE confer-ence on computer vision and pattern recognition 1251–1258.5

Chwialkowski, K., Strathmann, H. and Gretton, A.(2016). A kernel test of goodness of fit. In Proceedings ofThe 33rd International Conference on Machine Learning48 2606–2615. 5

Cowles, M. K. and Carlin, B. P. (1996). Markov chainMonte Carlo convergence diagnostics: a comparative re-view. Journal of the American Statistical Association 91883–904. 4

Cybenko, G. (1989). Approximation by superpositions of asigmoidal function. Mathematics of control, signals and sys-tems 2 303–314. 1

Dai, N. and Jones, G. L. (2017). Multivariate initial se-quence estimators in Markov chain Monte Carlo. Journalof Multivariate Analysis 159 184–199. 7

Daniels, M. J. and Kass, R. E. (1998). A note on first-stageapproximation in two-stage hierarchical models. Sankhya:The Indian Journal of Statistics, Series B (1960-2002) 6019–30. 2

de Freitas, N. (1999). Bayesian methods for neural net-works, PhD thesis, University of Cambridge. 2, 4

de Freitas, N., Andrieu, C., Højen-Sørensen, P., Niran-jan, M. and Gee, A. (2001). Sequential Monte Carlo meth-ods for neural networks In Sequential Monte Carlo Methodsin Practice 359–379. 2

De Sa, C., Chen, V. and Wong, W. (2018). MinibatchGibbs sampling on large graphical models. In Proceedingsof the 35th International Conference on Machine Learning80 1165–1173. 2

Dupuy, C. and Bach, F. (2017). Online but accurate infer-ence for latent variable models with local Gibbs sampling.Journal of Machine Learning Research 18 1-45. 2

Ensign, D., Neville, S., Paul, A. and Venkatasubra-

CHALLENGES IN MCMC FOR BAYESIAN NEURAL NETWORKS 19

manian, S. (2017). The complexity of explaining neu-ral networks through (group) invariants. In Proceedings ofthe 28th International Conference on Algorithmic LearningTheory 76 341–359. 3

Esmaeili, B., Wu, H., Jain, S., Bozkurt, A., Sid-dharth, N., Paige, B., Brooks, D. H., Dy, J. andvan de Meent, J.-W. (2019). Structured disentangledrepresentations. In Proceedings of the 22nd InternationalConference on Artificial Intelligence and Statistics 892525–2534. 2

Freeman, I., Roese-Koerner, L. and Kummert, A. (2018).Effnet: An efficient structure for convolutional neural net-works. In 25th IEEE International Conference on ImageProcessing 6–10. 5

Friel, N. and Pettitt, A. N. (2008). Marginal likelihoodestimation via power posteriors. Journal of the Royal Sta-tistical Society: Series B (Statistical Methodology) 70 589–607. 7, 17

Gelman, A. and Rubin, D. B. (1992). Inference from itera-tive simulation using multiple sequences. Statistical Science7 457–472. 7

Gelman, A., Carlin, J. B., Stern, H. S. and Rubin, D. B.(2004). Bayesian data analysis, 2nd ed. Chapman andHall/CRC. 7

Gilks, W. R. and Roberts, G. O. Strategies for improvingMCMC. 89–114. 2

Giordano, R. J., Broderick, T. and Jordan, M. I. (2015).Linear response methods for accurate covariance estimatesfrom mean field variational Bayes. In Advances in NeuralInformation Processing Systems 28 1441–1449. 3

Gong, L. and Flegal, J. M. (2016). A practical sequen-tial stopping rule for high-dimensional Markov chain MonteCarlo. Journal of Computational and Graphical Statistics25 684–700. 7

Gong, W., Li, Y. and Hernandez-Lobato, J. M. (2019).Meta-learning For stochastic gradient MCMC. In Interna-tional Conference on Learning Representations. 2

Goodfellow, I., Bengio, Y. and Courville, A. (2016).Deep learning. MIT press. 5, 9

Graf, S. and Luschgy, H. (2007). Foundations of quantiza-tion for probability distributions. Springer. 5

Gretton, A., Borgwardt, K. M., Rasch, M. J.,Scholkopf, B. and Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research 13 723–773. 5

Gu, S. S., Ghahramani, Z. and Turner, R. E. (2015). Neu-ral adaptive sequential Monte Carlo. In Advances in NeuralInformation Processing Systems 28 2629–2637. 2

Hastie, T., Tibshirani, R. and Friedman, J. (2016). Theelements of statistical learning: data mining, inference andprediction, 2nd ed. Springer. 5

Hastings, W. K. (1970). Monte Carlo sampling methods us-ing Markov chains and their applications. Biometrika 5797–109. 7

Hecht-Nielsen, R. (1990). On the algebraic structure offeedforward network weight spaces. In Advanced NeuralComputers 129–135. 3

Hornik, K. (1991). Approximation capabilities of multilayerfeedforward networks. Neural Networks 4 251–257. 1

Horst, A. M., Hill, A. P. and Gorman, K. B. (2020).palmerpenguins: Palmer Archipelago (Antarctica) penguindata R package version 0.1.0. 10

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D.,Wang, W., Weyand, T., Andreetto, M. and Adam, H.(2017). Mobilenets: efficient convolutional neural networksfor mobile vision applications. arXiv. 5

Hu, S. X., Zagoruyko, S. and Komodakis, N. (2019). Ex-ploring weight symmetry in deep neural networks. Com-puter Vision and Image Understanding 187 102786. 3

Huang, C.-W., Sankaran, K., Dhekane, E., Lacoste, A.and Courville, A. (2019). Hierarchical importanceweighted autoencoders. In Proceedings of the 36th Inter-national Conference on Machine Learning 97 2869–2878.2

Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K.,Dally, W. J. and Keutzer, K. (2016). SqueezeNet:AlexNet-level accuracy with 50x fewer parameters and¡0.5MB model size. arXiv. 5

Izmailov, P., Maddox, W. J., Kirichenko, P.,Garipov, T., Vetrov, D. and Wilson, A. G. (2020).Subspace inference for Bayesian deep learning. In Pro-ceedings of The 35th Uncertainty in Artificial IntelligenceConference 115 1169–1179. 3

Jarrett, K., Kavukcuoglu, K., Ranzato, M. and Le-Cun, Y. (2009). What is the best multi-stage architecturefor object recognition? In IEEE 12th International Confer-ence on Computer Vision 2146-2153. 5

Jaynes, E. T. (1968). Prior probabilities. IEEE Transactionson Systems Science and Cybernetics 4 227–241. 4

Jeffreys, H. (1962). The theory of probability, 3rd ed. OUPOxford. 4

Johndrow, J. E., Pillai, N. S. and Smith, A. (2020). Nofree lunch for approximate MCMC. arXiv. 2

Kass, R. E., Carlin, B. P., Gelman, A. and Neal, R. M.(1998). Markov chain Monte Carlo in practice: a roundtablediscussion. The American Statistician 52 93–100. 7

Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012).ImageNet classification with deep convolutional neural net-works. In Advances in Neural Information Processing Sys-tems 25 1097–1105. 5

Lee, H. K. H. (2000). Consistency of posterior distributionsfor neural networks. Neural Networks 13 629–642. 4

Lee, H. K. H. (2003). A noninformative prior for neural net-works. Machine Learning 50 197–212. 4

Lee, H. K. (2004). Priors for neural networks. In Classifica-tion, Clustering, and Data Mining Applications (D. Banks,F. R. McMorris, P. Arabie and W. Gaul, eds.) 141–150. 4

Lee, H. K. (2005). Neural networks and default priors. InProceedings of the American Statistical Association, Sec-tion on Bayesian Statistical Science. 4

Lee, H. K. (2007). Default priors for neural network classifi-cation. Journal of Classification 24 53–70. 4

Lu, Z., Pu, H., Wang, F., Hu, Z. and Wang, L. (2017). Theexpressive power of neural networks: a view from the width.In Advances in Neural Information Processing Systems 306231–6239. 1

Ma, Y.-A., Foti, N. J. and Fox, E. B. (2017). Stochasticgradient MCMC methods for hidden Markov models. In

20 T. PAPAMARKOU, J. HINKLE, M. T. YOUNG AND D. WOMBLE

Proceedings of the 34th International Conference on Ma-chine Learning 70 2265–2274. 2

MacKay, D. J. (1995). Developments in probabilistic mod-elling with neural networks—ensemble learning. In Neu-ral Networks: Artificial Intelligence and Industrial Applica-tions 191–198. 2

Maddison, C. J., Huang, A., Sutskever, I. and Silver, D.(2015). Move evaluation in Go using deep convolutionalneural networks. In International Conference on LearningRepresentations. 3

Mandt, S., Hoffman, M. D. and Blei, D. M. (2017).Stochastic gradient descent as approximate Bayesian infer-ence. Journal of Machine Learning Research 18 1–35. 2,3

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N.,Teller, A. H. and Teller, E. (1953). Equation of statecalculations by fast computing machines. The journal ofChemical Physics 21 1087–1092. 7

Minsky, M. L. and Papert, S. A. (1988). Perceptrons: ex-panded edition. MIT press. 5, 9

Moore, D. A. (2016). Symmetrized variational inference. InNIPS Workshop on Advances in Approximate Bayesian In-ference. 3

Nair, V. and Hinton, G. E. (2009). 3D object recognitionwith deep belief nets. In Advances in Neural InformationProcessing Systems 22 1339–1347. 5

Nalisnick, E. T. (2018). On priors for Bayesian neural net-works, PhD thesis, UC Irvine. 3, 4, 12

Neal, R. M. (2011). MCMC Using Hamiltonian dynamics InHandbook of Markov Chain Monte Carlo 5. CRC Press. 7

Nemeth, C. and Sherlock, C. (2018). Merging MCMCsubposteriors through Gaussian-process approximations.Bayesian Analysis 13 507–530. 2

Nwankpa, C., Ijomah, W., Gachagan, A. and Mar-shall, S. (2018). Activation functions: comparison oftrends in practice and research for deep learning. arXiv.5

Ong, V. M. H., Nott, D. J. and Smith, M. S. (2018).Gaussian variational approximation with a factor covari-ance structure. Journal of Computational and GraphicalStatistics 27 465–478. 3

Pearce, T., Zaki, M., Brintrup, A. and Neely, A. (2019).Expressive priors in Bayesian neural networks: kernel com-binations and periodic functions. In Proceedings of the 35thConference on Uncertainty in Artificial Intelligence. 4

Polson, N. G. and Sokolov, V. (2017). Deep learning: aBayesian perspective. Bayesian Analysis 12 1275–1304. 1

Pourzanjani, A. A., Jiang, R. M. and Petzold, L. R.(2017). Improving the identifiability of neural networks forBayesian inference. In NIPS Workshop on Bayesian DeepLearning. 3

Quiroz, M., Kohn, R., Villani, M. and Tran, M.-N.(2019). Speeding Up MCMC by efficient data subsampling.Journal of the American Statistical Association 114 831–843. 2

Ranganath, R., Tran, D. and Blei, D. (2016). Hierarchi-cal variational models. In Proceedings of The 33rd Interna-tional Conference on Machine Learning 48 324–333. 2

Robert, C. P., Elvira, V., Tawn, N. and Wu, C. (2018).

Accelerating MCMC algorithms. Wiley InterdisciplinaryReviews: Computational Statistics 10 e1435. 2

Rosenblatt, F. (1958). The perceptron: a probabilisticmodel for information storage and organization in thebrain. Psychological review 65 386. 5

Rudolf, D. and Schweizer, N. (2018). Perturbation theoryfor Markov chains via Wasserstein distance. Bernoulli 242610–2639. 5, 17

Sargent, D. J., Hodges, J. S. and Carlin, B. P. (2000).Structured Markov chain Monte Carlo. Journal of Compu-tational and Graphical Statistics 9 217–234. 2

Seita, D., Pan, X., Chen, H. and Canny, J. (2018). Anefficient minibatch acceptance test for Metropolis-Hastings.In Proceedings of the Twenty-Seventh International JointConference on Artificial Intelligence 5359–5363. 2

Sen, D., Papamarkou, T. and Dunson, D. (2020). Bayesianneural networks and dimensionality reduction. arXiv. 3

Simpson, D., Rue, H., Riebler, A., Martins, T. G. andSørbye, S. H. (2017). Penalising model component com-plexity: a principled, practical approach to constructing pri-ors. Statistical Science 32 1–28. 4

Smith, J. W., Everhart, J., Dickson, W., Knowler, W.and Johannes, R. (1988). Using the ADAP learning algo-rithm to forecast the onset of diabetes mellitus. In Proceed-ings of the Annual Symposium on Computer Application inMedical Care 261. 10

Stephens, M. (2000). Dealing with label switching in mixturemodels. Journal of the Royal Statistical Society: Series B(Statistical Methodology) 62 795–809. 3

Titsias, M. K. and Ruiz, F. (2019). Unbiased implicit vari-ational inference. In Proceedings of Machine Learning Re-search 89 167–176. 2

Titterington, D. M. (2004). Bayesian methods for neuralnetworks and related models. Statistical Science 19 128–139. 2

Truong, T.-D., Nguyen, V.-T. and Tran, M.-T. (2018).Lightweight Deep Convolutional Network for Tiny ObjectRecognition. In Proceedings of the 7th International Con-ference on Pattern Recognition Applications and Methods675–682. 5

Vats, D. and Flegal, J. M. (2018). Lugsail lag windows andtheir application to MCMC. arXiv. 7

Vats, D., Flegal, J. M. and Jones, G. L. (2019). Mul-tivariate output analysis for Markov chain Monte Carlo.Biometrika 106 321–337. 8

Vats, D. and Knudson, C. (2018). Revisiting the Gelman-Rubin diagnostic. arXiv. 7, 8

Vehtari, A., Gelman, A., Simpson, D., Carpenter, B.and Burkner, P.-C. (2019). Rank-normalization, folding,and localization: an improved R for assessing convergenceof MCMC. arXiv. 7, 8, 11

Vladimirova, M., Verbeek, J., Mesejo, P. and Arbel, J.(2019). Understanding priors in Bayesian neural networksat the unit level. In Proceedings of the 36th InternationalConference on Machine Learning 97 6458–6467. 4

Welling, M. and Teh, Y. W. (2011). Bayesian learning viastochastic gradient Langevin dynamics. In Proceedings ofthe 28th International Conference on International Con-ference on Machine Learning 681–688. 2

CHALLENGES IN MCMC FOR BAYESIAN NEURAL NETWORKS 21

Williams, P. M. (1995). Bayesian regularization and pruningusing a Laplace prior. Neural Computation 7 117–143. 4

Williams, C. K. I. (2000). An MCMC approach to hierarchi-cal mixture modelling. In Advances in Neural InformationProcessing Systems 12 680–686. 2

Wilson, A. G. and Izmailov, P. (2020). Bayesian deeplearning and a probabilistic perspective of generalization.arXiv. 1, 17

Zhang, G., Sun, S., Duvenaud, D. and Grosse, R. (2018a).Noisy natural gradient as variational inference. In Pro-ceedings of the 35th International Conference on MachineLearning 80 5852–5861. 3

Zhang, X., Zhou, X., Lin, M. and Sun, J. (2018b). Shuf-flenet: An extremely efficient convolutional neural networkfor mobile devices. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition 6848–6856.5