ufdcimages.uflib.ufl.edu€¦ · acknowledgments completing this dissertation would not have been...

140
OBJECTIVE BAYESIAN METHODS FOR OCCUPANCY MODEL ESTIMATION AND SELECTION By DANIEL TAYLOR-RODR ´ IGUEZ A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2014

Upload: others

Post on 08-Nov-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,

OBJECTIVE BAYESIAN METHODS FOR OCCUPANCY MODEL ESTIMATION ANDSELECTION

By

DANIEL TAYLOR-RODRIGUEZ

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2014

c⃝ 2014 Daniel Taylor-Rodrıguez

2

In memory of George Casella

It is a capital mistake to theorize before one has data Insensibly onebegins to twist facts to suit theories instead of theories to suit facts

ndashSherlock HolmesA Scandal in Bohemia

3

ACKNOWLEDGMENTS

Completing this dissertation would not have been possible without the support from

the people that have helped me remain focused motivated and inspired throughout the

years I am undeservingly fortunate to be surrounded by such amazing people

First of all I would like to express my gratitude to Professor George Casella It

was an unsurpassable honor to work with him His wisdom generosity optimism and

unyielding resolve will forever inspire me I will always treasure his teachings and the

fond memories I have of him I thank him and Anne for treating me and my wife as

family

I would like to acknowledge all of my committee members My heartfelt thanks to

my advisor Professor Linda J Young I will carry her thoughtful and patient recommendations

throughout my life I have no words to express how thankful I am to her for guiding me

through the difficult times that followed Dr Casellarsquos passing Also she has my gratitude

for sharing her knowledge and wealth of experience and for providing me with so many

amazing opportunities I am forever grateful to my local advisor Professor Nikolay

Bliznyuk for unsparingly sharing his insightful reflections and knowledge His generosity

and drive to help students develop are a model to follow His kind and extensive efforts

our many conversations his suggestions and advise in all aspects of academic and

non-academic life have made me a better statistician and have had a profound influence

on my way of thinking My appreciation to Professor Madan Oli for his enlightening

advise and for helping me advance my understanding of ecology

I would like to express my absolute gratitude to Dr Andrew Womack my friend and

young mentor His love for good science and hard work although impossible to keep up

with made my doctoral training one of the most exciting times in my life I have sincerely

enjoyed working and learning from him the last couple of years I offer my gratitude

to Dr Salvador Gezan for his friendship and the patience with which he taught me so

much more about statistics (boring our wives to death in the process) I am grateful to

4

Professor Mary Christman for her mentorship and enormous support I would like to

thank Dr Mihai Giurcanu for spending countless hours helping me think more deeply

about statistics his insight has been instrumental to shaping my own ideas Thanks to

Dr Claudio Fuentes for taking an interest in my work and for his advise support and

kind words which helped me retain the confidence to continue

I would like to acknowledge my friends at UF Juan Jose Acosta Mauricio

Mosquera Diana Falla Salvador and Emma Weeks and Anna Denicol thanks for

becoming my family away from home Andreas Tavis Emily Alex Sasha Mike

Yeonhee and Laura thanks for being there for me I truly enjoyed sharing these

years with you Vitor Paula Rafa Leandro Fabio Eduardo Marcelo and all the other

Brazilians in the Animal Science Department thanks for your friendship and for the

many unforgettable (though blurry) weekends

Also I would like to thank Pablo Arboleda for believing in me Because of him I

was able to take the first step towards fulfilling my educational goals My gratitude to

Grupo Bancolombia Fulbright Colombia Colfuturo and the IGERT QSE3 program

for supporting me throughout my studies Also thanks to Marc Kery and Christian

Monnerat for providing data to validate our methods Thanks to the staff in the Statistics

Department specially to Ryan Chance to the staff at the HPC and also to Karen Bray

at SNRE

Above all else I would like to thank my wife and family Nata you have always been

there for me pushing me forward believing in me helping me make better decisions

and regardless of how hard things get you have always managed to give me true and

lasting happiness Thank you for your love strength and patience Mom Dad Alejandro

Alberto Laura Sammy Vale and Tommy without your love trust and support getting

this far would not have been possible Thank you for giving me so much Gustavo

Lilia Angelica and Juan Pablo thanks for taking me into your family your words of

encouragement have led the way

5

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS 4

LIST OF TABLES 8

LIST OF FIGURES 10

ABSTRACT 12

CHAPTER

1 GENERAL INTRODUCTION 14

11 Occupancy Modeling 1512 A Primer on Objective Bayesian Testing 1713 Overview of the Chapters 21

2 MODEL ESTIMATION METHODS 23

21 Introduction 23211 The Occupancy Model 24212 Data Augmentation Algorithms for Binary Models 26

22 Single Season Occupancy 29221 Probit Link Model 30222 Logit Link Model 32

23 Temporal Dynamics and Spatial Structure 34231 Dynamic Mixture Occupancy State-Space Model 37232 Incorporating Spatial Dependence 43

24 Summary 46

3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS 49

31 Introduction 4932 Objective Bayesian Inference 52

321 The Intrinsic Methodology 53322 Mixtures of g-Priors 54

3221 Intrinsic priors 553222 Other mixtures of g-priors 56

33 Objective Bayes Occupancy Model Selection 57331 Preliminaries 58332 Intrinsic Priors for the Occupancy Problem 60333 Model Posterior Probabilities 62334 Model Selection Algorithm 63

34 Alternative Formulation 6635 Simulation Experiments 68

351 Marginal Posterior Inclusion Probabilities for Model Predictors 70

6

352 Summary Statistics for the Highest Posterior Probability Model 7636 Case Study Blue Hawker Data Analysis 77

361 Results Variable Selection Procedure 79362 Validation for the Selection Procedure 81

37 Discussion 82

4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS 84

41 Introduction 8442 Setup for Well-Formulated Models 88

421 Well-Formulated Model Spaces 9043 Priors on the Model Space 91

431 Model Prior Definition 92432 Choice of Prior Structure and Hyper-Parameters 96433 Posterior Sensitivity to the Choice of Prior 99

44 Random Walks on the Model Space 104441 Simple Pruning and Growing 105442 Degree Based Pruning and Growing 106

45 Simulation Study 107451 SNR and Sample Size Effect 109452 Coefficient Magnitude 110453 Special Points on the Scale 111

46 Case Study Ozone Data Analysis 11147 Discussion 113

5 CONCLUSIONS 115

APPENDIX

A FULL CONDITIONAL DENSITIES DYMOSS 118

B RANDOM WALK ALGORITHMS 121

C WFM SIMULATION DETAILS 124

D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS 131

REFERENCES 133

BIOGRAPHICAL SKETCH 140

7

LIST OF TABLES

Table page

1-1 Interpretation of BFji when contrasting Mj and Mi 20

3-1 Simulation control parameters occupancy model selector 69

3-2 Comparison of average minOddsMPIP under scenarios having different numberof sites (N=50 N=100) and under scenarios having different number of surveysper site (J=3 J=5) for the presence and detection components using uniformand multiplicity correction priors 75

3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors 75

3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highest probabilitymodels for the presence and the detection components using uniform andmultiplicity correcting priors on the model space 76

3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space 77

3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 77

3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 78

3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data 80

3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data 80

3-10 MPIP presence component 81

3-11 MPIP detection component 81

3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform and multiplicitycorrection model priors 82

8

4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulations 100

4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchical independenceprior (HIP) the hierarchical order prior (HOP) and the hierarchical uniformprior (HUP) 102

4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and the hierarchicaluniform prior (HUP) 103

4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP) 105

4-5 Variables used in the analyses of the ozone contamination dataset 112

4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso 113

C-1 Experimental conditions WFM simulations 124

D-1 Variables used in the analyses of the ozone contamination dataset 131

D-2 Marginal inclusion probabilities intrinsic prior 132

D-3 Marginal inclusion probabilities Zellner-Siow prior 132

D-4 Marginal inclusion probabilities Hyper-g11 132

D-5 Marginal inclusion probabilities Hyper-g21 132

9

LIST OF FIGURES

Figure page

2-1 Graphical representation occupancy model 25

2-2 Graphical representation occupancy model after data-augmentation 31

2-3 Graphical representation multiseason model for a single site 39

2-4 Graphical representation data-augmented multiseason model 39

3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors 71

3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors 72

3-3 Predictor MPIP averaged over scenarios with the interaction between the numberof sites and the surveys per site using uniform (U) and multiplicity correction(MC) priors 72

3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors 73

3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors 73

4-1 Graphs of well-formulated polynomial models for p = 2 90

4-2 E(M) and C(M) in M defined by a quadratic surface in two main effects formodel M = 1 x1 x21 91

4-3 Graphical representation of assumptions on M defined by the quadratic surfacein two main effects 93

4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the intercept onlymodel and (ab) isin (1 1) (1 ch) 97

4-5 Prior probabilities for the space of well-formulated models associated to threemain effects and one interaction term where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch) 98

4-6 MT DAG of the largest true model used in simulations 109

4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch) 110

C-1 SNR vs n Average model size average true positives and average false positivesfor all simulated scenarios by model ranking according to model posterior probabilities126

10

C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 128

C-3 SNR vs different true models MT Average model size average true positivesand average false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 129

11

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

OBJECTIVE BAYESIAN METHODS FOR OCCUPANCY MODEL ESTIMATION ANDSELECTION

By

Daniel Taylor-Rodrıguez

August 2014

Chair Linda J YoungCochair Nikolay BliznyukMajor Interdisciplinary Ecology

The ecological literature contains numerous methods for conducting inference about

the dynamics that govern biological populations Among these methods occupancy

models have played a leading role during the past decade in the analysis of large

biological population surveys The flexibility of the occupancy framework has brought

about useful extensions for determining key population parameters which provide

insights about the distribution structure and dynamics of a population However the

methods used to fit the models and to conduct inference have gradually grown in

complexity leaving practitioners unable to fully understand their implicit assumptions

increasing the potential for misuse This motivated our first contribution We develop

a flexible and straightforward estimation method for occupancy models that provides

the means to directly incorporate temporal and spatial heterogeneity using covariate

information that characterizes habitat quality and the detectability of a species

Adding to the issue mentioned above studies of complex ecological systems now

collect large amounts of information To identify the drivers of these systems robust

techniques that account for test multiplicity and for the structure in the predictors are

necessary but unavailable for ecological models We develop tools to address this

methodological gap First working in an ldquoobjectiverdquo Bayesian framework we develop

the first fully automatic and objective method for occupancy model selection based

12

on intrinsic parameter priors Moreover for the general variable selection problem we

propose three sets of prior structures on the model space that correct for multiple testing

and a stochastic search algorithm that relies on the priors on the models space to

account for the polynomial structure in the predictors

13

CHAPTER 1GENERAL INTRODUCTION

As with any other branch of science ecology strives to grasp truths about the

world that surrounds us and in particular about nature The objective truth sought

by ecology may well be beyond our grasp however it is reasonable to think that at

least partially ldquoNature is capable of being understoodrdquo (Dewey 1958) We can observe

and interpret nature to formulate hypotheses which can then be tested against reality

Hypotheses that encounter no or little opposition when confronted with reality may

become contextual versions of the truth and may be generalized by scaling them

spatially andor temporally accordingly to delimit the bounds within which they are valid

To formulate hypotheses accurately and in a fashion amenable to scientific inquiry

not only the point of view and assumptions considered must be made explicit but

also the object of interest the properties worthy of consideration of that object and

the methods used in studying such properties (Reiners amp Lockwood 2009 Rigler amp

Peters 1995) Ecology as defined by Krebs (1972) is ldquothe study of interactions that

determine the distribution and abundance of organismsrdquo This characterizes organisms

and their interactions as the objects of interest to ecology and prescribes distribution

and abundance as a relevant property of these organisms

With regards to the methods used to acquire ecological scientific knowledge

traditionally theoretical mathematical models (such as deterministic PDEs) have been

used However naturally varying systems are imprecisely observed and as such are

subject to multiple sources of uncertainty that must be explicitly accounted for Because

of this the ecological scientific community is developing a growing interest in flexible

and powerful statistical methods and among these Bayesian hierarchical models

predominate These methods rely on empirical observations and can accommodate

fairly complex relationships between empirical observations and theoretical process

models while accounting for diverse sources of uncertainty (Hooten 2006)

14

Bayesian approaches are now used extensively in ecological modeling however

there are two issues of concern one from the standpoint of ecological practitioners

and another from the perspective of scientific ecological endeavors First Bayesian

modeling tools require a considerable understanding of probability and statistical theory

leading practitioners to view them as black box approaches (Kery 2010) Second

although Bayesian applications proliferate in the literature in general there is a lack of

awareness of the distinction between approaches specifically devised for testing and

those for estimation (Ellison 2004) Furthermore there is a dangerous unfamiliarity with

the proven risks of using tools designed for estimation in testing procedures (Berger amp

Pericchi 1996 Berger et al 2001 Kass amp Raftery 1995 Moreno et al 1998 Robert

et al 2009 Robert 1993) (eg use of flat priors in hypothesis testing)

Occupancy models have played a leading role during the past decade in large

biological population surveys The flexibility of the occupancy framework has allowed

the development of useful extensions to determine several key population parameters

which provide robust notions of the distribution structure and dynamics of a population

In order to address some of the concerns stated in previous paragraph we concentrate

in the occupancy framework to develop estimation and testing tools that will allow

ecologists first to gain insight about the estimation procedure and second to conduct

statistically sound model selection for site-occupancy data

11 Occupancy Modeling

Since MacKenzie et al (2002) and Tyre et al (2003) introduced the site-occupancy

framework countless applications and extensions of the method have been developed

in the ecological literature as evidenced by the 438000 hits on Google Scholar for

a search of rdquooccupancy modelrdquo This class of models acknowledges that techniques

used to conduct biological population surveys are prone to detection errors ndashif an

individual is detected it must be present while if it is not detected it might or might

not be Occupancy models improve upon traditional binary regression by accounting

15

for observed detection and partially observed presence as two separate but related

components In the site occupancy setting the chosen locations are surveyed

repeatedly in order to reduce the ambiguity caused by the observed zeros This

approach therefore allows probabilities of both presence (occurrence) and detection

to be estimated

The uses of site-occupancy models are many For example metapopulation

and island biogeography models are often parameterized in terms of site (or patch)

occupancy (Hansky 19921994 1997 as cited in MacKenzie et al (2003)) and

occupancy may be used as a surrogate for abundance to answer questions regarding

geographic distribution range size and metapopulation dynamics (MacKenzie et al

2004 Royle amp Kery 2007)

The basic occupancy framework which assumes a single closed population with

fixed probabilities through time has proven to be quite useful however it might be of

limited utility when addressing some problems In particular assumptions for the basic

model may become too restrictive or unrealistic whenever the study period extends

throughout multiple years or seasons especially given the increasingly changing

environmental conditions that most ecosystems are currently experiencing

Among the extensions found in the literature one that we consider particularly

relevant incorporates heterogenous occupancy probabilities through time Models

that incorporate temporally varying probabilities stem from important meta-population

notions provided by Hanski (1994) such as occupancy probabilities depending on local

colonization and local extinction processes In spite of the conceptual usefulness of

Hanskirsquos model several strong and untenable assumptions (eg all patches being

homogenous in quality) are required for it to provide practically meaningful results

A more viable alternative which builds on Hanski (1994) is an extension of

the single season occupancy model of MacKenzie et al (2003) In this model the

heterogeneity of occupancy probabilities across seasons arises from local colonization

16

and extinction processes This model is flexible enough to let detection occurrence

extinction and colonization probabilities to each depend upon its own set of covariates

Model parameters are obtained through likelihood-based estimation

Using a maximum likelihood approach presents two drawbacks First the

uncertainty assessment for maximum likelihood parameter estimates relies on

asymptotic results which are obtained from implementation of the delta method

making it sensitive to sample size Second to obtain parameter estimates the latent

process (occupancy) is marginalized out of the likelihood leading to the usual zero

inflated Bernoulli model Although this is a convenient strategy for solving the estimation

problem after integrating the latent state variables (occupancy indicators) they are

no longer available Therefore finite sample estimates cannot be calculated directly

Instead a supplementary parametric bootstrapping step is necessary Further

additional structure such as temporal or spatial variation cannot be introduced by

means of random effects (Royle amp Kery 2007)

12 A Primer on Objective Bayesian Testing

With the advent of high dimensional data such as that found in modern problems

in ecology genetics physics etc coupled with evolving computing capability objective

Bayesian inferential methods have gained increasing popularity This however is by no

means a new approach in the way Bayesian inference is conducted In fact starting with

Bayes and Laplace and continuing for almost 200 years Bayesian analysis was primarily

based on ldquononinformativerdquo priors (Berger amp Bernardo 1992)

Now subjective elicitation of prior probabilities in Bayesian analysis is widely

recognized as the ideal (Berger et al 2001) however it is often the case that the

available information is insufficient to specify appropriate prior probabilistic statements

Commonly as in model selection problems where large model spaces have to be

explored the number of model parameters is prohibitively large preventing one from

eliciting prior information for the entire parameter space As a consequence in practice

17

the determination of priors through the definition of structural rules has become the

alternative to subjective elicitation for a variety of problems in Bayesian testing Priors

arising from these rules are known in the literature as noninformative objective default

or reference Many of these connotations generate controversy and are accused

perhaps rightly of providing a false pretension of objectivity Nevertheless we will avoid

that discussion and refer to them herein exchangeably as noninformative or objective

priors to convey the sense that no attempt to introduce an informed opinion is made in

defining prior probabilities

A plethora of ldquononinformativerdquo methods has been developed in the past few

decades (see Berger amp Bernardo (1992) Berger amp Pericchi (1996) Berger et al (2001)

Clyde amp George (2004) Kass amp Wasserman (1995 1996) Liang et al (2008) Moreno

et al (1998) Spiegelhalter amp Smith (1982) Wasserman (2000) and the references

therein) We find particularly interesting those derived from the model structure in which

no tuning parameters are required especially since these can be regarded as automatic

methods Among them methods based on the Bayes factor for Intrinsic Priors have

proven their worth in a variety of inferential problems given their excellent performance

flexibility and ease of use This class of priors is discussed in detail in chapter 3 For

now some basic notation and notions of Bayesian inferential procedures are introduced

Hypothesis testing and the Bayes factor

Bayesian model selection techniques that aim to find the true model as opposed

to searching for the model that best predicts the data are fundamentally extensions to

Bayesian hypothesis testing strategies In general this Bayesian approach to hypothesis

testing and model selection relies on determining the amount of evidence found in favor

of one hypothesis (or model) over the other given an observed set of data Approached

from a Bayesian standpoint this type of problem can be formulated in great generality

using a natural well defined probabilistic framework that incorporates both model and

parameter uncertainty

18

Jeffreys (1935) first developed the Bayesian strategy to hypothesis testing and

consequently to the model selection problem Bayesian model selection within

a model space M = (M1M2 MJ) where each model is associated with a

parameter θj which may be a vector of parameters itself incorporates three types

of probability distributions (1) a prior probability distribution for each model π(Mj)

(2) a prior probability distribution for the parameters in each model π(θj |Mj) and (3)

the distribution of the data conditional on both the model and the modelrsquos parameters

f (x|θj Mj) These three probability densities induce the joint distribution p(x θj Mj) =

f (x|θj Mj) middot π(θj |Mj) middot π(Mj) which is instrumental in producing model posterior

probabilities The model posterior probability is the probability that a model is true given

the data It is obtained by marginalizing over the parameter space and using Bayes rule

p(Mj |x) =m(x|Mj)π(Mj)sumJ

i=1m(x|Mi)π(Mi) (1ndash1)

where m(x|Mj) =intf (x|θj Mj)π(θj |Mj)dθj is the marginal likelihood of Mj

Given that interest lies in comparing different models evidence in favor of one or

another model is assessed with pairwise comparisons using posterior odds

p(Mj |x)p(Mk |x)

=m(x|Mj)

m(x|Mk)middot π(Mj)

π(Mk) (1ndash2)

The first term on the right hand side of (1ndash2) m(x|Mj )

m(x|Mk) is known as the Bayes factor

comparing model Mj to model Mk and it is denoted by BFjk(x) The Bayes factor

provides a measure of the evidence in favor of either model given the data and updates

the model prior odds given by π(Mj )

π(Mk) to produce the posterior odds

Note that the model posterior probability in (1ndash1) can be expressed as a function of

Bayes factors To illustrate let model Mlowast isin M be a reference model All other models

compare in M are compared to the reference model Then dividing both the numerator

19

and denominator in (1ndash1) by m(x|Mlowast)π(Mlowast) yields

p(Mj |x) =BFjlowast(x)

π(Mj )

π(Mlowast)

1 +sum

MiisinMMi =Mlowast

BFilowast(x)π(Mi )π(Mlowast)

(1ndash3)

Therefore as the Bayes factor increases the posterior probability of model Mj given the

data increases If all models have equal prior probabilities a straightforward criterion

to select the best among all candidate models is to choose the model with the largest

Bayes factor As such the Bayes factor is not only useful for identifying models favored

by the data but it also provides a means to rank models in terms of their posterior

probabilities

Assuming equal model prior probabilities in (1ndash3) the prior odds are set equal to

one and the model posterior odds in (1ndash2) become p(Mj |x)p(Mk |x) = BFjk(x) Based

on the Bayes factors the evidence in favor of one or another model can be interpreted

using Table 1-1 adapted from Kass amp Raftery (1995)

Table 1-1 Interpretation of BFji when contrasting Mj and Mi

lnBFjk BFjk Evidence in favor of Mj P(Mj |x)0 to 2 1 to 3 Weak evidence 05-0752 to 6 3 to 20 Positive evidence 075-095

6 to 10 20 to 150 Strong evidence 095-099gt10 gt150 Very strong evidence gt 099

Bayesian hypothesis testing and model selection procedures through Bayes factors

and posterior probabilities have several desirable features First these methods have a

straight forward interpretation since the Bayes factor is an increasing function of model

(or hypothesis) posterior probabilities Second these methods can yield frequentist

matching confidence bounds when implemented with good testing priors (Kass amp

Wasserman 1996) such as the reference priors of Berger amp Bernardo (1992) Third

since the Bayes factor contains the ratio of marginal densities it automatically penalizes

complexity according to the number of parameters in each model this property is

known as Ockhamrsquos razor (Kass amp Raftery 1995) Four the use of Bayes factors does

20

not require having nested hypotheses (ie having the null hypothesis nested in the

alternative) standard distributions or regular asymptotics (eg convergence to normal

or chi squared distributions) (Berger et al 2001) In contrast this is not always the case

with frequentist and likelihood ratio tests which depend on known distributions (at least

asymptotically) for the test statistic to perform the test Finally Bayesian hypothesis

testing procedures using the Bayes factor can naturally incorporate model uncertainty by

using the Bayesian machinery for model averaged predictions and confidence bounds

(Kass amp Raftery 1995) It is not clear how to account for this uncertainty rigorously in a

fully frequentist approach

13 Overview of the Chapters

In the chapters that follow we develop a flexible and straightforward hierarchical

Bayesian framework for occupancy models allowing us to obtain estimates and conduct

robust testing from an ldquoobjectiverdquo Bayesian perspective Latent mixtures of random

variables supply a foundation for our methodology This approach provides a means to

directly incorporate spatial dependency and temporal heterogeneity through predictors

that characterize either habitat quality of a given site or detectability features of a

particular survey conducted in a specific site On the other hand the Bayesian testing

methods we propose are (1) a fully automatic and objective method for occupancy

model selection and (2) an objective Bayesian testing tool that accounts for multiple

testing and for polynomial hierarchical structure in the space of predictors

Chapter 2 introduces the methods proposed for estimation of occupancy model

parameters A simple estimation procedure for the single season occupancy model

with covariates is formulated using both probit and logit links Based on the simple

version an extension is provided to cope with metapopulation dynamics by introducing

persistence and colonization processes Finally given the fundamental role that spatial

dependence plays in defining temporal dynamics a strategy to seamlessly account for

this feature in our framework is introduced

21

Chapter 3 develops a new fully automatic and objective method for occupancy

model selection that is asymptotically consistent for variable selection and averts the

use of tuning parameters In this Chapter first some issues surrounding multimodel

inference are described and insight about objective Bayesian inferential procedures is

provided Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate

priors on the parameter space the intrinsic priors for the parameters of the occupancy

model are obtained These are used in the construction of a variable selection algorithm

for ldquoobjectiverdquo variable selection tailored to the occupancy model framework

Chapter 4 touches on two important and interconnected issues when conducting

model testing that have yet to receive the attention they deserve (1) controlling for false

discovery in hypothesis testing given the size of the model space ie given the number

of tests performed and (2) non-invariance to location transformations of the variable

selection procedures in the face of polynomial predictor structure These elements both

depend on the definition of prior probabilities on the model space In this chapter a set

of priors on the model space and a stochastic search algorithm are proposed Together

these control for model multiplicity and account for the polynomial structure among the

predictors

22

CHAPTER 2MODEL ESTIMATION METHODS

ldquoData Data Datardquo he cried impatiently ldquoI canrsquot make bricks without clayrdquo

ndashSherlock HolmesThe Adventure of the Copper Beeches

21 Introduction

Prior to the introduction of site-occupancy models (MacKenzie et al 2002 Tyre

et al 2003) presence-absence data from ecological monitoring programs were used

without any adjustment to assess the impact of management actions to observe trends

in species distribution through space and time or to model the habitat of a species (Tyre

et al 2003) These efforts however were suspect due to false-negative errors not

being accounted for False-negative errors occur whenever a species is present at a site

but goes undetected during the survey

Site-occupancy models developed independently by MacKenzie et al (2002)

and Tyre et al (2003) extend simple binary-regression models to account for the

aforementioned errors in detection of individuals common in surveys of animal or plant

populations Since their introduction the site-occupancy framework has been used in

countless applications and numerous extensions for it have been proposed Occupancy

models improve upon traditional binary regression by analyzing observed detection

and partially observed presence as two separate but related components In the site

occupancy setting the chosen locations are surveyed repeatedly in order to reduce the

ambiguity caused by the observed zeros This approach therefore allows simultaneous

estimation of the probabilities of presence (occurrence) and detection

Several extensions to the basic single-season closed population model are

now available The occupancy approach has been used to determine species range

dynamics (MacKenzie et al 2003 Royle amp Kery 2007) and to understand agestage

23

structure within populations (Nichols et al 2007) model species co-occurrence

(MacKenzie et al 2004 Ovaskainen et al 2010 Waddle et al 2010) It has even been

suggested as a surrogate for abundance (MacKenzie amp Nichols 2004) MacKenzie et al

suggested using occupancy models to conduct large-scale monitoring programs since

this approach avoids the high costs associated with surveys designed for abundance

estimation Also to investigate metapopulation dynamics occupancy models improve

upon incidence function models (Hanski 1994) which are often parameterized in terms

of site (or patch) occupancy and assume homogenous patches and a metapopulation

that is at a colonization-extinction equilibrium

Nevertheless the implementation of Bayesian occupancy models commonly resorts

to sampling strategies dependent on hyper-parameters subjective prior elicitation

and relatively elaborate algorithms From the standpoint of practitioners these are

often treated as black-box methods (Kery 2010) As such the potential of using the

methodology incorrectly is high Commonly these procedures are fitted with packages

such as BUGS or JAGS Although the packagersquos ease of use has led to a wide-spread

adoption of the methods the user may be oblivious as to the assumptions underpinning

the analysis

We believe providing straightforward and robust alternatives to implement these

methods will help practitioners gain insight about how occupancy modeling and more

generally Bayesian modeling is performed In this Chapter using a simple Gibbs

sampling approach first we develop a versatile method to estimate the single season

closed population site-occupancy model then extend it to analyze metapopulation

dynamics through time and finally provide a further adaptation to incorporate spatial

dependence among neighboring sites211 The Occupancy Model

In this section of the document we first introduce our results published in Dorazio

amp Taylor-Rodrıguez (2012) and build upon them to propose relevant extensions For

24

the standard sampling protocol for collecting site-occupancy data J gt 1 independent

surveys are conducted at each of N representative sample locations (sites) noting

whether a species is detected or not detected during each survey Let yij denote a binary

random variable that indicates detection (y = 1) or non-detection (y = 0) during the

j th survey of site i Without loss of generality J may be assumed constant among all N

sites to simplify description of the model In practice however site-specific variation in

J poses no real difficulties and is easily implemented This sampling protocol therefore

yields a N times J matrix Y of detectionnon-detection data

Note that the observed process yij is an imperfect representation of the underlying

occupancy or presence process Hence letting zi denote the presence indicator at site i

this model specification can therefore be represented through the hierarchy

yij |zi λ sim Bernoulli (zipij)

zi |α sim Bernoulli (ψi) (2ndash1)

where pij is the probability of correctly classifying as occupied the i th site during the j th

survey ψi is the presence probability at the i th site The graphical representation of this

process is

ψi

zi

yi

pi

Figure 2-1 Graphical representation occupancy model

Probabilities of detection and occupancy can both be made functions of covariates

and their corresponding parameter estimates can be obtained using either a maximum

25

likelihood or a Bayesian approach Existing methodologies from the likelihood

perspective marginalize over the latent occupancy process (zi ) making the estimation

procedure depend only on the detections Most Bayesian strategies rely on MCMC

algorithms that require parameter prior specification and tuning However Albert amp Chib

(1993) proposed a longstanding strategy in the Bayesian statistical literature that models

binary outcomes using a simple Gibbs sampler This procedure which is described in

the following section can be extrapolated to the occupancy setting eliminating the need

for tuning parameters and subjective prior elicitation212 Data Augmentation Algorithms for Binary Models

Probit model Data-augmentation with latent normal variables

At the root of Albert amp Chibrsquos algorithm lies the idea that if the observed outcome is

0 the latent variable can be simulated from a truncated normal distribution with support

(minusinfin 0] And if the outcome is 1 the latent variable can be simulated from a truncated

normal distribution in (0infin) To understand the reasoning behind this strategy let

Y sim Bern((xTβ)

) and V = xTβ + ε with ε sim N (0 1) In such a case note that

Pr(y = 1 | xTβ) = (xTβ) = Pr(ε lt xTβ)

= Pr(ε gt minusxTβ)

= Pr(v gt 0 | xTβ)

Thus whenever y = 1 then v gt 0 and v le 0 otherwise In other words we

may think of y as a truncated version of v Thus we can sample iteratively alternating

between the latent variables conditioned on model parameters and vice versa to draw

from the desired posterior densities By augmenting the data with the latent variables

we are able to obtain full conditional posterior distributions for model parameters that are

easy to draw from (equation 2ndash3 below) Further we may sample the latent variables

we may also sample the parameters

Given some initial values for all model parameters values for the latent variables

can be simulated By conditioning on the latter it is then possible to draw samples

26

from the parameterrsquos posterior distributions These samples can be used to generate

new values for the latent variables etc The process is iterated using a Gibbs sampling

approach Generally after a large number iterations it yields draws from the joint

posterior distribution of the latent variables and the model parameters conditional on the

observed outcome values We formalize the procedure below

Assume that each outcome Y1Y2 Yn is such that Yi |xi β sim Bernoulli(qi)

where qi = (xTi β) is the standard normal CDF evaluated at xTi β where xi and β

are the p-dimensional vectors of observed covariates for the i -th observation and their

corresponding parameters respectively

Now let y = y1 y2 yn be the vector of observed outcomes and [ β ] represents

the prior distribution of the model parameters Therefore the posterior distribution of β is

given by

[ β|y ] prop [ β ]nprodi=1

(xTi β)yi(1minus(xTi β)

)1minusyi (2ndash2)

which is intractable Nevertheless introducing latent random variables V = (V1 Vn)

such that Vi sim N (xTi β 1) resolves this difficulty by specifying that whenever Yi = 1

then Vi gt 0 and if Yi = 0 then Vi le 0 This yields

[ β v|y ] prop [ β ]

nprodi=1

ϕ(vi | xTi β 1)Ivile0Iyi=0 + Ivigt0Iyi=1

(2ndash3)

where ϕ(x |micro τ 2) is the probability density function of normal random variable x

with mean micro and variance τ2 The data augmentation artifact works since [ β|y ] =int[ β v|y ]dv hence if we sample from joint posterior 2ndash3 and extract only the sampled

values for β they will correspond to samples from [ β|y ]

From the expression above it is possible to obtain the full conditional distributions

for V and β Thus a Gibbs sampler can be proposed For example if we use a flat prior

27

for β (ie [ β ] prop 1) the full conditionals are given by

β|V y sim MVNk

((XTX )minus1(XTV ) (XTX )minus1

)(2ndash4)

V|β y simnprodi=1

tr N (xTi β 1Qi) (2ndash5)

where MVNq(micro ) represents a multivariate normal distribution with mean vector micro

and variance-covariance matrix and tr N (ξσ2Q) stands for the truncated normal

distribution with mean ξ variance σ2 and truncation region Q For each i = 1 2 n

the support of the truncated variables is given by Q = (minusinfin 0 ] if yi = 0 and Q = (0infin)

otherwise Note that conjugate normal priors could be used alternatively

At iteration m + 1 the Gibbs sampler draws V(m+1) conditional on β(m) from (2ndash5)

and then samples β(m+1) conditional on V(m+1) from (2ndash4) This process is repeated for

s = 0 1 nsim where nsim is the number of iterations in the Gibbs sampler

Logit model Data-augmentation with latent Polya-gamma variables

Recently Polson et al (2013) developed a novel and efficient approach for Bayesian

inference for logistic models using Polya-gamma latent variables which is analogous

to the Albert amp Chib algorithm The result arises from what the authors refer to as the

Polya-gamma distribution To construct a random variable from this family consider the

infinite mixture of the iid sequence of Exp(1) random variables Ekinfink=1 given by

ω =2

π2

infinsumk=1

Ek

(2k minus 1)2

with probability density function

g(ω) =infinsumk=1

(minus1)k 2k + 1radic2πω3

eminus(2k+1)2

8ω Iωisin(0infin) (2ndash6)

and Laplace density transform E[eminustω] = coshminus1(radic

t2)

28

The Polya-gamma family of densities is obtained through an exponential tilting of

the density g from 2ndash6 These densities indexed by c ge 0 are characterized by

f (ω|c) = cosh c2 eminusc2ω2 g(ω)

The likelihood for the binomial logistic model can be expressed in terms of latent

Polya-gamma variables as follows Assume yi sim Bernoulli(δi) with predictors xprimei =

(xi1 xip) and success probability δi = exprimeiβ(1 + ex

primeiβ) Hence the posterior for the

model parameters can be represented as

[β|y] =[β]prodn

i δyii (1minus δi)

1minusyi

c(y)

where c(y) is the normalizing constant

To facilitate the sampling procedure a data augmentation step can be performed

by introducing a Polya-gamma random variable ω sim PG(xprimeβ 1) This yields the

data-augmented posterior

[βω|y] =

(prodn

i=1 Pr(yi = 1|β))f (ω|xprime

β) [β] dω

c(y) (2ndash7)

such that [β|y] =int

R+[βω|y] dω

Thus from the augmented model the full conditional density for β is given by

[β|ω y] prop

(nprodi=1

Pr(yi = 1|β)

)f (ω|xprime

β) [β] dω

=

nprodi=1

(exprimeiβ)yi

1 + exprimeiβ

nprodi=1

cosh

(∣∣xprime

iβ∣∣

2

)exp

[minus(x

prime

iβ)2ωi

2

]g(ωi)

(2ndash8)

This expression yields a normal posterior distribution if β is assigned flat or normal

priors Hence a two-step sampling strategy analogous to that of Albert amp Chib (1993)

can be used to estimate β in the occupancy framework22 Single Season Occupancy

Let pij = F (qTij λ) be the probability of correctly classifying as occupied the i th

site during the j th survey conditional on the site being occupied and let ψi = F (xTi α)

29

correspond to the presence probability at the i th site Further let Fminus1(middot) denote a link

function (ie probit or logit) connecting the response to the predictors and denote by λ

and α respectively the r -variate and p-variate coefficient vectors for the detection and

for the presence probabilities Then the following is the joint posterior probability for the

presence indicators and the model parameters

πlowast(z vαwλ) prop πα(α)πλ(λ)Nprodi=1

F (xprimeiα)zi (1minus F (xprimeiα))

(1minuszi ) times

Jprodj=1

(ziF (qprimeijλ))

yij (1minus ziF (qprimeijλ))

1minusyij (2ndash9)

As in the simple probit regression problem this posterior is intractable Consequently

sampling from it directly is not possible But the procedures of Albert amp Chib for the

probit model and of Polson et al for the logit model can be extended to generate an

MCMC sampling strategy for the occupancy problem In what follows we make use of

this framework to develop samplers with which occupancy parameter estimates can be

obtained for both probit and logit link functions These algorithms have the added benefit

that they do not require tuning parameters nor eliciting parameter priors subjectively221 Probit Link Model

To extend Albert amp Chibrsquos algorithm to the occupancy framework with a probit link

first we introduce two sets of latent variables denoted by wij and vi corresponding to

the normal latent variables used to augment the data The corresponding hierarchy is

yij |zi sij sim Bernoulli(ziIwijgt0

)wij |λ sim N

(qprimeijλ 1

)λ sim [λ]

zi |vi sim Ivigt0

vi |α sim N (xprimeiα 1)

α sim [α] (2ndash10)

30

represented by the directed graph found in Figure 2-2

α

vi

zi

yi

wi

λ

Figure 2-2 Graphical representation occupancy model after data-augmentation

Under this hierarchical model the joint density is given by

πlowast(z vαwλ) prop Cyπα(α)πλ(λ)Nprodi=1

ϕ(vi xprimeiα 1)I

zivigt0I

(1minuszi )vile0 times

Jprodj=1

(ziIwijgt0)yij (1minus ziIwijgt0)

1minusyijϕ(wij qprimeijλ 1) (2ndash11)

The full conditional densities derived from the posterior in equation 2ndash11 are

detailed below

1 These are obtained from the full conditional of z after integrating out v and w

f (z|αλ) =

Nprodi=1

f (zi |αλ) =Nprodi=1

ψlowastizi (1minus ψlowast

i )1minuszi

where ψlowasti =

ψiprodJ

j=1 pyijij (1minus pij)

1minusyij

ψiprodJ

j=1 pyijij (1minus pij)1minusyij + (1minus ψi)

prodJ

j=1 Iyij=0(2ndash12)

2

f (v|zα) =

Nprodi=1

f (vi |zi α) =Nprodi=1

tr N (x primeiα 1Ai)

where Ai =

(minusinfin 0] zi = 0(0infin) zi = 1

(2ndash13)

31

and tr N(microσ2A) denotes the pdf of a truncated normal random variable withmean micro variance σ2 and truncation region A

3

f (α|v) = ϕp (α αXprimev α) (2ndash14)

where α = (X primeX )minus1and ϕk(x micro ) represents the k-variate normal density withmean vector micro and variance matrix

4

f (w|y zλ) =

Nprodi=1

Jprodj=1

f (wij |yij zi λ) =Nprodi=1

Jprodj=1

tr N (qprimeijλ 1Bij)

where Bij =

(minusinfininfin) zi = 0(minusinfin 0] zi = 1 and yij = 0(0infin) zi = 1 and yij = 1

(2ndash15)

5

f (λ|w) = ϕr (λ λQprimew λ) (2ndash16)

where λ = (Q primeQ)minus1

The Gibbs sampling algorithm for the model can then be summarized as

1 Initialize z α v λ and w

2 Sample zi sim Bern(ψilowast)

3 Sample vi from a truncated normal with micro = x primeiα σ = 1 and truncation regiondepending on zi

4 Sample α sim N (αXprimev α) with α = (X primeX )minus1

5 Sample wij from a truncated normal with micro = qprimeijλ σ = 1 and truncation region

depending on yij and zi

6 Sample λ sim N (λQprimew λ) with λ = (Q primeQ)minus1

222 Logit Link Model

Now turning to the logit link version of the occupancy model again let yij be the

indicator variable used to mark detection of the target species on the j th survey at the

i th site and let zi be the indicator variable that denotes presence (zi = 1) or absence

32

(zi = 0) of the target species at the i th site The model is now defined by

yij |zi λ sim Bernoulli (zipij) where pij =eq

primeijλ

1 + eqprimeijλ

λ sim [λ]

zi |α sim Bernoulli (ψi) where ψi =ex

primeiα

1 + exprimeiα

α sim [α]

In this hierarchy the contribution of a single site to the likelihood is

Li(αλ) =(ex

primeiα)zi

1 + exprimeiα

Jprodj=1

(zi

eqprimeijλ

1 + eqprimeijλ

)yij(1minus zi

eqprimeijλ

1 + eqprimeijλ

)1minusyij

(2ndash17)

As in the probit case we data-augment the likelihood with two separate sets

of covariates however in this case each of them having Polya-gamma distribution

Augmenting the model and using the posterior in (2ndash7) the joint is

[ zαλ|y ] prop [α] [λ]

Nprodi=1

(ex

primeiα)zi

1 + exprimeiαcosh

(∣∣xprime

iα∣∣

2

)exp

[minus(x

prime

iα)2vi

2

]g(vi)times

Jprodj=1

(zi

eqprimeijλ

1 + eqprimeijλ

)yij(1minus zi

eqprimeijλ

1 + eqprimeijλ

)1minusyij

times

cosh

(∣∣ziqprimeijλ∣∣2

)exp

[minus(ziq

primeijλ)2wij

2

]g(wij)

(2ndash18)

The full conditionals for z α v λ and w obtained from (2ndash18) are provided below

1 The full conditional for z is obtained after marginalizing the latent variables andyields

f (z|αλ) =

Nprodi=1

f (zi |αλ) =Nprodi=1

ψlowastizi (1minus ψlowast

i )1minuszi

where ψlowasti =

ψiprodJ

j=1 pyijij (1minus pij)

1minusyij

ψiprodJ

j=1 pyijij (1minus pij)1minusyij + (1minus ψi)

prodJ

j=1 Iyij=0(2ndash19)

33

2 Using the result derived in Polson et al (2013) we have that

f (v|zα) =

Nprodi=1

f (vi |zi α) =Nprodi=1

PG(1 xprimeiα) (2ndash20)

3

f (α|v) prop [α ]

Nprodi=1

exp[zix

prime

iαminus xprime

2minus (x

prime

iα)2vi

2

] (2ndash21)

4 By the same result as that used for v the full conditional for w is

f (w|y zλ) =

Nprodi=1

Jprodj=1

f (wij |yij zi λ)

=

(prodiisinS1

Jprodj=1

PG(1 |qprimeijλ| )

)(prodi isinS1

Jprodj=1

PG(1 0)

) (2ndash22)

with S1 = i isin 1 2 N zi = 1

5

f (λ|z yw) prop [λ ]prodiisinS1

exp

[yijq

prime

ijλminusq

prime

ijλ

2minus

(qprime

ijλ)2wij

2

] (2ndash23)

with S1 as defined above

The Gibbs sampling algorithm is analogous to the one with a probit link but with the

obvious modifications to incorporate Polya-gamma instead of normal latent variables23 Temporal Dynamics and Spatial Structure

The uses of the single-season model are limited to very specific problems In

particular assumptions for the basic model may become too restrictive or unrealistic

whenever the study period extends throughout multiple years or seasons especially

given the increasingly changing environmental conditions that most ecosystems are

currently experiencing

Among the many extensions found in the literature one that we consider particularly

relevant incorporates heterogenous occupancy probabilities through time Extensions of

34

site-occupancy models that incorporate temporally varying probabilities can be traced

back to Hanski (1994) The heterogeneity of occupancy probabilities through time arises

from local colonization and extinction processes MacKenzie et al (2003) proposed an

alternative to Hanskirsquos approach in order to incorporate imperfect detection The method

is flexible enough to let detection occurrence survival and colonization probabilities

each depend upon its own set of covariates using likelihood-based estimation for the

model parameters

However the approach of MacKenzie et al presents two drawbacks First

the uncertainty assessment for maximum likelihood parameter estimates relies on

asymptotic results (obtained from implementation of the delta method) making it

sensitive to sample size And second to obtain parameter estimates the latent process

(occupancy) is marginalized out of the likelihood leading to the usual zero-inflated

Bernoulli model Although this is a convenient strategy to solve the estimation problem

the latent state variables (occupancy indicators) are no longer available and as such

finite sample estimates cannot be calculated unless an additional (and computationally

expensive) parametric bootstrap step is performed (Royle amp Kery 2007) Additionally as

the occupancy process is integrated out the likelihood approach precludes incorporation

of additional structural dependence using random effects Thus the model cannot

account for spatial dependence which plays a fundamental role in this setting

To work around some of the shortcomings encountered when fitting dynamic

occupancy models via likelihood based methods Royle amp Kery developed what they

refer to as a dynamic occupancy state space model (DOSS) alluding to the conceptual

similarity found between this model and the class of state space models found in the

time series literature In particular this model allows one to retain the latent process

(occupancy indicators) in order to obtain small sample estimates and to eventually

generate extensions that incorporate structure in time andor space through random

effects

35

The data used in the DOSS model comes from standard repeated presenceabsence

surveys with N sampling locations (patches or sites) indexed by i = 1 2 N Within

a given season (eg year month week depending on the biology of the species) each

sampling location is visited (surveyed) j = 1 2 J times This process is repeated for

t = 1 2 T seasons Here an important assumption is that the site occupancy status

is closed within but not across seasons

As is usual in the occupancy modeling framework two different processes are

considered The first one is the detection process per site-visit-season combination

denoted by yijt The yijt are indicator functions that take the value 1 if the species is

present at site i survey j and season t and 0 otherwise These detection indicators

are assumed to be independent within each site and season The second response

considered is the partially observed presence (occupancy) indicators zit These are

indicator variables which are equal to 1 whenever yijt = 1 for one or more of the visits

made to site i during season t otherwise the values of the zit rsquos are unknown Royle amp

Kery refer to these two processes as the observation (yijt) and the state (zit) models

In this setting the parameters of greatest interest are the occurrence or site

occupancy probabilities denoted by ψit as well as those representing the population

dynamics which are accounted for by introducing changes in occupancy status over

time through local colonization and survival That is if a site was not occupied at season

t minus 1 at season t it can either be colonized or remain unoccupied On the other hand

if the site was in fact occupied at season t minus 1 it can remain that way (survival) or

become abandoned (local extinction) at season t The probabilities of survival and

colonization from season t minus 1 to season t at the i th site are denoted by θi(tminus1) and

γi(tminus1) respectivelyDuring the initial period (or season) the model for the state process is expressed in

terms of the occupancy probability (equation 2ndash24) For subsequent periods the stateprocess is specified in terms of survival and colonization probabilities (equation 2ndash25) inparticular

zi1 sim Bernoulli (ψi1) (2ndash24)

36

zit |zi(tminus1) sim Bernoulli(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)(2ndash25)

The observation model conditional on the latent process zit is defined by

yijt |zit sim Bernoulli(zitpijt

)(2ndash26)

Royle amp Kery induce the heterogeneity by site site-season and site-survey-seasonrespectively in the occupancy survival and colonization and in the detection probabilitiesthroughthe following specification

logit(ψi1) = x1 + ri ri sim N(0σ2ψ) logitminus1(x1) sim Unif(0 1)

logit(θit) = at + ui ui sim N(0σ2θ ) logitminus1(ai) sim Unif(0 1)logit(γit) = bt + vi vi sim N(0σ2γ) logitminus1(bi) sim Unif(0 1)

logit(pijt) = ct + wij wi sim N(0σ2p) logitminus1(ci) sim Unif(0 1) (2ndash27)

where x1 at bt ct are the season fixed effects for the corresponding probabilities

and where (ri ui vi) and wij are the site and site-survey random effects respectively

Additionally all variance components assume the usual inverse gamma priors

As the authors state this formulation can be regarded as ldquobeing suitably vaguerdquo

however it is also restrictive in the sense that it is not clear what strategy to follow to

incorporate additional covariates while preserving the straightforward sampling strategy231 Dynamic Mixture Occupancy State-Space Model

We assume that the probabilities for occupancy survival colonization and detection

are all functions of linear combinations of covariates However our setup varies

slightly from that considered by Royle amp Kery (2007) In essence we modify the way in

which the estimates for survival and colonization probabilities are attained Our model

incorporates the notion that occupancy at a site occupied during the previous season

takes place through persistence where we define persistence as a function of both

survival and colonization That is a site occupied at time t may again be occupied

at time t + 1 if the current settlers survive if they perish and new settlers colonize

simultaneously or if both current settlers survive and new ones colonize

Our functional forms of choice are again the probit and logit link functions This

means that each probability of interest which we will refer to for illustration as δ is

37

linked to a linear combination of covariates xprime ξ through the relationship defined by

δ = F (xT ξ) where F (middot) represents the inverse link function This particular assumption

facilitates relating the data augmentation algorithms of Albert amp Chib and Polson et al to

Royle amp Keryrsquos DOSS model We refer to this extension of Royle amp Keryrsquos model as the

Dynamic Mixture Occupancy State Space model (DYMOSS)

As before let yijt be the indicator variable used to mark detection of the target

species on the j th survey at the i th site during the tth season and let zit be the indicator

variable that denotes presence (zit = 1) or absence (zit = 0) of the target species at the

i th site tth season with i isin 1 2 N j isin 1 2 J and t isin 1 2 T

Additionally assume that probabilities for occupancy at time t = 1 persistence

colonization and detection are all functions of covariates with corresponding parameter

vectors α (s) =δ(s)tminus1

Tt=2

B(c) =β(c)tminus1

Tt=2

and = λtTt=1 and covariate matrices

X(o) X = Xtminus1Tt=2 and Q(s) = QtTt=1 respectively Using the notation above our

proposed dynamic occupancy model is defined by the following hierarchyState model

zi1|α sim Bernoulli (ψi1) where ψi1 = F(xprime(o)iα

)zit |zi(tminus1) δ

(c)tminus1β

(s)tminus1 sim Bernoulli

(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)where θi(tminus1) = F

(δ(s)tminus1 + xprimei(tminus1)β

(c)tminus1

) and

γi(tminus1) = F(xprimei(tminus1)β

(c)tminus1

)(2ndash28)

Observed modelyijt |zit ηt sim Bernoulli (zitpijt)

where pijt = F (qTijtλt) (2ndash29)

In the hierarchical setup given by Equations 2ndash28 and 2ndash29 θi(tminus1) corresponds to

the probability of persistence from time t minus 1 to time t at site i and γi(tminus1) denotes the

colonization probability Note that θi(tminus1) minus γi(tminus1) yields the survival probability from t minus 1

to t The effect of survival is introduced by changing the intercept of the linear predictor

by a quantity δ(s)tminus1 Although in this version of the model this effect is accomplished by

just modifying the intercept it can be extended to have covariates determining δ(s)tminus1 as

well The graphical representation of the model for a single site is

38

α

zi1

yi1

λ1

zi2

yi2

λ1

δ(s)1

β(c)1

middot middot middot

zit

yit

λt

δ(s)tminus1

β(c)tminus1

middot middot middot

ziT

yiT

λT

δ(s)Tminus1

β(c)Tminus1

Figure 2-3 Graphical representation multiseason model for a single site

The joint posterior for the model defined by this hierarchical setting is

[ zηαβλ|y ] = Cy

Nprodi=1

ψi1 Jprodj=1

pyij1ij1 (1minus pij1)

(1minusyij1)

zi1(1minus ψi1)

Jprodj=1

Iyij1=0

1minuszi1 [η1][α]times

Tprodt=2

Nprodi=1

[(θziti(tminus1)(1minus θi(tminus1))

1minuszit)zi(tminus1)

+(γziti(tminus1)(1minus γi(tminus1))

1minuszit)1minuszi(tminus1)

] Jprod

j=1

pyijtijt (1minus pijt)

1minusyijt

zit

times

Jprodj=1

Iyijt=0

1minuszit [ηt ][βtminus1][λtminus1]

(2ndash30)

which as in the single season case is intractable Once again a Gibbs sampler cannot

be constructed directly to sample from this joint posterior The graphical representation

of the model for one site incorporating the latent variables is provided in Figure 2-4

α

ui1

zi1

yi1

wi1

λ1

zi2

yi2

wi2

λ1

vi1

δ(s)1

β(c)1

middot middot middot

middot middot middot

zit

vi tminus1

yit

wit

λt

δ(s)tminus1

β(c)tminus1

middot middot middot

middot middot middot

ziT

vi Tminus1

yiT

wiT

λT

δ(s)Tminus1

β(s)Tminus1

Figure 2-4 Graphical representation data-augmented multiseason model

Probit link normal-mixture DYMOSS model

39

We deal with the intractability of the joint posterior distribution as before that is

by introducing latent random variables Each of the latent variables incorporates the

relevant linear combinations of covariates for the probabilities considered in the model

This artifact enables us to sample from the joint posterior distributions of the model

parameters For the probit link the sets of latent random variables respectively for first

season occupancy persistence and colonization and detection are

bull ui sim N (bTi α 1)

bull vi(tminus1) sim zi(tminus1)N(δ(s)(tminus1) + xTi(tminus1)β

(c)(tminus1) 1

)+ (1minus zi(tminus1))N

(xTi(tminus1)β

(c)(tminus1) 1

) and

bull wijt sim N (qTijtηt 1)

Introducing these latent variables into the hierarchical formulation yieldsState model

ui1|α sim N(xprime(o)iα 1

)zi1|ui sim Bernoulli

(Iuigt0

)for t gt 1

vi(tminus1)|zi(tminus1)βtminus1 sim zi(tminus1)N(δ(s)(tminus1) + xprimei(tminus1)β

(c)(tminus1) 1

)+

(1minus zi(tminus1))N(xprimei(tminus1)β

(c)(tminus1) 1

)zit |vi(tminus1) sim Bernoulli

(Ivi(tminus1)gt0

)(2ndash31)

Observed modelwijt |ηt sim N

(qTijtλt 1

)yijt |zit wijt sim Bernoulli

(zitIrijtgt0

)(2ndash32)

Note that the result presented in Section 22 corresponds to the particular case for

T = 1 of the model specified by Equations 2ndash31 and 2ndash32

As mentioned previously model parameters are obtained using a Gibbs sampling

approach Let ϕ(x |microσ2) denote the pdf of a normally distributed random variable x

with mean micro and standard deviation σ Also let

1 Wt = (w1t w2t wNt) with wit = (wi1t wi2t wiJitt) (for i = 1 2 N andt = 1 2 T )

40

2 u = (u1 u2 uN)

3 V = (v1 vTminus1) with vt = (v1t v2t vNt)

For the probit link model the joint posterior distribution is

π(ZuV WtTt=1αB(c) δ(s)

)prop [α]

prodNi=1 ϕ

(ui∣∣ xprime(o)iα 1

)Izi1uigt0I

1minuszi1uile0

times

Tprodt=2

[β(c)tminus1 δ

(s)tminus1

] Nprodi=1

ϕ(vi(tminus1)

∣∣micro(v)i(tminus1) 1

)Izitvi(tminus1)gt0

I1minuszitvi(tminus1)le0

times

Tprodt=1

[λt ]

Nprodi=1

Jitprodj=1

ϕ(wijt

∣∣qprimeijtλt 1)(zitIwijtgt0)yij1(1minus zitIwijtgt0)

(1minusyijt)

where micro(v)i(tminus1) = zi(tminus1)δ

(s)tminus1 + xprimei(tminus1)β

(c)tminus1 (2ndash33)

Initialize the Gibbs sampler at α(0)B(0)(c) δ

(s)(0)2minus1 and (0) For m = 0 1 nsim

The sampler proceeds iteratively by block sampling sequentially for each primary

sampling period as follows first the presence process then the latent variables from

the data-augmentation step for the presence component followed by the parameters for

the presence process then the latent variables for the detection component and finally

the parameters for the detection component Letting [|] denote the full conditional

probability density function of the component conditional on all other unknown

parameters and the observed data for m = 1 nsim the sampling procedure can be

summarized as

[z(m)1 | middot

]rarr[u(m)| middot

]rarr[α(m)

∣∣∣ middot ]rarr [W

(m)1 | middot

]rarr[λ(m)1

∣∣∣ middot ]rarr[z(m)2 | middot

]rarr[V(m)2minus1| middot

]rarr[β(c)(m)2minus1 δ(s)(m)

2minus1

∣∣∣ middot ]rarr [W

(m)2 | middot

]rarr[λ(m)2

∣∣∣ middot ]rarr middot middot middot

middot middot middot rarr[z(m)T | middot

]rarr[V(m)Tminus1| middot

]rarr[β(c)(m)Tminus1 δ(s)(m)

Tminus1

∣∣∣ middot ]rarr [W

(m)T | middot

]rarr[λ(m)T

∣∣∣ middot ]The full conditional probability densities for this Gibbs sampling algorithm are

presented in detail within Appendix A

41

Logit link Polya-Gamma DYMOSS model

Using the same notation as before the logit link model resorts to the hierarchy given

byState model

ui1|α sim PG(xT(o)iα 1

)zi1|ui sim Bernoulli

(Iuigt0

)for t gt 1

vi(tminus1)| sim PG(1∣∣zi(tminus1)δ(s)(tminus1) + xprimei(tminus1)β

(c)(tminus1)

∣∣)sim Bernoulli

(Ivi(tminus1)gt0

)(2ndash34)

Observed modelwijt |λt sim PG

(qTijtλt 1

)yijt |zit wijt sim Bernoulli

(zitIwijtgt0

)(2ndash35)

The logit link version of the joint posterior is given by

π(ZuV WtTt=1αB(s)B(c)

)prop

Nprodi=1

(e

xprime(o)i

α)zi1

1 + exprime(o)i

αPG

(ui 1 |xprime(o)iα|

)[λ1][α]times

Ji1prodj=1

(zi1

eqprimeij1λ1

1 + eqprimeij1λ1

)yij1(1minus zi1

eqprimeij1λ1

1 + eqprimeij1λ1

)1minusyij1

PG(wij1 1 |zi1qprimeij1λ1|

)times

Tprodt=2

[δ(s)tminus1][β

(c)tminus1][λt ]

Nprodi=1

(exp

[micro(v)tminus1

])zit1 + exp

[micro(v)i(tminus1)

]PG (vit 1 ∣∣∣micro(v)i(tminus1)

∣∣∣)timesJitprodj=1

(zit

eqprimeijtλt

1 + eqprimeijtλt

)yijt(1minus zit

eqprimeijtλt

1 + eqlowastTij

λt

)1minusyijt

PG(wijt 1 |zitqprimeijtλt |

)

(2ndash36)

with micro(v)tminus1 = zi(tminus1)δ

(s)tminus1 + xprimei(tminus1)β

(c)tminus1

42

The sampling procedure is entirely analogous to that described for the probit

version The full conditional densities derived from expression 2ndash36 are described in

detail in Appendix A232 Incorporating Spatial Dependence

In this section we describe how the additional layer of complexity space can also

be accounted for by continuing to use the same data-augmentation framework The

method we employ to incorporate spatial dependence is a slightly modified version of

the traditional approach for spatial generalized linear mixed models (GLMMrsquos) and

extends the model proposed by Johnson et al (2013) for the single season closed

population occupancy model

The traditional approach consists of using spatial random effects to induce a

correlation structure among adjacent sites This formulation introduced by Besag et al

(1991) assumes that the spatial random effect corresponds to a Gaussian Markov

Random Field (GMRF) The model known as the Spatial GLMM (SGLMM) is used to

analyze areal data It has been applied extensively given the flexibility of its hierarchical

formulation and the availability of software for its implementation (Hughes amp Haran

2013)

Succinctly the spatial dependence is accounted for in the model by adding a

random vector η assumed to have a conditionally-autoregressive (CAR) prior (also

known as the Gaussian Markov random field prior) To define the prior let the pair

G = (V E) represent the undirected graph for the entire spatial region studied where

V = (1 2 N) denotes the vertices of the graph (sites) and E the set of edges

between sites E is constituted by elements of the form (i j) indicating that sites i

and j are spatially adjacent for some i j isin V The prior for the spatial effects is then

characterized by

[η|τ ] prop τ rank()2exp[minusτ2ηprimeη

] (2ndash37)

43

where = (diag(A1)minus A) is the precision matrix with A denoting the adjacency matrix

The entries of the adjacency matrix A are such that diag(A) = 0 and Aij = I(i j)isinE

The matrix is singular Hence the probability density defined in equation 2ndash37

is improper ie it doesnrsquot integrate to 1 Regardless of the impropriety of the prior this

model can be fitted using a Bayesian approach since even if the prior is improper the

posterior for the model parameters is proper If a constraint such assum

k ηk = 0 is

imposed or if the precision matrix is replaced by a positive definite matrix the model

can also be fitted using a maximum likelihood approach

Assuming that all but the detection process are subject to spatial correlations and

using the notation we have developed up to this point the spatially explicit version of the

DYMOSS model is characterized by the hierarchy represented by equations 2ndash38 and

2ndash39

Hence adding spatial structure into the DYMOSS framework described in the

previous section only involves adding the steps to sample η(o) and ηtT

t=2 conditional

on all other parameters Furthermore the corresponding parameters and spatial

random effects of a given component (ie occupancy survival and colonization)

can be effortlessly pooled together into a single parameter vector to perform block

sampling For each of the latent variables the only modification required is to sum the

corresponding spatial effect to the linear predictor so that these retain their conditional

independence given the linear combination of fixed effects and the spatial effects

State modelzi1|α sim Bernoulli (ψi1) where ψi1 = F

(xT(o)iα+ η

(o)i

)[η(o)|τ

]prop τ rank()2exp

[minusτ2η(o)primeη(o)

]zit |zi(tminus1)αβtminus1λtminus1 sim Bernoulli

(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)where θi(tminus1) = F

(δ(s)(tminus1) + xTi(tminus1)β

(c)tminus1 + ηit

) and

γi(tminus1) = F(xTi(tminus1)β

(c)tminus1 + ηit

)[ηt |τ ] prop τ rank()2exp

[minusτ2ηprimetηt

](2ndash38)

44

Observed modelyijt |zit ηt sim Bernoulli (zitpijt)

where pijt = F (qTijtλt) (2ndash39)

In spite of the popularity of this approach to incorporating spatial dependence three

shortcomings have been reported in the literature (Hughes amp Haran 2013 Reich et al

2006) (1) model parameters have no clear interpretation due to spatial confounding

of the predictors with the spatial effect (2) there is variance inflation due to spatial

confounding and (3) the high dimensionality of the latent spatial variables leads to

high computational costs To avoid such difficulties we follow the approach used by

Hughes amp Haran (2013) which builds upon the earlier work by Reich et al (2006) This

methodology is summarized in what follows

Let a vector of spatial effects η have the CAR model given by 2ndash37 above Now

consider a random vector ζ sim MVN(0 τKprimeK

) with defined as above and where

τKprimeK corresponds to the precision of the distribution and not the covariance matrix

with matrix K satisfying KprimeK = I

This last condition implies that the linear predictor Xβ + η = Xβ + Kζ With

respect to how the matrix K is chosen Hughes amp Haran (2013) recommend basing its

construction on the spectral decomposition of operator matrices based on Moranrsquos I

The Moran operator matrix is defined as PperpAPperp with Pperp = IminusX (XprimeX )minus1X

prime and where A

is the adjacency matrix previously described The choice of the Moran operator is based

on the fact that it accounts for the underlying graph while incorporating the spatial

structure residual to the design matrix X These elements are incorporated into its

spectral decomposition of the Moran operator That is its eigenvalues correspond to the

values of Moranrsquos I statistic (a measure of spatial autocorrelation) for a spatial process

orthogonal to X while its eigenvectors provide the patterns of spatial dependence

residual to X Thus the matrix K is chosen to be the matrix whose columns are the

eigenvectors of the Moran operator for a particular adjacency matrix

45

Using this strategy the new hierarchical formulation of our model is simply modified

by letting η(o) = K(o)ζ(o) and ηt = Ktζt with

1 ζ(o) sim MVN(0 τ (o)K(o)primeK(o)

) where K(o) is the eigenvector matrix for

P(o)perpAP(o)perp and

2 ζt sim MVN(0 τtK

primetKt

) where Kt is the Pperp

t APperpt for t = 2 3 T

The algorithms for the probit and logit link from section 231 can be readily

adapted to incorporate the spatial structure simply by obtaining the joint posteriors

for (α ζ(o)) and (β(c)tminus1 δ

(s)tminus1 ζt) making the obvious modification of the corresponding

linear predictors to incorporate the spatial components24 Summary

With a few exceptions (Dorazio amp Taylor-Rodrıguez 2012 Johnson et al 2013

Royle amp Kery 2007) recent Bayesian approaches to site-occupancy modeling with

covariates have relied on model configurations (eg as multivariate normal priors of

parameters in logit scale) that lead to unfamiliar conditional posterior distributions thus

precluding the use of a direct sampling approach Therefore the sampling strategies

available are based on algorithms (eg Metropolis Hastings) that require tuning and the

knowledge to do so correctly

In Dorazio amp Taylor-Rodrıguez (2012) we proposed a Bayesian specification for

which a Gibbs sampler of the basic occupancy model is available and allowed detection

and occupancy probabilities to depend on linear combinations of predictors This

method described in section 221 is based on the data augmentation algorithm of

Albert amp Chib (1993) There the full conditional posteriors of the parameters of the probit

regression model are cast as latent mixtures of normal random variables The probit and

the logit link yield similar results with large sample sizes however their results may be

different when small to moderate sample sizes are considered because the logit link

function places more mass in the tails of the distribution than the probit link does In

46

section 222 we adapt the method for the single season model to work with the logit link

function

The basic occupancy framework is useful but it assumes a single closed population

with fixed probabilities through time Hence its assumptions may not be appropriate to

address problems where the interest lies in the temporal dynamics of the population

Hence we developed a dynamic model that incorporates the notion that occupancy

at a site previously occupied takes place through persistence which depends both on

survival and habitat suitability By this we mean that a site occupied at time t may again

be occupied at time t + 1 if (1) the current settlers survive (2) the existing settlers

perish but new settlers simultaneously colonize or (3) current settlers survive and new

ones colonize during the same season In our current formulation of the DYMOSS both

colonization and persistence depend on habitat suitability characterized by xprimei(tminus1)β(c)tminus1

They only differ in that persistence is also influenced by whether the site being occupied

during season t minus 1 enhances the suitability of the site or harms it through density

dependence

Additionally the study of the dynamics that govern distribution and abundance of

biological populations requires an understanding of the physical and biotic processes

that act upon them and these vary in time and space Consequently as a final step in

this Chapter we described a straightforward strategy to add spatial dependence among

neighboring sites in the dynamic metapopulation model This extension is based on the

popular Bayesian spatial modeling technique of Besag et al (1991) updated using the

methods described in (Hughes amp Haran 2013)

Future steps along these lines are (1) develop the software necessary to

implement the tools described throughout the Chapter and (2) build a suite of additional

extensions using this framework for occupancy models will be explored The first of

them will be used to incorporate information from different sources such as tracks

scats surveys and direct observations into a single model This can be accomplished

47

by adding a layer to the hierarchy where the source and spatial scale of the data is

accounted for The second extension is a single season spatially explicit multiple

species co-occupancy model This model will allow studying complex interactions

and testing hypotheses about species interactions at a given point in time Lastly this

co-occupancy model will be adapted to incorporate temporal dynamics in the spirit of

the DYMOSS model

48

CHAPTER 3INTRINSIC ANALYSIS FOR OCCUPANCY MODELS

Eliminate all other factors and the one which remains must be the truthndashSherlock Holmes

The Sign of Four

31 Introduction

Occupancy models are often used to understand the mechanisms that dictate

the distribution of a species Therefore variable selection plays a fundamental role in

achieving this goal To the best of our knowledge ldquoobjectiverdquo Bayesian alternatives for

variable selection have not been put forth for this problem and with a few exceptions

(Hooten amp Hobbs 2014 Link amp Barker 2009) AIC is the method used to choose from

competing site-occupancy models In addition the procedures currently implemented

and accessible to ecologists require enumerating and estimating all the candidate

models (Fiske amp Chandler 2011 Mazerolle amp Mazerolle 2013) In practice this

can be achieved if the model space considered is small enough which is possible

if the choice of the model space is guided by substantial prior knowledge about the

underlying ecological processes Nevertheless many site-occupancy surveys collect

large amounts of covariate information about the sampled sites Given that the total

number of candidate models grows exponentially fast with the number of predictors

considered choosing a reduced set of models guided by ecological intuition becomes

increasingly difficult This is even more so the case in the occupancy model context

where the model space is the cartesian product of models for presence and models for

detection Given the issues mentioned above we propose the first objective Bayesian

variable selection method for the single-season occupancy model framework This

approach explores in a principled manner the entire model space It is completely

49

automatic precluding the need for both tuning parameters in the sampling algorithm and

subjective elicitation of parameter prior distributions

As mentioned above in ecological modeling if model selection or less frequently

model averaging is considered the Akaike Information Criterion (AIC) (Akaike 1983)

or a version of it is the measure of choice for comparing candidate models (Fiske amp

Chandler 2011 Mazerolle amp Mazerolle 2013) The AIC is designed to find the model

that has on average the density closest in Kullback-Leibler distance to the density

of the true data generating mechanism The model with the smallest AIC is selected

However if nested models are considered one of them being the true one generally the

AIC will not select it (Wasserman 2000) Commonly the model selected by AIC will be

more complex than the true one The reason for this is that the AIC has a weak signal to

noise ratio and as such it tends to overfit (Rao amp Wu 2001) Other versions of the AIC

provide a bias correction that enhances the signal to noise ratio leading to a stronger

penalization for model complexity Some examples are the AICc (Hurvich amp Tsai 1989)

and AICu (McQuarrie et al 1997) however these are also not consistent for selection

albeit asymptotically efficient (Rao amp Wu 2001)

If we are interested in prediction as opposed to testing the AIC is certainly

appropriate However when conducting inference the use of Bayesian model averaging

and selection methods is more fitting If the true data generating mechanism is among

those considered asymptotically Bayesian methods choose the true model with

probability one Conversely if the true model is not among the alternatives and a

suitable parameter prior is used the posterior probability of the most parsimonious

model closest to the true one tends asymptotically to one

In spite of this in general for Bayesian testing direct elicitation of prior probabilistic

statements is often impeded because the problems studied may not be sufficiently

well understood to make an informed decision about the priors Conversely there may

be a prohibitively large number of parameters making specifying priors for each of

50

these parameters an arduous task In addition to this seemingly innocuous subjective

choices for the priors on the parameter space may drastically affect test outcomes

This has been a recurring argument in favor of objective Bayesian procedures

which appeal to the use of formal rules to build parameter priors that incorporate the

structural information inside the likelihood while utilizing some objective criterion (Kass amp

Wasserman 1996)

One popular choice of ldquoobjectiverdquo prior is the reference prior (Berger amp Bernardo

1992) which is the prior that maximizes the amount of signal extracted from the

data These priors have proven to be effective as they are fully automatic and can

be frequentist matching in the sense that the posterior credible interval agrees with the

frequentist confidence interval from repeated sampling with equal coverage-probability

(Kass amp Wasserman 1996) Reference priors however are improper and while

they yield reasonable posterior parameter probabilities the derived model posterior

probabilities may be ill defined To avoid this shortcoming Berger amp Pericchi (1996)

introduced the intrinsic Bayes factor (IBF) for model comparison Moreno et al (1998)

building on the IBF of Berger amp Pericchi (1996) developed a limiting procedure to

generate a system of priors that yield well-defined posteriors even though these

priors may sometimes be improper The IBF is built using a data-dependent prior to

automatically generate Bayes factors however the extension introduced by Moreno

et al (1998) generates the intrinsic prior by taking a theoretical average over the space

of training samples freeing the prior from data dependence

In our view in the face of a large number of predictors the best alternative is to run

a stochastic search algorithm using good ldquoobjectiverdquo testing parameter priors and to

incorporate suitable model priors This being said the discussion about model priors is

deferred until Chapter 4 this Chapter focuses on the priors on the parameter space

The Chapter is structured as follows First issues surrounding multimodel inference

are described and insight about objective Bayesian inferential procedures is provided

51

Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate priors

on the parameter space the intrinsic priors for the parameters of the occupancy model

are derived These are used in the construction of an algorithm for ldquoobjectiverdquo model

selection tailored to the occupancy model framework To assess the performance of our

methods we provide results from a simulation study in which distinct scenarios both

favorable and unfavorable are used to determine the robustness of these tools and

analyze the Blue Hawker data set which has been examined previously in the ecological

literature (Dorazio amp Taylor-Rodrıguez 2012 Kery et al 2010)32 Objective Bayesian Inference

As mentioned before in practice noninformative priors arising from structural

rules are an alternative to subjective elicitation of priors Some of the rules used in

defining noninformative priors include the principle of insufficient reason parametrization

invariance maximum entropy geometric arguments coverage matching and decision

theoretic approaches (see Kass amp Wasserman (1996) for a discussion)

These rules reflect one of two attitudes (1) noninformative priors either aim to

convey unique representations of ignorance or (2) they attempt to produce probability

statements that may be accepted by convention This latter attitude is in the same

spirit as how weights and distances are defined (Kass amp Wasserman 1996) and

characterizes the way in which Bayesian reference methods are interpreted today ie

noninformative priors are seen to be chosen by convention according to the situation

A word of caution must be given when using noninformative priors Difficulties arise

in their implementation that should not be taken lightly In particular these difficulties

may occur because noninformative priors are generally improper (meaning that they do

not integrate or sum to a finite number) and as such are said to depend on arbitrary

constants

Bayes factors strongly depend upon the prior distributions for the parameters

included in each of the models being compared This can be an important limitation

52

considering that when using noninformative priors their introduction will result in the

Bayes factors being a function of the ratio of arbitrary constants given that these priors

are typically improper (see Jeffreys 1961 Pericchi 2005 and references therein)

Many different approaches have been developed to deal with the arbitrary constants

when using improper priors since then These include the use of partial Bayes factors

(Berger amp Pericchi 1996 Good 1950 Lempers 1971) setting the ratio of arbitrary

constants to a predefined value (Spiegelhalter amp Smith 1982) and approximating to the

Bayes factor (see Haughton 1988 as cited in Berger amp Pericchi 1996 Kass amp Raftery

1995 Tierney amp Kadane 1986)321 The Intrinsic Methodology

Berger amp Pericchi (1996) cleverly dealt with the arbitrary constants that arise when

using improper priors by introducing the intrinsic Bayes factor (IBF) procedure This

solution based on partial Bayes factors provides the means to replace the improper

priors by proper ldquoposteriorrdquo priors The IBF is obtained from combining the model

structure with information contained in the observed data Furthermore they showed

that as the sample size tends to infinity the Intrinsic Bayes factor corresponds to the

proper Bayes factor arising from the intrinsic priors

Intrinsic priors however are not unique The asymptotic correspondence between

the IBF and the Bayes factor arising from the intrinsic prior yields two functional

equations that are solved by a whole class of intrinsic priors Because all the priors

in the class produce Bayes factors that are asymptotically equivalent to the IBF for

finite sample sizes the resulting Bayes factor is not unique To address this issue

Moreno et al (1998) formalized the methodology through the ldquolimiting procedurerdquo

This procedure allows one to obtain a unique Bayes factor consolidating the method

as a valid objective Bayesian model selection procedure which we will refer to as the

Bayes factor for intrinsic priors (BFIP) This result is particularly valid for nested models

although the methodology may be extended with some caution to nonnested models

53

As mentioned before the Bayesian hypothesis testing procedure is highly sensitive

to parameter-prior specification and not all priors that are useful for estimation are

recommended for hypothesis testing or model selection Evidence of this is provided

by the Jeffreys-Lindley paradox which states that a point null hypothesis will always

be accepted when the variance of a conjugate prior goes to infinity (Robert 1993)

Additionally when comparing nested models the null model should correspond to

a substantial reduction in complexity from that of larger alternative models Hence

priors for the larger alternative models that place probability mass away from the null

model are wasteful If the true model is ldquofarrdquo from the null it will be easily detected by

any statistical procedure Therefore the prior on the alternative models should ldquowork

harderrdquo at selecting competitive models that are ldquocloserdquo to the null This principle known

as the Savage continuity condition (Gunel amp Dickey 1974) is widely recognized by

statisticians

Interestingly the intrinsic prior in correspondence with the BFIP automatically

satisfies the Savage continuity condition That is when comparing nested models the

intrinsic prior for the more complex model is centered around the null model and in spite

of being a limiting procedure it is not subject to the Jeffreys-Lindley paradox

Moreover beyond the usual pairwise consistency of the Bayes factor for nested

models Casella et al (2009) show that the corresponding Bayesian procedure with

intrinsic priors for variable selection in normal regression is consistent in the entire

class of normal linear models adding an important feature to the list of virtues of the

procedure Consistency of the BFIP for the case where the dimension of the alternative

model grows with the sample size is discussed in Moreno et al (2010)322 Mixtures of g-Priors

As previously mentioned in the Bayesian paradigm a model M in M is defined

by a sampling density and a prior distribution The sampling density associated with

model M is denoted by f (y|βM σ2M M) where (βM σ

2M) is a vector of model-specific

54

unknown parameters The prior for model M and its corresponding set of parameters is

π(βM σ2M M|M) = π(βM σ

2M |MM) middot π(M|M)

Objective local priors for the model parameters (βM σ2M) are achieved through

modifications and extensions of Zellnerrsquos g-prior (Liang et al 2008 Womack et al

2014) In particular below we focus on the intrinsic prior and provide some details for

other scaled mixtures of g-priors We defer the discussion on priors over the model

space until Chapter 5 where we describe them in detail and develop a few alternatives

of our own3221 Intrinsic priors

An automatic choice of an objective prior is the intrinsic prior (Berger amp Pericchi

1996 Moreno et al 1998) Because MB sube M for all M isin M the intrinsic prior for

(βM σ2M) is defined as an expected posterior prior

πI (βM σ2M |M) =

intpR(βM σ

2M |~yM)mR(~y|MB)d~y (3ndash1)

where ~y is a minimal training sample for model M I denotes the intrinsic distributions

and R denotes distributions derived from the reference prior πR(βM σ2M |M) = cM

dβMdσ2M

σ2M

In (3ndash1) mR(~y|M) =intint

f (~y|βM σ2M M)πR(βM σ

2M |M)dβMdσ2M is the reference marginal

of ~y under model M and pR(βM σ2M |~yM) =

f (~y|βM σ2MM)πR(βM σ2

M|M)

mR(~y|M)is the reference

posterior density

In the regression framework the reference marginal mR is improper and produces

improper intrinsic priors However the intrinsic Bayes factor of model M to the base

model MB is well-defined and given by

BF IMMB

(y) = (1minus R2M)

minus nminus|MB |2 times

int 1

0

n + sin2(π2θ) middot (|M|+ 1)

n +sin2(π

2θ)middot(|M|+1)1minusR2

M

nminus|M|

2sin2(π

2θ) middot (|M|+ 1)

n +sin2(π

2θ)middot(|M|+1)1minusR2

M

|M|minus|MB |

2

dθ (3ndash2)

55

where R2M is the coefficient of determination of model M versus model MB The Bayes

factor between two models M and M prime is defined as BF IMMprime(y) = BF I

MMB(y)BF I

MprimeMB(y)

The ldquogoodnessrdquo of the model M based on the intrinsic priors is given by its posterior

probability

pI (M|yM) =BF I

MMB(y)π(M|M)sum

MprimeisinM BF IMprimeMB

(y)π(M prime|M) (3ndash3)

It has been shown that the system of intrinsic priors produces consistent model selection

(Casella et al 2009 Giron et al 2010) In the context of well-formulated models the

true model MT is the smallest well-formulated model M isin M such that α isin M if βα = 0

If MT is the true model then the posterior probability of model MT based on equation

(3ndash3) converges to 13222 Other mixtures of g-priors

Scaled mixtures of g-priors place a reference prior on (βMBσ2) and a multivariate

normal distribution on β in M MB that is normal with mean 0 and precision matrix

qMw

nσ2ZprimeM (IminusH0)ZM

where H0 is the hat matrix associated with ZMB The prior is completed by a prior on w

and choice of scaling qM that is set at |M| + 1 to account for the minimal sample size of

M Under these assumptions the Bayesrsquo factor for M to MB is given by

BFMMB(y) =

(1minus R2

M

) nminus|MB |2

int n + w(|M|+ 1)

n + w(|M|+1)1minusR2

M

nminus|M|

2w(|M|+ 1)

n + w(|M|+1)1minusR2

M

|M|minus|MB |

2

π(w)dw

We consider the following priors on w The intrinsic prior is π(w) = Beta(w 05 05)

which is only defined for w isin (0 1) A version of the Zellner-Siow prior is given by

w sim Gamma(05 05) which produces a multivariate Cauchy distribution on β A family

of hyper-g priors are defined by π(w) prop wminus12(β + w)(α+1)2 which have Cauchy-like

tails but produce more shrinkage than the Cauchy prior

56

33 Objective Bayes Occupancy Model Selection

As mentioned before Bayesian inferential approaches used for ecological models

are lacking In particular there exists a need for suitable objective and automatic

Bayesian testing procedures and software implementations that explore thoroughly the

model space considered With this goal in mind in this section we develop an objective

intrinsic and fully automatic Bayesian model selection methodology for single season

site-occupancy models We refer to this method as automatic and objective given that

in its implementation no hyperparameter tuning is required and that it is built using

noninformative priors with good testing properties (eg intrinsic priors)

An inferential method for the occupancy problem is possible using the intrinsic

approach given that we are able to link intrinsic-Bayesian tools for the normal linear

model through our probit formulation of the occupancy model In other words because

we can represent the single season probit occupancy model through the hierarchy

yij |zi wij sim Bernoulli(ziIwijgt0

)wij |λ sim N

(qprimeijλ 1

)zi |vi sim Bernoulli

(Ivigt0

)vi |α sim N (x primeiα 1)

it is possible to solve the selection problem on the latent scale variables wij and vi and

to use those results at the level of the occupancy and detection processes

In what follows first we provide some necessary notation Then a derivation of

the intrinsic priors for the parameters of the detection and occupancy components

is outlined Using these priors we obtain the general form of the model posterior

probabilities Finally the results are incorporated in a model selection algorithm for

site-occupancy data Although the priors on the model space are not discussed in this

Chapter the software and methods developed have different choices of model priors

built in

57

331 Preliminaries

The notation used in Chapter 2 will be considered in this section as well Namely

presence will be denoted by z detection by y their corresponding latent processes are

v and w and the model parameters are denoted by α and λ However some additional

notation is also necessary Let M0 =M0y M0z

denote the ldquobaserdquo model defined by

the smallest models considered for the detection and presence processes The base

models M0y and M0z include predictors that must be contained in every model that

belongs to the model space Some examples of base models are the intercept only

model a model with covariates related to the sampling design and a model including

some predictors important to the researcher that should be included in every model

Furthermore let the sets [Kz ] = 1 2 Kz and [Ky ] = 1 2 Ky index

the covariates considered for the variable selection procedure for the presence and

detection processes respectively That is these sets denote the covariates that can

be added from the base models in M0 or removed from the largest possible models

considered MF z and MF y which we will refer to as the ldquofullrdquo models The model space

can then be represented by the Cartesian product of subsets such that Ay sube [Ky ]

and Az sube [Kz ] The entire model space is populated by models of the form MA =MAy

MAz

isin M = My timesMz with MAy

isin My and MAzisin Mz

For the presence process z the design matrix for model MAzis given by the block

matrix XAz= (X0|Xr A) X0 corresponds to the design matrix of the base model ndash which

is such that M0z sube MAzisin Mz for all Az isin [Kz ] ndash and Xr A corresponds to the submatrix

that contains the covariates indexed by Az Analogously for the detection process y the

design matrix is given by QAy= (Q0|Qr A) Similarly the coefficients for models MAz

and

MAyare given by αA = (αprime

0αprimer A)

prime and λA = (λprime0λ

primer A)

prime

With these elements in place the model selection problem consists of finding

subsets of covariates indexed by A = Az Ay that have a high posterior probability

given the detection and occupancy processes This is equivalent to finding models with

58

high posterior odds when compared to a suitable base model These posterior odds are

given by

p(MA|y z)p(M0|y z)

=m(y z|MA)π(MA)

m(y z|M0)π(M0)= BFMAM0

(y z)π(MA)

π(M0)

Since we are able to represent the occupancy model as a truncation of latent

normal variables it is possible to work through the occupancy model selection problem

in the latent normal scale used for the presence and detection processes We formulate

two solutions to this problem one that depends on the observed and latent components

and another that solely depends on the latent level variables used to data-augment the

problem We will however focus on the latter approach as this yields a straightforward

MCMC sampling scheme For completeness the other alternative is described in

Section 34

At the root of our objective inferential procedure for occupancy models lies the

conditional argument introduced by Womack et al (work in progress) for the simple

probit regression In the occupancy setting the argument is

p(MA|y zw v) =m(y z vw|MA)π(MA)

m(y zw v)

=fyz(y z|w v)

(intfvw(vw|αλMA)παλ(αλ|MA)d(αλ)

)π(MA)

fyz(y z|w v)sum

MlowastisinM(int

fvw(vw|αλMlowast)παλ(αλ|Mlowast)d(αλ))π(Mlowast)

=m(v|MAz

)m(w|MAy)π(MA)

m(v)m(w)

prop m(v|MAz)m(w|MAy

)π(MA) (3ndash4)

where

1 fyz(y z|w v) =prodN

i=1 Izivigt0I

(1minuszi )vile0

prodJ

j=1(ziIwijgt0)yij (1minus ziIwijgt0)

1minusyij

2 fvw(vw|αλMA) =

(Nprodi=1

ϕ(vi xprimeiαMAz

1)

)︸ ︷︷ ︸

f (v|αr Aα0MAz )

(Nprodi=1

Jiprodj=1

ϕ(wij qprimeijλMAy

1)

)︸ ︷︷ ︸

f (w|λr Aλ0MAy )

and

59

3 παλ(αλ|MA) = πα(α|MAz)πλ(λ|MAy

)

This result implies that once the occupancy and detection indicators are

conditioned on the latent processes v and w respectively the model posterior

probabilities only depend on the latent variables Hence in this case the model

selection problem is driven by the posterior odds

p(MA|y zw v)p(M0|y zw v)

=m(w v|MA)

m(w v|M0)

π(MA)

π(M0) (3ndash5)

where m(w v|MA) = m(w|MAy) middotm(v|MAz

) with

m(v|MAz) =

int intf (v|αr Aα0MAz

)π(αr A|α0MAz)π(α0)dαr Adα0

(3ndash6)

m(w|MAy) =

int intf (w|λr Aλ0MAy

)π(λr A|λ0MAy)π(λ0)dλ0dλr A

(3ndash7)

332 Intrinsic Priors for the Occupancy Problem

In general the intrinsic priors as defined by Moreno et al (1998) use the functional

form of the response to inform their construction assuming some preliminary prior

distribution proper or improper on the model parameters For our purposes we assume

noninformative improper priors for the parameters denoted by πN(middot|middot) Specifically the

intrinsic priors πIP(θMlowast|Mlowast) for a vector of parameters θMlowast corresponding to model

Mlowast isin M0M sub M for a response vector s with probability density (or mass) function

f (s|θMlowast) are defined by

πIP(θM0|M0) = πN(θM0

|M0)

πIP(θM |M) = πN(θM |M)

intm(~s|M)

m(~s|M0)f (~s|θM M)d~s

where ~s is a theoretical training sample

In what follows whenever it is clear from the context in an attempt to simplify the

notation MA will be used to refer to MAzor MAy

and A will denote Az or Ay To derive

60

the parameter priors involved in equations 3ndash6 and 3ndash7 using the objective intrinsic prior

strategy we start by assuming flat priors πN(αA|MA) prop cA and πN(λA|MA) prop dA where

cA and dA are unknown constants

The intrinsic prior for the parameters associated with the occupancy process αA

conditional on model MA is

πIP(αA|MA) = πN(αA|MA)

intm(~v|MA)

m(~v|M0)f (~v|αAMA)d~v

where the marginals m(~v|Mj) with j isin A 0 are obtained by solving the analogous

equation 3ndash6 for the (theoretical) training sample ~v These marginals are given by

m(~v|Mj) = cj (2π)pjminusp0

2 |~X primej~Xj |

12 eminus

12~vprime(Iminus~Hj )~v

The training sample ~v has dimension pAz=∣∣MAz

∣∣ that is the total number of

parameters in model MAz Note that without ambiguity we use

∣∣ middot ∣∣ to denote both

the cardinality of a set and also the determinant of a matrix The design matrix ~XA

corresponds to the training sample ~v and is chosen such that ~X primeA~XA =

pAzNX primeAXA

(Leon-Novelo et al 2012) and ~Hj is the corresponding hat matrix

Replacing m(~v|MA) and m(~v|M0) in πIP(αA|MA) and solving the integral with

respect to the theoretical training sample ~v we have

πIP(αA|MA) = cA

int ((2π)minus

pAzminusp0z2

(c0

cA

)eminus

12~vprime((Iminus~HA)minus(Iminus~H0))~v |~X

primeA~XA|12

|~X prime0~X0|12

)times(

(2π)minuspAz2 eminus

12(~vminus~XAαA)

prime(~vminus~XAαA))d~v

= c0(2π)minus

pAzminusp0z2 |~X prime

Ar~XAr |

12 2minus

pAzminusp0z2 exp

[minus1

2αprimer A

(1

2~X primer A

~Xr A

)αr A

]= πN(α0)timesN

(αr A

∣∣ 0 2 middot ( ~X primer A

~Xr A)minus1)

(3ndash8)

61

Analogously the intrinsic prior for the parameters associated to the detection

process is

πIP(λA|MA) = d0(2π)minus

pAyminusp0y2 | ~Q prime

Ar~QAr |

12 2minus

pAyminusp0y2 exp

[minus1

2λprimer A

(1

2~Q primer A

~Qr A

)λr A

]= πN(λ0)timesN

(λr A

∣∣ 0 2 middot ( ~Q primeA~QA)

minus1)

(3ndash9)

In short the intrinsic priors for αA = (αprime0α

primer A)

prime and λprimeA = (λprime

0λprimer A)

prime are the product

of a reference prior on the parameters of the base model and a normal density on the

parameters indexed by Az and Ay respectively333 Model Posterior Probabilities

We now derive the expressions involved in the calculations of the model posterior

probabilities First recall that p(MA|y zw v) prop m(w v|MA)π(MA) Hence determining

this posterior probability only requires calculating m(w v|MA)

Note that since w and v are independent obtaining the model posteriors from

expression 3ndash4 reduces to finding closed form expressions for the marginals m(v |MAz)

and m(w |MAy) respectively from equations 3ndash6 and 3ndash7 Therefore

m(w v|MA) =

int intf (vw|αλMA)π

IP (α|MAz)πIP

(λ|MAy

)dαdλ

(3ndash10)

For the latent variable associated with the occupancy process plugging the

parameter intrinsic prior given by 3ndash8 into equation 3ndash6 (recalling that ~X primeA~XA =

pAzNX primeAXA)

and integrating out αA yields

m(v|MA) =

int intc0N (v|X0α0 + Xr Aαr A I)N

(αr A|0 2( ~X prime

r A~Xr A)

minus1)dαr Adα0

= c0(2π)minusn2

int (pAz

2N + pAz

) (pAzminusp0z

)

2

times

exp[minus1

2(v minus X0α0)

prime(I minus

(2N

2N + pAz

)Hr Az

)(v minus X0α0)

]dα0

62

= c0 (2π)minus(nminusp0z )2

(pAz

2N + pAz

) (pAzminusp0z

)

2

|X prime0X0|minus

12 times

exp[minus1

2vprime(I minus H0z minus

(2N

2N + pAz

)Hr Az

)v

] (3ndash11)

with Hr Az= HAz

minus H0z where HAzis the hat matrix for the entire model MAz

and H0z is

the hat matrix for the base model

Similarly the marginal distribution for w is

m(w|MA) = d0 (2π)minus(Jminusp0y )2

(pAy

2J + pAy

) (pAyminusp0y

)

2

|Q prime0Q0|minus

12 times

exp[minus1

2wprime(I minus H0y minus

(2J

2J + pAy

)Hr Ay

)w

] (3ndash12)

where J =sumN

i=1 Ji or in other words J denotes the total number of surveys conducted

Now the posteriors for the base model M0 =M0y M0z

are

m(v|M0) =

intc0N (v|X0α0 I) dα0

= c0(2π)minus(nminusp0z )2 |X prime

0X0|minus12 exp

[minus1

2(v (I minus H0z ) v)

](3ndash13)

and

m(w|M0) = d0(2π)minus(Jminusp0y )2 |Q prime

0Q0|minus12 exp

[minus1

2

(w(I minus H0y

)w)]

(3ndash14)

334 Model Selection Algorithm

Having the parameter intrinsic priors in place and knowing the form of the model

posterior probabilities it is finally possible to develop a strategy to conduct model

selection for the occupancy framework

For each of the two components of the model ndashoccupancy and detectionndash the

algorithm first draws the set of active predictors (ie Az and Ay ) together with their

corresponding parameters This is a reversible jump step which uses a Metropolis

63

Hastings correction with proposal distributions given by

q(Alowastz |zo z(t)u v(t)MAz

) =1

2

(p(MAlowast

z|zo z(t)u v(t)Mz MAlowast

zisin L(MAz

)) +1

|L(MAz)|

)q(Alowast

y |y zo z(t)u w(t)MAy) =

1

2

(p(MAlowast

w|y zo z(t)u w(t)My MAlowast

yisin L(MAy

)) +1

|L(MAy)|

)(3ndash15)

where L(MAz) and L(MAy

) denote the sets of models obtained from adding or removing

one predictor at a time from MAzand MAy

respectively

To promote mixing this step is followed by an additional draw from the full

conditionals of α and λ The densities p(α0|) p(αr A|) p(λ0|) and p(λr A|) can

be sampled from directly with Gibbs steps Using the notation a|middot to denote the random

variable a conditioned on all other parameters and on the data these densities are given

by

bull α0|middot sim N((X

prime0X0)

minus1Xprime0v (X

prime0X0)

minus1)bull αr A|middot sim N

(microαr A

αr A

) where the mean vector and the covariance matrix are

given by αr A= 2N

2N+pAz(X

prime

r AXr A)minus1 and microαr A

=(αr A

Xprime

r Av)

bull λ0|middot sim N((Q

prime0Q0)

minus1Qprime0w (Q

prime0Q0)

minus1) and

bull λr A|middot sim N(microλr A

λr A

) analogously with mean and covariance matrix given by

λr A= 2J

2J+pAy(Q

prime

r AQr A)minus1 and microλr A

=(λr A

Qprime

r Aw)

Finally Gibbs sampling steps are also available for the unobserved occupancy

indicators zu and for the corresponding latent variables v and w The full conditional

posterior densities for z(t+1)u v(t+1) and w(t+1) are those introduced in Chapter 2 for the

single season probit model

The following steps summarize the stochastic search algorithm

1 Initialize A(0)y A

(0)z z

(0)u v(0)w(0)α(0)

0 λ(0)0

2 Sample the model indices and corresponding parameters

(a) Draw simultaneously

64

bull Alowastz sim q(Az |zo z(t)u v(t)MAz

)

bull αlowast0 sim p(α0|MAlowast

z zo z

(t)u v(t)) and

bull αlowastr Alowast sim p(αr A|MAlowast

z zo z

(t)u v(t))

(b) Accept (M(t+1)Az

α(t+1)10 α(t+1)1

r A ) = (MAlowastzαlowast

0αlowastr Alowast) with probability

δz = min

(1

p(MAlowastz|zo z(t)u v(t))

p(MA(t)z|zo z(t)u v(t))

q(A(t)z |zo z(t)u v(t)MAlowast

z)

q(Alowastz |zo z

(t)u v(t)MAz

)

)

otherwise let (M(t+1)Az

α(t+1)10 α(t+1)1

r A ) = (A(t)z α(t)2

0 α(t)2r A )

(c) Sample simultaneously

bull Alowasty sim q(Ay |y zo z(t)u w(t)MAy

)

bull λlowast0 sim p(λ0|MAlowast

y y zo z

(t)u w(t)) and

bull λlowastr Alowast sim p(λr A|MAlowast

y y zo z

(t)u w(t))

(d) Accept (M(t+1)Ay

λ(t+1)10 λ(t+1)1

r A ) = (MAlowastyλlowast

0λlowastr Alowast) with probability

δy = min

(1

p(MAlowastz|y zo z(t)u w(t))

p(MA(t)z|y zo z(t)u w(t))

q(A(t)z |y zo z(t)u w(t)MAlowast

y)

q(Alowastz |y zo z

(t)u w(t)MAy

)

)

otherwise let (M(t+1)Ay

λ(t+1)10 λ(t+1)1

r A ) = (A(t)y λ(t)2

0 λ(t)2r A )

3 Sample base model parameters

(a) Draw α(t+1)20 sim p(α0|MA

(t+1)z

zo z(t)u v(t))

(b) Draw λ(t+1)20 sim p(λ0|MA(t+1)y

y zo z(t)u v(t))

4 To improve mixing resample model coefficients not present the base model butare in MA

(a) Draw α(t+1)2r A sim p(αr A|MA

(t+1)z

zo z(t)u v(t))

(b) Draw λ(t+1)2r A sim p(λr A|MA

(t+1)y

yzo z(t)u v(t))

5 Sample latent and missing (unobserved) variables

(a) Sample z(t+1)u sim p(zu|MA(t+1)z

yα(t+1)2r A α(t+1)2

0 λ(t+1)2r A λ(t+1)2

0 )

(b) Sample v(t+1) sim p(v|MA(t+1)z

zo z(t+1)u α(t+1)2

r A α(t+1)20 )

65

(c) Sample w(t+1) sim p(w|MA(t+1)y

zo z(t+1)u λ(t+1)2

r A λ(t+1)20 )

34 Alternative Formulation

Because the occupancy process is partially observed it is reasonable to consider

the posterior odds in terms of the observed responses that is the detections y and

the presences at sites where at least one detection takes place Partitioning the vector

of presences into observed and unobserved z = (zprimeo zprimeu)

prime and integrating out the

unobserved component the model posterior for MA can be obtained as

p(MA|y zo) prop Ezu [m(y z|MA)] π(MA) (3ndash16)

Data-augmenting the model in terms of latent normal variables a la Albert and Chib

the marginals for any model My Mz = M isin M of z and y inside of the expectation in

equation 3ndash16 can be expressed in terms of the latent variables

m(y z|M) =

intT (z)

intT (yz)

m(w v|M)dwdv

=

(intT (z)

m(v| Mz)dv

)(intT (yz)

m(w|My)dw

) (3ndash17)

where T (z) and T (y z) denote the corresponding truncation regions for v and w which

depend on the values taken by z and y and

m(v|Mz) =

intf (v|αMz)π(α|Mz)dα (3ndash18)

m(w|My) =

intf (w|λMy)π(λ|My)dλ (3ndash19)

The last equality in equation 3ndash17 is a consequence of the independence of the

latent processes v and w Using expressions 3ndash18 and 3ndash19 allows one to embed this

model selection problem in the classical linear normal regression setting where many

ldquoobjectiverdquo Bayesian inferential tools are available In particular these expressions

facilitate deriving the parameter intrinsic priors (Berger amp Pericchi 1996 Moreno

et al 1998) for this problem This approach is an extension of the one implemented in

Leon-Novelo et al (2012) for the simple probit regression problem

66

Using this alternative approach all that is left is to integrate m(v|MA) and m(w|MA)

over their corresponding truncation regions T (z) and T (y z) which yields m(y z|MA)

and then to obtain the expectation with respect to the unobserved zrsquos Note however

two issues arise First such integrals are not available in closed form Second

calculating the expectation over the limit of integration further complicates things To

address these difficulties it is possible to express E [m(y z|MA)] as

Ezu [m(y z|MA)] = Ezu

[(intT (z)

m(v| MAz)dv

)(intT (yz)

m(w|MAy)dw

)](3ndash20)

= Ezu

[(intT (z)

intm(v| MAz

α0)πIP(α0|MAz

)dα0dv

)times(int

T (yz)

intm(w| MAy

λ0)πIP(λ0|MAy

)dλ0dw

)]

= Ezu

int (int

T (z)

m(v| MAzα0)dv

)︸ ︷︷ ︸

g1(T (z)|MAz α0)

πIP(α0|MAz)dα0 times

int (intT (yz)

m(w|MAyλ0)dw

)︸ ︷︷ ︸

g2(T (yz)|MAy λ0)

πIP(λ0|MAy)dλ0

= Ezu

[intg1(T (z)|MAz

α0)πIP(α0|MAz

)dα0 timesintg2(T (y z)|MAy

λ0)πIP(λ0|MAy

)dλ0

]= c0 d0

int intEzu

[g1(T (z)|MAz

α0)g2(T (y z)|MAyλ0)

]dα0 dλ0

where the last equality follows from Fubinirsquos theorem since m(v|MAzα0) and

m(w|MAyλ0) are proper densities From 3ndash21 the posterior odds are

p(MA|y zo)p(M0|y zo)

=

int intEzu

[g1(T (z)|MAz

α0)g2(T (y z)|MAyλ0)

]dα0 dλ0int int

Ezu

[g1(T (z)|M0z α0)g2(T (y z)|M0y λ0)

]dα0 dλ0

π(MA)

π(M0)

(3ndash21)

67

35 Simulation Experiments

The proposed methodology was tested under 36 different scenarios where we

evaluate the behavior of the algorithm by varying the number of sites the number of

surveys the amount of signal in the predictors for the presence component and finally

the amount of signal in the predictors for the detection component

For each model component the base model is taken to be the intercept only model

and the full models considered for the presence and the detection have respectively 30

and 20 predictors Therefore the model space contains 230times220 asymp 112times1015 candidate

models

To control the amount of signal in the presence and detection components values

for the model parameter were purposefully chosen so that quantiles 10 50 and 90 of the

occupancy and detection probabilities match some pre-specified probabilities Because

presence and detection are binary variables the amount of signal in each model

component associates to the spread and center of the distribution for the occupancy and

detection probabilities respectively Low signal levels relate to occupancy or detection

probabilities close to 50 High signal levels associate with probabilities close to 0 or 1

Large spreads of the distributions for the occupancy and detection probabilities reflect

greater heterogeneity among the observations collected improving the discrimination

capability of the model and viceversa

Therefore for the presence component the parameter values of the true model

were chosen to set the median for the occupancy probabilities equal 05 The chosen

parameter values also fix quantiles 10 and 90 symmetrically about 05 at small (Qz10 =

03Qz90 = 07) intermediate (Qz

10 = 02Qz90 = 08) and large (Qz

10 = 01Qz90 = 09)

distances For the detection component the model parameters are obtained to reflect

detection probabilities concentrated about low values (Qy50 = 02) intermediate values

(Qy50 = 05) and high values (Qy

50 = 08) while keeping quantiles 10 and 90 fixed at 01

and 09 respectively

68

Table 3-1 Simulation control parameters occupancy model selectorParameter Values considered

N 50 100

J 3 5

(Qz10Q

z50Q

z90)

(03 05 07) (02 05 08) (01 05 09)

(Qy

10Qy50Q

y90)

(01 02 09) (01 05 09) (01 08 09)

There are in total 36 scenarios these result from crossing all the levels of the

simulation control parameters (Table 3-1) Under each of these scenarios 20 data sets

were generated at random True presence and detection indicators were generated

with the probit model formulation from Chapter 2 This with the assumed true models

MTz = 1 x2 x15 x16 x22 x28 for the presence and MTy = 1 q7 q10 q12 q17 for

the detection with the predictors included in the randomly generated datasets In this

context 1 represents the intercept term Throughout the Section we refer to predictors

included in the true models as true predictors and to those absent as false predictors

The selection procedure was conducted using each one of these data sets with

two different priors on the model space the uniform or equal probability prior and a

multiplicity correcting prior

The results are summarized through the marginal posterior inclusion probabilities

(MPIPs) for each predictor and also the five highest posterior probability models (HPM)

The MPIP for a given predictor under a specific scenario and for a particular data set is

defined as

p(predictor is included|y zw v) =sumMisinM

I(predictorisinM)p(M|y zw vM) (3ndash22)

In addition we compare the MPIP odds between predictors present in the true model

and predictors absent from it Specifically we consider the minimum odds of marginal

posterior inclusion probabilities for the predictors Let ~ξ and ξ denote respectively a

69

predictor in the true model MT and a predictor absent from MT We define the minimum

MPIP odds between the probabilities of true and false predictor as

minOddsMPIP =min~ξisinMT

p(I~ξ = 1|~ξ isin MT )

maxξ isinMTp(Iξ = 1|ξ isin MT )

(3ndash23)

If the variable selection procedure adequately discriminates true and false predictors

minOddsMPIP will take values larger than one The ability of the method to discriminate

between the least probable true predictor and the most probable false predictor worsens

as the indicator approaches 0351 Marginal Posterior Inclusion Probabilities for Model Predictors

For clarity in Figures 3-1 through 3-5 only predictors in the true models are labeled

and are emphasized with a dotted line passing through them The left hand side plots

in these figures contain the results for the presence component and the ones on the

right correspond to predictors in the detection component The results obtained with

the uniform model priors correspond to the black lines and those for the multiplicity

correcting prior are in red In these Figures the MPIPrsquos have been averaged over all

datasets corresponding scenarios matching the condition observed

In Figure 3-1 we contrast the mean MPIPrsquos of the predictors over all datasets from

scenarios with 50 sites to the mean MPIPrsquos obtained for the scenarios with 100 sites

Similarly Figure 3-2 compares the mean MPIPrsquos of scenarios where 3 surveys are

performed to those of scenarios having 5 surveys per site Figures 3-4 and 3-5 show the

effect of the different levels of signal considered in the occupancy probabilities and in the

detection probabilities

From these figures mainly three results can be drawn (1) the effect of the model

prior is substantial (2) the proposed methods yield MPIPrsquos that clearly separate

true predictors from false predictors and (3) the separation between MPIPrsquos of true

predictors and false predictors is noticeably larger in the detection component

70

Regardless of the simulation scenario and model component observed under the

uniform prior false predictors obtain a relatively high MPIP Conversely the multiplicity

correction prior strongly shrinks towards 0 the MPIP for false predictors In the presence

component the MPIP for the true predictors is shrunk substantially under the multiplicity

prior however there remains a clear separation between true and false predictors In

contrast in the detection component the MPIP for true predictors remains relatively high

(Figures 3-1 through 3-5)

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif N=50MC N=50

Unif N=100MC N=100

Figure 3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors

71

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif J=3MC J=3

Unif J=5MC J=5

Figure 3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif N=50 J=3Unif N=50 J=5

Unif N=100 J=3Unif N=100 J=5

MC N=50 J=3MC N=50 J=5

MC N=100 J=3MC N=100 J=5

Figure 3-3 Predictor MPIP averaged over scenarios with the interaction between thenumber of sites and the surveys per site using uniform (U) and multiplicitycorrection (MC) priors

72

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

U(03 05 07)MC(03 05 07)

U(02 05 08)MC(02 05 08)

U(01 05 09)MC(01 05 09)

Figure 3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

U(01 02 09)MC(01 02 09)

U(01 05 09)MC(01 05 09)

U(01 08 09)MC(01 08 09)

Figure 3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors

73

In scenarios where more sites were surveyed the separation between the MPIP of

true and false predictors grew in both model components (Figure 3-1) Increasing the

number of sites has an effect over both components given that every time a new site is

included covariate information is added to the design matrix of both the presence and

the detection components

On the hand increasing the number of surveys affects the MPIP of predictors in the

detection component (Figures 3-2 and 3-3) but has only a marginal effect on predictors

of the presence component This may appear to be counterintuitive however increasing

the number of surveys only increases the number of observation in the design matrix

for the detection while leaving unaltered the design matrix for the presence The small

changes observed in the MPIP for the presence predictors J increases are exclusively

a result of having additional detection indicators equal to 1 in sites where with less

surveys would only have 0 valued detections

From Figure 3-3 it is clear that for the presence component the effect of the number

of sites dominates the behavior of the MPIP especially when using the multiplicity

correction priors In the detection component the MPIP is influenced by both the number

of sites and number of surveys The influence of increasing the number of surveys is

larger when considering a smaller number of sites and viceversa

Regarding the effect of the distribution for the occupancy probabilities we observe

that mostly the detection component is affected There is stronger discrimination

between true and false predictors as the distribution has a higher variability (Figure

3-4) This is consistent with intuition since having the presence probabilities more

concentrated about 05 implies that the predictors do not vary much from one site to

the next whereas having the occupancy probabilities more spread out would have the

opposite effect

Finally concentrating the detection probabilities about high or low values For

predictors in the detection component the separation between MPIP of true and false

74

predictors is larger both in scenarios where the distribution of the detection probability

is centered about 02 or 08 when compared to those scenarios where this distribution

is centered about 05 (where the signal of the predictors is weakest) For predictors in

the presence component having the detection probabilities centered at higher values

slightly increases the inclusion probabilities of the true predictors (Figure 3-5) and

reduces that of false predictors

Table 3-2 Comparison of average minOddsMPIP under scenarios having differentnumber of sites (N=50 N=100) and under scenarios having different numberof surveys per site (J=3 J=5) for the presence and detection componentsusing uniform and multiplicity correction priors

Sites SurveysComp π(M) N=50 N=100 J=3 J=5

Presence Unif 112 131 119 124MC 320 846 420 674

Detection Unif 203 264 211 257MC 2115 3246 2139 3252

Table 3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors

(Qz10Q

z50Q

z90) (Qy

10Qy50Q

y90)

Comp π(M) (030507) (020508) (010509) (010209) (010509) (010809)

Presence Unif 105 120 134 110 123 124MC 202 455 805 238 619 640

Detection Unif 234 234 230 257 200 238MC 2537 2077 2528 2933 1852 2849

The separation between the MPIP of true and false predictors is even more

notorious in Tables 3-2 and 3-3 where the minimum MPIP odds between true and

false predictors are shown Under every scenario the value for the minOddsMPIP (as

defined in 3ndash23) was greater than 1 implying that on average even the lowest MPIP

for a true predictor is higher than the maximum MPIP for a false predictor In both

components of the model the minOddsMPIP are markedly larger under the multiplicity

correction prior and increase with the number of sites and with the number of surveys

75

For the presence component increasing the signal in the occupancy probabilities

or having the detection probabilities concentrate about higher values has a positive and

considerable effect on the magnitude of the odds For the detection component these

odds are particularly high specially under the multiplicity correction prior Also having

the distribution for the detection probabilities center about low or high values increases

the minOddsMPIP 352 Summary Statistics for the Highest Posterior Probability Model

Tables 3-4 through 3-7 show the number of true predictors that are included in

the HPM (True +) and the number of false predictors excluded from it (True minus)

The mean percentages observed in these Tables provide one clear message The

highest probability models chosen with either model prior commonly differ from the

corresponding true models The multiplicity correction priorrsquos strong shrinkage only

allows a few true predictors to be selected but at the same time it prevents from

including in the HPM any false predictors On the other hand the uniform prior includes

in the HPM a larger proportion of true predictors but at the expense of also introducing

a large number of false predictors This situation is exacerbated in the presence

component but also occurs to a minor extent in the detection component

Table 3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highestprobability models for the presence and the detection components usinguniform and multiplicity correcting priors on the model space

True + True minusComp π(M) N=50 N=100 N=50 N=100

Presence Unif 057 063 051 055MC 006 013 100 100

Detection Unif 077 085 087 093MC 049 070 100 100

Having more sites or surveys improves the inclusion of true predictors and exclusion

of false ones in the HPM for both the presence and detection components (Tables 3-4

and 3-5) On the other hand if the distribution for the occupancy probabilities is more

76

Table 3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space

True + True minusComp π(M) J=3 J=5 J=3 J=5

Presence Unif 059 061 052 054MC 008 010 100 100

Detection Unif 078 085 087 092MC 050 068 100 100

spread out the HPM includes more true predictors and less false ones in the presence

component In contrast the effect of the spread of the occupancy probabilities in the

detection HPM is negligible (Table 3-6) Finally there is a positive relationship between

the location of the median for the detection probabilities and the number of correctly

classified true and false predictors for the presence The HPM in the detection part of

the model responds positively to low and high values of the median detection probability

(increased signal levels) in terms of correctly classified true and false predictors (Table

3-7)

Table 3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space

True + True minusComp π(M) (030507) (020508) (010509) (030507) (020508) (010509)

Presence Unif 055 061 064 050 054 055MC 002 008 018 100 100 100

Detection Unif 081 082 081 090 089 089MC 057 061 059 100 100 100

36 Case Study Blue Hawker Data Analysis

During 1999 and 2000 an intensive volunteer surveying effort coordinated by the

Centre Suisse de Cartographie de la Faune (CSCF) was conducted in order to analyze

the distribution of the blue hawker Ashna cyanea (Odonata Aeshnidae) a common

dragonfly in Switzerland Given that Switzerland is a small and mountainous country

77

Table 3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space

True + True minusComp π(M) (010209) (010509) (010809) (010209) (010509) (010809)

Presence Unif 059 059 062 051 054 054MC 006 010 011 100 100 100

Detection Unif 089 077 078 091 087 091MC 070 048 059 100 100 100

there is large variation in its topography and physio-geography as such elevation is a

good candidate covariate to predict species occurrence at a large spatial scale It can

be used as a proxy for habitat type intensity of land use temperature as well as some

biotic factors (Kery et al 2010)

Repeated visits to 1-ha pixels took place to obtain the corresponding detection

history In addition to the survey outcome the x and y-coordinates thermal-level the

date of the survey and the elevation were recorded Surveys were restricted to the

known flight period of the blue hawker which takes place between May 1 and October

10 In total 2572 sites were surveyed at least once during the surveying period The

number of surveys per site ranges from 1 to 22 times within each survey year

Kery et al (2010) summarize the results of this effort using AIC-based model

comparisons first by following a backwards elimination approach for the detection

process while keeping the occupancy component fixed at the most complex model and

then for the presence component choosing among a group of three models while using

the detection model chosen In our analysis of this dataset for the detection and the

presence we consider as the full models those used in Kery et al (2010) namely

minus1(ψ) = α0 + α1year+ α2elev+ α3elev2 + α4elev

3

minus1(p) = λ0 + λ1year+ λ2elev+ λ3elev2 + λ4elev

3 + λ5date+ λ6date2

78

where year = Iyear=2000

The model spaces for this data contain 26 = 64 and 24 = 16 models respectively

for the detection and occupancy components That is in total the model space contains

24+6 = 1 024 models Although this model space can be enumerated entirely for

illustration we implemented the algorithm from section 334 generating 10000 draws

from the Gibbs sampler Each one of the models sampled were chosen from the set of

models that could be reached by changing the state of a single term in the current model

(to inclusion or exclusion accordingly) This allows a more thorough exploration of the

model space because for each of the 10000 models drawn the posterior probabilities

for many more models can be observed Below the labels for the predictors are followed

by either ldquozrdquo or ldquoyrdquo accordingly to represent the component they pertain to Finally

using the results from the model selection procedure we conducted a validation step to

determine the predictive accuracy of the HPMrsquos and of the median probability models

(MPMrsquos) The performance of these models is then contrasted with that of the model

ultimately selected by Kery et al (2010)361 Results Variable Selection Procedure

The model finally chosen for the presence component in Kery et al (2010) was not

found among the highest five probability models under either model prior 3-8 Moreover

the year indicator was never chosen under the multiplicity correcting prior hinting that

this term might correspond to a falsely identified predictor under the uniform prior

Results in Table 3-10 support this claim the marginal inclusion posterior probability for

the year predictor is 7 under the multiplicity correction prior The multiplicity correction

prior concentrates more densely the model posterior probability mass in the highest

ranked models (90 of the mass is in the top five models) than the uniform prior (which

account for 40 of the mass)

For the detection component the HPM under both priors is the intercept only model

which we represent in Table 3-9 with a blank label In both cases this model obtains very

79

Table 3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data

Uniform model priorRank Mz selected p(Mz |y)

1 yrz+elevz 0102 yrz+elevz+elevz3 0083 elevz2+elevz3 0084 yrz+elevz2 0075 yrz+elevz3 007

Multiplicity correcting model priorRank Mz selected p(Mz |y)

1 elevz+elevz3 0532 0153 elevz+elevz2 0094 elevz2 0065 elevz+elevz2+elevz3 005

high posterior probabilities The terms contained in cubic polynomial for the elevation

appear to contain some relevant information however this conflicts with the MPIPs

observed in Table 3-11 which under both model priors are relatively low (lt 20 with the

uniform and le 4 with the multiplicity correcting prior)

Table 3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data

Uniform model priorRank Mz selected p(Mz |y)

1 0452 elevy3 0063 elevy2 0054 elevy 0055 yry 004

Multiplicity correcting model priorRank Mz selected p(Mz |y)

1 0862 elevy3 0023 datey2 0024 elevy2 0025 yry 002

Finally it is possible to use the MPIPs to obtain the median probability model which

contains the terms that have a MPIP higher than 50 For the occupancy process

(Table 3-10) under the uniform prior the model with the year the elevation and the

elevation cubed are included The MPM with multiplicity correction prior coincides with

the HPM from this prior The MPM chosen for the detection component (Table 3-11)

under both priors is the intercept only model coinciding again with the HPM

Given the outcomes of the simulation studies from Section 35 especially those

pertaining to the detection component the results in Table 3-11 appear to indicate that

none of the predictors considered belong to the true model especially when considering

80

Table 3-10 MPIP presence component

Predictor p(predictor isin MTz |y z w v)

Unif MultCorryrz 053 007elevz 051 073elevz2 045 023elevz3 050 067

Table 3-11 MPIP detection component

Predictor p(predictor isin MTy |y z w v)

Unif MultCorryry 019 003elevy 018 003elevy2 018 003elevy 3 019 004datey 016 003datey2 015 004

those derived with the multiplicity correction prior On the other hand for the presence

component (Table 3-10) there is an indication that terms related to the cubic polynomial

in elevz can explain the occupancy patterns362 Validation for the Selection Procedure

Approximately half of the sites were selected at random for training (ie for model

selection and parameter estimation) and the remaining half were used as test data In

the previous section we observed that using the marginal posterior inclusion probability

of the predictors the our method effectively separates predictors in the true model from

those that are not in it However in Tables 3-10 and 3-11 this separation is only clear for

the presence component using the multiplicity correction prior

Therefore in the validation procedure we observe the misclassification rates for the

detections using the following models (1) the model ultimately recommended in Kery

et al (2010) (yrz+elevz+elevz2+elevz3 + elevy+ elevy2+ datey+ datey2) (2) the

highest probability model (HPM) with a uniform prior (yrz+elevz) (3) the HPM with a

multiplicity correcting prior (elevz + elevz3 ) (4) the median probability model (MPM)

ndashthe model including only predictors with a MPIP larger than 50ndash with the uniform

prior (yrz+elevz+elevz3) and finally (5) the MPM with a multiplicity correction prior

(elevz+elevz3 same as the HPM with multiplicity correction)

We must emphasize that the models resulting from the implement ion of our model

selection procedure used exclusively the training dataset On the other hand the model

in Kery et al (2010) was chosen to minimize the prediction error of the complete data

81

Because this model was obtained from the full dataset results derived from it can only

be considered as a lower bound for the prediction errors

The benchmark misclassification error rate for true 1rsquos is high (close to 70)

However the misclassification rate for true 0rsquos which accounts for most of the

responses is less pronounced (15) Overall the performance of the selected models

is comparable They yield considerably worse results than the benchmark for the true

1rsquos but achieve rates close to the benchmark for the true zeros Pooling together

the results for true ones and true zeros the selected models with either prior have

misclassification rates close to 30 The benchmark model performs comparably with a

joint misclassification error of 23 (Table 3-12)

Table 3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform andmultiplicity correction model priors

Model True 1 True 0 Jointbenchmark (Kery et al 2010) yrz+elevz+elevz2+elevz3 + 066 015 023

elevy+ elevy2+ datey+ datey2

HPM Unif yrz+elevz 083 017 028HPMHPM MC elevz + elevz3 082 018 028MPM Unif yrz+elevz+elevz3 082 018 029

37 Discussion

In this Chapter we proposed an objective and fully automatic Bayes methodology for

the single season site-occupancy model The methodology is said to be fully automatic

because no hyper-parameter specification is necessary in defining the parameter priors

and objective because it relies on the intrinsic priors derived from noninformative priors

The intrinsic priors have been shown to have desirable properties as testing priors We

also propose a fast stochastic search algorithm to explore large model spaces using our

model selection procedure

Our simulation experiments demonstrated the ability of the method to single out the

predictors present in the true model when considering the marginal posterior inclusion

probabilities for the predictors For predictors in the true model these probabilities

were comparatively larger than those for predictors absent from it Also the simulations

82

indicated that the method has a greater discrimination capability for predictors in the

detection component of the model especially when using multiplicity correction priors

Multiplicity correction priors were not described in this Chapter however their

influence on the selection outcome is significant This behavior was observed in the

simulation experiment and in the analysis of the Blue Hawker data Model priors play an

essential role As the number of predictors grows these are instrumental in controlling

for selection of false positive predictors Additionally model priors can be used to

account for predictor structure in the selection process which helps both to reduce the

size of the model space and to make the selection more robust These issues are the

topic of the next Chapter

Accounting for the polynomial hierarchy in the predictors within the occupancy

context is a straightforward extension of the procedures we describe in Chapter 4

Hence our next step is to develop efficient software for it An additional direction we

plan to pursue is developing methods for occupancy variable selection in a multivariate

setting This can be used to conduct hypothesis testing in scenarios with varying

conditions through time or in the case where multiple species are co-observed A

final variation we will investigate for this problem is that of occupancy model selection

incorporating random effects

83

CHAPTER 4PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS

It has long been an axiom of mine that the little things are infinitely themost important

ndashSherlock HolmesA Case of Identity

41 Introduction

In regression problems if a large number of potential predictors is available the

complete model space is too large to enumerate and automatic selection algorithms are

necessary to find informative parsimonious models This multiple testing problem

is difficult and even more so when interactions or powers of the predictors are

considered In the ecological literature models with interactions andor higher order

polynomial terms are ubiquitous (Johnson et al 2013 Kery et al 2010 Zeller et al

2011) given the complexity and non-linearities found in ecological processes Several

model selection procedures even in the classical normal linear setting fail to address

two fundamental issues (1) the model selection outcome is not invariant to affine

transformations when interactions or polynomial structures are found among the

predictors and (2) additional penalization is required to control for false positives as the

model space grows (ie as more covariates are considered)

These two issues motivate the developments developed throughout this Chapter

Building on the results of Chipman (1996) we propose investigate and provide

recommendations for three different prior distributions on the model space These

priors help control for test multiplicity while accounting for polynomial structure in the

predictors They improve upon those proposed by Chipman first by avoiding the need

for specific values for the prior inclusion probabilities of the predictors and second

by formulating principled alternatives to introduce additional structure in the model

84

priors Finally we design a stochastic search algorithm that allows fast and thorough

exploration of model spaces with polynomial structure

Having structure in the predictors can determine the selection outcome As an

illustration consider the model E [y ] = β00 + β01x2 + β20x21 where the order one

term x1 is not present (this choice of subscripts for the coefficients is defined in the

following section) Transforming x1 7rarr xlowast1 = x1 + c for some c = 0 the model

becomes E [y ] = β00 + β01x2 + βlowast20x

lowast21 Note that in terms of the original predictors

xlowast21 = x21 + 2c middot x1 + c2 implying that this seemingly innocuous transformation of x1

modifies the column space of the design matrix by including x1 which was not in the

original model That is when lower order terms in the hierarchy are omitted from the

model the column space of the design matrix is not invariant to afine transformations

As the hat matrix depends on the column space the modelrsquos predictive capability is also

affected by how the covariates in the model are coded an undesirable feature for any

model selection procedure To make model selection invariant to afine transformations

the selection must be constrained to the subset of models that respect the hierarchy

(Griepentrog et al 1982 Khuri 2002 McCullagh amp Nelder 1989 Nelder 2000

Peixoto 1987 1990) These models are known as well-formulated models (WFMs)

Succinctly a model is well-formulated if for any predictor in the model every lower order

predictor associated with it is also in the model The model above is not well-formulated

as it contains x21 but not x1

WFMs exhibit strong heredity in that all lower order terms dividing higher order

terms in the model must also be included An alternative is to only require weak heredity

(Chipman 1996) which only forces some of the lower terms in the corresponding

polynomial hierarchy to be in the model However Nelder (1998) demonstrated that the

conditions under which weak heredity allows the design matrix to be invariant to afine

transformations of the predictors are too restrictive to be useful in practice

85

Although this topic appeared in the literature more than three decades ago (Nelder

1977) only recently have modern variable selection techniques been adapted to

account for the constraints imposed by heredity As described in Bien et al (2013)

the current literature on variable selection for polynomial response surface models

can be classified into three broad groups mult-istep procedures (Brusco et al 2009

Peixoto 1987) regularized regression methods (Bien et al 2013 Yuan et al 2009)

and Bayesian approaches (Chipman 1996) The methods introduced in this Chapter

take a Bayesian approach towards variable selection for well-formulated models with

particular emphasis on model priors

As mentioned in previous chapters the Bayesian variable selection problem

consists of finding models with high posterior probabilities within a pre-specified model

space M The model posterior probability for M isin M is given by

p(M|yM) prop m(y|M)π(M|M) (4ndash1)

Model posterior probabilities depend on the prior distribution on the model space

as well as on the prior distributions for the model specific parameters implicitly through

the marginals m(y|M) Priors on the model specific parameters have been extensively

discussed in the literature (Berger amp Pericchi 1996 Berger et al 2001 George 2000

Jeffreys 1961 Kass amp Wasserman 1996 Liang et al 2008 Zellner amp Siow 1980) In

contrast the effect of the prior on the model space has until recently been neglected

A few authors (eg Casella et al (2014) Scott amp Berger (2010) Wilson et al (2010))

have highlighted the relevance of the priors on the model space in the context of multiple

testing Adequately formulating priors on the model space can both account for structure

in the predictors and provide additional control on the detection of false positive terms

In addition using the popular uniform prior over the model space may lead to the

undesirable and ldquoinformativerdquo implication of favoring models of size p2 (where p is the

86

total number of covariates) since this is the most abundant model size contained in the

model space

Variable selection within the model space of well-formulated polynomial models

poses two challenges for automatic objective model selection procedures First the

notion of model complexity takes on a new dimension Complexity is not exclusively

a function of the number of predictors but also depends upon the depth and

connectedness of the associations defined by the polynomial hierarchy Second

because the model space is shaped by such relationships stochastic search algorithms

used to explore the models must also conform to these restrictions

Models without polynomial hierarchy constitute a special case of WFMs where

all predictors are of order one Hence all the methods developed throughout this

Chapter also apply to models with no predictor structure Additionally although our

proposed methods are presented for the normal linear case to simplify the exposition

these methods are general enough to be embedded in many Bayesian selection

and averaging procedures including of course the occupancy framework previously

discussed

In this Chapter first we provide the necessary definitions to characterize the

well-formulated model selection problem Then we proceed to introduce three new prior

structures on the well-formulated model space and characterize their behavior with

simple examples and simulations With the model priors in place we build a stochastic

search algorithm to explore spaces of well-formulated models that relies on intrinsic

priors for the model specific parameters mdash though this assumption can be relaxed

to use other mixtures of g-priors Finally we implement our procedures using both

simulated and real data

87

42 Setup for Well-Formulated Models

Suppose that the observations yi are modeled using the polynomial regression of

the covariates xi 1 xi p given by

yi =sum

β(α1αp)

pprodj=1

xαji j + ϵi (4ndash2)

where α = (α1 αp) belongs to Np0 the p-dimensional space of natural numbers

including 0 with ϵiiidsim N(0σ2) and only finitely many βα are allowed to be non-zero

As an illustration consider a model space that includes polynomial terms incorporating

covariates xi 1 and xi 2 only The terms x2i 2 and x2i 1xi 2 can be represented by α = (0 2)

and α = (2 1) respectively

The notation y = Z(X)β + ϵ is used to denote that observed response y =

(y1 yn) is modeled via a polynomial function Z of the original covariates contained

in X = (x1 xp) (where xj = (x1j xnj)prime) and the coefficients of the polynomial

terms are given by β A specific polynomial model M is defined by the set of coefficients

βα that are allowed to be non-zero This definition is equivalent to characterizing M

through a collection of multi-indices α isin Np0 In particular model M is specified by

M = αM1 αM|M| for αMk isin Np0 where βα = 0 for α isin M

Any particular model M uses a subset XM of the original covariates X to form the

polynomial terms in the design matrix ZM(X) Without ambiguity a polynomial model

ZM(X) on X can be identified with a polynomial model ZM(XM) on the covariates XM

The number of terms used by M to model the response y denoted by |M| corresponds

to the number of columns of ZM(XM) The coefficient vector and error variance of

the model M are denoted by βM and σ2M respectively Thus M models the data as

y = ZM(XM)βM + ϵM where ϵM sim N(0 Iσ2M

) Model M is said to be nested in model M prime

if M sub M prime M models the response of the covariates in two distinct ways choosing the

set of meaningful covariates XM as well as choosing the polynomial structure of these

covariates ZM(XM)

88

The set Np0 constitutes a partially ordered set or more succinctly a poset A poset

is a set partially ordered through a binary relation ldquo≼rdquo In this context the binary relation

on the poset Np0 is defined between pairs (ααprime) by αprime ≼ α whenever αj ge αprime

j for all

j = 1 prime with αprime ≺ α if additionally αj gt αprimej for some j The order of a term α isin Np

0

is given by the sum of its elements order(α) =sumαj When order(α) = order(αprime) + 1

and αprime ≺ α then αprime is said to immediately precede α which is denoted by αprime rarr α

The parent set of α is defined by P(α) = αprime isin Np0 αprime rarr α and is given by the

set of nodes that immediately precede the given node A polynomial model M is said to

be well-formulated if α isin M implies that P(α) sub M For example any well-formulated

model using x2i 1xi 2 to model yi must also include the parent terms xi 1xi 2 and x2i 1 their

corresponding parent terms xi 1 and xi 2 and the intercept term 1

The poset Np0 can be represented by a Directed Acyclic Graph (DAG) denoted

by (Np0) Without ambiguity we can identify nodes in the graph α isin Np

0 with terms in

the set of covariates The graph has directed edges to a node from its parents Any

well-formulated model M is represented by a subgraph (M) of (Np0) with the property

that if node α isin (M) then the nodes corresponding to P(α) are also in (M) Figure

4-1 shows examples of well-formulated polynomial models where α isin Np0 is identified

withprodp

j=1 xαjj

The motivation for considering only well-formulated polynomial models is

compelling Let ZM be the design matrix associated with a polynomial model The

subspace of y modeled by ZM given by the hat matrix HM = ZM(ZprimeMZM)

minus1ZprimeM is

invariant to affine transformations of the matrix XM if and only if M corresponds to a

well-formulated polynomial model (Peixoto 1990)

89

A B

Figure 4-1 Graphs of well-formulated polynomial models for p = 2

For example if p = 2 and yi = β(00) + β(10)xi 1 + β(01)xi 2 + β(11)xi 1xi 2 + ϵi then

the hat matrix is invariant to any covariate transformation of the form A(xi 1xi 2

)+ b for any

real-valued positive definite 2 times 2 matrix A and any real-valued vector of dimension two

b In contrast if yi = β(00) + β(20)x2i 1 + ϵi then the hat matrix formed after applying the

transformation xi 1 7rarr xi 1 + c for real c = 0 is not the same as the hat matrix formed by

the original xi 1421 Well-Formulated Model Spaces

The spaces of WFMs M considered in this paper can be characterized in terms

of two WFMs MB the base model and MF the full model The base model contains at

least the intercept term and is nested in the full model The model space M is populated

by all well formulated models M that nest MB and are nested in MF

M = M MB sube M sube MF and M is well-formulated

For M to be well-formulated the entire ancestry of each node in M must also be

included in M Because of this M isin M can be uniquely identified by two different sets

of nodes in MF the set of extreme nodes and the set of children nodes For M isin M

90

the sets of extreme and children nodes respectively denoted by E(M) and C(M) are

defined by

E(M) = α isin M MB α isin P(αprime) forall αprime isin M

C(M) = α isin MF M α cupM is well-formulated

The extreme nodes are those nodes that when removed from M give rise to a WFM in

M The children nodes are those nodes that when added to M give rise to a WFM in

M Because MB sube M for all M isin M the set of nodes E(M)cupMB determine M by

beginning with this set and iteratively adding parent nodes Similarly the nodes in C(M)

determine the set αprime isin P(α) α isin C(M)cupαprime isin E(MF ) α ≼ αprime for all α isin C(M)

which contains E(M)cupMB and thus uniquely identifies M

1

x1

x2

x21

x1x2

x22

A Extreme node set

1

x1

x2

x21

x1x2

x22

B Children node set

Figure 4-2

In Figure 4-2 the extreme and children sets for model M = 1 x1 x21 are shown for

the model space characterized by MF = 1 x1 x2 x21 x1x2 x22 In Figure 4-2A the solid

nodes represent nodes α isin M E(M) the dashed node corresponds to α isin E(M) and

the dotted nodes are not in M Solid nodes in Figure 4-2B correspond to those in M

The dashed node is the single node in C(M) and the dotted nodes are not in M cup C(M)43 Priors on the Model Space

As discussed in Scott amp Berger (2010) the Ockhamrsquos-razor effect found

automatically in Bayesian variable selection through the Bayes factor does not correct

91

for multiple testing This penalization acts against more complex models but does not

account for the collection of models in the model space which describes the multiplicity

of the testing problem This is where the role of the prior on the model space becomes

important As Scott amp Berger explain the multiplicity penalty is ldquohidden awayrdquo in the

model prior probabilities π(M|M)

In what follows we propose three different prior structures on the model space

for WFMs discuss their advantages and disadvantages and describe reasonable

choices for their hyper-parameters In addition we investigate how the choice of

prior structure and hyper-parameter combinations affect the posterior probabilities for

predictor inclusion providing some recommendations for different situations431 Model Prior Definition

The graphical structure for the model spaces suggests a method for prior

construction on M guided by the notion of inheritance A node α is said to inherit from

a node αprime if there is a directed path from αprime to α in the graph (MF ) The inheritance

is said to be immediate if order(α) = order(αprime) + 1 (equivalently if αprime isin P(α) or if αprime

immediately precedes α)

For convenience define (M) = M MB to be the set of nodes in M that are not

in the base model MB For α isin (MF ) let γα(M) be the indicator function describing

whether α is included in M ie γα(M) = I(αisinM) Denote by γν(M) the set of indicators

of inclusion in M for all order ν nodes in (MF ) Finally let γltν(M) =cupνminus1

j=0 γ j(M)

the set of indicators for inclusion in M for all nodes in (MF ) of order less than ν With

these definitions the prior probability of any model M isin M can be factored as

π(M|M) =

JmaxMprod

j=JminM

π(γ j(M)|γltj(M)M) (4ndash3)

where JminM and Jmax

M are respectively the minimum and maximum order of nodes in

(MF ) and π(γJminM (M)|γltJmin

M (M)M) = π(γJminM (M)|M)

92

Prior distributions on M can be simplified by making two assumptions First if

order(α) = order(αprime) = j then γα and γαprime are assumed to be conditionally independent

when conditioned on γltj denoted by γα perpperp γαprime|γltj Second immediate inheritance is

invoked and it is assumed that if order(α) = j then γα(M)|γltj(M) = γα(M)|γP(α)(M)

where γP(α)(M) is the inclusion indicator for the set of parent nodes of α This indicator

is one if the complete parent set of α is contained in M and zero otherwise

In Figure 4-3 these two assumptions are depicted with MF being an order two

surface in two main effects The conditional independence assumption (Figure 4-3A)

implies that the inclusion indicators for x21 x22 and x1x2 is independent when conditioned

on all the lower order terms In this same space immediate inheritance implies that

the inclusion of x21 conditioned on the inclusion of all lower order nodes is equivalent to

conditioning it on its parent set (x1 in this case)

x21 perpperp x1x2 perpperp x22

∣∣∣∣∣

1

x1

x2

A Conditional independence

x21∣∣∣∣∣

1

x1

x2

=

x21

∣∣∣∣∣ x1

B Immediate inheritance

Figure 4-3

Denote the conditional inclusion probability of node α in model M by πα =

π(γα(M) = 1|γP(α)(M)M) Under the assumptions of conditional independence

93

and immediate inheritance the prior probability of M is

π(M|πMM) =prod

αisin(MF )

πγα(M)α (1minus πα)

1minusγα(M) (4ndash4)

with πM = πα α isin (MF ) Because M must be well-formulated πα = γα =

0 if γP(α)(M) = 0 Thus the product in 4ndash4 can be restricted to the set of nodes

α isin (M)cup

C(M) Additional structure can be built into the prior on M by making

assumptions about the inclusion probabilities πα such as equality assumptions or

assumptions of a hyper-prior for these parameters Three such prior classes are

developed next first by assigning hyperpriors on πM assuming some structure among

its elements and then marginalizing out the πM

Hierarchical Uniform Prior (HUP) The HUP assumes that the non-zero πα

are all equal Specifically for a model M isin M it is assumed that πα = π for all

α isin (M)cupC(M) A complete Bayesian specification of the HUP is completed by

assuming a prior distribution for π The choice of π sim Beta(a b) produces

πHUP(M|M a b) =B(|(M)|+ a |C(M)|+ b)

B(a b) (4ndash5)

where B is the beta function Setting a = b = 1 gives the particular value of

πHUP(M|M a = 1 b = 1) =1

|(M)|+ |C(M)|+ 1

(|(M)|+ |C(M)|

|(M)|

)minus1

(4ndash6)

The HUP assigns equal probabilities to all models for which the sets of nodes (M)

and C(M) have the same cardinality This prior provides a combinatorial penalization

but essentially fails to account for the hierarchical structure of the model space An

additional penalization for model complexity can be incorporated into the HUP by

changing the values of a and b Because πα = π for all α this penalization can only

depend on some aspect of the entire graph of MF such as the total number of nodes

not in the null model |(MF )|

94

Hierarchical Independence Prior (HIP) The HIP assumes that there are no

equality constraints among the non-zero πα Each non-zero πα is given its own prior

which is assumed to be a Beta distribution with parameters aα and bα Thus the prior

probability of M under the HIP is

πHIP(M|M ab) =

prodαisin(M)

aα + bα

prodαisinC(M)

aα + bα

(4ndash7)

where the product over empty is taken to be 1 Because the πα are totally independent any

choice of aα and bα is equivalent to choosing a probability of success πα for a given α

Setting aα = bα = 1 for all α isin (M)cup

C(M) gives the particular value of

πHIP(M|M a = 1b = 1) =

(1

2

)|(M)|+|C(M)|

(4ndash8)

Although the prior with this choice of hyper-parameters accounts for the hierarchical

structure of the model space it essentially provides no penalization for combinatorial

complexity at different levels of the hierarchy This can be observed by considering a

model space with main effects only the exponent in 4ndash8 is the same for every model in

the space because each node is either in the model or in the children set

Additional penalizations for model complexity can be incorporated into the HIP

Because each γ j is conditioned on γltj in the prior construction the aα and bα for α of

order j can be conditioned on γltj One such additional penalization utilizes the number

of nodes of order j that could be added to produce a WFM conditioned on the inclusion

vector γltj which is denoted as chj(γltj) Choosing aα = 1 and bα(M) = chj(γ

ltj) is

equivalent to choosing a probability of success πα = 1chj(γltj) This penalization can

drive down the false positive rate when chj(γltj) is large but may produce more false

negatives

Hierarchical Order Prior (HOP) A compromise between complete equality and

complete independence of the πα is to assume equality between the πα of a given

order and independence across the different orders Define j(M) = α isin (M)

95

order(α) = j and Cj(M) = α isin C(M) order(α) = j The HOP assumes that πα = πj

for all α isin j(M)cupCj(M) Assuming that πj sim Beta(aj bj) provides a prior probability of

πHOP(M|M ab) =

JmaxMprod

j=JminM

B(|j(M)|+ aj |Cj(M)|+ bj)

B(aj bj)(4ndash9)

The specific choice of aj = bj = 1 for all j gives a value of

πHOP(M|M a = 1b = 1) =prodj

[1

|j(M)|+ |Cj(M)|+ 1

(|j(M)|+ |Cj(M)|

|j(M)|

)minus1]

(4ndash10)

and produces a hierarchical version of the Scott and Berger multiplicity correction

The HOP arises from a conditional exchangeability assumption on the indicator

variables Conditioned on γltj(M) the indicators γα α isin j(M)cup

Cj(M) are

assumed to be exchangeable Bernoulli random variables By de Finettirsquos theorem these

arise from independent Bernoulli random variables with common probability of success

πj with a prior distribution Our construction of the HOP assumes that this prior is a

beta distribution Additional complexity penalizations can be incorporated into the HOP

in a similar fashion to the HIP The number of possible nodes that could be added of

order j while maintaining a WFM is given by chj(M) = chj(γltj(M)) = |j(M)

cupCj(M)|

Using aj = 1 and bj(M) = chj(M) produces a prior with two desirable properties

First if M prime sub M then π(M) le π(M prime) Second for each order j the conditional

probability of including k nodes is greater than or equal to that of including k + 1 nodes

for k = 0 1 chj(M)minus 1432 Choice of Prior Structure and Hyper-Parameters

Each of the priors introduced in Section 31 defines a whole family of model priors

characterized by the probability distribution assumed for the inclusion probabilities πM

For the sake of simplicity this paper focuses on those arising from Beta distributions

and concentrates on particular choices of hyper-parameters which can be specified

automatically First we describe some general features about how each of the three

prior structures (HUP HIP HOP) allocates mass to the models in the model space

96

Second as there is an infinite number of ways in which the hyper-parameters can be

specified focused is placed on the default choice a = b = 1 as well as the complexity

penalizations described in Section 31 The second alternative is referred to as a =

1b = ch where b = ch has a slightly different interpretation depending on the prior

structure Accordingly b = ch is given by bj(M) = bα(M) = chj(M) = |j(M)cup

Cj(M)|

for the HOP and HIP where j = order(α) while b = ch denotes that b = |(MF )| for

the HUP The prior behavior for two model spaces In both cases the base model MB is

taken to be the intercept only model and MF is the DAG shown (Figures 4-4 and 4-5)

The priors considered treat model complexity differently and some general properties

can be seen in these examples

ModelHIP HOP HUP

(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)

1 1 14 49 13 12 13 572 1 x1 18 19 112 112 112 5563 1 x2 18 19 112 112 112 5564 1 x1 x

21 18 19 112 112 112 5168

5 1 x2 x22 18 19 112 112 112 5168

6 1 x1 x2 132 364 112 112 160 1727 1 x1 x2 x

21 132 164 136 160 160 1168

8 1 x1 x2 x1x2 132 164 136 160 160 11689 1 x1 x2 x

22 132 164 136 160 160 1168

10 1 x1 x2 x21 x1x2 132 1192 136 1120 130 1252

11 1 x1 x2 x21 x

22 132 1192 136 1120 130 1252

12 1 x1 x2 x1x2 x22 132 1192 136 1120 130 1252

13 1 x1 x2 x21 x1x2 x

22 132 1576 112 1120 16 1252

Figure 4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch)

First contrast the choice of HIP HUP and HOP for the choice of (ab) = (1 1) The

HIP induces a complexity penalization that only accounts for the order of the terms in

the model This is best exhibited by the model space in Figure 4-4 Models including x1

and x2 models 6 through 13 are given the same prior probability and no penalization is

incurred for the inclusion of any or all of the quadratic terms In contrast to the HIP the

97

ModelHIP HOP HUP

(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)

1 1 18 2764 14 12 14 472 1 x1 18 964 112 110 112 2213 1 x2 18 964 112 110 112 2214 1 x3 18 964 112 110 112 2215 1 x1 x3 18 364 112 120 112 41056 1 x2 x3 18 364 112 120 112 41057 1 x1 x2 116 3128 124 140 130 1428 1 x1 x2 x1x2 116 3128 124 140 120 1709 1 x1 x2 x3 116 1128 18 140 120 17010 1 x1 x2 x3 x1x2 116 1128 18 140 15 170

Figure 4-5 Prior probabilities for the space of well-formulated models associated tothree main effects and one interaction term where MB is taken to be theintercept only model and (ab) isin (1 1) (1 ch)

HUP induces a penalization for model complexity but it does not adequately penalize

models for including additional terms Using the HIP models including all of the terms

are given at least as much probability as any model containing any non-empty set of

terms (Figures 4-4 and 4-5) This lack of penalization of the full model is originates from

its combinatorial simplicity (ie this is the only model that contains every term) and

as an unfortunate consequence this model space distribution favors the base and full

models Similar behavior is observed with the HOP with (ab) = (1 1) As models

become more complex they are appropriately penalized for their size However after a

sufficient number of nodes are added the number of possible models of that particular

size is considerably reduced Thus combinatorial complexity is negligible on the largest

models This is best exhibited in Figure 4-5 where the HOP places more mass on

the full model than on any model containing a single order one node highlighting an

undesirable behavior of the priors with this choice of hyper-parameters

In contrast if (ab) = (1 ch) all three priors produce strong penalization as

models become more complex both in terms of the number and order of the nodes

contained in the model For all of the priors adding a node α to a model M to form M prime

produces p(M) ge p(M prime) However differences between the priors are apparent The

98

HIP penalizes the full model the most with the HOP penalizing it the least and the HUP

lying between them At face value the HOP creates the most compelling penalization

of model complexity In Figure 4-5 the penalization of the HOP is the least dramatic

producing prior odds of 20 for MB versus MF as opposed to the HUP and HIP which

produce prior odds of 40 and 54 respectively Similarly the prior odds in Figure 4-4 are

60 180 and 256 for the HOP HUP and HIP respectively433 Posterior Sensitivity to the Choice of Prior

To determine how the proposed priors are adjusting the posterior probabilities to

account for multiplicity a simple simulation was performed The goal of this exercise

was to understand how the priors respond to increasing complexity First the priors are

compared as the number of main effects p grows Second they are compared as the

depth of the hierarchy increases or in other words as the orderJMmax increases

The quality of a node is characterized by its marginal posterior inclusion

probabilities defined as pα =sum

MisinM I(αisinM)p(M|yM) for α isin MF These posteriors

were obtained for the proposed priors as well as the Equal Probability Prior (EPP)

on M For all prior structures both the default hyper-parameters a = b = 1 and

the penalizing choice of a = 1 and b = ch are considered The results for the

different combinations of MF and MT incorporated in the analysis were obtained

from 100 random replications (ie generating at random 100 matrices of main effects

and responses) The simulation proceeds as follows

1 Randomly generate main effects matrices X = (x1 x18) for xiiidsim Nn(0 In) and

error vectors ϵ sim Nn(0 In) for n = 60

2 Setting all coefficient values equal to one calculate y = ZMTβ + ϵ for the true

models given byMT 1 = x1 x2 x3 x

21 x1x2 x

22 x2x3 with |MT 1| = 7

MT 2 = x1 x2 x16 with |MT 2| = 16MT 3 = x1 x2 x3 x4 with |MT 3| = 4MT 4 = x1 x2 x8 x

21 x3x4 with |MT 4| = 10

MT 5 = x1 x2 x3 x4 x21 x3x4 with |MT 5| = 6

99

Table 4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulationsgrowing p fixed JM

max fixed p growing JMmax

MF

∣∣MF

∣∣ ∣∣M∣∣ MT used MF

∣∣MF

∣∣ ∣∣M∣∣ MT used(x1 + x2 + x3)

2 9 95 MT 1 (x1 + x2 + x3)2 9 95 MT 1

(x1 + + x4)2 14 1337 MT 1 (x1 + x2 + x3)

3 19 2497 MT 1

(x1 + + x5)2 20 38619 MT 1 (x1 + x2 + x3)

4 34 161421 MT 1

Other model spacesMF

∣∣MF

∣∣ ∣∣M∣∣ MT usedx1 + x2 + middot middot middot+ x18 18 262144 MT 2MT 3

(x1 + x2 + x4)2 + x5+ 20 85568 MT 4MT 5x6 + + x10

3 In all simulations the base model MB is the intercept only model The notation(x1 + + xp)

d is used to represent the full order-d polynomial response surface inp main effects The model spaces characterized by their corresponding full modelMF are presented in Table 4-1 as well as the true models used in each case

4 Enumerate the model spaces and calculate p(M|yM) for all M isin Musing the EPP HUP HIP and HOP the latter two each with the two sets ofhyper-parameters

5 Count the number of true positives and false positives in each M for the differentpriors

The true positives (TP) are defined as those nodes α isin MT such that pα gt 05

With the false positives (FP) three different cutoffs are considered for pα elucidating

the adjustment for multiplicity induced by the model priors These cutoffs are

010 020 and 050 for α isin MT The results from this exercise provide insight

about the influence of the prior on the marginal posterior inclusion probabilities In Table

4-1 the model spaces considered are described in terms of the number of models they

contain and in terms of the number of nodes of MF the full model that defines the DAG

for M

Growing number of main effects fixed polynomial degree This simulation

investigates the posterior behavior as the number of covariates grows for a polynomial

100

surface of degree two The true model is assumed to be MT 1 and has 7 polynomial

terms The false positive and true positive rates are displayed in Table 4-2

First focus on the posterior when (ab) = (1 1) As p increases and the cutoff

is low the number of false positives increases for the EPP as well as the hierarchical

priors although less dramatically for the latter All of the priors identify all of the true

positives The false positive rate for the 50 cutoff is less than one for all four prior

structures with the HIP exhibiting the smallest false positive rate

With the second choice of hyper-parameters (1 ch) the improvement of the

hierarchical priors over the EPP is dramatic and the difference in performance is more

pronounced as p increases These also considerably outperform the priors using the

default hyper-parameters a = b = 1 in terms of the false positives Regarding the

number of true positives all priors discovered the 7 true predictors in MT 1 for most of

the 100 random samples of data with only minor differences observed between any of

the priors considered That being said the means for the priors with a = 1b = ch are

slightly lower for the true positives With a 50 cutoff the hierarchical priors keep a tight

control on the number of false positives but in doing so discard true positives with slightly

higher frequency

Growing polynomial degree fixed main effects For these examples the true

model is once again MT 1 When the complexity is increased by making the order of MF

larger (Table 4-3) the inability of the EPP to adjust the inclusion posteriors for multiplicity

becomes more pronounced the EPP becomes less and less efficient at removing false

positives when the FP cutoff is low Among the priors with a = b = 1 as the order

increases the HIP is the best at filtering out the false positives Using the 05 false

positive cutoff some false positives are included both for the EEP and for all the priors

with a = b = 1 indicating that the default hyper-parameters might not be the best option

to control FP The 7 covariates in the true model all obtain a high inclusion posterior

probability both with the EEP and the a = b = 1 priors

101

Table 4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

7 (x1 + x2 + x3)2

178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3 + x4)2

362 194 233 245 010 063 107FP(gt020) 160 047 217 215 001 017 024FP(gt050) 025 006 035 036 000 002 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3 + x4 + x5)2

600 216 260 255 012 043 115FP(gt020) 291 055 213 218 002 019 027FP(gt050) 066 011 025 037 000 003 001TP(gt050) (MT 1) 700 700 700 700 697 699 699

In contrast any of the a = 1 and b = ch priors dramatically improve upon their

a = b = 1 counterparts consistently assigning low inclusion probabilities for the majority

of the false positive terms even for low cutoffs As the order of the polynomial surface

increases the difference in performance between these priors and either the EEP or

their default versions becomes even more clear At the 50 cutoff the hierarchical priors

with complexity penalization exhibit very low false positive rates The true positive rate

decreases slightly for the priors but not to an alarming degree

Other model spaces This part of the analysis considers model spaces that do not

correspond to full polynomial degree response surfaces (Table 4-4) The first example

is a model space with main effects only The second example includes a full quadratic

surface of order 2 but in addition includes six terms for which only main effects are to be

modeled Two true models are used in combination with each model space to observe

how the posterior probabilities vary under the influence of the different priors for ldquolargerdquo

and ldquosmallrdquo true models

102

Table 4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

7 (x1 + x2 + x3)2

178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3)3

737 521 606 291 055 105 139FP(gt020) 291 155 361 208 017 034 031FP(gt050) 040 021 050 026 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 698 700FP(gt010)

7 (x1 + x2 + x3)4

822 400 469 261 052 055 132FP(gt020) 421 113 176 203 012 015 031FP(gt050) 056 017 022 027 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 697 699

By construction in model spaces with main effects only HIP(11) and EPP are

equivalent as are HOP(ab) and HUP(ab) This accounts for the similarities observed

among the results for the first two cases presented in Table 4-4 where the model space

corresponds to a full model with 18 main effects and the true models are a model with

16 and 4 main effects respectively When the number of true coefficients is large the

HUP(11) and HOP(11) do poorly at controlling false positives even at the 50 cutoff

In contrast the HIP (and thus the EPP) with the 50 cutoff identifies the true positives

and no false positives This result however does not imply that the EPP controls false

positives well The true model contains 16 out of the 18 nodes in MF so there is little

potential for false positives The a = 1 and b = ch priors show dramatically different

behavior The HIP controls false positive well but fails to identify the true coefficients at

the 50 cutoff In contrast the HOP identifies all of the true positives and has a small

false positive rate for the 50 cutoff

103

If the number of true positives is small most terms in the full model are truly zero

The EPP includes at least one false positive in approximately 50 of the randomly

sampled datasets On the other hand the HUP(11) provides some control for

multiplicity obtaining on average a lower number of false positives than the EPP

Furthermore the proposed hierarchical priors with a = 1b = ch are substantially better

than the EPP (and the choice of a = b = 1) at controlling false positives and capturing

all true positives using the marginal posterior inclusion probabilities The two examples

suggest that the HOP(1 ch) is the best default choice for model selection when the

number of terms available at a given degree is large

The third and fourth examples in Table 4-4 consider the same irregular model

space with data generated from MT 4 with ten terms and MT 5 with six terms HIP(11)

and EPP again behave quite similarly incorporating a large number of false positives

for the 01 cutoff At the 05 cutoff some false positives are still included The HUP(11)

and HOP(11) behave similarly with a slightly higher false positive rate at the 50 cutoff

In terms of the true positives the EPP and a = b = 1 priors always include all of the

predictors in MT 4 and MT 5 On the other hand the ability of the a = 1b = ch priors

to control for false positives is markedly better than that of the EPP and the hierarchical

priors with choice of a = 1 = b = 1 At the 50 cutoff these priors identify all of the true

positives and true negatives Once again these examples point to the hierarchical priors

with additional penalization for complexity as being good default priors on the model

space44 Random Walks on the Model Space

When the model space M is too large to enumerate a stochastic procedure can

be used to find models with high posterior probability In particular an MCMC algorithm

can be utilized to generate a dependent sample of models from the model posterior The

structure of the model space M both presents difficulties and provides clues on how to

build algorithms to explore it Different MCMC strategies can be adopted two of which

104

Table 4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

16 x1 + x2 + + x18

193 193 200 200 003 180 180FP(gt020) 052 052 200 200 001 046 046FP(gt050) 007 007 200 200 001 004 004TP(gt050) (MT 2) 1599 1599 1600 1600 699 1599 1599FP(gt010)

4 x1 + x2 + + x18

1395 1395 915 915 026 131 131FP(gt020) 545 545 303 303 005 045 045FP(gt050) 084 084 045 045 002 006 006TP(gt050) (MT 3) 400 400 400 400 400 400 400FP(gt010)

10

973 971 1000 560 034 233 220FP(gt020) (x1 + + x4)

2+ 265 265 873 305 012 074 069FP(gt050) +x5 + + x10 035 035 136 168 002 011 012TP(gt050) (MT 4) 1000 1000 1000 999 994 998 999FP(gt010)

6

1352 1352 1106 994 044 163 196FP(gt020) (x1 + + x4)

2+ 422 421 360 501 015 048 068FP(gt050) +x5 + + x10 053 053 057 075 001 008 011TP(gt050) (MT 5) 600 600 600 600 599 599 599

are outlined in this section Combining the different strategies allows the model selection

algorithm to explore the model space thoroughly and relatively fast441 Simple Pruning and Growing

This first strategy relies on small localized jumps around the model space turning

on or off a single node at each step The idea behind this algorithm is to grow the model

by activating one node in the children set or to prune the model by removing one node

in the extreme set At a given step in the algorithm assume that the current state of the

chain is model M Let pG be the probability that algorithm chooses the growth step The

proposed model M prime can either be M+ = M cup α for some α isin C(M) or Mminus = M α

or some α isin E(M)

An example transition kernel is defined by the mixture

g(M prime|M) = pG middot qGrow(M prime|M) + (1minus pG) middot qPrune(M prime|M)

105

=IM =MF

1 + IM =MBmiddotIαisinC(M)

|C(M)|+

IM =MB

1 + IM =MF middotIαisinE(M)

|E(M)|(4ndash11)

where pG has explicitly been defined as 05 when both C(M) and E(M) are non-empty

and as 0 (or 1) when C(M) = empty (or E(M) = empty) After choosing pruning or growing a

single node is proposed for addition to or deletion from M uniformly at random

For this simple algorithm pruning has the reverse kernel of growing and vice-versa

From this construction more elaborate algorithms can be specified First instead of

choosing the node uniformly at random from the corresponding set nodes can be

selected using the relative posterior probability of adding or removing the node Second

more than one node can be selected at any step for instance by also sampling at

random the number of nodes to add or remove given the size of the set Third the

strategy could combine pruning and growing in a single step by sampling one node

α isin C(M) cup E(M) and adding or removing it accordingly Fourth the sets of nodes from

C(M) cup E(M) that yield well-formulated models can be added or removed This simple

algorithm produces small moves around the model space by focusing node addition or

removal only on the set C(M) cup E(M)442 Degree Based Pruning and Growing

In exploring the model space it is possible to take advantage of the hierarchical

structure defined between nodes of different order One can update the vector of

inclusion indicators by blocks denoted j(M) Two flavors of this algorithm are

proposed one that separates the pruning and growing steps and one where both

are done simultaneously

Assume that at a given step say t the algorithm is at M If growing the strategy

proceeds successively by order class going from j = Jmin up to j = Jmax with Jmin

and Jmax being the lowest and highest orders of nodes in MF MB respectively Define

Mt(Jminminus1) = M and set j = Jmin The growth kernel comprises the following steps

proceeding from j = Jmin to j = Jmax

106

1) Propose a model M prime by selecting a set of nodes from Cj(Mt(jminus1)) through thekernel qGrow j(middot|Mt(jminus1))

2) Compute the Metropolis-Hastings correction for M prime versus Mt(jminus1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(jminus1)

3) If j lt Jmax then set j = j + 1 and return to 1) otherwise proceed to 4)

4) Set Mt = Mt(Jmax )

The pruning step is defined In a similar fashion however it starts at order j = Jmax

and proceeds down to j = Jmin Let Ej(M prime) = E(M prime) cap j(MF ) be the set of nodes of

order j that can removed from the model M to produce a WFM Define Mt(Jmax+1) = M

and set j = Jmax The pruning kernel comprises the following steps

1) Propose a model M prime by selecting a set of nodes from Ej(Mt(j+1)) through thekernel qPrunej(middot|Mt(j+1))

2) Compute the Metropolis-Hastings correction for M prime versus Mt(j+1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(j+1)

3) If j gt Jmin then set j = j minus 1 and return to Step 1) otherwise proceed to Step 4)

4) Set Mt = Mt(Jmin )

It is clear that the growing and pruning steps are reverse kernels of each other

Pruning and growing can be combined for each j The forward kernel proceeds from

j = Jmin to j = Jmax and proposes adding sets of nodes from Cj(M) cup Ej(M) The reverse

kernel simply reverses the direction of j proceeding from j = Jmax to j = Jmin 45 Simulation Study

To study the operating characteristics of the proposed priors a simulation

experiment was designed with three goals First the priors are characterized by how

the posterior distributions are affected by the sample size and the signal-to-noise ratio

(SNR) Second given the SNR level the influence of the allocation of the signal across

the terms in the model is investigated Third performance is assessed when the true

model has special points in the scale (McCullagh amp Nelder 1989) ie when the true

107

model has coefficients equal to zero for some lower-order terms in the polynomial

hierarchy

With these goals in mind sets of predictors and responses are generated under

various experimental conditions The model space is defined with MB being the

intercept-only model and MF being the complete order-four polynomial surface in five

main effects that has 126 nodes The entries of the matrix of main effects are generated

as independent standard normal The response vectors are drawn from the n-variate

normal distribution as y sim Nn

(ZMT

(X)βγ In) where MT is the true model and In is the

n times n identity matrix

The sample sizes considered are n isin 130 260 1040 which ensures that

ZMF(X) is of full rank The cardinality of this model space is |M| gt 12 times 1022 which

makes enumeration of all models unfeasible Because the value of the 2k-th moment

of the standard normal distribution increases with k = 1 2 higher-order terms by

construction have a larger variance than their ancestors As such assuming equal

values for all coefficients higher-order terms necessarily contain more ldquosignalrdquo than

the lower order terms from which they inherit (eg x21 has more signal than x1) Once a

higher-order term is selected its entire ancestry is also included Therefore to prevent

the simulation results from being overly optimistic (because of the larger signals from the

higher-order terms) sphering is used to calculate meaningful values of the coefficients

ensuring that the signal is of the magnitude intended in any given direction Given

the results of the simulations from Section 433 only the HOP with a = 1b = ch is

considered with the EPP included for comparison

The total number of combinations of SNR sample size regression coefficient

values and nodes in MT amounts to 108 different scenarios Each scenario was run

with 100 independently generated datasets and the mean behavior of the samples was

observed The results presented in this section correspond to the median probability

model (MPM) from each of the 108 simulation scenarios considered Figure 4-7 shows

108

the comparison between the two priors for the mean number of true positive (TP) and

false positive (FP) terms Although some of the scenarios consider true models that are

not well-formulated the smallest well-formulated model that stems from MT is always

the one shown in Figure 4-6

Figure 4-6 MT DAG of the largest true model used in simulations

The results are summarized in Figure 4-7 Each point on the horizontal axis

corresponds to the average for a given set of simulation conditions Only labels for the

SNR and sample size are included for clarity but the results are also shown for the

different values of the regression coefficients and the different true models considered

Additional details about the procedure and other results are included in the appendices451 SNR and Sample Size Effect

As expected small sample sizes conditioned upon a small SNR impair the ability

of the algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this

effect being greater when using the latter prior However considering the mean number

of TPs jointly with the number of FPs it is clear that although the number of TPs is

specially low with HOP(1 ch) most of the few predictors that are discovered in fact

belong to the true model In comparison to the results with EPP in terms of FPs the

HOP(1 ch) does better and even more so when both the sample size and the SNR are

109

Figure 4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch)

smallest Finally when either the SNR or the sample size is large the performance in

terms of TPs is similar between both priors but the number of FPs are somewhat lower

with the HOP452 Coefficient Magnitude

Three ways to allocate the amount of signal across predictors are considered For

the first choice all coefficients contain the same amount of signal regardless of their

order In the second each order-one coefficient contains twice as much signal as any

order-two coefficient and four times as much as any order-three coefficient Finally

each order-one coefficient contains a half as much signal as any order-two coefficient

and a quarter of what any order-three coefficient has These choices are denoted by

β(1) = c(1o1 1o2 1o3) β(2) = c(1o1 05o2 025o3) and β(3) = c(025o1 05o2 1o3)

respectively In Figure 4-7 the first 4 scenarios correspond to simulations with β(1) the

next four use β(2) the next four correspond to β(3) and then the values are cycled in

110

the same way The results show that scenarios using either β(1) or β(3) behave similarly

contrasting with the negative impact of having the highest signal in the order-one terms

through β(2) In Figure 4-7 the effect of using β(2) is evident as it corresponds to the

lowest values for the TPs regardless of the sample size the SNR or the prior used This

is an intuitive result since giving more signal to higher-order terms makes it easier to

detect higher-order terms and consequently by strong heredity the algorithm will also

select the corresponding lower-order terms included in the true model453 Special Points on the Scale

Four true models were considered (1) the model from Figure 4-6 (MT 1) (2)

the model without the order-one terms (MT 2) (3) the model without order-two terms

(MT 3) and (4) the model without x21 and x2x5 (MT 4) The last three are clearly not

well-formulated In Figure 4-7 the leftmost point on the horizontal axis corresponds to

scenarios with MT 1 the next point is for scenarios with MT 2 followed by those with MT 3

then with MT 4 then MT 1 etc In comparison to the EPP the HOP(1 ch) tightly controls

the inclusion of FPs by choosing smaller models at the expense of also reducing the TP

count especially when there is more uncertainty about the true model (ie SNR=025)

For both prior structures the results indicate that at low SNR levels the presence of

special points has no apparent impact as the selection behavior is similar between the

four models in terms of both the TP and FP An interesting observation is that the effect

of having special points on the scale is vastly magnified whenever the coefficients that

assign more weight to order-one terms (β(2)) are used46 Case Study Ozone Data Analysis

This section uses the ozone data from Breiman amp Friedman (1985) and followsthe analysis performed by Liang et al (2008) who investigated hyper g-priors Afterremoving observations with missing values 330 observations remain includingdaily measurements of maximum ozone concentration near Los Angeles and eightmeteorological variables Table D From the 330 observations 165 were sampled atrandom without replacement and used to run the variable selection procedure theremaining 165 were used for validation The eight meteorological variables interactionsand their squared terms are used as predictors resulting in a full model with 44predictors The model space assumes that the base model MB is the intercept onlymodel and that MF is the quadratic surface in the eight meteorological variables The

111

model space contains approximately 71 billion models and computation of all modelposterior probabilities is not feasible

Table 4-5 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA

vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX

The HOP HUP and HIP with a = 1 and b = ch as well as the EPP are considered forcomparison purposes To obtain the Bayes factors in equation 3ndash3 four different mixtures ofg-priors are utilized intrinsic priors (IP) (which yields the expression in equation 3ndash2) hyper-g(HG) priors (Liang et al 2008) with hyper-parameters α = 2β = 1 and α = β = 1 and Zellner-Siow (ZS) priors (Zellner amp Siow 1980) The results were extracted for the median posteriorprobability (MPM) models Additionally the model is estimated using the R package hierNet(Bien et al 2013) to compare model selection results to those obtained using the hierarchicallasso (Bien et al 2013) restricted to well formulated models by imposing the strong heredityconstraint The procedures were assessed on the basis of their predictive accuracy on thevalidation dataset

Among all models the one that yields the smallest RMSE is the median probability modelobtained using the HOP and EPP with the ZS prior and also using the HOP with both HGpriors (Table 4-6) The HOP model with the intrinsic prior has all the terms contained in thelowest RMSE model with the exception of dpg2 which has a relatively high marginal inclusionprobability of 46 This disparity between the IP and other mixtures of g-priors is explainedby the fact that the IP induces less posterior shrinkage than the ZS and HG priors The MPMobtained through the HUP and HIP are nested in the best model suggesting that these modelspace priors penalize complexity too much and result in false negatives Consideration ofthese MPMs suggest that the HOP is best at producing true positives while controlling for falsepositives

Finally the model obtained from the hierarchical lasso (HierNet) is the largest model andproduces the second to largest RMSE All of the terms contained in any of the other modelsexcept for vh are nested within the hierarchical lasso model and most of the terms that areexclusive to this model receive extremely low marginal inclusion probabilities under any of themodel priors and parameter priors considered under Bayesian model selection

112

Table 4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso

BF Prior Model R2 RMSEIP EPP hum dpg ibt hum2 hum lowast dpg 08054 42739

hum lowast ibt dpg2 ibt2IP HIP hum ibt hum2 hum lowast ibt ibt2 07740 43396IP HOP hum dpg ibt hum2 hum lowast ibt ibt2 07848 43175IP HUP hum dpg ibt hum lowast ibt ibt2 07767 43508ZS EPP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HIP hum ibt hum lowast ibt ibt2 07525 43505ZS HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG11 EPP vh hum dpg ibt hum2 hum lowast ibt dpg2 07701 43049HG11 HIP hum ibt hum lowast ibt ibt2 07525 43505HG11 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG11 HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 EPP hum dpg ibt hum2 hum lowast ibt dpg2 07701 43037HG21 HIP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG21 HUP hum dpg ibt hum lowast ibt 07526 44036

HierNet hum temp ibh dpg ibt vis hum2 hum lowast ibt 07651 43680temp2 temp lowast ibt dpg2

47 DiscussionScott amp Berger (2010) noted that Ockhamrsquos-razor effect found automatically in Bayesian

variable selection through the Bayes factor does not correct for multiple testing The Bayesfactor penalizes complexity of the alternative model according to the number of parametersin excess of those of the null model Therefore the Bayes factor only controls complexity in apairwise fashion If the model selection procedure uses equal prior probabilities for all M isin Mthen these comparisons ignore the effect of the multiplicity of the testing problem This is wherethe role of the prior on the model space becomes important The multiplicity penalty is ldquohiddenawayrdquo in the model prior probabilities π(M|M)

In addition to the multiplicity of the testing problem disregarding the hierarchical polynomialstructure in the predictors in model selection procedures has the potential to lead to differentresults according to how the predictors are setup (eg in what units these predictors areexpressed)

In this Chapter we investigated a solution to these two issues We define prior structuresfor well-formulated models and develop random walk algorithms to traverse this type of modelspace The key to understanding prior distributions on the space of WFMs is the hierarchicalnature of the model space itself The prior distributions described take advantage of thathierarchy in two ways First conditional independence and immediate inheritance are used todevelop the HOP HIP and HUP structures discussed in Section 43 Second the conditionalnature of the priors allows for the direct incorporation of complexity penalizations Of the priorsproposed the HOP using the hyperparameter choice (1 ch) provides the best control of falsepositives while maintaining a reasonable true positive rate Thus this prior is recommended asthe default prior on the space of WFMs

113

In the near future the software developed to carry out a Metropolis-Hastings random walkon the space of WFMs will be integrated to the R package varSelectIP These new functionsimplement various local priors for the regression coefficients including the intrinsic prior Zellner-Siow prior and hyper g-priors In addition the software supports the computation of crediblesets for each regression coefficient conditioned on the selected model as well as under modelaveraging

114

CHAPTER 5CONCLUSIONS

Ecologists are now embracing the use of Bayesian methods to investigate the

interactions that dictate the distribution and abundance of organisms These tools are

both powerful and flexible They allow integrating under a single methodology empirical

observations and theoretical process models and can seamlessly account for several

sources of uncertainty and dependence The estimation and testing methods proposed

throughout the document will contribute to the understanding of Bayesian methods used

in ecology and hopefully these will shed light about the differences between estimation

and testing Bayesian tools

All of our contributions exploit the potential of the latent variable formulation This

approach greatly simplifies the analysis of complex models it redirects the bulk of

the inferential burden away from the original response variables and places it on the

easy-to-work-with latent scale for which several time-tested approaches are available

Our methods are distinctly classified into estimation and testing tools

For estimation we proposed a Bayesian specification of the single-season

occupancy model for which a Gibbs sampler is available using both logit and probit

link functions This setup allows detection and occupancy probabilities to depend

on linear combinations of predictors Then we developed a dynamic version of this

approach incorporating the notion that occupancy at a previously occupied site depends

both on survival of current settlers and habitat suitability Additionally because these

dynamics also vary in space we suggest a strategy to add spatial dependence among

neighboring sites

Ecological inquiry usually requires of competing explanations and uncertainty

surrounds the decision of choosing any one of them Hence a model or a set of

probable models should be selected from all the viable alternatives To address this

testing problem we proposed an objective and fully automatic Bayesian methodology

115

for the single season site-occupancy model Our approach relies on the intrinsic prior

which prevents from introducing (commonly unavailable) subjectively information

into the model In simulation experiments we observed that the methods single out

accurately the predictors present in the true model using the marginal posterior inclusion

probabilities of the predictors For predictors in the true model these probabilities were

comparatively larger than those for predictors not present in the true model Also the

simulations indicated that the method provides better discrimination for predictors in the

detection component of the model

In our simulations and in the analysis of the Blue Hawker data we observed that the

effect from using the multiplicity correction prior was substantial This occurs because

the Bayes factor only penalizes complexity of the alternative model according to its

number of parameters in excess to those of the null model As the number of predictors

grows the number of models in the models space also grows increasing the chances

of making false positive decisions on the inclusion of predictors This is where the role

of the prior on the model space becomes important The multiplicity penalty is ldquohidden

awayrdquo in the model prior probabilities π(M|M) In addition to the multiplicity of the

testing problem disregarding the hierarchical polynomial structure in the predictors in

model selection procedures has the potential to lead to different results according to

how the predictors are coded (eg in what units these predictors are expressed)

To confront this situation we propose three prior structures for well-formulated

models take advantage of the hierarchical structure of the predictors Of the priors

proposed we recommend the HOP using the hyperparameter choice (1 ch) which

provides the best control of false positives while maintaining a reasonable true positive

rate

Overall considering the flexibility of the latent approach several other extensions of

these methods follow Currently we envision three future developments (1) occupancy

models incorporate various sources of information (2) multi-species models that make

116

use of spatial and interspecific dependence and (3) investigate methods to conduct

model selection for the dynamic and spatially explicit version of the model

117

APPENDIX AFULL CONDITIONAL DENSITIES DYMOSS

In this section we introduce the full conditional probability density functions for all

the parameters involved in the DYMOSS model using probit as well as logic links

Sampler Z

The full conditionals corresponding to the presence indicators have the same form

regardless of the link used These are derived separately for the cases t = 1 1 lt t lt T

and t = T since their corresponding probabilities take on slightly different forms

Let ϕ(ν|microσ2) represent the density for a normal random variable ν with mean micro and

variance σ2 and recall that ψi1 = F (xprime(o)iα) and pijt = F (qprimeijtλt) where F () is the

inverse link function The full conditional for zit is given by

1 For t = 1

π(zi1|vi1αλ1βc1 δ

s1) = ψlowast

i1zi1 (1minus ψlowast

i1)1minuszi1

= Bernoulli(ψlowasti1) (Andash1)

where

ψlowasti1 =

ψi1ϕ(vi1|xprimei1βc1 + δs1 1)

prodJi1j=1(1minus pij1)

ψi1ϕ(vi1|xprimei1βc1 + δs1 1)

prodJi1j=1(1minus pij1) + (1minus ψi1)ϕ(vi1|xprimei1β

c1 1)

prodJj=1 Iyij1=0

2 For 1 lt t lt T

π(zit |zi(tminus1) zi(t+1)λt βctminus1 δ

stminus1) = ψlowast

itzit (1minus ψlowast

it)1minuszit

= Bernoulli(ψlowastit) (Andash2)

where

ψlowastit =

κitprodJit

j=1(1minus pijt)

κitprodJit

j=1(1minus pijt) +nablait

prodJj=1 Iyijt=0

with

(a) κit = F (xprimei(tminus1)β

ctminus1 + zi(tminus1)δ

stminus1)ϕ(vit |xprimeitβ

ct + δst 1) and

(b) nablait =(1minus F (xprime

i(tminus1)βctminus1 + zi(tminus1)δ

stminus1)

)ϕ(vit |xprimeitβ

ct 1)

3 For t = T

π(ziT |zi(Tminus1)λT βcTminus1 δ

sTminus1) = ψ⋆iT

ziT (1minus ψ⋆iT )1minusziT

118

=

Nprodi=1

Bernoulli(ψ⋆iT ) (Andash3)

where

ψ⋆iT =κ⋆iT

prodJiTj=1(1minus pijT )

κ⋆iTprodJiT

j=1(1minus pijT ) +nabla⋆iT

prodJj=1 IyijT=0

with

(a) κ⋆iT = F (xprimei(Tminus1)β

cTminus1 + zi(Tminus1)δ

sTminus1) and

(b) nabla⋆iT =

(1minus F (xprime

i(Tminus1)βcTminus1 + zi(Tminus1)δ

sTminus1)

)Sampler ui

1

π(ui |zi1α) = tr N(xprime(o)iα 1 trunc(zi1))

where trunc(zi1) =

(minusinfin 0] zi1 = 0

(0infin) zi1 = 1(Andash4)

and tr N(microσ2A) denotes the pdf of a truncated normal random variable with mean microvariance σ2 and truncation region A

Sampler α

1

π(α|u) prop [α]

Nprodi=1

ϕ(ui xprime(o)iα 1) (Andash5)

If [α] prop 1 then

α|u sim N(m(α)α)

with m(α) = αXprime(o)u and α = (X prime

(o)X(o))minus1

Sampler vit

1 (For t gt 1)

π(vi (tminus1)|zi (tminus1) zit βctminus1 δ

stminus1) = tr N

(micro(v)i(tminus1) 1 trunc(zit)

)(Andash6)

where micro(v)i(tminus1) = xprime

i(tminus1)βctminus1 + zi(tminus1)δ

ci(tminus1) and trunc(zit) defines the corresponding

truncation region given by zit

119

Sampler(β(c)tminus1 δ

(c)tminus1

)

1 (For t gt 1)

π(β(s)tminus1 δ

(c)tminus1|vtminus1 ztminus1) prop [β

(s)tminus1 δ

(c)tminus1]

Nprodi=1

ϕ(vit xprimei(tminus1)β

(c)tminus1 + zi(tminus1)δ

(s)tminus1 1) (Andash7)

If[β(c)tminus1 δ

(s)tminus1

]prop 1 then

β(c)tminus1 δ

(s)tminus1|vtminus1 ztminus1 sim N(m(β

(c)tminus1 δ

(s)tminus1)tminus1)

with m(β(c)tminus1 δ

(s)tminus1) = tminus1 ~X

primetminus1vtminus1 and tminus1 = (~X prime

tminus1 ~Xtminus1)minus1 where ~Xtminus1 =(

Xtminus1 ztminus1)

Sampler wijt

1 (For t gt 1 and zit = 1)

π(wijt | i zit = 1 yijt λ) = tr N(qprimeijtλt 1 tr(yijt)

)(Andash8)

Sampler λt

1 (For t = 1 2 T )

π(λt |zt wt) prop [λt ]prod

i zit=1

Jitprodj=1

ϕ(wijt qprimeijtλt 1) (Andash9)

If [λt ] prop 1 then

λt |wt zt sim N(m(λt)λt)

with m(λt) = λtQ primetwt and λt

= (Q primetQt)

minus1 where Qt and wt respectively are the designmatrix and the vector of latent variables for surveys of sites such that zit = 1

120

APPENDIX BRANDOM WALK ALGORITHMS

Global Jump From the current state M the global jump is performed by drawing

a model M prime at random from the model space This is achieved by beginning at the base

model and increasing the order from JminM to the Jmax

M the minimum and maximum orders

of nodes in (MF ) = MF MB at each order a set of nodes is selected at random from

the prior conditioned on the nodes already in the model The MH correction is

α =

1m(y|M primeM)

m(y|MM)

Local Jump From the current state M the local jump is performed by drawing a

model from the set of models L(M) = Mα α isin E(M) cup C(M) where Mα is M α

for α isin E(M) and M cup α for α isin C(M) The proposal probabilities for the model are

computed as a mixture of p(M prime|yMM prime isin L(M)) and the discrete uniform distribution

The proposal kernel is

q(M prime|yMM prime isin L(M)) =1

2

(p(M prime|yMM prime isin L(M)) +

1

|L(M)|

)This choice promotes moving to better models while maintaining a non-negligible

probability of moving to any of the possible models The MH correction is

α =

1m(y|M primeM)

m(y|MM)

q(M|yMM isin L(M prime))

q(M prime|yMM prime isin L(M))

Intermediate Jump The intermediate jump is performed by increasing or

decreasing the order of the nodes under consideration performing local proposals based

on order For a model M prime define Lj(Mprime) = M prime cup M prime

α α isin (E(M prime) cup C(M prime)) capj(MF )

From a state M the kernel chooses at random whether to increase or decrease the

order If M = MF then decreasing the order is chosen with probability 1 and if M = MB

then increasing the order is chosen with probability 1 in all other cases the probability of

increasing and decreasing order is 12 The proposal kernels are given by

121

Increasing order proposal kernel

1 Set j = JminM minus 1 and M prime

j = M

2 Draw M primej+1 from qincj+1(M

prime|yMM prime isin Lj+1(Mprimej )) where

qincj+1(Mprime|yMM prime isin Lj+1(M

primej )) =

12

(p(M prime|yMM prime isin Lj+1(M

primej )) +

1|Lj+1(M

primej)|

)

3 Set j = j + 1

4 If j lt JmaxM then return to 2 O therwise proceed to 5

5 Set M prime = M primeJmaxM

and compute the proposal probability

qinc(Mprime|yMM) =

JmaxM minus1prod

j=JminM minus1

qincj+1(Mprimej |yMM prime isin Lj+1(M

primej )) (Bndash1)

Decreasing order proposal kernel

1 Set j = JmaxM + 1 and M prime

j = M

2 Draw M primejminus1 from qdecjminus1(M

prime|yMM prime isin Ljminus1(Mprimej )) where

qdecjminus1(Mprime|yMM prime isin Ljminus1(M

primej )) =

12

(p(M prime|yMM prime isin Ljminus1(M

primej )) +

1|Ljminus1(M

primej)|

)

3 Set j = j minus 1

4 If j gt JminM then return to 2 Otherwise proceed to 5

5 Set M prime = M primeJminM

and compute the proposal probability

qdec(Mprime|yMM) =

JminM +1prod

j=JmaxM +1

qdecjminus1(Mprimej |yMM prime isin Ljminus1(M

primej )) (Bndash2)

If increasing order is chosen then the MH correction is given by

α = min

1

(1 + I (M prime = MF )

1 + I (M = MB)

)qdec(M|yMM prime)

qinc(M prime|yMM)

p(M prime|yM)

p(M|yM)

(Bndash3)

and similarly if decreasing order is chosen

Other Local and Intermediate Kernels The local and intermediate kernels

described here perform a kind of stochastic forwards-backwards selection Each kernel

122

q can be relaxed to allow more than one node to be turned on or off at each step which

could provide larger jumps for each of these kernels The tradeoff is that number of

proposed models for such jumps could be very large precluding the use of posterior

information in the construction of the proposal kernel

123

APPENDIX CWFM SIMULATION DETAILS

Briefly the idea is to let ZMT(X )βMT

= (QR)βMT= QηMT

(ie βMT= Rminus1ηMT

)

using the QR decomposition As such setting all values in ηMTproportional to one

corresponds to distributing the signal in the model uniformly across all predictors

regardless of their order

The (unconditional) variance of a single observation yi is var(yi) = var (E [yi |zi ]) +

E [var(yi |zi)] where zi is the i -th row of the design matrix ZMT Hence we take the

signal to noise ratio for each observation to be

SNR(η) = ηTMT

RminusTzRminus1ηMT

σ2

where z = var(zi) We determine how the signal is distributed across predictors up to a

proportionality constant to be able to control simultaneously the signal to noise ratio

Additionally to investigate the ability of the model to capture correctly the

hierarchical structure we specify four different 0-1 vectors that determine the predictors

in MT which generates the data in the different scenarios

Table C-1 Experimental conditions WFM simulationsParameter Values considered

SNR(ηMT) = k 025 1 4

ηMTprop (1 13 14 12) (1 13 1214

1412) (1 1413

1214 12)

γMT(1 13 14 12) (1 13 14 02) (1 13 04 12) (1 03 0 1 1 0 12)

n 130 260 1040

The results presented below are somewhat different from those found in the main

body of the article in Section 5 These are extracted averaging the number of FPrsquos

TPrsquos and model sizes respectively over the 100 independent runs and across the

corresponding scenarios for the 20 highest probability models

124

SNR and Sample Size Effect

In terms of the SNR and the sample size (Figure C-1) we observe that as

expected small sample sizes conditioned upon a small SNR impair the ability of the

algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this effect

more notorious when using the latter prior However considering the mean number

of true positives (TP) jointly with the mean model size it is clear that although the

sensitivity is low most of the few predictors that are discovered belong to the true

model The results observed with SNR of 025 and a relatively small sample size are

far from being impressive however real problems where the SNR is as low as 025

will yield many spurious associations under the EPP The fact that the HOP(1 ch) has

a strong protection against false positive is commendable in itself A SNR of 1 also

represents a feeble relationship between the predictors and the response nonetheless

the method captures approximately half of the true coefficients while including very few

false positives Following intuition as either the sample size or the SNR increase the

algorithms performance is greatly enhanced Either having a large sample size or a

large SNR yields models that contain mostly true predictors Additionally HOP(1 ch)

provides a strong control over the number of false positives therefore for high SNR

or larger sample sizes the number of predictors in the top 20 models is close to the

size of the true model In general the EPP allows the detection of more TPrsquos while

the HOP(1 ch) provides a stronger control on the amount of FPrsquos included when

considering small sample sizes combined with small SNRs As either sample size or

SNR grows the differences between the two priors become indistinct

125

Figure C-1 SNR vs n Average model size average true positives and average false

positives for all simulated scenarios by model ranking according to model

posterior probabilities

Coefficient Magnitude

This part of the experiment explores the effect of how the signal is distributed across

predictors As mentioned before sphering is used to assign the coefficients values

in a manner that controls the amount of signal that goes into each coefficient Three

possible ways to allocate the signal are considered First each order-one coefficient

contains twice as much signal as any order-two coefficient and four times as much

any as order-three coefficient second all coefficients contain the same amount of

signal regardless of their order and third each order-one coefficient contains a half

as much signal as any order-two coefficient and a quarter of what any order-three

126

coefficient has In Figure C-2 these values are denoted by β = c(1o1 05o2 025o3)

β = c(1o1 1o2 1o3) and β = c(025o1 05o2 1o3) respectively

Observe that the number of FPrsquos is invulnerable to how the SNR is distributed

across predictors using the HOP(1 ch) conversely when using the EPP the number

of FPrsquos decreases as the SNR grows always being slightly higher than those obtained

with the HOP With either prior structure the algorithm performs better whenever all

coefficients are equally weighted or when those for the order-three terms have higher

weights In these two cases (ie with β = c(1o1 05o2 025o3) or β = c(1o1 1o2 1o3))

the effect of the SNR appears to be similar In contrast when more weight is given to

order one terms the algorithm yields slightly worse models at any SNR level This is an

intuitive result since giving more signal to higher order terms makes it easier to detect

higher order terms and consequently by strong heredity the algorithm will also select

the corresponding lower order terms included in the true model

Special Points on the Scale

In Nelder (1998) the author argues that the conditions under which the

weak-heredity principle can be used for model selection are so restrictive that the

principle is commonly not valid in practice in this context In addition the author states

that considering well-formulated models only does not take into account the possible

presence of special points on the scales of the predictors that is situations where

omitting lower order terms is justified due to the nature of the data However it is our

contention that every model has an underlying well-formulated structure whether or not

some predictor has special points on its scale will be determined through the estimation

of the coefficients once a valid well-formulated structure has been chosen

To understand how the algorithm behaves whenever the true data generating

mechanism has zero-valued coefficients for some lower order terms in the hierarchy

four different true models are considered Three of them are not well-formulated while

the remaining one is the WFM shown in Figure 4-6 The three models that have special

127

Figure C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model rankingaccording to model posterior probabilities

points correspond to the same model MT from Figure 4-6 but have respectively

zero-valued coefficients for all the order-one terms all the order-two terms and for x21

and x2x5

As seen before in comparison to the EPP the HOP(1 ch) tightly controls the

inclusion FPs by choosing smaller models at the expense of also reducing the TP

count especially when there is more uncertainty about the true model (ie SNR=025)

For both prior structures the results in Figure C-3 indicate that at low SNR levels the

presence of special points has no apparent impact as the selection behavior is similar

between the four models in terms of both the TP and FP As the SNR increases the

TPs and the model size are affected for true models with zero-valued lower order

128

Figure C-3 SNR vs different true models MT Average model size average truepositives and average false positives for all simulated scenarios by modelranking according to model posterior probabilities

terms These differences however are not very large Relatively smaller models are

selected whenever some terms in the hierarchy are missing but with high SNR which

is where the differences are most pronounced the predictors included are mostly true

coefficients The impact is almost imperceptible for the true model that lacks order one

terms and the model with zero coefficients for x21 and x2x5 and is more visible for models

without order two terms This last result is expected due to strong-heredity whenever

the order-one coefficients are missing the inclusion of order-two and order-three

terms will force their selection which is also the case when only a few order two terms

have zero-valued coefficients Conversely when all order two predictors are removed

129

some order three predictors are not selected as their signal is attributed the order two

predictors missing from the true model This is especially the case for the order three

interaction term x1x2x5 which depends on the inclusion of three order two terms terms

(x1x2 x1x5 x2x5) in order for it to be included as well This makes the inclusion of this

term somewhat more challenging the three order two interactions capture most of

the variation of the polynomial terms that is present when the order three term is also

included However special points on the scale commonly occur on a single or at most

on a few covariates A true data generating mechanism that removes all terms of a given

order in the context of polynomial models is clearly not justified here this was only done

for comparison purposes

130

APPENDIX DSUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS

The covariates considered for the ozone data analysis match those used in Liang

et al (2008) these are displayed in Table D below

Table D-1 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA

vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX

The marginal posterior inclusion probability corresponds to the probability of including a

given term in the full model MF after summing over all models in the model space For each

node α isin MF this probability is given by pα =sum

MisinM I(αisinM)p(M|yM) Given that in problems

with a large model space such as the one considered for the ozone concentration problem

enumeration of the entire space is not feasible Thus these probabilities are estimated summing

over every model drawn by the random walk over the model space M

Given that there are in total 44 potential predictors for convenience in Tables D-2 to D-5

below we only display the marginal posterior probabilities for the terms included under at least

one of the model priors considered (EPP HIP HUP and HOP) for each of the parameter priors

utilized (intrinsic priors Zellner-Siow priors Hyper-g(11) and Hyper-g(21))

131

Table D-2 Marginal inclusion probabilities

intrinsic prior

EPP HIP HUP HOP

hum 099 069 085 076

dpg 085 048 052 053

ibt 099 100 100 100

hum2 076 051 043 062

humdpg 055 002 003 017

humibt 098 069 084 075

dpg2 072 036 025 046

ibt2 059 078 057 081

Table D-3 Marginal inclusion probabilities

Zellner-Siow prior

EPP HIP HUP HOP

hum 076 067 080 069

dpg 089 050 055 058

ibt 099 100 100 100

hum2 057 049 040 057

humibt 072 066 078 068

dpg2 081 038 031 051

ibt2 054 076 055 077

Table D-4 Marginal inclusion probabilities

Hyper-g11

EPP HIP HUP HOP

vh 054 005 010 011

hum 081 067 080 069

dpg 090 050 055 058

ibt 099 100 099 099

hum2 061 049 040 057

humibt 078 066 078 068

dpg2 083 038 030 051

ibt2 049 076 054 077

Table D-5 Marginal inclusion probabilities

Hyper-g21

EPP HIP HUP HOP

hum 079 064 073 067

dpg 090 052 060 059

ibt 099 100 099 100

hum2 060 047 037 055

humibt 076 064 071 067

dpg2 082 041 036 052

ibt2 047 073 049 075

132

REFERENCES

Akaike H (1983) Information measures and model selection Bull Int Statist Inst 50277ndash290

Albert J H amp Chib S (1993) Bayesian-analysis of binary and polychotomousresponse data Journal of the American Statistical Association 88(422) 669ndash679

Berger J amp Bernardo J (1992) On the development of reference priors BayesianStatistics 4 (pp 35ndash60)

URL httpisbastatdukeedueventsvalencia1992Valencia4Refpdf

Berger J amp Pericchi L (1996) The intrinsic Bayes factor for model selection andprediction Journal of the American Statistical Association 91(433) 109ndash122

URL httpamstattandfonlinecomdoiabs10108001621459199610476668

Berger J Pericchi L amp Ghosh J (2001) Objective Bayesian methods for modelselection introduction and comparison In Model selection vol 38 of IMS LectureNotes Monogr Ser (pp 135ndash207) Inst Math Statist

URL httpwwwjstororgstable1023074356165

Besag J York J amp Mollie A (1991) Bayesian Image-Restoration with 2 Applicationsin Spatial Statistics Annals of the Institute of Statistical Mathematics 43 1ndash20

Bien J Taylor J amp Tibshirani R (2013) A lasso for hierarchical interactions TheAnnals of Statistics 41(3) 1111ndash1141

URL httpprojecteuclidorgeuclidaos1371150895

Breiman L amp Friedman J (1985) Estimating optimal transformations for multipleregression and correlation Journal of the American Statistical Association 80580ndash598

Brusco M J Steinley D amp Cradit J D (2009) An exact algorithm for hierarchicallywell-formulated subsets in second-order polynomial regression Technometrics 51(3)306ndash315

Casella G Giron F J Martınez M L amp Moreno E (2009) Consistency of Bayesianprocedures for variable selection The Annals of Statistics 37 (3) 1207ndash1228

URL httpprojecteuclidorgeuclidaos1239369020

Casella G Moreno E amp Giron F (2014) Cluster Analysis Model Selection and PriorDistributions on Models Bayesian Analysis TBA(TBA) 1ndash46

URL httpwwwstatufledu~casellaPapersClusterModel-July11-Apdf

133

Chipman H (1996) Bayesian variable selection with related predictors CanadianJournal of Statistics 24(1) 17ndash36

URL httponlinelibrarywileycomdoi1023073315687abstract

Clyde M amp George E I (2004) Model Uncertainty Statistical Science 19(1) 81ndash94

URL httpprojecteuclidorgDienstgetRecordid=euclidss1089808274

Dewey J (1958) Experience and nature New York Dover Publications

Dorazio R M amp Taylor-Rodrıguez D (2012) A Gibbs sampler for Bayesian analysis ofsite-occupancy data Methods in Ecology and Evolution 3 1093ndash1098

Ellison A M (2004) Bayesian inference in ecology Ecology Letters 7 509ndash520

Fiske I amp Chandler R (2011) unmarked An R package for fitting hierarchical modelsof wildlife occurrence and abundance Journal of Statistical Software 43(10)

URL httpcorekmiopenacukdownloadpdf5701760pdf

George E (2000) The variable selection problem Journal of the American StatisticalAssociation 95(452) 1304ndash1308

URL httpwwwtandfonlinecomdoiabs10108001621459200010474336

Giron F J Moreno E Casella G amp Martınez M L (2010) Consistency of objectiveBayes factors for nonnested linear models and increasing model dimension Revistade la Real Academia de Ciencias Exactas Fisicas y Naturales Serie A Matematicas104(1) 57ndash67

URL httpwwwspringerlinkcomindex105052RACSAM201006

Good I J (1950) Probability and the Weighing of Evidence New York Haffner

Griepentrog G L Ryan J M amp Smith L D (1982) Linear transformations ofpolynomial regression-models American Statistician 36(3) 171ndash174

Gunel E amp Dickey J (1974) Bayes factors for independence in contingency tablesBiometrika 61 545ndash557

Hanski I (1994) A Practical Model of Metapopulation Dynamics Journal of AnimalEcology 63 151ndash162

Hooten M (2006) Hierarchical spatio-temporal models for ecological processesDoctoral dissertation University of Missouri-Columbia

URL httpsmospacelibraryumsystemeduxmluihandle103554500

Hooten M B amp Hobbs N T (2014) A Guide to Bayesian Model Selection forEcologists Ecological Monographs (In Press)

134

Hughes J amp Haran M (2013) Dimension reduction and alleviation of confoundingfor spatial generalized linear mixed models Journal of the Royal Statistical SocietySeries B Statistical Methodology 75 139ndash159

Hurvich C M amp Tsai C-L (1989) Regression and time series model selection insmall samples Biometrika 76 297ndash307

URL httpbiometoxfordjournalsorgcontent762297abstract

Jeffreys H (1935) Some tests of significance treated by the theory of probabilityProcedings of the Cambridge Philosophy Society 31 203ndash222

Jeffreys H (1961) Theory of Probability London Oxford University Press 3rd ed

Johnson D Conn P Hooten M Ray J amp Pond B (2013) Spatial occupancymodels for large data sets Ecology 94(4) 801ndash808

URL httpwwwesajournalsorgdoiabs10189012-05641mi=3eywlhampaf=R

ampsearchText=human+population

Kass R amp Wasserman L (1995) A reference Bayesian test for nested hypothesesand its relationship to the Schwarz criterion Journal of the American StatisticalAssociation 90(431)

URL httpamstattandfonlinecomdoiabs10108001621459199510476592

Kass R E amp Raftery A E (1995) Bayes Factors Journal of the American StatisticalAssociation 90 773ndash795

URL httpwwwtandfonlinecomdoiabs10108001621459199510476572$

delimiter026E30F$nhttpwwwtandfonlinecomdoiabs10108001621459

199510476572UvBybrTIgcs

Kass R E amp Wasserman L (1996) The Selection of Prior Distributions by FormalRules Journal of the American Statistical Association 91(435) 1343

URL httpwwwjstororgstable2291752origin=crossref

Kery M (2010) Introduction to WinBUGS for Ecologists Bayesian Approach toRegression ANOVA Mixed Models and Related Analyses Academic Press 1st ed

Kery M Gardner B amp Monnerat C (2010) Predicting species distributions fromchecklist data using site-occupancy models Journal of Biogeography 37 (10)1851ndash1862 Kery Marc Gardner Beth Monnerat Christian

Khuri A (2002) Nonsingular linear transformations of the control variables in responsesurface models Technical Report

Krebs C J (1972) Ecology the experimental analysis of distribution and abundance

135

Lempers F B (1971) Posterior probabilities of alternative linear models University ofRotterdam Press Rotterdam

Leon-Novelo L Moreno E amp Casella G (2012) Objective Bayes model selection inprobit models Statistics in medicine 31(4) 353ndash65

URL httpwwwncbinlmnihgovpubmed22162041

Liang F Paulo R Molina G Clyde M a amp Berger J O (2008) Mixtures of g Priorsfor Bayesian Variable Selection Journal of the American Statistical Association103(481) 410ndash423

URL httpwwwtandfonlinecomdoiabs101198016214507000001337

Link W amp Barker R (2009) Bayesian inference with ecological applications Elsevier

URL httpbooksgooglecombookshl=enamplr=ampid=hecon2l2QPcCampoi=fnd

amppg=PP2ampdq=Bayesian+Inference+with+ecological+applicationsampots=S82_

0pxrNmampsig=L3xbsSQcKD8FV6rxCMp2pmP2JKk

MacKenzie D amp Nichols J (2004) Occupancy as a surrogate for abundanceestimation Animal biodiversity and conservation 1 461ndash467

URL httpcrsitbacidmediajurnalrefslandscapemackenzie2004zhpdf

MacKenzie D Nichols J amp Hines J (2003) Estimating site occupancy colonizationand local extinction when a species is detected imperfectly Ecology 84(8)2200ndash2207

URL httpwwwesajournalsorgdoiabs10189002-3090

MacKenzie D I Bailey L L amp Nichols J D (2004) Investigating speciesco-occurrence patterns when species Journal of Animal Ecology 73 546ndash555

MacKenzie D I Nichols J D Lachman G B Droege S Royle J A amp LangtimmC A (2002) Estimating site occupancy rates when detection probabilities are lessthan one Ecology 83(8) 2248ndash2255

Mazerolle M amp Mazerolle M (2013) Package rsquoAICcmodavgrsquo (c)

URL ftpheanetarchivegnewsenseorgdisk1CRANwebpackages

AICcmodavgAICcmodavgpdf

McCullagh P amp Nelder J A (1989) Generalized linear models (2nd ed) LondonEngland Chapman amp Hall

McQuarrie A Shumway R amp Tsai C-L (1997) The model selection criterion AICu

136

Moreno E Bertolino F amp Racugno W (1998) An intrinsic limiting procedure for modelselection and hypotheses testing Journal of the American Statistical Association93(444) 1451ndash1460

Moreno E Giron F J amp Casella G (2010) Consistency of objective Bayes factors asthe model dimension grows The Annals of Statistics 38(4) 1937ndash1952

URL httpprojecteuclidorgeuclidaos1278861238

Nelder J A (1977) Reformulation of linear-models Journal of the Royal StatisticalSociety Series A - Statistics in Society 140 48ndash77

Nelder J A (1998) The selection of terms in response-surface models - how strong isthe weak-heredity principle American Statistician 52(4) 315ndash318

Nelder J A (2000) Functional marginality and response-surface fitting Journal ofApplied Statistics 27 (1) 109ndash112

Nichols J Hines J amp Mackenzie D (2007) Occupancy estimation and modeling withmultiple states and state uncertainty Ecology 88(6) 1395ndash1400

URL httpwwwesajournalsorgdoipdf10189006-1474

Ovaskainen O Hottola J amp Siitonen J (2010) Modeling species co-occurrenceby multivariate logistic regression generates new hypotheses on fungal interactionsEcology 91(9) 2514ndash21

URL httpwwwncbinlmnihgovpubmed20957941

Peixoto J L (1987) Hierarchical variable selection in polynomial regression-modelsAmerican Statistician 41(4) 311ndash313

Peixoto J L (1990) A property of well-formulated polynomial regression-modelsAmerican Statistician 44(1) 26ndash30

Pericchi L R (2005) Model selection and hypothesis testing based on objectiveprobabilities and bayes factors In Handbook of Statistics Elsevier

Polson N G Scott J G amp Windle J (2013) Bayesian Inference for Logistic ModelsUsing Polya-Gamma Latent Variables Journal of the American Statistical Association108 1339ndash1349

URL httpdxdoiorg101080016214592013829001

Rao C R amp Wu Y (2001) On model selection vol Volume 38 of Lecture NotesndashMonograph Series (pp 1ndash57) Beachwood OH Institute of Mathematical Statistics

URL httpdxdoiorg101214lnms1215540960

137

Reich B J Hodges J S amp Zadnik V (2006) Effects of residual smoothing on theposterior of the fixed effects in disease-mapping models Biometrics 62 1197ndash1206

Reiners W amp Lockwood J (2009) Philosophical Foundations for the Practices ofEcology Cambridge University Press

URL httpbooksgooglecombooksid=dr9cPgAACAAJ

Rigler F amp Peters R (1995) Excellence in Ecology Science and Limnology EcologyInstitute Germany

URL httportoncatieaccrcgi-binwxisexeIsisScript=CIENLxis

ampmethod=postampformato=2ampcantidad=1ampexpresion=mfn=008268

Robert C Chopin N amp Rousseau J (2009) Harold Jeffreysrsquo Theory of Probabilityrevisited Statistical Science Volume 24(2) 141ndash179

URL httpswwwnewtonacukpreprintsNI08021pdf

Robert C P (1993) A note on jeffreys-lindley paradox Statistica Sinica 3 601ndash608

Royle J A amp Kery M (2007) A Bayesian state-space formulation of dynamicoccupancy models Ecology 88(7) 1813ndash23

URL httpwwwncbinlmnihgovpubmed17645027

Scott J amp Berger J (2010) Bayes and Empirical-Bayes Multiplicity Adjustment in thevariable selection problem The Annals of Statistics

URL httpprojecteuclidorgeuclidaos1278861454

Spiegelhalter D J amp Smith A F M (1982) Bayes factor for linear and log-linearmodels with vague prior information J R Statist Soc B 44 377ndash387

Tierney L amp Kadane J B (1986) Accurate approximations for posterior moments andmarginal densities Journal of the American Statistical Association 81 82ndash86

Tyre A J Tenhumberg B Field S a Niejalke D Parris K amp Possingham H P(2003) Improving Precision and Reducing Bias in Biological Surveys EstimatingFalse-Negative Error Rates Ecological Applications 13(6) 1790ndash1801

URL httpwwwesajournalsorgdoiabs10189002-5078

Waddle J H Dorazio R M Walls S C Rice K G Beauchamp J Schuman M Jamp Mazzotti F J (2010) A new parameterization for estimating co-occurrence ofinteracting species Ecological applications a publication of the Ecological Society ofAmerica 20 1467ndash1475

Wasserman L (2000) Bayesian Model Selection and Model Averaging Journal ofmathematical psychology 44(1) 92ndash107

138

URL httpwwwncbinlmnihgovpubmed10733859

Wilson M Iversen E Clyde M A Schmidler S C amp Schildkraut J M (2010)Bayesian model search and multilevel inference for SNP association studies TheAnnals of Applied Statistics 4(3) 1342ndash1364

URL httpwwwncbinlmnihgovpmcarticlesPMC3004292

Womack A J Leon-Novelo L amp Casella G (2014) Inference from Intrinsic BayesProcedures Under Model Selection and Uncertainty Journal of the AmericanStatistical Association (June) 140114063448000

URL httpwwwtandfonlinecomdoiabs101080016214592014880348

Yuan M Joseph V R amp Zou H (2009) Structured variable selection and estimationThe Annals of Applied Statistics 3(4) 1738ndash1757

URL httpprojecteuclidorgeuclidaoas1267453962

Zeller K A Nijhawan S Salom-Perez R Potosme S H amp Hines J E (2011)Integrating occupancy modeling and interview data for corridor identification A casestudy for jaguars in nicaragua Biological Conservation 144(2) 892ndash901

Zellner A amp Siow A (1980) Posterior odds ratios for selected regression hypothesesIn Trabajos de estadıstica y de investigacion operativa (pp 585ndash603)

URL httpwwwspringerlinkcomindex5300770UP12246M9pdf

139

BIOGRAPHICAL SKETCH

Daniel Taylor-Rodrıguez was born in Bogota Colombia He earned a BS

degree in economics from the Universidad de Los Andes (2004) and a Specialist

degree in statistics from the Universidad Nacional de Colombia In 2009 he traveled

to Gainesville Florida to pursue a masterrsquos in statistics under the supervision of

George Casella Upon completion he started a PhD in interdisciplinary ecology with

concentration in statistics again under George Casellarsquos supervision After Georgersquos

passing Linda Young and Nikolay Bliznyuk continued to oversee Danielrsquos mentorship

He has currently accepted a joint postdoctoral fellowship at the Statistical and Applied

Mathematical Sciences Institute and the Department of Statistical Science at Duke

University

140

  • ACKNOWLEDGMENTS
  • TABLE OF CONTENTS
  • LIST OF TABLES
  • LIST OF FIGURES
  • ABSTRACT
  • 1 GENERAL INTRODUCTION
    • 11 Occupancy Modeling
    • 12 A Primer on Objective Bayesian Testing
    • 13 Overview of the Chapters
      • 2 MODEL ESTIMATION METHODS
        • 21 Introduction
          • 211 The Occupancy Model
          • 212 Data Augmentation Algorithms for Binary Models
            • 22 Single Season Occupancy
              • 221 Probit Link Model
              • 222 Logit Link Model
                • 23 Temporal Dynamics and Spatial Structure
                  • 231 Dynamic Mixture Occupancy State-Space Model
                  • 232 Incorporating Spatial Dependence
                    • 24 Summary
                      • 3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
                        • 31 Introduction
                        • 32 Objective Bayesian Inference
                          • 321 The Intrinsic Methodology
                          • 322 Mixtures of g-Priors
                            • 3221 Intrinsic priors
                            • 3222 Other mixtures of g-priors
                                • 33 Objective Bayes Occupancy Model Selection
                                  • 331 Preliminaries
                                  • 332 Intrinsic Priors for the Occupancy Problem
                                  • 333 Model Posterior Probabilities
                                  • 334 Model Selection Algorithm
                                    • 34 Alternative Formulation
                                    • 35 Simulation Experiments
                                      • 351 Marginal Posterior Inclusion Probabilities for Model Predictors
                                      • 352 Summary Statistics for the Highest Posterior Probability Model
                                        • 36 Case Study Blue Hawker Data Analysis
                                          • 361 Results Variable Selection Procedure
                                          • 362 Validation for the Selection Procedure
                                            • 37 Discussion
                                              • 4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
                                                • 41 Introduction
                                                • 42 Setup for Well-Formulated Models
                                                  • 421 Well-Formulated Model Spaces
                                                    • 43 Priors on the Model Space
                                                      • 431 Model Prior Definition
                                                      • 432 Choice of Prior Structure and Hyper-Parameters
                                                      • 433 Posterior Sensitivity to the Choice of Prior
                                                        • 44 Random Walks on the Model Space
                                                          • 441 Simple Pruning and Growing
                                                          • 442 Degree Based Pruning and Growing
                                                            • 45 Simulation Study
                                                              • 451 SNR and Sample Size Effect
                                                              • 452 Coefficient Magnitude
                                                              • 453 Special Points on the Scale
                                                                • 46 Case Study Ozone Data Analysis
                                                                • 47 Discussion
                                                                  • 5 CONCLUSIONS
                                                                  • A FULL CONDITIONAL DENSITIES DYMOSS
                                                                  • B RANDOM WALK ALGORITHMS
                                                                  • C WFM SIMULATION DETAILS
                                                                  • D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
                                                                  • REFERENCES
                                                                  • BIOGRAPHICAL SKETCH
Page 2: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,

c⃝ 2014 Daniel Taylor-Rodrıguez

2

In memory of George Casella

It is a capital mistake to theorize before one has data Insensibly onebegins to twist facts to suit theories instead of theories to suit facts

ndashSherlock HolmesA Scandal in Bohemia

3

ACKNOWLEDGMENTS

Completing this dissertation would not have been possible without the support from

the people that have helped me remain focused motivated and inspired throughout the

years I am undeservingly fortunate to be surrounded by such amazing people

First of all I would like to express my gratitude to Professor George Casella It

was an unsurpassable honor to work with him His wisdom generosity optimism and

unyielding resolve will forever inspire me I will always treasure his teachings and the

fond memories I have of him I thank him and Anne for treating me and my wife as

family

I would like to acknowledge all of my committee members My heartfelt thanks to

my advisor Professor Linda J Young I will carry her thoughtful and patient recommendations

throughout my life I have no words to express how thankful I am to her for guiding me

through the difficult times that followed Dr Casellarsquos passing Also she has my gratitude

for sharing her knowledge and wealth of experience and for providing me with so many

amazing opportunities I am forever grateful to my local advisor Professor Nikolay

Bliznyuk for unsparingly sharing his insightful reflections and knowledge His generosity

and drive to help students develop are a model to follow His kind and extensive efforts

our many conversations his suggestions and advise in all aspects of academic and

non-academic life have made me a better statistician and have had a profound influence

on my way of thinking My appreciation to Professor Madan Oli for his enlightening

advise and for helping me advance my understanding of ecology

I would like to express my absolute gratitude to Dr Andrew Womack my friend and

young mentor His love for good science and hard work although impossible to keep up

with made my doctoral training one of the most exciting times in my life I have sincerely

enjoyed working and learning from him the last couple of years I offer my gratitude

to Dr Salvador Gezan for his friendship and the patience with which he taught me so

much more about statistics (boring our wives to death in the process) I am grateful to

4

Professor Mary Christman for her mentorship and enormous support I would like to

thank Dr Mihai Giurcanu for spending countless hours helping me think more deeply

about statistics his insight has been instrumental to shaping my own ideas Thanks to

Dr Claudio Fuentes for taking an interest in my work and for his advise support and

kind words which helped me retain the confidence to continue

I would like to acknowledge my friends at UF Juan Jose Acosta Mauricio

Mosquera Diana Falla Salvador and Emma Weeks and Anna Denicol thanks for

becoming my family away from home Andreas Tavis Emily Alex Sasha Mike

Yeonhee and Laura thanks for being there for me I truly enjoyed sharing these

years with you Vitor Paula Rafa Leandro Fabio Eduardo Marcelo and all the other

Brazilians in the Animal Science Department thanks for your friendship and for the

many unforgettable (though blurry) weekends

Also I would like to thank Pablo Arboleda for believing in me Because of him I

was able to take the first step towards fulfilling my educational goals My gratitude to

Grupo Bancolombia Fulbright Colombia Colfuturo and the IGERT QSE3 program

for supporting me throughout my studies Also thanks to Marc Kery and Christian

Monnerat for providing data to validate our methods Thanks to the staff in the Statistics

Department specially to Ryan Chance to the staff at the HPC and also to Karen Bray

at SNRE

Above all else I would like to thank my wife and family Nata you have always been

there for me pushing me forward believing in me helping me make better decisions

and regardless of how hard things get you have always managed to give me true and

lasting happiness Thank you for your love strength and patience Mom Dad Alejandro

Alberto Laura Sammy Vale and Tommy without your love trust and support getting

this far would not have been possible Thank you for giving me so much Gustavo

Lilia Angelica and Juan Pablo thanks for taking me into your family your words of

encouragement have led the way

5

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS 4

LIST OF TABLES 8

LIST OF FIGURES 10

ABSTRACT 12

CHAPTER

1 GENERAL INTRODUCTION 14

11 Occupancy Modeling 1512 A Primer on Objective Bayesian Testing 1713 Overview of the Chapters 21

2 MODEL ESTIMATION METHODS 23

21 Introduction 23211 The Occupancy Model 24212 Data Augmentation Algorithms for Binary Models 26

22 Single Season Occupancy 29221 Probit Link Model 30222 Logit Link Model 32

23 Temporal Dynamics and Spatial Structure 34231 Dynamic Mixture Occupancy State-Space Model 37232 Incorporating Spatial Dependence 43

24 Summary 46

3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS 49

31 Introduction 4932 Objective Bayesian Inference 52

321 The Intrinsic Methodology 53322 Mixtures of g-Priors 54

3221 Intrinsic priors 553222 Other mixtures of g-priors 56

33 Objective Bayes Occupancy Model Selection 57331 Preliminaries 58332 Intrinsic Priors for the Occupancy Problem 60333 Model Posterior Probabilities 62334 Model Selection Algorithm 63

34 Alternative Formulation 6635 Simulation Experiments 68

351 Marginal Posterior Inclusion Probabilities for Model Predictors 70

6

352 Summary Statistics for the Highest Posterior Probability Model 7636 Case Study Blue Hawker Data Analysis 77

361 Results Variable Selection Procedure 79362 Validation for the Selection Procedure 81

37 Discussion 82

4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS 84

41 Introduction 8442 Setup for Well-Formulated Models 88

421 Well-Formulated Model Spaces 9043 Priors on the Model Space 91

431 Model Prior Definition 92432 Choice of Prior Structure and Hyper-Parameters 96433 Posterior Sensitivity to the Choice of Prior 99

44 Random Walks on the Model Space 104441 Simple Pruning and Growing 105442 Degree Based Pruning and Growing 106

45 Simulation Study 107451 SNR and Sample Size Effect 109452 Coefficient Magnitude 110453 Special Points on the Scale 111

46 Case Study Ozone Data Analysis 11147 Discussion 113

5 CONCLUSIONS 115

APPENDIX

A FULL CONDITIONAL DENSITIES DYMOSS 118

B RANDOM WALK ALGORITHMS 121

C WFM SIMULATION DETAILS 124

D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS 131

REFERENCES 133

BIOGRAPHICAL SKETCH 140

7

LIST OF TABLES

Table page

1-1 Interpretation of BFji when contrasting Mj and Mi 20

3-1 Simulation control parameters occupancy model selector 69

3-2 Comparison of average minOddsMPIP under scenarios having different numberof sites (N=50 N=100) and under scenarios having different number of surveysper site (J=3 J=5) for the presence and detection components using uniformand multiplicity correction priors 75

3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors 75

3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highest probabilitymodels for the presence and the detection components using uniform andmultiplicity correcting priors on the model space 76

3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space 77

3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 77

3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 78

3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data 80

3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data 80

3-10 MPIP presence component 81

3-11 MPIP detection component 81

3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform and multiplicitycorrection model priors 82

8

4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulations 100

4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchical independenceprior (HIP) the hierarchical order prior (HOP) and the hierarchical uniformprior (HUP) 102

4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and the hierarchicaluniform prior (HUP) 103

4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP) 105

4-5 Variables used in the analyses of the ozone contamination dataset 112

4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso 113

C-1 Experimental conditions WFM simulations 124

D-1 Variables used in the analyses of the ozone contamination dataset 131

D-2 Marginal inclusion probabilities intrinsic prior 132

D-3 Marginal inclusion probabilities Zellner-Siow prior 132

D-4 Marginal inclusion probabilities Hyper-g11 132

D-5 Marginal inclusion probabilities Hyper-g21 132

9

LIST OF FIGURES

Figure page

2-1 Graphical representation occupancy model 25

2-2 Graphical representation occupancy model after data-augmentation 31

2-3 Graphical representation multiseason model for a single site 39

2-4 Graphical representation data-augmented multiseason model 39

3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors 71

3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors 72

3-3 Predictor MPIP averaged over scenarios with the interaction between the numberof sites and the surveys per site using uniform (U) and multiplicity correction(MC) priors 72

3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors 73

3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors 73

4-1 Graphs of well-formulated polynomial models for p = 2 90

4-2 E(M) and C(M) in M defined by a quadratic surface in two main effects formodel M = 1 x1 x21 91

4-3 Graphical representation of assumptions on M defined by the quadratic surfacein two main effects 93

4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the intercept onlymodel and (ab) isin (1 1) (1 ch) 97

4-5 Prior probabilities for the space of well-formulated models associated to threemain effects and one interaction term where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch) 98

4-6 MT DAG of the largest true model used in simulations 109

4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch) 110

C-1 SNR vs n Average model size average true positives and average false positivesfor all simulated scenarios by model ranking according to model posterior probabilities126

10

C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 128

C-3 SNR vs different true models MT Average model size average true positivesand average false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 129

11

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

OBJECTIVE BAYESIAN METHODS FOR OCCUPANCY MODEL ESTIMATION ANDSELECTION

By

Daniel Taylor-Rodrıguez

August 2014

Chair Linda J YoungCochair Nikolay BliznyukMajor Interdisciplinary Ecology

The ecological literature contains numerous methods for conducting inference about

the dynamics that govern biological populations Among these methods occupancy

models have played a leading role during the past decade in the analysis of large

biological population surveys The flexibility of the occupancy framework has brought

about useful extensions for determining key population parameters which provide

insights about the distribution structure and dynamics of a population However the

methods used to fit the models and to conduct inference have gradually grown in

complexity leaving practitioners unable to fully understand their implicit assumptions

increasing the potential for misuse This motivated our first contribution We develop

a flexible and straightforward estimation method for occupancy models that provides

the means to directly incorporate temporal and spatial heterogeneity using covariate

information that characterizes habitat quality and the detectability of a species

Adding to the issue mentioned above studies of complex ecological systems now

collect large amounts of information To identify the drivers of these systems robust

techniques that account for test multiplicity and for the structure in the predictors are

necessary but unavailable for ecological models We develop tools to address this

methodological gap First working in an ldquoobjectiverdquo Bayesian framework we develop

the first fully automatic and objective method for occupancy model selection based

12

on intrinsic parameter priors Moreover for the general variable selection problem we

propose three sets of prior structures on the model space that correct for multiple testing

and a stochastic search algorithm that relies on the priors on the models space to

account for the polynomial structure in the predictors

13

CHAPTER 1GENERAL INTRODUCTION

As with any other branch of science ecology strives to grasp truths about the

world that surrounds us and in particular about nature The objective truth sought

by ecology may well be beyond our grasp however it is reasonable to think that at

least partially ldquoNature is capable of being understoodrdquo (Dewey 1958) We can observe

and interpret nature to formulate hypotheses which can then be tested against reality

Hypotheses that encounter no or little opposition when confronted with reality may

become contextual versions of the truth and may be generalized by scaling them

spatially andor temporally accordingly to delimit the bounds within which they are valid

To formulate hypotheses accurately and in a fashion amenable to scientific inquiry

not only the point of view and assumptions considered must be made explicit but

also the object of interest the properties worthy of consideration of that object and

the methods used in studying such properties (Reiners amp Lockwood 2009 Rigler amp

Peters 1995) Ecology as defined by Krebs (1972) is ldquothe study of interactions that

determine the distribution and abundance of organismsrdquo This characterizes organisms

and their interactions as the objects of interest to ecology and prescribes distribution

and abundance as a relevant property of these organisms

With regards to the methods used to acquire ecological scientific knowledge

traditionally theoretical mathematical models (such as deterministic PDEs) have been

used However naturally varying systems are imprecisely observed and as such are

subject to multiple sources of uncertainty that must be explicitly accounted for Because

of this the ecological scientific community is developing a growing interest in flexible

and powerful statistical methods and among these Bayesian hierarchical models

predominate These methods rely on empirical observations and can accommodate

fairly complex relationships between empirical observations and theoretical process

models while accounting for diverse sources of uncertainty (Hooten 2006)

14

Bayesian approaches are now used extensively in ecological modeling however

there are two issues of concern one from the standpoint of ecological practitioners

and another from the perspective of scientific ecological endeavors First Bayesian

modeling tools require a considerable understanding of probability and statistical theory

leading practitioners to view them as black box approaches (Kery 2010) Second

although Bayesian applications proliferate in the literature in general there is a lack of

awareness of the distinction between approaches specifically devised for testing and

those for estimation (Ellison 2004) Furthermore there is a dangerous unfamiliarity with

the proven risks of using tools designed for estimation in testing procedures (Berger amp

Pericchi 1996 Berger et al 2001 Kass amp Raftery 1995 Moreno et al 1998 Robert

et al 2009 Robert 1993) (eg use of flat priors in hypothesis testing)

Occupancy models have played a leading role during the past decade in large

biological population surveys The flexibility of the occupancy framework has allowed

the development of useful extensions to determine several key population parameters

which provide robust notions of the distribution structure and dynamics of a population

In order to address some of the concerns stated in previous paragraph we concentrate

in the occupancy framework to develop estimation and testing tools that will allow

ecologists first to gain insight about the estimation procedure and second to conduct

statistically sound model selection for site-occupancy data

11 Occupancy Modeling

Since MacKenzie et al (2002) and Tyre et al (2003) introduced the site-occupancy

framework countless applications and extensions of the method have been developed

in the ecological literature as evidenced by the 438000 hits on Google Scholar for

a search of rdquooccupancy modelrdquo This class of models acknowledges that techniques

used to conduct biological population surveys are prone to detection errors ndashif an

individual is detected it must be present while if it is not detected it might or might

not be Occupancy models improve upon traditional binary regression by accounting

15

for observed detection and partially observed presence as two separate but related

components In the site occupancy setting the chosen locations are surveyed

repeatedly in order to reduce the ambiguity caused by the observed zeros This

approach therefore allows probabilities of both presence (occurrence) and detection

to be estimated

The uses of site-occupancy models are many For example metapopulation

and island biogeography models are often parameterized in terms of site (or patch)

occupancy (Hansky 19921994 1997 as cited in MacKenzie et al (2003)) and

occupancy may be used as a surrogate for abundance to answer questions regarding

geographic distribution range size and metapopulation dynamics (MacKenzie et al

2004 Royle amp Kery 2007)

The basic occupancy framework which assumes a single closed population with

fixed probabilities through time has proven to be quite useful however it might be of

limited utility when addressing some problems In particular assumptions for the basic

model may become too restrictive or unrealistic whenever the study period extends

throughout multiple years or seasons especially given the increasingly changing

environmental conditions that most ecosystems are currently experiencing

Among the extensions found in the literature one that we consider particularly

relevant incorporates heterogenous occupancy probabilities through time Models

that incorporate temporally varying probabilities stem from important meta-population

notions provided by Hanski (1994) such as occupancy probabilities depending on local

colonization and local extinction processes In spite of the conceptual usefulness of

Hanskirsquos model several strong and untenable assumptions (eg all patches being

homogenous in quality) are required for it to provide practically meaningful results

A more viable alternative which builds on Hanski (1994) is an extension of

the single season occupancy model of MacKenzie et al (2003) In this model the

heterogeneity of occupancy probabilities across seasons arises from local colonization

16

and extinction processes This model is flexible enough to let detection occurrence

extinction and colonization probabilities to each depend upon its own set of covariates

Model parameters are obtained through likelihood-based estimation

Using a maximum likelihood approach presents two drawbacks First the

uncertainty assessment for maximum likelihood parameter estimates relies on

asymptotic results which are obtained from implementation of the delta method

making it sensitive to sample size Second to obtain parameter estimates the latent

process (occupancy) is marginalized out of the likelihood leading to the usual zero

inflated Bernoulli model Although this is a convenient strategy for solving the estimation

problem after integrating the latent state variables (occupancy indicators) they are

no longer available Therefore finite sample estimates cannot be calculated directly

Instead a supplementary parametric bootstrapping step is necessary Further

additional structure such as temporal or spatial variation cannot be introduced by

means of random effects (Royle amp Kery 2007)

12 A Primer on Objective Bayesian Testing

With the advent of high dimensional data such as that found in modern problems

in ecology genetics physics etc coupled with evolving computing capability objective

Bayesian inferential methods have gained increasing popularity This however is by no

means a new approach in the way Bayesian inference is conducted In fact starting with

Bayes and Laplace and continuing for almost 200 years Bayesian analysis was primarily

based on ldquononinformativerdquo priors (Berger amp Bernardo 1992)

Now subjective elicitation of prior probabilities in Bayesian analysis is widely

recognized as the ideal (Berger et al 2001) however it is often the case that the

available information is insufficient to specify appropriate prior probabilistic statements

Commonly as in model selection problems where large model spaces have to be

explored the number of model parameters is prohibitively large preventing one from

eliciting prior information for the entire parameter space As a consequence in practice

17

the determination of priors through the definition of structural rules has become the

alternative to subjective elicitation for a variety of problems in Bayesian testing Priors

arising from these rules are known in the literature as noninformative objective default

or reference Many of these connotations generate controversy and are accused

perhaps rightly of providing a false pretension of objectivity Nevertheless we will avoid

that discussion and refer to them herein exchangeably as noninformative or objective

priors to convey the sense that no attempt to introduce an informed opinion is made in

defining prior probabilities

A plethora of ldquononinformativerdquo methods has been developed in the past few

decades (see Berger amp Bernardo (1992) Berger amp Pericchi (1996) Berger et al (2001)

Clyde amp George (2004) Kass amp Wasserman (1995 1996) Liang et al (2008) Moreno

et al (1998) Spiegelhalter amp Smith (1982) Wasserman (2000) and the references

therein) We find particularly interesting those derived from the model structure in which

no tuning parameters are required especially since these can be regarded as automatic

methods Among them methods based on the Bayes factor for Intrinsic Priors have

proven their worth in a variety of inferential problems given their excellent performance

flexibility and ease of use This class of priors is discussed in detail in chapter 3 For

now some basic notation and notions of Bayesian inferential procedures are introduced

Hypothesis testing and the Bayes factor

Bayesian model selection techniques that aim to find the true model as opposed

to searching for the model that best predicts the data are fundamentally extensions to

Bayesian hypothesis testing strategies In general this Bayesian approach to hypothesis

testing and model selection relies on determining the amount of evidence found in favor

of one hypothesis (or model) over the other given an observed set of data Approached

from a Bayesian standpoint this type of problem can be formulated in great generality

using a natural well defined probabilistic framework that incorporates both model and

parameter uncertainty

18

Jeffreys (1935) first developed the Bayesian strategy to hypothesis testing and

consequently to the model selection problem Bayesian model selection within

a model space M = (M1M2 MJ) where each model is associated with a

parameter θj which may be a vector of parameters itself incorporates three types

of probability distributions (1) a prior probability distribution for each model π(Mj)

(2) a prior probability distribution for the parameters in each model π(θj |Mj) and (3)

the distribution of the data conditional on both the model and the modelrsquos parameters

f (x|θj Mj) These three probability densities induce the joint distribution p(x θj Mj) =

f (x|θj Mj) middot π(θj |Mj) middot π(Mj) which is instrumental in producing model posterior

probabilities The model posterior probability is the probability that a model is true given

the data It is obtained by marginalizing over the parameter space and using Bayes rule

p(Mj |x) =m(x|Mj)π(Mj)sumJ

i=1m(x|Mi)π(Mi) (1ndash1)

where m(x|Mj) =intf (x|θj Mj)π(θj |Mj)dθj is the marginal likelihood of Mj

Given that interest lies in comparing different models evidence in favor of one or

another model is assessed with pairwise comparisons using posterior odds

p(Mj |x)p(Mk |x)

=m(x|Mj)

m(x|Mk)middot π(Mj)

π(Mk) (1ndash2)

The first term on the right hand side of (1ndash2) m(x|Mj )

m(x|Mk) is known as the Bayes factor

comparing model Mj to model Mk and it is denoted by BFjk(x) The Bayes factor

provides a measure of the evidence in favor of either model given the data and updates

the model prior odds given by π(Mj )

π(Mk) to produce the posterior odds

Note that the model posterior probability in (1ndash1) can be expressed as a function of

Bayes factors To illustrate let model Mlowast isin M be a reference model All other models

compare in M are compared to the reference model Then dividing both the numerator

19

and denominator in (1ndash1) by m(x|Mlowast)π(Mlowast) yields

p(Mj |x) =BFjlowast(x)

π(Mj )

π(Mlowast)

1 +sum

MiisinMMi =Mlowast

BFilowast(x)π(Mi )π(Mlowast)

(1ndash3)

Therefore as the Bayes factor increases the posterior probability of model Mj given the

data increases If all models have equal prior probabilities a straightforward criterion

to select the best among all candidate models is to choose the model with the largest

Bayes factor As such the Bayes factor is not only useful for identifying models favored

by the data but it also provides a means to rank models in terms of their posterior

probabilities

Assuming equal model prior probabilities in (1ndash3) the prior odds are set equal to

one and the model posterior odds in (1ndash2) become p(Mj |x)p(Mk |x) = BFjk(x) Based

on the Bayes factors the evidence in favor of one or another model can be interpreted

using Table 1-1 adapted from Kass amp Raftery (1995)

Table 1-1 Interpretation of BFji when contrasting Mj and Mi

lnBFjk BFjk Evidence in favor of Mj P(Mj |x)0 to 2 1 to 3 Weak evidence 05-0752 to 6 3 to 20 Positive evidence 075-095

6 to 10 20 to 150 Strong evidence 095-099gt10 gt150 Very strong evidence gt 099

Bayesian hypothesis testing and model selection procedures through Bayes factors

and posterior probabilities have several desirable features First these methods have a

straight forward interpretation since the Bayes factor is an increasing function of model

(or hypothesis) posterior probabilities Second these methods can yield frequentist

matching confidence bounds when implemented with good testing priors (Kass amp

Wasserman 1996) such as the reference priors of Berger amp Bernardo (1992) Third

since the Bayes factor contains the ratio of marginal densities it automatically penalizes

complexity according to the number of parameters in each model this property is

known as Ockhamrsquos razor (Kass amp Raftery 1995) Four the use of Bayes factors does

20

not require having nested hypotheses (ie having the null hypothesis nested in the

alternative) standard distributions or regular asymptotics (eg convergence to normal

or chi squared distributions) (Berger et al 2001) In contrast this is not always the case

with frequentist and likelihood ratio tests which depend on known distributions (at least

asymptotically) for the test statistic to perform the test Finally Bayesian hypothesis

testing procedures using the Bayes factor can naturally incorporate model uncertainty by

using the Bayesian machinery for model averaged predictions and confidence bounds

(Kass amp Raftery 1995) It is not clear how to account for this uncertainty rigorously in a

fully frequentist approach

13 Overview of the Chapters

In the chapters that follow we develop a flexible and straightforward hierarchical

Bayesian framework for occupancy models allowing us to obtain estimates and conduct

robust testing from an ldquoobjectiverdquo Bayesian perspective Latent mixtures of random

variables supply a foundation for our methodology This approach provides a means to

directly incorporate spatial dependency and temporal heterogeneity through predictors

that characterize either habitat quality of a given site or detectability features of a

particular survey conducted in a specific site On the other hand the Bayesian testing

methods we propose are (1) a fully automatic and objective method for occupancy

model selection and (2) an objective Bayesian testing tool that accounts for multiple

testing and for polynomial hierarchical structure in the space of predictors

Chapter 2 introduces the methods proposed for estimation of occupancy model

parameters A simple estimation procedure for the single season occupancy model

with covariates is formulated using both probit and logit links Based on the simple

version an extension is provided to cope with metapopulation dynamics by introducing

persistence and colonization processes Finally given the fundamental role that spatial

dependence plays in defining temporal dynamics a strategy to seamlessly account for

this feature in our framework is introduced

21

Chapter 3 develops a new fully automatic and objective method for occupancy

model selection that is asymptotically consistent for variable selection and averts the

use of tuning parameters In this Chapter first some issues surrounding multimodel

inference are described and insight about objective Bayesian inferential procedures is

provided Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate

priors on the parameter space the intrinsic priors for the parameters of the occupancy

model are obtained These are used in the construction of a variable selection algorithm

for ldquoobjectiverdquo variable selection tailored to the occupancy model framework

Chapter 4 touches on two important and interconnected issues when conducting

model testing that have yet to receive the attention they deserve (1) controlling for false

discovery in hypothesis testing given the size of the model space ie given the number

of tests performed and (2) non-invariance to location transformations of the variable

selection procedures in the face of polynomial predictor structure These elements both

depend on the definition of prior probabilities on the model space In this chapter a set

of priors on the model space and a stochastic search algorithm are proposed Together

these control for model multiplicity and account for the polynomial structure among the

predictors

22

CHAPTER 2MODEL ESTIMATION METHODS

ldquoData Data Datardquo he cried impatiently ldquoI canrsquot make bricks without clayrdquo

ndashSherlock HolmesThe Adventure of the Copper Beeches

21 Introduction

Prior to the introduction of site-occupancy models (MacKenzie et al 2002 Tyre

et al 2003) presence-absence data from ecological monitoring programs were used

without any adjustment to assess the impact of management actions to observe trends

in species distribution through space and time or to model the habitat of a species (Tyre

et al 2003) These efforts however were suspect due to false-negative errors not

being accounted for False-negative errors occur whenever a species is present at a site

but goes undetected during the survey

Site-occupancy models developed independently by MacKenzie et al (2002)

and Tyre et al (2003) extend simple binary-regression models to account for the

aforementioned errors in detection of individuals common in surveys of animal or plant

populations Since their introduction the site-occupancy framework has been used in

countless applications and numerous extensions for it have been proposed Occupancy

models improve upon traditional binary regression by analyzing observed detection

and partially observed presence as two separate but related components In the site

occupancy setting the chosen locations are surveyed repeatedly in order to reduce the

ambiguity caused by the observed zeros This approach therefore allows simultaneous

estimation of the probabilities of presence (occurrence) and detection

Several extensions to the basic single-season closed population model are

now available The occupancy approach has been used to determine species range

dynamics (MacKenzie et al 2003 Royle amp Kery 2007) and to understand agestage

23

structure within populations (Nichols et al 2007) model species co-occurrence

(MacKenzie et al 2004 Ovaskainen et al 2010 Waddle et al 2010) It has even been

suggested as a surrogate for abundance (MacKenzie amp Nichols 2004) MacKenzie et al

suggested using occupancy models to conduct large-scale monitoring programs since

this approach avoids the high costs associated with surveys designed for abundance

estimation Also to investigate metapopulation dynamics occupancy models improve

upon incidence function models (Hanski 1994) which are often parameterized in terms

of site (or patch) occupancy and assume homogenous patches and a metapopulation

that is at a colonization-extinction equilibrium

Nevertheless the implementation of Bayesian occupancy models commonly resorts

to sampling strategies dependent on hyper-parameters subjective prior elicitation

and relatively elaborate algorithms From the standpoint of practitioners these are

often treated as black-box methods (Kery 2010) As such the potential of using the

methodology incorrectly is high Commonly these procedures are fitted with packages

such as BUGS or JAGS Although the packagersquos ease of use has led to a wide-spread

adoption of the methods the user may be oblivious as to the assumptions underpinning

the analysis

We believe providing straightforward and robust alternatives to implement these

methods will help practitioners gain insight about how occupancy modeling and more

generally Bayesian modeling is performed In this Chapter using a simple Gibbs

sampling approach first we develop a versatile method to estimate the single season

closed population site-occupancy model then extend it to analyze metapopulation

dynamics through time and finally provide a further adaptation to incorporate spatial

dependence among neighboring sites211 The Occupancy Model

In this section of the document we first introduce our results published in Dorazio

amp Taylor-Rodrıguez (2012) and build upon them to propose relevant extensions For

24

the standard sampling protocol for collecting site-occupancy data J gt 1 independent

surveys are conducted at each of N representative sample locations (sites) noting

whether a species is detected or not detected during each survey Let yij denote a binary

random variable that indicates detection (y = 1) or non-detection (y = 0) during the

j th survey of site i Without loss of generality J may be assumed constant among all N

sites to simplify description of the model In practice however site-specific variation in

J poses no real difficulties and is easily implemented This sampling protocol therefore

yields a N times J matrix Y of detectionnon-detection data

Note that the observed process yij is an imperfect representation of the underlying

occupancy or presence process Hence letting zi denote the presence indicator at site i

this model specification can therefore be represented through the hierarchy

yij |zi λ sim Bernoulli (zipij)

zi |α sim Bernoulli (ψi) (2ndash1)

where pij is the probability of correctly classifying as occupied the i th site during the j th

survey ψi is the presence probability at the i th site The graphical representation of this

process is

ψi

zi

yi

pi

Figure 2-1 Graphical representation occupancy model

Probabilities of detection and occupancy can both be made functions of covariates

and their corresponding parameter estimates can be obtained using either a maximum

25

likelihood or a Bayesian approach Existing methodologies from the likelihood

perspective marginalize over the latent occupancy process (zi ) making the estimation

procedure depend only on the detections Most Bayesian strategies rely on MCMC

algorithms that require parameter prior specification and tuning However Albert amp Chib

(1993) proposed a longstanding strategy in the Bayesian statistical literature that models

binary outcomes using a simple Gibbs sampler This procedure which is described in

the following section can be extrapolated to the occupancy setting eliminating the need

for tuning parameters and subjective prior elicitation212 Data Augmentation Algorithms for Binary Models

Probit model Data-augmentation with latent normal variables

At the root of Albert amp Chibrsquos algorithm lies the idea that if the observed outcome is

0 the latent variable can be simulated from a truncated normal distribution with support

(minusinfin 0] And if the outcome is 1 the latent variable can be simulated from a truncated

normal distribution in (0infin) To understand the reasoning behind this strategy let

Y sim Bern((xTβ)

) and V = xTβ + ε with ε sim N (0 1) In such a case note that

Pr(y = 1 | xTβ) = (xTβ) = Pr(ε lt xTβ)

= Pr(ε gt minusxTβ)

= Pr(v gt 0 | xTβ)

Thus whenever y = 1 then v gt 0 and v le 0 otherwise In other words we

may think of y as a truncated version of v Thus we can sample iteratively alternating

between the latent variables conditioned on model parameters and vice versa to draw

from the desired posterior densities By augmenting the data with the latent variables

we are able to obtain full conditional posterior distributions for model parameters that are

easy to draw from (equation 2ndash3 below) Further we may sample the latent variables

we may also sample the parameters

Given some initial values for all model parameters values for the latent variables

can be simulated By conditioning on the latter it is then possible to draw samples

26

from the parameterrsquos posterior distributions These samples can be used to generate

new values for the latent variables etc The process is iterated using a Gibbs sampling

approach Generally after a large number iterations it yields draws from the joint

posterior distribution of the latent variables and the model parameters conditional on the

observed outcome values We formalize the procedure below

Assume that each outcome Y1Y2 Yn is such that Yi |xi β sim Bernoulli(qi)

where qi = (xTi β) is the standard normal CDF evaluated at xTi β where xi and β

are the p-dimensional vectors of observed covariates for the i -th observation and their

corresponding parameters respectively

Now let y = y1 y2 yn be the vector of observed outcomes and [ β ] represents

the prior distribution of the model parameters Therefore the posterior distribution of β is

given by

[ β|y ] prop [ β ]nprodi=1

(xTi β)yi(1minus(xTi β)

)1minusyi (2ndash2)

which is intractable Nevertheless introducing latent random variables V = (V1 Vn)

such that Vi sim N (xTi β 1) resolves this difficulty by specifying that whenever Yi = 1

then Vi gt 0 and if Yi = 0 then Vi le 0 This yields

[ β v|y ] prop [ β ]

nprodi=1

ϕ(vi | xTi β 1)Ivile0Iyi=0 + Ivigt0Iyi=1

(2ndash3)

where ϕ(x |micro τ 2) is the probability density function of normal random variable x

with mean micro and variance τ2 The data augmentation artifact works since [ β|y ] =int[ β v|y ]dv hence if we sample from joint posterior 2ndash3 and extract only the sampled

values for β they will correspond to samples from [ β|y ]

From the expression above it is possible to obtain the full conditional distributions

for V and β Thus a Gibbs sampler can be proposed For example if we use a flat prior

27

for β (ie [ β ] prop 1) the full conditionals are given by

β|V y sim MVNk

((XTX )minus1(XTV ) (XTX )minus1

)(2ndash4)

V|β y simnprodi=1

tr N (xTi β 1Qi) (2ndash5)

where MVNq(micro ) represents a multivariate normal distribution with mean vector micro

and variance-covariance matrix and tr N (ξσ2Q) stands for the truncated normal

distribution with mean ξ variance σ2 and truncation region Q For each i = 1 2 n

the support of the truncated variables is given by Q = (minusinfin 0 ] if yi = 0 and Q = (0infin)

otherwise Note that conjugate normal priors could be used alternatively

At iteration m + 1 the Gibbs sampler draws V(m+1) conditional on β(m) from (2ndash5)

and then samples β(m+1) conditional on V(m+1) from (2ndash4) This process is repeated for

s = 0 1 nsim where nsim is the number of iterations in the Gibbs sampler

Logit model Data-augmentation with latent Polya-gamma variables

Recently Polson et al (2013) developed a novel and efficient approach for Bayesian

inference for logistic models using Polya-gamma latent variables which is analogous

to the Albert amp Chib algorithm The result arises from what the authors refer to as the

Polya-gamma distribution To construct a random variable from this family consider the

infinite mixture of the iid sequence of Exp(1) random variables Ekinfink=1 given by

ω =2

π2

infinsumk=1

Ek

(2k minus 1)2

with probability density function

g(ω) =infinsumk=1

(minus1)k 2k + 1radic2πω3

eminus(2k+1)2

8ω Iωisin(0infin) (2ndash6)

and Laplace density transform E[eminustω] = coshminus1(radic

t2)

28

The Polya-gamma family of densities is obtained through an exponential tilting of

the density g from 2ndash6 These densities indexed by c ge 0 are characterized by

f (ω|c) = cosh c2 eminusc2ω2 g(ω)

The likelihood for the binomial logistic model can be expressed in terms of latent

Polya-gamma variables as follows Assume yi sim Bernoulli(δi) with predictors xprimei =

(xi1 xip) and success probability δi = exprimeiβ(1 + ex

primeiβ) Hence the posterior for the

model parameters can be represented as

[β|y] =[β]prodn

i δyii (1minus δi)

1minusyi

c(y)

where c(y) is the normalizing constant

To facilitate the sampling procedure a data augmentation step can be performed

by introducing a Polya-gamma random variable ω sim PG(xprimeβ 1) This yields the

data-augmented posterior

[βω|y] =

(prodn

i=1 Pr(yi = 1|β))f (ω|xprime

β) [β] dω

c(y) (2ndash7)

such that [β|y] =int

R+[βω|y] dω

Thus from the augmented model the full conditional density for β is given by

[β|ω y] prop

(nprodi=1

Pr(yi = 1|β)

)f (ω|xprime

β) [β] dω

=

nprodi=1

(exprimeiβ)yi

1 + exprimeiβ

nprodi=1

cosh

(∣∣xprime

iβ∣∣

2

)exp

[minus(x

prime

iβ)2ωi

2

]g(ωi)

(2ndash8)

This expression yields a normal posterior distribution if β is assigned flat or normal

priors Hence a two-step sampling strategy analogous to that of Albert amp Chib (1993)

can be used to estimate β in the occupancy framework22 Single Season Occupancy

Let pij = F (qTij λ) be the probability of correctly classifying as occupied the i th

site during the j th survey conditional on the site being occupied and let ψi = F (xTi α)

29

correspond to the presence probability at the i th site Further let Fminus1(middot) denote a link

function (ie probit or logit) connecting the response to the predictors and denote by λ

and α respectively the r -variate and p-variate coefficient vectors for the detection and

for the presence probabilities Then the following is the joint posterior probability for the

presence indicators and the model parameters

πlowast(z vαwλ) prop πα(α)πλ(λ)Nprodi=1

F (xprimeiα)zi (1minus F (xprimeiα))

(1minuszi ) times

Jprodj=1

(ziF (qprimeijλ))

yij (1minus ziF (qprimeijλ))

1minusyij (2ndash9)

As in the simple probit regression problem this posterior is intractable Consequently

sampling from it directly is not possible But the procedures of Albert amp Chib for the

probit model and of Polson et al for the logit model can be extended to generate an

MCMC sampling strategy for the occupancy problem In what follows we make use of

this framework to develop samplers with which occupancy parameter estimates can be

obtained for both probit and logit link functions These algorithms have the added benefit

that they do not require tuning parameters nor eliciting parameter priors subjectively221 Probit Link Model

To extend Albert amp Chibrsquos algorithm to the occupancy framework with a probit link

first we introduce two sets of latent variables denoted by wij and vi corresponding to

the normal latent variables used to augment the data The corresponding hierarchy is

yij |zi sij sim Bernoulli(ziIwijgt0

)wij |λ sim N

(qprimeijλ 1

)λ sim [λ]

zi |vi sim Ivigt0

vi |α sim N (xprimeiα 1)

α sim [α] (2ndash10)

30

represented by the directed graph found in Figure 2-2

α

vi

zi

yi

wi

λ

Figure 2-2 Graphical representation occupancy model after data-augmentation

Under this hierarchical model the joint density is given by

πlowast(z vαwλ) prop Cyπα(α)πλ(λ)Nprodi=1

ϕ(vi xprimeiα 1)I

zivigt0I

(1minuszi )vile0 times

Jprodj=1

(ziIwijgt0)yij (1minus ziIwijgt0)

1minusyijϕ(wij qprimeijλ 1) (2ndash11)

The full conditional densities derived from the posterior in equation 2ndash11 are

detailed below

1 These are obtained from the full conditional of z after integrating out v and w

f (z|αλ) =

Nprodi=1

f (zi |αλ) =Nprodi=1

ψlowastizi (1minus ψlowast

i )1minuszi

where ψlowasti =

ψiprodJ

j=1 pyijij (1minus pij)

1minusyij

ψiprodJ

j=1 pyijij (1minus pij)1minusyij + (1minus ψi)

prodJ

j=1 Iyij=0(2ndash12)

2

f (v|zα) =

Nprodi=1

f (vi |zi α) =Nprodi=1

tr N (x primeiα 1Ai)

where Ai =

(minusinfin 0] zi = 0(0infin) zi = 1

(2ndash13)

31

and tr N(microσ2A) denotes the pdf of a truncated normal random variable withmean micro variance σ2 and truncation region A

3

f (α|v) = ϕp (α αXprimev α) (2ndash14)

where α = (X primeX )minus1and ϕk(x micro ) represents the k-variate normal density withmean vector micro and variance matrix

4

f (w|y zλ) =

Nprodi=1

Jprodj=1

f (wij |yij zi λ) =Nprodi=1

Jprodj=1

tr N (qprimeijλ 1Bij)

where Bij =

(minusinfininfin) zi = 0(minusinfin 0] zi = 1 and yij = 0(0infin) zi = 1 and yij = 1

(2ndash15)

5

f (λ|w) = ϕr (λ λQprimew λ) (2ndash16)

where λ = (Q primeQ)minus1

The Gibbs sampling algorithm for the model can then be summarized as

1 Initialize z α v λ and w

2 Sample zi sim Bern(ψilowast)

3 Sample vi from a truncated normal with micro = x primeiα σ = 1 and truncation regiondepending on zi

4 Sample α sim N (αXprimev α) with α = (X primeX )minus1

5 Sample wij from a truncated normal with micro = qprimeijλ σ = 1 and truncation region

depending on yij and zi

6 Sample λ sim N (λQprimew λ) with λ = (Q primeQ)minus1

222 Logit Link Model

Now turning to the logit link version of the occupancy model again let yij be the

indicator variable used to mark detection of the target species on the j th survey at the

i th site and let zi be the indicator variable that denotes presence (zi = 1) or absence

32

(zi = 0) of the target species at the i th site The model is now defined by

yij |zi λ sim Bernoulli (zipij) where pij =eq

primeijλ

1 + eqprimeijλ

λ sim [λ]

zi |α sim Bernoulli (ψi) where ψi =ex

primeiα

1 + exprimeiα

α sim [α]

In this hierarchy the contribution of a single site to the likelihood is

Li(αλ) =(ex

primeiα)zi

1 + exprimeiα

Jprodj=1

(zi

eqprimeijλ

1 + eqprimeijλ

)yij(1minus zi

eqprimeijλ

1 + eqprimeijλ

)1minusyij

(2ndash17)

As in the probit case we data-augment the likelihood with two separate sets

of covariates however in this case each of them having Polya-gamma distribution

Augmenting the model and using the posterior in (2ndash7) the joint is

[ zαλ|y ] prop [α] [λ]

Nprodi=1

(ex

primeiα)zi

1 + exprimeiαcosh

(∣∣xprime

iα∣∣

2

)exp

[minus(x

prime

iα)2vi

2

]g(vi)times

Jprodj=1

(zi

eqprimeijλ

1 + eqprimeijλ

)yij(1minus zi

eqprimeijλ

1 + eqprimeijλ

)1minusyij

times

cosh

(∣∣ziqprimeijλ∣∣2

)exp

[minus(ziq

primeijλ)2wij

2

]g(wij)

(2ndash18)

The full conditionals for z α v λ and w obtained from (2ndash18) are provided below

1 The full conditional for z is obtained after marginalizing the latent variables andyields

f (z|αλ) =

Nprodi=1

f (zi |αλ) =Nprodi=1

ψlowastizi (1minus ψlowast

i )1minuszi

where ψlowasti =

ψiprodJ

j=1 pyijij (1minus pij)

1minusyij

ψiprodJ

j=1 pyijij (1minus pij)1minusyij + (1minus ψi)

prodJ

j=1 Iyij=0(2ndash19)

33

2 Using the result derived in Polson et al (2013) we have that

f (v|zα) =

Nprodi=1

f (vi |zi α) =Nprodi=1

PG(1 xprimeiα) (2ndash20)

3

f (α|v) prop [α ]

Nprodi=1

exp[zix

prime

iαminus xprime

2minus (x

prime

iα)2vi

2

] (2ndash21)

4 By the same result as that used for v the full conditional for w is

f (w|y zλ) =

Nprodi=1

Jprodj=1

f (wij |yij zi λ)

=

(prodiisinS1

Jprodj=1

PG(1 |qprimeijλ| )

)(prodi isinS1

Jprodj=1

PG(1 0)

) (2ndash22)

with S1 = i isin 1 2 N zi = 1

5

f (λ|z yw) prop [λ ]prodiisinS1

exp

[yijq

prime

ijλminusq

prime

ijλ

2minus

(qprime

ijλ)2wij

2

] (2ndash23)

with S1 as defined above

The Gibbs sampling algorithm is analogous to the one with a probit link but with the

obvious modifications to incorporate Polya-gamma instead of normal latent variables23 Temporal Dynamics and Spatial Structure

The uses of the single-season model are limited to very specific problems In

particular assumptions for the basic model may become too restrictive or unrealistic

whenever the study period extends throughout multiple years or seasons especially

given the increasingly changing environmental conditions that most ecosystems are

currently experiencing

Among the many extensions found in the literature one that we consider particularly

relevant incorporates heterogenous occupancy probabilities through time Extensions of

34

site-occupancy models that incorporate temporally varying probabilities can be traced

back to Hanski (1994) The heterogeneity of occupancy probabilities through time arises

from local colonization and extinction processes MacKenzie et al (2003) proposed an

alternative to Hanskirsquos approach in order to incorporate imperfect detection The method

is flexible enough to let detection occurrence survival and colonization probabilities

each depend upon its own set of covariates using likelihood-based estimation for the

model parameters

However the approach of MacKenzie et al presents two drawbacks First

the uncertainty assessment for maximum likelihood parameter estimates relies on

asymptotic results (obtained from implementation of the delta method) making it

sensitive to sample size And second to obtain parameter estimates the latent process

(occupancy) is marginalized out of the likelihood leading to the usual zero-inflated

Bernoulli model Although this is a convenient strategy to solve the estimation problem

the latent state variables (occupancy indicators) are no longer available and as such

finite sample estimates cannot be calculated unless an additional (and computationally

expensive) parametric bootstrap step is performed (Royle amp Kery 2007) Additionally as

the occupancy process is integrated out the likelihood approach precludes incorporation

of additional structural dependence using random effects Thus the model cannot

account for spatial dependence which plays a fundamental role in this setting

To work around some of the shortcomings encountered when fitting dynamic

occupancy models via likelihood based methods Royle amp Kery developed what they

refer to as a dynamic occupancy state space model (DOSS) alluding to the conceptual

similarity found between this model and the class of state space models found in the

time series literature In particular this model allows one to retain the latent process

(occupancy indicators) in order to obtain small sample estimates and to eventually

generate extensions that incorporate structure in time andor space through random

effects

35

The data used in the DOSS model comes from standard repeated presenceabsence

surveys with N sampling locations (patches or sites) indexed by i = 1 2 N Within

a given season (eg year month week depending on the biology of the species) each

sampling location is visited (surveyed) j = 1 2 J times This process is repeated for

t = 1 2 T seasons Here an important assumption is that the site occupancy status

is closed within but not across seasons

As is usual in the occupancy modeling framework two different processes are

considered The first one is the detection process per site-visit-season combination

denoted by yijt The yijt are indicator functions that take the value 1 if the species is

present at site i survey j and season t and 0 otherwise These detection indicators

are assumed to be independent within each site and season The second response

considered is the partially observed presence (occupancy) indicators zit These are

indicator variables which are equal to 1 whenever yijt = 1 for one or more of the visits

made to site i during season t otherwise the values of the zit rsquos are unknown Royle amp

Kery refer to these two processes as the observation (yijt) and the state (zit) models

In this setting the parameters of greatest interest are the occurrence or site

occupancy probabilities denoted by ψit as well as those representing the population

dynamics which are accounted for by introducing changes in occupancy status over

time through local colonization and survival That is if a site was not occupied at season

t minus 1 at season t it can either be colonized or remain unoccupied On the other hand

if the site was in fact occupied at season t minus 1 it can remain that way (survival) or

become abandoned (local extinction) at season t The probabilities of survival and

colonization from season t minus 1 to season t at the i th site are denoted by θi(tminus1) and

γi(tminus1) respectivelyDuring the initial period (or season) the model for the state process is expressed in

terms of the occupancy probability (equation 2ndash24) For subsequent periods the stateprocess is specified in terms of survival and colonization probabilities (equation 2ndash25) inparticular

zi1 sim Bernoulli (ψi1) (2ndash24)

36

zit |zi(tminus1) sim Bernoulli(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)(2ndash25)

The observation model conditional on the latent process zit is defined by

yijt |zit sim Bernoulli(zitpijt

)(2ndash26)

Royle amp Kery induce the heterogeneity by site site-season and site-survey-seasonrespectively in the occupancy survival and colonization and in the detection probabilitiesthroughthe following specification

logit(ψi1) = x1 + ri ri sim N(0σ2ψ) logitminus1(x1) sim Unif(0 1)

logit(θit) = at + ui ui sim N(0σ2θ ) logitminus1(ai) sim Unif(0 1)logit(γit) = bt + vi vi sim N(0σ2γ) logitminus1(bi) sim Unif(0 1)

logit(pijt) = ct + wij wi sim N(0σ2p) logitminus1(ci) sim Unif(0 1) (2ndash27)

where x1 at bt ct are the season fixed effects for the corresponding probabilities

and where (ri ui vi) and wij are the site and site-survey random effects respectively

Additionally all variance components assume the usual inverse gamma priors

As the authors state this formulation can be regarded as ldquobeing suitably vaguerdquo

however it is also restrictive in the sense that it is not clear what strategy to follow to

incorporate additional covariates while preserving the straightforward sampling strategy231 Dynamic Mixture Occupancy State-Space Model

We assume that the probabilities for occupancy survival colonization and detection

are all functions of linear combinations of covariates However our setup varies

slightly from that considered by Royle amp Kery (2007) In essence we modify the way in

which the estimates for survival and colonization probabilities are attained Our model

incorporates the notion that occupancy at a site occupied during the previous season

takes place through persistence where we define persistence as a function of both

survival and colonization That is a site occupied at time t may again be occupied

at time t + 1 if the current settlers survive if they perish and new settlers colonize

simultaneously or if both current settlers survive and new ones colonize

Our functional forms of choice are again the probit and logit link functions This

means that each probability of interest which we will refer to for illustration as δ is

37

linked to a linear combination of covariates xprime ξ through the relationship defined by

δ = F (xT ξ) where F (middot) represents the inverse link function This particular assumption

facilitates relating the data augmentation algorithms of Albert amp Chib and Polson et al to

Royle amp Keryrsquos DOSS model We refer to this extension of Royle amp Keryrsquos model as the

Dynamic Mixture Occupancy State Space model (DYMOSS)

As before let yijt be the indicator variable used to mark detection of the target

species on the j th survey at the i th site during the tth season and let zit be the indicator

variable that denotes presence (zit = 1) or absence (zit = 0) of the target species at the

i th site tth season with i isin 1 2 N j isin 1 2 J and t isin 1 2 T

Additionally assume that probabilities for occupancy at time t = 1 persistence

colonization and detection are all functions of covariates with corresponding parameter

vectors α (s) =δ(s)tminus1

Tt=2

B(c) =β(c)tminus1

Tt=2

and = λtTt=1 and covariate matrices

X(o) X = Xtminus1Tt=2 and Q(s) = QtTt=1 respectively Using the notation above our

proposed dynamic occupancy model is defined by the following hierarchyState model

zi1|α sim Bernoulli (ψi1) where ψi1 = F(xprime(o)iα

)zit |zi(tminus1) δ

(c)tminus1β

(s)tminus1 sim Bernoulli

(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)where θi(tminus1) = F

(δ(s)tminus1 + xprimei(tminus1)β

(c)tminus1

) and

γi(tminus1) = F(xprimei(tminus1)β

(c)tminus1

)(2ndash28)

Observed modelyijt |zit ηt sim Bernoulli (zitpijt)

where pijt = F (qTijtλt) (2ndash29)

In the hierarchical setup given by Equations 2ndash28 and 2ndash29 θi(tminus1) corresponds to

the probability of persistence from time t minus 1 to time t at site i and γi(tminus1) denotes the

colonization probability Note that θi(tminus1) minus γi(tminus1) yields the survival probability from t minus 1

to t The effect of survival is introduced by changing the intercept of the linear predictor

by a quantity δ(s)tminus1 Although in this version of the model this effect is accomplished by

just modifying the intercept it can be extended to have covariates determining δ(s)tminus1 as

well The graphical representation of the model for a single site is

38

α

zi1

yi1

λ1

zi2

yi2

λ1

δ(s)1

β(c)1

middot middot middot

zit

yit

λt

δ(s)tminus1

β(c)tminus1

middot middot middot

ziT

yiT

λT

δ(s)Tminus1

β(c)Tminus1

Figure 2-3 Graphical representation multiseason model for a single site

The joint posterior for the model defined by this hierarchical setting is

[ zηαβλ|y ] = Cy

Nprodi=1

ψi1 Jprodj=1

pyij1ij1 (1minus pij1)

(1minusyij1)

zi1(1minus ψi1)

Jprodj=1

Iyij1=0

1minuszi1 [η1][α]times

Tprodt=2

Nprodi=1

[(θziti(tminus1)(1minus θi(tminus1))

1minuszit)zi(tminus1)

+(γziti(tminus1)(1minus γi(tminus1))

1minuszit)1minuszi(tminus1)

] Jprod

j=1

pyijtijt (1minus pijt)

1minusyijt

zit

times

Jprodj=1

Iyijt=0

1minuszit [ηt ][βtminus1][λtminus1]

(2ndash30)

which as in the single season case is intractable Once again a Gibbs sampler cannot

be constructed directly to sample from this joint posterior The graphical representation

of the model for one site incorporating the latent variables is provided in Figure 2-4

α

ui1

zi1

yi1

wi1

λ1

zi2

yi2

wi2

λ1

vi1

δ(s)1

β(c)1

middot middot middot

middot middot middot

zit

vi tminus1

yit

wit

λt

δ(s)tminus1

β(c)tminus1

middot middot middot

middot middot middot

ziT

vi Tminus1

yiT

wiT

λT

δ(s)Tminus1

β(s)Tminus1

Figure 2-4 Graphical representation data-augmented multiseason model

Probit link normal-mixture DYMOSS model

39

We deal with the intractability of the joint posterior distribution as before that is

by introducing latent random variables Each of the latent variables incorporates the

relevant linear combinations of covariates for the probabilities considered in the model

This artifact enables us to sample from the joint posterior distributions of the model

parameters For the probit link the sets of latent random variables respectively for first

season occupancy persistence and colonization and detection are

bull ui sim N (bTi α 1)

bull vi(tminus1) sim zi(tminus1)N(δ(s)(tminus1) + xTi(tminus1)β

(c)(tminus1) 1

)+ (1minus zi(tminus1))N

(xTi(tminus1)β

(c)(tminus1) 1

) and

bull wijt sim N (qTijtηt 1)

Introducing these latent variables into the hierarchical formulation yieldsState model

ui1|α sim N(xprime(o)iα 1

)zi1|ui sim Bernoulli

(Iuigt0

)for t gt 1

vi(tminus1)|zi(tminus1)βtminus1 sim zi(tminus1)N(δ(s)(tminus1) + xprimei(tminus1)β

(c)(tminus1) 1

)+

(1minus zi(tminus1))N(xprimei(tminus1)β

(c)(tminus1) 1

)zit |vi(tminus1) sim Bernoulli

(Ivi(tminus1)gt0

)(2ndash31)

Observed modelwijt |ηt sim N

(qTijtλt 1

)yijt |zit wijt sim Bernoulli

(zitIrijtgt0

)(2ndash32)

Note that the result presented in Section 22 corresponds to the particular case for

T = 1 of the model specified by Equations 2ndash31 and 2ndash32

As mentioned previously model parameters are obtained using a Gibbs sampling

approach Let ϕ(x |microσ2) denote the pdf of a normally distributed random variable x

with mean micro and standard deviation σ Also let

1 Wt = (w1t w2t wNt) with wit = (wi1t wi2t wiJitt) (for i = 1 2 N andt = 1 2 T )

40

2 u = (u1 u2 uN)

3 V = (v1 vTminus1) with vt = (v1t v2t vNt)

For the probit link model the joint posterior distribution is

π(ZuV WtTt=1αB(c) δ(s)

)prop [α]

prodNi=1 ϕ

(ui∣∣ xprime(o)iα 1

)Izi1uigt0I

1minuszi1uile0

times

Tprodt=2

[β(c)tminus1 δ

(s)tminus1

] Nprodi=1

ϕ(vi(tminus1)

∣∣micro(v)i(tminus1) 1

)Izitvi(tminus1)gt0

I1minuszitvi(tminus1)le0

times

Tprodt=1

[λt ]

Nprodi=1

Jitprodj=1

ϕ(wijt

∣∣qprimeijtλt 1)(zitIwijtgt0)yij1(1minus zitIwijtgt0)

(1minusyijt)

where micro(v)i(tminus1) = zi(tminus1)δ

(s)tminus1 + xprimei(tminus1)β

(c)tminus1 (2ndash33)

Initialize the Gibbs sampler at α(0)B(0)(c) δ

(s)(0)2minus1 and (0) For m = 0 1 nsim

The sampler proceeds iteratively by block sampling sequentially for each primary

sampling period as follows first the presence process then the latent variables from

the data-augmentation step for the presence component followed by the parameters for

the presence process then the latent variables for the detection component and finally

the parameters for the detection component Letting [|] denote the full conditional

probability density function of the component conditional on all other unknown

parameters and the observed data for m = 1 nsim the sampling procedure can be

summarized as

[z(m)1 | middot

]rarr[u(m)| middot

]rarr[α(m)

∣∣∣ middot ]rarr [W

(m)1 | middot

]rarr[λ(m)1

∣∣∣ middot ]rarr[z(m)2 | middot

]rarr[V(m)2minus1| middot

]rarr[β(c)(m)2minus1 δ(s)(m)

2minus1

∣∣∣ middot ]rarr [W

(m)2 | middot

]rarr[λ(m)2

∣∣∣ middot ]rarr middot middot middot

middot middot middot rarr[z(m)T | middot

]rarr[V(m)Tminus1| middot

]rarr[β(c)(m)Tminus1 δ(s)(m)

Tminus1

∣∣∣ middot ]rarr [W

(m)T | middot

]rarr[λ(m)T

∣∣∣ middot ]The full conditional probability densities for this Gibbs sampling algorithm are

presented in detail within Appendix A

41

Logit link Polya-Gamma DYMOSS model

Using the same notation as before the logit link model resorts to the hierarchy given

byState model

ui1|α sim PG(xT(o)iα 1

)zi1|ui sim Bernoulli

(Iuigt0

)for t gt 1

vi(tminus1)| sim PG(1∣∣zi(tminus1)δ(s)(tminus1) + xprimei(tminus1)β

(c)(tminus1)

∣∣)sim Bernoulli

(Ivi(tminus1)gt0

)(2ndash34)

Observed modelwijt |λt sim PG

(qTijtλt 1

)yijt |zit wijt sim Bernoulli

(zitIwijtgt0

)(2ndash35)

The logit link version of the joint posterior is given by

π(ZuV WtTt=1αB(s)B(c)

)prop

Nprodi=1

(e

xprime(o)i

α)zi1

1 + exprime(o)i

αPG

(ui 1 |xprime(o)iα|

)[λ1][α]times

Ji1prodj=1

(zi1

eqprimeij1λ1

1 + eqprimeij1λ1

)yij1(1minus zi1

eqprimeij1λ1

1 + eqprimeij1λ1

)1minusyij1

PG(wij1 1 |zi1qprimeij1λ1|

)times

Tprodt=2

[δ(s)tminus1][β

(c)tminus1][λt ]

Nprodi=1

(exp

[micro(v)tminus1

])zit1 + exp

[micro(v)i(tminus1)

]PG (vit 1 ∣∣∣micro(v)i(tminus1)

∣∣∣)timesJitprodj=1

(zit

eqprimeijtλt

1 + eqprimeijtλt

)yijt(1minus zit

eqprimeijtλt

1 + eqlowastTij

λt

)1minusyijt

PG(wijt 1 |zitqprimeijtλt |

)

(2ndash36)

with micro(v)tminus1 = zi(tminus1)δ

(s)tminus1 + xprimei(tminus1)β

(c)tminus1

42

The sampling procedure is entirely analogous to that described for the probit

version The full conditional densities derived from expression 2ndash36 are described in

detail in Appendix A232 Incorporating Spatial Dependence

In this section we describe how the additional layer of complexity space can also

be accounted for by continuing to use the same data-augmentation framework The

method we employ to incorporate spatial dependence is a slightly modified version of

the traditional approach for spatial generalized linear mixed models (GLMMrsquos) and

extends the model proposed by Johnson et al (2013) for the single season closed

population occupancy model

The traditional approach consists of using spatial random effects to induce a

correlation structure among adjacent sites This formulation introduced by Besag et al

(1991) assumes that the spatial random effect corresponds to a Gaussian Markov

Random Field (GMRF) The model known as the Spatial GLMM (SGLMM) is used to

analyze areal data It has been applied extensively given the flexibility of its hierarchical

formulation and the availability of software for its implementation (Hughes amp Haran

2013)

Succinctly the spatial dependence is accounted for in the model by adding a

random vector η assumed to have a conditionally-autoregressive (CAR) prior (also

known as the Gaussian Markov random field prior) To define the prior let the pair

G = (V E) represent the undirected graph for the entire spatial region studied where

V = (1 2 N) denotes the vertices of the graph (sites) and E the set of edges

between sites E is constituted by elements of the form (i j) indicating that sites i

and j are spatially adjacent for some i j isin V The prior for the spatial effects is then

characterized by

[η|τ ] prop τ rank()2exp[minusτ2ηprimeη

] (2ndash37)

43

where = (diag(A1)minus A) is the precision matrix with A denoting the adjacency matrix

The entries of the adjacency matrix A are such that diag(A) = 0 and Aij = I(i j)isinE

The matrix is singular Hence the probability density defined in equation 2ndash37

is improper ie it doesnrsquot integrate to 1 Regardless of the impropriety of the prior this

model can be fitted using a Bayesian approach since even if the prior is improper the

posterior for the model parameters is proper If a constraint such assum

k ηk = 0 is

imposed or if the precision matrix is replaced by a positive definite matrix the model

can also be fitted using a maximum likelihood approach

Assuming that all but the detection process are subject to spatial correlations and

using the notation we have developed up to this point the spatially explicit version of the

DYMOSS model is characterized by the hierarchy represented by equations 2ndash38 and

2ndash39

Hence adding spatial structure into the DYMOSS framework described in the

previous section only involves adding the steps to sample η(o) and ηtT

t=2 conditional

on all other parameters Furthermore the corresponding parameters and spatial

random effects of a given component (ie occupancy survival and colonization)

can be effortlessly pooled together into a single parameter vector to perform block

sampling For each of the latent variables the only modification required is to sum the

corresponding spatial effect to the linear predictor so that these retain their conditional

independence given the linear combination of fixed effects and the spatial effects

State modelzi1|α sim Bernoulli (ψi1) where ψi1 = F

(xT(o)iα+ η

(o)i

)[η(o)|τ

]prop τ rank()2exp

[minusτ2η(o)primeη(o)

]zit |zi(tminus1)αβtminus1λtminus1 sim Bernoulli

(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)where θi(tminus1) = F

(δ(s)(tminus1) + xTi(tminus1)β

(c)tminus1 + ηit

) and

γi(tminus1) = F(xTi(tminus1)β

(c)tminus1 + ηit

)[ηt |τ ] prop τ rank()2exp

[minusτ2ηprimetηt

](2ndash38)

44

Observed modelyijt |zit ηt sim Bernoulli (zitpijt)

where pijt = F (qTijtλt) (2ndash39)

In spite of the popularity of this approach to incorporating spatial dependence three

shortcomings have been reported in the literature (Hughes amp Haran 2013 Reich et al

2006) (1) model parameters have no clear interpretation due to spatial confounding

of the predictors with the spatial effect (2) there is variance inflation due to spatial

confounding and (3) the high dimensionality of the latent spatial variables leads to

high computational costs To avoid such difficulties we follow the approach used by

Hughes amp Haran (2013) which builds upon the earlier work by Reich et al (2006) This

methodology is summarized in what follows

Let a vector of spatial effects η have the CAR model given by 2ndash37 above Now

consider a random vector ζ sim MVN(0 τKprimeK

) with defined as above and where

τKprimeK corresponds to the precision of the distribution and not the covariance matrix

with matrix K satisfying KprimeK = I

This last condition implies that the linear predictor Xβ + η = Xβ + Kζ With

respect to how the matrix K is chosen Hughes amp Haran (2013) recommend basing its

construction on the spectral decomposition of operator matrices based on Moranrsquos I

The Moran operator matrix is defined as PperpAPperp with Pperp = IminusX (XprimeX )minus1X

prime and where A

is the adjacency matrix previously described The choice of the Moran operator is based

on the fact that it accounts for the underlying graph while incorporating the spatial

structure residual to the design matrix X These elements are incorporated into its

spectral decomposition of the Moran operator That is its eigenvalues correspond to the

values of Moranrsquos I statistic (a measure of spatial autocorrelation) for a spatial process

orthogonal to X while its eigenvectors provide the patterns of spatial dependence

residual to X Thus the matrix K is chosen to be the matrix whose columns are the

eigenvectors of the Moran operator for a particular adjacency matrix

45

Using this strategy the new hierarchical formulation of our model is simply modified

by letting η(o) = K(o)ζ(o) and ηt = Ktζt with

1 ζ(o) sim MVN(0 τ (o)K(o)primeK(o)

) where K(o) is the eigenvector matrix for

P(o)perpAP(o)perp and

2 ζt sim MVN(0 τtK

primetKt

) where Kt is the Pperp

t APperpt for t = 2 3 T

The algorithms for the probit and logit link from section 231 can be readily

adapted to incorporate the spatial structure simply by obtaining the joint posteriors

for (α ζ(o)) and (β(c)tminus1 δ

(s)tminus1 ζt) making the obvious modification of the corresponding

linear predictors to incorporate the spatial components24 Summary

With a few exceptions (Dorazio amp Taylor-Rodrıguez 2012 Johnson et al 2013

Royle amp Kery 2007) recent Bayesian approaches to site-occupancy modeling with

covariates have relied on model configurations (eg as multivariate normal priors of

parameters in logit scale) that lead to unfamiliar conditional posterior distributions thus

precluding the use of a direct sampling approach Therefore the sampling strategies

available are based on algorithms (eg Metropolis Hastings) that require tuning and the

knowledge to do so correctly

In Dorazio amp Taylor-Rodrıguez (2012) we proposed a Bayesian specification for

which a Gibbs sampler of the basic occupancy model is available and allowed detection

and occupancy probabilities to depend on linear combinations of predictors This

method described in section 221 is based on the data augmentation algorithm of

Albert amp Chib (1993) There the full conditional posteriors of the parameters of the probit

regression model are cast as latent mixtures of normal random variables The probit and

the logit link yield similar results with large sample sizes however their results may be

different when small to moderate sample sizes are considered because the logit link

function places more mass in the tails of the distribution than the probit link does In

46

section 222 we adapt the method for the single season model to work with the logit link

function

The basic occupancy framework is useful but it assumes a single closed population

with fixed probabilities through time Hence its assumptions may not be appropriate to

address problems where the interest lies in the temporal dynamics of the population

Hence we developed a dynamic model that incorporates the notion that occupancy

at a site previously occupied takes place through persistence which depends both on

survival and habitat suitability By this we mean that a site occupied at time t may again

be occupied at time t + 1 if (1) the current settlers survive (2) the existing settlers

perish but new settlers simultaneously colonize or (3) current settlers survive and new

ones colonize during the same season In our current formulation of the DYMOSS both

colonization and persistence depend on habitat suitability characterized by xprimei(tminus1)β(c)tminus1

They only differ in that persistence is also influenced by whether the site being occupied

during season t minus 1 enhances the suitability of the site or harms it through density

dependence

Additionally the study of the dynamics that govern distribution and abundance of

biological populations requires an understanding of the physical and biotic processes

that act upon them and these vary in time and space Consequently as a final step in

this Chapter we described a straightforward strategy to add spatial dependence among

neighboring sites in the dynamic metapopulation model This extension is based on the

popular Bayesian spatial modeling technique of Besag et al (1991) updated using the

methods described in (Hughes amp Haran 2013)

Future steps along these lines are (1) develop the software necessary to

implement the tools described throughout the Chapter and (2) build a suite of additional

extensions using this framework for occupancy models will be explored The first of

them will be used to incorporate information from different sources such as tracks

scats surveys and direct observations into a single model This can be accomplished

47

by adding a layer to the hierarchy where the source and spatial scale of the data is

accounted for The second extension is a single season spatially explicit multiple

species co-occupancy model This model will allow studying complex interactions

and testing hypotheses about species interactions at a given point in time Lastly this

co-occupancy model will be adapted to incorporate temporal dynamics in the spirit of

the DYMOSS model

48

CHAPTER 3INTRINSIC ANALYSIS FOR OCCUPANCY MODELS

Eliminate all other factors and the one which remains must be the truthndashSherlock Holmes

The Sign of Four

31 Introduction

Occupancy models are often used to understand the mechanisms that dictate

the distribution of a species Therefore variable selection plays a fundamental role in

achieving this goal To the best of our knowledge ldquoobjectiverdquo Bayesian alternatives for

variable selection have not been put forth for this problem and with a few exceptions

(Hooten amp Hobbs 2014 Link amp Barker 2009) AIC is the method used to choose from

competing site-occupancy models In addition the procedures currently implemented

and accessible to ecologists require enumerating and estimating all the candidate

models (Fiske amp Chandler 2011 Mazerolle amp Mazerolle 2013) In practice this

can be achieved if the model space considered is small enough which is possible

if the choice of the model space is guided by substantial prior knowledge about the

underlying ecological processes Nevertheless many site-occupancy surveys collect

large amounts of covariate information about the sampled sites Given that the total

number of candidate models grows exponentially fast with the number of predictors

considered choosing a reduced set of models guided by ecological intuition becomes

increasingly difficult This is even more so the case in the occupancy model context

where the model space is the cartesian product of models for presence and models for

detection Given the issues mentioned above we propose the first objective Bayesian

variable selection method for the single-season occupancy model framework This

approach explores in a principled manner the entire model space It is completely

49

automatic precluding the need for both tuning parameters in the sampling algorithm and

subjective elicitation of parameter prior distributions

As mentioned above in ecological modeling if model selection or less frequently

model averaging is considered the Akaike Information Criterion (AIC) (Akaike 1983)

or a version of it is the measure of choice for comparing candidate models (Fiske amp

Chandler 2011 Mazerolle amp Mazerolle 2013) The AIC is designed to find the model

that has on average the density closest in Kullback-Leibler distance to the density

of the true data generating mechanism The model with the smallest AIC is selected

However if nested models are considered one of them being the true one generally the

AIC will not select it (Wasserman 2000) Commonly the model selected by AIC will be

more complex than the true one The reason for this is that the AIC has a weak signal to

noise ratio and as such it tends to overfit (Rao amp Wu 2001) Other versions of the AIC

provide a bias correction that enhances the signal to noise ratio leading to a stronger

penalization for model complexity Some examples are the AICc (Hurvich amp Tsai 1989)

and AICu (McQuarrie et al 1997) however these are also not consistent for selection

albeit asymptotically efficient (Rao amp Wu 2001)

If we are interested in prediction as opposed to testing the AIC is certainly

appropriate However when conducting inference the use of Bayesian model averaging

and selection methods is more fitting If the true data generating mechanism is among

those considered asymptotically Bayesian methods choose the true model with

probability one Conversely if the true model is not among the alternatives and a

suitable parameter prior is used the posterior probability of the most parsimonious

model closest to the true one tends asymptotically to one

In spite of this in general for Bayesian testing direct elicitation of prior probabilistic

statements is often impeded because the problems studied may not be sufficiently

well understood to make an informed decision about the priors Conversely there may

be a prohibitively large number of parameters making specifying priors for each of

50

these parameters an arduous task In addition to this seemingly innocuous subjective

choices for the priors on the parameter space may drastically affect test outcomes

This has been a recurring argument in favor of objective Bayesian procedures

which appeal to the use of formal rules to build parameter priors that incorporate the

structural information inside the likelihood while utilizing some objective criterion (Kass amp

Wasserman 1996)

One popular choice of ldquoobjectiverdquo prior is the reference prior (Berger amp Bernardo

1992) which is the prior that maximizes the amount of signal extracted from the

data These priors have proven to be effective as they are fully automatic and can

be frequentist matching in the sense that the posterior credible interval agrees with the

frequentist confidence interval from repeated sampling with equal coverage-probability

(Kass amp Wasserman 1996) Reference priors however are improper and while

they yield reasonable posterior parameter probabilities the derived model posterior

probabilities may be ill defined To avoid this shortcoming Berger amp Pericchi (1996)

introduced the intrinsic Bayes factor (IBF) for model comparison Moreno et al (1998)

building on the IBF of Berger amp Pericchi (1996) developed a limiting procedure to

generate a system of priors that yield well-defined posteriors even though these

priors may sometimes be improper The IBF is built using a data-dependent prior to

automatically generate Bayes factors however the extension introduced by Moreno

et al (1998) generates the intrinsic prior by taking a theoretical average over the space

of training samples freeing the prior from data dependence

In our view in the face of a large number of predictors the best alternative is to run

a stochastic search algorithm using good ldquoobjectiverdquo testing parameter priors and to

incorporate suitable model priors This being said the discussion about model priors is

deferred until Chapter 4 this Chapter focuses on the priors on the parameter space

The Chapter is structured as follows First issues surrounding multimodel inference

are described and insight about objective Bayesian inferential procedures is provided

51

Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate priors

on the parameter space the intrinsic priors for the parameters of the occupancy model

are derived These are used in the construction of an algorithm for ldquoobjectiverdquo model

selection tailored to the occupancy model framework To assess the performance of our

methods we provide results from a simulation study in which distinct scenarios both

favorable and unfavorable are used to determine the robustness of these tools and

analyze the Blue Hawker data set which has been examined previously in the ecological

literature (Dorazio amp Taylor-Rodrıguez 2012 Kery et al 2010)32 Objective Bayesian Inference

As mentioned before in practice noninformative priors arising from structural

rules are an alternative to subjective elicitation of priors Some of the rules used in

defining noninformative priors include the principle of insufficient reason parametrization

invariance maximum entropy geometric arguments coverage matching and decision

theoretic approaches (see Kass amp Wasserman (1996) for a discussion)

These rules reflect one of two attitudes (1) noninformative priors either aim to

convey unique representations of ignorance or (2) they attempt to produce probability

statements that may be accepted by convention This latter attitude is in the same

spirit as how weights and distances are defined (Kass amp Wasserman 1996) and

characterizes the way in which Bayesian reference methods are interpreted today ie

noninformative priors are seen to be chosen by convention according to the situation

A word of caution must be given when using noninformative priors Difficulties arise

in their implementation that should not be taken lightly In particular these difficulties

may occur because noninformative priors are generally improper (meaning that they do

not integrate or sum to a finite number) and as such are said to depend on arbitrary

constants

Bayes factors strongly depend upon the prior distributions for the parameters

included in each of the models being compared This can be an important limitation

52

considering that when using noninformative priors their introduction will result in the

Bayes factors being a function of the ratio of arbitrary constants given that these priors

are typically improper (see Jeffreys 1961 Pericchi 2005 and references therein)

Many different approaches have been developed to deal with the arbitrary constants

when using improper priors since then These include the use of partial Bayes factors

(Berger amp Pericchi 1996 Good 1950 Lempers 1971) setting the ratio of arbitrary

constants to a predefined value (Spiegelhalter amp Smith 1982) and approximating to the

Bayes factor (see Haughton 1988 as cited in Berger amp Pericchi 1996 Kass amp Raftery

1995 Tierney amp Kadane 1986)321 The Intrinsic Methodology

Berger amp Pericchi (1996) cleverly dealt with the arbitrary constants that arise when

using improper priors by introducing the intrinsic Bayes factor (IBF) procedure This

solution based on partial Bayes factors provides the means to replace the improper

priors by proper ldquoposteriorrdquo priors The IBF is obtained from combining the model

structure with information contained in the observed data Furthermore they showed

that as the sample size tends to infinity the Intrinsic Bayes factor corresponds to the

proper Bayes factor arising from the intrinsic priors

Intrinsic priors however are not unique The asymptotic correspondence between

the IBF and the Bayes factor arising from the intrinsic prior yields two functional

equations that are solved by a whole class of intrinsic priors Because all the priors

in the class produce Bayes factors that are asymptotically equivalent to the IBF for

finite sample sizes the resulting Bayes factor is not unique To address this issue

Moreno et al (1998) formalized the methodology through the ldquolimiting procedurerdquo

This procedure allows one to obtain a unique Bayes factor consolidating the method

as a valid objective Bayesian model selection procedure which we will refer to as the

Bayes factor for intrinsic priors (BFIP) This result is particularly valid for nested models

although the methodology may be extended with some caution to nonnested models

53

As mentioned before the Bayesian hypothesis testing procedure is highly sensitive

to parameter-prior specification and not all priors that are useful for estimation are

recommended for hypothesis testing or model selection Evidence of this is provided

by the Jeffreys-Lindley paradox which states that a point null hypothesis will always

be accepted when the variance of a conjugate prior goes to infinity (Robert 1993)

Additionally when comparing nested models the null model should correspond to

a substantial reduction in complexity from that of larger alternative models Hence

priors for the larger alternative models that place probability mass away from the null

model are wasteful If the true model is ldquofarrdquo from the null it will be easily detected by

any statistical procedure Therefore the prior on the alternative models should ldquowork

harderrdquo at selecting competitive models that are ldquocloserdquo to the null This principle known

as the Savage continuity condition (Gunel amp Dickey 1974) is widely recognized by

statisticians

Interestingly the intrinsic prior in correspondence with the BFIP automatically

satisfies the Savage continuity condition That is when comparing nested models the

intrinsic prior for the more complex model is centered around the null model and in spite

of being a limiting procedure it is not subject to the Jeffreys-Lindley paradox

Moreover beyond the usual pairwise consistency of the Bayes factor for nested

models Casella et al (2009) show that the corresponding Bayesian procedure with

intrinsic priors for variable selection in normal regression is consistent in the entire

class of normal linear models adding an important feature to the list of virtues of the

procedure Consistency of the BFIP for the case where the dimension of the alternative

model grows with the sample size is discussed in Moreno et al (2010)322 Mixtures of g-Priors

As previously mentioned in the Bayesian paradigm a model M in M is defined

by a sampling density and a prior distribution The sampling density associated with

model M is denoted by f (y|βM σ2M M) where (βM σ

2M) is a vector of model-specific

54

unknown parameters The prior for model M and its corresponding set of parameters is

π(βM σ2M M|M) = π(βM σ

2M |MM) middot π(M|M)

Objective local priors for the model parameters (βM σ2M) are achieved through

modifications and extensions of Zellnerrsquos g-prior (Liang et al 2008 Womack et al

2014) In particular below we focus on the intrinsic prior and provide some details for

other scaled mixtures of g-priors We defer the discussion on priors over the model

space until Chapter 5 where we describe them in detail and develop a few alternatives

of our own3221 Intrinsic priors

An automatic choice of an objective prior is the intrinsic prior (Berger amp Pericchi

1996 Moreno et al 1998) Because MB sube M for all M isin M the intrinsic prior for

(βM σ2M) is defined as an expected posterior prior

πI (βM σ2M |M) =

intpR(βM σ

2M |~yM)mR(~y|MB)d~y (3ndash1)

where ~y is a minimal training sample for model M I denotes the intrinsic distributions

and R denotes distributions derived from the reference prior πR(βM σ2M |M) = cM

dβMdσ2M

σ2M

In (3ndash1) mR(~y|M) =intint

f (~y|βM σ2M M)πR(βM σ

2M |M)dβMdσ2M is the reference marginal

of ~y under model M and pR(βM σ2M |~yM) =

f (~y|βM σ2MM)πR(βM σ2

M|M)

mR(~y|M)is the reference

posterior density

In the regression framework the reference marginal mR is improper and produces

improper intrinsic priors However the intrinsic Bayes factor of model M to the base

model MB is well-defined and given by

BF IMMB

(y) = (1minus R2M)

minus nminus|MB |2 times

int 1

0

n + sin2(π2θ) middot (|M|+ 1)

n +sin2(π

2θ)middot(|M|+1)1minusR2

M

nminus|M|

2sin2(π

2θ) middot (|M|+ 1)

n +sin2(π

2θ)middot(|M|+1)1minusR2

M

|M|minus|MB |

2

dθ (3ndash2)

55

where R2M is the coefficient of determination of model M versus model MB The Bayes

factor between two models M and M prime is defined as BF IMMprime(y) = BF I

MMB(y)BF I

MprimeMB(y)

The ldquogoodnessrdquo of the model M based on the intrinsic priors is given by its posterior

probability

pI (M|yM) =BF I

MMB(y)π(M|M)sum

MprimeisinM BF IMprimeMB

(y)π(M prime|M) (3ndash3)

It has been shown that the system of intrinsic priors produces consistent model selection

(Casella et al 2009 Giron et al 2010) In the context of well-formulated models the

true model MT is the smallest well-formulated model M isin M such that α isin M if βα = 0

If MT is the true model then the posterior probability of model MT based on equation

(3ndash3) converges to 13222 Other mixtures of g-priors

Scaled mixtures of g-priors place a reference prior on (βMBσ2) and a multivariate

normal distribution on β in M MB that is normal with mean 0 and precision matrix

qMw

nσ2ZprimeM (IminusH0)ZM

where H0 is the hat matrix associated with ZMB The prior is completed by a prior on w

and choice of scaling qM that is set at |M| + 1 to account for the minimal sample size of

M Under these assumptions the Bayesrsquo factor for M to MB is given by

BFMMB(y) =

(1minus R2

M

) nminus|MB |2

int n + w(|M|+ 1)

n + w(|M|+1)1minusR2

M

nminus|M|

2w(|M|+ 1)

n + w(|M|+1)1minusR2

M

|M|minus|MB |

2

π(w)dw

We consider the following priors on w The intrinsic prior is π(w) = Beta(w 05 05)

which is only defined for w isin (0 1) A version of the Zellner-Siow prior is given by

w sim Gamma(05 05) which produces a multivariate Cauchy distribution on β A family

of hyper-g priors are defined by π(w) prop wminus12(β + w)(α+1)2 which have Cauchy-like

tails but produce more shrinkage than the Cauchy prior

56

33 Objective Bayes Occupancy Model Selection

As mentioned before Bayesian inferential approaches used for ecological models

are lacking In particular there exists a need for suitable objective and automatic

Bayesian testing procedures and software implementations that explore thoroughly the

model space considered With this goal in mind in this section we develop an objective

intrinsic and fully automatic Bayesian model selection methodology for single season

site-occupancy models We refer to this method as automatic and objective given that

in its implementation no hyperparameter tuning is required and that it is built using

noninformative priors with good testing properties (eg intrinsic priors)

An inferential method for the occupancy problem is possible using the intrinsic

approach given that we are able to link intrinsic-Bayesian tools for the normal linear

model through our probit formulation of the occupancy model In other words because

we can represent the single season probit occupancy model through the hierarchy

yij |zi wij sim Bernoulli(ziIwijgt0

)wij |λ sim N

(qprimeijλ 1

)zi |vi sim Bernoulli

(Ivigt0

)vi |α sim N (x primeiα 1)

it is possible to solve the selection problem on the latent scale variables wij and vi and

to use those results at the level of the occupancy and detection processes

In what follows first we provide some necessary notation Then a derivation of

the intrinsic priors for the parameters of the detection and occupancy components

is outlined Using these priors we obtain the general form of the model posterior

probabilities Finally the results are incorporated in a model selection algorithm for

site-occupancy data Although the priors on the model space are not discussed in this

Chapter the software and methods developed have different choices of model priors

built in

57

331 Preliminaries

The notation used in Chapter 2 will be considered in this section as well Namely

presence will be denoted by z detection by y their corresponding latent processes are

v and w and the model parameters are denoted by α and λ However some additional

notation is also necessary Let M0 =M0y M0z

denote the ldquobaserdquo model defined by

the smallest models considered for the detection and presence processes The base

models M0y and M0z include predictors that must be contained in every model that

belongs to the model space Some examples of base models are the intercept only

model a model with covariates related to the sampling design and a model including

some predictors important to the researcher that should be included in every model

Furthermore let the sets [Kz ] = 1 2 Kz and [Ky ] = 1 2 Ky index

the covariates considered for the variable selection procedure for the presence and

detection processes respectively That is these sets denote the covariates that can

be added from the base models in M0 or removed from the largest possible models

considered MF z and MF y which we will refer to as the ldquofullrdquo models The model space

can then be represented by the Cartesian product of subsets such that Ay sube [Ky ]

and Az sube [Kz ] The entire model space is populated by models of the form MA =MAy

MAz

isin M = My timesMz with MAy

isin My and MAzisin Mz

For the presence process z the design matrix for model MAzis given by the block

matrix XAz= (X0|Xr A) X0 corresponds to the design matrix of the base model ndash which

is such that M0z sube MAzisin Mz for all Az isin [Kz ] ndash and Xr A corresponds to the submatrix

that contains the covariates indexed by Az Analogously for the detection process y the

design matrix is given by QAy= (Q0|Qr A) Similarly the coefficients for models MAz

and

MAyare given by αA = (αprime

0αprimer A)

prime and λA = (λprime0λ

primer A)

prime

With these elements in place the model selection problem consists of finding

subsets of covariates indexed by A = Az Ay that have a high posterior probability

given the detection and occupancy processes This is equivalent to finding models with

58

high posterior odds when compared to a suitable base model These posterior odds are

given by

p(MA|y z)p(M0|y z)

=m(y z|MA)π(MA)

m(y z|M0)π(M0)= BFMAM0

(y z)π(MA)

π(M0)

Since we are able to represent the occupancy model as a truncation of latent

normal variables it is possible to work through the occupancy model selection problem

in the latent normal scale used for the presence and detection processes We formulate

two solutions to this problem one that depends on the observed and latent components

and another that solely depends on the latent level variables used to data-augment the

problem We will however focus on the latter approach as this yields a straightforward

MCMC sampling scheme For completeness the other alternative is described in

Section 34

At the root of our objective inferential procedure for occupancy models lies the

conditional argument introduced by Womack et al (work in progress) for the simple

probit regression In the occupancy setting the argument is

p(MA|y zw v) =m(y z vw|MA)π(MA)

m(y zw v)

=fyz(y z|w v)

(intfvw(vw|αλMA)παλ(αλ|MA)d(αλ)

)π(MA)

fyz(y z|w v)sum

MlowastisinM(int

fvw(vw|αλMlowast)παλ(αλ|Mlowast)d(αλ))π(Mlowast)

=m(v|MAz

)m(w|MAy)π(MA)

m(v)m(w)

prop m(v|MAz)m(w|MAy

)π(MA) (3ndash4)

where

1 fyz(y z|w v) =prodN

i=1 Izivigt0I

(1minuszi )vile0

prodJ

j=1(ziIwijgt0)yij (1minus ziIwijgt0)

1minusyij

2 fvw(vw|αλMA) =

(Nprodi=1

ϕ(vi xprimeiαMAz

1)

)︸ ︷︷ ︸

f (v|αr Aα0MAz )

(Nprodi=1

Jiprodj=1

ϕ(wij qprimeijλMAy

1)

)︸ ︷︷ ︸

f (w|λr Aλ0MAy )

and

59

3 παλ(αλ|MA) = πα(α|MAz)πλ(λ|MAy

)

This result implies that once the occupancy and detection indicators are

conditioned on the latent processes v and w respectively the model posterior

probabilities only depend on the latent variables Hence in this case the model

selection problem is driven by the posterior odds

p(MA|y zw v)p(M0|y zw v)

=m(w v|MA)

m(w v|M0)

π(MA)

π(M0) (3ndash5)

where m(w v|MA) = m(w|MAy) middotm(v|MAz

) with

m(v|MAz) =

int intf (v|αr Aα0MAz

)π(αr A|α0MAz)π(α0)dαr Adα0

(3ndash6)

m(w|MAy) =

int intf (w|λr Aλ0MAy

)π(λr A|λ0MAy)π(λ0)dλ0dλr A

(3ndash7)

332 Intrinsic Priors for the Occupancy Problem

In general the intrinsic priors as defined by Moreno et al (1998) use the functional

form of the response to inform their construction assuming some preliminary prior

distribution proper or improper on the model parameters For our purposes we assume

noninformative improper priors for the parameters denoted by πN(middot|middot) Specifically the

intrinsic priors πIP(θMlowast|Mlowast) for a vector of parameters θMlowast corresponding to model

Mlowast isin M0M sub M for a response vector s with probability density (or mass) function

f (s|θMlowast) are defined by

πIP(θM0|M0) = πN(θM0

|M0)

πIP(θM |M) = πN(θM |M)

intm(~s|M)

m(~s|M0)f (~s|θM M)d~s

where ~s is a theoretical training sample

In what follows whenever it is clear from the context in an attempt to simplify the

notation MA will be used to refer to MAzor MAy

and A will denote Az or Ay To derive

60

the parameter priors involved in equations 3ndash6 and 3ndash7 using the objective intrinsic prior

strategy we start by assuming flat priors πN(αA|MA) prop cA and πN(λA|MA) prop dA where

cA and dA are unknown constants

The intrinsic prior for the parameters associated with the occupancy process αA

conditional on model MA is

πIP(αA|MA) = πN(αA|MA)

intm(~v|MA)

m(~v|M0)f (~v|αAMA)d~v

where the marginals m(~v|Mj) with j isin A 0 are obtained by solving the analogous

equation 3ndash6 for the (theoretical) training sample ~v These marginals are given by

m(~v|Mj) = cj (2π)pjminusp0

2 |~X primej~Xj |

12 eminus

12~vprime(Iminus~Hj )~v

The training sample ~v has dimension pAz=∣∣MAz

∣∣ that is the total number of

parameters in model MAz Note that without ambiguity we use

∣∣ middot ∣∣ to denote both

the cardinality of a set and also the determinant of a matrix The design matrix ~XA

corresponds to the training sample ~v and is chosen such that ~X primeA~XA =

pAzNX primeAXA

(Leon-Novelo et al 2012) and ~Hj is the corresponding hat matrix

Replacing m(~v|MA) and m(~v|M0) in πIP(αA|MA) and solving the integral with

respect to the theoretical training sample ~v we have

πIP(αA|MA) = cA

int ((2π)minus

pAzminusp0z2

(c0

cA

)eminus

12~vprime((Iminus~HA)minus(Iminus~H0))~v |~X

primeA~XA|12

|~X prime0~X0|12

)times(

(2π)minuspAz2 eminus

12(~vminus~XAαA)

prime(~vminus~XAαA))d~v

= c0(2π)minus

pAzminusp0z2 |~X prime

Ar~XAr |

12 2minus

pAzminusp0z2 exp

[minus1

2αprimer A

(1

2~X primer A

~Xr A

)αr A

]= πN(α0)timesN

(αr A

∣∣ 0 2 middot ( ~X primer A

~Xr A)minus1)

(3ndash8)

61

Analogously the intrinsic prior for the parameters associated to the detection

process is

πIP(λA|MA) = d0(2π)minus

pAyminusp0y2 | ~Q prime

Ar~QAr |

12 2minus

pAyminusp0y2 exp

[minus1

2λprimer A

(1

2~Q primer A

~Qr A

)λr A

]= πN(λ0)timesN

(λr A

∣∣ 0 2 middot ( ~Q primeA~QA)

minus1)

(3ndash9)

In short the intrinsic priors for αA = (αprime0α

primer A)

prime and λprimeA = (λprime

0λprimer A)

prime are the product

of a reference prior on the parameters of the base model and a normal density on the

parameters indexed by Az and Ay respectively333 Model Posterior Probabilities

We now derive the expressions involved in the calculations of the model posterior

probabilities First recall that p(MA|y zw v) prop m(w v|MA)π(MA) Hence determining

this posterior probability only requires calculating m(w v|MA)

Note that since w and v are independent obtaining the model posteriors from

expression 3ndash4 reduces to finding closed form expressions for the marginals m(v |MAz)

and m(w |MAy) respectively from equations 3ndash6 and 3ndash7 Therefore

m(w v|MA) =

int intf (vw|αλMA)π

IP (α|MAz)πIP

(λ|MAy

)dαdλ

(3ndash10)

For the latent variable associated with the occupancy process plugging the

parameter intrinsic prior given by 3ndash8 into equation 3ndash6 (recalling that ~X primeA~XA =

pAzNX primeAXA)

and integrating out αA yields

m(v|MA) =

int intc0N (v|X0α0 + Xr Aαr A I)N

(αr A|0 2( ~X prime

r A~Xr A)

minus1)dαr Adα0

= c0(2π)minusn2

int (pAz

2N + pAz

) (pAzminusp0z

)

2

times

exp[minus1

2(v minus X0α0)

prime(I minus

(2N

2N + pAz

)Hr Az

)(v minus X0α0)

]dα0

62

= c0 (2π)minus(nminusp0z )2

(pAz

2N + pAz

) (pAzminusp0z

)

2

|X prime0X0|minus

12 times

exp[minus1

2vprime(I minus H0z minus

(2N

2N + pAz

)Hr Az

)v

] (3ndash11)

with Hr Az= HAz

minus H0z where HAzis the hat matrix for the entire model MAz

and H0z is

the hat matrix for the base model

Similarly the marginal distribution for w is

m(w|MA) = d0 (2π)minus(Jminusp0y )2

(pAy

2J + pAy

) (pAyminusp0y

)

2

|Q prime0Q0|minus

12 times

exp[minus1

2wprime(I minus H0y minus

(2J

2J + pAy

)Hr Ay

)w

] (3ndash12)

where J =sumN

i=1 Ji or in other words J denotes the total number of surveys conducted

Now the posteriors for the base model M0 =M0y M0z

are

m(v|M0) =

intc0N (v|X0α0 I) dα0

= c0(2π)minus(nminusp0z )2 |X prime

0X0|minus12 exp

[minus1

2(v (I minus H0z ) v)

](3ndash13)

and

m(w|M0) = d0(2π)minus(Jminusp0y )2 |Q prime

0Q0|minus12 exp

[minus1

2

(w(I minus H0y

)w)]

(3ndash14)

334 Model Selection Algorithm

Having the parameter intrinsic priors in place and knowing the form of the model

posterior probabilities it is finally possible to develop a strategy to conduct model

selection for the occupancy framework

For each of the two components of the model ndashoccupancy and detectionndash the

algorithm first draws the set of active predictors (ie Az and Ay ) together with their

corresponding parameters This is a reversible jump step which uses a Metropolis

63

Hastings correction with proposal distributions given by

q(Alowastz |zo z(t)u v(t)MAz

) =1

2

(p(MAlowast

z|zo z(t)u v(t)Mz MAlowast

zisin L(MAz

)) +1

|L(MAz)|

)q(Alowast

y |y zo z(t)u w(t)MAy) =

1

2

(p(MAlowast

w|y zo z(t)u w(t)My MAlowast

yisin L(MAy

)) +1

|L(MAy)|

)(3ndash15)

where L(MAz) and L(MAy

) denote the sets of models obtained from adding or removing

one predictor at a time from MAzand MAy

respectively

To promote mixing this step is followed by an additional draw from the full

conditionals of α and λ The densities p(α0|) p(αr A|) p(λ0|) and p(λr A|) can

be sampled from directly with Gibbs steps Using the notation a|middot to denote the random

variable a conditioned on all other parameters and on the data these densities are given

by

bull α0|middot sim N((X

prime0X0)

minus1Xprime0v (X

prime0X0)

minus1)bull αr A|middot sim N

(microαr A

αr A

) where the mean vector and the covariance matrix are

given by αr A= 2N

2N+pAz(X

prime

r AXr A)minus1 and microαr A

=(αr A

Xprime

r Av)

bull λ0|middot sim N((Q

prime0Q0)

minus1Qprime0w (Q

prime0Q0)

minus1) and

bull λr A|middot sim N(microλr A

λr A

) analogously with mean and covariance matrix given by

λr A= 2J

2J+pAy(Q

prime

r AQr A)minus1 and microλr A

=(λr A

Qprime

r Aw)

Finally Gibbs sampling steps are also available for the unobserved occupancy

indicators zu and for the corresponding latent variables v and w The full conditional

posterior densities for z(t+1)u v(t+1) and w(t+1) are those introduced in Chapter 2 for the

single season probit model

The following steps summarize the stochastic search algorithm

1 Initialize A(0)y A

(0)z z

(0)u v(0)w(0)α(0)

0 λ(0)0

2 Sample the model indices and corresponding parameters

(a) Draw simultaneously

64

bull Alowastz sim q(Az |zo z(t)u v(t)MAz

)

bull αlowast0 sim p(α0|MAlowast

z zo z

(t)u v(t)) and

bull αlowastr Alowast sim p(αr A|MAlowast

z zo z

(t)u v(t))

(b) Accept (M(t+1)Az

α(t+1)10 α(t+1)1

r A ) = (MAlowastzαlowast

0αlowastr Alowast) with probability

δz = min

(1

p(MAlowastz|zo z(t)u v(t))

p(MA(t)z|zo z(t)u v(t))

q(A(t)z |zo z(t)u v(t)MAlowast

z)

q(Alowastz |zo z

(t)u v(t)MAz

)

)

otherwise let (M(t+1)Az

α(t+1)10 α(t+1)1

r A ) = (A(t)z α(t)2

0 α(t)2r A )

(c) Sample simultaneously

bull Alowasty sim q(Ay |y zo z(t)u w(t)MAy

)

bull λlowast0 sim p(λ0|MAlowast

y y zo z

(t)u w(t)) and

bull λlowastr Alowast sim p(λr A|MAlowast

y y zo z

(t)u w(t))

(d) Accept (M(t+1)Ay

λ(t+1)10 λ(t+1)1

r A ) = (MAlowastyλlowast

0λlowastr Alowast) with probability

δy = min

(1

p(MAlowastz|y zo z(t)u w(t))

p(MA(t)z|y zo z(t)u w(t))

q(A(t)z |y zo z(t)u w(t)MAlowast

y)

q(Alowastz |y zo z

(t)u w(t)MAy

)

)

otherwise let (M(t+1)Ay

λ(t+1)10 λ(t+1)1

r A ) = (A(t)y λ(t)2

0 λ(t)2r A )

3 Sample base model parameters

(a) Draw α(t+1)20 sim p(α0|MA

(t+1)z

zo z(t)u v(t))

(b) Draw λ(t+1)20 sim p(λ0|MA(t+1)y

y zo z(t)u v(t))

4 To improve mixing resample model coefficients not present the base model butare in MA

(a) Draw α(t+1)2r A sim p(αr A|MA

(t+1)z

zo z(t)u v(t))

(b) Draw λ(t+1)2r A sim p(λr A|MA

(t+1)y

yzo z(t)u v(t))

5 Sample latent and missing (unobserved) variables

(a) Sample z(t+1)u sim p(zu|MA(t+1)z

yα(t+1)2r A α(t+1)2

0 λ(t+1)2r A λ(t+1)2

0 )

(b) Sample v(t+1) sim p(v|MA(t+1)z

zo z(t+1)u α(t+1)2

r A α(t+1)20 )

65

(c) Sample w(t+1) sim p(w|MA(t+1)y

zo z(t+1)u λ(t+1)2

r A λ(t+1)20 )

34 Alternative Formulation

Because the occupancy process is partially observed it is reasonable to consider

the posterior odds in terms of the observed responses that is the detections y and

the presences at sites where at least one detection takes place Partitioning the vector

of presences into observed and unobserved z = (zprimeo zprimeu)

prime and integrating out the

unobserved component the model posterior for MA can be obtained as

p(MA|y zo) prop Ezu [m(y z|MA)] π(MA) (3ndash16)

Data-augmenting the model in terms of latent normal variables a la Albert and Chib

the marginals for any model My Mz = M isin M of z and y inside of the expectation in

equation 3ndash16 can be expressed in terms of the latent variables

m(y z|M) =

intT (z)

intT (yz)

m(w v|M)dwdv

=

(intT (z)

m(v| Mz)dv

)(intT (yz)

m(w|My)dw

) (3ndash17)

where T (z) and T (y z) denote the corresponding truncation regions for v and w which

depend on the values taken by z and y and

m(v|Mz) =

intf (v|αMz)π(α|Mz)dα (3ndash18)

m(w|My) =

intf (w|λMy)π(λ|My)dλ (3ndash19)

The last equality in equation 3ndash17 is a consequence of the independence of the

latent processes v and w Using expressions 3ndash18 and 3ndash19 allows one to embed this

model selection problem in the classical linear normal regression setting where many

ldquoobjectiverdquo Bayesian inferential tools are available In particular these expressions

facilitate deriving the parameter intrinsic priors (Berger amp Pericchi 1996 Moreno

et al 1998) for this problem This approach is an extension of the one implemented in

Leon-Novelo et al (2012) for the simple probit regression problem

66

Using this alternative approach all that is left is to integrate m(v|MA) and m(w|MA)

over their corresponding truncation regions T (z) and T (y z) which yields m(y z|MA)

and then to obtain the expectation with respect to the unobserved zrsquos Note however

two issues arise First such integrals are not available in closed form Second

calculating the expectation over the limit of integration further complicates things To

address these difficulties it is possible to express E [m(y z|MA)] as

Ezu [m(y z|MA)] = Ezu

[(intT (z)

m(v| MAz)dv

)(intT (yz)

m(w|MAy)dw

)](3ndash20)

= Ezu

[(intT (z)

intm(v| MAz

α0)πIP(α0|MAz

)dα0dv

)times(int

T (yz)

intm(w| MAy

λ0)πIP(λ0|MAy

)dλ0dw

)]

= Ezu

int (int

T (z)

m(v| MAzα0)dv

)︸ ︷︷ ︸

g1(T (z)|MAz α0)

πIP(α0|MAz)dα0 times

int (intT (yz)

m(w|MAyλ0)dw

)︸ ︷︷ ︸

g2(T (yz)|MAy λ0)

πIP(λ0|MAy)dλ0

= Ezu

[intg1(T (z)|MAz

α0)πIP(α0|MAz

)dα0 timesintg2(T (y z)|MAy

λ0)πIP(λ0|MAy

)dλ0

]= c0 d0

int intEzu

[g1(T (z)|MAz

α0)g2(T (y z)|MAyλ0)

]dα0 dλ0

where the last equality follows from Fubinirsquos theorem since m(v|MAzα0) and

m(w|MAyλ0) are proper densities From 3ndash21 the posterior odds are

p(MA|y zo)p(M0|y zo)

=

int intEzu

[g1(T (z)|MAz

α0)g2(T (y z)|MAyλ0)

]dα0 dλ0int int

Ezu

[g1(T (z)|M0z α0)g2(T (y z)|M0y λ0)

]dα0 dλ0

π(MA)

π(M0)

(3ndash21)

67

35 Simulation Experiments

The proposed methodology was tested under 36 different scenarios where we

evaluate the behavior of the algorithm by varying the number of sites the number of

surveys the amount of signal in the predictors for the presence component and finally

the amount of signal in the predictors for the detection component

For each model component the base model is taken to be the intercept only model

and the full models considered for the presence and the detection have respectively 30

and 20 predictors Therefore the model space contains 230times220 asymp 112times1015 candidate

models

To control the amount of signal in the presence and detection components values

for the model parameter were purposefully chosen so that quantiles 10 50 and 90 of the

occupancy and detection probabilities match some pre-specified probabilities Because

presence and detection are binary variables the amount of signal in each model

component associates to the spread and center of the distribution for the occupancy and

detection probabilities respectively Low signal levels relate to occupancy or detection

probabilities close to 50 High signal levels associate with probabilities close to 0 or 1

Large spreads of the distributions for the occupancy and detection probabilities reflect

greater heterogeneity among the observations collected improving the discrimination

capability of the model and viceversa

Therefore for the presence component the parameter values of the true model

were chosen to set the median for the occupancy probabilities equal 05 The chosen

parameter values also fix quantiles 10 and 90 symmetrically about 05 at small (Qz10 =

03Qz90 = 07) intermediate (Qz

10 = 02Qz90 = 08) and large (Qz

10 = 01Qz90 = 09)

distances For the detection component the model parameters are obtained to reflect

detection probabilities concentrated about low values (Qy50 = 02) intermediate values

(Qy50 = 05) and high values (Qy

50 = 08) while keeping quantiles 10 and 90 fixed at 01

and 09 respectively

68

Table 3-1 Simulation control parameters occupancy model selectorParameter Values considered

N 50 100

J 3 5

(Qz10Q

z50Q

z90)

(03 05 07) (02 05 08) (01 05 09)

(Qy

10Qy50Q

y90)

(01 02 09) (01 05 09) (01 08 09)

There are in total 36 scenarios these result from crossing all the levels of the

simulation control parameters (Table 3-1) Under each of these scenarios 20 data sets

were generated at random True presence and detection indicators were generated

with the probit model formulation from Chapter 2 This with the assumed true models

MTz = 1 x2 x15 x16 x22 x28 for the presence and MTy = 1 q7 q10 q12 q17 for

the detection with the predictors included in the randomly generated datasets In this

context 1 represents the intercept term Throughout the Section we refer to predictors

included in the true models as true predictors and to those absent as false predictors

The selection procedure was conducted using each one of these data sets with

two different priors on the model space the uniform or equal probability prior and a

multiplicity correcting prior

The results are summarized through the marginal posterior inclusion probabilities

(MPIPs) for each predictor and also the five highest posterior probability models (HPM)

The MPIP for a given predictor under a specific scenario and for a particular data set is

defined as

p(predictor is included|y zw v) =sumMisinM

I(predictorisinM)p(M|y zw vM) (3ndash22)

In addition we compare the MPIP odds between predictors present in the true model

and predictors absent from it Specifically we consider the minimum odds of marginal

posterior inclusion probabilities for the predictors Let ~ξ and ξ denote respectively a

69

predictor in the true model MT and a predictor absent from MT We define the minimum

MPIP odds between the probabilities of true and false predictor as

minOddsMPIP =min~ξisinMT

p(I~ξ = 1|~ξ isin MT )

maxξ isinMTp(Iξ = 1|ξ isin MT )

(3ndash23)

If the variable selection procedure adequately discriminates true and false predictors

minOddsMPIP will take values larger than one The ability of the method to discriminate

between the least probable true predictor and the most probable false predictor worsens

as the indicator approaches 0351 Marginal Posterior Inclusion Probabilities for Model Predictors

For clarity in Figures 3-1 through 3-5 only predictors in the true models are labeled

and are emphasized with a dotted line passing through them The left hand side plots

in these figures contain the results for the presence component and the ones on the

right correspond to predictors in the detection component The results obtained with

the uniform model priors correspond to the black lines and those for the multiplicity

correcting prior are in red In these Figures the MPIPrsquos have been averaged over all

datasets corresponding scenarios matching the condition observed

In Figure 3-1 we contrast the mean MPIPrsquos of the predictors over all datasets from

scenarios with 50 sites to the mean MPIPrsquos obtained for the scenarios with 100 sites

Similarly Figure 3-2 compares the mean MPIPrsquos of scenarios where 3 surveys are

performed to those of scenarios having 5 surveys per site Figures 3-4 and 3-5 show the

effect of the different levels of signal considered in the occupancy probabilities and in the

detection probabilities

From these figures mainly three results can be drawn (1) the effect of the model

prior is substantial (2) the proposed methods yield MPIPrsquos that clearly separate

true predictors from false predictors and (3) the separation between MPIPrsquos of true

predictors and false predictors is noticeably larger in the detection component

70

Regardless of the simulation scenario and model component observed under the

uniform prior false predictors obtain a relatively high MPIP Conversely the multiplicity

correction prior strongly shrinks towards 0 the MPIP for false predictors In the presence

component the MPIP for the true predictors is shrunk substantially under the multiplicity

prior however there remains a clear separation between true and false predictors In

contrast in the detection component the MPIP for true predictors remains relatively high

(Figures 3-1 through 3-5)

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif N=50MC N=50

Unif N=100MC N=100

Figure 3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors

71

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif J=3MC J=3

Unif J=5MC J=5

Figure 3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif N=50 J=3Unif N=50 J=5

Unif N=100 J=3Unif N=100 J=5

MC N=50 J=3MC N=50 J=5

MC N=100 J=3MC N=100 J=5

Figure 3-3 Predictor MPIP averaged over scenarios with the interaction between thenumber of sites and the surveys per site using uniform (U) and multiplicitycorrection (MC) priors

72

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

U(03 05 07)MC(03 05 07)

U(02 05 08)MC(02 05 08)

U(01 05 09)MC(01 05 09)

Figure 3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

U(01 02 09)MC(01 02 09)

U(01 05 09)MC(01 05 09)

U(01 08 09)MC(01 08 09)

Figure 3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors

73

In scenarios where more sites were surveyed the separation between the MPIP of

true and false predictors grew in both model components (Figure 3-1) Increasing the

number of sites has an effect over both components given that every time a new site is

included covariate information is added to the design matrix of both the presence and

the detection components

On the hand increasing the number of surveys affects the MPIP of predictors in the

detection component (Figures 3-2 and 3-3) but has only a marginal effect on predictors

of the presence component This may appear to be counterintuitive however increasing

the number of surveys only increases the number of observation in the design matrix

for the detection while leaving unaltered the design matrix for the presence The small

changes observed in the MPIP for the presence predictors J increases are exclusively

a result of having additional detection indicators equal to 1 in sites where with less

surveys would only have 0 valued detections

From Figure 3-3 it is clear that for the presence component the effect of the number

of sites dominates the behavior of the MPIP especially when using the multiplicity

correction priors In the detection component the MPIP is influenced by both the number

of sites and number of surveys The influence of increasing the number of surveys is

larger when considering a smaller number of sites and viceversa

Regarding the effect of the distribution for the occupancy probabilities we observe

that mostly the detection component is affected There is stronger discrimination

between true and false predictors as the distribution has a higher variability (Figure

3-4) This is consistent with intuition since having the presence probabilities more

concentrated about 05 implies that the predictors do not vary much from one site to

the next whereas having the occupancy probabilities more spread out would have the

opposite effect

Finally concentrating the detection probabilities about high or low values For

predictors in the detection component the separation between MPIP of true and false

74

predictors is larger both in scenarios where the distribution of the detection probability

is centered about 02 or 08 when compared to those scenarios where this distribution

is centered about 05 (where the signal of the predictors is weakest) For predictors in

the presence component having the detection probabilities centered at higher values

slightly increases the inclusion probabilities of the true predictors (Figure 3-5) and

reduces that of false predictors

Table 3-2 Comparison of average minOddsMPIP under scenarios having differentnumber of sites (N=50 N=100) and under scenarios having different numberof surveys per site (J=3 J=5) for the presence and detection componentsusing uniform and multiplicity correction priors

Sites SurveysComp π(M) N=50 N=100 J=3 J=5

Presence Unif 112 131 119 124MC 320 846 420 674

Detection Unif 203 264 211 257MC 2115 3246 2139 3252

Table 3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors

(Qz10Q

z50Q

z90) (Qy

10Qy50Q

y90)

Comp π(M) (030507) (020508) (010509) (010209) (010509) (010809)

Presence Unif 105 120 134 110 123 124MC 202 455 805 238 619 640

Detection Unif 234 234 230 257 200 238MC 2537 2077 2528 2933 1852 2849

The separation between the MPIP of true and false predictors is even more

notorious in Tables 3-2 and 3-3 where the minimum MPIP odds between true and

false predictors are shown Under every scenario the value for the minOddsMPIP (as

defined in 3ndash23) was greater than 1 implying that on average even the lowest MPIP

for a true predictor is higher than the maximum MPIP for a false predictor In both

components of the model the minOddsMPIP are markedly larger under the multiplicity

correction prior and increase with the number of sites and with the number of surveys

75

For the presence component increasing the signal in the occupancy probabilities

or having the detection probabilities concentrate about higher values has a positive and

considerable effect on the magnitude of the odds For the detection component these

odds are particularly high specially under the multiplicity correction prior Also having

the distribution for the detection probabilities center about low or high values increases

the minOddsMPIP 352 Summary Statistics for the Highest Posterior Probability Model

Tables 3-4 through 3-7 show the number of true predictors that are included in

the HPM (True +) and the number of false predictors excluded from it (True minus)

The mean percentages observed in these Tables provide one clear message The

highest probability models chosen with either model prior commonly differ from the

corresponding true models The multiplicity correction priorrsquos strong shrinkage only

allows a few true predictors to be selected but at the same time it prevents from

including in the HPM any false predictors On the other hand the uniform prior includes

in the HPM a larger proportion of true predictors but at the expense of also introducing

a large number of false predictors This situation is exacerbated in the presence

component but also occurs to a minor extent in the detection component

Table 3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highestprobability models for the presence and the detection components usinguniform and multiplicity correcting priors on the model space

True + True minusComp π(M) N=50 N=100 N=50 N=100

Presence Unif 057 063 051 055MC 006 013 100 100

Detection Unif 077 085 087 093MC 049 070 100 100

Having more sites or surveys improves the inclusion of true predictors and exclusion

of false ones in the HPM for both the presence and detection components (Tables 3-4

and 3-5) On the other hand if the distribution for the occupancy probabilities is more

76

Table 3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space

True + True minusComp π(M) J=3 J=5 J=3 J=5

Presence Unif 059 061 052 054MC 008 010 100 100

Detection Unif 078 085 087 092MC 050 068 100 100

spread out the HPM includes more true predictors and less false ones in the presence

component In contrast the effect of the spread of the occupancy probabilities in the

detection HPM is negligible (Table 3-6) Finally there is a positive relationship between

the location of the median for the detection probabilities and the number of correctly

classified true and false predictors for the presence The HPM in the detection part of

the model responds positively to low and high values of the median detection probability

(increased signal levels) in terms of correctly classified true and false predictors (Table

3-7)

Table 3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space

True + True minusComp π(M) (030507) (020508) (010509) (030507) (020508) (010509)

Presence Unif 055 061 064 050 054 055MC 002 008 018 100 100 100

Detection Unif 081 082 081 090 089 089MC 057 061 059 100 100 100

36 Case Study Blue Hawker Data Analysis

During 1999 and 2000 an intensive volunteer surveying effort coordinated by the

Centre Suisse de Cartographie de la Faune (CSCF) was conducted in order to analyze

the distribution of the blue hawker Ashna cyanea (Odonata Aeshnidae) a common

dragonfly in Switzerland Given that Switzerland is a small and mountainous country

77

Table 3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space

True + True minusComp π(M) (010209) (010509) (010809) (010209) (010509) (010809)

Presence Unif 059 059 062 051 054 054MC 006 010 011 100 100 100

Detection Unif 089 077 078 091 087 091MC 070 048 059 100 100 100

there is large variation in its topography and physio-geography as such elevation is a

good candidate covariate to predict species occurrence at a large spatial scale It can

be used as a proxy for habitat type intensity of land use temperature as well as some

biotic factors (Kery et al 2010)

Repeated visits to 1-ha pixels took place to obtain the corresponding detection

history In addition to the survey outcome the x and y-coordinates thermal-level the

date of the survey and the elevation were recorded Surveys were restricted to the

known flight period of the blue hawker which takes place between May 1 and October

10 In total 2572 sites were surveyed at least once during the surveying period The

number of surveys per site ranges from 1 to 22 times within each survey year

Kery et al (2010) summarize the results of this effort using AIC-based model

comparisons first by following a backwards elimination approach for the detection

process while keeping the occupancy component fixed at the most complex model and

then for the presence component choosing among a group of three models while using

the detection model chosen In our analysis of this dataset for the detection and the

presence we consider as the full models those used in Kery et al (2010) namely

minus1(ψ) = α0 + α1year+ α2elev+ α3elev2 + α4elev

3

minus1(p) = λ0 + λ1year+ λ2elev+ λ3elev2 + λ4elev

3 + λ5date+ λ6date2

78

where year = Iyear=2000

The model spaces for this data contain 26 = 64 and 24 = 16 models respectively

for the detection and occupancy components That is in total the model space contains

24+6 = 1 024 models Although this model space can be enumerated entirely for

illustration we implemented the algorithm from section 334 generating 10000 draws

from the Gibbs sampler Each one of the models sampled were chosen from the set of

models that could be reached by changing the state of a single term in the current model

(to inclusion or exclusion accordingly) This allows a more thorough exploration of the

model space because for each of the 10000 models drawn the posterior probabilities

for many more models can be observed Below the labels for the predictors are followed

by either ldquozrdquo or ldquoyrdquo accordingly to represent the component they pertain to Finally

using the results from the model selection procedure we conducted a validation step to

determine the predictive accuracy of the HPMrsquos and of the median probability models

(MPMrsquos) The performance of these models is then contrasted with that of the model

ultimately selected by Kery et al (2010)361 Results Variable Selection Procedure

The model finally chosen for the presence component in Kery et al (2010) was not

found among the highest five probability models under either model prior 3-8 Moreover

the year indicator was never chosen under the multiplicity correcting prior hinting that

this term might correspond to a falsely identified predictor under the uniform prior

Results in Table 3-10 support this claim the marginal inclusion posterior probability for

the year predictor is 7 under the multiplicity correction prior The multiplicity correction

prior concentrates more densely the model posterior probability mass in the highest

ranked models (90 of the mass is in the top five models) than the uniform prior (which

account for 40 of the mass)

For the detection component the HPM under both priors is the intercept only model

which we represent in Table 3-9 with a blank label In both cases this model obtains very

79

Table 3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data

Uniform model priorRank Mz selected p(Mz |y)

1 yrz+elevz 0102 yrz+elevz+elevz3 0083 elevz2+elevz3 0084 yrz+elevz2 0075 yrz+elevz3 007

Multiplicity correcting model priorRank Mz selected p(Mz |y)

1 elevz+elevz3 0532 0153 elevz+elevz2 0094 elevz2 0065 elevz+elevz2+elevz3 005

high posterior probabilities The terms contained in cubic polynomial for the elevation

appear to contain some relevant information however this conflicts with the MPIPs

observed in Table 3-11 which under both model priors are relatively low (lt 20 with the

uniform and le 4 with the multiplicity correcting prior)

Table 3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data

Uniform model priorRank Mz selected p(Mz |y)

1 0452 elevy3 0063 elevy2 0054 elevy 0055 yry 004

Multiplicity correcting model priorRank Mz selected p(Mz |y)

1 0862 elevy3 0023 datey2 0024 elevy2 0025 yry 002

Finally it is possible to use the MPIPs to obtain the median probability model which

contains the terms that have a MPIP higher than 50 For the occupancy process

(Table 3-10) under the uniform prior the model with the year the elevation and the

elevation cubed are included The MPM with multiplicity correction prior coincides with

the HPM from this prior The MPM chosen for the detection component (Table 3-11)

under both priors is the intercept only model coinciding again with the HPM

Given the outcomes of the simulation studies from Section 35 especially those

pertaining to the detection component the results in Table 3-11 appear to indicate that

none of the predictors considered belong to the true model especially when considering

80

Table 3-10 MPIP presence component

Predictor p(predictor isin MTz |y z w v)

Unif MultCorryrz 053 007elevz 051 073elevz2 045 023elevz3 050 067

Table 3-11 MPIP detection component

Predictor p(predictor isin MTy |y z w v)

Unif MultCorryry 019 003elevy 018 003elevy2 018 003elevy 3 019 004datey 016 003datey2 015 004

those derived with the multiplicity correction prior On the other hand for the presence

component (Table 3-10) there is an indication that terms related to the cubic polynomial

in elevz can explain the occupancy patterns362 Validation for the Selection Procedure

Approximately half of the sites were selected at random for training (ie for model

selection and parameter estimation) and the remaining half were used as test data In

the previous section we observed that using the marginal posterior inclusion probability

of the predictors the our method effectively separates predictors in the true model from

those that are not in it However in Tables 3-10 and 3-11 this separation is only clear for

the presence component using the multiplicity correction prior

Therefore in the validation procedure we observe the misclassification rates for the

detections using the following models (1) the model ultimately recommended in Kery

et al (2010) (yrz+elevz+elevz2+elevz3 + elevy+ elevy2+ datey+ datey2) (2) the

highest probability model (HPM) with a uniform prior (yrz+elevz) (3) the HPM with a

multiplicity correcting prior (elevz + elevz3 ) (4) the median probability model (MPM)

ndashthe model including only predictors with a MPIP larger than 50ndash with the uniform

prior (yrz+elevz+elevz3) and finally (5) the MPM with a multiplicity correction prior

(elevz+elevz3 same as the HPM with multiplicity correction)

We must emphasize that the models resulting from the implement ion of our model

selection procedure used exclusively the training dataset On the other hand the model

in Kery et al (2010) was chosen to minimize the prediction error of the complete data

81

Because this model was obtained from the full dataset results derived from it can only

be considered as a lower bound for the prediction errors

The benchmark misclassification error rate for true 1rsquos is high (close to 70)

However the misclassification rate for true 0rsquos which accounts for most of the

responses is less pronounced (15) Overall the performance of the selected models

is comparable They yield considerably worse results than the benchmark for the true

1rsquos but achieve rates close to the benchmark for the true zeros Pooling together

the results for true ones and true zeros the selected models with either prior have

misclassification rates close to 30 The benchmark model performs comparably with a

joint misclassification error of 23 (Table 3-12)

Table 3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform andmultiplicity correction model priors

Model True 1 True 0 Jointbenchmark (Kery et al 2010) yrz+elevz+elevz2+elevz3 + 066 015 023

elevy+ elevy2+ datey+ datey2

HPM Unif yrz+elevz 083 017 028HPMHPM MC elevz + elevz3 082 018 028MPM Unif yrz+elevz+elevz3 082 018 029

37 Discussion

In this Chapter we proposed an objective and fully automatic Bayes methodology for

the single season site-occupancy model The methodology is said to be fully automatic

because no hyper-parameter specification is necessary in defining the parameter priors

and objective because it relies on the intrinsic priors derived from noninformative priors

The intrinsic priors have been shown to have desirable properties as testing priors We

also propose a fast stochastic search algorithm to explore large model spaces using our

model selection procedure

Our simulation experiments demonstrated the ability of the method to single out the

predictors present in the true model when considering the marginal posterior inclusion

probabilities for the predictors For predictors in the true model these probabilities

were comparatively larger than those for predictors absent from it Also the simulations

82

indicated that the method has a greater discrimination capability for predictors in the

detection component of the model especially when using multiplicity correction priors

Multiplicity correction priors were not described in this Chapter however their

influence on the selection outcome is significant This behavior was observed in the

simulation experiment and in the analysis of the Blue Hawker data Model priors play an

essential role As the number of predictors grows these are instrumental in controlling

for selection of false positive predictors Additionally model priors can be used to

account for predictor structure in the selection process which helps both to reduce the

size of the model space and to make the selection more robust These issues are the

topic of the next Chapter

Accounting for the polynomial hierarchy in the predictors within the occupancy

context is a straightforward extension of the procedures we describe in Chapter 4

Hence our next step is to develop efficient software for it An additional direction we

plan to pursue is developing methods for occupancy variable selection in a multivariate

setting This can be used to conduct hypothesis testing in scenarios with varying

conditions through time or in the case where multiple species are co-observed A

final variation we will investigate for this problem is that of occupancy model selection

incorporating random effects

83

CHAPTER 4PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS

It has long been an axiom of mine that the little things are infinitely themost important

ndashSherlock HolmesA Case of Identity

41 Introduction

In regression problems if a large number of potential predictors is available the

complete model space is too large to enumerate and automatic selection algorithms are

necessary to find informative parsimonious models This multiple testing problem

is difficult and even more so when interactions or powers of the predictors are

considered In the ecological literature models with interactions andor higher order

polynomial terms are ubiquitous (Johnson et al 2013 Kery et al 2010 Zeller et al

2011) given the complexity and non-linearities found in ecological processes Several

model selection procedures even in the classical normal linear setting fail to address

two fundamental issues (1) the model selection outcome is not invariant to affine

transformations when interactions or polynomial structures are found among the

predictors and (2) additional penalization is required to control for false positives as the

model space grows (ie as more covariates are considered)

These two issues motivate the developments developed throughout this Chapter

Building on the results of Chipman (1996) we propose investigate and provide

recommendations for three different prior distributions on the model space These

priors help control for test multiplicity while accounting for polynomial structure in the

predictors They improve upon those proposed by Chipman first by avoiding the need

for specific values for the prior inclusion probabilities of the predictors and second

by formulating principled alternatives to introduce additional structure in the model

84

priors Finally we design a stochastic search algorithm that allows fast and thorough

exploration of model spaces with polynomial structure

Having structure in the predictors can determine the selection outcome As an

illustration consider the model E [y ] = β00 + β01x2 + β20x21 where the order one

term x1 is not present (this choice of subscripts for the coefficients is defined in the

following section) Transforming x1 7rarr xlowast1 = x1 + c for some c = 0 the model

becomes E [y ] = β00 + β01x2 + βlowast20x

lowast21 Note that in terms of the original predictors

xlowast21 = x21 + 2c middot x1 + c2 implying that this seemingly innocuous transformation of x1

modifies the column space of the design matrix by including x1 which was not in the

original model That is when lower order terms in the hierarchy are omitted from the

model the column space of the design matrix is not invariant to afine transformations

As the hat matrix depends on the column space the modelrsquos predictive capability is also

affected by how the covariates in the model are coded an undesirable feature for any

model selection procedure To make model selection invariant to afine transformations

the selection must be constrained to the subset of models that respect the hierarchy

(Griepentrog et al 1982 Khuri 2002 McCullagh amp Nelder 1989 Nelder 2000

Peixoto 1987 1990) These models are known as well-formulated models (WFMs)

Succinctly a model is well-formulated if for any predictor in the model every lower order

predictor associated with it is also in the model The model above is not well-formulated

as it contains x21 but not x1

WFMs exhibit strong heredity in that all lower order terms dividing higher order

terms in the model must also be included An alternative is to only require weak heredity

(Chipman 1996) which only forces some of the lower terms in the corresponding

polynomial hierarchy to be in the model However Nelder (1998) demonstrated that the

conditions under which weak heredity allows the design matrix to be invariant to afine

transformations of the predictors are too restrictive to be useful in practice

85

Although this topic appeared in the literature more than three decades ago (Nelder

1977) only recently have modern variable selection techniques been adapted to

account for the constraints imposed by heredity As described in Bien et al (2013)

the current literature on variable selection for polynomial response surface models

can be classified into three broad groups mult-istep procedures (Brusco et al 2009

Peixoto 1987) regularized regression methods (Bien et al 2013 Yuan et al 2009)

and Bayesian approaches (Chipman 1996) The methods introduced in this Chapter

take a Bayesian approach towards variable selection for well-formulated models with

particular emphasis on model priors

As mentioned in previous chapters the Bayesian variable selection problem

consists of finding models with high posterior probabilities within a pre-specified model

space M The model posterior probability for M isin M is given by

p(M|yM) prop m(y|M)π(M|M) (4ndash1)

Model posterior probabilities depend on the prior distribution on the model space

as well as on the prior distributions for the model specific parameters implicitly through

the marginals m(y|M) Priors on the model specific parameters have been extensively

discussed in the literature (Berger amp Pericchi 1996 Berger et al 2001 George 2000

Jeffreys 1961 Kass amp Wasserman 1996 Liang et al 2008 Zellner amp Siow 1980) In

contrast the effect of the prior on the model space has until recently been neglected

A few authors (eg Casella et al (2014) Scott amp Berger (2010) Wilson et al (2010))

have highlighted the relevance of the priors on the model space in the context of multiple

testing Adequately formulating priors on the model space can both account for structure

in the predictors and provide additional control on the detection of false positive terms

In addition using the popular uniform prior over the model space may lead to the

undesirable and ldquoinformativerdquo implication of favoring models of size p2 (where p is the

86

total number of covariates) since this is the most abundant model size contained in the

model space

Variable selection within the model space of well-formulated polynomial models

poses two challenges for automatic objective model selection procedures First the

notion of model complexity takes on a new dimension Complexity is not exclusively

a function of the number of predictors but also depends upon the depth and

connectedness of the associations defined by the polynomial hierarchy Second

because the model space is shaped by such relationships stochastic search algorithms

used to explore the models must also conform to these restrictions

Models without polynomial hierarchy constitute a special case of WFMs where

all predictors are of order one Hence all the methods developed throughout this

Chapter also apply to models with no predictor structure Additionally although our

proposed methods are presented for the normal linear case to simplify the exposition

these methods are general enough to be embedded in many Bayesian selection

and averaging procedures including of course the occupancy framework previously

discussed

In this Chapter first we provide the necessary definitions to characterize the

well-formulated model selection problem Then we proceed to introduce three new prior

structures on the well-formulated model space and characterize their behavior with

simple examples and simulations With the model priors in place we build a stochastic

search algorithm to explore spaces of well-formulated models that relies on intrinsic

priors for the model specific parameters mdash though this assumption can be relaxed

to use other mixtures of g-priors Finally we implement our procedures using both

simulated and real data

87

42 Setup for Well-Formulated Models

Suppose that the observations yi are modeled using the polynomial regression of

the covariates xi 1 xi p given by

yi =sum

β(α1αp)

pprodj=1

xαji j + ϵi (4ndash2)

where α = (α1 αp) belongs to Np0 the p-dimensional space of natural numbers

including 0 with ϵiiidsim N(0σ2) and only finitely many βα are allowed to be non-zero

As an illustration consider a model space that includes polynomial terms incorporating

covariates xi 1 and xi 2 only The terms x2i 2 and x2i 1xi 2 can be represented by α = (0 2)

and α = (2 1) respectively

The notation y = Z(X)β + ϵ is used to denote that observed response y =

(y1 yn) is modeled via a polynomial function Z of the original covariates contained

in X = (x1 xp) (where xj = (x1j xnj)prime) and the coefficients of the polynomial

terms are given by β A specific polynomial model M is defined by the set of coefficients

βα that are allowed to be non-zero This definition is equivalent to characterizing M

through a collection of multi-indices α isin Np0 In particular model M is specified by

M = αM1 αM|M| for αMk isin Np0 where βα = 0 for α isin M

Any particular model M uses a subset XM of the original covariates X to form the

polynomial terms in the design matrix ZM(X) Without ambiguity a polynomial model

ZM(X) on X can be identified with a polynomial model ZM(XM) on the covariates XM

The number of terms used by M to model the response y denoted by |M| corresponds

to the number of columns of ZM(XM) The coefficient vector and error variance of

the model M are denoted by βM and σ2M respectively Thus M models the data as

y = ZM(XM)βM + ϵM where ϵM sim N(0 Iσ2M

) Model M is said to be nested in model M prime

if M sub M prime M models the response of the covariates in two distinct ways choosing the

set of meaningful covariates XM as well as choosing the polynomial structure of these

covariates ZM(XM)

88

The set Np0 constitutes a partially ordered set or more succinctly a poset A poset

is a set partially ordered through a binary relation ldquo≼rdquo In this context the binary relation

on the poset Np0 is defined between pairs (ααprime) by αprime ≼ α whenever αj ge αprime

j for all

j = 1 prime with αprime ≺ α if additionally αj gt αprimej for some j The order of a term α isin Np

0

is given by the sum of its elements order(α) =sumαj When order(α) = order(αprime) + 1

and αprime ≺ α then αprime is said to immediately precede α which is denoted by αprime rarr α

The parent set of α is defined by P(α) = αprime isin Np0 αprime rarr α and is given by the

set of nodes that immediately precede the given node A polynomial model M is said to

be well-formulated if α isin M implies that P(α) sub M For example any well-formulated

model using x2i 1xi 2 to model yi must also include the parent terms xi 1xi 2 and x2i 1 their

corresponding parent terms xi 1 and xi 2 and the intercept term 1

The poset Np0 can be represented by a Directed Acyclic Graph (DAG) denoted

by (Np0) Without ambiguity we can identify nodes in the graph α isin Np

0 with terms in

the set of covariates The graph has directed edges to a node from its parents Any

well-formulated model M is represented by a subgraph (M) of (Np0) with the property

that if node α isin (M) then the nodes corresponding to P(α) are also in (M) Figure

4-1 shows examples of well-formulated polynomial models where α isin Np0 is identified

withprodp

j=1 xαjj

The motivation for considering only well-formulated polynomial models is

compelling Let ZM be the design matrix associated with a polynomial model The

subspace of y modeled by ZM given by the hat matrix HM = ZM(ZprimeMZM)

minus1ZprimeM is

invariant to affine transformations of the matrix XM if and only if M corresponds to a

well-formulated polynomial model (Peixoto 1990)

89

A B

Figure 4-1 Graphs of well-formulated polynomial models for p = 2

For example if p = 2 and yi = β(00) + β(10)xi 1 + β(01)xi 2 + β(11)xi 1xi 2 + ϵi then

the hat matrix is invariant to any covariate transformation of the form A(xi 1xi 2

)+ b for any

real-valued positive definite 2 times 2 matrix A and any real-valued vector of dimension two

b In contrast if yi = β(00) + β(20)x2i 1 + ϵi then the hat matrix formed after applying the

transformation xi 1 7rarr xi 1 + c for real c = 0 is not the same as the hat matrix formed by

the original xi 1421 Well-Formulated Model Spaces

The spaces of WFMs M considered in this paper can be characterized in terms

of two WFMs MB the base model and MF the full model The base model contains at

least the intercept term and is nested in the full model The model space M is populated

by all well formulated models M that nest MB and are nested in MF

M = M MB sube M sube MF and M is well-formulated

For M to be well-formulated the entire ancestry of each node in M must also be

included in M Because of this M isin M can be uniquely identified by two different sets

of nodes in MF the set of extreme nodes and the set of children nodes For M isin M

90

the sets of extreme and children nodes respectively denoted by E(M) and C(M) are

defined by

E(M) = α isin M MB α isin P(αprime) forall αprime isin M

C(M) = α isin MF M α cupM is well-formulated

The extreme nodes are those nodes that when removed from M give rise to a WFM in

M The children nodes are those nodes that when added to M give rise to a WFM in

M Because MB sube M for all M isin M the set of nodes E(M)cupMB determine M by

beginning with this set and iteratively adding parent nodes Similarly the nodes in C(M)

determine the set αprime isin P(α) α isin C(M)cupαprime isin E(MF ) α ≼ αprime for all α isin C(M)

which contains E(M)cupMB and thus uniquely identifies M

1

x1

x2

x21

x1x2

x22

A Extreme node set

1

x1

x2

x21

x1x2

x22

B Children node set

Figure 4-2

In Figure 4-2 the extreme and children sets for model M = 1 x1 x21 are shown for

the model space characterized by MF = 1 x1 x2 x21 x1x2 x22 In Figure 4-2A the solid

nodes represent nodes α isin M E(M) the dashed node corresponds to α isin E(M) and

the dotted nodes are not in M Solid nodes in Figure 4-2B correspond to those in M

The dashed node is the single node in C(M) and the dotted nodes are not in M cup C(M)43 Priors on the Model Space

As discussed in Scott amp Berger (2010) the Ockhamrsquos-razor effect found

automatically in Bayesian variable selection through the Bayes factor does not correct

91

for multiple testing This penalization acts against more complex models but does not

account for the collection of models in the model space which describes the multiplicity

of the testing problem This is where the role of the prior on the model space becomes

important As Scott amp Berger explain the multiplicity penalty is ldquohidden awayrdquo in the

model prior probabilities π(M|M)

In what follows we propose three different prior structures on the model space

for WFMs discuss their advantages and disadvantages and describe reasonable

choices for their hyper-parameters In addition we investigate how the choice of

prior structure and hyper-parameter combinations affect the posterior probabilities for

predictor inclusion providing some recommendations for different situations431 Model Prior Definition

The graphical structure for the model spaces suggests a method for prior

construction on M guided by the notion of inheritance A node α is said to inherit from

a node αprime if there is a directed path from αprime to α in the graph (MF ) The inheritance

is said to be immediate if order(α) = order(αprime) + 1 (equivalently if αprime isin P(α) or if αprime

immediately precedes α)

For convenience define (M) = M MB to be the set of nodes in M that are not

in the base model MB For α isin (MF ) let γα(M) be the indicator function describing

whether α is included in M ie γα(M) = I(αisinM) Denote by γν(M) the set of indicators

of inclusion in M for all order ν nodes in (MF ) Finally let γltν(M) =cupνminus1

j=0 γ j(M)

the set of indicators for inclusion in M for all nodes in (MF ) of order less than ν With

these definitions the prior probability of any model M isin M can be factored as

π(M|M) =

JmaxMprod

j=JminM

π(γ j(M)|γltj(M)M) (4ndash3)

where JminM and Jmax

M are respectively the minimum and maximum order of nodes in

(MF ) and π(γJminM (M)|γltJmin

M (M)M) = π(γJminM (M)|M)

92

Prior distributions on M can be simplified by making two assumptions First if

order(α) = order(αprime) = j then γα and γαprime are assumed to be conditionally independent

when conditioned on γltj denoted by γα perpperp γαprime|γltj Second immediate inheritance is

invoked and it is assumed that if order(α) = j then γα(M)|γltj(M) = γα(M)|γP(α)(M)

where γP(α)(M) is the inclusion indicator for the set of parent nodes of α This indicator

is one if the complete parent set of α is contained in M and zero otherwise

In Figure 4-3 these two assumptions are depicted with MF being an order two

surface in two main effects The conditional independence assumption (Figure 4-3A)

implies that the inclusion indicators for x21 x22 and x1x2 is independent when conditioned

on all the lower order terms In this same space immediate inheritance implies that

the inclusion of x21 conditioned on the inclusion of all lower order nodes is equivalent to

conditioning it on its parent set (x1 in this case)

x21 perpperp x1x2 perpperp x22

∣∣∣∣∣

1

x1

x2

A Conditional independence

x21∣∣∣∣∣

1

x1

x2

=

x21

∣∣∣∣∣ x1

B Immediate inheritance

Figure 4-3

Denote the conditional inclusion probability of node α in model M by πα =

π(γα(M) = 1|γP(α)(M)M) Under the assumptions of conditional independence

93

and immediate inheritance the prior probability of M is

π(M|πMM) =prod

αisin(MF )

πγα(M)α (1minus πα)

1minusγα(M) (4ndash4)

with πM = πα α isin (MF ) Because M must be well-formulated πα = γα =

0 if γP(α)(M) = 0 Thus the product in 4ndash4 can be restricted to the set of nodes

α isin (M)cup

C(M) Additional structure can be built into the prior on M by making

assumptions about the inclusion probabilities πα such as equality assumptions or

assumptions of a hyper-prior for these parameters Three such prior classes are

developed next first by assigning hyperpriors on πM assuming some structure among

its elements and then marginalizing out the πM

Hierarchical Uniform Prior (HUP) The HUP assumes that the non-zero πα

are all equal Specifically for a model M isin M it is assumed that πα = π for all

α isin (M)cupC(M) A complete Bayesian specification of the HUP is completed by

assuming a prior distribution for π The choice of π sim Beta(a b) produces

πHUP(M|M a b) =B(|(M)|+ a |C(M)|+ b)

B(a b) (4ndash5)

where B is the beta function Setting a = b = 1 gives the particular value of

πHUP(M|M a = 1 b = 1) =1

|(M)|+ |C(M)|+ 1

(|(M)|+ |C(M)|

|(M)|

)minus1

(4ndash6)

The HUP assigns equal probabilities to all models for which the sets of nodes (M)

and C(M) have the same cardinality This prior provides a combinatorial penalization

but essentially fails to account for the hierarchical structure of the model space An

additional penalization for model complexity can be incorporated into the HUP by

changing the values of a and b Because πα = π for all α this penalization can only

depend on some aspect of the entire graph of MF such as the total number of nodes

not in the null model |(MF )|

94

Hierarchical Independence Prior (HIP) The HIP assumes that there are no

equality constraints among the non-zero πα Each non-zero πα is given its own prior

which is assumed to be a Beta distribution with parameters aα and bα Thus the prior

probability of M under the HIP is

πHIP(M|M ab) =

prodαisin(M)

aα + bα

prodαisinC(M)

aα + bα

(4ndash7)

where the product over empty is taken to be 1 Because the πα are totally independent any

choice of aα and bα is equivalent to choosing a probability of success πα for a given α

Setting aα = bα = 1 for all α isin (M)cup

C(M) gives the particular value of

πHIP(M|M a = 1b = 1) =

(1

2

)|(M)|+|C(M)|

(4ndash8)

Although the prior with this choice of hyper-parameters accounts for the hierarchical

structure of the model space it essentially provides no penalization for combinatorial

complexity at different levels of the hierarchy This can be observed by considering a

model space with main effects only the exponent in 4ndash8 is the same for every model in

the space because each node is either in the model or in the children set

Additional penalizations for model complexity can be incorporated into the HIP

Because each γ j is conditioned on γltj in the prior construction the aα and bα for α of

order j can be conditioned on γltj One such additional penalization utilizes the number

of nodes of order j that could be added to produce a WFM conditioned on the inclusion

vector γltj which is denoted as chj(γltj) Choosing aα = 1 and bα(M) = chj(γ

ltj) is

equivalent to choosing a probability of success πα = 1chj(γltj) This penalization can

drive down the false positive rate when chj(γltj) is large but may produce more false

negatives

Hierarchical Order Prior (HOP) A compromise between complete equality and

complete independence of the πα is to assume equality between the πα of a given

order and independence across the different orders Define j(M) = α isin (M)

95

order(α) = j and Cj(M) = α isin C(M) order(α) = j The HOP assumes that πα = πj

for all α isin j(M)cupCj(M) Assuming that πj sim Beta(aj bj) provides a prior probability of

πHOP(M|M ab) =

JmaxMprod

j=JminM

B(|j(M)|+ aj |Cj(M)|+ bj)

B(aj bj)(4ndash9)

The specific choice of aj = bj = 1 for all j gives a value of

πHOP(M|M a = 1b = 1) =prodj

[1

|j(M)|+ |Cj(M)|+ 1

(|j(M)|+ |Cj(M)|

|j(M)|

)minus1]

(4ndash10)

and produces a hierarchical version of the Scott and Berger multiplicity correction

The HOP arises from a conditional exchangeability assumption on the indicator

variables Conditioned on γltj(M) the indicators γα α isin j(M)cup

Cj(M) are

assumed to be exchangeable Bernoulli random variables By de Finettirsquos theorem these

arise from independent Bernoulli random variables with common probability of success

πj with a prior distribution Our construction of the HOP assumes that this prior is a

beta distribution Additional complexity penalizations can be incorporated into the HOP

in a similar fashion to the HIP The number of possible nodes that could be added of

order j while maintaining a WFM is given by chj(M) = chj(γltj(M)) = |j(M)

cupCj(M)|

Using aj = 1 and bj(M) = chj(M) produces a prior with two desirable properties

First if M prime sub M then π(M) le π(M prime) Second for each order j the conditional

probability of including k nodes is greater than or equal to that of including k + 1 nodes

for k = 0 1 chj(M)minus 1432 Choice of Prior Structure and Hyper-Parameters

Each of the priors introduced in Section 31 defines a whole family of model priors

characterized by the probability distribution assumed for the inclusion probabilities πM

For the sake of simplicity this paper focuses on those arising from Beta distributions

and concentrates on particular choices of hyper-parameters which can be specified

automatically First we describe some general features about how each of the three

prior structures (HUP HIP HOP) allocates mass to the models in the model space

96

Second as there is an infinite number of ways in which the hyper-parameters can be

specified focused is placed on the default choice a = b = 1 as well as the complexity

penalizations described in Section 31 The second alternative is referred to as a =

1b = ch where b = ch has a slightly different interpretation depending on the prior

structure Accordingly b = ch is given by bj(M) = bα(M) = chj(M) = |j(M)cup

Cj(M)|

for the HOP and HIP where j = order(α) while b = ch denotes that b = |(MF )| for

the HUP The prior behavior for two model spaces In both cases the base model MB is

taken to be the intercept only model and MF is the DAG shown (Figures 4-4 and 4-5)

The priors considered treat model complexity differently and some general properties

can be seen in these examples

ModelHIP HOP HUP

(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)

1 1 14 49 13 12 13 572 1 x1 18 19 112 112 112 5563 1 x2 18 19 112 112 112 5564 1 x1 x

21 18 19 112 112 112 5168

5 1 x2 x22 18 19 112 112 112 5168

6 1 x1 x2 132 364 112 112 160 1727 1 x1 x2 x

21 132 164 136 160 160 1168

8 1 x1 x2 x1x2 132 164 136 160 160 11689 1 x1 x2 x

22 132 164 136 160 160 1168

10 1 x1 x2 x21 x1x2 132 1192 136 1120 130 1252

11 1 x1 x2 x21 x

22 132 1192 136 1120 130 1252

12 1 x1 x2 x1x2 x22 132 1192 136 1120 130 1252

13 1 x1 x2 x21 x1x2 x

22 132 1576 112 1120 16 1252

Figure 4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch)

First contrast the choice of HIP HUP and HOP for the choice of (ab) = (1 1) The

HIP induces a complexity penalization that only accounts for the order of the terms in

the model This is best exhibited by the model space in Figure 4-4 Models including x1

and x2 models 6 through 13 are given the same prior probability and no penalization is

incurred for the inclusion of any or all of the quadratic terms In contrast to the HIP the

97

ModelHIP HOP HUP

(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)

1 1 18 2764 14 12 14 472 1 x1 18 964 112 110 112 2213 1 x2 18 964 112 110 112 2214 1 x3 18 964 112 110 112 2215 1 x1 x3 18 364 112 120 112 41056 1 x2 x3 18 364 112 120 112 41057 1 x1 x2 116 3128 124 140 130 1428 1 x1 x2 x1x2 116 3128 124 140 120 1709 1 x1 x2 x3 116 1128 18 140 120 17010 1 x1 x2 x3 x1x2 116 1128 18 140 15 170

Figure 4-5 Prior probabilities for the space of well-formulated models associated tothree main effects and one interaction term where MB is taken to be theintercept only model and (ab) isin (1 1) (1 ch)

HUP induces a penalization for model complexity but it does not adequately penalize

models for including additional terms Using the HIP models including all of the terms

are given at least as much probability as any model containing any non-empty set of

terms (Figures 4-4 and 4-5) This lack of penalization of the full model is originates from

its combinatorial simplicity (ie this is the only model that contains every term) and

as an unfortunate consequence this model space distribution favors the base and full

models Similar behavior is observed with the HOP with (ab) = (1 1) As models

become more complex they are appropriately penalized for their size However after a

sufficient number of nodes are added the number of possible models of that particular

size is considerably reduced Thus combinatorial complexity is negligible on the largest

models This is best exhibited in Figure 4-5 where the HOP places more mass on

the full model than on any model containing a single order one node highlighting an

undesirable behavior of the priors with this choice of hyper-parameters

In contrast if (ab) = (1 ch) all three priors produce strong penalization as

models become more complex both in terms of the number and order of the nodes

contained in the model For all of the priors adding a node α to a model M to form M prime

produces p(M) ge p(M prime) However differences between the priors are apparent The

98

HIP penalizes the full model the most with the HOP penalizing it the least and the HUP

lying between them At face value the HOP creates the most compelling penalization

of model complexity In Figure 4-5 the penalization of the HOP is the least dramatic

producing prior odds of 20 for MB versus MF as opposed to the HUP and HIP which

produce prior odds of 40 and 54 respectively Similarly the prior odds in Figure 4-4 are

60 180 and 256 for the HOP HUP and HIP respectively433 Posterior Sensitivity to the Choice of Prior

To determine how the proposed priors are adjusting the posterior probabilities to

account for multiplicity a simple simulation was performed The goal of this exercise

was to understand how the priors respond to increasing complexity First the priors are

compared as the number of main effects p grows Second they are compared as the

depth of the hierarchy increases or in other words as the orderJMmax increases

The quality of a node is characterized by its marginal posterior inclusion

probabilities defined as pα =sum

MisinM I(αisinM)p(M|yM) for α isin MF These posteriors

were obtained for the proposed priors as well as the Equal Probability Prior (EPP)

on M For all prior structures both the default hyper-parameters a = b = 1 and

the penalizing choice of a = 1 and b = ch are considered The results for the

different combinations of MF and MT incorporated in the analysis were obtained

from 100 random replications (ie generating at random 100 matrices of main effects

and responses) The simulation proceeds as follows

1 Randomly generate main effects matrices X = (x1 x18) for xiiidsim Nn(0 In) and

error vectors ϵ sim Nn(0 In) for n = 60

2 Setting all coefficient values equal to one calculate y = ZMTβ + ϵ for the true

models given byMT 1 = x1 x2 x3 x

21 x1x2 x

22 x2x3 with |MT 1| = 7

MT 2 = x1 x2 x16 with |MT 2| = 16MT 3 = x1 x2 x3 x4 with |MT 3| = 4MT 4 = x1 x2 x8 x

21 x3x4 with |MT 4| = 10

MT 5 = x1 x2 x3 x4 x21 x3x4 with |MT 5| = 6

99

Table 4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulationsgrowing p fixed JM

max fixed p growing JMmax

MF

∣∣MF

∣∣ ∣∣M∣∣ MT used MF

∣∣MF

∣∣ ∣∣M∣∣ MT used(x1 + x2 + x3)

2 9 95 MT 1 (x1 + x2 + x3)2 9 95 MT 1

(x1 + + x4)2 14 1337 MT 1 (x1 + x2 + x3)

3 19 2497 MT 1

(x1 + + x5)2 20 38619 MT 1 (x1 + x2 + x3)

4 34 161421 MT 1

Other model spacesMF

∣∣MF

∣∣ ∣∣M∣∣ MT usedx1 + x2 + middot middot middot+ x18 18 262144 MT 2MT 3

(x1 + x2 + x4)2 + x5+ 20 85568 MT 4MT 5x6 + + x10

3 In all simulations the base model MB is the intercept only model The notation(x1 + + xp)

d is used to represent the full order-d polynomial response surface inp main effects The model spaces characterized by their corresponding full modelMF are presented in Table 4-1 as well as the true models used in each case

4 Enumerate the model spaces and calculate p(M|yM) for all M isin Musing the EPP HUP HIP and HOP the latter two each with the two sets ofhyper-parameters

5 Count the number of true positives and false positives in each M for the differentpriors

The true positives (TP) are defined as those nodes α isin MT such that pα gt 05

With the false positives (FP) three different cutoffs are considered for pα elucidating

the adjustment for multiplicity induced by the model priors These cutoffs are

010 020 and 050 for α isin MT The results from this exercise provide insight

about the influence of the prior on the marginal posterior inclusion probabilities In Table

4-1 the model spaces considered are described in terms of the number of models they

contain and in terms of the number of nodes of MF the full model that defines the DAG

for M

Growing number of main effects fixed polynomial degree This simulation

investigates the posterior behavior as the number of covariates grows for a polynomial

100

surface of degree two The true model is assumed to be MT 1 and has 7 polynomial

terms The false positive and true positive rates are displayed in Table 4-2

First focus on the posterior when (ab) = (1 1) As p increases and the cutoff

is low the number of false positives increases for the EPP as well as the hierarchical

priors although less dramatically for the latter All of the priors identify all of the true

positives The false positive rate for the 50 cutoff is less than one for all four prior

structures with the HIP exhibiting the smallest false positive rate

With the second choice of hyper-parameters (1 ch) the improvement of the

hierarchical priors over the EPP is dramatic and the difference in performance is more

pronounced as p increases These also considerably outperform the priors using the

default hyper-parameters a = b = 1 in terms of the false positives Regarding the

number of true positives all priors discovered the 7 true predictors in MT 1 for most of

the 100 random samples of data with only minor differences observed between any of

the priors considered That being said the means for the priors with a = 1b = ch are

slightly lower for the true positives With a 50 cutoff the hierarchical priors keep a tight

control on the number of false positives but in doing so discard true positives with slightly

higher frequency

Growing polynomial degree fixed main effects For these examples the true

model is once again MT 1 When the complexity is increased by making the order of MF

larger (Table 4-3) the inability of the EPP to adjust the inclusion posteriors for multiplicity

becomes more pronounced the EPP becomes less and less efficient at removing false

positives when the FP cutoff is low Among the priors with a = b = 1 as the order

increases the HIP is the best at filtering out the false positives Using the 05 false

positive cutoff some false positives are included both for the EEP and for all the priors

with a = b = 1 indicating that the default hyper-parameters might not be the best option

to control FP The 7 covariates in the true model all obtain a high inclusion posterior

probability both with the EEP and the a = b = 1 priors

101

Table 4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

7 (x1 + x2 + x3)2

178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3 + x4)2

362 194 233 245 010 063 107FP(gt020) 160 047 217 215 001 017 024FP(gt050) 025 006 035 036 000 002 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3 + x4 + x5)2

600 216 260 255 012 043 115FP(gt020) 291 055 213 218 002 019 027FP(gt050) 066 011 025 037 000 003 001TP(gt050) (MT 1) 700 700 700 700 697 699 699

In contrast any of the a = 1 and b = ch priors dramatically improve upon their

a = b = 1 counterparts consistently assigning low inclusion probabilities for the majority

of the false positive terms even for low cutoffs As the order of the polynomial surface

increases the difference in performance between these priors and either the EEP or

their default versions becomes even more clear At the 50 cutoff the hierarchical priors

with complexity penalization exhibit very low false positive rates The true positive rate

decreases slightly for the priors but not to an alarming degree

Other model spaces This part of the analysis considers model spaces that do not

correspond to full polynomial degree response surfaces (Table 4-4) The first example

is a model space with main effects only The second example includes a full quadratic

surface of order 2 but in addition includes six terms for which only main effects are to be

modeled Two true models are used in combination with each model space to observe

how the posterior probabilities vary under the influence of the different priors for ldquolargerdquo

and ldquosmallrdquo true models

102

Table 4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

7 (x1 + x2 + x3)2

178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3)3

737 521 606 291 055 105 139FP(gt020) 291 155 361 208 017 034 031FP(gt050) 040 021 050 026 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 698 700FP(gt010)

7 (x1 + x2 + x3)4

822 400 469 261 052 055 132FP(gt020) 421 113 176 203 012 015 031FP(gt050) 056 017 022 027 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 697 699

By construction in model spaces with main effects only HIP(11) and EPP are

equivalent as are HOP(ab) and HUP(ab) This accounts for the similarities observed

among the results for the first two cases presented in Table 4-4 where the model space

corresponds to a full model with 18 main effects and the true models are a model with

16 and 4 main effects respectively When the number of true coefficients is large the

HUP(11) and HOP(11) do poorly at controlling false positives even at the 50 cutoff

In contrast the HIP (and thus the EPP) with the 50 cutoff identifies the true positives

and no false positives This result however does not imply that the EPP controls false

positives well The true model contains 16 out of the 18 nodes in MF so there is little

potential for false positives The a = 1 and b = ch priors show dramatically different

behavior The HIP controls false positive well but fails to identify the true coefficients at

the 50 cutoff In contrast the HOP identifies all of the true positives and has a small

false positive rate for the 50 cutoff

103

If the number of true positives is small most terms in the full model are truly zero

The EPP includes at least one false positive in approximately 50 of the randomly

sampled datasets On the other hand the HUP(11) provides some control for

multiplicity obtaining on average a lower number of false positives than the EPP

Furthermore the proposed hierarchical priors with a = 1b = ch are substantially better

than the EPP (and the choice of a = b = 1) at controlling false positives and capturing

all true positives using the marginal posterior inclusion probabilities The two examples

suggest that the HOP(1 ch) is the best default choice for model selection when the

number of terms available at a given degree is large

The third and fourth examples in Table 4-4 consider the same irregular model

space with data generated from MT 4 with ten terms and MT 5 with six terms HIP(11)

and EPP again behave quite similarly incorporating a large number of false positives

for the 01 cutoff At the 05 cutoff some false positives are still included The HUP(11)

and HOP(11) behave similarly with a slightly higher false positive rate at the 50 cutoff

In terms of the true positives the EPP and a = b = 1 priors always include all of the

predictors in MT 4 and MT 5 On the other hand the ability of the a = 1b = ch priors

to control for false positives is markedly better than that of the EPP and the hierarchical

priors with choice of a = 1 = b = 1 At the 50 cutoff these priors identify all of the true

positives and true negatives Once again these examples point to the hierarchical priors

with additional penalization for complexity as being good default priors on the model

space44 Random Walks on the Model Space

When the model space M is too large to enumerate a stochastic procedure can

be used to find models with high posterior probability In particular an MCMC algorithm

can be utilized to generate a dependent sample of models from the model posterior The

structure of the model space M both presents difficulties and provides clues on how to

build algorithms to explore it Different MCMC strategies can be adopted two of which

104

Table 4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

16 x1 + x2 + + x18

193 193 200 200 003 180 180FP(gt020) 052 052 200 200 001 046 046FP(gt050) 007 007 200 200 001 004 004TP(gt050) (MT 2) 1599 1599 1600 1600 699 1599 1599FP(gt010)

4 x1 + x2 + + x18

1395 1395 915 915 026 131 131FP(gt020) 545 545 303 303 005 045 045FP(gt050) 084 084 045 045 002 006 006TP(gt050) (MT 3) 400 400 400 400 400 400 400FP(gt010)

10

973 971 1000 560 034 233 220FP(gt020) (x1 + + x4)

2+ 265 265 873 305 012 074 069FP(gt050) +x5 + + x10 035 035 136 168 002 011 012TP(gt050) (MT 4) 1000 1000 1000 999 994 998 999FP(gt010)

6

1352 1352 1106 994 044 163 196FP(gt020) (x1 + + x4)

2+ 422 421 360 501 015 048 068FP(gt050) +x5 + + x10 053 053 057 075 001 008 011TP(gt050) (MT 5) 600 600 600 600 599 599 599

are outlined in this section Combining the different strategies allows the model selection

algorithm to explore the model space thoroughly and relatively fast441 Simple Pruning and Growing

This first strategy relies on small localized jumps around the model space turning

on or off a single node at each step The idea behind this algorithm is to grow the model

by activating one node in the children set or to prune the model by removing one node

in the extreme set At a given step in the algorithm assume that the current state of the

chain is model M Let pG be the probability that algorithm chooses the growth step The

proposed model M prime can either be M+ = M cup α for some α isin C(M) or Mminus = M α

or some α isin E(M)

An example transition kernel is defined by the mixture

g(M prime|M) = pG middot qGrow(M prime|M) + (1minus pG) middot qPrune(M prime|M)

105

=IM =MF

1 + IM =MBmiddotIαisinC(M)

|C(M)|+

IM =MB

1 + IM =MF middotIαisinE(M)

|E(M)|(4ndash11)

where pG has explicitly been defined as 05 when both C(M) and E(M) are non-empty

and as 0 (or 1) when C(M) = empty (or E(M) = empty) After choosing pruning or growing a

single node is proposed for addition to or deletion from M uniformly at random

For this simple algorithm pruning has the reverse kernel of growing and vice-versa

From this construction more elaborate algorithms can be specified First instead of

choosing the node uniformly at random from the corresponding set nodes can be

selected using the relative posterior probability of adding or removing the node Second

more than one node can be selected at any step for instance by also sampling at

random the number of nodes to add or remove given the size of the set Third the

strategy could combine pruning and growing in a single step by sampling one node

α isin C(M) cup E(M) and adding or removing it accordingly Fourth the sets of nodes from

C(M) cup E(M) that yield well-formulated models can be added or removed This simple

algorithm produces small moves around the model space by focusing node addition or

removal only on the set C(M) cup E(M)442 Degree Based Pruning and Growing

In exploring the model space it is possible to take advantage of the hierarchical

structure defined between nodes of different order One can update the vector of

inclusion indicators by blocks denoted j(M) Two flavors of this algorithm are

proposed one that separates the pruning and growing steps and one where both

are done simultaneously

Assume that at a given step say t the algorithm is at M If growing the strategy

proceeds successively by order class going from j = Jmin up to j = Jmax with Jmin

and Jmax being the lowest and highest orders of nodes in MF MB respectively Define

Mt(Jminminus1) = M and set j = Jmin The growth kernel comprises the following steps

proceeding from j = Jmin to j = Jmax

106

1) Propose a model M prime by selecting a set of nodes from Cj(Mt(jminus1)) through thekernel qGrow j(middot|Mt(jminus1))

2) Compute the Metropolis-Hastings correction for M prime versus Mt(jminus1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(jminus1)

3) If j lt Jmax then set j = j + 1 and return to 1) otherwise proceed to 4)

4) Set Mt = Mt(Jmax )

The pruning step is defined In a similar fashion however it starts at order j = Jmax

and proceeds down to j = Jmin Let Ej(M prime) = E(M prime) cap j(MF ) be the set of nodes of

order j that can removed from the model M to produce a WFM Define Mt(Jmax+1) = M

and set j = Jmax The pruning kernel comprises the following steps

1) Propose a model M prime by selecting a set of nodes from Ej(Mt(j+1)) through thekernel qPrunej(middot|Mt(j+1))

2) Compute the Metropolis-Hastings correction for M prime versus Mt(j+1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(j+1)

3) If j gt Jmin then set j = j minus 1 and return to Step 1) otherwise proceed to Step 4)

4) Set Mt = Mt(Jmin )

It is clear that the growing and pruning steps are reverse kernels of each other

Pruning and growing can be combined for each j The forward kernel proceeds from

j = Jmin to j = Jmax and proposes adding sets of nodes from Cj(M) cup Ej(M) The reverse

kernel simply reverses the direction of j proceeding from j = Jmax to j = Jmin 45 Simulation Study

To study the operating characteristics of the proposed priors a simulation

experiment was designed with three goals First the priors are characterized by how

the posterior distributions are affected by the sample size and the signal-to-noise ratio

(SNR) Second given the SNR level the influence of the allocation of the signal across

the terms in the model is investigated Third performance is assessed when the true

model has special points in the scale (McCullagh amp Nelder 1989) ie when the true

107

model has coefficients equal to zero for some lower-order terms in the polynomial

hierarchy

With these goals in mind sets of predictors and responses are generated under

various experimental conditions The model space is defined with MB being the

intercept-only model and MF being the complete order-four polynomial surface in five

main effects that has 126 nodes The entries of the matrix of main effects are generated

as independent standard normal The response vectors are drawn from the n-variate

normal distribution as y sim Nn

(ZMT

(X)βγ In) where MT is the true model and In is the

n times n identity matrix

The sample sizes considered are n isin 130 260 1040 which ensures that

ZMF(X) is of full rank The cardinality of this model space is |M| gt 12 times 1022 which

makes enumeration of all models unfeasible Because the value of the 2k-th moment

of the standard normal distribution increases with k = 1 2 higher-order terms by

construction have a larger variance than their ancestors As such assuming equal

values for all coefficients higher-order terms necessarily contain more ldquosignalrdquo than

the lower order terms from which they inherit (eg x21 has more signal than x1) Once a

higher-order term is selected its entire ancestry is also included Therefore to prevent

the simulation results from being overly optimistic (because of the larger signals from the

higher-order terms) sphering is used to calculate meaningful values of the coefficients

ensuring that the signal is of the magnitude intended in any given direction Given

the results of the simulations from Section 433 only the HOP with a = 1b = ch is

considered with the EPP included for comparison

The total number of combinations of SNR sample size regression coefficient

values and nodes in MT amounts to 108 different scenarios Each scenario was run

with 100 independently generated datasets and the mean behavior of the samples was

observed The results presented in this section correspond to the median probability

model (MPM) from each of the 108 simulation scenarios considered Figure 4-7 shows

108

the comparison between the two priors for the mean number of true positive (TP) and

false positive (FP) terms Although some of the scenarios consider true models that are

not well-formulated the smallest well-formulated model that stems from MT is always

the one shown in Figure 4-6

Figure 4-6 MT DAG of the largest true model used in simulations

The results are summarized in Figure 4-7 Each point on the horizontal axis

corresponds to the average for a given set of simulation conditions Only labels for the

SNR and sample size are included for clarity but the results are also shown for the

different values of the regression coefficients and the different true models considered

Additional details about the procedure and other results are included in the appendices451 SNR and Sample Size Effect

As expected small sample sizes conditioned upon a small SNR impair the ability

of the algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this

effect being greater when using the latter prior However considering the mean number

of TPs jointly with the number of FPs it is clear that although the number of TPs is

specially low with HOP(1 ch) most of the few predictors that are discovered in fact

belong to the true model In comparison to the results with EPP in terms of FPs the

HOP(1 ch) does better and even more so when both the sample size and the SNR are

109

Figure 4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch)

smallest Finally when either the SNR or the sample size is large the performance in

terms of TPs is similar between both priors but the number of FPs are somewhat lower

with the HOP452 Coefficient Magnitude

Three ways to allocate the amount of signal across predictors are considered For

the first choice all coefficients contain the same amount of signal regardless of their

order In the second each order-one coefficient contains twice as much signal as any

order-two coefficient and four times as much as any order-three coefficient Finally

each order-one coefficient contains a half as much signal as any order-two coefficient

and a quarter of what any order-three coefficient has These choices are denoted by

β(1) = c(1o1 1o2 1o3) β(2) = c(1o1 05o2 025o3) and β(3) = c(025o1 05o2 1o3)

respectively In Figure 4-7 the first 4 scenarios correspond to simulations with β(1) the

next four use β(2) the next four correspond to β(3) and then the values are cycled in

110

the same way The results show that scenarios using either β(1) or β(3) behave similarly

contrasting with the negative impact of having the highest signal in the order-one terms

through β(2) In Figure 4-7 the effect of using β(2) is evident as it corresponds to the

lowest values for the TPs regardless of the sample size the SNR or the prior used This

is an intuitive result since giving more signal to higher-order terms makes it easier to

detect higher-order terms and consequently by strong heredity the algorithm will also

select the corresponding lower-order terms included in the true model453 Special Points on the Scale

Four true models were considered (1) the model from Figure 4-6 (MT 1) (2)

the model without the order-one terms (MT 2) (3) the model without order-two terms

(MT 3) and (4) the model without x21 and x2x5 (MT 4) The last three are clearly not

well-formulated In Figure 4-7 the leftmost point on the horizontal axis corresponds to

scenarios with MT 1 the next point is for scenarios with MT 2 followed by those with MT 3

then with MT 4 then MT 1 etc In comparison to the EPP the HOP(1 ch) tightly controls

the inclusion of FPs by choosing smaller models at the expense of also reducing the TP

count especially when there is more uncertainty about the true model (ie SNR=025)

For both prior structures the results indicate that at low SNR levels the presence of

special points has no apparent impact as the selection behavior is similar between the

four models in terms of both the TP and FP An interesting observation is that the effect

of having special points on the scale is vastly magnified whenever the coefficients that

assign more weight to order-one terms (β(2)) are used46 Case Study Ozone Data Analysis

This section uses the ozone data from Breiman amp Friedman (1985) and followsthe analysis performed by Liang et al (2008) who investigated hyper g-priors Afterremoving observations with missing values 330 observations remain includingdaily measurements of maximum ozone concentration near Los Angeles and eightmeteorological variables Table D From the 330 observations 165 were sampled atrandom without replacement and used to run the variable selection procedure theremaining 165 were used for validation The eight meteorological variables interactionsand their squared terms are used as predictors resulting in a full model with 44predictors The model space assumes that the base model MB is the intercept onlymodel and that MF is the quadratic surface in the eight meteorological variables The

111

model space contains approximately 71 billion models and computation of all modelposterior probabilities is not feasible

Table 4-5 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA

vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX

The HOP HUP and HIP with a = 1 and b = ch as well as the EPP are considered forcomparison purposes To obtain the Bayes factors in equation 3ndash3 four different mixtures ofg-priors are utilized intrinsic priors (IP) (which yields the expression in equation 3ndash2) hyper-g(HG) priors (Liang et al 2008) with hyper-parameters α = 2β = 1 and α = β = 1 and Zellner-Siow (ZS) priors (Zellner amp Siow 1980) The results were extracted for the median posteriorprobability (MPM) models Additionally the model is estimated using the R package hierNet(Bien et al 2013) to compare model selection results to those obtained using the hierarchicallasso (Bien et al 2013) restricted to well formulated models by imposing the strong heredityconstraint The procedures were assessed on the basis of their predictive accuracy on thevalidation dataset

Among all models the one that yields the smallest RMSE is the median probability modelobtained using the HOP and EPP with the ZS prior and also using the HOP with both HGpriors (Table 4-6) The HOP model with the intrinsic prior has all the terms contained in thelowest RMSE model with the exception of dpg2 which has a relatively high marginal inclusionprobability of 46 This disparity between the IP and other mixtures of g-priors is explainedby the fact that the IP induces less posterior shrinkage than the ZS and HG priors The MPMobtained through the HUP and HIP are nested in the best model suggesting that these modelspace priors penalize complexity too much and result in false negatives Consideration ofthese MPMs suggest that the HOP is best at producing true positives while controlling for falsepositives

Finally the model obtained from the hierarchical lasso (HierNet) is the largest model andproduces the second to largest RMSE All of the terms contained in any of the other modelsexcept for vh are nested within the hierarchical lasso model and most of the terms that areexclusive to this model receive extremely low marginal inclusion probabilities under any of themodel priors and parameter priors considered under Bayesian model selection

112

Table 4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso

BF Prior Model R2 RMSEIP EPP hum dpg ibt hum2 hum lowast dpg 08054 42739

hum lowast ibt dpg2 ibt2IP HIP hum ibt hum2 hum lowast ibt ibt2 07740 43396IP HOP hum dpg ibt hum2 hum lowast ibt ibt2 07848 43175IP HUP hum dpg ibt hum lowast ibt ibt2 07767 43508ZS EPP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HIP hum ibt hum lowast ibt ibt2 07525 43505ZS HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG11 EPP vh hum dpg ibt hum2 hum lowast ibt dpg2 07701 43049HG11 HIP hum ibt hum lowast ibt ibt2 07525 43505HG11 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG11 HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 EPP hum dpg ibt hum2 hum lowast ibt dpg2 07701 43037HG21 HIP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG21 HUP hum dpg ibt hum lowast ibt 07526 44036

HierNet hum temp ibh dpg ibt vis hum2 hum lowast ibt 07651 43680temp2 temp lowast ibt dpg2

47 DiscussionScott amp Berger (2010) noted that Ockhamrsquos-razor effect found automatically in Bayesian

variable selection through the Bayes factor does not correct for multiple testing The Bayesfactor penalizes complexity of the alternative model according to the number of parametersin excess of those of the null model Therefore the Bayes factor only controls complexity in apairwise fashion If the model selection procedure uses equal prior probabilities for all M isin Mthen these comparisons ignore the effect of the multiplicity of the testing problem This is wherethe role of the prior on the model space becomes important The multiplicity penalty is ldquohiddenawayrdquo in the model prior probabilities π(M|M)

In addition to the multiplicity of the testing problem disregarding the hierarchical polynomialstructure in the predictors in model selection procedures has the potential to lead to differentresults according to how the predictors are setup (eg in what units these predictors areexpressed)

In this Chapter we investigated a solution to these two issues We define prior structuresfor well-formulated models and develop random walk algorithms to traverse this type of modelspace The key to understanding prior distributions on the space of WFMs is the hierarchicalnature of the model space itself The prior distributions described take advantage of thathierarchy in two ways First conditional independence and immediate inheritance are used todevelop the HOP HIP and HUP structures discussed in Section 43 Second the conditionalnature of the priors allows for the direct incorporation of complexity penalizations Of the priorsproposed the HOP using the hyperparameter choice (1 ch) provides the best control of falsepositives while maintaining a reasonable true positive rate Thus this prior is recommended asthe default prior on the space of WFMs

113

In the near future the software developed to carry out a Metropolis-Hastings random walkon the space of WFMs will be integrated to the R package varSelectIP These new functionsimplement various local priors for the regression coefficients including the intrinsic prior Zellner-Siow prior and hyper g-priors In addition the software supports the computation of crediblesets for each regression coefficient conditioned on the selected model as well as under modelaveraging

114

CHAPTER 5CONCLUSIONS

Ecologists are now embracing the use of Bayesian methods to investigate the

interactions that dictate the distribution and abundance of organisms These tools are

both powerful and flexible They allow integrating under a single methodology empirical

observations and theoretical process models and can seamlessly account for several

sources of uncertainty and dependence The estimation and testing methods proposed

throughout the document will contribute to the understanding of Bayesian methods used

in ecology and hopefully these will shed light about the differences between estimation

and testing Bayesian tools

All of our contributions exploit the potential of the latent variable formulation This

approach greatly simplifies the analysis of complex models it redirects the bulk of

the inferential burden away from the original response variables and places it on the

easy-to-work-with latent scale for which several time-tested approaches are available

Our methods are distinctly classified into estimation and testing tools

For estimation we proposed a Bayesian specification of the single-season

occupancy model for which a Gibbs sampler is available using both logit and probit

link functions This setup allows detection and occupancy probabilities to depend

on linear combinations of predictors Then we developed a dynamic version of this

approach incorporating the notion that occupancy at a previously occupied site depends

both on survival of current settlers and habitat suitability Additionally because these

dynamics also vary in space we suggest a strategy to add spatial dependence among

neighboring sites

Ecological inquiry usually requires of competing explanations and uncertainty

surrounds the decision of choosing any one of them Hence a model or a set of

probable models should be selected from all the viable alternatives To address this

testing problem we proposed an objective and fully automatic Bayesian methodology

115

for the single season site-occupancy model Our approach relies on the intrinsic prior

which prevents from introducing (commonly unavailable) subjectively information

into the model In simulation experiments we observed that the methods single out

accurately the predictors present in the true model using the marginal posterior inclusion

probabilities of the predictors For predictors in the true model these probabilities were

comparatively larger than those for predictors not present in the true model Also the

simulations indicated that the method provides better discrimination for predictors in the

detection component of the model

In our simulations and in the analysis of the Blue Hawker data we observed that the

effect from using the multiplicity correction prior was substantial This occurs because

the Bayes factor only penalizes complexity of the alternative model according to its

number of parameters in excess to those of the null model As the number of predictors

grows the number of models in the models space also grows increasing the chances

of making false positive decisions on the inclusion of predictors This is where the role

of the prior on the model space becomes important The multiplicity penalty is ldquohidden

awayrdquo in the model prior probabilities π(M|M) In addition to the multiplicity of the

testing problem disregarding the hierarchical polynomial structure in the predictors in

model selection procedures has the potential to lead to different results according to

how the predictors are coded (eg in what units these predictors are expressed)

To confront this situation we propose three prior structures for well-formulated

models take advantage of the hierarchical structure of the predictors Of the priors

proposed we recommend the HOP using the hyperparameter choice (1 ch) which

provides the best control of false positives while maintaining a reasonable true positive

rate

Overall considering the flexibility of the latent approach several other extensions of

these methods follow Currently we envision three future developments (1) occupancy

models incorporate various sources of information (2) multi-species models that make

116

use of spatial and interspecific dependence and (3) investigate methods to conduct

model selection for the dynamic and spatially explicit version of the model

117

APPENDIX AFULL CONDITIONAL DENSITIES DYMOSS

In this section we introduce the full conditional probability density functions for all

the parameters involved in the DYMOSS model using probit as well as logic links

Sampler Z

The full conditionals corresponding to the presence indicators have the same form

regardless of the link used These are derived separately for the cases t = 1 1 lt t lt T

and t = T since their corresponding probabilities take on slightly different forms

Let ϕ(ν|microσ2) represent the density for a normal random variable ν with mean micro and

variance σ2 and recall that ψi1 = F (xprime(o)iα) and pijt = F (qprimeijtλt) where F () is the

inverse link function The full conditional for zit is given by

1 For t = 1

π(zi1|vi1αλ1βc1 δ

s1) = ψlowast

i1zi1 (1minus ψlowast

i1)1minuszi1

= Bernoulli(ψlowasti1) (Andash1)

where

ψlowasti1 =

ψi1ϕ(vi1|xprimei1βc1 + δs1 1)

prodJi1j=1(1minus pij1)

ψi1ϕ(vi1|xprimei1βc1 + δs1 1)

prodJi1j=1(1minus pij1) + (1minus ψi1)ϕ(vi1|xprimei1β

c1 1)

prodJj=1 Iyij1=0

2 For 1 lt t lt T

π(zit |zi(tminus1) zi(t+1)λt βctminus1 δ

stminus1) = ψlowast

itzit (1minus ψlowast

it)1minuszit

= Bernoulli(ψlowastit) (Andash2)

where

ψlowastit =

κitprodJit

j=1(1minus pijt)

κitprodJit

j=1(1minus pijt) +nablait

prodJj=1 Iyijt=0

with

(a) κit = F (xprimei(tminus1)β

ctminus1 + zi(tminus1)δ

stminus1)ϕ(vit |xprimeitβ

ct + δst 1) and

(b) nablait =(1minus F (xprime

i(tminus1)βctminus1 + zi(tminus1)δ

stminus1)

)ϕ(vit |xprimeitβ

ct 1)

3 For t = T

π(ziT |zi(Tminus1)λT βcTminus1 δ

sTminus1) = ψ⋆iT

ziT (1minus ψ⋆iT )1minusziT

118

=

Nprodi=1

Bernoulli(ψ⋆iT ) (Andash3)

where

ψ⋆iT =κ⋆iT

prodJiTj=1(1minus pijT )

κ⋆iTprodJiT

j=1(1minus pijT ) +nabla⋆iT

prodJj=1 IyijT=0

with

(a) κ⋆iT = F (xprimei(Tminus1)β

cTminus1 + zi(Tminus1)δ

sTminus1) and

(b) nabla⋆iT =

(1minus F (xprime

i(Tminus1)βcTminus1 + zi(Tminus1)δ

sTminus1)

)Sampler ui

1

π(ui |zi1α) = tr N(xprime(o)iα 1 trunc(zi1))

where trunc(zi1) =

(minusinfin 0] zi1 = 0

(0infin) zi1 = 1(Andash4)

and tr N(microσ2A) denotes the pdf of a truncated normal random variable with mean microvariance σ2 and truncation region A

Sampler α

1

π(α|u) prop [α]

Nprodi=1

ϕ(ui xprime(o)iα 1) (Andash5)

If [α] prop 1 then

α|u sim N(m(α)α)

with m(α) = αXprime(o)u and α = (X prime

(o)X(o))minus1

Sampler vit

1 (For t gt 1)

π(vi (tminus1)|zi (tminus1) zit βctminus1 δ

stminus1) = tr N

(micro(v)i(tminus1) 1 trunc(zit)

)(Andash6)

where micro(v)i(tminus1) = xprime

i(tminus1)βctminus1 + zi(tminus1)δ

ci(tminus1) and trunc(zit) defines the corresponding

truncation region given by zit

119

Sampler(β(c)tminus1 δ

(c)tminus1

)

1 (For t gt 1)

π(β(s)tminus1 δ

(c)tminus1|vtminus1 ztminus1) prop [β

(s)tminus1 δ

(c)tminus1]

Nprodi=1

ϕ(vit xprimei(tminus1)β

(c)tminus1 + zi(tminus1)δ

(s)tminus1 1) (Andash7)

If[β(c)tminus1 δ

(s)tminus1

]prop 1 then

β(c)tminus1 δ

(s)tminus1|vtminus1 ztminus1 sim N(m(β

(c)tminus1 δ

(s)tminus1)tminus1)

with m(β(c)tminus1 δ

(s)tminus1) = tminus1 ~X

primetminus1vtminus1 and tminus1 = (~X prime

tminus1 ~Xtminus1)minus1 where ~Xtminus1 =(

Xtminus1 ztminus1)

Sampler wijt

1 (For t gt 1 and zit = 1)

π(wijt | i zit = 1 yijt λ) = tr N(qprimeijtλt 1 tr(yijt)

)(Andash8)

Sampler λt

1 (For t = 1 2 T )

π(λt |zt wt) prop [λt ]prod

i zit=1

Jitprodj=1

ϕ(wijt qprimeijtλt 1) (Andash9)

If [λt ] prop 1 then

λt |wt zt sim N(m(λt)λt)

with m(λt) = λtQ primetwt and λt

= (Q primetQt)

minus1 where Qt and wt respectively are the designmatrix and the vector of latent variables for surveys of sites such that zit = 1

120

APPENDIX BRANDOM WALK ALGORITHMS

Global Jump From the current state M the global jump is performed by drawing

a model M prime at random from the model space This is achieved by beginning at the base

model and increasing the order from JminM to the Jmax

M the minimum and maximum orders

of nodes in (MF ) = MF MB at each order a set of nodes is selected at random from

the prior conditioned on the nodes already in the model The MH correction is

α =

1m(y|M primeM)

m(y|MM)

Local Jump From the current state M the local jump is performed by drawing a

model from the set of models L(M) = Mα α isin E(M) cup C(M) where Mα is M α

for α isin E(M) and M cup α for α isin C(M) The proposal probabilities for the model are

computed as a mixture of p(M prime|yMM prime isin L(M)) and the discrete uniform distribution

The proposal kernel is

q(M prime|yMM prime isin L(M)) =1

2

(p(M prime|yMM prime isin L(M)) +

1

|L(M)|

)This choice promotes moving to better models while maintaining a non-negligible

probability of moving to any of the possible models The MH correction is

α =

1m(y|M primeM)

m(y|MM)

q(M|yMM isin L(M prime))

q(M prime|yMM prime isin L(M))

Intermediate Jump The intermediate jump is performed by increasing or

decreasing the order of the nodes under consideration performing local proposals based

on order For a model M prime define Lj(Mprime) = M prime cup M prime

α α isin (E(M prime) cup C(M prime)) capj(MF )

From a state M the kernel chooses at random whether to increase or decrease the

order If M = MF then decreasing the order is chosen with probability 1 and if M = MB

then increasing the order is chosen with probability 1 in all other cases the probability of

increasing and decreasing order is 12 The proposal kernels are given by

121

Increasing order proposal kernel

1 Set j = JminM minus 1 and M prime

j = M

2 Draw M primej+1 from qincj+1(M

prime|yMM prime isin Lj+1(Mprimej )) where

qincj+1(Mprime|yMM prime isin Lj+1(M

primej )) =

12

(p(M prime|yMM prime isin Lj+1(M

primej )) +

1|Lj+1(M

primej)|

)

3 Set j = j + 1

4 If j lt JmaxM then return to 2 O therwise proceed to 5

5 Set M prime = M primeJmaxM

and compute the proposal probability

qinc(Mprime|yMM) =

JmaxM minus1prod

j=JminM minus1

qincj+1(Mprimej |yMM prime isin Lj+1(M

primej )) (Bndash1)

Decreasing order proposal kernel

1 Set j = JmaxM + 1 and M prime

j = M

2 Draw M primejminus1 from qdecjminus1(M

prime|yMM prime isin Ljminus1(Mprimej )) where

qdecjminus1(Mprime|yMM prime isin Ljminus1(M

primej )) =

12

(p(M prime|yMM prime isin Ljminus1(M

primej )) +

1|Ljminus1(M

primej)|

)

3 Set j = j minus 1

4 If j gt JminM then return to 2 Otherwise proceed to 5

5 Set M prime = M primeJminM

and compute the proposal probability

qdec(Mprime|yMM) =

JminM +1prod

j=JmaxM +1

qdecjminus1(Mprimej |yMM prime isin Ljminus1(M

primej )) (Bndash2)

If increasing order is chosen then the MH correction is given by

α = min

1

(1 + I (M prime = MF )

1 + I (M = MB)

)qdec(M|yMM prime)

qinc(M prime|yMM)

p(M prime|yM)

p(M|yM)

(Bndash3)

and similarly if decreasing order is chosen

Other Local and Intermediate Kernels The local and intermediate kernels

described here perform a kind of stochastic forwards-backwards selection Each kernel

122

q can be relaxed to allow more than one node to be turned on or off at each step which

could provide larger jumps for each of these kernels The tradeoff is that number of

proposed models for such jumps could be very large precluding the use of posterior

information in the construction of the proposal kernel

123

APPENDIX CWFM SIMULATION DETAILS

Briefly the idea is to let ZMT(X )βMT

= (QR)βMT= QηMT

(ie βMT= Rminus1ηMT

)

using the QR decomposition As such setting all values in ηMTproportional to one

corresponds to distributing the signal in the model uniformly across all predictors

regardless of their order

The (unconditional) variance of a single observation yi is var(yi) = var (E [yi |zi ]) +

E [var(yi |zi)] where zi is the i -th row of the design matrix ZMT Hence we take the

signal to noise ratio for each observation to be

SNR(η) = ηTMT

RminusTzRminus1ηMT

σ2

where z = var(zi) We determine how the signal is distributed across predictors up to a

proportionality constant to be able to control simultaneously the signal to noise ratio

Additionally to investigate the ability of the model to capture correctly the

hierarchical structure we specify four different 0-1 vectors that determine the predictors

in MT which generates the data in the different scenarios

Table C-1 Experimental conditions WFM simulationsParameter Values considered

SNR(ηMT) = k 025 1 4

ηMTprop (1 13 14 12) (1 13 1214

1412) (1 1413

1214 12)

γMT(1 13 14 12) (1 13 14 02) (1 13 04 12) (1 03 0 1 1 0 12)

n 130 260 1040

The results presented below are somewhat different from those found in the main

body of the article in Section 5 These are extracted averaging the number of FPrsquos

TPrsquos and model sizes respectively over the 100 independent runs and across the

corresponding scenarios for the 20 highest probability models

124

SNR and Sample Size Effect

In terms of the SNR and the sample size (Figure C-1) we observe that as

expected small sample sizes conditioned upon a small SNR impair the ability of the

algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this effect

more notorious when using the latter prior However considering the mean number

of true positives (TP) jointly with the mean model size it is clear that although the

sensitivity is low most of the few predictors that are discovered belong to the true

model The results observed with SNR of 025 and a relatively small sample size are

far from being impressive however real problems where the SNR is as low as 025

will yield many spurious associations under the EPP The fact that the HOP(1 ch) has

a strong protection against false positive is commendable in itself A SNR of 1 also

represents a feeble relationship between the predictors and the response nonetheless

the method captures approximately half of the true coefficients while including very few

false positives Following intuition as either the sample size or the SNR increase the

algorithms performance is greatly enhanced Either having a large sample size or a

large SNR yields models that contain mostly true predictors Additionally HOP(1 ch)

provides a strong control over the number of false positives therefore for high SNR

or larger sample sizes the number of predictors in the top 20 models is close to the

size of the true model In general the EPP allows the detection of more TPrsquos while

the HOP(1 ch) provides a stronger control on the amount of FPrsquos included when

considering small sample sizes combined with small SNRs As either sample size or

SNR grows the differences between the two priors become indistinct

125

Figure C-1 SNR vs n Average model size average true positives and average false

positives for all simulated scenarios by model ranking according to model

posterior probabilities

Coefficient Magnitude

This part of the experiment explores the effect of how the signal is distributed across

predictors As mentioned before sphering is used to assign the coefficients values

in a manner that controls the amount of signal that goes into each coefficient Three

possible ways to allocate the signal are considered First each order-one coefficient

contains twice as much signal as any order-two coefficient and four times as much

any as order-three coefficient second all coefficients contain the same amount of

signal regardless of their order and third each order-one coefficient contains a half

as much signal as any order-two coefficient and a quarter of what any order-three

126

coefficient has In Figure C-2 these values are denoted by β = c(1o1 05o2 025o3)

β = c(1o1 1o2 1o3) and β = c(025o1 05o2 1o3) respectively

Observe that the number of FPrsquos is invulnerable to how the SNR is distributed

across predictors using the HOP(1 ch) conversely when using the EPP the number

of FPrsquos decreases as the SNR grows always being slightly higher than those obtained

with the HOP With either prior structure the algorithm performs better whenever all

coefficients are equally weighted or when those for the order-three terms have higher

weights In these two cases (ie with β = c(1o1 05o2 025o3) or β = c(1o1 1o2 1o3))

the effect of the SNR appears to be similar In contrast when more weight is given to

order one terms the algorithm yields slightly worse models at any SNR level This is an

intuitive result since giving more signal to higher order terms makes it easier to detect

higher order terms and consequently by strong heredity the algorithm will also select

the corresponding lower order terms included in the true model

Special Points on the Scale

In Nelder (1998) the author argues that the conditions under which the

weak-heredity principle can be used for model selection are so restrictive that the

principle is commonly not valid in practice in this context In addition the author states

that considering well-formulated models only does not take into account the possible

presence of special points on the scales of the predictors that is situations where

omitting lower order terms is justified due to the nature of the data However it is our

contention that every model has an underlying well-formulated structure whether or not

some predictor has special points on its scale will be determined through the estimation

of the coefficients once a valid well-formulated structure has been chosen

To understand how the algorithm behaves whenever the true data generating

mechanism has zero-valued coefficients for some lower order terms in the hierarchy

four different true models are considered Three of them are not well-formulated while

the remaining one is the WFM shown in Figure 4-6 The three models that have special

127

Figure C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model rankingaccording to model posterior probabilities

points correspond to the same model MT from Figure 4-6 but have respectively

zero-valued coefficients for all the order-one terms all the order-two terms and for x21

and x2x5

As seen before in comparison to the EPP the HOP(1 ch) tightly controls the

inclusion FPs by choosing smaller models at the expense of also reducing the TP

count especially when there is more uncertainty about the true model (ie SNR=025)

For both prior structures the results in Figure C-3 indicate that at low SNR levels the

presence of special points has no apparent impact as the selection behavior is similar

between the four models in terms of both the TP and FP As the SNR increases the

TPs and the model size are affected for true models with zero-valued lower order

128

Figure C-3 SNR vs different true models MT Average model size average truepositives and average false positives for all simulated scenarios by modelranking according to model posterior probabilities

terms These differences however are not very large Relatively smaller models are

selected whenever some terms in the hierarchy are missing but with high SNR which

is where the differences are most pronounced the predictors included are mostly true

coefficients The impact is almost imperceptible for the true model that lacks order one

terms and the model with zero coefficients for x21 and x2x5 and is more visible for models

without order two terms This last result is expected due to strong-heredity whenever

the order-one coefficients are missing the inclusion of order-two and order-three

terms will force their selection which is also the case when only a few order two terms

have zero-valued coefficients Conversely when all order two predictors are removed

129

some order three predictors are not selected as their signal is attributed the order two

predictors missing from the true model This is especially the case for the order three

interaction term x1x2x5 which depends on the inclusion of three order two terms terms

(x1x2 x1x5 x2x5) in order for it to be included as well This makes the inclusion of this

term somewhat more challenging the three order two interactions capture most of

the variation of the polynomial terms that is present when the order three term is also

included However special points on the scale commonly occur on a single or at most

on a few covariates A true data generating mechanism that removes all terms of a given

order in the context of polynomial models is clearly not justified here this was only done

for comparison purposes

130

APPENDIX DSUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS

The covariates considered for the ozone data analysis match those used in Liang

et al (2008) these are displayed in Table D below

Table D-1 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA

vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX

The marginal posterior inclusion probability corresponds to the probability of including a

given term in the full model MF after summing over all models in the model space For each

node α isin MF this probability is given by pα =sum

MisinM I(αisinM)p(M|yM) Given that in problems

with a large model space such as the one considered for the ozone concentration problem

enumeration of the entire space is not feasible Thus these probabilities are estimated summing

over every model drawn by the random walk over the model space M

Given that there are in total 44 potential predictors for convenience in Tables D-2 to D-5

below we only display the marginal posterior probabilities for the terms included under at least

one of the model priors considered (EPP HIP HUP and HOP) for each of the parameter priors

utilized (intrinsic priors Zellner-Siow priors Hyper-g(11) and Hyper-g(21))

131

Table D-2 Marginal inclusion probabilities

intrinsic prior

EPP HIP HUP HOP

hum 099 069 085 076

dpg 085 048 052 053

ibt 099 100 100 100

hum2 076 051 043 062

humdpg 055 002 003 017

humibt 098 069 084 075

dpg2 072 036 025 046

ibt2 059 078 057 081

Table D-3 Marginal inclusion probabilities

Zellner-Siow prior

EPP HIP HUP HOP

hum 076 067 080 069

dpg 089 050 055 058

ibt 099 100 100 100

hum2 057 049 040 057

humibt 072 066 078 068

dpg2 081 038 031 051

ibt2 054 076 055 077

Table D-4 Marginal inclusion probabilities

Hyper-g11

EPP HIP HUP HOP

vh 054 005 010 011

hum 081 067 080 069

dpg 090 050 055 058

ibt 099 100 099 099

hum2 061 049 040 057

humibt 078 066 078 068

dpg2 083 038 030 051

ibt2 049 076 054 077

Table D-5 Marginal inclusion probabilities

Hyper-g21

EPP HIP HUP HOP

hum 079 064 073 067

dpg 090 052 060 059

ibt 099 100 099 100

hum2 060 047 037 055

humibt 076 064 071 067

dpg2 082 041 036 052

ibt2 047 073 049 075

132

REFERENCES

Akaike H (1983) Information measures and model selection Bull Int Statist Inst 50277ndash290

Albert J H amp Chib S (1993) Bayesian-analysis of binary and polychotomousresponse data Journal of the American Statistical Association 88(422) 669ndash679

Berger J amp Bernardo J (1992) On the development of reference priors BayesianStatistics 4 (pp 35ndash60)

URL httpisbastatdukeedueventsvalencia1992Valencia4Refpdf

Berger J amp Pericchi L (1996) The intrinsic Bayes factor for model selection andprediction Journal of the American Statistical Association 91(433) 109ndash122

URL httpamstattandfonlinecomdoiabs10108001621459199610476668

Berger J Pericchi L amp Ghosh J (2001) Objective Bayesian methods for modelselection introduction and comparison In Model selection vol 38 of IMS LectureNotes Monogr Ser (pp 135ndash207) Inst Math Statist

URL httpwwwjstororgstable1023074356165

Besag J York J amp Mollie A (1991) Bayesian Image-Restoration with 2 Applicationsin Spatial Statistics Annals of the Institute of Statistical Mathematics 43 1ndash20

Bien J Taylor J amp Tibshirani R (2013) A lasso for hierarchical interactions TheAnnals of Statistics 41(3) 1111ndash1141

URL httpprojecteuclidorgeuclidaos1371150895

Breiman L amp Friedman J (1985) Estimating optimal transformations for multipleregression and correlation Journal of the American Statistical Association 80580ndash598

Brusco M J Steinley D amp Cradit J D (2009) An exact algorithm for hierarchicallywell-formulated subsets in second-order polynomial regression Technometrics 51(3)306ndash315

Casella G Giron F J Martınez M L amp Moreno E (2009) Consistency of Bayesianprocedures for variable selection The Annals of Statistics 37 (3) 1207ndash1228

URL httpprojecteuclidorgeuclidaos1239369020

Casella G Moreno E amp Giron F (2014) Cluster Analysis Model Selection and PriorDistributions on Models Bayesian Analysis TBA(TBA) 1ndash46

URL httpwwwstatufledu~casellaPapersClusterModel-July11-Apdf

133

Chipman H (1996) Bayesian variable selection with related predictors CanadianJournal of Statistics 24(1) 17ndash36

URL httponlinelibrarywileycomdoi1023073315687abstract

Clyde M amp George E I (2004) Model Uncertainty Statistical Science 19(1) 81ndash94

URL httpprojecteuclidorgDienstgetRecordid=euclidss1089808274

Dewey J (1958) Experience and nature New York Dover Publications

Dorazio R M amp Taylor-Rodrıguez D (2012) A Gibbs sampler for Bayesian analysis ofsite-occupancy data Methods in Ecology and Evolution 3 1093ndash1098

Ellison A M (2004) Bayesian inference in ecology Ecology Letters 7 509ndash520

Fiske I amp Chandler R (2011) unmarked An R package for fitting hierarchical modelsof wildlife occurrence and abundance Journal of Statistical Software 43(10)

URL httpcorekmiopenacukdownloadpdf5701760pdf

George E (2000) The variable selection problem Journal of the American StatisticalAssociation 95(452) 1304ndash1308

URL httpwwwtandfonlinecomdoiabs10108001621459200010474336

Giron F J Moreno E Casella G amp Martınez M L (2010) Consistency of objectiveBayes factors for nonnested linear models and increasing model dimension Revistade la Real Academia de Ciencias Exactas Fisicas y Naturales Serie A Matematicas104(1) 57ndash67

URL httpwwwspringerlinkcomindex105052RACSAM201006

Good I J (1950) Probability and the Weighing of Evidence New York Haffner

Griepentrog G L Ryan J M amp Smith L D (1982) Linear transformations ofpolynomial regression-models American Statistician 36(3) 171ndash174

Gunel E amp Dickey J (1974) Bayes factors for independence in contingency tablesBiometrika 61 545ndash557

Hanski I (1994) A Practical Model of Metapopulation Dynamics Journal of AnimalEcology 63 151ndash162

Hooten M (2006) Hierarchical spatio-temporal models for ecological processesDoctoral dissertation University of Missouri-Columbia

URL httpsmospacelibraryumsystemeduxmluihandle103554500

Hooten M B amp Hobbs N T (2014) A Guide to Bayesian Model Selection forEcologists Ecological Monographs (In Press)

134

Hughes J amp Haran M (2013) Dimension reduction and alleviation of confoundingfor spatial generalized linear mixed models Journal of the Royal Statistical SocietySeries B Statistical Methodology 75 139ndash159

Hurvich C M amp Tsai C-L (1989) Regression and time series model selection insmall samples Biometrika 76 297ndash307

URL httpbiometoxfordjournalsorgcontent762297abstract

Jeffreys H (1935) Some tests of significance treated by the theory of probabilityProcedings of the Cambridge Philosophy Society 31 203ndash222

Jeffreys H (1961) Theory of Probability London Oxford University Press 3rd ed

Johnson D Conn P Hooten M Ray J amp Pond B (2013) Spatial occupancymodels for large data sets Ecology 94(4) 801ndash808

URL httpwwwesajournalsorgdoiabs10189012-05641mi=3eywlhampaf=R

ampsearchText=human+population

Kass R amp Wasserman L (1995) A reference Bayesian test for nested hypothesesand its relationship to the Schwarz criterion Journal of the American StatisticalAssociation 90(431)

URL httpamstattandfonlinecomdoiabs10108001621459199510476592

Kass R E amp Raftery A E (1995) Bayes Factors Journal of the American StatisticalAssociation 90 773ndash795

URL httpwwwtandfonlinecomdoiabs10108001621459199510476572$

delimiter026E30F$nhttpwwwtandfonlinecomdoiabs10108001621459

199510476572UvBybrTIgcs

Kass R E amp Wasserman L (1996) The Selection of Prior Distributions by FormalRules Journal of the American Statistical Association 91(435) 1343

URL httpwwwjstororgstable2291752origin=crossref

Kery M (2010) Introduction to WinBUGS for Ecologists Bayesian Approach toRegression ANOVA Mixed Models and Related Analyses Academic Press 1st ed

Kery M Gardner B amp Monnerat C (2010) Predicting species distributions fromchecklist data using site-occupancy models Journal of Biogeography 37 (10)1851ndash1862 Kery Marc Gardner Beth Monnerat Christian

Khuri A (2002) Nonsingular linear transformations of the control variables in responsesurface models Technical Report

Krebs C J (1972) Ecology the experimental analysis of distribution and abundance

135

Lempers F B (1971) Posterior probabilities of alternative linear models University ofRotterdam Press Rotterdam

Leon-Novelo L Moreno E amp Casella G (2012) Objective Bayes model selection inprobit models Statistics in medicine 31(4) 353ndash65

URL httpwwwncbinlmnihgovpubmed22162041

Liang F Paulo R Molina G Clyde M a amp Berger J O (2008) Mixtures of g Priorsfor Bayesian Variable Selection Journal of the American Statistical Association103(481) 410ndash423

URL httpwwwtandfonlinecomdoiabs101198016214507000001337

Link W amp Barker R (2009) Bayesian inference with ecological applications Elsevier

URL httpbooksgooglecombookshl=enamplr=ampid=hecon2l2QPcCampoi=fnd

amppg=PP2ampdq=Bayesian+Inference+with+ecological+applicationsampots=S82_

0pxrNmampsig=L3xbsSQcKD8FV6rxCMp2pmP2JKk

MacKenzie D amp Nichols J (2004) Occupancy as a surrogate for abundanceestimation Animal biodiversity and conservation 1 461ndash467

URL httpcrsitbacidmediajurnalrefslandscapemackenzie2004zhpdf

MacKenzie D Nichols J amp Hines J (2003) Estimating site occupancy colonizationand local extinction when a species is detected imperfectly Ecology 84(8)2200ndash2207

URL httpwwwesajournalsorgdoiabs10189002-3090

MacKenzie D I Bailey L L amp Nichols J D (2004) Investigating speciesco-occurrence patterns when species Journal of Animal Ecology 73 546ndash555

MacKenzie D I Nichols J D Lachman G B Droege S Royle J A amp LangtimmC A (2002) Estimating site occupancy rates when detection probabilities are lessthan one Ecology 83(8) 2248ndash2255

Mazerolle M amp Mazerolle M (2013) Package rsquoAICcmodavgrsquo (c)

URL ftpheanetarchivegnewsenseorgdisk1CRANwebpackages

AICcmodavgAICcmodavgpdf

McCullagh P amp Nelder J A (1989) Generalized linear models (2nd ed) LondonEngland Chapman amp Hall

McQuarrie A Shumway R amp Tsai C-L (1997) The model selection criterion AICu

136

Moreno E Bertolino F amp Racugno W (1998) An intrinsic limiting procedure for modelselection and hypotheses testing Journal of the American Statistical Association93(444) 1451ndash1460

Moreno E Giron F J amp Casella G (2010) Consistency of objective Bayes factors asthe model dimension grows The Annals of Statistics 38(4) 1937ndash1952

URL httpprojecteuclidorgeuclidaos1278861238

Nelder J A (1977) Reformulation of linear-models Journal of the Royal StatisticalSociety Series A - Statistics in Society 140 48ndash77

Nelder J A (1998) The selection of terms in response-surface models - how strong isthe weak-heredity principle American Statistician 52(4) 315ndash318

Nelder J A (2000) Functional marginality and response-surface fitting Journal ofApplied Statistics 27 (1) 109ndash112

Nichols J Hines J amp Mackenzie D (2007) Occupancy estimation and modeling withmultiple states and state uncertainty Ecology 88(6) 1395ndash1400

URL httpwwwesajournalsorgdoipdf10189006-1474

Ovaskainen O Hottola J amp Siitonen J (2010) Modeling species co-occurrenceby multivariate logistic regression generates new hypotheses on fungal interactionsEcology 91(9) 2514ndash21

URL httpwwwncbinlmnihgovpubmed20957941

Peixoto J L (1987) Hierarchical variable selection in polynomial regression-modelsAmerican Statistician 41(4) 311ndash313

Peixoto J L (1990) A property of well-formulated polynomial regression-modelsAmerican Statistician 44(1) 26ndash30

Pericchi L R (2005) Model selection and hypothesis testing based on objectiveprobabilities and bayes factors In Handbook of Statistics Elsevier

Polson N G Scott J G amp Windle J (2013) Bayesian Inference for Logistic ModelsUsing Polya-Gamma Latent Variables Journal of the American Statistical Association108 1339ndash1349

URL httpdxdoiorg101080016214592013829001

Rao C R amp Wu Y (2001) On model selection vol Volume 38 of Lecture NotesndashMonograph Series (pp 1ndash57) Beachwood OH Institute of Mathematical Statistics

URL httpdxdoiorg101214lnms1215540960

137

Reich B J Hodges J S amp Zadnik V (2006) Effects of residual smoothing on theposterior of the fixed effects in disease-mapping models Biometrics 62 1197ndash1206

Reiners W amp Lockwood J (2009) Philosophical Foundations for the Practices ofEcology Cambridge University Press

URL httpbooksgooglecombooksid=dr9cPgAACAAJ

Rigler F amp Peters R (1995) Excellence in Ecology Science and Limnology EcologyInstitute Germany

URL httportoncatieaccrcgi-binwxisexeIsisScript=CIENLxis

ampmethod=postampformato=2ampcantidad=1ampexpresion=mfn=008268

Robert C Chopin N amp Rousseau J (2009) Harold Jeffreysrsquo Theory of Probabilityrevisited Statistical Science Volume 24(2) 141ndash179

URL httpswwwnewtonacukpreprintsNI08021pdf

Robert C P (1993) A note on jeffreys-lindley paradox Statistica Sinica 3 601ndash608

Royle J A amp Kery M (2007) A Bayesian state-space formulation of dynamicoccupancy models Ecology 88(7) 1813ndash23

URL httpwwwncbinlmnihgovpubmed17645027

Scott J amp Berger J (2010) Bayes and Empirical-Bayes Multiplicity Adjustment in thevariable selection problem The Annals of Statistics

URL httpprojecteuclidorgeuclidaos1278861454

Spiegelhalter D J amp Smith A F M (1982) Bayes factor for linear and log-linearmodels with vague prior information J R Statist Soc B 44 377ndash387

Tierney L amp Kadane J B (1986) Accurate approximations for posterior moments andmarginal densities Journal of the American Statistical Association 81 82ndash86

Tyre A J Tenhumberg B Field S a Niejalke D Parris K amp Possingham H P(2003) Improving Precision and Reducing Bias in Biological Surveys EstimatingFalse-Negative Error Rates Ecological Applications 13(6) 1790ndash1801

URL httpwwwesajournalsorgdoiabs10189002-5078

Waddle J H Dorazio R M Walls S C Rice K G Beauchamp J Schuman M Jamp Mazzotti F J (2010) A new parameterization for estimating co-occurrence ofinteracting species Ecological applications a publication of the Ecological Society ofAmerica 20 1467ndash1475

Wasserman L (2000) Bayesian Model Selection and Model Averaging Journal ofmathematical psychology 44(1) 92ndash107

138

URL httpwwwncbinlmnihgovpubmed10733859

Wilson M Iversen E Clyde M A Schmidler S C amp Schildkraut J M (2010)Bayesian model search and multilevel inference for SNP association studies TheAnnals of Applied Statistics 4(3) 1342ndash1364

URL httpwwwncbinlmnihgovpmcarticlesPMC3004292

Womack A J Leon-Novelo L amp Casella G (2014) Inference from Intrinsic BayesProcedures Under Model Selection and Uncertainty Journal of the AmericanStatistical Association (June) 140114063448000

URL httpwwwtandfonlinecomdoiabs101080016214592014880348

Yuan M Joseph V R amp Zou H (2009) Structured variable selection and estimationThe Annals of Applied Statistics 3(4) 1738ndash1757

URL httpprojecteuclidorgeuclidaoas1267453962

Zeller K A Nijhawan S Salom-Perez R Potosme S H amp Hines J E (2011)Integrating occupancy modeling and interview data for corridor identification A casestudy for jaguars in nicaragua Biological Conservation 144(2) 892ndash901

Zellner A amp Siow A (1980) Posterior odds ratios for selected regression hypothesesIn Trabajos de estadıstica y de investigacion operativa (pp 585ndash603)

URL httpwwwspringerlinkcomindex5300770UP12246M9pdf

139

BIOGRAPHICAL SKETCH

Daniel Taylor-Rodrıguez was born in Bogota Colombia He earned a BS

degree in economics from the Universidad de Los Andes (2004) and a Specialist

degree in statistics from the Universidad Nacional de Colombia In 2009 he traveled

to Gainesville Florida to pursue a masterrsquos in statistics under the supervision of

George Casella Upon completion he started a PhD in interdisciplinary ecology with

concentration in statistics again under George Casellarsquos supervision After Georgersquos

passing Linda Young and Nikolay Bliznyuk continued to oversee Danielrsquos mentorship

He has currently accepted a joint postdoctoral fellowship at the Statistical and Applied

Mathematical Sciences Institute and the Department of Statistical Science at Duke

University

140

  • ACKNOWLEDGMENTS
  • TABLE OF CONTENTS
  • LIST OF TABLES
  • LIST OF FIGURES
  • ABSTRACT
  • 1 GENERAL INTRODUCTION
    • 11 Occupancy Modeling
    • 12 A Primer on Objective Bayesian Testing
    • 13 Overview of the Chapters
      • 2 MODEL ESTIMATION METHODS
        • 21 Introduction
          • 211 The Occupancy Model
          • 212 Data Augmentation Algorithms for Binary Models
            • 22 Single Season Occupancy
              • 221 Probit Link Model
              • 222 Logit Link Model
                • 23 Temporal Dynamics and Spatial Structure
                  • 231 Dynamic Mixture Occupancy State-Space Model
                  • 232 Incorporating Spatial Dependence
                    • 24 Summary
                      • 3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
                        • 31 Introduction
                        • 32 Objective Bayesian Inference
                          • 321 The Intrinsic Methodology
                          • 322 Mixtures of g-Priors
                            • 3221 Intrinsic priors
                            • 3222 Other mixtures of g-priors
                                • 33 Objective Bayes Occupancy Model Selection
                                  • 331 Preliminaries
                                  • 332 Intrinsic Priors for the Occupancy Problem
                                  • 333 Model Posterior Probabilities
                                  • 334 Model Selection Algorithm
                                    • 34 Alternative Formulation
                                    • 35 Simulation Experiments
                                      • 351 Marginal Posterior Inclusion Probabilities for Model Predictors
                                      • 352 Summary Statistics for the Highest Posterior Probability Model
                                        • 36 Case Study Blue Hawker Data Analysis
                                          • 361 Results Variable Selection Procedure
                                          • 362 Validation for the Selection Procedure
                                            • 37 Discussion
                                              • 4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
                                                • 41 Introduction
                                                • 42 Setup for Well-Formulated Models
                                                  • 421 Well-Formulated Model Spaces
                                                    • 43 Priors on the Model Space
                                                      • 431 Model Prior Definition
                                                      • 432 Choice of Prior Structure and Hyper-Parameters
                                                      • 433 Posterior Sensitivity to the Choice of Prior
                                                        • 44 Random Walks on the Model Space
                                                          • 441 Simple Pruning and Growing
                                                          • 442 Degree Based Pruning and Growing
                                                            • 45 Simulation Study
                                                              • 451 SNR and Sample Size Effect
                                                              • 452 Coefficient Magnitude
                                                              • 453 Special Points on the Scale
                                                                • 46 Case Study Ozone Data Analysis
                                                                • 47 Discussion
                                                                  • 5 CONCLUSIONS
                                                                  • A FULL CONDITIONAL DENSITIES DYMOSS
                                                                  • B RANDOM WALK ALGORITHMS
                                                                  • C WFM SIMULATION DETAILS
                                                                  • D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
                                                                  • REFERENCES
                                                                  • BIOGRAPHICAL SKETCH
Page 3: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,

In memory of George Casella

It is a capital mistake to theorize before one has data Insensibly onebegins to twist facts to suit theories instead of theories to suit facts

ndashSherlock HolmesA Scandal in Bohemia

3

ACKNOWLEDGMENTS

Completing this dissertation would not have been possible without the support from

the people that have helped me remain focused motivated and inspired throughout the

years I am undeservingly fortunate to be surrounded by such amazing people

First of all I would like to express my gratitude to Professor George Casella It

was an unsurpassable honor to work with him His wisdom generosity optimism and

unyielding resolve will forever inspire me I will always treasure his teachings and the

fond memories I have of him I thank him and Anne for treating me and my wife as

family

I would like to acknowledge all of my committee members My heartfelt thanks to

my advisor Professor Linda J Young I will carry her thoughtful and patient recommendations

throughout my life I have no words to express how thankful I am to her for guiding me

through the difficult times that followed Dr Casellarsquos passing Also she has my gratitude

for sharing her knowledge and wealth of experience and for providing me with so many

amazing opportunities I am forever grateful to my local advisor Professor Nikolay

Bliznyuk for unsparingly sharing his insightful reflections and knowledge His generosity

and drive to help students develop are a model to follow His kind and extensive efforts

our many conversations his suggestions and advise in all aspects of academic and

non-academic life have made me a better statistician and have had a profound influence

on my way of thinking My appreciation to Professor Madan Oli for his enlightening

advise and for helping me advance my understanding of ecology

I would like to express my absolute gratitude to Dr Andrew Womack my friend and

young mentor His love for good science and hard work although impossible to keep up

with made my doctoral training one of the most exciting times in my life I have sincerely

enjoyed working and learning from him the last couple of years I offer my gratitude

to Dr Salvador Gezan for his friendship and the patience with which he taught me so

much more about statistics (boring our wives to death in the process) I am grateful to

4

Professor Mary Christman for her mentorship and enormous support I would like to

thank Dr Mihai Giurcanu for spending countless hours helping me think more deeply

about statistics his insight has been instrumental to shaping my own ideas Thanks to

Dr Claudio Fuentes for taking an interest in my work and for his advise support and

kind words which helped me retain the confidence to continue

I would like to acknowledge my friends at UF Juan Jose Acosta Mauricio

Mosquera Diana Falla Salvador and Emma Weeks and Anna Denicol thanks for

becoming my family away from home Andreas Tavis Emily Alex Sasha Mike

Yeonhee and Laura thanks for being there for me I truly enjoyed sharing these

years with you Vitor Paula Rafa Leandro Fabio Eduardo Marcelo and all the other

Brazilians in the Animal Science Department thanks for your friendship and for the

many unforgettable (though blurry) weekends

Also I would like to thank Pablo Arboleda for believing in me Because of him I

was able to take the first step towards fulfilling my educational goals My gratitude to

Grupo Bancolombia Fulbright Colombia Colfuturo and the IGERT QSE3 program

for supporting me throughout my studies Also thanks to Marc Kery and Christian

Monnerat for providing data to validate our methods Thanks to the staff in the Statistics

Department specially to Ryan Chance to the staff at the HPC and also to Karen Bray

at SNRE

Above all else I would like to thank my wife and family Nata you have always been

there for me pushing me forward believing in me helping me make better decisions

and regardless of how hard things get you have always managed to give me true and

lasting happiness Thank you for your love strength and patience Mom Dad Alejandro

Alberto Laura Sammy Vale and Tommy without your love trust and support getting

this far would not have been possible Thank you for giving me so much Gustavo

Lilia Angelica and Juan Pablo thanks for taking me into your family your words of

encouragement have led the way

5

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS 4

LIST OF TABLES 8

LIST OF FIGURES 10

ABSTRACT 12

CHAPTER

1 GENERAL INTRODUCTION 14

11 Occupancy Modeling 1512 A Primer on Objective Bayesian Testing 1713 Overview of the Chapters 21

2 MODEL ESTIMATION METHODS 23

21 Introduction 23211 The Occupancy Model 24212 Data Augmentation Algorithms for Binary Models 26

22 Single Season Occupancy 29221 Probit Link Model 30222 Logit Link Model 32

23 Temporal Dynamics and Spatial Structure 34231 Dynamic Mixture Occupancy State-Space Model 37232 Incorporating Spatial Dependence 43

24 Summary 46

3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS 49

31 Introduction 4932 Objective Bayesian Inference 52

321 The Intrinsic Methodology 53322 Mixtures of g-Priors 54

3221 Intrinsic priors 553222 Other mixtures of g-priors 56

33 Objective Bayes Occupancy Model Selection 57331 Preliminaries 58332 Intrinsic Priors for the Occupancy Problem 60333 Model Posterior Probabilities 62334 Model Selection Algorithm 63

34 Alternative Formulation 6635 Simulation Experiments 68

351 Marginal Posterior Inclusion Probabilities for Model Predictors 70

6

352 Summary Statistics for the Highest Posterior Probability Model 7636 Case Study Blue Hawker Data Analysis 77

361 Results Variable Selection Procedure 79362 Validation for the Selection Procedure 81

37 Discussion 82

4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS 84

41 Introduction 8442 Setup for Well-Formulated Models 88

421 Well-Formulated Model Spaces 9043 Priors on the Model Space 91

431 Model Prior Definition 92432 Choice of Prior Structure and Hyper-Parameters 96433 Posterior Sensitivity to the Choice of Prior 99

44 Random Walks on the Model Space 104441 Simple Pruning and Growing 105442 Degree Based Pruning and Growing 106

45 Simulation Study 107451 SNR and Sample Size Effect 109452 Coefficient Magnitude 110453 Special Points on the Scale 111

46 Case Study Ozone Data Analysis 11147 Discussion 113

5 CONCLUSIONS 115

APPENDIX

A FULL CONDITIONAL DENSITIES DYMOSS 118

B RANDOM WALK ALGORITHMS 121

C WFM SIMULATION DETAILS 124

D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS 131

REFERENCES 133

BIOGRAPHICAL SKETCH 140

7

LIST OF TABLES

Table page

1-1 Interpretation of BFji when contrasting Mj and Mi 20

3-1 Simulation control parameters occupancy model selector 69

3-2 Comparison of average minOddsMPIP under scenarios having different numberof sites (N=50 N=100) and under scenarios having different number of surveysper site (J=3 J=5) for the presence and detection components using uniformand multiplicity correction priors 75

3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors 75

3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highest probabilitymodels for the presence and the detection components using uniform andmultiplicity correcting priors on the model space 76

3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space 77

3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 77

3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 78

3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data 80

3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data 80

3-10 MPIP presence component 81

3-11 MPIP detection component 81

3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform and multiplicitycorrection model priors 82

8

4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulations 100

4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchical independenceprior (HIP) the hierarchical order prior (HOP) and the hierarchical uniformprior (HUP) 102

4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and the hierarchicaluniform prior (HUP) 103

4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP) 105

4-5 Variables used in the analyses of the ozone contamination dataset 112

4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso 113

C-1 Experimental conditions WFM simulations 124

D-1 Variables used in the analyses of the ozone contamination dataset 131

D-2 Marginal inclusion probabilities intrinsic prior 132

D-3 Marginal inclusion probabilities Zellner-Siow prior 132

D-4 Marginal inclusion probabilities Hyper-g11 132

D-5 Marginal inclusion probabilities Hyper-g21 132

9

LIST OF FIGURES

Figure page

2-1 Graphical representation occupancy model 25

2-2 Graphical representation occupancy model after data-augmentation 31

2-3 Graphical representation multiseason model for a single site 39

2-4 Graphical representation data-augmented multiseason model 39

3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors 71

3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors 72

3-3 Predictor MPIP averaged over scenarios with the interaction between the numberof sites and the surveys per site using uniform (U) and multiplicity correction(MC) priors 72

3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors 73

3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors 73

4-1 Graphs of well-formulated polynomial models for p = 2 90

4-2 E(M) and C(M) in M defined by a quadratic surface in two main effects formodel M = 1 x1 x21 91

4-3 Graphical representation of assumptions on M defined by the quadratic surfacein two main effects 93

4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the intercept onlymodel and (ab) isin (1 1) (1 ch) 97

4-5 Prior probabilities for the space of well-formulated models associated to threemain effects and one interaction term where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch) 98

4-6 MT DAG of the largest true model used in simulations 109

4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch) 110

C-1 SNR vs n Average model size average true positives and average false positivesfor all simulated scenarios by model ranking according to model posterior probabilities126

10

C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 128

C-3 SNR vs different true models MT Average model size average true positivesand average false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 129

11

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

OBJECTIVE BAYESIAN METHODS FOR OCCUPANCY MODEL ESTIMATION ANDSELECTION

By

Daniel Taylor-Rodrıguez

August 2014

Chair Linda J YoungCochair Nikolay BliznyukMajor Interdisciplinary Ecology

The ecological literature contains numerous methods for conducting inference about

the dynamics that govern biological populations Among these methods occupancy

models have played a leading role during the past decade in the analysis of large

biological population surveys The flexibility of the occupancy framework has brought

about useful extensions for determining key population parameters which provide

insights about the distribution structure and dynamics of a population However the

methods used to fit the models and to conduct inference have gradually grown in

complexity leaving practitioners unable to fully understand their implicit assumptions

increasing the potential for misuse This motivated our first contribution We develop

a flexible and straightforward estimation method for occupancy models that provides

the means to directly incorporate temporal and spatial heterogeneity using covariate

information that characterizes habitat quality and the detectability of a species

Adding to the issue mentioned above studies of complex ecological systems now

collect large amounts of information To identify the drivers of these systems robust

techniques that account for test multiplicity and for the structure in the predictors are

necessary but unavailable for ecological models We develop tools to address this

methodological gap First working in an ldquoobjectiverdquo Bayesian framework we develop

the first fully automatic and objective method for occupancy model selection based

12

on intrinsic parameter priors Moreover for the general variable selection problem we

propose three sets of prior structures on the model space that correct for multiple testing

and a stochastic search algorithm that relies on the priors on the models space to

account for the polynomial structure in the predictors

13

CHAPTER 1GENERAL INTRODUCTION

As with any other branch of science ecology strives to grasp truths about the

world that surrounds us and in particular about nature The objective truth sought

by ecology may well be beyond our grasp however it is reasonable to think that at

least partially ldquoNature is capable of being understoodrdquo (Dewey 1958) We can observe

and interpret nature to formulate hypotheses which can then be tested against reality

Hypotheses that encounter no or little opposition when confronted with reality may

become contextual versions of the truth and may be generalized by scaling them

spatially andor temporally accordingly to delimit the bounds within which they are valid

To formulate hypotheses accurately and in a fashion amenable to scientific inquiry

not only the point of view and assumptions considered must be made explicit but

also the object of interest the properties worthy of consideration of that object and

the methods used in studying such properties (Reiners amp Lockwood 2009 Rigler amp

Peters 1995) Ecology as defined by Krebs (1972) is ldquothe study of interactions that

determine the distribution and abundance of organismsrdquo This characterizes organisms

and their interactions as the objects of interest to ecology and prescribes distribution

and abundance as a relevant property of these organisms

With regards to the methods used to acquire ecological scientific knowledge

traditionally theoretical mathematical models (such as deterministic PDEs) have been

used However naturally varying systems are imprecisely observed and as such are

subject to multiple sources of uncertainty that must be explicitly accounted for Because

of this the ecological scientific community is developing a growing interest in flexible

and powerful statistical methods and among these Bayesian hierarchical models

predominate These methods rely on empirical observations and can accommodate

fairly complex relationships between empirical observations and theoretical process

models while accounting for diverse sources of uncertainty (Hooten 2006)

14

Bayesian approaches are now used extensively in ecological modeling however

there are two issues of concern one from the standpoint of ecological practitioners

and another from the perspective of scientific ecological endeavors First Bayesian

modeling tools require a considerable understanding of probability and statistical theory

leading practitioners to view them as black box approaches (Kery 2010) Second

although Bayesian applications proliferate in the literature in general there is a lack of

awareness of the distinction between approaches specifically devised for testing and

those for estimation (Ellison 2004) Furthermore there is a dangerous unfamiliarity with

the proven risks of using tools designed for estimation in testing procedures (Berger amp

Pericchi 1996 Berger et al 2001 Kass amp Raftery 1995 Moreno et al 1998 Robert

et al 2009 Robert 1993) (eg use of flat priors in hypothesis testing)

Occupancy models have played a leading role during the past decade in large

biological population surveys The flexibility of the occupancy framework has allowed

the development of useful extensions to determine several key population parameters

which provide robust notions of the distribution structure and dynamics of a population

In order to address some of the concerns stated in previous paragraph we concentrate

in the occupancy framework to develop estimation and testing tools that will allow

ecologists first to gain insight about the estimation procedure and second to conduct

statistically sound model selection for site-occupancy data

11 Occupancy Modeling

Since MacKenzie et al (2002) and Tyre et al (2003) introduced the site-occupancy

framework countless applications and extensions of the method have been developed

in the ecological literature as evidenced by the 438000 hits on Google Scholar for

a search of rdquooccupancy modelrdquo This class of models acknowledges that techniques

used to conduct biological population surveys are prone to detection errors ndashif an

individual is detected it must be present while if it is not detected it might or might

not be Occupancy models improve upon traditional binary regression by accounting

15

for observed detection and partially observed presence as two separate but related

components In the site occupancy setting the chosen locations are surveyed

repeatedly in order to reduce the ambiguity caused by the observed zeros This

approach therefore allows probabilities of both presence (occurrence) and detection

to be estimated

The uses of site-occupancy models are many For example metapopulation

and island biogeography models are often parameterized in terms of site (or patch)

occupancy (Hansky 19921994 1997 as cited in MacKenzie et al (2003)) and

occupancy may be used as a surrogate for abundance to answer questions regarding

geographic distribution range size and metapopulation dynamics (MacKenzie et al

2004 Royle amp Kery 2007)

The basic occupancy framework which assumes a single closed population with

fixed probabilities through time has proven to be quite useful however it might be of

limited utility when addressing some problems In particular assumptions for the basic

model may become too restrictive or unrealistic whenever the study period extends

throughout multiple years or seasons especially given the increasingly changing

environmental conditions that most ecosystems are currently experiencing

Among the extensions found in the literature one that we consider particularly

relevant incorporates heterogenous occupancy probabilities through time Models

that incorporate temporally varying probabilities stem from important meta-population

notions provided by Hanski (1994) such as occupancy probabilities depending on local

colonization and local extinction processes In spite of the conceptual usefulness of

Hanskirsquos model several strong and untenable assumptions (eg all patches being

homogenous in quality) are required for it to provide practically meaningful results

A more viable alternative which builds on Hanski (1994) is an extension of

the single season occupancy model of MacKenzie et al (2003) In this model the

heterogeneity of occupancy probabilities across seasons arises from local colonization

16

and extinction processes This model is flexible enough to let detection occurrence

extinction and colonization probabilities to each depend upon its own set of covariates

Model parameters are obtained through likelihood-based estimation

Using a maximum likelihood approach presents two drawbacks First the

uncertainty assessment for maximum likelihood parameter estimates relies on

asymptotic results which are obtained from implementation of the delta method

making it sensitive to sample size Second to obtain parameter estimates the latent

process (occupancy) is marginalized out of the likelihood leading to the usual zero

inflated Bernoulli model Although this is a convenient strategy for solving the estimation

problem after integrating the latent state variables (occupancy indicators) they are

no longer available Therefore finite sample estimates cannot be calculated directly

Instead a supplementary parametric bootstrapping step is necessary Further

additional structure such as temporal or spatial variation cannot be introduced by

means of random effects (Royle amp Kery 2007)

12 A Primer on Objective Bayesian Testing

With the advent of high dimensional data such as that found in modern problems

in ecology genetics physics etc coupled with evolving computing capability objective

Bayesian inferential methods have gained increasing popularity This however is by no

means a new approach in the way Bayesian inference is conducted In fact starting with

Bayes and Laplace and continuing for almost 200 years Bayesian analysis was primarily

based on ldquononinformativerdquo priors (Berger amp Bernardo 1992)

Now subjective elicitation of prior probabilities in Bayesian analysis is widely

recognized as the ideal (Berger et al 2001) however it is often the case that the

available information is insufficient to specify appropriate prior probabilistic statements

Commonly as in model selection problems where large model spaces have to be

explored the number of model parameters is prohibitively large preventing one from

eliciting prior information for the entire parameter space As a consequence in practice

17

the determination of priors through the definition of structural rules has become the

alternative to subjective elicitation for a variety of problems in Bayesian testing Priors

arising from these rules are known in the literature as noninformative objective default

or reference Many of these connotations generate controversy and are accused

perhaps rightly of providing a false pretension of objectivity Nevertheless we will avoid

that discussion and refer to them herein exchangeably as noninformative or objective

priors to convey the sense that no attempt to introduce an informed opinion is made in

defining prior probabilities

A plethora of ldquononinformativerdquo methods has been developed in the past few

decades (see Berger amp Bernardo (1992) Berger amp Pericchi (1996) Berger et al (2001)

Clyde amp George (2004) Kass amp Wasserman (1995 1996) Liang et al (2008) Moreno

et al (1998) Spiegelhalter amp Smith (1982) Wasserman (2000) and the references

therein) We find particularly interesting those derived from the model structure in which

no tuning parameters are required especially since these can be regarded as automatic

methods Among them methods based on the Bayes factor for Intrinsic Priors have

proven their worth in a variety of inferential problems given their excellent performance

flexibility and ease of use This class of priors is discussed in detail in chapter 3 For

now some basic notation and notions of Bayesian inferential procedures are introduced

Hypothesis testing and the Bayes factor

Bayesian model selection techniques that aim to find the true model as opposed

to searching for the model that best predicts the data are fundamentally extensions to

Bayesian hypothesis testing strategies In general this Bayesian approach to hypothesis

testing and model selection relies on determining the amount of evidence found in favor

of one hypothesis (or model) over the other given an observed set of data Approached

from a Bayesian standpoint this type of problem can be formulated in great generality

using a natural well defined probabilistic framework that incorporates both model and

parameter uncertainty

18

Jeffreys (1935) first developed the Bayesian strategy to hypothesis testing and

consequently to the model selection problem Bayesian model selection within

a model space M = (M1M2 MJ) where each model is associated with a

parameter θj which may be a vector of parameters itself incorporates three types

of probability distributions (1) a prior probability distribution for each model π(Mj)

(2) a prior probability distribution for the parameters in each model π(θj |Mj) and (3)

the distribution of the data conditional on both the model and the modelrsquos parameters

f (x|θj Mj) These three probability densities induce the joint distribution p(x θj Mj) =

f (x|θj Mj) middot π(θj |Mj) middot π(Mj) which is instrumental in producing model posterior

probabilities The model posterior probability is the probability that a model is true given

the data It is obtained by marginalizing over the parameter space and using Bayes rule

p(Mj |x) =m(x|Mj)π(Mj)sumJ

i=1m(x|Mi)π(Mi) (1ndash1)

where m(x|Mj) =intf (x|θj Mj)π(θj |Mj)dθj is the marginal likelihood of Mj

Given that interest lies in comparing different models evidence in favor of one or

another model is assessed with pairwise comparisons using posterior odds

p(Mj |x)p(Mk |x)

=m(x|Mj)

m(x|Mk)middot π(Mj)

π(Mk) (1ndash2)

The first term on the right hand side of (1ndash2) m(x|Mj )

m(x|Mk) is known as the Bayes factor

comparing model Mj to model Mk and it is denoted by BFjk(x) The Bayes factor

provides a measure of the evidence in favor of either model given the data and updates

the model prior odds given by π(Mj )

π(Mk) to produce the posterior odds

Note that the model posterior probability in (1ndash1) can be expressed as a function of

Bayes factors To illustrate let model Mlowast isin M be a reference model All other models

compare in M are compared to the reference model Then dividing both the numerator

19

and denominator in (1ndash1) by m(x|Mlowast)π(Mlowast) yields

p(Mj |x) =BFjlowast(x)

π(Mj )

π(Mlowast)

1 +sum

MiisinMMi =Mlowast

BFilowast(x)π(Mi )π(Mlowast)

(1ndash3)

Therefore as the Bayes factor increases the posterior probability of model Mj given the

data increases If all models have equal prior probabilities a straightforward criterion

to select the best among all candidate models is to choose the model with the largest

Bayes factor As such the Bayes factor is not only useful for identifying models favored

by the data but it also provides a means to rank models in terms of their posterior

probabilities

Assuming equal model prior probabilities in (1ndash3) the prior odds are set equal to

one and the model posterior odds in (1ndash2) become p(Mj |x)p(Mk |x) = BFjk(x) Based

on the Bayes factors the evidence in favor of one or another model can be interpreted

using Table 1-1 adapted from Kass amp Raftery (1995)

Table 1-1 Interpretation of BFji when contrasting Mj and Mi

lnBFjk BFjk Evidence in favor of Mj P(Mj |x)0 to 2 1 to 3 Weak evidence 05-0752 to 6 3 to 20 Positive evidence 075-095

6 to 10 20 to 150 Strong evidence 095-099gt10 gt150 Very strong evidence gt 099

Bayesian hypothesis testing and model selection procedures through Bayes factors

and posterior probabilities have several desirable features First these methods have a

straight forward interpretation since the Bayes factor is an increasing function of model

(or hypothesis) posterior probabilities Second these methods can yield frequentist

matching confidence bounds when implemented with good testing priors (Kass amp

Wasserman 1996) such as the reference priors of Berger amp Bernardo (1992) Third

since the Bayes factor contains the ratio of marginal densities it automatically penalizes

complexity according to the number of parameters in each model this property is

known as Ockhamrsquos razor (Kass amp Raftery 1995) Four the use of Bayes factors does

20

not require having nested hypotheses (ie having the null hypothesis nested in the

alternative) standard distributions or regular asymptotics (eg convergence to normal

or chi squared distributions) (Berger et al 2001) In contrast this is not always the case

with frequentist and likelihood ratio tests which depend on known distributions (at least

asymptotically) for the test statistic to perform the test Finally Bayesian hypothesis

testing procedures using the Bayes factor can naturally incorporate model uncertainty by

using the Bayesian machinery for model averaged predictions and confidence bounds

(Kass amp Raftery 1995) It is not clear how to account for this uncertainty rigorously in a

fully frequentist approach

13 Overview of the Chapters

In the chapters that follow we develop a flexible and straightforward hierarchical

Bayesian framework for occupancy models allowing us to obtain estimates and conduct

robust testing from an ldquoobjectiverdquo Bayesian perspective Latent mixtures of random

variables supply a foundation for our methodology This approach provides a means to

directly incorporate spatial dependency and temporal heterogeneity through predictors

that characterize either habitat quality of a given site or detectability features of a

particular survey conducted in a specific site On the other hand the Bayesian testing

methods we propose are (1) a fully automatic and objective method for occupancy

model selection and (2) an objective Bayesian testing tool that accounts for multiple

testing and for polynomial hierarchical structure in the space of predictors

Chapter 2 introduces the methods proposed for estimation of occupancy model

parameters A simple estimation procedure for the single season occupancy model

with covariates is formulated using both probit and logit links Based on the simple

version an extension is provided to cope with metapopulation dynamics by introducing

persistence and colonization processes Finally given the fundamental role that spatial

dependence plays in defining temporal dynamics a strategy to seamlessly account for

this feature in our framework is introduced

21

Chapter 3 develops a new fully automatic and objective method for occupancy

model selection that is asymptotically consistent for variable selection and averts the

use of tuning parameters In this Chapter first some issues surrounding multimodel

inference are described and insight about objective Bayesian inferential procedures is

provided Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate

priors on the parameter space the intrinsic priors for the parameters of the occupancy

model are obtained These are used in the construction of a variable selection algorithm

for ldquoobjectiverdquo variable selection tailored to the occupancy model framework

Chapter 4 touches on two important and interconnected issues when conducting

model testing that have yet to receive the attention they deserve (1) controlling for false

discovery in hypothesis testing given the size of the model space ie given the number

of tests performed and (2) non-invariance to location transformations of the variable

selection procedures in the face of polynomial predictor structure These elements both

depend on the definition of prior probabilities on the model space In this chapter a set

of priors on the model space and a stochastic search algorithm are proposed Together

these control for model multiplicity and account for the polynomial structure among the

predictors

22

CHAPTER 2MODEL ESTIMATION METHODS

ldquoData Data Datardquo he cried impatiently ldquoI canrsquot make bricks without clayrdquo

ndashSherlock HolmesThe Adventure of the Copper Beeches

21 Introduction

Prior to the introduction of site-occupancy models (MacKenzie et al 2002 Tyre

et al 2003) presence-absence data from ecological monitoring programs were used

without any adjustment to assess the impact of management actions to observe trends

in species distribution through space and time or to model the habitat of a species (Tyre

et al 2003) These efforts however were suspect due to false-negative errors not

being accounted for False-negative errors occur whenever a species is present at a site

but goes undetected during the survey

Site-occupancy models developed independently by MacKenzie et al (2002)

and Tyre et al (2003) extend simple binary-regression models to account for the

aforementioned errors in detection of individuals common in surveys of animal or plant

populations Since their introduction the site-occupancy framework has been used in

countless applications and numerous extensions for it have been proposed Occupancy

models improve upon traditional binary regression by analyzing observed detection

and partially observed presence as two separate but related components In the site

occupancy setting the chosen locations are surveyed repeatedly in order to reduce the

ambiguity caused by the observed zeros This approach therefore allows simultaneous

estimation of the probabilities of presence (occurrence) and detection

Several extensions to the basic single-season closed population model are

now available The occupancy approach has been used to determine species range

dynamics (MacKenzie et al 2003 Royle amp Kery 2007) and to understand agestage

23

structure within populations (Nichols et al 2007) model species co-occurrence

(MacKenzie et al 2004 Ovaskainen et al 2010 Waddle et al 2010) It has even been

suggested as a surrogate for abundance (MacKenzie amp Nichols 2004) MacKenzie et al

suggested using occupancy models to conduct large-scale monitoring programs since

this approach avoids the high costs associated with surveys designed for abundance

estimation Also to investigate metapopulation dynamics occupancy models improve

upon incidence function models (Hanski 1994) which are often parameterized in terms

of site (or patch) occupancy and assume homogenous patches and a metapopulation

that is at a colonization-extinction equilibrium

Nevertheless the implementation of Bayesian occupancy models commonly resorts

to sampling strategies dependent on hyper-parameters subjective prior elicitation

and relatively elaborate algorithms From the standpoint of practitioners these are

often treated as black-box methods (Kery 2010) As such the potential of using the

methodology incorrectly is high Commonly these procedures are fitted with packages

such as BUGS or JAGS Although the packagersquos ease of use has led to a wide-spread

adoption of the methods the user may be oblivious as to the assumptions underpinning

the analysis

We believe providing straightforward and robust alternatives to implement these

methods will help practitioners gain insight about how occupancy modeling and more

generally Bayesian modeling is performed In this Chapter using a simple Gibbs

sampling approach first we develop a versatile method to estimate the single season

closed population site-occupancy model then extend it to analyze metapopulation

dynamics through time and finally provide a further adaptation to incorporate spatial

dependence among neighboring sites211 The Occupancy Model

In this section of the document we first introduce our results published in Dorazio

amp Taylor-Rodrıguez (2012) and build upon them to propose relevant extensions For

24

the standard sampling protocol for collecting site-occupancy data J gt 1 independent

surveys are conducted at each of N representative sample locations (sites) noting

whether a species is detected or not detected during each survey Let yij denote a binary

random variable that indicates detection (y = 1) or non-detection (y = 0) during the

j th survey of site i Without loss of generality J may be assumed constant among all N

sites to simplify description of the model In practice however site-specific variation in

J poses no real difficulties and is easily implemented This sampling protocol therefore

yields a N times J matrix Y of detectionnon-detection data

Note that the observed process yij is an imperfect representation of the underlying

occupancy or presence process Hence letting zi denote the presence indicator at site i

this model specification can therefore be represented through the hierarchy

yij |zi λ sim Bernoulli (zipij)

zi |α sim Bernoulli (ψi) (2ndash1)

where pij is the probability of correctly classifying as occupied the i th site during the j th

survey ψi is the presence probability at the i th site The graphical representation of this

process is

ψi

zi

yi

pi

Figure 2-1 Graphical representation occupancy model

Probabilities of detection and occupancy can both be made functions of covariates

and their corresponding parameter estimates can be obtained using either a maximum

25

likelihood or a Bayesian approach Existing methodologies from the likelihood

perspective marginalize over the latent occupancy process (zi ) making the estimation

procedure depend only on the detections Most Bayesian strategies rely on MCMC

algorithms that require parameter prior specification and tuning However Albert amp Chib

(1993) proposed a longstanding strategy in the Bayesian statistical literature that models

binary outcomes using a simple Gibbs sampler This procedure which is described in

the following section can be extrapolated to the occupancy setting eliminating the need

for tuning parameters and subjective prior elicitation212 Data Augmentation Algorithms for Binary Models

Probit model Data-augmentation with latent normal variables

At the root of Albert amp Chibrsquos algorithm lies the idea that if the observed outcome is

0 the latent variable can be simulated from a truncated normal distribution with support

(minusinfin 0] And if the outcome is 1 the latent variable can be simulated from a truncated

normal distribution in (0infin) To understand the reasoning behind this strategy let

Y sim Bern((xTβ)

) and V = xTβ + ε with ε sim N (0 1) In such a case note that

Pr(y = 1 | xTβ) = (xTβ) = Pr(ε lt xTβ)

= Pr(ε gt minusxTβ)

= Pr(v gt 0 | xTβ)

Thus whenever y = 1 then v gt 0 and v le 0 otherwise In other words we

may think of y as a truncated version of v Thus we can sample iteratively alternating

between the latent variables conditioned on model parameters and vice versa to draw

from the desired posterior densities By augmenting the data with the latent variables

we are able to obtain full conditional posterior distributions for model parameters that are

easy to draw from (equation 2ndash3 below) Further we may sample the latent variables

we may also sample the parameters

Given some initial values for all model parameters values for the latent variables

can be simulated By conditioning on the latter it is then possible to draw samples

26

from the parameterrsquos posterior distributions These samples can be used to generate

new values for the latent variables etc The process is iterated using a Gibbs sampling

approach Generally after a large number iterations it yields draws from the joint

posterior distribution of the latent variables and the model parameters conditional on the

observed outcome values We formalize the procedure below

Assume that each outcome Y1Y2 Yn is such that Yi |xi β sim Bernoulli(qi)

where qi = (xTi β) is the standard normal CDF evaluated at xTi β where xi and β

are the p-dimensional vectors of observed covariates for the i -th observation and their

corresponding parameters respectively

Now let y = y1 y2 yn be the vector of observed outcomes and [ β ] represents

the prior distribution of the model parameters Therefore the posterior distribution of β is

given by

[ β|y ] prop [ β ]nprodi=1

(xTi β)yi(1minus(xTi β)

)1minusyi (2ndash2)

which is intractable Nevertheless introducing latent random variables V = (V1 Vn)

such that Vi sim N (xTi β 1) resolves this difficulty by specifying that whenever Yi = 1

then Vi gt 0 and if Yi = 0 then Vi le 0 This yields

[ β v|y ] prop [ β ]

nprodi=1

ϕ(vi | xTi β 1)Ivile0Iyi=0 + Ivigt0Iyi=1

(2ndash3)

where ϕ(x |micro τ 2) is the probability density function of normal random variable x

with mean micro and variance τ2 The data augmentation artifact works since [ β|y ] =int[ β v|y ]dv hence if we sample from joint posterior 2ndash3 and extract only the sampled

values for β they will correspond to samples from [ β|y ]

From the expression above it is possible to obtain the full conditional distributions

for V and β Thus a Gibbs sampler can be proposed For example if we use a flat prior

27

for β (ie [ β ] prop 1) the full conditionals are given by

β|V y sim MVNk

((XTX )minus1(XTV ) (XTX )minus1

)(2ndash4)

V|β y simnprodi=1

tr N (xTi β 1Qi) (2ndash5)

where MVNq(micro ) represents a multivariate normal distribution with mean vector micro

and variance-covariance matrix and tr N (ξσ2Q) stands for the truncated normal

distribution with mean ξ variance σ2 and truncation region Q For each i = 1 2 n

the support of the truncated variables is given by Q = (minusinfin 0 ] if yi = 0 and Q = (0infin)

otherwise Note that conjugate normal priors could be used alternatively

At iteration m + 1 the Gibbs sampler draws V(m+1) conditional on β(m) from (2ndash5)

and then samples β(m+1) conditional on V(m+1) from (2ndash4) This process is repeated for

s = 0 1 nsim where nsim is the number of iterations in the Gibbs sampler

Logit model Data-augmentation with latent Polya-gamma variables

Recently Polson et al (2013) developed a novel and efficient approach for Bayesian

inference for logistic models using Polya-gamma latent variables which is analogous

to the Albert amp Chib algorithm The result arises from what the authors refer to as the

Polya-gamma distribution To construct a random variable from this family consider the

infinite mixture of the iid sequence of Exp(1) random variables Ekinfink=1 given by

ω =2

π2

infinsumk=1

Ek

(2k minus 1)2

with probability density function

g(ω) =infinsumk=1

(minus1)k 2k + 1radic2πω3

eminus(2k+1)2

8ω Iωisin(0infin) (2ndash6)

and Laplace density transform E[eminustω] = coshminus1(radic

t2)

28

The Polya-gamma family of densities is obtained through an exponential tilting of

the density g from 2ndash6 These densities indexed by c ge 0 are characterized by

f (ω|c) = cosh c2 eminusc2ω2 g(ω)

The likelihood for the binomial logistic model can be expressed in terms of latent

Polya-gamma variables as follows Assume yi sim Bernoulli(δi) with predictors xprimei =

(xi1 xip) and success probability δi = exprimeiβ(1 + ex

primeiβ) Hence the posterior for the

model parameters can be represented as

[β|y] =[β]prodn

i δyii (1minus δi)

1minusyi

c(y)

where c(y) is the normalizing constant

To facilitate the sampling procedure a data augmentation step can be performed

by introducing a Polya-gamma random variable ω sim PG(xprimeβ 1) This yields the

data-augmented posterior

[βω|y] =

(prodn

i=1 Pr(yi = 1|β))f (ω|xprime

β) [β] dω

c(y) (2ndash7)

such that [β|y] =int

R+[βω|y] dω

Thus from the augmented model the full conditional density for β is given by

[β|ω y] prop

(nprodi=1

Pr(yi = 1|β)

)f (ω|xprime

β) [β] dω

=

nprodi=1

(exprimeiβ)yi

1 + exprimeiβ

nprodi=1

cosh

(∣∣xprime

iβ∣∣

2

)exp

[minus(x

prime

iβ)2ωi

2

]g(ωi)

(2ndash8)

This expression yields a normal posterior distribution if β is assigned flat or normal

priors Hence a two-step sampling strategy analogous to that of Albert amp Chib (1993)

can be used to estimate β in the occupancy framework22 Single Season Occupancy

Let pij = F (qTij λ) be the probability of correctly classifying as occupied the i th

site during the j th survey conditional on the site being occupied and let ψi = F (xTi α)

29

correspond to the presence probability at the i th site Further let Fminus1(middot) denote a link

function (ie probit or logit) connecting the response to the predictors and denote by λ

and α respectively the r -variate and p-variate coefficient vectors for the detection and

for the presence probabilities Then the following is the joint posterior probability for the

presence indicators and the model parameters

πlowast(z vαwλ) prop πα(α)πλ(λ)Nprodi=1

F (xprimeiα)zi (1minus F (xprimeiα))

(1minuszi ) times

Jprodj=1

(ziF (qprimeijλ))

yij (1minus ziF (qprimeijλ))

1minusyij (2ndash9)

As in the simple probit regression problem this posterior is intractable Consequently

sampling from it directly is not possible But the procedures of Albert amp Chib for the

probit model and of Polson et al for the logit model can be extended to generate an

MCMC sampling strategy for the occupancy problem In what follows we make use of

this framework to develop samplers with which occupancy parameter estimates can be

obtained for both probit and logit link functions These algorithms have the added benefit

that they do not require tuning parameters nor eliciting parameter priors subjectively221 Probit Link Model

To extend Albert amp Chibrsquos algorithm to the occupancy framework with a probit link

first we introduce two sets of latent variables denoted by wij and vi corresponding to

the normal latent variables used to augment the data The corresponding hierarchy is

yij |zi sij sim Bernoulli(ziIwijgt0

)wij |λ sim N

(qprimeijλ 1

)λ sim [λ]

zi |vi sim Ivigt0

vi |α sim N (xprimeiα 1)

α sim [α] (2ndash10)

30

represented by the directed graph found in Figure 2-2

α

vi

zi

yi

wi

λ

Figure 2-2 Graphical representation occupancy model after data-augmentation

Under this hierarchical model the joint density is given by

πlowast(z vαwλ) prop Cyπα(α)πλ(λ)Nprodi=1

ϕ(vi xprimeiα 1)I

zivigt0I

(1minuszi )vile0 times

Jprodj=1

(ziIwijgt0)yij (1minus ziIwijgt0)

1minusyijϕ(wij qprimeijλ 1) (2ndash11)

The full conditional densities derived from the posterior in equation 2ndash11 are

detailed below

1 These are obtained from the full conditional of z after integrating out v and w

f (z|αλ) =

Nprodi=1

f (zi |αλ) =Nprodi=1

ψlowastizi (1minus ψlowast

i )1minuszi

where ψlowasti =

ψiprodJ

j=1 pyijij (1minus pij)

1minusyij

ψiprodJ

j=1 pyijij (1minus pij)1minusyij + (1minus ψi)

prodJ

j=1 Iyij=0(2ndash12)

2

f (v|zα) =

Nprodi=1

f (vi |zi α) =Nprodi=1

tr N (x primeiα 1Ai)

where Ai =

(minusinfin 0] zi = 0(0infin) zi = 1

(2ndash13)

31

and tr N(microσ2A) denotes the pdf of a truncated normal random variable withmean micro variance σ2 and truncation region A

3

f (α|v) = ϕp (α αXprimev α) (2ndash14)

where α = (X primeX )minus1and ϕk(x micro ) represents the k-variate normal density withmean vector micro and variance matrix

4

f (w|y zλ) =

Nprodi=1

Jprodj=1

f (wij |yij zi λ) =Nprodi=1

Jprodj=1

tr N (qprimeijλ 1Bij)

where Bij =

(minusinfininfin) zi = 0(minusinfin 0] zi = 1 and yij = 0(0infin) zi = 1 and yij = 1

(2ndash15)

5

f (λ|w) = ϕr (λ λQprimew λ) (2ndash16)

where λ = (Q primeQ)minus1

The Gibbs sampling algorithm for the model can then be summarized as

1 Initialize z α v λ and w

2 Sample zi sim Bern(ψilowast)

3 Sample vi from a truncated normal with micro = x primeiα σ = 1 and truncation regiondepending on zi

4 Sample α sim N (αXprimev α) with α = (X primeX )minus1

5 Sample wij from a truncated normal with micro = qprimeijλ σ = 1 and truncation region

depending on yij and zi

6 Sample λ sim N (λQprimew λ) with λ = (Q primeQ)minus1

222 Logit Link Model

Now turning to the logit link version of the occupancy model again let yij be the

indicator variable used to mark detection of the target species on the j th survey at the

i th site and let zi be the indicator variable that denotes presence (zi = 1) or absence

32

(zi = 0) of the target species at the i th site The model is now defined by

yij |zi λ sim Bernoulli (zipij) where pij =eq

primeijλ

1 + eqprimeijλ

λ sim [λ]

zi |α sim Bernoulli (ψi) where ψi =ex

primeiα

1 + exprimeiα

α sim [α]

In this hierarchy the contribution of a single site to the likelihood is

Li(αλ) =(ex

primeiα)zi

1 + exprimeiα

Jprodj=1

(zi

eqprimeijλ

1 + eqprimeijλ

)yij(1minus zi

eqprimeijλ

1 + eqprimeijλ

)1minusyij

(2ndash17)

As in the probit case we data-augment the likelihood with two separate sets

of covariates however in this case each of them having Polya-gamma distribution

Augmenting the model and using the posterior in (2ndash7) the joint is

[ zαλ|y ] prop [α] [λ]

Nprodi=1

(ex

primeiα)zi

1 + exprimeiαcosh

(∣∣xprime

iα∣∣

2

)exp

[minus(x

prime

iα)2vi

2

]g(vi)times

Jprodj=1

(zi

eqprimeijλ

1 + eqprimeijλ

)yij(1minus zi

eqprimeijλ

1 + eqprimeijλ

)1minusyij

times

cosh

(∣∣ziqprimeijλ∣∣2

)exp

[minus(ziq

primeijλ)2wij

2

]g(wij)

(2ndash18)

The full conditionals for z α v λ and w obtained from (2ndash18) are provided below

1 The full conditional for z is obtained after marginalizing the latent variables andyields

f (z|αλ) =

Nprodi=1

f (zi |αλ) =Nprodi=1

ψlowastizi (1minus ψlowast

i )1minuszi

where ψlowasti =

ψiprodJ

j=1 pyijij (1minus pij)

1minusyij

ψiprodJ

j=1 pyijij (1minus pij)1minusyij + (1minus ψi)

prodJ

j=1 Iyij=0(2ndash19)

33

2 Using the result derived in Polson et al (2013) we have that

f (v|zα) =

Nprodi=1

f (vi |zi α) =Nprodi=1

PG(1 xprimeiα) (2ndash20)

3

f (α|v) prop [α ]

Nprodi=1

exp[zix

prime

iαminus xprime

2minus (x

prime

iα)2vi

2

] (2ndash21)

4 By the same result as that used for v the full conditional for w is

f (w|y zλ) =

Nprodi=1

Jprodj=1

f (wij |yij zi λ)

=

(prodiisinS1

Jprodj=1

PG(1 |qprimeijλ| )

)(prodi isinS1

Jprodj=1

PG(1 0)

) (2ndash22)

with S1 = i isin 1 2 N zi = 1

5

f (λ|z yw) prop [λ ]prodiisinS1

exp

[yijq

prime

ijλminusq

prime

ijλ

2minus

(qprime

ijλ)2wij

2

] (2ndash23)

with S1 as defined above

The Gibbs sampling algorithm is analogous to the one with a probit link but with the

obvious modifications to incorporate Polya-gamma instead of normal latent variables23 Temporal Dynamics and Spatial Structure

The uses of the single-season model are limited to very specific problems In

particular assumptions for the basic model may become too restrictive or unrealistic

whenever the study period extends throughout multiple years or seasons especially

given the increasingly changing environmental conditions that most ecosystems are

currently experiencing

Among the many extensions found in the literature one that we consider particularly

relevant incorporates heterogenous occupancy probabilities through time Extensions of

34

site-occupancy models that incorporate temporally varying probabilities can be traced

back to Hanski (1994) The heterogeneity of occupancy probabilities through time arises

from local colonization and extinction processes MacKenzie et al (2003) proposed an

alternative to Hanskirsquos approach in order to incorporate imperfect detection The method

is flexible enough to let detection occurrence survival and colonization probabilities

each depend upon its own set of covariates using likelihood-based estimation for the

model parameters

However the approach of MacKenzie et al presents two drawbacks First

the uncertainty assessment for maximum likelihood parameter estimates relies on

asymptotic results (obtained from implementation of the delta method) making it

sensitive to sample size And second to obtain parameter estimates the latent process

(occupancy) is marginalized out of the likelihood leading to the usual zero-inflated

Bernoulli model Although this is a convenient strategy to solve the estimation problem

the latent state variables (occupancy indicators) are no longer available and as such

finite sample estimates cannot be calculated unless an additional (and computationally

expensive) parametric bootstrap step is performed (Royle amp Kery 2007) Additionally as

the occupancy process is integrated out the likelihood approach precludes incorporation

of additional structural dependence using random effects Thus the model cannot

account for spatial dependence which plays a fundamental role in this setting

To work around some of the shortcomings encountered when fitting dynamic

occupancy models via likelihood based methods Royle amp Kery developed what they

refer to as a dynamic occupancy state space model (DOSS) alluding to the conceptual

similarity found between this model and the class of state space models found in the

time series literature In particular this model allows one to retain the latent process

(occupancy indicators) in order to obtain small sample estimates and to eventually

generate extensions that incorporate structure in time andor space through random

effects

35

The data used in the DOSS model comes from standard repeated presenceabsence

surveys with N sampling locations (patches or sites) indexed by i = 1 2 N Within

a given season (eg year month week depending on the biology of the species) each

sampling location is visited (surveyed) j = 1 2 J times This process is repeated for

t = 1 2 T seasons Here an important assumption is that the site occupancy status

is closed within but not across seasons

As is usual in the occupancy modeling framework two different processes are

considered The first one is the detection process per site-visit-season combination

denoted by yijt The yijt are indicator functions that take the value 1 if the species is

present at site i survey j and season t and 0 otherwise These detection indicators

are assumed to be independent within each site and season The second response

considered is the partially observed presence (occupancy) indicators zit These are

indicator variables which are equal to 1 whenever yijt = 1 for one or more of the visits

made to site i during season t otherwise the values of the zit rsquos are unknown Royle amp

Kery refer to these two processes as the observation (yijt) and the state (zit) models

In this setting the parameters of greatest interest are the occurrence or site

occupancy probabilities denoted by ψit as well as those representing the population

dynamics which are accounted for by introducing changes in occupancy status over

time through local colonization and survival That is if a site was not occupied at season

t minus 1 at season t it can either be colonized or remain unoccupied On the other hand

if the site was in fact occupied at season t minus 1 it can remain that way (survival) or

become abandoned (local extinction) at season t The probabilities of survival and

colonization from season t minus 1 to season t at the i th site are denoted by θi(tminus1) and

γi(tminus1) respectivelyDuring the initial period (or season) the model for the state process is expressed in

terms of the occupancy probability (equation 2ndash24) For subsequent periods the stateprocess is specified in terms of survival and colonization probabilities (equation 2ndash25) inparticular

zi1 sim Bernoulli (ψi1) (2ndash24)

36

zit |zi(tminus1) sim Bernoulli(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)(2ndash25)

The observation model conditional on the latent process zit is defined by

yijt |zit sim Bernoulli(zitpijt

)(2ndash26)

Royle amp Kery induce the heterogeneity by site site-season and site-survey-seasonrespectively in the occupancy survival and colonization and in the detection probabilitiesthroughthe following specification

logit(ψi1) = x1 + ri ri sim N(0σ2ψ) logitminus1(x1) sim Unif(0 1)

logit(θit) = at + ui ui sim N(0σ2θ ) logitminus1(ai) sim Unif(0 1)logit(γit) = bt + vi vi sim N(0σ2γ) logitminus1(bi) sim Unif(0 1)

logit(pijt) = ct + wij wi sim N(0σ2p) logitminus1(ci) sim Unif(0 1) (2ndash27)

where x1 at bt ct are the season fixed effects for the corresponding probabilities

and where (ri ui vi) and wij are the site and site-survey random effects respectively

Additionally all variance components assume the usual inverse gamma priors

As the authors state this formulation can be regarded as ldquobeing suitably vaguerdquo

however it is also restrictive in the sense that it is not clear what strategy to follow to

incorporate additional covariates while preserving the straightforward sampling strategy231 Dynamic Mixture Occupancy State-Space Model

We assume that the probabilities for occupancy survival colonization and detection

are all functions of linear combinations of covariates However our setup varies

slightly from that considered by Royle amp Kery (2007) In essence we modify the way in

which the estimates for survival and colonization probabilities are attained Our model

incorporates the notion that occupancy at a site occupied during the previous season

takes place through persistence where we define persistence as a function of both

survival and colonization That is a site occupied at time t may again be occupied

at time t + 1 if the current settlers survive if they perish and new settlers colonize

simultaneously or if both current settlers survive and new ones colonize

Our functional forms of choice are again the probit and logit link functions This

means that each probability of interest which we will refer to for illustration as δ is

37

linked to a linear combination of covariates xprime ξ through the relationship defined by

δ = F (xT ξ) where F (middot) represents the inverse link function This particular assumption

facilitates relating the data augmentation algorithms of Albert amp Chib and Polson et al to

Royle amp Keryrsquos DOSS model We refer to this extension of Royle amp Keryrsquos model as the

Dynamic Mixture Occupancy State Space model (DYMOSS)

As before let yijt be the indicator variable used to mark detection of the target

species on the j th survey at the i th site during the tth season and let zit be the indicator

variable that denotes presence (zit = 1) or absence (zit = 0) of the target species at the

i th site tth season with i isin 1 2 N j isin 1 2 J and t isin 1 2 T

Additionally assume that probabilities for occupancy at time t = 1 persistence

colonization and detection are all functions of covariates with corresponding parameter

vectors α (s) =δ(s)tminus1

Tt=2

B(c) =β(c)tminus1

Tt=2

and = λtTt=1 and covariate matrices

X(o) X = Xtminus1Tt=2 and Q(s) = QtTt=1 respectively Using the notation above our

proposed dynamic occupancy model is defined by the following hierarchyState model

zi1|α sim Bernoulli (ψi1) where ψi1 = F(xprime(o)iα

)zit |zi(tminus1) δ

(c)tminus1β

(s)tminus1 sim Bernoulli

(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)where θi(tminus1) = F

(δ(s)tminus1 + xprimei(tminus1)β

(c)tminus1

) and

γi(tminus1) = F(xprimei(tminus1)β

(c)tminus1

)(2ndash28)

Observed modelyijt |zit ηt sim Bernoulli (zitpijt)

where pijt = F (qTijtλt) (2ndash29)

In the hierarchical setup given by Equations 2ndash28 and 2ndash29 θi(tminus1) corresponds to

the probability of persistence from time t minus 1 to time t at site i and γi(tminus1) denotes the

colonization probability Note that θi(tminus1) minus γi(tminus1) yields the survival probability from t minus 1

to t The effect of survival is introduced by changing the intercept of the linear predictor

by a quantity δ(s)tminus1 Although in this version of the model this effect is accomplished by

just modifying the intercept it can be extended to have covariates determining δ(s)tminus1 as

well The graphical representation of the model for a single site is

38

α

zi1

yi1

λ1

zi2

yi2

λ1

δ(s)1

β(c)1

middot middot middot

zit

yit

λt

δ(s)tminus1

β(c)tminus1

middot middot middot

ziT

yiT

λT

δ(s)Tminus1

β(c)Tminus1

Figure 2-3 Graphical representation multiseason model for a single site

The joint posterior for the model defined by this hierarchical setting is

[ zηαβλ|y ] = Cy

Nprodi=1

ψi1 Jprodj=1

pyij1ij1 (1minus pij1)

(1minusyij1)

zi1(1minus ψi1)

Jprodj=1

Iyij1=0

1minuszi1 [η1][α]times

Tprodt=2

Nprodi=1

[(θziti(tminus1)(1minus θi(tminus1))

1minuszit)zi(tminus1)

+(γziti(tminus1)(1minus γi(tminus1))

1minuszit)1minuszi(tminus1)

] Jprod

j=1

pyijtijt (1minus pijt)

1minusyijt

zit

times

Jprodj=1

Iyijt=0

1minuszit [ηt ][βtminus1][λtminus1]

(2ndash30)

which as in the single season case is intractable Once again a Gibbs sampler cannot

be constructed directly to sample from this joint posterior The graphical representation

of the model for one site incorporating the latent variables is provided in Figure 2-4

α

ui1

zi1

yi1

wi1

λ1

zi2

yi2

wi2

λ1

vi1

δ(s)1

β(c)1

middot middot middot

middot middot middot

zit

vi tminus1

yit

wit

λt

δ(s)tminus1

β(c)tminus1

middot middot middot

middot middot middot

ziT

vi Tminus1

yiT

wiT

λT

δ(s)Tminus1

β(s)Tminus1

Figure 2-4 Graphical representation data-augmented multiseason model

Probit link normal-mixture DYMOSS model

39

We deal with the intractability of the joint posterior distribution as before that is

by introducing latent random variables Each of the latent variables incorporates the

relevant linear combinations of covariates for the probabilities considered in the model

This artifact enables us to sample from the joint posterior distributions of the model

parameters For the probit link the sets of latent random variables respectively for first

season occupancy persistence and colonization and detection are

bull ui sim N (bTi α 1)

bull vi(tminus1) sim zi(tminus1)N(δ(s)(tminus1) + xTi(tminus1)β

(c)(tminus1) 1

)+ (1minus zi(tminus1))N

(xTi(tminus1)β

(c)(tminus1) 1

) and

bull wijt sim N (qTijtηt 1)

Introducing these latent variables into the hierarchical formulation yieldsState model

ui1|α sim N(xprime(o)iα 1

)zi1|ui sim Bernoulli

(Iuigt0

)for t gt 1

vi(tminus1)|zi(tminus1)βtminus1 sim zi(tminus1)N(δ(s)(tminus1) + xprimei(tminus1)β

(c)(tminus1) 1

)+

(1minus zi(tminus1))N(xprimei(tminus1)β

(c)(tminus1) 1

)zit |vi(tminus1) sim Bernoulli

(Ivi(tminus1)gt0

)(2ndash31)

Observed modelwijt |ηt sim N

(qTijtλt 1

)yijt |zit wijt sim Bernoulli

(zitIrijtgt0

)(2ndash32)

Note that the result presented in Section 22 corresponds to the particular case for

T = 1 of the model specified by Equations 2ndash31 and 2ndash32

As mentioned previously model parameters are obtained using a Gibbs sampling

approach Let ϕ(x |microσ2) denote the pdf of a normally distributed random variable x

with mean micro and standard deviation σ Also let

1 Wt = (w1t w2t wNt) with wit = (wi1t wi2t wiJitt) (for i = 1 2 N andt = 1 2 T )

40

2 u = (u1 u2 uN)

3 V = (v1 vTminus1) with vt = (v1t v2t vNt)

For the probit link model the joint posterior distribution is

π(ZuV WtTt=1αB(c) δ(s)

)prop [α]

prodNi=1 ϕ

(ui∣∣ xprime(o)iα 1

)Izi1uigt0I

1minuszi1uile0

times

Tprodt=2

[β(c)tminus1 δ

(s)tminus1

] Nprodi=1

ϕ(vi(tminus1)

∣∣micro(v)i(tminus1) 1

)Izitvi(tminus1)gt0

I1minuszitvi(tminus1)le0

times

Tprodt=1

[λt ]

Nprodi=1

Jitprodj=1

ϕ(wijt

∣∣qprimeijtλt 1)(zitIwijtgt0)yij1(1minus zitIwijtgt0)

(1minusyijt)

where micro(v)i(tminus1) = zi(tminus1)δ

(s)tminus1 + xprimei(tminus1)β

(c)tminus1 (2ndash33)

Initialize the Gibbs sampler at α(0)B(0)(c) δ

(s)(0)2minus1 and (0) For m = 0 1 nsim

The sampler proceeds iteratively by block sampling sequentially for each primary

sampling period as follows first the presence process then the latent variables from

the data-augmentation step for the presence component followed by the parameters for

the presence process then the latent variables for the detection component and finally

the parameters for the detection component Letting [|] denote the full conditional

probability density function of the component conditional on all other unknown

parameters and the observed data for m = 1 nsim the sampling procedure can be

summarized as

[z(m)1 | middot

]rarr[u(m)| middot

]rarr[α(m)

∣∣∣ middot ]rarr [W

(m)1 | middot

]rarr[λ(m)1

∣∣∣ middot ]rarr[z(m)2 | middot

]rarr[V(m)2minus1| middot

]rarr[β(c)(m)2minus1 δ(s)(m)

2minus1

∣∣∣ middot ]rarr [W

(m)2 | middot

]rarr[λ(m)2

∣∣∣ middot ]rarr middot middot middot

middot middot middot rarr[z(m)T | middot

]rarr[V(m)Tminus1| middot

]rarr[β(c)(m)Tminus1 δ(s)(m)

Tminus1

∣∣∣ middot ]rarr [W

(m)T | middot

]rarr[λ(m)T

∣∣∣ middot ]The full conditional probability densities for this Gibbs sampling algorithm are

presented in detail within Appendix A

41

Logit link Polya-Gamma DYMOSS model

Using the same notation as before the logit link model resorts to the hierarchy given

byState model

ui1|α sim PG(xT(o)iα 1

)zi1|ui sim Bernoulli

(Iuigt0

)for t gt 1

vi(tminus1)| sim PG(1∣∣zi(tminus1)δ(s)(tminus1) + xprimei(tminus1)β

(c)(tminus1)

∣∣)sim Bernoulli

(Ivi(tminus1)gt0

)(2ndash34)

Observed modelwijt |λt sim PG

(qTijtλt 1

)yijt |zit wijt sim Bernoulli

(zitIwijtgt0

)(2ndash35)

The logit link version of the joint posterior is given by

π(ZuV WtTt=1αB(s)B(c)

)prop

Nprodi=1

(e

xprime(o)i

α)zi1

1 + exprime(o)i

αPG

(ui 1 |xprime(o)iα|

)[λ1][α]times

Ji1prodj=1

(zi1

eqprimeij1λ1

1 + eqprimeij1λ1

)yij1(1minus zi1

eqprimeij1λ1

1 + eqprimeij1λ1

)1minusyij1

PG(wij1 1 |zi1qprimeij1λ1|

)times

Tprodt=2

[δ(s)tminus1][β

(c)tminus1][λt ]

Nprodi=1

(exp

[micro(v)tminus1

])zit1 + exp

[micro(v)i(tminus1)

]PG (vit 1 ∣∣∣micro(v)i(tminus1)

∣∣∣)timesJitprodj=1

(zit

eqprimeijtλt

1 + eqprimeijtλt

)yijt(1minus zit

eqprimeijtλt

1 + eqlowastTij

λt

)1minusyijt

PG(wijt 1 |zitqprimeijtλt |

)

(2ndash36)

with micro(v)tminus1 = zi(tminus1)δ

(s)tminus1 + xprimei(tminus1)β

(c)tminus1

42

The sampling procedure is entirely analogous to that described for the probit

version The full conditional densities derived from expression 2ndash36 are described in

detail in Appendix A232 Incorporating Spatial Dependence

In this section we describe how the additional layer of complexity space can also

be accounted for by continuing to use the same data-augmentation framework The

method we employ to incorporate spatial dependence is a slightly modified version of

the traditional approach for spatial generalized linear mixed models (GLMMrsquos) and

extends the model proposed by Johnson et al (2013) for the single season closed

population occupancy model

The traditional approach consists of using spatial random effects to induce a

correlation structure among adjacent sites This formulation introduced by Besag et al

(1991) assumes that the spatial random effect corresponds to a Gaussian Markov

Random Field (GMRF) The model known as the Spatial GLMM (SGLMM) is used to

analyze areal data It has been applied extensively given the flexibility of its hierarchical

formulation and the availability of software for its implementation (Hughes amp Haran

2013)

Succinctly the spatial dependence is accounted for in the model by adding a

random vector η assumed to have a conditionally-autoregressive (CAR) prior (also

known as the Gaussian Markov random field prior) To define the prior let the pair

G = (V E) represent the undirected graph for the entire spatial region studied where

V = (1 2 N) denotes the vertices of the graph (sites) and E the set of edges

between sites E is constituted by elements of the form (i j) indicating that sites i

and j are spatially adjacent for some i j isin V The prior for the spatial effects is then

characterized by

[η|τ ] prop τ rank()2exp[minusτ2ηprimeη

] (2ndash37)

43

where = (diag(A1)minus A) is the precision matrix with A denoting the adjacency matrix

The entries of the adjacency matrix A are such that diag(A) = 0 and Aij = I(i j)isinE

The matrix is singular Hence the probability density defined in equation 2ndash37

is improper ie it doesnrsquot integrate to 1 Regardless of the impropriety of the prior this

model can be fitted using a Bayesian approach since even if the prior is improper the

posterior for the model parameters is proper If a constraint such assum

k ηk = 0 is

imposed or if the precision matrix is replaced by a positive definite matrix the model

can also be fitted using a maximum likelihood approach

Assuming that all but the detection process are subject to spatial correlations and

using the notation we have developed up to this point the spatially explicit version of the

DYMOSS model is characterized by the hierarchy represented by equations 2ndash38 and

2ndash39

Hence adding spatial structure into the DYMOSS framework described in the

previous section only involves adding the steps to sample η(o) and ηtT

t=2 conditional

on all other parameters Furthermore the corresponding parameters and spatial

random effects of a given component (ie occupancy survival and colonization)

can be effortlessly pooled together into a single parameter vector to perform block

sampling For each of the latent variables the only modification required is to sum the

corresponding spatial effect to the linear predictor so that these retain their conditional

independence given the linear combination of fixed effects and the spatial effects

State modelzi1|α sim Bernoulli (ψi1) where ψi1 = F

(xT(o)iα+ η

(o)i

)[η(o)|τ

]prop τ rank()2exp

[minusτ2η(o)primeη(o)

]zit |zi(tminus1)αβtminus1λtminus1 sim Bernoulli

(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)where θi(tminus1) = F

(δ(s)(tminus1) + xTi(tminus1)β

(c)tminus1 + ηit

) and

γi(tminus1) = F(xTi(tminus1)β

(c)tminus1 + ηit

)[ηt |τ ] prop τ rank()2exp

[minusτ2ηprimetηt

](2ndash38)

44

Observed modelyijt |zit ηt sim Bernoulli (zitpijt)

where pijt = F (qTijtλt) (2ndash39)

In spite of the popularity of this approach to incorporating spatial dependence three

shortcomings have been reported in the literature (Hughes amp Haran 2013 Reich et al

2006) (1) model parameters have no clear interpretation due to spatial confounding

of the predictors with the spatial effect (2) there is variance inflation due to spatial

confounding and (3) the high dimensionality of the latent spatial variables leads to

high computational costs To avoid such difficulties we follow the approach used by

Hughes amp Haran (2013) which builds upon the earlier work by Reich et al (2006) This

methodology is summarized in what follows

Let a vector of spatial effects η have the CAR model given by 2ndash37 above Now

consider a random vector ζ sim MVN(0 τKprimeK

) with defined as above and where

τKprimeK corresponds to the precision of the distribution and not the covariance matrix

with matrix K satisfying KprimeK = I

This last condition implies that the linear predictor Xβ + η = Xβ + Kζ With

respect to how the matrix K is chosen Hughes amp Haran (2013) recommend basing its

construction on the spectral decomposition of operator matrices based on Moranrsquos I

The Moran operator matrix is defined as PperpAPperp with Pperp = IminusX (XprimeX )minus1X

prime and where A

is the adjacency matrix previously described The choice of the Moran operator is based

on the fact that it accounts for the underlying graph while incorporating the spatial

structure residual to the design matrix X These elements are incorporated into its

spectral decomposition of the Moran operator That is its eigenvalues correspond to the

values of Moranrsquos I statistic (a measure of spatial autocorrelation) for a spatial process

orthogonal to X while its eigenvectors provide the patterns of spatial dependence

residual to X Thus the matrix K is chosen to be the matrix whose columns are the

eigenvectors of the Moran operator for a particular adjacency matrix

45

Using this strategy the new hierarchical formulation of our model is simply modified

by letting η(o) = K(o)ζ(o) and ηt = Ktζt with

1 ζ(o) sim MVN(0 τ (o)K(o)primeK(o)

) where K(o) is the eigenvector matrix for

P(o)perpAP(o)perp and

2 ζt sim MVN(0 τtK

primetKt

) where Kt is the Pperp

t APperpt for t = 2 3 T

The algorithms for the probit and logit link from section 231 can be readily

adapted to incorporate the spatial structure simply by obtaining the joint posteriors

for (α ζ(o)) and (β(c)tminus1 δ

(s)tminus1 ζt) making the obvious modification of the corresponding

linear predictors to incorporate the spatial components24 Summary

With a few exceptions (Dorazio amp Taylor-Rodrıguez 2012 Johnson et al 2013

Royle amp Kery 2007) recent Bayesian approaches to site-occupancy modeling with

covariates have relied on model configurations (eg as multivariate normal priors of

parameters in logit scale) that lead to unfamiliar conditional posterior distributions thus

precluding the use of a direct sampling approach Therefore the sampling strategies

available are based on algorithms (eg Metropolis Hastings) that require tuning and the

knowledge to do so correctly

In Dorazio amp Taylor-Rodrıguez (2012) we proposed a Bayesian specification for

which a Gibbs sampler of the basic occupancy model is available and allowed detection

and occupancy probabilities to depend on linear combinations of predictors This

method described in section 221 is based on the data augmentation algorithm of

Albert amp Chib (1993) There the full conditional posteriors of the parameters of the probit

regression model are cast as latent mixtures of normal random variables The probit and

the logit link yield similar results with large sample sizes however their results may be

different when small to moderate sample sizes are considered because the logit link

function places more mass in the tails of the distribution than the probit link does In

46

section 222 we adapt the method for the single season model to work with the logit link

function

The basic occupancy framework is useful but it assumes a single closed population

with fixed probabilities through time Hence its assumptions may not be appropriate to

address problems where the interest lies in the temporal dynamics of the population

Hence we developed a dynamic model that incorporates the notion that occupancy

at a site previously occupied takes place through persistence which depends both on

survival and habitat suitability By this we mean that a site occupied at time t may again

be occupied at time t + 1 if (1) the current settlers survive (2) the existing settlers

perish but new settlers simultaneously colonize or (3) current settlers survive and new

ones colonize during the same season In our current formulation of the DYMOSS both

colonization and persistence depend on habitat suitability characterized by xprimei(tminus1)β(c)tminus1

They only differ in that persistence is also influenced by whether the site being occupied

during season t minus 1 enhances the suitability of the site or harms it through density

dependence

Additionally the study of the dynamics that govern distribution and abundance of

biological populations requires an understanding of the physical and biotic processes

that act upon them and these vary in time and space Consequently as a final step in

this Chapter we described a straightforward strategy to add spatial dependence among

neighboring sites in the dynamic metapopulation model This extension is based on the

popular Bayesian spatial modeling technique of Besag et al (1991) updated using the

methods described in (Hughes amp Haran 2013)

Future steps along these lines are (1) develop the software necessary to

implement the tools described throughout the Chapter and (2) build a suite of additional

extensions using this framework for occupancy models will be explored The first of

them will be used to incorporate information from different sources such as tracks

scats surveys and direct observations into a single model This can be accomplished

47

by adding a layer to the hierarchy where the source and spatial scale of the data is

accounted for The second extension is a single season spatially explicit multiple

species co-occupancy model This model will allow studying complex interactions

and testing hypotheses about species interactions at a given point in time Lastly this

co-occupancy model will be adapted to incorporate temporal dynamics in the spirit of

the DYMOSS model

48

CHAPTER 3INTRINSIC ANALYSIS FOR OCCUPANCY MODELS

Eliminate all other factors and the one which remains must be the truthndashSherlock Holmes

The Sign of Four

31 Introduction

Occupancy models are often used to understand the mechanisms that dictate

the distribution of a species Therefore variable selection plays a fundamental role in

achieving this goal To the best of our knowledge ldquoobjectiverdquo Bayesian alternatives for

variable selection have not been put forth for this problem and with a few exceptions

(Hooten amp Hobbs 2014 Link amp Barker 2009) AIC is the method used to choose from

competing site-occupancy models In addition the procedures currently implemented

and accessible to ecologists require enumerating and estimating all the candidate

models (Fiske amp Chandler 2011 Mazerolle amp Mazerolle 2013) In practice this

can be achieved if the model space considered is small enough which is possible

if the choice of the model space is guided by substantial prior knowledge about the

underlying ecological processes Nevertheless many site-occupancy surveys collect

large amounts of covariate information about the sampled sites Given that the total

number of candidate models grows exponentially fast with the number of predictors

considered choosing a reduced set of models guided by ecological intuition becomes

increasingly difficult This is even more so the case in the occupancy model context

where the model space is the cartesian product of models for presence and models for

detection Given the issues mentioned above we propose the first objective Bayesian

variable selection method for the single-season occupancy model framework This

approach explores in a principled manner the entire model space It is completely

49

automatic precluding the need for both tuning parameters in the sampling algorithm and

subjective elicitation of parameter prior distributions

As mentioned above in ecological modeling if model selection or less frequently

model averaging is considered the Akaike Information Criterion (AIC) (Akaike 1983)

or a version of it is the measure of choice for comparing candidate models (Fiske amp

Chandler 2011 Mazerolle amp Mazerolle 2013) The AIC is designed to find the model

that has on average the density closest in Kullback-Leibler distance to the density

of the true data generating mechanism The model with the smallest AIC is selected

However if nested models are considered one of them being the true one generally the

AIC will not select it (Wasserman 2000) Commonly the model selected by AIC will be

more complex than the true one The reason for this is that the AIC has a weak signal to

noise ratio and as such it tends to overfit (Rao amp Wu 2001) Other versions of the AIC

provide a bias correction that enhances the signal to noise ratio leading to a stronger

penalization for model complexity Some examples are the AICc (Hurvich amp Tsai 1989)

and AICu (McQuarrie et al 1997) however these are also not consistent for selection

albeit asymptotically efficient (Rao amp Wu 2001)

If we are interested in prediction as opposed to testing the AIC is certainly

appropriate However when conducting inference the use of Bayesian model averaging

and selection methods is more fitting If the true data generating mechanism is among

those considered asymptotically Bayesian methods choose the true model with

probability one Conversely if the true model is not among the alternatives and a

suitable parameter prior is used the posterior probability of the most parsimonious

model closest to the true one tends asymptotically to one

In spite of this in general for Bayesian testing direct elicitation of prior probabilistic

statements is often impeded because the problems studied may not be sufficiently

well understood to make an informed decision about the priors Conversely there may

be a prohibitively large number of parameters making specifying priors for each of

50

these parameters an arduous task In addition to this seemingly innocuous subjective

choices for the priors on the parameter space may drastically affect test outcomes

This has been a recurring argument in favor of objective Bayesian procedures

which appeal to the use of formal rules to build parameter priors that incorporate the

structural information inside the likelihood while utilizing some objective criterion (Kass amp

Wasserman 1996)

One popular choice of ldquoobjectiverdquo prior is the reference prior (Berger amp Bernardo

1992) which is the prior that maximizes the amount of signal extracted from the

data These priors have proven to be effective as they are fully automatic and can

be frequentist matching in the sense that the posterior credible interval agrees with the

frequentist confidence interval from repeated sampling with equal coverage-probability

(Kass amp Wasserman 1996) Reference priors however are improper and while

they yield reasonable posterior parameter probabilities the derived model posterior

probabilities may be ill defined To avoid this shortcoming Berger amp Pericchi (1996)

introduced the intrinsic Bayes factor (IBF) for model comparison Moreno et al (1998)

building on the IBF of Berger amp Pericchi (1996) developed a limiting procedure to

generate a system of priors that yield well-defined posteriors even though these

priors may sometimes be improper The IBF is built using a data-dependent prior to

automatically generate Bayes factors however the extension introduced by Moreno

et al (1998) generates the intrinsic prior by taking a theoretical average over the space

of training samples freeing the prior from data dependence

In our view in the face of a large number of predictors the best alternative is to run

a stochastic search algorithm using good ldquoobjectiverdquo testing parameter priors and to

incorporate suitable model priors This being said the discussion about model priors is

deferred until Chapter 4 this Chapter focuses on the priors on the parameter space

The Chapter is structured as follows First issues surrounding multimodel inference

are described and insight about objective Bayesian inferential procedures is provided

51

Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate priors

on the parameter space the intrinsic priors for the parameters of the occupancy model

are derived These are used in the construction of an algorithm for ldquoobjectiverdquo model

selection tailored to the occupancy model framework To assess the performance of our

methods we provide results from a simulation study in which distinct scenarios both

favorable and unfavorable are used to determine the robustness of these tools and

analyze the Blue Hawker data set which has been examined previously in the ecological

literature (Dorazio amp Taylor-Rodrıguez 2012 Kery et al 2010)32 Objective Bayesian Inference

As mentioned before in practice noninformative priors arising from structural

rules are an alternative to subjective elicitation of priors Some of the rules used in

defining noninformative priors include the principle of insufficient reason parametrization

invariance maximum entropy geometric arguments coverage matching and decision

theoretic approaches (see Kass amp Wasserman (1996) for a discussion)

These rules reflect one of two attitudes (1) noninformative priors either aim to

convey unique representations of ignorance or (2) they attempt to produce probability

statements that may be accepted by convention This latter attitude is in the same

spirit as how weights and distances are defined (Kass amp Wasserman 1996) and

characterizes the way in which Bayesian reference methods are interpreted today ie

noninformative priors are seen to be chosen by convention according to the situation

A word of caution must be given when using noninformative priors Difficulties arise

in their implementation that should not be taken lightly In particular these difficulties

may occur because noninformative priors are generally improper (meaning that they do

not integrate or sum to a finite number) and as such are said to depend on arbitrary

constants

Bayes factors strongly depend upon the prior distributions for the parameters

included in each of the models being compared This can be an important limitation

52

considering that when using noninformative priors their introduction will result in the

Bayes factors being a function of the ratio of arbitrary constants given that these priors

are typically improper (see Jeffreys 1961 Pericchi 2005 and references therein)

Many different approaches have been developed to deal with the arbitrary constants

when using improper priors since then These include the use of partial Bayes factors

(Berger amp Pericchi 1996 Good 1950 Lempers 1971) setting the ratio of arbitrary

constants to a predefined value (Spiegelhalter amp Smith 1982) and approximating to the

Bayes factor (see Haughton 1988 as cited in Berger amp Pericchi 1996 Kass amp Raftery

1995 Tierney amp Kadane 1986)321 The Intrinsic Methodology

Berger amp Pericchi (1996) cleverly dealt with the arbitrary constants that arise when

using improper priors by introducing the intrinsic Bayes factor (IBF) procedure This

solution based on partial Bayes factors provides the means to replace the improper

priors by proper ldquoposteriorrdquo priors The IBF is obtained from combining the model

structure with information contained in the observed data Furthermore they showed

that as the sample size tends to infinity the Intrinsic Bayes factor corresponds to the

proper Bayes factor arising from the intrinsic priors

Intrinsic priors however are not unique The asymptotic correspondence between

the IBF and the Bayes factor arising from the intrinsic prior yields two functional

equations that are solved by a whole class of intrinsic priors Because all the priors

in the class produce Bayes factors that are asymptotically equivalent to the IBF for

finite sample sizes the resulting Bayes factor is not unique To address this issue

Moreno et al (1998) formalized the methodology through the ldquolimiting procedurerdquo

This procedure allows one to obtain a unique Bayes factor consolidating the method

as a valid objective Bayesian model selection procedure which we will refer to as the

Bayes factor for intrinsic priors (BFIP) This result is particularly valid for nested models

although the methodology may be extended with some caution to nonnested models

53

As mentioned before the Bayesian hypothesis testing procedure is highly sensitive

to parameter-prior specification and not all priors that are useful for estimation are

recommended for hypothesis testing or model selection Evidence of this is provided

by the Jeffreys-Lindley paradox which states that a point null hypothesis will always

be accepted when the variance of a conjugate prior goes to infinity (Robert 1993)

Additionally when comparing nested models the null model should correspond to

a substantial reduction in complexity from that of larger alternative models Hence

priors for the larger alternative models that place probability mass away from the null

model are wasteful If the true model is ldquofarrdquo from the null it will be easily detected by

any statistical procedure Therefore the prior on the alternative models should ldquowork

harderrdquo at selecting competitive models that are ldquocloserdquo to the null This principle known

as the Savage continuity condition (Gunel amp Dickey 1974) is widely recognized by

statisticians

Interestingly the intrinsic prior in correspondence with the BFIP automatically

satisfies the Savage continuity condition That is when comparing nested models the

intrinsic prior for the more complex model is centered around the null model and in spite

of being a limiting procedure it is not subject to the Jeffreys-Lindley paradox

Moreover beyond the usual pairwise consistency of the Bayes factor for nested

models Casella et al (2009) show that the corresponding Bayesian procedure with

intrinsic priors for variable selection in normal regression is consistent in the entire

class of normal linear models adding an important feature to the list of virtues of the

procedure Consistency of the BFIP for the case where the dimension of the alternative

model grows with the sample size is discussed in Moreno et al (2010)322 Mixtures of g-Priors

As previously mentioned in the Bayesian paradigm a model M in M is defined

by a sampling density and a prior distribution The sampling density associated with

model M is denoted by f (y|βM σ2M M) where (βM σ

2M) is a vector of model-specific

54

unknown parameters The prior for model M and its corresponding set of parameters is

π(βM σ2M M|M) = π(βM σ

2M |MM) middot π(M|M)

Objective local priors for the model parameters (βM σ2M) are achieved through

modifications and extensions of Zellnerrsquos g-prior (Liang et al 2008 Womack et al

2014) In particular below we focus on the intrinsic prior and provide some details for

other scaled mixtures of g-priors We defer the discussion on priors over the model

space until Chapter 5 where we describe them in detail and develop a few alternatives

of our own3221 Intrinsic priors

An automatic choice of an objective prior is the intrinsic prior (Berger amp Pericchi

1996 Moreno et al 1998) Because MB sube M for all M isin M the intrinsic prior for

(βM σ2M) is defined as an expected posterior prior

πI (βM σ2M |M) =

intpR(βM σ

2M |~yM)mR(~y|MB)d~y (3ndash1)

where ~y is a minimal training sample for model M I denotes the intrinsic distributions

and R denotes distributions derived from the reference prior πR(βM σ2M |M) = cM

dβMdσ2M

σ2M

In (3ndash1) mR(~y|M) =intint

f (~y|βM σ2M M)πR(βM σ

2M |M)dβMdσ2M is the reference marginal

of ~y under model M and pR(βM σ2M |~yM) =

f (~y|βM σ2MM)πR(βM σ2

M|M)

mR(~y|M)is the reference

posterior density

In the regression framework the reference marginal mR is improper and produces

improper intrinsic priors However the intrinsic Bayes factor of model M to the base

model MB is well-defined and given by

BF IMMB

(y) = (1minus R2M)

minus nminus|MB |2 times

int 1

0

n + sin2(π2θ) middot (|M|+ 1)

n +sin2(π

2θ)middot(|M|+1)1minusR2

M

nminus|M|

2sin2(π

2θ) middot (|M|+ 1)

n +sin2(π

2θ)middot(|M|+1)1minusR2

M

|M|minus|MB |

2

dθ (3ndash2)

55

where R2M is the coefficient of determination of model M versus model MB The Bayes

factor between two models M and M prime is defined as BF IMMprime(y) = BF I

MMB(y)BF I

MprimeMB(y)

The ldquogoodnessrdquo of the model M based on the intrinsic priors is given by its posterior

probability

pI (M|yM) =BF I

MMB(y)π(M|M)sum

MprimeisinM BF IMprimeMB

(y)π(M prime|M) (3ndash3)

It has been shown that the system of intrinsic priors produces consistent model selection

(Casella et al 2009 Giron et al 2010) In the context of well-formulated models the

true model MT is the smallest well-formulated model M isin M such that α isin M if βα = 0

If MT is the true model then the posterior probability of model MT based on equation

(3ndash3) converges to 13222 Other mixtures of g-priors

Scaled mixtures of g-priors place a reference prior on (βMBσ2) and a multivariate

normal distribution on β in M MB that is normal with mean 0 and precision matrix

qMw

nσ2ZprimeM (IminusH0)ZM

where H0 is the hat matrix associated with ZMB The prior is completed by a prior on w

and choice of scaling qM that is set at |M| + 1 to account for the minimal sample size of

M Under these assumptions the Bayesrsquo factor for M to MB is given by

BFMMB(y) =

(1minus R2

M

) nminus|MB |2

int n + w(|M|+ 1)

n + w(|M|+1)1minusR2

M

nminus|M|

2w(|M|+ 1)

n + w(|M|+1)1minusR2

M

|M|minus|MB |

2

π(w)dw

We consider the following priors on w The intrinsic prior is π(w) = Beta(w 05 05)

which is only defined for w isin (0 1) A version of the Zellner-Siow prior is given by

w sim Gamma(05 05) which produces a multivariate Cauchy distribution on β A family

of hyper-g priors are defined by π(w) prop wminus12(β + w)(α+1)2 which have Cauchy-like

tails but produce more shrinkage than the Cauchy prior

56

33 Objective Bayes Occupancy Model Selection

As mentioned before Bayesian inferential approaches used for ecological models

are lacking In particular there exists a need for suitable objective and automatic

Bayesian testing procedures and software implementations that explore thoroughly the

model space considered With this goal in mind in this section we develop an objective

intrinsic and fully automatic Bayesian model selection methodology for single season

site-occupancy models We refer to this method as automatic and objective given that

in its implementation no hyperparameter tuning is required and that it is built using

noninformative priors with good testing properties (eg intrinsic priors)

An inferential method for the occupancy problem is possible using the intrinsic

approach given that we are able to link intrinsic-Bayesian tools for the normal linear

model through our probit formulation of the occupancy model In other words because

we can represent the single season probit occupancy model through the hierarchy

yij |zi wij sim Bernoulli(ziIwijgt0

)wij |λ sim N

(qprimeijλ 1

)zi |vi sim Bernoulli

(Ivigt0

)vi |α sim N (x primeiα 1)

it is possible to solve the selection problem on the latent scale variables wij and vi and

to use those results at the level of the occupancy and detection processes

In what follows first we provide some necessary notation Then a derivation of

the intrinsic priors for the parameters of the detection and occupancy components

is outlined Using these priors we obtain the general form of the model posterior

probabilities Finally the results are incorporated in a model selection algorithm for

site-occupancy data Although the priors on the model space are not discussed in this

Chapter the software and methods developed have different choices of model priors

built in

57

331 Preliminaries

The notation used in Chapter 2 will be considered in this section as well Namely

presence will be denoted by z detection by y their corresponding latent processes are

v and w and the model parameters are denoted by α and λ However some additional

notation is also necessary Let M0 =M0y M0z

denote the ldquobaserdquo model defined by

the smallest models considered for the detection and presence processes The base

models M0y and M0z include predictors that must be contained in every model that

belongs to the model space Some examples of base models are the intercept only

model a model with covariates related to the sampling design and a model including

some predictors important to the researcher that should be included in every model

Furthermore let the sets [Kz ] = 1 2 Kz and [Ky ] = 1 2 Ky index

the covariates considered for the variable selection procedure for the presence and

detection processes respectively That is these sets denote the covariates that can

be added from the base models in M0 or removed from the largest possible models

considered MF z and MF y which we will refer to as the ldquofullrdquo models The model space

can then be represented by the Cartesian product of subsets such that Ay sube [Ky ]

and Az sube [Kz ] The entire model space is populated by models of the form MA =MAy

MAz

isin M = My timesMz with MAy

isin My and MAzisin Mz

For the presence process z the design matrix for model MAzis given by the block

matrix XAz= (X0|Xr A) X0 corresponds to the design matrix of the base model ndash which

is such that M0z sube MAzisin Mz for all Az isin [Kz ] ndash and Xr A corresponds to the submatrix

that contains the covariates indexed by Az Analogously for the detection process y the

design matrix is given by QAy= (Q0|Qr A) Similarly the coefficients for models MAz

and

MAyare given by αA = (αprime

0αprimer A)

prime and λA = (λprime0λ

primer A)

prime

With these elements in place the model selection problem consists of finding

subsets of covariates indexed by A = Az Ay that have a high posterior probability

given the detection and occupancy processes This is equivalent to finding models with

58

high posterior odds when compared to a suitable base model These posterior odds are

given by

p(MA|y z)p(M0|y z)

=m(y z|MA)π(MA)

m(y z|M0)π(M0)= BFMAM0

(y z)π(MA)

π(M0)

Since we are able to represent the occupancy model as a truncation of latent

normal variables it is possible to work through the occupancy model selection problem

in the latent normal scale used for the presence and detection processes We formulate

two solutions to this problem one that depends on the observed and latent components

and another that solely depends on the latent level variables used to data-augment the

problem We will however focus on the latter approach as this yields a straightforward

MCMC sampling scheme For completeness the other alternative is described in

Section 34

At the root of our objective inferential procedure for occupancy models lies the

conditional argument introduced by Womack et al (work in progress) for the simple

probit regression In the occupancy setting the argument is

p(MA|y zw v) =m(y z vw|MA)π(MA)

m(y zw v)

=fyz(y z|w v)

(intfvw(vw|αλMA)παλ(αλ|MA)d(αλ)

)π(MA)

fyz(y z|w v)sum

MlowastisinM(int

fvw(vw|αλMlowast)παλ(αλ|Mlowast)d(αλ))π(Mlowast)

=m(v|MAz

)m(w|MAy)π(MA)

m(v)m(w)

prop m(v|MAz)m(w|MAy

)π(MA) (3ndash4)

where

1 fyz(y z|w v) =prodN

i=1 Izivigt0I

(1minuszi )vile0

prodJ

j=1(ziIwijgt0)yij (1minus ziIwijgt0)

1minusyij

2 fvw(vw|αλMA) =

(Nprodi=1

ϕ(vi xprimeiαMAz

1)

)︸ ︷︷ ︸

f (v|αr Aα0MAz )

(Nprodi=1

Jiprodj=1

ϕ(wij qprimeijλMAy

1)

)︸ ︷︷ ︸

f (w|λr Aλ0MAy )

and

59

3 παλ(αλ|MA) = πα(α|MAz)πλ(λ|MAy

)

This result implies that once the occupancy and detection indicators are

conditioned on the latent processes v and w respectively the model posterior

probabilities only depend on the latent variables Hence in this case the model

selection problem is driven by the posterior odds

p(MA|y zw v)p(M0|y zw v)

=m(w v|MA)

m(w v|M0)

π(MA)

π(M0) (3ndash5)

where m(w v|MA) = m(w|MAy) middotm(v|MAz

) with

m(v|MAz) =

int intf (v|αr Aα0MAz

)π(αr A|α0MAz)π(α0)dαr Adα0

(3ndash6)

m(w|MAy) =

int intf (w|λr Aλ0MAy

)π(λr A|λ0MAy)π(λ0)dλ0dλr A

(3ndash7)

332 Intrinsic Priors for the Occupancy Problem

In general the intrinsic priors as defined by Moreno et al (1998) use the functional

form of the response to inform their construction assuming some preliminary prior

distribution proper or improper on the model parameters For our purposes we assume

noninformative improper priors for the parameters denoted by πN(middot|middot) Specifically the

intrinsic priors πIP(θMlowast|Mlowast) for a vector of parameters θMlowast corresponding to model

Mlowast isin M0M sub M for a response vector s with probability density (or mass) function

f (s|θMlowast) are defined by

πIP(θM0|M0) = πN(θM0

|M0)

πIP(θM |M) = πN(θM |M)

intm(~s|M)

m(~s|M0)f (~s|θM M)d~s

where ~s is a theoretical training sample

In what follows whenever it is clear from the context in an attempt to simplify the

notation MA will be used to refer to MAzor MAy

and A will denote Az or Ay To derive

60

the parameter priors involved in equations 3ndash6 and 3ndash7 using the objective intrinsic prior

strategy we start by assuming flat priors πN(αA|MA) prop cA and πN(λA|MA) prop dA where

cA and dA are unknown constants

The intrinsic prior for the parameters associated with the occupancy process αA

conditional on model MA is

πIP(αA|MA) = πN(αA|MA)

intm(~v|MA)

m(~v|M0)f (~v|αAMA)d~v

where the marginals m(~v|Mj) with j isin A 0 are obtained by solving the analogous

equation 3ndash6 for the (theoretical) training sample ~v These marginals are given by

m(~v|Mj) = cj (2π)pjminusp0

2 |~X primej~Xj |

12 eminus

12~vprime(Iminus~Hj )~v

The training sample ~v has dimension pAz=∣∣MAz

∣∣ that is the total number of

parameters in model MAz Note that without ambiguity we use

∣∣ middot ∣∣ to denote both

the cardinality of a set and also the determinant of a matrix The design matrix ~XA

corresponds to the training sample ~v and is chosen such that ~X primeA~XA =

pAzNX primeAXA

(Leon-Novelo et al 2012) and ~Hj is the corresponding hat matrix

Replacing m(~v|MA) and m(~v|M0) in πIP(αA|MA) and solving the integral with

respect to the theoretical training sample ~v we have

πIP(αA|MA) = cA

int ((2π)minus

pAzminusp0z2

(c0

cA

)eminus

12~vprime((Iminus~HA)minus(Iminus~H0))~v |~X

primeA~XA|12

|~X prime0~X0|12

)times(

(2π)minuspAz2 eminus

12(~vminus~XAαA)

prime(~vminus~XAαA))d~v

= c0(2π)minus

pAzminusp0z2 |~X prime

Ar~XAr |

12 2minus

pAzminusp0z2 exp

[minus1

2αprimer A

(1

2~X primer A

~Xr A

)αr A

]= πN(α0)timesN

(αr A

∣∣ 0 2 middot ( ~X primer A

~Xr A)minus1)

(3ndash8)

61

Analogously the intrinsic prior for the parameters associated to the detection

process is

πIP(λA|MA) = d0(2π)minus

pAyminusp0y2 | ~Q prime

Ar~QAr |

12 2minus

pAyminusp0y2 exp

[minus1

2λprimer A

(1

2~Q primer A

~Qr A

)λr A

]= πN(λ0)timesN

(λr A

∣∣ 0 2 middot ( ~Q primeA~QA)

minus1)

(3ndash9)

In short the intrinsic priors for αA = (αprime0α

primer A)

prime and λprimeA = (λprime

0λprimer A)

prime are the product

of a reference prior on the parameters of the base model and a normal density on the

parameters indexed by Az and Ay respectively333 Model Posterior Probabilities

We now derive the expressions involved in the calculations of the model posterior

probabilities First recall that p(MA|y zw v) prop m(w v|MA)π(MA) Hence determining

this posterior probability only requires calculating m(w v|MA)

Note that since w and v are independent obtaining the model posteriors from

expression 3ndash4 reduces to finding closed form expressions for the marginals m(v |MAz)

and m(w |MAy) respectively from equations 3ndash6 and 3ndash7 Therefore

m(w v|MA) =

int intf (vw|αλMA)π

IP (α|MAz)πIP

(λ|MAy

)dαdλ

(3ndash10)

For the latent variable associated with the occupancy process plugging the

parameter intrinsic prior given by 3ndash8 into equation 3ndash6 (recalling that ~X primeA~XA =

pAzNX primeAXA)

and integrating out αA yields

m(v|MA) =

int intc0N (v|X0α0 + Xr Aαr A I)N

(αr A|0 2( ~X prime

r A~Xr A)

minus1)dαr Adα0

= c0(2π)minusn2

int (pAz

2N + pAz

) (pAzminusp0z

)

2

times

exp[minus1

2(v minus X0α0)

prime(I minus

(2N

2N + pAz

)Hr Az

)(v minus X0α0)

]dα0

62

= c0 (2π)minus(nminusp0z )2

(pAz

2N + pAz

) (pAzminusp0z

)

2

|X prime0X0|minus

12 times

exp[minus1

2vprime(I minus H0z minus

(2N

2N + pAz

)Hr Az

)v

] (3ndash11)

with Hr Az= HAz

minus H0z where HAzis the hat matrix for the entire model MAz

and H0z is

the hat matrix for the base model

Similarly the marginal distribution for w is

m(w|MA) = d0 (2π)minus(Jminusp0y )2

(pAy

2J + pAy

) (pAyminusp0y

)

2

|Q prime0Q0|minus

12 times

exp[minus1

2wprime(I minus H0y minus

(2J

2J + pAy

)Hr Ay

)w

] (3ndash12)

where J =sumN

i=1 Ji or in other words J denotes the total number of surveys conducted

Now the posteriors for the base model M0 =M0y M0z

are

m(v|M0) =

intc0N (v|X0α0 I) dα0

= c0(2π)minus(nminusp0z )2 |X prime

0X0|minus12 exp

[minus1

2(v (I minus H0z ) v)

](3ndash13)

and

m(w|M0) = d0(2π)minus(Jminusp0y )2 |Q prime

0Q0|minus12 exp

[minus1

2

(w(I minus H0y

)w)]

(3ndash14)

334 Model Selection Algorithm

Having the parameter intrinsic priors in place and knowing the form of the model

posterior probabilities it is finally possible to develop a strategy to conduct model

selection for the occupancy framework

For each of the two components of the model ndashoccupancy and detectionndash the

algorithm first draws the set of active predictors (ie Az and Ay ) together with their

corresponding parameters This is a reversible jump step which uses a Metropolis

63

Hastings correction with proposal distributions given by

q(Alowastz |zo z(t)u v(t)MAz

) =1

2

(p(MAlowast

z|zo z(t)u v(t)Mz MAlowast

zisin L(MAz

)) +1

|L(MAz)|

)q(Alowast

y |y zo z(t)u w(t)MAy) =

1

2

(p(MAlowast

w|y zo z(t)u w(t)My MAlowast

yisin L(MAy

)) +1

|L(MAy)|

)(3ndash15)

where L(MAz) and L(MAy

) denote the sets of models obtained from adding or removing

one predictor at a time from MAzand MAy

respectively

To promote mixing this step is followed by an additional draw from the full

conditionals of α and λ The densities p(α0|) p(αr A|) p(λ0|) and p(λr A|) can

be sampled from directly with Gibbs steps Using the notation a|middot to denote the random

variable a conditioned on all other parameters and on the data these densities are given

by

bull α0|middot sim N((X

prime0X0)

minus1Xprime0v (X

prime0X0)

minus1)bull αr A|middot sim N

(microαr A

αr A

) where the mean vector and the covariance matrix are

given by αr A= 2N

2N+pAz(X

prime

r AXr A)minus1 and microαr A

=(αr A

Xprime

r Av)

bull λ0|middot sim N((Q

prime0Q0)

minus1Qprime0w (Q

prime0Q0)

minus1) and

bull λr A|middot sim N(microλr A

λr A

) analogously with mean and covariance matrix given by

λr A= 2J

2J+pAy(Q

prime

r AQr A)minus1 and microλr A

=(λr A

Qprime

r Aw)

Finally Gibbs sampling steps are also available for the unobserved occupancy

indicators zu and for the corresponding latent variables v and w The full conditional

posterior densities for z(t+1)u v(t+1) and w(t+1) are those introduced in Chapter 2 for the

single season probit model

The following steps summarize the stochastic search algorithm

1 Initialize A(0)y A

(0)z z

(0)u v(0)w(0)α(0)

0 λ(0)0

2 Sample the model indices and corresponding parameters

(a) Draw simultaneously

64

bull Alowastz sim q(Az |zo z(t)u v(t)MAz

)

bull αlowast0 sim p(α0|MAlowast

z zo z

(t)u v(t)) and

bull αlowastr Alowast sim p(αr A|MAlowast

z zo z

(t)u v(t))

(b) Accept (M(t+1)Az

α(t+1)10 α(t+1)1

r A ) = (MAlowastzαlowast

0αlowastr Alowast) with probability

δz = min

(1

p(MAlowastz|zo z(t)u v(t))

p(MA(t)z|zo z(t)u v(t))

q(A(t)z |zo z(t)u v(t)MAlowast

z)

q(Alowastz |zo z

(t)u v(t)MAz

)

)

otherwise let (M(t+1)Az

α(t+1)10 α(t+1)1

r A ) = (A(t)z α(t)2

0 α(t)2r A )

(c) Sample simultaneously

bull Alowasty sim q(Ay |y zo z(t)u w(t)MAy

)

bull λlowast0 sim p(λ0|MAlowast

y y zo z

(t)u w(t)) and

bull λlowastr Alowast sim p(λr A|MAlowast

y y zo z

(t)u w(t))

(d) Accept (M(t+1)Ay

λ(t+1)10 λ(t+1)1

r A ) = (MAlowastyλlowast

0λlowastr Alowast) with probability

δy = min

(1

p(MAlowastz|y zo z(t)u w(t))

p(MA(t)z|y zo z(t)u w(t))

q(A(t)z |y zo z(t)u w(t)MAlowast

y)

q(Alowastz |y zo z

(t)u w(t)MAy

)

)

otherwise let (M(t+1)Ay

λ(t+1)10 λ(t+1)1

r A ) = (A(t)y λ(t)2

0 λ(t)2r A )

3 Sample base model parameters

(a) Draw α(t+1)20 sim p(α0|MA

(t+1)z

zo z(t)u v(t))

(b) Draw λ(t+1)20 sim p(λ0|MA(t+1)y

y zo z(t)u v(t))

4 To improve mixing resample model coefficients not present the base model butare in MA

(a) Draw α(t+1)2r A sim p(αr A|MA

(t+1)z

zo z(t)u v(t))

(b) Draw λ(t+1)2r A sim p(λr A|MA

(t+1)y

yzo z(t)u v(t))

5 Sample latent and missing (unobserved) variables

(a) Sample z(t+1)u sim p(zu|MA(t+1)z

yα(t+1)2r A α(t+1)2

0 λ(t+1)2r A λ(t+1)2

0 )

(b) Sample v(t+1) sim p(v|MA(t+1)z

zo z(t+1)u α(t+1)2

r A α(t+1)20 )

65

(c) Sample w(t+1) sim p(w|MA(t+1)y

zo z(t+1)u λ(t+1)2

r A λ(t+1)20 )

34 Alternative Formulation

Because the occupancy process is partially observed it is reasonable to consider

the posterior odds in terms of the observed responses that is the detections y and

the presences at sites where at least one detection takes place Partitioning the vector

of presences into observed and unobserved z = (zprimeo zprimeu)

prime and integrating out the

unobserved component the model posterior for MA can be obtained as

p(MA|y zo) prop Ezu [m(y z|MA)] π(MA) (3ndash16)

Data-augmenting the model in terms of latent normal variables a la Albert and Chib

the marginals for any model My Mz = M isin M of z and y inside of the expectation in

equation 3ndash16 can be expressed in terms of the latent variables

m(y z|M) =

intT (z)

intT (yz)

m(w v|M)dwdv

=

(intT (z)

m(v| Mz)dv

)(intT (yz)

m(w|My)dw

) (3ndash17)

where T (z) and T (y z) denote the corresponding truncation regions for v and w which

depend on the values taken by z and y and

m(v|Mz) =

intf (v|αMz)π(α|Mz)dα (3ndash18)

m(w|My) =

intf (w|λMy)π(λ|My)dλ (3ndash19)

The last equality in equation 3ndash17 is a consequence of the independence of the

latent processes v and w Using expressions 3ndash18 and 3ndash19 allows one to embed this

model selection problem in the classical linear normal regression setting where many

ldquoobjectiverdquo Bayesian inferential tools are available In particular these expressions

facilitate deriving the parameter intrinsic priors (Berger amp Pericchi 1996 Moreno

et al 1998) for this problem This approach is an extension of the one implemented in

Leon-Novelo et al (2012) for the simple probit regression problem

66

Using this alternative approach all that is left is to integrate m(v|MA) and m(w|MA)

over their corresponding truncation regions T (z) and T (y z) which yields m(y z|MA)

and then to obtain the expectation with respect to the unobserved zrsquos Note however

two issues arise First such integrals are not available in closed form Second

calculating the expectation over the limit of integration further complicates things To

address these difficulties it is possible to express E [m(y z|MA)] as

Ezu [m(y z|MA)] = Ezu

[(intT (z)

m(v| MAz)dv

)(intT (yz)

m(w|MAy)dw

)](3ndash20)

= Ezu

[(intT (z)

intm(v| MAz

α0)πIP(α0|MAz

)dα0dv

)times(int

T (yz)

intm(w| MAy

λ0)πIP(λ0|MAy

)dλ0dw

)]

= Ezu

int (int

T (z)

m(v| MAzα0)dv

)︸ ︷︷ ︸

g1(T (z)|MAz α0)

πIP(α0|MAz)dα0 times

int (intT (yz)

m(w|MAyλ0)dw

)︸ ︷︷ ︸

g2(T (yz)|MAy λ0)

πIP(λ0|MAy)dλ0

= Ezu

[intg1(T (z)|MAz

α0)πIP(α0|MAz

)dα0 timesintg2(T (y z)|MAy

λ0)πIP(λ0|MAy

)dλ0

]= c0 d0

int intEzu

[g1(T (z)|MAz

α0)g2(T (y z)|MAyλ0)

]dα0 dλ0

where the last equality follows from Fubinirsquos theorem since m(v|MAzα0) and

m(w|MAyλ0) are proper densities From 3ndash21 the posterior odds are

p(MA|y zo)p(M0|y zo)

=

int intEzu

[g1(T (z)|MAz

α0)g2(T (y z)|MAyλ0)

]dα0 dλ0int int

Ezu

[g1(T (z)|M0z α0)g2(T (y z)|M0y λ0)

]dα0 dλ0

π(MA)

π(M0)

(3ndash21)

67

35 Simulation Experiments

The proposed methodology was tested under 36 different scenarios where we

evaluate the behavior of the algorithm by varying the number of sites the number of

surveys the amount of signal in the predictors for the presence component and finally

the amount of signal in the predictors for the detection component

For each model component the base model is taken to be the intercept only model

and the full models considered for the presence and the detection have respectively 30

and 20 predictors Therefore the model space contains 230times220 asymp 112times1015 candidate

models

To control the amount of signal in the presence and detection components values

for the model parameter were purposefully chosen so that quantiles 10 50 and 90 of the

occupancy and detection probabilities match some pre-specified probabilities Because

presence and detection are binary variables the amount of signal in each model

component associates to the spread and center of the distribution for the occupancy and

detection probabilities respectively Low signal levels relate to occupancy or detection

probabilities close to 50 High signal levels associate with probabilities close to 0 or 1

Large spreads of the distributions for the occupancy and detection probabilities reflect

greater heterogeneity among the observations collected improving the discrimination

capability of the model and viceversa

Therefore for the presence component the parameter values of the true model

were chosen to set the median for the occupancy probabilities equal 05 The chosen

parameter values also fix quantiles 10 and 90 symmetrically about 05 at small (Qz10 =

03Qz90 = 07) intermediate (Qz

10 = 02Qz90 = 08) and large (Qz

10 = 01Qz90 = 09)

distances For the detection component the model parameters are obtained to reflect

detection probabilities concentrated about low values (Qy50 = 02) intermediate values

(Qy50 = 05) and high values (Qy

50 = 08) while keeping quantiles 10 and 90 fixed at 01

and 09 respectively

68

Table 3-1 Simulation control parameters occupancy model selectorParameter Values considered

N 50 100

J 3 5

(Qz10Q

z50Q

z90)

(03 05 07) (02 05 08) (01 05 09)

(Qy

10Qy50Q

y90)

(01 02 09) (01 05 09) (01 08 09)

There are in total 36 scenarios these result from crossing all the levels of the

simulation control parameters (Table 3-1) Under each of these scenarios 20 data sets

were generated at random True presence and detection indicators were generated

with the probit model formulation from Chapter 2 This with the assumed true models

MTz = 1 x2 x15 x16 x22 x28 for the presence and MTy = 1 q7 q10 q12 q17 for

the detection with the predictors included in the randomly generated datasets In this

context 1 represents the intercept term Throughout the Section we refer to predictors

included in the true models as true predictors and to those absent as false predictors

The selection procedure was conducted using each one of these data sets with

two different priors on the model space the uniform or equal probability prior and a

multiplicity correcting prior

The results are summarized through the marginal posterior inclusion probabilities

(MPIPs) for each predictor and also the five highest posterior probability models (HPM)

The MPIP for a given predictor under a specific scenario and for a particular data set is

defined as

p(predictor is included|y zw v) =sumMisinM

I(predictorisinM)p(M|y zw vM) (3ndash22)

In addition we compare the MPIP odds between predictors present in the true model

and predictors absent from it Specifically we consider the minimum odds of marginal

posterior inclusion probabilities for the predictors Let ~ξ and ξ denote respectively a

69

predictor in the true model MT and a predictor absent from MT We define the minimum

MPIP odds between the probabilities of true and false predictor as

minOddsMPIP =min~ξisinMT

p(I~ξ = 1|~ξ isin MT )

maxξ isinMTp(Iξ = 1|ξ isin MT )

(3ndash23)

If the variable selection procedure adequately discriminates true and false predictors

minOddsMPIP will take values larger than one The ability of the method to discriminate

between the least probable true predictor and the most probable false predictor worsens

as the indicator approaches 0351 Marginal Posterior Inclusion Probabilities for Model Predictors

For clarity in Figures 3-1 through 3-5 only predictors in the true models are labeled

and are emphasized with a dotted line passing through them The left hand side plots

in these figures contain the results for the presence component and the ones on the

right correspond to predictors in the detection component The results obtained with

the uniform model priors correspond to the black lines and those for the multiplicity

correcting prior are in red In these Figures the MPIPrsquos have been averaged over all

datasets corresponding scenarios matching the condition observed

In Figure 3-1 we contrast the mean MPIPrsquos of the predictors over all datasets from

scenarios with 50 sites to the mean MPIPrsquos obtained for the scenarios with 100 sites

Similarly Figure 3-2 compares the mean MPIPrsquos of scenarios where 3 surveys are

performed to those of scenarios having 5 surveys per site Figures 3-4 and 3-5 show the

effect of the different levels of signal considered in the occupancy probabilities and in the

detection probabilities

From these figures mainly three results can be drawn (1) the effect of the model

prior is substantial (2) the proposed methods yield MPIPrsquos that clearly separate

true predictors from false predictors and (3) the separation between MPIPrsquos of true

predictors and false predictors is noticeably larger in the detection component

70

Regardless of the simulation scenario and model component observed under the

uniform prior false predictors obtain a relatively high MPIP Conversely the multiplicity

correction prior strongly shrinks towards 0 the MPIP for false predictors In the presence

component the MPIP for the true predictors is shrunk substantially under the multiplicity

prior however there remains a clear separation between true and false predictors In

contrast in the detection component the MPIP for true predictors remains relatively high

(Figures 3-1 through 3-5)

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif N=50MC N=50

Unif N=100MC N=100

Figure 3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors

71

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif J=3MC J=3

Unif J=5MC J=5

Figure 3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif N=50 J=3Unif N=50 J=5

Unif N=100 J=3Unif N=100 J=5

MC N=50 J=3MC N=50 J=5

MC N=100 J=3MC N=100 J=5

Figure 3-3 Predictor MPIP averaged over scenarios with the interaction between thenumber of sites and the surveys per site using uniform (U) and multiplicitycorrection (MC) priors

72

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

U(03 05 07)MC(03 05 07)

U(02 05 08)MC(02 05 08)

U(01 05 09)MC(01 05 09)

Figure 3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

U(01 02 09)MC(01 02 09)

U(01 05 09)MC(01 05 09)

U(01 08 09)MC(01 08 09)

Figure 3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors

73

In scenarios where more sites were surveyed the separation between the MPIP of

true and false predictors grew in both model components (Figure 3-1) Increasing the

number of sites has an effect over both components given that every time a new site is

included covariate information is added to the design matrix of both the presence and

the detection components

On the hand increasing the number of surveys affects the MPIP of predictors in the

detection component (Figures 3-2 and 3-3) but has only a marginal effect on predictors

of the presence component This may appear to be counterintuitive however increasing

the number of surveys only increases the number of observation in the design matrix

for the detection while leaving unaltered the design matrix for the presence The small

changes observed in the MPIP for the presence predictors J increases are exclusively

a result of having additional detection indicators equal to 1 in sites where with less

surveys would only have 0 valued detections

From Figure 3-3 it is clear that for the presence component the effect of the number

of sites dominates the behavior of the MPIP especially when using the multiplicity

correction priors In the detection component the MPIP is influenced by both the number

of sites and number of surveys The influence of increasing the number of surveys is

larger when considering a smaller number of sites and viceversa

Regarding the effect of the distribution for the occupancy probabilities we observe

that mostly the detection component is affected There is stronger discrimination

between true and false predictors as the distribution has a higher variability (Figure

3-4) This is consistent with intuition since having the presence probabilities more

concentrated about 05 implies that the predictors do not vary much from one site to

the next whereas having the occupancy probabilities more spread out would have the

opposite effect

Finally concentrating the detection probabilities about high or low values For

predictors in the detection component the separation between MPIP of true and false

74

predictors is larger both in scenarios where the distribution of the detection probability

is centered about 02 or 08 when compared to those scenarios where this distribution

is centered about 05 (where the signal of the predictors is weakest) For predictors in

the presence component having the detection probabilities centered at higher values

slightly increases the inclusion probabilities of the true predictors (Figure 3-5) and

reduces that of false predictors

Table 3-2 Comparison of average minOddsMPIP under scenarios having differentnumber of sites (N=50 N=100) and under scenarios having different numberof surveys per site (J=3 J=5) for the presence and detection componentsusing uniform and multiplicity correction priors

Sites SurveysComp π(M) N=50 N=100 J=3 J=5

Presence Unif 112 131 119 124MC 320 846 420 674

Detection Unif 203 264 211 257MC 2115 3246 2139 3252

Table 3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors

(Qz10Q

z50Q

z90) (Qy

10Qy50Q

y90)

Comp π(M) (030507) (020508) (010509) (010209) (010509) (010809)

Presence Unif 105 120 134 110 123 124MC 202 455 805 238 619 640

Detection Unif 234 234 230 257 200 238MC 2537 2077 2528 2933 1852 2849

The separation between the MPIP of true and false predictors is even more

notorious in Tables 3-2 and 3-3 where the minimum MPIP odds between true and

false predictors are shown Under every scenario the value for the minOddsMPIP (as

defined in 3ndash23) was greater than 1 implying that on average even the lowest MPIP

for a true predictor is higher than the maximum MPIP for a false predictor In both

components of the model the minOddsMPIP are markedly larger under the multiplicity

correction prior and increase with the number of sites and with the number of surveys

75

For the presence component increasing the signal in the occupancy probabilities

or having the detection probabilities concentrate about higher values has a positive and

considerable effect on the magnitude of the odds For the detection component these

odds are particularly high specially under the multiplicity correction prior Also having

the distribution for the detection probabilities center about low or high values increases

the minOddsMPIP 352 Summary Statistics for the Highest Posterior Probability Model

Tables 3-4 through 3-7 show the number of true predictors that are included in

the HPM (True +) and the number of false predictors excluded from it (True minus)

The mean percentages observed in these Tables provide one clear message The

highest probability models chosen with either model prior commonly differ from the

corresponding true models The multiplicity correction priorrsquos strong shrinkage only

allows a few true predictors to be selected but at the same time it prevents from

including in the HPM any false predictors On the other hand the uniform prior includes

in the HPM a larger proportion of true predictors but at the expense of also introducing

a large number of false predictors This situation is exacerbated in the presence

component but also occurs to a minor extent in the detection component

Table 3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highestprobability models for the presence and the detection components usinguniform and multiplicity correcting priors on the model space

True + True minusComp π(M) N=50 N=100 N=50 N=100

Presence Unif 057 063 051 055MC 006 013 100 100

Detection Unif 077 085 087 093MC 049 070 100 100

Having more sites or surveys improves the inclusion of true predictors and exclusion

of false ones in the HPM for both the presence and detection components (Tables 3-4

and 3-5) On the other hand if the distribution for the occupancy probabilities is more

76

Table 3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space

True + True minusComp π(M) J=3 J=5 J=3 J=5

Presence Unif 059 061 052 054MC 008 010 100 100

Detection Unif 078 085 087 092MC 050 068 100 100

spread out the HPM includes more true predictors and less false ones in the presence

component In contrast the effect of the spread of the occupancy probabilities in the

detection HPM is negligible (Table 3-6) Finally there is a positive relationship between

the location of the median for the detection probabilities and the number of correctly

classified true and false predictors for the presence The HPM in the detection part of

the model responds positively to low and high values of the median detection probability

(increased signal levels) in terms of correctly classified true and false predictors (Table

3-7)

Table 3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space

True + True minusComp π(M) (030507) (020508) (010509) (030507) (020508) (010509)

Presence Unif 055 061 064 050 054 055MC 002 008 018 100 100 100

Detection Unif 081 082 081 090 089 089MC 057 061 059 100 100 100

36 Case Study Blue Hawker Data Analysis

During 1999 and 2000 an intensive volunteer surveying effort coordinated by the

Centre Suisse de Cartographie de la Faune (CSCF) was conducted in order to analyze

the distribution of the blue hawker Ashna cyanea (Odonata Aeshnidae) a common

dragonfly in Switzerland Given that Switzerland is a small and mountainous country

77

Table 3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space

True + True minusComp π(M) (010209) (010509) (010809) (010209) (010509) (010809)

Presence Unif 059 059 062 051 054 054MC 006 010 011 100 100 100

Detection Unif 089 077 078 091 087 091MC 070 048 059 100 100 100

there is large variation in its topography and physio-geography as such elevation is a

good candidate covariate to predict species occurrence at a large spatial scale It can

be used as a proxy for habitat type intensity of land use temperature as well as some

biotic factors (Kery et al 2010)

Repeated visits to 1-ha pixels took place to obtain the corresponding detection

history In addition to the survey outcome the x and y-coordinates thermal-level the

date of the survey and the elevation were recorded Surveys were restricted to the

known flight period of the blue hawker which takes place between May 1 and October

10 In total 2572 sites were surveyed at least once during the surveying period The

number of surveys per site ranges from 1 to 22 times within each survey year

Kery et al (2010) summarize the results of this effort using AIC-based model

comparisons first by following a backwards elimination approach for the detection

process while keeping the occupancy component fixed at the most complex model and

then for the presence component choosing among a group of three models while using

the detection model chosen In our analysis of this dataset for the detection and the

presence we consider as the full models those used in Kery et al (2010) namely

minus1(ψ) = α0 + α1year+ α2elev+ α3elev2 + α4elev

3

minus1(p) = λ0 + λ1year+ λ2elev+ λ3elev2 + λ4elev

3 + λ5date+ λ6date2

78

where year = Iyear=2000

The model spaces for this data contain 26 = 64 and 24 = 16 models respectively

for the detection and occupancy components That is in total the model space contains

24+6 = 1 024 models Although this model space can be enumerated entirely for

illustration we implemented the algorithm from section 334 generating 10000 draws

from the Gibbs sampler Each one of the models sampled were chosen from the set of

models that could be reached by changing the state of a single term in the current model

(to inclusion or exclusion accordingly) This allows a more thorough exploration of the

model space because for each of the 10000 models drawn the posterior probabilities

for many more models can be observed Below the labels for the predictors are followed

by either ldquozrdquo or ldquoyrdquo accordingly to represent the component they pertain to Finally

using the results from the model selection procedure we conducted a validation step to

determine the predictive accuracy of the HPMrsquos and of the median probability models

(MPMrsquos) The performance of these models is then contrasted with that of the model

ultimately selected by Kery et al (2010)361 Results Variable Selection Procedure

The model finally chosen for the presence component in Kery et al (2010) was not

found among the highest five probability models under either model prior 3-8 Moreover

the year indicator was never chosen under the multiplicity correcting prior hinting that

this term might correspond to a falsely identified predictor under the uniform prior

Results in Table 3-10 support this claim the marginal inclusion posterior probability for

the year predictor is 7 under the multiplicity correction prior The multiplicity correction

prior concentrates more densely the model posterior probability mass in the highest

ranked models (90 of the mass is in the top five models) than the uniform prior (which

account for 40 of the mass)

For the detection component the HPM under both priors is the intercept only model

which we represent in Table 3-9 with a blank label In both cases this model obtains very

79

Table 3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data

Uniform model priorRank Mz selected p(Mz |y)

1 yrz+elevz 0102 yrz+elevz+elevz3 0083 elevz2+elevz3 0084 yrz+elevz2 0075 yrz+elevz3 007

Multiplicity correcting model priorRank Mz selected p(Mz |y)

1 elevz+elevz3 0532 0153 elevz+elevz2 0094 elevz2 0065 elevz+elevz2+elevz3 005

high posterior probabilities The terms contained in cubic polynomial for the elevation

appear to contain some relevant information however this conflicts with the MPIPs

observed in Table 3-11 which under both model priors are relatively low (lt 20 with the

uniform and le 4 with the multiplicity correcting prior)

Table 3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data

Uniform model priorRank Mz selected p(Mz |y)

1 0452 elevy3 0063 elevy2 0054 elevy 0055 yry 004

Multiplicity correcting model priorRank Mz selected p(Mz |y)

1 0862 elevy3 0023 datey2 0024 elevy2 0025 yry 002

Finally it is possible to use the MPIPs to obtain the median probability model which

contains the terms that have a MPIP higher than 50 For the occupancy process

(Table 3-10) under the uniform prior the model with the year the elevation and the

elevation cubed are included The MPM with multiplicity correction prior coincides with

the HPM from this prior The MPM chosen for the detection component (Table 3-11)

under both priors is the intercept only model coinciding again with the HPM

Given the outcomes of the simulation studies from Section 35 especially those

pertaining to the detection component the results in Table 3-11 appear to indicate that

none of the predictors considered belong to the true model especially when considering

80

Table 3-10 MPIP presence component

Predictor p(predictor isin MTz |y z w v)

Unif MultCorryrz 053 007elevz 051 073elevz2 045 023elevz3 050 067

Table 3-11 MPIP detection component

Predictor p(predictor isin MTy |y z w v)

Unif MultCorryry 019 003elevy 018 003elevy2 018 003elevy 3 019 004datey 016 003datey2 015 004

those derived with the multiplicity correction prior On the other hand for the presence

component (Table 3-10) there is an indication that terms related to the cubic polynomial

in elevz can explain the occupancy patterns362 Validation for the Selection Procedure

Approximately half of the sites were selected at random for training (ie for model

selection and parameter estimation) and the remaining half were used as test data In

the previous section we observed that using the marginal posterior inclusion probability

of the predictors the our method effectively separates predictors in the true model from

those that are not in it However in Tables 3-10 and 3-11 this separation is only clear for

the presence component using the multiplicity correction prior

Therefore in the validation procedure we observe the misclassification rates for the

detections using the following models (1) the model ultimately recommended in Kery

et al (2010) (yrz+elevz+elevz2+elevz3 + elevy+ elevy2+ datey+ datey2) (2) the

highest probability model (HPM) with a uniform prior (yrz+elevz) (3) the HPM with a

multiplicity correcting prior (elevz + elevz3 ) (4) the median probability model (MPM)

ndashthe model including only predictors with a MPIP larger than 50ndash with the uniform

prior (yrz+elevz+elevz3) and finally (5) the MPM with a multiplicity correction prior

(elevz+elevz3 same as the HPM with multiplicity correction)

We must emphasize that the models resulting from the implement ion of our model

selection procedure used exclusively the training dataset On the other hand the model

in Kery et al (2010) was chosen to minimize the prediction error of the complete data

81

Because this model was obtained from the full dataset results derived from it can only

be considered as a lower bound for the prediction errors

The benchmark misclassification error rate for true 1rsquos is high (close to 70)

However the misclassification rate for true 0rsquos which accounts for most of the

responses is less pronounced (15) Overall the performance of the selected models

is comparable They yield considerably worse results than the benchmark for the true

1rsquos but achieve rates close to the benchmark for the true zeros Pooling together

the results for true ones and true zeros the selected models with either prior have

misclassification rates close to 30 The benchmark model performs comparably with a

joint misclassification error of 23 (Table 3-12)

Table 3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform andmultiplicity correction model priors

Model True 1 True 0 Jointbenchmark (Kery et al 2010) yrz+elevz+elevz2+elevz3 + 066 015 023

elevy+ elevy2+ datey+ datey2

HPM Unif yrz+elevz 083 017 028HPMHPM MC elevz + elevz3 082 018 028MPM Unif yrz+elevz+elevz3 082 018 029

37 Discussion

In this Chapter we proposed an objective and fully automatic Bayes methodology for

the single season site-occupancy model The methodology is said to be fully automatic

because no hyper-parameter specification is necessary in defining the parameter priors

and objective because it relies on the intrinsic priors derived from noninformative priors

The intrinsic priors have been shown to have desirable properties as testing priors We

also propose a fast stochastic search algorithm to explore large model spaces using our

model selection procedure

Our simulation experiments demonstrated the ability of the method to single out the

predictors present in the true model when considering the marginal posterior inclusion

probabilities for the predictors For predictors in the true model these probabilities

were comparatively larger than those for predictors absent from it Also the simulations

82

indicated that the method has a greater discrimination capability for predictors in the

detection component of the model especially when using multiplicity correction priors

Multiplicity correction priors were not described in this Chapter however their

influence on the selection outcome is significant This behavior was observed in the

simulation experiment and in the analysis of the Blue Hawker data Model priors play an

essential role As the number of predictors grows these are instrumental in controlling

for selection of false positive predictors Additionally model priors can be used to

account for predictor structure in the selection process which helps both to reduce the

size of the model space and to make the selection more robust These issues are the

topic of the next Chapter

Accounting for the polynomial hierarchy in the predictors within the occupancy

context is a straightforward extension of the procedures we describe in Chapter 4

Hence our next step is to develop efficient software for it An additional direction we

plan to pursue is developing methods for occupancy variable selection in a multivariate

setting This can be used to conduct hypothesis testing in scenarios with varying

conditions through time or in the case where multiple species are co-observed A

final variation we will investigate for this problem is that of occupancy model selection

incorporating random effects

83

CHAPTER 4PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS

It has long been an axiom of mine that the little things are infinitely themost important

ndashSherlock HolmesA Case of Identity

41 Introduction

In regression problems if a large number of potential predictors is available the

complete model space is too large to enumerate and automatic selection algorithms are

necessary to find informative parsimonious models This multiple testing problem

is difficult and even more so when interactions or powers of the predictors are

considered In the ecological literature models with interactions andor higher order

polynomial terms are ubiquitous (Johnson et al 2013 Kery et al 2010 Zeller et al

2011) given the complexity and non-linearities found in ecological processes Several

model selection procedures even in the classical normal linear setting fail to address

two fundamental issues (1) the model selection outcome is not invariant to affine

transformations when interactions or polynomial structures are found among the

predictors and (2) additional penalization is required to control for false positives as the

model space grows (ie as more covariates are considered)

These two issues motivate the developments developed throughout this Chapter

Building on the results of Chipman (1996) we propose investigate and provide

recommendations for three different prior distributions on the model space These

priors help control for test multiplicity while accounting for polynomial structure in the

predictors They improve upon those proposed by Chipman first by avoiding the need

for specific values for the prior inclusion probabilities of the predictors and second

by formulating principled alternatives to introduce additional structure in the model

84

priors Finally we design a stochastic search algorithm that allows fast and thorough

exploration of model spaces with polynomial structure

Having structure in the predictors can determine the selection outcome As an

illustration consider the model E [y ] = β00 + β01x2 + β20x21 where the order one

term x1 is not present (this choice of subscripts for the coefficients is defined in the

following section) Transforming x1 7rarr xlowast1 = x1 + c for some c = 0 the model

becomes E [y ] = β00 + β01x2 + βlowast20x

lowast21 Note that in terms of the original predictors

xlowast21 = x21 + 2c middot x1 + c2 implying that this seemingly innocuous transformation of x1

modifies the column space of the design matrix by including x1 which was not in the

original model That is when lower order terms in the hierarchy are omitted from the

model the column space of the design matrix is not invariant to afine transformations

As the hat matrix depends on the column space the modelrsquos predictive capability is also

affected by how the covariates in the model are coded an undesirable feature for any

model selection procedure To make model selection invariant to afine transformations

the selection must be constrained to the subset of models that respect the hierarchy

(Griepentrog et al 1982 Khuri 2002 McCullagh amp Nelder 1989 Nelder 2000

Peixoto 1987 1990) These models are known as well-formulated models (WFMs)

Succinctly a model is well-formulated if for any predictor in the model every lower order

predictor associated with it is also in the model The model above is not well-formulated

as it contains x21 but not x1

WFMs exhibit strong heredity in that all lower order terms dividing higher order

terms in the model must also be included An alternative is to only require weak heredity

(Chipman 1996) which only forces some of the lower terms in the corresponding

polynomial hierarchy to be in the model However Nelder (1998) demonstrated that the

conditions under which weak heredity allows the design matrix to be invariant to afine

transformations of the predictors are too restrictive to be useful in practice

85

Although this topic appeared in the literature more than three decades ago (Nelder

1977) only recently have modern variable selection techniques been adapted to

account for the constraints imposed by heredity As described in Bien et al (2013)

the current literature on variable selection for polynomial response surface models

can be classified into three broad groups mult-istep procedures (Brusco et al 2009

Peixoto 1987) regularized regression methods (Bien et al 2013 Yuan et al 2009)

and Bayesian approaches (Chipman 1996) The methods introduced in this Chapter

take a Bayesian approach towards variable selection for well-formulated models with

particular emphasis on model priors

As mentioned in previous chapters the Bayesian variable selection problem

consists of finding models with high posterior probabilities within a pre-specified model

space M The model posterior probability for M isin M is given by

p(M|yM) prop m(y|M)π(M|M) (4ndash1)

Model posterior probabilities depend on the prior distribution on the model space

as well as on the prior distributions for the model specific parameters implicitly through

the marginals m(y|M) Priors on the model specific parameters have been extensively

discussed in the literature (Berger amp Pericchi 1996 Berger et al 2001 George 2000

Jeffreys 1961 Kass amp Wasserman 1996 Liang et al 2008 Zellner amp Siow 1980) In

contrast the effect of the prior on the model space has until recently been neglected

A few authors (eg Casella et al (2014) Scott amp Berger (2010) Wilson et al (2010))

have highlighted the relevance of the priors on the model space in the context of multiple

testing Adequately formulating priors on the model space can both account for structure

in the predictors and provide additional control on the detection of false positive terms

In addition using the popular uniform prior over the model space may lead to the

undesirable and ldquoinformativerdquo implication of favoring models of size p2 (where p is the

86

total number of covariates) since this is the most abundant model size contained in the

model space

Variable selection within the model space of well-formulated polynomial models

poses two challenges for automatic objective model selection procedures First the

notion of model complexity takes on a new dimension Complexity is not exclusively

a function of the number of predictors but also depends upon the depth and

connectedness of the associations defined by the polynomial hierarchy Second

because the model space is shaped by such relationships stochastic search algorithms

used to explore the models must also conform to these restrictions

Models without polynomial hierarchy constitute a special case of WFMs where

all predictors are of order one Hence all the methods developed throughout this

Chapter also apply to models with no predictor structure Additionally although our

proposed methods are presented for the normal linear case to simplify the exposition

these methods are general enough to be embedded in many Bayesian selection

and averaging procedures including of course the occupancy framework previously

discussed

In this Chapter first we provide the necessary definitions to characterize the

well-formulated model selection problem Then we proceed to introduce three new prior

structures on the well-formulated model space and characterize their behavior with

simple examples and simulations With the model priors in place we build a stochastic

search algorithm to explore spaces of well-formulated models that relies on intrinsic

priors for the model specific parameters mdash though this assumption can be relaxed

to use other mixtures of g-priors Finally we implement our procedures using both

simulated and real data

87

42 Setup for Well-Formulated Models

Suppose that the observations yi are modeled using the polynomial regression of

the covariates xi 1 xi p given by

yi =sum

β(α1αp)

pprodj=1

xαji j + ϵi (4ndash2)

where α = (α1 αp) belongs to Np0 the p-dimensional space of natural numbers

including 0 with ϵiiidsim N(0σ2) and only finitely many βα are allowed to be non-zero

As an illustration consider a model space that includes polynomial terms incorporating

covariates xi 1 and xi 2 only The terms x2i 2 and x2i 1xi 2 can be represented by α = (0 2)

and α = (2 1) respectively

The notation y = Z(X)β + ϵ is used to denote that observed response y =

(y1 yn) is modeled via a polynomial function Z of the original covariates contained

in X = (x1 xp) (where xj = (x1j xnj)prime) and the coefficients of the polynomial

terms are given by β A specific polynomial model M is defined by the set of coefficients

βα that are allowed to be non-zero This definition is equivalent to characterizing M

through a collection of multi-indices α isin Np0 In particular model M is specified by

M = αM1 αM|M| for αMk isin Np0 where βα = 0 for α isin M

Any particular model M uses a subset XM of the original covariates X to form the

polynomial terms in the design matrix ZM(X) Without ambiguity a polynomial model

ZM(X) on X can be identified with a polynomial model ZM(XM) on the covariates XM

The number of terms used by M to model the response y denoted by |M| corresponds

to the number of columns of ZM(XM) The coefficient vector and error variance of

the model M are denoted by βM and σ2M respectively Thus M models the data as

y = ZM(XM)βM + ϵM where ϵM sim N(0 Iσ2M

) Model M is said to be nested in model M prime

if M sub M prime M models the response of the covariates in two distinct ways choosing the

set of meaningful covariates XM as well as choosing the polynomial structure of these

covariates ZM(XM)

88

The set Np0 constitutes a partially ordered set or more succinctly a poset A poset

is a set partially ordered through a binary relation ldquo≼rdquo In this context the binary relation

on the poset Np0 is defined between pairs (ααprime) by αprime ≼ α whenever αj ge αprime

j for all

j = 1 prime with αprime ≺ α if additionally αj gt αprimej for some j The order of a term α isin Np

0

is given by the sum of its elements order(α) =sumαj When order(α) = order(αprime) + 1

and αprime ≺ α then αprime is said to immediately precede α which is denoted by αprime rarr α

The parent set of α is defined by P(α) = αprime isin Np0 αprime rarr α and is given by the

set of nodes that immediately precede the given node A polynomial model M is said to

be well-formulated if α isin M implies that P(α) sub M For example any well-formulated

model using x2i 1xi 2 to model yi must also include the parent terms xi 1xi 2 and x2i 1 their

corresponding parent terms xi 1 and xi 2 and the intercept term 1

The poset Np0 can be represented by a Directed Acyclic Graph (DAG) denoted

by (Np0) Without ambiguity we can identify nodes in the graph α isin Np

0 with terms in

the set of covariates The graph has directed edges to a node from its parents Any

well-formulated model M is represented by a subgraph (M) of (Np0) with the property

that if node α isin (M) then the nodes corresponding to P(α) are also in (M) Figure

4-1 shows examples of well-formulated polynomial models where α isin Np0 is identified

withprodp

j=1 xαjj

The motivation for considering only well-formulated polynomial models is

compelling Let ZM be the design matrix associated with a polynomial model The

subspace of y modeled by ZM given by the hat matrix HM = ZM(ZprimeMZM)

minus1ZprimeM is

invariant to affine transformations of the matrix XM if and only if M corresponds to a

well-formulated polynomial model (Peixoto 1990)

89

A B

Figure 4-1 Graphs of well-formulated polynomial models for p = 2

For example if p = 2 and yi = β(00) + β(10)xi 1 + β(01)xi 2 + β(11)xi 1xi 2 + ϵi then

the hat matrix is invariant to any covariate transformation of the form A(xi 1xi 2

)+ b for any

real-valued positive definite 2 times 2 matrix A and any real-valued vector of dimension two

b In contrast if yi = β(00) + β(20)x2i 1 + ϵi then the hat matrix formed after applying the

transformation xi 1 7rarr xi 1 + c for real c = 0 is not the same as the hat matrix formed by

the original xi 1421 Well-Formulated Model Spaces

The spaces of WFMs M considered in this paper can be characterized in terms

of two WFMs MB the base model and MF the full model The base model contains at

least the intercept term and is nested in the full model The model space M is populated

by all well formulated models M that nest MB and are nested in MF

M = M MB sube M sube MF and M is well-formulated

For M to be well-formulated the entire ancestry of each node in M must also be

included in M Because of this M isin M can be uniquely identified by two different sets

of nodes in MF the set of extreme nodes and the set of children nodes For M isin M

90

the sets of extreme and children nodes respectively denoted by E(M) and C(M) are

defined by

E(M) = α isin M MB α isin P(αprime) forall αprime isin M

C(M) = α isin MF M α cupM is well-formulated

The extreme nodes are those nodes that when removed from M give rise to a WFM in

M The children nodes are those nodes that when added to M give rise to a WFM in

M Because MB sube M for all M isin M the set of nodes E(M)cupMB determine M by

beginning with this set and iteratively adding parent nodes Similarly the nodes in C(M)

determine the set αprime isin P(α) α isin C(M)cupαprime isin E(MF ) α ≼ αprime for all α isin C(M)

which contains E(M)cupMB and thus uniquely identifies M

1

x1

x2

x21

x1x2

x22

A Extreme node set

1

x1

x2

x21

x1x2

x22

B Children node set

Figure 4-2

In Figure 4-2 the extreme and children sets for model M = 1 x1 x21 are shown for

the model space characterized by MF = 1 x1 x2 x21 x1x2 x22 In Figure 4-2A the solid

nodes represent nodes α isin M E(M) the dashed node corresponds to α isin E(M) and

the dotted nodes are not in M Solid nodes in Figure 4-2B correspond to those in M

The dashed node is the single node in C(M) and the dotted nodes are not in M cup C(M)43 Priors on the Model Space

As discussed in Scott amp Berger (2010) the Ockhamrsquos-razor effect found

automatically in Bayesian variable selection through the Bayes factor does not correct

91

for multiple testing This penalization acts against more complex models but does not

account for the collection of models in the model space which describes the multiplicity

of the testing problem This is where the role of the prior on the model space becomes

important As Scott amp Berger explain the multiplicity penalty is ldquohidden awayrdquo in the

model prior probabilities π(M|M)

In what follows we propose three different prior structures on the model space

for WFMs discuss their advantages and disadvantages and describe reasonable

choices for their hyper-parameters In addition we investigate how the choice of

prior structure and hyper-parameter combinations affect the posterior probabilities for

predictor inclusion providing some recommendations for different situations431 Model Prior Definition

The graphical structure for the model spaces suggests a method for prior

construction on M guided by the notion of inheritance A node α is said to inherit from

a node αprime if there is a directed path from αprime to α in the graph (MF ) The inheritance

is said to be immediate if order(α) = order(αprime) + 1 (equivalently if αprime isin P(α) or if αprime

immediately precedes α)

For convenience define (M) = M MB to be the set of nodes in M that are not

in the base model MB For α isin (MF ) let γα(M) be the indicator function describing

whether α is included in M ie γα(M) = I(αisinM) Denote by γν(M) the set of indicators

of inclusion in M for all order ν nodes in (MF ) Finally let γltν(M) =cupνminus1

j=0 γ j(M)

the set of indicators for inclusion in M for all nodes in (MF ) of order less than ν With

these definitions the prior probability of any model M isin M can be factored as

π(M|M) =

JmaxMprod

j=JminM

π(γ j(M)|γltj(M)M) (4ndash3)

where JminM and Jmax

M are respectively the minimum and maximum order of nodes in

(MF ) and π(γJminM (M)|γltJmin

M (M)M) = π(γJminM (M)|M)

92

Prior distributions on M can be simplified by making two assumptions First if

order(α) = order(αprime) = j then γα and γαprime are assumed to be conditionally independent

when conditioned on γltj denoted by γα perpperp γαprime|γltj Second immediate inheritance is

invoked and it is assumed that if order(α) = j then γα(M)|γltj(M) = γα(M)|γP(α)(M)

where γP(α)(M) is the inclusion indicator for the set of parent nodes of α This indicator

is one if the complete parent set of α is contained in M and zero otherwise

In Figure 4-3 these two assumptions are depicted with MF being an order two

surface in two main effects The conditional independence assumption (Figure 4-3A)

implies that the inclusion indicators for x21 x22 and x1x2 is independent when conditioned

on all the lower order terms In this same space immediate inheritance implies that

the inclusion of x21 conditioned on the inclusion of all lower order nodes is equivalent to

conditioning it on its parent set (x1 in this case)

x21 perpperp x1x2 perpperp x22

∣∣∣∣∣

1

x1

x2

A Conditional independence

x21∣∣∣∣∣

1

x1

x2

=

x21

∣∣∣∣∣ x1

B Immediate inheritance

Figure 4-3

Denote the conditional inclusion probability of node α in model M by πα =

π(γα(M) = 1|γP(α)(M)M) Under the assumptions of conditional independence

93

and immediate inheritance the prior probability of M is

π(M|πMM) =prod

αisin(MF )

πγα(M)α (1minus πα)

1minusγα(M) (4ndash4)

with πM = πα α isin (MF ) Because M must be well-formulated πα = γα =

0 if γP(α)(M) = 0 Thus the product in 4ndash4 can be restricted to the set of nodes

α isin (M)cup

C(M) Additional structure can be built into the prior on M by making

assumptions about the inclusion probabilities πα such as equality assumptions or

assumptions of a hyper-prior for these parameters Three such prior classes are

developed next first by assigning hyperpriors on πM assuming some structure among

its elements and then marginalizing out the πM

Hierarchical Uniform Prior (HUP) The HUP assumes that the non-zero πα

are all equal Specifically for a model M isin M it is assumed that πα = π for all

α isin (M)cupC(M) A complete Bayesian specification of the HUP is completed by

assuming a prior distribution for π The choice of π sim Beta(a b) produces

πHUP(M|M a b) =B(|(M)|+ a |C(M)|+ b)

B(a b) (4ndash5)

where B is the beta function Setting a = b = 1 gives the particular value of

πHUP(M|M a = 1 b = 1) =1

|(M)|+ |C(M)|+ 1

(|(M)|+ |C(M)|

|(M)|

)minus1

(4ndash6)

The HUP assigns equal probabilities to all models for which the sets of nodes (M)

and C(M) have the same cardinality This prior provides a combinatorial penalization

but essentially fails to account for the hierarchical structure of the model space An

additional penalization for model complexity can be incorporated into the HUP by

changing the values of a and b Because πα = π for all α this penalization can only

depend on some aspect of the entire graph of MF such as the total number of nodes

not in the null model |(MF )|

94

Hierarchical Independence Prior (HIP) The HIP assumes that there are no

equality constraints among the non-zero πα Each non-zero πα is given its own prior

which is assumed to be a Beta distribution with parameters aα and bα Thus the prior

probability of M under the HIP is

πHIP(M|M ab) =

prodαisin(M)

aα + bα

prodαisinC(M)

aα + bα

(4ndash7)

where the product over empty is taken to be 1 Because the πα are totally independent any

choice of aα and bα is equivalent to choosing a probability of success πα for a given α

Setting aα = bα = 1 for all α isin (M)cup

C(M) gives the particular value of

πHIP(M|M a = 1b = 1) =

(1

2

)|(M)|+|C(M)|

(4ndash8)

Although the prior with this choice of hyper-parameters accounts for the hierarchical

structure of the model space it essentially provides no penalization for combinatorial

complexity at different levels of the hierarchy This can be observed by considering a

model space with main effects only the exponent in 4ndash8 is the same for every model in

the space because each node is either in the model or in the children set

Additional penalizations for model complexity can be incorporated into the HIP

Because each γ j is conditioned on γltj in the prior construction the aα and bα for α of

order j can be conditioned on γltj One such additional penalization utilizes the number

of nodes of order j that could be added to produce a WFM conditioned on the inclusion

vector γltj which is denoted as chj(γltj) Choosing aα = 1 and bα(M) = chj(γ

ltj) is

equivalent to choosing a probability of success πα = 1chj(γltj) This penalization can

drive down the false positive rate when chj(γltj) is large but may produce more false

negatives

Hierarchical Order Prior (HOP) A compromise between complete equality and

complete independence of the πα is to assume equality between the πα of a given

order and independence across the different orders Define j(M) = α isin (M)

95

order(α) = j and Cj(M) = α isin C(M) order(α) = j The HOP assumes that πα = πj

for all α isin j(M)cupCj(M) Assuming that πj sim Beta(aj bj) provides a prior probability of

πHOP(M|M ab) =

JmaxMprod

j=JminM

B(|j(M)|+ aj |Cj(M)|+ bj)

B(aj bj)(4ndash9)

The specific choice of aj = bj = 1 for all j gives a value of

πHOP(M|M a = 1b = 1) =prodj

[1

|j(M)|+ |Cj(M)|+ 1

(|j(M)|+ |Cj(M)|

|j(M)|

)minus1]

(4ndash10)

and produces a hierarchical version of the Scott and Berger multiplicity correction

The HOP arises from a conditional exchangeability assumption on the indicator

variables Conditioned on γltj(M) the indicators γα α isin j(M)cup

Cj(M) are

assumed to be exchangeable Bernoulli random variables By de Finettirsquos theorem these

arise from independent Bernoulli random variables with common probability of success

πj with a prior distribution Our construction of the HOP assumes that this prior is a

beta distribution Additional complexity penalizations can be incorporated into the HOP

in a similar fashion to the HIP The number of possible nodes that could be added of

order j while maintaining a WFM is given by chj(M) = chj(γltj(M)) = |j(M)

cupCj(M)|

Using aj = 1 and bj(M) = chj(M) produces a prior with two desirable properties

First if M prime sub M then π(M) le π(M prime) Second for each order j the conditional

probability of including k nodes is greater than or equal to that of including k + 1 nodes

for k = 0 1 chj(M)minus 1432 Choice of Prior Structure and Hyper-Parameters

Each of the priors introduced in Section 31 defines a whole family of model priors

characterized by the probability distribution assumed for the inclusion probabilities πM

For the sake of simplicity this paper focuses on those arising from Beta distributions

and concentrates on particular choices of hyper-parameters which can be specified

automatically First we describe some general features about how each of the three

prior structures (HUP HIP HOP) allocates mass to the models in the model space

96

Second as there is an infinite number of ways in which the hyper-parameters can be

specified focused is placed on the default choice a = b = 1 as well as the complexity

penalizations described in Section 31 The second alternative is referred to as a =

1b = ch where b = ch has a slightly different interpretation depending on the prior

structure Accordingly b = ch is given by bj(M) = bα(M) = chj(M) = |j(M)cup

Cj(M)|

for the HOP and HIP where j = order(α) while b = ch denotes that b = |(MF )| for

the HUP The prior behavior for two model spaces In both cases the base model MB is

taken to be the intercept only model and MF is the DAG shown (Figures 4-4 and 4-5)

The priors considered treat model complexity differently and some general properties

can be seen in these examples

ModelHIP HOP HUP

(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)

1 1 14 49 13 12 13 572 1 x1 18 19 112 112 112 5563 1 x2 18 19 112 112 112 5564 1 x1 x

21 18 19 112 112 112 5168

5 1 x2 x22 18 19 112 112 112 5168

6 1 x1 x2 132 364 112 112 160 1727 1 x1 x2 x

21 132 164 136 160 160 1168

8 1 x1 x2 x1x2 132 164 136 160 160 11689 1 x1 x2 x

22 132 164 136 160 160 1168

10 1 x1 x2 x21 x1x2 132 1192 136 1120 130 1252

11 1 x1 x2 x21 x

22 132 1192 136 1120 130 1252

12 1 x1 x2 x1x2 x22 132 1192 136 1120 130 1252

13 1 x1 x2 x21 x1x2 x

22 132 1576 112 1120 16 1252

Figure 4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch)

First contrast the choice of HIP HUP and HOP for the choice of (ab) = (1 1) The

HIP induces a complexity penalization that only accounts for the order of the terms in

the model This is best exhibited by the model space in Figure 4-4 Models including x1

and x2 models 6 through 13 are given the same prior probability and no penalization is

incurred for the inclusion of any or all of the quadratic terms In contrast to the HIP the

97

ModelHIP HOP HUP

(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)

1 1 18 2764 14 12 14 472 1 x1 18 964 112 110 112 2213 1 x2 18 964 112 110 112 2214 1 x3 18 964 112 110 112 2215 1 x1 x3 18 364 112 120 112 41056 1 x2 x3 18 364 112 120 112 41057 1 x1 x2 116 3128 124 140 130 1428 1 x1 x2 x1x2 116 3128 124 140 120 1709 1 x1 x2 x3 116 1128 18 140 120 17010 1 x1 x2 x3 x1x2 116 1128 18 140 15 170

Figure 4-5 Prior probabilities for the space of well-formulated models associated tothree main effects and one interaction term where MB is taken to be theintercept only model and (ab) isin (1 1) (1 ch)

HUP induces a penalization for model complexity but it does not adequately penalize

models for including additional terms Using the HIP models including all of the terms

are given at least as much probability as any model containing any non-empty set of

terms (Figures 4-4 and 4-5) This lack of penalization of the full model is originates from

its combinatorial simplicity (ie this is the only model that contains every term) and

as an unfortunate consequence this model space distribution favors the base and full

models Similar behavior is observed with the HOP with (ab) = (1 1) As models

become more complex they are appropriately penalized for their size However after a

sufficient number of nodes are added the number of possible models of that particular

size is considerably reduced Thus combinatorial complexity is negligible on the largest

models This is best exhibited in Figure 4-5 where the HOP places more mass on

the full model than on any model containing a single order one node highlighting an

undesirable behavior of the priors with this choice of hyper-parameters

In contrast if (ab) = (1 ch) all three priors produce strong penalization as

models become more complex both in terms of the number and order of the nodes

contained in the model For all of the priors adding a node α to a model M to form M prime

produces p(M) ge p(M prime) However differences between the priors are apparent The

98

HIP penalizes the full model the most with the HOP penalizing it the least and the HUP

lying between them At face value the HOP creates the most compelling penalization

of model complexity In Figure 4-5 the penalization of the HOP is the least dramatic

producing prior odds of 20 for MB versus MF as opposed to the HUP and HIP which

produce prior odds of 40 and 54 respectively Similarly the prior odds in Figure 4-4 are

60 180 and 256 for the HOP HUP and HIP respectively433 Posterior Sensitivity to the Choice of Prior

To determine how the proposed priors are adjusting the posterior probabilities to

account for multiplicity a simple simulation was performed The goal of this exercise

was to understand how the priors respond to increasing complexity First the priors are

compared as the number of main effects p grows Second they are compared as the

depth of the hierarchy increases or in other words as the orderJMmax increases

The quality of a node is characterized by its marginal posterior inclusion

probabilities defined as pα =sum

MisinM I(αisinM)p(M|yM) for α isin MF These posteriors

were obtained for the proposed priors as well as the Equal Probability Prior (EPP)

on M For all prior structures both the default hyper-parameters a = b = 1 and

the penalizing choice of a = 1 and b = ch are considered The results for the

different combinations of MF and MT incorporated in the analysis were obtained

from 100 random replications (ie generating at random 100 matrices of main effects

and responses) The simulation proceeds as follows

1 Randomly generate main effects matrices X = (x1 x18) for xiiidsim Nn(0 In) and

error vectors ϵ sim Nn(0 In) for n = 60

2 Setting all coefficient values equal to one calculate y = ZMTβ + ϵ for the true

models given byMT 1 = x1 x2 x3 x

21 x1x2 x

22 x2x3 with |MT 1| = 7

MT 2 = x1 x2 x16 with |MT 2| = 16MT 3 = x1 x2 x3 x4 with |MT 3| = 4MT 4 = x1 x2 x8 x

21 x3x4 with |MT 4| = 10

MT 5 = x1 x2 x3 x4 x21 x3x4 with |MT 5| = 6

99

Table 4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulationsgrowing p fixed JM

max fixed p growing JMmax

MF

∣∣MF

∣∣ ∣∣M∣∣ MT used MF

∣∣MF

∣∣ ∣∣M∣∣ MT used(x1 + x2 + x3)

2 9 95 MT 1 (x1 + x2 + x3)2 9 95 MT 1

(x1 + + x4)2 14 1337 MT 1 (x1 + x2 + x3)

3 19 2497 MT 1

(x1 + + x5)2 20 38619 MT 1 (x1 + x2 + x3)

4 34 161421 MT 1

Other model spacesMF

∣∣MF

∣∣ ∣∣M∣∣ MT usedx1 + x2 + middot middot middot+ x18 18 262144 MT 2MT 3

(x1 + x2 + x4)2 + x5+ 20 85568 MT 4MT 5x6 + + x10

3 In all simulations the base model MB is the intercept only model The notation(x1 + + xp)

d is used to represent the full order-d polynomial response surface inp main effects The model spaces characterized by their corresponding full modelMF are presented in Table 4-1 as well as the true models used in each case

4 Enumerate the model spaces and calculate p(M|yM) for all M isin Musing the EPP HUP HIP and HOP the latter two each with the two sets ofhyper-parameters

5 Count the number of true positives and false positives in each M for the differentpriors

The true positives (TP) are defined as those nodes α isin MT such that pα gt 05

With the false positives (FP) three different cutoffs are considered for pα elucidating

the adjustment for multiplicity induced by the model priors These cutoffs are

010 020 and 050 for α isin MT The results from this exercise provide insight

about the influence of the prior on the marginal posterior inclusion probabilities In Table

4-1 the model spaces considered are described in terms of the number of models they

contain and in terms of the number of nodes of MF the full model that defines the DAG

for M

Growing number of main effects fixed polynomial degree This simulation

investigates the posterior behavior as the number of covariates grows for a polynomial

100

surface of degree two The true model is assumed to be MT 1 and has 7 polynomial

terms The false positive and true positive rates are displayed in Table 4-2

First focus on the posterior when (ab) = (1 1) As p increases and the cutoff

is low the number of false positives increases for the EPP as well as the hierarchical

priors although less dramatically for the latter All of the priors identify all of the true

positives The false positive rate for the 50 cutoff is less than one for all four prior

structures with the HIP exhibiting the smallest false positive rate

With the second choice of hyper-parameters (1 ch) the improvement of the

hierarchical priors over the EPP is dramatic and the difference in performance is more

pronounced as p increases These also considerably outperform the priors using the

default hyper-parameters a = b = 1 in terms of the false positives Regarding the

number of true positives all priors discovered the 7 true predictors in MT 1 for most of

the 100 random samples of data with only minor differences observed between any of

the priors considered That being said the means for the priors with a = 1b = ch are

slightly lower for the true positives With a 50 cutoff the hierarchical priors keep a tight

control on the number of false positives but in doing so discard true positives with slightly

higher frequency

Growing polynomial degree fixed main effects For these examples the true

model is once again MT 1 When the complexity is increased by making the order of MF

larger (Table 4-3) the inability of the EPP to adjust the inclusion posteriors for multiplicity

becomes more pronounced the EPP becomes less and less efficient at removing false

positives when the FP cutoff is low Among the priors with a = b = 1 as the order

increases the HIP is the best at filtering out the false positives Using the 05 false

positive cutoff some false positives are included both for the EEP and for all the priors

with a = b = 1 indicating that the default hyper-parameters might not be the best option

to control FP The 7 covariates in the true model all obtain a high inclusion posterior

probability both with the EEP and the a = b = 1 priors

101

Table 4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

7 (x1 + x2 + x3)2

178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3 + x4)2

362 194 233 245 010 063 107FP(gt020) 160 047 217 215 001 017 024FP(gt050) 025 006 035 036 000 002 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3 + x4 + x5)2

600 216 260 255 012 043 115FP(gt020) 291 055 213 218 002 019 027FP(gt050) 066 011 025 037 000 003 001TP(gt050) (MT 1) 700 700 700 700 697 699 699

In contrast any of the a = 1 and b = ch priors dramatically improve upon their

a = b = 1 counterparts consistently assigning low inclusion probabilities for the majority

of the false positive terms even for low cutoffs As the order of the polynomial surface

increases the difference in performance between these priors and either the EEP or

their default versions becomes even more clear At the 50 cutoff the hierarchical priors

with complexity penalization exhibit very low false positive rates The true positive rate

decreases slightly for the priors but not to an alarming degree

Other model spaces This part of the analysis considers model spaces that do not

correspond to full polynomial degree response surfaces (Table 4-4) The first example

is a model space with main effects only The second example includes a full quadratic

surface of order 2 but in addition includes six terms for which only main effects are to be

modeled Two true models are used in combination with each model space to observe

how the posterior probabilities vary under the influence of the different priors for ldquolargerdquo

and ldquosmallrdquo true models

102

Table 4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

7 (x1 + x2 + x3)2

178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3)3

737 521 606 291 055 105 139FP(gt020) 291 155 361 208 017 034 031FP(gt050) 040 021 050 026 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 698 700FP(gt010)

7 (x1 + x2 + x3)4

822 400 469 261 052 055 132FP(gt020) 421 113 176 203 012 015 031FP(gt050) 056 017 022 027 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 697 699

By construction in model spaces with main effects only HIP(11) and EPP are

equivalent as are HOP(ab) and HUP(ab) This accounts for the similarities observed

among the results for the first two cases presented in Table 4-4 where the model space

corresponds to a full model with 18 main effects and the true models are a model with

16 and 4 main effects respectively When the number of true coefficients is large the

HUP(11) and HOP(11) do poorly at controlling false positives even at the 50 cutoff

In contrast the HIP (and thus the EPP) with the 50 cutoff identifies the true positives

and no false positives This result however does not imply that the EPP controls false

positives well The true model contains 16 out of the 18 nodes in MF so there is little

potential for false positives The a = 1 and b = ch priors show dramatically different

behavior The HIP controls false positive well but fails to identify the true coefficients at

the 50 cutoff In contrast the HOP identifies all of the true positives and has a small

false positive rate for the 50 cutoff

103

If the number of true positives is small most terms in the full model are truly zero

The EPP includes at least one false positive in approximately 50 of the randomly

sampled datasets On the other hand the HUP(11) provides some control for

multiplicity obtaining on average a lower number of false positives than the EPP

Furthermore the proposed hierarchical priors with a = 1b = ch are substantially better

than the EPP (and the choice of a = b = 1) at controlling false positives and capturing

all true positives using the marginal posterior inclusion probabilities The two examples

suggest that the HOP(1 ch) is the best default choice for model selection when the

number of terms available at a given degree is large

The third and fourth examples in Table 4-4 consider the same irregular model

space with data generated from MT 4 with ten terms and MT 5 with six terms HIP(11)

and EPP again behave quite similarly incorporating a large number of false positives

for the 01 cutoff At the 05 cutoff some false positives are still included The HUP(11)

and HOP(11) behave similarly with a slightly higher false positive rate at the 50 cutoff

In terms of the true positives the EPP and a = b = 1 priors always include all of the

predictors in MT 4 and MT 5 On the other hand the ability of the a = 1b = ch priors

to control for false positives is markedly better than that of the EPP and the hierarchical

priors with choice of a = 1 = b = 1 At the 50 cutoff these priors identify all of the true

positives and true negatives Once again these examples point to the hierarchical priors

with additional penalization for complexity as being good default priors on the model

space44 Random Walks on the Model Space

When the model space M is too large to enumerate a stochastic procedure can

be used to find models with high posterior probability In particular an MCMC algorithm

can be utilized to generate a dependent sample of models from the model posterior The

structure of the model space M both presents difficulties and provides clues on how to

build algorithms to explore it Different MCMC strategies can be adopted two of which

104

Table 4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

16 x1 + x2 + + x18

193 193 200 200 003 180 180FP(gt020) 052 052 200 200 001 046 046FP(gt050) 007 007 200 200 001 004 004TP(gt050) (MT 2) 1599 1599 1600 1600 699 1599 1599FP(gt010)

4 x1 + x2 + + x18

1395 1395 915 915 026 131 131FP(gt020) 545 545 303 303 005 045 045FP(gt050) 084 084 045 045 002 006 006TP(gt050) (MT 3) 400 400 400 400 400 400 400FP(gt010)

10

973 971 1000 560 034 233 220FP(gt020) (x1 + + x4)

2+ 265 265 873 305 012 074 069FP(gt050) +x5 + + x10 035 035 136 168 002 011 012TP(gt050) (MT 4) 1000 1000 1000 999 994 998 999FP(gt010)

6

1352 1352 1106 994 044 163 196FP(gt020) (x1 + + x4)

2+ 422 421 360 501 015 048 068FP(gt050) +x5 + + x10 053 053 057 075 001 008 011TP(gt050) (MT 5) 600 600 600 600 599 599 599

are outlined in this section Combining the different strategies allows the model selection

algorithm to explore the model space thoroughly and relatively fast441 Simple Pruning and Growing

This first strategy relies on small localized jumps around the model space turning

on or off a single node at each step The idea behind this algorithm is to grow the model

by activating one node in the children set or to prune the model by removing one node

in the extreme set At a given step in the algorithm assume that the current state of the

chain is model M Let pG be the probability that algorithm chooses the growth step The

proposed model M prime can either be M+ = M cup α for some α isin C(M) or Mminus = M α

or some α isin E(M)

An example transition kernel is defined by the mixture

g(M prime|M) = pG middot qGrow(M prime|M) + (1minus pG) middot qPrune(M prime|M)

105

=IM =MF

1 + IM =MBmiddotIαisinC(M)

|C(M)|+

IM =MB

1 + IM =MF middotIαisinE(M)

|E(M)|(4ndash11)

where pG has explicitly been defined as 05 when both C(M) and E(M) are non-empty

and as 0 (or 1) when C(M) = empty (or E(M) = empty) After choosing pruning or growing a

single node is proposed for addition to or deletion from M uniformly at random

For this simple algorithm pruning has the reverse kernel of growing and vice-versa

From this construction more elaborate algorithms can be specified First instead of

choosing the node uniformly at random from the corresponding set nodes can be

selected using the relative posterior probability of adding or removing the node Second

more than one node can be selected at any step for instance by also sampling at

random the number of nodes to add or remove given the size of the set Third the

strategy could combine pruning and growing in a single step by sampling one node

α isin C(M) cup E(M) and adding or removing it accordingly Fourth the sets of nodes from

C(M) cup E(M) that yield well-formulated models can be added or removed This simple

algorithm produces small moves around the model space by focusing node addition or

removal only on the set C(M) cup E(M)442 Degree Based Pruning and Growing

In exploring the model space it is possible to take advantage of the hierarchical

structure defined between nodes of different order One can update the vector of

inclusion indicators by blocks denoted j(M) Two flavors of this algorithm are

proposed one that separates the pruning and growing steps and one where both

are done simultaneously

Assume that at a given step say t the algorithm is at M If growing the strategy

proceeds successively by order class going from j = Jmin up to j = Jmax with Jmin

and Jmax being the lowest and highest orders of nodes in MF MB respectively Define

Mt(Jminminus1) = M and set j = Jmin The growth kernel comprises the following steps

proceeding from j = Jmin to j = Jmax

106

1) Propose a model M prime by selecting a set of nodes from Cj(Mt(jminus1)) through thekernel qGrow j(middot|Mt(jminus1))

2) Compute the Metropolis-Hastings correction for M prime versus Mt(jminus1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(jminus1)

3) If j lt Jmax then set j = j + 1 and return to 1) otherwise proceed to 4)

4) Set Mt = Mt(Jmax )

The pruning step is defined In a similar fashion however it starts at order j = Jmax

and proceeds down to j = Jmin Let Ej(M prime) = E(M prime) cap j(MF ) be the set of nodes of

order j that can removed from the model M to produce a WFM Define Mt(Jmax+1) = M

and set j = Jmax The pruning kernel comprises the following steps

1) Propose a model M prime by selecting a set of nodes from Ej(Mt(j+1)) through thekernel qPrunej(middot|Mt(j+1))

2) Compute the Metropolis-Hastings correction for M prime versus Mt(j+1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(j+1)

3) If j gt Jmin then set j = j minus 1 and return to Step 1) otherwise proceed to Step 4)

4) Set Mt = Mt(Jmin )

It is clear that the growing and pruning steps are reverse kernels of each other

Pruning and growing can be combined for each j The forward kernel proceeds from

j = Jmin to j = Jmax and proposes adding sets of nodes from Cj(M) cup Ej(M) The reverse

kernel simply reverses the direction of j proceeding from j = Jmax to j = Jmin 45 Simulation Study

To study the operating characteristics of the proposed priors a simulation

experiment was designed with three goals First the priors are characterized by how

the posterior distributions are affected by the sample size and the signal-to-noise ratio

(SNR) Second given the SNR level the influence of the allocation of the signal across

the terms in the model is investigated Third performance is assessed when the true

model has special points in the scale (McCullagh amp Nelder 1989) ie when the true

107

model has coefficients equal to zero for some lower-order terms in the polynomial

hierarchy

With these goals in mind sets of predictors and responses are generated under

various experimental conditions The model space is defined with MB being the

intercept-only model and MF being the complete order-four polynomial surface in five

main effects that has 126 nodes The entries of the matrix of main effects are generated

as independent standard normal The response vectors are drawn from the n-variate

normal distribution as y sim Nn

(ZMT

(X)βγ In) where MT is the true model and In is the

n times n identity matrix

The sample sizes considered are n isin 130 260 1040 which ensures that

ZMF(X) is of full rank The cardinality of this model space is |M| gt 12 times 1022 which

makes enumeration of all models unfeasible Because the value of the 2k-th moment

of the standard normal distribution increases with k = 1 2 higher-order terms by

construction have a larger variance than their ancestors As such assuming equal

values for all coefficients higher-order terms necessarily contain more ldquosignalrdquo than

the lower order terms from which they inherit (eg x21 has more signal than x1) Once a

higher-order term is selected its entire ancestry is also included Therefore to prevent

the simulation results from being overly optimistic (because of the larger signals from the

higher-order terms) sphering is used to calculate meaningful values of the coefficients

ensuring that the signal is of the magnitude intended in any given direction Given

the results of the simulations from Section 433 only the HOP with a = 1b = ch is

considered with the EPP included for comparison

The total number of combinations of SNR sample size regression coefficient

values and nodes in MT amounts to 108 different scenarios Each scenario was run

with 100 independently generated datasets and the mean behavior of the samples was

observed The results presented in this section correspond to the median probability

model (MPM) from each of the 108 simulation scenarios considered Figure 4-7 shows

108

the comparison between the two priors for the mean number of true positive (TP) and

false positive (FP) terms Although some of the scenarios consider true models that are

not well-formulated the smallest well-formulated model that stems from MT is always

the one shown in Figure 4-6

Figure 4-6 MT DAG of the largest true model used in simulations

The results are summarized in Figure 4-7 Each point on the horizontal axis

corresponds to the average for a given set of simulation conditions Only labels for the

SNR and sample size are included for clarity but the results are also shown for the

different values of the regression coefficients and the different true models considered

Additional details about the procedure and other results are included in the appendices451 SNR and Sample Size Effect

As expected small sample sizes conditioned upon a small SNR impair the ability

of the algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this

effect being greater when using the latter prior However considering the mean number

of TPs jointly with the number of FPs it is clear that although the number of TPs is

specially low with HOP(1 ch) most of the few predictors that are discovered in fact

belong to the true model In comparison to the results with EPP in terms of FPs the

HOP(1 ch) does better and even more so when both the sample size and the SNR are

109

Figure 4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch)

smallest Finally when either the SNR or the sample size is large the performance in

terms of TPs is similar between both priors but the number of FPs are somewhat lower

with the HOP452 Coefficient Magnitude

Three ways to allocate the amount of signal across predictors are considered For

the first choice all coefficients contain the same amount of signal regardless of their

order In the second each order-one coefficient contains twice as much signal as any

order-two coefficient and four times as much as any order-three coefficient Finally

each order-one coefficient contains a half as much signal as any order-two coefficient

and a quarter of what any order-three coefficient has These choices are denoted by

β(1) = c(1o1 1o2 1o3) β(2) = c(1o1 05o2 025o3) and β(3) = c(025o1 05o2 1o3)

respectively In Figure 4-7 the first 4 scenarios correspond to simulations with β(1) the

next four use β(2) the next four correspond to β(3) and then the values are cycled in

110

the same way The results show that scenarios using either β(1) or β(3) behave similarly

contrasting with the negative impact of having the highest signal in the order-one terms

through β(2) In Figure 4-7 the effect of using β(2) is evident as it corresponds to the

lowest values for the TPs regardless of the sample size the SNR or the prior used This

is an intuitive result since giving more signal to higher-order terms makes it easier to

detect higher-order terms and consequently by strong heredity the algorithm will also

select the corresponding lower-order terms included in the true model453 Special Points on the Scale

Four true models were considered (1) the model from Figure 4-6 (MT 1) (2)

the model without the order-one terms (MT 2) (3) the model without order-two terms

(MT 3) and (4) the model without x21 and x2x5 (MT 4) The last three are clearly not

well-formulated In Figure 4-7 the leftmost point on the horizontal axis corresponds to

scenarios with MT 1 the next point is for scenarios with MT 2 followed by those with MT 3

then with MT 4 then MT 1 etc In comparison to the EPP the HOP(1 ch) tightly controls

the inclusion of FPs by choosing smaller models at the expense of also reducing the TP

count especially when there is more uncertainty about the true model (ie SNR=025)

For both prior structures the results indicate that at low SNR levels the presence of

special points has no apparent impact as the selection behavior is similar between the

four models in terms of both the TP and FP An interesting observation is that the effect

of having special points on the scale is vastly magnified whenever the coefficients that

assign more weight to order-one terms (β(2)) are used46 Case Study Ozone Data Analysis

This section uses the ozone data from Breiman amp Friedman (1985) and followsthe analysis performed by Liang et al (2008) who investigated hyper g-priors Afterremoving observations with missing values 330 observations remain includingdaily measurements of maximum ozone concentration near Los Angeles and eightmeteorological variables Table D From the 330 observations 165 were sampled atrandom without replacement and used to run the variable selection procedure theremaining 165 were used for validation The eight meteorological variables interactionsand their squared terms are used as predictors resulting in a full model with 44predictors The model space assumes that the base model MB is the intercept onlymodel and that MF is the quadratic surface in the eight meteorological variables The

111

model space contains approximately 71 billion models and computation of all modelposterior probabilities is not feasible

Table 4-5 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA

vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX

The HOP HUP and HIP with a = 1 and b = ch as well as the EPP are considered forcomparison purposes To obtain the Bayes factors in equation 3ndash3 four different mixtures ofg-priors are utilized intrinsic priors (IP) (which yields the expression in equation 3ndash2) hyper-g(HG) priors (Liang et al 2008) with hyper-parameters α = 2β = 1 and α = β = 1 and Zellner-Siow (ZS) priors (Zellner amp Siow 1980) The results were extracted for the median posteriorprobability (MPM) models Additionally the model is estimated using the R package hierNet(Bien et al 2013) to compare model selection results to those obtained using the hierarchicallasso (Bien et al 2013) restricted to well formulated models by imposing the strong heredityconstraint The procedures were assessed on the basis of their predictive accuracy on thevalidation dataset

Among all models the one that yields the smallest RMSE is the median probability modelobtained using the HOP and EPP with the ZS prior and also using the HOP with both HGpriors (Table 4-6) The HOP model with the intrinsic prior has all the terms contained in thelowest RMSE model with the exception of dpg2 which has a relatively high marginal inclusionprobability of 46 This disparity between the IP and other mixtures of g-priors is explainedby the fact that the IP induces less posterior shrinkage than the ZS and HG priors The MPMobtained through the HUP and HIP are nested in the best model suggesting that these modelspace priors penalize complexity too much and result in false negatives Consideration ofthese MPMs suggest that the HOP is best at producing true positives while controlling for falsepositives

Finally the model obtained from the hierarchical lasso (HierNet) is the largest model andproduces the second to largest RMSE All of the terms contained in any of the other modelsexcept for vh are nested within the hierarchical lasso model and most of the terms that areexclusive to this model receive extremely low marginal inclusion probabilities under any of themodel priors and parameter priors considered under Bayesian model selection

112

Table 4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso

BF Prior Model R2 RMSEIP EPP hum dpg ibt hum2 hum lowast dpg 08054 42739

hum lowast ibt dpg2 ibt2IP HIP hum ibt hum2 hum lowast ibt ibt2 07740 43396IP HOP hum dpg ibt hum2 hum lowast ibt ibt2 07848 43175IP HUP hum dpg ibt hum lowast ibt ibt2 07767 43508ZS EPP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HIP hum ibt hum lowast ibt ibt2 07525 43505ZS HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG11 EPP vh hum dpg ibt hum2 hum lowast ibt dpg2 07701 43049HG11 HIP hum ibt hum lowast ibt ibt2 07525 43505HG11 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG11 HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 EPP hum dpg ibt hum2 hum lowast ibt dpg2 07701 43037HG21 HIP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG21 HUP hum dpg ibt hum lowast ibt 07526 44036

HierNet hum temp ibh dpg ibt vis hum2 hum lowast ibt 07651 43680temp2 temp lowast ibt dpg2

47 DiscussionScott amp Berger (2010) noted that Ockhamrsquos-razor effect found automatically in Bayesian

variable selection through the Bayes factor does not correct for multiple testing The Bayesfactor penalizes complexity of the alternative model according to the number of parametersin excess of those of the null model Therefore the Bayes factor only controls complexity in apairwise fashion If the model selection procedure uses equal prior probabilities for all M isin Mthen these comparisons ignore the effect of the multiplicity of the testing problem This is wherethe role of the prior on the model space becomes important The multiplicity penalty is ldquohiddenawayrdquo in the model prior probabilities π(M|M)

In addition to the multiplicity of the testing problem disregarding the hierarchical polynomialstructure in the predictors in model selection procedures has the potential to lead to differentresults according to how the predictors are setup (eg in what units these predictors areexpressed)

In this Chapter we investigated a solution to these two issues We define prior structuresfor well-formulated models and develop random walk algorithms to traverse this type of modelspace The key to understanding prior distributions on the space of WFMs is the hierarchicalnature of the model space itself The prior distributions described take advantage of thathierarchy in two ways First conditional independence and immediate inheritance are used todevelop the HOP HIP and HUP structures discussed in Section 43 Second the conditionalnature of the priors allows for the direct incorporation of complexity penalizations Of the priorsproposed the HOP using the hyperparameter choice (1 ch) provides the best control of falsepositives while maintaining a reasonable true positive rate Thus this prior is recommended asthe default prior on the space of WFMs

113

In the near future the software developed to carry out a Metropolis-Hastings random walkon the space of WFMs will be integrated to the R package varSelectIP These new functionsimplement various local priors for the regression coefficients including the intrinsic prior Zellner-Siow prior and hyper g-priors In addition the software supports the computation of crediblesets for each regression coefficient conditioned on the selected model as well as under modelaveraging

114

CHAPTER 5CONCLUSIONS

Ecologists are now embracing the use of Bayesian methods to investigate the

interactions that dictate the distribution and abundance of organisms These tools are

both powerful and flexible They allow integrating under a single methodology empirical

observations and theoretical process models and can seamlessly account for several

sources of uncertainty and dependence The estimation and testing methods proposed

throughout the document will contribute to the understanding of Bayesian methods used

in ecology and hopefully these will shed light about the differences between estimation

and testing Bayesian tools

All of our contributions exploit the potential of the latent variable formulation This

approach greatly simplifies the analysis of complex models it redirects the bulk of

the inferential burden away from the original response variables and places it on the

easy-to-work-with latent scale for which several time-tested approaches are available

Our methods are distinctly classified into estimation and testing tools

For estimation we proposed a Bayesian specification of the single-season

occupancy model for which a Gibbs sampler is available using both logit and probit

link functions This setup allows detection and occupancy probabilities to depend

on linear combinations of predictors Then we developed a dynamic version of this

approach incorporating the notion that occupancy at a previously occupied site depends

both on survival of current settlers and habitat suitability Additionally because these

dynamics also vary in space we suggest a strategy to add spatial dependence among

neighboring sites

Ecological inquiry usually requires of competing explanations and uncertainty

surrounds the decision of choosing any one of them Hence a model or a set of

probable models should be selected from all the viable alternatives To address this

testing problem we proposed an objective and fully automatic Bayesian methodology

115

for the single season site-occupancy model Our approach relies on the intrinsic prior

which prevents from introducing (commonly unavailable) subjectively information

into the model In simulation experiments we observed that the methods single out

accurately the predictors present in the true model using the marginal posterior inclusion

probabilities of the predictors For predictors in the true model these probabilities were

comparatively larger than those for predictors not present in the true model Also the

simulations indicated that the method provides better discrimination for predictors in the

detection component of the model

In our simulations and in the analysis of the Blue Hawker data we observed that the

effect from using the multiplicity correction prior was substantial This occurs because

the Bayes factor only penalizes complexity of the alternative model according to its

number of parameters in excess to those of the null model As the number of predictors

grows the number of models in the models space also grows increasing the chances

of making false positive decisions on the inclusion of predictors This is where the role

of the prior on the model space becomes important The multiplicity penalty is ldquohidden

awayrdquo in the model prior probabilities π(M|M) In addition to the multiplicity of the

testing problem disregarding the hierarchical polynomial structure in the predictors in

model selection procedures has the potential to lead to different results according to

how the predictors are coded (eg in what units these predictors are expressed)

To confront this situation we propose three prior structures for well-formulated

models take advantage of the hierarchical structure of the predictors Of the priors

proposed we recommend the HOP using the hyperparameter choice (1 ch) which

provides the best control of false positives while maintaining a reasonable true positive

rate

Overall considering the flexibility of the latent approach several other extensions of

these methods follow Currently we envision three future developments (1) occupancy

models incorporate various sources of information (2) multi-species models that make

116

use of spatial and interspecific dependence and (3) investigate methods to conduct

model selection for the dynamic and spatially explicit version of the model

117

APPENDIX AFULL CONDITIONAL DENSITIES DYMOSS

In this section we introduce the full conditional probability density functions for all

the parameters involved in the DYMOSS model using probit as well as logic links

Sampler Z

The full conditionals corresponding to the presence indicators have the same form

regardless of the link used These are derived separately for the cases t = 1 1 lt t lt T

and t = T since their corresponding probabilities take on slightly different forms

Let ϕ(ν|microσ2) represent the density for a normal random variable ν with mean micro and

variance σ2 and recall that ψi1 = F (xprime(o)iα) and pijt = F (qprimeijtλt) where F () is the

inverse link function The full conditional for zit is given by

1 For t = 1

π(zi1|vi1αλ1βc1 δ

s1) = ψlowast

i1zi1 (1minus ψlowast

i1)1minuszi1

= Bernoulli(ψlowasti1) (Andash1)

where

ψlowasti1 =

ψi1ϕ(vi1|xprimei1βc1 + δs1 1)

prodJi1j=1(1minus pij1)

ψi1ϕ(vi1|xprimei1βc1 + δs1 1)

prodJi1j=1(1minus pij1) + (1minus ψi1)ϕ(vi1|xprimei1β

c1 1)

prodJj=1 Iyij1=0

2 For 1 lt t lt T

π(zit |zi(tminus1) zi(t+1)λt βctminus1 δ

stminus1) = ψlowast

itzit (1minus ψlowast

it)1minuszit

= Bernoulli(ψlowastit) (Andash2)

where

ψlowastit =

κitprodJit

j=1(1minus pijt)

κitprodJit

j=1(1minus pijt) +nablait

prodJj=1 Iyijt=0

with

(a) κit = F (xprimei(tminus1)β

ctminus1 + zi(tminus1)δ

stminus1)ϕ(vit |xprimeitβ

ct + δst 1) and

(b) nablait =(1minus F (xprime

i(tminus1)βctminus1 + zi(tminus1)δ

stminus1)

)ϕ(vit |xprimeitβ

ct 1)

3 For t = T

π(ziT |zi(Tminus1)λT βcTminus1 δ

sTminus1) = ψ⋆iT

ziT (1minus ψ⋆iT )1minusziT

118

=

Nprodi=1

Bernoulli(ψ⋆iT ) (Andash3)

where

ψ⋆iT =κ⋆iT

prodJiTj=1(1minus pijT )

κ⋆iTprodJiT

j=1(1minus pijT ) +nabla⋆iT

prodJj=1 IyijT=0

with

(a) κ⋆iT = F (xprimei(Tminus1)β

cTminus1 + zi(Tminus1)δ

sTminus1) and

(b) nabla⋆iT =

(1minus F (xprime

i(Tminus1)βcTminus1 + zi(Tminus1)δ

sTminus1)

)Sampler ui

1

π(ui |zi1α) = tr N(xprime(o)iα 1 trunc(zi1))

where trunc(zi1) =

(minusinfin 0] zi1 = 0

(0infin) zi1 = 1(Andash4)

and tr N(microσ2A) denotes the pdf of a truncated normal random variable with mean microvariance σ2 and truncation region A

Sampler α

1

π(α|u) prop [α]

Nprodi=1

ϕ(ui xprime(o)iα 1) (Andash5)

If [α] prop 1 then

α|u sim N(m(α)α)

with m(α) = αXprime(o)u and α = (X prime

(o)X(o))minus1

Sampler vit

1 (For t gt 1)

π(vi (tminus1)|zi (tminus1) zit βctminus1 δ

stminus1) = tr N

(micro(v)i(tminus1) 1 trunc(zit)

)(Andash6)

where micro(v)i(tminus1) = xprime

i(tminus1)βctminus1 + zi(tminus1)δ

ci(tminus1) and trunc(zit) defines the corresponding

truncation region given by zit

119

Sampler(β(c)tminus1 δ

(c)tminus1

)

1 (For t gt 1)

π(β(s)tminus1 δ

(c)tminus1|vtminus1 ztminus1) prop [β

(s)tminus1 δ

(c)tminus1]

Nprodi=1

ϕ(vit xprimei(tminus1)β

(c)tminus1 + zi(tminus1)δ

(s)tminus1 1) (Andash7)

If[β(c)tminus1 δ

(s)tminus1

]prop 1 then

β(c)tminus1 δ

(s)tminus1|vtminus1 ztminus1 sim N(m(β

(c)tminus1 δ

(s)tminus1)tminus1)

with m(β(c)tminus1 δ

(s)tminus1) = tminus1 ~X

primetminus1vtminus1 and tminus1 = (~X prime

tminus1 ~Xtminus1)minus1 where ~Xtminus1 =(

Xtminus1 ztminus1)

Sampler wijt

1 (For t gt 1 and zit = 1)

π(wijt | i zit = 1 yijt λ) = tr N(qprimeijtλt 1 tr(yijt)

)(Andash8)

Sampler λt

1 (For t = 1 2 T )

π(λt |zt wt) prop [λt ]prod

i zit=1

Jitprodj=1

ϕ(wijt qprimeijtλt 1) (Andash9)

If [λt ] prop 1 then

λt |wt zt sim N(m(λt)λt)

with m(λt) = λtQ primetwt and λt

= (Q primetQt)

minus1 where Qt and wt respectively are the designmatrix and the vector of latent variables for surveys of sites such that zit = 1

120

APPENDIX BRANDOM WALK ALGORITHMS

Global Jump From the current state M the global jump is performed by drawing

a model M prime at random from the model space This is achieved by beginning at the base

model and increasing the order from JminM to the Jmax

M the minimum and maximum orders

of nodes in (MF ) = MF MB at each order a set of nodes is selected at random from

the prior conditioned on the nodes already in the model The MH correction is

α =

1m(y|M primeM)

m(y|MM)

Local Jump From the current state M the local jump is performed by drawing a

model from the set of models L(M) = Mα α isin E(M) cup C(M) where Mα is M α

for α isin E(M) and M cup α for α isin C(M) The proposal probabilities for the model are

computed as a mixture of p(M prime|yMM prime isin L(M)) and the discrete uniform distribution

The proposal kernel is

q(M prime|yMM prime isin L(M)) =1

2

(p(M prime|yMM prime isin L(M)) +

1

|L(M)|

)This choice promotes moving to better models while maintaining a non-negligible

probability of moving to any of the possible models The MH correction is

α =

1m(y|M primeM)

m(y|MM)

q(M|yMM isin L(M prime))

q(M prime|yMM prime isin L(M))

Intermediate Jump The intermediate jump is performed by increasing or

decreasing the order of the nodes under consideration performing local proposals based

on order For a model M prime define Lj(Mprime) = M prime cup M prime

α α isin (E(M prime) cup C(M prime)) capj(MF )

From a state M the kernel chooses at random whether to increase or decrease the

order If M = MF then decreasing the order is chosen with probability 1 and if M = MB

then increasing the order is chosen with probability 1 in all other cases the probability of

increasing and decreasing order is 12 The proposal kernels are given by

121

Increasing order proposal kernel

1 Set j = JminM minus 1 and M prime

j = M

2 Draw M primej+1 from qincj+1(M

prime|yMM prime isin Lj+1(Mprimej )) where

qincj+1(Mprime|yMM prime isin Lj+1(M

primej )) =

12

(p(M prime|yMM prime isin Lj+1(M

primej )) +

1|Lj+1(M

primej)|

)

3 Set j = j + 1

4 If j lt JmaxM then return to 2 O therwise proceed to 5

5 Set M prime = M primeJmaxM

and compute the proposal probability

qinc(Mprime|yMM) =

JmaxM minus1prod

j=JminM minus1

qincj+1(Mprimej |yMM prime isin Lj+1(M

primej )) (Bndash1)

Decreasing order proposal kernel

1 Set j = JmaxM + 1 and M prime

j = M

2 Draw M primejminus1 from qdecjminus1(M

prime|yMM prime isin Ljminus1(Mprimej )) where

qdecjminus1(Mprime|yMM prime isin Ljminus1(M

primej )) =

12

(p(M prime|yMM prime isin Ljminus1(M

primej )) +

1|Ljminus1(M

primej)|

)

3 Set j = j minus 1

4 If j gt JminM then return to 2 Otherwise proceed to 5

5 Set M prime = M primeJminM

and compute the proposal probability

qdec(Mprime|yMM) =

JminM +1prod

j=JmaxM +1

qdecjminus1(Mprimej |yMM prime isin Ljminus1(M

primej )) (Bndash2)

If increasing order is chosen then the MH correction is given by

α = min

1

(1 + I (M prime = MF )

1 + I (M = MB)

)qdec(M|yMM prime)

qinc(M prime|yMM)

p(M prime|yM)

p(M|yM)

(Bndash3)

and similarly if decreasing order is chosen

Other Local and Intermediate Kernels The local and intermediate kernels

described here perform a kind of stochastic forwards-backwards selection Each kernel

122

q can be relaxed to allow more than one node to be turned on or off at each step which

could provide larger jumps for each of these kernels The tradeoff is that number of

proposed models for such jumps could be very large precluding the use of posterior

information in the construction of the proposal kernel

123

APPENDIX CWFM SIMULATION DETAILS

Briefly the idea is to let ZMT(X )βMT

= (QR)βMT= QηMT

(ie βMT= Rminus1ηMT

)

using the QR decomposition As such setting all values in ηMTproportional to one

corresponds to distributing the signal in the model uniformly across all predictors

regardless of their order

The (unconditional) variance of a single observation yi is var(yi) = var (E [yi |zi ]) +

E [var(yi |zi)] where zi is the i -th row of the design matrix ZMT Hence we take the

signal to noise ratio for each observation to be

SNR(η) = ηTMT

RminusTzRminus1ηMT

σ2

where z = var(zi) We determine how the signal is distributed across predictors up to a

proportionality constant to be able to control simultaneously the signal to noise ratio

Additionally to investigate the ability of the model to capture correctly the

hierarchical structure we specify four different 0-1 vectors that determine the predictors

in MT which generates the data in the different scenarios

Table C-1 Experimental conditions WFM simulationsParameter Values considered

SNR(ηMT) = k 025 1 4

ηMTprop (1 13 14 12) (1 13 1214

1412) (1 1413

1214 12)

γMT(1 13 14 12) (1 13 14 02) (1 13 04 12) (1 03 0 1 1 0 12)

n 130 260 1040

The results presented below are somewhat different from those found in the main

body of the article in Section 5 These are extracted averaging the number of FPrsquos

TPrsquos and model sizes respectively over the 100 independent runs and across the

corresponding scenarios for the 20 highest probability models

124

SNR and Sample Size Effect

In terms of the SNR and the sample size (Figure C-1) we observe that as

expected small sample sizes conditioned upon a small SNR impair the ability of the

algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this effect

more notorious when using the latter prior However considering the mean number

of true positives (TP) jointly with the mean model size it is clear that although the

sensitivity is low most of the few predictors that are discovered belong to the true

model The results observed with SNR of 025 and a relatively small sample size are

far from being impressive however real problems where the SNR is as low as 025

will yield many spurious associations under the EPP The fact that the HOP(1 ch) has

a strong protection against false positive is commendable in itself A SNR of 1 also

represents a feeble relationship between the predictors and the response nonetheless

the method captures approximately half of the true coefficients while including very few

false positives Following intuition as either the sample size or the SNR increase the

algorithms performance is greatly enhanced Either having a large sample size or a

large SNR yields models that contain mostly true predictors Additionally HOP(1 ch)

provides a strong control over the number of false positives therefore for high SNR

or larger sample sizes the number of predictors in the top 20 models is close to the

size of the true model In general the EPP allows the detection of more TPrsquos while

the HOP(1 ch) provides a stronger control on the amount of FPrsquos included when

considering small sample sizes combined with small SNRs As either sample size or

SNR grows the differences between the two priors become indistinct

125

Figure C-1 SNR vs n Average model size average true positives and average false

positives for all simulated scenarios by model ranking according to model

posterior probabilities

Coefficient Magnitude

This part of the experiment explores the effect of how the signal is distributed across

predictors As mentioned before sphering is used to assign the coefficients values

in a manner that controls the amount of signal that goes into each coefficient Three

possible ways to allocate the signal are considered First each order-one coefficient

contains twice as much signal as any order-two coefficient and four times as much

any as order-three coefficient second all coefficients contain the same amount of

signal regardless of their order and third each order-one coefficient contains a half

as much signal as any order-two coefficient and a quarter of what any order-three

126

coefficient has In Figure C-2 these values are denoted by β = c(1o1 05o2 025o3)

β = c(1o1 1o2 1o3) and β = c(025o1 05o2 1o3) respectively

Observe that the number of FPrsquos is invulnerable to how the SNR is distributed

across predictors using the HOP(1 ch) conversely when using the EPP the number

of FPrsquos decreases as the SNR grows always being slightly higher than those obtained

with the HOP With either prior structure the algorithm performs better whenever all

coefficients are equally weighted or when those for the order-three terms have higher

weights In these two cases (ie with β = c(1o1 05o2 025o3) or β = c(1o1 1o2 1o3))

the effect of the SNR appears to be similar In contrast when more weight is given to

order one terms the algorithm yields slightly worse models at any SNR level This is an

intuitive result since giving more signal to higher order terms makes it easier to detect

higher order terms and consequently by strong heredity the algorithm will also select

the corresponding lower order terms included in the true model

Special Points on the Scale

In Nelder (1998) the author argues that the conditions under which the

weak-heredity principle can be used for model selection are so restrictive that the

principle is commonly not valid in practice in this context In addition the author states

that considering well-formulated models only does not take into account the possible

presence of special points on the scales of the predictors that is situations where

omitting lower order terms is justified due to the nature of the data However it is our

contention that every model has an underlying well-formulated structure whether or not

some predictor has special points on its scale will be determined through the estimation

of the coefficients once a valid well-formulated structure has been chosen

To understand how the algorithm behaves whenever the true data generating

mechanism has zero-valued coefficients for some lower order terms in the hierarchy

four different true models are considered Three of them are not well-formulated while

the remaining one is the WFM shown in Figure 4-6 The three models that have special

127

Figure C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model rankingaccording to model posterior probabilities

points correspond to the same model MT from Figure 4-6 but have respectively

zero-valued coefficients for all the order-one terms all the order-two terms and for x21

and x2x5

As seen before in comparison to the EPP the HOP(1 ch) tightly controls the

inclusion FPs by choosing smaller models at the expense of also reducing the TP

count especially when there is more uncertainty about the true model (ie SNR=025)

For both prior structures the results in Figure C-3 indicate that at low SNR levels the

presence of special points has no apparent impact as the selection behavior is similar

between the four models in terms of both the TP and FP As the SNR increases the

TPs and the model size are affected for true models with zero-valued lower order

128

Figure C-3 SNR vs different true models MT Average model size average truepositives and average false positives for all simulated scenarios by modelranking according to model posterior probabilities

terms These differences however are not very large Relatively smaller models are

selected whenever some terms in the hierarchy are missing but with high SNR which

is where the differences are most pronounced the predictors included are mostly true

coefficients The impact is almost imperceptible for the true model that lacks order one

terms and the model with zero coefficients for x21 and x2x5 and is more visible for models

without order two terms This last result is expected due to strong-heredity whenever

the order-one coefficients are missing the inclusion of order-two and order-three

terms will force their selection which is also the case when only a few order two terms

have zero-valued coefficients Conversely when all order two predictors are removed

129

some order three predictors are not selected as their signal is attributed the order two

predictors missing from the true model This is especially the case for the order three

interaction term x1x2x5 which depends on the inclusion of three order two terms terms

(x1x2 x1x5 x2x5) in order for it to be included as well This makes the inclusion of this

term somewhat more challenging the three order two interactions capture most of

the variation of the polynomial terms that is present when the order three term is also

included However special points on the scale commonly occur on a single or at most

on a few covariates A true data generating mechanism that removes all terms of a given

order in the context of polynomial models is clearly not justified here this was only done

for comparison purposes

130

APPENDIX DSUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS

The covariates considered for the ozone data analysis match those used in Liang

et al (2008) these are displayed in Table D below

Table D-1 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA

vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX

The marginal posterior inclusion probability corresponds to the probability of including a

given term in the full model MF after summing over all models in the model space For each

node α isin MF this probability is given by pα =sum

MisinM I(αisinM)p(M|yM) Given that in problems

with a large model space such as the one considered for the ozone concentration problem

enumeration of the entire space is not feasible Thus these probabilities are estimated summing

over every model drawn by the random walk over the model space M

Given that there are in total 44 potential predictors for convenience in Tables D-2 to D-5

below we only display the marginal posterior probabilities for the terms included under at least

one of the model priors considered (EPP HIP HUP and HOP) for each of the parameter priors

utilized (intrinsic priors Zellner-Siow priors Hyper-g(11) and Hyper-g(21))

131

Table D-2 Marginal inclusion probabilities

intrinsic prior

EPP HIP HUP HOP

hum 099 069 085 076

dpg 085 048 052 053

ibt 099 100 100 100

hum2 076 051 043 062

humdpg 055 002 003 017

humibt 098 069 084 075

dpg2 072 036 025 046

ibt2 059 078 057 081

Table D-3 Marginal inclusion probabilities

Zellner-Siow prior

EPP HIP HUP HOP

hum 076 067 080 069

dpg 089 050 055 058

ibt 099 100 100 100

hum2 057 049 040 057

humibt 072 066 078 068

dpg2 081 038 031 051

ibt2 054 076 055 077

Table D-4 Marginal inclusion probabilities

Hyper-g11

EPP HIP HUP HOP

vh 054 005 010 011

hum 081 067 080 069

dpg 090 050 055 058

ibt 099 100 099 099

hum2 061 049 040 057

humibt 078 066 078 068

dpg2 083 038 030 051

ibt2 049 076 054 077

Table D-5 Marginal inclusion probabilities

Hyper-g21

EPP HIP HUP HOP

hum 079 064 073 067

dpg 090 052 060 059

ibt 099 100 099 100

hum2 060 047 037 055

humibt 076 064 071 067

dpg2 082 041 036 052

ibt2 047 073 049 075

132

REFERENCES

Akaike H (1983) Information measures and model selection Bull Int Statist Inst 50277ndash290

Albert J H amp Chib S (1993) Bayesian-analysis of binary and polychotomousresponse data Journal of the American Statistical Association 88(422) 669ndash679

Berger J amp Bernardo J (1992) On the development of reference priors BayesianStatistics 4 (pp 35ndash60)

URL httpisbastatdukeedueventsvalencia1992Valencia4Refpdf

Berger J amp Pericchi L (1996) The intrinsic Bayes factor for model selection andprediction Journal of the American Statistical Association 91(433) 109ndash122

URL httpamstattandfonlinecomdoiabs10108001621459199610476668

Berger J Pericchi L amp Ghosh J (2001) Objective Bayesian methods for modelselection introduction and comparison In Model selection vol 38 of IMS LectureNotes Monogr Ser (pp 135ndash207) Inst Math Statist

URL httpwwwjstororgstable1023074356165

Besag J York J amp Mollie A (1991) Bayesian Image-Restoration with 2 Applicationsin Spatial Statistics Annals of the Institute of Statistical Mathematics 43 1ndash20

Bien J Taylor J amp Tibshirani R (2013) A lasso for hierarchical interactions TheAnnals of Statistics 41(3) 1111ndash1141

URL httpprojecteuclidorgeuclidaos1371150895

Breiman L amp Friedman J (1985) Estimating optimal transformations for multipleregression and correlation Journal of the American Statistical Association 80580ndash598

Brusco M J Steinley D amp Cradit J D (2009) An exact algorithm for hierarchicallywell-formulated subsets in second-order polynomial regression Technometrics 51(3)306ndash315

Casella G Giron F J Martınez M L amp Moreno E (2009) Consistency of Bayesianprocedures for variable selection The Annals of Statistics 37 (3) 1207ndash1228

URL httpprojecteuclidorgeuclidaos1239369020

Casella G Moreno E amp Giron F (2014) Cluster Analysis Model Selection and PriorDistributions on Models Bayesian Analysis TBA(TBA) 1ndash46

URL httpwwwstatufledu~casellaPapersClusterModel-July11-Apdf

133

Chipman H (1996) Bayesian variable selection with related predictors CanadianJournal of Statistics 24(1) 17ndash36

URL httponlinelibrarywileycomdoi1023073315687abstract

Clyde M amp George E I (2004) Model Uncertainty Statistical Science 19(1) 81ndash94

URL httpprojecteuclidorgDienstgetRecordid=euclidss1089808274

Dewey J (1958) Experience and nature New York Dover Publications

Dorazio R M amp Taylor-Rodrıguez D (2012) A Gibbs sampler for Bayesian analysis ofsite-occupancy data Methods in Ecology and Evolution 3 1093ndash1098

Ellison A M (2004) Bayesian inference in ecology Ecology Letters 7 509ndash520

Fiske I amp Chandler R (2011) unmarked An R package for fitting hierarchical modelsof wildlife occurrence and abundance Journal of Statistical Software 43(10)

URL httpcorekmiopenacukdownloadpdf5701760pdf

George E (2000) The variable selection problem Journal of the American StatisticalAssociation 95(452) 1304ndash1308

URL httpwwwtandfonlinecomdoiabs10108001621459200010474336

Giron F J Moreno E Casella G amp Martınez M L (2010) Consistency of objectiveBayes factors for nonnested linear models and increasing model dimension Revistade la Real Academia de Ciencias Exactas Fisicas y Naturales Serie A Matematicas104(1) 57ndash67

URL httpwwwspringerlinkcomindex105052RACSAM201006

Good I J (1950) Probability and the Weighing of Evidence New York Haffner

Griepentrog G L Ryan J M amp Smith L D (1982) Linear transformations ofpolynomial regression-models American Statistician 36(3) 171ndash174

Gunel E amp Dickey J (1974) Bayes factors for independence in contingency tablesBiometrika 61 545ndash557

Hanski I (1994) A Practical Model of Metapopulation Dynamics Journal of AnimalEcology 63 151ndash162

Hooten M (2006) Hierarchical spatio-temporal models for ecological processesDoctoral dissertation University of Missouri-Columbia

URL httpsmospacelibraryumsystemeduxmluihandle103554500

Hooten M B amp Hobbs N T (2014) A Guide to Bayesian Model Selection forEcologists Ecological Monographs (In Press)

134

Hughes J amp Haran M (2013) Dimension reduction and alleviation of confoundingfor spatial generalized linear mixed models Journal of the Royal Statistical SocietySeries B Statistical Methodology 75 139ndash159

Hurvich C M amp Tsai C-L (1989) Regression and time series model selection insmall samples Biometrika 76 297ndash307

URL httpbiometoxfordjournalsorgcontent762297abstract

Jeffreys H (1935) Some tests of significance treated by the theory of probabilityProcedings of the Cambridge Philosophy Society 31 203ndash222

Jeffreys H (1961) Theory of Probability London Oxford University Press 3rd ed

Johnson D Conn P Hooten M Ray J amp Pond B (2013) Spatial occupancymodels for large data sets Ecology 94(4) 801ndash808

URL httpwwwesajournalsorgdoiabs10189012-05641mi=3eywlhampaf=R

ampsearchText=human+population

Kass R amp Wasserman L (1995) A reference Bayesian test for nested hypothesesand its relationship to the Schwarz criterion Journal of the American StatisticalAssociation 90(431)

URL httpamstattandfonlinecomdoiabs10108001621459199510476592

Kass R E amp Raftery A E (1995) Bayes Factors Journal of the American StatisticalAssociation 90 773ndash795

URL httpwwwtandfonlinecomdoiabs10108001621459199510476572$

delimiter026E30F$nhttpwwwtandfonlinecomdoiabs10108001621459

199510476572UvBybrTIgcs

Kass R E amp Wasserman L (1996) The Selection of Prior Distributions by FormalRules Journal of the American Statistical Association 91(435) 1343

URL httpwwwjstororgstable2291752origin=crossref

Kery M (2010) Introduction to WinBUGS for Ecologists Bayesian Approach toRegression ANOVA Mixed Models and Related Analyses Academic Press 1st ed

Kery M Gardner B amp Monnerat C (2010) Predicting species distributions fromchecklist data using site-occupancy models Journal of Biogeography 37 (10)1851ndash1862 Kery Marc Gardner Beth Monnerat Christian

Khuri A (2002) Nonsingular linear transformations of the control variables in responsesurface models Technical Report

Krebs C J (1972) Ecology the experimental analysis of distribution and abundance

135

Lempers F B (1971) Posterior probabilities of alternative linear models University ofRotterdam Press Rotterdam

Leon-Novelo L Moreno E amp Casella G (2012) Objective Bayes model selection inprobit models Statistics in medicine 31(4) 353ndash65

URL httpwwwncbinlmnihgovpubmed22162041

Liang F Paulo R Molina G Clyde M a amp Berger J O (2008) Mixtures of g Priorsfor Bayesian Variable Selection Journal of the American Statistical Association103(481) 410ndash423

URL httpwwwtandfonlinecomdoiabs101198016214507000001337

Link W amp Barker R (2009) Bayesian inference with ecological applications Elsevier

URL httpbooksgooglecombookshl=enamplr=ampid=hecon2l2QPcCampoi=fnd

amppg=PP2ampdq=Bayesian+Inference+with+ecological+applicationsampots=S82_

0pxrNmampsig=L3xbsSQcKD8FV6rxCMp2pmP2JKk

MacKenzie D amp Nichols J (2004) Occupancy as a surrogate for abundanceestimation Animal biodiversity and conservation 1 461ndash467

URL httpcrsitbacidmediajurnalrefslandscapemackenzie2004zhpdf

MacKenzie D Nichols J amp Hines J (2003) Estimating site occupancy colonizationand local extinction when a species is detected imperfectly Ecology 84(8)2200ndash2207

URL httpwwwesajournalsorgdoiabs10189002-3090

MacKenzie D I Bailey L L amp Nichols J D (2004) Investigating speciesco-occurrence patterns when species Journal of Animal Ecology 73 546ndash555

MacKenzie D I Nichols J D Lachman G B Droege S Royle J A amp LangtimmC A (2002) Estimating site occupancy rates when detection probabilities are lessthan one Ecology 83(8) 2248ndash2255

Mazerolle M amp Mazerolle M (2013) Package rsquoAICcmodavgrsquo (c)

URL ftpheanetarchivegnewsenseorgdisk1CRANwebpackages

AICcmodavgAICcmodavgpdf

McCullagh P amp Nelder J A (1989) Generalized linear models (2nd ed) LondonEngland Chapman amp Hall

McQuarrie A Shumway R amp Tsai C-L (1997) The model selection criterion AICu

136

Moreno E Bertolino F amp Racugno W (1998) An intrinsic limiting procedure for modelselection and hypotheses testing Journal of the American Statistical Association93(444) 1451ndash1460

Moreno E Giron F J amp Casella G (2010) Consistency of objective Bayes factors asthe model dimension grows The Annals of Statistics 38(4) 1937ndash1952

URL httpprojecteuclidorgeuclidaos1278861238

Nelder J A (1977) Reformulation of linear-models Journal of the Royal StatisticalSociety Series A - Statistics in Society 140 48ndash77

Nelder J A (1998) The selection of terms in response-surface models - how strong isthe weak-heredity principle American Statistician 52(4) 315ndash318

Nelder J A (2000) Functional marginality and response-surface fitting Journal ofApplied Statistics 27 (1) 109ndash112

Nichols J Hines J amp Mackenzie D (2007) Occupancy estimation and modeling withmultiple states and state uncertainty Ecology 88(6) 1395ndash1400

URL httpwwwesajournalsorgdoipdf10189006-1474

Ovaskainen O Hottola J amp Siitonen J (2010) Modeling species co-occurrenceby multivariate logistic regression generates new hypotheses on fungal interactionsEcology 91(9) 2514ndash21

URL httpwwwncbinlmnihgovpubmed20957941

Peixoto J L (1987) Hierarchical variable selection in polynomial regression-modelsAmerican Statistician 41(4) 311ndash313

Peixoto J L (1990) A property of well-formulated polynomial regression-modelsAmerican Statistician 44(1) 26ndash30

Pericchi L R (2005) Model selection and hypothesis testing based on objectiveprobabilities and bayes factors In Handbook of Statistics Elsevier

Polson N G Scott J G amp Windle J (2013) Bayesian Inference for Logistic ModelsUsing Polya-Gamma Latent Variables Journal of the American Statistical Association108 1339ndash1349

URL httpdxdoiorg101080016214592013829001

Rao C R amp Wu Y (2001) On model selection vol Volume 38 of Lecture NotesndashMonograph Series (pp 1ndash57) Beachwood OH Institute of Mathematical Statistics

URL httpdxdoiorg101214lnms1215540960

137

Reich B J Hodges J S amp Zadnik V (2006) Effects of residual smoothing on theposterior of the fixed effects in disease-mapping models Biometrics 62 1197ndash1206

Reiners W amp Lockwood J (2009) Philosophical Foundations for the Practices ofEcology Cambridge University Press

URL httpbooksgooglecombooksid=dr9cPgAACAAJ

Rigler F amp Peters R (1995) Excellence in Ecology Science and Limnology EcologyInstitute Germany

URL httportoncatieaccrcgi-binwxisexeIsisScript=CIENLxis

ampmethod=postampformato=2ampcantidad=1ampexpresion=mfn=008268

Robert C Chopin N amp Rousseau J (2009) Harold Jeffreysrsquo Theory of Probabilityrevisited Statistical Science Volume 24(2) 141ndash179

URL httpswwwnewtonacukpreprintsNI08021pdf

Robert C P (1993) A note on jeffreys-lindley paradox Statistica Sinica 3 601ndash608

Royle J A amp Kery M (2007) A Bayesian state-space formulation of dynamicoccupancy models Ecology 88(7) 1813ndash23

URL httpwwwncbinlmnihgovpubmed17645027

Scott J amp Berger J (2010) Bayes and Empirical-Bayes Multiplicity Adjustment in thevariable selection problem The Annals of Statistics

URL httpprojecteuclidorgeuclidaos1278861454

Spiegelhalter D J amp Smith A F M (1982) Bayes factor for linear and log-linearmodels with vague prior information J R Statist Soc B 44 377ndash387

Tierney L amp Kadane J B (1986) Accurate approximations for posterior moments andmarginal densities Journal of the American Statistical Association 81 82ndash86

Tyre A J Tenhumberg B Field S a Niejalke D Parris K amp Possingham H P(2003) Improving Precision and Reducing Bias in Biological Surveys EstimatingFalse-Negative Error Rates Ecological Applications 13(6) 1790ndash1801

URL httpwwwesajournalsorgdoiabs10189002-5078

Waddle J H Dorazio R M Walls S C Rice K G Beauchamp J Schuman M Jamp Mazzotti F J (2010) A new parameterization for estimating co-occurrence ofinteracting species Ecological applications a publication of the Ecological Society ofAmerica 20 1467ndash1475

Wasserman L (2000) Bayesian Model Selection and Model Averaging Journal ofmathematical psychology 44(1) 92ndash107

138

URL httpwwwncbinlmnihgovpubmed10733859

Wilson M Iversen E Clyde M A Schmidler S C amp Schildkraut J M (2010)Bayesian model search and multilevel inference for SNP association studies TheAnnals of Applied Statistics 4(3) 1342ndash1364

URL httpwwwncbinlmnihgovpmcarticlesPMC3004292

Womack A J Leon-Novelo L amp Casella G (2014) Inference from Intrinsic BayesProcedures Under Model Selection and Uncertainty Journal of the AmericanStatistical Association (June) 140114063448000

URL httpwwwtandfonlinecomdoiabs101080016214592014880348

Yuan M Joseph V R amp Zou H (2009) Structured variable selection and estimationThe Annals of Applied Statistics 3(4) 1738ndash1757

URL httpprojecteuclidorgeuclidaoas1267453962

Zeller K A Nijhawan S Salom-Perez R Potosme S H amp Hines J E (2011)Integrating occupancy modeling and interview data for corridor identification A casestudy for jaguars in nicaragua Biological Conservation 144(2) 892ndash901

Zellner A amp Siow A (1980) Posterior odds ratios for selected regression hypothesesIn Trabajos de estadıstica y de investigacion operativa (pp 585ndash603)

URL httpwwwspringerlinkcomindex5300770UP12246M9pdf

139

BIOGRAPHICAL SKETCH

Daniel Taylor-Rodrıguez was born in Bogota Colombia He earned a BS

degree in economics from the Universidad de Los Andes (2004) and a Specialist

degree in statistics from the Universidad Nacional de Colombia In 2009 he traveled

to Gainesville Florida to pursue a masterrsquos in statistics under the supervision of

George Casella Upon completion he started a PhD in interdisciplinary ecology with

concentration in statistics again under George Casellarsquos supervision After Georgersquos

passing Linda Young and Nikolay Bliznyuk continued to oversee Danielrsquos mentorship

He has currently accepted a joint postdoctoral fellowship at the Statistical and Applied

Mathematical Sciences Institute and the Department of Statistical Science at Duke

University

140

  • ACKNOWLEDGMENTS
  • TABLE OF CONTENTS
  • LIST OF TABLES
  • LIST OF FIGURES
  • ABSTRACT
  • 1 GENERAL INTRODUCTION
    • 11 Occupancy Modeling
    • 12 A Primer on Objective Bayesian Testing
    • 13 Overview of the Chapters
      • 2 MODEL ESTIMATION METHODS
        • 21 Introduction
          • 211 The Occupancy Model
          • 212 Data Augmentation Algorithms for Binary Models
            • 22 Single Season Occupancy
              • 221 Probit Link Model
              • 222 Logit Link Model
                • 23 Temporal Dynamics and Spatial Structure
                  • 231 Dynamic Mixture Occupancy State-Space Model
                  • 232 Incorporating Spatial Dependence
                    • 24 Summary
                      • 3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
                        • 31 Introduction
                        • 32 Objective Bayesian Inference
                          • 321 The Intrinsic Methodology
                          • 322 Mixtures of g-Priors
                            • 3221 Intrinsic priors
                            • 3222 Other mixtures of g-priors
                                • 33 Objective Bayes Occupancy Model Selection
                                  • 331 Preliminaries
                                  • 332 Intrinsic Priors for the Occupancy Problem
                                  • 333 Model Posterior Probabilities
                                  • 334 Model Selection Algorithm
                                    • 34 Alternative Formulation
                                    • 35 Simulation Experiments
                                      • 351 Marginal Posterior Inclusion Probabilities for Model Predictors
                                      • 352 Summary Statistics for the Highest Posterior Probability Model
                                        • 36 Case Study Blue Hawker Data Analysis
                                          • 361 Results Variable Selection Procedure
                                          • 362 Validation for the Selection Procedure
                                            • 37 Discussion
                                              • 4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
                                                • 41 Introduction
                                                • 42 Setup for Well-Formulated Models
                                                  • 421 Well-Formulated Model Spaces
                                                    • 43 Priors on the Model Space
                                                      • 431 Model Prior Definition
                                                      • 432 Choice of Prior Structure and Hyper-Parameters
                                                      • 433 Posterior Sensitivity to the Choice of Prior
                                                        • 44 Random Walks on the Model Space
                                                          • 441 Simple Pruning and Growing
                                                          • 442 Degree Based Pruning and Growing
                                                            • 45 Simulation Study
                                                              • 451 SNR and Sample Size Effect
                                                              • 452 Coefficient Magnitude
                                                              • 453 Special Points on the Scale
                                                                • 46 Case Study Ozone Data Analysis
                                                                • 47 Discussion
                                                                  • 5 CONCLUSIONS
                                                                  • A FULL CONDITIONAL DENSITIES DYMOSS
                                                                  • B RANDOM WALK ALGORITHMS
                                                                  • C WFM SIMULATION DETAILS
                                                                  • D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
                                                                  • REFERENCES
                                                                  • BIOGRAPHICAL SKETCH
Page 4: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,

ACKNOWLEDGMENTS

Completing this dissertation would not have been possible without the support from

the people that have helped me remain focused motivated and inspired throughout the

years I am undeservingly fortunate to be surrounded by such amazing people

First of all I would like to express my gratitude to Professor George Casella It

was an unsurpassable honor to work with him His wisdom generosity optimism and

unyielding resolve will forever inspire me I will always treasure his teachings and the

fond memories I have of him I thank him and Anne for treating me and my wife as

family

I would like to acknowledge all of my committee members My heartfelt thanks to

my advisor Professor Linda J Young I will carry her thoughtful and patient recommendations

throughout my life I have no words to express how thankful I am to her for guiding me

through the difficult times that followed Dr Casellarsquos passing Also she has my gratitude

for sharing her knowledge and wealth of experience and for providing me with so many

amazing opportunities I am forever grateful to my local advisor Professor Nikolay

Bliznyuk for unsparingly sharing his insightful reflections and knowledge His generosity

and drive to help students develop are a model to follow His kind and extensive efforts

our many conversations his suggestions and advise in all aspects of academic and

non-academic life have made me a better statistician and have had a profound influence

on my way of thinking My appreciation to Professor Madan Oli for his enlightening

advise and for helping me advance my understanding of ecology

I would like to express my absolute gratitude to Dr Andrew Womack my friend and

young mentor His love for good science and hard work although impossible to keep up

with made my doctoral training one of the most exciting times in my life I have sincerely

enjoyed working and learning from him the last couple of years I offer my gratitude

to Dr Salvador Gezan for his friendship and the patience with which he taught me so

much more about statistics (boring our wives to death in the process) I am grateful to

4

Professor Mary Christman for her mentorship and enormous support I would like to

thank Dr Mihai Giurcanu for spending countless hours helping me think more deeply

about statistics his insight has been instrumental to shaping my own ideas Thanks to

Dr Claudio Fuentes for taking an interest in my work and for his advise support and

kind words which helped me retain the confidence to continue

I would like to acknowledge my friends at UF Juan Jose Acosta Mauricio

Mosquera Diana Falla Salvador and Emma Weeks and Anna Denicol thanks for

becoming my family away from home Andreas Tavis Emily Alex Sasha Mike

Yeonhee and Laura thanks for being there for me I truly enjoyed sharing these

years with you Vitor Paula Rafa Leandro Fabio Eduardo Marcelo and all the other

Brazilians in the Animal Science Department thanks for your friendship and for the

many unforgettable (though blurry) weekends

Also I would like to thank Pablo Arboleda for believing in me Because of him I

was able to take the first step towards fulfilling my educational goals My gratitude to

Grupo Bancolombia Fulbright Colombia Colfuturo and the IGERT QSE3 program

for supporting me throughout my studies Also thanks to Marc Kery and Christian

Monnerat for providing data to validate our methods Thanks to the staff in the Statistics

Department specially to Ryan Chance to the staff at the HPC and also to Karen Bray

at SNRE

Above all else I would like to thank my wife and family Nata you have always been

there for me pushing me forward believing in me helping me make better decisions

and regardless of how hard things get you have always managed to give me true and

lasting happiness Thank you for your love strength and patience Mom Dad Alejandro

Alberto Laura Sammy Vale and Tommy without your love trust and support getting

this far would not have been possible Thank you for giving me so much Gustavo

Lilia Angelica and Juan Pablo thanks for taking me into your family your words of

encouragement have led the way

5

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS 4

LIST OF TABLES 8

LIST OF FIGURES 10

ABSTRACT 12

CHAPTER

1 GENERAL INTRODUCTION 14

11 Occupancy Modeling 1512 A Primer on Objective Bayesian Testing 1713 Overview of the Chapters 21

2 MODEL ESTIMATION METHODS 23

21 Introduction 23211 The Occupancy Model 24212 Data Augmentation Algorithms for Binary Models 26

22 Single Season Occupancy 29221 Probit Link Model 30222 Logit Link Model 32

23 Temporal Dynamics and Spatial Structure 34231 Dynamic Mixture Occupancy State-Space Model 37232 Incorporating Spatial Dependence 43

24 Summary 46

3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS 49

31 Introduction 4932 Objective Bayesian Inference 52

321 The Intrinsic Methodology 53322 Mixtures of g-Priors 54

3221 Intrinsic priors 553222 Other mixtures of g-priors 56

33 Objective Bayes Occupancy Model Selection 57331 Preliminaries 58332 Intrinsic Priors for the Occupancy Problem 60333 Model Posterior Probabilities 62334 Model Selection Algorithm 63

34 Alternative Formulation 6635 Simulation Experiments 68

351 Marginal Posterior Inclusion Probabilities for Model Predictors 70

6

352 Summary Statistics for the Highest Posterior Probability Model 7636 Case Study Blue Hawker Data Analysis 77

361 Results Variable Selection Procedure 79362 Validation for the Selection Procedure 81

37 Discussion 82

4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS 84

41 Introduction 8442 Setup for Well-Formulated Models 88

421 Well-Formulated Model Spaces 9043 Priors on the Model Space 91

431 Model Prior Definition 92432 Choice of Prior Structure and Hyper-Parameters 96433 Posterior Sensitivity to the Choice of Prior 99

44 Random Walks on the Model Space 104441 Simple Pruning and Growing 105442 Degree Based Pruning and Growing 106

45 Simulation Study 107451 SNR and Sample Size Effect 109452 Coefficient Magnitude 110453 Special Points on the Scale 111

46 Case Study Ozone Data Analysis 11147 Discussion 113

5 CONCLUSIONS 115

APPENDIX

A FULL CONDITIONAL DENSITIES DYMOSS 118

B RANDOM WALK ALGORITHMS 121

C WFM SIMULATION DETAILS 124

D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS 131

REFERENCES 133

BIOGRAPHICAL SKETCH 140

7

LIST OF TABLES

Table page

1-1 Interpretation of BFji when contrasting Mj and Mi 20

3-1 Simulation control parameters occupancy model selector 69

3-2 Comparison of average minOddsMPIP under scenarios having different numberof sites (N=50 N=100) and under scenarios having different number of surveysper site (J=3 J=5) for the presence and detection components using uniformand multiplicity correction priors 75

3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors 75

3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highest probabilitymodels for the presence and the detection components using uniform andmultiplicity correcting priors on the model space 76

3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space 77

3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 77

3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 78

3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data 80

3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data 80

3-10 MPIP presence component 81

3-11 MPIP detection component 81

3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform and multiplicitycorrection model priors 82

8

4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulations 100

4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchical independenceprior (HIP) the hierarchical order prior (HOP) and the hierarchical uniformprior (HUP) 102

4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and the hierarchicaluniform prior (HUP) 103

4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP) 105

4-5 Variables used in the analyses of the ozone contamination dataset 112

4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso 113

C-1 Experimental conditions WFM simulations 124

D-1 Variables used in the analyses of the ozone contamination dataset 131

D-2 Marginal inclusion probabilities intrinsic prior 132

D-3 Marginal inclusion probabilities Zellner-Siow prior 132

D-4 Marginal inclusion probabilities Hyper-g11 132

D-5 Marginal inclusion probabilities Hyper-g21 132

9

LIST OF FIGURES

Figure page

2-1 Graphical representation occupancy model 25

2-2 Graphical representation occupancy model after data-augmentation 31

2-3 Graphical representation multiseason model for a single site 39

2-4 Graphical representation data-augmented multiseason model 39

3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors 71

3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors 72

3-3 Predictor MPIP averaged over scenarios with the interaction between the numberof sites and the surveys per site using uniform (U) and multiplicity correction(MC) priors 72

3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors 73

3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors 73

4-1 Graphs of well-formulated polynomial models for p = 2 90

4-2 E(M) and C(M) in M defined by a quadratic surface in two main effects formodel M = 1 x1 x21 91

4-3 Graphical representation of assumptions on M defined by the quadratic surfacein two main effects 93

4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the intercept onlymodel and (ab) isin (1 1) (1 ch) 97

4-5 Prior probabilities for the space of well-formulated models associated to threemain effects and one interaction term where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch) 98

4-6 MT DAG of the largest true model used in simulations 109

4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch) 110

C-1 SNR vs n Average model size average true positives and average false positivesfor all simulated scenarios by model ranking according to model posterior probabilities126

10

C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 128

C-3 SNR vs different true models MT Average model size average true positivesand average false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 129

11

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

OBJECTIVE BAYESIAN METHODS FOR OCCUPANCY MODEL ESTIMATION ANDSELECTION

By

Daniel Taylor-Rodrıguez

August 2014

Chair Linda J YoungCochair Nikolay BliznyukMajor Interdisciplinary Ecology

The ecological literature contains numerous methods for conducting inference about

the dynamics that govern biological populations Among these methods occupancy

models have played a leading role during the past decade in the analysis of large

biological population surveys The flexibility of the occupancy framework has brought

about useful extensions for determining key population parameters which provide

insights about the distribution structure and dynamics of a population However the

methods used to fit the models and to conduct inference have gradually grown in

complexity leaving practitioners unable to fully understand their implicit assumptions

increasing the potential for misuse This motivated our first contribution We develop

a flexible and straightforward estimation method for occupancy models that provides

the means to directly incorporate temporal and spatial heterogeneity using covariate

information that characterizes habitat quality and the detectability of a species

Adding to the issue mentioned above studies of complex ecological systems now

collect large amounts of information To identify the drivers of these systems robust

techniques that account for test multiplicity and for the structure in the predictors are

necessary but unavailable for ecological models We develop tools to address this

methodological gap First working in an ldquoobjectiverdquo Bayesian framework we develop

the first fully automatic and objective method for occupancy model selection based

12

on intrinsic parameter priors Moreover for the general variable selection problem we

propose three sets of prior structures on the model space that correct for multiple testing

and a stochastic search algorithm that relies on the priors on the models space to

account for the polynomial structure in the predictors

13

CHAPTER 1GENERAL INTRODUCTION

As with any other branch of science ecology strives to grasp truths about the

world that surrounds us and in particular about nature The objective truth sought

by ecology may well be beyond our grasp however it is reasonable to think that at

least partially ldquoNature is capable of being understoodrdquo (Dewey 1958) We can observe

and interpret nature to formulate hypotheses which can then be tested against reality

Hypotheses that encounter no or little opposition when confronted with reality may

become contextual versions of the truth and may be generalized by scaling them

spatially andor temporally accordingly to delimit the bounds within which they are valid

To formulate hypotheses accurately and in a fashion amenable to scientific inquiry

not only the point of view and assumptions considered must be made explicit but

also the object of interest the properties worthy of consideration of that object and

the methods used in studying such properties (Reiners amp Lockwood 2009 Rigler amp

Peters 1995) Ecology as defined by Krebs (1972) is ldquothe study of interactions that

determine the distribution and abundance of organismsrdquo This characterizes organisms

and their interactions as the objects of interest to ecology and prescribes distribution

and abundance as a relevant property of these organisms

With regards to the methods used to acquire ecological scientific knowledge

traditionally theoretical mathematical models (such as deterministic PDEs) have been

used However naturally varying systems are imprecisely observed and as such are

subject to multiple sources of uncertainty that must be explicitly accounted for Because

of this the ecological scientific community is developing a growing interest in flexible

and powerful statistical methods and among these Bayesian hierarchical models

predominate These methods rely on empirical observations and can accommodate

fairly complex relationships between empirical observations and theoretical process

models while accounting for diverse sources of uncertainty (Hooten 2006)

14

Bayesian approaches are now used extensively in ecological modeling however

there are two issues of concern one from the standpoint of ecological practitioners

and another from the perspective of scientific ecological endeavors First Bayesian

modeling tools require a considerable understanding of probability and statistical theory

leading practitioners to view them as black box approaches (Kery 2010) Second

although Bayesian applications proliferate in the literature in general there is a lack of

awareness of the distinction between approaches specifically devised for testing and

those for estimation (Ellison 2004) Furthermore there is a dangerous unfamiliarity with

the proven risks of using tools designed for estimation in testing procedures (Berger amp

Pericchi 1996 Berger et al 2001 Kass amp Raftery 1995 Moreno et al 1998 Robert

et al 2009 Robert 1993) (eg use of flat priors in hypothesis testing)

Occupancy models have played a leading role during the past decade in large

biological population surveys The flexibility of the occupancy framework has allowed

the development of useful extensions to determine several key population parameters

which provide robust notions of the distribution structure and dynamics of a population

In order to address some of the concerns stated in previous paragraph we concentrate

in the occupancy framework to develop estimation and testing tools that will allow

ecologists first to gain insight about the estimation procedure and second to conduct

statistically sound model selection for site-occupancy data

11 Occupancy Modeling

Since MacKenzie et al (2002) and Tyre et al (2003) introduced the site-occupancy

framework countless applications and extensions of the method have been developed

in the ecological literature as evidenced by the 438000 hits on Google Scholar for

a search of rdquooccupancy modelrdquo This class of models acknowledges that techniques

used to conduct biological population surveys are prone to detection errors ndashif an

individual is detected it must be present while if it is not detected it might or might

not be Occupancy models improve upon traditional binary regression by accounting

15

for observed detection and partially observed presence as two separate but related

components In the site occupancy setting the chosen locations are surveyed

repeatedly in order to reduce the ambiguity caused by the observed zeros This

approach therefore allows probabilities of both presence (occurrence) and detection

to be estimated

The uses of site-occupancy models are many For example metapopulation

and island biogeography models are often parameterized in terms of site (or patch)

occupancy (Hansky 19921994 1997 as cited in MacKenzie et al (2003)) and

occupancy may be used as a surrogate for abundance to answer questions regarding

geographic distribution range size and metapopulation dynamics (MacKenzie et al

2004 Royle amp Kery 2007)

The basic occupancy framework which assumes a single closed population with

fixed probabilities through time has proven to be quite useful however it might be of

limited utility when addressing some problems In particular assumptions for the basic

model may become too restrictive or unrealistic whenever the study period extends

throughout multiple years or seasons especially given the increasingly changing

environmental conditions that most ecosystems are currently experiencing

Among the extensions found in the literature one that we consider particularly

relevant incorporates heterogenous occupancy probabilities through time Models

that incorporate temporally varying probabilities stem from important meta-population

notions provided by Hanski (1994) such as occupancy probabilities depending on local

colonization and local extinction processes In spite of the conceptual usefulness of

Hanskirsquos model several strong and untenable assumptions (eg all patches being

homogenous in quality) are required for it to provide practically meaningful results

A more viable alternative which builds on Hanski (1994) is an extension of

the single season occupancy model of MacKenzie et al (2003) In this model the

heterogeneity of occupancy probabilities across seasons arises from local colonization

16

and extinction processes This model is flexible enough to let detection occurrence

extinction and colonization probabilities to each depend upon its own set of covariates

Model parameters are obtained through likelihood-based estimation

Using a maximum likelihood approach presents two drawbacks First the

uncertainty assessment for maximum likelihood parameter estimates relies on

asymptotic results which are obtained from implementation of the delta method

making it sensitive to sample size Second to obtain parameter estimates the latent

process (occupancy) is marginalized out of the likelihood leading to the usual zero

inflated Bernoulli model Although this is a convenient strategy for solving the estimation

problem after integrating the latent state variables (occupancy indicators) they are

no longer available Therefore finite sample estimates cannot be calculated directly

Instead a supplementary parametric bootstrapping step is necessary Further

additional structure such as temporal or spatial variation cannot be introduced by

means of random effects (Royle amp Kery 2007)

12 A Primer on Objective Bayesian Testing

With the advent of high dimensional data such as that found in modern problems

in ecology genetics physics etc coupled with evolving computing capability objective

Bayesian inferential methods have gained increasing popularity This however is by no

means a new approach in the way Bayesian inference is conducted In fact starting with

Bayes and Laplace and continuing for almost 200 years Bayesian analysis was primarily

based on ldquononinformativerdquo priors (Berger amp Bernardo 1992)

Now subjective elicitation of prior probabilities in Bayesian analysis is widely

recognized as the ideal (Berger et al 2001) however it is often the case that the

available information is insufficient to specify appropriate prior probabilistic statements

Commonly as in model selection problems where large model spaces have to be

explored the number of model parameters is prohibitively large preventing one from

eliciting prior information for the entire parameter space As a consequence in practice

17

the determination of priors through the definition of structural rules has become the

alternative to subjective elicitation for a variety of problems in Bayesian testing Priors

arising from these rules are known in the literature as noninformative objective default

or reference Many of these connotations generate controversy and are accused

perhaps rightly of providing a false pretension of objectivity Nevertheless we will avoid

that discussion and refer to them herein exchangeably as noninformative or objective

priors to convey the sense that no attempt to introduce an informed opinion is made in

defining prior probabilities

A plethora of ldquononinformativerdquo methods has been developed in the past few

decades (see Berger amp Bernardo (1992) Berger amp Pericchi (1996) Berger et al (2001)

Clyde amp George (2004) Kass amp Wasserman (1995 1996) Liang et al (2008) Moreno

et al (1998) Spiegelhalter amp Smith (1982) Wasserman (2000) and the references

therein) We find particularly interesting those derived from the model structure in which

no tuning parameters are required especially since these can be regarded as automatic

methods Among them methods based on the Bayes factor for Intrinsic Priors have

proven their worth in a variety of inferential problems given their excellent performance

flexibility and ease of use This class of priors is discussed in detail in chapter 3 For

now some basic notation and notions of Bayesian inferential procedures are introduced

Hypothesis testing and the Bayes factor

Bayesian model selection techniques that aim to find the true model as opposed

to searching for the model that best predicts the data are fundamentally extensions to

Bayesian hypothesis testing strategies In general this Bayesian approach to hypothesis

testing and model selection relies on determining the amount of evidence found in favor

of one hypothesis (or model) over the other given an observed set of data Approached

from a Bayesian standpoint this type of problem can be formulated in great generality

using a natural well defined probabilistic framework that incorporates both model and

parameter uncertainty

18

Jeffreys (1935) first developed the Bayesian strategy to hypothesis testing and

consequently to the model selection problem Bayesian model selection within

a model space M = (M1M2 MJ) where each model is associated with a

parameter θj which may be a vector of parameters itself incorporates three types

of probability distributions (1) a prior probability distribution for each model π(Mj)

(2) a prior probability distribution for the parameters in each model π(θj |Mj) and (3)

the distribution of the data conditional on both the model and the modelrsquos parameters

f (x|θj Mj) These three probability densities induce the joint distribution p(x θj Mj) =

f (x|θj Mj) middot π(θj |Mj) middot π(Mj) which is instrumental in producing model posterior

probabilities The model posterior probability is the probability that a model is true given

the data It is obtained by marginalizing over the parameter space and using Bayes rule

p(Mj |x) =m(x|Mj)π(Mj)sumJ

i=1m(x|Mi)π(Mi) (1ndash1)

where m(x|Mj) =intf (x|θj Mj)π(θj |Mj)dθj is the marginal likelihood of Mj

Given that interest lies in comparing different models evidence in favor of one or

another model is assessed with pairwise comparisons using posterior odds

p(Mj |x)p(Mk |x)

=m(x|Mj)

m(x|Mk)middot π(Mj)

π(Mk) (1ndash2)

The first term on the right hand side of (1ndash2) m(x|Mj )

m(x|Mk) is known as the Bayes factor

comparing model Mj to model Mk and it is denoted by BFjk(x) The Bayes factor

provides a measure of the evidence in favor of either model given the data and updates

the model prior odds given by π(Mj )

π(Mk) to produce the posterior odds

Note that the model posterior probability in (1ndash1) can be expressed as a function of

Bayes factors To illustrate let model Mlowast isin M be a reference model All other models

compare in M are compared to the reference model Then dividing both the numerator

19

and denominator in (1ndash1) by m(x|Mlowast)π(Mlowast) yields

p(Mj |x) =BFjlowast(x)

π(Mj )

π(Mlowast)

1 +sum

MiisinMMi =Mlowast

BFilowast(x)π(Mi )π(Mlowast)

(1ndash3)

Therefore as the Bayes factor increases the posterior probability of model Mj given the

data increases If all models have equal prior probabilities a straightforward criterion

to select the best among all candidate models is to choose the model with the largest

Bayes factor As such the Bayes factor is not only useful for identifying models favored

by the data but it also provides a means to rank models in terms of their posterior

probabilities

Assuming equal model prior probabilities in (1ndash3) the prior odds are set equal to

one and the model posterior odds in (1ndash2) become p(Mj |x)p(Mk |x) = BFjk(x) Based

on the Bayes factors the evidence in favor of one or another model can be interpreted

using Table 1-1 adapted from Kass amp Raftery (1995)

Table 1-1 Interpretation of BFji when contrasting Mj and Mi

lnBFjk BFjk Evidence in favor of Mj P(Mj |x)0 to 2 1 to 3 Weak evidence 05-0752 to 6 3 to 20 Positive evidence 075-095

6 to 10 20 to 150 Strong evidence 095-099gt10 gt150 Very strong evidence gt 099

Bayesian hypothesis testing and model selection procedures through Bayes factors

and posterior probabilities have several desirable features First these methods have a

straight forward interpretation since the Bayes factor is an increasing function of model

(or hypothesis) posterior probabilities Second these methods can yield frequentist

matching confidence bounds when implemented with good testing priors (Kass amp

Wasserman 1996) such as the reference priors of Berger amp Bernardo (1992) Third

since the Bayes factor contains the ratio of marginal densities it automatically penalizes

complexity according to the number of parameters in each model this property is

known as Ockhamrsquos razor (Kass amp Raftery 1995) Four the use of Bayes factors does

20

not require having nested hypotheses (ie having the null hypothesis nested in the

alternative) standard distributions or regular asymptotics (eg convergence to normal

or chi squared distributions) (Berger et al 2001) In contrast this is not always the case

with frequentist and likelihood ratio tests which depend on known distributions (at least

asymptotically) for the test statistic to perform the test Finally Bayesian hypothesis

testing procedures using the Bayes factor can naturally incorporate model uncertainty by

using the Bayesian machinery for model averaged predictions and confidence bounds

(Kass amp Raftery 1995) It is not clear how to account for this uncertainty rigorously in a

fully frequentist approach

13 Overview of the Chapters

In the chapters that follow we develop a flexible and straightforward hierarchical

Bayesian framework for occupancy models allowing us to obtain estimates and conduct

robust testing from an ldquoobjectiverdquo Bayesian perspective Latent mixtures of random

variables supply a foundation for our methodology This approach provides a means to

directly incorporate spatial dependency and temporal heterogeneity through predictors

that characterize either habitat quality of a given site or detectability features of a

particular survey conducted in a specific site On the other hand the Bayesian testing

methods we propose are (1) a fully automatic and objective method for occupancy

model selection and (2) an objective Bayesian testing tool that accounts for multiple

testing and for polynomial hierarchical structure in the space of predictors

Chapter 2 introduces the methods proposed for estimation of occupancy model

parameters A simple estimation procedure for the single season occupancy model

with covariates is formulated using both probit and logit links Based on the simple

version an extension is provided to cope with metapopulation dynamics by introducing

persistence and colonization processes Finally given the fundamental role that spatial

dependence plays in defining temporal dynamics a strategy to seamlessly account for

this feature in our framework is introduced

21

Chapter 3 develops a new fully automatic and objective method for occupancy

model selection that is asymptotically consistent for variable selection and averts the

use of tuning parameters In this Chapter first some issues surrounding multimodel

inference are described and insight about objective Bayesian inferential procedures is

provided Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate

priors on the parameter space the intrinsic priors for the parameters of the occupancy

model are obtained These are used in the construction of a variable selection algorithm

for ldquoobjectiverdquo variable selection tailored to the occupancy model framework

Chapter 4 touches on two important and interconnected issues when conducting

model testing that have yet to receive the attention they deserve (1) controlling for false

discovery in hypothesis testing given the size of the model space ie given the number

of tests performed and (2) non-invariance to location transformations of the variable

selection procedures in the face of polynomial predictor structure These elements both

depend on the definition of prior probabilities on the model space In this chapter a set

of priors on the model space and a stochastic search algorithm are proposed Together

these control for model multiplicity and account for the polynomial structure among the

predictors

22

CHAPTER 2MODEL ESTIMATION METHODS

ldquoData Data Datardquo he cried impatiently ldquoI canrsquot make bricks without clayrdquo

ndashSherlock HolmesThe Adventure of the Copper Beeches

21 Introduction

Prior to the introduction of site-occupancy models (MacKenzie et al 2002 Tyre

et al 2003) presence-absence data from ecological monitoring programs were used

without any adjustment to assess the impact of management actions to observe trends

in species distribution through space and time or to model the habitat of a species (Tyre

et al 2003) These efforts however were suspect due to false-negative errors not

being accounted for False-negative errors occur whenever a species is present at a site

but goes undetected during the survey

Site-occupancy models developed independently by MacKenzie et al (2002)

and Tyre et al (2003) extend simple binary-regression models to account for the

aforementioned errors in detection of individuals common in surveys of animal or plant

populations Since their introduction the site-occupancy framework has been used in

countless applications and numerous extensions for it have been proposed Occupancy

models improve upon traditional binary regression by analyzing observed detection

and partially observed presence as two separate but related components In the site

occupancy setting the chosen locations are surveyed repeatedly in order to reduce the

ambiguity caused by the observed zeros This approach therefore allows simultaneous

estimation of the probabilities of presence (occurrence) and detection

Several extensions to the basic single-season closed population model are

now available The occupancy approach has been used to determine species range

dynamics (MacKenzie et al 2003 Royle amp Kery 2007) and to understand agestage

23

structure within populations (Nichols et al 2007) model species co-occurrence

(MacKenzie et al 2004 Ovaskainen et al 2010 Waddle et al 2010) It has even been

suggested as a surrogate for abundance (MacKenzie amp Nichols 2004) MacKenzie et al

suggested using occupancy models to conduct large-scale monitoring programs since

this approach avoids the high costs associated with surveys designed for abundance

estimation Also to investigate metapopulation dynamics occupancy models improve

upon incidence function models (Hanski 1994) which are often parameterized in terms

of site (or patch) occupancy and assume homogenous patches and a metapopulation

that is at a colonization-extinction equilibrium

Nevertheless the implementation of Bayesian occupancy models commonly resorts

to sampling strategies dependent on hyper-parameters subjective prior elicitation

and relatively elaborate algorithms From the standpoint of practitioners these are

often treated as black-box methods (Kery 2010) As such the potential of using the

methodology incorrectly is high Commonly these procedures are fitted with packages

such as BUGS or JAGS Although the packagersquos ease of use has led to a wide-spread

adoption of the methods the user may be oblivious as to the assumptions underpinning

the analysis

We believe providing straightforward and robust alternatives to implement these

methods will help practitioners gain insight about how occupancy modeling and more

generally Bayesian modeling is performed In this Chapter using a simple Gibbs

sampling approach first we develop a versatile method to estimate the single season

closed population site-occupancy model then extend it to analyze metapopulation

dynamics through time and finally provide a further adaptation to incorporate spatial

dependence among neighboring sites211 The Occupancy Model

In this section of the document we first introduce our results published in Dorazio

amp Taylor-Rodrıguez (2012) and build upon them to propose relevant extensions For

24

the standard sampling protocol for collecting site-occupancy data J gt 1 independent

surveys are conducted at each of N representative sample locations (sites) noting

whether a species is detected or not detected during each survey Let yij denote a binary

random variable that indicates detection (y = 1) or non-detection (y = 0) during the

j th survey of site i Without loss of generality J may be assumed constant among all N

sites to simplify description of the model In practice however site-specific variation in

J poses no real difficulties and is easily implemented This sampling protocol therefore

yields a N times J matrix Y of detectionnon-detection data

Note that the observed process yij is an imperfect representation of the underlying

occupancy or presence process Hence letting zi denote the presence indicator at site i

this model specification can therefore be represented through the hierarchy

yij |zi λ sim Bernoulli (zipij)

zi |α sim Bernoulli (ψi) (2ndash1)

where pij is the probability of correctly classifying as occupied the i th site during the j th

survey ψi is the presence probability at the i th site The graphical representation of this

process is

ψi

zi

yi

pi

Figure 2-1 Graphical representation occupancy model

Probabilities of detection and occupancy can both be made functions of covariates

and their corresponding parameter estimates can be obtained using either a maximum

25

likelihood or a Bayesian approach Existing methodologies from the likelihood

perspective marginalize over the latent occupancy process (zi ) making the estimation

procedure depend only on the detections Most Bayesian strategies rely on MCMC

algorithms that require parameter prior specification and tuning However Albert amp Chib

(1993) proposed a longstanding strategy in the Bayesian statistical literature that models

binary outcomes using a simple Gibbs sampler This procedure which is described in

the following section can be extrapolated to the occupancy setting eliminating the need

for tuning parameters and subjective prior elicitation212 Data Augmentation Algorithms for Binary Models

Probit model Data-augmentation with latent normal variables

At the root of Albert amp Chibrsquos algorithm lies the idea that if the observed outcome is

0 the latent variable can be simulated from a truncated normal distribution with support

(minusinfin 0] And if the outcome is 1 the latent variable can be simulated from a truncated

normal distribution in (0infin) To understand the reasoning behind this strategy let

Y sim Bern((xTβ)

) and V = xTβ + ε with ε sim N (0 1) In such a case note that

Pr(y = 1 | xTβ) = (xTβ) = Pr(ε lt xTβ)

= Pr(ε gt minusxTβ)

= Pr(v gt 0 | xTβ)

Thus whenever y = 1 then v gt 0 and v le 0 otherwise In other words we

may think of y as a truncated version of v Thus we can sample iteratively alternating

between the latent variables conditioned on model parameters and vice versa to draw

from the desired posterior densities By augmenting the data with the latent variables

we are able to obtain full conditional posterior distributions for model parameters that are

easy to draw from (equation 2ndash3 below) Further we may sample the latent variables

we may also sample the parameters

Given some initial values for all model parameters values for the latent variables

can be simulated By conditioning on the latter it is then possible to draw samples

26

from the parameterrsquos posterior distributions These samples can be used to generate

new values for the latent variables etc The process is iterated using a Gibbs sampling

approach Generally after a large number iterations it yields draws from the joint

posterior distribution of the latent variables and the model parameters conditional on the

observed outcome values We formalize the procedure below

Assume that each outcome Y1Y2 Yn is such that Yi |xi β sim Bernoulli(qi)

where qi = (xTi β) is the standard normal CDF evaluated at xTi β where xi and β

are the p-dimensional vectors of observed covariates for the i -th observation and their

corresponding parameters respectively

Now let y = y1 y2 yn be the vector of observed outcomes and [ β ] represents

the prior distribution of the model parameters Therefore the posterior distribution of β is

given by

[ β|y ] prop [ β ]nprodi=1

(xTi β)yi(1minus(xTi β)

)1minusyi (2ndash2)

which is intractable Nevertheless introducing latent random variables V = (V1 Vn)

such that Vi sim N (xTi β 1) resolves this difficulty by specifying that whenever Yi = 1

then Vi gt 0 and if Yi = 0 then Vi le 0 This yields

[ β v|y ] prop [ β ]

nprodi=1

ϕ(vi | xTi β 1)Ivile0Iyi=0 + Ivigt0Iyi=1

(2ndash3)

where ϕ(x |micro τ 2) is the probability density function of normal random variable x

with mean micro and variance τ2 The data augmentation artifact works since [ β|y ] =int[ β v|y ]dv hence if we sample from joint posterior 2ndash3 and extract only the sampled

values for β they will correspond to samples from [ β|y ]

From the expression above it is possible to obtain the full conditional distributions

for V and β Thus a Gibbs sampler can be proposed For example if we use a flat prior

27

for β (ie [ β ] prop 1) the full conditionals are given by

β|V y sim MVNk

((XTX )minus1(XTV ) (XTX )minus1

)(2ndash4)

V|β y simnprodi=1

tr N (xTi β 1Qi) (2ndash5)

where MVNq(micro ) represents a multivariate normal distribution with mean vector micro

and variance-covariance matrix and tr N (ξσ2Q) stands for the truncated normal

distribution with mean ξ variance σ2 and truncation region Q For each i = 1 2 n

the support of the truncated variables is given by Q = (minusinfin 0 ] if yi = 0 and Q = (0infin)

otherwise Note that conjugate normal priors could be used alternatively

At iteration m + 1 the Gibbs sampler draws V(m+1) conditional on β(m) from (2ndash5)

and then samples β(m+1) conditional on V(m+1) from (2ndash4) This process is repeated for

s = 0 1 nsim where nsim is the number of iterations in the Gibbs sampler

Logit model Data-augmentation with latent Polya-gamma variables

Recently Polson et al (2013) developed a novel and efficient approach for Bayesian

inference for logistic models using Polya-gamma latent variables which is analogous

to the Albert amp Chib algorithm The result arises from what the authors refer to as the

Polya-gamma distribution To construct a random variable from this family consider the

infinite mixture of the iid sequence of Exp(1) random variables Ekinfink=1 given by

ω =2

π2

infinsumk=1

Ek

(2k minus 1)2

with probability density function

g(ω) =infinsumk=1

(minus1)k 2k + 1radic2πω3

eminus(2k+1)2

8ω Iωisin(0infin) (2ndash6)

and Laplace density transform E[eminustω] = coshminus1(radic

t2)

28

The Polya-gamma family of densities is obtained through an exponential tilting of

the density g from 2ndash6 These densities indexed by c ge 0 are characterized by

f (ω|c) = cosh c2 eminusc2ω2 g(ω)

The likelihood for the binomial logistic model can be expressed in terms of latent

Polya-gamma variables as follows Assume yi sim Bernoulli(δi) with predictors xprimei =

(xi1 xip) and success probability δi = exprimeiβ(1 + ex

primeiβ) Hence the posterior for the

model parameters can be represented as

[β|y] =[β]prodn

i δyii (1minus δi)

1minusyi

c(y)

where c(y) is the normalizing constant

To facilitate the sampling procedure a data augmentation step can be performed

by introducing a Polya-gamma random variable ω sim PG(xprimeβ 1) This yields the

data-augmented posterior

[βω|y] =

(prodn

i=1 Pr(yi = 1|β))f (ω|xprime

β) [β] dω

c(y) (2ndash7)

such that [β|y] =int

R+[βω|y] dω

Thus from the augmented model the full conditional density for β is given by

[β|ω y] prop

(nprodi=1

Pr(yi = 1|β)

)f (ω|xprime

β) [β] dω

=

nprodi=1

(exprimeiβ)yi

1 + exprimeiβ

nprodi=1

cosh

(∣∣xprime

iβ∣∣

2

)exp

[minus(x

prime

iβ)2ωi

2

]g(ωi)

(2ndash8)

This expression yields a normal posterior distribution if β is assigned flat or normal

priors Hence a two-step sampling strategy analogous to that of Albert amp Chib (1993)

can be used to estimate β in the occupancy framework22 Single Season Occupancy

Let pij = F (qTij λ) be the probability of correctly classifying as occupied the i th

site during the j th survey conditional on the site being occupied and let ψi = F (xTi α)

29

correspond to the presence probability at the i th site Further let Fminus1(middot) denote a link

function (ie probit or logit) connecting the response to the predictors and denote by λ

and α respectively the r -variate and p-variate coefficient vectors for the detection and

for the presence probabilities Then the following is the joint posterior probability for the

presence indicators and the model parameters

πlowast(z vαwλ) prop πα(α)πλ(λ)Nprodi=1

F (xprimeiα)zi (1minus F (xprimeiα))

(1minuszi ) times

Jprodj=1

(ziF (qprimeijλ))

yij (1minus ziF (qprimeijλ))

1minusyij (2ndash9)

As in the simple probit regression problem this posterior is intractable Consequently

sampling from it directly is not possible But the procedures of Albert amp Chib for the

probit model and of Polson et al for the logit model can be extended to generate an

MCMC sampling strategy for the occupancy problem In what follows we make use of

this framework to develop samplers with which occupancy parameter estimates can be

obtained for both probit and logit link functions These algorithms have the added benefit

that they do not require tuning parameters nor eliciting parameter priors subjectively221 Probit Link Model

To extend Albert amp Chibrsquos algorithm to the occupancy framework with a probit link

first we introduce two sets of latent variables denoted by wij and vi corresponding to

the normal latent variables used to augment the data The corresponding hierarchy is

yij |zi sij sim Bernoulli(ziIwijgt0

)wij |λ sim N

(qprimeijλ 1

)λ sim [λ]

zi |vi sim Ivigt0

vi |α sim N (xprimeiα 1)

α sim [α] (2ndash10)

30

represented by the directed graph found in Figure 2-2

α

vi

zi

yi

wi

λ

Figure 2-2 Graphical representation occupancy model after data-augmentation

Under this hierarchical model the joint density is given by

πlowast(z vαwλ) prop Cyπα(α)πλ(λ)Nprodi=1

ϕ(vi xprimeiα 1)I

zivigt0I

(1minuszi )vile0 times

Jprodj=1

(ziIwijgt0)yij (1minus ziIwijgt0)

1minusyijϕ(wij qprimeijλ 1) (2ndash11)

The full conditional densities derived from the posterior in equation 2ndash11 are

detailed below

1 These are obtained from the full conditional of z after integrating out v and w

f (z|αλ) =

Nprodi=1

f (zi |αλ) =Nprodi=1

ψlowastizi (1minus ψlowast

i )1minuszi

where ψlowasti =

ψiprodJ

j=1 pyijij (1minus pij)

1minusyij

ψiprodJ

j=1 pyijij (1minus pij)1minusyij + (1minus ψi)

prodJ

j=1 Iyij=0(2ndash12)

2

f (v|zα) =

Nprodi=1

f (vi |zi α) =Nprodi=1

tr N (x primeiα 1Ai)

where Ai =

(minusinfin 0] zi = 0(0infin) zi = 1

(2ndash13)

31

and tr N(microσ2A) denotes the pdf of a truncated normal random variable withmean micro variance σ2 and truncation region A

3

f (α|v) = ϕp (α αXprimev α) (2ndash14)

where α = (X primeX )minus1and ϕk(x micro ) represents the k-variate normal density withmean vector micro and variance matrix

4

f (w|y zλ) =

Nprodi=1

Jprodj=1

f (wij |yij zi λ) =Nprodi=1

Jprodj=1

tr N (qprimeijλ 1Bij)

where Bij =

(minusinfininfin) zi = 0(minusinfin 0] zi = 1 and yij = 0(0infin) zi = 1 and yij = 1

(2ndash15)

5

f (λ|w) = ϕr (λ λQprimew λ) (2ndash16)

where λ = (Q primeQ)minus1

The Gibbs sampling algorithm for the model can then be summarized as

1 Initialize z α v λ and w

2 Sample zi sim Bern(ψilowast)

3 Sample vi from a truncated normal with micro = x primeiα σ = 1 and truncation regiondepending on zi

4 Sample α sim N (αXprimev α) with α = (X primeX )minus1

5 Sample wij from a truncated normal with micro = qprimeijλ σ = 1 and truncation region

depending on yij and zi

6 Sample λ sim N (λQprimew λ) with λ = (Q primeQ)minus1

222 Logit Link Model

Now turning to the logit link version of the occupancy model again let yij be the

indicator variable used to mark detection of the target species on the j th survey at the

i th site and let zi be the indicator variable that denotes presence (zi = 1) or absence

32

(zi = 0) of the target species at the i th site The model is now defined by

yij |zi λ sim Bernoulli (zipij) where pij =eq

primeijλ

1 + eqprimeijλ

λ sim [λ]

zi |α sim Bernoulli (ψi) where ψi =ex

primeiα

1 + exprimeiα

α sim [α]

In this hierarchy the contribution of a single site to the likelihood is

Li(αλ) =(ex

primeiα)zi

1 + exprimeiα

Jprodj=1

(zi

eqprimeijλ

1 + eqprimeijλ

)yij(1minus zi

eqprimeijλ

1 + eqprimeijλ

)1minusyij

(2ndash17)

As in the probit case we data-augment the likelihood with two separate sets

of covariates however in this case each of them having Polya-gamma distribution

Augmenting the model and using the posterior in (2ndash7) the joint is

[ zαλ|y ] prop [α] [λ]

Nprodi=1

(ex

primeiα)zi

1 + exprimeiαcosh

(∣∣xprime

iα∣∣

2

)exp

[minus(x

prime

iα)2vi

2

]g(vi)times

Jprodj=1

(zi

eqprimeijλ

1 + eqprimeijλ

)yij(1minus zi

eqprimeijλ

1 + eqprimeijλ

)1minusyij

times

cosh

(∣∣ziqprimeijλ∣∣2

)exp

[minus(ziq

primeijλ)2wij

2

]g(wij)

(2ndash18)

The full conditionals for z α v λ and w obtained from (2ndash18) are provided below

1 The full conditional for z is obtained after marginalizing the latent variables andyields

f (z|αλ) =

Nprodi=1

f (zi |αλ) =Nprodi=1

ψlowastizi (1minus ψlowast

i )1minuszi

where ψlowasti =

ψiprodJ

j=1 pyijij (1minus pij)

1minusyij

ψiprodJ

j=1 pyijij (1minus pij)1minusyij + (1minus ψi)

prodJ

j=1 Iyij=0(2ndash19)

33

2 Using the result derived in Polson et al (2013) we have that

f (v|zα) =

Nprodi=1

f (vi |zi α) =Nprodi=1

PG(1 xprimeiα) (2ndash20)

3

f (α|v) prop [α ]

Nprodi=1

exp[zix

prime

iαminus xprime

2minus (x

prime

iα)2vi

2

] (2ndash21)

4 By the same result as that used for v the full conditional for w is

f (w|y zλ) =

Nprodi=1

Jprodj=1

f (wij |yij zi λ)

=

(prodiisinS1

Jprodj=1

PG(1 |qprimeijλ| )

)(prodi isinS1

Jprodj=1

PG(1 0)

) (2ndash22)

with S1 = i isin 1 2 N zi = 1

5

f (λ|z yw) prop [λ ]prodiisinS1

exp

[yijq

prime

ijλminusq

prime

ijλ

2minus

(qprime

ijλ)2wij

2

] (2ndash23)

with S1 as defined above

The Gibbs sampling algorithm is analogous to the one with a probit link but with the

obvious modifications to incorporate Polya-gamma instead of normal latent variables23 Temporal Dynamics and Spatial Structure

The uses of the single-season model are limited to very specific problems In

particular assumptions for the basic model may become too restrictive or unrealistic

whenever the study period extends throughout multiple years or seasons especially

given the increasingly changing environmental conditions that most ecosystems are

currently experiencing

Among the many extensions found in the literature one that we consider particularly

relevant incorporates heterogenous occupancy probabilities through time Extensions of

34

site-occupancy models that incorporate temporally varying probabilities can be traced

back to Hanski (1994) The heterogeneity of occupancy probabilities through time arises

from local colonization and extinction processes MacKenzie et al (2003) proposed an

alternative to Hanskirsquos approach in order to incorporate imperfect detection The method

is flexible enough to let detection occurrence survival and colonization probabilities

each depend upon its own set of covariates using likelihood-based estimation for the

model parameters

However the approach of MacKenzie et al presents two drawbacks First

the uncertainty assessment for maximum likelihood parameter estimates relies on

asymptotic results (obtained from implementation of the delta method) making it

sensitive to sample size And second to obtain parameter estimates the latent process

(occupancy) is marginalized out of the likelihood leading to the usual zero-inflated

Bernoulli model Although this is a convenient strategy to solve the estimation problem

the latent state variables (occupancy indicators) are no longer available and as such

finite sample estimates cannot be calculated unless an additional (and computationally

expensive) parametric bootstrap step is performed (Royle amp Kery 2007) Additionally as

the occupancy process is integrated out the likelihood approach precludes incorporation

of additional structural dependence using random effects Thus the model cannot

account for spatial dependence which plays a fundamental role in this setting

To work around some of the shortcomings encountered when fitting dynamic

occupancy models via likelihood based methods Royle amp Kery developed what they

refer to as a dynamic occupancy state space model (DOSS) alluding to the conceptual

similarity found between this model and the class of state space models found in the

time series literature In particular this model allows one to retain the latent process

(occupancy indicators) in order to obtain small sample estimates and to eventually

generate extensions that incorporate structure in time andor space through random

effects

35

The data used in the DOSS model comes from standard repeated presenceabsence

surveys with N sampling locations (patches or sites) indexed by i = 1 2 N Within

a given season (eg year month week depending on the biology of the species) each

sampling location is visited (surveyed) j = 1 2 J times This process is repeated for

t = 1 2 T seasons Here an important assumption is that the site occupancy status

is closed within but not across seasons

As is usual in the occupancy modeling framework two different processes are

considered The first one is the detection process per site-visit-season combination

denoted by yijt The yijt are indicator functions that take the value 1 if the species is

present at site i survey j and season t and 0 otherwise These detection indicators

are assumed to be independent within each site and season The second response

considered is the partially observed presence (occupancy) indicators zit These are

indicator variables which are equal to 1 whenever yijt = 1 for one or more of the visits

made to site i during season t otherwise the values of the zit rsquos are unknown Royle amp

Kery refer to these two processes as the observation (yijt) and the state (zit) models

In this setting the parameters of greatest interest are the occurrence or site

occupancy probabilities denoted by ψit as well as those representing the population

dynamics which are accounted for by introducing changes in occupancy status over

time through local colonization and survival That is if a site was not occupied at season

t minus 1 at season t it can either be colonized or remain unoccupied On the other hand

if the site was in fact occupied at season t minus 1 it can remain that way (survival) or

become abandoned (local extinction) at season t The probabilities of survival and

colonization from season t minus 1 to season t at the i th site are denoted by θi(tminus1) and

γi(tminus1) respectivelyDuring the initial period (or season) the model for the state process is expressed in

terms of the occupancy probability (equation 2ndash24) For subsequent periods the stateprocess is specified in terms of survival and colonization probabilities (equation 2ndash25) inparticular

zi1 sim Bernoulli (ψi1) (2ndash24)

36

zit |zi(tminus1) sim Bernoulli(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)(2ndash25)

The observation model conditional on the latent process zit is defined by

yijt |zit sim Bernoulli(zitpijt

)(2ndash26)

Royle amp Kery induce the heterogeneity by site site-season and site-survey-seasonrespectively in the occupancy survival and colonization and in the detection probabilitiesthroughthe following specification

logit(ψi1) = x1 + ri ri sim N(0σ2ψ) logitminus1(x1) sim Unif(0 1)

logit(θit) = at + ui ui sim N(0σ2θ ) logitminus1(ai) sim Unif(0 1)logit(γit) = bt + vi vi sim N(0σ2γ) logitminus1(bi) sim Unif(0 1)

logit(pijt) = ct + wij wi sim N(0σ2p) logitminus1(ci) sim Unif(0 1) (2ndash27)

where x1 at bt ct are the season fixed effects for the corresponding probabilities

and where (ri ui vi) and wij are the site and site-survey random effects respectively

Additionally all variance components assume the usual inverse gamma priors

As the authors state this formulation can be regarded as ldquobeing suitably vaguerdquo

however it is also restrictive in the sense that it is not clear what strategy to follow to

incorporate additional covariates while preserving the straightforward sampling strategy231 Dynamic Mixture Occupancy State-Space Model

We assume that the probabilities for occupancy survival colonization and detection

are all functions of linear combinations of covariates However our setup varies

slightly from that considered by Royle amp Kery (2007) In essence we modify the way in

which the estimates for survival and colonization probabilities are attained Our model

incorporates the notion that occupancy at a site occupied during the previous season

takes place through persistence where we define persistence as a function of both

survival and colonization That is a site occupied at time t may again be occupied

at time t + 1 if the current settlers survive if they perish and new settlers colonize

simultaneously or if both current settlers survive and new ones colonize

Our functional forms of choice are again the probit and logit link functions This

means that each probability of interest which we will refer to for illustration as δ is

37

linked to a linear combination of covariates xprime ξ through the relationship defined by

δ = F (xT ξ) where F (middot) represents the inverse link function This particular assumption

facilitates relating the data augmentation algorithms of Albert amp Chib and Polson et al to

Royle amp Keryrsquos DOSS model We refer to this extension of Royle amp Keryrsquos model as the

Dynamic Mixture Occupancy State Space model (DYMOSS)

As before let yijt be the indicator variable used to mark detection of the target

species on the j th survey at the i th site during the tth season and let zit be the indicator

variable that denotes presence (zit = 1) or absence (zit = 0) of the target species at the

i th site tth season with i isin 1 2 N j isin 1 2 J and t isin 1 2 T

Additionally assume that probabilities for occupancy at time t = 1 persistence

colonization and detection are all functions of covariates with corresponding parameter

vectors α (s) =δ(s)tminus1

Tt=2

B(c) =β(c)tminus1

Tt=2

and = λtTt=1 and covariate matrices

X(o) X = Xtminus1Tt=2 and Q(s) = QtTt=1 respectively Using the notation above our

proposed dynamic occupancy model is defined by the following hierarchyState model

zi1|α sim Bernoulli (ψi1) where ψi1 = F(xprime(o)iα

)zit |zi(tminus1) δ

(c)tminus1β

(s)tminus1 sim Bernoulli

(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)where θi(tminus1) = F

(δ(s)tminus1 + xprimei(tminus1)β

(c)tminus1

) and

γi(tminus1) = F(xprimei(tminus1)β

(c)tminus1

)(2ndash28)

Observed modelyijt |zit ηt sim Bernoulli (zitpijt)

where pijt = F (qTijtλt) (2ndash29)

In the hierarchical setup given by Equations 2ndash28 and 2ndash29 θi(tminus1) corresponds to

the probability of persistence from time t minus 1 to time t at site i and γi(tminus1) denotes the

colonization probability Note that θi(tminus1) minus γi(tminus1) yields the survival probability from t minus 1

to t The effect of survival is introduced by changing the intercept of the linear predictor

by a quantity δ(s)tminus1 Although in this version of the model this effect is accomplished by

just modifying the intercept it can be extended to have covariates determining δ(s)tminus1 as

well The graphical representation of the model for a single site is

38

α

zi1

yi1

λ1

zi2

yi2

λ1

δ(s)1

β(c)1

middot middot middot

zit

yit

λt

δ(s)tminus1

β(c)tminus1

middot middot middot

ziT

yiT

λT

δ(s)Tminus1

β(c)Tminus1

Figure 2-3 Graphical representation multiseason model for a single site

The joint posterior for the model defined by this hierarchical setting is

[ zηαβλ|y ] = Cy

Nprodi=1

ψi1 Jprodj=1

pyij1ij1 (1minus pij1)

(1minusyij1)

zi1(1minus ψi1)

Jprodj=1

Iyij1=0

1minuszi1 [η1][α]times

Tprodt=2

Nprodi=1

[(θziti(tminus1)(1minus θi(tminus1))

1minuszit)zi(tminus1)

+(γziti(tminus1)(1minus γi(tminus1))

1minuszit)1minuszi(tminus1)

] Jprod

j=1

pyijtijt (1minus pijt)

1minusyijt

zit

times

Jprodj=1

Iyijt=0

1minuszit [ηt ][βtminus1][λtminus1]

(2ndash30)

which as in the single season case is intractable Once again a Gibbs sampler cannot

be constructed directly to sample from this joint posterior The graphical representation

of the model for one site incorporating the latent variables is provided in Figure 2-4

α

ui1

zi1

yi1

wi1

λ1

zi2

yi2

wi2

λ1

vi1

δ(s)1

β(c)1

middot middot middot

middot middot middot

zit

vi tminus1

yit

wit

λt

δ(s)tminus1

β(c)tminus1

middot middot middot

middot middot middot

ziT

vi Tminus1

yiT

wiT

λT

δ(s)Tminus1

β(s)Tminus1

Figure 2-4 Graphical representation data-augmented multiseason model

Probit link normal-mixture DYMOSS model

39

We deal with the intractability of the joint posterior distribution as before that is

by introducing latent random variables Each of the latent variables incorporates the

relevant linear combinations of covariates for the probabilities considered in the model

This artifact enables us to sample from the joint posterior distributions of the model

parameters For the probit link the sets of latent random variables respectively for first

season occupancy persistence and colonization and detection are

bull ui sim N (bTi α 1)

bull vi(tminus1) sim zi(tminus1)N(δ(s)(tminus1) + xTi(tminus1)β

(c)(tminus1) 1

)+ (1minus zi(tminus1))N

(xTi(tminus1)β

(c)(tminus1) 1

) and

bull wijt sim N (qTijtηt 1)

Introducing these latent variables into the hierarchical formulation yieldsState model

ui1|α sim N(xprime(o)iα 1

)zi1|ui sim Bernoulli

(Iuigt0

)for t gt 1

vi(tminus1)|zi(tminus1)βtminus1 sim zi(tminus1)N(δ(s)(tminus1) + xprimei(tminus1)β

(c)(tminus1) 1

)+

(1minus zi(tminus1))N(xprimei(tminus1)β

(c)(tminus1) 1

)zit |vi(tminus1) sim Bernoulli

(Ivi(tminus1)gt0

)(2ndash31)

Observed modelwijt |ηt sim N

(qTijtλt 1

)yijt |zit wijt sim Bernoulli

(zitIrijtgt0

)(2ndash32)

Note that the result presented in Section 22 corresponds to the particular case for

T = 1 of the model specified by Equations 2ndash31 and 2ndash32

As mentioned previously model parameters are obtained using a Gibbs sampling

approach Let ϕ(x |microσ2) denote the pdf of a normally distributed random variable x

with mean micro and standard deviation σ Also let

1 Wt = (w1t w2t wNt) with wit = (wi1t wi2t wiJitt) (for i = 1 2 N andt = 1 2 T )

40

2 u = (u1 u2 uN)

3 V = (v1 vTminus1) with vt = (v1t v2t vNt)

For the probit link model the joint posterior distribution is

π(ZuV WtTt=1αB(c) δ(s)

)prop [α]

prodNi=1 ϕ

(ui∣∣ xprime(o)iα 1

)Izi1uigt0I

1minuszi1uile0

times

Tprodt=2

[β(c)tminus1 δ

(s)tminus1

] Nprodi=1

ϕ(vi(tminus1)

∣∣micro(v)i(tminus1) 1

)Izitvi(tminus1)gt0

I1minuszitvi(tminus1)le0

times

Tprodt=1

[λt ]

Nprodi=1

Jitprodj=1

ϕ(wijt

∣∣qprimeijtλt 1)(zitIwijtgt0)yij1(1minus zitIwijtgt0)

(1minusyijt)

where micro(v)i(tminus1) = zi(tminus1)δ

(s)tminus1 + xprimei(tminus1)β

(c)tminus1 (2ndash33)

Initialize the Gibbs sampler at α(0)B(0)(c) δ

(s)(0)2minus1 and (0) For m = 0 1 nsim

The sampler proceeds iteratively by block sampling sequentially for each primary

sampling period as follows first the presence process then the latent variables from

the data-augmentation step for the presence component followed by the parameters for

the presence process then the latent variables for the detection component and finally

the parameters for the detection component Letting [|] denote the full conditional

probability density function of the component conditional on all other unknown

parameters and the observed data for m = 1 nsim the sampling procedure can be

summarized as

[z(m)1 | middot

]rarr[u(m)| middot

]rarr[α(m)

∣∣∣ middot ]rarr [W

(m)1 | middot

]rarr[λ(m)1

∣∣∣ middot ]rarr[z(m)2 | middot

]rarr[V(m)2minus1| middot

]rarr[β(c)(m)2minus1 δ(s)(m)

2minus1

∣∣∣ middot ]rarr [W

(m)2 | middot

]rarr[λ(m)2

∣∣∣ middot ]rarr middot middot middot

middot middot middot rarr[z(m)T | middot

]rarr[V(m)Tminus1| middot

]rarr[β(c)(m)Tminus1 δ(s)(m)

Tminus1

∣∣∣ middot ]rarr [W

(m)T | middot

]rarr[λ(m)T

∣∣∣ middot ]The full conditional probability densities for this Gibbs sampling algorithm are

presented in detail within Appendix A

41

Logit link Polya-Gamma DYMOSS model

Using the same notation as before the logit link model resorts to the hierarchy given

byState model

ui1|α sim PG(xT(o)iα 1

)zi1|ui sim Bernoulli

(Iuigt0

)for t gt 1

vi(tminus1)| sim PG(1∣∣zi(tminus1)δ(s)(tminus1) + xprimei(tminus1)β

(c)(tminus1)

∣∣)sim Bernoulli

(Ivi(tminus1)gt0

)(2ndash34)

Observed modelwijt |λt sim PG

(qTijtλt 1

)yijt |zit wijt sim Bernoulli

(zitIwijtgt0

)(2ndash35)

The logit link version of the joint posterior is given by

π(ZuV WtTt=1αB(s)B(c)

)prop

Nprodi=1

(e

xprime(o)i

α)zi1

1 + exprime(o)i

αPG

(ui 1 |xprime(o)iα|

)[λ1][α]times

Ji1prodj=1

(zi1

eqprimeij1λ1

1 + eqprimeij1λ1

)yij1(1minus zi1

eqprimeij1λ1

1 + eqprimeij1λ1

)1minusyij1

PG(wij1 1 |zi1qprimeij1λ1|

)times

Tprodt=2

[δ(s)tminus1][β

(c)tminus1][λt ]

Nprodi=1

(exp

[micro(v)tminus1

])zit1 + exp

[micro(v)i(tminus1)

]PG (vit 1 ∣∣∣micro(v)i(tminus1)

∣∣∣)timesJitprodj=1

(zit

eqprimeijtλt

1 + eqprimeijtλt

)yijt(1minus zit

eqprimeijtλt

1 + eqlowastTij

λt

)1minusyijt

PG(wijt 1 |zitqprimeijtλt |

)

(2ndash36)

with micro(v)tminus1 = zi(tminus1)δ

(s)tminus1 + xprimei(tminus1)β

(c)tminus1

42

The sampling procedure is entirely analogous to that described for the probit

version The full conditional densities derived from expression 2ndash36 are described in

detail in Appendix A232 Incorporating Spatial Dependence

In this section we describe how the additional layer of complexity space can also

be accounted for by continuing to use the same data-augmentation framework The

method we employ to incorporate spatial dependence is a slightly modified version of

the traditional approach for spatial generalized linear mixed models (GLMMrsquos) and

extends the model proposed by Johnson et al (2013) for the single season closed

population occupancy model

The traditional approach consists of using spatial random effects to induce a

correlation structure among adjacent sites This formulation introduced by Besag et al

(1991) assumes that the spatial random effect corresponds to a Gaussian Markov

Random Field (GMRF) The model known as the Spatial GLMM (SGLMM) is used to

analyze areal data It has been applied extensively given the flexibility of its hierarchical

formulation and the availability of software for its implementation (Hughes amp Haran

2013)

Succinctly the spatial dependence is accounted for in the model by adding a

random vector η assumed to have a conditionally-autoregressive (CAR) prior (also

known as the Gaussian Markov random field prior) To define the prior let the pair

G = (V E) represent the undirected graph for the entire spatial region studied where

V = (1 2 N) denotes the vertices of the graph (sites) and E the set of edges

between sites E is constituted by elements of the form (i j) indicating that sites i

and j are spatially adjacent for some i j isin V The prior for the spatial effects is then

characterized by

[η|τ ] prop τ rank()2exp[minusτ2ηprimeη

] (2ndash37)

43

where = (diag(A1)minus A) is the precision matrix with A denoting the adjacency matrix

The entries of the adjacency matrix A are such that diag(A) = 0 and Aij = I(i j)isinE

The matrix is singular Hence the probability density defined in equation 2ndash37

is improper ie it doesnrsquot integrate to 1 Regardless of the impropriety of the prior this

model can be fitted using a Bayesian approach since even if the prior is improper the

posterior for the model parameters is proper If a constraint such assum

k ηk = 0 is

imposed or if the precision matrix is replaced by a positive definite matrix the model

can also be fitted using a maximum likelihood approach

Assuming that all but the detection process are subject to spatial correlations and

using the notation we have developed up to this point the spatially explicit version of the

DYMOSS model is characterized by the hierarchy represented by equations 2ndash38 and

2ndash39

Hence adding spatial structure into the DYMOSS framework described in the

previous section only involves adding the steps to sample η(o) and ηtT

t=2 conditional

on all other parameters Furthermore the corresponding parameters and spatial

random effects of a given component (ie occupancy survival and colonization)

can be effortlessly pooled together into a single parameter vector to perform block

sampling For each of the latent variables the only modification required is to sum the

corresponding spatial effect to the linear predictor so that these retain their conditional

independence given the linear combination of fixed effects and the spatial effects

State modelzi1|α sim Bernoulli (ψi1) where ψi1 = F

(xT(o)iα+ η

(o)i

)[η(o)|τ

]prop τ rank()2exp

[minusτ2η(o)primeη(o)

]zit |zi(tminus1)αβtminus1λtminus1 sim Bernoulli

(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)where θi(tminus1) = F

(δ(s)(tminus1) + xTi(tminus1)β

(c)tminus1 + ηit

) and

γi(tminus1) = F(xTi(tminus1)β

(c)tminus1 + ηit

)[ηt |τ ] prop τ rank()2exp

[minusτ2ηprimetηt

](2ndash38)

44

Observed modelyijt |zit ηt sim Bernoulli (zitpijt)

where pijt = F (qTijtλt) (2ndash39)

In spite of the popularity of this approach to incorporating spatial dependence three

shortcomings have been reported in the literature (Hughes amp Haran 2013 Reich et al

2006) (1) model parameters have no clear interpretation due to spatial confounding

of the predictors with the spatial effect (2) there is variance inflation due to spatial

confounding and (3) the high dimensionality of the latent spatial variables leads to

high computational costs To avoid such difficulties we follow the approach used by

Hughes amp Haran (2013) which builds upon the earlier work by Reich et al (2006) This

methodology is summarized in what follows

Let a vector of spatial effects η have the CAR model given by 2ndash37 above Now

consider a random vector ζ sim MVN(0 τKprimeK

) with defined as above and where

τKprimeK corresponds to the precision of the distribution and not the covariance matrix

with matrix K satisfying KprimeK = I

This last condition implies that the linear predictor Xβ + η = Xβ + Kζ With

respect to how the matrix K is chosen Hughes amp Haran (2013) recommend basing its

construction on the spectral decomposition of operator matrices based on Moranrsquos I

The Moran operator matrix is defined as PperpAPperp with Pperp = IminusX (XprimeX )minus1X

prime and where A

is the adjacency matrix previously described The choice of the Moran operator is based

on the fact that it accounts for the underlying graph while incorporating the spatial

structure residual to the design matrix X These elements are incorporated into its

spectral decomposition of the Moran operator That is its eigenvalues correspond to the

values of Moranrsquos I statistic (a measure of spatial autocorrelation) for a spatial process

orthogonal to X while its eigenvectors provide the patterns of spatial dependence

residual to X Thus the matrix K is chosen to be the matrix whose columns are the

eigenvectors of the Moran operator for a particular adjacency matrix

45

Using this strategy the new hierarchical formulation of our model is simply modified

by letting η(o) = K(o)ζ(o) and ηt = Ktζt with

1 ζ(o) sim MVN(0 τ (o)K(o)primeK(o)

) where K(o) is the eigenvector matrix for

P(o)perpAP(o)perp and

2 ζt sim MVN(0 τtK

primetKt

) where Kt is the Pperp

t APperpt for t = 2 3 T

The algorithms for the probit and logit link from section 231 can be readily

adapted to incorporate the spatial structure simply by obtaining the joint posteriors

for (α ζ(o)) and (β(c)tminus1 δ

(s)tminus1 ζt) making the obvious modification of the corresponding

linear predictors to incorporate the spatial components24 Summary

With a few exceptions (Dorazio amp Taylor-Rodrıguez 2012 Johnson et al 2013

Royle amp Kery 2007) recent Bayesian approaches to site-occupancy modeling with

covariates have relied on model configurations (eg as multivariate normal priors of

parameters in logit scale) that lead to unfamiliar conditional posterior distributions thus

precluding the use of a direct sampling approach Therefore the sampling strategies

available are based on algorithms (eg Metropolis Hastings) that require tuning and the

knowledge to do so correctly

In Dorazio amp Taylor-Rodrıguez (2012) we proposed a Bayesian specification for

which a Gibbs sampler of the basic occupancy model is available and allowed detection

and occupancy probabilities to depend on linear combinations of predictors This

method described in section 221 is based on the data augmentation algorithm of

Albert amp Chib (1993) There the full conditional posteriors of the parameters of the probit

regression model are cast as latent mixtures of normal random variables The probit and

the logit link yield similar results with large sample sizes however their results may be

different when small to moderate sample sizes are considered because the logit link

function places more mass in the tails of the distribution than the probit link does In

46

section 222 we adapt the method for the single season model to work with the logit link

function

The basic occupancy framework is useful but it assumes a single closed population

with fixed probabilities through time Hence its assumptions may not be appropriate to

address problems where the interest lies in the temporal dynamics of the population

Hence we developed a dynamic model that incorporates the notion that occupancy

at a site previously occupied takes place through persistence which depends both on

survival and habitat suitability By this we mean that a site occupied at time t may again

be occupied at time t + 1 if (1) the current settlers survive (2) the existing settlers

perish but new settlers simultaneously colonize or (3) current settlers survive and new

ones colonize during the same season In our current formulation of the DYMOSS both

colonization and persistence depend on habitat suitability characterized by xprimei(tminus1)β(c)tminus1

They only differ in that persistence is also influenced by whether the site being occupied

during season t minus 1 enhances the suitability of the site or harms it through density

dependence

Additionally the study of the dynamics that govern distribution and abundance of

biological populations requires an understanding of the physical and biotic processes

that act upon them and these vary in time and space Consequently as a final step in

this Chapter we described a straightforward strategy to add spatial dependence among

neighboring sites in the dynamic metapopulation model This extension is based on the

popular Bayesian spatial modeling technique of Besag et al (1991) updated using the

methods described in (Hughes amp Haran 2013)

Future steps along these lines are (1) develop the software necessary to

implement the tools described throughout the Chapter and (2) build a suite of additional

extensions using this framework for occupancy models will be explored The first of

them will be used to incorporate information from different sources such as tracks

scats surveys and direct observations into a single model This can be accomplished

47

by adding a layer to the hierarchy where the source and spatial scale of the data is

accounted for The second extension is a single season spatially explicit multiple

species co-occupancy model This model will allow studying complex interactions

and testing hypotheses about species interactions at a given point in time Lastly this

co-occupancy model will be adapted to incorporate temporal dynamics in the spirit of

the DYMOSS model

48

CHAPTER 3INTRINSIC ANALYSIS FOR OCCUPANCY MODELS

Eliminate all other factors and the one which remains must be the truthndashSherlock Holmes

The Sign of Four

31 Introduction

Occupancy models are often used to understand the mechanisms that dictate

the distribution of a species Therefore variable selection plays a fundamental role in

achieving this goal To the best of our knowledge ldquoobjectiverdquo Bayesian alternatives for

variable selection have not been put forth for this problem and with a few exceptions

(Hooten amp Hobbs 2014 Link amp Barker 2009) AIC is the method used to choose from

competing site-occupancy models In addition the procedures currently implemented

and accessible to ecologists require enumerating and estimating all the candidate

models (Fiske amp Chandler 2011 Mazerolle amp Mazerolle 2013) In practice this

can be achieved if the model space considered is small enough which is possible

if the choice of the model space is guided by substantial prior knowledge about the

underlying ecological processes Nevertheless many site-occupancy surveys collect

large amounts of covariate information about the sampled sites Given that the total

number of candidate models grows exponentially fast with the number of predictors

considered choosing a reduced set of models guided by ecological intuition becomes

increasingly difficult This is even more so the case in the occupancy model context

where the model space is the cartesian product of models for presence and models for

detection Given the issues mentioned above we propose the first objective Bayesian

variable selection method for the single-season occupancy model framework This

approach explores in a principled manner the entire model space It is completely

49

automatic precluding the need for both tuning parameters in the sampling algorithm and

subjective elicitation of parameter prior distributions

As mentioned above in ecological modeling if model selection or less frequently

model averaging is considered the Akaike Information Criterion (AIC) (Akaike 1983)

or a version of it is the measure of choice for comparing candidate models (Fiske amp

Chandler 2011 Mazerolle amp Mazerolle 2013) The AIC is designed to find the model

that has on average the density closest in Kullback-Leibler distance to the density

of the true data generating mechanism The model with the smallest AIC is selected

However if nested models are considered one of them being the true one generally the

AIC will not select it (Wasserman 2000) Commonly the model selected by AIC will be

more complex than the true one The reason for this is that the AIC has a weak signal to

noise ratio and as such it tends to overfit (Rao amp Wu 2001) Other versions of the AIC

provide a bias correction that enhances the signal to noise ratio leading to a stronger

penalization for model complexity Some examples are the AICc (Hurvich amp Tsai 1989)

and AICu (McQuarrie et al 1997) however these are also not consistent for selection

albeit asymptotically efficient (Rao amp Wu 2001)

If we are interested in prediction as opposed to testing the AIC is certainly

appropriate However when conducting inference the use of Bayesian model averaging

and selection methods is more fitting If the true data generating mechanism is among

those considered asymptotically Bayesian methods choose the true model with

probability one Conversely if the true model is not among the alternatives and a

suitable parameter prior is used the posterior probability of the most parsimonious

model closest to the true one tends asymptotically to one

In spite of this in general for Bayesian testing direct elicitation of prior probabilistic

statements is often impeded because the problems studied may not be sufficiently

well understood to make an informed decision about the priors Conversely there may

be a prohibitively large number of parameters making specifying priors for each of

50

these parameters an arduous task In addition to this seemingly innocuous subjective

choices for the priors on the parameter space may drastically affect test outcomes

This has been a recurring argument in favor of objective Bayesian procedures

which appeal to the use of formal rules to build parameter priors that incorporate the

structural information inside the likelihood while utilizing some objective criterion (Kass amp

Wasserman 1996)

One popular choice of ldquoobjectiverdquo prior is the reference prior (Berger amp Bernardo

1992) which is the prior that maximizes the amount of signal extracted from the

data These priors have proven to be effective as they are fully automatic and can

be frequentist matching in the sense that the posterior credible interval agrees with the

frequentist confidence interval from repeated sampling with equal coverage-probability

(Kass amp Wasserman 1996) Reference priors however are improper and while

they yield reasonable posterior parameter probabilities the derived model posterior

probabilities may be ill defined To avoid this shortcoming Berger amp Pericchi (1996)

introduced the intrinsic Bayes factor (IBF) for model comparison Moreno et al (1998)

building on the IBF of Berger amp Pericchi (1996) developed a limiting procedure to

generate a system of priors that yield well-defined posteriors even though these

priors may sometimes be improper The IBF is built using a data-dependent prior to

automatically generate Bayes factors however the extension introduced by Moreno

et al (1998) generates the intrinsic prior by taking a theoretical average over the space

of training samples freeing the prior from data dependence

In our view in the face of a large number of predictors the best alternative is to run

a stochastic search algorithm using good ldquoobjectiverdquo testing parameter priors and to

incorporate suitable model priors This being said the discussion about model priors is

deferred until Chapter 4 this Chapter focuses on the priors on the parameter space

The Chapter is structured as follows First issues surrounding multimodel inference

are described and insight about objective Bayesian inferential procedures is provided

51

Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate priors

on the parameter space the intrinsic priors for the parameters of the occupancy model

are derived These are used in the construction of an algorithm for ldquoobjectiverdquo model

selection tailored to the occupancy model framework To assess the performance of our

methods we provide results from a simulation study in which distinct scenarios both

favorable and unfavorable are used to determine the robustness of these tools and

analyze the Blue Hawker data set which has been examined previously in the ecological

literature (Dorazio amp Taylor-Rodrıguez 2012 Kery et al 2010)32 Objective Bayesian Inference

As mentioned before in practice noninformative priors arising from structural

rules are an alternative to subjective elicitation of priors Some of the rules used in

defining noninformative priors include the principle of insufficient reason parametrization

invariance maximum entropy geometric arguments coverage matching and decision

theoretic approaches (see Kass amp Wasserman (1996) for a discussion)

These rules reflect one of two attitudes (1) noninformative priors either aim to

convey unique representations of ignorance or (2) they attempt to produce probability

statements that may be accepted by convention This latter attitude is in the same

spirit as how weights and distances are defined (Kass amp Wasserman 1996) and

characterizes the way in which Bayesian reference methods are interpreted today ie

noninformative priors are seen to be chosen by convention according to the situation

A word of caution must be given when using noninformative priors Difficulties arise

in their implementation that should not be taken lightly In particular these difficulties

may occur because noninformative priors are generally improper (meaning that they do

not integrate or sum to a finite number) and as such are said to depend on arbitrary

constants

Bayes factors strongly depend upon the prior distributions for the parameters

included in each of the models being compared This can be an important limitation

52

considering that when using noninformative priors their introduction will result in the

Bayes factors being a function of the ratio of arbitrary constants given that these priors

are typically improper (see Jeffreys 1961 Pericchi 2005 and references therein)

Many different approaches have been developed to deal with the arbitrary constants

when using improper priors since then These include the use of partial Bayes factors

(Berger amp Pericchi 1996 Good 1950 Lempers 1971) setting the ratio of arbitrary

constants to a predefined value (Spiegelhalter amp Smith 1982) and approximating to the

Bayes factor (see Haughton 1988 as cited in Berger amp Pericchi 1996 Kass amp Raftery

1995 Tierney amp Kadane 1986)321 The Intrinsic Methodology

Berger amp Pericchi (1996) cleverly dealt with the arbitrary constants that arise when

using improper priors by introducing the intrinsic Bayes factor (IBF) procedure This

solution based on partial Bayes factors provides the means to replace the improper

priors by proper ldquoposteriorrdquo priors The IBF is obtained from combining the model

structure with information contained in the observed data Furthermore they showed

that as the sample size tends to infinity the Intrinsic Bayes factor corresponds to the

proper Bayes factor arising from the intrinsic priors

Intrinsic priors however are not unique The asymptotic correspondence between

the IBF and the Bayes factor arising from the intrinsic prior yields two functional

equations that are solved by a whole class of intrinsic priors Because all the priors

in the class produce Bayes factors that are asymptotically equivalent to the IBF for

finite sample sizes the resulting Bayes factor is not unique To address this issue

Moreno et al (1998) formalized the methodology through the ldquolimiting procedurerdquo

This procedure allows one to obtain a unique Bayes factor consolidating the method

as a valid objective Bayesian model selection procedure which we will refer to as the

Bayes factor for intrinsic priors (BFIP) This result is particularly valid for nested models

although the methodology may be extended with some caution to nonnested models

53

As mentioned before the Bayesian hypothesis testing procedure is highly sensitive

to parameter-prior specification and not all priors that are useful for estimation are

recommended for hypothesis testing or model selection Evidence of this is provided

by the Jeffreys-Lindley paradox which states that a point null hypothesis will always

be accepted when the variance of a conjugate prior goes to infinity (Robert 1993)

Additionally when comparing nested models the null model should correspond to

a substantial reduction in complexity from that of larger alternative models Hence

priors for the larger alternative models that place probability mass away from the null

model are wasteful If the true model is ldquofarrdquo from the null it will be easily detected by

any statistical procedure Therefore the prior on the alternative models should ldquowork

harderrdquo at selecting competitive models that are ldquocloserdquo to the null This principle known

as the Savage continuity condition (Gunel amp Dickey 1974) is widely recognized by

statisticians

Interestingly the intrinsic prior in correspondence with the BFIP automatically

satisfies the Savage continuity condition That is when comparing nested models the

intrinsic prior for the more complex model is centered around the null model and in spite

of being a limiting procedure it is not subject to the Jeffreys-Lindley paradox

Moreover beyond the usual pairwise consistency of the Bayes factor for nested

models Casella et al (2009) show that the corresponding Bayesian procedure with

intrinsic priors for variable selection in normal regression is consistent in the entire

class of normal linear models adding an important feature to the list of virtues of the

procedure Consistency of the BFIP for the case where the dimension of the alternative

model grows with the sample size is discussed in Moreno et al (2010)322 Mixtures of g-Priors

As previously mentioned in the Bayesian paradigm a model M in M is defined

by a sampling density and a prior distribution The sampling density associated with

model M is denoted by f (y|βM σ2M M) where (βM σ

2M) is a vector of model-specific

54

unknown parameters The prior for model M and its corresponding set of parameters is

π(βM σ2M M|M) = π(βM σ

2M |MM) middot π(M|M)

Objective local priors for the model parameters (βM σ2M) are achieved through

modifications and extensions of Zellnerrsquos g-prior (Liang et al 2008 Womack et al

2014) In particular below we focus on the intrinsic prior and provide some details for

other scaled mixtures of g-priors We defer the discussion on priors over the model

space until Chapter 5 where we describe them in detail and develop a few alternatives

of our own3221 Intrinsic priors

An automatic choice of an objective prior is the intrinsic prior (Berger amp Pericchi

1996 Moreno et al 1998) Because MB sube M for all M isin M the intrinsic prior for

(βM σ2M) is defined as an expected posterior prior

πI (βM σ2M |M) =

intpR(βM σ

2M |~yM)mR(~y|MB)d~y (3ndash1)

where ~y is a minimal training sample for model M I denotes the intrinsic distributions

and R denotes distributions derived from the reference prior πR(βM σ2M |M) = cM

dβMdσ2M

σ2M

In (3ndash1) mR(~y|M) =intint

f (~y|βM σ2M M)πR(βM σ

2M |M)dβMdσ2M is the reference marginal

of ~y under model M and pR(βM σ2M |~yM) =

f (~y|βM σ2MM)πR(βM σ2

M|M)

mR(~y|M)is the reference

posterior density

In the regression framework the reference marginal mR is improper and produces

improper intrinsic priors However the intrinsic Bayes factor of model M to the base

model MB is well-defined and given by

BF IMMB

(y) = (1minus R2M)

minus nminus|MB |2 times

int 1

0

n + sin2(π2θ) middot (|M|+ 1)

n +sin2(π

2θ)middot(|M|+1)1minusR2

M

nminus|M|

2sin2(π

2θ) middot (|M|+ 1)

n +sin2(π

2θ)middot(|M|+1)1minusR2

M

|M|minus|MB |

2

dθ (3ndash2)

55

where R2M is the coefficient of determination of model M versus model MB The Bayes

factor between two models M and M prime is defined as BF IMMprime(y) = BF I

MMB(y)BF I

MprimeMB(y)

The ldquogoodnessrdquo of the model M based on the intrinsic priors is given by its posterior

probability

pI (M|yM) =BF I

MMB(y)π(M|M)sum

MprimeisinM BF IMprimeMB

(y)π(M prime|M) (3ndash3)

It has been shown that the system of intrinsic priors produces consistent model selection

(Casella et al 2009 Giron et al 2010) In the context of well-formulated models the

true model MT is the smallest well-formulated model M isin M such that α isin M if βα = 0

If MT is the true model then the posterior probability of model MT based on equation

(3ndash3) converges to 13222 Other mixtures of g-priors

Scaled mixtures of g-priors place a reference prior on (βMBσ2) and a multivariate

normal distribution on β in M MB that is normal with mean 0 and precision matrix

qMw

nσ2ZprimeM (IminusH0)ZM

where H0 is the hat matrix associated with ZMB The prior is completed by a prior on w

and choice of scaling qM that is set at |M| + 1 to account for the minimal sample size of

M Under these assumptions the Bayesrsquo factor for M to MB is given by

BFMMB(y) =

(1minus R2

M

) nminus|MB |2

int n + w(|M|+ 1)

n + w(|M|+1)1minusR2

M

nminus|M|

2w(|M|+ 1)

n + w(|M|+1)1minusR2

M

|M|minus|MB |

2

π(w)dw

We consider the following priors on w The intrinsic prior is π(w) = Beta(w 05 05)

which is only defined for w isin (0 1) A version of the Zellner-Siow prior is given by

w sim Gamma(05 05) which produces a multivariate Cauchy distribution on β A family

of hyper-g priors are defined by π(w) prop wminus12(β + w)(α+1)2 which have Cauchy-like

tails but produce more shrinkage than the Cauchy prior

56

33 Objective Bayes Occupancy Model Selection

As mentioned before Bayesian inferential approaches used for ecological models

are lacking In particular there exists a need for suitable objective and automatic

Bayesian testing procedures and software implementations that explore thoroughly the

model space considered With this goal in mind in this section we develop an objective

intrinsic and fully automatic Bayesian model selection methodology for single season

site-occupancy models We refer to this method as automatic and objective given that

in its implementation no hyperparameter tuning is required and that it is built using

noninformative priors with good testing properties (eg intrinsic priors)

An inferential method for the occupancy problem is possible using the intrinsic

approach given that we are able to link intrinsic-Bayesian tools for the normal linear

model through our probit formulation of the occupancy model In other words because

we can represent the single season probit occupancy model through the hierarchy

yij |zi wij sim Bernoulli(ziIwijgt0

)wij |λ sim N

(qprimeijλ 1

)zi |vi sim Bernoulli

(Ivigt0

)vi |α sim N (x primeiα 1)

it is possible to solve the selection problem on the latent scale variables wij and vi and

to use those results at the level of the occupancy and detection processes

In what follows first we provide some necessary notation Then a derivation of

the intrinsic priors for the parameters of the detection and occupancy components

is outlined Using these priors we obtain the general form of the model posterior

probabilities Finally the results are incorporated in a model selection algorithm for

site-occupancy data Although the priors on the model space are not discussed in this

Chapter the software and methods developed have different choices of model priors

built in

57

331 Preliminaries

The notation used in Chapter 2 will be considered in this section as well Namely

presence will be denoted by z detection by y their corresponding latent processes are

v and w and the model parameters are denoted by α and λ However some additional

notation is also necessary Let M0 =M0y M0z

denote the ldquobaserdquo model defined by

the smallest models considered for the detection and presence processes The base

models M0y and M0z include predictors that must be contained in every model that

belongs to the model space Some examples of base models are the intercept only

model a model with covariates related to the sampling design and a model including

some predictors important to the researcher that should be included in every model

Furthermore let the sets [Kz ] = 1 2 Kz and [Ky ] = 1 2 Ky index

the covariates considered for the variable selection procedure for the presence and

detection processes respectively That is these sets denote the covariates that can

be added from the base models in M0 or removed from the largest possible models

considered MF z and MF y which we will refer to as the ldquofullrdquo models The model space

can then be represented by the Cartesian product of subsets such that Ay sube [Ky ]

and Az sube [Kz ] The entire model space is populated by models of the form MA =MAy

MAz

isin M = My timesMz with MAy

isin My and MAzisin Mz

For the presence process z the design matrix for model MAzis given by the block

matrix XAz= (X0|Xr A) X0 corresponds to the design matrix of the base model ndash which

is such that M0z sube MAzisin Mz for all Az isin [Kz ] ndash and Xr A corresponds to the submatrix

that contains the covariates indexed by Az Analogously for the detection process y the

design matrix is given by QAy= (Q0|Qr A) Similarly the coefficients for models MAz

and

MAyare given by αA = (αprime

0αprimer A)

prime and λA = (λprime0λ

primer A)

prime

With these elements in place the model selection problem consists of finding

subsets of covariates indexed by A = Az Ay that have a high posterior probability

given the detection and occupancy processes This is equivalent to finding models with

58

high posterior odds when compared to a suitable base model These posterior odds are

given by

p(MA|y z)p(M0|y z)

=m(y z|MA)π(MA)

m(y z|M0)π(M0)= BFMAM0

(y z)π(MA)

π(M0)

Since we are able to represent the occupancy model as a truncation of latent

normal variables it is possible to work through the occupancy model selection problem

in the latent normal scale used for the presence and detection processes We formulate

two solutions to this problem one that depends on the observed and latent components

and another that solely depends on the latent level variables used to data-augment the

problem We will however focus on the latter approach as this yields a straightforward

MCMC sampling scheme For completeness the other alternative is described in

Section 34

At the root of our objective inferential procedure for occupancy models lies the

conditional argument introduced by Womack et al (work in progress) for the simple

probit regression In the occupancy setting the argument is

p(MA|y zw v) =m(y z vw|MA)π(MA)

m(y zw v)

=fyz(y z|w v)

(intfvw(vw|αλMA)παλ(αλ|MA)d(αλ)

)π(MA)

fyz(y z|w v)sum

MlowastisinM(int

fvw(vw|αλMlowast)παλ(αλ|Mlowast)d(αλ))π(Mlowast)

=m(v|MAz

)m(w|MAy)π(MA)

m(v)m(w)

prop m(v|MAz)m(w|MAy

)π(MA) (3ndash4)

where

1 fyz(y z|w v) =prodN

i=1 Izivigt0I

(1minuszi )vile0

prodJ

j=1(ziIwijgt0)yij (1minus ziIwijgt0)

1minusyij

2 fvw(vw|αλMA) =

(Nprodi=1

ϕ(vi xprimeiαMAz

1)

)︸ ︷︷ ︸

f (v|αr Aα0MAz )

(Nprodi=1

Jiprodj=1

ϕ(wij qprimeijλMAy

1)

)︸ ︷︷ ︸

f (w|λr Aλ0MAy )

and

59

3 παλ(αλ|MA) = πα(α|MAz)πλ(λ|MAy

)

This result implies that once the occupancy and detection indicators are

conditioned on the latent processes v and w respectively the model posterior

probabilities only depend on the latent variables Hence in this case the model

selection problem is driven by the posterior odds

p(MA|y zw v)p(M0|y zw v)

=m(w v|MA)

m(w v|M0)

π(MA)

π(M0) (3ndash5)

where m(w v|MA) = m(w|MAy) middotm(v|MAz

) with

m(v|MAz) =

int intf (v|αr Aα0MAz

)π(αr A|α0MAz)π(α0)dαr Adα0

(3ndash6)

m(w|MAy) =

int intf (w|λr Aλ0MAy

)π(λr A|λ0MAy)π(λ0)dλ0dλr A

(3ndash7)

332 Intrinsic Priors for the Occupancy Problem

In general the intrinsic priors as defined by Moreno et al (1998) use the functional

form of the response to inform their construction assuming some preliminary prior

distribution proper or improper on the model parameters For our purposes we assume

noninformative improper priors for the parameters denoted by πN(middot|middot) Specifically the

intrinsic priors πIP(θMlowast|Mlowast) for a vector of parameters θMlowast corresponding to model

Mlowast isin M0M sub M for a response vector s with probability density (or mass) function

f (s|θMlowast) are defined by

πIP(θM0|M0) = πN(θM0

|M0)

πIP(θM |M) = πN(θM |M)

intm(~s|M)

m(~s|M0)f (~s|θM M)d~s

where ~s is a theoretical training sample

In what follows whenever it is clear from the context in an attempt to simplify the

notation MA will be used to refer to MAzor MAy

and A will denote Az or Ay To derive

60

the parameter priors involved in equations 3ndash6 and 3ndash7 using the objective intrinsic prior

strategy we start by assuming flat priors πN(αA|MA) prop cA and πN(λA|MA) prop dA where

cA and dA are unknown constants

The intrinsic prior for the parameters associated with the occupancy process αA

conditional on model MA is

πIP(αA|MA) = πN(αA|MA)

intm(~v|MA)

m(~v|M0)f (~v|αAMA)d~v

where the marginals m(~v|Mj) with j isin A 0 are obtained by solving the analogous

equation 3ndash6 for the (theoretical) training sample ~v These marginals are given by

m(~v|Mj) = cj (2π)pjminusp0

2 |~X primej~Xj |

12 eminus

12~vprime(Iminus~Hj )~v

The training sample ~v has dimension pAz=∣∣MAz

∣∣ that is the total number of

parameters in model MAz Note that without ambiguity we use

∣∣ middot ∣∣ to denote both

the cardinality of a set and also the determinant of a matrix The design matrix ~XA

corresponds to the training sample ~v and is chosen such that ~X primeA~XA =

pAzNX primeAXA

(Leon-Novelo et al 2012) and ~Hj is the corresponding hat matrix

Replacing m(~v|MA) and m(~v|M0) in πIP(αA|MA) and solving the integral with

respect to the theoretical training sample ~v we have

πIP(αA|MA) = cA

int ((2π)minus

pAzminusp0z2

(c0

cA

)eminus

12~vprime((Iminus~HA)minus(Iminus~H0))~v |~X

primeA~XA|12

|~X prime0~X0|12

)times(

(2π)minuspAz2 eminus

12(~vminus~XAαA)

prime(~vminus~XAαA))d~v

= c0(2π)minus

pAzminusp0z2 |~X prime

Ar~XAr |

12 2minus

pAzminusp0z2 exp

[minus1

2αprimer A

(1

2~X primer A

~Xr A

)αr A

]= πN(α0)timesN

(αr A

∣∣ 0 2 middot ( ~X primer A

~Xr A)minus1)

(3ndash8)

61

Analogously the intrinsic prior for the parameters associated to the detection

process is

πIP(λA|MA) = d0(2π)minus

pAyminusp0y2 | ~Q prime

Ar~QAr |

12 2minus

pAyminusp0y2 exp

[minus1

2λprimer A

(1

2~Q primer A

~Qr A

)λr A

]= πN(λ0)timesN

(λr A

∣∣ 0 2 middot ( ~Q primeA~QA)

minus1)

(3ndash9)

In short the intrinsic priors for αA = (αprime0α

primer A)

prime and λprimeA = (λprime

0λprimer A)

prime are the product

of a reference prior on the parameters of the base model and a normal density on the

parameters indexed by Az and Ay respectively333 Model Posterior Probabilities

We now derive the expressions involved in the calculations of the model posterior

probabilities First recall that p(MA|y zw v) prop m(w v|MA)π(MA) Hence determining

this posterior probability only requires calculating m(w v|MA)

Note that since w and v are independent obtaining the model posteriors from

expression 3ndash4 reduces to finding closed form expressions for the marginals m(v |MAz)

and m(w |MAy) respectively from equations 3ndash6 and 3ndash7 Therefore

m(w v|MA) =

int intf (vw|αλMA)π

IP (α|MAz)πIP

(λ|MAy

)dαdλ

(3ndash10)

For the latent variable associated with the occupancy process plugging the

parameter intrinsic prior given by 3ndash8 into equation 3ndash6 (recalling that ~X primeA~XA =

pAzNX primeAXA)

and integrating out αA yields

m(v|MA) =

int intc0N (v|X0α0 + Xr Aαr A I)N

(αr A|0 2( ~X prime

r A~Xr A)

minus1)dαr Adα0

= c0(2π)minusn2

int (pAz

2N + pAz

) (pAzminusp0z

)

2

times

exp[minus1

2(v minus X0α0)

prime(I minus

(2N

2N + pAz

)Hr Az

)(v minus X0α0)

]dα0

62

= c0 (2π)minus(nminusp0z )2

(pAz

2N + pAz

) (pAzminusp0z

)

2

|X prime0X0|minus

12 times

exp[minus1

2vprime(I minus H0z minus

(2N

2N + pAz

)Hr Az

)v

] (3ndash11)

with Hr Az= HAz

minus H0z where HAzis the hat matrix for the entire model MAz

and H0z is

the hat matrix for the base model

Similarly the marginal distribution for w is

m(w|MA) = d0 (2π)minus(Jminusp0y )2

(pAy

2J + pAy

) (pAyminusp0y

)

2

|Q prime0Q0|minus

12 times

exp[minus1

2wprime(I minus H0y minus

(2J

2J + pAy

)Hr Ay

)w

] (3ndash12)

where J =sumN

i=1 Ji or in other words J denotes the total number of surveys conducted

Now the posteriors for the base model M0 =M0y M0z

are

m(v|M0) =

intc0N (v|X0α0 I) dα0

= c0(2π)minus(nminusp0z )2 |X prime

0X0|minus12 exp

[minus1

2(v (I minus H0z ) v)

](3ndash13)

and

m(w|M0) = d0(2π)minus(Jminusp0y )2 |Q prime

0Q0|minus12 exp

[minus1

2

(w(I minus H0y

)w)]

(3ndash14)

334 Model Selection Algorithm

Having the parameter intrinsic priors in place and knowing the form of the model

posterior probabilities it is finally possible to develop a strategy to conduct model

selection for the occupancy framework

For each of the two components of the model ndashoccupancy and detectionndash the

algorithm first draws the set of active predictors (ie Az and Ay ) together with their

corresponding parameters This is a reversible jump step which uses a Metropolis

63

Hastings correction with proposal distributions given by

q(Alowastz |zo z(t)u v(t)MAz

) =1

2

(p(MAlowast

z|zo z(t)u v(t)Mz MAlowast

zisin L(MAz

)) +1

|L(MAz)|

)q(Alowast

y |y zo z(t)u w(t)MAy) =

1

2

(p(MAlowast

w|y zo z(t)u w(t)My MAlowast

yisin L(MAy

)) +1

|L(MAy)|

)(3ndash15)

where L(MAz) and L(MAy

) denote the sets of models obtained from adding or removing

one predictor at a time from MAzand MAy

respectively

To promote mixing this step is followed by an additional draw from the full

conditionals of α and λ The densities p(α0|) p(αr A|) p(λ0|) and p(λr A|) can

be sampled from directly with Gibbs steps Using the notation a|middot to denote the random

variable a conditioned on all other parameters and on the data these densities are given

by

bull α0|middot sim N((X

prime0X0)

minus1Xprime0v (X

prime0X0)

minus1)bull αr A|middot sim N

(microαr A

αr A

) where the mean vector and the covariance matrix are

given by αr A= 2N

2N+pAz(X

prime

r AXr A)minus1 and microαr A

=(αr A

Xprime

r Av)

bull λ0|middot sim N((Q

prime0Q0)

minus1Qprime0w (Q

prime0Q0)

minus1) and

bull λr A|middot sim N(microλr A

λr A

) analogously with mean and covariance matrix given by

λr A= 2J

2J+pAy(Q

prime

r AQr A)minus1 and microλr A

=(λr A

Qprime

r Aw)

Finally Gibbs sampling steps are also available for the unobserved occupancy

indicators zu and for the corresponding latent variables v and w The full conditional

posterior densities for z(t+1)u v(t+1) and w(t+1) are those introduced in Chapter 2 for the

single season probit model

The following steps summarize the stochastic search algorithm

1 Initialize A(0)y A

(0)z z

(0)u v(0)w(0)α(0)

0 λ(0)0

2 Sample the model indices and corresponding parameters

(a) Draw simultaneously

64

bull Alowastz sim q(Az |zo z(t)u v(t)MAz

)

bull αlowast0 sim p(α0|MAlowast

z zo z

(t)u v(t)) and

bull αlowastr Alowast sim p(αr A|MAlowast

z zo z

(t)u v(t))

(b) Accept (M(t+1)Az

α(t+1)10 α(t+1)1

r A ) = (MAlowastzαlowast

0αlowastr Alowast) with probability

δz = min

(1

p(MAlowastz|zo z(t)u v(t))

p(MA(t)z|zo z(t)u v(t))

q(A(t)z |zo z(t)u v(t)MAlowast

z)

q(Alowastz |zo z

(t)u v(t)MAz

)

)

otherwise let (M(t+1)Az

α(t+1)10 α(t+1)1

r A ) = (A(t)z α(t)2

0 α(t)2r A )

(c) Sample simultaneously

bull Alowasty sim q(Ay |y zo z(t)u w(t)MAy

)

bull λlowast0 sim p(λ0|MAlowast

y y zo z

(t)u w(t)) and

bull λlowastr Alowast sim p(λr A|MAlowast

y y zo z

(t)u w(t))

(d) Accept (M(t+1)Ay

λ(t+1)10 λ(t+1)1

r A ) = (MAlowastyλlowast

0λlowastr Alowast) with probability

δy = min

(1

p(MAlowastz|y zo z(t)u w(t))

p(MA(t)z|y zo z(t)u w(t))

q(A(t)z |y zo z(t)u w(t)MAlowast

y)

q(Alowastz |y zo z

(t)u w(t)MAy

)

)

otherwise let (M(t+1)Ay

λ(t+1)10 λ(t+1)1

r A ) = (A(t)y λ(t)2

0 λ(t)2r A )

3 Sample base model parameters

(a) Draw α(t+1)20 sim p(α0|MA

(t+1)z

zo z(t)u v(t))

(b) Draw λ(t+1)20 sim p(λ0|MA(t+1)y

y zo z(t)u v(t))

4 To improve mixing resample model coefficients not present the base model butare in MA

(a) Draw α(t+1)2r A sim p(αr A|MA

(t+1)z

zo z(t)u v(t))

(b) Draw λ(t+1)2r A sim p(λr A|MA

(t+1)y

yzo z(t)u v(t))

5 Sample latent and missing (unobserved) variables

(a) Sample z(t+1)u sim p(zu|MA(t+1)z

yα(t+1)2r A α(t+1)2

0 λ(t+1)2r A λ(t+1)2

0 )

(b) Sample v(t+1) sim p(v|MA(t+1)z

zo z(t+1)u α(t+1)2

r A α(t+1)20 )

65

(c) Sample w(t+1) sim p(w|MA(t+1)y

zo z(t+1)u λ(t+1)2

r A λ(t+1)20 )

34 Alternative Formulation

Because the occupancy process is partially observed it is reasonable to consider

the posterior odds in terms of the observed responses that is the detections y and

the presences at sites where at least one detection takes place Partitioning the vector

of presences into observed and unobserved z = (zprimeo zprimeu)

prime and integrating out the

unobserved component the model posterior for MA can be obtained as

p(MA|y zo) prop Ezu [m(y z|MA)] π(MA) (3ndash16)

Data-augmenting the model in terms of latent normal variables a la Albert and Chib

the marginals for any model My Mz = M isin M of z and y inside of the expectation in

equation 3ndash16 can be expressed in terms of the latent variables

m(y z|M) =

intT (z)

intT (yz)

m(w v|M)dwdv

=

(intT (z)

m(v| Mz)dv

)(intT (yz)

m(w|My)dw

) (3ndash17)

where T (z) and T (y z) denote the corresponding truncation regions for v and w which

depend on the values taken by z and y and

m(v|Mz) =

intf (v|αMz)π(α|Mz)dα (3ndash18)

m(w|My) =

intf (w|λMy)π(λ|My)dλ (3ndash19)

The last equality in equation 3ndash17 is a consequence of the independence of the

latent processes v and w Using expressions 3ndash18 and 3ndash19 allows one to embed this

model selection problem in the classical linear normal regression setting where many

ldquoobjectiverdquo Bayesian inferential tools are available In particular these expressions

facilitate deriving the parameter intrinsic priors (Berger amp Pericchi 1996 Moreno

et al 1998) for this problem This approach is an extension of the one implemented in

Leon-Novelo et al (2012) for the simple probit regression problem

66

Using this alternative approach all that is left is to integrate m(v|MA) and m(w|MA)

over their corresponding truncation regions T (z) and T (y z) which yields m(y z|MA)

and then to obtain the expectation with respect to the unobserved zrsquos Note however

two issues arise First such integrals are not available in closed form Second

calculating the expectation over the limit of integration further complicates things To

address these difficulties it is possible to express E [m(y z|MA)] as

Ezu [m(y z|MA)] = Ezu

[(intT (z)

m(v| MAz)dv

)(intT (yz)

m(w|MAy)dw

)](3ndash20)

= Ezu

[(intT (z)

intm(v| MAz

α0)πIP(α0|MAz

)dα0dv

)times(int

T (yz)

intm(w| MAy

λ0)πIP(λ0|MAy

)dλ0dw

)]

= Ezu

int (int

T (z)

m(v| MAzα0)dv

)︸ ︷︷ ︸

g1(T (z)|MAz α0)

πIP(α0|MAz)dα0 times

int (intT (yz)

m(w|MAyλ0)dw

)︸ ︷︷ ︸

g2(T (yz)|MAy λ0)

πIP(λ0|MAy)dλ0

= Ezu

[intg1(T (z)|MAz

α0)πIP(α0|MAz

)dα0 timesintg2(T (y z)|MAy

λ0)πIP(λ0|MAy

)dλ0

]= c0 d0

int intEzu

[g1(T (z)|MAz

α0)g2(T (y z)|MAyλ0)

]dα0 dλ0

where the last equality follows from Fubinirsquos theorem since m(v|MAzα0) and

m(w|MAyλ0) are proper densities From 3ndash21 the posterior odds are

p(MA|y zo)p(M0|y zo)

=

int intEzu

[g1(T (z)|MAz

α0)g2(T (y z)|MAyλ0)

]dα0 dλ0int int

Ezu

[g1(T (z)|M0z α0)g2(T (y z)|M0y λ0)

]dα0 dλ0

π(MA)

π(M0)

(3ndash21)

67

35 Simulation Experiments

The proposed methodology was tested under 36 different scenarios where we

evaluate the behavior of the algorithm by varying the number of sites the number of

surveys the amount of signal in the predictors for the presence component and finally

the amount of signal in the predictors for the detection component

For each model component the base model is taken to be the intercept only model

and the full models considered for the presence and the detection have respectively 30

and 20 predictors Therefore the model space contains 230times220 asymp 112times1015 candidate

models

To control the amount of signal in the presence and detection components values

for the model parameter were purposefully chosen so that quantiles 10 50 and 90 of the

occupancy and detection probabilities match some pre-specified probabilities Because

presence and detection are binary variables the amount of signal in each model

component associates to the spread and center of the distribution for the occupancy and

detection probabilities respectively Low signal levels relate to occupancy or detection

probabilities close to 50 High signal levels associate with probabilities close to 0 or 1

Large spreads of the distributions for the occupancy and detection probabilities reflect

greater heterogeneity among the observations collected improving the discrimination

capability of the model and viceversa

Therefore for the presence component the parameter values of the true model

were chosen to set the median for the occupancy probabilities equal 05 The chosen

parameter values also fix quantiles 10 and 90 symmetrically about 05 at small (Qz10 =

03Qz90 = 07) intermediate (Qz

10 = 02Qz90 = 08) and large (Qz

10 = 01Qz90 = 09)

distances For the detection component the model parameters are obtained to reflect

detection probabilities concentrated about low values (Qy50 = 02) intermediate values

(Qy50 = 05) and high values (Qy

50 = 08) while keeping quantiles 10 and 90 fixed at 01

and 09 respectively

68

Table 3-1 Simulation control parameters occupancy model selectorParameter Values considered

N 50 100

J 3 5

(Qz10Q

z50Q

z90)

(03 05 07) (02 05 08) (01 05 09)

(Qy

10Qy50Q

y90)

(01 02 09) (01 05 09) (01 08 09)

There are in total 36 scenarios these result from crossing all the levels of the

simulation control parameters (Table 3-1) Under each of these scenarios 20 data sets

were generated at random True presence and detection indicators were generated

with the probit model formulation from Chapter 2 This with the assumed true models

MTz = 1 x2 x15 x16 x22 x28 for the presence and MTy = 1 q7 q10 q12 q17 for

the detection with the predictors included in the randomly generated datasets In this

context 1 represents the intercept term Throughout the Section we refer to predictors

included in the true models as true predictors and to those absent as false predictors

The selection procedure was conducted using each one of these data sets with

two different priors on the model space the uniform or equal probability prior and a

multiplicity correcting prior

The results are summarized through the marginal posterior inclusion probabilities

(MPIPs) for each predictor and also the five highest posterior probability models (HPM)

The MPIP for a given predictor under a specific scenario and for a particular data set is

defined as

p(predictor is included|y zw v) =sumMisinM

I(predictorisinM)p(M|y zw vM) (3ndash22)

In addition we compare the MPIP odds between predictors present in the true model

and predictors absent from it Specifically we consider the minimum odds of marginal

posterior inclusion probabilities for the predictors Let ~ξ and ξ denote respectively a

69

predictor in the true model MT and a predictor absent from MT We define the minimum

MPIP odds between the probabilities of true and false predictor as

minOddsMPIP =min~ξisinMT

p(I~ξ = 1|~ξ isin MT )

maxξ isinMTp(Iξ = 1|ξ isin MT )

(3ndash23)

If the variable selection procedure adequately discriminates true and false predictors

minOddsMPIP will take values larger than one The ability of the method to discriminate

between the least probable true predictor and the most probable false predictor worsens

as the indicator approaches 0351 Marginal Posterior Inclusion Probabilities for Model Predictors

For clarity in Figures 3-1 through 3-5 only predictors in the true models are labeled

and are emphasized with a dotted line passing through them The left hand side plots

in these figures contain the results for the presence component and the ones on the

right correspond to predictors in the detection component The results obtained with

the uniform model priors correspond to the black lines and those for the multiplicity

correcting prior are in red In these Figures the MPIPrsquos have been averaged over all

datasets corresponding scenarios matching the condition observed

In Figure 3-1 we contrast the mean MPIPrsquos of the predictors over all datasets from

scenarios with 50 sites to the mean MPIPrsquos obtained for the scenarios with 100 sites

Similarly Figure 3-2 compares the mean MPIPrsquos of scenarios where 3 surveys are

performed to those of scenarios having 5 surveys per site Figures 3-4 and 3-5 show the

effect of the different levels of signal considered in the occupancy probabilities and in the

detection probabilities

From these figures mainly three results can be drawn (1) the effect of the model

prior is substantial (2) the proposed methods yield MPIPrsquos that clearly separate

true predictors from false predictors and (3) the separation between MPIPrsquos of true

predictors and false predictors is noticeably larger in the detection component

70

Regardless of the simulation scenario and model component observed under the

uniform prior false predictors obtain a relatively high MPIP Conversely the multiplicity

correction prior strongly shrinks towards 0 the MPIP for false predictors In the presence

component the MPIP for the true predictors is shrunk substantially under the multiplicity

prior however there remains a clear separation between true and false predictors In

contrast in the detection component the MPIP for true predictors remains relatively high

(Figures 3-1 through 3-5)

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif N=50MC N=50

Unif N=100MC N=100

Figure 3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors

71

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif J=3MC J=3

Unif J=5MC J=5

Figure 3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif N=50 J=3Unif N=50 J=5

Unif N=100 J=3Unif N=100 J=5

MC N=50 J=3MC N=50 J=5

MC N=100 J=3MC N=100 J=5

Figure 3-3 Predictor MPIP averaged over scenarios with the interaction between thenumber of sites and the surveys per site using uniform (U) and multiplicitycorrection (MC) priors

72

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

U(03 05 07)MC(03 05 07)

U(02 05 08)MC(02 05 08)

U(01 05 09)MC(01 05 09)

Figure 3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

U(01 02 09)MC(01 02 09)

U(01 05 09)MC(01 05 09)

U(01 08 09)MC(01 08 09)

Figure 3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors

73

In scenarios where more sites were surveyed the separation between the MPIP of

true and false predictors grew in both model components (Figure 3-1) Increasing the

number of sites has an effect over both components given that every time a new site is

included covariate information is added to the design matrix of both the presence and

the detection components

On the hand increasing the number of surveys affects the MPIP of predictors in the

detection component (Figures 3-2 and 3-3) but has only a marginal effect on predictors

of the presence component This may appear to be counterintuitive however increasing

the number of surveys only increases the number of observation in the design matrix

for the detection while leaving unaltered the design matrix for the presence The small

changes observed in the MPIP for the presence predictors J increases are exclusively

a result of having additional detection indicators equal to 1 in sites where with less

surveys would only have 0 valued detections

From Figure 3-3 it is clear that for the presence component the effect of the number

of sites dominates the behavior of the MPIP especially when using the multiplicity

correction priors In the detection component the MPIP is influenced by both the number

of sites and number of surveys The influence of increasing the number of surveys is

larger when considering a smaller number of sites and viceversa

Regarding the effect of the distribution for the occupancy probabilities we observe

that mostly the detection component is affected There is stronger discrimination

between true and false predictors as the distribution has a higher variability (Figure

3-4) This is consistent with intuition since having the presence probabilities more

concentrated about 05 implies that the predictors do not vary much from one site to

the next whereas having the occupancy probabilities more spread out would have the

opposite effect

Finally concentrating the detection probabilities about high or low values For

predictors in the detection component the separation between MPIP of true and false

74

predictors is larger both in scenarios where the distribution of the detection probability

is centered about 02 or 08 when compared to those scenarios where this distribution

is centered about 05 (where the signal of the predictors is weakest) For predictors in

the presence component having the detection probabilities centered at higher values

slightly increases the inclusion probabilities of the true predictors (Figure 3-5) and

reduces that of false predictors

Table 3-2 Comparison of average minOddsMPIP under scenarios having differentnumber of sites (N=50 N=100) and under scenarios having different numberof surveys per site (J=3 J=5) for the presence and detection componentsusing uniform and multiplicity correction priors

Sites SurveysComp π(M) N=50 N=100 J=3 J=5

Presence Unif 112 131 119 124MC 320 846 420 674

Detection Unif 203 264 211 257MC 2115 3246 2139 3252

Table 3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors

(Qz10Q

z50Q

z90) (Qy

10Qy50Q

y90)

Comp π(M) (030507) (020508) (010509) (010209) (010509) (010809)

Presence Unif 105 120 134 110 123 124MC 202 455 805 238 619 640

Detection Unif 234 234 230 257 200 238MC 2537 2077 2528 2933 1852 2849

The separation between the MPIP of true and false predictors is even more

notorious in Tables 3-2 and 3-3 where the minimum MPIP odds between true and

false predictors are shown Under every scenario the value for the minOddsMPIP (as

defined in 3ndash23) was greater than 1 implying that on average even the lowest MPIP

for a true predictor is higher than the maximum MPIP for a false predictor In both

components of the model the minOddsMPIP are markedly larger under the multiplicity

correction prior and increase with the number of sites and with the number of surveys

75

For the presence component increasing the signal in the occupancy probabilities

or having the detection probabilities concentrate about higher values has a positive and

considerable effect on the magnitude of the odds For the detection component these

odds are particularly high specially under the multiplicity correction prior Also having

the distribution for the detection probabilities center about low or high values increases

the minOddsMPIP 352 Summary Statistics for the Highest Posterior Probability Model

Tables 3-4 through 3-7 show the number of true predictors that are included in

the HPM (True +) and the number of false predictors excluded from it (True minus)

The mean percentages observed in these Tables provide one clear message The

highest probability models chosen with either model prior commonly differ from the

corresponding true models The multiplicity correction priorrsquos strong shrinkage only

allows a few true predictors to be selected but at the same time it prevents from

including in the HPM any false predictors On the other hand the uniform prior includes

in the HPM a larger proportion of true predictors but at the expense of also introducing

a large number of false predictors This situation is exacerbated in the presence

component but also occurs to a minor extent in the detection component

Table 3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highestprobability models for the presence and the detection components usinguniform and multiplicity correcting priors on the model space

True + True minusComp π(M) N=50 N=100 N=50 N=100

Presence Unif 057 063 051 055MC 006 013 100 100

Detection Unif 077 085 087 093MC 049 070 100 100

Having more sites or surveys improves the inclusion of true predictors and exclusion

of false ones in the HPM for both the presence and detection components (Tables 3-4

and 3-5) On the other hand if the distribution for the occupancy probabilities is more

76

Table 3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space

True + True minusComp π(M) J=3 J=5 J=3 J=5

Presence Unif 059 061 052 054MC 008 010 100 100

Detection Unif 078 085 087 092MC 050 068 100 100

spread out the HPM includes more true predictors and less false ones in the presence

component In contrast the effect of the spread of the occupancy probabilities in the

detection HPM is negligible (Table 3-6) Finally there is a positive relationship between

the location of the median for the detection probabilities and the number of correctly

classified true and false predictors for the presence The HPM in the detection part of

the model responds positively to low and high values of the median detection probability

(increased signal levels) in terms of correctly classified true and false predictors (Table

3-7)

Table 3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space

True + True minusComp π(M) (030507) (020508) (010509) (030507) (020508) (010509)

Presence Unif 055 061 064 050 054 055MC 002 008 018 100 100 100

Detection Unif 081 082 081 090 089 089MC 057 061 059 100 100 100

36 Case Study Blue Hawker Data Analysis

During 1999 and 2000 an intensive volunteer surveying effort coordinated by the

Centre Suisse de Cartographie de la Faune (CSCF) was conducted in order to analyze

the distribution of the blue hawker Ashna cyanea (Odonata Aeshnidae) a common

dragonfly in Switzerland Given that Switzerland is a small and mountainous country

77

Table 3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space

True + True minusComp π(M) (010209) (010509) (010809) (010209) (010509) (010809)

Presence Unif 059 059 062 051 054 054MC 006 010 011 100 100 100

Detection Unif 089 077 078 091 087 091MC 070 048 059 100 100 100

there is large variation in its topography and physio-geography as such elevation is a

good candidate covariate to predict species occurrence at a large spatial scale It can

be used as a proxy for habitat type intensity of land use temperature as well as some

biotic factors (Kery et al 2010)

Repeated visits to 1-ha pixels took place to obtain the corresponding detection

history In addition to the survey outcome the x and y-coordinates thermal-level the

date of the survey and the elevation were recorded Surveys were restricted to the

known flight period of the blue hawker which takes place between May 1 and October

10 In total 2572 sites were surveyed at least once during the surveying period The

number of surveys per site ranges from 1 to 22 times within each survey year

Kery et al (2010) summarize the results of this effort using AIC-based model

comparisons first by following a backwards elimination approach for the detection

process while keeping the occupancy component fixed at the most complex model and

then for the presence component choosing among a group of three models while using

the detection model chosen In our analysis of this dataset for the detection and the

presence we consider as the full models those used in Kery et al (2010) namely

minus1(ψ) = α0 + α1year+ α2elev+ α3elev2 + α4elev

3

minus1(p) = λ0 + λ1year+ λ2elev+ λ3elev2 + λ4elev

3 + λ5date+ λ6date2

78

where year = Iyear=2000

The model spaces for this data contain 26 = 64 and 24 = 16 models respectively

for the detection and occupancy components That is in total the model space contains

24+6 = 1 024 models Although this model space can be enumerated entirely for

illustration we implemented the algorithm from section 334 generating 10000 draws

from the Gibbs sampler Each one of the models sampled were chosen from the set of

models that could be reached by changing the state of a single term in the current model

(to inclusion or exclusion accordingly) This allows a more thorough exploration of the

model space because for each of the 10000 models drawn the posterior probabilities

for many more models can be observed Below the labels for the predictors are followed

by either ldquozrdquo or ldquoyrdquo accordingly to represent the component they pertain to Finally

using the results from the model selection procedure we conducted a validation step to

determine the predictive accuracy of the HPMrsquos and of the median probability models

(MPMrsquos) The performance of these models is then contrasted with that of the model

ultimately selected by Kery et al (2010)361 Results Variable Selection Procedure

The model finally chosen for the presence component in Kery et al (2010) was not

found among the highest five probability models under either model prior 3-8 Moreover

the year indicator was never chosen under the multiplicity correcting prior hinting that

this term might correspond to a falsely identified predictor under the uniform prior

Results in Table 3-10 support this claim the marginal inclusion posterior probability for

the year predictor is 7 under the multiplicity correction prior The multiplicity correction

prior concentrates more densely the model posterior probability mass in the highest

ranked models (90 of the mass is in the top five models) than the uniform prior (which

account for 40 of the mass)

For the detection component the HPM under both priors is the intercept only model

which we represent in Table 3-9 with a blank label In both cases this model obtains very

79

Table 3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data

Uniform model priorRank Mz selected p(Mz |y)

1 yrz+elevz 0102 yrz+elevz+elevz3 0083 elevz2+elevz3 0084 yrz+elevz2 0075 yrz+elevz3 007

Multiplicity correcting model priorRank Mz selected p(Mz |y)

1 elevz+elevz3 0532 0153 elevz+elevz2 0094 elevz2 0065 elevz+elevz2+elevz3 005

high posterior probabilities The terms contained in cubic polynomial for the elevation

appear to contain some relevant information however this conflicts with the MPIPs

observed in Table 3-11 which under both model priors are relatively low (lt 20 with the

uniform and le 4 with the multiplicity correcting prior)

Table 3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data

Uniform model priorRank Mz selected p(Mz |y)

1 0452 elevy3 0063 elevy2 0054 elevy 0055 yry 004

Multiplicity correcting model priorRank Mz selected p(Mz |y)

1 0862 elevy3 0023 datey2 0024 elevy2 0025 yry 002

Finally it is possible to use the MPIPs to obtain the median probability model which

contains the terms that have a MPIP higher than 50 For the occupancy process

(Table 3-10) under the uniform prior the model with the year the elevation and the

elevation cubed are included The MPM with multiplicity correction prior coincides with

the HPM from this prior The MPM chosen for the detection component (Table 3-11)

under both priors is the intercept only model coinciding again with the HPM

Given the outcomes of the simulation studies from Section 35 especially those

pertaining to the detection component the results in Table 3-11 appear to indicate that

none of the predictors considered belong to the true model especially when considering

80

Table 3-10 MPIP presence component

Predictor p(predictor isin MTz |y z w v)

Unif MultCorryrz 053 007elevz 051 073elevz2 045 023elevz3 050 067

Table 3-11 MPIP detection component

Predictor p(predictor isin MTy |y z w v)

Unif MultCorryry 019 003elevy 018 003elevy2 018 003elevy 3 019 004datey 016 003datey2 015 004

those derived with the multiplicity correction prior On the other hand for the presence

component (Table 3-10) there is an indication that terms related to the cubic polynomial

in elevz can explain the occupancy patterns362 Validation for the Selection Procedure

Approximately half of the sites were selected at random for training (ie for model

selection and parameter estimation) and the remaining half were used as test data In

the previous section we observed that using the marginal posterior inclusion probability

of the predictors the our method effectively separates predictors in the true model from

those that are not in it However in Tables 3-10 and 3-11 this separation is only clear for

the presence component using the multiplicity correction prior

Therefore in the validation procedure we observe the misclassification rates for the

detections using the following models (1) the model ultimately recommended in Kery

et al (2010) (yrz+elevz+elevz2+elevz3 + elevy+ elevy2+ datey+ datey2) (2) the

highest probability model (HPM) with a uniform prior (yrz+elevz) (3) the HPM with a

multiplicity correcting prior (elevz + elevz3 ) (4) the median probability model (MPM)

ndashthe model including only predictors with a MPIP larger than 50ndash with the uniform

prior (yrz+elevz+elevz3) and finally (5) the MPM with a multiplicity correction prior

(elevz+elevz3 same as the HPM with multiplicity correction)

We must emphasize that the models resulting from the implement ion of our model

selection procedure used exclusively the training dataset On the other hand the model

in Kery et al (2010) was chosen to minimize the prediction error of the complete data

81

Because this model was obtained from the full dataset results derived from it can only

be considered as a lower bound for the prediction errors

The benchmark misclassification error rate for true 1rsquos is high (close to 70)

However the misclassification rate for true 0rsquos which accounts for most of the

responses is less pronounced (15) Overall the performance of the selected models

is comparable They yield considerably worse results than the benchmark for the true

1rsquos but achieve rates close to the benchmark for the true zeros Pooling together

the results for true ones and true zeros the selected models with either prior have

misclassification rates close to 30 The benchmark model performs comparably with a

joint misclassification error of 23 (Table 3-12)

Table 3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform andmultiplicity correction model priors

Model True 1 True 0 Jointbenchmark (Kery et al 2010) yrz+elevz+elevz2+elevz3 + 066 015 023

elevy+ elevy2+ datey+ datey2

HPM Unif yrz+elevz 083 017 028HPMHPM MC elevz + elevz3 082 018 028MPM Unif yrz+elevz+elevz3 082 018 029

37 Discussion

In this Chapter we proposed an objective and fully automatic Bayes methodology for

the single season site-occupancy model The methodology is said to be fully automatic

because no hyper-parameter specification is necessary in defining the parameter priors

and objective because it relies on the intrinsic priors derived from noninformative priors

The intrinsic priors have been shown to have desirable properties as testing priors We

also propose a fast stochastic search algorithm to explore large model spaces using our

model selection procedure

Our simulation experiments demonstrated the ability of the method to single out the

predictors present in the true model when considering the marginal posterior inclusion

probabilities for the predictors For predictors in the true model these probabilities

were comparatively larger than those for predictors absent from it Also the simulations

82

indicated that the method has a greater discrimination capability for predictors in the

detection component of the model especially when using multiplicity correction priors

Multiplicity correction priors were not described in this Chapter however their

influence on the selection outcome is significant This behavior was observed in the

simulation experiment and in the analysis of the Blue Hawker data Model priors play an

essential role As the number of predictors grows these are instrumental in controlling

for selection of false positive predictors Additionally model priors can be used to

account for predictor structure in the selection process which helps both to reduce the

size of the model space and to make the selection more robust These issues are the

topic of the next Chapter

Accounting for the polynomial hierarchy in the predictors within the occupancy

context is a straightforward extension of the procedures we describe in Chapter 4

Hence our next step is to develop efficient software for it An additional direction we

plan to pursue is developing methods for occupancy variable selection in a multivariate

setting This can be used to conduct hypothesis testing in scenarios with varying

conditions through time or in the case where multiple species are co-observed A

final variation we will investigate for this problem is that of occupancy model selection

incorporating random effects

83

CHAPTER 4PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS

It has long been an axiom of mine that the little things are infinitely themost important

ndashSherlock HolmesA Case of Identity

41 Introduction

In regression problems if a large number of potential predictors is available the

complete model space is too large to enumerate and automatic selection algorithms are

necessary to find informative parsimonious models This multiple testing problem

is difficult and even more so when interactions or powers of the predictors are

considered In the ecological literature models with interactions andor higher order

polynomial terms are ubiquitous (Johnson et al 2013 Kery et al 2010 Zeller et al

2011) given the complexity and non-linearities found in ecological processes Several

model selection procedures even in the classical normal linear setting fail to address

two fundamental issues (1) the model selection outcome is not invariant to affine

transformations when interactions or polynomial structures are found among the

predictors and (2) additional penalization is required to control for false positives as the

model space grows (ie as more covariates are considered)

These two issues motivate the developments developed throughout this Chapter

Building on the results of Chipman (1996) we propose investigate and provide

recommendations for three different prior distributions on the model space These

priors help control for test multiplicity while accounting for polynomial structure in the

predictors They improve upon those proposed by Chipman first by avoiding the need

for specific values for the prior inclusion probabilities of the predictors and second

by formulating principled alternatives to introduce additional structure in the model

84

priors Finally we design a stochastic search algorithm that allows fast and thorough

exploration of model spaces with polynomial structure

Having structure in the predictors can determine the selection outcome As an

illustration consider the model E [y ] = β00 + β01x2 + β20x21 where the order one

term x1 is not present (this choice of subscripts for the coefficients is defined in the

following section) Transforming x1 7rarr xlowast1 = x1 + c for some c = 0 the model

becomes E [y ] = β00 + β01x2 + βlowast20x

lowast21 Note that in terms of the original predictors

xlowast21 = x21 + 2c middot x1 + c2 implying that this seemingly innocuous transformation of x1

modifies the column space of the design matrix by including x1 which was not in the

original model That is when lower order terms in the hierarchy are omitted from the

model the column space of the design matrix is not invariant to afine transformations

As the hat matrix depends on the column space the modelrsquos predictive capability is also

affected by how the covariates in the model are coded an undesirable feature for any

model selection procedure To make model selection invariant to afine transformations

the selection must be constrained to the subset of models that respect the hierarchy

(Griepentrog et al 1982 Khuri 2002 McCullagh amp Nelder 1989 Nelder 2000

Peixoto 1987 1990) These models are known as well-formulated models (WFMs)

Succinctly a model is well-formulated if for any predictor in the model every lower order

predictor associated with it is also in the model The model above is not well-formulated

as it contains x21 but not x1

WFMs exhibit strong heredity in that all lower order terms dividing higher order

terms in the model must also be included An alternative is to only require weak heredity

(Chipman 1996) which only forces some of the lower terms in the corresponding

polynomial hierarchy to be in the model However Nelder (1998) demonstrated that the

conditions under which weak heredity allows the design matrix to be invariant to afine

transformations of the predictors are too restrictive to be useful in practice

85

Although this topic appeared in the literature more than three decades ago (Nelder

1977) only recently have modern variable selection techniques been adapted to

account for the constraints imposed by heredity As described in Bien et al (2013)

the current literature on variable selection for polynomial response surface models

can be classified into three broad groups mult-istep procedures (Brusco et al 2009

Peixoto 1987) regularized regression methods (Bien et al 2013 Yuan et al 2009)

and Bayesian approaches (Chipman 1996) The methods introduced in this Chapter

take a Bayesian approach towards variable selection for well-formulated models with

particular emphasis on model priors

As mentioned in previous chapters the Bayesian variable selection problem

consists of finding models with high posterior probabilities within a pre-specified model

space M The model posterior probability for M isin M is given by

p(M|yM) prop m(y|M)π(M|M) (4ndash1)

Model posterior probabilities depend on the prior distribution on the model space

as well as on the prior distributions for the model specific parameters implicitly through

the marginals m(y|M) Priors on the model specific parameters have been extensively

discussed in the literature (Berger amp Pericchi 1996 Berger et al 2001 George 2000

Jeffreys 1961 Kass amp Wasserman 1996 Liang et al 2008 Zellner amp Siow 1980) In

contrast the effect of the prior on the model space has until recently been neglected

A few authors (eg Casella et al (2014) Scott amp Berger (2010) Wilson et al (2010))

have highlighted the relevance of the priors on the model space in the context of multiple

testing Adequately formulating priors on the model space can both account for structure

in the predictors and provide additional control on the detection of false positive terms

In addition using the popular uniform prior over the model space may lead to the

undesirable and ldquoinformativerdquo implication of favoring models of size p2 (where p is the

86

total number of covariates) since this is the most abundant model size contained in the

model space

Variable selection within the model space of well-formulated polynomial models

poses two challenges for automatic objective model selection procedures First the

notion of model complexity takes on a new dimension Complexity is not exclusively

a function of the number of predictors but also depends upon the depth and

connectedness of the associations defined by the polynomial hierarchy Second

because the model space is shaped by such relationships stochastic search algorithms

used to explore the models must also conform to these restrictions

Models without polynomial hierarchy constitute a special case of WFMs where

all predictors are of order one Hence all the methods developed throughout this

Chapter also apply to models with no predictor structure Additionally although our

proposed methods are presented for the normal linear case to simplify the exposition

these methods are general enough to be embedded in many Bayesian selection

and averaging procedures including of course the occupancy framework previously

discussed

In this Chapter first we provide the necessary definitions to characterize the

well-formulated model selection problem Then we proceed to introduce three new prior

structures on the well-formulated model space and characterize their behavior with

simple examples and simulations With the model priors in place we build a stochastic

search algorithm to explore spaces of well-formulated models that relies on intrinsic

priors for the model specific parameters mdash though this assumption can be relaxed

to use other mixtures of g-priors Finally we implement our procedures using both

simulated and real data

87

42 Setup for Well-Formulated Models

Suppose that the observations yi are modeled using the polynomial regression of

the covariates xi 1 xi p given by

yi =sum

β(α1αp)

pprodj=1

xαji j + ϵi (4ndash2)

where α = (α1 αp) belongs to Np0 the p-dimensional space of natural numbers

including 0 with ϵiiidsim N(0σ2) and only finitely many βα are allowed to be non-zero

As an illustration consider a model space that includes polynomial terms incorporating

covariates xi 1 and xi 2 only The terms x2i 2 and x2i 1xi 2 can be represented by α = (0 2)

and α = (2 1) respectively

The notation y = Z(X)β + ϵ is used to denote that observed response y =

(y1 yn) is modeled via a polynomial function Z of the original covariates contained

in X = (x1 xp) (where xj = (x1j xnj)prime) and the coefficients of the polynomial

terms are given by β A specific polynomial model M is defined by the set of coefficients

βα that are allowed to be non-zero This definition is equivalent to characterizing M

through a collection of multi-indices α isin Np0 In particular model M is specified by

M = αM1 αM|M| for αMk isin Np0 where βα = 0 for α isin M

Any particular model M uses a subset XM of the original covariates X to form the

polynomial terms in the design matrix ZM(X) Without ambiguity a polynomial model

ZM(X) on X can be identified with a polynomial model ZM(XM) on the covariates XM

The number of terms used by M to model the response y denoted by |M| corresponds

to the number of columns of ZM(XM) The coefficient vector and error variance of

the model M are denoted by βM and σ2M respectively Thus M models the data as

y = ZM(XM)βM + ϵM where ϵM sim N(0 Iσ2M

) Model M is said to be nested in model M prime

if M sub M prime M models the response of the covariates in two distinct ways choosing the

set of meaningful covariates XM as well as choosing the polynomial structure of these

covariates ZM(XM)

88

The set Np0 constitutes a partially ordered set or more succinctly a poset A poset

is a set partially ordered through a binary relation ldquo≼rdquo In this context the binary relation

on the poset Np0 is defined between pairs (ααprime) by αprime ≼ α whenever αj ge αprime

j for all

j = 1 prime with αprime ≺ α if additionally αj gt αprimej for some j The order of a term α isin Np

0

is given by the sum of its elements order(α) =sumαj When order(α) = order(αprime) + 1

and αprime ≺ α then αprime is said to immediately precede α which is denoted by αprime rarr α

The parent set of α is defined by P(α) = αprime isin Np0 αprime rarr α and is given by the

set of nodes that immediately precede the given node A polynomial model M is said to

be well-formulated if α isin M implies that P(α) sub M For example any well-formulated

model using x2i 1xi 2 to model yi must also include the parent terms xi 1xi 2 and x2i 1 their

corresponding parent terms xi 1 and xi 2 and the intercept term 1

The poset Np0 can be represented by a Directed Acyclic Graph (DAG) denoted

by (Np0) Without ambiguity we can identify nodes in the graph α isin Np

0 with terms in

the set of covariates The graph has directed edges to a node from its parents Any

well-formulated model M is represented by a subgraph (M) of (Np0) with the property

that if node α isin (M) then the nodes corresponding to P(α) are also in (M) Figure

4-1 shows examples of well-formulated polynomial models where α isin Np0 is identified

withprodp

j=1 xαjj

The motivation for considering only well-formulated polynomial models is

compelling Let ZM be the design matrix associated with a polynomial model The

subspace of y modeled by ZM given by the hat matrix HM = ZM(ZprimeMZM)

minus1ZprimeM is

invariant to affine transformations of the matrix XM if and only if M corresponds to a

well-formulated polynomial model (Peixoto 1990)

89

A B

Figure 4-1 Graphs of well-formulated polynomial models for p = 2

For example if p = 2 and yi = β(00) + β(10)xi 1 + β(01)xi 2 + β(11)xi 1xi 2 + ϵi then

the hat matrix is invariant to any covariate transformation of the form A(xi 1xi 2

)+ b for any

real-valued positive definite 2 times 2 matrix A and any real-valued vector of dimension two

b In contrast if yi = β(00) + β(20)x2i 1 + ϵi then the hat matrix formed after applying the

transformation xi 1 7rarr xi 1 + c for real c = 0 is not the same as the hat matrix formed by

the original xi 1421 Well-Formulated Model Spaces

The spaces of WFMs M considered in this paper can be characterized in terms

of two WFMs MB the base model and MF the full model The base model contains at

least the intercept term and is nested in the full model The model space M is populated

by all well formulated models M that nest MB and are nested in MF

M = M MB sube M sube MF and M is well-formulated

For M to be well-formulated the entire ancestry of each node in M must also be

included in M Because of this M isin M can be uniquely identified by two different sets

of nodes in MF the set of extreme nodes and the set of children nodes For M isin M

90

the sets of extreme and children nodes respectively denoted by E(M) and C(M) are

defined by

E(M) = α isin M MB α isin P(αprime) forall αprime isin M

C(M) = α isin MF M α cupM is well-formulated

The extreme nodes are those nodes that when removed from M give rise to a WFM in

M The children nodes are those nodes that when added to M give rise to a WFM in

M Because MB sube M for all M isin M the set of nodes E(M)cupMB determine M by

beginning with this set and iteratively adding parent nodes Similarly the nodes in C(M)

determine the set αprime isin P(α) α isin C(M)cupαprime isin E(MF ) α ≼ αprime for all α isin C(M)

which contains E(M)cupMB and thus uniquely identifies M

1

x1

x2

x21

x1x2

x22

A Extreme node set

1

x1

x2

x21

x1x2

x22

B Children node set

Figure 4-2

In Figure 4-2 the extreme and children sets for model M = 1 x1 x21 are shown for

the model space characterized by MF = 1 x1 x2 x21 x1x2 x22 In Figure 4-2A the solid

nodes represent nodes α isin M E(M) the dashed node corresponds to α isin E(M) and

the dotted nodes are not in M Solid nodes in Figure 4-2B correspond to those in M

The dashed node is the single node in C(M) and the dotted nodes are not in M cup C(M)43 Priors on the Model Space

As discussed in Scott amp Berger (2010) the Ockhamrsquos-razor effect found

automatically in Bayesian variable selection through the Bayes factor does not correct

91

for multiple testing This penalization acts against more complex models but does not

account for the collection of models in the model space which describes the multiplicity

of the testing problem This is where the role of the prior on the model space becomes

important As Scott amp Berger explain the multiplicity penalty is ldquohidden awayrdquo in the

model prior probabilities π(M|M)

In what follows we propose three different prior structures on the model space

for WFMs discuss their advantages and disadvantages and describe reasonable

choices for their hyper-parameters In addition we investigate how the choice of

prior structure and hyper-parameter combinations affect the posterior probabilities for

predictor inclusion providing some recommendations for different situations431 Model Prior Definition

The graphical structure for the model spaces suggests a method for prior

construction on M guided by the notion of inheritance A node α is said to inherit from

a node αprime if there is a directed path from αprime to α in the graph (MF ) The inheritance

is said to be immediate if order(α) = order(αprime) + 1 (equivalently if αprime isin P(α) or if αprime

immediately precedes α)

For convenience define (M) = M MB to be the set of nodes in M that are not

in the base model MB For α isin (MF ) let γα(M) be the indicator function describing

whether α is included in M ie γα(M) = I(αisinM) Denote by γν(M) the set of indicators

of inclusion in M for all order ν nodes in (MF ) Finally let γltν(M) =cupνminus1

j=0 γ j(M)

the set of indicators for inclusion in M for all nodes in (MF ) of order less than ν With

these definitions the prior probability of any model M isin M can be factored as

π(M|M) =

JmaxMprod

j=JminM

π(γ j(M)|γltj(M)M) (4ndash3)

where JminM and Jmax

M are respectively the minimum and maximum order of nodes in

(MF ) and π(γJminM (M)|γltJmin

M (M)M) = π(γJminM (M)|M)

92

Prior distributions on M can be simplified by making two assumptions First if

order(α) = order(αprime) = j then γα and γαprime are assumed to be conditionally independent

when conditioned on γltj denoted by γα perpperp γαprime|γltj Second immediate inheritance is

invoked and it is assumed that if order(α) = j then γα(M)|γltj(M) = γα(M)|γP(α)(M)

where γP(α)(M) is the inclusion indicator for the set of parent nodes of α This indicator

is one if the complete parent set of α is contained in M and zero otherwise

In Figure 4-3 these two assumptions are depicted with MF being an order two

surface in two main effects The conditional independence assumption (Figure 4-3A)

implies that the inclusion indicators for x21 x22 and x1x2 is independent when conditioned

on all the lower order terms In this same space immediate inheritance implies that

the inclusion of x21 conditioned on the inclusion of all lower order nodes is equivalent to

conditioning it on its parent set (x1 in this case)

x21 perpperp x1x2 perpperp x22

∣∣∣∣∣

1

x1

x2

A Conditional independence

x21∣∣∣∣∣

1

x1

x2

=

x21

∣∣∣∣∣ x1

B Immediate inheritance

Figure 4-3

Denote the conditional inclusion probability of node α in model M by πα =

π(γα(M) = 1|γP(α)(M)M) Under the assumptions of conditional independence

93

and immediate inheritance the prior probability of M is

π(M|πMM) =prod

αisin(MF )

πγα(M)α (1minus πα)

1minusγα(M) (4ndash4)

with πM = πα α isin (MF ) Because M must be well-formulated πα = γα =

0 if γP(α)(M) = 0 Thus the product in 4ndash4 can be restricted to the set of nodes

α isin (M)cup

C(M) Additional structure can be built into the prior on M by making

assumptions about the inclusion probabilities πα such as equality assumptions or

assumptions of a hyper-prior for these parameters Three such prior classes are

developed next first by assigning hyperpriors on πM assuming some structure among

its elements and then marginalizing out the πM

Hierarchical Uniform Prior (HUP) The HUP assumes that the non-zero πα

are all equal Specifically for a model M isin M it is assumed that πα = π for all

α isin (M)cupC(M) A complete Bayesian specification of the HUP is completed by

assuming a prior distribution for π The choice of π sim Beta(a b) produces

πHUP(M|M a b) =B(|(M)|+ a |C(M)|+ b)

B(a b) (4ndash5)

where B is the beta function Setting a = b = 1 gives the particular value of

πHUP(M|M a = 1 b = 1) =1

|(M)|+ |C(M)|+ 1

(|(M)|+ |C(M)|

|(M)|

)minus1

(4ndash6)

The HUP assigns equal probabilities to all models for which the sets of nodes (M)

and C(M) have the same cardinality This prior provides a combinatorial penalization

but essentially fails to account for the hierarchical structure of the model space An

additional penalization for model complexity can be incorporated into the HUP by

changing the values of a and b Because πα = π for all α this penalization can only

depend on some aspect of the entire graph of MF such as the total number of nodes

not in the null model |(MF )|

94

Hierarchical Independence Prior (HIP) The HIP assumes that there are no

equality constraints among the non-zero πα Each non-zero πα is given its own prior

which is assumed to be a Beta distribution with parameters aα and bα Thus the prior

probability of M under the HIP is

πHIP(M|M ab) =

prodαisin(M)

aα + bα

prodαisinC(M)

aα + bα

(4ndash7)

where the product over empty is taken to be 1 Because the πα are totally independent any

choice of aα and bα is equivalent to choosing a probability of success πα for a given α

Setting aα = bα = 1 for all α isin (M)cup

C(M) gives the particular value of

πHIP(M|M a = 1b = 1) =

(1

2

)|(M)|+|C(M)|

(4ndash8)

Although the prior with this choice of hyper-parameters accounts for the hierarchical

structure of the model space it essentially provides no penalization for combinatorial

complexity at different levels of the hierarchy This can be observed by considering a

model space with main effects only the exponent in 4ndash8 is the same for every model in

the space because each node is either in the model or in the children set

Additional penalizations for model complexity can be incorporated into the HIP

Because each γ j is conditioned on γltj in the prior construction the aα and bα for α of

order j can be conditioned on γltj One such additional penalization utilizes the number

of nodes of order j that could be added to produce a WFM conditioned on the inclusion

vector γltj which is denoted as chj(γltj) Choosing aα = 1 and bα(M) = chj(γ

ltj) is

equivalent to choosing a probability of success πα = 1chj(γltj) This penalization can

drive down the false positive rate when chj(γltj) is large but may produce more false

negatives

Hierarchical Order Prior (HOP) A compromise between complete equality and

complete independence of the πα is to assume equality between the πα of a given

order and independence across the different orders Define j(M) = α isin (M)

95

order(α) = j and Cj(M) = α isin C(M) order(α) = j The HOP assumes that πα = πj

for all α isin j(M)cupCj(M) Assuming that πj sim Beta(aj bj) provides a prior probability of

πHOP(M|M ab) =

JmaxMprod

j=JminM

B(|j(M)|+ aj |Cj(M)|+ bj)

B(aj bj)(4ndash9)

The specific choice of aj = bj = 1 for all j gives a value of

πHOP(M|M a = 1b = 1) =prodj

[1

|j(M)|+ |Cj(M)|+ 1

(|j(M)|+ |Cj(M)|

|j(M)|

)minus1]

(4ndash10)

and produces a hierarchical version of the Scott and Berger multiplicity correction

The HOP arises from a conditional exchangeability assumption on the indicator

variables Conditioned on γltj(M) the indicators γα α isin j(M)cup

Cj(M) are

assumed to be exchangeable Bernoulli random variables By de Finettirsquos theorem these

arise from independent Bernoulli random variables with common probability of success

πj with a prior distribution Our construction of the HOP assumes that this prior is a

beta distribution Additional complexity penalizations can be incorporated into the HOP

in a similar fashion to the HIP The number of possible nodes that could be added of

order j while maintaining a WFM is given by chj(M) = chj(γltj(M)) = |j(M)

cupCj(M)|

Using aj = 1 and bj(M) = chj(M) produces a prior with two desirable properties

First if M prime sub M then π(M) le π(M prime) Second for each order j the conditional

probability of including k nodes is greater than or equal to that of including k + 1 nodes

for k = 0 1 chj(M)minus 1432 Choice of Prior Structure and Hyper-Parameters

Each of the priors introduced in Section 31 defines a whole family of model priors

characterized by the probability distribution assumed for the inclusion probabilities πM

For the sake of simplicity this paper focuses on those arising from Beta distributions

and concentrates on particular choices of hyper-parameters which can be specified

automatically First we describe some general features about how each of the three

prior structures (HUP HIP HOP) allocates mass to the models in the model space

96

Second as there is an infinite number of ways in which the hyper-parameters can be

specified focused is placed on the default choice a = b = 1 as well as the complexity

penalizations described in Section 31 The second alternative is referred to as a =

1b = ch where b = ch has a slightly different interpretation depending on the prior

structure Accordingly b = ch is given by bj(M) = bα(M) = chj(M) = |j(M)cup

Cj(M)|

for the HOP and HIP where j = order(α) while b = ch denotes that b = |(MF )| for

the HUP The prior behavior for two model spaces In both cases the base model MB is

taken to be the intercept only model and MF is the DAG shown (Figures 4-4 and 4-5)

The priors considered treat model complexity differently and some general properties

can be seen in these examples

ModelHIP HOP HUP

(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)

1 1 14 49 13 12 13 572 1 x1 18 19 112 112 112 5563 1 x2 18 19 112 112 112 5564 1 x1 x

21 18 19 112 112 112 5168

5 1 x2 x22 18 19 112 112 112 5168

6 1 x1 x2 132 364 112 112 160 1727 1 x1 x2 x

21 132 164 136 160 160 1168

8 1 x1 x2 x1x2 132 164 136 160 160 11689 1 x1 x2 x

22 132 164 136 160 160 1168

10 1 x1 x2 x21 x1x2 132 1192 136 1120 130 1252

11 1 x1 x2 x21 x

22 132 1192 136 1120 130 1252

12 1 x1 x2 x1x2 x22 132 1192 136 1120 130 1252

13 1 x1 x2 x21 x1x2 x

22 132 1576 112 1120 16 1252

Figure 4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch)

First contrast the choice of HIP HUP and HOP for the choice of (ab) = (1 1) The

HIP induces a complexity penalization that only accounts for the order of the terms in

the model This is best exhibited by the model space in Figure 4-4 Models including x1

and x2 models 6 through 13 are given the same prior probability and no penalization is

incurred for the inclusion of any or all of the quadratic terms In contrast to the HIP the

97

ModelHIP HOP HUP

(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)

1 1 18 2764 14 12 14 472 1 x1 18 964 112 110 112 2213 1 x2 18 964 112 110 112 2214 1 x3 18 964 112 110 112 2215 1 x1 x3 18 364 112 120 112 41056 1 x2 x3 18 364 112 120 112 41057 1 x1 x2 116 3128 124 140 130 1428 1 x1 x2 x1x2 116 3128 124 140 120 1709 1 x1 x2 x3 116 1128 18 140 120 17010 1 x1 x2 x3 x1x2 116 1128 18 140 15 170

Figure 4-5 Prior probabilities for the space of well-formulated models associated tothree main effects and one interaction term where MB is taken to be theintercept only model and (ab) isin (1 1) (1 ch)

HUP induces a penalization for model complexity but it does not adequately penalize

models for including additional terms Using the HIP models including all of the terms

are given at least as much probability as any model containing any non-empty set of

terms (Figures 4-4 and 4-5) This lack of penalization of the full model is originates from

its combinatorial simplicity (ie this is the only model that contains every term) and

as an unfortunate consequence this model space distribution favors the base and full

models Similar behavior is observed with the HOP with (ab) = (1 1) As models

become more complex they are appropriately penalized for their size However after a

sufficient number of nodes are added the number of possible models of that particular

size is considerably reduced Thus combinatorial complexity is negligible on the largest

models This is best exhibited in Figure 4-5 where the HOP places more mass on

the full model than on any model containing a single order one node highlighting an

undesirable behavior of the priors with this choice of hyper-parameters

In contrast if (ab) = (1 ch) all three priors produce strong penalization as

models become more complex both in terms of the number and order of the nodes

contained in the model For all of the priors adding a node α to a model M to form M prime

produces p(M) ge p(M prime) However differences between the priors are apparent The

98

HIP penalizes the full model the most with the HOP penalizing it the least and the HUP

lying between them At face value the HOP creates the most compelling penalization

of model complexity In Figure 4-5 the penalization of the HOP is the least dramatic

producing prior odds of 20 for MB versus MF as opposed to the HUP and HIP which

produce prior odds of 40 and 54 respectively Similarly the prior odds in Figure 4-4 are

60 180 and 256 for the HOP HUP and HIP respectively433 Posterior Sensitivity to the Choice of Prior

To determine how the proposed priors are adjusting the posterior probabilities to

account for multiplicity a simple simulation was performed The goal of this exercise

was to understand how the priors respond to increasing complexity First the priors are

compared as the number of main effects p grows Second they are compared as the

depth of the hierarchy increases or in other words as the orderJMmax increases

The quality of a node is characterized by its marginal posterior inclusion

probabilities defined as pα =sum

MisinM I(αisinM)p(M|yM) for α isin MF These posteriors

were obtained for the proposed priors as well as the Equal Probability Prior (EPP)

on M For all prior structures both the default hyper-parameters a = b = 1 and

the penalizing choice of a = 1 and b = ch are considered The results for the

different combinations of MF and MT incorporated in the analysis were obtained

from 100 random replications (ie generating at random 100 matrices of main effects

and responses) The simulation proceeds as follows

1 Randomly generate main effects matrices X = (x1 x18) for xiiidsim Nn(0 In) and

error vectors ϵ sim Nn(0 In) for n = 60

2 Setting all coefficient values equal to one calculate y = ZMTβ + ϵ for the true

models given byMT 1 = x1 x2 x3 x

21 x1x2 x

22 x2x3 with |MT 1| = 7

MT 2 = x1 x2 x16 with |MT 2| = 16MT 3 = x1 x2 x3 x4 with |MT 3| = 4MT 4 = x1 x2 x8 x

21 x3x4 with |MT 4| = 10

MT 5 = x1 x2 x3 x4 x21 x3x4 with |MT 5| = 6

99

Table 4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulationsgrowing p fixed JM

max fixed p growing JMmax

MF

∣∣MF

∣∣ ∣∣M∣∣ MT used MF

∣∣MF

∣∣ ∣∣M∣∣ MT used(x1 + x2 + x3)

2 9 95 MT 1 (x1 + x2 + x3)2 9 95 MT 1

(x1 + + x4)2 14 1337 MT 1 (x1 + x2 + x3)

3 19 2497 MT 1

(x1 + + x5)2 20 38619 MT 1 (x1 + x2 + x3)

4 34 161421 MT 1

Other model spacesMF

∣∣MF

∣∣ ∣∣M∣∣ MT usedx1 + x2 + middot middot middot+ x18 18 262144 MT 2MT 3

(x1 + x2 + x4)2 + x5+ 20 85568 MT 4MT 5x6 + + x10

3 In all simulations the base model MB is the intercept only model The notation(x1 + + xp)

d is used to represent the full order-d polynomial response surface inp main effects The model spaces characterized by their corresponding full modelMF are presented in Table 4-1 as well as the true models used in each case

4 Enumerate the model spaces and calculate p(M|yM) for all M isin Musing the EPP HUP HIP and HOP the latter two each with the two sets ofhyper-parameters

5 Count the number of true positives and false positives in each M for the differentpriors

The true positives (TP) are defined as those nodes α isin MT such that pα gt 05

With the false positives (FP) three different cutoffs are considered for pα elucidating

the adjustment for multiplicity induced by the model priors These cutoffs are

010 020 and 050 for α isin MT The results from this exercise provide insight

about the influence of the prior on the marginal posterior inclusion probabilities In Table

4-1 the model spaces considered are described in terms of the number of models they

contain and in terms of the number of nodes of MF the full model that defines the DAG

for M

Growing number of main effects fixed polynomial degree This simulation

investigates the posterior behavior as the number of covariates grows for a polynomial

100

surface of degree two The true model is assumed to be MT 1 and has 7 polynomial

terms The false positive and true positive rates are displayed in Table 4-2

First focus on the posterior when (ab) = (1 1) As p increases and the cutoff

is low the number of false positives increases for the EPP as well as the hierarchical

priors although less dramatically for the latter All of the priors identify all of the true

positives The false positive rate for the 50 cutoff is less than one for all four prior

structures with the HIP exhibiting the smallest false positive rate

With the second choice of hyper-parameters (1 ch) the improvement of the

hierarchical priors over the EPP is dramatic and the difference in performance is more

pronounced as p increases These also considerably outperform the priors using the

default hyper-parameters a = b = 1 in terms of the false positives Regarding the

number of true positives all priors discovered the 7 true predictors in MT 1 for most of

the 100 random samples of data with only minor differences observed between any of

the priors considered That being said the means for the priors with a = 1b = ch are

slightly lower for the true positives With a 50 cutoff the hierarchical priors keep a tight

control on the number of false positives but in doing so discard true positives with slightly

higher frequency

Growing polynomial degree fixed main effects For these examples the true

model is once again MT 1 When the complexity is increased by making the order of MF

larger (Table 4-3) the inability of the EPP to adjust the inclusion posteriors for multiplicity

becomes more pronounced the EPP becomes less and less efficient at removing false

positives when the FP cutoff is low Among the priors with a = b = 1 as the order

increases the HIP is the best at filtering out the false positives Using the 05 false

positive cutoff some false positives are included both for the EEP and for all the priors

with a = b = 1 indicating that the default hyper-parameters might not be the best option

to control FP The 7 covariates in the true model all obtain a high inclusion posterior

probability both with the EEP and the a = b = 1 priors

101

Table 4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

7 (x1 + x2 + x3)2

178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3 + x4)2

362 194 233 245 010 063 107FP(gt020) 160 047 217 215 001 017 024FP(gt050) 025 006 035 036 000 002 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3 + x4 + x5)2

600 216 260 255 012 043 115FP(gt020) 291 055 213 218 002 019 027FP(gt050) 066 011 025 037 000 003 001TP(gt050) (MT 1) 700 700 700 700 697 699 699

In contrast any of the a = 1 and b = ch priors dramatically improve upon their

a = b = 1 counterparts consistently assigning low inclusion probabilities for the majority

of the false positive terms even for low cutoffs As the order of the polynomial surface

increases the difference in performance between these priors and either the EEP or

their default versions becomes even more clear At the 50 cutoff the hierarchical priors

with complexity penalization exhibit very low false positive rates The true positive rate

decreases slightly for the priors but not to an alarming degree

Other model spaces This part of the analysis considers model spaces that do not

correspond to full polynomial degree response surfaces (Table 4-4) The first example

is a model space with main effects only The second example includes a full quadratic

surface of order 2 but in addition includes six terms for which only main effects are to be

modeled Two true models are used in combination with each model space to observe

how the posterior probabilities vary under the influence of the different priors for ldquolargerdquo

and ldquosmallrdquo true models

102

Table 4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

7 (x1 + x2 + x3)2

178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3)3

737 521 606 291 055 105 139FP(gt020) 291 155 361 208 017 034 031FP(gt050) 040 021 050 026 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 698 700FP(gt010)

7 (x1 + x2 + x3)4

822 400 469 261 052 055 132FP(gt020) 421 113 176 203 012 015 031FP(gt050) 056 017 022 027 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 697 699

By construction in model spaces with main effects only HIP(11) and EPP are

equivalent as are HOP(ab) and HUP(ab) This accounts for the similarities observed

among the results for the first two cases presented in Table 4-4 where the model space

corresponds to a full model with 18 main effects and the true models are a model with

16 and 4 main effects respectively When the number of true coefficients is large the

HUP(11) and HOP(11) do poorly at controlling false positives even at the 50 cutoff

In contrast the HIP (and thus the EPP) with the 50 cutoff identifies the true positives

and no false positives This result however does not imply that the EPP controls false

positives well The true model contains 16 out of the 18 nodes in MF so there is little

potential for false positives The a = 1 and b = ch priors show dramatically different

behavior The HIP controls false positive well but fails to identify the true coefficients at

the 50 cutoff In contrast the HOP identifies all of the true positives and has a small

false positive rate for the 50 cutoff

103

If the number of true positives is small most terms in the full model are truly zero

The EPP includes at least one false positive in approximately 50 of the randomly

sampled datasets On the other hand the HUP(11) provides some control for

multiplicity obtaining on average a lower number of false positives than the EPP

Furthermore the proposed hierarchical priors with a = 1b = ch are substantially better

than the EPP (and the choice of a = b = 1) at controlling false positives and capturing

all true positives using the marginal posterior inclusion probabilities The two examples

suggest that the HOP(1 ch) is the best default choice for model selection when the

number of terms available at a given degree is large

The third and fourth examples in Table 4-4 consider the same irregular model

space with data generated from MT 4 with ten terms and MT 5 with six terms HIP(11)

and EPP again behave quite similarly incorporating a large number of false positives

for the 01 cutoff At the 05 cutoff some false positives are still included The HUP(11)

and HOP(11) behave similarly with a slightly higher false positive rate at the 50 cutoff

In terms of the true positives the EPP and a = b = 1 priors always include all of the

predictors in MT 4 and MT 5 On the other hand the ability of the a = 1b = ch priors

to control for false positives is markedly better than that of the EPP and the hierarchical

priors with choice of a = 1 = b = 1 At the 50 cutoff these priors identify all of the true

positives and true negatives Once again these examples point to the hierarchical priors

with additional penalization for complexity as being good default priors on the model

space44 Random Walks on the Model Space

When the model space M is too large to enumerate a stochastic procedure can

be used to find models with high posterior probability In particular an MCMC algorithm

can be utilized to generate a dependent sample of models from the model posterior The

structure of the model space M both presents difficulties and provides clues on how to

build algorithms to explore it Different MCMC strategies can be adopted two of which

104

Table 4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

16 x1 + x2 + + x18

193 193 200 200 003 180 180FP(gt020) 052 052 200 200 001 046 046FP(gt050) 007 007 200 200 001 004 004TP(gt050) (MT 2) 1599 1599 1600 1600 699 1599 1599FP(gt010)

4 x1 + x2 + + x18

1395 1395 915 915 026 131 131FP(gt020) 545 545 303 303 005 045 045FP(gt050) 084 084 045 045 002 006 006TP(gt050) (MT 3) 400 400 400 400 400 400 400FP(gt010)

10

973 971 1000 560 034 233 220FP(gt020) (x1 + + x4)

2+ 265 265 873 305 012 074 069FP(gt050) +x5 + + x10 035 035 136 168 002 011 012TP(gt050) (MT 4) 1000 1000 1000 999 994 998 999FP(gt010)

6

1352 1352 1106 994 044 163 196FP(gt020) (x1 + + x4)

2+ 422 421 360 501 015 048 068FP(gt050) +x5 + + x10 053 053 057 075 001 008 011TP(gt050) (MT 5) 600 600 600 600 599 599 599

are outlined in this section Combining the different strategies allows the model selection

algorithm to explore the model space thoroughly and relatively fast441 Simple Pruning and Growing

This first strategy relies on small localized jumps around the model space turning

on or off a single node at each step The idea behind this algorithm is to grow the model

by activating one node in the children set or to prune the model by removing one node

in the extreme set At a given step in the algorithm assume that the current state of the

chain is model M Let pG be the probability that algorithm chooses the growth step The

proposed model M prime can either be M+ = M cup α for some α isin C(M) or Mminus = M α

or some α isin E(M)

An example transition kernel is defined by the mixture

g(M prime|M) = pG middot qGrow(M prime|M) + (1minus pG) middot qPrune(M prime|M)

105

=IM =MF

1 + IM =MBmiddotIαisinC(M)

|C(M)|+

IM =MB

1 + IM =MF middotIαisinE(M)

|E(M)|(4ndash11)

where pG has explicitly been defined as 05 when both C(M) and E(M) are non-empty

and as 0 (or 1) when C(M) = empty (or E(M) = empty) After choosing pruning or growing a

single node is proposed for addition to or deletion from M uniformly at random

For this simple algorithm pruning has the reverse kernel of growing and vice-versa

From this construction more elaborate algorithms can be specified First instead of

choosing the node uniformly at random from the corresponding set nodes can be

selected using the relative posterior probability of adding or removing the node Second

more than one node can be selected at any step for instance by also sampling at

random the number of nodes to add or remove given the size of the set Third the

strategy could combine pruning and growing in a single step by sampling one node

α isin C(M) cup E(M) and adding or removing it accordingly Fourth the sets of nodes from

C(M) cup E(M) that yield well-formulated models can be added or removed This simple

algorithm produces small moves around the model space by focusing node addition or

removal only on the set C(M) cup E(M)442 Degree Based Pruning and Growing

In exploring the model space it is possible to take advantage of the hierarchical

structure defined between nodes of different order One can update the vector of

inclusion indicators by blocks denoted j(M) Two flavors of this algorithm are

proposed one that separates the pruning and growing steps and one where both

are done simultaneously

Assume that at a given step say t the algorithm is at M If growing the strategy

proceeds successively by order class going from j = Jmin up to j = Jmax with Jmin

and Jmax being the lowest and highest orders of nodes in MF MB respectively Define

Mt(Jminminus1) = M and set j = Jmin The growth kernel comprises the following steps

proceeding from j = Jmin to j = Jmax

106

1) Propose a model M prime by selecting a set of nodes from Cj(Mt(jminus1)) through thekernel qGrow j(middot|Mt(jminus1))

2) Compute the Metropolis-Hastings correction for M prime versus Mt(jminus1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(jminus1)

3) If j lt Jmax then set j = j + 1 and return to 1) otherwise proceed to 4)

4) Set Mt = Mt(Jmax )

The pruning step is defined In a similar fashion however it starts at order j = Jmax

and proceeds down to j = Jmin Let Ej(M prime) = E(M prime) cap j(MF ) be the set of nodes of

order j that can removed from the model M to produce a WFM Define Mt(Jmax+1) = M

and set j = Jmax The pruning kernel comprises the following steps

1) Propose a model M prime by selecting a set of nodes from Ej(Mt(j+1)) through thekernel qPrunej(middot|Mt(j+1))

2) Compute the Metropolis-Hastings correction for M prime versus Mt(j+1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(j+1)

3) If j gt Jmin then set j = j minus 1 and return to Step 1) otherwise proceed to Step 4)

4) Set Mt = Mt(Jmin )

It is clear that the growing and pruning steps are reverse kernels of each other

Pruning and growing can be combined for each j The forward kernel proceeds from

j = Jmin to j = Jmax and proposes adding sets of nodes from Cj(M) cup Ej(M) The reverse

kernel simply reverses the direction of j proceeding from j = Jmax to j = Jmin 45 Simulation Study

To study the operating characteristics of the proposed priors a simulation

experiment was designed with three goals First the priors are characterized by how

the posterior distributions are affected by the sample size and the signal-to-noise ratio

(SNR) Second given the SNR level the influence of the allocation of the signal across

the terms in the model is investigated Third performance is assessed when the true

model has special points in the scale (McCullagh amp Nelder 1989) ie when the true

107

model has coefficients equal to zero for some lower-order terms in the polynomial

hierarchy

With these goals in mind sets of predictors and responses are generated under

various experimental conditions The model space is defined with MB being the

intercept-only model and MF being the complete order-four polynomial surface in five

main effects that has 126 nodes The entries of the matrix of main effects are generated

as independent standard normal The response vectors are drawn from the n-variate

normal distribution as y sim Nn

(ZMT

(X)βγ In) where MT is the true model and In is the

n times n identity matrix

The sample sizes considered are n isin 130 260 1040 which ensures that

ZMF(X) is of full rank The cardinality of this model space is |M| gt 12 times 1022 which

makes enumeration of all models unfeasible Because the value of the 2k-th moment

of the standard normal distribution increases with k = 1 2 higher-order terms by

construction have a larger variance than their ancestors As such assuming equal

values for all coefficients higher-order terms necessarily contain more ldquosignalrdquo than

the lower order terms from which they inherit (eg x21 has more signal than x1) Once a

higher-order term is selected its entire ancestry is also included Therefore to prevent

the simulation results from being overly optimistic (because of the larger signals from the

higher-order terms) sphering is used to calculate meaningful values of the coefficients

ensuring that the signal is of the magnitude intended in any given direction Given

the results of the simulations from Section 433 only the HOP with a = 1b = ch is

considered with the EPP included for comparison

The total number of combinations of SNR sample size regression coefficient

values and nodes in MT amounts to 108 different scenarios Each scenario was run

with 100 independently generated datasets and the mean behavior of the samples was

observed The results presented in this section correspond to the median probability

model (MPM) from each of the 108 simulation scenarios considered Figure 4-7 shows

108

the comparison between the two priors for the mean number of true positive (TP) and

false positive (FP) terms Although some of the scenarios consider true models that are

not well-formulated the smallest well-formulated model that stems from MT is always

the one shown in Figure 4-6

Figure 4-6 MT DAG of the largest true model used in simulations

The results are summarized in Figure 4-7 Each point on the horizontal axis

corresponds to the average for a given set of simulation conditions Only labels for the

SNR and sample size are included for clarity but the results are also shown for the

different values of the regression coefficients and the different true models considered

Additional details about the procedure and other results are included in the appendices451 SNR and Sample Size Effect

As expected small sample sizes conditioned upon a small SNR impair the ability

of the algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this

effect being greater when using the latter prior However considering the mean number

of TPs jointly with the number of FPs it is clear that although the number of TPs is

specially low with HOP(1 ch) most of the few predictors that are discovered in fact

belong to the true model In comparison to the results with EPP in terms of FPs the

HOP(1 ch) does better and even more so when both the sample size and the SNR are

109

Figure 4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch)

smallest Finally when either the SNR or the sample size is large the performance in

terms of TPs is similar between both priors but the number of FPs are somewhat lower

with the HOP452 Coefficient Magnitude

Three ways to allocate the amount of signal across predictors are considered For

the first choice all coefficients contain the same amount of signal regardless of their

order In the second each order-one coefficient contains twice as much signal as any

order-two coefficient and four times as much as any order-three coefficient Finally

each order-one coefficient contains a half as much signal as any order-two coefficient

and a quarter of what any order-three coefficient has These choices are denoted by

β(1) = c(1o1 1o2 1o3) β(2) = c(1o1 05o2 025o3) and β(3) = c(025o1 05o2 1o3)

respectively In Figure 4-7 the first 4 scenarios correspond to simulations with β(1) the

next four use β(2) the next four correspond to β(3) and then the values are cycled in

110

the same way The results show that scenarios using either β(1) or β(3) behave similarly

contrasting with the negative impact of having the highest signal in the order-one terms

through β(2) In Figure 4-7 the effect of using β(2) is evident as it corresponds to the

lowest values for the TPs regardless of the sample size the SNR or the prior used This

is an intuitive result since giving more signal to higher-order terms makes it easier to

detect higher-order terms and consequently by strong heredity the algorithm will also

select the corresponding lower-order terms included in the true model453 Special Points on the Scale

Four true models were considered (1) the model from Figure 4-6 (MT 1) (2)

the model without the order-one terms (MT 2) (3) the model without order-two terms

(MT 3) and (4) the model without x21 and x2x5 (MT 4) The last three are clearly not

well-formulated In Figure 4-7 the leftmost point on the horizontal axis corresponds to

scenarios with MT 1 the next point is for scenarios with MT 2 followed by those with MT 3

then with MT 4 then MT 1 etc In comparison to the EPP the HOP(1 ch) tightly controls

the inclusion of FPs by choosing smaller models at the expense of also reducing the TP

count especially when there is more uncertainty about the true model (ie SNR=025)

For both prior structures the results indicate that at low SNR levels the presence of

special points has no apparent impact as the selection behavior is similar between the

four models in terms of both the TP and FP An interesting observation is that the effect

of having special points on the scale is vastly magnified whenever the coefficients that

assign more weight to order-one terms (β(2)) are used46 Case Study Ozone Data Analysis

This section uses the ozone data from Breiman amp Friedman (1985) and followsthe analysis performed by Liang et al (2008) who investigated hyper g-priors Afterremoving observations with missing values 330 observations remain includingdaily measurements of maximum ozone concentration near Los Angeles and eightmeteorological variables Table D From the 330 observations 165 were sampled atrandom without replacement and used to run the variable selection procedure theremaining 165 were used for validation The eight meteorological variables interactionsand their squared terms are used as predictors resulting in a full model with 44predictors The model space assumes that the base model MB is the intercept onlymodel and that MF is the quadratic surface in the eight meteorological variables The

111

model space contains approximately 71 billion models and computation of all modelposterior probabilities is not feasible

Table 4-5 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA

vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX

The HOP HUP and HIP with a = 1 and b = ch as well as the EPP are considered forcomparison purposes To obtain the Bayes factors in equation 3ndash3 four different mixtures ofg-priors are utilized intrinsic priors (IP) (which yields the expression in equation 3ndash2) hyper-g(HG) priors (Liang et al 2008) with hyper-parameters α = 2β = 1 and α = β = 1 and Zellner-Siow (ZS) priors (Zellner amp Siow 1980) The results were extracted for the median posteriorprobability (MPM) models Additionally the model is estimated using the R package hierNet(Bien et al 2013) to compare model selection results to those obtained using the hierarchicallasso (Bien et al 2013) restricted to well formulated models by imposing the strong heredityconstraint The procedures were assessed on the basis of their predictive accuracy on thevalidation dataset

Among all models the one that yields the smallest RMSE is the median probability modelobtained using the HOP and EPP with the ZS prior and also using the HOP with both HGpriors (Table 4-6) The HOP model with the intrinsic prior has all the terms contained in thelowest RMSE model with the exception of dpg2 which has a relatively high marginal inclusionprobability of 46 This disparity between the IP and other mixtures of g-priors is explainedby the fact that the IP induces less posterior shrinkage than the ZS and HG priors The MPMobtained through the HUP and HIP are nested in the best model suggesting that these modelspace priors penalize complexity too much and result in false negatives Consideration ofthese MPMs suggest that the HOP is best at producing true positives while controlling for falsepositives

Finally the model obtained from the hierarchical lasso (HierNet) is the largest model andproduces the second to largest RMSE All of the terms contained in any of the other modelsexcept for vh are nested within the hierarchical lasso model and most of the terms that areexclusive to this model receive extremely low marginal inclusion probabilities under any of themodel priors and parameter priors considered under Bayesian model selection

112

Table 4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso

BF Prior Model R2 RMSEIP EPP hum dpg ibt hum2 hum lowast dpg 08054 42739

hum lowast ibt dpg2 ibt2IP HIP hum ibt hum2 hum lowast ibt ibt2 07740 43396IP HOP hum dpg ibt hum2 hum lowast ibt ibt2 07848 43175IP HUP hum dpg ibt hum lowast ibt ibt2 07767 43508ZS EPP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HIP hum ibt hum lowast ibt ibt2 07525 43505ZS HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG11 EPP vh hum dpg ibt hum2 hum lowast ibt dpg2 07701 43049HG11 HIP hum ibt hum lowast ibt ibt2 07525 43505HG11 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG11 HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 EPP hum dpg ibt hum2 hum lowast ibt dpg2 07701 43037HG21 HIP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG21 HUP hum dpg ibt hum lowast ibt 07526 44036

HierNet hum temp ibh dpg ibt vis hum2 hum lowast ibt 07651 43680temp2 temp lowast ibt dpg2

47 DiscussionScott amp Berger (2010) noted that Ockhamrsquos-razor effect found automatically in Bayesian

variable selection through the Bayes factor does not correct for multiple testing The Bayesfactor penalizes complexity of the alternative model according to the number of parametersin excess of those of the null model Therefore the Bayes factor only controls complexity in apairwise fashion If the model selection procedure uses equal prior probabilities for all M isin Mthen these comparisons ignore the effect of the multiplicity of the testing problem This is wherethe role of the prior on the model space becomes important The multiplicity penalty is ldquohiddenawayrdquo in the model prior probabilities π(M|M)

In addition to the multiplicity of the testing problem disregarding the hierarchical polynomialstructure in the predictors in model selection procedures has the potential to lead to differentresults according to how the predictors are setup (eg in what units these predictors areexpressed)

In this Chapter we investigated a solution to these two issues We define prior structuresfor well-formulated models and develop random walk algorithms to traverse this type of modelspace The key to understanding prior distributions on the space of WFMs is the hierarchicalnature of the model space itself The prior distributions described take advantage of thathierarchy in two ways First conditional independence and immediate inheritance are used todevelop the HOP HIP and HUP structures discussed in Section 43 Second the conditionalnature of the priors allows for the direct incorporation of complexity penalizations Of the priorsproposed the HOP using the hyperparameter choice (1 ch) provides the best control of falsepositives while maintaining a reasonable true positive rate Thus this prior is recommended asthe default prior on the space of WFMs

113

In the near future the software developed to carry out a Metropolis-Hastings random walkon the space of WFMs will be integrated to the R package varSelectIP These new functionsimplement various local priors for the regression coefficients including the intrinsic prior Zellner-Siow prior and hyper g-priors In addition the software supports the computation of crediblesets for each regression coefficient conditioned on the selected model as well as under modelaveraging

114

CHAPTER 5CONCLUSIONS

Ecologists are now embracing the use of Bayesian methods to investigate the

interactions that dictate the distribution and abundance of organisms These tools are

both powerful and flexible They allow integrating under a single methodology empirical

observations and theoretical process models and can seamlessly account for several

sources of uncertainty and dependence The estimation and testing methods proposed

throughout the document will contribute to the understanding of Bayesian methods used

in ecology and hopefully these will shed light about the differences between estimation

and testing Bayesian tools

All of our contributions exploit the potential of the latent variable formulation This

approach greatly simplifies the analysis of complex models it redirects the bulk of

the inferential burden away from the original response variables and places it on the

easy-to-work-with latent scale for which several time-tested approaches are available

Our methods are distinctly classified into estimation and testing tools

For estimation we proposed a Bayesian specification of the single-season

occupancy model for which a Gibbs sampler is available using both logit and probit

link functions This setup allows detection and occupancy probabilities to depend

on linear combinations of predictors Then we developed a dynamic version of this

approach incorporating the notion that occupancy at a previously occupied site depends

both on survival of current settlers and habitat suitability Additionally because these

dynamics also vary in space we suggest a strategy to add spatial dependence among

neighboring sites

Ecological inquiry usually requires of competing explanations and uncertainty

surrounds the decision of choosing any one of them Hence a model or a set of

probable models should be selected from all the viable alternatives To address this

testing problem we proposed an objective and fully automatic Bayesian methodology

115

for the single season site-occupancy model Our approach relies on the intrinsic prior

which prevents from introducing (commonly unavailable) subjectively information

into the model In simulation experiments we observed that the methods single out

accurately the predictors present in the true model using the marginal posterior inclusion

probabilities of the predictors For predictors in the true model these probabilities were

comparatively larger than those for predictors not present in the true model Also the

simulations indicated that the method provides better discrimination for predictors in the

detection component of the model

In our simulations and in the analysis of the Blue Hawker data we observed that the

effect from using the multiplicity correction prior was substantial This occurs because

the Bayes factor only penalizes complexity of the alternative model according to its

number of parameters in excess to those of the null model As the number of predictors

grows the number of models in the models space also grows increasing the chances

of making false positive decisions on the inclusion of predictors This is where the role

of the prior on the model space becomes important The multiplicity penalty is ldquohidden

awayrdquo in the model prior probabilities π(M|M) In addition to the multiplicity of the

testing problem disregarding the hierarchical polynomial structure in the predictors in

model selection procedures has the potential to lead to different results according to

how the predictors are coded (eg in what units these predictors are expressed)

To confront this situation we propose three prior structures for well-formulated

models take advantage of the hierarchical structure of the predictors Of the priors

proposed we recommend the HOP using the hyperparameter choice (1 ch) which

provides the best control of false positives while maintaining a reasonable true positive

rate

Overall considering the flexibility of the latent approach several other extensions of

these methods follow Currently we envision three future developments (1) occupancy

models incorporate various sources of information (2) multi-species models that make

116

use of spatial and interspecific dependence and (3) investigate methods to conduct

model selection for the dynamic and spatially explicit version of the model

117

APPENDIX AFULL CONDITIONAL DENSITIES DYMOSS

In this section we introduce the full conditional probability density functions for all

the parameters involved in the DYMOSS model using probit as well as logic links

Sampler Z

The full conditionals corresponding to the presence indicators have the same form

regardless of the link used These are derived separately for the cases t = 1 1 lt t lt T

and t = T since their corresponding probabilities take on slightly different forms

Let ϕ(ν|microσ2) represent the density for a normal random variable ν with mean micro and

variance σ2 and recall that ψi1 = F (xprime(o)iα) and pijt = F (qprimeijtλt) where F () is the

inverse link function The full conditional for zit is given by

1 For t = 1

π(zi1|vi1αλ1βc1 δ

s1) = ψlowast

i1zi1 (1minus ψlowast

i1)1minuszi1

= Bernoulli(ψlowasti1) (Andash1)

where

ψlowasti1 =

ψi1ϕ(vi1|xprimei1βc1 + δs1 1)

prodJi1j=1(1minus pij1)

ψi1ϕ(vi1|xprimei1βc1 + δs1 1)

prodJi1j=1(1minus pij1) + (1minus ψi1)ϕ(vi1|xprimei1β

c1 1)

prodJj=1 Iyij1=0

2 For 1 lt t lt T

π(zit |zi(tminus1) zi(t+1)λt βctminus1 δ

stminus1) = ψlowast

itzit (1minus ψlowast

it)1minuszit

= Bernoulli(ψlowastit) (Andash2)

where

ψlowastit =

κitprodJit

j=1(1minus pijt)

κitprodJit

j=1(1minus pijt) +nablait

prodJj=1 Iyijt=0

with

(a) κit = F (xprimei(tminus1)β

ctminus1 + zi(tminus1)δ

stminus1)ϕ(vit |xprimeitβ

ct + δst 1) and

(b) nablait =(1minus F (xprime

i(tminus1)βctminus1 + zi(tminus1)δ

stminus1)

)ϕ(vit |xprimeitβ

ct 1)

3 For t = T

π(ziT |zi(Tminus1)λT βcTminus1 δ

sTminus1) = ψ⋆iT

ziT (1minus ψ⋆iT )1minusziT

118

=

Nprodi=1

Bernoulli(ψ⋆iT ) (Andash3)

where

ψ⋆iT =κ⋆iT

prodJiTj=1(1minus pijT )

κ⋆iTprodJiT

j=1(1minus pijT ) +nabla⋆iT

prodJj=1 IyijT=0

with

(a) κ⋆iT = F (xprimei(Tminus1)β

cTminus1 + zi(Tminus1)δ

sTminus1) and

(b) nabla⋆iT =

(1minus F (xprime

i(Tminus1)βcTminus1 + zi(Tminus1)δ

sTminus1)

)Sampler ui

1

π(ui |zi1α) = tr N(xprime(o)iα 1 trunc(zi1))

where trunc(zi1) =

(minusinfin 0] zi1 = 0

(0infin) zi1 = 1(Andash4)

and tr N(microσ2A) denotes the pdf of a truncated normal random variable with mean microvariance σ2 and truncation region A

Sampler α

1

π(α|u) prop [α]

Nprodi=1

ϕ(ui xprime(o)iα 1) (Andash5)

If [α] prop 1 then

α|u sim N(m(α)α)

with m(α) = αXprime(o)u and α = (X prime

(o)X(o))minus1

Sampler vit

1 (For t gt 1)

π(vi (tminus1)|zi (tminus1) zit βctminus1 δ

stminus1) = tr N

(micro(v)i(tminus1) 1 trunc(zit)

)(Andash6)

where micro(v)i(tminus1) = xprime

i(tminus1)βctminus1 + zi(tminus1)δ

ci(tminus1) and trunc(zit) defines the corresponding

truncation region given by zit

119

Sampler(β(c)tminus1 δ

(c)tminus1

)

1 (For t gt 1)

π(β(s)tminus1 δ

(c)tminus1|vtminus1 ztminus1) prop [β

(s)tminus1 δ

(c)tminus1]

Nprodi=1

ϕ(vit xprimei(tminus1)β

(c)tminus1 + zi(tminus1)δ

(s)tminus1 1) (Andash7)

If[β(c)tminus1 δ

(s)tminus1

]prop 1 then

β(c)tminus1 δ

(s)tminus1|vtminus1 ztminus1 sim N(m(β

(c)tminus1 δ

(s)tminus1)tminus1)

with m(β(c)tminus1 δ

(s)tminus1) = tminus1 ~X

primetminus1vtminus1 and tminus1 = (~X prime

tminus1 ~Xtminus1)minus1 where ~Xtminus1 =(

Xtminus1 ztminus1)

Sampler wijt

1 (For t gt 1 and zit = 1)

π(wijt | i zit = 1 yijt λ) = tr N(qprimeijtλt 1 tr(yijt)

)(Andash8)

Sampler λt

1 (For t = 1 2 T )

π(λt |zt wt) prop [λt ]prod

i zit=1

Jitprodj=1

ϕ(wijt qprimeijtλt 1) (Andash9)

If [λt ] prop 1 then

λt |wt zt sim N(m(λt)λt)

with m(λt) = λtQ primetwt and λt

= (Q primetQt)

minus1 where Qt and wt respectively are the designmatrix and the vector of latent variables for surveys of sites such that zit = 1

120

APPENDIX BRANDOM WALK ALGORITHMS

Global Jump From the current state M the global jump is performed by drawing

a model M prime at random from the model space This is achieved by beginning at the base

model and increasing the order from JminM to the Jmax

M the minimum and maximum orders

of nodes in (MF ) = MF MB at each order a set of nodes is selected at random from

the prior conditioned on the nodes already in the model The MH correction is

α =

1m(y|M primeM)

m(y|MM)

Local Jump From the current state M the local jump is performed by drawing a

model from the set of models L(M) = Mα α isin E(M) cup C(M) where Mα is M α

for α isin E(M) and M cup α for α isin C(M) The proposal probabilities for the model are

computed as a mixture of p(M prime|yMM prime isin L(M)) and the discrete uniform distribution

The proposal kernel is

q(M prime|yMM prime isin L(M)) =1

2

(p(M prime|yMM prime isin L(M)) +

1

|L(M)|

)This choice promotes moving to better models while maintaining a non-negligible

probability of moving to any of the possible models The MH correction is

α =

1m(y|M primeM)

m(y|MM)

q(M|yMM isin L(M prime))

q(M prime|yMM prime isin L(M))

Intermediate Jump The intermediate jump is performed by increasing or

decreasing the order of the nodes under consideration performing local proposals based

on order For a model M prime define Lj(Mprime) = M prime cup M prime

α α isin (E(M prime) cup C(M prime)) capj(MF )

From a state M the kernel chooses at random whether to increase or decrease the

order If M = MF then decreasing the order is chosen with probability 1 and if M = MB

then increasing the order is chosen with probability 1 in all other cases the probability of

increasing and decreasing order is 12 The proposal kernels are given by

121

Increasing order proposal kernel

1 Set j = JminM minus 1 and M prime

j = M

2 Draw M primej+1 from qincj+1(M

prime|yMM prime isin Lj+1(Mprimej )) where

qincj+1(Mprime|yMM prime isin Lj+1(M

primej )) =

12

(p(M prime|yMM prime isin Lj+1(M

primej )) +

1|Lj+1(M

primej)|

)

3 Set j = j + 1

4 If j lt JmaxM then return to 2 O therwise proceed to 5

5 Set M prime = M primeJmaxM

and compute the proposal probability

qinc(Mprime|yMM) =

JmaxM minus1prod

j=JminM minus1

qincj+1(Mprimej |yMM prime isin Lj+1(M

primej )) (Bndash1)

Decreasing order proposal kernel

1 Set j = JmaxM + 1 and M prime

j = M

2 Draw M primejminus1 from qdecjminus1(M

prime|yMM prime isin Ljminus1(Mprimej )) where

qdecjminus1(Mprime|yMM prime isin Ljminus1(M

primej )) =

12

(p(M prime|yMM prime isin Ljminus1(M

primej )) +

1|Ljminus1(M

primej)|

)

3 Set j = j minus 1

4 If j gt JminM then return to 2 Otherwise proceed to 5

5 Set M prime = M primeJminM

and compute the proposal probability

qdec(Mprime|yMM) =

JminM +1prod

j=JmaxM +1

qdecjminus1(Mprimej |yMM prime isin Ljminus1(M

primej )) (Bndash2)

If increasing order is chosen then the MH correction is given by

α = min

1

(1 + I (M prime = MF )

1 + I (M = MB)

)qdec(M|yMM prime)

qinc(M prime|yMM)

p(M prime|yM)

p(M|yM)

(Bndash3)

and similarly if decreasing order is chosen

Other Local and Intermediate Kernels The local and intermediate kernels

described here perform a kind of stochastic forwards-backwards selection Each kernel

122

q can be relaxed to allow more than one node to be turned on or off at each step which

could provide larger jumps for each of these kernels The tradeoff is that number of

proposed models for such jumps could be very large precluding the use of posterior

information in the construction of the proposal kernel

123

APPENDIX CWFM SIMULATION DETAILS

Briefly the idea is to let ZMT(X )βMT

= (QR)βMT= QηMT

(ie βMT= Rminus1ηMT

)

using the QR decomposition As such setting all values in ηMTproportional to one

corresponds to distributing the signal in the model uniformly across all predictors

regardless of their order

The (unconditional) variance of a single observation yi is var(yi) = var (E [yi |zi ]) +

E [var(yi |zi)] where zi is the i -th row of the design matrix ZMT Hence we take the

signal to noise ratio for each observation to be

SNR(η) = ηTMT

RminusTzRminus1ηMT

σ2

where z = var(zi) We determine how the signal is distributed across predictors up to a

proportionality constant to be able to control simultaneously the signal to noise ratio

Additionally to investigate the ability of the model to capture correctly the

hierarchical structure we specify four different 0-1 vectors that determine the predictors

in MT which generates the data in the different scenarios

Table C-1 Experimental conditions WFM simulationsParameter Values considered

SNR(ηMT) = k 025 1 4

ηMTprop (1 13 14 12) (1 13 1214

1412) (1 1413

1214 12)

γMT(1 13 14 12) (1 13 14 02) (1 13 04 12) (1 03 0 1 1 0 12)

n 130 260 1040

The results presented below are somewhat different from those found in the main

body of the article in Section 5 These are extracted averaging the number of FPrsquos

TPrsquos and model sizes respectively over the 100 independent runs and across the

corresponding scenarios for the 20 highest probability models

124

SNR and Sample Size Effect

In terms of the SNR and the sample size (Figure C-1) we observe that as

expected small sample sizes conditioned upon a small SNR impair the ability of the

algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this effect

more notorious when using the latter prior However considering the mean number

of true positives (TP) jointly with the mean model size it is clear that although the

sensitivity is low most of the few predictors that are discovered belong to the true

model The results observed with SNR of 025 and a relatively small sample size are

far from being impressive however real problems where the SNR is as low as 025

will yield many spurious associations under the EPP The fact that the HOP(1 ch) has

a strong protection against false positive is commendable in itself A SNR of 1 also

represents a feeble relationship between the predictors and the response nonetheless

the method captures approximately half of the true coefficients while including very few

false positives Following intuition as either the sample size or the SNR increase the

algorithms performance is greatly enhanced Either having a large sample size or a

large SNR yields models that contain mostly true predictors Additionally HOP(1 ch)

provides a strong control over the number of false positives therefore for high SNR

or larger sample sizes the number of predictors in the top 20 models is close to the

size of the true model In general the EPP allows the detection of more TPrsquos while

the HOP(1 ch) provides a stronger control on the amount of FPrsquos included when

considering small sample sizes combined with small SNRs As either sample size or

SNR grows the differences between the two priors become indistinct

125

Figure C-1 SNR vs n Average model size average true positives and average false

positives for all simulated scenarios by model ranking according to model

posterior probabilities

Coefficient Magnitude

This part of the experiment explores the effect of how the signal is distributed across

predictors As mentioned before sphering is used to assign the coefficients values

in a manner that controls the amount of signal that goes into each coefficient Three

possible ways to allocate the signal are considered First each order-one coefficient

contains twice as much signal as any order-two coefficient and four times as much

any as order-three coefficient second all coefficients contain the same amount of

signal regardless of their order and third each order-one coefficient contains a half

as much signal as any order-two coefficient and a quarter of what any order-three

126

coefficient has In Figure C-2 these values are denoted by β = c(1o1 05o2 025o3)

β = c(1o1 1o2 1o3) and β = c(025o1 05o2 1o3) respectively

Observe that the number of FPrsquos is invulnerable to how the SNR is distributed

across predictors using the HOP(1 ch) conversely when using the EPP the number

of FPrsquos decreases as the SNR grows always being slightly higher than those obtained

with the HOP With either prior structure the algorithm performs better whenever all

coefficients are equally weighted or when those for the order-three terms have higher

weights In these two cases (ie with β = c(1o1 05o2 025o3) or β = c(1o1 1o2 1o3))

the effect of the SNR appears to be similar In contrast when more weight is given to

order one terms the algorithm yields slightly worse models at any SNR level This is an

intuitive result since giving more signal to higher order terms makes it easier to detect

higher order terms and consequently by strong heredity the algorithm will also select

the corresponding lower order terms included in the true model

Special Points on the Scale

In Nelder (1998) the author argues that the conditions under which the

weak-heredity principle can be used for model selection are so restrictive that the

principle is commonly not valid in practice in this context In addition the author states

that considering well-formulated models only does not take into account the possible

presence of special points on the scales of the predictors that is situations where

omitting lower order terms is justified due to the nature of the data However it is our

contention that every model has an underlying well-formulated structure whether or not

some predictor has special points on its scale will be determined through the estimation

of the coefficients once a valid well-formulated structure has been chosen

To understand how the algorithm behaves whenever the true data generating

mechanism has zero-valued coefficients for some lower order terms in the hierarchy

four different true models are considered Three of them are not well-formulated while

the remaining one is the WFM shown in Figure 4-6 The three models that have special

127

Figure C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model rankingaccording to model posterior probabilities

points correspond to the same model MT from Figure 4-6 but have respectively

zero-valued coefficients for all the order-one terms all the order-two terms and for x21

and x2x5

As seen before in comparison to the EPP the HOP(1 ch) tightly controls the

inclusion FPs by choosing smaller models at the expense of also reducing the TP

count especially when there is more uncertainty about the true model (ie SNR=025)

For both prior structures the results in Figure C-3 indicate that at low SNR levels the

presence of special points has no apparent impact as the selection behavior is similar

between the four models in terms of both the TP and FP As the SNR increases the

TPs and the model size are affected for true models with zero-valued lower order

128

Figure C-3 SNR vs different true models MT Average model size average truepositives and average false positives for all simulated scenarios by modelranking according to model posterior probabilities

terms These differences however are not very large Relatively smaller models are

selected whenever some terms in the hierarchy are missing but with high SNR which

is where the differences are most pronounced the predictors included are mostly true

coefficients The impact is almost imperceptible for the true model that lacks order one

terms and the model with zero coefficients for x21 and x2x5 and is more visible for models

without order two terms This last result is expected due to strong-heredity whenever

the order-one coefficients are missing the inclusion of order-two and order-three

terms will force their selection which is also the case when only a few order two terms

have zero-valued coefficients Conversely when all order two predictors are removed

129

some order three predictors are not selected as their signal is attributed the order two

predictors missing from the true model This is especially the case for the order three

interaction term x1x2x5 which depends on the inclusion of three order two terms terms

(x1x2 x1x5 x2x5) in order for it to be included as well This makes the inclusion of this

term somewhat more challenging the three order two interactions capture most of

the variation of the polynomial terms that is present when the order three term is also

included However special points on the scale commonly occur on a single or at most

on a few covariates A true data generating mechanism that removes all terms of a given

order in the context of polynomial models is clearly not justified here this was only done

for comparison purposes

130

APPENDIX DSUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS

The covariates considered for the ozone data analysis match those used in Liang

et al (2008) these are displayed in Table D below

Table D-1 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA

vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX

The marginal posterior inclusion probability corresponds to the probability of including a

given term in the full model MF after summing over all models in the model space For each

node α isin MF this probability is given by pα =sum

MisinM I(αisinM)p(M|yM) Given that in problems

with a large model space such as the one considered for the ozone concentration problem

enumeration of the entire space is not feasible Thus these probabilities are estimated summing

over every model drawn by the random walk over the model space M

Given that there are in total 44 potential predictors for convenience in Tables D-2 to D-5

below we only display the marginal posterior probabilities for the terms included under at least

one of the model priors considered (EPP HIP HUP and HOP) for each of the parameter priors

utilized (intrinsic priors Zellner-Siow priors Hyper-g(11) and Hyper-g(21))

131

Table D-2 Marginal inclusion probabilities

intrinsic prior

EPP HIP HUP HOP

hum 099 069 085 076

dpg 085 048 052 053

ibt 099 100 100 100

hum2 076 051 043 062

humdpg 055 002 003 017

humibt 098 069 084 075

dpg2 072 036 025 046

ibt2 059 078 057 081

Table D-3 Marginal inclusion probabilities

Zellner-Siow prior

EPP HIP HUP HOP

hum 076 067 080 069

dpg 089 050 055 058

ibt 099 100 100 100

hum2 057 049 040 057

humibt 072 066 078 068

dpg2 081 038 031 051

ibt2 054 076 055 077

Table D-4 Marginal inclusion probabilities

Hyper-g11

EPP HIP HUP HOP

vh 054 005 010 011

hum 081 067 080 069

dpg 090 050 055 058

ibt 099 100 099 099

hum2 061 049 040 057

humibt 078 066 078 068

dpg2 083 038 030 051

ibt2 049 076 054 077

Table D-5 Marginal inclusion probabilities

Hyper-g21

EPP HIP HUP HOP

hum 079 064 073 067

dpg 090 052 060 059

ibt 099 100 099 100

hum2 060 047 037 055

humibt 076 064 071 067

dpg2 082 041 036 052

ibt2 047 073 049 075

132

REFERENCES

Akaike H (1983) Information measures and model selection Bull Int Statist Inst 50277ndash290

Albert J H amp Chib S (1993) Bayesian-analysis of binary and polychotomousresponse data Journal of the American Statistical Association 88(422) 669ndash679

Berger J amp Bernardo J (1992) On the development of reference priors BayesianStatistics 4 (pp 35ndash60)

URL httpisbastatdukeedueventsvalencia1992Valencia4Refpdf

Berger J amp Pericchi L (1996) The intrinsic Bayes factor for model selection andprediction Journal of the American Statistical Association 91(433) 109ndash122

URL httpamstattandfonlinecomdoiabs10108001621459199610476668

Berger J Pericchi L amp Ghosh J (2001) Objective Bayesian methods for modelselection introduction and comparison In Model selection vol 38 of IMS LectureNotes Monogr Ser (pp 135ndash207) Inst Math Statist

URL httpwwwjstororgstable1023074356165

Besag J York J amp Mollie A (1991) Bayesian Image-Restoration with 2 Applicationsin Spatial Statistics Annals of the Institute of Statistical Mathematics 43 1ndash20

Bien J Taylor J amp Tibshirani R (2013) A lasso for hierarchical interactions TheAnnals of Statistics 41(3) 1111ndash1141

URL httpprojecteuclidorgeuclidaos1371150895

Breiman L amp Friedman J (1985) Estimating optimal transformations for multipleregression and correlation Journal of the American Statistical Association 80580ndash598

Brusco M J Steinley D amp Cradit J D (2009) An exact algorithm for hierarchicallywell-formulated subsets in second-order polynomial regression Technometrics 51(3)306ndash315

Casella G Giron F J Martınez M L amp Moreno E (2009) Consistency of Bayesianprocedures for variable selection The Annals of Statistics 37 (3) 1207ndash1228

URL httpprojecteuclidorgeuclidaos1239369020

Casella G Moreno E amp Giron F (2014) Cluster Analysis Model Selection and PriorDistributions on Models Bayesian Analysis TBA(TBA) 1ndash46

URL httpwwwstatufledu~casellaPapersClusterModel-July11-Apdf

133

Chipman H (1996) Bayesian variable selection with related predictors CanadianJournal of Statistics 24(1) 17ndash36

URL httponlinelibrarywileycomdoi1023073315687abstract

Clyde M amp George E I (2004) Model Uncertainty Statistical Science 19(1) 81ndash94

URL httpprojecteuclidorgDienstgetRecordid=euclidss1089808274

Dewey J (1958) Experience and nature New York Dover Publications

Dorazio R M amp Taylor-Rodrıguez D (2012) A Gibbs sampler for Bayesian analysis ofsite-occupancy data Methods in Ecology and Evolution 3 1093ndash1098

Ellison A M (2004) Bayesian inference in ecology Ecology Letters 7 509ndash520

Fiske I amp Chandler R (2011) unmarked An R package for fitting hierarchical modelsof wildlife occurrence and abundance Journal of Statistical Software 43(10)

URL httpcorekmiopenacukdownloadpdf5701760pdf

George E (2000) The variable selection problem Journal of the American StatisticalAssociation 95(452) 1304ndash1308

URL httpwwwtandfonlinecomdoiabs10108001621459200010474336

Giron F J Moreno E Casella G amp Martınez M L (2010) Consistency of objectiveBayes factors for nonnested linear models and increasing model dimension Revistade la Real Academia de Ciencias Exactas Fisicas y Naturales Serie A Matematicas104(1) 57ndash67

URL httpwwwspringerlinkcomindex105052RACSAM201006

Good I J (1950) Probability and the Weighing of Evidence New York Haffner

Griepentrog G L Ryan J M amp Smith L D (1982) Linear transformations ofpolynomial regression-models American Statistician 36(3) 171ndash174

Gunel E amp Dickey J (1974) Bayes factors for independence in contingency tablesBiometrika 61 545ndash557

Hanski I (1994) A Practical Model of Metapopulation Dynamics Journal of AnimalEcology 63 151ndash162

Hooten M (2006) Hierarchical spatio-temporal models for ecological processesDoctoral dissertation University of Missouri-Columbia

URL httpsmospacelibraryumsystemeduxmluihandle103554500

Hooten M B amp Hobbs N T (2014) A Guide to Bayesian Model Selection forEcologists Ecological Monographs (In Press)

134

Hughes J amp Haran M (2013) Dimension reduction and alleviation of confoundingfor spatial generalized linear mixed models Journal of the Royal Statistical SocietySeries B Statistical Methodology 75 139ndash159

Hurvich C M amp Tsai C-L (1989) Regression and time series model selection insmall samples Biometrika 76 297ndash307

URL httpbiometoxfordjournalsorgcontent762297abstract

Jeffreys H (1935) Some tests of significance treated by the theory of probabilityProcedings of the Cambridge Philosophy Society 31 203ndash222

Jeffreys H (1961) Theory of Probability London Oxford University Press 3rd ed

Johnson D Conn P Hooten M Ray J amp Pond B (2013) Spatial occupancymodels for large data sets Ecology 94(4) 801ndash808

URL httpwwwesajournalsorgdoiabs10189012-05641mi=3eywlhampaf=R

ampsearchText=human+population

Kass R amp Wasserman L (1995) A reference Bayesian test for nested hypothesesand its relationship to the Schwarz criterion Journal of the American StatisticalAssociation 90(431)

URL httpamstattandfonlinecomdoiabs10108001621459199510476592

Kass R E amp Raftery A E (1995) Bayes Factors Journal of the American StatisticalAssociation 90 773ndash795

URL httpwwwtandfonlinecomdoiabs10108001621459199510476572$

delimiter026E30F$nhttpwwwtandfonlinecomdoiabs10108001621459

199510476572UvBybrTIgcs

Kass R E amp Wasserman L (1996) The Selection of Prior Distributions by FormalRules Journal of the American Statistical Association 91(435) 1343

URL httpwwwjstororgstable2291752origin=crossref

Kery M (2010) Introduction to WinBUGS for Ecologists Bayesian Approach toRegression ANOVA Mixed Models and Related Analyses Academic Press 1st ed

Kery M Gardner B amp Monnerat C (2010) Predicting species distributions fromchecklist data using site-occupancy models Journal of Biogeography 37 (10)1851ndash1862 Kery Marc Gardner Beth Monnerat Christian

Khuri A (2002) Nonsingular linear transformations of the control variables in responsesurface models Technical Report

Krebs C J (1972) Ecology the experimental analysis of distribution and abundance

135

Lempers F B (1971) Posterior probabilities of alternative linear models University ofRotterdam Press Rotterdam

Leon-Novelo L Moreno E amp Casella G (2012) Objective Bayes model selection inprobit models Statistics in medicine 31(4) 353ndash65

URL httpwwwncbinlmnihgovpubmed22162041

Liang F Paulo R Molina G Clyde M a amp Berger J O (2008) Mixtures of g Priorsfor Bayesian Variable Selection Journal of the American Statistical Association103(481) 410ndash423

URL httpwwwtandfonlinecomdoiabs101198016214507000001337

Link W amp Barker R (2009) Bayesian inference with ecological applications Elsevier

URL httpbooksgooglecombookshl=enamplr=ampid=hecon2l2QPcCampoi=fnd

amppg=PP2ampdq=Bayesian+Inference+with+ecological+applicationsampots=S82_

0pxrNmampsig=L3xbsSQcKD8FV6rxCMp2pmP2JKk

MacKenzie D amp Nichols J (2004) Occupancy as a surrogate for abundanceestimation Animal biodiversity and conservation 1 461ndash467

URL httpcrsitbacidmediajurnalrefslandscapemackenzie2004zhpdf

MacKenzie D Nichols J amp Hines J (2003) Estimating site occupancy colonizationand local extinction when a species is detected imperfectly Ecology 84(8)2200ndash2207

URL httpwwwesajournalsorgdoiabs10189002-3090

MacKenzie D I Bailey L L amp Nichols J D (2004) Investigating speciesco-occurrence patterns when species Journal of Animal Ecology 73 546ndash555

MacKenzie D I Nichols J D Lachman G B Droege S Royle J A amp LangtimmC A (2002) Estimating site occupancy rates when detection probabilities are lessthan one Ecology 83(8) 2248ndash2255

Mazerolle M amp Mazerolle M (2013) Package rsquoAICcmodavgrsquo (c)

URL ftpheanetarchivegnewsenseorgdisk1CRANwebpackages

AICcmodavgAICcmodavgpdf

McCullagh P amp Nelder J A (1989) Generalized linear models (2nd ed) LondonEngland Chapman amp Hall

McQuarrie A Shumway R amp Tsai C-L (1997) The model selection criterion AICu

136

Moreno E Bertolino F amp Racugno W (1998) An intrinsic limiting procedure for modelselection and hypotheses testing Journal of the American Statistical Association93(444) 1451ndash1460

Moreno E Giron F J amp Casella G (2010) Consistency of objective Bayes factors asthe model dimension grows The Annals of Statistics 38(4) 1937ndash1952

URL httpprojecteuclidorgeuclidaos1278861238

Nelder J A (1977) Reformulation of linear-models Journal of the Royal StatisticalSociety Series A - Statistics in Society 140 48ndash77

Nelder J A (1998) The selection of terms in response-surface models - how strong isthe weak-heredity principle American Statistician 52(4) 315ndash318

Nelder J A (2000) Functional marginality and response-surface fitting Journal ofApplied Statistics 27 (1) 109ndash112

Nichols J Hines J amp Mackenzie D (2007) Occupancy estimation and modeling withmultiple states and state uncertainty Ecology 88(6) 1395ndash1400

URL httpwwwesajournalsorgdoipdf10189006-1474

Ovaskainen O Hottola J amp Siitonen J (2010) Modeling species co-occurrenceby multivariate logistic regression generates new hypotheses on fungal interactionsEcology 91(9) 2514ndash21

URL httpwwwncbinlmnihgovpubmed20957941

Peixoto J L (1987) Hierarchical variable selection in polynomial regression-modelsAmerican Statistician 41(4) 311ndash313

Peixoto J L (1990) A property of well-formulated polynomial regression-modelsAmerican Statistician 44(1) 26ndash30

Pericchi L R (2005) Model selection and hypothesis testing based on objectiveprobabilities and bayes factors In Handbook of Statistics Elsevier

Polson N G Scott J G amp Windle J (2013) Bayesian Inference for Logistic ModelsUsing Polya-Gamma Latent Variables Journal of the American Statistical Association108 1339ndash1349

URL httpdxdoiorg101080016214592013829001

Rao C R amp Wu Y (2001) On model selection vol Volume 38 of Lecture NotesndashMonograph Series (pp 1ndash57) Beachwood OH Institute of Mathematical Statistics

URL httpdxdoiorg101214lnms1215540960

137

Reich B J Hodges J S amp Zadnik V (2006) Effects of residual smoothing on theposterior of the fixed effects in disease-mapping models Biometrics 62 1197ndash1206

Reiners W amp Lockwood J (2009) Philosophical Foundations for the Practices ofEcology Cambridge University Press

URL httpbooksgooglecombooksid=dr9cPgAACAAJ

Rigler F amp Peters R (1995) Excellence in Ecology Science and Limnology EcologyInstitute Germany

URL httportoncatieaccrcgi-binwxisexeIsisScript=CIENLxis

ampmethod=postampformato=2ampcantidad=1ampexpresion=mfn=008268

Robert C Chopin N amp Rousseau J (2009) Harold Jeffreysrsquo Theory of Probabilityrevisited Statistical Science Volume 24(2) 141ndash179

URL httpswwwnewtonacukpreprintsNI08021pdf

Robert C P (1993) A note on jeffreys-lindley paradox Statistica Sinica 3 601ndash608

Royle J A amp Kery M (2007) A Bayesian state-space formulation of dynamicoccupancy models Ecology 88(7) 1813ndash23

URL httpwwwncbinlmnihgovpubmed17645027

Scott J amp Berger J (2010) Bayes and Empirical-Bayes Multiplicity Adjustment in thevariable selection problem The Annals of Statistics

URL httpprojecteuclidorgeuclidaos1278861454

Spiegelhalter D J amp Smith A F M (1982) Bayes factor for linear and log-linearmodels with vague prior information J R Statist Soc B 44 377ndash387

Tierney L amp Kadane J B (1986) Accurate approximations for posterior moments andmarginal densities Journal of the American Statistical Association 81 82ndash86

Tyre A J Tenhumberg B Field S a Niejalke D Parris K amp Possingham H P(2003) Improving Precision and Reducing Bias in Biological Surveys EstimatingFalse-Negative Error Rates Ecological Applications 13(6) 1790ndash1801

URL httpwwwesajournalsorgdoiabs10189002-5078

Waddle J H Dorazio R M Walls S C Rice K G Beauchamp J Schuman M Jamp Mazzotti F J (2010) A new parameterization for estimating co-occurrence ofinteracting species Ecological applications a publication of the Ecological Society ofAmerica 20 1467ndash1475

Wasserman L (2000) Bayesian Model Selection and Model Averaging Journal ofmathematical psychology 44(1) 92ndash107

138

URL httpwwwncbinlmnihgovpubmed10733859

Wilson M Iversen E Clyde M A Schmidler S C amp Schildkraut J M (2010)Bayesian model search and multilevel inference for SNP association studies TheAnnals of Applied Statistics 4(3) 1342ndash1364

URL httpwwwncbinlmnihgovpmcarticlesPMC3004292

Womack A J Leon-Novelo L amp Casella G (2014) Inference from Intrinsic BayesProcedures Under Model Selection and Uncertainty Journal of the AmericanStatistical Association (June) 140114063448000

URL httpwwwtandfonlinecomdoiabs101080016214592014880348

Yuan M Joseph V R amp Zou H (2009) Structured variable selection and estimationThe Annals of Applied Statistics 3(4) 1738ndash1757

URL httpprojecteuclidorgeuclidaoas1267453962

Zeller K A Nijhawan S Salom-Perez R Potosme S H amp Hines J E (2011)Integrating occupancy modeling and interview data for corridor identification A casestudy for jaguars in nicaragua Biological Conservation 144(2) 892ndash901

Zellner A amp Siow A (1980) Posterior odds ratios for selected regression hypothesesIn Trabajos de estadıstica y de investigacion operativa (pp 585ndash603)

URL httpwwwspringerlinkcomindex5300770UP12246M9pdf

139

BIOGRAPHICAL SKETCH

Daniel Taylor-Rodrıguez was born in Bogota Colombia He earned a BS

degree in economics from the Universidad de Los Andes (2004) and a Specialist

degree in statistics from the Universidad Nacional de Colombia In 2009 he traveled

to Gainesville Florida to pursue a masterrsquos in statistics under the supervision of

George Casella Upon completion he started a PhD in interdisciplinary ecology with

concentration in statistics again under George Casellarsquos supervision After Georgersquos

passing Linda Young and Nikolay Bliznyuk continued to oversee Danielrsquos mentorship

He has currently accepted a joint postdoctoral fellowship at the Statistical and Applied

Mathematical Sciences Institute and the Department of Statistical Science at Duke

University

140

  • ACKNOWLEDGMENTS
  • TABLE OF CONTENTS
  • LIST OF TABLES
  • LIST OF FIGURES
  • ABSTRACT
  • 1 GENERAL INTRODUCTION
    • 11 Occupancy Modeling
    • 12 A Primer on Objective Bayesian Testing
    • 13 Overview of the Chapters
      • 2 MODEL ESTIMATION METHODS
        • 21 Introduction
          • 211 The Occupancy Model
          • 212 Data Augmentation Algorithms for Binary Models
            • 22 Single Season Occupancy
              • 221 Probit Link Model
              • 222 Logit Link Model
                • 23 Temporal Dynamics and Spatial Structure
                  • 231 Dynamic Mixture Occupancy State-Space Model
                  • 232 Incorporating Spatial Dependence
                    • 24 Summary
                      • 3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
                        • 31 Introduction
                        • 32 Objective Bayesian Inference
                          • 321 The Intrinsic Methodology
                          • 322 Mixtures of g-Priors
                            • 3221 Intrinsic priors
                            • 3222 Other mixtures of g-priors
                                • 33 Objective Bayes Occupancy Model Selection
                                  • 331 Preliminaries
                                  • 332 Intrinsic Priors for the Occupancy Problem
                                  • 333 Model Posterior Probabilities
                                  • 334 Model Selection Algorithm
                                    • 34 Alternative Formulation
                                    • 35 Simulation Experiments
                                      • 351 Marginal Posterior Inclusion Probabilities for Model Predictors
                                      • 352 Summary Statistics for the Highest Posterior Probability Model
                                        • 36 Case Study Blue Hawker Data Analysis
                                          • 361 Results Variable Selection Procedure
                                          • 362 Validation for the Selection Procedure
                                            • 37 Discussion
                                              • 4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
                                                • 41 Introduction
                                                • 42 Setup for Well-Formulated Models
                                                  • 421 Well-Formulated Model Spaces
                                                    • 43 Priors on the Model Space
                                                      • 431 Model Prior Definition
                                                      • 432 Choice of Prior Structure and Hyper-Parameters
                                                      • 433 Posterior Sensitivity to the Choice of Prior
                                                        • 44 Random Walks on the Model Space
                                                          • 441 Simple Pruning and Growing
                                                          • 442 Degree Based Pruning and Growing
                                                            • 45 Simulation Study
                                                              • 451 SNR and Sample Size Effect
                                                              • 452 Coefficient Magnitude
                                                              • 453 Special Points on the Scale
                                                                • 46 Case Study Ozone Data Analysis
                                                                • 47 Discussion
                                                                  • 5 CONCLUSIONS
                                                                  • A FULL CONDITIONAL DENSITIES DYMOSS
                                                                  • B RANDOM WALK ALGORITHMS
                                                                  • C WFM SIMULATION DETAILS
                                                                  • D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
                                                                  • REFERENCES
                                                                  • BIOGRAPHICAL SKETCH
Page 5: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,

Professor Mary Christman for her mentorship and enormous support I would like to

thank Dr Mihai Giurcanu for spending countless hours helping me think more deeply

about statistics his insight has been instrumental to shaping my own ideas Thanks to

Dr Claudio Fuentes for taking an interest in my work and for his advise support and

kind words which helped me retain the confidence to continue

I would like to acknowledge my friends at UF Juan Jose Acosta Mauricio

Mosquera Diana Falla Salvador and Emma Weeks and Anna Denicol thanks for

becoming my family away from home Andreas Tavis Emily Alex Sasha Mike

Yeonhee and Laura thanks for being there for me I truly enjoyed sharing these

years with you Vitor Paula Rafa Leandro Fabio Eduardo Marcelo and all the other

Brazilians in the Animal Science Department thanks for your friendship and for the

many unforgettable (though blurry) weekends

Also I would like to thank Pablo Arboleda for believing in me Because of him I

was able to take the first step towards fulfilling my educational goals My gratitude to

Grupo Bancolombia Fulbright Colombia Colfuturo and the IGERT QSE3 program

for supporting me throughout my studies Also thanks to Marc Kery and Christian

Monnerat for providing data to validate our methods Thanks to the staff in the Statistics

Department specially to Ryan Chance to the staff at the HPC and also to Karen Bray

at SNRE

Above all else I would like to thank my wife and family Nata you have always been

there for me pushing me forward believing in me helping me make better decisions

and regardless of how hard things get you have always managed to give me true and

lasting happiness Thank you for your love strength and patience Mom Dad Alejandro

Alberto Laura Sammy Vale and Tommy without your love trust and support getting

this far would not have been possible Thank you for giving me so much Gustavo

Lilia Angelica and Juan Pablo thanks for taking me into your family your words of

encouragement have led the way

5

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS 4

LIST OF TABLES 8

LIST OF FIGURES 10

ABSTRACT 12

CHAPTER

1 GENERAL INTRODUCTION 14

11 Occupancy Modeling 1512 A Primer on Objective Bayesian Testing 1713 Overview of the Chapters 21

2 MODEL ESTIMATION METHODS 23

21 Introduction 23211 The Occupancy Model 24212 Data Augmentation Algorithms for Binary Models 26

22 Single Season Occupancy 29221 Probit Link Model 30222 Logit Link Model 32

23 Temporal Dynamics and Spatial Structure 34231 Dynamic Mixture Occupancy State-Space Model 37232 Incorporating Spatial Dependence 43

24 Summary 46

3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS 49

31 Introduction 4932 Objective Bayesian Inference 52

321 The Intrinsic Methodology 53322 Mixtures of g-Priors 54

3221 Intrinsic priors 553222 Other mixtures of g-priors 56

33 Objective Bayes Occupancy Model Selection 57331 Preliminaries 58332 Intrinsic Priors for the Occupancy Problem 60333 Model Posterior Probabilities 62334 Model Selection Algorithm 63

34 Alternative Formulation 6635 Simulation Experiments 68

351 Marginal Posterior Inclusion Probabilities for Model Predictors 70

6

352 Summary Statistics for the Highest Posterior Probability Model 7636 Case Study Blue Hawker Data Analysis 77

361 Results Variable Selection Procedure 79362 Validation for the Selection Procedure 81

37 Discussion 82

4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS 84

41 Introduction 8442 Setup for Well-Formulated Models 88

421 Well-Formulated Model Spaces 9043 Priors on the Model Space 91

431 Model Prior Definition 92432 Choice of Prior Structure and Hyper-Parameters 96433 Posterior Sensitivity to the Choice of Prior 99

44 Random Walks on the Model Space 104441 Simple Pruning and Growing 105442 Degree Based Pruning and Growing 106

45 Simulation Study 107451 SNR and Sample Size Effect 109452 Coefficient Magnitude 110453 Special Points on the Scale 111

46 Case Study Ozone Data Analysis 11147 Discussion 113

5 CONCLUSIONS 115

APPENDIX

A FULL CONDITIONAL DENSITIES DYMOSS 118

B RANDOM WALK ALGORITHMS 121

C WFM SIMULATION DETAILS 124

D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS 131

REFERENCES 133

BIOGRAPHICAL SKETCH 140

7

LIST OF TABLES

Table page

1-1 Interpretation of BFji when contrasting Mj and Mi 20

3-1 Simulation control parameters occupancy model selector 69

3-2 Comparison of average minOddsMPIP under scenarios having different numberof sites (N=50 N=100) and under scenarios having different number of surveysper site (J=3 J=5) for the presence and detection components using uniformand multiplicity correction priors 75

3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors 75

3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highest probabilitymodels for the presence and the detection components using uniform andmultiplicity correcting priors on the model space 76

3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space 77

3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 77

3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 78

3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data 80

3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data 80

3-10 MPIP presence component 81

3-11 MPIP detection component 81

3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform and multiplicitycorrection model priors 82

8

4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulations 100

4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchical independenceprior (HIP) the hierarchical order prior (HOP) and the hierarchical uniformprior (HUP) 102

4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and the hierarchicaluniform prior (HUP) 103

4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP) 105

4-5 Variables used in the analyses of the ozone contamination dataset 112

4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso 113

C-1 Experimental conditions WFM simulations 124

D-1 Variables used in the analyses of the ozone contamination dataset 131

D-2 Marginal inclusion probabilities intrinsic prior 132

D-3 Marginal inclusion probabilities Zellner-Siow prior 132

D-4 Marginal inclusion probabilities Hyper-g11 132

D-5 Marginal inclusion probabilities Hyper-g21 132

9

LIST OF FIGURES

Figure page

2-1 Graphical representation occupancy model 25

2-2 Graphical representation occupancy model after data-augmentation 31

2-3 Graphical representation multiseason model for a single site 39

2-4 Graphical representation data-augmented multiseason model 39

3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors 71

3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors 72

3-3 Predictor MPIP averaged over scenarios with the interaction between the numberof sites and the surveys per site using uniform (U) and multiplicity correction(MC) priors 72

3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors 73

3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors 73

4-1 Graphs of well-formulated polynomial models for p = 2 90

4-2 E(M) and C(M) in M defined by a quadratic surface in two main effects formodel M = 1 x1 x21 91

4-3 Graphical representation of assumptions on M defined by the quadratic surfacein two main effects 93

4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the intercept onlymodel and (ab) isin (1 1) (1 ch) 97

4-5 Prior probabilities for the space of well-formulated models associated to threemain effects and one interaction term where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch) 98

4-6 MT DAG of the largest true model used in simulations 109

4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch) 110

C-1 SNR vs n Average model size average true positives and average false positivesfor all simulated scenarios by model ranking according to model posterior probabilities126

10

C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 128

C-3 SNR vs different true models MT Average model size average true positivesand average false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 129

11

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

OBJECTIVE BAYESIAN METHODS FOR OCCUPANCY MODEL ESTIMATION ANDSELECTION

By

Daniel Taylor-Rodrıguez

August 2014

Chair Linda J YoungCochair Nikolay BliznyukMajor Interdisciplinary Ecology

The ecological literature contains numerous methods for conducting inference about

the dynamics that govern biological populations Among these methods occupancy

models have played a leading role during the past decade in the analysis of large

biological population surveys The flexibility of the occupancy framework has brought

about useful extensions for determining key population parameters which provide

insights about the distribution structure and dynamics of a population However the

methods used to fit the models and to conduct inference have gradually grown in

complexity leaving practitioners unable to fully understand their implicit assumptions

increasing the potential for misuse This motivated our first contribution We develop

a flexible and straightforward estimation method for occupancy models that provides

the means to directly incorporate temporal and spatial heterogeneity using covariate

information that characterizes habitat quality and the detectability of a species

Adding to the issue mentioned above studies of complex ecological systems now

collect large amounts of information To identify the drivers of these systems robust

techniques that account for test multiplicity and for the structure in the predictors are

necessary but unavailable for ecological models We develop tools to address this

methodological gap First working in an ldquoobjectiverdquo Bayesian framework we develop

the first fully automatic and objective method for occupancy model selection based

12

on intrinsic parameter priors Moreover for the general variable selection problem we

propose three sets of prior structures on the model space that correct for multiple testing

and a stochastic search algorithm that relies on the priors on the models space to

account for the polynomial structure in the predictors

13

CHAPTER 1GENERAL INTRODUCTION

As with any other branch of science ecology strives to grasp truths about the

world that surrounds us and in particular about nature The objective truth sought

by ecology may well be beyond our grasp however it is reasonable to think that at

least partially ldquoNature is capable of being understoodrdquo (Dewey 1958) We can observe

and interpret nature to formulate hypotheses which can then be tested against reality

Hypotheses that encounter no or little opposition when confronted with reality may

become contextual versions of the truth and may be generalized by scaling them

spatially andor temporally accordingly to delimit the bounds within which they are valid

To formulate hypotheses accurately and in a fashion amenable to scientific inquiry

not only the point of view and assumptions considered must be made explicit but

also the object of interest the properties worthy of consideration of that object and

the methods used in studying such properties (Reiners amp Lockwood 2009 Rigler amp

Peters 1995) Ecology as defined by Krebs (1972) is ldquothe study of interactions that

determine the distribution and abundance of organismsrdquo This characterizes organisms

and their interactions as the objects of interest to ecology and prescribes distribution

and abundance as a relevant property of these organisms

With regards to the methods used to acquire ecological scientific knowledge

traditionally theoretical mathematical models (such as deterministic PDEs) have been

used However naturally varying systems are imprecisely observed and as such are

subject to multiple sources of uncertainty that must be explicitly accounted for Because

of this the ecological scientific community is developing a growing interest in flexible

and powerful statistical methods and among these Bayesian hierarchical models

predominate These methods rely on empirical observations and can accommodate

fairly complex relationships between empirical observations and theoretical process

models while accounting for diverse sources of uncertainty (Hooten 2006)

14

Bayesian approaches are now used extensively in ecological modeling however

there are two issues of concern one from the standpoint of ecological practitioners

and another from the perspective of scientific ecological endeavors First Bayesian

modeling tools require a considerable understanding of probability and statistical theory

leading practitioners to view them as black box approaches (Kery 2010) Second

although Bayesian applications proliferate in the literature in general there is a lack of

awareness of the distinction between approaches specifically devised for testing and

those for estimation (Ellison 2004) Furthermore there is a dangerous unfamiliarity with

the proven risks of using tools designed for estimation in testing procedures (Berger amp

Pericchi 1996 Berger et al 2001 Kass amp Raftery 1995 Moreno et al 1998 Robert

et al 2009 Robert 1993) (eg use of flat priors in hypothesis testing)

Occupancy models have played a leading role during the past decade in large

biological population surveys The flexibility of the occupancy framework has allowed

the development of useful extensions to determine several key population parameters

which provide robust notions of the distribution structure and dynamics of a population

In order to address some of the concerns stated in previous paragraph we concentrate

in the occupancy framework to develop estimation and testing tools that will allow

ecologists first to gain insight about the estimation procedure and second to conduct

statistically sound model selection for site-occupancy data

11 Occupancy Modeling

Since MacKenzie et al (2002) and Tyre et al (2003) introduced the site-occupancy

framework countless applications and extensions of the method have been developed

in the ecological literature as evidenced by the 438000 hits on Google Scholar for

a search of rdquooccupancy modelrdquo This class of models acknowledges that techniques

used to conduct biological population surveys are prone to detection errors ndashif an

individual is detected it must be present while if it is not detected it might or might

not be Occupancy models improve upon traditional binary regression by accounting

15

for observed detection and partially observed presence as two separate but related

components In the site occupancy setting the chosen locations are surveyed

repeatedly in order to reduce the ambiguity caused by the observed zeros This

approach therefore allows probabilities of both presence (occurrence) and detection

to be estimated

The uses of site-occupancy models are many For example metapopulation

and island biogeography models are often parameterized in terms of site (or patch)

occupancy (Hansky 19921994 1997 as cited in MacKenzie et al (2003)) and

occupancy may be used as a surrogate for abundance to answer questions regarding

geographic distribution range size and metapopulation dynamics (MacKenzie et al

2004 Royle amp Kery 2007)

The basic occupancy framework which assumes a single closed population with

fixed probabilities through time has proven to be quite useful however it might be of

limited utility when addressing some problems In particular assumptions for the basic

model may become too restrictive or unrealistic whenever the study period extends

throughout multiple years or seasons especially given the increasingly changing

environmental conditions that most ecosystems are currently experiencing

Among the extensions found in the literature one that we consider particularly

relevant incorporates heterogenous occupancy probabilities through time Models

that incorporate temporally varying probabilities stem from important meta-population

notions provided by Hanski (1994) such as occupancy probabilities depending on local

colonization and local extinction processes In spite of the conceptual usefulness of

Hanskirsquos model several strong and untenable assumptions (eg all patches being

homogenous in quality) are required for it to provide practically meaningful results

A more viable alternative which builds on Hanski (1994) is an extension of

the single season occupancy model of MacKenzie et al (2003) In this model the

heterogeneity of occupancy probabilities across seasons arises from local colonization

16

and extinction processes This model is flexible enough to let detection occurrence

extinction and colonization probabilities to each depend upon its own set of covariates

Model parameters are obtained through likelihood-based estimation

Using a maximum likelihood approach presents two drawbacks First the

uncertainty assessment for maximum likelihood parameter estimates relies on

asymptotic results which are obtained from implementation of the delta method

making it sensitive to sample size Second to obtain parameter estimates the latent

process (occupancy) is marginalized out of the likelihood leading to the usual zero

inflated Bernoulli model Although this is a convenient strategy for solving the estimation

problem after integrating the latent state variables (occupancy indicators) they are

no longer available Therefore finite sample estimates cannot be calculated directly

Instead a supplementary parametric bootstrapping step is necessary Further

additional structure such as temporal or spatial variation cannot be introduced by

means of random effects (Royle amp Kery 2007)

12 A Primer on Objective Bayesian Testing

With the advent of high dimensional data such as that found in modern problems

in ecology genetics physics etc coupled with evolving computing capability objective

Bayesian inferential methods have gained increasing popularity This however is by no

means a new approach in the way Bayesian inference is conducted In fact starting with

Bayes and Laplace and continuing for almost 200 years Bayesian analysis was primarily

based on ldquononinformativerdquo priors (Berger amp Bernardo 1992)

Now subjective elicitation of prior probabilities in Bayesian analysis is widely

recognized as the ideal (Berger et al 2001) however it is often the case that the

available information is insufficient to specify appropriate prior probabilistic statements

Commonly as in model selection problems where large model spaces have to be

explored the number of model parameters is prohibitively large preventing one from

eliciting prior information for the entire parameter space As a consequence in practice

17

the determination of priors through the definition of structural rules has become the

alternative to subjective elicitation for a variety of problems in Bayesian testing Priors

arising from these rules are known in the literature as noninformative objective default

or reference Many of these connotations generate controversy and are accused

perhaps rightly of providing a false pretension of objectivity Nevertheless we will avoid

that discussion and refer to them herein exchangeably as noninformative or objective

priors to convey the sense that no attempt to introduce an informed opinion is made in

defining prior probabilities

A plethora of ldquononinformativerdquo methods has been developed in the past few

decades (see Berger amp Bernardo (1992) Berger amp Pericchi (1996) Berger et al (2001)

Clyde amp George (2004) Kass amp Wasserman (1995 1996) Liang et al (2008) Moreno

et al (1998) Spiegelhalter amp Smith (1982) Wasserman (2000) and the references

therein) We find particularly interesting those derived from the model structure in which

no tuning parameters are required especially since these can be regarded as automatic

methods Among them methods based on the Bayes factor for Intrinsic Priors have

proven their worth in a variety of inferential problems given their excellent performance

flexibility and ease of use This class of priors is discussed in detail in chapter 3 For

now some basic notation and notions of Bayesian inferential procedures are introduced

Hypothesis testing and the Bayes factor

Bayesian model selection techniques that aim to find the true model as opposed

to searching for the model that best predicts the data are fundamentally extensions to

Bayesian hypothesis testing strategies In general this Bayesian approach to hypothesis

testing and model selection relies on determining the amount of evidence found in favor

of one hypothesis (or model) over the other given an observed set of data Approached

from a Bayesian standpoint this type of problem can be formulated in great generality

using a natural well defined probabilistic framework that incorporates both model and

parameter uncertainty

18

Jeffreys (1935) first developed the Bayesian strategy to hypothesis testing and

consequently to the model selection problem Bayesian model selection within

a model space M = (M1M2 MJ) where each model is associated with a

parameter θj which may be a vector of parameters itself incorporates three types

of probability distributions (1) a prior probability distribution for each model π(Mj)

(2) a prior probability distribution for the parameters in each model π(θj |Mj) and (3)

the distribution of the data conditional on both the model and the modelrsquos parameters

f (x|θj Mj) These three probability densities induce the joint distribution p(x θj Mj) =

f (x|θj Mj) middot π(θj |Mj) middot π(Mj) which is instrumental in producing model posterior

probabilities The model posterior probability is the probability that a model is true given

the data It is obtained by marginalizing over the parameter space and using Bayes rule

p(Mj |x) =m(x|Mj)π(Mj)sumJ

i=1m(x|Mi)π(Mi) (1ndash1)

where m(x|Mj) =intf (x|θj Mj)π(θj |Mj)dθj is the marginal likelihood of Mj

Given that interest lies in comparing different models evidence in favor of one or

another model is assessed with pairwise comparisons using posterior odds

p(Mj |x)p(Mk |x)

=m(x|Mj)

m(x|Mk)middot π(Mj)

π(Mk) (1ndash2)

The first term on the right hand side of (1ndash2) m(x|Mj )

m(x|Mk) is known as the Bayes factor

comparing model Mj to model Mk and it is denoted by BFjk(x) The Bayes factor

provides a measure of the evidence in favor of either model given the data and updates

the model prior odds given by π(Mj )

π(Mk) to produce the posterior odds

Note that the model posterior probability in (1ndash1) can be expressed as a function of

Bayes factors To illustrate let model Mlowast isin M be a reference model All other models

compare in M are compared to the reference model Then dividing both the numerator

19

and denominator in (1ndash1) by m(x|Mlowast)π(Mlowast) yields

p(Mj |x) =BFjlowast(x)

π(Mj )

π(Mlowast)

1 +sum

MiisinMMi =Mlowast

BFilowast(x)π(Mi )π(Mlowast)

(1ndash3)

Therefore as the Bayes factor increases the posterior probability of model Mj given the

data increases If all models have equal prior probabilities a straightforward criterion

to select the best among all candidate models is to choose the model with the largest

Bayes factor As such the Bayes factor is not only useful for identifying models favored

by the data but it also provides a means to rank models in terms of their posterior

probabilities

Assuming equal model prior probabilities in (1ndash3) the prior odds are set equal to

one and the model posterior odds in (1ndash2) become p(Mj |x)p(Mk |x) = BFjk(x) Based

on the Bayes factors the evidence in favor of one or another model can be interpreted

using Table 1-1 adapted from Kass amp Raftery (1995)

Table 1-1 Interpretation of BFji when contrasting Mj and Mi

lnBFjk BFjk Evidence in favor of Mj P(Mj |x)0 to 2 1 to 3 Weak evidence 05-0752 to 6 3 to 20 Positive evidence 075-095

6 to 10 20 to 150 Strong evidence 095-099gt10 gt150 Very strong evidence gt 099

Bayesian hypothesis testing and model selection procedures through Bayes factors

and posterior probabilities have several desirable features First these methods have a

straight forward interpretation since the Bayes factor is an increasing function of model

(or hypothesis) posterior probabilities Second these methods can yield frequentist

matching confidence bounds when implemented with good testing priors (Kass amp

Wasserman 1996) such as the reference priors of Berger amp Bernardo (1992) Third

since the Bayes factor contains the ratio of marginal densities it automatically penalizes

complexity according to the number of parameters in each model this property is

known as Ockhamrsquos razor (Kass amp Raftery 1995) Four the use of Bayes factors does

20

not require having nested hypotheses (ie having the null hypothesis nested in the

alternative) standard distributions or regular asymptotics (eg convergence to normal

or chi squared distributions) (Berger et al 2001) In contrast this is not always the case

with frequentist and likelihood ratio tests which depend on known distributions (at least

asymptotically) for the test statistic to perform the test Finally Bayesian hypothesis

testing procedures using the Bayes factor can naturally incorporate model uncertainty by

using the Bayesian machinery for model averaged predictions and confidence bounds

(Kass amp Raftery 1995) It is not clear how to account for this uncertainty rigorously in a

fully frequentist approach

13 Overview of the Chapters

In the chapters that follow we develop a flexible and straightforward hierarchical

Bayesian framework for occupancy models allowing us to obtain estimates and conduct

robust testing from an ldquoobjectiverdquo Bayesian perspective Latent mixtures of random

variables supply a foundation for our methodology This approach provides a means to

directly incorporate spatial dependency and temporal heterogeneity through predictors

that characterize either habitat quality of a given site or detectability features of a

particular survey conducted in a specific site On the other hand the Bayesian testing

methods we propose are (1) a fully automatic and objective method for occupancy

model selection and (2) an objective Bayesian testing tool that accounts for multiple

testing and for polynomial hierarchical structure in the space of predictors

Chapter 2 introduces the methods proposed for estimation of occupancy model

parameters A simple estimation procedure for the single season occupancy model

with covariates is formulated using both probit and logit links Based on the simple

version an extension is provided to cope with metapopulation dynamics by introducing

persistence and colonization processes Finally given the fundamental role that spatial

dependence plays in defining temporal dynamics a strategy to seamlessly account for

this feature in our framework is introduced

21

Chapter 3 develops a new fully automatic and objective method for occupancy

model selection that is asymptotically consistent for variable selection and averts the

use of tuning parameters In this Chapter first some issues surrounding multimodel

inference are described and insight about objective Bayesian inferential procedures is

provided Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate

priors on the parameter space the intrinsic priors for the parameters of the occupancy

model are obtained These are used in the construction of a variable selection algorithm

for ldquoobjectiverdquo variable selection tailored to the occupancy model framework

Chapter 4 touches on two important and interconnected issues when conducting

model testing that have yet to receive the attention they deserve (1) controlling for false

discovery in hypothesis testing given the size of the model space ie given the number

of tests performed and (2) non-invariance to location transformations of the variable

selection procedures in the face of polynomial predictor structure These elements both

depend on the definition of prior probabilities on the model space In this chapter a set

of priors on the model space and a stochastic search algorithm are proposed Together

these control for model multiplicity and account for the polynomial structure among the

predictors

22

CHAPTER 2MODEL ESTIMATION METHODS

ldquoData Data Datardquo he cried impatiently ldquoI canrsquot make bricks without clayrdquo

ndashSherlock HolmesThe Adventure of the Copper Beeches

21 Introduction

Prior to the introduction of site-occupancy models (MacKenzie et al 2002 Tyre

et al 2003) presence-absence data from ecological monitoring programs were used

without any adjustment to assess the impact of management actions to observe trends

in species distribution through space and time or to model the habitat of a species (Tyre

et al 2003) These efforts however were suspect due to false-negative errors not

being accounted for False-negative errors occur whenever a species is present at a site

but goes undetected during the survey

Site-occupancy models developed independently by MacKenzie et al (2002)

and Tyre et al (2003) extend simple binary-regression models to account for the

aforementioned errors in detection of individuals common in surveys of animal or plant

populations Since their introduction the site-occupancy framework has been used in

countless applications and numerous extensions for it have been proposed Occupancy

models improve upon traditional binary regression by analyzing observed detection

and partially observed presence as two separate but related components In the site

occupancy setting the chosen locations are surveyed repeatedly in order to reduce the

ambiguity caused by the observed zeros This approach therefore allows simultaneous

estimation of the probabilities of presence (occurrence) and detection

Several extensions to the basic single-season closed population model are

now available The occupancy approach has been used to determine species range

dynamics (MacKenzie et al 2003 Royle amp Kery 2007) and to understand agestage

23

structure within populations (Nichols et al 2007) model species co-occurrence

(MacKenzie et al 2004 Ovaskainen et al 2010 Waddle et al 2010) It has even been

suggested as a surrogate for abundance (MacKenzie amp Nichols 2004) MacKenzie et al

suggested using occupancy models to conduct large-scale monitoring programs since

this approach avoids the high costs associated with surveys designed for abundance

estimation Also to investigate metapopulation dynamics occupancy models improve

upon incidence function models (Hanski 1994) which are often parameterized in terms

of site (or patch) occupancy and assume homogenous patches and a metapopulation

that is at a colonization-extinction equilibrium

Nevertheless the implementation of Bayesian occupancy models commonly resorts

to sampling strategies dependent on hyper-parameters subjective prior elicitation

and relatively elaborate algorithms From the standpoint of practitioners these are

often treated as black-box methods (Kery 2010) As such the potential of using the

methodology incorrectly is high Commonly these procedures are fitted with packages

such as BUGS or JAGS Although the packagersquos ease of use has led to a wide-spread

adoption of the methods the user may be oblivious as to the assumptions underpinning

the analysis

We believe providing straightforward and robust alternatives to implement these

methods will help practitioners gain insight about how occupancy modeling and more

generally Bayesian modeling is performed In this Chapter using a simple Gibbs

sampling approach first we develop a versatile method to estimate the single season

closed population site-occupancy model then extend it to analyze metapopulation

dynamics through time and finally provide a further adaptation to incorporate spatial

dependence among neighboring sites211 The Occupancy Model

In this section of the document we first introduce our results published in Dorazio

amp Taylor-Rodrıguez (2012) and build upon them to propose relevant extensions For

24

the standard sampling protocol for collecting site-occupancy data J gt 1 independent

surveys are conducted at each of N representative sample locations (sites) noting

whether a species is detected or not detected during each survey Let yij denote a binary

random variable that indicates detection (y = 1) or non-detection (y = 0) during the

j th survey of site i Without loss of generality J may be assumed constant among all N

sites to simplify description of the model In practice however site-specific variation in

J poses no real difficulties and is easily implemented This sampling protocol therefore

yields a N times J matrix Y of detectionnon-detection data

Note that the observed process yij is an imperfect representation of the underlying

occupancy or presence process Hence letting zi denote the presence indicator at site i

this model specification can therefore be represented through the hierarchy

yij |zi λ sim Bernoulli (zipij)

zi |α sim Bernoulli (ψi) (2ndash1)

where pij is the probability of correctly classifying as occupied the i th site during the j th

survey ψi is the presence probability at the i th site The graphical representation of this

process is

ψi

zi

yi

pi

Figure 2-1 Graphical representation occupancy model

Probabilities of detection and occupancy can both be made functions of covariates

and their corresponding parameter estimates can be obtained using either a maximum

25

likelihood or a Bayesian approach Existing methodologies from the likelihood

perspective marginalize over the latent occupancy process (zi ) making the estimation

procedure depend only on the detections Most Bayesian strategies rely on MCMC

algorithms that require parameter prior specification and tuning However Albert amp Chib

(1993) proposed a longstanding strategy in the Bayesian statistical literature that models

binary outcomes using a simple Gibbs sampler This procedure which is described in

the following section can be extrapolated to the occupancy setting eliminating the need

for tuning parameters and subjective prior elicitation212 Data Augmentation Algorithms for Binary Models

Probit model Data-augmentation with latent normal variables

At the root of Albert amp Chibrsquos algorithm lies the idea that if the observed outcome is

0 the latent variable can be simulated from a truncated normal distribution with support

(minusinfin 0] And if the outcome is 1 the latent variable can be simulated from a truncated

normal distribution in (0infin) To understand the reasoning behind this strategy let

Y sim Bern((xTβ)

) and V = xTβ + ε with ε sim N (0 1) In such a case note that

Pr(y = 1 | xTβ) = (xTβ) = Pr(ε lt xTβ)

= Pr(ε gt minusxTβ)

= Pr(v gt 0 | xTβ)

Thus whenever y = 1 then v gt 0 and v le 0 otherwise In other words we

may think of y as a truncated version of v Thus we can sample iteratively alternating

between the latent variables conditioned on model parameters and vice versa to draw

from the desired posterior densities By augmenting the data with the latent variables

we are able to obtain full conditional posterior distributions for model parameters that are

easy to draw from (equation 2ndash3 below) Further we may sample the latent variables

we may also sample the parameters

Given some initial values for all model parameters values for the latent variables

can be simulated By conditioning on the latter it is then possible to draw samples

26

from the parameterrsquos posterior distributions These samples can be used to generate

new values for the latent variables etc The process is iterated using a Gibbs sampling

approach Generally after a large number iterations it yields draws from the joint

posterior distribution of the latent variables and the model parameters conditional on the

observed outcome values We formalize the procedure below

Assume that each outcome Y1Y2 Yn is such that Yi |xi β sim Bernoulli(qi)

where qi = (xTi β) is the standard normal CDF evaluated at xTi β where xi and β

are the p-dimensional vectors of observed covariates for the i -th observation and their

corresponding parameters respectively

Now let y = y1 y2 yn be the vector of observed outcomes and [ β ] represents

the prior distribution of the model parameters Therefore the posterior distribution of β is

given by

[ β|y ] prop [ β ]nprodi=1

(xTi β)yi(1minus(xTi β)

)1minusyi (2ndash2)

which is intractable Nevertheless introducing latent random variables V = (V1 Vn)

such that Vi sim N (xTi β 1) resolves this difficulty by specifying that whenever Yi = 1

then Vi gt 0 and if Yi = 0 then Vi le 0 This yields

[ β v|y ] prop [ β ]

nprodi=1

ϕ(vi | xTi β 1)Ivile0Iyi=0 + Ivigt0Iyi=1

(2ndash3)

where ϕ(x |micro τ 2) is the probability density function of normal random variable x

with mean micro and variance τ2 The data augmentation artifact works since [ β|y ] =int[ β v|y ]dv hence if we sample from joint posterior 2ndash3 and extract only the sampled

values for β they will correspond to samples from [ β|y ]

From the expression above it is possible to obtain the full conditional distributions

for V and β Thus a Gibbs sampler can be proposed For example if we use a flat prior

27

for β (ie [ β ] prop 1) the full conditionals are given by

β|V y sim MVNk

((XTX )minus1(XTV ) (XTX )minus1

)(2ndash4)

V|β y simnprodi=1

tr N (xTi β 1Qi) (2ndash5)

where MVNq(micro ) represents a multivariate normal distribution with mean vector micro

and variance-covariance matrix and tr N (ξσ2Q) stands for the truncated normal

distribution with mean ξ variance σ2 and truncation region Q For each i = 1 2 n

the support of the truncated variables is given by Q = (minusinfin 0 ] if yi = 0 and Q = (0infin)

otherwise Note that conjugate normal priors could be used alternatively

At iteration m + 1 the Gibbs sampler draws V(m+1) conditional on β(m) from (2ndash5)

and then samples β(m+1) conditional on V(m+1) from (2ndash4) This process is repeated for

s = 0 1 nsim where nsim is the number of iterations in the Gibbs sampler

Logit model Data-augmentation with latent Polya-gamma variables

Recently Polson et al (2013) developed a novel and efficient approach for Bayesian

inference for logistic models using Polya-gamma latent variables which is analogous

to the Albert amp Chib algorithm The result arises from what the authors refer to as the

Polya-gamma distribution To construct a random variable from this family consider the

infinite mixture of the iid sequence of Exp(1) random variables Ekinfink=1 given by

ω =2

π2

infinsumk=1

Ek

(2k minus 1)2

with probability density function

g(ω) =infinsumk=1

(minus1)k 2k + 1radic2πω3

eminus(2k+1)2

8ω Iωisin(0infin) (2ndash6)

and Laplace density transform E[eminustω] = coshminus1(radic

t2)

28

The Polya-gamma family of densities is obtained through an exponential tilting of

the density g from 2ndash6 These densities indexed by c ge 0 are characterized by

f (ω|c) = cosh c2 eminusc2ω2 g(ω)

The likelihood for the binomial logistic model can be expressed in terms of latent

Polya-gamma variables as follows Assume yi sim Bernoulli(δi) with predictors xprimei =

(xi1 xip) and success probability δi = exprimeiβ(1 + ex

primeiβ) Hence the posterior for the

model parameters can be represented as

[β|y] =[β]prodn

i δyii (1minus δi)

1minusyi

c(y)

where c(y) is the normalizing constant

To facilitate the sampling procedure a data augmentation step can be performed

by introducing a Polya-gamma random variable ω sim PG(xprimeβ 1) This yields the

data-augmented posterior

[βω|y] =

(prodn

i=1 Pr(yi = 1|β))f (ω|xprime

β) [β] dω

c(y) (2ndash7)

such that [β|y] =int

R+[βω|y] dω

Thus from the augmented model the full conditional density for β is given by

[β|ω y] prop

(nprodi=1

Pr(yi = 1|β)

)f (ω|xprime

β) [β] dω

=

nprodi=1

(exprimeiβ)yi

1 + exprimeiβ

nprodi=1

cosh

(∣∣xprime

iβ∣∣

2

)exp

[minus(x

prime

iβ)2ωi

2

]g(ωi)

(2ndash8)

This expression yields a normal posterior distribution if β is assigned flat or normal

priors Hence a two-step sampling strategy analogous to that of Albert amp Chib (1993)

can be used to estimate β in the occupancy framework22 Single Season Occupancy

Let pij = F (qTij λ) be the probability of correctly classifying as occupied the i th

site during the j th survey conditional on the site being occupied and let ψi = F (xTi α)

29

correspond to the presence probability at the i th site Further let Fminus1(middot) denote a link

function (ie probit or logit) connecting the response to the predictors and denote by λ

and α respectively the r -variate and p-variate coefficient vectors for the detection and

for the presence probabilities Then the following is the joint posterior probability for the

presence indicators and the model parameters

πlowast(z vαwλ) prop πα(α)πλ(λ)Nprodi=1

F (xprimeiα)zi (1minus F (xprimeiα))

(1minuszi ) times

Jprodj=1

(ziF (qprimeijλ))

yij (1minus ziF (qprimeijλ))

1minusyij (2ndash9)

As in the simple probit regression problem this posterior is intractable Consequently

sampling from it directly is not possible But the procedures of Albert amp Chib for the

probit model and of Polson et al for the logit model can be extended to generate an

MCMC sampling strategy for the occupancy problem In what follows we make use of

this framework to develop samplers with which occupancy parameter estimates can be

obtained for both probit and logit link functions These algorithms have the added benefit

that they do not require tuning parameters nor eliciting parameter priors subjectively221 Probit Link Model

To extend Albert amp Chibrsquos algorithm to the occupancy framework with a probit link

first we introduce two sets of latent variables denoted by wij and vi corresponding to

the normal latent variables used to augment the data The corresponding hierarchy is

yij |zi sij sim Bernoulli(ziIwijgt0

)wij |λ sim N

(qprimeijλ 1

)λ sim [λ]

zi |vi sim Ivigt0

vi |α sim N (xprimeiα 1)

α sim [α] (2ndash10)

30

represented by the directed graph found in Figure 2-2

α

vi

zi

yi

wi

λ

Figure 2-2 Graphical representation occupancy model after data-augmentation

Under this hierarchical model the joint density is given by

πlowast(z vαwλ) prop Cyπα(α)πλ(λ)Nprodi=1

ϕ(vi xprimeiα 1)I

zivigt0I

(1minuszi )vile0 times

Jprodj=1

(ziIwijgt0)yij (1minus ziIwijgt0)

1minusyijϕ(wij qprimeijλ 1) (2ndash11)

The full conditional densities derived from the posterior in equation 2ndash11 are

detailed below

1 These are obtained from the full conditional of z after integrating out v and w

f (z|αλ) =

Nprodi=1

f (zi |αλ) =Nprodi=1

ψlowastizi (1minus ψlowast

i )1minuszi

where ψlowasti =

ψiprodJ

j=1 pyijij (1minus pij)

1minusyij

ψiprodJ

j=1 pyijij (1minus pij)1minusyij + (1minus ψi)

prodJ

j=1 Iyij=0(2ndash12)

2

f (v|zα) =

Nprodi=1

f (vi |zi α) =Nprodi=1

tr N (x primeiα 1Ai)

where Ai =

(minusinfin 0] zi = 0(0infin) zi = 1

(2ndash13)

31

and tr N(microσ2A) denotes the pdf of a truncated normal random variable withmean micro variance σ2 and truncation region A

3

f (α|v) = ϕp (α αXprimev α) (2ndash14)

where α = (X primeX )minus1and ϕk(x micro ) represents the k-variate normal density withmean vector micro and variance matrix

4

f (w|y zλ) =

Nprodi=1

Jprodj=1

f (wij |yij zi λ) =Nprodi=1

Jprodj=1

tr N (qprimeijλ 1Bij)

where Bij =

(minusinfininfin) zi = 0(minusinfin 0] zi = 1 and yij = 0(0infin) zi = 1 and yij = 1

(2ndash15)

5

f (λ|w) = ϕr (λ λQprimew λ) (2ndash16)

where λ = (Q primeQ)minus1

The Gibbs sampling algorithm for the model can then be summarized as

1 Initialize z α v λ and w

2 Sample zi sim Bern(ψilowast)

3 Sample vi from a truncated normal with micro = x primeiα σ = 1 and truncation regiondepending on zi

4 Sample α sim N (αXprimev α) with α = (X primeX )minus1

5 Sample wij from a truncated normal with micro = qprimeijλ σ = 1 and truncation region

depending on yij and zi

6 Sample λ sim N (λQprimew λ) with λ = (Q primeQ)minus1

222 Logit Link Model

Now turning to the logit link version of the occupancy model again let yij be the

indicator variable used to mark detection of the target species on the j th survey at the

i th site and let zi be the indicator variable that denotes presence (zi = 1) or absence

32

(zi = 0) of the target species at the i th site The model is now defined by

yij |zi λ sim Bernoulli (zipij) where pij =eq

primeijλ

1 + eqprimeijλ

λ sim [λ]

zi |α sim Bernoulli (ψi) where ψi =ex

primeiα

1 + exprimeiα

α sim [α]

In this hierarchy the contribution of a single site to the likelihood is

Li(αλ) =(ex

primeiα)zi

1 + exprimeiα

Jprodj=1

(zi

eqprimeijλ

1 + eqprimeijλ

)yij(1minus zi

eqprimeijλ

1 + eqprimeijλ

)1minusyij

(2ndash17)

As in the probit case we data-augment the likelihood with two separate sets

of covariates however in this case each of them having Polya-gamma distribution

Augmenting the model and using the posterior in (2ndash7) the joint is

[ zαλ|y ] prop [α] [λ]

Nprodi=1

(ex

primeiα)zi

1 + exprimeiαcosh

(∣∣xprime

iα∣∣

2

)exp

[minus(x

prime

iα)2vi

2

]g(vi)times

Jprodj=1

(zi

eqprimeijλ

1 + eqprimeijλ

)yij(1minus zi

eqprimeijλ

1 + eqprimeijλ

)1minusyij

times

cosh

(∣∣ziqprimeijλ∣∣2

)exp

[minus(ziq

primeijλ)2wij

2

]g(wij)

(2ndash18)

The full conditionals for z α v λ and w obtained from (2ndash18) are provided below

1 The full conditional for z is obtained after marginalizing the latent variables andyields

f (z|αλ) =

Nprodi=1

f (zi |αλ) =Nprodi=1

ψlowastizi (1minus ψlowast

i )1minuszi

where ψlowasti =

ψiprodJ

j=1 pyijij (1minus pij)

1minusyij

ψiprodJ

j=1 pyijij (1minus pij)1minusyij + (1minus ψi)

prodJ

j=1 Iyij=0(2ndash19)

33

2 Using the result derived in Polson et al (2013) we have that

f (v|zα) =

Nprodi=1

f (vi |zi α) =Nprodi=1

PG(1 xprimeiα) (2ndash20)

3

f (α|v) prop [α ]

Nprodi=1

exp[zix

prime

iαminus xprime

2minus (x

prime

iα)2vi

2

] (2ndash21)

4 By the same result as that used for v the full conditional for w is

f (w|y zλ) =

Nprodi=1

Jprodj=1

f (wij |yij zi λ)

=

(prodiisinS1

Jprodj=1

PG(1 |qprimeijλ| )

)(prodi isinS1

Jprodj=1

PG(1 0)

) (2ndash22)

with S1 = i isin 1 2 N zi = 1

5

f (λ|z yw) prop [λ ]prodiisinS1

exp

[yijq

prime

ijλminusq

prime

ijλ

2minus

(qprime

ijλ)2wij

2

] (2ndash23)

with S1 as defined above

The Gibbs sampling algorithm is analogous to the one with a probit link but with the

obvious modifications to incorporate Polya-gamma instead of normal latent variables23 Temporal Dynamics and Spatial Structure

The uses of the single-season model are limited to very specific problems In

particular assumptions for the basic model may become too restrictive or unrealistic

whenever the study period extends throughout multiple years or seasons especially

given the increasingly changing environmental conditions that most ecosystems are

currently experiencing

Among the many extensions found in the literature one that we consider particularly

relevant incorporates heterogenous occupancy probabilities through time Extensions of

34

site-occupancy models that incorporate temporally varying probabilities can be traced

back to Hanski (1994) The heterogeneity of occupancy probabilities through time arises

from local colonization and extinction processes MacKenzie et al (2003) proposed an

alternative to Hanskirsquos approach in order to incorporate imperfect detection The method

is flexible enough to let detection occurrence survival and colonization probabilities

each depend upon its own set of covariates using likelihood-based estimation for the

model parameters

However the approach of MacKenzie et al presents two drawbacks First

the uncertainty assessment for maximum likelihood parameter estimates relies on

asymptotic results (obtained from implementation of the delta method) making it

sensitive to sample size And second to obtain parameter estimates the latent process

(occupancy) is marginalized out of the likelihood leading to the usual zero-inflated

Bernoulli model Although this is a convenient strategy to solve the estimation problem

the latent state variables (occupancy indicators) are no longer available and as such

finite sample estimates cannot be calculated unless an additional (and computationally

expensive) parametric bootstrap step is performed (Royle amp Kery 2007) Additionally as

the occupancy process is integrated out the likelihood approach precludes incorporation

of additional structural dependence using random effects Thus the model cannot

account for spatial dependence which plays a fundamental role in this setting

To work around some of the shortcomings encountered when fitting dynamic

occupancy models via likelihood based methods Royle amp Kery developed what they

refer to as a dynamic occupancy state space model (DOSS) alluding to the conceptual

similarity found between this model and the class of state space models found in the

time series literature In particular this model allows one to retain the latent process

(occupancy indicators) in order to obtain small sample estimates and to eventually

generate extensions that incorporate structure in time andor space through random

effects

35

The data used in the DOSS model comes from standard repeated presenceabsence

surveys with N sampling locations (patches or sites) indexed by i = 1 2 N Within

a given season (eg year month week depending on the biology of the species) each

sampling location is visited (surveyed) j = 1 2 J times This process is repeated for

t = 1 2 T seasons Here an important assumption is that the site occupancy status

is closed within but not across seasons

As is usual in the occupancy modeling framework two different processes are

considered The first one is the detection process per site-visit-season combination

denoted by yijt The yijt are indicator functions that take the value 1 if the species is

present at site i survey j and season t and 0 otherwise These detection indicators

are assumed to be independent within each site and season The second response

considered is the partially observed presence (occupancy) indicators zit These are

indicator variables which are equal to 1 whenever yijt = 1 for one or more of the visits

made to site i during season t otherwise the values of the zit rsquos are unknown Royle amp

Kery refer to these two processes as the observation (yijt) and the state (zit) models

In this setting the parameters of greatest interest are the occurrence or site

occupancy probabilities denoted by ψit as well as those representing the population

dynamics which are accounted for by introducing changes in occupancy status over

time through local colonization and survival That is if a site was not occupied at season

t minus 1 at season t it can either be colonized or remain unoccupied On the other hand

if the site was in fact occupied at season t minus 1 it can remain that way (survival) or

become abandoned (local extinction) at season t The probabilities of survival and

colonization from season t minus 1 to season t at the i th site are denoted by θi(tminus1) and

γi(tminus1) respectivelyDuring the initial period (or season) the model for the state process is expressed in

terms of the occupancy probability (equation 2ndash24) For subsequent periods the stateprocess is specified in terms of survival and colonization probabilities (equation 2ndash25) inparticular

zi1 sim Bernoulli (ψi1) (2ndash24)

36

zit |zi(tminus1) sim Bernoulli(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)(2ndash25)

The observation model conditional on the latent process zit is defined by

yijt |zit sim Bernoulli(zitpijt

)(2ndash26)

Royle amp Kery induce the heterogeneity by site site-season and site-survey-seasonrespectively in the occupancy survival and colonization and in the detection probabilitiesthroughthe following specification

logit(ψi1) = x1 + ri ri sim N(0σ2ψ) logitminus1(x1) sim Unif(0 1)

logit(θit) = at + ui ui sim N(0σ2θ ) logitminus1(ai) sim Unif(0 1)logit(γit) = bt + vi vi sim N(0σ2γ) logitminus1(bi) sim Unif(0 1)

logit(pijt) = ct + wij wi sim N(0σ2p) logitminus1(ci) sim Unif(0 1) (2ndash27)

where x1 at bt ct are the season fixed effects for the corresponding probabilities

and where (ri ui vi) and wij are the site and site-survey random effects respectively

Additionally all variance components assume the usual inverse gamma priors

As the authors state this formulation can be regarded as ldquobeing suitably vaguerdquo

however it is also restrictive in the sense that it is not clear what strategy to follow to

incorporate additional covariates while preserving the straightforward sampling strategy231 Dynamic Mixture Occupancy State-Space Model

We assume that the probabilities for occupancy survival colonization and detection

are all functions of linear combinations of covariates However our setup varies

slightly from that considered by Royle amp Kery (2007) In essence we modify the way in

which the estimates for survival and colonization probabilities are attained Our model

incorporates the notion that occupancy at a site occupied during the previous season

takes place through persistence where we define persistence as a function of both

survival and colonization That is a site occupied at time t may again be occupied

at time t + 1 if the current settlers survive if they perish and new settlers colonize

simultaneously or if both current settlers survive and new ones colonize

Our functional forms of choice are again the probit and logit link functions This

means that each probability of interest which we will refer to for illustration as δ is

37

linked to a linear combination of covariates xprime ξ through the relationship defined by

δ = F (xT ξ) where F (middot) represents the inverse link function This particular assumption

facilitates relating the data augmentation algorithms of Albert amp Chib and Polson et al to

Royle amp Keryrsquos DOSS model We refer to this extension of Royle amp Keryrsquos model as the

Dynamic Mixture Occupancy State Space model (DYMOSS)

As before let yijt be the indicator variable used to mark detection of the target

species on the j th survey at the i th site during the tth season and let zit be the indicator

variable that denotes presence (zit = 1) or absence (zit = 0) of the target species at the

i th site tth season with i isin 1 2 N j isin 1 2 J and t isin 1 2 T

Additionally assume that probabilities for occupancy at time t = 1 persistence

colonization and detection are all functions of covariates with corresponding parameter

vectors α (s) =δ(s)tminus1

Tt=2

B(c) =β(c)tminus1

Tt=2

and = λtTt=1 and covariate matrices

X(o) X = Xtminus1Tt=2 and Q(s) = QtTt=1 respectively Using the notation above our

proposed dynamic occupancy model is defined by the following hierarchyState model

zi1|α sim Bernoulli (ψi1) where ψi1 = F(xprime(o)iα

)zit |zi(tminus1) δ

(c)tminus1β

(s)tminus1 sim Bernoulli

(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)where θi(tminus1) = F

(δ(s)tminus1 + xprimei(tminus1)β

(c)tminus1

) and

γi(tminus1) = F(xprimei(tminus1)β

(c)tminus1

)(2ndash28)

Observed modelyijt |zit ηt sim Bernoulli (zitpijt)

where pijt = F (qTijtλt) (2ndash29)

In the hierarchical setup given by Equations 2ndash28 and 2ndash29 θi(tminus1) corresponds to

the probability of persistence from time t minus 1 to time t at site i and γi(tminus1) denotes the

colonization probability Note that θi(tminus1) minus γi(tminus1) yields the survival probability from t minus 1

to t The effect of survival is introduced by changing the intercept of the linear predictor

by a quantity δ(s)tminus1 Although in this version of the model this effect is accomplished by

just modifying the intercept it can be extended to have covariates determining δ(s)tminus1 as

well The graphical representation of the model for a single site is

38

α

zi1

yi1

λ1

zi2

yi2

λ1

δ(s)1

β(c)1

middot middot middot

zit

yit

λt

δ(s)tminus1

β(c)tminus1

middot middot middot

ziT

yiT

λT

δ(s)Tminus1

β(c)Tminus1

Figure 2-3 Graphical representation multiseason model for a single site

The joint posterior for the model defined by this hierarchical setting is

[ zηαβλ|y ] = Cy

Nprodi=1

ψi1 Jprodj=1

pyij1ij1 (1minus pij1)

(1minusyij1)

zi1(1minus ψi1)

Jprodj=1

Iyij1=0

1minuszi1 [η1][α]times

Tprodt=2

Nprodi=1

[(θziti(tminus1)(1minus θi(tminus1))

1minuszit)zi(tminus1)

+(γziti(tminus1)(1minus γi(tminus1))

1minuszit)1minuszi(tminus1)

] Jprod

j=1

pyijtijt (1minus pijt)

1minusyijt

zit

times

Jprodj=1

Iyijt=0

1minuszit [ηt ][βtminus1][λtminus1]

(2ndash30)

which as in the single season case is intractable Once again a Gibbs sampler cannot

be constructed directly to sample from this joint posterior The graphical representation

of the model for one site incorporating the latent variables is provided in Figure 2-4

α

ui1

zi1

yi1

wi1

λ1

zi2

yi2

wi2

λ1

vi1

δ(s)1

β(c)1

middot middot middot

middot middot middot

zit

vi tminus1

yit

wit

λt

δ(s)tminus1

β(c)tminus1

middot middot middot

middot middot middot

ziT

vi Tminus1

yiT

wiT

λT

δ(s)Tminus1

β(s)Tminus1

Figure 2-4 Graphical representation data-augmented multiseason model

Probit link normal-mixture DYMOSS model

39

We deal with the intractability of the joint posterior distribution as before that is

by introducing latent random variables Each of the latent variables incorporates the

relevant linear combinations of covariates for the probabilities considered in the model

This artifact enables us to sample from the joint posterior distributions of the model

parameters For the probit link the sets of latent random variables respectively for first

season occupancy persistence and colonization and detection are

bull ui sim N (bTi α 1)

bull vi(tminus1) sim zi(tminus1)N(δ(s)(tminus1) + xTi(tminus1)β

(c)(tminus1) 1

)+ (1minus zi(tminus1))N

(xTi(tminus1)β

(c)(tminus1) 1

) and

bull wijt sim N (qTijtηt 1)

Introducing these latent variables into the hierarchical formulation yieldsState model

ui1|α sim N(xprime(o)iα 1

)zi1|ui sim Bernoulli

(Iuigt0

)for t gt 1

vi(tminus1)|zi(tminus1)βtminus1 sim zi(tminus1)N(δ(s)(tminus1) + xprimei(tminus1)β

(c)(tminus1) 1

)+

(1minus zi(tminus1))N(xprimei(tminus1)β

(c)(tminus1) 1

)zit |vi(tminus1) sim Bernoulli

(Ivi(tminus1)gt0

)(2ndash31)

Observed modelwijt |ηt sim N

(qTijtλt 1

)yijt |zit wijt sim Bernoulli

(zitIrijtgt0

)(2ndash32)

Note that the result presented in Section 22 corresponds to the particular case for

T = 1 of the model specified by Equations 2ndash31 and 2ndash32

As mentioned previously model parameters are obtained using a Gibbs sampling

approach Let ϕ(x |microσ2) denote the pdf of a normally distributed random variable x

with mean micro and standard deviation σ Also let

1 Wt = (w1t w2t wNt) with wit = (wi1t wi2t wiJitt) (for i = 1 2 N andt = 1 2 T )

40

2 u = (u1 u2 uN)

3 V = (v1 vTminus1) with vt = (v1t v2t vNt)

For the probit link model the joint posterior distribution is

π(ZuV WtTt=1αB(c) δ(s)

)prop [α]

prodNi=1 ϕ

(ui∣∣ xprime(o)iα 1

)Izi1uigt0I

1minuszi1uile0

times

Tprodt=2

[β(c)tminus1 δ

(s)tminus1

] Nprodi=1

ϕ(vi(tminus1)

∣∣micro(v)i(tminus1) 1

)Izitvi(tminus1)gt0

I1minuszitvi(tminus1)le0

times

Tprodt=1

[λt ]

Nprodi=1

Jitprodj=1

ϕ(wijt

∣∣qprimeijtλt 1)(zitIwijtgt0)yij1(1minus zitIwijtgt0)

(1minusyijt)

where micro(v)i(tminus1) = zi(tminus1)δ

(s)tminus1 + xprimei(tminus1)β

(c)tminus1 (2ndash33)

Initialize the Gibbs sampler at α(0)B(0)(c) δ

(s)(0)2minus1 and (0) For m = 0 1 nsim

The sampler proceeds iteratively by block sampling sequentially for each primary

sampling period as follows first the presence process then the latent variables from

the data-augmentation step for the presence component followed by the parameters for

the presence process then the latent variables for the detection component and finally

the parameters for the detection component Letting [|] denote the full conditional

probability density function of the component conditional on all other unknown

parameters and the observed data for m = 1 nsim the sampling procedure can be

summarized as

[z(m)1 | middot

]rarr[u(m)| middot

]rarr[α(m)

∣∣∣ middot ]rarr [W

(m)1 | middot

]rarr[λ(m)1

∣∣∣ middot ]rarr[z(m)2 | middot

]rarr[V(m)2minus1| middot

]rarr[β(c)(m)2minus1 δ(s)(m)

2minus1

∣∣∣ middot ]rarr [W

(m)2 | middot

]rarr[λ(m)2

∣∣∣ middot ]rarr middot middot middot

middot middot middot rarr[z(m)T | middot

]rarr[V(m)Tminus1| middot

]rarr[β(c)(m)Tminus1 δ(s)(m)

Tminus1

∣∣∣ middot ]rarr [W

(m)T | middot

]rarr[λ(m)T

∣∣∣ middot ]The full conditional probability densities for this Gibbs sampling algorithm are

presented in detail within Appendix A

41

Logit link Polya-Gamma DYMOSS model

Using the same notation as before the logit link model resorts to the hierarchy given

byState model

ui1|α sim PG(xT(o)iα 1

)zi1|ui sim Bernoulli

(Iuigt0

)for t gt 1

vi(tminus1)| sim PG(1∣∣zi(tminus1)δ(s)(tminus1) + xprimei(tminus1)β

(c)(tminus1)

∣∣)sim Bernoulli

(Ivi(tminus1)gt0

)(2ndash34)

Observed modelwijt |λt sim PG

(qTijtλt 1

)yijt |zit wijt sim Bernoulli

(zitIwijtgt0

)(2ndash35)

The logit link version of the joint posterior is given by

π(ZuV WtTt=1αB(s)B(c)

)prop

Nprodi=1

(e

xprime(o)i

α)zi1

1 + exprime(o)i

αPG

(ui 1 |xprime(o)iα|

)[λ1][α]times

Ji1prodj=1

(zi1

eqprimeij1λ1

1 + eqprimeij1λ1

)yij1(1minus zi1

eqprimeij1λ1

1 + eqprimeij1λ1

)1minusyij1

PG(wij1 1 |zi1qprimeij1λ1|

)times

Tprodt=2

[δ(s)tminus1][β

(c)tminus1][λt ]

Nprodi=1

(exp

[micro(v)tminus1

])zit1 + exp

[micro(v)i(tminus1)

]PG (vit 1 ∣∣∣micro(v)i(tminus1)

∣∣∣)timesJitprodj=1

(zit

eqprimeijtλt

1 + eqprimeijtλt

)yijt(1minus zit

eqprimeijtλt

1 + eqlowastTij

λt

)1minusyijt

PG(wijt 1 |zitqprimeijtλt |

)

(2ndash36)

with micro(v)tminus1 = zi(tminus1)δ

(s)tminus1 + xprimei(tminus1)β

(c)tminus1

42

The sampling procedure is entirely analogous to that described for the probit

version The full conditional densities derived from expression 2ndash36 are described in

detail in Appendix A232 Incorporating Spatial Dependence

In this section we describe how the additional layer of complexity space can also

be accounted for by continuing to use the same data-augmentation framework The

method we employ to incorporate spatial dependence is a slightly modified version of

the traditional approach for spatial generalized linear mixed models (GLMMrsquos) and

extends the model proposed by Johnson et al (2013) for the single season closed

population occupancy model

The traditional approach consists of using spatial random effects to induce a

correlation structure among adjacent sites This formulation introduced by Besag et al

(1991) assumes that the spatial random effect corresponds to a Gaussian Markov

Random Field (GMRF) The model known as the Spatial GLMM (SGLMM) is used to

analyze areal data It has been applied extensively given the flexibility of its hierarchical

formulation and the availability of software for its implementation (Hughes amp Haran

2013)

Succinctly the spatial dependence is accounted for in the model by adding a

random vector η assumed to have a conditionally-autoregressive (CAR) prior (also

known as the Gaussian Markov random field prior) To define the prior let the pair

G = (V E) represent the undirected graph for the entire spatial region studied where

V = (1 2 N) denotes the vertices of the graph (sites) and E the set of edges

between sites E is constituted by elements of the form (i j) indicating that sites i

and j are spatially adjacent for some i j isin V The prior for the spatial effects is then

characterized by

[η|τ ] prop τ rank()2exp[minusτ2ηprimeη

] (2ndash37)

43

where = (diag(A1)minus A) is the precision matrix with A denoting the adjacency matrix

The entries of the adjacency matrix A are such that diag(A) = 0 and Aij = I(i j)isinE

The matrix is singular Hence the probability density defined in equation 2ndash37

is improper ie it doesnrsquot integrate to 1 Regardless of the impropriety of the prior this

model can be fitted using a Bayesian approach since even if the prior is improper the

posterior for the model parameters is proper If a constraint such assum

k ηk = 0 is

imposed or if the precision matrix is replaced by a positive definite matrix the model

can also be fitted using a maximum likelihood approach

Assuming that all but the detection process are subject to spatial correlations and

using the notation we have developed up to this point the spatially explicit version of the

DYMOSS model is characterized by the hierarchy represented by equations 2ndash38 and

2ndash39

Hence adding spatial structure into the DYMOSS framework described in the

previous section only involves adding the steps to sample η(o) and ηtT

t=2 conditional

on all other parameters Furthermore the corresponding parameters and spatial

random effects of a given component (ie occupancy survival and colonization)

can be effortlessly pooled together into a single parameter vector to perform block

sampling For each of the latent variables the only modification required is to sum the

corresponding spatial effect to the linear predictor so that these retain their conditional

independence given the linear combination of fixed effects and the spatial effects

State modelzi1|α sim Bernoulli (ψi1) where ψi1 = F

(xT(o)iα+ η

(o)i

)[η(o)|τ

]prop τ rank()2exp

[minusτ2η(o)primeη(o)

]zit |zi(tminus1)αβtminus1λtminus1 sim Bernoulli

(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)where θi(tminus1) = F

(δ(s)(tminus1) + xTi(tminus1)β

(c)tminus1 + ηit

) and

γi(tminus1) = F(xTi(tminus1)β

(c)tminus1 + ηit

)[ηt |τ ] prop τ rank()2exp

[minusτ2ηprimetηt

](2ndash38)

44

Observed modelyijt |zit ηt sim Bernoulli (zitpijt)

where pijt = F (qTijtλt) (2ndash39)

In spite of the popularity of this approach to incorporating spatial dependence three

shortcomings have been reported in the literature (Hughes amp Haran 2013 Reich et al

2006) (1) model parameters have no clear interpretation due to spatial confounding

of the predictors with the spatial effect (2) there is variance inflation due to spatial

confounding and (3) the high dimensionality of the latent spatial variables leads to

high computational costs To avoid such difficulties we follow the approach used by

Hughes amp Haran (2013) which builds upon the earlier work by Reich et al (2006) This

methodology is summarized in what follows

Let a vector of spatial effects η have the CAR model given by 2ndash37 above Now

consider a random vector ζ sim MVN(0 τKprimeK

) with defined as above and where

τKprimeK corresponds to the precision of the distribution and not the covariance matrix

with matrix K satisfying KprimeK = I

This last condition implies that the linear predictor Xβ + η = Xβ + Kζ With

respect to how the matrix K is chosen Hughes amp Haran (2013) recommend basing its

construction on the spectral decomposition of operator matrices based on Moranrsquos I

The Moran operator matrix is defined as PperpAPperp with Pperp = IminusX (XprimeX )minus1X

prime and where A

is the adjacency matrix previously described The choice of the Moran operator is based

on the fact that it accounts for the underlying graph while incorporating the spatial

structure residual to the design matrix X These elements are incorporated into its

spectral decomposition of the Moran operator That is its eigenvalues correspond to the

values of Moranrsquos I statistic (a measure of spatial autocorrelation) for a spatial process

orthogonal to X while its eigenvectors provide the patterns of spatial dependence

residual to X Thus the matrix K is chosen to be the matrix whose columns are the

eigenvectors of the Moran operator for a particular adjacency matrix

45

Using this strategy the new hierarchical formulation of our model is simply modified

by letting η(o) = K(o)ζ(o) and ηt = Ktζt with

1 ζ(o) sim MVN(0 τ (o)K(o)primeK(o)

) where K(o) is the eigenvector matrix for

P(o)perpAP(o)perp and

2 ζt sim MVN(0 τtK

primetKt

) where Kt is the Pperp

t APperpt for t = 2 3 T

The algorithms for the probit and logit link from section 231 can be readily

adapted to incorporate the spatial structure simply by obtaining the joint posteriors

for (α ζ(o)) and (β(c)tminus1 δ

(s)tminus1 ζt) making the obvious modification of the corresponding

linear predictors to incorporate the spatial components24 Summary

With a few exceptions (Dorazio amp Taylor-Rodrıguez 2012 Johnson et al 2013

Royle amp Kery 2007) recent Bayesian approaches to site-occupancy modeling with

covariates have relied on model configurations (eg as multivariate normal priors of

parameters in logit scale) that lead to unfamiliar conditional posterior distributions thus

precluding the use of a direct sampling approach Therefore the sampling strategies

available are based on algorithms (eg Metropolis Hastings) that require tuning and the

knowledge to do so correctly

In Dorazio amp Taylor-Rodrıguez (2012) we proposed a Bayesian specification for

which a Gibbs sampler of the basic occupancy model is available and allowed detection

and occupancy probabilities to depend on linear combinations of predictors This

method described in section 221 is based on the data augmentation algorithm of

Albert amp Chib (1993) There the full conditional posteriors of the parameters of the probit

regression model are cast as latent mixtures of normal random variables The probit and

the logit link yield similar results with large sample sizes however their results may be

different when small to moderate sample sizes are considered because the logit link

function places more mass in the tails of the distribution than the probit link does In

46

section 222 we adapt the method for the single season model to work with the logit link

function

The basic occupancy framework is useful but it assumes a single closed population

with fixed probabilities through time Hence its assumptions may not be appropriate to

address problems where the interest lies in the temporal dynamics of the population

Hence we developed a dynamic model that incorporates the notion that occupancy

at a site previously occupied takes place through persistence which depends both on

survival and habitat suitability By this we mean that a site occupied at time t may again

be occupied at time t + 1 if (1) the current settlers survive (2) the existing settlers

perish but new settlers simultaneously colonize or (3) current settlers survive and new

ones colonize during the same season In our current formulation of the DYMOSS both

colonization and persistence depend on habitat suitability characterized by xprimei(tminus1)β(c)tminus1

They only differ in that persistence is also influenced by whether the site being occupied

during season t minus 1 enhances the suitability of the site or harms it through density

dependence

Additionally the study of the dynamics that govern distribution and abundance of

biological populations requires an understanding of the physical and biotic processes

that act upon them and these vary in time and space Consequently as a final step in

this Chapter we described a straightforward strategy to add spatial dependence among

neighboring sites in the dynamic metapopulation model This extension is based on the

popular Bayesian spatial modeling technique of Besag et al (1991) updated using the

methods described in (Hughes amp Haran 2013)

Future steps along these lines are (1) develop the software necessary to

implement the tools described throughout the Chapter and (2) build a suite of additional

extensions using this framework for occupancy models will be explored The first of

them will be used to incorporate information from different sources such as tracks

scats surveys and direct observations into a single model This can be accomplished

47

by adding a layer to the hierarchy where the source and spatial scale of the data is

accounted for The second extension is a single season spatially explicit multiple

species co-occupancy model This model will allow studying complex interactions

and testing hypotheses about species interactions at a given point in time Lastly this

co-occupancy model will be adapted to incorporate temporal dynamics in the spirit of

the DYMOSS model

48

CHAPTER 3INTRINSIC ANALYSIS FOR OCCUPANCY MODELS

Eliminate all other factors and the one which remains must be the truthndashSherlock Holmes

The Sign of Four

31 Introduction

Occupancy models are often used to understand the mechanisms that dictate

the distribution of a species Therefore variable selection plays a fundamental role in

achieving this goal To the best of our knowledge ldquoobjectiverdquo Bayesian alternatives for

variable selection have not been put forth for this problem and with a few exceptions

(Hooten amp Hobbs 2014 Link amp Barker 2009) AIC is the method used to choose from

competing site-occupancy models In addition the procedures currently implemented

and accessible to ecologists require enumerating and estimating all the candidate

models (Fiske amp Chandler 2011 Mazerolle amp Mazerolle 2013) In practice this

can be achieved if the model space considered is small enough which is possible

if the choice of the model space is guided by substantial prior knowledge about the

underlying ecological processes Nevertheless many site-occupancy surveys collect

large amounts of covariate information about the sampled sites Given that the total

number of candidate models grows exponentially fast with the number of predictors

considered choosing a reduced set of models guided by ecological intuition becomes

increasingly difficult This is even more so the case in the occupancy model context

where the model space is the cartesian product of models for presence and models for

detection Given the issues mentioned above we propose the first objective Bayesian

variable selection method for the single-season occupancy model framework This

approach explores in a principled manner the entire model space It is completely

49

automatic precluding the need for both tuning parameters in the sampling algorithm and

subjective elicitation of parameter prior distributions

As mentioned above in ecological modeling if model selection or less frequently

model averaging is considered the Akaike Information Criterion (AIC) (Akaike 1983)

or a version of it is the measure of choice for comparing candidate models (Fiske amp

Chandler 2011 Mazerolle amp Mazerolle 2013) The AIC is designed to find the model

that has on average the density closest in Kullback-Leibler distance to the density

of the true data generating mechanism The model with the smallest AIC is selected

However if nested models are considered one of them being the true one generally the

AIC will not select it (Wasserman 2000) Commonly the model selected by AIC will be

more complex than the true one The reason for this is that the AIC has a weak signal to

noise ratio and as such it tends to overfit (Rao amp Wu 2001) Other versions of the AIC

provide a bias correction that enhances the signal to noise ratio leading to a stronger

penalization for model complexity Some examples are the AICc (Hurvich amp Tsai 1989)

and AICu (McQuarrie et al 1997) however these are also not consistent for selection

albeit asymptotically efficient (Rao amp Wu 2001)

If we are interested in prediction as opposed to testing the AIC is certainly

appropriate However when conducting inference the use of Bayesian model averaging

and selection methods is more fitting If the true data generating mechanism is among

those considered asymptotically Bayesian methods choose the true model with

probability one Conversely if the true model is not among the alternatives and a

suitable parameter prior is used the posterior probability of the most parsimonious

model closest to the true one tends asymptotically to one

In spite of this in general for Bayesian testing direct elicitation of prior probabilistic

statements is often impeded because the problems studied may not be sufficiently

well understood to make an informed decision about the priors Conversely there may

be a prohibitively large number of parameters making specifying priors for each of

50

these parameters an arduous task In addition to this seemingly innocuous subjective

choices for the priors on the parameter space may drastically affect test outcomes

This has been a recurring argument in favor of objective Bayesian procedures

which appeal to the use of formal rules to build parameter priors that incorporate the

structural information inside the likelihood while utilizing some objective criterion (Kass amp

Wasserman 1996)

One popular choice of ldquoobjectiverdquo prior is the reference prior (Berger amp Bernardo

1992) which is the prior that maximizes the amount of signal extracted from the

data These priors have proven to be effective as they are fully automatic and can

be frequentist matching in the sense that the posterior credible interval agrees with the

frequentist confidence interval from repeated sampling with equal coverage-probability

(Kass amp Wasserman 1996) Reference priors however are improper and while

they yield reasonable posterior parameter probabilities the derived model posterior

probabilities may be ill defined To avoid this shortcoming Berger amp Pericchi (1996)

introduced the intrinsic Bayes factor (IBF) for model comparison Moreno et al (1998)

building on the IBF of Berger amp Pericchi (1996) developed a limiting procedure to

generate a system of priors that yield well-defined posteriors even though these

priors may sometimes be improper The IBF is built using a data-dependent prior to

automatically generate Bayes factors however the extension introduced by Moreno

et al (1998) generates the intrinsic prior by taking a theoretical average over the space

of training samples freeing the prior from data dependence

In our view in the face of a large number of predictors the best alternative is to run

a stochastic search algorithm using good ldquoobjectiverdquo testing parameter priors and to

incorporate suitable model priors This being said the discussion about model priors is

deferred until Chapter 4 this Chapter focuses on the priors on the parameter space

The Chapter is structured as follows First issues surrounding multimodel inference

are described and insight about objective Bayesian inferential procedures is provided

51

Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate priors

on the parameter space the intrinsic priors for the parameters of the occupancy model

are derived These are used in the construction of an algorithm for ldquoobjectiverdquo model

selection tailored to the occupancy model framework To assess the performance of our

methods we provide results from a simulation study in which distinct scenarios both

favorable and unfavorable are used to determine the robustness of these tools and

analyze the Blue Hawker data set which has been examined previously in the ecological

literature (Dorazio amp Taylor-Rodrıguez 2012 Kery et al 2010)32 Objective Bayesian Inference

As mentioned before in practice noninformative priors arising from structural

rules are an alternative to subjective elicitation of priors Some of the rules used in

defining noninformative priors include the principle of insufficient reason parametrization

invariance maximum entropy geometric arguments coverage matching and decision

theoretic approaches (see Kass amp Wasserman (1996) for a discussion)

These rules reflect one of two attitudes (1) noninformative priors either aim to

convey unique representations of ignorance or (2) they attempt to produce probability

statements that may be accepted by convention This latter attitude is in the same

spirit as how weights and distances are defined (Kass amp Wasserman 1996) and

characterizes the way in which Bayesian reference methods are interpreted today ie

noninformative priors are seen to be chosen by convention according to the situation

A word of caution must be given when using noninformative priors Difficulties arise

in their implementation that should not be taken lightly In particular these difficulties

may occur because noninformative priors are generally improper (meaning that they do

not integrate or sum to a finite number) and as such are said to depend on arbitrary

constants

Bayes factors strongly depend upon the prior distributions for the parameters

included in each of the models being compared This can be an important limitation

52

considering that when using noninformative priors their introduction will result in the

Bayes factors being a function of the ratio of arbitrary constants given that these priors

are typically improper (see Jeffreys 1961 Pericchi 2005 and references therein)

Many different approaches have been developed to deal with the arbitrary constants

when using improper priors since then These include the use of partial Bayes factors

(Berger amp Pericchi 1996 Good 1950 Lempers 1971) setting the ratio of arbitrary

constants to a predefined value (Spiegelhalter amp Smith 1982) and approximating to the

Bayes factor (see Haughton 1988 as cited in Berger amp Pericchi 1996 Kass amp Raftery

1995 Tierney amp Kadane 1986)321 The Intrinsic Methodology

Berger amp Pericchi (1996) cleverly dealt with the arbitrary constants that arise when

using improper priors by introducing the intrinsic Bayes factor (IBF) procedure This

solution based on partial Bayes factors provides the means to replace the improper

priors by proper ldquoposteriorrdquo priors The IBF is obtained from combining the model

structure with information contained in the observed data Furthermore they showed

that as the sample size tends to infinity the Intrinsic Bayes factor corresponds to the

proper Bayes factor arising from the intrinsic priors

Intrinsic priors however are not unique The asymptotic correspondence between

the IBF and the Bayes factor arising from the intrinsic prior yields two functional

equations that are solved by a whole class of intrinsic priors Because all the priors

in the class produce Bayes factors that are asymptotically equivalent to the IBF for

finite sample sizes the resulting Bayes factor is not unique To address this issue

Moreno et al (1998) formalized the methodology through the ldquolimiting procedurerdquo

This procedure allows one to obtain a unique Bayes factor consolidating the method

as a valid objective Bayesian model selection procedure which we will refer to as the

Bayes factor for intrinsic priors (BFIP) This result is particularly valid for nested models

although the methodology may be extended with some caution to nonnested models

53

As mentioned before the Bayesian hypothesis testing procedure is highly sensitive

to parameter-prior specification and not all priors that are useful for estimation are

recommended for hypothesis testing or model selection Evidence of this is provided

by the Jeffreys-Lindley paradox which states that a point null hypothesis will always

be accepted when the variance of a conjugate prior goes to infinity (Robert 1993)

Additionally when comparing nested models the null model should correspond to

a substantial reduction in complexity from that of larger alternative models Hence

priors for the larger alternative models that place probability mass away from the null

model are wasteful If the true model is ldquofarrdquo from the null it will be easily detected by

any statistical procedure Therefore the prior on the alternative models should ldquowork

harderrdquo at selecting competitive models that are ldquocloserdquo to the null This principle known

as the Savage continuity condition (Gunel amp Dickey 1974) is widely recognized by

statisticians

Interestingly the intrinsic prior in correspondence with the BFIP automatically

satisfies the Savage continuity condition That is when comparing nested models the

intrinsic prior for the more complex model is centered around the null model and in spite

of being a limiting procedure it is not subject to the Jeffreys-Lindley paradox

Moreover beyond the usual pairwise consistency of the Bayes factor for nested

models Casella et al (2009) show that the corresponding Bayesian procedure with

intrinsic priors for variable selection in normal regression is consistent in the entire

class of normal linear models adding an important feature to the list of virtues of the

procedure Consistency of the BFIP for the case where the dimension of the alternative

model grows with the sample size is discussed in Moreno et al (2010)322 Mixtures of g-Priors

As previously mentioned in the Bayesian paradigm a model M in M is defined

by a sampling density and a prior distribution The sampling density associated with

model M is denoted by f (y|βM σ2M M) where (βM σ

2M) is a vector of model-specific

54

unknown parameters The prior for model M and its corresponding set of parameters is

π(βM σ2M M|M) = π(βM σ

2M |MM) middot π(M|M)

Objective local priors for the model parameters (βM σ2M) are achieved through

modifications and extensions of Zellnerrsquos g-prior (Liang et al 2008 Womack et al

2014) In particular below we focus on the intrinsic prior and provide some details for

other scaled mixtures of g-priors We defer the discussion on priors over the model

space until Chapter 5 where we describe them in detail and develop a few alternatives

of our own3221 Intrinsic priors

An automatic choice of an objective prior is the intrinsic prior (Berger amp Pericchi

1996 Moreno et al 1998) Because MB sube M for all M isin M the intrinsic prior for

(βM σ2M) is defined as an expected posterior prior

πI (βM σ2M |M) =

intpR(βM σ

2M |~yM)mR(~y|MB)d~y (3ndash1)

where ~y is a minimal training sample for model M I denotes the intrinsic distributions

and R denotes distributions derived from the reference prior πR(βM σ2M |M) = cM

dβMdσ2M

σ2M

In (3ndash1) mR(~y|M) =intint

f (~y|βM σ2M M)πR(βM σ

2M |M)dβMdσ2M is the reference marginal

of ~y under model M and pR(βM σ2M |~yM) =

f (~y|βM σ2MM)πR(βM σ2

M|M)

mR(~y|M)is the reference

posterior density

In the regression framework the reference marginal mR is improper and produces

improper intrinsic priors However the intrinsic Bayes factor of model M to the base

model MB is well-defined and given by

BF IMMB

(y) = (1minus R2M)

minus nminus|MB |2 times

int 1

0

n + sin2(π2θ) middot (|M|+ 1)

n +sin2(π

2θ)middot(|M|+1)1minusR2

M

nminus|M|

2sin2(π

2θ) middot (|M|+ 1)

n +sin2(π

2θ)middot(|M|+1)1minusR2

M

|M|minus|MB |

2

dθ (3ndash2)

55

where R2M is the coefficient of determination of model M versus model MB The Bayes

factor between two models M and M prime is defined as BF IMMprime(y) = BF I

MMB(y)BF I

MprimeMB(y)

The ldquogoodnessrdquo of the model M based on the intrinsic priors is given by its posterior

probability

pI (M|yM) =BF I

MMB(y)π(M|M)sum

MprimeisinM BF IMprimeMB

(y)π(M prime|M) (3ndash3)

It has been shown that the system of intrinsic priors produces consistent model selection

(Casella et al 2009 Giron et al 2010) In the context of well-formulated models the

true model MT is the smallest well-formulated model M isin M such that α isin M if βα = 0

If MT is the true model then the posterior probability of model MT based on equation

(3ndash3) converges to 13222 Other mixtures of g-priors

Scaled mixtures of g-priors place a reference prior on (βMBσ2) and a multivariate

normal distribution on β in M MB that is normal with mean 0 and precision matrix

qMw

nσ2ZprimeM (IminusH0)ZM

where H0 is the hat matrix associated with ZMB The prior is completed by a prior on w

and choice of scaling qM that is set at |M| + 1 to account for the minimal sample size of

M Under these assumptions the Bayesrsquo factor for M to MB is given by

BFMMB(y) =

(1minus R2

M

) nminus|MB |2

int n + w(|M|+ 1)

n + w(|M|+1)1minusR2

M

nminus|M|

2w(|M|+ 1)

n + w(|M|+1)1minusR2

M

|M|minus|MB |

2

π(w)dw

We consider the following priors on w The intrinsic prior is π(w) = Beta(w 05 05)

which is only defined for w isin (0 1) A version of the Zellner-Siow prior is given by

w sim Gamma(05 05) which produces a multivariate Cauchy distribution on β A family

of hyper-g priors are defined by π(w) prop wminus12(β + w)(α+1)2 which have Cauchy-like

tails but produce more shrinkage than the Cauchy prior

56

33 Objective Bayes Occupancy Model Selection

As mentioned before Bayesian inferential approaches used for ecological models

are lacking In particular there exists a need for suitable objective and automatic

Bayesian testing procedures and software implementations that explore thoroughly the

model space considered With this goal in mind in this section we develop an objective

intrinsic and fully automatic Bayesian model selection methodology for single season

site-occupancy models We refer to this method as automatic and objective given that

in its implementation no hyperparameter tuning is required and that it is built using

noninformative priors with good testing properties (eg intrinsic priors)

An inferential method for the occupancy problem is possible using the intrinsic

approach given that we are able to link intrinsic-Bayesian tools for the normal linear

model through our probit formulation of the occupancy model In other words because

we can represent the single season probit occupancy model through the hierarchy

yij |zi wij sim Bernoulli(ziIwijgt0

)wij |λ sim N

(qprimeijλ 1

)zi |vi sim Bernoulli

(Ivigt0

)vi |α sim N (x primeiα 1)

it is possible to solve the selection problem on the latent scale variables wij and vi and

to use those results at the level of the occupancy and detection processes

In what follows first we provide some necessary notation Then a derivation of

the intrinsic priors for the parameters of the detection and occupancy components

is outlined Using these priors we obtain the general form of the model posterior

probabilities Finally the results are incorporated in a model selection algorithm for

site-occupancy data Although the priors on the model space are not discussed in this

Chapter the software and methods developed have different choices of model priors

built in

57

331 Preliminaries

The notation used in Chapter 2 will be considered in this section as well Namely

presence will be denoted by z detection by y their corresponding latent processes are

v and w and the model parameters are denoted by α and λ However some additional

notation is also necessary Let M0 =M0y M0z

denote the ldquobaserdquo model defined by

the smallest models considered for the detection and presence processes The base

models M0y and M0z include predictors that must be contained in every model that

belongs to the model space Some examples of base models are the intercept only

model a model with covariates related to the sampling design and a model including

some predictors important to the researcher that should be included in every model

Furthermore let the sets [Kz ] = 1 2 Kz and [Ky ] = 1 2 Ky index

the covariates considered for the variable selection procedure for the presence and

detection processes respectively That is these sets denote the covariates that can

be added from the base models in M0 or removed from the largest possible models

considered MF z and MF y which we will refer to as the ldquofullrdquo models The model space

can then be represented by the Cartesian product of subsets such that Ay sube [Ky ]

and Az sube [Kz ] The entire model space is populated by models of the form MA =MAy

MAz

isin M = My timesMz with MAy

isin My and MAzisin Mz

For the presence process z the design matrix for model MAzis given by the block

matrix XAz= (X0|Xr A) X0 corresponds to the design matrix of the base model ndash which

is such that M0z sube MAzisin Mz for all Az isin [Kz ] ndash and Xr A corresponds to the submatrix

that contains the covariates indexed by Az Analogously for the detection process y the

design matrix is given by QAy= (Q0|Qr A) Similarly the coefficients for models MAz

and

MAyare given by αA = (αprime

0αprimer A)

prime and λA = (λprime0λ

primer A)

prime

With these elements in place the model selection problem consists of finding

subsets of covariates indexed by A = Az Ay that have a high posterior probability

given the detection and occupancy processes This is equivalent to finding models with

58

high posterior odds when compared to a suitable base model These posterior odds are

given by

p(MA|y z)p(M0|y z)

=m(y z|MA)π(MA)

m(y z|M0)π(M0)= BFMAM0

(y z)π(MA)

π(M0)

Since we are able to represent the occupancy model as a truncation of latent

normal variables it is possible to work through the occupancy model selection problem

in the latent normal scale used for the presence and detection processes We formulate

two solutions to this problem one that depends on the observed and latent components

and another that solely depends on the latent level variables used to data-augment the

problem We will however focus on the latter approach as this yields a straightforward

MCMC sampling scheme For completeness the other alternative is described in

Section 34

At the root of our objective inferential procedure for occupancy models lies the

conditional argument introduced by Womack et al (work in progress) for the simple

probit regression In the occupancy setting the argument is

p(MA|y zw v) =m(y z vw|MA)π(MA)

m(y zw v)

=fyz(y z|w v)

(intfvw(vw|αλMA)παλ(αλ|MA)d(αλ)

)π(MA)

fyz(y z|w v)sum

MlowastisinM(int

fvw(vw|αλMlowast)παλ(αλ|Mlowast)d(αλ))π(Mlowast)

=m(v|MAz

)m(w|MAy)π(MA)

m(v)m(w)

prop m(v|MAz)m(w|MAy

)π(MA) (3ndash4)

where

1 fyz(y z|w v) =prodN

i=1 Izivigt0I

(1minuszi )vile0

prodJ

j=1(ziIwijgt0)yij (1minus ziIwijgt0)

1minusyij

2 fvw(vw|αλMA) =

(Nprodi=1

ϕ(vi xprimeiαMAz

1)

)︸ ︷︷ ︸

f (v|αr Aα0MAz )

(Nprodi=1

Jiprodj=1

ϕ(wij qprimeijλMAy

1)

)︸ ︷︷ ︸

f (w|λr Aλ0MAy )

and

59

3 παλ(αλ|MA) = πα(α|MAz)πλ(λ|MAy

)

This result implies that once the occupancy and detection indicators are

conditioned on the latent processes v and w respectively the model posterior

probabilities only depend on the latent variables Hence in this case the model

selection problem is driven by the posterior odds

p(MA|y zw v)p(M0|y zw v)

=m(w v|MA)

m(w v|M0)

π(MA)

π(M0) (3ndash5)

where m(w v|MA) = m(w|MAy) middotm(v|MAz

) with

m(v|MAz) =

int intf (v|αr Aα0MAz

)π(αr A|α0MAz)π(α0)dαr Adα0

(3ndash6)

m(w|MAy) =

int intf (w|λr Aλ0MAy

)π(λr A|λ0MAy)π(λ0)dλ0dλr A

(3ndash7)

332 Intrinsic Priors for the Occupancy Problem

In general the intrinsic priors as defined by Moreno et al (1998) use the functional

form of the response to inform their construction assuming some preliminary prior

distribution proper or improper on the model parameters For our purposes we assume

noninformative improper priors for the parameters denoted by πN(middot|middot) Specifically the

intrinsic priors πIP(θMlowast|Mlowast) for a vector of parameters θMlowast corresponding to model

Mlowast isin M0M sub M for a response vector s with probability density (or mass) function

f (s|θMlowast) are defined by

πIP(θM0|M0) = πN(θM0

|M0)

πIP(θM |M) = πN(θM |M)

intm(~s|M)

m(~s|M0)f (~s|θM M)d~s

where ~s is a theoretical training sample

In what follows whenever it is clear from the context in an attempt to simplify the

notation MA will be used to refer to MAzor MAy

and A will denote Az or Ay To derive

60

the parameter priors involved in equations 3ndash6 and 3ndash7 using the objective intrinsic prior

strategy we start by assuming flat priors πN(αA|MA) prop cA and πN(λA|MA) prop dA where

cA and dA are unknown constants

The intrinsic prior for the parameters associated with the occupancy process αA

conditional on model MA is

πIP(αA|MA) = πN(αA|MA)

intm(~v|MA)

m(~v|M0)f (~v|αAMA)d~v

where the marginals m(~v|Mj) with j isin A 0 are obtained by solving the analogous

equation 3ndash6 for the (theoretical) training sample ~v These marginals are given by

m(~v|Mj) = cj (2π)pjminusp0

2 |~X primej~Xj |

12 eminus

12~vprime(Iminus~Hj )~v

The training sample ~v has dimension pAz=∣∣MAz

∣∣ that is the total number of

parameters in model MAz Note that without ambiguity we use

∣∣ middot ∣∣ to denote both

the cardinality of a set and also the determinant of a matrix The design matrix ~XA

corresponds to the training sample ~v and is chosen such that ~X primeA~XA =

pAzNX primeAXA

(Leon-Novelo et al 2012) and ~Hj is the corresponding hat matrix

Replacing m(~v|MA) and m(~v|M0) in πIP(αA|MA) and solving the integral with

respect to the theoretical training sample ~v we have

πIP(αA|MA) = cA

int ((2π)minus

pAzminusp0z2

(c0

cA

)eminus

12~vprime((Iminus~HA)minus(Iminus~H0))~v |~X

primeA~XA|12

|~X prime0~X0|12

)times(

(2π)minuspAz2 eminus

12(~vminus~XAαA)

prime(~vminus~XAαA))d~v

= c0(2π)minus

pAzminusp0z2 |~X prime

Ar~XAr |

12 2minus

pAzminusp0z2 exp

[minus1

2αprimer A

(1

2~X primer A

~Xr A

)αr A

]= πN(α0)timesN

(αr A

∣∣ 0 2 middot ( ~X primer A

~Xr A)minus1)

(3ndash8)

61

Analogously the intrinsic prior for the parameters associated to the detection

process is

πIP(λA|MA) = d0(2π)minus

pAyminusp0y2 | ~Q prime

Ar~QAr |

12 2minus

pAyminusp0y2 exp

[minus1

2λprimer A

(1

2~Q primer A

~Qr A

)λr A

]= πN(λ0)timesN

(λr A

∣∣ 0 2 middot ( ~Q primeA~QA)

minus1)

(3ndash9)

In short the intrinsic priors for αA = (αprime0α

primer A)

prime and λprimeA = (λprime

0λprimer A)

prime are the product

of a reference prior on the parameters of the base model and a normal density on the

parameters indexed by Az and Ay respectively333 Model Posterior Probabilities

We now derive the expressions involved in the calculations of the model posterior

probabilities First recall that p(MA|y zw v) prop m(w v|MA)π(MA) Hence determining

this posterior probability only requires calculating m(w v|MA)

Note that since w and v are independent obtaining the model posteriors from

expression 3ndash4 reduces to finding closed form expressions for the marginals m(v |MAz)

and m(w |MAy) respectively from equations 3ndash6 and 3ndash7 Therefore

m(w v|MA) =

int intf (vw|αλMA)π

IP (α|MAz)πIP

(λ|MAy

)dαdλ

(3ndash10)

For the latent variable associated with the occupancy process plugging the

parameter intrinsic prior given by 3ndash8 into equation 3ndash6 (recalling that ~X primeA~XA =

pAzNX primeAXA)

and integrating out αA yields

m(v|MA) =

int intc0N (v|X0α0 + Xr Aαr A I)N

(αr A|0 2( ~X prime

r A~Xr A)

minus1)dαr Adα0

= c0(2π)minusn2

int (pAz

2N + pAz

) (pAzminusp0z

)

2

times

exp[minus1

2(v minus X0α0)

prime(I minus

(2N

2N + pAz

)Hr Az

)(v minus X0α0)

]dα0

62

= c0 (2π)minus(nminusp0z )2

(pAz

2N + pAz

) (pAzminusp0z

)

2

|X prime0X0|minus

12 times

exp[minus1

2vprime(I minus H0z minus

(2N

2N + pAz

)Hr Az

)v

] (3ndash11)

with Hr Az= HAz

minus H0z where HAzis the hat matrix for the entire model MAz

and H0z is

the hat matrix for the base model

Similarly the marginal distribution for w is

m(w|MA) = d0 (2π)minus(Jminusp0y )2

(pAy

2J + pAy

) (pAyminusp0y

)

2

|Q prime0Q0|minus

12 times

exp[minus1

2wprime(I minus H0y minus

(2J

2J + pAy

)Hr Ay

)w

] (3ndash12)

where J =sumN

i=1 Ji or in other words J denotes the total number of surveys conducted

Now the posteriors for the base model M0 =M0y M0z

are

m(v|M0) =

intc0N (v|X0α0 I) dα0

= c0(2π)minus(nminusp0z )2 |X prime

0X0|minus12 exp

[minus1

2(v (I minus H0z ) v)

](3ndash13)

and

m(w|M0) = d0(2π)minus(Jminusp0y )2 |Q prime

0Q0|minus12 exp

[minus1

2

(w(I minus H0y

)w)]

(3ndash14)

334 Model Selection Algorithm

Having the parameter intrinsic priors in place and knowing the form of the model

posterior probabilities it is finally possible to develop a strategy to conduct model

selection for the occupancy framework

For each of the two components of the model ndashoccupancy and detectionndash the

algorithm first draws the set of active predictors (ie Az and Ay ) together with their

corresponding parameters This is a reversible jump step which uses a Metropolis

63

Hastings correction with proposal distributions given by

q(Alowastz |zo z(t)u v(t)MAz

) =1

2

(p(MAlowast

z|zo z(t)u v(t)Mz MAlowast

zisin L(MAz

)) +1

|L(MAz)|

)q(Alowast

y |y zo z(t)u w(t)MAy) =

1

2

(p(MAlowast

w|y zo z(t)u w(t)My MAlowast

yisin L(MAy

)) +1

|L(MAy)|

)(3ndash15)

where L(MAz) and L(MAy

) denote the sets of models obtained from adding or removing

one predictor at a time from MAzand MAy

respectively

To promote mixing this step is followed by an additional draw from the full

conditionals of α and λ The densities p(α0|) p(αr A|) p(λ0|) and p(λr A|) can

be sampled from directly with Gibbs steps Using the notation a|middot to denote the random

variable a conditioned on all other parameters and on the data these densities are given

by

bull α0|middot sim N((X

prime0X0)

minus1Xprime0v (X

prime0X0)

minus1)bull αr A|middot sim N

(microαr A

αr A

) where the mean vector and the covariance matrix are

given by αr A= 2N

2N+pAz(X

prime

r AXr A)minus1 and microαr A

=(αr A

Xprime

r Av)

bull λ0|middot sim N((Q

prime0Q0)

minus1Qprime0w (Q

prime0Q0)

minus1) and

bull λr A|middot sim N(microλr A

λr A

) analogously with mean and covariance matrix given by

λr A= 2J

2J+pAy(Q

prime

r AQr A)minus1 and microλr A

=(λr A

Qprime

r Aw)

Finally Gibbs sampling steps are also available for the unobserved occupancy

indicators zu and for the corresponding latent variables v and w The full conditional

posterior densities for z(t+1)u v(t+1) and w(t+1) are those introduced in Chapter 2 for the

single season probit model

The following steps summarize the stochastic search algorithm

1 Initialize A(0)y A

(0)z z

(0)u v(0)w(0)α(0)

0 λ(0)0

2 Sample the model indices and corresponding parameters

(a) Draw simultaneously

64

bull Alowastz sim q(Az |zo z(t)u v(t)MAz

)

bull αlowast0 sim p(α0|MAlowast

z zo z

(t)u v(t)) and

bull αlowastr Alowast sim p(αr A|MAlowast

z zo z

(t)u v(t))

(b) Accept (M(t+1)Az

α(t+1)10 α(t+1)1

r A ) = (MAlowastzαlowast

0αlowastr Alowast) with probability

δz = min

(1

p(MAlowastz|zo z(t)u v(t))

p(MA(t)z|zo z(t)u v(t))

q(A(t)z |zo z(t)u v(t)MAlowast

z)

q(Alowastz |zo z

(t)u v(t)MAz

)

)

otherwise let (M(t+1)Az

α(t+1)10 α(t+1)1

r A ) = (A(t)z α(t)2

0 α(t)2r A )

(c) Sample simultaneously

bull Alowasty sim q(Ay |y zo z(t)u w(t)MAy

)

bull λlowast0 sim p(λ0|MAlowast

y y zo z

(t)u w(t)) and

bull λlowastr Alowast sim p(λr A|MAlowast

y y zo z

(t)u w(t))

(d) Accept (M(t+1)Ay

λ(t+1)10 λ(t+1)1

r A ) = (MAlowastyλlowast

0λlowastr Alowast) with probability

δy = min

(1

p(MAlowastz|y zo z(t)u w(t))

p(MA(t)z|y zo z(t)u w(t))

q(A(t)z |y zo z(t)u w(t)MAlowast

y)

q(Alowastz |y zo z

(t)u w(t)MAy

)

)

otherwise let (M(t+1)Ay

λ(t+1)10 λ(t+1)1

r A ) = (A(t)y λ(t)2

0 λ(t)2r A )

3 Sample base model parameters

(a) Draw α(t+1)20 sim p(α0|MA

(t+1)z

zo z(t)u v(t))

(b) Draw λ(t+1)20 sim p(λ0|MA(t+1)y

y zo z(t)u v(t))

4 To improve mixing resample model coefficients not present the base model butare in MA

(a) Draw α(t+1)2r A sim p(αr A|MA

(t+1)z

zo z(t)u v(t))

(b) Draw λ(t+1)2r A sim p(λr A|MA

(t+1)y

yzo z(t)u v(t))

5 Sample latent and missing (unobserved) variables

(a) Sample z(t+1)u sim p(zu|MA(t+1)z

yα(t+1)2r A α(t+1)2

0 λ(t+1)2r A λ(t+1)2

0 )

(b) Sample v(t+1) sim p(v|MA(t+1)z

zo z(t+1)u α(t+1)2

r A α(t+1)20 )

65

(c) Sample w(t+1) sim p(w|MA(t+1)y

zo z(t+1)u λ(t+1)2

r A λ(t+1)20 )

34 Alternative Formulation

Because the occupancy process is partially observed it is reasonable to consider

the posterior odds in terms of the observed responses that is the detections y and

the presences at sites where at least one detection takes place Partitioning the vector

of presences into observed and unobserved z = (zprimeo zprimeu)

prime and integrating out the

unobserved component the model posterior for MA can be obtained as

p(MA|y zo) prop Ezu [m(y z|MA)] π(MA) (3ndash16)

Data-augmenting the model in terms of latent normal variables a la Albert and Chib

the marginals for any model My Mz = M isin M of z and y inside of the expectation in

equation 3ndash16 can be expressed in terms of the latent variables

m(y z|M) =

intT (z)

intT (yz)

m(w v|M)dwdv

=

(intT (z)

m(v| Mz)dv

)(intT (yz)

m(w|My)dw

) (3ndash17)

where T (z) and T (y z) denote the corresponding truncation regions for v and w which

depend on the values taken by z and y and

m(v|Mz) =

intf (v|αMz)π(α|Mz)dα (3ndash18)

m(w|My) =

intf (w|λMy)π(λ|My)dλ (3ndash19)

The last equality in equation 3ndash17 is a consequence of the independence of the

latent processes v and w Using expressions 3ndash18 and 3ndash19 allows one to embed this

model selection problem in the classical linear normal regression setting where many

ldquoobjectiverdquo Bayesian inferential tools are available In particular these expressions

facilitate deriving the parameter intrinsic priors (Berger amp Pericchi 1996 Moreno

et al 1998) for this problem This approach is an extension of the one implemented in

Leon-Novelo et al (2012) for the simple probit regression problem

66

Using this alternative approach all that is left is to integrate m(v|MA) and m(w|MA)

over their corresponding truncation regions T (z) and T (y z) which yields m(y z|MA)

and then to obtain the expectation with respect to the unobserved zrsquos Note however

two issues arise First such integrals are not available in closed form Second

calculating the expectation over the limit of integration further complicates things To

address these difficulties it is possible to express E [m(y z|MA)] as

Ezu [m(y z|MA)] = Ezu

[(intT (z)

m(v| MAz)dv

)(intT (yz)

m(w|MAy)dw

)](3ndash20)

= Ezu

[(intT (z)

intm(v| MAz

α0)πIP(α0|MAz

)dα0dv

)times(int

T (yz)

intm(w| MAy

λ0)πIP(λ0|MAy

)dλ0dw

)]

= Ezu

int (int

T (z)

m(v| MAzα0)dv

)︸ ︷︷ ︸

g1(T (z)|MAz α0)

πIP(α0|MAz)dα0 times

int (intT (yz)

m(w|MAyλ0)dw

)︸ ︷︷ ︸

g2(T (yz)|MAy λ0)

πIP(λ0|MAy)dλ0

= Ezu

[intg1(T (z)|MAz

α0)πIP(α0|MAz

)dα0 timesintg2(T (y z)|MAy

λ0)πIP(λ0|MAy

)dλ0

]= c0 d0

int intEzu

[g1(T (z)|MAz

α0)g2(T (y z)|MAyλ0)

]dα0 dλ0

where the last equality follows from Fubinirsquos theorem since m(v|MAzα0) and

m(w|MAyλ0) are proper densities From 3ndash21 the posterior odds are

p(MA|y zo)p(M0|y zo)

=

int intEzu

[g1(T (z)|MAz

α0)g2(T (y z)|MAyλ0)

]dα0 dλ0int int

Ezu

[g1(T (z)|M0z α0)g2(T (y z)|M0y λ0)

]dα0 dλ0

π(MA)

π(M0)

(3ndash21)

67

35 Simulation Experiments

The proposed methodology was tested under 36 different scenarios where we

evaluate the behavior of the algorithm by varying the number of sites the number of

surveys the amount of signal in the predictors for the presence component and finally

the amount of signal in the predictors for the detection component

For each model component the base model is taken to be the intercept only model

and the full models considered for the presence and the detection have respectively 30

and 20 predictors Therefore the model space contains 230times220 asymp 112times1015 candidate

models

To control the amount of signal in the presence and detection components values

for the model parameter were purposefully chosen so that quantiles 10 50 and 90 of the

occupancy and detection probabilities match some pre-specified probabilities Because

presence and detection are binary variables the amount of signal in each model

component associates to the spread and center of the distribution for the occupancy and

detection probabilities respectively Low signal levels relate to occupancy or detection

probabilities close to 50 High signal levels associate with probabilities close to 0 or 1

Large spreads of the distributions for the occupancy and detection probabilities reflect

greater heterogeneity among the observations collected improving the discrimination

capability of the model and viceversa

Therefore for the presence component the parameter values of the true model

were chosen to set the median for the occupancy probabilities equal 05 The chosen

parameter values also fix quantiles 10 and 90 symmetrically about 05 at small (Qz10 =

03Qz90 = 07) intermediate (Qz

10 = 02Qz90 = 08) and large (Qz

10 = 01Qz90 = 09)

distances For the detection component the model parameters are obtained to reflect

detection probabilities concentrated about low values (Qy50 = 02) intermediate values

(Qy50 = 05) and high values (Qy

50 = 08) while keeping quantiles 10 and 90 fixed at 01

and 09 respectively

68

Table 3-1 Simulation control parameters occupancy model selectorParameter Values considered

N 50 100

J 3 5

(Qz10Q

z50Q

z90)

(03 05 07) (02 05 08) (01 05 09)

(Qy

10Qy50Q

y90)

(01 02 09) (01 05 09) (01 08 09)

There are in total 36 scenarios these result from crossing all the levels of the

simulation control parameters (Table 3-1) Under each of these scenarios 20 data sets

were generated at random True presence and detection indicators were generated

with the probit model formulation from Chapter 2 This with the assumed true models

MTz = 1 x2 x15 x16 x22 x28 for the presence and MTy = 1 q7 q10 q12 q17 for

the detection with the predictors included in the randomly generated datasets In this

context 1 represents the intercept term Throughout the Section we refer to predictors

included in the true models as true predictors and to those absent as false predictors

The selection procedure was conducted using each one of these data sets with

two different priors on the model space the uniform or equal probability prior and a

multiplicity correcting prior

The results are summarized through the marginal posterior inclusion probabilities

(MPIPs) for each predictor and also the five highest posterior probability models (HPM)

The MPIP for a given predictor under a specific scenario and for a particular data set is

defined as

p(predictor is included|y zw v) =sumMisinM

I(predictorisinM)p(M|y zw vM) (3ndash22)

In addition we compare the MPIP odds between predictors present in the true model

and predictors absent from it Specifically we consider the minimum odds of marginal

posterior inclusion probabilities for the predictors Let ~ξ and ξ denote respectively a

69

predictor in the true model MT and a predictor absent from MT We define the minimum

MPIP odds between the probabilities of true and false predictor as

minOddsMPIP =min~ξisinMT

p(I~ξ = 1|~ξ isin MT )

maxξ isinMTp(Iξ = 1|ξ isin MT )

(3ndash23)

If the variable selection procedure adequately discriminates true and false predictors

minOddsMPIP will take values larger than one The ability of the method to discriminate

between the least probable true predictor and the most probable false predictor worsens

as the indicator approaches 0351 Marginal Posterior Inclusion Probabilities for Model Predictors

For clarity in Figures 3-1 through 3-5 only predictors in the true models are labeled

and are emphasized with a dotted line passing through them The left hand side plots

in these figures contain the results for the presence component and the ones on the

right correspond to predictors in the detection component The results obtained with

the uniform model priors correspond to the black lines and those for the multiplicity

correcting prior are in red In these Figures the MPIPrsquos have been averaged over all

datasets corresponding scenarios matching the condition observed

In Figure 3-1 we contrast the mean MPIPrsquos of the predictors over all datasets from

scenarios with 50 sites to the mean MPIPrsquos obtained for the scenarios with 100 sites

Similarly Figure 3-2 compares the mean MPIPrsquos of scenarios where 3 surveys are

performed to those of scenarios having 5 surveys per site Figures 3-4 and 3-5 show the

effect of the different levels of signal considered in the occupancy probabilities and in the

detection probabilities

From these figures mainly three results can be drawn (1) the effect of the model

prior is substantial (2) the proposed methods yield MPIPrsquos that clearly separate

true predictors from false predictors and (3) the separation between MPIPrsquos of true

predictors and false predictors is noticeably larger in the detection component

70

Regardless of the simulation scenario and model component observed under the

uniform prior false predictors obtain a relatively high MPIP Conversely the multiplicity

correction prior strongly shrinks towards 0 the MPIP for false predictors In the presence

component the MPIP for the true predictors is shrunk substantially under the multiplicity

prior however there remains a clear separation between true and false predictors In

contrast in the detection component the MPIP for true predictors remains relatively high

(Figures 3-1 through 3-5)

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif N=50MC N=50

Unif N=100MC N=100

Figure 3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors

71

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif J=3MC J=3

Unif J=5MC J=5

Figure 3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif N=50 J=3Unif N=50 J=5

Unif N=100 J=3Unif N=100 J=5

MC N=50 J=3MC N=50 J=5

MC N=100 J=3MC N=100 J=5

Figure 3-3 Predictor MPIP averaged over scenarios with the interaction between thenumber of sites and the surveys per site using uniform (U) and multiplicitycorrection (MC) priors

72

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

U(03 05 07)MC(03 05 07)

U(02 05 08)MC(02 05 08)

U(01 05 09)MC(01 05 09)

Figure 3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

U(01 02 09)MC(01 02 09)

U(01 05 09)MC(01 05 09)

U(01 08 09)MC(01 08 09)

Figure 3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors

73

In scenarios where more sites were surveyed the separation between the MPIP of

true and false predictors grew in both model components (Figure 3-1) Increasing the

number of sites has an effect over both components given that every time a new site is

included covariate information is added to the design matrix of both the presence and

the detection components

On the hand increasing the number of surveys affects the MPIP of predictors in the

detection component (Figures 3-2 and 3-3) but has only a marginal effect on predictors

of the presence component This may appear to be counterintuitive however increasing

the number of surveys only increases the number of observation in the design matrix

for the detection while leaving unaltered the design matrix for the presence The small

changes observed in the MPIP for the presence predictors J increases are exclusively

a result of having additional detection indicators equal to 1 in sites where with less

surveys would only have 0 valued detections

From Figure 3-3 it is clear that for the presence component the effect of the number

of sites dominates the behavior of the MPIP especially when using the multiplicity

correction priors In the detection component the MPIP is influenced by both the number

of sites and number of surveys The influence of increasing the number of surveys is

larger when considering a smaller number of sites and viceversa

Regarding the effect of the distribution for the occupancy probabilities we observe

that mostly the detection component is affected There is stronger discrimination

between true and false predictors as the distribution has a higher variability (Figure

3-4) This is consistent with intuition since having the presence probabilities more

concentrated about 05 implies that the predictors do not vary much from one site to

the next whereas having the occupancy probabilities more spread out would have the

opposite effect

Finally concentrating the detection probabilities about high or low values For

predictors in the detection component the separation between MPIP of true and false

74

predictors is larger both in scenarios where the distribution of the detection probability

is centered about 02 or 08 when compared to those scenarios where this distribution

is centered about 05 (where the signal of the predictors is weakest) For predictors in

the presence component having the detection probabilities centered at higher values

slightly increases the inclusion probabilities of the true predictors (Figure 3-5) and

reduces that of false predictors

Table 3-2 Comparison of average minOddsMPIP under scenarios having differentnumber of sites (N=50 N=100) and under scenarios having different numberof surveys per site (J=3 J=5) for the presence and detection componentsusing uniform and multiplicity correction priors

Sites SurveysComp π(M) N=50 N=100 J=3 J=5

Presence Unif 112 131 119 124MC 320 846 420 674

Detection Unif 203 264 211 257MC 2115 3246 2139 3252

Table 3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors

(Qz10Q

z50Q

z90) (Qy

10Qy50Q

y90)

Comp π(M) (030507) (020508) (010509) (010209) (010509) (010809)

Presence Unif 105 120 134 110 123 124MC 202 455 805 238 619 640

Detection Unif 234 234 230 257 200 238MC 2537 2077 2528 2933 1852 2849

The separation between the MPIP of true and false predictors is even more

notorious in Tables 3-2 and 3-3 where the minimum MPIP odds between true and

false predictors are shown Under every scenario the value for the minOddsMPIP (as

defined in 3ndash23) was greater than 1 implying that on average even the lowest MPIP

for a true predictor is higher than the maximum MPIP for a false predictor In both

components of the model the minOddsMPIP are markedly larger under the multiplicity

correction prior and increase with the number of sites and with the number of surveys

75

For the presence component increasing the signal in the occupancy probabilities

or having the detection probabilities concentrate about higher values has a positive and

considerable effect on the magnitude of the odds For the detection component these

odds are particularly high specially under the multiplicity correction prior Also having

the distribution for the detection probabilities center about low or high values increases

the minOddsMPIP 352 Summary Statistics for the Highest Posterior Probability Model

Tables 3-4 through 3-7 show the number of true predictors that are included in

the HPM (True +) and the number of false predictors excluded from it (True minus)

The mean percentages observed in these Tables provide one clear message The

highest probability models chosen with either model prior commonly differ from the

corresponding true models The multiplicity correction priorrsquos strong shrinkage only

allows a few true predictors to be selected but at the same time it prevents from

including in the HPM any false predictors On the other hand the uniform prior includes

in the HPM a larger proportion of true predictors but at the expense of also introducing

a large number of false predictors This situation is exacerbated in the presence

component but also occurs to a minor extent in the detection component

Table 3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highestprobability models for the presence and the detection components usinguniform and multiplicity correcting priors on the model space

True + True minusComp π(M) N=50 N=100 N=50 N=100

Presence Unif 057 063 051 055MC 006 013 100 100

Detection Unif 077 085 087 093MC 049 070 100 100

Having more sites or surveys improves the inclusion of true predictors and exclusion

of false ones in the HPM for both the presence and detection components (Tables 3-4

and 3-5) On the other hand if the distribution for the occupancy probabilities is more

76

Table 3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space

True + True minusComp π(M) J=3 J=5 J=3 J=5

Presence Unif 059 061 052 054MC 008 010 100 100

Detection Unif 078 085 087 092MC 050 068 100 100

spread out the HPM includes more true predictors and less false ones in the presence

component In contrast the effect of the spread of the occupancy probabilities in the

detection HPM is negligible (Table 3-6) Finally there is a positive relationship between

the location of the median for the detection probabilities and the number of correctly

classified true and false predictors for the presence The HPM in the detection part of

the model responds positively to low and high values of the median detection probability

(increased signal levels) in terms of correctly classified true and false predictors (Table

3-7)

Table 3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space

True + True minusComp π(M) (030507) (020508) (010509) (030507) (020508) (010509)

Presence Unif 055 061 064 050 054 055MC 002 008 018 100 100 100

Detection Unif 081 082 081 090 089 089MC 057 061 059 100 100 100

36 Case Study Blue Hawker Data Analysis

During 1999 and 2000 an intensive volunteer surveying effort coordinated by the

Centre Suisse de Cartographie de la Faune (CSCF) was conducted in order to analyze

the distribution of the blue hawker Ashna cyanea (Odonata Aeshnidae) a common

dragonfly in Switzerland Given that Switzerland is a small and mountainous country

77

Table 3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space

True + True minusComp π(M) (010209) (010509) (010809) (010209) (010509) (010809)

Presence Unif 059 059 062 051 054 054MC 006 010 011 100 100 100

Detection Unif 089 077 078 091 087 091MC 070 048 059 100 100 100

there is large variation in its topography and physio-geography as such elevation is a

good candidate covariate to predict species occurrence at a large spatial scale It can

be used as a proxy for habitat type intensity of land use temperature as well as some

biotic factors (Kery et al 2010)

Repeated visits to 1-ha pixels took place to obtain the corresponding detection

history In addition to the survey outcome the x and y-coordinates thermal-level the

date of the survey and the elevation were recorded Surveys were restricted to the

known flight period of the blue hawker which takes place between May 1 and October

10 In total 2572 sites were surveyed at least once during the surveying period The

number of surveys per site ranges from 1 to 22 times within each survey year

Kery et al (2010) summarize the results of this effort using AIC-based model

comparisons first by following a backwards elimination approach for the detection

process while keeping the occupancy component fixed at the most complex model and

then for the presence component choosing among a group of three models while using

the detection model chosen In our analysis of this dataset for the detection and the

presence we consider as the full models those used in Kery et al (2010) namely

minus1(ψ) = α0 + α1year+ α2elev+ α3elev2 + α4elev

3

minus1(p) = λ0 + λ1year+ λ2elev+ λ3elev2 + λ4elev

3 + λ5date+ λ6date2

78

where year = Iyear=2000

The model spaces for this data contain 26 = 64 and 24 = 16 models respectively

for the detection and occupancy components That is in total the model space contains

24+6 = 1 024 models Although this model space can be enumerated entirely for

illustration we implemented the algorithm from section 334 generating 10000 draws

from the Gibbs sampler Each one of the models sampled were chosen from the set of

models that could be reached by changing the state of a single term in the current model

(to inclusion or exclusion accordingly) This allows a more thorough exploration of the

model space because for each of the 10000 models drawn the posterior probabilities

for many more models can be observed Below the labels for the predictors are followed

by either ldquozrdquo or ldquoyrdquo accordingly to represent the component they pertain to Finally

using the results from the model selection procedure we conducted a validation step to

determine the predictive accuracy of the HPMrsquos and of the median probability models

(MPMrsquos) The performance of these models is then contrasted with that of the model

ultimately selected by Kery et al (2010)361 Results Variable Selection Procedure

The model finally chosen for the presence component in Kery et al (2010) was not

found among the highest five probability models under either model prior 3-8 Moreover

the year indicator was never chosen under the multiplicity correcting prior hinting that

this term might correspond to a falsely identified predictor under the uniform prior

Results in Table 3-10 support this claim the marginal inclusion posterior probability for

the year predictor is 7 under the multiplicity correction prior The multiplicity correction

prior concentrates more densely the model posterior probability mass in the highest

ranked models (90 of the mass is in the top five models) than the uniform prior (which

account for 40 of the mass)

For the detection component the HPM under both priors is the intercept only model

which we represent in Table 3-9 with a blank label In both cases this model obtains very

79

Table 3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data

Uniform model priorRank Mz selected p(Mz |y)

1 yrz+elevz 0102 yrz+elevz+elevz3 0083 elevz2+elevz3 0084 yrz+elevz2 0075 yrz+elevz3 007

Multiplicity correcting model priorRank Mz selected p(Mz |y)

1 elevz+elevz3 0532 0153 elevz+elevz2 0094 elevz2 0065 elevz+elevz2+elevz3 005

high posterior probabilities The terms contained in cubic polynomial for the elevation

appear to contain some relevant information however this conflicts with the MPIPs

observed in Table 3-11 which under both model priors are relatively low (lt 20 with the

uniform and le 4 with the multiplicity correcting prior)

Table 3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data

Uniform model priorRank Mz selected p(Mz |y)

1 0452 elevy3 0063 elevy2 0054 elevy 0055 yry 004

Multiplicity correcting model priorRank Mz selected p(Mz |y)

1 0862 elevy3 0023 datey2 0024 elevy2 0025 yry 002

Finally it is possible to use the MPIPs to obtain the median probability model which

contains the terms that have a MPIP higher than 50 For the occupancy process

(Table 3-10) under the uniform prior the model with the year the elevation and the

elevation cubed are included The MPM with multiplicity correction prior coincides with

the HPM from this prior The MPM chosen for the detection component (Table 3-11)

under both priors is the intercept only model coinciding again with the HPM

Given the outcomes of the simulation studies from Section 35 especially those

pertaining to the detection component the results in Table 3-11 appear to indicate that

none of the predictors considered belong to the true model especially when considering

80

Table 3-10 MPIP presence component

Predictor p(predictor isin MTz |y z w v)

Unif MultCorryrz 053 007elevz 051 073elevz2 045 023elevz3 050 067

Table 3-11 MPIP detection component

Predictor p(predictor isin MTy |y z w v)

Unif MultCorryry 019 003elevy 018 003elevy2 018 003elevy 3 019 004datey 016 003datey2 015 004

those derived with the multiplicity correction prior On the other hand for the presence

component (Table 3-10) there is an indication that terms related to the cubic polynomial

in elevz can explain the occupancy patterns362 Validation for the Selection Procedure

Approximately half of the sites were selected at random for training (ie for model

selection and parameter estimation) and the remaining half were used as test data In

the previous section we observed that using the marginal posterior inclusion probability

of the predictors the our method effectively separates predictors in the true model from

those that are not in it However in Tables 3-10 and 3-11 this separation is only clear for

the presence component using the multiplicity correction prior

Therefore in the validation procedure we observe the misclassification rates for the

detections using the following models (1) the model ultimately recommended in Kery

et al (2010) (yrz+elevz+elevz2+elevz3 + elevy+ elevy2+ datey+ datey2) (2) the

highest probability model (HPM) with a uniform prior (yrz+elevz) (3) the HPM with a

multiplicity correcting prior (elevz + elevz3 ) (4) the median probability model (MPM)

ndashthe model including only predictors with a MPIP larger than 50ndash with the uniform

prior (yrz+elevz+elevz3) and finally (5) the MPM with a multiplicity correction prior

(elevz+elevz3 same as the HPM with multiplicity correction)

We must emphasize that the models resulting from the implement ion of our model

selection procedure used exclusively the training dataset On the other hand the model

in Kery et al (2010) was chosen to minimize the prediction error of the complete data

81

Because this model was obtained from the full dataset results derived from it can only

be considered as a lower bound for the prediction errors

The benchmark misclassification error rate for true 1rsquos is high (close to 70)

However the misclassification rate for true 0rsquos which accounts for most of the

responses is less pronounced (15) Overall the performance of the selected models

is comparable They yield considerably worse results than the benchmark for the true

1rsquos but achieve rates close to the benchmark for the true zeros Pooling together

the results for true ones and true zeros the selected models with either prior have

misclassification rates close to 30 The benchmark model performs comparably with a

joint misclassification error of 23 (Table 3-12)

Table 3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform andmultiplicity correction model priors

Model True 1 True 0 Jointbenchmark (Kery et al 2010) yrz+elevz+elevz2+elevz3 + 066 015 023

elevy+ elevy2+ datey+ datey2

HPM Unif yrz+elevz 083 017 028HPMHPM MC elevz + elevz3 082 018 028MPM Unif yrz+elevz+elevz3 082 018 029

37 Discussion

In this Chapter we proposed an objective and fully automatic Bayes methodology for

the single season site-occupancy model The methodology is said to be fully automatic

because no hyper-parameter specification is necessary in defining the parameter priors

and objective because it relies on the intrinsic priors derived from noninformative priors

The intrinsic priors have been shown to have desirable properties as testing priors We

also propose a fast stochastic search algorithm to explore large model spaces using our

model selection procedure

Our simulation experiments demonstrated the ability of the method to single out the

predictors present in the true model when considering the marginal posterior inclusion

probabilities for the predictors For predictors in the true model these probabilities

were comparatively larger than those for predictors absent from it Also the simulations

82

indicated that the method has a greater discrimination capability for predictors in the

detection component of the model especially when using multiplicity correction priors

Multiplicity correction priors were not described in this Chapter however their

influence on the selection outcome is significant This behavior was observed in the

simulation experiment and in the analysis of the Blue Hawker data Model priors play an

essential role As the number of predictors grows these are instrumental in controlling

for selection of false positive predictors Additionally model priors can be used to

account for predictor structure in the selection process which helps both to reduce the

size of the model space and to make the selection more robust These issues are the

topic of the next Chapter

Accounting for the polynomial hierarchy in the predictors within the occupancy

context is a straightforward extension of the procedures we describe in Chapter 4

Hence our next step is to develop efficient software for it An additional direction we

plan to pursue is developing methods for occupancy variable selection in a multivariate

setting This can be used to conduct hypothesis testing in scenarios with varying

conditions through time or in the case where multiple species are co-observed A

final variation we will investigate for this problem is that of occupancy model selection

incorporating random effects

83

CHAPTER 4PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS

It has long been an axiom of mine that the little things are infinitely themost important

ndashSherlock HolmesA Case of Identity

41 Introduction

In regression problems if a large number of potential predictors is available the

complete model space is too large to enumerate and automatic selection algorithms are

necessary to find informative parsimonious models This multiple testing problem

is difficult and even more so when interactions or powers of the predictors are

considered In the ecological literature models with interactions andor higher order

polynomial terms are ubiquitous (Johnson et al 2013 Kery et al 2010 Zeller et al

2011) given the complexity and non-linearities found in ecological processes Several

model selection procedures even in the classical normal linear setting fail to address

two fundamental issues (1) the model selection outcome is not invariant to affine

transformations when interactions or polynomial structures are found among the

predictors and (2) additional penalization is required to control for false positives as the

model space grows (ie as more covariates are considered)

These two issues motivate the developments developed throughout this Chapter

Building on the results of Chipman (1996) we propose investigate and provide

recommendations for three different prior distributions on the model space These

priors help control for test multiplicity while accounting for polynomial structure in the

predictors They improve upon those proposed by Chipman first by avoiding the need

for specific values for the prior inclusion probabilities of the predictors and second

by formulating principled alternatives to introduce additional structure in the model

84

priors Finally we design a stochastic search algorithm that allows fast and thorough

exploration of model spaces with polynomial structure

Having structure in the predictors can determine the selection outcome As an

illustration consider the model E [y ] = β00 + β01x2 + β20x21 where the order one

term x1 is not present (this choice of subscripts for the coefficients is defined in the

following section) Transforming x1 7rarr xlowast1 = x1 + c for some c = 0 the model

becomes E [y ] = β00 + β01x2 + βlowast20x

lowast21 Note that in terms of the original predictors

xlowast21 = x21 + 2c middot x1 + c2 implying that this seemingly innocuous transformation of x1

modifies the column space of the design matrix by including x1 which was not in the

original model That is when lower order terms in the hierarchy are omitted from the

model the column space of the design matrix is not invariant to afine transformations

As the hat matrix depends on the column space the modelrsquos predictive capability is also

affected by how the covariates in the model are coded an undesirable feature for any

model selection procedure To make model selection invariant to afine transformations

the selection must be constrained to the subset of models that respect the hierarchy

(Griepentrog et al 1982 Khuri 2002 McCullagh amp Nelder 1989 Nelder 2000

Peixoto 1987 1990) These models are known as well-formulated models (WFMs)

Succinctly a model is well-formulated if for any predictor in the model every lower order

predictor associated with it is also in the model The model above is not well-formulated

as it contains x21 but not x1

WFMs exhibit strong heredity in that all lower order terms dividing higher order

terms in the model must also be included An alternative is to only require weak heredity

(Chipman 1996) which only forces some of the lower terms in the corresponding

polynomial hierarchy to be in the model However Nelder (1998) demonstrated that the

conditions under which weak heredity allows the design matrix to be invariant to afine

transformations of the predictors are too restrictive to be useful in practice

85

Although this topic appeared in the literature more than three decades ago (Nelder

1977) only recently have modern variable selection techniques been adapted to

account for the constraints imposed by heredity As described in Bien et al (2013)

the current literature on variable selection for polynomial response surface models

can be classified into three broad groups mult-istep procedures (Brusco et al 2009

Peixoto 1987) regularized regression methods (Bien et al 2013 Yuan et al 2009)

and Bayesian approaches (Chipman 1996) The methods introduced in this Chapter

take a Bayesian approach towards variable selection for well-formulated models with

particular emphasis on model priors

As mentioned in previous chapters the Bayesian variable selection problem

consists of finding models with high posterior probabilities within a pre-specified model

space M The model posterior probability for M isin M is given by

p(M|yM) prop m(y|M)π(M|M) (4ndash1)

Model posterior probabilities depend on the prior distribution on the model space

as well as on the prior distributions for the model specific parameters implicitly through

the marginals m(y|M) Priors on the model specific parameters have been extensively

discussed in the literature (Berger amp Pericchi 1996 Berger et al 2001 George 2000

Jeffreys 1961 Kass amp Wasserman 1996 Liang et al 2008 Zellner amp Siow 1980) In

contrast the effect of the prior on the model space has until recently been neglected

A few authors (eg Casella et al (2014) Scott amp Berger (2010) Wilson et al (2010))

have highlighted the relevance of the priors on the model space in the context of multiple

testing Adequately formulating priors on the model space can both account for structure

in the predictors and provide additional control on the detection of false positive terms

In addition using the popular uniform prior over the model space may lead to the

undesirable and ldquoinformativerdquo implication of favoring models of size p2 (where p is the

86

total number of covariates) since this is the most abundant model size contained in the

model space

Variable selection within the model space of well-formulated polynomial models

poses two challenges for automatic objective model selection procedures First the

notion of model complexity takes on a new dimension Complexity is not exclusively

a function of the number of predictors but also depends upon the depth and

connectedness of the associations defined by the polynomial hierarchy Second

because the model space is shaped by such relationships stochastic search algorithms

used to explore the models must also conform to these restrictions

Models without polynomial hierarchy constitute a special case of WFMs where

all predictors are of order one Hence all the methods developed throughout this

Chapter also apply to models with no predictor structure Additionally although our

proposed methods are presented for the normal linear case to simplify the exposition

these methods are general enough to be embedded in many Bayesian selection

and averaging procedures including of course the occupancy framework previously

discussed

In this Chapter first we provide the necessary definitions to characterize the

well-formulated model selection problem Then we proceed to introduce three new prior

structures on the well-formulated model space and characterize their behavior with

simple examples and simulations With the model priors in place we build a stochastic

search algorithm to explore spaces of well-formulated models that relies on intrinsic

priors for the model specific parameters mdash though this assumption can be relaxed

to use other mixtures of g-priors Finally we implement our procedures using both

simulated and real data

87

42 Setup for Well-Formulated Models

Suppose that the observations yi are modeled using the polynomial regression of

the covariates xi 1 xi p given by

yi =sum

β(α1αp)

pprodj=1

xαji j + ϵi (4ndash2)

where α = (α1 αp) belongs to Np0 the p-dimensional space of natural numbers

including 0 with ϵiiidsim N(0σ2) and only finitely many βα are allowed to be non-zero

As an illustration consider a model space that includes polynomial terms incorporating

covariates xi 1 and xi 2 only The terms x2i 2 and x2i 1xi 2 can be represented by α = (0 2)

and α = (2 1) respectively

The notation y = Z(X)β + ϵ is used to denote that observed response y =

(y1 yn) is modeled via a polynomial function Z of the original covariates contained

in X = (x1 xp) (where xj = (x1j xnj)prime) and the coefficients of the polynomial

terms are given by β A specific polynomial model M is defined by the set of coefficients

βα that are allowed to be non-zero This definition is equivalent to characterizing M

through a collection of multi-indices α isin Np0 In particular model M is specified by

M = αM1 αM|M| for αMk isin Np0 where βα = 0 for α isin M

Any particular model M uses a subset XM of the original covariates X to form the

polynomial terms in the design matrix ZM(X) Without ambiguity a polynomial model

ZM(X) on X can be identified with a polynomial model ZM(XM) on the covariates XM

The number of terms used by M to model the response y denoted by |M| corresponds

to the number of columns of ZM(XM) The coefficient vector and error variance of

the model M are denoted by βM and σ2M respectively Thus M models the data as

y = ZM(XM)βM + ϵM where ϵM sim N(0 Iσ2M

) Model M is said to be nested in model M prime

if M sub M prime M models the response of the covariates in two distinct ways choosing the

set of meaningful covariates XM as well as choosing the polynomial structure of these

covariates ZM(XM)

88

The set Np0 constitutes a partially ordered set or more succinctly a poset A poset

is a set partially ordered through a binary relation ldquo≼rdquo In this context the binary relation

on the poset Np0 is defined between pairs (ααprime) by αprime ≼ α whenever αj ge αprime

j for all

j = 1 prime with αprime ≺ α if additionally αj gt αprimej for some j The order of a term α isin Np

0

is given by the sum of its elements order(α) =sumαj When order(α) = order(αprime) + 1

and αprime ≺ α then αprime is said to immediately precede α which is denoted by αprime rarr α

The parent set of α is defined by P(α) = αprime isin Np0 αprime rarr α and is given by the

set of nodes that immediately precede the given node A polynomial model M is said to

be well-formulated if α isin M implies that P(α) sub M For example any well-formulated

model using x2i 1xi 2 to model yi must also include the parent terms xi 1xi 2 and x2i 1 their

corresponding parent terms xi 1 and xi 2 and the intercept term 1

The poset Np0 can be represented by a Directed Acyclic Graph (DAG) denoted

by (Np0) Without ambiguity we can identify nodes in the graph α isin Np

0 with terms in

the set of covariates The graph has directed edges to a node from its parents Any

well-formulated model M is represented by a subgraph (M) of (Np0) with the property

that if node α isin (M) then the nodes corresponding to P(α) are also in (M) Figure

4-1 shows examples of well-formulated polynomial models where α isin Np0 is identified

withprodp

j=1 xαjj

The motivation for considering only well-formulated polynomial models is

compelling Let ZM be the design matrix associated with a polynomial model The

subspace of y modeled by ZM given by the hat matrix HM = ZM(ZprimeMZM)

minus1ZprimeM is

invariant to affine transformations of the matrix XM if and only if M corresponds to a

well-formulated polynomial model (Peixoto 1990)

89

A B

Figure 4-1 Graphs of well-formulated polynomial models for p = 2

For example if p = 2 and yi = β(00) + β(10)xi 1 + β(01)xi 2 + β(11)xi 1xi 2 + ϵi then

the hat matrix is invariant to any covariate transformation of the form A(xi 1xi 2

)+ b for any

real-valued positive definite 2 times 2 matrix A and any real-valued vector of dimension two

b In contrast if yi = β(00) + β(20)x2i 1 + ϵi then the hat matrix formed after applying the

transformation xi 1 7rarr xi 1 + c for real c = 0 is not the same as the hat matrix formed by

the original xi 1421 Well-Formulated Model Spaces

The spaces of WFMs M considered in this paper can be characterized in terms

of two WFMs MB the base model and MF the full model The base model contains at

least the intercept term and is nested in the full model The model space M is populated

by all well formulated models M that nest MB and are nested in MF

M = M MB sube M sube MF and M is well-formulated

For M to be well-formulated the entire ancestry of each node in M must also be

included in M Because of this M isin M can be uniquely identified by two different sets

of nodes in MF the set of extreme nodes and the set of children nodes For M isin M

90

the sets of extreme and children nodes respectively denoted by E(M) and C(M) are

defined by

E(M) = α isin M MB α isin P(αprime) forall αprime isin M

C(M) = α isin MF M α cupM is well-formulated

The extreme nodes are those nodes that when removed from M give rise to a WFM in

M The children nodes are those nodes that when added to M give rise to a WFM in

M Because MB sube M for all M isin M the set of nodes E(M)cupMB determine M by

beginning with this set and iteratively adding parent nodes Similarly the nodes in C(M)

determine the set αprime isin P(α) α isin C(M)cupαprime isin E(MF ) α ≼ αprime for all α isin C(M)

which contains E(M)cupMB and thus uniquely identifies M

1

x1

x2

x21

x1x2

x22

A Extreme node set

1

x1

x2

x21

x1x2

x22

B Children node set

Figure 4-2

In Figure 4-2 the extreme and children sets for model M = 1 x1 x21 are shown for

the model space characterized by MF = 1 x1 x2 x21 x1x2 x22 In Figure 4-2A the solid

nodes represent nodes α isin M E(M) the dashed node corresponds to α isin E(M) and

the dotted nodes are not in M Solid nodes in Figure 4-2B correspond to those in M

The dashed node is the single node in C(M) and the dotted nodes are not in M cup C(M)43 Priors on the Model Space

As discussed in Scott amp Berger (2010) the Ockhamrsquos-razor effect found

automatically in Bayesian variable selection through the Bayes factor does not correct

91

for multiple testing This penalization acts against more complex models but does not

account for the collection of models in the model space which describes the multiplicity

of the testing problem This is where the role of the prior on the model space becomes

important As Scott amp Berger explain the multiplicity penalty is ldquohidden awayrdquo in the

model prior probabilities π(M|M)

In what follows we propose three different prior structures on the model space

for WFMs discuss their advantages and disadvantages and describe reasonable

choices for their hyper-parameters In addition we investigate how the choice of

prior structure and hyper-parameter combinations affect the posterior probabilities for

predictor inclusion providing some recommendations for different situations431 Model Prior Definition

The graphical structure for the model spaces suggests a method for prior

construction on M guided by the notion of inheritance A node α is said to inherit from

a node αprime if there is a directed path from αprime to α in the graph (MF ) The inheritance

is said to be immediate if order(α) = order(αprime) + 1 (equivalently if αprime isin P(α) or if αprime

immediately precedes α)

For convenience define (M) = M MB to be the set of nodes in M that are not

in the base model MB For α isin (MF ) let γα(M) be the indicator function describing

whether α is included in M ie γα(M) = I(αisinM) Denote by γν(M) the set of indicators

of inclusion in M for all order ν nodes in (MF ) Finally let γltν(M) =cupνminus1

j=0 γ j(M)

the set of indicators for inclusion in M for all nodes in (MF ) of order less than ν With

these definitions the prior probability of any model M isin M can be factored as

π(M|M) =

JmaxMprod

j=JminM

π(γ j(M)|γltj(M)M) (4ndash3)

where JminM and Jmax

M are respectively the minimum and maximum order of nodes in

(MF ) and π(γJminM (M)|γltJmin

M (M)M) = π(γJminM (M)|M)

92

Prior distributions on M can be simplified by making two assumptions First if

order(α) = order(αprime) = j then γα and γαprime are assumed to be conditionally independent

when conditioned on γltj denoted by γα perpperp γαprime|γltj Second immediate inheritance is

invoked and it is assumed that if order(α) = j then γα(M)|γltj(M) = γα(M)|γP(α)(M)

where γP(α)(M) is the inclusion indicator for the set of parent nodes of α This indicator

is one if the complete parent set of α is contained in M and zero otherwise

In Figure 4-3 these two assumptions are depicted with MF being an order two

surface in two main effects The conditional independence assumption (Figure 4-3A)

implies that the inclusion indicators for x21 x22 and x1x2 is independent when conditioned

on all the lower order terms In this same space immediate inheritance implies that

the inclusion of x21 conditioned on the inclusion of all lower order nodes is equivalent to

conditioning it on its parent set (x1 in this case)

x21 perpperp x1x2 perpperp x22

∣∣∣∣∣

1

x1

x2

A Conditional independence

x21∣∣∣∣∣

1

x1

x2

=

x21

∣∣∣∣∣ x1

B Immediate inheritance

Figure 4-3

Denote the conditional inclusion probability of node α in model M by πα =

π(γα(M) = 1|γP(α)(M)M) Under the assumptions of conditional independence

93

and immediate inheritance the prior probability of M is

π(M|πMM) =prod

αisin(MF )

πγα(M)α (1minus πα)

1minusγα(M) (4ndash4)

with πM = πα α isin (MF ) Because M must be well-formulated πα = γα =

0 if γP(α)(M) = 0 Thus the product in 4ndash4 can be restricted to the set of nodes

α isin (M)cup

C(M) Additional structure can be built into the prior on M by making

assumptions about the inclusion probabilities πα such as equality assumptions or

assumptions of a hyper-prior for these parameters Three such prior classes are

developed next first by assigning hyperpriors on πM assuming some structure among

its elements and then marginalizing out the πM

Hierarchical Uniform Prior (HUP) The HUP assumes that the non-zero πα

are all equal Specifically for a model M isin M it is assumed that πα = π for all

α isin (M)cupC(M) A complete Bayesian specification of the HUP is completed by

assuming a prior distribution for π The choice of π sim Beta(a b) produces

πHUP(M|M a b) =B(|(M)|+ a |C(M)|+ b)

B(a b) (4ndash5)

where B is the beta function Setting a = b = 1 gives the particular value of

πHUP(M|M a = 1 b = 1) =1

|(M)|+ |C(M)|+ 1

(|(M)|+ |C(M)|

|(M)|

)minus1

(4ndash6)

The HUP assigns equal probabilities to all models for which the sets of nodes (M)

and C(M) have the same cardinality This prior provides a combinatorial penalization

but essentially fails to account for the hierarchical structure of the model space An

additional penalization for model complexity can be incorporated into the HUP by

changing the values of a and b Because πα = π for all α this penalization can only

depend on some aspect of the entire graph of MF such as the total number of nodes

not in the null model |(MF )|

94

Hierarchical Independence Prior (HIP) The HIP assumes that there are no

equality constraints among the non-zero πα Each non-zero πα is given its own prior

which is assumed to be a Beta distribution with parameters aα and bα Thus the prior

probability of M under the HIP is

πHIP(M|M ab) =

prodαisin(M)

aα + bα

prodαisinC(M)

aα + bα

(4ndash7)

where the product over empty is taken to be 1 Because the πα are totally independent any

choice of aα and bα is equivalent to choosing a probability of success πα for a given α

Setting aα = bα = 1 for all α isin (M)cup

C(M) gives the particular value of

πHIP(M|M a = 1b = 1) =

(1

2

)|(M)|+|C(M)|

(4ndash8)

Although the prior with this choice of hyper-parameters accounts for the hierarchical

structure of the model space it essentially provides no penalization for combinatorial

complexity at different levels of the hierarchy This can be observed by considering a

model space with main effects only the exponent in 4ndash8 is the same for every model in

the space because each node is either in the model or in the children set

Additional penalizations for model complexity can be incorporated into the HIP

Because each γ j is conditioned on γltj in the prior construction the aα and bα for α of

order j can be conditioned on γltj One such additional penalization utilizes the number

of nodes of order j that could be added to produce a WFM conditioned on the inclusion

vector γltj which is denoted as chj(γltj) Choosing aα = 1 and bα(M) = chj(γ

ltj) is

equivalent to choosing a probability of success πα = 1chj(γltj) This penalization can

drive down the false positive rate when chj(γltj) is large but may produce more false

negatives

Hierarchical Order Prior (HOP) A compromise between complete equality and

complete independence of the πα is to assume equality between the πα of a given

order and independence across the different orders Define j(M) = α isin (M)

95

order(α) = j and Cj(M) = α isin C(M) order(α) = j The HOP assumes that πα = πj

for all α isin j(M)cupCj(M) Assuming that πj sim Beta(aj bj) provides a prior probability of

πHOP(M|M ab) =

JmaxMprod

j=JminM

B(|j(M)|+ aj |Cj(M)|+ bj)

B(aj bj)(4ndash9)

The specific choice of aj = bj = 1 for all j gives a value of

πHOP(M|M a = 1b = 1) =prodj

[1

|j(M)|+ |Cj(M)|+ 1

(|j(M)|+ |Cj(M)|

|j(M)|

)minus1]

(4ndash10)

and produces a hierarchical version of the Scott and Berger multiplicity correction

The HOP arises from a conditional exchangeability assumption on the indicator

variables Conditioned on γltj(M) the indicators γα α isin j(M)cup

Cj(M) are

assumed to be exchangeable Bernoulli random variables By de Finettirsquos theorem these

arise from independent Bernoulli random variables with common probability of success

πj with a prior distribution Our construction of the HOP assumes that this prior is a

beta distribution Additional complexity penalizations can be incorporated into the HOP

in a similar fashion to the HIP The number of possible nodes that could be added of

order j while maintaining a WFM is given by chj(M) = chj(γltj(M)) = |j(M)

cupCj(M)|

Using aj = 1 and bj(M) = chj(M) produces a prior with two desirable properties

First if M prime sub M then π(M) le π(M prime) Second for each order j the conditional

probability of including k nodes is greater than or equal to that of including k + 1 nodes

for k = 0 1 chj(M)minus 1432 Choice of Prior Structure and Hyper-Parameters

Each of the priors introduced in Section 31 defines a whole family of model priors

characterized by the probability distribution assumed for the inclusion probabilities πM

For the sake of simplicity this paper focuses on those arising from Beta distributions

and concentrates on particular choices of hyper-parameters which can be specified

automatically First we describe some general features about how each of the three

prior structures (HUP HIP HOP) allocates mass to the models in the model space

96

Second as there is an infinite number of ways in which the hyper-parameters can be

specified focused is placed on the default choice a = b = 1 as well as the complexity

penalizations described in Section 31 The second alternative is referred to as a =

1b = ch where b = ch has a slightly different interpretation depending on the prior

structure Accordingly b = ch is given by bj(M) = bα(M) = chj(M) = |j(M)cup

Cj(M)|

for the HOP and HIP where j = order(α) while b = ch denotes that b = |(MF )| for

the HUP The prior behavior for two model spaces In both cases the base model MB is

taken to be the intercept only model and MF is the DAG shown (Figures 4-4 and 4-5)

The priors considered treat model complexity differently and some general properties

can be seen in these examples

ModelHIP HOP HUP

(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)

1 1 14 49 13 12 13 572 1 x1 18 19 112 112 112 5563 1 x2 18 19 112 112 112 5564 1 x1 x

21 18 19 112 112 112 5168

5 1 x2 x22 18 19 112 112 112 5168

6 1 x1 x2 132 364 112 112 160 1727 1 x1 x2 x

21 132 164 136 160 160 1168

8 1 x1 x2 x1x2 132 164 136 160 160 11689 1 x1 x2 x

22 132 164 136 160 160 1168

10 1 x1 x2 x21 x1x2 132 1192 136 1120 130 1252

11 1 x1 x2 x21 x

22 132 1192 136 1120 130 1252

12 1 x1 x2 x1x2 x22 132 1192 136 1120 130 1252

13 1 x1 x2 x21 x1x2 x

22 132 1576 112 1120 16 1252

Figure 4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch)

First contrast the choice of HIP HUP and HOP for the choice of (ab) = (1 1) The

HIP induces a complexity penalization that only accounts for the order of the terms in

the model This is best exhibited by the model space in Figure 4-4 Models including x1

and x2 models 6 through 13 are given the same prior probability and no penalization is

incurred for the inclusion of any or all of the quadratic terms In contrast to the HIP the

97

ModelHIP HOP HUP

(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)

1 1 18 2764 14 12 14 472 1 x1 18 964 112 110 112 2213 1 x2 18 964 112 110 112 2214 1 x3 18 964 112 110 112 2215 1 x1 x3 18 364 112 120 112 41056 1 x2 x3 18 364 112 120 112 41057 1 x1 x2 116 3128 124 140 130 1428 1 x1 x2 x1x2 116 3128 124 140 120 1709 1 x1 x2 x3 116 1128 18 140 120 17010 1 x1 x2 x3 x1x2 116 1128 18 140 15 170

Figure 4-5 Prior probabilities for the space of well-formulated models associated tothree main effects and one interaction term where MB is taken to be theintercept only model and (ab) isin (1 1) (1 ch)

HUP induces a penalization for model complexity but it does not adequately penalize

models for including additional terms Using the HIP models including all of the terms

are given at least as much probability as any model containing any non-empty set of

terms (Figures 4-4 and 4-5) This lack of penalization of the full model is originates from

its combinatorial simplicity (ie this is the only model that contains every term) and

as an unfortunate consequence this model space distribution favors the base and full

models Similar behavior is observed with the HOP with (ab) = (1 1) As models

become more complex they are appropriately penalized for their size However after a

sufficient number of nodes are added the number of possible models of that particular

size is considerably reduced Thus combinatorial complexity is negligible on the largest

models This is best exhibited in Figure 4-5 where the HOP places more mass on

the full model than on any model containing a single order one node highlighting an

undesirable behavior of the priors with this choice of hyper-parameters

In contrast if (ab) = (1 ch) all three priors produce strong penalization as

models become more complex both in terms of the number and order of the nodes

contained in the model For all of the priors adding a node α to a model M to form M prime

produces p(M) ge p(M prime) However differences between the priors are apparent The

98

HIP penalizes the full model the most with the HOP penalizing it the least and the HUP

lying between them At face value the HOP creates the most compelling penalization

of model complexity In Figure 4-5 the penalization of the HOP is the least dramatic

producing prior odds of 20 for MB versus MF as opposed to the HUP and HIP which

produce prior odds of 40 and 54 respectively Similarly the prior odds in Figure 4-4 are

60 180 and 256 for the HOP HUP and HIP respectively433 Posterior Sensitivity to the Choice of Prior

To determine how the proposed priors are adjusting the posterior probabilities to

account for multiplicity a simple simulation was performed The goal of this exercise

was to understand how the priors respond to increasing complexity First the priors are

compared as the number of main effects p grows Second they are compared as the

depth of the hierarchy increases or in other words as the orderJMmax increases

The quality of a node is characterized by its marginal posterior inclusion

probabilities defined as pα =sum

MisinM I(αisinM)p(M|yM) for α isin MF These posteriors

were obtained for the proposed priors as well as the Equal Probability Prior (EPP)

on M For all prior structures both the default hyper-parameters a = b = 1 and

the penalizing choice of a = 1 and b = ch are considered The results for the

different combinations of MF and MT incorporated in the analysis were obtained

from 100 random replications (ie generating at random 100 matrices of main effects

and responses) The simulation proceeds as follows

1 Randomly generate main effects matrices X = (x1 x18) for xiiidsim Nn(0 In) and

error vectors ϵ sim Nn(0 In) for n = 60

2 Setting all coefficient values equal to one calculate y = ZMTβ + ϵ for the true

models given byMT 1 = x1 x2 x3 x

21 x1x2 x

22 x2x3 with |MT 1| = 7

MT 2 = x1 x2 x16 with |MT 2| = 16MT 3 = x1 x2 x3 x4 with |MT 3| = 4MT 4 = x1 x2 x8 x

21 x3x4 with |MT 4| = 10

MT 5 = x1 x2 x3 x4 x21 x3x4 with |MT 5| = 6

99

Table 4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulationsgrowing p fixed JM

max fixed p growing JMmax

MF

∣∣MF

∣∣ ∣∣M∣∣ MT used MF

∣∣MF

∣∣ ∣∣M∣∣ MT used(x1 + x2 + x3)

2 9 95 MT 1 (x1 + x2 + x3)2 9 95 MT 1

(x1 + + x4)2 14 1337 MT 1 (x1 + x2 + x3)

3 19 2497 MT 1

(x1 + + x5)2 20 38619 MT 1 (x1 + x2 + x3)

4 34 161421 MT 1

Other model spacesMF

∣∣MF

∣∣ ∣∣M∣∣ MT usedx1 + x2 + middot middot middot+ x18 18 262144 MT 2MT 3

(x1 + x2 + x4)2 + x5+ 20 85568 MT 4MT 5x6 + + x10

3 In all simulations the base model MB is the intercept only model The notation(x1 + + xp)

d is used to represent the full order-d polynomial response surface inp main effects The model spaces characterized by their corresponding full modelMF are presented in Table 4-1 as well as the true models used in each case

4 Enumerate the model spaces and calculate p(M|yM) for all M isin Musing the EPP HUP HIP and HOP the latter two each with the two sets ofhyper-parameters

5 Count the number of true positives and false positives in each M for the differentpriors

The true positives (TP) are defined as those nodes α isin MT such that pα gt 05

With the false positives (FP) three different cutoffs are considered for pα elucidating

the adjustment for multiplicity induced by the model priors These cutoffs are

010 020 and 050 for α isin MT The results from this exercise provide insight

about the influence of the prior on the marginal posterior inclusion probabilities In Table

4-1 the model spaces considered are described in terms of the number of models they

contain and in terms of the number of nodes of MF the full model that defines the DAG

for M

Growing number of main effects fixed polynomial degree This simulation

investigates the posterior behavior as the number of covariates grows for a polynomial

100

surface of degree two The true model is assumed to be MT 1 and has 7 polynomial

terms The false positive and true positive rates are displayed in Table 4-2

First focus on the posterior when (ab) = (1 1) As p increases and the cutoff

is low the number of false positives increases for the EPP as well as the hierarchical

priors although less dramatically for the latter All of the priors identify all of the true

positives The false positive rate for the 50 cutoff is less than one for all four prior

structures with the HIP exhibiting the smallest false positive rate

With the second choice of hyper-parameters (1 ch) the improvement of the

hierarchical priors over the EPP is dramatic and the difference in performance is more

pronounced as p increases These also considerably outperform the priors using the

default hyper-parameters a = b = 1 in terms of the false positives Regarding the

number of true positives all priors discovered the 7 true predictors in MT 1 for most of

the 100 random samples of data with only minor differences observed between any of

the priors considered That being said the means for the priors with a = 1b = ch are

slightly lower for the true positives With a 50 cutoff the hierarchical priors keep a tight

control on the number of false positives but in doing so discard true positives with slightly

higher frequency

Growing polynomial degree fixed main effects For these examples the true

model is once again MT 1 When the complexity is increased by making the order of MF

larger (Table 4-3) the inability of the EPP to adjust the inclusion posteriors for multiplicity

becomes more pronounced the EPP becomes less and less efficient at removing false

positives when the FP cutoff is low Among the priors with a = b = 1 as the order

increases the HIP is the best at filtering out the false positives Using the 05 false

positive cutoff some false positives are included both for the EEP and for all the priors

with a = b = 1 indicating that the default hyper-parameters might not be the best option

to control FP The 7 covariates in the true model all obtain a high inclusion posterior

probability both with the EEP and the a = b = 1 priors

101

Table 4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

7 (x1 + x2 + x3)2

178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3 + x4)2

362 194 233 245 010 063 107FP(gt020) 160 047 217 215 001 017 024FP(gt050) 025 006 035 036 000 002 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3 + x4 + x5)2

600 216 260 255 012 043 115FP(gt020) 291 055 213 218 002 019 027FP(gt050) 066 011 025 037 000 003 001TP(gt050) (MT 1) 700 700 700 700 697 699 699

In contrast any of the a = 1 and b = ch priors dramatically improve upon their

a = b = 1 counterparts consistently assigning low inclusion probabilities for the majority

of the false positive terms even for low cutoffs As the order of the polynomial surface

increases the difference in performance between these priors and either the EEP or

their default versions becomes even more clear At the 50 cutoff the hierarchical priors

with complexity penalization exhibit very low false positive rates The true positive rate

decreases slightly for the priors but not to an alarming degree

Other model spaces This part of the analysis considers model spaces that do not

correspond to full polynomial degree response surfaces (Table 4-4) The first example

is a model space with main effects only The second example includes a full quadratic

surface of order 2 but in addition includes six terms for which only main effects are to be

modeled Two true models are used in combination with each model space to observe

how the posterior probabilities vary under the influence of the different priors for ldquolargerdquo

and ldquosmallrdquo true models

102

Table 4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

7 (x1 + x2 + x3)2

178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3)3

737 521 606 291 055 105 139FP(gt020) 291 155 361 208 017 034 031FP(gt050) 040 021 050 026 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 698 700FP(gt010)

7 (x1 + x2 + x3)4

822 400 469 261 052 055 132FP(gt020) 421 113 176 203 012 015 031FP(gt050) 056 017 022 027 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 697 699

By construction in model spaces with main effects only HIP(11) and EPP are

equivalent as are HOP(ab) and HUP(ab) This accounts for the similarities observed

among the results for the first two cases presented in Table 4-4 where the model space

corresponds to a full model with 18 main effects and the true models are a model with

16 and 4 main effects respectively When the number of true coefficients is large the

HUP(11) and HOP(11) do poorly at controlling false positives even at the 50 cutoff

In contrast the HIP (and thus the EPP) with the 50 cutoff identifies the true positives

and no false positives This result however does not imply that the EPP controls false

positives well The true model contains 16 out of the 18 nodes in MF so there is little

potential for false positives The a = 1 and b = ch priors show dramatically different

behavior The HIP controls false positive well but fails to identify the true coefficients at

the 50 cutoff In contrast the HOP identifies all of the true positives and has a small

false positive rate for the 50 cutoff

103

If the number of true positives is small most terms in the full model are truly zero

The EPP includes at least one false positive in approximately 50 of the randomly

sampled datasets On the other hand the HUP(11) provides some control for

multiplicity obtaining on average a lower number of false positives than the EPP

Furthermore the proposed hierarchical priors with a = 1b = ch are substantially better

than the EPP (and the choice of a = b = 1) at controlling false positives and capturing

all true positives using the marginal posterior inclusion probabilities The two examples

suggest that the HOP(1 ch) is the best default choice for model selection when the

number of terms available at a given degree is large

The third and fourth examples in Table 4-4 consider the same irregular model

space with data generated from MT 4 with ten terms and MT 5 with six terms HIP(11)

and EPP again behave quite similarly incorporating a large number of false positives

for the 01 cutoff At the 05 cutoff some false positives are still included The HUP(11)

and HOP(11) behave similarly with a slightly higher false positive rate at the 50 cutoff

In terms of the true positives the EPP and a = b = 1 priors always include all of the

predictors in MT 4 and MT 5 On the other hand the ability of the a = 1b = ch priors

to control for false positives is markedly better than that of the EPP and the hierarchical

priors with choice of a = 1 = b = 1 At the 50 cutoff these priors identify all of the true

positives and true negatives Once again these examples point to the hierarchical priors

with additional penalization for complexity as being good default priors on the model

space44 Random Walks on the Model Space

When the model space M is too large to enumerate a stochastic procedure can

be used to find models with high posterior probability In particular an MCMC algorithm

can be utilized to generate a dependent sample of models from the model posterior The

structure of the model space M both presents difficulties and provides clues on how to

build algorithms to explore it Different MCMC strategies can be adopted two of which

104

Table 4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

16 x1 + x2 + + x18

193 193 200 200 003 180 180FP(gt020) 052 052 200 200 001 046 046FP(gt050) 007 007 200 200 001 004 004TP(gt050) (MT 2) 1599 1599 1600 1600 699 1599 1599FP(gt010)

4 x1 + x2 + + x18

1395 1395 915 915 026 131 131FP(gt020) 545 545 303 303 005 045 045FP(gt050) 084 084 045 045 002 006 006TP(gt050) (MT 3) 400 400 400 400 400 400 400FP(gt010)

10

973 971 1000 560 034 233 220FP(gt020) (x1 + + x4)

2+ 265 265 873 305 012 074 069FP(gt050) +x5 + + x10 035 035 136 168 002 011 012TP(gt050) (MT 4) 1000 1000 1000 999 994 998 999FP(gt010)

6

1352 1352 1106 994 044 163 196FP(gt020) (x1 + + x4)

2+ 422 421 360 501 015 048 068FP(gt050) +x5 + + x10 053 053 057 075 001 008 011TP(gt050) (MT 5) 600 600 600 600 599 599 599

are outlined in this section Combining the different strategies allows the model selection

algorithm to explore the model space thoroughly and relatively fast441 Simple Pruning and Growing

This first strategy relies on small localized jumps around the model space turning

on or off a single node at each step The idea behind this algorithm is to grow the model

by activating one node in the children set or to prune the model by removing one node

in the extreme set At a given step in the algorithm assume that the current state of the

chain is model M Let pG be the probability that algorithm chooses the growth step The

proposed model M prime can either be M+ = M cup α for some α isin C(M) or Mminus = M α

or some α isin E(M)

An example transition kernel is defined by the mixture

g(M prime|M) = pG middot qGrow(M prime|M) + (1minus pG) middot qPrune(M prime|M)

105

=IM =MF

1 + IM =MBmiddotIαisinC(M)

|C(M)|+

IM =MB

1 + IM =MF middotIαisinE(M)

|E(M)|(4ndash11)

where pG has explicitly been defined as 05 when both C(M) and E(M) are non-empty

and as 0 (or 1) when C(M) = empty (or E(M) = empty) After choosing pruning or growing a

single node is proposed for addition to or deletion from M uniformly at random

For this simple algorithm pruning has the reverse kernel of growing and vice-versa

From this construction more elaborate algorithms can be specified First instead of

choosing the node uniformly at random from the corresponding set nodes can be

selected using the relative posterior probability of adding or removing the node Second

more than one node can be selected at any step for instance by also sampling at

random the number of nodes to add or remove given the size of the set Third the

strategy could combine pruning and growing in a single step by sampling one node

α isin C(M) cup E(M) and adding or removing it accordingly Fourth the sets of nodes from

C(M) cup E(M) that yield well-formulated models can be added or removed This simple

algorithm produces small moves around the model space by focusing node addition or

removal only on the set C(M) cup E(M)442 Degree Based Pruning and Growing

In exploring the model space it is possible to take advantage of the hierarchical

structure defined between nodes of different order One can update the vector of

inclusion indicators by blocks denoted j(M) Two flavors of this algorithm are

proposed one that separates the pruning and growing steps and one where both

are done simultaneously

Assume that at a given step say t the algorithm is at M If growing the strategy

proceeds successively by order class going from j = Jmin up to j = Jmax with Jmin

and Jmax being the lowest and highest orders of nodes in MF MB respectively Define

Mt(Jminminus1) = M and set j = Jmin The growth kernel comprises the following steps

proceeding from j = Jmin to j = Jmax

106

1) Propose a model M prime by selecting a set of nodes from Cj(Mt(jminus1)) through thekernel qGrow j(middot|Mt(jminus1))

2) Compute the Metropolis-Hastings correction for M prime versus Mt(jminus1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(jminus1)

3) If j lt Jmax then set j = j + 1 and return to 1) otherwise proceed to 4)

4) Set Mt = Mt(Jmax )

The pruning step is defined In a similar fashion however it starts at order j = Jmax

and proceeds down to j = Jmin Let Ej(M prime) = E(M prime) cap j(MF ) be the set of nodes of

order j that can removed from the model M to produce a WFM Define Mt(Jmax+1) = M

and set j = Jmax The pruning kernel comprises the following steps

1) Propose a model M prime by selecting a set of nodes from Ej(Mt(j+1)) through thekernel qPrunej(middot|Mt(j+1))

2) Compute the Metropolis-Hastings correction for M prime versus Mt(j+1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(j+1)

3) If j gt Jmin then set j = j minus 1 and return to Step 1) otherwise proceed to Step 4)

4) Set Mt = Mt(Jmin )

It is clear that the growing and pruning steps are reverse kernels of each other

Pruning and growing can be combined for each j The forward kernel proceeds from

j = Jmin to j = Jmax and proposes adding sets of nodes from Cj(M) cup Ej(M) The reverse

kernel simply reverses the direction of j proceeding from j = Jmax to j = Jmin 45 Simulation Study

To study the operating characteristics of the proposed priors a simulation

experiment was designed with three goals First the priors are characterized by how

the posterior distributions are affected by the sample size and the signal-to-noise ratio

(SNR) Second given the SNR level the influence of the allocation of the signal across

the terms in the model is investigated Third performance is assessed when the true

model has special points in the scale (McCullagh amp Nelder 1989) ie when the true

107

model has coefficients equal to zero for some lower-order terms in the polynomial

hierarchy

With these goals in mind sets of predictors and responses are generated under

various experimental conditions The model space is defined with MB being the

intercept-only model and MF being the complete order-four polynomial surface in five

main effects that has 126 nodes The entries of the matrix of main effects are generated

as independent standard normal The response vectors are drawn from the n-variate

normal distribution as y sim Nn

(ZMT

(X)βγ In) where MT is the true model and In is the

n times n identity matrix

The sample sizes considered are n isin 130 260 1040 which ensures that

ZMF(X) is of full rank The cardinality of this model space is |M| gt 12 times 1022 which

makes enumeration of all models unfeasible Because the value of the 2k-th moment

of the standard normal distribution increases with k = 1 2 higher-order terms by

construction have a larger variance than their ancestors As such assuming equal

values for all coefficients higher-order terms necessarily contain more ldquosignalrdquo than

the lower order terms from which they inherit (eg x21 has more signal than x1) Once a

higher-order term is selected its entire ancestry is also included Therefore to prevent

the simulation results from being overly optimistic (because of the larger signals from the

higher-order terms) sphering is used to calculate meaningful values of the coefficients

ensuring that the signal is of the magnitude intended in any given direction Given

the results of the simulations from Section 433 only the HOP with a = 1b = ch is

considered with the EPP included for comparison

The total number of combinations of SNR sample size regression coefficient

values and nodes in MT amounts to 108 different scenarios Each scenario was run

with 100 independently generated datasets and the mean behavior of the samples was

observed The results presented in this section correspond to the median probability

model (MPM) from each of the 108 simulation scenarios considered Figure 4-7 shows

108

the comparison between the two priors for the mean number of true positive (TP) and

false positive (FP) terms Although some of the scenarios consider true models that are

not well-formulated the smallest well-formulated model that stems from MT is always

the one shown in Figure 4-6

Figure 4-6 MT DAG of the largest true model used in simulations

The results are summarized in Figure 4-7 Each point on the horizontal axis

corresponds to the average for a given set of simulation conditions Only labels for the

SNR and sample size are included for clarity but the results are also shown for the

different values of the regression coefficients and the different true models considered

Additional details about the procedure and other results are included in the appendices451 SNR and Sample Size Effect

As expected small sample sizes conditioned upon a small SNR impair the ability

of the algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this

effect being greater when using the latter prior However considering the mean number

of TPs jointly with the number of FPs it is clear that although the number of TPs is

specially low with HOP(1 ch) most of the few predictors that are discovered in fact

belong to the true model In comparison to the results with EPP in terms of FPs the

HOP(1 ch) does better and even more so when both the sample size and the SNR are

109

Figure 4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch)

smallest Finally when either the SNR or the sample size is large the performance in

terms of TPs is similar between both priors but the number of FPs are somewhat lower

with the HOP452 Coefficient Magnitude

Three ways to allocate the amount of signal across predictors are considered For

the first choice all coefficients contain the same amount of signal regardless of their

order In the second each order-one coefficient contains twice as much signal as any

order-two coefficient and four times as much as any order-three coefficient Finally

each order-one coefficient contains a half as much signal as any order-two coefficient

and a quarter of what any order-three coefficient has These choices are denoted by

β(1) = c(1o1 1o2 1o3) β(2) = c(1o1 05o2 025o3) and β(3) = c(025o1 05o2 1o3)

respectively In Figure 4-7 the first 4 scenarios correspond to simulations with β(1) the

next four use β(2) the next four correspond to β(3) and then the values are cycled in

110

the same way The results show that scenarios using either β(1) or β(3) behave similarly

contrasting with the negative impact of having the highest signal in the order-one terms

through β(2) In Figure 4-7 the effect of using β(2) is evident as it corresponds to the

lowest values for the TPs regardless of the sample size the SNR or the prior used This

is an intuitive result since giving more signal to higher-order terms makes it easier to

detect higher-order terms and consequently by strong heredity the algorithm will also

select the corresponding lower-order terms included in the true model453 Special Points on the Scale

Four true models were considered (1) the model from Figure 4-6 (MT 1) (2)

the model without the order-one terms (MT 2) (3) the model without order-two terms

(MT 3) and (4) the model without x21 and x2x5 (MT 4) The last three are clearly not

well-formulated In Figure 4-7 the leftmost point on the horizontal axis corresponds to

scenarios with MT 1 the next point is for scenarios with MT 2 followed by those with MT 3

then with MT 4 then MT 1 etc In comparison to the EPP the HOP(1 ch) tightly controls

the inclusion of FPs by choosing smaller models at the expense of also reducing the TP

count especially when there is more uncertainty about the true model (ie SNR=025)

For both prior structures the results indicate that at low SNR levels the presence of

special points has no apparent impact as the selection behavior is similar between the

four models in terms of both the TP and FP An interesting observation is that the effect

of having special points on the scale is vastly magnified whenever the coefficients that

assign more weight to order-one terms (β(2)) are used46 Case Study Ozone Data Analysis

This section uses the ozone data from Breiman amp Friedman (1985) and followsthe analysis performed by Liang et al (2008) who investigated hyper g-priors Afterremoving observations with missing values 330 observations remain includingdaily measurements of maximum ozone concentration near Los Angeles and eightmeteorological variables Table D From the 330 observations 165 were sampled atrandom without replacement and used to run the variable selection procedure theremaining 165 were used for validation The eight meteorological variables interactionsand their squared terms are used as predictors resulting in a full model with 44predictors The model space assumes that the base model MB is the intercept onlymodel and that MF is the quadratic surface in the eight meteorological variables The

111

model space contains approximately 71 billion models and computation of all modelposterior probabilities is not feasible

Table 4-5 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA

vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX

The HOP HUP and HIP with a = 1 and b = ch as well as the EPP are considered forcomparison purposes To obtain the Bayes factors in equation 3ndash3 four different mixtures ofg-priors are utilized intrinsic priors (IP) (which yields the expression in equation 3ndash2) hyper-g(HG) priors (Liang et al 2008) with hyper-parameters α = 2β = 1 and α = β = 1 and Zellner-Siow (ZS) priors (Zellner amp Siow 1980) The results were extracted for the median posteriorprobability (MPM) models Additionally the model is estimated using the R package hierNet(Bien et al 2013) to compare model selection results to those obtained using the hierarchicallasso (Bien et al 2013) restricted to well formulated models by imposing the strong heredityconstraint The procedures were assessed on the basis of their predictive accuracy on thevalidation dataset

Among all models the one that yields the smallest RMSE is the median probability modelobtained using the HOP and EPP with the ZS prior and also using the HOP with both HGpriors (Table 4-6) The HOP model with the intrinsic prior has all the terms contained in thelowest RMSE model with the exception of dpg2 which has a relatively high marginal inclusionprobability of 46 This disparity between the IP and other mixtures of g-priors is explainedby the fact that the IP induces less posterior shrinkage than the ZS and HG priors The MPMobtained through the HUP and HIP are nested in the best model suggesting that these modelspace priors penalize complexity too much and result in false negatives Consideration ofthese MPMs suggest that the HOP is best at producing true positives while controlling for falsepositives

Finally the model obtained from the hierarchical lasso (HierNet) is the largest model andproduces the second to largest RMSE All of the terms contained in any of the other modelsexcept for vh are nested within the hierarchical lasso model and most of the terms that areexclusive to this model receive extremely low marginal inclusion probabilities under any of themodel priors and parameter priors considered under Bayesian model selection

112

Table 4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso

BF Prior Model R2 RMSEIP EPP hum dpg ibt hum2 hum lowast dpg 08054 42739

hum lowast ibt dpg2 ibt2IP HIP hum ibt hum2 hum lowast ibt ibt2 07740 43396IP HOP hum dpg ibt hum2 hum lowast ibt ibt2 07848 43175IP HUP hum dpg ibt hum lowast ibt ibt2 07767 43508ZS EPP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HIP hum ibt hum lowast ibt ibt2 07525 43505ZS HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG11 EPP vh hum dpg ibt hum2 hum lowast ibt dpg2 07701 43049HG11 HIP hum ibt hum lowast ibt ibt2 07525 43505HG11 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG11 HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 EPP hum dpg ibt hum2 hum lowast ibt dpg2 07701 43037HG21 HIP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG21 HUP hum dpg ibt hum lowast ibt 07526 44036

HierNet hum temp ibh dpg ibt vis hum2 hum lowast ibt 07651 43680temp2 temp lowast ibt dpg2

47 DiscussionScott amp Berger (2010) noted that Ockhamrsquos-razor effect found automatically in Bayesian

variable selection through the Bayes factor does not correct for multiple testing The Bayesfactor penalizes complexity of the alternative model according to the number of parametersin excess of those of the null model Therefore the Bayes factor only controls complexity in apairwise fashion If the model selection procedure uses equal prior probabilities for all M isin Mthen these comparisons ignore the effect of the multiplicity of the testing problem This is wherethe role of the prior on the model space becomes important The multiplicity penalty is ldquohiddenawayrdquo in the model prior probabilities π(M|M)

In addition to the multiplicity of the testing problem disregarding the hierarchical polynomialstructure in the predictors in model selection procedures has the potential to lead to differentresults according to how the predictors are setup (eg in what units these predictors areexpressed)

In this Chapter we investigated a solution to these two issues We define prior structuresfor well-formulated models and develop random walk algorithms to traverse this type of modelspace The key to understanding prior distributions on the space of WFMs is the hierarchicalnature of the model space itself The prior distributions described take advantage of thathierarchy in two ways First conditional independence and immediate inheritance are used todevelop the HOP HIP and HUP structures discussed in Section 43 Second the conditionalnature of the priors allows for the direct incorporation of complexity penalizations Of the priorsproposed the HOP using the hyperparameter choice (1 ch) provides the best control of falsepositives while maintaining a reasonable true positive rate Thus this prior is recommended asthe default prior on the space of WFMs

113

In the near future the software developed to carry out a Metropolis-Hastings random walkon the space of WFMs will be integrated to the R package varSelectIP These new functionsimplement various local priors for the regression coefficients including the intrinsic prior Zellner-Siow prior and hyper g-priors In addition the software supports the computation of crediblesets for each regression coefficient conditioned on the selected model as well as under modelaveraging

114

CHAPTER 5CONCLUSIONS

Ecologists are now embracing the use of Bayesian methods to investigate the

interactions that dictate the distribution and abundance of organisms These tools are

both powerful and flexible They allow integrating under a single methodology empirical

observations and theoretical process models and can seamlessly account for several

sources of uncertainty and dependence The estimation and testing methods proposed

throughout the document will contribute to the understanding of Bayesian methods used

in ecology and hopefully these will shed light about the differences between estimation

and testing Bayesian tools

All of our contributions exploit the potential of the latent variable formulation This

approach greatly simplifies the analysis of complex models it redirects the bulk of

the inferential burden away from the original response variables and places it on the

easy-to-work-with latent scale for which several time-tested approaches are available

Our methods are distinctly classified into estimation and testing tools

For estimation we proposed a Bayesian specification of the single-season

occupancy model for which a Gibbs sampler is available using both logit and probit

link functions This setup allows detection and occupancy probabilities to depend

on linear combinations of predictors Then we developed a dynamic version of this

approach incorporating the notion that occupancy at a previously occupied site depends

both on survival of current settlers and habitat suitability Additionally because these

dynamics also vary in space we suggest a strategy to add spatial dependence among

neighboring sites

Ecological inquiry usually requires of competing explanations and uncertainty

surrounds the decision of choosing any one of them Hence a model or a set of

probable models should be selected from all the viable alternatives To address this

testing problem we proposed an objective and fully automatic Bayesian methodology

115

for the single season site-occupancy model Our approach relies on the intrinsic prior

which prevents from introducing (commonly unavailable) subjectively information

into the model In simulation experiments we observed that the methods single out

accurately the predictors present in the true model using the marginal posterior inclusion

probabilities of the predictors For predictors in the true model these probabilities were

comparatively larger than those for predictors not present in the true model Also the

simulations indicated that the method provides better discrimination for predictors in the

detection component of the model

In our simulations and in the analysis of the Blue Hawker data we observed that the

effect from using the multiplicity correction prior was substantial This occurs because

the Bayes factor only penalizes complexity of the alternative model according to its

number of parameters in excess to those of the null model As the number of predictors

grows the number of models in the models space also grows increasing the chances

of making false positive decisions on the inclusion of predictors This is where the role

of the prior on the model space becomes important The multiplicity penalty is ldquohidden

awayrdquo in the model prior probabilities π(M|M) In addition to the multiplicity of the

testing problem disregarding the hierarchical polynomial structure in the predictors in

model selection procedures has the potential to lead to different results according to

how the predictors are coded (eg in what units these predictors are expressed)

To confront this situation we propose three prior structures for well-formulated

models take advantage of the hierarchical structure of the predictors Of the priors

proposed we recommend the HOP using the hyperparameter choice (1 ch) which

provides the best control of false positives while maintaining a reasonable true positive

rate

Overall considering the flexibility of the latent approach several other extensions of

these methods follow Currently we envision three future developments (1) occupancy

models incorporate various sources of information (2) multi-species models that make

116

use of spatial and interspecific dependence and (3) investigate methods to conduct

model selection for the dynamic and spatially explicit version of the model

117

APPENDIX AFULL CONDITIONAL DENSITIES DYMOSS

In this section we introduce the full conditional probability density functions for all

the parameters involved in the DYMOSS model using probit as well as logic links

Sampler Z

The full conditionals corresponding to the presence indicators have the same form

regardless of the link used These are derived separately for the cases t = 1 1 lt t lt T

and t = T since their corresponding probabilities take on slightly different forms

Let ϕ(ν|microσ2) represent the density for a normal random variable ν with mean micro and

variance σ2 and recall that ψi1 = F (xprime(o)iα) and pijt = F (qprimeijtλt) where F () is the

inverse link function The full conditional for zit is given by

1 For t = 1

π(zi1|vi1αλ1βc1 δ

s1) = ψlowast

i1zi1 (1minus ψlowast

i1)1minuszi1

= Bernoulli(ψlowasti1) (Andash1)

where

ψlowasti1 =

ψi1ϕ(vi1|xprimei1βc1 + δs1 1)

prodJi1j=1(1minus pij1)

ψi1ϕ(vi1|xprimei1βc1 + δs1 1)

prodJi1j=1(1minus pij1) + (1minus ψi1)ϕ(vi1|xprimei1β

c1 1)

prodJj=1 Iyij1=0

2 For 1 lt t lt T

π(zit |zi(tminus1) zi(t+1)λt βctminus1 δ

stminus1) = ψlowast

itzit (1minus ψlowast

it)1minuszit

= Bernoulli(ψlowastit) (Andash2)

where

ψlowastit =

κitprodJit

j=1(1minus pijt)

κitprodJit

j=1(1minus pijt) +nablait

prodJj=1 Iyijt=0

with

(a) κit = F (xprimei(tminus1)β

ctminus1 + zi(tminus1)δ

stminus1)ϕ(vit |xprimeitβ

ct + δst 1) and

(b) nablait =(1minus F (xprime

i(tminus1)βctminus1 + zi(tminus1)δ

stminus1)

)ϕ(vit |xprimeitβ

ct 1)

3 For t = T

π(ziT |zi(Tminus1)λT βcTminus1 δ

sTminus1) = ψ⋆iT

ziT (1minus ψ⋆iT )1minusziT

118

=

Nprodi=1

Bernoulli(ψ⋆iT ) (Andash3)

where

ψ⋆iT =κ⋆iT

prodJiTj=1(1minus pijT )

κ⋆iTprodJiT

j=1(1minus pijT ) +nabla⋆iT

prodJj=1 IyijT=0

with

(a) κ⋆iT = F (xprimei(Tminus1)β

cTminus1 + zi(Tminus1)δ

sTminus1) and

(b) nabla⋆iT =

(1minus F (xprime

i(Tminus1)βcTminus1 + zi(Tminus1)δ

sTminus1)

)Sampler ui

1

π(ui |zi1α) = tr N(xprime(o)iα 1 trunc(zi1))

where trunc(zi1) =

(minusinfin 0] zi1 = 0

(0infin) zi1 = 1(Andash4)

and tr N(microσ2A) denotes the pdf of a truncated normal random variable with mean microvariance σ2 and truncation region A

Sampler α

1

π(α|u) prop [α]

Nprodi=1

ϕ(ui xprime(o)iα 1) (Andash5)

If [α] prop 1 then

α|u sim N(m(α)α)

with m(α) = αXprime(o)u and α = (X prime

(o)X(o))minus1

Sampler vit

1 (For t gt 1)

π(vi (tminus1)|zi (tminus1) zit βctminus1 δ

stminus1) = tr N

(micro(v)i(tminus1) 1 trunc(zit)

)(Andash6)

where micro(v)i(tminus1) = xprime

i(tminus1)βctminus1 + zi(tminus1)δ

ci(tminus1) and trunc(zit) defines the corresponding

truncation region given by zit

119

Sampler(β(c)tminus1 δ

(c)tminus1

)

1 (For t gt 1)

π(β(s)tminus1 δ

(c)tminus1|vtminus1 ztminus1) prop [β

(s)tminus1 δ

(c)tminus1]

Nprodi=1

ϕ(vit xprimei(tminus1)β

(c)tminus1 + zi(tminus1)δ

(s)tminus1 1) (Andash7)

If[β(c)tminus1 δ

(s)tminus1

]prop 1 then

β(c)tminus1 δ

(s)tminus1|vtminus1 ztminus1 sim N(m(β

(c)tminus1 δ

(s)tminus1)tminus1)

with m(β(c)tminus1 δ

(s)tminus1) = tminus1 ~X

primetminus1vtminus1 and tminus1 = (~X prime

tminus1 ~Xtminus1)minus1 where ~Xtminus1 =(

Xtminus1 ztminus1)

Sampler wijt

1 (For t gt 1 and zit = 1)

π(wijt | i zit = 1 yijt λ) = tr N(qprimeijtλt 1 tr(yijt)

)(Andash8)

Sampler λt

1 (For t = 1 2 T )

π(λt |zt wt) prop [λt ]prod

i zit=1

Jitprodj=1

ϕ(wijt qprimeijtλt 1) (Andash9)

If [λt ] prop 1 then

λt |wt zt sim N(m(λt)λt)

with m(λt) = λtQ primetwt and λt

= (Q primetQt)

minus1 where Qt and wt respectively are the designmatrix and the vector of latent variables for surveys of sites such that zit = 1

120

APPENDIX BRANDOM WALK ALGORITHMS

Global Jump From the current state M the global jump is performed by drawing

a model M prime at random from the model space This is achieved by beginning at the base

model and increasing the order from JminM to the Jmax

M the minimum and maximum orders

of nodes in (MF ) = MF MB at each order a set of nodes is selected at random from

the prior conditioned on the nodes already in the model The MH correction is

α =

1m(y|M primeM)

m(y|MM)

Local Jump From the current state M the local jump is performed by drawing a

model from the set of models L(M) = Mα α isin E(M) cup C(M) where Mα is M α

for α isin E(M) and M cup α for α isin C(M) The proposal probabilities for the model are

computed as a mixture of p(M prime|yMM prime isin L(M)) and the discrete uniform distribution

The proposal kernel is

q(M prime|yMM prime isin L(M)) =1

2

(p(M prime|yMM prime isin L(M)) +

1

|L(M)|

)This choice promotes moving to better models while maintaining a non-negligible

probability of moving to any of the possible models The MH correction is

α =

1m(y|M primeM)

m(y|MM)

q(M|yMM isin L(M prime))

q(M prime|yMM prime isin L(M))

Intermediate Jump The intermediate jump is performed by increasing or

decreasing the order of the nodes under consideration performing local proposals based

on order For a model M prime define Lj(Mprime) = M prime cup M prime

α α isin (E(M prime) cup C(M prime)) capj(MF )

From a state M the kernel chooses at random whether to increase or decrease the

order If M = MF then decreasing the order is chosen with probability 1 and if M = MB

then increasing the order is chosen with probability 1 in all other cases the probability of

increasing and decreasing order is 12 The proposal kernels are given by

121

Increasing order proposal kernel

1 Set j = JminM minus 1 and M prime

j = M

2 Draw M primej+1 from qincj+1(M

prime|yMM prime isin Lj+1(Mprimej )) where

qincj+1(Mprime|yMM prime isin Lj+1(M

primej )) =

12

(p(M prime|yMM prime isin Lj+1(M

primej )) +

1|Lj+1(M

primej)|

)

3 Set j = j + 1

4 If j lt JmaxM then return to 2 O therwise proceed to 5

5 Set M prime = M primeJmaxM

and compute the proposal probability

qinc(Mprime|yMM) =

JmaxM minus1prod

j=JminM minus1

qincj+1(Mprimej |yMM prime isin Lj+1(M

primej )) (Bndash1)

Decreasing order proposal kernel

1 Set j = JmaxM + 1 and M prime

j = M

2 Draw M primejminus1 from qdecjminus1(M

prime|yMM prime isin Ljminus1(Mprimej )) where

qdecjminus1(Mprime|yMM prime isin Ljminus1(M

primej )) =

12

(p(M prime|yMM prime isin Ljminus1(M

primej )) +

1|Ljminus1(M

primej)|

)

3 Set j = j minus 1

4 If j gt JminM then return to 2 Otherwise proceed to 5

5 Set M prime = M primeJminM

and compute the proposal probability

qdec(Mprime|yMM) =

JminM +1prod

j=JmaxM +1

qdecjminus1(Mprimej |yMM prime isin Ljminus1(M

primej )) (Bndash2)

If increasing order is chosen then the MH correction is given by

α = min

1

(1 + I (M prime = MF )

1 + I (M = MB)

)qdec(M|yMM prime)

qinc(M prime|yMM)

p(M prime|yM)

p(M|yM)

(Bndash3)

and similarly if decreasing order is chosen

Other Local and Intermediate Kernels The local and intermediate kernels

described here perform a kind of stochastic forwards-backwards selection Each kernel

122

q can be relaxed to allow more than one node to be turned on or off at each step which

could provide larger jumps for each of these kernels The tradeoff is that number of

proposed models for such jumps could be very large precluding the use of posterior

information in the construction of the proposal kernel

123

APPENDIX CWFM SIMULATION DETAILS

Briefly the idea is to let ZMT(X )βMT

= (QR)βMT= QηMT

(ie βMT= Rminus1ηMT

)

using the QR decomposition As such setting all values in ηMTproportional to one

corresponds to distributing the signal in the model uniformly across all predictors

regardless of their order

The (unconditional) variance of a single observation yi is var(yi) = var (E [yi |zi ]) +

E [var(yi |zi)] where zi is the i -th row of the design matrix ZMT Hence we take the

signal to noise ratio for each observation to be

SNR(η) = ηTMT

RminusTzRminus1ηMT

σ2

where z = var(zi) We determine how the signal is distributed across predictors up to a

proportionality constant to be able to control simultaneously the signal to noise ratio

Additionally to investigate the ability of the model to capture correctly the

hierarchical structure we specify four different 0-1 vectors that determine the predictors

in MT which generates the data in the different scenarios

Table C-1 Experimental conditions WFM simulationsParameter Values considered

SNR(ηMT) = k 025 1 4

ηMTprop (1 13 14 12) (1 13 1214

1412) (1 1413

1214 12)

γMT(1 13 14 12) (1 13 14 02) (1 13 04 12) (1 03 0 1 1 0 12)

n 130 260 1040

The results presented below are somewhat different from those found in the main

body of the article in Section 5 These are extracted averaging the number of FPrsquos

TPrsquos and model sizes respectively over the 100 independent runs and across the

corresponding scenarios for the 20 highest probability models

124

SNR and Sample Size Effect

In terms of the SNR and the sample size (Figure C-1) we observe that as

expected small sample sizes conditioned upon a small SNR impair the ability of the

algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this effect

more notorious when using the latter prior However considering the mean number

of true positives (TP) jointly with the mean model size it is clear that although the

sensitivity is low most of the few predictors that are discovered belong to the true

model The results observed with SNR of 025 and a relatively small sample size are

far from being impressive however real problems where the SNR is as low as 025

will yield many spurious associations under the EPP The fact that the HOP(1 ch) has

a strong protection against false positive is commendable in itself A SNR of 1 also

represents a feeble relationship between the predictors and the response nonetheless

the method captures approximately half of the true coefficients while including very few

false positives Following intuition as either the sample size or the SNR increase the

algorithms performance is greatly enhanced Either having a large sample size or a

large SNR yields models that contain mostly true predictors Additionally HOP(1 ch)

provides a strong control over the number of false positives therefore for high SNR

or larger sample sizes the number of predictors in the top 20 models is close to the

size of the true model In general the EPP allows the detection of more TPrsquos while

the HOP(1 ch) provides a stronger control on the amount of FPrsquos included when

considering small sample sizes combined with small SNRs As either sample size or

SNR grows the differences between the two priors become indistinct

125

Figure C-1 SNR vs n Average model size average true positives and average false

positives for all simulated scenarios by model ranking according to model

posterior probabilities

Coefficient Magnitude

This part of the experiment explores the effect of how the signal is distributed across

predictors As mentioned before sphering is used to assign the coefficients values

in a manner that controls the amount of signal that goes into each coefficient Three

possible ways to allocate the signal are considered First each order-one coefficient

contains twice as much signal as any order-two coefficient and four times as much

any as order-three coefficient second all coefficients contain the same amount of

signal regardless of their order and third each order-one coefficient contains a half

as much signal as any order-two coefficient and a quarter of what any order-three

126

coefficient has In Figure C-2 these values are denoted by β = c(1o1 05o2 025o3)

β = c(1o1 1o2 1o3) and β = c(025o1 05o2 1o3) respectively

Observe that the number of FPrsquos is invulnerable to how the SNR is distributed

across predictors using the HOP(1 ch) conversely when using the EPP the number

of FPrsquos decreases as the SNR grows always being slightly higher than those obtained

with the HOP With either prior structure the algorithm performs better whenever all

coefficients are equally weighted or when those for the order-three terms have higher

weights In these two cases (ie with β = c(1o1 05o2 025o3) or β = c(1o1 1o2 1o3))

the effect of the SNR appears to be similar In contrast when more weight is given to

order one terms the algorithm yields slightly worse models at any SNR level This is an

intuitive result since giving more signal to higher order terms makes it easier to detect

higher order terms and consequently by strong heredity the algorithm will also select

the corresponding lower order terms included in the true model

Special Points on the Scale

In Nelder (1998) the author argues that the conditions under which the

weak-heredity principle can be used for model selection are so restrictive that the

principle is commonly not valid in practice in this context In addition the author states

that considering well-formulated models only does not take into account the possible

presence of special points on the scales of the predictors that is situations where

omitting lower order terms is justified due to the nature of the data However it is our

contention that every model has an underlying well-formulated structure whether or not

some predictor has special points on its scale will be determined through the estimation

of the coefficients once a valid well-formulated structure has been chosen

To understand how the algorithm behaves whenever the true data generating

mechanism has zero-valued coefficients for some lower order terms in the hierarchy

four different true models are considered Three of them are not well-formulated while

the remaining one is the WFM shown in Figure 4-6 The three models that have special

127

Figure C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model rankingaccording to model posterior probabilities

points correspond to the same model MT from Figure 4-6 but have respectively

zero-valued coefficients for all the order-one terms all the order-two terms and for x21

and x2x5

As seen before in comparison to the EPP the HOP(1 ch) tightly controls the

inclusion FPs by choosing smaller models at the expense of also reducing the TP

count especially when there is more uncertainty about the true model (ie SNR=025)

For both prior structures the results in Figure C-3 indicate that at low SNR levels the

presence of special points has no apparent impact as the selection behavior is similar

between the four models in terms of both the TP and FP As the SNR increases the

TPs and the model size are affected for true models with zero-valued lower order

128

Figure C-3 SNR vs different true models MT Average model size average truepositives and average false positives for all simulated scenarios by modelranking according to model posterior probabilities

terms These differences however are not very large Relatively smaller models are

selected whenever some terms in the hierarchy are missing but with high SNR which

is where the differences are most pronounced the predictors included are mostly true

coefficients The impact is almost imperceptible for the true model that lacks order one

terms and the model with zero coefficients for x21 and x2x5 and is more visible for models

without order two terms This last result is expected due to strong-heredity whenever

the order-one coefficients are missing the inclusion of order-two and order-three

terms will force their selection which is also the case when only a few order two terms

have zero-valued coefficients Conversely when all order two predictors are removed

129

some order three predictors are not selected as their signal is attributed the order two

predictors missing from the true model This is especially the case for the order three

interaction term x1x2x5 which depends on the inclusion of three order two terms terms

(x1x2 x1x5 x2x5) in order for it to be included as well This makes the inclusion of this

term somewhat more challenging the three order two interactions capture most of

the variation of the polynomial terms that is present when the order three term is also

included However special points on the scale commonly occur on a single or at most

on a few covariates A true data generating mechanism that removes all terms of a given

order in the context of polynomial models is clearly not justified here this was only done

for comparison purposes

130

APPENDIX DSUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS

The covariates considered for the ozone data analysis match those used in Liang

et al (2008) these are displayed in Table D below

Table D-1 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA

vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX

The marginal posterior inclusion probability corresponds to the probability of including a

given term in the full model MF after summing over all models in the model space For each

node α isin MF this probability is given by pα =sum

MisinM I(αisinM)p(M|yM) Given that in problems

with a large model space such as the one considered for the ozone concentration problem

enumeration of the entire space is not feasible Thus these probabilities are estimated summing

over every model drawn by the random walk over the model space M

Given that there are in total 44 potential predictors for convenience in Tables D-2 to D-5

below we only display the marginal posterior probabilities for the terms included under at least

one of the model priors considered (EPP HIP HUP and HOP) for each of the parameter priors

utilized (intrinsic priors Zellner-Siow priors Hyper-g(11) and Hyper-g(21))

131

Table D-2 Marginal inclusion probabilities

intrinsic prior

EPP HIP HUP HOP

hum 099 069 085 076

dpg 085 048 052 053

ibt 099 100 100 100

hum2 076 051 043 062

humdpg 055 002 003 017

humibt 098 069 084 075

dpg2 072 036 025 046

ibt2 059 078 057 081

Table D-3 Marginal inclusion probabilities

Zellner-Siow prior

EPP HIP HUP HOP

hum 076 067 080 069

dpg 089 050 055 058

ibt 099 100 100 100

hum2 057 049 040 057

humibt 072 066 078 068

dpg2 081 038 031 051

ibt2 054 076 055 077

Table D-4 Marginal inclusion probabilities

Hyper-g11

EPP HIP HUP HOP

vh 054 005 010 011

hum 081 067 080 069

dpg 090 050 055 058

ibt 099 100 099 099

hum2 061 049 040 057

humibt 078 066 078 068

dpg2 083 038 030 051

ibt2 049 076 054 077

Table D-5 Marginal inclusion probabilities

Hyper-g21

EPP HIP HUP HOP

hum 079 064 073 067

dpg 090 052 060 059

ibt 099 100 099 100

hum2 060 047 037 055

humibt 076 064 071 067

dpg2 082 041 036 052

ibt2 047 073 049 075

132

REFERENCES

Akaike H (1983) Information measures and model selection Bull Int Statist Inst 50277ndash290

Albert J H amp Chib S (1993) Bayesian-analysis of binary and polychotomousresponse data Journal of the American Statistical Association 88(422) 669ndash679

Berger J amp Bernardo J (1992) On the development of reference priors BayesianStatistics 4 (pp 35ndash60)

URL httpisbastatdukeedueventsvalencia1992Valencia4Refpdf

Berger J amp Pericchi L (1996) The intrinsic Bayes factor for model selection andprediction Journal of the American Statistical Association 91(433) 109ndash122

URL httpamstattandfonlinecomdoiabs10108001621459199610476668

Berger J Pericchi L amp Ghosh J (2001) Objective Bayesian methods for modelselection introduction and comparison In Model selection vol 38 of IMS LectureNotes Monogr Ser (pp 135ndash207) Inst Math Statist

URL httpwwwjstororgstable1023074356165

Besag J York J amp Mollie A (1991) Bayesian Image-Restoration with 2 Applicationsin Spatial Statistics Annals of the Institute of Statistical Mathematics 43 1ndash20

Bien J Taylor J amp Tibshirani R (2013) A lasso for hierarchical interactions TheAnnals of Statistics 41(3) 1111ndash1141

URL httpprojecteuclidorgeuclidaos1371150895

Breiman L amp Friedman J (1985) Estimating optimal transformations for multipleregression and correlation Journal of the American Statistical Association 80580ndash598

Brusco M J Steinley D amp Cradit J D (2009) An exact algorithm for hierarchicallywell-formulated subsets in second-order polynomial regression Technometrics 51(3)306ndash315

Casella G Giron F J Martınez M L amp Moreno E (2009) Consistency of Bayesianprocedures for variable selection The Annals of Statistics 37 (3) 1207ndash1228

URL httpprojecteuclidorgeuclidaos1239369020

Casella G Moreno E amp Giron F (2014) Cluster Analysis Model Selection and PriorDistributions on Models Bayesian Analysis TBA(TBA) 1ndash46

URL httpwwwstatufledu~casellaPapersClusterModel-July11-Apdf

133

Chipman H (1996) Bayesian variable selection with related predictors CanadianJournal of Statistics 24(1) 17ndash36

URL httponlinelibrarywileycomdoi1023073315687abstract

Clyde M amp George E I (2004) Model Uncertainty Statistical Science 19(1) 81ndash94

URL httpprojecteuclidorgDienstgetRecordid=euclidss1089808274

Dewey J (1958) Experience and nature New York Dover Publications

Dorazio R M amp Taylor-Rodrıguez D (2012) A Gibbs sampler for Bayesian analysis ofsite-occupancy data Methods in Ecology and Evolution 3 1093ndash1098

Ellison A M (2004) Bayesian inference in ecology Ecology Letters 7 509ndash520

Fiske I amp Chandler R (2011) unmarked An R package for fitting hierarchical modelsof wildlife occurrence and abundance Journal of Statistical Software 43(10)

URL httpcorekmiopenacukdownloadpdf5701760pdf

George E (2000) The variable selection problem Journal of the American StatisticalAssociation 95(452) 1304ndash1308

URL httpwwwtandfonlinecomdoiabs10108001621459200010474336

Giron F J Moreno E Casella G amp Martınez M L (2010) Consistency of objectiveBayes factors for nonnested linear models and increasing model dimension Revistade la Real Academia de Ciencias Exactas Fisicas y Naturales Serie A Matematicas104(1) 57ndash67

URL httpwwwspringerlinkcomindex105052RACSAM201006

Good I J (1950) Probability and the Weighing of Evidence New York Haffner

Griepentrog G L Ryan J M amp Smith L D (1982) Linear transformations ofpolynomial regression-models American Statistician 36(3) 171ndash174

Gunel E amp Dickey J (1974) Bayes factors for independence in contingency tablesBiometrika 61 545ndash557

Hanski I (1994) A Practical Model of Metapopulation Dynamics Journal of AnimalEcology 63 151ndash162

Hooten M (2006) Hierarchical spatio-temporal models for ecological processesDoctoral dissertation University of Missouri-Columbia

URL httpsmospacelibraryumsystemeduxmluihandle103554500

Hooten M B amp Hobbs N T (2014) A Guide to Bayesian Model Selection forEcologists Ecological Monographs (In Press)

134

Hughes J amp Haran M (2013) Dimension reduction and alleviation of confoundingfor spatial generalized linear mixed models Journal of the Royal Statistical SocietySeries B Statistical Methodology 75 139ndash159

Hurvich C M amp Tsai C-L (1989) Regression and time series model selection insmall samples Biometrika 76 297ndash307

URL httpbiometoxfordjournalsorgcontent762297abstract

Jeffreys H (1935) Some tests of significance treated by the theory of probabilityProcedings of the Cambridge Philosophy Society 31 203ndash222

Jeffreys H (1961) Theory of Probability London Oxford University Press 3rd ed

Johnson D Conn P Hooten M Ray J amp Pond B (2013) Spatial occupancymodels for large data sets Ecology 94(4) 801ndash808

URL httpwwwesajournalsorgdoiabs10189012-05641mi=3eywlhampaf=R

ampsearchText=human+population

Kass R amp Wasserman L (1995) A reference Bayesian test for nested hypothesesand its relationship to the Schwarz criterion Journal of the American StatisticalAssociation 90(431)

URL httpamstattandfonlinecomdoiabs10108001621459199510476592

Kass R E amp Raftery A E (1995) Bayes Factors Journal of the American StatisticalAssociation 90 773ndash795

URL httpwwwtandfonlinecomdoiabs10108001621459199510476572$

delimiter026E30F$nhttpwwwtandfonlinecomdoiabs10108001621459

199510476572UvBybrTIgcs

Kass R E amp Wasserman L (1996) The Selection of Prior Distributions by FormalRules Journal of the American Statistical Association 91(435) 1343

URL httpwwwjstororgstable2291752origin=crossref

Kery M (2010) Introduction to WinBUGS for Ecologists Bayesian Approach toRegression ANOVA Mixed Models and Related Analyses Academic Press 1st ed

Kery M Gardner B amp Monnerat C (2010) Predicting species distributions fromchecklist data using site-occupancy models Journal of Biogeography 37 (10)1851ndash1862 Kery Marc Gardner Beth Monnerat Christian

Khuri A (2002) Nonsingular linear transformations of the control variables in responsesurface models Technical Report

Krebs C J (1972) Ecology the experimental analysis of distribution and abundance

135

Lempers F B (1971) Posterior probabilities of alternative linear models University ofRotterdam Press Rotterdam

Leon-Novelo L Moreno E amp Casella G (2012) Objective Bayes model selection inprobit models Statistics in medicine 31(4) 353ndash65

URL httpwwwncbinlmnihgovpubmed22162041

Liang F Paulo R Molina G Clyde M a amp Berger J O (2008) Mixtures of g Priorsfor Bayesian Variable Selection Journal of the American Statistical Association103(481) 410ndash423

URL httpwwwtandfonlinecomdoiabs101198016214507000001337

Link W amp Barker R (2009) Bayesian inference with ecological applications Elsevier

URL httpbooksgooglecombookshl=enamplr=ampid=hecon2l2QPcCampoi=fnd

amppg=PP2ampdq=Bayesian+Inference+with+ecological+applicationsampots=S82_

0pxrNmampsig=L3xbsSQcKD8FV6rxCMp2pmP2JKk

MacKenzie D amp Nichols J (2004) Occupancy as a surrogate for abundanceestimation Animal biodiversity and conservation 1 461ndash467

URL httpcrsitbacidmediajurnalrefslandscapemackenzie2004zhpdf

MacKenzie D Nichols J amp Hines J (2003) Estimating site occupancy colonizationand local extinction when a species is detected imperfectly Ecology 84(8)2200ndash2207

URL httpwwwesajournalsorgdoiabs10189002-3090

MacKenzie D I Bailey L L amp Nichols J D (2004) Investigating speciesco-occurrence patterns when species Journal of Animal Ecology 73 546ndash555

MacKenzie D I Nichols J D Lachman G B Droege S Royle J A amp LangtimmC A (2002) Estimating site occupancy rates when detection probabilities are lessthan one Ecology 83(8) 2248ndash2255

Mazerolle M amp Mazerolle M (2013) Package rsquoAICcmodavgrsquo (c)

URL ftpheanetarchivegnewsenseorgdisk1CRANwebpackages

AICcmodavgAICcmodavgpdf

McCullagh P amp Nelder J A (1989) Generalized linear models (2nd ed) LondonEngland Chapman amp Hall

McQuarrie A Shumway R amp Tsai C-L (1997) The model selection criterion AICu

136

Moreno E Bertolino F amp Racugno W (1998) An intrinsic limiting procedure for modelselection and hypotheses testing Journal of the American Statistical Association93(444) 1451ndash1460

Moreno E Giron F J amp Casella G (2010) Consistency of objective Bayes factors asthe model dimension grows The Annals of Statistics 38(4) 1937ndash1952

URL httpprojecteuclidorgeuclidaos1278861238

Nelder J A (1977) Reformulation of linear-models Journal of the Royal StatisticalSociety Series A - Statistics in Society 140 48ndash77

Nelder J A (1998) The selection of terms in response-surface models - how strong isthe weak-heredity principle American Statistician 52(4) 315ndash318

Nelder J A (2000) Functional marginality and response-surface fitting Journal ofApplied Statistics 27 (1) 109ndash112

Nichols J Hines J amp Mackenzie D (2007) Occupancy estimation and modeling withmultiple states and state uncertainty Ecology 88(6) 1395ndash1400

URL httpwwwesajournalsorgdoipdf10189006-1474

Ovaskainen O Hottola J amp Siitonen J (2010) Modeling species co-occurrenceby multivariate logistic regression generates new hypotheses on fungal interactionsEcology 91(9) 2514ndash21

URL httpwwwncbinlmnihgovpubmed20957941

Peixoto J L (1987) Hierarchical variable selection in polynomial regression-modelsAmerican Statistician 41(4) 311ndash313

Peixoto J L (1990) A property of well-formulated polynomial regression-modelsAmerican Statistician 44(1) 26ndash30

Pericchi L R (2005) Model selection and hypothesis testing based on objectiveprobabilities and bayes factors In Handbook of Statistics Elsevier

Polson N G Scott J G amp Windle J (2013) Bayesian Inference for Logistic ModelsUsing Polya-Gamma Latent Variables Journal of the American Statistical Association108 1339ndash1349

URL httpdxdoiorg101080016214592013829001

Rao C R amp Wu Y (2001) On model selection vol Volume 38 of Lecture NotesndashMonograph Series (pp 1ndash57) Beachwood OH Institute of Mathematical Statistics

URL httpdxdoiorg101214lnms1215540960

137

Reich B J Hodges J S amp Zadnik V (2006) Effects of residual smoothing on theposterior of the fixed effects in disease-mapping models Biometrics 62 1197ndash1206

Reiners W amp Lockwood J (2009) Philosophical Foundations for the Practices ofEcology Cambridge University Press

URL httpbooksgooglecombooksid=dr9cPgAACAAJ

Rigler F amp Peters R (1995) Excellence in Ecology Science and Limnology EcologyInstitute Germany

URL httportoncatieaccrcgi-binwxisexeIsisScript=CIENLxis

ampmethod=postampformato=2ampcantidad=1ampexpresion=mfn=008268

Robert C Chopin N amp Rousseau J (2009) Harold Jeffreysrsquo Theory of Probabilityrevisited Statistical Science Volume 24(2) 141ndash179

URL httpswwwnewtonacukpreprintsNI08021pdf

Robert C P (1993) A note on jeffreys-lindley paradox Statistica Sinica 3 601ndash608

Royle J A amp Kery M (2007) A Bayesian state-space formulation of dynamicoccupancy models Ecology 88(7) 1813ndash23

URL httpwwwncbinlmnihgovpubmed17645027

Scott J amp Berger J (2010) Bayes and Empirical-Bayes Multiplicity Adjustment in thevariable selection problem The Annals of Statistics

URL httpprojecteuclidorgeuclidaos1278861454

Spiegelhalter D J amp Smith A F M (1982) Bayes factor for linear and log-linearmodels with vague prior information J R Statist Soc B 44 377ndash387

Tierney L amp Kadane J B (1986) Accurate approximations for posterior moments andmarginal densities Journal of the American Statistical Association 81 82ndash86

Tyre A J Tenhumberg B Field S a Niejalke D Parris K amp Possingham H P(2003) Improving Precision and Reducing Bias in Biological Surveys EstimatingFalse-Negative Error Rates Ecological Applications 13(6) 1790ndash1801

URL httpwwwesajournalsorgdoiabs10189002-5078

Waddle J H Dorazio R M Walls S C Rice K G Beauchamp J Schuman M Jamp Mazzotti F J (2010) A new parameterization for estimating co-occurrence ofinteracting species Ecological applications a publication of the Ecological Society ofAmerica 20 1467ndash1475

Wasserman L (2000) Bayesian Model Selection and Model Averaging Journal ofmathematical psychology 44(1) 92ndash107

138

URL httpwwwncbinlmnihgovpubmed10733859

Wilson M Iversen E Clyde M A Schmidler S C amp Schildkraut J M (2010)Bayesian model search and multilevel inference for SNP association studies TheAnnals of Applied Statistics 4(3) 1342ndash1364

URL httpwwwncbinlmnihgovpmcarticlesPMC3004292

Womack A J Leon-Novelo L amp Casella G (2014) Inference from Intrinsic BayesProcedures Under Model Selection and Uncertainty Journal of the AmericanStatistical Association (June) 140114063448000

URL httpwwwtandfonlinecomdoiabs101080016214592014880348

Yuan M Joseph V R amp Zou H (2009) Structured variable selection and estimationThe Annals of Applied Statistics 3(4) 1738ndash1757

URL httpprojecteuclidorgeuclidaoas1267453962

Zeller K A Nijhawan S Salom-Perez R Potosme S H amp Hines J E (2011)Integrating occupancy modeling and interview data for corridor identification A casestudy for jaguars in nicaragua Biological Conservation 144(2) 892ndash901

Zellner A amp Siow A (1980) Posterior odds ratios for selected regression hypothesesIn Trabajos de estadıstica y de investigacion operativa (pp 585ndash603)

URL httpwwwspringerlinkcomindex5300770UP12246M9pdf

139

BIOGRAPHICAL SKETCH

Daniel Taylor-Rodrıguez was born in Bogota Colombia He earned a BS

degree in economics from the Universidad de Los Andes (2004) and a Specialist

degree in statistics from the Universidad Nacional de Colombia In 2009 he traveled

to Gainesville Florida to pursue a masterrsquos in statistics under the supervision of

George Casella Upon completion he started a PhD in interdisciplinary ecology with

concentration in statistics again under George Casellarsquos supervision After Georgersquos

passing Linda Young and Nikolay Bliznyuk continued to oversee Danielrsquos mentorship

He has currently accepted a joint postdoctoral fellowship at the Statistical and Applied

Mathematical Sciences Institute and the Department of Statistical Science at Duke

University

140

  • ACKNOWLEDGMENTS
  • TABLE OF CONTENTS
  • LIST OF TABLES
  • LIST OF FIGURES
  • ABSTRACT
  • 1 GENERAL INTRODUCTION
    • 11 Occupancy Modeling
    • 12 A Primer on Objective Bayesian Testing
    • 13 Overview of the Chapters
      • 2 MODEL ESTIMATION METHODS
        • 21 Introduction
          • 211 The Occupancy Model
          • 212 Data Augmentation Algorithms for Binary Models
            • 22 Single Season Occupancy
              • 221 Probit Link Model
              • 222 Logit Link Model
                • 23 Temporal Dynamics and Spatial Structure
                  • 231 Dynamic Mixture Occupancy State-Space Model
                  • 232 Incorporating Spatial Dependence
                    • 24 Summary
                      • 3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
                        • 31 Introduction
                        • 32 Objective Bayesian Inference
                          • 321 The Intrinsic Methodology
                          • 322 Mixtures of g-Priors
                            • 3221 Intrinsic priors
                            • 3222 Other mixtures of g-priors
                                • 33 Objective Bayes Occupancy Model Selection
                                  • 331 Preliminaries
                                  • 332 Intrinsic Priors for the Occupancy Problem
                                  • 333 Model Posterior Probabilities
                                  • 334 Model Selection Algorithm
                                    • 34 Alternative Formulation
                                    • 35 Simulation Experiments
                                      • 351 Marginal Posterior Inclusion Probabilities for Model Predictors
                                      • 352 Summary Statistics for the Highest Posterior Probability Model
                                        • 36 Case Study Blue Hawker Data Analysis
                                          • 361 Results Variable Selection Procedure
                                          • 362 Validation for the Selection Procedure
                                            • 37 Discussion
                                              • 4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
                                                • 41 Introduction
                                                • 42 Setup for Well-Formulated Models
                                                  • 421 Well-Formulated Model Spaces
                                                    • 43 Priors on the Model Space
                                                      • 431 Model Prior Definition
                                                      • 432 Choice of Prior Structure and Hyper-Parameters
                                                      • 433 Posterior Sensitivity to the Choice of Prior
                                                        • 44 Random Walks on the Model Space
                                                          • 441 Simple Pruning and Growing
                                                          • 442 Degree Based Pruning and Growing
                                                            • 45 Simulation Study
                                                              • 451 SNR and Sample Size Effect
                                                              • 452 Coefficient Magnitude
                                                              • 453 Special Points on the Scale
                                                                • 46 Case Study Ozone Data Analysis
                                                                • 47 Discussion
                                                                  • 5 CONCLUSIONS
                                                                  • A FULL CONDITIONAL DENSITIES DYMOSS
                                                                  • B RANDOM WALK ALGORITHMS
                                                                  • C WFM SIMULATION DETAILS
                                                                  • D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
                                                                  • REFERENCES
                                                                  • BIOGRAPHICAL SKETCH
Page 6: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS 4

LIST OF TABLES 8

LIST OF FIGURES 10

ABSTRACT 12

CHAPTER

1 GENERAL INTRODUCTION 14

11 Occupancy Modeling 1512 A Primer on Objective Bayesian Testing 1713 Overview of the Chapters 21

2 MODEL ESTIMATION METHODS 23

21 Introduction 23211 The Occupancy Model 24212 Data Augmentation Algorithms for Binary Models 26

22 Single Season Occupancy 29221 Probit Link Model 30222 Logit Link Model 32

23 Temporal Dynamics and Spatial Structure 34231 Dynamic Mixture Occupancy State-Space Model 37232 Incorporating Spatial Dependence 43

24 Summary 46

3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS 49

31 Introduction 4932 Objective Bayesian Inference 52

321 The Intrinsic Methodology 53322 Mixtures of g-Priors 54

3221 Intrinsic priors 553222 Other mixtures of g-priors 56

33 Objective Bayes Occupancy Model Selection 57331 Preliminaries 58332 Intrinsic Priors for the Occupancy Problem 60333 Model Posterior Probabilities 62334 Model Selection Algorithm 63

34 Alternative Formulation 6635 Simulation Experiments 68

351 Marginal Posterior Inclusion Probabilities for Model Predictors 70

6

352 Summary Statistics for the Highest Posterior Probability Model 7636 Case Study Blue Hawker Data Analysis 77

361 Results Variable Selection Procedure 79362 Validation for the Selection Procedure 81

37 Discussion 82

4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS 84

41 Introduction 8442 Setup for Well-Formulated Models 88

421 Well-Formulated Model Spaces 9043 Priors on the Model Space 91

431 Model Prior Definition 92432 Choice of Prior Structure and Hyper-Parameters 96433 Posterior Sensitivity to the Choice of Prior 99

44 Random Walks on the Model Space 104441 Simple Pruning and Growing 105442 Degree Based Pruning and Growing 106

45 Simulation Study 107451 SNR and Sample Size Effect 109452 Coefficient Magnitude 110453 Special Points on the Scale 111

46 Case Study Ozone Data Analysis 11147 Discussion 113

5 CONCLUSIONS 115

APPENDIX

A FULL CONDITIONAL DENSITIES DYMOSS 118

B RANDOM WALK ALGORITHMS 121

C WFM SIMULATION DETAILS 124

D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS 131

REFERENCES 133

BIOGRAPHICAL SKETCH 140

7

LIST OF TABLES

Table page

1-1 Interpretation of BFji when contrasting Mj and Mi 20

3-1 Simulation control parameters occupancy model selector 69

3-2 Comparison of average minOddsMPIP under scenarios having different numberof sites (N=50 N=100) and under scenarios having different number of surveysper site (J=3 J=5) for the presence and detection components using uniformand multiplicity correction priors 75

3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors 75

3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highest probabilitymodels for the presence and the detection components using uniform andmultiplicity correcting priors on the model space 76

3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space 77

3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 77

3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 78

3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data 80

3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data 80

3-10 MPIP presence component 81

3-11 MPIP detection component 81

3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform and multiplicitycorrection model priors 82

8

4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulations 100

4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchical independenceprior (HIP) the hierarchical order prior (HOP) and the hierarchical uniformprior (HUP) 102

4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and the hierarchicaluniform prior (HUP) 103

4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP) 105

4-5 Variables used in the analyses of the ozone contamination dataset 112

4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso 113

C-1 Experimental conditions WFM simulations 124

D-1 Variables used in the analyses of the ozone contamination dataset 131

D-2 Marginal inclusion probabilities intrinsic prior 132

D-3 Marginal inclusion probabilities Zellner-Siow prior 132

D-4 Marginal inclusion probabilities Hyper-g11 132

D-5 Marginal inclusion probabilities Hyper-g21 132

9

LIST OF FIGURES

Figure page

2-1 Graphical representation occupancy model 25

2-2 Graphical representation occupancy model after data-augmentation 31

2-3 Graphical representation multiseason model for a single site 39

2-4 Graphical representation data-augmented multiseason model 39

3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors 71

3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors 72

3-3 Predictor MPIP averaged over scenarios with the interaction between the numberof sites and the surveys per site using uniform (U) and multiplicity correction(MC) priors 72

3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors 73

3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors 73

4-1 Graphs of well-formulated polynomial models for p = 2 90

4-2 E(M) and C(M) in M defined by a quadratic surface in two main effects formodel M = 1 x1 x21 91

4-3 Graphical representation of assumptions on M defined by the quadratic surfacein two main effects 93

4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the intercept onlymodel and (ab) isin (1 1) (1 ch) 97

4-5 Prior probabilities for the space of well-formulated models associated to threemain effects and one interaction term where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch) 98

4-6 MT DAG of the largest true model used in simulations 109

4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch) 110

C-1 SNR vs n Average model size average true positives and average false positivesfor all simulated scenarios by model ranking according to model posterior probabilities126

10

C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 128

C-3 SNR vs different true models MT Average model size average true positivesand average false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 129

11

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

OBJECTIVE BAYESIAN METHODS FOR OCCUPANCY MODEL ESTIMATION ANDSELECTION

By

Daniel Taylor-Rodrıguez

August 2014

Chair Linda J YoungCochair Nikolay BliznyukMajor Interdisciplinary Ecology

The ecological literature contains numerous methods for conducting inference about

the dynamics that govern biological populations Among these methods occupancy

models have played a leading role during the past decade in the analysis of large

biological population surveys The flexibility of the occupancy framework has brought

about useful extensions for determining key population parameters which provide

insights about the distribution structure and dynamics of a population However the

methods used to fit the models and to conduct inference have gradually grown in

complexity leaving practitioners unable to fully understand their implicit assumptions

increasing the potential for misuse This motivated our first contribution We develop

a flexible and straightforward estimation method for occupancy models that provides

the means to directly incorporate temporal and spatial heterogeneity using covariate

information that characterizes habitat quality and the detectability of a species

Adding to the issue mentioned above studies of complex ecological systems now

collect large amounts of information To identify the drivers of these systems robust

techniques that account for test multiplicity and for the structure in the predictors are

necessary but unavailable for ecological models We develop tools to address this

methodological gap First working in an ldquoobjectiverdquo Bayesian framework we develop

the first fully automatic and objective method for occupancy model selection based

12

on intrinsic parameter priors Moreover for the general variable selection problem we

propose three sets of prior structures on the model space that correct for multiple testing

and a stochastic search algorithm that relies on the priors on the models space to

account for the polynomial structure in the predictors

13

CHAPTER 1GENERAL INTRODUCTION

As with any other branch of science ecology strives to grasp truths about the

world that surrounds us and in particular about nature The objective truth sought

by ecology may well be beyond our grasp however it is reasonable to think that at

least partially ldquoNature is capable of being understoodrdquo (Dewey 1958) We can observe

and interpret nature to formulate hypotheses which can then be tested against reality

Hypotheses that encounter no or little opposition when confronted with reality may

become contextual versions of the truth and may be generalized by scaling them

spatially andor temporally accordingly to delimit the bounds within which they are valid

To formulate hypotheses accurately and in a fashion amenable to scientific inquiry

not only the point of view and assumptions considered must be made explicit but

also the object of interest the properties worthy of consideration of that object and

the methods used in studying such properties (Reiners amp Lockwood 2009 Rigler amp

Peters 1995) Ecology as defined by Krebs (1972) is ldquothe study of interactions that

determine the distribution and abundance of organismsrdquo This characterizes organisms

and their interactions as the objects of interest to ecology and prescribes distribution

and abundance as a relevant property of these organisms

With regards to the methods used to acquire ecological scientific knowledge

traditionally theoretical mathematical models (such as deterministic PDEs) have been

used However naturally varying systems are imprecisely observed and as such are

subject to multiple sources of uncertainty that must be explicitly accounted for Because

of this the ecological scientific community is developing a growing interest in flexible

and powerful statistical methods and among these Bayesian hierarchical models

predominate These methods rely on empirical observations and can accommodate

fairly complex relationships between empirical observations and theoretical process

models while accounting for diverse sources of uncertainty (Hooten 2006)

14

Bayesian approaches are now used extensively in ecological modeling however

there are two issues of concern one from the standpoint of ecological practitioners

and another from the perspective of scientific ecological endeavors First Bayesian

modeling tools require a considerable understanding of probability and statistical theory

leading practitioners to view them as black box approaches (Kery 2010) Second

although Bayesian applications proliferate in the literature in general there is a lack of

awareness of the distinction between approaches specifically devised for testing and

those for estimation (Ellison 2004) Furthermore there is a dangerous unfamiliarity with

the proven risks of using tools designed for estimation in testing procedures (Berger amp

Pericchi 1996 Berger et al 2001 Kass amp Raftery 1995 Moreno et al 1998 Robert

et al 2009 Robert 1993) (eg use of flat priors in hypothesis testing)

Occupancy models have played a leading role during the past decade in large

biological population surveys The flexibility of the occupancy framework has allowed

the development of useful extensions to determine several key population parameters

which provide robust notions of the distribution structure and dynamics of a population

In order to address some of the concerns stated in previous paragraph we concentrate

in the occupancy framework to develop estimation and testing tools that will allow

ecologists first to gain insight about the estimation procedure and second to conduct

statistically sound model selection for site-occupancy data

11 Occupancy Modeling

Since MacKenzie et al (2002) and Tyre et al (2003) introduced the site-occupancy

framework countless applications and extensions of the method have been developed

in the ecological literature as evidenced by the 438000 hits on Google Scholar for

a search of rdquooccupancy modelrdquo This class of models acknowledges that techniques

used to conduct biological population surveys are prone to detection errors ndashif an

individual is detected it must be present while if it is not detected it might or might

not be Occupancy models improve upon traditional binary regression by accounting

15

for observed detection and partially observed presence as two separate but related

components In the site occupancy setting the chosen locations are surveyed

repeatedly in order to reduce the ambiguity caused by the observed zeros This

approach therefore allows probabilities of both presence (occurrence) and detection

to be estimated

The uses of site-occupancy models are many For example metapopulation

and island biogeography models are often parameterized in terms of site (or patch)

occupancy (Hansky 19921994 1997 as cited in MacKenzie et al (2003)) and

occupancy may be used as a surrogate for abundance to answer questions regarding

geographic distribution range size and metapopulation dynamics (MacKenzie et al

2004 Royle amp Kery 2007)

The basic occupancy framework which assumes a single closed population with

fixed probabilities through time has proven to be quite useful however it might be of

limited utility when addressing some problems In particular assumptions for the basic

model may become too restrictive or unrealistic whenever the study period extends

throughout multiple years or seasons especially given the increasingly changing

environmental conditions that most ecosystems are currently experiencing

Among the extensions found in the literature one that we consider particularly

relevant incorporates heterogenous occupancy probabilities through time Models

that incorporate temporally varying probabilities stem from important meta-population

notions provided by Hanski (1994) such as occupancy probabilities depending on local

colonization and local extinction processes In spite of the conceptual usefulness of

Hanskirsquos model several strong and untenable assumptions (eg all patches being

homogenous in quality) are required for it to provide practically meaningful results

A more viable alternative which builds on Hanski (1994) is an extension of

the single season occupancy model of MacKenzie et al (2003) In this model the

heterogeneity of occupancy probabilities across seasons arises from local colonization

16

and extinction processes This model is flexible enough to let detection occurrence

extinction and colonization probabilities to each depend upon its own set of covariates

Model parameters are obtained through likelihood-based estimation

Using a maximum likelihood approach presents two drawbacks First the

uncertainty assessment for maximum likelihood parameter estimates relies on

asymptotic results which are obtained from implementation of the delta method

making it sensitive to sample size Second to obtain parameter estimates the latent

process (occupancy) is marginalized out of the likelihood leading to the usual zero

inflated Bernoulli model Although this is a convenient strategy for solving the estimation

problem after integrating the latent state variables (occupancy indicators) they are

no longer available Therefore finite sample estimates cannot be calculated directly

Instead a supplementary parametric bootstrapping step is necessary Further

additional structure such as temporal or spatial variation cannot be introduced by

means of random effects (Royle amp Kery 2007)

12 A Primer on Objective Bayesian Testing

With the advent of high dimensional data such as that found in modern problems

in ecology genetics physics etc coupled with evolving computing capability objective

Bayesian inferential methods have gained increasing popularity This however is by no

means a new approach in the way Bayesian inference is conducted In fact starting with

Bayes and Laplace and continuing for almost 200 years Bayesian analysis was primarily

based on ldquononinformativerdquo priors (Berger amp Bernardo 1992)

Now subjective elicitation of prior probabilities in Bayesian analysis is widely

recognized as the ideal (Berger et al 2001) however it is often the case that the

available information is insufficient to specify appropriate prior probabilistic statements

Commonly as in model selection problems where large model spaces have to be

explored the number of model parameters is prohibitively large preventing one from

eliciting prior information for the entire parameter space As a consequence in practice

17

the determination of priors through the definition of structural rules has become the

alternative to subjective elicitation for a variety of problems in Bayesian testing Priors

arising from these rules are known in the literature as noninformative objective default

or reference Many of these connotations generate controversy and are accused

perhaps rightly of providing a false pretension of objectivity Nevertheless we will avoid

that discussion and refer to them herein exchangeably as noninformative or objective

priors to convey the sense that no attempt to introduce an informed opinion is made in

defining prior probabilities

A plethora of ldquononinformativerdquo methods has been developed in the past few

decades (see Berger amp Bernardo (1992) Berger amp Pericchi (1996) Berger et al (2001)

Clyde amp George (2004) Kass amp Wasserman (1995 1996) Liang et al (2008) Moreno

et al (1998) Spiegelhalter amp Smith (1982) Wasserman (2000) and the references

therein) We find particularly interesting those derived from the model structure in which

no tuning parameters are required especially since these can be regarded as automatic

methods Among them methods based on the Bayes factor for Intrinsic Priors have

proven their worth in a variety of inferential problems given their excellent performance

flexibility and ease of use This class of priors is discussed in detail in chapter 3 For

now some basic notation and notions of Bayesian inferential procedures are introduced

Hypothesis testing and the Bayes factor

Bayesian model selection techniques that aim to find the true model as opposed

to searching for the model that best predicts the data are fundamentally extensions to

Bayesian hypothesis testing strategies In general this Bayesian approach to hypothesis

testing and model selection relies on determining the amount of evidence found in favor

of one hypothesis (or model) over the other given an observed set of data Approached

from a Bayesian standpoint this type of problem can be formulated in great generality

using a natural well defined probabilistic framework that incorporates both model and

parameter uncertainty

18

Jeffreys (1935) first developed the Bayesian strategy to hypothesis testing and

consequently to the model selection problem Bayesian model selection within

a model space M = (M1M2 MJ) where each model is associated with a

parameter θj which may be a vector of parameters itself incorporates three types

of probability distributions (1) a prior probability distribution for each model π(Mj)

(2) a prior probability distribution for the parameters in each model π(θj |Mj) and (3)

the distribution of the data conditional on both the model and the modelrsquos parameters

f (x|θj Mj) These three probability densities induce the joint distribution p(x θj Mj) =

f (x|θj Mj) middot π(θj |Mj) middot π(Mj) which is instrumental in producing model posterior

probabilities The model posterior probability is the probability that a model is true given

the data It is obtained by marginalizing over the parameter space and using Bayes rule

p(Mj |x) =m(x|Mj)π(Mj)sumJ

i=1m(x|Mi)π(Mi) (1ndash1)

where m(x|Mj) =intf (x|θj Mj)π(θj |Mj)dθj is the marginal likelihood of Mj

Given that interest lies in comparing different models evidence in favor of one or

another model is assessed with pairwise comparisons using posterior odds

p(Mj |x)p(Mk |x)

=m(x|Mj)

m(x|Mk)middot π(Mj)

π(Mk) (1ndash2)

The first term on the right hand side of (1ndash2) m(x|Mj )

m(x|Mk) is known as the Bayes factor

comparing model Mj to model Mk and it is denoted by BFjk(x) The Bayes factor

provides a measure of the evidence in favor of either model given the data and updates

the model prior odds given by π(Mj )

π(Mk) to produce the posterior odds

Note that the model posterior probability in (1ndash1) can be expressed as a function of

Bayes factors To illustrate let model Mlowast isin M be a reference model All other models

compare in M are compared to the reference model Then dividing both the numerator

19

and denominator in (1ndash1) by m(x|Mlowast)π(Mlowast) yields

p(Mj |x) =BFjlowast(x)

π(Mj )

π(Mlowast)

1 +sum

MiisinMMi =Mlowast

BFilowast(x)π(Mi )π(Mlowast)

(1ndash3)

Therefore as the Bayes factor increases the posterior probability of model Mj given the

data increases If all models have equal prior probabilities a straightforward criterion

to select the best among all candidate models is to choose the model with the largest

Bayes factor As such the Bayes factor is not only useful for identifying models favored

by the data but it also provides a means to rank models in terms of their posterior

probabilities

Assuming equal model prior probabilities in (1ndash3) the prior odds are set equal to

one and the model posterior odds in (1ndash2) become p(Mj |x)p(Mk |x) = BFjk(x) Based

on the Bayes factors the evidence in favor of one or another model can be interpreted

using Table 1-1 adapted from Kass amp Raftery (1995)

Table 1-1 Interpretation of BFji when contrasting Mj and Mi

lnBFjk BFjk Evidence in favor of Mj P(Mj |x)0 to 2 1 to 3 Weak evidence 05-0752 to 6 3 to 20 Positive evidence 075-095

6 to 10 20 to 150 Strong evidence 095-099gt10 gt150 Very strong evidence gt 099

Bayesian hypothesis testing and model selection procedures through Bayes factors

and posterior probabilities have several desirable features First these methods have a

straight forward interpretation since the Bayes factor is an increasing function of model

(or hypothesis) posterior probabilities Second these methods can yield frequentist

matching confidence bounds when implemented with good testing priors (Kass amp

Wasserman 1996) such as the reference priors of Berger amp Bernardo (1992) Third

since the Bayes factor contains the ratio of marginal densities it automatically penalizes

complexity according to the number of parameters in each model this property is

known as Ockhamrsquos razor (Kass amp Raftery 1995) Four the use of Bayes factors does

20

not require having nested hypotheses (ie having the null hypothesis nested in the

alternative) standard distributions or regular asymptotics (eg convergence to normal

or chi squared distributions) (Berger et al 2001) In contrast this is not always the case

with frequentist and likelihood ratio tests which depend on known distributions (at least

asymptotically) for the test statistic to perform the test Finally Bayesian hypothesis

testing procedures using the Bayes factor can naturally incorporate model uncertainty by

using the Bayesian machinery for model averaged predictions and confidence bounds

(Kass amp Raftery 1995) It is not clear how to account for this uncertainty rigorously in a

fully frequentist approach

13 Overview of the Chapters

In the chapters that follow we develop a flexible and straightforward hierarchical

Bayesian framework for occupancy models allowing us to obtain estimates and conduct

robust testing from an ldquoobjectiverdquo Bayesian perspective Latent mixtures of random

variables supply a foundation for our methodology This approach provides a means to

directly incorporate spatial dependency and temporal heterogeneity through predictors

that characterize either habitat quality of a given site or detectability features of a

particular survey conducted in a specific site On the other hand the Bayesian testing

methods we propose are (1) a fully automatic and objective method for occupancy

model selection and (2) an objective Bayesian testing tool that accounts for multiple

testing and for polynomial hierarchical structure in the space of predictors

Chapter 2 introduces the methods proposed for estimation of occupancy model

parameters A simple estimation procedure for the single season occupancy model

with covariates is formulated using both probit and logit links Based on the simple

version an extension is provided to cope with metapopulation dynamics by introducing

persistence and colonization processes Finally given the fundamental role that spatial

dependence plays in defining temporal dynamics a strategy to seamlessly account for

this feature in our framework is introduced

21

Chapter 3 develops a new fully automatic and objective method for occupancy

model selection that is asymptotically consistent for variable selection and averts the

use of tuning parameters In this Chapter first some issues surrounding multimodel

inference are described and insight about objective Bayesian inferential procedures is

provided Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate

priors on the parameter space the intrinsic priors for the parameters of the occupancy

model are obtained These are used in the construction of a variable selection algorithm

for ldquoobjectiverdquo variable selection tailored to the occupancy model framework

Chapter 4 touches on two important and interconnected issues when conducting

model testing that have yet to receive the attention they deserve (1) controlling for false

discovery in hypothesis testing given the size of the model space ie given the number

of tests performed and (2) non-invariance to location transformations of the variable

selection procedures in the face of polynomial predictor structure These elements both

depend on the definition of prior probabilities on the model space In this chapter a set

of priors on the model space and a stochastic search algorithm are proposed Together

these control for model multiplicity and account for the polynomial structure among the

predictors

22

CHAPTER 2MODEL ESTIMATION METHODS

ldquoData Data Datardquo he cried impatiently ldquoI canrsquot make bricks without clayrdquo

ndashSherlock HolmesThe Adventure of the Copper Beeches

21 Introduction

Prior to the introduction of site-occupancy models (MacKenzie et al 2002 Tyre

et al 2003) presence-absence data from ecological monitoring programs were used

without any adjustment to assess the impact of management actions to observe trends

in species distribution through space and time or to model the habitat of a species (Tyre

et al 2003) These efforts however were suspect due to false-negative errors not

being accounted for False-negative errors occur whenever a species is present at a site

but goes undetected during the survey

Site-occupancy models developed independently by MacKenzie et al (2002)

and Tyre et al (2003) extend simple binary-regression models to account for the

aforementioned errors in detection of individuals common in surveys of animal or plant

populations Since their introduction the site-occupancy framework has been used in

countless applications and numerous extensions for it have been proposed Occupancy

models improve upon traditional binary regression by analyzing observed detection

and partially observed presence as two separate but related components In the site

occupancy setting the chosen locations are surveyed repeatedly in order to reduce the

ambiguity caused by the observed zeros This approach therefore allows simultaneous

estimation of the probabilities of presence (occurrence) and detection

Several extensions to the basic single-season closed population model are

now available The occupancy approach has been used to determine species range

dynamics (MacKenzie et al 2003 Royle amp Kery 2007) and to understand agestage

23

structure within populations (Nichols et al 2007) model species co-occurrence

(MacKenzie et al 2004 Ovaskainen et al 2010 Waddle et al 2010) It has even been

suggested as a surrogate for abundance (MacKenzie amp Nichols 2004) MacKenzie et al

suggested using occupancy models to conduct large-scale monitoring programs since

this approach avoids the high costs associated with surveys designed for abundance

estimation Also to investigate metapopulation dynamics occupancy models improve

upon incidence function models (Hanski 1994) which are often parameterized in terms

of site (or patch) occupancy and assume homogenous patches and a metapopulation

that is at a colonization-extinction equilibrium

Nevertheless the implementation of Bayesian occupancy models commonly resorts

to sampling strategies dependent on hyper-parameters subjective prior elicitation

and relatively elaborate algorithms From the standpoint of practitioners these are

often treated as black-box methods (Kery 2010) As such the potential of using the

methodology incorrectly is high Commonly these procedures are fitted with packages

such as BUGS or JAGS Although the packagersquos ease of use has led to a wide-spread

adoption of the methods the user may be oblivious as to the assumptions underpinning

the analysis

We believe providing straightforward and robust alternatives to implement these

methods will help practitioners gain insight about how occupancy modeling and more

generally Bayesian modeling is performed In this Chapter using a simple Gibbs

sampling approach first we develop a versatile method to estimate the single season

closed population site-occupancy model then extend it to analyze metapopulation

dynamics through time and finally provide a further adaptation to incorporate spatial

dependence among neighboring sites211 The Occupancy Model

In this section of the document we first introduce our results published in Dorazio

amp Taylor-Rodrıguez (2012) and build upon them to propose relevant extensions For

24

the standard sampling protocol for collecting site-occupancy data J gt 1 independent

surveys are conducted at each of N representative sample locations (sites) noting

whether a species is detected or not detected during each survey Let yij denote a binary

random variable that indicates detection (y = 1) or non-detection (y = 0) during the

j th survey of site i Without loss of generality J may be assumed constant among all N

sites to simplify description of the model In practice however site-specific variation in

J poses no real difficulties and is easily implemented This sampling protocol therefore

yields a N times J matrix Y of detectionnon-detection data

Note that the observed process yij is an imperfect representation of the underlying

occupancy or presence process Hence letting zi denote the presence indicator at site i

this model specification can therefore be represented through the hierarchy

yij |zi λ sim Bernoulli (zipij)

zi |α sim Bernoulli (ψi) (2ndash1)

where pij is the probability of correctly classifying as occupied the i th site during the j th

survey ψi is the presence probability at the i th site The graphical representation of this

process is

ψi

zi

yi

pi

Figure 2-1 Graphical representation occupancy model

Probabilities of detection and occupancy can both be made functions of covariates

and their corresponding parameter estimates can be obtained using either a maximum

25

likelihood or a Bayesian approach Existing methodologies from the likelihood

perspective marginalize over the latent occupancy process (zi ) making the estimation

procedure depend only on the detections Most Bayesian strategies rely on MCMC

algorithms that require parameter prior specification and tuning However Albert amp Chib

(1993) proposed a longstanding strategy in the Bayesian statistical literature that models

binary outcomes using a simple Gibbs sampler This procedure which is described in

the following section can be extrapolated to the occupancy setting eliminating the need

for tuning parameters and subjective prior elicitation212 Data Augmentation Algorithms for Binary Models

Probit model Data-augmentation with latent normal variables

At the root of Albert amp Chibrsquos algorithm lies the idea that if the observed outcome is

0 the latent variable can be simulated from a truncated normal distribution with support

(minusinfin 0] And if the outcome is 1 the latent variable can be simulated from a truncated

normal distribution in (0infin) To understand the reasoning behind this strategy let

Y sim Bern((xTβ)

) and V = xTβ + ε with ε sim N (0 1) In such a case note that

Pr(y = 1 | xTβ) = (xTβ) = Pr(ε lt xTβ)

= Pr(ε gt minusxTβ)

= Pr(v gt 0 | xTβ)

Thus whenever y = 1 then v gt 0 and v le 0 otherwise In other words we

may think of y as a truncated version of v Thus we can sample iteratively alternating

between the latent variables conditioned on model parameters and vice versa to draw

from the desired posterior densities By augmenting the data with the latent variables

we are able to obtain full conditional posterior distributions for model parameters that are

easy to draw from (equation 2ndash3 below) Further we may sample the latent variables

we may also sample the parameters

Given some initial values for all model parameters values for the latent variables

can be simulated By conditioning on the latter it is then possible to draw samples

26

from the parameterrsquos posterior distributions These samples can be used to generate

new values for the latent variables etc The process is iterated using a Gibbs sampling

approach Generally after a large number iterations it yields draws from the joint

posterior distribution of the latent variables and the model parameters conditional on the

observed outcome values We formalize the procedure below

Assume that each outcome Y1Y2 Yn is such that Yi |xi β sim Bernoulli(qi)

where qi = (xTi β) is the standard normal CDF evaluated at xTi β where xi and β

are the p-dimensional vectors of observed covariates for the i -th observation and their

corresponding parameters respectively

Now let y = y1 y2 yn be the vector of observed outcomes and [ β ] represents

the prior distribution of the model parameters Therefore the posterior distribution of β is

given by

[ β|y ] prop [ β ]nprodi=1

(xTi β)yi(1minus(xTi β)

)1minusyi (2ndash2)

which is intractable Nevertheless introducing latent random variables V = (V1 Vn)

such that Vi sim N (xTi β 1) resolves this difficulty by specifying that whenever Yi = 1

then Vi gt 0 and if Yi = 0 then Vi le 0 This yields

[ β v|y ] prop [ β ]

nprodi=1

ϕ(vi | xTi β 1)Ivile0Iyi=0 + Ivigt0Iyi=1

(2ndash3)

where ϕ(x |micro τ 2) is the probability density function of normal random variable x

with mean micro and variance τ2 The data augmentation artifact works since [ β|y ] =int[ β v|y ]dv hence if we sample from joint posterior 2ndash3 and extract only the sampled

values for β they will correspond to samples from [ β|y ]

From the expression above it is possible to obtain the full conditional distributions

for V and β Thus a Gibbs sampler can be proposed For example if we use a flat prior

27

for β (ie [ β ] prop 1) the full conditionals are given by

β|V y sim MVNk

((XTX )minus1(XTV ) (XTX )minus1

)(2ndash4)

V|β y simnprodi=1

tr N (xTi β 1Qi) (2ndash5)

where MVNq(micro ) represents a multivariate normal distribution with mean vector micro

and variance-covariance matrix and tr N (ξσ2Q) stands for the truncated normal

distribution with mean ξ variance σ2 and truncation region Q For each i = 1 2 n

the support of the truncated variables is given by Q = (minusinfin 0 ] if yi = 0 and Q = (0infin)

otherwise Note that conjugate normal priors could be used alternatively

At iteration m + 1 the Gibbs sampler draws V(m+1) conditional on β(m) from (2ndash5)

and then samples β(m+1) conditional on V(m+1) from (2ndash4) This process is repeated for

s = 0 1 nsim where nsim is the number of iterations in the Gibbs sampler

Logit model Data-augmentation with latent Polya-gamma variables

Recently Polson et al (2013) developed a novel and efficient approach for Bayesian

inference for logistic models using Polya-gamma latent variables which is analogous

to the Albert amp Chib algorithm The result arises from what the authors refer to as the

Polya-gamma distribution To construct a random variable from this family consider the

infinite mixture of the iid sequence of Exp(1) random variables Ekinfink=1 given by

ω =2

π2

infinsumk=1

Ek

(2k minus 1)2

with probability density function

g(ω) =infinsumk=1

(minus1)k 2k + 1radic2πω3

eminus(2k+1)2

8ω Iωisin(0infin) (2ndash6)

and Laplace density transform E[eminustω] = coshminus1(radic

t2)

28

The Polya-gamma family of densities is obtained through an exponential tilting of

the density g from 2ndash6 These densities indexed by c ge 0 are characterized by

f (ω|c) = cosh c2 eminusc2ω2 g(ω)

The likelihood for the binomial logistic model can be expressed in terms of latent

Polya-gamma variables as follows Assume yi sim Bernoulli(δi) with predictors xprimei =

(xi1 xip) and success probability δi = exprimeiβ(1 + ex

primeiβ) Hence the posterior for the

model parameters can be represented as

[β|y] =[β]prodn

i δyii (1minus δi)

1minusyi

c(y)

where c(y) is the normalizing constant

To facilitate the sampling procedure a data augmentation step can be performed

by introducing a Polya-gamma random variable ω sim PG(xprimeβ 1) This yields the

data-augmented posterior

[βω|y] =

(prodn

i=1 Pr(yi = 1|β))f (ω|xprime

β) [β] dω

c(y) (2ndash7)

such that [β|y] =int

R+[βω|y] dω

Thus from the augmented model the full conditional density for β is given by

[β|ω y] prop

(nprodi=1

Pr(yi = 1|β)

)f (ω|xprime

β) [β] dω

=

nprodi=1

(exprimeiβ)yi

1 + exprimeiβ

nprodi=1

cosh

(∣∣xprime

iβ∣∣

2

)exp

[minus(x

prime

iβ)2ωi

2

]g(ωi)

(2ndash8)

This expression yields a normal posterior distribution if β is assigned flat or normal

priors Hence a two-step sampling strategy analogous to that of Albert amp Chib (1993)

can be used to estimate β in the occupancy framework22 Single Season Occupancy

Let pij = F (qTij λ) be the probability of correctly classifying as occupied the i th

site during the j th survey conditional on the site being occupied and let ψi = F (xTi α)

29

correspond to the presence probability at the i th site Further let Fminus1(middot) denote a link

function (ie probit or logit) connecting the response to the predictors and denote by λ

and α respectively the r -variate and p-variate coefficient vectors for the detection and

for the presence probabilities Then the following is the joint posterior probability for the

presence indicators and the model parameters

πlowast(z vαwλ) prop πα(α)πλ(λ)Nprodi=1

F (xprimeiα)zi (1minus F (xprimeiα))

(1minuszi ) times

Jprodj=1

(ziF (qprimeijλ))

yij (1minus ziF (qprimeijλ))

1minusyij (2ndash9)

As in the simple probit regression problem this posterior is intractable Consequently

sampling from it directly is not possible But the procedures of Albert amp Chib for the

probit model and of Polson et al for the logit model can be extended to generate an

MCMC sampling strategy for the occupancy problem In what follows we make use of

this framework to develop samplers with which occupancy parameter estimates can be

obtained for both probit and logit link functions These algorithms have the added benefit

that they do not require tuning parameters nor eliciting parameter priors subjectively221 Probit Link Model

To extend Albert amp Chibrsquos algorithm to the occupancy framework with a probit link

first we introduce two sets of latent variables denoted by wij and vi corresponding to

the normal latent variables used to augment the data The corresponding hierarchy is

yij |zi sij sim Bernoulli(ziIwijgt0

)wij |λ sim N

(qprimeijλ 1

)λ sim [λ]

zi |vi sim Ivigt0

vi |α sim N (xprimeiα 1)

α sim [α] (2ndash10)

30

represented by the directed graph found in Figure 2-2

α

vi

zi

yi

wi

λ

Figure 2-2 Graphical representation occupancy model after data-augmentation

Under this hierarchical model the joint density is given by

πlowast(z vαwλ) prop Cyπα(α)πλ(λ)Nprodi=1

ϕ(vi xprimeiα 1)I

zivigt0I

(1minuszi )vile0 times

Jprodj=1

(ziIwijgt0)yij (1minus ziIwijgt0)

1minusyijϕ(wij qprimeijλ 1) (2ndash11)

The full conditional densities derived from the posterior in equation 2ndash11 are

detailed below

1 These are obtained from the full conditional of z after integrating out v and w

f (z|αλ) =

Nprodi=1

f (zi |αλ) =Nprodi=1

ψlowastizi (1minus ψlowast

i )1minuszi

where ψlowasti =

ψiprodJ

j=1 pyijij (1minus pij)

1minusyij

ψiprodJ

j=1 pyijij (1minus pij)1minusyij + (1minus ψi)

prodJ

j=1 Iyij=0(2ndash12)

2

f (v|zα) =

Nprodi=1

f (vi |zi α) =Nprodi=1

tr N (x primeiα 1Ai)

where Ai =

(minusinfin 0] zi = 0(0infin) zi = 1

(2ndash13)

31

and tr N(microσ2A) denotes the pdf of a truncated normal random variable withmean micro variance σ2 and truncation region A

3

f (α|v) = ϕp (α αXprimev α) (2ndash14)

where α = (X primeX )minus1and ϕk(x micro ) represents the k-variate normal density withmean vector micro and variance matrix

4

f (w|y zλ) =

Nprodi=1

Jprodj=1

f (wij |yij zi λ) =Nprodi=1

Jprodj=1

tr N (qprimeijλ 1Bij)

where Bij =

(minusinfininfin) zi = 0(minusinfin 0] zi = 1 and yij = 0(0infin) zi = 1 and yij = 1

(2ndash15)

5

f (λ|w) = ϕr (λ λQprimew λ) (2ndash16)

where λ = (Q primeQ)minus1

The Gibbs sampling algorithm for the model can then be summarized as

1 Initialize z α v λ and w

2 Sample zi sim Bern(ψilowast)

3 Sample vi from a truncated normal with micro = x primeiα σ = 1 and truncation regiondepending on zi

4 Sample α sim N (αXprimev α) with α = (X primeX )minus1

5 Sample wij from a truncated normal with micro = qprimeijλ σ = 1 and truncation region

depending on yij and zi

6 Sample λ sim N (λQprimew λ) with λ = (Q primeQ)minus1

222 Logit Link Model

Now turning to the logit link version of the occupancy model again let yij be the

indicator variable used to mark detection of the target species on the j th survey at the

i th site and let zi be the indicator variable that denotes presence (zi = 1) or absence

32

(zi = 0) of the target species at the i th site The model is now defined by

yij |zi λ sim Bernoulli (zipij) where pij =eq

primeijλ

1 + eqprimeijλ

λ sim [λ]

zi |α sim Bernoulli (ψi) where ψi =ex

primeiα

1 + exprimeiα

α sim [α]

In this hierarchy the contribution of a single site to the likelihood is

Li(αλ) =(ex

primeiα)zi

1 + exprimeiα

Jprodj=1

(zi

eqprimeijλ

1 + eqprimeijλ

)yij(1minus zi

eqprimeijλ

1 + eqprimeijλ

)1minusyij

(2ndash17)

As in the probit case we data-augment the likelihood with two separate sets

of covariates however in this case each of them having Polya-gamma distribution

Augmenting the model and using the posterior in (2ndash7) the joint is

[ zαλ|y ] prop [α] [λ]

Nprodi=1

(ex

primeiα)zi

1 + exprimeiαcosh

(∣∣xprime

iα∣∣

2

)exp

[minus(x

prime

iα)2vi

2

]g(vi)times

Jprodj=1

(zi

eqprimeijλ

1 + eqprimeijλ

)yij(1minus zi

eqprimeijλ

1 + eqprimeijλ

)1minusyij

times

cosh

(∣∣ziqprimeijλ∣∣2

)exp

[minus(ziq

primeijλ)2wij

2

]g(wij)

(2ndash18)

The full conditionals for z α v λ and w obtained from (2ndash18) are provided below

1 The full conditional for z is obtained after marginalizing the latent variables andyields

f (z|αλ) =

Nprodi=1

f (zi |αλ) =Nprodi=1

ψlowastizi (1minus ψlowast

i )1minuszi

where ψlowasti =

ψiprodJ

j=1 pyijij (1minus pij)

1minusyij

ψiprodJ

j=1 pyijij (1minus pij)1minusyij + (1minus ψi)

prodJ

j=1 Iyij=0(2ndash19)

33

2 Using the result derived in Polson et al (2013) we have that

f (v|zα) =

Nprodi=1

f (vi |zi α) =Nprodi=1

PG(1 xprimeiα) (2ndash20)

3

f (α|v) prop [α ]

Nprodi=1

exp[zix

prime

iαminus xprime

2minus (x

prime

iα)2vi

2

] (2ndash21)

4 By the same result as that used for v the full conditional for w is

f (w|y zλ) =

Nprodi=1

Jprodj=1

f (wij |yij zi λ)

=

(prodiisinS1

Jprodj=1

PG(1 |qprimeijλ| )

)(prodi isinS1

Jprodj=1

PG(1 0)

) (2ndash22)

with S1 = i isin 1 2 N zi = 1

5

f (λ|z yw) prop [λ ]prodiisinS1

exp

[yijq

prime

ijλminusq

prime

ijλ

2minus

(qprime

ijλ)2wij

2

] (2ndash23)

with S1 as defined above

The Gibbs sampling algorithm is analogous to the one with a probit link but with the

obvious modifications to incorporate Polya-gamma instead of normal latent variables23 Temporal Dynamics and Spatial Structure

The uses of the single-season model are limited to very specific problems In

particular assumptions for the basic model may become too restrictive or unrealistic

whenever the study period extends throughout multiple years or seasons especially

given the increasingly changing environmental conditions that most ecosystems are

currently experiencing

Among the many extensions found in the literature one that we consider particularly

relevant incorporates heterogenous occupancy probabilities through time Extensions of

34

site-occupancy models that incorporate temporally varying probabilities can be traced

back to Hanski (1994) The heterogeneity of occupancy probabilities through time arises

from local colonization and extinction processes MacKenzie et al (2003) proposed an

alternative to Hanskirsquos approach in order to incorporate imperfect detection The method

is flexible enough to let detection occurrence survival and colonization probabilities

each depend upon its own set of covariates using likelihood-based estimation for the

model parameters

However the approach of MacKenzie et al presents two drawbacks First

the uncertainty assessment for maximum likelihood parameter estimates relies on

asymptotic results (obtained from implementation of the delta method) making it

sensitive to sample size And second to obtain parameter estimates the latent process

(occupancy) is marginalized out of the likelihood leading to the usual zero-inflated

Bernoulli model Although this is a convenient strategy to solve the estimation problem

the latent state variables (occupancy indicators) are no longer available and as such

finite sample estimates cannot be calculated unless an additional (and computationally

expensive) parametric bootstrap step is performed (Royle amp Kery 2007) Additionally as

the occupancy process is integrated out the likelihood approach precludes incorporation

of additional structural dependence using random effects Thus the model cannot

account for spatial dependence which plays a fundamental role in this setting

To work around some of the shortcomings encountered when fitting dynamic

occupancy models via likelihood based methods Royle amp Kery developed what they

refer to as a dynamic occupancy state space model (DOSS) alluding to the conceptual

similarity found between this model and the class of state space models found in the

time series literature In particular this model allows one to retain the latent process

(occupancy indicators) in order to obtain small sample estimates and to eventually

generate extensions that incorporate structure in time andor space through random

effects

35

The data used in the DOSS model comes from standard repeated presenceabsence

surveys with N sampling locations (patches or sites) indexed by i = 1 2 N Within

a given season (eg year month week depending on the biology of the species) each

sampling location is visited (surveyed) j = 1 2 J times This process is repeated for

t = 1 2 T seasons Here an important assumption is that the site occupancy status

is closed within but not across seasons

As is usual in the occupancy modeling framework two different processes are

considered The first one is the detection process per site-visit-season combination

denoted by yijt The yijt are indicator functions that take the value 1 if the species is

present at site i survey j and season t and 0 otherwise These detection indicators

are assumed to be independent within each site and season The second response

considered is the partially observed presence (occupancy) indicators zit These are

indicator variables which are equal to 1 whenever yijt = 1 for one or more of the visits

made to site i during season t otherwise the values of the zit rsquos are unknown Royle amp

Kery refer to these two processes as the observation (yijt) and the state (zit) models

In this setting the parameters of greatest interest are the occurrence or site

occupancy probabilities denoted by ψit as well as those representing the population

dynamics which are accounted for by introducing changes in occupancy status over

time through local colonization and survival That is if a site was not occupied at season

t minus 1 at season t it can either be colonized or remain unoccupied On the other hand

if the site was in fact occupied at season t minus 1 it can remain that way (survival) or

become abandoned (local extinction) at season t The probabilities of survival and

colonization from season t minus 1 to season t at the i th site are denoted by θi(tminus1) and

γi(tminus1) respectivelyDuring the initial period (or season) the model for the state process is expressed in

terms of the occupancy probability (equation 2ndash24) For subsequent periods the stateprocess is specified in terms of survival and colonization probabilities (equation 2ndash25) inparticular

zi1 sim Bernoulli (ψi1) (2ndash24)

36

zit |zi(tminus1) sim Bernoulli(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)(2ndash25)

The observation model conditional on the latent process zit is defined by

yijt |zit sim Bernoulli(zitpijt

)(2ndash26)

Royle amp Kery induce the heterogeneity by site site-season and site-survey-seasonrespectively in the occupancy survival and colonization and in the detection probabilitiesthroughthe following specification

logit(ψi1) = x1 + ri ri sim N(0σ2ψ) logitminus1(x1) sim Unif(0 1)

logit(θit) = at + ui ui sim N(0σ2θ ) logitminus1(ai) sim Unif(0 1)logit(γit) = bt + vi vi sim N(0σ2γ) logitminus1(bi) sim Unif(0 1)

logit(pijt) = ct + wij wi sim N(0σ2p) logitminus1(ci) sim Unif(0 1) (2ndash27)

where x1 at bt ct are the season fixed effects for the corresponding probabilities

and where (ri ui vi) and wij are the site and site-survey random effects respectively

Additionally all variance components assume the usual inverse gamma priors

As the authors state this formulation can be regarded as ldquobeing suitably vaguerdquo

however it is also restrictive in the sense that it is not clear what strategy to follow to

incorporate additional covariates while preserving the straightforward sampling strategy231 Dynamic Mixture Occupancy State-Space Model

We assume that the probabilities for occupancy survival colonization and detection

are all functions of linear combinations of covariates However our setup varies

slightly from that considered by Royle amp Kery (2007) In essence we modify the way in

which the estimates for survival and colonization probabilities are attained Our model

incorporates the notion that occupancy at a site occupied during the previous season

takes place through persistence where we define persistence as a function of both

survival and colonization That is a site occupied at time t may again be occupied

at time t + 1 if the current settlers survive if they perish and new settlers colonize

simultaneously or if both current settlers survive and new ones colonize

Our functional forms of choice are again the probit and logit link functions This

means that each probability of interest which we will refer to for illustration as δ is

37

linked to a linear combination of covariates xprime ξ through the relationship defined by

δ = F (xT ξ) where F (middot) represents the inverse link function This particular assumption

facilitates relating the data augmentation algorithms of Albert amp Chib and Polson et al to

Royle amp Keryrsquos DOSS model We refer to this extension of Royle amp Keryrsquos model as the

Dynamic Mixture Occupancy State Space model (DYMOSS)

As before let yijt be the indicator variable used to mark detection of the target

species on the j th survey at the i th site during the tth season and let zit be the indicator

variable that denotes presence (zit = 1) or absence (zit = 0) of the target species at the

i th site tth season with i isin 1 2 N j isin 1 2 J and t isin 1 2 T

Additionally assume that probabilities for occupancy at time t = 1 persistence

colonization and detection are all functions of covariates with corresponding parameter

vectors α (s) =δ(s)tminus1

Tt=2

B(c) =β(c)tminus1

Tt=2

and = λtTt=1 and covariate matrices

X(o) X = Xtminus1Tt=2 and Q(s) = QtTt=1 respectively Using the notation above our

proposed dynamic occupancy model is defined by the following hierarchyState model

zi1|α sim Bernoulli (ψi1) where ψi1 = F(xprime(o)iα

)zit |zi(tminus1) δ

(c)tminus1β

(s)tminus1 sim Bernoulli

(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)where θi(tminus1) = F

(δ(s)tminus1 + xprimei(tminus1)β

(c)tminus1

) and

γi(tminus1) = F(xprimei(tminus1)β

(c)tminus1

)(2ndash28)

Observed modelyijt |zit ηt sim Bernoulli (zitpijt)

where pijt = F (qTijtλt) (2ndash29)

In the hierarchical setup given by Equations 2ndash28 and 2ndash29 θi(tminus1) corresponds to

the probability of persistence from time t minus 1 to time t at site i and γi(tminus1) denotes the

colonization probability Note that θi(tminus1) minus γi(tminus1) yields the survival probability from t minus 1

to t The effect of survival is introduced by changing the intercept of the linear predictor

by a quantity δ(s)tminus1 Although in this version of the model this effect is accomplished by

just modifying the intercept it can be extended to have covariates determining δ(s)tminus1 as

well The graphical representation of the model for a single site is

38

α

zi1

yi1

λ1

zi2

yi2

λ1

δ(s)1

β(c)1

middot middot middot

zit

yit

λt

δ(s)tminus1

β(c)tminus1

middot middot middot

ziT

yiT

λT

δ(s)Tminus1

β(c)Tminus1

Figure 2-3 Graphical representation multiseason model for a single site

The joint posterior for the model defined by this hierarchical setting is

[ zηαβλ|y ] = Cy

Nprodi=1

ψi1 Jprodj=1

pyij1ij1 (1minus pij1)

(1minusyij1)

zi1(1minus ψi1)

Jprodj=1

Iyij1=0

1minuszi1 [η1][α]times

Tprodt=2

Nprodi=1

[(θziti(tminus1)(1minus θi(tminus1))

1minuszit)zi(tminus1)

+(γziti(tminus1)(1minus γi(tminus1))

1minuszit)1minuszi(tminus1)

] Jprod

j=1

pyijtijt (1minus pijt)

1minusyijt

zit

times

Jprodj=1

Iyijt=0

1minuszit [ηt ][βtminus1][λtminus1]

(2ndash30)

which as in the single season case is intractable Once again a Gibbs sampler cannot

be constructed directly to sample from this joint posterior The graphical representation

of the model for one site incorporating the latent variables is provided in Figure 2-4

α

ui1

zi1

yi1

wi1

λ1

zi2

yi2

wi2

λ1

vi1

δ(s)1

β(c)1

middot middot middot

middot middot middot

zit

vi tminus1

yit

wit

λt

δ(s)tminus1

β(c)tminus1

middot middot middot

middot middot middot

ziT

vi Tminus1

yiT

wiT

λT

δ(s)Tminus1

β(s)Tminus1

Figure 2-4 Graphical representation data-augmented multiseason model

Probit link normal-mixture DYMOSS model

39

We deal with the intractability of the joint posterior distribution as before that is

by introducing latent random variables Each of the latent variables incorporates the

relevant linear combinations of covariates for the probabilities considered in the model

This artifact enables us to sample from the joint posterior distributions of the model

parameters For the probit link the sets of latent random variables respectively for first

season occupancy persistence and colonization and detection are

bull ui sim N (bTi α 1)

bull vi(tminus1) sim zi(tminus1)N(δ(s)(tminus1) + xTi(tminus1)β

(c)(tminus1) 1

)+ (1minus zi(tminus1))N

(xTi(tminus1)β

(c)(tminus1) 1

) and

bull wijt sim N (qTijtηt 1)

Introducing these latent variables into the hierarchical formulation yieldsState model

ui1|α sim N(xprime(o)iα 1

)zi1|ui sim Bernoulli

(Iuigt0

)for t gt 1

vi(tminus1)|zi(tminus1)βtminus1 sim zi(tminus1)N(δ(s)(tminus1) + xprimei(tminus1)β

(c)(tminus1) 1

)+

(1minus zi(tminus1))N(xprimei(tminus1)β

(c)(tminus1) 1

)zit |vi(tminus1) sim Bernoulli

(Ivi(tminus1)gt0

)(2ndash31)

Observed modelwijt |ηt sim N

(qTijtλt 1

)yijt |zit wijt sim Bernoulli

(zitIrijtgt0

)(2ndash32)

Note that the result presented in Section 22 corresponds to the particular case for

T = 1 of the model specified by Equations 2ndash31 and 2ndash32

As mentioned previously model parameters are obtained using a Gibbs sampling

approach Let ϕ(x |microσ2) denote the pdf of a normally distributed random variable x

with mean micro and standard deviation σ Also let

1 Wt = (w1t w2t wNt) with wit = (wi1t wi2t wiJitt) (for i = 1 2 N andt = 1 2 T )

40

2 u = (u1 u2 uN)

3 V = (v1 vTminus1) with vt = (v1t v2t vNt)

For the probit link model the joint posterior distribution is

π(ZuV WtTt=1αB(c) δ(s)

)prop [α]

prodNi=1 ϕ

(ui∣∣ xprime(o)iα 1

)Izi1uigt0I

1minuszi1uile0

times

Tprodt=2

[β(c)tminus1 δ

(s)tminus1

] Nprodi=1

ϕ(vi(tminus1)

∣∣micro(v)i(tminus1) 1

)Izitvi(tminus1)gt0

I1minuszitvi(tminus1)le0

times

Tprodt=1

[λt ]

Nprodi=1

Jitprodj=1

ϕ(wijt

∣∣qprimeijtλt 1)(zitIwijtgt0)yij1(1minus zitIwijtgt0)

(1minusyijt)

where micro(v)i(tminus1) = zi(tminus1)δ

(s)tminus1 + xprimei(tminus1)β

(c)tminus1 (2ndash33)

Initialize the Gibbs sampler at α(0)B(0)(c) δ

(s)(0)2minus1 and (0) For m = 0 1 nsim

The sampler proceeds iteratively by block sampling sequentially for each primary

sampling period as follows first the presence process then the latent variables from

the data-augmentation step for the presence component followed by the parameters for

the presence process then the latent variables for the detection component and finally

the parameters for the detection component Letting [|] denote the full conditional

probability density function of the component conditional on all other unknown

parameters and the observed data for m = 1 nsim the sampling procedure can be

summarized as

[z(m)1 | middot

]rarr[u(m)| middot

]rarr[α(m)

∣∣∣ middot ]rarr [W

(m)1 | middot

]rarr[λ(m)1

∣∣∣ middot ]rarr[z(m)2 | middot

]rarr[V(m)2minus1| middot

]rarr[β(c)(m)2minus1 δ(s)(m)

2minus1

∣∣∣ middot ]rarr [W

(m)2 | middot

]rarr[λ(m)2

∣∣∣ middot ]rarr middot middot middot

middot middot middot rarr[z(m)T | middot

]rarr[V(m)Tminus1| middot

]rarr[β(c)(m)Tminus1 δ(s)(m)

Tminus1

∣∣∣ middot ]rarr [W

(m)T | middot

]rarr[λ(m)T

∣∣∣ middot ]The full conditional probability densities for this Gibbs sampling algorithm are

presented in detail within Appendix A

41

Logit link Polya-Gamma DYMOSS model

Using the same notation as before the logit link model resorts to the hierarchy given

byState model

ui1|α sim PG(xT(o)iα 1

)zi1|ui sim Bernoulli

(Iuigt0

)for t gt 1

vi(tminus1)| sim PG(1∣∣zi(tminus1)δ(s)(tminus1) + xprimei(tminus1)β

(c)(tminus1)

∣∣)sim Bernoulli

(Ivi(tminus1)gt0

)(2ndash34)

Observed modelwijt |λt sim PG

(qTijtλt 1

)yijt |zit wijt sim Bernoulli

(zitIwijtgt0

)(2ndash35)

The logit link version of the joint posterior is given by

π(ZuV WtTt=1αB(s)B(c)

)prop

Nprodi=1

(e

xprime(o)i

α)zi1

1 + exprime(o)i

αPG

(ui 1 |xprime(o)iα|

)[λ1][α]times

Ji1prodj=1

(zi1

eqprimeij1λ1

1 + eqprimeij1λ1

)yij1(1minus zi1

eqprimeij1λ1

1 + eqprimeij1λ1

)1minusyij1

PG(wij1 1 |zi1qprimeij1λ1|

)times

Tprodt=2

[δ(s)tminus1][β

(c)tminus1][λt ]

Nprodi=1

(exp

[micro(v)tminus1

])zit1 + exp

[micro(v)i(tminus1)

]PG (vit 1 ∣∣∣micro(v)i(tminus1)

∣∣∣)timesJitprodj=1

(zit

eqprimeijtλt

1 + eqprimeijtλt

)yijt(1minus zit

eqprimeijtλt

1 + eqlowastTij

λt

)1minusyijt

PG(wijt 1 |zitqprimeijtλt |

)

(2ndash36)

with micro(v)tminus1 = zi(tminus1)δ

(s)tminus1 + xprimei(tminus1)β

(c)tminus1

42

The sampling procedure is entirely analogous to that described for the probit

version The full conditional densities derived from expression 2ndash36 are described in

detail in Appendix A232 Incorporating Spatial Dependence

In this section we describe how the additional layer of complexity space can also

be accounted for by continuing to use the same data-augmentation framework The

method we employ to incorporate spatial dependence is a slightly modified version of

the traditional approach for spatial generalized linear mixed models (GLMMrsquos) and

extends the model proposed by Johnson et al (2013) for the single season closed

population occupancy model

The traditional approach consists of using spatial random effects to induce a

correlation structure among adjacent sites This formulation introduced by Besag et al

(1991) assumes that the spatial random effect corresponds to a Gaussian Markov

Random Field (GMRF) The model known as the Spatial GLMM (SGLMM) is used to

analyze areal data It has been applied extensively given the flexibility of its hierarchical

formulation and the availability of software for its implementation (Hughes amp Haran

2013)

Succinctly the spatial dependence is accounted for in the model by adding a

random vector η assumed to have a conditionally-autoregressive (CAR) prior (also

known as the Gaussian Markov random field prior) To define the prior let the pair

G = (V E) represent the undirected graph for the entire spatial region studied where

V = (1 2 N) denotes the vertices of the graph (sites) and E the set of edges

between sites E is constituted by elements of the form (i j) indicating that sites i

and j are spatially adjacent for some i j isin V The prior for the spatial effects is then

characterized by

[η|τ ] prop τ rank()2exp[minusτ2ηprimeη

] (2ndash37)

43

where = (diag(A1)minus A) is the precision matrix with A denoting the adjacency matrix

The entries of the adjacency matrix A are such that diag(A) = 0 and Aij = I(i j)isinE

The matrix is singular Hence the probability density defined in equation 2ndash37

is improper ie it doesnrsquot integrate to 1 Regardless of the impropriety of the prior this

model can be fitted using a Bayesian approach since even if the prior is improper the

posterior for the model parameters is proper If a constraint such assum

k ηk = 0 is

imposed or if the precision matrix is replaced by a positive definite matrix the model

can also be fitted using a maximum likelihood approach

Assuming that all but the detection process are subject to spatial correlations and

using the notation we have developed up to this point the spatially explicit version of the

DYMOSS model is characterized by the hierarchy represented by equations 2ndash38 and

2ndash39

Hence adding spatial structure into the DYMOSS framework described in the

previous section only involves adding the steps to sample η(o) and ηtT

t=2 conditional

on all other parameters Furthermore the corresponding parameters and spatial

random effects of a given component (ie occupancy survival and colonization)

can be effortlessly pooled together into a single parameter vector to perform block

sampling For each of the latent variables the only modification required is to sum the

corresponding spatial effect to the linear predictor so that these retain their conditional

independence given the linear combination of fixed effects and the spatial effects

State modelzi1|α sim Bernoulli (ψi1) where ψi1 = F

(xT(o)iα+ η

(o)i

)[η(o)|τ

]prop τ rank()2exp

[minusτ2η(o)primeη(o)

]zit |zi(tminus1)αβtminus1λtminus1 sim Bernoulli

(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)where θi(tminus1) = F

(δ(s)(tminus1) + xTi(tminus1)β

(c)tminus1 + ηit

) and

γi(tminus1) = F(xTi(tminus1)β

(c)tminus1 + ηit

)[ηt |τ ] prop τ rank()2exp

[minusτ2ηprimetηt

](2ndash38)

44

Observed modelyijt |zit ηt sim Bernoulli (zitpijt)

where pijt = F (qTijtλt) (2ndash39)

In spite of the popularity of this approach to incorporating spatial dependence three

shortcomings have been reported in the literature (Hughes amp Haran 2013 Reich et al

2006) (1) model parameters have no clear interpretation due to spatial confounding

of the predictors with the spatial effect (2) there is variance inflation due to spatial

confounding and (3) the high dimensionality of the latent spatial variables leads to

high computational costs To avoid such difficulties we follow the approach used by

Hughes amp Haran (2013) which builds upon the earlier work by Reich et al (2006) This

methodology is summarized in what follows

Let a vector of spatial effects η have the CAR model given by 2ndash37 above Now

consider a random vector ζ sim MVN(0 τKprimeK

) with defined as above and where

τKprimeK corresponds to the precision of the distribution and not the covariance matrix

with matrix K satisfying KprimeK = I

This last condition implies that the linear predictor Xβ + η = Xβ + Kζ With

respect to how the matrix K is chosen Hughes amp Haran (2013) recommend basing its

construction on the spectral decomposition of operator matrices based on Moranrsquos I

The Moran operator matrix is defined as PperpAPperp with Pperp = IminusX (XprimeX )minus1X

prime and where A

is the adjacency matrix previously described The choice of the Moran operator is based

on the fact that it accounts for the underlying graph while incorporating the spatial

structure residual to the design matrix X These elements are incorporated into its

spectral decomposition of the Moran operator That is its eigenvalues correspond to the

values of Moranrsquos I statistic (a measure of spatial autocorrelation) for a spatial process

orthogonal to X while its eigenvectors provide the patterns of spatial dependence

residual to X Thus the matrix K is chosen to be the matrix whose columns are the

eigenvectors of the Moran operator for a particular adjacency matrix

45

Using this strategy the new hierarchical formulation of our model is simply modified

by letting η(o) = K(o)ζ(o) and ηt = Ktζt with

1 ζ(o) sim MVN(0 τ (o)K(o)primeK(o)

) where K(o) is the eigenvector matrix for

P(o)perpAP(o)perp and

2 ζt sim MVN(0 τtK

primetKt

) where Kt is the Pperp

t APperpt for t = 2 3 T

The algorithms for the probit and logit link from section 231 can be readily

adapted to incorporate the spatial structure simply by obtaining the joint posteriors

for (α ζ(o)) and (β(c)tminus1 δ

(s)tminus1 ζt) making the obvious modification of the corresponding

linear predictors to incorporate the spatial components24 Summary

With a few exceptions (Dorazio amp Taylor-Rodrıguez 2012 Johnson et al 2013

Royle amp Kery 2007) recent Bayesian approaches to site-occupancy modeling with

covariates have relied on model configurations (eg as multivariate normal priors of

parameters in logit scale) that lead to unfamiliar conditional posterior distributions thus

precluding the use of a direct sampling approach Therefore the sampling strategies

available are based on algorithms (eg Metropolis Hastings) that require tuning and the

knowledge to do so correctly

In Dorazio amp Taylor-Rodrıguez (2012) we proposed a Bayesian specification for

which a Gibbs sampler of the basic occupancy model is available and allowed detection

and occupancy probabilities to depend on linear combinations of predictors This

method described in section 221 is based on the data augmentation algorithm of

Albert amp Chib (1993) There the full conditional posteriors of the parameters of the probit

regression model are cast as latent mixtures of normal random variables The probit and

the logit link yield similar results with large sample sizes however their results may be

different when small to moderate sample sizes are considered because the logit link

function places more mass in the tails of the distribution than the probit link does In

46

section 222 we adapt the method for the single season model to work with the logit link

function

The basic occupancy framework is useful but it assumes a single closed population

with fixed probabilities through time Hence its assumptions may not be appropriate to

address problems where the interest lies in the temporal dynamics of the population

Hence we developed a dynamic model that incorporates the notion that occupancy

at a site previously occupied takes place through persistence which depends both on

survival and habitat suitability By this we mean that a site occupied at time t may again

be occupied at time t + 1 if (1) the current settlers survive (2) the existing settlers

perish but new settlers simultaneously colonize or (3) current settlers survive and new

ones colonize during the same season In our current formulation of the DYMOSS both

colonization and persistence depend on habitat suitability characterized by xprimei(tminus1)β(c)tminus1

They only differ in that persistence is also influenced by whether the site being occupied

during season t minus 1 enhances the suitability of the site or harms it through density

dependence

Additionally the study of the dynamics that govern distribution and abundance of

biological populations requires an understanding of the physical and biotic processes

that act upon them and these vary in time and space Consequently as a final step in

this Chapter we described a straightforward strategy to add spatial dependence among

neighboring sites in the dynamic metapopulation model This extension is based on the

popular Bayesian spatial modeling technique of Besag et al (1991) updated using the

methods described in (Hughes amp Haran 2013)

Future steps along these lines are (1) develop the software necessary to

implement the tools described throughout the Chapter and (2) build a suite of additional

extensions using this framework for occupancy models will be explored The first of

them will be used to incorporate information from different sources such as tracks

scats surveys and direct observations into a single model This can be accomplished

47

by adding a layer to the hierarchy where the source and spatial scale of the data is

accounted for The second extension is a single season spatially explicit multiple

species co-occupancy model This model will allow studying complex interactions

and testing hypotheses about species interactions at a given point in time Lastly this

co-occupancy model will be adapted to incorporate temporal dynamics in the spirit of

the DYMOSS model

48

CHAPTER 3INTRINSIC ANALYSIS FOR OCCUPANCY MODELS

Eliminate all other factors and the one which remains must be the truthndashSherlock Holmes

The Sign of Four

31 Introduction

Occupancy models are often used to understand the mechanisms that dictate

the distribution of a species Therefore variable selection plays a fundamental role in

achieving this goal To the best of our knowledge ldquoobjectiverdquo Bayesian alternatives for

variable selection have not been put forth for this problem and with a few exceptions

(Hooten amp Hobbs 2014 Link amp Barker 2009) AIC is the method used to choose from

competing site-occupancy models In addition the procedures currently implemented

and accessible to ecologists require enumerating and estimating all the candidate

models (Fiske amp Chandler 2011 Mazerolle amp Mazerolle 2013) In practice this

can be achieved if the model space considered is small enough which is possible

if the choice of the model space is guided by substantial prior knowledge about the

underlying ecological processes Nevertheless many site-occupancy surveys collect

large amounts of covariate information about the sampled sites Given that the total

number of candidate models grows exponentially fast with the number of predictors

considered choosing a reduced set of models guided by ecological intuition becomes

increasingly difficult This is even more so the case in the occupancy model context

where the model space is the cartesian product of models for presence and models for

detection Given the issues mentioned above we propose the first objective Bayesian

variable selection method for the single-season occupancy model framework This

approach explores in a principled manner the entire model space It is completely

49

automatic precluding the need for both tuning parameters in the sampling algorithm and

subjective elicitation of parameter prior distributions

As mentioned above in ecological modeling if model selection or less frequently

model averaging is considered the Akaike Information Criterion (AIC) (Akaike 1983)

or a version of it is the measure of choice for comparing candidate models (Fiske amp

Chandler 2011 Mazerolle amp Mazerolle 2013) The AIC is designed to find the model

that has on average the density closest in Kullback-Leibler distance to the density

of the true data generating mechanism The model with the smallest AIC is selected

However if nested models are considered one of them being the true one generally the

AIC will not select it (Wasserman 2000) Commonly the model selected by AIC will be

more complex than the true one The reason for this is that the AIC has a weak signal to

noise ratio and as such it tends to overfit (Rao amp Wu 2001) Other versions of the AIC

provide a bias correction that enhances the signal to noise ratio leading to a stronger

penalization for model complexity Some examples are the AICc (Hurvich amp Tsai 1989)

and AICu (McQuarrie et al 1997) however these are also not consistent for selection

albeit asymptotically efficient (Rao amp Wu 2001)

If we are interested in prediction as opposed to testing the AIC is certainly

appropriate However when conducting inference the use of Bayesian model averaging

and selection methods is more fitting If the true data generating mechanism is among

those considered asymptotically Bayesian methods choose the true model with

probability one Conversely if the true model is not among the alternatives and a

suitable parameter prior is used the posterior probability of the most parsimonious

model closest to the true one tends asymptotically to one

In spite of this in general for Bayesian testing direct elicitation of prior probabilistic

statements is often impeded because the problems studied may not be sufficiently

well understood to make an informed decision about the priors Conversely there may

be a prohibitively large number of parameters making specifying priors for each of

50

these parameters an arduous task In addition to this seemingly innocuous subjective

choices for the priors on the parameter space may drastically affect test outcomes

This has been a recurring argument in favor of objective Bayesian procedures

which appeal to the use of formal rules to build parameter priors that incorporate the

structural information inside the likelihood while utilizing some objective criterion (Kass amp

Wasserman 1996)

One popular choice of ldquoobjectiverdquo prior is the reference prior (Berger amp Bernardo

1992) which is the prior that maximizes the amount of signal extracted from the

data These priors have proven to be effective as they are fully automatic and can

be frequentist matching in the sense that the posterior credible interval agrees with the

frequentist confidence interval from repeated sampling with equal coverage-probability

(Kass amp Wasserman 1996) Reference priors however are improper and while

they yield reasonable posterior parameter probabilities the derived model posterior

probabilities may be ill defined To avoid this shortcoming Berger amp Pericchi (1996)

introduced the intrinsic Bayes factor (IBF) for model comparison Moreno et al (1998)

building on the IBF of Berger amp Pericchi (1996) developed a limiting procedure to

generate a system of priors that yield well-defined posteriors even though these

priors may sometimes be improper The IBF is built using a data-dependent prior to

automatically generate Bayes factors however the extension introduced by Moreno

et al (1998) generates the intrinsic prior by taking a theoretical average over the space

of training samples freeing the prior from data dependence

In our view in the face of a large number of predictors the best alternative is to run

a stochastic search algorithm using good ldquoobjectiverdquo testing parameter priors and to

incorporate suitable model priors This being said the discussion about model priors is

deferred until Chapter 4 this Chapter focuses on the priors on the parameter space

The Chapter is structured as follows First issues surrounding multimodel inference

are described and insight about objective Bayesian inferential procedures is provided

51

Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate priors

on the parameter space the intrinsic priors for the parameters of the occupancy model

are derived These are used in the construction of an algorithm for ldquoobjectiverdquo model

selection tailored to the occupancy model framework To assess the performance of our

methods we provide results from a simulation study in which distinct scenarios both

favorable and unfavorable are used to determine the robustness of these tools and

analyze the Blue Hawker data set which has been examined previously in the ecological

literature (Dorazio amp Taylor-Rodrıguez 2012 Kery et al 2010)32 Objective Bayesian Inference

As mentioned before in practice noninformative priors arising from structural

rules are an alternative to subjective elicitation of priors Some of the rules used in

defining noninformative priors include the principle of insufficient reason parametrization

invariance maximum entropy geometric arguments coverage matching and decision

theoretic approaches (see Kass amp Wasserman (1996) for a discussion)

These rules reflect one of two attitudes (1) noninformative priors either aim to

convey unique representations of ignorance or (2) they attempt to produce probability

statements that may be accepted by convention This latter attitude is in the same

spirit as how weights and distances are defined (Kass amp Wasserman 1996) and

characterizes the way in which Bayesian reference methods are interpreted today ie

noninformative priors are seen to be chosen by convention according to the situation

A word of caution must be given when using noninformative priors Difficulties arise

in their implementation that should not be taken lightly In particular these difficulties

may occur because noninformative priors are generally improper (meaning that they do

not integrate or sum to a finite number) and as such are said to depend on arbitrary

constants

Bayes factors strongly depend upon the prior distributions for the parameters

included in each of the models being compared This can be an important limitation

52

considering that when using noninformative priors their introduction will result in the

Bayes factors being a function of the ratio of arbitrary constants given that these priors

are typically improper (see Jeffreys 1961 Pericchi 2005 and references therein)

Many different approaches have been developed to deal with the arbitrary constants

when using improper priors since then These include the use of partial Bayes factors

(Berger amp Pericchi 1996 Good 1950 Lempers 1971) setting the ratio of arbitrary

constants to a predefined value (Spiegelhalter amp Smith 1982) and approximating to the

Bayes factor (see Haughton 1988 as cited in Berger amp Pericchi 1996 Kass amp Raftery

1995 Tierney amp Kadane 1986)321 The Intrinsic Methodology

Berger amp Pericchi (1996) cleverly dealt with the arbitrary constants that arise when

using improper priors by introducing the intrinsic Bayes factor (IBF) procedure This

solution based on partial Bayes factors provides the means to replace the improper

priors by proper ldquoposteriorrdquo priors The IBF is obtained from combining the model

structure with information contained in the observed data Furthermore they showed

that as the sample size tends to infinity the Intrinsic Bayes factor corresponds to the

proper Bayes factor arising from the intrinsic priors

Intrinsic priors however are not unique The asymptotic correspondence between

the IBF and the Bayes factor arising from the intrinsic prior yields two functional

equations that are solved by a whole class of intrinsic priors Because all the priors

in the class produce Bayes factors that are asymptotically equivalent to the IBF for

finite sample sizes the resulting Bayes factor is not unique To address this issue

Moreno et al (1998) formalized the methodology through the ldquolimiting procedurerdquo

This procedure allows one to obtain a unique Bayes factor consolidating the method

as a valid objective Bayesian model selection procedure which we will refer to as the

Bayes factor for intrinsic priors (BFIP) This result is particularly valid for nested models

although the methodology may be extended with some caution to nonnested models

53

As mentioned before the Bayesian hypothesis testing procedure is highly sensitive

to parameter-prior specification and not all priors that are useful for estimation are

recommended for hypothesis testing or model selection Evidence of this is provided

by the Jeffreys-Lindley paradox which states that a point null hypothesis will always

be accepted when the variance of a conjugate prior goes to infinity (Robert 1993)

Additionally when comparing nested models the null model should correspond to

a substantial reduction in complexity from that of larger alternative models Hence

priors for the larger alternative models that place probability mass away from the null

model are wasteful If the true model is ldquofarrdquo from the null it will be easily detected by

any statistical procedure Therefore the prior on the alternative models should ldquowork

harderrdquo at selecting competitive models that are ldquocloserdquo to the null This principle known

as the Savage continuity condition (Gunel amp Dickey 1974) is widely recognized by

statisticians

Interestingly the intrinsic prior in correspondence with the BFIP automatically

satisfies the Savage continuity condition That is when comparing nested models the

intrinsic prior for the more complex model is centered around the null model and in spite

of being a limiting procedure it is not subject to the Jeffreys-Lindley paradox

Moreover beyond the usual pairwise consistency of the Bayes factor for nested

models Casella et al (2009) show that the corresponding Bayesian procedure with

intrinsic priors for variable selection in normal regression is consistent in the entire

class of normal linear models adding an important feature to the list of virtues of the

procedure Consistency of the BFIP for the case where the dimension of the alternative

model grows with the sample size is discussed in Moreno et al (2010)322 Mixtures of g-Priors

As previously mentioned in the Bayesian paradigm a model M in M is defined

by a sampling density and a prior distribution The sampling density associated with

model M is denoted by f (y|βM σ2M M) where (βM σ

2M) is a vector of model-specific

54

unknown parameters The prior for model M and its corresponding set of parameters is

π(βM σ2M M|M) = π(βM σ

2M |MM) middot π(M|M)

Objective local priors for the model parameters (βM σ2M) are achieved through

modifications and extensions of Zellnerrsquos g-prior (Liang et al 2008 Womack et al

2014) In particular below we focus on the intrinsic prior and provide some details for

other scaled mixtures of g-priors We defer the discussion on priors over the model

space until Chapter 5 where we describe them in detail and develop a few alternatives

of our own3221 Intrinsic priors

An automatic choice of an objective prior is the intrinsic prior (Berger amp Pericchi

1996 Moreno et al 1998) Because MB sube M for all M isin M the intrinsic prior for

(βM σ2M) is defined as an expected posterior prior

πI (βM σ2M |M) =

intpR(βM σ

2M |~yM)mR(~y|MB)d~y (3ndash1)

where ~y is a minimal training sample for model M I denotes the intrinsic distributions

and R denotes distributions derived from the reference prior πR(βM σ2M |M) = cM

dβMdσ2M

σ2M

In (3ndash1) mR(~y|M) =intint

f (~y|βM σ2M M)πR(βM σ

2M |M)dβMdσ2M is the reference marginal

of ~y under model M and pR(βM σ2M |~yM) =

f (~y|βM σ2MM)πR(βM σ2

M|M)

mR(~y|M)is the reference

posterior density

In the regression framework the reference marginal mR is improper and produces

improper intrinsic priors However the intrinsic Bayes factor of model M to the base

model MB is well-defined and given by

BF IMMB

(y) = (1minus R2M)

minus nminus|MB |2 times

int 1

0

n + sin2(π2θ) middot (|M|+ 1)

n +sin2(π

2θ)middot(|M|+1)1minusR2

M

nminus|M|

2sin2(π

2θ) middot (|M|+ 1)

n +sin2(π

2θ)middot(|M|+1)1minusR2

M

|M|minus|MB |

2

dθ (3ndash2)

55

where R2M is the coefficient of determination of model M versus model MB The Bayes

factor between two models M and M prime is defined as BF IMMprime(y) = BF I

MMB(y)BF I

MprimeMB(y)

The ldquogoodnessrdquo of the model M based on the intrinsic priors is given by its posterior

probability

pI (M|yM) =BF I

MMB(y)π(M|M)sum

MprimeisinM BF IMprimeMB

(y)π(M prime|M) (3ndash3)

It has been shown that the system of intrinsic priors produces consistent model selection

(Casella et al 2009 Giron et al 2010) In the context of well-formulated models the

true model MT is the smallest well-formulated model M isin M such that α isin M if βα = 0

If MT is the true model then the posterior probability of model MT based on equation

(3ndash3) converges to 13222 Other mixtures of g-priors

Scaled mixtures of g-priors place a reference prior on (βMBσ2) and a multivariate

normal distribution on β in M MB that is normal with mean 0 and precision matrix

qMw

nσ2ZprimeM (IminusH0)ZM

where H0 is the hat matrix associated with ZMB The prior is completed by a prior on w

and choice of scaling qM that is set at |M| + 1 to account for the minimal sample size of

M Under these assumptions the Bayesrsquo factor for M to MB is given by

BFMMB(y) =

(1minus R2

M

) nminus|MB |2

int n + w(|M|+ 1)

n + w(|M|+1)1minusR2

M

nminus|M|

2w(|M|+ 1)

n + w(|M|+1)1minusR2

M

|M|minus|MB |

2

π(w)dw

We consider the following priors on w The intrinsic prior is π(w) = Beta(w 05 05)

which is only defined for w isin (0 1) A version of the Zellner-Siow prior is given by

w sim Gamma(05 05) which produces a multivariate Cauchy distribution on β A family

of hyper-g priors are defined by π(w) prop wminus12(β + w)(α+1)2 which have Cauchy-like

tails but produce more shrinkage than the Cauchy prior

56

33 Objective Bayes Occupancy Model Selection

As mentioned before Bayesian inferential approaches used for ecological models

are lacking In particular there exists a need for suitable objective and automatic

Bayesian testing procedures and software implementations that explore thoroughly the

model space considered With this goal in mind in this section we develop an objective

intrinsic and fully automatic Bayesian model selection methodology for single season

site-occupancy models We refer to this method as automatic and objective given that

in its implementation no hyperparameter tuning is required and that it is built using

noninformative priors with good testing properties (eg intrinsic priors)

An inferential method for the occupancy problem is possible using the intrinsic

approach given that we are able to link intrinsic-Bayesian tools for the normal linear

model through our probit formulation of the occupancy model In other words because

we can represent the single season probit occupancy model through the hierarchy

yij |zi wij sim Bernoulli(ziIwijgt0

)wij |λ sim N

(qprimeijλ 1

)zi |vi sim Bernoulli

(Ivigt0

)vi |α sim N (x primeiα 1)

it is possible to solve the selection problem on the latent scale variables wij and vi and

to use those results at the level of the occupancy and detection processes

In what follows first we provide some necessary notation Then a derivation of

the intrinsic priors for the parameters of the detection and occupancy components

is outlined Using these priors we obtain the general form of the model posterior

probabilities Finally the results are incorporated in a model selection algorithm for

site-occupancy data Although the priors on the model space are not discussed in this

Chapter the software and methods developed have different choices of model priors

built in

57

331 Preliminaries

The notation used in Chapter 2 will be considered in this section as well Namely

presence will be denoted by z detection by y their corresponding latent processes are

v and w and the model parameters are denoted by α and λ However some additional

notation is also necessary Let M0 =M0y M0z

denote the ldquobaserdquo model defined by

the smallest models considered for the detection and presence processes The base

models M0y and M0z include predictors that must be contained in every model that

belongs to the model space Some examples of base models are the intercept only

model a model with covariates related to the sampling design and a model including

some predictors important to the researcher that should be included in every model

Furthermore let the sets [Kz ] = 1 2 Kz and [Ky ] = 1 2 Ky index

the covariates considered for the variable selection procedure for the presence and

detection processes respectively That is these sets denote the covariates that can

be added from the base models in M0 or removed from the largest possible models

considered MF z and MF y which we will refer to as the ldquofullrdquo models The model space

can then be represented by the Cartesian product of subsets such that Ay sube [Ky ]

and Az sube [Kz ] The entire model space is populated by models of the form MA =MAy

MAz

isin M = My timesMz with MAy

isin My and MAzisin Mz

For the presence process z the design matrix for model MAzis given by the block

matrix XAz= (X0|Xr A) X0 corresponds to the design matrix of the base model ndash which

is such that M0z sube MAzisin Mz for all Az isin [Kz ] ndash and Xr A corresponds to the submatrix

that contains the covariates indexed by Az Analogously for the detection process y the

design matrix is given by QAy= (Q0|Qr A) Similarly the coefficients for models MAz

and

MAyare given by αA = (αprime

0αprimer A)

prime and λA = (λprime0λ

primer A)

prime

With these elements in place the model selection problem consists of finding

subsets of covariates indexed by A = Az Ay that have a high posterior probability

given the detection and occupancy processes This is equivalent to finding models with

58

high posterior odds when compared to a suitable base model These posterior odds are

given by

p(MA|y z)p(M0|y z)

=m(y z|MA)π(MA)

m(y z|M0)π(M0)= BFMAM0

(y z)π(MA)

π(M0)

Since we are able to represent the occupancy model as a truncation of latent

normal variables it is possible to work through the occupancy model selection problem

in the latent normal scale used for the presence and detection processes We formulate

two solutions to this problem one that depends on the observed and latent components

and another that solely depends on the latent level variables used to data-augment the

problem We will however focus on the latter approach as this yields a straightforward

MCMC sampling scheme For completeness the other alternative is described in

Section 34

At the root of our objective inferential procedure for occupancy models lies the

conditional argument introduced by Womack et al (work in progress) for the simple

probit regression In the occupancy setting the argument is

p(MA|y zw v) =m(y z vw|MA)π(MA)

m(y zw v)

=fyz(y z|w v)

(intfvw(vw|αλMA)παλ(αλ|MA)d(αλ)

)π(MA)

fyz(y z|w v)sum

MlowastisinM(int

fvw(vw|αλMlowast)παλ(αλ|Mlowast)d(αλ))π(Mlowast)

=m(v|MAz

)m(w|MAy)π(MA)

m(v)m(w)

prop m(v|MAz)m(w|MAy

)π(MA) (3ndash4)

where

1 fyz(y z|w v) =prodN

i=1 Izivigt0I

(1minuszi )vile0

prodJ

j=1(ziIwijgt0)yij (1minus ziIwijgt0)

1minusyij

2 fvw(vw|αλMA) =

(Nprodi=1

ϕ(vi xprimeiαMAz

1)

)︸ ︷︷ ︸

f (v|αr Aα0MAz )

(Nprodi=1

Jiprodj=1

ϕ(wij qprimeijλMAy

1)

)︸ ︷︷ ︸

f (w|λr Aλ0MAy )

and

59

3 παλ(αλ|MA) = πα(α|MAz)πλ(λ|MAy

)

This result implies that once the occupancy and detection indicators are

conditioned on the latent processes v and w respectively the model posterior

probabilities only depend on the latent variables Hence in this case the model

selection problem is driven by the posterior odds

p(MA|y zw v)p(M0|y zw v)

=m(w v|MA)

m(w v|M0)

π(MA)

π(M0) (3ndash5)

where m(w v|MA) = m(w|MAy) middotm(v|MAz

) with

m(v|MAz) =

int intf (v|αr Aα0MAz

)π(αr A|α0MAz)π(α0)dαr Adα0

(3ndash6)

m(w|MAy) =

int intf (w|λr Aλ0MAy

)π(λr A|λ0MAy)π(λ0)dλ0dλr A

(3ndash7)

332 Intrinsic Priors for the Occupancy Problem

In general the intrinsic priors as defined by Moreno et al (1998) use the functional

form of the response to inform their construction assuming some preliminary prior

distribution proper or improper on the model parameters For our purposes we assume

noninformative improper priors for the parameters denoted by πN(middot|middot) Specifically the

intrinsic priors πIP(θMlowast|Mlowast) for a vector of parameters θMlowast corresponding to model

Mlowast isin M0M sub M for a response vector s with probability density (or mass) function

f (s|θMlowast) are defined by

πIP(θM0|M0) = πN(θM0

|M0)

πIP(θM |M) = πN(θM |M)

intm(~s|M)

m(~s|M0)f (~s|θM M)d~s

where ~s is a theoretical training sample

In what follows whenever it is clear from the context in an attempt to simplify the

notation MA will be used to refer to MAzor MAy

and A will denote Az or Ay To derive

60

the parameter priors involved in equations 3ndash6 and 3ndash7 using the objective intrinsic prior

strategy we start by assuming flat priors πN(αA|MA) prop cA and πN(λA|MA) prop dA where

cA and dA are unknown constants

The intrinsic prior for the parameters associated with the occupancy process αA

conditional on model MA is

πIP(αA|MA) = πN(αA|MA)

intm(~v|MA)

m(~v|M0)f (~v|αAMA)d~v

where the marginals m(~v|Mj) with j isin A 0 are obtained by solving the analogous

equation 3ndash6 for the (theoretical) training sample ~v These marginals are given by

m(~v|Mj) = cj (2π)pjminusp0

2 |~X primej~Xj |

12 eminus

12~vprime(Iminus~Hj )~v

The training sample ~v has dimension pAz=∣∣MAz

∣∣ that is the total number of

parameters in model MAz Note that without ambiguity we use

∣∣ middot ∣∣ to denote both

the cardinality of a set and also the determinant of a matrix The design matrix ~XA

corresponds to the training sample ~v and is chosen such that ~X primeA~XA =

pAzNX primeAXA

(Leon-Novelo et al 2012) and ~Hj is the corresponding hat matrix

Replacing m(~v|MA) and m(~v|M0) in πIP(αA|MA) and solving the integral with

respect to the theoretical training sample ~v we have

πIP(αA|MA) = cA

int ((2π)minus

pAzminusp0z2

(c0

cA

)eminus

12~vprime((Iminus~HA)minus(Iminus~H0))~v |~X

primeA~XA|12

|~X prime0~X0|12

)times(

(2π)minuspAz2 eminus

12(~vminus~XAαA)

prime(~vminus~XAαA))d~v

= c0(2π)minus

pAzminusp0z2 |~X prime

Ar~XAr |

12 2minus

pAzminusp0z2 exp

[minus1

2αprimer A

(1

2~X primer A

~Xr A

)αr A

]= πN(α0)timesN

(αr A

∣∣ 0 2 middot ( ~X primer A

~Xr A)minus1)

(3ndash8)

61

Analogously the intrinsic prior for the parameters associated to the detection

process is

πIP(λA|MA) = d0(2π)minus

pAyminusp0y2 | ~Q prime

Ar~QAr |

12 2minus

pAyminusp0y2 exp

[minus1

2λprimer A

(1

2~Q primer A

~Qr A

)λr A

]= πN(λ0)timesN

(λr A

∣∣ 0 2 middot ( ~Q primeA~QA)

minus1)

(3ndash9)

In short the intrinsic priors for αA = (αprime0α

primer A)

prime and λprimeA = (λprime

0λprimer A)

prime are the product

of a reference prior on the parameters of the base model and a normal density on the

parameters indexed by Az and Ay respectively333 Model Posterior Probabilities

We now derive the expressions involved in the calculations of the model posterior

probabilities First recall that p(MA|y zw v) prop m(w v|MA)π(MA) Hence determining

this posterior probability only requires calculating m(w v|MA)

Note that since w and v are independent obtaining the model posteriors from

expression 3ndash4 reduces to finding closed form expressions for the marginals m(v |MAz)

and m(w |MAy) respectively from equations 3ndash6 and 3ndash7 Therefore

m(w v|MA) =

int intf (vw|αλMA)π

IP (α|MAz)πIP

(λ|MAy

)dαdλ

(3ndash10)

For the latent variable associated with the occupancy process plugging the

parameter intrinsic prior given by 3ndash8 into equation 3ndash6 (recalling that ~X primeA~XA =

pAzNX primeAXA)

and integrating out αA yields

m(v|MA) =

int intc0N (v|X0α0 + Xr Aαr A I)N

(αr A|0 2( ~X prime

r A~Xr A)

minus1)dαr Adα0

= c0(2π)minusn2

int (pAz

2N + pAz

) (pAzminusp0z

)

2

times

exp[minus1

2(v minus X0α0)

prime(I minus

(2N

2N + pAz

)Hr Az

)(v minus X0α0)

]dα0

62

= c0 (2π)minus(nminusp0z )2

(pAz

2N + pAz

) (pAzminusp0z

)

2

|X prime0X0|minus

12 times

exp[minus1

2vprime(I minus H0z minus

(2N

2N + pAz

)Hr Az

)v

] (3ndash11)

with Hr Az= HAz

minus H0z where HAzis the hat matrix for the entire model MAz

and H0z is

the hat matrix for the base model

Similarly the marginal distribution for w is

m(w|MA) = d0 (2π)minus(Jminusp0y )2

(pAy

2J + pAy

) (pAyminusp0y

)

2

|Q prime0Q0|minus

12 times

exp[minus1

2wprime(I minus H0y minus

(2J

2J + pAy

)Hr Ay

)w

] (3ndash12)

where J =sumN

i=1 Ji or in other words J denotes the total number of surveys conducted

Now the posteriors for the base model M0 =M0y M0z

are

m(v|M0) =

intc0N (v|X0α0 I) dα0

= c0(2π)minus(nminusp0z )2 |X prime

0X0|minus12 exp

[minus1

2(v (I minus H0z ) v)

](3ndash13)

and

m(w|M0) = d0(2π)minus(Jminusp0y )2 |Q prime

0Q0|minus12 exp

[minus1

2

(w(I minus H0y

)w)]

(3ndash14)

334 Model Selection Algorithm

Having the parameter intrinsic priors in place and knowing the form of the model

posterior probabilities it is finally possible to develop a strategy to conduct model

selection for the occupancy framework

For each of the two components of the model ndashoccupancy and detectionndash the

algorithm first draws the set of active predictors (ie Az and Ay ) together with their

corresponding parameters This is a reversible jump step which uses a Metropolis

63

Hastings correction with proposal distributions given by

q(Alowastz |zo z(t)u v(t)MAz

) =1

2

(p(MAlowast

z|zo z(t)u v(t)Mz MAlowast

zisin L(MAz

)) +1

|L(MAz)|

)q(Alowast

y |y zo z(t)u w(t)MAy) =

1

2

(p(MAlowast

w|y zo z(t)u w(t)My MAlowast

yisin L(MAy

)) +1

|L(MAy)|

)(3ndash15)

where L(MAz) and L(MAy

) denote the sets of models obtained from adding or removing

one predictor at a time from MAzand MAy

respectively

To promote mixing this step is followed by an additional draw from the full

conditionals of α and λ The densities p(α0|) p(αr A|) p(λ0|) and p(λr A|) can

be sampled from directly with Gibbs steps Using the notation a|middot to denote the random

variable a conditioned on all other parameters and on the data these densities are given

by

bull α0|middot sim N((X

prime0X0)

minus1Xprime0v (X

prime0X0)

minus1)bull αr A|middot sim N

(microαr A

αr A

) where the mean vector and the covariance matrix are

given by αr A= 2N

2N+pAz(X

prime

r AXr A)minus1 and microαr A

=(αr A

Xprime

r Av)

bull λ0|middot sim N((Q

prime0Q0)

minus1Qprime0w (Q

prime0Q0)

minus1) and

bull λr A|middot sim N(microλr A

λr A

) analogously with mean and covariance matrix given by

λr A= 2J

2J+pAy(Q

prime

r AQr A)minus1 and microλr A

=(λr A

Qprime

r Aw)

Finally Gibbs sampling steps are also available for the unobserved occupancy

indicators zu and for the corresponding latent variables v and w The full conditional

posterior densities for z(t+1)u v(t+1) and w(t+1) are those introduced in Chapter 2 for the

single season probit model

The following steps summarize the stochastic search algorithm

1 Initialize A(0)y A

(0)z z

(0)u v(0)w(0)α(0)

0 λ(0)0

2 Sample the model indices and corresponding parameters

(a) Draw simultaneously

64

bull Alowastz sim q(Az |zo z(t)u v(t)MAz

)

bull αlowast0 sim p(α0|MAlowast

z zo z

(t)u v(t)) and

bull αlowastr Alowast sim p(αr A|MAlowast

z zo z

(t)u v(t))

(b) Accept (M(t+1)Az

α(t+1)10 α(t+1)1

r A ) = (MAlowastzαlowast

0αlowastr Alowast) with probability

δz = min

(1

p(MAlowastz|zo z(t)u v(t))

p(MA(t)z|zo z(t)u v(t))

q(A(t)z |zo z(t)u v(t)MAlowast

z)

q(Alowastz |zo z

(t)u v(t)MAz

)

)

otherwise let (M(t+1)Az

α(t+1)10 α(t+1)1

r A ) = (A(t)z α(t)2

0 α(t)2r A )

(c) Sample simultaneously

bull Alowasty sim q(Ay |y zo z(t)u w(t)MAy

)

bull λlowast0 sim p(λ0|MAlowast

y y zo z

(t)u w(t)) and

bull λlowastr Alowast sim p(λr A|MAlowast

y y zo z

(t)u w(t))

(d) Accept (M(t+1)Ay

λ(t+1)10 λ(t+1)1

r A ) = (MAlowastyλlowast

0λlowastr Alowast) with probability

δy = min

(1

p(MAlowastz|y zo z(t)u w(t))

p(MA(t)z|y zo z(t)u w(t))

q(A(t)z |y zo z(t)u w(t)MAlowast

y)

q(Alowastz |y zo z

(t)u w(t)MAy

)

)

otherwise let (M(t+1)Ay

λ(t+1)10 λ(t+1)1

r A ) = (A(t)y λ(t)2

0 λ(t)2r A )

3 Sample base model parameters

(a) Draw α(t+1)20 sim p(α0|MA

(t+1)z

zo z(t)u v(t))

(b) Draw λ(t+1)20 sim p(λ0|MA(t+1)y

y zo z(t)u v(t))

4 To improve mixing resample model coefficients not present the base model butare in MA

(a) Draw α(t+1)2r A sim p(αr A|MA

(t+1)z

zo z(t)u v(t))

(b) Draw λ(t+1)2r A sim p(λr A|MA

(t+1)y

yzo z(t)u v(t))

5 Sample latent and missing (unobserved) variables

(a) Sample z(t+1)u sim p(zu|MA(t+1)z

yα(t+1)2r A α(t+1)2

0 λ(t+1)2r A λ(t+1)2

0 )

(b) Sample v(t+1) sim p(v|MA(t+1)z

zo z(t+1)u α(t+1)2

r A α(t+1)20 )

65

(c) Sample w(t+1) sim p(w|MA(t+1)y

zo z(t+1)u λ(t+1)2

r A λ(t+1)20 )

34 Alternative Formulation

Because the occupancy process is partially observed it is reasonable to consider

the posterior odds in terms of the observed responses that is the detections y and

the presences at sites where at least one detection takes place Partitioning the vector

of presences into observed and unobserved z = (zprimeo zprimeu)

prime and integrating out the

unobserved component the model posterior for MA can be obtained as

p(MA|y zo) prop Ezu [m(y z|MA)] π(MA) (3ndash16)

Data-augmenting the model in terms of latent normal variables a la Albert and Chib

the marginals for any model My Mz = M isin M of z and y inside of the expectation in

equation 3ndash16 can be expressed in terms of the latent variables

m(y z|M) =

intT (z)

intT (yz)

m(w v|M)dwdv

=

(intT (z)

m(v| Mz)dv

)(intT (yz)

m(w|My)dw

) (3ndash17)

where T (z) and T (y z) denote the corresponding truncation regions for v and w which

depend on the values taken by z and y and

m(v|Mz) =

intf (v|αMz)π(α|Mz)dα (3ndash18)

m(w|My) =

intf (w|λMy)π(λ|My)dλ (3ndash19)

The last equality in equation 3ndash17 is a consequence of the independence of the

latent processes v and w Using expressions 3ndash18 and 3ndash19 allows one to embed this

model selection problem in the classical linear normal regression setting where many

ldquoobjectiverdquo Bayesian inferential tools are available In particular these expressions

facilitate deriving the parameter intrinsic priors (Berger amp Pericchi 1996 Moreno

et al 1998) for this problem This approach is an extension of the one implemented in

Leon-Novelo et al (2012) for the simple probit regression problem

66

Using this alternative approach all that is left is to integrate m(v|MA) and m(w|MA)

over their corresponding truncation regions T (z) and T (y z) which yields m(y z|MA)

and then to obtain the expectation with respect to the unobserved zrsquos Note however

two issues arise First such integrals are not available in closed form Second

calculating the expectation over the limit of integration further complicates things To

address these difficulties it is possible to express E [m(y z|MA)] as

Ezu [m(y z|MA)] = Ezu

[(intT (z)

m(v| MAz)dv

)(intT (yz)

m(w|MAy)dw

)](3ndash20)

= Ezu

[(intT (z)

intm(v| MAz

α0)πIP(α0|MAz

)dα0dv

)times(int

T (yz)

intm(w| MAy

λ0)πIP(λ0|MAy

)dλ0dw

)]

= Ezu

int (int

T (z)

m(v| MAzα0)dv

)︸ ︷︷ ︸

g1(T (z)|MAz α0)

πIP(α0|MAz)dα0 times

int (intT (yz)

m(w|MAyλ0)dw

)︸ ︷︷ ︸

g2(T (yz)|MAy λ0)

πIP(λ0|MAy)dλ0

= Ezu

[intg1(T (z)|MAz

α0)πIP(α0|MAz

)dα0 timesintg2(T (y z)|MAy

λ0)πIP(λ0|MAy

)dλ0

]= c0 d0

int intEzu

[g1(T (z)|MAz

α0)g2(T (y z)|MAyλ0)

]dα0 dλ0

where the last equality follows from Fubinirsquos theorem since m(v|MAzα0) and

m(w|MAyλ0) are proper densities From 3ndash21 the posterior odds are

p(MA|y zo)p(M0|y zo)

=

int intEzu

[g1(T (z)|MAz

α0)g2(T (y z)|MAyλ0)

]dα0 dλ0int int

Ezu

[g1(T (z)|M0z α0)g2(T (y z)|M0y λ0)

]dα0 dλ0

π(MA)

π(M0)

(3ndash21)

67

35 Simulation Experiments

The proposed methodology was tested under 36 different scenarios where we

evaluate the behavior of the algorithm by varying the number of sites the number of

surveys the amount of signal in the predictors for the presence component and finally

the amount of signal in the predictors for the detection component

For each model component the base model is taken to be the intercept only model

and the full models considered for the presence and the detection have respectively 30

and 20 predictors Therefore the model space contains 230times220 asymp 112times1015 candidate

models

To control the amount of signal in the presence and detection components values

for the model parameter were purposefully chosen so that quantiles 10 50 and 90 of the

occupancy and detection probabilities match some pre-specified probabilities Because

presence and detection are binary variables the amount of signal in each model

component associates to the spread and center of the distribution for the occupancy and

detection probabilities respectively Low signal levels relate to occupancy or detection

probabilities close to 50 High signal levels associate with probabilities close to 0 or 1

Large spreads of the distributions for the occupancy and detection probabilities reflect

greater heterogeneity among the observations collected improving the discrimination

capability of the model and viceversa

Therefore for the presence component the parameter values of the true model

were chosen to set the median for the occupancy probabilities equal 05 The chosen

parameter values also fix quantiles 10 and 90 symmetrically about 05 at small (Qz10 =

03Qz90 = 07) intermediate (Qz

10 = 02Qz90 = 08) and large (Qz

10 = 01Qz90 = 09)

distances For the detection component the model parameters are obtained to reflect

detection probabilities concentrated about low values (Qy50 = 02) intermediate values

(Qy50 = 05) and high values (Qy

50 = 08) while keeping quantiles 10 and 90 fixed at 01

and 09 respectively

68

Table 3-1 Simulation control parameters occupancy model selectorParameter Values considered

N 50 100

J 3 5

(Qz10Q

z50Q

z90)

(03 05 07) (02 05 08) (01 05 09)

(Qy

10Qy50Q

y90)

(01 02 09) (01 05 09) (01 08 09)

There are in total 36 scenarios these result from crossing all the levels of the

simulation control parameters (Table 3-1) Under each of these scenarios 20 data sets

were generated at random True presence and detection indicators were generated

with the probit model formulation from Chapter 2 This with the assumed true models

MTz = 1 x2 x15 x16 x22 x28 for the presence and MTy = 1 q7 q10 q12 q17 for

the detection with the predictors included in the randomly generated datasets In this

context 1 represents the intercept term Throughout the Section we refer to predictors

included in the true models as true predictors and to those absent as false predictors

The selection procedure was conducted using each one of these data sets with

two different priors on the model space the uniform or equal probability prior and a

multiplicity correcting prior

The results are summarized through the marginal posterior inclusion probabilities

(MPIPs) for each predictor and also the five highest posterior probability models (HPM)

The MPIP for a given predictor under a specific scenario and for a particular data set is

defined as

p(predictor is included|y zw v) =sumMisinM

I(predictorisinM)p(M|y zw vM) (3ndash22)

In addition we compare the MPIP odds between predictors present in the true model

and predictors absent from it Specifically we consider the minimum odds of marginal

posterior inclusion probabilities for the predictors Let ~ξ and ξ denote respectively a

69

predictor in the true model MT and a predictor absent from MT We define the minimum

MPIP odds between the probabilities of true and false predictor as

minOddsMPIP =min~ξisinMT

p(I~ξ = 1|~ξ isin MT )

maxξ isinMTp(Iξ = 1|ξ isin MT )

(3ndash23)

If the variable selection procedure adequately discriminates true and false predictors

minOddsMPIP will take values larger than one The ability of the method to discriminate

between the least probable true predictor and the most probable false predictor worsens

as the indicator approaches 0351 Marginal Posterior Inclusion Probabilities for Model Predictors

For clarity in Figures 3-1 through 3-5 only predictors in the true models are labeled

and are emphasized with a dotted line passing through them The left hand side plots

in these figures contain the results for the presence component and the ones on the

right correspond to predictors in the detection component The results obtained with

the uniform model priors correspond to the black lines and those for the multiplicity

correcting prior are in red In these Figures the MPIPrsquos have been averaged over all

datasets corresponding scenarios matching the condition observed

In Figure 3-1 we contrast the mean MPIPrsquos of the predictors over all datasets from

scenarios with 50 sites to the mean MPIPrsquos obtained for the scenarios with 100 sites

Similarly Figure 3-2 compares the mean MPIPrsquos of scenarios where 3 surveys are

performed to those of scenarios having 5 surveys per site Figures 3-4 and 3-5 show the

effect of the different levels of signal considered in the occupancy probabilities and in the

detection probabilities

From these figures mainly three results can be drawn (1) the effect of the model

prior is substantial (2) the proposed methods yield MPIPrsquos that clearly separate

true predictors from false predictors and (3) the separation between MPIPrsquos of true

predictors and false predictors is noticeably larger in the detection component

70

Regardless of the simulation scenario and model component observed under the

uniform prior false predictors obtain a relatively high MPIP Conversely the multiplicity

correction prior strongly shrinks towards 0 the MPIP for false predictors In the presence

component the MPIP for the true predictors is shrunk substantially under the multiplicity

prior however there remains a clear separation between true and false predictors In

contrast in the detection component the MPIP for true predictors remains relatively high

(Figures 3-1 through 3-5)

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif N=50MC N=50

Unif N=100MC N=100

Figure 3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors

71

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif J=3MC J=3

Unif J=5MC J=5

Figure 3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif N=50 J=3Unif N=50 J=5

Unif N=100 J=3Unif N=100 J=5

MC N=50 J=3MC N=50 J=5

MC N=100 J=3MC N=100 J=5

Figure 3-3 Predictor MPIP averaged over scenarios with the interaction between thenumber of sites and the surveys per site using uniform (U) and multiplicitycorrection (MC) priors

72

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

U(03 05 07)MC(03 05 07)

U(02 05 08)MC(02 05 08)

U(01 05 09)MC(01 05 09)

Figure 3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

U(01 02 09)MC(01 02 09)

U(01 05 09)MC(01 05 09)

U(01 08 09)MC(01 08 09)

Figure 3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors

73

In scenarios where more sites were surveyed the separation between the MPIP of

true and false predictors grew in both model components (Figure 3-1) Increasing the

number of sites has an effect over both components given that every time a new site is

included covariate information is added to the design matrix of both the presence and

the detection components

On the hand increasing the number of surveys affects the MPIP of predictors in the

detection component (Figures 3-2 and 3-3) but has only a marginal effect on predictors

of the presence component This may appear to be counterintuitive however increasing

the number of surveys only increases the number of observation in the design matrix

for the detection while leaving unaltered the design matrix for the presence The small

changes observed in the MPIP for the presence predictors J increases are exclusively

a result of having additional detection indicators equal to 1 in sites where with less

surveys would only have 0 valued detections

From Figure 3-3 it is clear that for the presence component the effect of the number

of sites dominates the behavior of the MPIP especially when using the multiplicity

correction priors In the detection component the MPIP is influenced by both the number

of sites and number of surveys The influence of increasing the number of surveys is

larger when considering a smaller number of sites and viceversa

Regarding the effect of the distribution for the occupancy probabilities we observe

that mostly the detection component is affected There is stronger discrimination

between true and false predictors as the distribution has a higher variability (Figure

3-4) This is consistent with intuition since having the presence probabilities more

concentrated about 05 implies that the predictors do not vary much from one site to

the next whereas having the occupancy probabilities more spread out would have the

opposite effect

Finally concentrating the detection probabilities about high or low values For

predictors in the detection component the separation between MPIP of true and false

74

predictors is larger both in scenarios where the distribution of the detection probability

is centered about 02 or 08 when compared to those scenarios where this distribution

is centered about 05 (where the signal of the predictors is weakest) For predictors in

the presence component having the detection probabilities centered at higher values

slightly increases the inclusion probabilities of the true predictors (Figure 3-5) and

reduces that of false predictors

Table 3-2 Comparison of average minOddsMPIP under scenarios having differentnumber of sites (N=50 N=100) and under scenarios having different numberof surveys per site (J=3 J=5) for the presence and detection componentsusing uniform and multiplicity correction priors

Sites SurveysComp π(M) N=50 N=100 J=3 J=5

Presence Unif 112 131 119 124MC 320 846 420 674

Detection Unif 203 264 211 257MC 2115 3246 2139 3252

Table 3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors

(Qz10Q

z50Q

z90) (Qy

10Qy50Q

y90)

Comp π(M) (030507) (020508) (010509) (010209) (010509) (010809)

Presence Unif 105 120 134 110 123 124MC 202 455 805 238 619 640

Detection Unif 234 234 230 257 200 238MC 2537 2077 2528 2933 1852 2849

The separation between the MPIP of true and false predictors is even more

notorious in Tables 3-2 and 3-3 where the minimum MPIP odds between true and

false predictors are shown Under every scenario the value for the minOddsMPIP (as

defined in 3ndash23) was greater than 1 implying that on average even the lowest MPIP

for a true predictor is higher than the maximum MPIP for a false predictor In both

components of the model the minOddsMPIP are markedly larger under the multiplicity

correction prior and increase with the number of sites and with the number of surveys

75

For the presence component increasing the signal in the occupancy probabilities

or having the detection probabilities concentrate about higher values has a positive and

considerable effect on the magnitude of the odds For the detection component these

odds are particularly high specially under the multiplicity correction prior Also having

the distribution for the detection probabilities center about low or high values increases

the minOddsMPIP 352 Summary Statistics for the Highest Posterior Probability Model

Tables 3-4 through 3-7 show the number of true predictors that are included in

the HPM (True +) and the number of false predictors excluded from it (True minus)

The mean percentages observed in these Tables provide one clear message The

highest probability models chosen with either model prior commonly differ from the

corresponding true models The multiplicity correction priorrsquos strong shrinkage only

allows a few true predictors to be selected but at the same time it prevents from

including in the HPM any false predictors On the other hand the uniform prior includes

in the HPM a larger proportion of true predictors but at the expense of also introducing

a large number of false predictors This situation is exacerbated in the presence

component but also occurs to a minor extent in the detection component

Table 3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highestprobability models for the presence and the detection components usinguniform and multiplicity correcting priors on the model space

True + True minusComp π(M) N=50 N=100 N=50 N=100

Presence Unif 057 063 051 055MC 006 013 100 100

Detection Unif 077 085 087 093MC 049 070 100 100

Having more sites or surveys improves the inclusion of true predictors and exclusion

of false ones in the HPM for both the presence and detection components (Tables 3-4

and 3-5) On the other hand if the distribution for the occupancy probabilities is more

76

Table 3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space

True + True minusComp π(M) J=3 J=5 J=3 J=5

Presence Unif 059 061 052 054MC 008 010 100 100

Detection Unif 078 085 087 092MC 050 068 100 100

spread out the HPM includes more true predictors and less false ones in the presence

component In contrast the effect of the spread of the occupancy probabilities in the

detection HPM is negligible (Table 3-6) Finally there is a positive relationship between

the location of the median for the detection probabilities and the number of correctly

classified true and false predictors for the presence The HPM in the detection part of

the model responds positively to low and high values of the median detection probability

(increased signal levels) in terms of correctly classified true and false predictors (Table

3-7)

Table 3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space

True + True minusComp π(M) (030507) (020508) (010509) (030507) (020508) (010509)

Presence Unif 055 061 064 050 054 055MC 002 008 018 100 100 100

Detection Unif 081 082 081 090 089 089MC 057 061 059 100 100 100

36 Case Study Blue Hawker Data Analysis

During 1999 and 2000 an intensive volunteer surveying effort coordinated by the

Centre Suisse de Cartographie de la Faune (CSCF) was conducted in order to analyze

the distribution of the blue hawker Ashna cyanea (Odonata Aeshnidae) a common

dragonfly in Switzerland Given that Switzerland is a small and mountainous country

77

Table 3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space

True + True minusComp π(M) (010209) (010509) (010809) (010209) (010509) (010809)

Presence Unif 059 059 062 051 054 054MC 006 010 011 100 100 100

Detection Unif 089 077 078 091 087 091MC 070 048 059 100 100 100

there is large variation in its topography and physio-geography as such elevation is a

good candidate covariate to predict species occurrence at a large spatial scale It can

be used as a proxy for habitat type intensity of land use temperature as well as some

biotic factors (Kery et al 2010)

Repeated visits to 1-ha pixels took place to obtain the corresponding detection

history In addition to the survey outcome the x and y-coordinates thermal-level the

date of the survey and the elevation were recorded Surveys were restricted to the

known flight period of the blue hawker which takes place between May 1 and October

10 In total 2572 sites were surveyed at least once during the surveying period The

number of surveys per site ranges from 1 to 22 times within each survey year

Kery et al (2010) summarize the results of this effort using AIC-based model

comparisons first by following a backwards elimination approach for the detection

process while keeping the occupancy component fixed at the most complex model and

then for the presence component choosing among a group of three models while using

the detection model chosen In our analysis of this dataset for the detection and the

presence we consider as the full models those used in Kery et al (2010) namely

minus1(ψ) = α0 + α1year+ α2elev+ α3elev2 + α4elev

3

minus1(p) = λ0 + λ1year+ λ2elev+ λ3elev2 + λ4elev

3 + λ5date+ λ6date2

78

where year = Iyear=2000

The model spaces for this data contain 26 = 64 and 24 = 16 models respectively

for the detection and occupancy components That is in total the model space contains

24+6 = 1 024 models Although this model space can be enumerated entirely for

illustration we implemented the algorithm from section 334 generating 10000 draws

from the Gibbs sampler Each one of the models sampled were chosen from the set of

models that could be reached by changing the state of a single term in the current model

(to inclusion or exclusion accordingly) This allows a more thorough exploration of the

model space because for each of the 10000 models drawn the posterior probabilities

for many more models can be observed Below the labels for the predictors are followed

by either ldquozrdquo or ldquoyrdquo accordingly to represent the component they pertain to Finally

using the results from the model selection procedure we conducted a validation step to

determine the predictive accuracy of the HPMrsquos and of the median probability models

(MPMrsquos) The performance of these models is then contrasted with that of the model

ultimately selected by Kery et al (2010)361 Results Variable Selection Procedure

The model finally chosen for the presence component in Kery et al (2010) was not

found among the highest five probability models under either model prior 3-8 Moreover

the year indicator was never chosen under the multiplicity correcting prior hinting that

this term might correspond to a falsely identified predictor under the uniform prior

Results in Table 3-10 support this claim the marginal inclusion posterior probability for

the year predictor is 7 under the multiplicity correction prior The multiplicity correction

prior concentrates more densely the model posterior probability mass in the highest

ranked models (90 of the mass is in the top five models) than the uniform prior (which

account for 40 of the mass)

For the detection component the HPM under both priors is the intercept only model

which we represent in Table 3-9 with a blank label In both cases this model obtains very

79

Table 3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data

Uniform model priorRank Mz selected p(Mz |y)

1 yrz+elevz 0102 yrz+elevz+elevz3 0083 elevz2+elevz3 0084 yrz+elevz2 0075 yrz+elevz3 007

Multiplicity correcting model priorRank Mz selected p(Mz |y)

1 elevz+elevz3 0532 0153 elevz+elevz2 0094 elevz2 0065 elevz+elevz2+elevz3 005

high posterior probabilities The terms contained in cubic polynomial for the elevation

appear to contain some relevant information however this conflicts with the MPIPs

observed in Table 3-11 which under both model priors are relatively low (lt 20 with the

uniform and le 4 with the multiplicity correcting prior)

Table 3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data

Uniform model priorRank Mz selected p(Mz |y)

1 0452 elevy3 0063 elevy2 0054 elevy 0055 yry 004

Multiplicity correcting model priorRank Mz selected p(Mz |y)

1 0862 elevy3 0023 datey2 0024 elevy2 0025 yry 002

Finally it is possible to use the MPIPs to obtain the median probability model which

contains the terms that have a MPIP higher than 50 For the occupancy process

(Table 3-10) under the uniform prior the model with the year the elevation and the

elevation cubed are included The MPM with multiplicity correction prior coincides with

the HPM from this prior The MPM chosen for the detection component (Table 3-11)

under both priors is the intercept only model coinciding again with the HPM

Given the outcomes of the simulation studies from Section 35 especially those

pertaining to the detection component the results in Table 3-11 appear to indicate that

none of the predictors considered belong to the true model especially when considering

80

Table 3-10 MPIP presence component

Predictor p(predictor isin MTz |y z w v)

Unif MultCorryrz 053 007elevz 051 073elevz2 045 023elevz3 050 067

Table 3-11 MPIP detection component

Predictor p(predictor isin MTy |y z w v)

Unif MultCorryry 019 003elevy 018 003elevy2 018 003elevy 3 019 004datey 016 003datey2 015 004

those derived with the multiplicity correction prior On the other hand for the presence

component (Table 3-10) there is an indication that terms related to the cubic polynomial

in elevz can explain the occupancy patterns362 Validation for the Selection Procedure

Approximately half of the sites were selected at random for training (ie for model

selection and parameter estimation) and the remaining half were used as test data In

the previous section we observed that using the marginal posterior inclusion probability

of the predictors the our method effectively separates predictors in the true model from

those that are not in it However in Tables 3-10 and 3-11 this separation is only clear for

the presence component using the multiplicity correction prior

Therefore in the validation procedure we observe the misclassification rates for the

detections using the following models (1) the model ultimately recommended in Kery

et al (2010) (yrz+elevz+elevz2+elevz3 + elevy+ elevy2+ datey+ datey2) (2) the

highest probability model (HPM) with a uniform prior (yrz+elevz) (3) the HPM with a

multiplicity correcting prior (elevz + elevz3 ) (4) the median probability model (MPM)

ndashthe model including only predictors with a MPIP larger than 50ndash with the uniform

prior (yrz+elevz+elevz3) and finally (5) the MPM with a multiplicity correction prior

(elevz+elevz3 same as the HPM with multiplicity correction)

We must emphasize that the models resulting from the implement ion of our model

selection procedure used exclusively the training dataset On the other hand the model

in Kery et al (2010) was chosen to minimize the prediction error of the complete data

81

Because this model was obtained from the full dataset results derived from it can only

be considered as a lower bound for the prediction errors

The benchmark misclassification error rate for true 1rsquos is high (close to 70)

However the misclassification rate for true 0rsquos which accounts for most of the

responses is less pronounced (15) Overall the performance of the selected models

is comparable They yield considerably worse results than the benchmark for the true

1rsquos but achieve rates close to the benchmark for the true zeros Pooling together

the results for true ones and true zeros the selected models with either prior have

misclassification rates close to 30 The benchmark model performs comparably with a

joint misclassification error of 23 (Table 3-12)

Table 3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform andmultiplicity correction model priors

Model True 1 True 0 Jointbenchmark (Kery et al 2010) yrz+elevz+elevz2+elevz3 + 066 015 023

elevy+ elevy2+ datey+ datey2

HPM Unif yrz+elevz 083 017 028HPMHPM MC elevz + elevz3 082 018 028MPM Unif yrz+elevz+elevz3 082 018 029

37 Discussion

In this Chapter we proposed an objective and fully automatic Bayes methodology for

the single season site-occupancy model The methodology is said to be fully automatic

because no hyper-parameter specification is necessary in defining the parameter priors

and objective because it relies on the intrinsic priors derived from noninformative priors

The intrinsic priors have been shown to have desirable properties as testing priors We

also propose a fast stochastic search algorithm to explore large model spaces using our

model selection procedure

Our simulation experiments demonstrated the ability of the method to single out the

predictors present in the true model when considering the marginal posterior inclusion

probabilities for the predictors For predictors in the true model these probabilities

were comparatively larger than those for predictors absent from it Also the simulations

82

indicated that the method has a greater discrimination capability for predictors in the

detection component of the model especially when using multiplicity correction priors

Multiplicity correction priors were not described in this Chapter however their

influence on the selection outcome is significant This behavior was observed in the

simulation experiment and in the analysis of the Blue Hawker data Model priors play an

essential role As the number of predictors grows these are instrumental in controlling

for selection of false positive predictors Additionally model priors can be used to

account for predictor structure in the selection process which helps both to reduce the

size of the model space and to make the selection more robust These issues are the

topic of the next Chapter

Accounting for the polynomial hierarchy in the predictors within the occupancy

context is a straightforward extension of the procedures we describe in Chapter 4

Hence our next step is to develop efficient software for it An additional direction we

plan to pursue is developing methods for occupancy variable selection in a multivariate

setting This can be used to conduct hypothesis testing in scenarios with varying

conditions through time or in the case where multiple species are co-observed A

final variation we will investigate for this problem is that of occupancy model selection

incorporating random effects

83

CHAPTER 4PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS

It has long been an axiom of mine that the little things are infinitely themost important

ndashSherlock HolmesA Case of Identity

41 Introduction

In regression problems if a large number of potential predictors is available the

complete model space is too large to enumerate and automatic selection algorithms are

necessary to find informative parsimonious models This multiple testing problem

is difficult and even more so when interactions or powers of the predictors are

considered In the ecological literature models with interactions andor higher order

polynomial terms are ubiquitous (Johnson et al 2013 Kery et al 2010 Zeller et al

2011) given the complexity and non-linearities found in ecological processes Several

model selection procedures even in the classical normal linear setting fail to address

two fundamental issues (1) the model selection outcome is not invariant to affine

transformations when interactions or polynomial structures are found among the

predictors and (2) additional penalization is required to control for false positives as the

model space grows (ie as more covariates are considered)

These two issues motivate the developments developed throughout this Chapter

Building on the results of Chipman (1996) we propose investigate and provide

recommendations for three different prior distributions on the model space These

priors help control for test multiplicity while accounting for polynomial structure in the

predictors They improve upon those proposed by Chipman first by avoiding the need

for specific values for the prior inclusion probabilities of the predictors and second

by formulating principled alternatives to introduce additional structure in the model

84

priors Finally we design a stochastic search algorithm that allows fast and thorough

exploration of model spaces with polynomial structure

Having structure in the predictors can determine the selection outcome As an

illustration consider the model E [y ] = β00 + β01x2 + β20x21 where the order one

term x1 is not present (this choice of subscripts for the coefficients is defined in the

following section) Transforming x1 7rarr xlowast1 = x1 + c for some c = 0 the model

becomes E [y ] = β00 + β01x2 + βlowast20x

lowast21 Note that in terms of the original predictors

xlowast21 = x21 + 2c middot x1 + c2 implying that this seemingly innocuous transformation of x1

modifies the column space of the design matrix by including x1 which was not in the

original model That is when lower order terms in the hierarchy are omitted from the

model the column space of the design matrix is not invariant to afine transformations

As the hat matrix depends on the column space the modelrsquos predictive capability is also

affected by how the covariates in the model are coded an undesirable feature for any

model selection procedure To make model selection invariant to afine transformations

the selection must be constrained to the subset of models that respect the hierarchy

(Griepentrog et al 1982 Khuri 2002 McCullagh amp Nelder 1989 Nelder 2000

Peixoto 1987 1990) These models are known as well-formulated models (WFMs)

Succinctly a model is well-formulated if for any predictor in the model every lower order

predictor associated with it is also in the model The model above is not well-formulated

as it contains x21 but not x1

WFMs exhibit strong heredity in that all lower order terms dividing higher order

terms in the model must also be included An alternative is to only require weak heredity

(Chipman 1996) which only forces some of the lower terms in the corresponding

polynomial hierarchy to be in the model However Nelder (1998) demonstrated that the

conditions under which weak heredity allows the design matrix to be invariant to afine

transformations of the predictors are too restrictive to be useful in practice

85

Although this topic appeared in the literature more than three decades ago (Nelder

1977) only recently have modern variable selection techniques been adapted to

account for the constraints imposed by heredity As described in Bien et al (2013)

the current literature on variable selection for polynomial response surface models

can be classified into three broad groups mult-istep procedures (Brusco et al 2009

Peixoto 1987) regularized regression methods (Bien et al 2013 Yuan et al 2009)

and Bayesian approaches (Chipman 1996) The methods introduced in this Chapter

take a Bayesian approach towards variable selection for well-formulated models with

particular emphasis on model priors

As mentioned in previous chapters the Bayesian variable selection problem

consists of finding models with high posterior probabilities within a pre-specified model

space M The model posterior probability for M isin M is given by

p(M|yM) prop m(y|M)π(M|M) (4ndash1)

Model posterior probabilities depend on the prior distribution on the model space

as well as on the prior distributions for the model specific parameters implicitly through

the marginals m(y|M) Priors on the model specific parameters have been extensively

discussed in the literature (Berger amp Pericchi 1996 Berger et al 2001 George 2000

Jeffreys 1961 Kass amp Wasserman 1996 Liang et al 2008 Zellner amp Siow 1980) In

contrast the effect of the prior on the model space has until recently been neglected

A few authors (eg Casella et al (2014) Scott amp Berger (2010) Wilson et al (2010))

have highlighted the relevance of the priors on the model space in the context of multiple

testing Adequately formulating priors on the model space can both account for structure

in the predictors and provide additional control on the detection of false positive terms

In addition using the popular uniform prior over the model space may lead to the

undesirable and ldquoinformativerdquo implication of favoring models of size p2 (where p is the

86

total number of covariates) since this is the most abundant model size contained in the

model space

Variable selection within the model space of well-formulated polynomial models

poses two challenges for automatic objective model selection procedures First the

notion of model complexity takes on a new dimension Complexity is not exclusively

a function of the number of predictors but also depends upon the depth and

connectedness of the associations defined by the polynomial hierarchy Second

because the model space is shaped by such relationships stochastic search algorithms

used to explore the models must also conform to these restrictions

Models without polynomial hierarchy constitute a special case of WFMs where

all predictors are of order one Hence all the methods developed throughout this

Chapter also apply to models with no predictor structure Additionally although our

proposed methods are presented for the normal linear case to simplify the exposition

these methods are general enough to be embedded in many Bayesian selection

and averaging procedures including of course the occupancy framework previously

discussed

In this Chapter first we provide the necessary definitions to characterize the

well-formulated model selection problem Then we proceed to introduce three new prior

structures on the well-formulated model space and characterize their behavior with

simple examples and simulations With the model priors in place we build a stochastic

search algorithm to explore spaces of well-formulated models that relies on intrinsic

priors for the model specific parameters mdash though this assumption can be relaxed

to use other mixtures of g-priors Finally we implement our procedures using both

simulated and real data

87

42 Setup for Well-Formulated Models

Suppose that the observations yi are modeled using the polynomial regression of

the covariates xi 1 xi p given by

yi =sum

β(α1αp)

pprodj=1

xαji j + ϵi (4ndash2)

where α = (α1 αp) belongs to Np0 the p-dimensional space of natural numbers

including 0 with ϵiiidsim N(0σ2) and only finitely many βα are allowed to be non-zero

As an illustration consider a model space that includes polynomial terms incorporating

covariates xi 1 and xi 2 only The terms x2i 2 and x2i 1xi 2 can be represented by α = (0 2)

and α = (2 1) respectively

The notation y = Z(X)β + ϵ is used to denote that observed response y =

(y1 yn) is modeled via a polynomial function Z of the original covariates contained

in X = (x1 xp) (where xj = (x1j xnj)prime) and the coefficients of the polynomial

terms are given by β A specific polynomial model M is defined by the set of coefficients

βα that are allowed to be non-zero This definition is equivalent to characterizing M

through a collection of multi-indices α isin Np0 In particular model M is specified by

M = αM1 αM|M| for αMk isin Np0 where βα = 0 for α isin M

Any particular model M uses a subset XM of the original covariates X to form the

polynomial terms in the design matrix ZM(X) Without ambiguity a polynomial model

ZM(X) on X can be identified with a polynomial model ZM(XM) on the covariates XM

The number of terms used by M to model the response y denoted by |M| corresponds

to the number of columns of ZM(XM) The coefficient vector and error variance of

the model M are denoted by βM and σ2M respectively Thus M models the data as

y = ZM(XM)βM + ϵM where ϵM sim N(0 Iσ2M

) Model M is said to be nested in model M prime

if M sub M prime M models the response of the covariates in two distinct ways choosing the

set of meaningful covariates XM as well as choosing the polynomial structure of these

covariates ZM(XM)

88

The set Np0 constitutes a partially ordered set or more succinctly a poset A poset

is a set partially ordered through a binary relation ldquo≼rdquo In this context the binary relation

on the poset Np0 is defined between pairs (ααprime) by αprime ≼ α whenever αj ge αprime

j for all

j = 1 prime with αprime ≺ α if additionally αj gt αprimej for some j The order of a term α isin Np

0

is given by the sum of its elements order(α) =sumαj When order(α) = order(αprime) + 1

and αprime ≺ α then αprime is said to immediately precede α which is denoted by αprime rarr α

The parent set of α is defined by P(α) = αprime isin Np0 αprime rarr α and is given by the

set of nodes that immediately precede the given node A polynomial model M is said to

be well-formulated if α isin M implies that P(α) sub M For example any well-formulated

model using x2i 1xi 2 to model yi must also include the parent terms xi 1xi 2 and x2i 1 their

corresponding parent terms xi 1 and xi 2 and the intercept term 1

The poset Np0 can be represented by a Directed Acyclic Graph (DAG) denoted

by (Np0) Without ambiguity we can identify nodes in the graph α isin Np

0 with terms in

the set of covariates The graph has directed edges to a node from its parents Any

well-formulated model M is represented by a subgraph (M) of (Np0) with the property

that if node α isin (M) then the nodes corresponding to P(α) are also in (M) Figure

4-1 shows examples of well-formulated polynomial models where α isin Np0 is identified

withprodp

j=1 xαjj

The motivation for considering only well-formulated polynomial models is

compelling Let ZM be the design matrix associated with a polynomial model The

subspace of y modeled by ZM given by the hat matrix HM = ZM(ZprimeMZM)

minus1ZprimeM is

invariant to affine transformations of the matrix XM if and only if M corresponds to a

well-formulated polynomial model (Peixoto 1990)

89

A B

Figure 4-1 Graphs of well-formulated polynomial models for p = 2

For example if p = 2 and yi = β(00) + β(10)xi 1 + β(01)xi 2 + β(11)xi 1xi 2 + ϵi then

the hat matrix is invariant to any covariate transformation of the form A(xi 1xi 2

)+ b for any

real-valued positive definite 2 times 2 matrix A and any real-valued vector of dimension two

b In contrast if yi = β(00) + β(20)x2i 1 + ϵi then the hat matrix formed after applying the

transformation xi 1 7rarr xi 1 + c for real c = 0 is not the same as the hat matrix formed by

the original xi 1421 Well-Formulated Model Spaces

The spaces of WFMs M considered in this paper can be characterized in terms

of two WFMs MB the base model and MF the full model The base model contains at

least the intercept term and is nested in the full model The model space M is populated

by all well formulated models M that nest MB and are nested in MF

M = M MB sube M sube MF and M is well-formulated

For M to be well-formulated the entire ancestry of each node in M must also be

included in M Because of this M isin M can be uniquely identified by two different sets

of nodes in MF the set of extreme nodes and the set of children nodes For M isin M

90

the sets of extreme and children nodes respectively denoted by E(M) and C(M) are

defined by

E(M) = α isin M MB α isin P(αprime) forall αprime isin M

C(M) = α isin MF M α cupM is well-formulated

The extreme nodes are those nodes that when removed from M give rise to a WFM in

M The children nodes are those nodes that when added to M give rise to a WFM in

M Because MB sube M for all M isin M the set of nodes E(M)cupMB determine M by

beginning with this set and iteratively adding parent nodes Similarly the nodes in C(M)

determine the set αprime isin P(α) α isin C(M)cupαprime isin E(MF ) α ≼ αprime for all α isin C(M)

which contains E(M)cupMB and thus uniquely identifies M

1

x1

x2

x21

x1x2

x22

A Extreme node set

1

x1

x2

x21

x1x2

x22

B Children node set

Figure 4-2

In Figure 4-2 the extreme and children sets for model M = 1 x1 x21 are shown for

the model space characterized by MF = 1 x1 x2 x21 x1x2 x22 In Figure 4-2A the solid

nodes represent nodes α isin M E(M) the dashed node corresponds to α isin E(M) and

the dotted nodes are not in M Solid nodes in Figure 4-2B correspond to those in M

The dashed node is the single node in C(M) and the dotted nodes are not in M cup C(M)43 Priors on the Model Space

As discussed in Scott amp Berger (2010) the Ockhamrsquos-razor effect found

automatically in Bayesian variable selection through the Bayes factor does not correct

91

for multiple testing This penalization acts against more complex models but does not

account for the collection of models in the model space which describes the multiplicity

of the testing problem This is where the role of the prior on the model space becomes

important As Scott amp Berger explain the multiplicity penalty is ldquohidden awayrdquo in the

model prior probabilities π(M|M)

In what follows we propose three different prior structures on the model space

for WFMs discuss their advantages and disadvantages and describe reasonable

choices for their hyper-parameters In addition we investigate how the choice of

prior structure and hyper-parameter combinations affect the posterior probabilities for

predictor inclusion providing some recommendations for different situations431 Model Prior Definition

The graphical structure for the model spaces suggests a method for prior

construction on M guided by the notion of inheritance A node α is said to inherit from

a node αprime if there is a directed path from αprime to α in the graph (MF ) The inheritance

is said to be immediate if order(α) = order(αprime) + 1 (equivalently if αprime isin P(α) or if αprime

immediately precedes α)

For convenience define (M) = M MB to be the set of nodes in M that are not

in the base model MB For α isin (MF ) let γα(M) be the indicator function describing

whether α is included in M ie γα(M) = I(αisinM) Denote by γν(M) the set of indicators

of inclusion in M for all order ν nodes in (MF ) Finally let γltν(M) =cupνminus1

j=0 γ j(M)

the set of indicators for inclusion in M for all nodes in (MF ) of order less than ν With

these definitions the prior probability of any model M isin M can be factored as

π(M|M) =

JmaxMprod

j=JminM

π(γ j(M)|γltj(M)M) (4ndash3)

where JminM and Jmax

M are respectively the minimum and maximum order of nodes in

(MF ) and π(γJminM (M)|γltJmin

M (M)M) = π(γJminM (M)|M)

92

Prior distributions on M can be simplified by making two assumptions First if

order(α) = order(αprime) = j then γα and γαprime are assumed to be conditionally independent

when conditioned on γltj denoted by γα perpperp γαprime|γltj Second immediate inheritance is

invoked and it is assumed that if order(α) = j then γα(M)|γltj(M) = γα(M)|γP(α)(M)

where γP(α)(M) is the inclusion indicator for the set of parent nodes of α This indicator

is one if the complete parent set of α is contained in M and zero otherwise

In Figure 4-3 these two assumptions are depicted with MF being an order two

surface in two main effects The conditional independence assumption (Figure 4-3A)

implies that the inclusion indicators for x21 x22 and x1x2 is independent when conditioned

on all the lower order terms In this same space immediate inheritance implies that

the inclusion of x21 conditioned on the inclusion of all lower order nodes is equivalent to

conditioning it on its parent set (x1 in this case)

x21 perpperp x1x2 perpperp x22

∣∣∣∣∣

1

x1

x2

A Conditional independence

x21∣∣∣∣∣

1

x1

x2

=

x21

∣∣∣∣∣ x1

B Immediate inheritance

Figure 4-3

Denote the conditional inclusion probability of node α in model M by πα =

π(γα(M) = 1|γP(α)(M)M) Under the assumptions of conditional independence

93

and immediate inheritance the prior probability of M is

π(M|πMM) =prod

αisin(MF )

πγα(M)α (1minus πα)

1minusγα(M) (4ndash4)

with πM = πα α isin (MF ) Because M must be well-formulated πα = γα =

0 if γP(α)(M) = 0 Thus the product in 4ndash4 can be restricted to the set of nodes

α isin (M)cup

C(M) Additional structure can be built into the prior on M by making

assumptions about the inclusion probabilities πα such as equality assumptions or

assumptions of a hyper-prior for these parameters Three such prior classes are

developed next first by assigning hyperpriors on πM assuming some structure among

its elements and then marginalizing out the πM

Hierarchical Uniform Prior (HUP) The HUP assumes that the non-zero πα

are all equal Specifically for a model M isin M it is assumed that πα = π for all

α isin (M)cupC(M) A complete Bayesian specification of the HUP is completed by

assuming a prior distribution for π The choice of π sim Beta(a b) produces

πHUP(M|M a b) =B(|(M)|+ a |C(M)|+ b)

B(a b) (4ndash5)

where B is the beta function Setting a = b = 1 gives the particular value of

πHUP(M|M a = 1 b = 1) =1

|(M)|+ |C(M)|+ 1

(|(M)|+ |C(M)|

|(M)|

)minus1

(4ndash6)

The HUP assigns equal probabilities to all models for which the sets of nodes (M)

and C(M) have the same cardinality This prior provides a combinatorial penalization

but essentially fails to account for the hierarchical structure of the model space An

additional penalization for model complexity can be incorporated into the HUP by

changing the values of a and b Because πα = π for all α this penalization can only

depend on some aspect of the entire graph of MF such as the total number of nodes

not in the null model |(MF )|

94

Hierarchical Independence Prior (HIP) The HIP assumes that there are no

equality constraints among the non-zero πα Each non-zero πα is given its own prior

which is assumed to be a Beta distribution with parameters aα and bα Thus the prior

probability of M under the HIP is

πHIP(M|M ab) =

prodαisin(M)

aα + bα

prodαisinC(M)

aα + bα

(4ndash7)

where the product over empty is taken to be 1 Because the πα are totally independent any

choice of aα and bα is equivalent to choosing a probability of success πα for a given α

Setting aα = bα = 1 for all α isin (M)cup

C(M) gives the particular value of

πHIP(M|M a = 1b = 1) =

(1

2

)|(M)|+|C(M)|

(4ndash8)

Although the prior with this choice of hyper-parameters accounts for the hierarchical

structure of the model space it essentially provides no penalization for combinatorial

complexity at different levels of the hierarchy This can be observed by considering a

model space with main effects only the exponent in 4ndash8 is the same for every model in

the space because each node is either in the model or in the children set

Additional penalizations for model complexity can be incorporated into the HIP

Because each γ j is conditioned on γltj in the prior construction the aα and bα for α of

order j can be conditioned on γltj One such additional penalization utilizes the number

of nodes of order j that could be added to produce a WFM conditioned on the inclusion

vector γltj which is denoted as chj(γltj) Choosing aα = 1 and bα(M) = chj(γ

ltj) is

equivalent to choosing a probability of success πα = 1chj(γltj) This penalization can

drive down the false positive rate when chj(γltj) is large but may produce more false

negatives

Hierarchical Order Prior (HOP) A compromise between complete equality and

complete independence of the πα is to assume equality between the πα of a given

order and independence across the different orders Define j(M) = α isin (M)

95

order(α) = j and Cj(M) = α isin C(M) order(α) = j The HOP assumes that πα = πj

for all α isin j(M)cupCj(M) Assuming that πj sim Beta(aj bj) provides a prior probability of

πHOP(M|M ab) =

JmaxMprod

j=JminM

B(|j(M)|+ aj |Cj(M)|+ bj)

B(aj bj)(4ndash9)

The specific choice of aj = bj = 1 for all j gives a value of

πHOP(M|M a = 1b = 1) =prodj

[1

|j(M)|+ |Cj(M)|+ 1

(|j(M)|+ |Cj(M)|

|j(M)|

)minus1]

(4ndash10)

and produces a hierarchical version of the Scott and Berger multiplicity correction

The HOP arises from a conditional exchangeability assumption on the indicator

variables Conditioned on γltj(M) the indicators γα α isin j(M)cup

Cj(M) are

assumed to be exchangeable Bernoulli random variables By de Finettirsquos theorem these

arise from independent Bernoulli random variables with common probability of success

πj with a prior distribution Our construction of the HOP assumes that this prior is a

beta distribution Additional complexity penalizations can be incorporated into the HOP

in a similar fashion to the HIP The number of possible nodes that could be added of

order j while maintaining a WFM is given by chj(M) = chj(γltj(M)) = |j(M)

cupCj(M)|

Using aj = 1 and bj(M) = chj(M) produces a prior with two desirable properties

First if M prime sub M then π(M) le π(M prime) Second for each order j the conditional

probability of including k nodes is greater than or equal to that of including k + 1 nodes

for k = 0 1 chj(M)minus 1432 Choice of Prior Structure and Hyper-Parameters

Each of the priors introduced in Section 31 defines a whole family of model priors

characterized by the probability distribution assumed for the inclusion probabilities πM

For the sake of simplicity this paper focuses on those arising from Beta distributions

and concentrates on particular choices of hyper-parameters which can be specified

automatically First we describe some general features about how each of the three

prior structures (HUP HIP HOP) allocates mass to the models in the model space

96

Second as there is an infinite number of ways in which the hyper-parameters can be

specified focused is placed on the default choice a = b = 1 as well as the complexity

penalizations described in Section 31 The second alternative is referred to as a =

1b = ch where b = ch has a slightly different interpretation depending on the prior

structure Accordingly b = ch is given by bj(M) = bα(M) = chj(M) = |j(M)cup

Cj(M)|

for the HOP and HIP where j = order(α) while b = ch denotes that b = |(MF )| for

the HUP The prior behavior for two model spaces In both cases the base model MB is

taken to be the intercept only model and MF is the DAG shown (Figures 4-4 and 4-5)

The priors considered treat model complexity differently and some general properties

can be seen in these examples

ModelHIP HOP HUP

(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)

1 1 14 49 13 12 13 572 1 x1 18 19 112 112 112 5563 1 x2 18 19 112 112 112 5564 1 x1 x

21 18 19 112 112 112 5168

5 1 x2 x22 18 19 112 112 112 5168

6 1 x1 x2 132 364 112 112 160 1727 1 x1 x2 x

21 132 164 136 160 160 1168

8 1 x1 x2 x1x2 132 164 136 160 160 11689 1 x1 x2 x

22 132 164 136 160 160 1168

10 1 x1 x2 x21 x1x2 132 1192 136 1120 130 1252

11 1 x1 x2 x21 x

22 132 1192 136 1120 130 1252

12 1 x1 x2 x1x2 x22 132 1192 136 1120 130 1252

13 1 x1 x2 x21 x1x2 x

22 132 1576 112 1120 16 1252

Figure 4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch)

First contrast the choice of HIP HUP and HOP for the choice of (ab) = (1 1) The

HIP induces a complexity penalization that only accounts for the order of the terms in

the model This is best exhibited by the model space in Figure 4-4 Models including x1

and x2 models 6 through 13 are given the same prior probability and no penalization is

incurred for the inclusion of any or all of the quadratic terms In contrast to the HIP the

97

ModelHIP HOP HUP

(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)

1 1 18 2764 14 12 14 472 1 x1 18 964 112 110 112 2213 1 x2 18 964 112 110 112 2214 1 x3 18 964 112 110 112 2215 1 x1 x3 18 364 112 120 112 41056 1 x2 x3 18 364 112 120 112 41057 1 x1 x2 116 3128 124 140 130 1428 1 x1 x2 x1x2 116 3128 124 140 120 1709 1 x1 x2 x3 116 1128 18 140 120 17010 1 x1 x2 x3 x1x2 116 1128 18 140 15 170

Figure 4-5 Prior probabilities for the space of well-formulated models associated tothree main effects and one interaction term where MB is taken to be theintercept only model and (ab) isin (1 1) (1 ch)

HUP induces a penalization for model complexity but it does not adequately penalize

models for including additional terms Using the HIP models including all of the terms

are given at least as much probability as any model containing any non-empty set of

terms (Figures 4-4 and 4-5) This lack of penalization of the full model is originates from

its combinatorial simplicity (ie this is the only model that contains every term) and

as an unfortunate consequence this model space distribution favors the base and full

models Similar behavior is observed with the HOP with (ab) = (1 1) As models

become more complex they are appropriately penalized for their size However after a

sufficient number of nodes are added the number of possible models of that particular

size is considerably reduced Thus combinatorial complexity is negligible on the largest

models This is best exhibited in Figure 4-5 where the HOP places more mass on

the full model than on any model containing a single order one node highlighting an

undesirable behavior of the priors with this choice of hyper-parameters

In contrast if (ab) = (1 ch) all three priors produce strong penalization as

models become more complex both in terms of the number and order of the nodes

contained in the model For all of the priors adding a node α to a model M to form M prime

produces p(M) ge p(M prime) However differences between the priors are apparent The

98

HIP penalizes the full model the most with the HOP penalizing it the least and the HUP

lying between them At face value the HOP creates the most compelling penalization

of model complexity In Figure 4-5 the penalization of the HOP is the least dramatic

producing prior odds of 20 for MB versus MF as opposed to the HUP and HIP which

produce prior odds of 40 and 54 respectively Similarly the prior odds in Figure 4-4 are

60 180 and 256 for the HOP HUP and HIP respectively433 Posterior Sensitivity to the Choice of Prior

To determine how the proposed priors are adjusting the posterior probabilities to

account for multiplicity a simple simulation was performed The goal of this exercise

was to understand how the priors respond to increasing complexity First the priors are

compared as the number of main effects p grows Second they are compared as the

depth of the hierarchy increases or in other words as the orderJMmax increases

The quality of a node is characterized by its marginal posterior inclusion

probabilities defined as pα =sum

MisinM I(αisinM)p(M|yM) for α isin MF These posteriors

were obtained for the proposed priors as well as the Equal Probability Prior (EPP)

on M For all prior structures both the default hyper-parameters a = b = 1 and

the penalizing choice of a = 1 and b = ch are considered The results for the

different combinations of MF and MT incorporated in the analysis were obtained

from 100 random replications (ie generating at random 100 matrices of main effects

and responses) The simulation proceeds as follows

1 Randomly generate main effects matrices X = (x1 x18) for xiiidsim Nn(0 In) and

error vectors ϵ sim Nn(0 In) for n = 60

2 Setting all coefficient values equal to one calculate y = ZMTβ + ϵ for the true

models given byMT 1 = x1 x2 x3 x

21 x1x2 x

22 x2x3 with |MT 1| = 7

MT 2 = x1 x2 x16 with |MT 2| = 16MT 3 = x1 x2 x3 x4 with |MT 3| = 4MT 4 = x1 x2 x8 x

21 x3x4 with |MT 4| = 10

MT 5 = x1 x2 x3 x4 x21 x3x4 with |MT 5| = 6

99

Table 4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulationsgrowing p fixed JM

max fixed p growing JMmax

MF

∣∣MF

∣∣ ∣∣M∣∣ MT used MF

∣∣MF

∣∣ ∣∣M∣∣ MT used(x1 + x2 + x3)

2 9 95 MT 1 (x1 + x2 + x3)2 9 95 MT 1

(x1 + + x4)2 14 1337 MT 1 (x1 + x2 + x3)

3 19 2497 MT 1

(x1 + + x5)2 20 38619 MT 1 (x1 + x2 + x3)

4 34 161421 MT 1

Other model spacesMF

∣∣MF

∣∣ ∣∣M∣∣ MT usedx1 + x2 + middot middot middot+ x18 18 262144 MT 2MT 3

(x1 + x2 + x4)2 + x5+ 20 85568 MT 4MT 5x6 + + x10

3 In all simulations the base model MB is the intercept only model The notation(x1 + + xp)

d is used to represent the full order-d polynomial response surface inp main effects The model spaces characterized by their corresponding full modelMF are presented in Table 4-1 as well as the true models used in each case

4 Enumerate the model spaces and calculate p(M|yM) for all M isin Musing the EPP HUP HIP and HOP the latter two each with the two sets ofhyper-parameters

5 Count the number of true positives and false positives in each M for the differentpriors

The true positives (TP) are defined as those nodes α isin MT such that pα gt 05

With the false positives (FP) three different cutoffs are considered for pα elucidating

the adjustment for multiplicity induced by the model priors These cutoffs are

010 020 and 050 for α isin MT The results from this exercise provide insight

about the influence of the prior on the marginal posterior inclusion probabilities In Table

4-1 the model spaces considered are described in terms of the number of models they

contain and in terms of the number of nodes of MF the full model that defines the DAG

for M

Growing number of main effects fixed polynomial degree This simulation

investigates the posterior behavior as the number of covariates grows for a polynomial

100

surface of degree two The true model is assumed to be MT 1 and has 7 polynomial

terms The false positive and true positive rates are displayed in Table 4-2

First focus on the posterior when (ab) = (1 1) As p increases and the cutoff

is low the number of false positives increases for the EPP as well as the hierarchical

priors although less dramatically for the latter All of the priors identify all of the true

positives The false positive rate for the 50 cutoff is less than one for all four prior

structures with the HIP exhibiting the smallest false positive rate

With the second choice of hyper-parameters (1 ch) the improvement of the

hierarchical priors over the EPP is dramatic and the difference in performance is more

pronounced as p increases These also considerably outperform the priors using the

default hyper-parameters a = b = 1 in terms of the false positives Regarding the

number of true positives all priors discovered the 7 true predictors in MT 1 for most of

the 100 random samples of data with only minor differences observed between any of

the priors considered That being said the means for the priors with a = 1b = ch are

slightly lower for the true positives With a 50 cutoff the hierarchical priors keep a tight

control on the number of false positives but in doing so discard true positives with slightly

higher frequency

Growing polynomial degree fixed main effects For these examples the true

model is once again MT 1 When the complexity is increased by making the order of MF

larger (Table 4-3) the inability of the EPP to adjust the inclusion posteriors for multiplicity

becomes more pronounced the EPP becomes less and less efficient at removing false

positives when the FP cutoff is low Among the priors with a = b = 1 as the order

increases the HIP is the best at filtering out the false positives Using the 05 false

positive cutoff some false positives are included both for the EEP and for all the priors

with a = b = 1 indicating that the default hyper-parameters might not be the best option

to control FP The 7 covariates in the true model all obtain a high inclusion posterior

probability both with the EEP and the a = b = 1 priors

101

Table 4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

7 (x1 + x2 + x3)2

178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3 + x4)2

362 194 233 245 010 063 107FP(gt020) 160 047 217 215 001 017 024FP(gt050) 025 006 035 036 000 002 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3 + x4 + x5)2

600 216 260 255 012 043 115FP(gt020) 291 055 213 218 002 019 027FP(gt050) 066 011 025 037 000 003 001TP(gt050) (MT 1) 700 700 700 700 697 699 699

In contrast any of the a = 1 and b = ch priors dramatically improve upon their

a = b = 1 counterparts consistently assigning low inclusion probabilities for the majority

of the false positive terms even for low cutoffs As the order of the polynomial surface

increases the difference in performance between these priors and either the EEP or

their default versions becomes even more clear At the 50 cutoff the hierarchical priors

with complexity penalization exhibit very low false positive rates The true positive rate

decreases slightly for the priors but not to an alarming degree

Other model spaces This part of the analysis considers model spaces that do not

correspond to full polynomial degree response surfaces (Table 4-4) The first example

is a model space with main effects only The second example includes a full quadratic

surface of order 2 but in addition includes six terms for which only main effects are to be

modeled Two true models are used in combination with each model space to observe

how the posterior probabilities vary under the influence of the different priors for ldquolargerdquo

and ldquosmallrdquo true models

102

Table 4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

7 (x1 + x2 + x3)2

178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3)3

737 521 606 291 055 105 139FP(gt020) 291 155 361 208 017 034 031FP(gt050) 040 021 050 026 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 698 700FP(gt010)

7 (x1 + x2 + x3)4

822 400 469 261 052 055 132FP(gt020) 421 113 176 203 012 015 031FP(gt050) 056 017 022 027 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 697 699

By construction in model spaces with main effects only HIP(11) and EPP are

equivalent as are HOP(ab) and HUP(ab) This accounts for the similarities observed

among the results for the first two cases presented in Table 4-4 where the model space

corresponds to a full model with 18 main effects and the true models are a model with

16 and 4 main effects respectively When the number of true coefficients is large the

HUP(11) and HOP(11) do poorly at controlling false positives even at the 50 cutoff

In contrast the HIP (and thus the EPP) with the 50 cutoff identifies the true positives

and no false positives This result however does not imply that the EPP controls false

positives well The true model contains 16 out of the 18 nodes in MF so there is little

potential for false positives The a = 1 and b = ch priors show dramatically different

behavior The HIP controls false positive well but fails to identify the true coefficients at

the 50 cutoff In contrast the HOP identifies all of the true positives and has a small

false positive rate for the 50 cutoff

103

If the number of true positives is small most terms in the full model are truly zero

The EPP includes at least one false positive in approximately 50 of the randomly

sampled datasets On the other hand the HUP(11) provides some control for

multiplicity obtaining on average a lower number of false positives than the EPP

Furthermore the proposed hierarchical priors with a = 1b = ch are substantially better

than the EPP (and the choice of a = b = 1) at controlling false positives and capturing

all true positives using the marginal posterior inclusion probabilities The two examples

suggest that the HOP(1 ch) is the best default choice for model selection when the

number of terms available at a given degree is large

The third and fourth examples in Table 4-4 consider the same irregular model

space with data generated from MT 4 with ten terms and MT 5 with six terms HIP(11)

and EPP again behave quite similarly incorporating a large number of false positives

for the 01 cutoff At the 05 cutoff some false positives are still included The HUP(11)

and HOP(11) behave similarly with a slightly higher false positive rate at the 50 cutoff

In terms of the true positives the EPP and a = b = 1 priors always include all of the

predictors in MT 4 and MT 5 On the other hand the ability of the a = 1b = ch priors

to control for false positives is markedly better than that of the EPP and the hierarchical

priors with choice of a = 1 = b = 1 At the 50 cutoff these priors identify all of the true

positives and true negatives Once again these examples point to the hierarchical priors

with additional penalization for complexity as being good default priors on the model

space44 Random Walks on the Model Space

When the model space M is too large to enumerate a stochastic procedure can

be used to find models with high posterior probability In particular an MCMC algorithm

can be utilized to generate a dependent sample of models from the model posterior The

structure of the model space M both presents difficulties and provides clues on how to

build algorithms to explore it Different MCMC strategies can be adopted two of which

104

Table 4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

16 x1 + x2 + + x18

193 193 200 200 003 180 180FP(gt020) 052 052 200 200 001 046 046FP(gt050) 007 007 200 200 001 004 004TP(gt050) (MT 2) 1599 1599 1600 1600 699 1599 1599FP(gt010)

4 x1 + x2 + + x18

1395 1395 915 915 026 131 131FP(gt020) 545 545 303 303 005 045 045FP(gt050) 084 084 045 045 002 006 006TP(gt050) (MT 3) 400 400 400 400 400 400 400FP(gt010)

10

973 971 1000 560 034 233 220FP(gt020) (x1 + + x4)

2+ 265 265 873 305 012 074 069FP(gt050) +x5 + + x10 035 035 136 168 002 011 012TP(gt050) (MT 4) 1000 1000 1000 999 994 998 999FP(gt010)

6

1352 1352 1106 994 044 163 196FP(gt020) (x1 + + x4)

2+ 422 421 360 501 015 048 068FP(gt050) +x5 + + x10 053 053 057 075 001 008 011TP(gt050) (MT 5) 600 600 600 600 599 599 599

are outlined in this section Combining the different strategies allows the model selection

algorithm to explore the model space thoroughly and relatively fast441 Simple Pruning and Growing

This first strategy relies on small localized jumps around the model space turning

on or off a single node at each step The idea behind this algorithm is to grow the model

by activating one node in the children set or to prune the model by removing one node

in the extreme set At a given step in the algorithm assume that the current state of the

chain is model M Let pG be the probability that algorithm chooses the growth step The

proposed model M prime can either be M+ = M cup α for some α isin C(M) or Mminus = M α

or some α isin E(M)

An example transition kernel is defined by the mixture

g(M prime|M) = pG middot qGrow(M prime|M) + (1minus pG) middot qPrune(M prime|M)

105

=IM =MF

1 + IM =MBmiddotIαisinC(M)

|C(M)|+

IM =MB

1 + IM =MF middotIαisinE(M)

|E(M)|(4ndash11)

where pG has explicitly been defined as 05 when both C(M) and E(M) are non-empty

and as 0 (or 1) when C(M) = empty (or E(M) = empty) After choosing pruning or growing a

single node is proposed for addition to or deletion from M uniformly at random

For this simple algorithm pruning has the reverse kernel of growing and vice-versa

From this construction more elaborate algorithms can be specified First instead of

choosing the node uniformly at random from the corresponding set nodes can be

selected using the relative posterior probability of adding or removing the node Second

more than one node can be selected at any step for instance by also sampling at

random the number of nodes to add or remove given the size of the set Third the

strategy could combine pruning and growing in a single step by sampling one node

α isin C(M) cup E(M) and adding or removing it accordingly Fourth the sets of nodes from

C(M) cup E(M) that yield well-formulated models can be added or removed This simple

algorithm produces small moves around the model space by focusing node addition or

removal only on the set C(M) cup E(M)442 Degree Based Pruning and Growing

In exploring the model space it is possible to take advantage of the hierarchical

structure defined between nodes of different order One can update the vector of

inclusion indicators by blocks denoted j(M) Two flavors of this algorithm are

proposed one that separates the pruning and growing steps and one where both

are done simultaneously

Assume that at a given step say t the algorithm is at M If growing the strategy

proceeds successively by order class going from j = Jmin up to j = Jmax with Jmin

and Jmax being the lowest and highest orders of nodes in MF MB respectively Define

Mt(Jminminus1) = M and set j = Jmin The growth kernel comprises the following steps

proceeding from j = Jmin to j = Jmax

106

1) Propose a model M prime by selecting a set of nodes from Cj(Mt(jminus1)) through thekernel qGrow j(middot|Mt(jminus1))

2) Compute the Metropolis-Hastings correction for M prime versus Mt(jminus1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(jminus1)

3) If j lt Jmax then set j = j + 1 and return to 1) otherwise proceed to 4)

4) Set Mt = Mt(Jmax )

The pruning step is defined In a similar fashion however it starts at order j = Jmax

and proceeds down to j = Jmin Let Ej(M prime) = E(M prime) cap j(MF ) be the set of nodes of

order j that can removed from the model M to produce a WFM Define Mt(Jmax+1) = M

and set j = Jmax The pruning kernel comprises the following steps

1) Propose a model M prime by selecting a set of nodes from Ej(Mt(j+1)) through thekernel qPrunej(middot|Mt(j+1))

2) Compute the Metropolis-Hastings correction for M prime versus Mt(j+1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(j+1)

3) If j gt Jmin then set j = j minus 1 and return to Step 1) otherwise proceed to Step 4)

4) Set Mt = Mt(Jmin )

It is clear that the growing and pruning steps are reverse kernels of each other

Pruning and growing can be combined for each j The forward kernel proceeds from

j = Jmin to j = Jmax and proposes adding sets of nodes from Cj(M) cup Ej(M) The reverse

kernel simply reverses the direction of j proceeding from j = Jmax to j = Jmin 45 Simulation Study

To study the operating characteristics of the proposed priors a simulation

experiment was designed with three goals First the priors are characterized by how

the posterior distributions are affected by the sample size and the signal-to-noise ratio

(SNR) Second given the SNR level the influence of the allocation of the signal across

the terms in the model is investigated Third performance is assessed when the true

model has special points in the scale (McCullagh amp Nelder 1989) ie when the true

107

model has coefficients equal to zero for some lower-order terms in the polynomial

hierarchy

With these goals in mind sets of predictors and responses are generated under

various experimental conditions The model space is defined with MB being the

intercept-only model and MF being the complete order-four polynomial surface in five

main effects that has 126 nodes The entries of the matrix of main effects are generated

as independent standard normal The response vectors are drawn from the n-variate

normal distribution as y sim Nn

(ZMT

(X)βγ In) where MT is the true model and In is the

n times n identity matrix

The sample sizes considered are n isin 130 260 1040 which ensures that

ZMF(X) is of full rank The cardinality of this model space is |M| gt 12 times 1022 which

makes enumeration of all models unfeasible Because the value of the 2k-th moment

of the standard normal distribution increases with k = 1 2 higher-order terms by

construction have a larger variance than their ancestors As such assuming equal

values for all coefficients higher-order terms necessarily contain more ldquosignalrdquo than

the lower order terms from which they inherit (eg x21 has more signal than x1) Once a

higher-order term is selected its entire ancestry is also included Therefore to prevent

the simulation results from being overly optimistic (because of the larger signals from the

higher-order terms) sphering is used to calculate meaningful values of the coefficients

ensuring that the signal is of the magnitude intended in any given direction Given

the results of the simulations from Section 433 only the HOP with a = 1b = ch is

considered with the EPP included for comparison

The total number of combinations of SNR sample size regression coefficient

values and nodes in MT amounts to 108 different scenarios Each scenario was run

with 100 independently generated datasets and the mean behavior of the samples was

observed The results presented in this section correspond to the median probability

model (MPM) from each of the 108 simulation scenarios considered Figure 4-7 shows

108

the comparison between the two priors for the mean number of true positive (TP) and

false positive (FP) terms Although some of the scenarios consider true models that are

not well-formulated the smallest well-formulated model that stems from MT is always

the one shown in Figure 4-6

Figure 4-6 MT DAG of the largest true model used in simulations

The results are summarized in Figure 4-7 Each point on the horizontal axis

corresponds to the average for a given set of simulation conditions Only labels for the

SNR and sample size are included for clarity but the results are also shown for the

different values of the regression coefficients and the different true models considered

Additional details about the procedure and other results are included in the appendices451 SNR and Sample Size Effect

As expected small sample sizes conditioned upon a small SNR impair the ability

of the algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this

effect being greater when using the latter prior However considering the mean number

of TPs jointly with the number of FPs it is clear that although the number of TPs is

specially low with HOP(1 ch) most of the few predictors that are discovered in fact

belong to the true model In comparison to the results with EPP in terms of FPs the

HOP(1 ch) does better and even more so when both the sample size and the SNR are

109

Figure 4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch)

smallest Finally when either the SNR or the sample size is large the performance in

terms of TPs is similar between both priors but the number of FPs are somewhat lower

with the HOP452 Coefficient Magnitude

Three ways to allocate the amount of signal across predictors are considered For

the first choice all coefficients contain the same amount of signal regardless of their

order In the second each order-one coefficient contains twice as much signal as any

order-two coefficient and four times as much as any order-three coefficient Finally

each order-one coefficient contains a half as much signal as any order-two coefficient

and a quarter of what any order-three coefficient has These choices are denoted by

β(1) = c(1o1 1o2 1o3) β(2) = c(1o1 05o2 025o3) and β(3) = c(025o1 05o2 1o3)

respectively In Figure 4-7 the first 4 scenarios correspond to simulations with β(1) the

next four use β(2) the next four correspond to β(3) and then the values are cycled in

110

the same way The results show that scenarios using either β(1) or β(3) behave similarly

contrasting with the negative impact of having the highest signal in the order-one terms

through β(2) In Figure 4-7 the effect of using β(2) is evident as it corresponds to the

lowest values for the TPs regardless of the sample size the SNR or the prior used This

is an intuitive result since giving more signal to higher-order terms makes it easier to

detect higher-order terms and consequently by strong heredity the algorithm will also

select the corresponding lower-order terms included in the true model453 Special Points on the Scale

Four true models were considered (1) the model from Figure 4-6 (MT 1) (2)

the model without the order-one terms (MT 2) (3) the model without order-two terms

(MT 3) and (4) the model without x21 and x2x5 (MT 4) The last three are clearly not

well-formulated In Figure 4-7 the leftmost point on the horizontal axis corresponds to

scenarios with MT 1 the next point is for scenarios with MT 2 followed by those with MT 3

then with MT 4 then MT 1 etc In comparison to the EPP the HOP(1 ch) tightly controls

the inclusion of FPs by choosing smaller models at the expense of also reducing the TP

count especially when there is more uncertainty about the true model (ie SNR=025)

For both prior structures the results indicate that at low SNR levels the presence of

special points has no apparent impact as the selection behavior is similar between the

four models in terms of both the TP and FP An interesting observation is that the effect

of having special points on the scale is vastly magnified whenever the coefficients that

assign more weight to order-one terms (β(2)) are used46 Case Study Ozone Data Analysis

This section uses the ozone data from Breiman amp Friedman (1985) and followsthe analysis performed by Liang et al (2008) who investigated hyper g-priors Afterremoving observations with missing values 330 observations remain includingdaily measurements of maximum ozone concentration near Los Angeles and eightmeteorological variables Table D From the 330 observations 165 were sampled atrandom without replacement and used to run the variable selection procedure theremaining 165 were used for validation The eight meteorological variables interactionsand their squared terms are used as predictors resulting in a full model with 44predictors The model space assumes that the base model MB is the intercept onlymodel and that MF is the quadratic surface in the eight meteorological variables The

111

model space contains approximately 71 billion models and computation of all modelposterior probabilities is not feasible

Table 4-5 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA

vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX

The HOP HUP and HIP with a = 1 and b = ch as well as the EPP are considered forcomparison purposes To obtain the Bayes factors in equation 3ndash3 four different mixtures ofg-priors are utilized intrinsic priors (IP) (which yields the expression in equation 3ndash2) hyper-g(HG) priors (Liang et al 2008) with hyper-parameters α = 2β = 1 and α = β = 1 and Zellner-Siow (ZS) priors (Zellner amp Siow 1980) The results were extracted for the median posteriorprobability (MPM) models Additionally the model is estimated using the R package hierNet(Bien et al 2013) to compare model selection results to those obtained using the hierarchicallasso (Bien et al 2013) restricted to well formulated models by imposing the strong heredityconstraint The procedures were assessed on the basis of their predictive accuracy on thevalidation dataset

Among all models the one that yields the smallest RMSE is the median probability modelobtained using the HOP and EPP with the ZS prior and also using the HOP with both HGpriors (Table 4-6) The HOP model with the intrinsic prior has all the terms contained in thelowest RMSE model with the exception of dpg2 which has a relatively high marginal inclusionprobability of 46 This disparity between the IP and other mixtures of g-priors is explainedby the fact that the IP induces less posterior shrinkage than the ZS and HG priors The MPMobtained through the HUP and HIP are nested in the best model suggesting that these modelspace priors penalize complexity too much and result in false negatives Consideration ofthese MPMs suggest that the HOP is best at producing true positives while controlling for falsepositives

Finally the model obtained from the hierarchical lasso (HierNet) is the largest model andproduces the second to largest RMSE All of the terms contained in any of the other modelsexcept for vh are nested within the hierarchical lasso model and most of the terms that areexclusive to this model receive extremely low marginal inclusion probabilities under any of themodel priors and parameter priors considered under Bayesian model selection

112

Table 4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso

BF Prior Model R2 RMSEIP EPP hum dpg ibt hum2 hum lowast dpg 08054 42739

hum lowast ibt dpg2 ibt2IP HIP hum ibt hum2 hum lowast ibt ibt2 07740 43396IP HOP hum dpg ibt hum2 hum lowast ibt ibt2 07848 43175IP HUP hum dpg ibt hum lowast ibt ibt2 07767 43508ZS EPP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HIP hum ibt hum lowast ibt ibt2 07525 43505ZS HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG11 EPP vh hum dpg ibt hum2 hum lowast ibt dpg2 07701 43049HG11 HIP hum ibt hum lowast ibt ibt2 07525 43505HG11 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG11 HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 EPP hum dpg ibt hum2 hum lowast ibt dpg2 07701 43037HG21 HIP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG21 HUP hum dpg ibt hum lowast ibt 07526 44036

HierNet hum temp ibh dpg ibt vis hum2 hum lowast ibt 07651 43680temp2 temp lowast ibt dpg2

47 DiscussionScott amp Berger (2010) noted that Ockhamrsquos-razor effect found automatically in Bayesian

variable selection through the Bayes factor does not correct for multiple testing The Bayesfactor penalizes complexity of the alternative model according to the number of parametersin excess of those of the null model Therefore the Bayes factor only controls complexity in apairwise fashion If the model selection procedure uses equal prior probabilities for all M isin Mthen these comparisons ignore the effect of the multiplicity of the testing problem This is wherethe role of the prior on the model space becomes important The multiplicity penalty is ldquohiddenawayrdquo in the model prior probabilities π(M|M)

In addition to the multiplicity of the testing problem disregarding the hierarchical polynomialstructure in the predictors in model selection procedures has the potential to lead to differentresults according to how the predictors are setup (eg in what units these predictors areexpressed)

In this Chapter we investigated a solution to these two issues We define prior structuresfor well-formulated models and develop random walk algorithms to traverse this type of modelspace The key to understanding prior distributions on the space of WFMs is the hierarchicalnature of the model space itself The prior distributions described take advantage of thathierarchy in two ways First conditional independence and immediate inheritance are used todevelop the HOP HIP and HUP structures discussed in Section 43 Second the conditionalnature of the priors allows for the direct incorporation of complexity penalizations Of the priorsproposed the HOP using the hyperparameter choice (1 ch) provides the best control of falsepositives while maintaining a reasonable true positive rate Thus this prior is recommended asthe default prior on the space of WFMs

113

In the near future the software developed to carry out a Metropolis-Hastings random walkon the space of WFMs will be integrated to the R package varSelectIP These new functionsimplement various local priors for the regression coefficients including the intrinsic prior Zellner-Siow prior and hyper g-priors In addition the software supports the computation of crediblesets for each regression coefficient conditioned on the selected model as well as under modelaveraging

114

CHAPTER 5CONCLUSIONS

Ecologists are now embracing the use of Bayesian methods to investigate the

interactions that dictate the distribution and abundance of organisms These tools are

both powerful and flexible They allow integrating under a single methodology empirical

observations and theoretical process models and can seamlessly account for several

sources of uncertainty and dependence The estimation and testing methods proposed

throughout the document will contribute to the understanding of Bayesian methods used

in ecology and hopefully these will shed light about the differences between estimation

and testing Bayesian tools

All of our contributions exploit the potential of the latent variable formulation This

approach greatly simplifies the analysis of complex models it redirects the bulk of

the inferential burden away from the original response variables and places it on the

easy-to-work-with latent scale for which several time-tested approaches are available

Our methods are distinctly classified into estimation and testing tools

For estimation we proposed a Bayesian specification of the single-season

occupancy model for which a Gibbs sampler is available using both logit and probit

link functions This setup allows detection and occupancy probabilities to depend

on linear combinations of predictors Then we developed a dynamic version of this

approach incorporating the notion that occupancy at a previously occupied site depends

both on survival of current settlers and habitat suitability Additionally because these

dynamics also vary in space we suggest a strategy to add spatial dependence among

neighboring sites

Ecological inquiry usually requires of competing explanations and uncertainty

surrounds the decision of choosing any one of them Hence a model or a set of

probable models should be selected from all the viable alternatives To address this

testing problem we proposed an objective and fully automatic Bayesian methodology

115

for the single season site-occupancy model Our approach relies on the intrinsic prior

which prevents from introducing (commonly unavailable) subjectively information

into the model In simulation experiments we observed that the methods single out

accurately the predictors present in the true model using the marginal posterior inclusion

probabilities of the predictors For predictors in the true model these probabilities were

comparatively larger than those for predictors not present in the true model Also the

simulations indicated that the method provides better discrimination for predictors in the

detection component of the model

In our simulations and in the analysis of the Blue Hawker data we observed that the

effect from using the multiplicity correction prior was substantial This occurs because

the Bayes factor only penalizes complexity of the alternative model according to its

number of parameters in excess to those of the null model As the number of predictors

grows the number of models in the models space also grows increasing the chances

of making false positive decisions on the inclusion of predictors This is where the role

of the prior on the model space becomes important The multiplicity penalty is ldquohidden

awayrdquo in the model prior probabilities π(M|M) In addition to the multiplicity of the

testing problem disregarding the hierarchical polynomial structure in the predictors in

model selection procedures has the potential to lead to different results according to

how the predictors are coded (eg in what units these predictors are expressed)

To confront this situation we propose three prior structures for well-formulated

models take advantage of the hierarchical structure of the predictors Of the priors

proposed we recommend the HOP using the hyperparameter choice (1 ch) which

provides the best control of false positives while maintaining a reasonable true positive

rate

Overall considering the flexibility of the latent approach several other extensions of

these methods follow Currently we envision three future developments (1) occupancy

models incorporate various sources of information (2) multi-species models that make

116

use of spatial and interspecific dependence and (3) investigate methods to conduct

model selection for the dynamic and spatially explicit version of the model

117

APPENDIX AFULL CONDITIONAL DENSITIES DYMOSS

In this section we introduce the full conditional probability density functions for all

the parameters involved in the DYMOSS model using probit as well as logic links

Sampler Z

The full conditionals corresponding to the presence indicators have the same form

regardless of the link used These are derived separately for the cases t = 1 1 lt t lt T

and t = T since their corresponding probabilities take on slightly different forms

Let ϕ(ν|microσ2) represent the density for a normal random variable ν with mean micro and

variance σ2 and recall that ψi1 = F (xprime(o)iα) and pijt = F (qprimeijtλt) where F () is the

inverse link function The full conditional for zit is given by

1 For t = 1

π(zi1|vi1αλ1βc1 δ

s1) = ψlowast

i1zi1 (1minus ψlowast

i1)1minuszi1

= Bernoulli(ψlowasti1) (Andash1)

where

ψlowasti1 =

ψi1ϕ(vi1|xprimei1βc1 + δs1 1)

prodJi1j=1(1minus pij1)

ψi1ϕ(vi1|xprimei1βc1 + δs1 1)

prodJi1j=1(1minus pij1) + (1minus ψi1)ϕ(vi1|xprimei1β

c1 1)

prodJj=1 Iyij1=0

2 For 1 lt t lt T

π(zit |zi(tminus1) zi(t+1)λt βctminus1 δ

stminus1) = ψlowast

itzit (1minus ψlowast

it)1minuszit

= Bernoulli(ψlowastit) (Andash2)

where

ψlowastit =

κitprodJit

j=1(1minus pijt)

κitprodJit

j=1(1minus pijt) +nablait

prodJj=1 Iyijt=0

with

(a) κit = F (xprimei(tminus1)β

ctminus1 + zi(tminus1)δ

stminus1)ϕ(vit |xprimeitβ

ct + δst 1) and

(b) nablait =(1minus F (xprime

i(tminus1)βctminus1 + zi(tminus1)δ

stminus1)

)ϕ(vit |xprimeitβ

ct 1)

3 For t = T

π(ziT |zi(Tminus1)λT βcTminus1 δ

sTminus1) = ψ⋆iT

ziT (1minus ψ⋆iT )1minusziT

118

=

Nprodi=1

Bernoulli(ψ⋆iT ) (Andash3)

where

ψ⋆iT =κ⋆iT

prodJiTj=1(1minus pijT )

κ⋆iTprodJiT

j=1(1minus pijT ) +nabla⋆iT

prodJj=1 IyijT=0

with

(a) κ⋆iT = F (xprimei(Tminus1)β

cTminus1 + zi(Tminus1)δ

sTminus1) and

(b) nabla⋆iT =

(1minus F (xprime

i(Tminus1)βcTminus1 + zi(Tminus1)δ

sTminus1)

)Sampler ui

1

π(ui |zi1α) = tr N(xprime(o)iα 1 trunc(zi1))

where trunc(zi1) =

(minusinfin 0] zi1 = 0

(0infin) zi1 = 1(Andash4)

and tr N(microσ2A) denotes the pdf of a truncated normal random variable with mean microvariance σ2 and truncation region A

Sampler α

1

π(α|u) prop [α]

Nprodi=1

ϕ(ui xprime(o)iα 1) (Andash5)

If [α] prop 1 then

α|u sim N(m(α)α)

with m(α) = αXprime(o)u and α = (X prime

(o)X(o))minus1

Sampler vit

1 (For t gt 1)

π(vi (tminus1)|zi (tminus1) zit βctminus1 δ

stminus1) = tr N

(micro(v)i(tminus1) 1 trunc(zit)

)(Andash6)

where micro(v)i(tminus1) = xprime

i(tminus1)βctminus1 + zi(tminus1)δ

ci(tminus1) and trunc(zit) defines the corresponding

truncation region given by zit

119

Sampler(β(c)tminus1 δ

(c)tminus1

)

1 (For t gt 1)

π(β(s)tminus1 δ

(c)tminus1|vtminus1 ztminus1) prop [β

(s)tminus1 δ

(c)tminus1]

Nprodi=1

ϕ(vit xprimei(tminus1)β

(c)tminus1 + zi(tminus1)δ

(s)tminus1 1) (Andash7)

If[β(c)tminus1 δ

(s)tminus1

]prop 1 then

β(c)tminus1 δ

(s)tminus1|vtminus1 ztminus1 sim N(m(β

(c)tminus1 δ

(s)tminus1)tminus1)

with m(β(c)tminus1 δ

(s)tminus1) = tminus1 ~X

primetminus1vtminus1 and tminus1 = (~X prime

tminus1 ~Xtminus1)minus1 where ~Xtminus1 =(

Xtminus1 ztminus1)

Sampler wijt

1 (For t gt 1 and zit = 1)

π(wijt | i zit = 1 yijt λ) = tr N(qprimeijtλt 1 tr(yijt)

)(Andash8)

Sampler λt

1 (For t = 1 2 T )

π(λt |zt wt) prop [λt ]prod

i zit=1

Jitprodj=1

ϕ(wijt qprimeijtλt 1) (Andash9)

If [λt ] prop 1 then

λt |wt zt sim N(m(λt)λt)

with m(λt) = λtQ primetwt and λt

= (Q primetQt)

minus1 where Qt and wt respectively are the designmatrix and the vector of latent variables for surveys of sites such that zit = 1

120

APPENDIX BRANDOM WALK ALGORITHMS

Global Jump From the current state M the global jump is performed by drawing

a model M prime at random from the model space This is achieved by beginning at the base

model and increasing the order from JminM to the Jmax

M the minimum and maximum orders

of nodes in (MF ) = MF MB at each order a set of nodes is selected at random from

the prior conditioned on the nodes already in the model The MH correction is

α =

1m(y|M primeM)

m(y|MM)

Local Jump From the current state M the local jump is performed by drawing a

model from the set of models L(M) = Mα α isin E(M) cup C(M) where Mα is M α

for α isin E(M) and M cup α for α isin C(M) The proposal probabilities for the model are

computed as a mixture of p(M prime|yMM prime isin L(M)) and the discrete uniform distribution

The proposal kernel is

q(M prime|yMM prime isin L(M)) =1

2

(p(M prime|yMM prime isin L(M)) +

1

|L(M)|

)This choice promotes moving to better models while maintaining a non-negligible

probability of moving to any of the possible models The MH correction is

α =

1m(y|M primeM)

m(y|MM)

q(M|yMM isin L(M prime))

q(M prime|yMM prime isin L(M))

Intermediate Jump The intermediate jump is performed by increasing or

decreasing the order of the nodes under consideration performing local proposals based

on order For a model M prime define Lj(Mprime) = M prime cup M prime

α α isin (E(M prime) cup C(M prime)) capj(MF )

From a state M the kernel chooses at random whether to increase or decrease the

order If M = MF then decreasing the order is chosen with probability 1 and if M = MB

then increasing the order is chosen with probability 1 in all other cases the probability of

increasing and decreasing order is 12 The proposal kernels are given by

121

Increasing order proposal kernel

1 Set j = JminM minus 1 and M prime

j = M

2 Draw M primej+1 from qincj+1(M

prime|yMM prime isin Lj+1(Mprimej )) where

qincj+1(Mprime|yMM prime isin Lj+1(M

primej )) =

12

(p(M prime|yMM prime isin Lj+1(M

primej )) +

1|Lj+1(M

primej)|

)

3 Set j = j + 1

4 If j lt JmaxM then return to 2 O therwise proceed to 5

5 Set M prime = M primeJmaxM

and compute the proposal probability

qinc(Mprime|yMM) =

JmaxM minus1prod

j=JminM minus1

qincj+1(Mprimej |yMM prime isin Lj+1(M

primej )) (Bndash1)

Decreasing order proposal kernel

1 Set j = JmaxM + 1 and M prime

j = M

2 Draw M primejminus1 from qdecjminus1(M

prime|yMM prime isin Ljminus1(Mprimej )) where

qdecjminus1(Mprime|yMM prime isin Ljminus1(M

primej )) =

12

(p(M prime|yMM prime isin Ljminus1(M

primej )) +

1|Ljminus1(M

primej)|

)

3 Set j = j minus 1

4 If j gt JminM then return to 2 Otherwise proceed to 5

5 Set M prime = M primeJminM

and compute the proposal probability

qdec(Mprime|yMM) =

JminM +1prod

j=JmaxM +1

qdecjminus1(Mprimej |yMM prime isin Ljminus1(M

primej )) (Bndash2)

If increasing order is chosen then the MH correction is given by

α = min

1

(1 + I (M prime = MF )

1 + I (M = MB)

)qdec(M|yMM prime)

qinc(M prime|yMM)

p(M prime|yM)

p(M|yM)

(Bndash3)

and similarly if decreasing order is chosen

Other Local and Intermediate Kernels The local and intermediate kernels

described here perform a kind of stochastic forwards-backwards selection Each kernel

122

q can be relaxed to allow more than one node to be turned on or off at each step which

could provide larger jumps for each of these kernels The tradeoff is that number of

proposed models for such jumps could be very large precluding the use of posterior

information in the construction of the proposal kernel

123

APPENDIX CWFM SIMULATION DETAILS

Briefly the idea is to let ZMT(X )βMT

= (QR)βMT= QηMT

(ie βMT= Rminus1ηMT

)

using the QR decomposition As such setting all values in ηMTproportional to one

corresponds to distributing the signal in the model uniformly across all predictors

regardless of their order

The (unconditional) variance of a single observation yi is var(yi) = var (E [yi |zi ]) +

E [var(yi |zi)] where zi is the i -th row of the design matrix ZMT Hence we take the

signal to noise ratio for each observation to be

SNR(η) = ηTMT

RminusTzRminus1ηMT

σ2

where z = var(zi) We determine how the signal is distributed across predictors up to a

proportionality constant to be able to control simultaneously the signal to noise ratio

Additionally to investigate the ability of the model to capture correctly the

hierarchical structure we specify four different 0-1 vectors that determine the predictors

in MT which generates the data in the different scenarios

Table C-1 Experimental conditions WFM simulationsParameter Values considered

SNR(ηMT) = k 025 1 4

ηMTprop (1 13 14 12) (1 13 1214

1412) (1 1413

1214 12)

γMT(1 13 14 12) (1 13 14 02) (1 13 04 12) (1 03 0 1 1 0 12)

n 130 260 1040

The results presented below are somewhat different from those found in the main

body of the article in Section 5 These are extracted averaging the number of FPrsquos

TPrsquos and model sizes respectively over the 100 independent runs and across the

corresponding scenarios for the 20 highest probability models

124

SNR and Sample Size Effect

In terms of the SNR and the sample size (Figure C-1) we observe that as

expected small sample sizes conditioned upon a small SNR impair the ability of the

algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this effect

more notorious when using the latter prior However considering the mean number

of true positives (TP) jointly with the mean model size it is clear that although the

sensitivity is low most of the few predictors that are discovered belong to the true

model The results observed with SNR of 025 and a relatively small sample size are

far from being impressive however real problems where the SNR is as low as 025

will yield many spurious associations under the EPP The fact that the HOP(1 ch) has

a strong protection against false positive is commendable in itself A SNR of 1 also

represents a feeble relationship between the predictors and the response nonetheless

the method captures approximately half of the true coefficients while including very few

false positives Following intuition as either the sample size or the SNR increase the

algorithms performance is greatly enhanced Either having a large sample size or a

large SNR yields models that contain mostly true predictors Additionally HOP(1 ch)

provides a strong control over the number of false positives therefore for high SNR

or larger sample sizes the number of predictors in the top 20 models is close to the

size of the true model In general the EPP allows the detection of more TPrsquos while

the HOP(1 ch) provides a stronger control on the amount of FPrsquos included when

considering small sample sizes combined with small SNRs As either sample size or

SNR grows the differences between the two priors become indistinct

125

Figure C-1 SNR vs n Average model size average true positives and average false

positives for all simulated scenarios by model ranking according to model

posterior probabilities

Coefficient Magnitude

This part of the experiment explores the effect of how the signal is distributed across

predictors As mentioned before sphering is used to assign the coefficients values

in a manner that controls the amount of signal that goes into each coefficient Three

possible ways to allocate the signal are considered First each order-one coefficient

contains twice as much signal as any order-two coefficient and four times as much

any as order-three coefficient second all coefficients contain the same amount of

signal regardless of their order and third each order-one coefficient contains a half

as much signal as any order-two coefficient and a quarter of what any order-three

126

coefficient has In Figure C-2 these values are denoted by β = c(1o1 05o2 025o3)

β = c(1o1 1o2 1o3) and β = c(025o1 05o2 1o3) respectively

Observe that the number of FPrsquos is invulnerable to how the SNR is distributed

across predictors using the HOP(1 ch) conversely when using the EPP the number

of FPrsquos decreases as the SNR grows always being slightly higher than those obtained

with the HOP With either prior structure the algorithm performs better whenever all

coefficients are equally weighted or when those for the order-three terms have higher

weights In these two cases (ie with β = c(1o1 05o2 025o3) or β = c(1o1 1o2 1o3))

the effect of the SNR appears to be similar In contrast when more weight is given to

order one terms the algorithm yields slightly worse models at any SNR level This is an

intuitive result since giving more signal to higher order terms makes it easier to detect

higher order terms and consequently by strong heredity the algorithm will also select

the corresponding lower order terms included in the true model

Special Points on the Scale

In Nelder (1998) the author argues that the conditions under which the

weak-heredity principle can be used for model selection are so restrictive that the

principle is commonly not valid in practice in this context In addition the author states

that considering well-formulated models only does not take into account the possible

presence of special points on the scales of the predictors that is situations where

omitting lower order terms is justified due to the nature of the data However it is our

contention that every model has an underlying well-formulated structure whether or not

some predictor has special points on its scale will be determined through the estimation

of the coefficients once a valid well-formulated structure has been chosen

To understand how the algorithm behaves whenever the true data generating

mechanism has zero-valued coefficients for some lower order terms in the hierarchy

four different true models are considered Three of them are not well-formulated while

the remaining one is the WFM shown in Figure 4-6 The three models that have special

127

Figure C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model rankingaccording to model posterior probabilities

points correspond to the same model MT from Figure 4-6 but have respectively

zero-valued coefficients for all the order-one terms all the order-two terms and for x21

and x2x5

As seen before in comparison to the EPP the HOP(1 ch) tightly controls the

inclusion FPs by choosing smaller models at the expense of also reducing the TP

count especially when there is more uncertainty about the true model (ie SNR=025)

For both prior structures the results in Figure C-3 indicate that at low SNR levels the

presence of special points has no apparent impact as the selection behavior is similar

between the four models in terms of both the TP and FP As the SNR increases the

TPs and the model size are affected for true models with zero-valued lower order

128

Figure C-3 SNR vs different true models MT Average model size average truepositives and average false positives for all simulated scenarios by modelranking according to model posterior probabilities

terms These differences however are not very large Relatively smaller models are

selected whenever some terms in the hierarchy are missing but with high SNR which

is where the differences are most pronounced the predictors included are mostly true

coefficients The impact is almost imperceptible for the true model that lacks order one

terms and the model with zero coefficients for x21 and x2x5 and is more visible for models

without order two terms This last result is expected due to strong-heredity whenever

the order-one coefficients are missing the inclusion of order-two and order-three

terms will force their selection which is also the case when only a few order two terms

have zero-valued coefficients Conversely when all order two predictors are removed

129

some order three predictors are not selected as their signal is attributed the order two

predictors missing from the true model This is especially the case for the order three

interaction term x1x2x5 which depends on the inclusion of three order two terms terms

(x1x2 x1x5 x2x5) in order for it to be included as well This makes the inclusion of this

term somewhat more challenging the three order two interactions capture most of

the variation of the polynomial terms that is present when the order three term is also

included However special points on the scale commonly occur on a single or at most

on a few covariates A true data generating mechanism that removes all terms of a given

order in the context of polynomial models is clearly not justified here this was only done

for comparison purposes

130

APPENDIX DSUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS

The covariates considered for the ozone data analysis match those used in Liang

et al (2008) these are displayed in Table D below

Table D-1 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA

vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX

The marginal posterior inclusion probability corresponds to the probability of including a

given term in the full model MF after summing over all models in the model space For each

node α isin MF this probability is given by pα =sum

MisinM I(αisinM)p(M|yM) Given that in problems

with a large model space such as the one considered for the ozone concentration problem

enumeration of the entire space is not feasible Thus these probabilities are estimated summing

over every model drawn by the random walk over the model space M

Given that there are in total 44 potential predictors for convenience in Tables D-2 to D-5

below we only display the marginal posterior probabilities for the terms included under at least

one of the model priors considered (EPP HIP HUP and HOP) for each of the parameter priors

utilized (intrinsic priors Zellner-Siow priors Hyper-g(11) and Hyper-g(21))

131

Table D-2 Marginal inclusion probabilities

intrinsic prior

EPP HIP HUP HOP

hum 099 069 085 076

dpg 085 048 052 053

ibt 099 100 100 100

hum2 076 051 043 062

humdpg 055 002 003 017

humibt 098 069 084 075

dpg2 072 036 025 046

ibt2 059 078 057 081

Table D-3 Marginal inclusion probabilities

Zellner-Siow prior

EPP HIP HUP HOP

hum 076 067 080 069

dpg 089 050 055 058

ibt 099 100 100 100

hum2 057 049 040 057

humibt 072 066 078 068

dpg2 081 038 031 051

ibt2 054 076 055 077

Table D-4 Marginal inclusion probabilities

Hyper-g11

EPP HIP HUP HOP

vh 054 005 010 011

hum 081 067 080 069

dpg 090 050 055 058

ibt 099 100 099 099

hum2 061 049 040 057

humibt 078 066 078 068

dpg2 083 038 030 051

ibt2 049 076 054 077

Table D-5 Marginal inclusion probabilities

Hyper-g21

EPP HIP HUP HOP

hum 079 064 073 067

dpg 090 052 060 059

ibt 099 100 099 100

hum2 060 047 037 055

humibt 076 064 071 067

dpg2 082 041 036 052

ibt2 047 073 049 075

132

REFERENCES

Akaike H (1983) Information measures and model selection Bull Int Statist Inst 50277ndash290

Albert J H amp Chib S (1993) Bayesian-analysis of binary and polychotomousresponse data Journal of the American Statistical Association 88(422) 669ndash679

Berger J amp Bernardo J (1992) On the development of reference priors BayesianStatistics 4 (pp 35ndash60)

URL httpisbastatdukeedueventsvalencia1992Valencia4Refpdf

Berger J amp Pericchi L (1996) The intrinsic Bayes factor for model selection andprediction Journal of the American Statistical Association 91(433) 109ndash122

URL httpamstattandfonlinecomdoiabs10108001621459199610476668

Berger J Pericchi L amp Ghosh J (2001) Objective Bayesian methods for modelselection introduction and comparison In Model selection vol 38 of IMS LectureNotes Monogr Ser (pp 135ndash207) Inst Math Statist

URL httpwwwjstororgstable1023074356165

Besag J York J amp Mollie A (1991) Bayesian Image-Restoration with 2 Applicationsin Spatial Statistics Annals of the Institute of Statistical Mathematics 43 1ndash20

Bien J Taylor J amp Tibshirani R (2013) A lasso for hierarchical interactions TheAnnals of Statistics 41(3) 1111ndash1141

URL httpprojecteuclidorgeuclidaos1371150895

Breiman L amp Friedman J (1985) Estimating optimal transformations for multipleregression and correlation Journal of the American Statistical Association 80580ndash598

Brusco M J Steinley D amp Cradit J D (2009) An exact algorithm for hierarchicallywell-formulated subsets in second-order polynomial regression Technometrics 51(3)306ndash315

Casella G Giron F J Martınez M L amp Moreno E (2009) Consistency of Bayesianprocedures for variable selection The Annals of Statistics 37 (3) 1207ndash1228

URL httpprojecteuclidorgeuclidaos1239369020

Casella G Moreno E amp Giron F (2014) Cluster Analysis Model Selection and PriorDistributions on Models Bayesian Analysis TBA(TBA) 1ndash46

URL httpwwwstatufledu~casellaPapersClusterModel-July11-Apdf

133

Chipman H (1996) Bayesian variable selection with related predictors CanadianJournal of Statistics 24(1) 17ndash36

URL httponlinelibrarywileycomdoi1023073315687abstract

Clyde M amp George E I (2004) Model Uncertainty Statistical Science 19(1) 81ndash94

URL httpprojecteuclidorgDienstgetRecordid=euclidss1089808274

Dewey J (1958) Experience and nature New York Dover Publications

Dorazio R M amp Taylor-Rodrıguez D (2012) A Gibbs sampler for Bayesian analysis ofsite-occupancy data Methods in Ecology and Evolution 3 1093ndash1098

Ellison A M (2004) Bayesian inference in ecology Ecology Letters 7 509ndash520

Fiske I amp Chandler R (2011) unmarked An R package for fitting hierarchical modelsof wildlife occurrence and abundance Journal of Statistical Software 43(10)

URL httpcorekmiopenacukdownloadpdf5701760pdf

George E (2000) The variable selection problem Journal of the American StatisticalAssociation 95(452) 1304ndash1308

URL httpwwwtandfonlinecomdoiabs10108001621459200010474336

Giron F J Moreno E Casella G amp Martınez M L (2010) Consistency of objectiveBayes factors for nonnested linear models and increasing model dimension Revistade la Real Academia de Ciencias Exactas Fisicas y Naturales Serie A Matematicas104(1) 57ndash67

URL httpwwwspringerlinkcomindex105052RACSAM201006

Good I J (1950) Probability and the Weighing of Evidence New York Haffner

Griepentrog G L Ryan J M amp Smith L D (1982) Linear transformations ofpolynomial regression-models American Statistician 36(3) 171ndash174

Gunel E amp Dickey J (1974) Bayes factors for independence in contingency tablesBiometrika 61 545ndash557

Hanski I (1994) A Practical Model of Metapopulation Dynamics Journal of AnimalEcology 63 151ndash162

Hooten M (2006) Hierarchical spatio-temporal models for ecological processesDoctoral dissertation University of Missouri-Columbia

URL httpsmospacelibraryumsystemeduxmluihandle103554500

Hooten M B amp Hobbs N T (2014) A Guide to Bayesian Model Selection forEcologists Ecological Monographs (In Press)

134

Hughes J amp Haran M (2013) Dimension reduction and alleviation of confoundingfor spatial generalized linear mixed models Journal of the Royal Statistical SocietySeries B Statistical Methodology 75 139ndash159

Hurvich C M amp Tsai C-L (1989) Regression and time series model selection insmall samples Biometrika 76 297ndash307

URL httpbiometoxfordjournalsorgcontent762297abstract

Jeffreys H (1935) Some tests of significance treated by the theory of probabilityProcedings of the Cambridge Philosophy Society 31 203ndash222

Jeffreys H (1961) Theory of Probability London Oxford University Press 3rd ed

Johnson D Conn P Hooten M Ray J amp Pond B (2013) Spatial occupancymodels for large data sets Ecology 94(4) 801ndash808

URL httpwwwesajournalsorgdoiabs10189012-05641mi=3eywlhampaf=R

ampsearchText=human+population

Kass R amp Wasserman L (1995) A reference Bayesian test for nested hypothesesand its relationship to the Schwarz criterion Journal of the American StatisticalAssociation 90(431)

URL httpamstattandfonlinecomdoiabs10108001621459199510476592

Kass R E amp Raftery A E (1995) Bayes Factors Journal of the American StatisticalAssociation 90 773ndash795

URL httpwwwtandfonlinecomdoiabs10108001621459199510476572$

delimiter026E30F$nhttpwwwtandfonlinecomdoiabs10108001621459

199510476572UvBybrTIgcs

Kass R E amp Wasserman L (1996) The Selection of Prior Distributions by FormalRules Journal of the American Statistical Association 91(435) 1343

URL httpwwwjstororgstable2291752origin=crossref

Kery M (2010) Introduction to WinBUGS for Ecologists Bayesian Approach toRegression ANOVA Mixed Models and Related Analyses Academic Press 1st ed

Kery M Gardner B amp Monnerat C (2010) Predicting species distributions fromchecklist data using site-occupancy models Journal of Biogeography 37 (10)1851ndash1862 Kery Marc Gardner Beth Monnerat Christian

Khuri A (2002) Nonsingular linear transformations of the control variables in responsesurface models Technical Report

Krebs C J (1972) Ecology the experimental analysis of distribution and abundance

135

Lempers F B (1971) Posterior probabilities of alternative linear models University ofRotterdam Press Rotterdam

Leon-Novelo L Moreno E amp Casella G (2012) Objective Bayes model selection inprobit models Statistics in medicine 31(4) 353ndash65

URL httpwwwncbinlmnihgovpubmed22162041

Liang F Paulo R Molina G Clyde M a amp Berger J O (2008) Mixtures of g Priorsfor Bayesian Variable Selection Journal of the American Statistical Association103(481) 410ndash423

URL httpwwwtandfonlinecomdoiabs101198016214507000001337

Link W amp Barker R (2009) Bayesian inference with ecological applications Elsevier

URL httpbooksgooglecombookshl=enamplr=ampid=hecon2l2QPcCampoi=fnd

amppg=PP2ampdq=Bayesian+Inference+with+ecological+applicationsampots=S82_

0pxrNmampsig=L3xbsSQcKD8FV6rxCMp2pmP2JKk

MacKenzie D amp Nichols J (2004) Occupancy as a surrogate for abundanceestimation Animal biodiversity and conservation 1 461ndash467

URL httpcrsitbacidmediajurnalrefslandscapemackenzie2004zhpdf

MacKenzie D Nichols J amp Hines J (2003) Estimating site occupancy colonizationand local extinction when a species is detected imperfectly Ecology 84(8)2200ndash2207

URL httpwwwesajournalsorgdoiabs10189002-3090

MacKenzie D I Bailey L L amp Nichols J D (2004) Investigating speciesco-occurrence patterns when species Journal of Animal Ecology 73 546ndash555

MacKenzie D I Nichols J D Lachman G B Droege S Royle J A amp LangtimmC A (2002) Estimating site occupancy rates when detection probabilities are lessthan one Ecology 83(8) 2248ndash2255

Mazerolle M amp Mazerolle M (2013) Package rsquoAICcmodavgrsquo (c)

URL ftpheanetarchivegnewsenseorgdisk1CRANwebpackages

AICcmodavgAICcmodavgpdf

McCullagh P amp Nelder J A (1989) Generalized linear models (2nd ed) LondonEngland Chapman amp Hall

McQuarrie A Shumway R amp Tsai C-L (1997) The model selection criterion AICu

136

Moreno E Bertolino F amp Racugno W (1998) An intrinsic limiting procedure for modelselection and hypotheses testing Journal of the American Statistical Association93(444) 1451ndash1460

Moreno E Giron F J amp Casella G (2010) Consistency of objective Bayes factors asthe model dimension grows The Annals of Statistics 38(4) 1937ndash1952

URL httpprojecteuclidorgeuclidaos1278861238

Nelder J A (1977) Reformulation of linear-models Journal of the Royal StatisticalSociety Series A - Statistics in Society 140 48ndash77

Nelder J A (1998) The selection of terms in response-surface models - how strong isthe weak-heredity principle American Statistician 52(4) 315ndash318

Nelder J A (2000) Functional marginality and response-surface fitting Journal ofApplied Statistics 27 (1) 109ndash112

Nichols J Hines J amp Mackenzie D (2007) Occupancy estimation and modeling withmultiple states and state uncertainty Ecology 88(6) 1395ndash1400

URL httpwwwesajournalsorgdoipdf10189006-1474

Ovaskainen O Hottola J amp Siitonen J (2010) Modeling species co-occurrenceby multivariate logistic regression generates new hypotheses on fungal interactionsEcology 91(9) 2514ndash21

URL httpwwwncbinlmnihgovpubmed20957941

Peixoto J L (1987) Hierarchical variable selection in polynomial regression-modelsAmerican Statistician 41(4) 311ndash313

Peixoto J L (1990) A property of well-formulated polynomial regression-modelsAmerican Statistician 44(1) 26ndash30

Pericchi L R (2005) Model selection and hypothesis testing based on objectiveprobabilities and bayes factors In Handbook of Statistics Elsevier

Polson N G Scott J G amp Windle J (2013) Bayesian Inference for Logistic ModelsUsing Polya-Gamma Latent Variables Journal of the American Statistical Association108 1339ndash1349

URL httpdxdoiorg101080016214592013829001

Rao C R amp Wu Y (2001) On model selection vol Volume 38 of Lecture NotesndashMonograph Series (pp 1ndash57) Beachwood OH Institute of Mathematical Statistics

URL httpdxdoiorg101214lnms1215540960

137

Reich B J Hodges J S amp Zadnik V (2006) Effects of residual smoothing on theposterior of the fixed effects in disease-mapping models Biometrics 62 1197ndash1206

Reiners W amp Lockwood J (2009) Philosophical Foundations for the Practices ofEcology Cambridge University Press

URL httpbooksgooglecombooksid=dr9cPgAACAAJ

Rigler F amp Peters R (1995) Excellence in Ecology Science and Limnology EcologyInstitute Germany

URL httportoncatieaccrcgi-binwxisexeIsisScript=CIENLxis

ampmethod=postampformato=2ampcantidad=1ampexpresion=mfn=008268

Robert C Chopin N amp Rousseau J (2009) Harold Jeffreysrsquo Theory of Probabilityrevisited Statistical Science Volume 24(2) 141ndash179

URL httpswwwnewtonacukpreprintsNI08021pdf

Robert C P (1993) A note on jeffreys-lindley paradox Statistica Sinica 3 601ndash608

Royle J A amp Kery M (2007) A Bayesian state-space formulation of dynamicoccupancy models Ecology 88(7) 1813ndash23

URL httpwwwncbinlmnihgovpubmed17645027

Scott J amp Berger J (2010) Bayes and Empirical-Bayes Multiplicity Adjustment in thevariable selection problem The Annals of Statistics

URL httpprojecteuclidorgeuclidaos1278861454

Spiegelhalter D J amp Smith A F M (1982) Bayes factor for linear and log-linearmodels with vague prior information J R Statist Soc B 44 377ndash387

Tierney L amp Kadane J B (1986) Accurate approximations for posterior moments andmarginal densities Journal of the American Statistical Association 81 82ndash86

Tyre A J Tenhumberg B Field S a Niejalke D Parris K amp Possingham H P(2003) Improving Precision and Reducing Bias in Biological Surveys EstimatingFalse-Negative Error Rates Ecological Applications 13(6) 1790ndash1801

URL httpwwwesajournalsorgdoiabs10189002-5078

Waddle J H Dorazio R M Walls S C Rice K G Beauchamp J Schuman M Jamp Mazzotti F J (2010) A new parameterization for estimating co-occurrence ofinteracting species Ecological applications a publication of the Ecological Society ofAmerica 20 1467ndash1475

Wasserman L (2000) Bayesian Model Selection and Model Averaging Journal ofmathematical psychology 44(1) 92ndash107

138

URL httpwwwncbinlmnihgovpubmed10733859

Wilson M Iversen E Clyde M A Schmidler S C amp Schildkraut J M (2010)Bayesian model search and multilevel inference for SNP association studies TheAnnals of Applied Statistics 4(3) 1342ndash1364

URL httpwwwncbinlmnihgovpmcarticlesPMC3004292

Womack A J Leon-Novelo L amp Casella G (2014) Inference from Intrinsic BayesProcedures Under Model Selection and Uncertainty Journal of the AmericanStatistical Association (June) 140114063448000

URL httpwwwtandfonlinecomdoiabs101080016214592014880348

Yuan M Joseph V R amp Zou H (2009) Structured variable selection and estimationThe Annals of Applied Statistics 3(4) 1738ndash1757

URL httpprojecteuclidorgeuclidaoas1267453962

Zeller K A Nijhawan S Salom-Perez R Potosme S H amp Hines J E (2011)Integrating occupancy modeling and interview data for corridor identification A casestudy for jaguars in nicaragua Biological Conservation 144(2) 892ndash901

Zellner A amp Siow A (1980) Posterior odds ratios for selected regression hypothesesIn Trabajos de estadıstica y de investigacion operativa (pp 585ndash603)

URL httpwwwspringerlinkcomindex5300770UP12246M9pdf

139

BIOGRAPHICAL SKETCH

Daniel Taylor-Rodrıguez was born in Bogota Colombia He earned a BS

degree in economics from the Universidad de Los Andes (2004) and a Specialist

degree in statistics from the Universidad Nacional de Colombia In 2009 he traveled

to Gainesville Florida to pursue a masterrsquos in statistics under the supervision of

George Casella Upon completion he started a PhD in interdisciplinary ecology with

concentration in statistics again under George Casellarsquos supervision After Georgersquos

passing Linda Young and Nikolay Bliznyuk continued to oversee Danielrsquos mentorship

He has currently accepted a joint postdoctoral fellowship at the Statistical and Applied

Mathematical Sciences Institute and the Department of Statistical Science at Duke

University

140

  • ACKNOWLEDGMENTS
  • TABLE OF CONTENTS
  • LIST OF TABLES
  • LIST OF FIGURES
  • ABSTRACT
  • 1 GENERAL INTRODUCTION
    • 11 Occupancy Modeling
    • 12 A Primer on Objective Bayesian Testing
    • 13 Overview of the Chapters
      • 2 MODEL ESTIMATION METHODS
        • 21 Introduction
          • 211 The Occupancy Model
          • 212 Data Augmentation Algorithms for Binary Models
            • 22 Single Season Occupancy
              • 221 Probit Link Model
              • 222 Logit Link Model
                • 23 Temporal Dynamics and Spatial Structure
                  • 231 Dynamic Mixture Occupancy State-Space Model
                  • 232 Incorporating Spatial Dependence
                    • 24 Summary
                      • 3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
                        • 31 Introduction
                        • 32 Objective Bayesian Inference
                          • 321 The Intrinsic Methodology
                          • 322 Mixtures of g-Priors
                            • 3221 Intrinsic priors
                            • 3222 Other mixtures of g-priors
                                • 33 Objective Bayes Occupancy Model Selection
                                  • 331 Preliminaries
                                  • 332 Intrinsic Priors for the Occupancy Problem
                                  • 333 Model Posterior Probabilities
                                  • 334 Model Selection Algorithm
                                    • 34 Alternative Formulation
                                    • 35 Simulation Experiments
                                      • 351 Marginal Posterior Inclusion Probabilities for Model Predictors
                                      • 352 Summary Statistics for the Highest Posterior Probability Model
                                        • 36 Case Study Blue Hawker Data Analysis
                                          • 361 Results Variable Selection Procedure
                                          • 362 Validation for the Selection Procedure
                                            • 37 Discussion
                                              • 4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
                                                • 41 Introduction
                                                • 42 Setup for Well-Formulated Models
                                                  • 421 Well-Formulated Model Spaces
                                                    • 43 Priors on the Model Space
                                                      • 431 Model Prior Definition
                                                      • 432 Choice of Prior Structure and Hyper-Parameters
                                                      • 433 Posterior Sensitivity to the Choice of Prior
                                                        • 44 Random Walks on the Model Space
                                                          • 441 Simple Pruning and Growing
                                                          • 442 Degree Based Pruning and Growing
                                                            • 45 Simulation Study
                                                              • 451 SNR and Sample Size Effect
                                                              • 452 Coefficient Magnitude
                                                              • 453 Special Points on the Scale
                                                                • 46 Case Study Ozone Data Analysis
                                                                • 47 Discussion
                                                                  • 5 CONCLUSIONS
                                                                  • A FULL CONDITIONAL DENSITIES DYMOSS
                                                                  • B RANDOM WALK ALGORITHMS
                                                                  • C WFM SIMULATION DETAILS
                                                                  • D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
                                                                  • REFERENCES
                                                                  • BIOGRAPHICAL SKETCH
Page 7: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,

352 Summary Statistics for the Highest Posterior Probability Model 7636 Case Study Blue Hawker Data Analysis 77

361 Results Variable Selection Procedure 79362 Validation for the Selection Procedure 81

37 Discussion 82

4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS 84

41 Introduction 8442 Setup for Well-Formulated Models 88

421 Well-Formulated Model Spaces 9043 Priors on the Model Space 91

431 Model Prior Definition 92432 Choice of Prior Structure and Hyper-Parameters 96433 Posterior Sensitivity to the Choice of Prior 99

44 Random Walks on the Model Space 104441 Simple Pruning and Growing 105442 Degree Based Pruning and Growing 106

45 Simulation Study 107451 SNR and Sample Size Effect 109452 Coefficient Magnitude 110453 Special Points on the Scale 111

46 Case Study Ozone Data Analysis 11147 Discussion 113

5 CONCLUSIONS 115

APPENDIX

A FULL CONDITIONAL DENSITIES DYMOSS 118

B RANDOM WALK ALGORITHMS 121

C WFM SIMULATION DETAILS 124

D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS 131

REFERENCES 133

BIOGRAPHICAL SKETCH 140

7

LIST OF TABLES

Table page

1-1 Interpretation of BFji when contrasting Mj and Mi 20

3-1 Simulation control parameters occupancy model selector 69

3-2 Comparison of average minOddsMPIP under scenarios having different numberof sites (N=50 N=100) and under scenarios having different number of surveysper site (J=3 J=5) for the presence and detection components using uniformand multiplicity correction priors 75

3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors 75

3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highest probabilitymodels for the presence and the detection components using uniform andmultiplicity correcting priors on the model space 76

3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space 77

3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 77

3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 78

3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data 80

3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data 80

3-10 MPIP presence component 81

3-11 MPIP detection component 81

3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform and multiplicitycorrection model priors 82

8

4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulations 100

4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchical independenceprior (HIP) the hierarchical order prior (HOP) and the hierarchical uniformprior (HUP) 102

4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and the hierarchicaluniform prior (HUP) 103

4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP) 105

4-5 Variables used in the analyses of the ozone contamination dataset 112

4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso 113

C-1 Experimental conditions WFM simulations 124

D-1 Variables used in the analyses of the ozone contamination dataset 131

D-2 Marginal inclusion probabilities intrinsic prior 132

D-3 Marginal inclusion probabilities Zellner-Siow prior 132

D-4 Marginal inclusion probabilities Hyper-g11 132

D-5 Marginal inclusion probabilities Hyper-g21 132

9

LIST OF FIGURES

Figure page

2-1 Graphical representation occupancy model 25

2-2 Graphical representation occupancy model after data-augmentation 31

2-3 Graphical representation multiseason model for a single site 39

2-4 Graphical representation data-augmented multiseason model 39

3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors 71

3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors 72

3-3 Predictor MPIP averaged over scenarios with the interaction between the numberof sites and the surveys per site using uniform (U) and multiplicity correction(MC) priors 72

3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors 73

3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors 73

4-1 Graphs of well-formulated polynomial models for p = 2 90

4-2 E(M) and C(M) in M defined by a quadratic surface in two main effects formodel M = 1 x1 x21 91

4-3 Graphical representation of assumptions on M defined by the quadratic surfacein two main effects 93

4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the intercept onlymodel and (ab) isin (1 1) (1 ch) 97

4-5 Prior probabilities for the space of well-formulated models associated to threemain effects and one interaction term where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch) 98

4-6 MT DAG of the largest true model used in simulations 109

4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch) 110

C-1 SNR vs n Average model size average true positives and average false positivesfor all simulated scenarios by model ranking according to model posterior probabilities126

10

C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 128

C-3 SNR vs different true models MT Average model size average true positivesand average false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 129

11

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

OBJECTIVE BAYESIAN METHODS FOR OCCUPANCY MODEL ESTIMATION ANDSELECTION

By

Daniel Taylor-Rodrıguez

August 2014

Chair Linda J YoungCochair Nikolay BliznyukMajor Interdisciplinary Ecology

The ecological literature contains numerous methods for conducting inference about

the dynamics that govern biological populations Among these methods occupancy

models have played a leading role during the past decade in the analysis of large

biological population surveys The flexibility of the occupancy framework has brought

about useful extensions for determining key population parameters which provide

insights about the distribution structure and dynamics of a population However the

methods used to fit the models and to conduct inference have gradually grown in

complexity leaving practitioners unable to fully understand their implicit assumptions

increasing the potential for misuse This motivated our first contribution We develop

a flexible and straightforward estimation method for occupancy models that provides

the means to directly incorporate temporal and spatial heterogeneity using covariate

information that characterizes habitat quality and the detectability of a species

Adding to the issue mentioned above studies of complex ecological systems now

collect large amounts of information To identify the drivers of these systems robust

techniques that account for test multiplicity and for the structure in the predictors are

necessary but unavailable for ecological models We develop tools to address this

methodological gap First working in an ldquoobjectiverdquo Bayesian framework we develop

the first fully automatic and objective method for occupancy model selection based

12

on intrinsic parameter priors Moreover for the general variable selection problem we

propose three sets of prior structures on the model space that correct for multiple testing

and a stochastic search algorithm that relies on the priors on the models space to

account for the polynomial structure in the predictors

13

CHAPTER 1GENERAL INTRODUCTION

As with any other branch of science ecology strives to grasp truths about the

world that surrounds us and in particular about nature The objective truth sought

by ecology may well be beyond our grasp however it is reasonable to think that at

least partially ldquoNature is capable of being understoodrdquo (Dewey 1958) We can observe

and interpret nature to formulate hypotheses which can then be tested against reality

Hypotheses that encounter no or little opposition when confronted with reality may

become contextual versions of the truth and may be generalized by scaling them

spatially andor temporally accordingly to delimit the bounds within which they are valid

To formulate hypotheses accurately and in a fashion amenable to scientific inquiry

not only the point of view and assumptions considered must be made explicit but

also the object of interest the properties worthy of consideration of that object and

the methods used in studying such properties (Reiners amp Lockwood 2009 Rigler amp

Peters 1995) Ecology as defined by Krebs (1972) is ldquothe study of interactions that

determine the distribution and abundance of organismsrdquo This characterizes organisms

and their interactions as the objects of interest to ecology and prescribes distribution

and abundance as a relevant property of these organisms

With regards to the methods used to acquire ecological scientific knowledge

traditionally theoretical mathematical models (such as deterministic PDEs) have been

used However naturally varying systems are imprecisely observed and as such are

subject to multiple sources of uncertainty that must be explicitly accounted for Because

of this the ecological scientific community is developing a growing interest in flexible

and powerful statistical methods and among these Bayesian hierarchical models

predominate These methods rely on empirical observations and can accommodate

fairly complex relationships between empirical observations and theoretical process

models while accounting for diverse sources of uncertainty (Hooten 2006)

14

Bayesian approaches are now used extensively in ecological modeling however

there are two issues of concern one from the standpoint of ecological practitioners

and another from the perspective of scientific ecological endeavors First Bayesian

modeling tools require a considerable understanding of probability and statistical theory

leading practitioners to view them as black box approaches (Kery 2010) Second

although Bayesian applications proliferate in the literature in general there is a lack of

awareness of the distinction between approaches specifically devised for testing and

those for estimation (Ellison 2004) Furthermore there is a dangerous unfamiliarity with

the proven risks of using tools designed for estimation in testing procedures (Berger amp

Pericchi 1996 Berger et al 2001 Kass amp Raftery 1995 Moreno et al 1998 Robert

et al 2009 Robert 1993) (eg use of flat priors in hypothesis testing)

Occupancy models have played a leading role during the past decade in large

biological population surveys The flexibility of the occupancy framework has allowed

the development of useful extensions to determine several key population parameters

which provide robust notions of the distribution structure and dynamics of a population

In order to address some of the concerns stated in previous paragraph we concentrate

in the occupancy framework to develop estimation and testing tools that will allow

ecologists first to gain insight about the estimation procedure and second to conduct

statistically sound model selection for site-occupancy data

11 Occupancy Modeling

Since MacKenzie et al (2002) and Tyre et al (2003) introduced the site-occupancy

framework countless applications and extensions of the method have been developed

in the ecological literature as evidenced by the 438000 hits on Google Scholar for

a search of rdquooccupancy modelrdquo This class of models acknowledges that techniques

used to conduct biological population surveys are prone to detection errors ndashif an

individual is detected it must be present while if it is not detected it might or might

not be Occupancy models improve upon traditional binary regression by accounting

15

for observed detection and partially observed presence as two separate but related

components In the site occupancy setting the chosen locations are surveyed

repeatedly in order to reduce the ambiguity caused by the observed zeros This

approach therefore allows probabilities of both presence (occurrence) and detection

to be estimated

The uses of site-occupancy models are many For example metapopulation

and island biogeography models are often parameterized in terms of site (or patch)

occupancy (Hansky 19921994 1997 as cited in MacKenzie et al (2003)) and

occupancy may be used as a surrogate for abundance to answer questions regarding

geographic distribution range size and metapopulation dynamics (MacKenzie et al

2004 Royle amp Kery 2007)

The basic occupancy framework which assumes a single closed population with

fixed probabilities through time has proven to be quite useful however it might be of

limited utility when addressing some problems In particular assumptions for the basic

model may become too restrictive or unrealistic whenever the study period extends

throughout multiple years or seasons especially given the increasingly changing

environmental conditions that most ecosystems are currently experiencing

Among the extensions found in the literature one that we consider particularly

relevant incorporates heterogenous occupancy probabilities through time Models

that incorporate temporally varying probabilities stem from important meta-population

notions provided by Hanski (1994) such as occupancy probabilities depending on local

colonization and local extinction processes In spite of the conceptual usefulness of

Hanskirsquos model several strong and untenable assumptions (eg all patches being

homogenous in quality) are required for it to provide practically meaningful results

A more viable alternative which builds on Hanski (1994) is an extension of

the single season occupancy model of MacKenzie et al (2003) In this model the

heterogeneity of occupancy probabilities across seasons arises from local colonization

16

and extinction processes This model is flexible enough to let detection occurrence

extinction and colonization probabilities to each depend upon its own set of covariates

Model parameters are obtained through likelihood-based estimation

Using a maximum likelihood approach presents two drawbacks First the

uncertainty assessment for maximum likelihood parameter estimates relies on

asymptotic results which are obtained from implementation of the delta method

making it sensitive to sample size Second to obtain parameter estimates the latent

process (occupancy) is marginalized out of the likelihood leading to the usual zero

inflated Bernoulli model Although this is a convenient strategy for solving the estimation

problem after integrating the latent state variables (occupancy indicators) they are

no longer available Therefore finite sample estimates cannot be calculated directly

Instead a supplementary parametric bootstrapping step is necessary Further

additional structure such as temporal or spatial variation cannot be introduced by

means of random effects (Royle amp Kery 2007)

12 A Primer on Objective Bayesian Testing

With the advent of high dimensional data such as that found in modern problems

in ecology genetics physics etc coupled with evolving computing capability objective

Bayesian inferential methods have gained increasing popularity This however is by no

means a new approach in the way Bayesian inference is conducted In fact starting with

Bayes and Laplace and continuing for almost 200 years Bayesian analysis was primarily

based on ldquononinformativerdquo priors (Berger amp Bernardo 1992)

Now subjective elicitation of prior probabilities in Bayesian analysis is widely

recognized as the ideal (Berger et al 2001) however it is often the case that the

available information is insufficient to specify appropriate prior probabilistic statements

Commonly as in model selection problems where large model spaces have to be

explored the number of model parameters is prohibitively large preventing one from

eliciting prior information for the entire parameter space As a consequence in practice

17

the determination of priors through the definition of structural rules has become the

alternative to subjective elicitation for a variety of problems in Bayesian testing Priors

arising from these rules are known in the literature as noninformative objective default

or reference Many of these connotations generate controversy and are accused

perhaps rightly of providing a false pretension of objectivity Nevertheless we will avoid

that discussion and refer to them herein exchangeably as noninformative or objective

priors to convey the sense that no attempt to introduce an informed opinion is made in

defining prior probabilities

A plethora of ldquononinformativerdquo methods has been developed in the past few

decades (see Berger amp Bernardo (1992) Berger amp Pericchi (1996) Berger et al (2001)

Clyde amp George (2004) Kass amp Wasserman (1995 1996) Liang et al (2008) Moreno

et al (1998) Spiegelhalter amp Smith (1982) Wasserman (2000) and the references

therein) We find particularly interesting those derived from the model structure in which

no tuning parameters are required especially since these can be regarded as automatic

methods Among them methods based on the Bayes factor for Intrinsic Priors have

proven their worth in a variety of inferential problems given their excellent performance

flexibility and ease of use This class of priors is discussed in detail in chapter 3 For

now some basic notation and notions of Bayesian inferential procedures are introduced

Hypothesis testing and the Bayes factor

Bayesian model selection techniques that aim to find the true model as opposed

to searching for the model that best predicts the data are fundamentally extensions to

Bayesian hypothesis testing strategies In general this Bayesian approach to hypothesis

testing and model selection relies on determining the amount of evidence found in favor

of one hypothesis (or model) over the other given an observed set of data Approached

from a Bayesian standpoint this type of problem can be formulated in great generality

using a natural well defined probabilistic framework that incorporates both model and

parameter uncertainty

18

Jeffreys (1935) first developed the Bayesian strategy to hypothesis testing and

consequently to the model selection problem Bayesian model selection within

a model space M = (M1M2 MJ) where each model is associated with a

parameter θj which may be a vector of parameters itself incorporates three types

of probability distributions (1) a prior probability distribution for each model π(Mj)

(2) a prior probability distribution for the parameters in each model π(θj |Mj) and (3)

the distribution of the data conditional on both the model and the modelrsquos parameters

f (x|θj Mj) These three probability densities induce the joint distribution p(x θj Mj) =

f (x|θj Mj) middot π(θj |Mj) middot π(Mj) which is instrumental in producing model posterior

probabilities The model posterior probability is the probability that a model is true given

the data It is obtained by marginalizing over the parameter space and using Bayes rule

p(Mj |x) =m(x|Mj)π(Mj)sumJ

i=1m(x|Mi)π(Mi) (1ndash1)

where m(x|Mj) =intf (x|θj Mj)π(θj |Mj)dθj is the marginal likelihood of Mj

Given that interest lies in comparing different models evidence in favor of one or

another model is assessed with pairwise comparisons using posterior odds

p(Mj |x)p(Mk |x)

=m(x|Mj)

m(x|Mk)middot π(Mj)

π(Mk) (1ndash2)

The first term on the right hand side of (1ndash2) m(x|Mj )

m(x|Mk) is known as the Bayes factor

comparing model Mj to model Mk and it is denoted by BFjk(x) The Bayes factor

provides a measure of the evidence in favor of either model given the data and updates

the model prior odds given by π(Mj )

π(Mk) to produce the posterior odds

Note that the model posterior probability in (1ndash1) can be expressed as a function of

Bayes factors To illustrate let model Mlowast isin M be a reference model All other models

compare in M are compared to the reference model Then dividing both the numerator

19

and denominator in (1ndash1) by m(x|Mlowast)π(Mlowast) yields

p(Mj |x) =BFjlowast(x)

π(Mj )

π(Mlowast)

1 +sum

MiisinMMi =Mlowast

BFilowast(x)π(Mi )π(Mlowast)

(1ndash3)

Therefore as the Bayes factor increases the posterior probability of model Mj given the

data increases If all models have equal prior probabilities a straightforward criterion

to select the best among all candidate models is to choose the model with the largest

Bayes factor As such the Bayes factor is not only useful for identifying models favored

by the data but it also provides a means to rank models in terms of their posterior

probabilities

Assuming equal model prior probabilities in (1ndash3) the prior odds are set equal to

one and the model posterior odds in (1ndash2) become p(Mj |x)p(Mk |x) = BFjk(x) Based

on the Bayes factors the evidence in favor of one or another model can be interpreted

using Table 1-1 adapted from Kass amp Raftery (1995)

Table 1-1 Interpretation of BFji when contrasting Mj and Mi

lnBFjk BFjk Evidence in favor of Mj P(Mj |x)0 to 2 1 to 3 Weak evidence 05-0752 to 6 3 to 20 Positive evidence 075-095

6 to 10 20 to 150 Strong evidence 095-099gt10 gt150 Very strong evidence gt 099

Bayesian hypothesis testing and model selection procedures through Bayes factors

and posterior probabilities have several desirable features First these methods have a

straight forward interpretation since the Bayes factor is an increasing function of model

(or hypothesis) posterior probabilities Second these methods can yield frequentist

matching confidence bounds when implemented with good testing priors (Kass amp

Wasserman 1996) such as the reference priors of Berger amp Bernardo (1992) Third

since the Bayes factor contains the ratio of marginal densities it automatically penalizes

complexity according to the number of parameters in each model this property is

known as Ockhamrsquos razor (Kass amp Raftery 1995) Four the use of Bayes factors does

20

not require having nested hypotheses (ie having the null hypothesis nested in the

alternative) standard distributions or regular asymptotics (eg convergence to normal

or chi squared distributions) (Berger et al 2001) In contrast this is not always the case

with frequentist and likelihood ratio tests which depend on known distributions (at least

asymptotically) for the test statistic to perform the test Finally Bayesian hypothesis

testing procedures using the Bayes factor can naturally incorporate model uncertainty by

using the Bayesian machinery for model averaged predictions and confidence bounds

(Kass amp Raftery 1995) It is not clear how to account for this uncertainty rigorously in a

fully frequentist approach

13 Overview of the Chapters

In the chapters that follow we develop a flexible and straightforward hierarchical

Bayesian framework for occupancy models allowing us to obtain estimates and conduct

robust testing from an ldquoobjectiverdquo Bayesian perspective Latent mixtures of random

variables supply a foundation for our methodology This approach provides a means to

directly incorporate spatial dependency and temporal heterogeneity through predictors

that characterize either habitat quality of a given site or detectability features of a

particular survey conducted in a specific site On the other hand the Bayesian testing

methods we propose are (1) a fully automatic and objective method for occupancy

model selection and (2) an objective Bayesian testing tool that accounts for multiple

testing and for polynomial hierarchical structure in the space of predictors

Chapter 2 introduces the methods proposed for estimation of occupancy model

parameters A simple estimation procedure for the single season occupancy model

with covariates is formulated using both probit and logit links Based on the simple

version an extension is provided to cope with metapopulation dynamics by introducing

persistence and colonization processes Finally given the fundamental role that spatial

dependence plays in defining temporal dynamics a strategy to seamlessly account for

this feature in our framework is introduced

21

Chapter 3 develops a new fully automatic and objective method for occupancy

model selection that is asymptotically consistent for variable selection and averts the

use of tuning parameters In this Chapter first some issues surrounding multimodel

inference are described and insight about objective Bayesian inferential procedures is

provided Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate

priors on the parameter space the intrinsic priors for the parameters of the occupancy

model are obtained These are used in the construction of a variable selection algorithm

for ldquoobjectiverdquo variable selection tailored to the occupancy model framework

Chapter 4 touches on two important and interconnected issues when conducting

model testing that have yet to receive the attention they deserve (1) controlling for false

discovery in hypothesis testing given the size of the model space ie given the number

of tests performed and (2) non-invariance to location transformations of the variable

selection procedures in the face of polynomial predictor structure These elements both

depend on the definition of prior probabilities on the model space In this chapter a set

of priors on the model space and a stochastic search algorithm are proposed Together

these control for model multiplicity and account for the polynomial structure among the

predictors

22

CHAPTER 2MODEL ESTIMATION METHODS

ldquoData Data Datardquo he cried impatiently ldquoI canrsquot make bricks without clayrdquo

ndashSherlock HolmesThe Adventure of the Copper Beeches

21 Introduction

Prior to the introduction of site-occupancy models (MacKenzie et al 2002 Tyre

et al 2003) presence-absence data from ecological monitoring programs were used

without any adjustment to assess the impact of management actions to observe trends

in species distribution through space and time or to model the habitat of a species (Tyre

et al 2003) These efforts however were suspect due to false-negative errors not

being accounted for False-negative errors occur whenever a species is present at a site

but goes undetected during the survey

Site-occupancy models developed independently by MacKenzie et al (2002)

and Tyre et al (2003) extend simple binary-regression models to account for the

aforementioned errors in detection of individuals common in surveys of animal or plant

populations Since their introduction the site-occupancy framework has been used in

countless applications and numerous extensions for it have been proposed Occupancy

models improve upon traditional binary regression by analyzing observed detection

and partially observed presence as two separate but related components In the site

occupancy setting the chosen locations are surveyed repeatedly in order to reduce the

ambiguity caused by the observed zeros This approach therefore allows simultaneous

estimation of the probabilities of presence (occurrence) and detection

Several extensions to the basic single-season closed population model are

now available The occupancy approach has been used to determine species range

dynamics (MacKenzie et al 2003 Royle amp Kery 2007) and to understand agestage

23

structure within populations (Nichols et al 2007) model species co-occurrence

(MacKenzie et al 2004 Ovaskainen et al 2010 Waddle et al 2010) It has even been

suggested as a surrogate for abundance (MacKenzie amp Nichols 2004) MacKenzie et al

suggested using occupancy models to conduct large-scale monitoring programs since

this approach avoids the high costs associated with surveys designed for abundance

estimation Also to investigate metapopulation dynamics occupancy models improve

upon incidence function models (Hanski 1994) which are often parameterized in terms

of site (or patch) occupancy and assume homogenous patches and a metapopulation

that is at a colonization-extinction equilibrium

Nevertheless the implementation of Bayesian occupancy models commonly resorts

to sampling strategies dependent on hyper-parameters subjective prior elicitation

and relatively elaborate algorithms From the standpoint of practitioners these are

often treated as black-box methods (Kery 2010) As such the potential of using the

methodology incorrectly is high Commonly these procedures are fitted with packages

such as BUGS or JAGS Although the packagersquos ease of use has led to a wide-spread

adoption of the methods the user may be oblivious as to the assumptions underpinning

the analysis

We believe providing straightforward and robust alternatives to implement these

methods will help practitioners gain insight about how occupancy modeling and more

generally Bayesian modeling is performed In this Chapter using a simple Gibbs

sampling approach first we develop a versatile method to estimate the single season

closed population site-occupancy model then extend it to analyze metapopulation

dynamics through time and finally provide a further adaptation to incorporate spatial

dependence among neighboring sites211 The Occupancy Model

In this section of the document we first introduce our results published in Dorazio

amp Taylor-Rodrıguez (2012) and build upon them to propose relevant extensions For

24

the standard sampling protocol for collecting site-occupancy data J gt 1 independent

surveys are conducted at each of N representative sample locations (sites) noting

whether a species is detected or not detected during each survey Let yij denote a binary

random variable that indicates detection (y = 1) or non-detection (y = 0) during the

j th survey of site i Without loss of generality J may be assumed constant among all N

sites to simplify description of the model In practice however site-specific variation in

J poses no real difficulties and is easily implemented This sampling protocol therefore

yields a N times J matrix Y of detectionnon-detection data

Note that the observed process yij is an imperfect representation of the underlying

occupancy or presence process Hence letting zi denote the presence indicator at site i

this model specification can therefore be represented through the hierarchy

yij |zi λ sim Bernoulli (zipij)

zi |α sim Bernoulli (ψi) (2ndash1)

where pij is the probability of correctly classifying as occupied the i th site during the j th

survey ψi is the presence probability at the i th site The graphical representation of this

process is

ψi

zi

yi

pi

Figure 2-1 Graphical representation occupancy model

Probabilities of detection and occupancy can both be made functions of covariates

and their corresponding parameter estimates can be obtained using either a maximum

25

likelihood or a Bayesian approach Existing methodologies from the likelihood

perspective marginalize over the latent occupancy process (zi ) making the estimation

procedure depend only on the detections Most Bayesian strategies rely on MCMC

algorithms that require parameter prior specification and tuning However Albert amp Chib

(1993) proposed a longstanding strategy in the Bayesian statistical literature that models

binary outcomes using a simple Gibbs sampler This procedure which is described in

the following section can be extrapolated to the occupancy setting eliminating the need

for tuning parameters and subjective prior elicitation212 Data Augmentation Algorithms for Binary Models

Probit model Data-augmentation with latent normal variables

At the root of Albert amp Chibrsquos algorithm lies the idea that if the observed outcome is

0 the latent variable can be simulated from a truncated normal distribution with support

(minusinfin 0] And if the outcome is 1 the latent variable can be simulated from a truncated

normal distribution in (0infin) To understand the reasoning behind this strategy let

Y sim Bern((xTβ)

) and V = xTβ + ε with ε sim N (0 1) In such a case note that

Pr(y = 1 | xTβ) = (xTβ) = Pr(ε lt xTβ)

= Pr(ε gt minusxTβ)

= Pr(v gt 0 | xTβ)

Thus whenever y = 1 then v gt 0 and v le 0 otherwise In other words we

may think of y as a truncated version of v Thus we can sample iteratively alternating

between the latent variables conditioned on model parameters and vice versa to draw

from the desired posterior densities By augmenting the data with the latent variables

we are able to obtain full conditional posterior distributions for model parameters that are

easy to draw from (equation 2ndash3 below) Further we may sample the latent variables

we may also sample the parameters

Given some initial values for all model parameters values for the latent variables

can be simulated By conditioning on the latter it is then possible to draw samples

26

from the parameterrsquos posterior distributions These samples can be used to generate

new values for the latent variables etc The process is iterated using a Gibbs sampling

approach Generally after a large number iterations it yields draws from the joint

posterior distribution of the latent variables and the model parameters conditional on the

observed outcome values We formalize the procedure below

Assume that each outcome Y1Y2 Yn is such that Yi |xi β sim Bernoulli(qi)

where qi = (xTi β) is the standard normal CDF evaluated at xTi β where xi and β

are the p-dimensional vectors of observed covariates for the i -th observation and their

corresponding parameters respectively

Now let y = y1 y2 yn be the vector of observed outcomes and [ β ] represents

the prior distribution of the model parameters Therefore the posterior distribution of β is

given by

[ β|y ] prop [ β ]nprodi=1

(xTi β)yi(1minus(xTi β)

)1minusyi (2ndash2)

which is intractable Nevertheless introducing latent random variables V = (V1 Vn)

such that Vi sim N (xTi β 1) resolves this difficulty by specifying that whenever Yi = 1

then Vi gt 0 and if Yi = 0 then Vi le 0 This yields

[ β v|y ] prop [ β ]

nprodi=1

ϕ(vi | xTi β 1)Ivile0Iyi=0 + Ivigt0Iyi=1

(2ndash3)

where ϕ(x |micro τ 2) is the probability density function of normal random variable x

with mean micro and variance τ2 The data augmentation artifact works since [ β|y ] =int[ β v|y ]dv hence if we sample from joint posterior 2ndash3 and extract only the sampled

values for β they will correspond to samples from [ β|y ]

From the expression above it is possible to obtain the full conditional distributions

for V and β Thus a Gibbs sampler can be proposed For example if we use a flat prior

27

for β (ie [ β ] prop 1) the full conditionals are given by

β|V y sim MVNk

((XTX )minus1(XTV ) (XTX )minus1

)(2ndash4)

V|β y simnprodi=1

tr N (xTi β 1Qi) (2ndash5)

where MVNq(micro ) represents a multivariate normal distribution with mean vector micro

and variance-covariance matrix and tr N (ξσ2Q) stands for the truncated normal

distribution with mean ξ variance σ2 and truncation region Q For each i = 1 2 n

the support of the truncated variables is given by Q = (minusinfin 0 ] if yi = 0 and Q = (0infin)

otherwise Note that conjugate normal priors could be used alternatively

At iteration m + 1 the Gibbs sampler draws V(m+1) conditional on β(m) from (2ndash5)

and then samples β(m+1) conditional on V(m+1) from (2ndash4) This process is repeated for

s = 0 1 nsim where nsim is the number of iterations in the Gibbs sampler

Logit model Data-augmentation with latent Polya-gamma variables

Recently Polson et al (2013) developed a novel and efficient approach for Bayesian

inference for logistic models using Polya-gamma latent variables which is analogous

to the Albert amp Chib algorithm The result arises from what the authors refer to as the

Polya-gamma distribution To construct a random variable from this family consider the

infinite mixture of the iid sequence of Exp(1) random variables Ekinfink=1 given by

ω =2

π2

infinsumk=1

Ek

(2k minus 1)2

with probability density function

g(ω) =infinsumk=1

(minus1)k 2k + 1radic2πω3

eminus(2k+1)2

8ω Iωisin(0infin) (2ndash6)

and Laplace density transform E[eminustω] = coshminus1(radic

t2)

28

The Polya-gamma family of densities is obtained through an exponential tilting of

the density g from 2ndash6 These densities indexed by c ge 0 are characterized by

f (ω|c) = cosh c2 eminusc2ω2 g(ω)

The likelihood for the binomial logistic model can be expressed in terms of latent

Polya-gamma variables as follows Assume yi sim Bernoulli(δi) with predictors xprimei =

(xi1 xip) and success probability δi = exprimeiβ(1 + ex

primeiβ) Hence the posterior for the

model parameters can be represented as

[β|y] =[β]prodn

i δyii (1minus δi)

1minusyi

c(y)

where c(y) is the normalizing constant

To facilitate the sampling procedure a data augmentation step can be performed

by introducing a Polya-gamma random variable ω sim PG(xprimeβ 1) This yields the

data-augmented posterior

[βω|y] =

(prodn

i=1 Pr(yi = 1|β))f (ω|xprime

β) [β] dω

c(y) (2ndash7)

such that [β|y] =int

R+[βω|y] dω

Thus from the augmented model the full conditional density for β is given by

[β|ω y] prop

(nprodi=1

Pr(yi = 1|β)

)f (ω|xprime

β) [β] dω

=

nprodi=1

(exprimeiβ)yi

1 + exprimeiβ

nprodi=1

cosh

(∣∣xprime

iβ∣∣

2

)exp

[minus(x

prime

iβ)2ωi

2

]g(ωi)

(2ndash8)

This expression yields a normal posterior distribution if β is assigned flat or normal

priors Hence a two-step sampling strategy analogous to that of Albert amp Chib (1993)

can be used to estimate β in the occupancy framework22 Single Season Occupancy

Let pij = F (qTij λ) be the probability of correctly classifying as occupied the i th

site during the j th survey conditional on the site being occupied and let ψi = F (xTi α)

29

correspond to the presence probability at the i th site Further let Fminus1(middot) denote a link

function (ie probit or logit) connecting the response to the predictors and denote by λ

and α respectively the r -variate and p-variate coefficient vectors for the detection and

for the presence probabilities Then the following is the joint posterior probability for the

presence indicators and the model parameters

πlowast(z vαwλ) prop πα(α)πλ(λ)Nprodi=1

F (xprimeiα)zi (1minus F (xprimeiα))

(1minuszi ) times

Jprodj=1

(ziF (qprimeijλ))

yij (1minus ziF (qprimeijλ))

1minusyij (2ndash9)

As in the simple probit regression problem this posterior is intractable Consequently

sampling from it directly is not possible But the procedures of Albert amp Chib for the

probit model and of Polson et al for the logit model can be extended to generate an

MCMC sampling strategy for the occupancy problem In what follows we make use of

this framework to develop samplers with which occupancy parameter estimates can be

obtained for both probit and logit link functions These algorithms have the added benefit

that they do not require tuning parameters nor eliciting parameter priors subjectively221 Probit Link Model

To extend Albert amp Chibrsquos algorithm to the occupancy framework with a probit link

first we introduce two sets of latent variables denoted by wij and vi corresponding to

the normal latent variables used to augment the data The corresponding hierarchy is

yij |zi sij sim Bernoulli(ziIwijgt0

)wij |λ sim N

(qprimeijλ 1

)λ sim [λ]

zi |vi sim Ivigt0

vi |α sim N (xprimeiα 1)

α sim [α] (2ndash10)

30

represented by the directed graph found in Figure 2-2

α

vi

zi

yi

wi

λ

Figure 2-2 Graphical representation occupancy model after data-augmentation

Under this hierarchical model the joint density is given by

πlowast(z vαwλ) prop Cyπα(α)πλ(λ)Nprodi=1

ϕ(vi xprimeiα 1)I

zivigt0I

(1minuszi )vile0 times

Jprodj=1

(ziIwijgt0)yij (1minus ziIwijgt0)

1minusyijϕ(wij qprimeijλ 1) (2ndash11)

The full conditional densities derived from the posterior in equation 2ndash11 are

detailed below

1 These are obtained from the full conditional of z after integrating out v and w

f (z|αλ) =

Nprodi=1

f (zi |αλ) =Nprodi=1

ψlowastizi (1minus ψlowast

i )1minuszi

where ψlowasti =

ψiprodJ

j=1 pyijij (1minus pij)

1minusyij

ψiprodJ

j=1 pyijij (1minus pij)1minusyij + (1minus ψi)

prodJ

j=1 Iyij=0(2ndash12)

2

f (v|zα) =

Nprodi=1

f (vi |zi α) =Nprodi=1

tr N (x primeiα 1Ai)

where Ai =

(minusinfin 0] zi = 0(0infin) zi = 1

(2ndash13)

31

and tr N(microσ2A) denotes the pdf of a truncated normal random variable withmean micro variance σ2 and truncation region A

3

f (α|v) = ϕp (α αXprimev α) (2ndash14)

where α = (X primeX )minus1and ϕk(x micro ) represents the k-variate normal density withmean vector micro and variance matrix

4

f (w|y zλ) =

Nprodi=1

Jprodj=1

f (wij |yij zi λ) =Nprodi=1

Jprodj=1

tr N (qprimeijλ 1Bij)

where Bij =

(minusinfininfin) zi = 0(minusinfin 0] zi = 1 and yij = 0(0infin) zi = 1 and yij = 1

(2ndash15)

5

f (λ|w) = ϕr (λ λQprimew λ) (2ndash16)

where λ = (Q primeQ)minus1

The Gibbs sampling algorithm for the model can then be summarized as

1 Initialize z α v λ and w

2 Sample zi sim Bern(ψilowast)

3 Sample vi from a truncated normal with micro = x primeiα σ = 1 and truncation regiondepending on zi

4 Sample α sim N (αXprimev α) with α = (X primeX )minus1

5 Sample wij from a truncated normal with micro = qprimeijλ σ = 1 and truncation region

depending on yij and zi

6 Sample λ sim N (λQprimew λ) with λ = (Q primeQ)minus1

222 Logit Link Model

Now turning to the logit link version of the occupancy model again let yij be the

indicator variable used to mark detection of the target species on the j th survey at the

i th site and let zi be the indicator variable that denotes presence (zi = 1) or absence

32

(zi = 0) of the target species at the i th site The model is now defined by

yij |zi λ sim Bernoulli (zipij) where pij =eq

primeijλ

1 + eqprimeijλ

λ sim [λ]

zi |α sim Bernoulli (ψi) where ψi =ex

primeiα

1 + exprimeiα

α sim [α]

In this hierarchy the contribution of a single site to the likelihood is

Li(αλ) =(ex

primeiα)zi

1 + exprimeiα

Jprodj=1

(zi

eqprimeijλ

1 + eqprimeijλ

)yij(1minus zi

eqprimeijλ

1 + eqprimeijλ

)1minusyij

(2ndash17)

As in the probit case we data-augment the likelihood with two separate sets

of covariates however in this case each of them having Polya-gamma distribution

Augmenting the model and using the posterior in (2ndash7) the joint is

[ zαλ|y ] prop [α] [λ]

Nprodi=1

(ex

primeiα)zi

1 + exprimeiαcosh

(∣∣xprime

iα∣∣

2

)exp

[minus(x

prime

iα)2vi

2

]g(vi)times

Jprodj=1

(zi

eqprimeijλ

1 + eqprimeijλ

)yij(1minus zi

eqprimeijλ

1 + eqprimeijλ

)1minusyij

times

cosh

(∣∣ziqprimeijλ∣∣2

)exp

[minus(ziq

primeijλ)2wij

2

]g(wij)

(2ndash18)

The full conditionals for z α v λ and w obtained from (2ndash18) are provided below

1 The full conditional for z is obtained after marginalizing the latent variables andyields

f (z|αλ) =

Nprodi=1

f (zi |αλ) =Nprodi=1

ψlowastizi (1minus ψlowast

i )1minuszi

where ψlowasti =

ψiprodJ

j=1 pyijij (1minus pij)

1minusyij

ψiprodJ

j=1 pyijij (1minus pij)1minusyij + (1minus ψi)

prodJ

j=1 Iyij=0(2ndash19)

33

2 Using the result derived in Polson et al (2013) we have that

f (v|zα) =

Nprodi=1

f (vi |zi α) =Nprodi=1

PG(1 xprimeiα) (2ndash20)

3

f (α|v) prop [α ]

Nprodi=1

exp[zix

prime

iαminus xprime

2minus (x

prime

iα)2vi

2

] (2ndash21)

4 By the same result as that used for v the full conditional for w is

f (w|y zλ) =

Nprodi=1

Jprodj=1

f (wij |yij zi λ)

=

(prodiisinS1

Jprodj=1

PG(1 |qprimeijλ| )

)(prodi isinS1

Jprodj=1

PG(1 0)

) (2ndash22)

with S1 = i isin 1 2 N zi = 1

5

f (λ|z yw) prop [λ ]prodiisinS1

exp

[yijq

prime

ijλminusq

prime

ijλ

2minus

(qprime

ijλ)2wij

2

] (2ndash23)

with S1 as defined above

The Gibbs sampling algorithm is analogous to the one with a probit link but with the

obvious modifications to incorporate Polya-gamma instead of normal latent variables23 Temporal Dynamics and Spatial Structure

The uses of the single-season model are limited to very specific problems In

particular assumptions for the basic model may become too restrictive or unrealistic

whenever the study period extends throughout multiple years or seasons especially

given the increasingly changing environmental conditions that most ecosystems are

currently experiencing

Among the many extensions found in the literature one that we consider particularly

relevant incorporates heterogenous occupancy probabilities through time Extensions of

34

site-occupancy models that incorporate temporally varying probabilities can be traced

back to Hanski (1994) The heterogeneity of occupancy probabilities through time arises

from local colonization and extinction processes MacKenzie et al (2003) proposed an

alternative to Hanskirsquos approach in order to incorporate imperfect detection The method

is flexible enough to let detection occurrence survival and colonization probabilities

each depend upon its own set of covariates using likelihood-based estimation for the

model parameters

However the approach of MacKenzie et al presents two drawbacks First

the uncertainty assessment for maximum likelihood parameter estimates relies on

asymptotic results (obtained from implementation of the delta method) making it

sensitive to sample size And second to obtain parameter estimates the latent process

(occupancy) is marginalized out of the likelihood leading to the usual zero-inflated

Bernoulli model Although this is a convenient strategy to solve the estimation problem

the latent state variables (occupancy indicators) are no longer available and as such

finite sample estimates cannot be calculated unless an additional (and computationally

expensive) parametric bootstrap step is performed (Royle amp Kery 2007) Additionally as

the occupancy process is integrated out the likelihood approach precludes incorporation

of additional structural dependence using random effects Thus the model cannot

account for spatial dependence which plays a fundamental role in this setting

To work around some of the shortcomings encountered when fitting dynamic

occupancy models via likelihood based methods Royle amp Kery developed what they

refer to as a dynamic occupancy state space model (DOSS) alluding to the conceptual

similarity found between this model and the class of state space models found in the

time series literature In particular this model allows one to retain the latent process

(occupancy indicators) in order to obtain small sample estimates and to eventually

generate extensions that incorporate structure in time andor space through random

effects

35

The data used in the DOSS model comes from standard repeated presenceabsence

surveys with N sampling locations (patches or sites) indexed by i = 1 2 N Within

a given season (eg year month week depending on the biology of the species) each

sampling location is visited (surveyed) j = 1 2 J times This process is repeated for

t = 1 2 T seasons Here an important assumption is that the site occupancy status

is closed within but not across seasons

As is usual in the occupancy modeling framework two different processes are

considered The first one is the detection process per site-visit-season combination

denoted by yijt The yijt are indicator functions that take the value 1 if the species is

present at site i survey j and season t and 0 otherwise These detection indicators

are assumed to be independent within each site and season The second response

considered is the partially observed presence (occupancy) indicators zit These are

indicator variables which are equal to 1 whenever yijt = 1 for one or more of the visits

made to site i during season t otherwise the values of the zit rsquos are unknown Royle amp

Kery refer to these two processes as the observation (yijt) and the state (zit) models

In this setting the parameters of greatest interest are the occurrence or site

occupancy probabilities denoted by ψit as well as those representing the population

dynamics which are accounted for by introducing changes in occupancy status over

time through local colonization and survival That is if a site was not occupied at season

t minus 1 at season t it can either be colonized or remain unoccupied On the other hand

if the site was in fact occupied at season t minus 1 it can remain that way (survival) or

become abandoned (local extinction) at season t The probabilities of survival and

colonization from season t minus 1 to season t at the i th site are denoted by θi(tminus1) and

γi(tminus1) respectivelyDuring the initial period (or season) the model for the state process is expressed in

terms of the occupancy probability (equation 2ndash24) For subsequent periods the stateprocess is specified in terms of survival and colonization probabilities (equation 2ndash25) inparticular

zi1 sim Bernoulli (ψi1) (2ndash24)

36

zit |zi(tminus1) sim Bernoulli(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)(2ndash25)

The observation model conditional on the latent process zit is defined by

yijt |zit sim Bernoulli(zitpijt

)(2ndash26)

Royle amp Kery induce the heterogeneity by site site-season and site-survey-seasonrespectively in the occupancy survival and colonization and in the detection probabilitiesthroughthe following specification

logit(ψi1) = x1 + ri ri sim N(0σ2ψ) logitminus1(x1) sim Unif(0 1)

logit(θit) = at + ui ui sim N(0σ2θ ) logitminus1(ai) sim Unif(0 1)logit(γit) = bt + vi vi sim N(0σ2γ) logitminus1(bi) sim Unif(0 1)

logit(pijt) = ct + wij wi sim N(0σ2p) logitminus1(ci) sim Unif(0 1) (2ndash27)

where x1 at bt ct are the season fixed effects for the corresponding probabilities

and where (ri ui vi) and wij are the site and site-survey random effects respectively

Additionally all variance components assume the usual inverse gamma priors

As the authors state this formulation can be regarded as ldquobeing suitably vaguerdquo

however it is also restrictive in the sense that it is not clear what strategy to follow to

incorporate additional covariates while preserving the straightforward sampling strategy231 Dynamic Mixture Occupancy State-Space Model

We assume that the probabilities for occupancy survival colonization and detection

are all functions of linear combinations of covariates However our setup varies

slightly from that considered by Royle amp Kery (2007) In essence we modify the way in

which the estimates for survival and colonization probabilities are attained Our model

incorporates the notion that occupancy at a site occupied during the previous season

takes place through persistence where we define persistence as a function of both

survival and colonization That is a site occupied at time t may again be occupied

at time t + 1 if the current settlers survive if they perish and new settlers colonize

simultaneously or if both current settlers survive and new ones colonize

Our functional forms of choice are again the probit and logit link functions This

means that each probability of interest which we will refer to for illustration as δ is

37

linked to a linear combination of covariates xprime ξ through the relationship defined by

δ = F (xT ξ) where F (middot) represents the inverse link function This particular assumption

facilitates relating the data augmentation algorithms of Albert amp Chib and Polson et al to

Royle amp Keryrsquos DOSS model We refer to this extension of Royle amp Keryrsquos model as the

Dynamic Mixture Occupancy State Space model (DYMOSS)

As before let yijt be the indicator variable used to mark detection of the target

species on the j th survey at the i th site during the tth season and let zit be the indicator

variable that denotes presence (zit = 1) or absence (zit = 0) of the target species at the

i th site tth season with i isin 1 2 N j isin 1 2 J and t isin 1 2 T

Additionally assume that probabilities for occupancy at time t = 1 persistence

colonization and detection are all functions of covariates with corresponding parameter

vectors α (s) =δ(s)tminus1

Tt=2

B(c) =β(c)tminus1

Tt=2

and = λtTt=1 and covariate matrices

X(o) X = Xtminus1Tt=2 and Q(s) = QtTt=1 respectively Using the notation above our

proposed dynamic occupancy model is defined by the following hierarchyState model

zi1|α sim Bernoulli (ψi1) where ψi1 = F(xprime(o)iα

)zit |zi(tminus1) δ

(c)tminus1β

(s)tminus1 sim Bernoulli

(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)where θi(tminus1) = F

(δ(s)tminus1 + xprimei(tminus1)β

(c)tminus1

) and

γi(tminus1) = F(xprimei(tminus1)β

(c)tminus1

)(2ndash28)

Observed modelyijt |zit ηt sim Bernoulli (zitpijt)

where pijt = F (qTijtλt) (2ndash29)

In the hierarchical setup given by Equations 2ndash28 and 2ndash29 θi(tminus1) corresponds to

the probability of persistence from time t minus 1 to time t at site i and γi(tminus1) denotes the

colonization probability Note that θi(tminus1) minus γi(tminus1) yields the survival probability from t minus 1

to t The effect of survival is introduced by changing the intercept of the linear predictor

by a quantity δ(s)tminus1 Although in this version of the model this effect is accomplished by

just modifying the intercept it can be extended to have covariates determining δ(s)tminus1 as

well The graphical representation of the model for a single site is

38

α

zi1

yi1

λ1

zi2

yi2

λ1

δ(s)1

β(c)1

middot middot middot

zit

yit

λt

δ(s)tminus1

β(c)tminus1

middot middot middot

ziT

yiT

λT

δ(s)Tminus1

β(c)Tminus1

Figure 2-3 Graphical representation multiseason model for a single site

The joint posterior for the model defined by this hierarchical setting is

[ zηαβλ|y ] = Cy

Nprodi=1

ψi1 Jprodj=1

pyij1ij1 (1minus pij1)

(1minusyij1)

zi1(1minus ψi1)

Jprodj=1

Iyij1=0

1minuszi1 [η1][α]times

Tprodt=2

Nprodi=1

[(θziti(tminus1)(1minus θi(tminus1))

1minuszit)zi(tminus1)

+(γziti(tminus1)(1minus γi(tminus1))

1minuszit)1minuszi(tminus1)

] Jprod

j=1

pyijtijt (1minus pijt)

1minusyijt

zit

times

Jprodj=1

Iyijt=0

1minuszit [ηt ][βtminus1][λtminus1]

(2ndash30)

which as in the single season case is intractable Once again a Gibbs sampler cannot

be constructed directly to sample from this joint posterior The graphical representation

of the model for one site incorporating the latent variables is provided in Figure 2-4

α

ui1

zi1

yi1

wi1

λ1

zi2

yi2

wi2

λ1

vi1

δ(s)1

β(c)1

middot middot middot

middot middot middot

zit

vi tminus1

yit

wit

λt

δ(s)tminus1

β(c)tminus1

middot middot middot

middot middot middot

ziT

vi Tminus1

yiT

wiT

λT

δ(s)Tminus1

β(s)Tminus1

Figure 2-4 Graphical representation data-augmented multiseason model

Probit link normal-mixture DYMOSS model

39

We deal with the intractability of the joint posterior distribution as before that is

by introducing latent random variables Each of the latent variables incorporates the

relevant linear combinations of covariates for the probabilities considered in the model

This artifact enables us to sample from the joint posterior distributions of the model

parameters For the probit link the sets of latent random variables respectively for first

season occupancy persistence and colonization and detection are

bull ui sim N (bTi α 1)

bull vi(tminus1) sim zi(tminus1)N(δ(s)(tminus1) + xTi(tminus1)β

(c)(tminus1) 1

)+ (1minus zi(tminus1))N

(xTi(tminus1)β

(c)(tminus1) 1

) and

bull wijt sim N (qTijtηt 1)

Introducing these latent variables into the hierarchical formulation yieldsState model

ui1|α sim N(xprime(o)iα 1

)zi1|ui sim Bernoulli

(Iuigt0

)for t gt 1

vi(tminus1)|zi(tminus1)βtminus1 sim zi(tminus1)N(δ(s)(tminus1) + xprimei(tminus1)β

(c)(tminus1) 1

)+

(1minus zi(tminus1))N(xprimei(tminus1)β

(c)(tminus1) 1

)zit |vi(tminus1) sim Bernoulli

(Ivi(tminus1)gt0

)(2ndash31)

Observed modelwijt |ηt sim N

(qTijtλt 1

)yijt |zit wijt sim Bernoulli

(zitIrijtgt0

)(2ndash32)

Note that the result presented in Section 22 corresponds to the particular case for

T = 1 of the model specified by Equations 2ndash31 and 2ndash32

As mentioned previously model parameters are obtained using a Gibbs sampling

approach Let ϕ(x |microσ2) denote the pdf of a normally distributed random variable x

with mean micro and standard deviation σ Also let

1 Wt = (w1t w2t wNt) with wit = (wi1t wi2t wiJitt) (for i = 1 2 N andt = 1 2 T )

40

2 u = (u1 u2 uN)

3 V = (v1 vTminus1) with vt = (v1t v2t vNt)

For the probit link model the joint posterior distribution is

π(ZuV WtTt=1αB(c) δ(s)

)prop [α]

prodNi=1 ϕ

(ui∣∣ xprime(o)iα 1

)Izi1uigt0I

1minuszi1uile0

times

Tprodt=2

[β(c)tminus1 δ

(s)tminus1

] Nprodi=1

ϕ(vi(tminus1)

∣∣micro(v)i(tminus1) 1

)Izitvi(tminus1)gt0

I1minuszitvi(tminus1)le0

times

Tprodt=1

[λt ]

Nprodi=1

Jitprodj=1

ϕ(wijt

∣∣qprimeijtλt 1)(zitIwijtgt0)yij1(1minus zitIwijtgt0)

(1minusyijt)

where micro(v)i(tminus1) = zi(tminus1)δ

(s)tminus1 + xprimei(tminus1)β

(c)tminus1 (2ndash33)

Initialize the Gibbs sampler at α(0)B(0)(c) δ

(s)(0)2minus1 and (0) For m = 0 1 nsim

The sampler proceeds iteratively by block sampling sequentially for each primary

sampling period as follows first the presence process then the latent variables from

the data-augmentation step for the presence component followed by the parameters for

the presence process then the latent variables for the detection component and finally

the parameters for the detection component Letting [|] denote the full conditional

probability density function of the component conditional on all other unknown

parameters and the observed data for m = 1 nsim the sampling procedure can be

summarized as

[z(m)1 | middot

]rarr[u(m)| middot

]rarr[α(m)

∣∣∣ middot ]rarr [W

(m)1 | middot

]rarr[λ(m)1

∣∣∣ middot ]rarr[z(m)2 | middot

]rarr[V(m)2minus1| middot

]rarr[β(c)(m)2minus1 δ(s)(m)

2minus1

∣∣∣ middot ]rarr [W

(m)2 | middot

]rarr[λ(m)2

∣∣∣ middot ]rarr middot middot middot

middot middot middot rarr[z(m)T | middot

]rarr[V(m)Tminus1| middot

]rarr[β(c)(m)Tminus1 δ(s)(m)

Tminus1

∣∣∣ middot ]rarr [W

(m)T | middot

]rarr[λ(m)T

∣∣∣ middot ]The full conditional probability densities for this Gibbs sampling algorithm are

presented in detail within Appendix A

41

Logit link Polya-Gamma DYMOSS model

Using the same notation as before the logit link model resorts to the hierarchy given

byState model

ui1|α sim PG(xT(o)iα 1

)zi1|ui sim Bernoulli

(Iuigt0

)for t gt 1

vi(tminus1)| sim PG(1∣∣zi(tminus1)δ(s)(tminus1) + xprimei(tminus1)β

(c)(tminus1)

∣∣)sim Bernoulli

(Ivi(tminus1)gt0

)(2ndash34)

Observed modelwijt |λt sim PG

(qTijtλt 1

)yijt |zit wijt sim Bernoulli

(zitIwijtgt0

)(2ndash35)

The logit link version of the joint posterior is given by

π(ZuV WtTt=1αB(s)B(c)

)prop

Nprodi=1

(e

xprime(o)i

α)zi1

1 + exprime(o)i

αPG

(ui 1 |xprime(o)iα|

)[λ1][α]times

Ji1prodj=1

(zi1

eqprimeij1λ1

1 + eqprimeij1λ1

)yij1(1minus zi1

eqprimeij1λ1

1 + eqprimeij1λ1

)1minusyij1

PG(wij1 1 |zi1qprimeij1λ1|

)times

Tprodt=2

[δ(s)tminus1][β

(c)tminus1][λt ]

Nprodi=1

(exp

[micro(v)tminus1

])zit1 + exp

[micro(v)i(tminus1)

]PG (vit 1 ∣∣∣micro(v)i(tminus1)

∣∣∣)timesJitprodj=1

(zit

eqprimeijtλt

1 + eqprimeijtλt

)yijt(1minus zit

eqprimeijtλt

1 + eqlowastTij

λt

)1minusyijt

PG(wijt 1 |zitqprimeijtλt |

)

(2ndash36)

with micro(v)tminus1 = zi(tminus1)δ

(s)tminus1 + xprimei(tminus1)β

(c)tminus1

42

The sampling procedure is entirely analogous to that described for the probit

version The full conditional densities derived from expression 2ndash36 are described in

detail in Appendix A232 Incorporating Spatial Dependence

In this section we describe how the additional layer of complexity space can also

be accounted for by continuing to use the same data-augmentation framework The

method we employ to incorporate spatial dependence is a slightly modified version of

the traditional approach for spatial generalized linear mixed models (GLMMrsquos) and

extends the model proposed by Johnson et al (2013) for the single season closed

population occupancy model

The traditional approach consists of using spatial random effects to induce a

correlation structure among adjacent sites This formulation introduced by Besag et al

(1991) assumes that the spatial random effect corresponds to a Gaussian Markov

Random Field (GMRF) The model known as the Spatial GLMM (SGLMM) is used to

analyze areal data It has been applied extensively given the flexibility of its hierarchical

formulation and the availability of software for its implementation (Hughes amp Haran

2013)

Succinctly the spatial dependence is accounted for in the model by adding a

random vector η assumed to have a conditionally-autoregressive (CAR) prior (also

known as the Gaussian Markov random field prior) To define the prior let the pair

G = (V E) represent the undirected graph for the entire spatial region studied where

V = (1 2 N) denotes the vertices of the graph (sites) and E the set of edges

between sites E is constituted by elements of the form (i j) indicating that sites i

and j are spatially adjacent for some i j isin V The prior for the spatial effects is then

characterized by

[η|τ ] prop τ rank()2exp[minusτ2ηprimeη

] (2ndash37)

43

where = (diag(A1)minus A) is the precision matrix with A denoting the adjacency matrix

The entries of the adjacency matrix A are such that diag(A) = 0 and Aij = I(i j)isinE

The matrix is singular Hence the probability density defined in equation 2ndash37

is improper ie it doesnrsquot integrate to 1 Regardless of the impropriety of the prior this

model can be fitted using a Bayesian approach since even if the prior is improper the

posterior for the model parameters is proper If a constraint such assum

k ηk = 0 is

imposed or if the precision matrix is replaced by a positive definite matrix the model

can also be fitted using a maximum likelihood approach

Assuming that all but the detection process are subject to spatial correlations and

using the notation we have developed up to this point the spatially explicit version of the

DYMOSS model is characterized by the hierarchy represented by equations 2ndash38 and

2ndash39

Hence adding spatial structure into the DYMOSS framework described in the

previous section only involves adding the steps to sample η(o) and ηtT

t=2 conditional

on all other parameters Furthermore the corresponding parameters and spatial

random effects of a given component (ie occupancy survival and colonization)

can be effortlessly pooled together into a single parameter vector to perform block

sampling For each of the latent variables the only modification required is to sum the

corresponding spatial effect to the linear predictor so that these retain their conditional

independence given the linear combination of fixed effects and the spatial effects

State modelzi1|α sim Bernoulli (ψi1) where ψi1 = F

(xT(o)iα+ η

(o)i

)[η(o)|τ

]prop τ rank()2exp

[minusτ2η(o)primeη(o)

]zit |zi(tminus1)αβtminus1λtminus1 sim Bernoulli

(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)where θi(tminus1) = F

(δ(s)(tminus1) + xTi(tminus1)β

(c)tminus1 + ηit

) and

γi(tminus1) = F(xTi(tminus1)β

(c)tminus1 + ηit

)[ηt |τ ] prop τ rank()2exp

[minusτ2ηprimetηt

](2ndash38)

44

Observed modelyijt |zit ηt sim Bernoulli (zitpijt)

where pijt = F (qTijtλt) (2ndash39)

In spite of the popularity of this approach to incorporating spatial dependence three

shortcomings have been reported in the literature (Hughes amp Haran 2013 Reich et al

2006) (1) model parameters have no clear interpretation due to spatial confounding

of the predictors with the spatial effect (2) there is variance inflation due to spatial

confounding and (3) the high dimensionality of the latent spatial variables leads to

high computational costs To avoid such difficulties we follow the approach used by

Hughes amp Haran (2013) which builds upon the earlier work by Reich et al (2006) This

methodology is summarized in what follows

Let a vector of spatial effects η have the CAR model given by 2ndash37 above Now

consider a random vector ζ sim MVN(0 τKprimeK

) with defined as above and where

τKprimeK corresponds to the precision of the distribution and not the covariance matrix

with matrix K satisfying KprimeK = I

This last condition implies that the linear predictor Xβ + η = Xβ + Kζ With

respect to how the matrix K is chosen Hughes amp Haran (2013) recommend basing its

construction on the spectral decomposition of operator matrices based on Moranrsquos I

The Moran operator matrix is defined as PperpAPperp with Pperp = IminusX (XprimeX )minus1X

prime and where A

is the adjacency matrix previously described The choice of the Moran operator is based

on the fact that it accounts for the underlying graph while incorporating the spatial

structure residual to the design matrix X These elements are incorporated into its

spectral decomposition of the Moran operator That is its eigenvalues correspond to the

values of Moranrsquos I statistic (a measure of spatial autocorrelation) for a spatial process

orthogonal to X while its eigenvectors provide the patterns of spatial dependence

residual to X Thus the matrix K is chosen to be the matrix whose columns are the

eigenvectors of the Moran operator for a particular adjacency matrix

45

Using this strategy the new hierarchical formulation of our model is simply modified

by letting η(o) = K(o)ζ(o) and ηt = Ktζt with

1 ζ(o) sim MVN(0 τ (o)K(o)primeK(o)

) where K(o) is the eigenvector matrix for

P(o)perpAP(o)perp and

2 ζt sim MVN(0 τtK

primetKt

) where Kt is the Pperp

t APperpt for t = 2 3 T

The algorithms for the probit and logit link from section 231 can be readily

adapted to incorporate the spatial structure simply by obtaining the joint posteriors

for (α ζ(o)) and (β(c)tminus1 δ

(s)tminus1 ζt) making the obvious modification of the corresponding

linear predictors to incorporate the spatial components24 Summary

With a few exceptions (Dorazio amp Taylor-Rodrıguez 2012 Johnson et al 2013

Royle amp Kery 2007) recent Bayesian approaches to site-occupancy modeling with

covariates have relied on model configurations (eg as multivariate normal priors of

parameters in logit scale) that lead to unfamiliar conditional posterior distributions thus

precluding the use of a direct sampling approach Therefore the sampling strategies

available are based on algorithms (eg Metropolis Hastings) that require tuning and the

knowledge to do so correctly

In Dorazio amp Taylor-Rodrıguez (2012) we proposed a Bayesian specification for

which a Gibbs sampler of the basic occupancy model is available and allowed detection

and occupancy probabilities to depend on linear combinations of predictors This

method described in section 221 is based on the data augmentation algorithm of

Albert amp Chib (1993) There the full conditional posteriors of the parameters of the probit

regression model are cast as latent mixtures of normal random variables The probit and

the logit link yield similar results with large sample sizes however their results may be

different when small to moderate sample sizes are considered because the logit link

function places more mass in the tails of the distribution than the probit link does In

46

section 222 we adapt the method for the single season model to work with the logit link

function

The basic occupancy framework is useful but it assumes a single closed population

with fixed probabilities through time Hence its assumptions may not be appropriate to

address problems where the interest lies in the temporal dynamics of the population

Hence we developed a dynamic model that incorporates the notion that occupancy

at a site previously occupied takes place through persistence which depends both on

survival and habitat suitability By this we mean that a site occupied at time t may again

be occupied at time t + 1 if (1) the current settlers survive (2) the existing settlers

perish but new settlers simultaneously colonize or (3) current settlers survive and new

ones colonize during the same season In our current formulation of the DYMOSS both

colonization and persistence depend on habitat suitability characterized by xprimei(tminus1)β(c)tminus1

They only differ in that persistence is also influenced by whether the site being occupied

during season t minus 1 enhances the suitability of the site or harms it through density

dependence

Additionally the study of the dynamics that govern distribution and abundance of

biological populations requires an understanding of the physical and biotic processes

that act upon them and these vary in time and space Consequently as a final step in

this Chapter we described a straightforward strategy to add spatial dependence among

neighboring sites in the dynamic metapopulation model This extension is based on the

popular Bayesian spatial modeling technique of Besag et al (1991) updated using the

methods described in (Hughes amp Haran 2013)

Future steps along these lines are (1) develop the software necessary to

implement the tools described throughout the Chapter and (2) build a suite of additional

extensions using this framework for occupancy models will be explored The first of

them will be used to incorporate information from different sources such as tracks

scats surveys and direct observations into a single model This can be accomplished

47

by adding a layer to the hierarchy where the source and spatial scale of the data is

accounted for The second extension is a single season spatially explicit multiple

species co-occupancy model This model will allow studying complex interactions

and testing hypotheses about species interactions at a given point in time Lastly this

co-occupancy model will be adapted to incorporate temporal dynamics in the spirit of

the DYMOSS model

48

CHAPTER 3INTRINSIC ANALYSIS FOR OCCUPANCY MODELS

Eliminate all other factors and the one which remains must be the truthndashSherlock Holmes

The Sign of Four

31 Introduction

Occupancy models are often used to understand the mechanisms that dictate

the distribution of a species Therefore variable selection plays a fundamental role in

achieving this goal To the best of our knowledge ldquoobjectiverdquo Bayesian alternatives for

variable selection have not been put forth for this problem and with a few exceptions

(Hooten amp Hobbs 2014 Link amp Barker 2009) AIC is the method used to choose from

competing site-occupancy models In addition the procedures currently implemented

and accessible to ecologists require enumerating and estimating all the candidate

models (Fiske amp Chandler 2011 Mazerolle amp Mazerolle 2013) In practice this

can be achieved if the model space considered is small enough which is possible

if the choice of the model space is guided by substantial prior knowledge about the

underlying ecological processes Nevertheless many site-occupancy surveys collect

large amounts of covariate information about the sampled sites Given that the total

number of candidate models grows exponentially fast with the number of predictors

considered choosing a reduced set of models guided by ecological intuition becomes

increasingly difficult This is even more so the case in the occupancy model context

where the model space is the cartesian product of models for presence and models for

detection Given the issues mentioned above we propose the first objective Bayesian

variable selection method for the single-season occupancy model framework This

approach explores in a principled manner the entire model space It is completely

49

automatic precluding the need for both tuning parameters in the sampling algorithm and

subjective elicitation of parameter prior distributions

As mentioned above in ecological modeling if model selection or less frequently

model averaging is considered the Akaike Information Criterion (AIC) (Akaike 1983)

or a version of it is the measure of choice for comparing candidate models (Fiske amp

Chandler 2011 Mazerolle amp Mazerolle 2013) The AIC is designed to find the model

that has on average the density closest in Kullback-Leibler distance to the density

of the true data generating mechanism The model with the smallest AIC is selected

However if nested models are considered one of them being the true one generally the

AIC will not select it (Wasserman 2000) Commonly the model selected by AIC will be

more complex than the true one The reason for this is that the AIC has a weak signal to

noise ratio and as such it tends to overfit (Rao amp Wu 2001) Other versions of the AIC

provide a bias correction that enhances the signal to noise ratio leading to a stronger

penalization for model complexity Some examples are the AICc (Hurvich amp Tsai 1989)

and AICu (McQuarrie et al 1997) however these are also not consistent for selection

albeit asymptotically efficient (Rao amp Wu 2001)

If we are interested in prediction as opposed to testing the AIC is certainly

appropriate However when conducting inference the use of Bayesian model averaging

and selection methods is more fitting If the true data generating mechanism is among

those considered asymptotically Bayesian methods choose the true model with

probability one Conversely if the true model is not among the alternatives and a

suitable parameter prior is used the posterior probability of the most parsimonious

model closest to the true one tends asymptotically to one

In spite of this in general for Bayesian testing direct elicitation of prior probabilistic

statements is often impeded because the problems studied may not be sufficiently

well understood to make an informed decision about the priors Conversely there may

be a prohibitively large number of parameters making specifying priors for each of

50

these parameters an arduous task In addition to this seemingly innocuous subjective

choices for the priors on the parameter space may drastically affect test outcomes

This has been a recurring argument in favor of objective Bayesian procedures

which appeal to the use of formal rules to build parameter priors that incorporate the

structural information inside the likelihood while utilizing some objective criterion (Kass amp

Wasserman 1996)

One popular choice of ldquoobjectiverdquo prior is the reference prior (Berger amp Bernardo

1992) which is the prior that maximizes the amount of signal extracted from the

data These priors have proven to be effective as they are fully automatic and can

be frequentist matching in the sense that the posterior credible interval agrees with the

frequentist confidence interval from repeated sampling with equal coverage-probability

(Kass amp Wasserman 1996) Reference priors however are improper and while

they yield reasonable posterior parameter probabilities the derived model posterior

probabilities may be ill defined To avoid this shortcoming Berger amp Pericchi (1996)

introduced the intrinsic Bayes factor (IBF) for model comparison Moreno et al (1998)

building on the IBF of Berger amp Pericchi (1996) developed a limiting procedure to

generate a system of priors that yield well-defined posteriors even though these

priors may sometimes be improper The IBF is built using a data-dependent prior to

automatically generate Bayes factors however the extension introduced by Moreno

et al (1998) generates the intrinsic prior by taking a theoretical average over the space

of training samples freeing the prior from data dependence

In our view in the face of a large number of predictors the best alternative is to run

a stochastic search algorithm using good ldquoobjectiverdquo testing parameter priors and to

incorporate suitable model priors This being said the discussion about model priors is

deferred until Chapter 4 this Chapter focuses on the priors on the parameter space

The Chapter is structured as follows First issues surrounding multimodel inference

are described and insight about objective Bayesian inferential procedures is provided

51

Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate priors

on the parameter space the intrinsic priors for the parameters of the occupancy model

are derived These are used in the construction of an algorithm for ldquoobjectiverdquo model

selection tailored to the occupancy model framework To assess the performance of our

methods we provide results from a simulation study in which distinct scenarios both

favorable and unfavorable are used to determine the robustness of these tools and

analyze the Blue Hawker data set which has been examined previously in the ecological

literature (Dorazio amp Taylor-Rodrıguez 2012 Kery et al 2010)32 Objective Bayesian Inference

As mentioned before in practice noninformative priors arising from structural

rules are an alternative to subjective elicitation of priors Some of the rules used in

defining noninformative priors include the principle of insufficient reason parametrization

invariance maximum entropy geometric arguments coverage matching and decision

theoretic approaches (see Kass amp Wasserman (1996) for a discussion)

These rules reflect one of two attitudes (1) noninformative priors either aim to

convey unique representations of ignorance or (2) they attempt to produce probability

statements that may be accepted by convention This latter attitude is in the same

spirit as how weights and distances are defined (Kass amp Wasserman 1996) and

characterizes the way in which Bayesian reference methods are interpreted today ie

noninformative priors are seen to be chosen by convention according to the situation

A word of caution must be given when using noninformative priors Difficulties arise

in their implementation that should not be taken lightly In particular these difficulties

may occur because noninformative priors are generally improper (meaning that they do

not integrate or sum to a finite number) and as such are said to depend on arbitrary

constants

Bayes factors strongly depend upon the prior distributions for the parameters

included in each of the models being compared This can be an important limitation

52

considering that when using noninformative priors their introduction will result in the

Bayes factors being a function of the ratio of arbitrary constants given that these priors

are typically improper (see Jeffreys 1961 Pericchi 2005 and references therein)

Many different approaches have been developed to deal with the arbitrary constants

when using improper priors since then These include the use of partial Bayes factors

(Berger amp Pericchi 1996 Good 1950 Lempers 1971) setting the ratio of arbitrary

constants to a predefined value (Spiegelhalter amp Smith 1982) and approximating to the

Bayes factor (see Haughton 1988 as cited in Berger amp Pericchi 1996 Kass amp Raftery

1995 Tierney amp Kadane 1986)321 The Intrinsic Methodology

Berger amp Pericchi (1996) cleverly dealt with the arbitrary constants that arise when

using improper priors by introducing the intrinsic Bayes factor (IBF) procedure This

solution based on partial Bayes factors provides the means to replace the improper

priors by proper ldquoposteriorrdquo priors The IBF is obtained from combining the model

structure with information contained in the observed data Furthermore they showed

that as the sample size tends to infinity the Intrinsic Bayes factor corresponds to the

proper Bayes factor arising from the intrinsic priors

Intrinsic priors however are not unique The asymptotic correspondence between

the IBF and the Bayes factor arising from the intrinsic prior yields two functional

equations that are solved by a whole class of intrinsic priors Because all the priors

in the class produce Bayes factors that are asymptotically equivalent to the IBF for

finite sample sizes the resulting Bayes factor is not unique To address this issue

Moreno et al (1998) formalized the methodology through the ldquolimiting procedurerdquo

This procedure allows one to obtain a unique Bayes factor consolidating the method

as a valid objective Bayesian model selection procedure which we will refer to as the

Bayes factor for intrinsic priors (BFIP) This result is particularly valid for nested models

although the methodology may be extended with some caution to nonnested models

53

As mentioned before the Bayesian hypothesis testing procedure is highly sensitive

to parameter-prior specification and not all priors that are useful for estimation are

recommended for hypothesis testing or model selection Evidence of this is provided

by the Jeffreys-Lindley paradox which states that a point null hypothesis will always

be accepted when the variance of a conjugate prior goes to infinity (Robert 1993)

Additionally when comparing nested models the null model should correspond to

a substantial reduction in complexity from that of larger alternative models Hence

priors for the larger alternative models that place probability mass away from the null

model are wasteful If the true model is ldquofarrdquo from the null it will be easily detected by

any statistical procedure Therefore the prior on the alternative models should ldquowork

harderrdquo at selecting competitive models that are ldquocloserdquo to the null This principle known

as the Savage continuity condition (Gunel amp Dickey 1974) is widely recognized by

statisticians

Interestingly the intrinsic prior in correspondence with the BFIP automatically

satisfies the Savage continuity condition That is when comparing nested models the

intrinsic prior for the more complex model is centered around the null model and in spite

of being a limiting procedure it is not subject to the Jeffreys-Lindley paradox

Moreover beyond the usual pairwise consistency of the Bayes factor for nested

models Casella et al (2009) show that the corresponding Bayesian procedure with

intrinsic priors for variable selection in normal regression is consistent in the entire

class of normal linear models adding an important feature to the list of virtues of the

procedure Consistency of the BFIP for the case where the dimension of the alternative

model grows with the sample size is discussed in Moreno et al (2010)322 Mixtures of g-Priors

As previously mentioned in the Bayesian paradigm a model M in M is defined

by a sampling density and a prior distribution The sampling density associated with

model M is denoted by f (y|βM σ2M M) where (βM σ

2M) is a vector of model-specific

54

unknown parameters The prior for model M and its corresponding set of parameters is

π(βM σ2M M|M) = π(βM σ

2M |MM) middot π(M|M)

Objective local priors for the model parameters (βM σ2M) are achieved through

modifications and extensions of Zellnerrsquos g-prior (Liang et al 2008 Womack et al

2014) In particular below we focus on the intrinsic prior and provide some details for

other scaled mixtures of g-priors We defer the discussion on priors over the model

space until Chapter 5 where we describe them in detail and develop a few alternatives

of our own3221 Intrinsic priors

An automatic choice of an objective prior is the intrinsic prior (Berger amp Pericchi

1996 Moreno et al 1998) Because MB sube M for all M isin M the intrinsic prior for

(βM σ2M) is defined as an expected posterior prior

πI (βM σ2M |M) =

intpR(βM σ

2M |~yM)mR(~y|MB)d~y (3ndash1)

where ~y is a minimal training sample for model M I denotes the intrinsic distributions

and R denotes distributions derived from the reference prior πR(βM σ2M |M) = cM

dβMdσ2M

σ2M

In (3ndash1) mR(~y|M) =intint

f (~y|βM σ2M M)πR(βM σ

2M |M)dβMdσ2M is the reference marginal

of ~y under model M and pR(βM σ2M |~yM) =

f (~y|βM σ2MM)πR(βM σ2

M|M)

mR(~y|M)is the reference

posterior density

In the regression framework the reference marginal mR is improper and produces

improper intrinsic priors However the intrinsic Bayes factor of model M to the base

model MB is well-defined and given by

BF IMMB

(y) = (1minus R2M)

minus nminus|MB |2 times

int 1

0

n + sin2(π2θ) middot (|M|+ 1)

n +sin2(π

2θ)middot(|M|+1)1minusR2

M

nminus|M|

2sin2(π

2θ) middot (|M|+ 1)

n +sin2(π

2θ)middot(|M|+1)1minusR2

M

|M|minus|MB |

2

dθ (3ndash2)

55

where R2M is the coefficient of determination of model M versus model MB The Bayes

factor between two models M and M prime is defined as BF IMMprime(y) = BF I

MMB(y)BF I

MprimeMB(y)

The ldquogoodnessrdquo of the model M based on the intrinsic priors is given by its posterior

probability

pI (M|yM) =BF I

MMB(y)π(M|M)sum

MprimeisinM BF IMprimeMB

(y)π(M prime|M) (3ndash3)

It has been shown that the system of intrinsic priors produces consistent model selection

(Casella et al 2009 Giron et al 2010) In the context of well-formulated models the

true model MT is the smallest well-formulated model M isin M such that α isin M if βα = 0

If MT is the true model then the posterior probability of model MT based on equation

(3ndash3) converges to 13222 Other mixtures of g-priors

Scaled mixtures of g-priors place a reference prior on (βMBσ2) and a multivariate

normal distribution on β in M MB that is normal with mean 0 and precision matrix

qMw

nσ2ZprimeM (IminusH0)ZM

where H0 is the hat matrix associated with ZMB The prior is completed by a prior on w

and choice of scaling qM that is set at |M| + 1 to account for the minimal sample size of

M Under these assumptions the Bayesrsquo factor for M to MB is given by

BFMMB(y) =

(1minus R2

M

) nminus|MB |2

int n + w(|M|+ 1)

n + w(|M|+1)1minusR2

M

nminus|M|

2w(|M|+ 1)

n + w(|M|+1)1minusR2

M

|M|minus|MB |

2

π(w)dw

We consider the following priors on w The intrinsic prior is π(w) = Beta(w 05 05)

which is only defined for w isin (0 1) A version of the Zellner-Siow prior is given by

w sim Gamma(05 05) which produces a multivariate Cauchy distribution on β A family

of hyper-g priors are defined by π(w) prop wminus12(β + w)(α+1)2 which have Cauchy-like

tails but produce more shrinkage than the Cauchy prior

56

33 Objective Bayes Occupancy Model Selection

As mentioned before Bayesian inferential approaches used for ecological models

are lacking In particular there exists a need for suitable objective and automatic

Bayesian testing procedures and software implementations that explore thoroughly the

model space considered With this goal in mind in this section we develop an objective

intrinsic and fully automatic Bayesian model selection methodology for single season

site-occupancy models We refer to this method as automatic and objective given that

in its implementation no hyperparameter tuning is required and that it is built using

noninformative priors with good testing properties (eg intrinsic priors)

An inferential method for the occupancy problem is possible using the intrinsic

approach given that we are able to link intrinsic-Bayesian tools for the normal linear

model through our probit formulation of the occupancy model In other words because

we can represent the single season probit occupancy model through the hierarchy

yij |zi wij sim Bernoulli(ziIwijgt0

)wij |λ sim N

(qprimeijλ 1

)zi |vi sim Bernoulli

(Ivigt0

)vi |α sim N (x primeiα 1)

it is possible to solve the selection problem on the latent scale variables wij and vi and

to use those results at the level of the occupancy and detection processes

In what follows first we provide some necessary notation Then a derivation of

the intrinsic priors for the parameters of the detection and occupancy components

is outlined Using these priors we obtain the general form of the model posterior

probabilities Finally the results are incorporated in a model selection algorithm for

site-occupancy data Although the priors on the model space are not discussed in this

Chapter the software and methods developed have different choices of model priors

built in

57

331 Preliminaries

The notation used in Chapter 2 will be considered in this section as well Namely

presence will be denoted by z detection by y their corresponding latent processes are

v and w and the model parameters are denoted by α and λ However some additional

notation is also necessary Let M0 =M0y M0z

denote the ldquobaserdquo model defined by

the smallest models considered for the detection and presence processes The base

models M0y and M0z include predictors that must be contained in every model that

belongs to the model space Some examples of base models are the intercept only

model a model with covariates related to the sampling design and a model including

some predictors important to the researcher that should be included in every model

Furthermore let the sets [Kz ] = 1 2 Kz and [Ky ] = 1 2 Ky index

the covariates considered for the variable selection procedure for the presence and

detection processes respectively That is these sets denote the covariates that can

be added from the base models in M0 or removed from the largest possible models

considered MF z and MF y which we will refer to as the ldquofullrdquo models The model space

can then be represented by the Cartesian product of subsets such that Ay sube [Ky ]

and Az sube [Kz ] The entire model space is populated by models of the form MA =MAy

MAz

isin M = My timesMz with MAy

isin My and MAzisin Mz

For the presence process z the design matrix for model MAzis given by the block

matrix XAz= (X0|Xr A) X0 corresponds to the design matrix of the base model ndash which

is such that M0z sube MAzisin Mz for all Az isin [Kz ] ndash and Xr A corresponds to the submatrix

that contains the covariates indexed by Az Analogously for the detection process y the

design matrix is given by QAy= (Q0|Qr A) Similarly the coefficients for models MAz

and

MAyare given by αA = (αprime

0αprimer A)

prime and λA = (λprime0λ

primer A)

prime

With these elements in place the model selection problem consists of finding

subsets of covariates indexed by A = Az Ay that have a high posterior probability

given the detection and occupancy processes This is equivalent to finding models with

58

high posterior odds when compared to a suitable base model These posterior odds are

given by

p(MA|y z)p(M0|y z)

=m(y z|MA)π(MA)

m(y z|M0)π(M0)= BFMAM0

(y z)π(MA)

π(M0)

Since we are able to represent the occupancy model as a truncation of latent

normal variables it is possible to work through the occupancy model selection problem

in the latent normal scale used for the presence and detection processes We formulate

two solutions to this problem one that depends on the observed and latent components

and another that solely depends on the latent level variables used to data-augment the

problem We will however focus on the latter approach as this yields a straightforward

MCMC sampling scheme For completeness the other alternative is described in

Section 34

At the root of our objective inferential procedure for occupancy models lies the

conditional argument introduced by Womack et al (work in progress) for the simple

probit regression In the occupancy setting the argument is

p(MA|y zw v) =m(y z vw|MA)π(MA)

m(y zw v)

=fyz(y z|w v)

(intfvw(vw|αλMA)παλ(αλ|MA)d(αλ)

)π(MA)

fyz(y z|w v)sum

MlowastisinM(int

fvw(vw|αλMlowast)παλ(αλ|Mlowast)d(αλ))π(Mlowast)

=m(v|MAz

)m(w|MAy)π(MA)

m(v)m(w)

prop m(v|MAz)m(w|MAy

)π(MA) (3ndash4)

where

1 fyz(y z|w v) =prodN

i=1 Izivigt0I

(1minuszi )vile0

prodJ

j=1(ziIwijgt0)yij (1minus ziIwijgt0)

1minusyij

2 fvw(vw|αλMA) =

(Nprodi=1

ϕ(vi xprimeiαMAz

1)

)︸ ︷︷ ︸

f (v|αr Aα0MAz )

(Nprodi=1

Jiprodj=1

ϕ(wij qprimeijλMAy

1)

)︸ ︷︷ ︸

f (w|λr Aλ0MAy )

and

59

3 παλ(αλ|MA) = πα(α|MAz)πλ(λ|MAy

)

This result implies that once the occupancy and detection indicators are

conditioned on the latent processes v and w respectively the model posterior

probabilities only depend on the latent variables Hence in this case the model

selection problem is driven by the posterior odds

p(MA|y zw v)p(M0|y zw v)

=m(w v|MA)

m(w v|M0)

π(MA)

π(M0) (3ndash5)

where m(w v|MA) = m(w|MAy) middotm(v|MAz

) with

m(v|MAz) =

int intf (v|αr Aα0MAz

)π(αr A|α0MAz)π(α0)dαr Adα0

(3ndash6)

m(w|MAy) =

int intf (w|λr Aλ0MAy

)π(λr A|λ0MAy)π(λ0)dλ0dλr A

(3ndash7)

332 Intrinsic Priors for the Occupancy Problem

In general the intrinsic priors as defined by Moreno et al (1998) use the functional

form of the response to inform their construction assuming some preliminary prior

distribution proper or improper on the model parameters For our purposes we assume

noninformative improper priors for the parameters denoted by πN(middot|middot) Specifically the

intrinsic priors πIP(θMlowast|Mlowast) for a vector of parameters θMlowast corresponding to model

Mlowast isin M0M sub M for a response vector s with probability density (or mass) function

f (s|θMlowast) are defined by

πIP(θM0|M0) = πN(θM0

|M0)

πIP(θM |M) = πN(θM |M)

intm(~s|M)

m(~s|M0)f (~s|θM M)d~s

where ~s is a theoretical training sample

In what follows whenever it is clear from the context in an attempt to simplify the

notation MA will be used to refer to MAzor MAy

and A will denote Az or Ay To derive

60

the parameter priors involved in equations 3ndash6 and 3ndash7 using the objective intrinsic prior

strategy we start by assuming flat priors πN(αA|MA) prop cA and πN(λA|MA) prop dA where

cA and dA are unknown constants

The intrinsic prior for the parameters associated with the occupancy process αA

conditional on model MA is

πIP(αA|MA) = πN(αA|MA)

intm(~v|MA)

m(~v|M0)f (~v|αAMA)d~v

where the marginals m(~v|Mj) with j isin A 0 are obtained by solving the analogous

equation 3ndash6 for the (theoretical) training sample ~v These marginals are given by

m(~v|Mj) = cj (2π)pjminusp0

2 |~X primej~Xj |

12 eminus

12~vprime(Iminus~Hj )~v

The training sample ~v has dimension pAz=∣∣MAz

∣∣ that is the total number of

parameters in model MAz Note that without ambiguity we use

∣∣ middot ∣∣ to denote both

the cardinality of a set and also the determinant of a matrix The design matrix ~XA

corresponds to the training sample ~v and is chosen such that ~X primeA~XA =

pAzNX primeAXA

(Leon-Novelo et al 2012) and ~Hj is the corresponding hat matrix

Replacing m(~v|MA) and m(~v|M0) in πIP(αA|MA) and solving the integral with

respect to the theoretical training sample ~v we have

πIP(αA|MA) = cA

int ((2π)minus

pAzminusp0z2

(c0

cA

)eminus

12~vprime((Iminus~HA)minus(Iminus~H0))~v |~X

primeA~XA|12

|~X prime0~X0|12

)times(

(2π)minuspAz2 eminus

12(~vminus~XAαA)

prime(~vminus~XAαA))d~v

= c0(2π)minus

pAzminusp0z2 |~X prime

Ar~XAr |

12 2minus

pAzminusp0z2 exp

[minus1

2αprimer A

(1

2~X primer A

~Xr A

)αr A

]= πN(α0)timesN

(αr A

∣∣ 0 2 middot ( ~X primer A

~Xr A)minus1)

(3ndash8)

61

Analogously the intrinsic prior for the parameters associated to the detection

process is

πIP(λA|MA) = d0(2π)minus

pAyminusp0y2 | ~Q prime

Ar~QAr |

12 2minus

pAyminusp0y2 exp

[minus1

2λprimer A

(1

2~Q primer A

~Qr A

)λr A

]= πN(λ0)timesN

(λr A

∣∣ 0 2 middot ( ~Q primeA~QA)

minus1)

(3ndash9)

In short the intrinsic priors for αA = (αprime0α

primer A)

prime and λprimeA = (λprime

0λprimer A)

prime are the product

of a reference prior on the parameters of the base model and a normal density on the

parameters indexed by Az and Ay respectively333 Model Posterior Probabilities

We now derive the expressions involved in the calculations of the model posterior

probabilities First recall that p(MA|y zw v) prop m(w v|MA)π(MA) Hence determining

this posterior probability only requires calculating m(w v|MA)

Note that since w and v are independent obtaining the model posteriors from

expression 3ndash4 reduces to finding closed form expressions for the marginals m(v |MAz)

and m(w |MAy) respectively from equations 3ndash6 and 3ndash7 Therefore

m(w v|MA) =

int intf (vw|αλMA)π

IP (α|MAz)πIP

(λ|MAy

)dαdλ

(3ndash10)

For the latent variable associated with the occupancy process plugging the

parameter intrinsic prior given by 3ndash8 into equation 3ndash6 (recalling that ~X primeA~XA =

pAzNX primeAXA)

and integrating out αA yields

m(v|MA) =

int intc0N (v|X0α0 + Xr Aαr A I)N

(αr A|0 2( ~X prime

r A~Xr A)

minus1)dαr Adα0

= c0(2π)minusn2

int (pAz

2N + pAz

) (pAzminusp0z

)

2

times

exp[minus1

2(v minus X0α0)

prime(I minus

(2N

2N + pAz

)Hr Az

)(v minus X0α0)

]dα0

62

= c0 (2π)minus(nminusp0z )2

(pAz

2N + pAz

) (pAzminusp0z

)

2

|X prime0X0|minus

12 times

exp[minus1

2vprime(I minus H0z minus

(2N

2N + pAz

)Hr Az

)v

] (3ndash11)

with Hr Az= HAz

minus H0z where HAzis the hat matrix for the entire model MAz

and H0z is

the hat matrix for the base model

Similarly the marginal distribution for w is

m(w|MA) = d0 (2π)minus(Jminusp0y )2

(pAy

2J + pAy

) (pAyminusp0y

)

2

|Q prime0Q0|minus

12 times

exp[minus1

2wprime(I minus H0y minus

(2J

2J + pAy

)Hr Ay

)w

] (3ndash12)

where J =sumN

i=1 Ji or in other words J denotes the total number of surveys conducted

Now the posteriors for the base model M0 =M0y M0z

are

m(v|M0) =

intc0N (v|X0α0 I) dα0

= c0(2π)minus(nminusp0z )2 |X prime

0X0|minus12 exp

[minus1

2(v (I minus H0z ) v)

](3ndash13)

and

m(w|M0) = d0(2π)minus(Jminusp0y )2 |Q prime

0Q0|minus12 exp

[minus1

2

(w(I minus H0y

)w)]

(3ndash14)

334 Model Selection Algorithm

Having the parameter intrinsic priors in place and knowing the form of the model

posterior probabilities it is finally possible to develop a strategy to conduct model

selection for the occupancy framework

For each of the two components of the model ndashoccupancy and detectionndash the

algorithm first draws the set of active predictors (ie Az and Ay ) together with their

corresponding parameters This is a reversible jump step which uses a Metropolis

63

Hastings correction with proposal distributions given by

q(Alowastz |zo z(t)u v(t)MAz

) =1

2

(p(MAlowast

z|zo z(t)u v(t)Mz MAlowast

zisin L(MAz

)) +1

|L(MAz)|

)q(Alowast

y |y zo z(t)u w(t)MAy) =

1

2

(p(MAlowast

w|y zo z(t)u w(t)My MAlowast

yisin L(MAy

)) +1

|L(MAy)|

)(3ndash15)

where L(MAz) and L(MAy

) denote the sets of models obtained from adding or removing

one predictor at a time from MAzand MAy

respectively

To promote mixing this step is followed by an additional draw from the full

conditionals of α and λ The densities p(α0|) p(αr A|) p(λ0|) and p(λr A|) can

be sampled from directly with Gibbs steps Using the notation a|middot to denote the random

variable a conditioned on all other parameters and on the data these densities are given

by

bull α0|middot sim N((X

prime0X0)

minus1Xprime0v (X

prime0X0)

minus1)bull αr A|middot sim N

(microαr A

αr A

) where the mean vector and the covariance matrix are

given by αr A= 2N

2N+pAz(X

prime

r AXr A)minus1 and microαr A

=(αr A

Xprime

r Av)

bull λ0|middot sim N((Q

prime0Q0)

minus1Qprime0w (Q

prime0Q0)

minus1) and

bull λr A|middot sim N(microλr A

λr A

) analogously with mean and covariance matrix given by

λr A= 2J

2J+pAy(Q

prime

r AQr A)minus1 and microλr A

=(λr A

Qprime

r Aw)

Finally Gibbs sampling steps are also available for the unobserved occupancy

indicators zu and for the corresponding latent variables v and w The full conditional

posterior densities for z(t+1)u v(t+1) and w(t+1) are those introduced in Chapter 2 for the

single season probit model

The following steps summarize the stochastic search algorithm

1 Initialize A(0)y A

(0)z z

(0)u v(0)w(0)α(0)

0 λ(0)0

2 Sample the model indices and corresponding parameters

(a) Draw simultaneously

64

bull Alowastz sim q(Az |zo z(t)u v(t)MAz

)

bull αlowast0 sim p(α0|MAlowast

z zo z

(t)u v(t)) and

bull αlowastr Alowast sim p(αr A|MAlowast

z zo z

(t)u v(t))

(b) Accept (M(t+1)Az

α(t+1)10 α(t+1)1

r A ) = (MAlowastzαlowast

0αlowastr Alowast) with probability

δz = min

(1

p(MAlowastz|zo z(t)u v(t))

p(MA(t)z|zo z(t)u v(t))

q(A(t)z |zo z(t)u v(t)MAlowast

z)

q(Alowastz |zo z

(t)u v(t)MAz

)

)

otherwise let (M(t+1)Az

α(t+1)10 α(t+1)1

r A ) = (A(t)z α(t)2

0 α(t)2r A )

(c) Sample simultaneously

bull Alowasty sim q(Ay |y zo z(t)u w(t)MAy

)

bull λlowast0 sim p(λ0|MAlowast

y y zo z

(t)u w(t)) and

bull λlowastr Alowast sim p(λr A|MAlowast

y y zo z

(t)u w(t))

(d) Accept (M(t+1)Ay

λ(t+1)10 λ(t+1)1

r A ) = (MAlowastyλlowast

0λlowastr Alowast) with probability

δy = min

(1

p(MAlowastz|y zo z(t)u w(t))

p(MA(t)z|y zo z(t)u w(t))

q(A(t)z |y zo z(t)u w(t)MAlowast

y)

q(Alowastz |y zo z

(t)u w(t)MAy

)

)

otherwise let (M(t+1)Ay

λ(t+1)10 λ(t+1)1

r A ) = (A(t)y λ(t)2

0 λ(t)2r A )

3 Sample base model parameters

(a) Draw α(t+1)20 sim p(α0|MA

(t+1)z

zo z(t)u v(t))

(b) Draw λ(t+1)20 sim p(λ0|MA(t+1)y

y zo z(t)u v(t))

4 To improve mixing resample model coefficients not present the base model butare in MA

(a) Draw α(t+1)2r A sim p(αr A|MA

(t+1)z

zo z(t)u v(t))

(b) Draw λ(t+1)2r A sim p(λr A|MA

(t+1)y

yzo z(t)u v(t))

5 Sample latent and missing (unobserved) variables

(a) Sample z(t+1)u sim p(zu|MA(t+1)z

yα(t+1)2r A α(t+1)2

0 λ(t+1)2r A λ(t+1)2

0 )

(b) Sample v(t+1) sim p(v|MA(t+1)z

zo z(t+1)u α(t+1)2

r A α(t+1)20 )

65

(c) Sample w(t+1) sim p(w|MA(t+1)y

zo z(t+1)u λ(t+1)2

r A λ(t+1)20 )

34 Alternative Formulation

Because the occupancy process is partially observed it is reasonable to consider

the posterior odds in terms of the observed responses that is the detections y and

the presences at sites where at least one detection takes place Partitioning the vector

of presences into observed and unobserved z = (zprimeo zprimeu)

prime and integrating out the

unobserved component the model posterior for MA can be obtained as

p(MA|y zo) prop Ezu [m(y z|MA)] π(MA) (3ndash16)

Data-augmenting the model in terms of latent normal variables a la Albert and Chib

the marginals for any model My Mz = M isin M of z and y inside of the expectation in

equation 3ndash16 can be expressed in terms of the latent variables

m(y z|M) =

intT (z)

intT (yz)

m(w v|M)dwdv

=

(intT (z)

m(v| Mz)dv

)(intT (yz)

m(w|My)dw

) (3ndash17)

where T (z) and T (y z) denote the corresponding truncation regions for v and w which

depend on the values taken by z and y and

m(v|Mz) =

intf (v|αMz)π(α|Mz)dα (3ndash18)

m(w|My) =

intf (w|λMy)π(λ|My)dλ (3ndash19)

The last equality in equation 3ndash17 is a consequence of the independence of the

latent processes v and w Using expressions 3ndash18 and 3ndash19 allows one to embed this

model selection problem in the classical linear normal regression setting where many

ldquoobjectiverdquo Bayesian inferential tools are available In particular these expressions

facilitate deriving the parameter intrinsic priors (Berger amp Pericchi 1996 Moreno

et al 1998) for this problem This approach is an extension of the one implemented in

Leon-Novelo et al (2012) for the simple probit regression problem

66

Using this alternative approach all that is left is to integrate m(v|MA) and m(w|MA)

over their corresponding truncation regions T (z) and T (y z) which yields m(y z|MA)

and then to obtain the expectation with respect to the unobserved zrsquos Note however

two issues arise First such integrals are not available in closed form Second

calculating the expectation over the limit of integration further complicates things To

address these difficulties it is possible to express E [m(y z|MA)] as

Ezu [m(y z|MA)] = Ezu

[(intT (z)

m(v| MAz)dv

)(intT (yz)

m(w|MAy)dw

)](3ndash20)

= Ezu

[(intT (z)

intm(v| MAz

α0)πIP(α0|MAz

)dα0dv

)times(int

T (yz)

intm(w| MAy

λ0)πIP(λ0|MAy

)dλ0dw

)]

= Ezu

int (int

T (z)

m(v| MAzα0)dv

)︸ ︷︷ ︸

g1(T (z)|MAz α0)

πIP(α0|MAz)dα0 times

int (intT (yz)

m(w|MAyλ0)dw

)︸ ︷︷ ︸

g2(T (yz)|MAy λ0)

πIP(λ0|MAy)dλ0

= Ezu

[intg1(T (z)|MAz

α0)πIP(α0|MAz

)dα0 timesintg2(T (y z)|MAy

λ0)πIP(λ0|MAy

)dλ0

]= c0 d0

int intEzu

[g1(T (z)|MAz

α0)g2(T (y z)|MAyλ0)

]dα0 dλ0

where the last equality follows from Fubinirsquos theorem since m(v|MAzα0) and

m(w|MAyλ0) are proper densities From 3ndash21 the posterior odds are

p(MA|y zo)p(M0|y zo)

=

int intEzu

[g1(T (z)|MAz

α0)g2(T (y z)|MAyλ0)

]dα0 dλ0int int

Ezu

[g1(T (z)|M0z α0)g2(T (y z)|M0y λ0)

]dα0 dλ0

π(MA)

π(M0)

(3ndash21)

67

35 Simulation Experiments

The proposed methodology was tested under 36 different scenarios where we

evaluate the behavior of the algorithm by varying the number of sites the number of

surveys the amount of signal in the predictors for the presence component and finally

the amount of signal in the predictors for the detection component

For each model component the base model is taken to be the intercept only model

and the full models considered for the presence and the detection have respectively 30

and 20 predictors Therefore the model space contains 230times220 asymp 112times1015 candidate

models

To control the amount of signal in the presence and detection components values

for the model parameter were purposefully chosen so that quantiles 10 50 and 90 of the

occupancy and detection probabilities match some pre-specified probabilities Because

presence and detection are binary variables the amount of signal in each model

component associates to the spread and center of the distribution for the occupancy and

detection probabilities respectively Low signal levels relate to occupancy or detection

probabilities close to 50 High signal levels associate with probabilities close to 0 or 1

Large spreads of the distributions for the occupancy and detection probabilities reflect

greater heterogeneity among the observations collected improving the discrimination

capability of the model and viceversa

Therefore for the presence component the parameter values of the true model

were chosen to set the median for the occupancy probabilities equal 05 The chosen

parameter values also fix quantiles 10 and 90 symmetrically about 05 at small (Qz10 =

03Qz90 = 07) intermediate (Qz

10 = 02Qz90 = 08) and large (Qz

10 = 01Qz90 = 09)

distances For the detection component the model parameters are obtained to reflect

detection probabilities concentrated about low values (Qy50 = 02) intermediate values

(Qy50 = 05) and high values (Qy

50 = 08) while keeping quantiles 10 and 90 fixed at 01

and 09 respectively

68

Table 3-1 Simulation control parameters occupancy model selectorParameter Values considered

N 50 100

J 3 5

(Qz10Q

z50Q

z90)

(03 05 07) (02 05 08) (01 05 09)

(Qy

10Qy50Q

y90)

(01 02 09) (01 05 09) (01 08 09)

There are in total 36 scenarios these result from crossing all the levels of the

simulation control parameters (Table 3-1) Under each of these scenarios 20 data sets

were generated at random True presence and detection indicators were generated

with the probit model formulation from Chapter 2 This with the assumed true models

MTz = 1 x2 x15 x16 x22 x28 for the presence and MTy = 1 q7 q10 q12 q17 for

the detection with the predictors included in the randomly generated datasets In this

context 1 represents the intercept term Throughout the Section we refer to predictors

included in the true models as true predictors and to those absent as false predictors

The selection procedure was conducted using each one of these data sets with

two different priors on the model space the uniform or equal probability prior and a

multiplicity correcting prior

The results are summarized through the marginal posterior inclusion probabilities

(MPIPs) for each predictor and also the five highest posterior probability models (HPM)

The MPIP for a given predictor under a specific scenario and for a particular data set is

defined as

p(predictor is included|y zw v) =sumMisinM

I(predictorisinM)p(M|y zw vM) (3ndash22)

In addition we compare the MPIP odds between predictors present in the true model

and predictors absent from it Specifically we consider the minimum odds of marginal

posterior inclusion probabilities for the predictors Let ~ξ and ξ denote respectively a

69

predictor in the true model MT and a predictor absent from MT We define the minimum

MPIP odds between the probabilities of true and false predictor as

minOddsMPIP =min~ξisinMT

p(I~ξ = 1|~ξ isin MT )

maxξ isinMTp(Iξ = 1|ξ isin MT )

(3ndash23)

If the variable selection procedure adequately discriminates true and false predictors

minOddsMPIP will take values larger than one The ability of the method to discriminate

between the least probable true predictor and the most probable false predictor worsens

as the indicator approaches 0351 Marginal Posterior Inclusion Probabilities for Model Predictors

For clarity in Figures 3-1 through 3-5 only predictors in the true models are labeled

and are emphasized with a dotted line passing through them The left hand side plots

in these figures contain the results for the presence component and the ones on the

right correspond to predictors in the detection component The results obtained with

the uniform model priors correspond to the black lines and those for the multiplicity

correcting prior are in red In these Figures the MPIPrsquos have been averaged over all

datasets corresponding scenarios matching the condition observed

In Figure 3-1 we contrast the mean MPIPrsquos of the predictors over all datasets from

scenarios with 50 sites to the mean MPIPrsquos obtained for the scenarios with 100 sites

Similarly Figure 3-2 compares the mean MPIPrsquos of scenarios where 3 surveys are

performed to those of scenarios having 5 surveys per site Figures 3-4 and 3-5 show the

effect of the different levels of signal considered in the occupancy probabilities and in the

detection probabilities

From these figures mainly three results can be drawn (1) the effect of the model

prior is substantial (2) the proposed methods yield MPIPrsquos that clearly separate

true predictors from false predictors and (3) the separation between MPIPrsquos of true

predictors and false predictors is noticeably larger in the detection component

70

Regardless of the simulation scenario and model component observed under the

uniform prior false predictors obtain a relatively high MPIP Conversely the multiplicity

correction prior strongly shrinks towards 0 the MPIP for false predictors In the presence

component the MPIP for the true predictors is shrunk substantially under the multiplicity

prior however there remains a clear separation between true and false predictors In

contrast in the detection component the MPIP for true predictors remains relatively high

(Figures 3-1 through 3-5)

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif N=50MC N=50

Unif N=100MC N=100

Figure 3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors

71

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif J=3MC J=3

Unif J=5MC J=5

Figure 3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif N=50 J=3Unif N=50 J=5

Unif N=100 J=3Unif N=100 J=5

MC N=50 J=3MC N=50 J=5

MC N=100 J=3MC N=100 J=5

Figure 3-3 Predictor MPIP averaged over scenarios with the interaction between thenumber of sites and the surveys per site using uniform (U) and multiplicitycorrection (MC) priors

72

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

U(03 05 07)MC(03 05 07)

U(02 05 08)MC(02 05 08)

U(01 05 09)MC(01 05 09)

Figure 3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

U(01 02 09)MC(01 02 09)

U(01 05 09)MC(01 05 09)

U(01 08 09)MC(01 08 09)

Figure 3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors

73

In scenarios where more sites were surveyed the separation between the MPIP of

true and false predictors grew in both model components (Figure 3-1) Increasing the

number of sites has an effect over both components given that every time a new site is

included covariate information is added to the design matrix of both the presence and

the detection components

On the hand increasing the number of surveys affects the MPIP of predictors in the

detection component (Figures 3-2 and 3-3) but has only a marginal effect on predictors

of the presence component This may appear to be counterintuitive however increasing

the number of surveys only increases the number of observation in the design matrix

for the detection while leaving unaltered the design matrix for the presence The small

changes observed in the MPIP for the presence predictors J increases are exclusively

a result of having additional detection indicators equal to 1 in sites where with less

surveys would only have 0 valued detections

From Figure 3-3 it is clear that for the presence component the effect of the number

of sites dominates the behavior of the MPIP especially when using the multiplicity

correction priors In the detection component the MPIP is influenced by both the number

of sites and number of surveys The influence of increasing the number of surveys is

larger when considering a smaller number of sites and viceversa

Regarding the effect of the distribution for the occupancy probabilities we observe

that mostly the detection component is affected There is stronger discrimination

between true and false predictors as the distribution has a higher variability (Figure

3-4) This is consistent with intuition since having the presence probabilities more

concentrated about 05 implies that the predictors do not vary much from one site to

the next whereas having the occupancy probabilities more spread out would have the

opposite effect

Finally concentrating the detection probabilities about high or low values For

predictors in the detection component the separation between MPIP of true and false

74

predictors is larger both in scenarios where the distribution of the detection probability

is centered about 02 or 08 when compared to those scenarios where this distribution

is centered about 05 (where the signal of the predictors is weakest) For predictors in

the presence component having the detection probabilities centered at higher values

slightly increases the inclusion probabilities of the true predictors (Figure 3-5) and

reduces that of false predictors

Table 3-2 Comparison of average minOddsMPIP under scenarios having differentnumber of sites (N=50 N=100) and under scenarios having different numberof surveys per site (J=3 J=5) for the presence and detection componentsusing uniform and multiplicity correction priors

Sites SurveysComp π(M) N=50 N=100 J=3 J=5

Presence Unif 112 131 119 124MC 320 846 420 674

Detection Unif 203 264 211 257MC 2115 3246 2139 3252

Table 3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors

(Qz10Q

z50Q

z90) (Qy

10Qy50Q

y90)

Comp π(M) (030507) (020508) (010509) (010209) (010509) (010809)

Presence Unif 105 120 134 110 123 124MC 202 455 805 238 619 640

Detection Unif 234 234 230 257 200 238MC 2537 2077 2528 2933 1852 2849

The separation between the MPIP of true and false predictors is even more

notorious in Tables 3-2 and 3-3 where the minimum MPIP odds between true and

false predictors are shown Under every scenario the value for the minOddsMPIP (as

defined in 3ndash23) was greater than 1 implying that on average even the lowest MPIP

for a true predictor is higher than the maximum MPIP for a false predictor In both

components of the model the minOddsMPIP are markedly larger under the multiplicity

correction prior and increase with the number of sites and with the number of surveys

75

For the presence component increasing the signal in the occupancy probabilities

or having the detection probabilities concentrate about higher values has a positive and

considerable effect on the magnitude of the odds For the detection component these

odds are particularly high specially under the multiplicity correction prior Also having

the distribution for the detection probabilities center about low or high values increases

the minOddsMPIP 352 Summary Statistics for the Highest Posterior Probability Model

Tables 3-4 through 3-7 show the number of true predictors that are included in

the HPM (True +) and the number of false predictors excluded from it (True minus)

The mean percentages observed in these Tables provide one clear message The

highest probability models chosen with either model prior commonly differ from the

corresponding true models The multiplicity correction priorrsquos strong shrinkage only

allows a few true predictors to be selected but at the same time it prevents from

including in the HPM any false predictors On the other hand the uniform prior includes

in the HPM a larger proportion of true predictors but at the expense of also introducing

a large number of false predictors This situation is exacerbated in the presence

component but also occurs to a minor extent in the detection component

Table 3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highestprobability models for the presence and the detection components usinguniform and multiplicity correcting priors on the model space

True + True minusComp π(M) N=50 N=100 N=50 N=100

Presence Unif 057 063 051 055MC 006 013 100 100

Detection Unif 077 085 087 093MC 049 070 100 100

Having more sites or surveys improves the inclusion of true predictors and exclusion

of false ones in the HPM for both the presence and detection components (Tables 3-4

and 3-5) On the other hand if the distribution for the occupancy probabilities is more

76

Table 3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space

True + True minusComp π(M) J=3 J=5 J=3 J=5

Presence Unif 059 061 052 054MC 008 010 100 100

Detection Unif 078 085 087 092MC 050 068 100 100

spread out the HPM includes more true predictors and less false ones in the presence

component In contrast the effect of the spread of the occupancy probabilities in the

detection HPM is negligible (Table 3-6) Finally there is a positive relationship between

the location of the median for the detection probabilities and the number of correctly

classified true and false predictors for the presence The HPM in the detection part of

the model responds positively to low and high values of the median detection probability

(increased signal levels) in terms of correctly classified true and false predictors (Table

3-7)

Table 3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space

True + True minusComp π(M) (030507) (020508) (010509) (030507) (020508) (010509)

Presence Unif 055 061 064 050 054 055MC 002 008 018 100 100 100

Detection Unif 081 082 081 090 089 089MC 057 061 059 100 100 100

36 Case Study Blue Hawker Data Analysis

During 1999 and 2000 an intensive volunteer surveying effort coordinated by the

Centre Suisse de Cartographie de la Faune (CSCF) was conducted in order to analyze

the distribution of the blue hawker Ashna cyanea (Odonata Aeshnidae) a common

dragonfly in Switzerland Given that Switzerland is a small and mountainous country

77

Table 3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space

True + True minusComp π(M) (010209) (010509) (010809) (010209) (010509) (010809)

Presence Unif 059 059 062 051 054 054MC 006 010 011 100 100 100

Detection Unif 089 077 078 091 087 091MC 070 048 059 100 100 100

there is large variation in its topography and physio-geography as such elevation is a

good candidate covariate to predict species occurrence at a large spatial scale It can

be used as a proxy for habitat type intensity of land use temperature as well as some

biotic factors (Kery et al 2010)

Repeated visits to 1-ha pixels took place to obtain the corresponding detection

history In addition to the survey outcome the x and y-coordinates thermal-level the

date of the survey and the elevation were recorded Surveys were restricted to the

known flight period of the blue hawker which takes place between May 1 and October

10 In total 2572 sites were surveyed at least once during the surveying period The

number of surveys per site ranges from 1 to 22 times within each survey year

Kery et al (2010) summarize the results of this effort using AIC-based model

comparisons first by following a backwards elimination approach for the detection

process while keeping the occupancy component fixed at the most complex model and

then for the presence component choosing among a group of three models while using

the detection model chosen In our analysis of this dataset for the detection and the

presence we consider as the full models those used in Kery et al (2010) namely

minus1(ψ) = α0 + α1year+ α2elev+ α3elev2 + α4elev

3

minus1(p) = λ0 + λ1year+ λ2elev+ λ3elev2 + λ4elev

3 + λ5date+ λ6date2

78

where year = Iyear=2000

The model spaces for this data contain 26 = 64 and 24 = 16 models respectively

for the detection and occupancy components That is in total the model space contains

24+6 = 1 024 models Although this model space can be enumerated entirely for

illustration we implemented the algorithm from section 334 generating 10000 draws

from the Gibbs sampler Each one of the models sampled were chosen from the set of

models that could be reached by changing the state of a single term in the current model

(to inclusion or exclusion accordingly) This allows a more thorough exploration of the

model space because for each of the 10000 models drawn the posterior probabilities

for many more models can be observed Below the labels for the predictors are followed

by either ldquozrdquo or ldquoyrdquo accordingly to represent the component they pertain to Finally

using the results from the model selection procedure we conducted a validation step to

determine the predictive accuracy of the HPMrsquos and of the median probability models

(MPMrsquos) The performance of these models is then contrasted with that of the model

ultimately selected by Kery et al (2010)361 Results Variable Selection Procedure

The model finally chosen for the presence component in Kery et al (2010) was not

found among the highest five probability models under either model prior 3-8 Moreover

the year indicator was never chosen under the multiplicity correcting prior hinting that

this term might correspond to a falsely identified predictor under the uniform prior

Results in Table 3-10 support this claim the marginal inclusion posterior probability for

the year predictor is 7 under the multiplicity correction prior The multiplicity correction

prior concentrates more densely the model posterior probability mass in the highest

ranked models (90 of the mass is in the top five models) than the uniform prior (which

account for 40 of the mass)

For the detection component the HPM under both priors is the intercept only model

which we represent in Table 3-9 with a blank label In both cases this model obtains very

79

Table 3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data

Uniform model priorRank Mz selected p(Mz |y)

1 yrz+elevz 0102 yrz+elevz+elevz3 0083 elevz2+elevz3 0084 yrz+elevz2 0075 yrz+elevz3 007

Multiplicity correcting model priorRank Mz selected p(Mz |y)

1 elevz+elevz3 0532 0153 elevz+elevz2 0094 elevz2 0065 elevz+elevz2+elevz3 005

high posterior probabilities The terms contained in cubic polynomial for the elevation

appear to contain some relevant information however this conflicts with the MPIPs

observed in Table 3-11 which under both model priors are relatively low (lt 20 with the

uniform and le 4 with the multiplicity correcting prior)

Table 3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data

Uniform model priorRank Mz selected p(Mz |y)

1 0452 elevy3 0063 elevy2 0054 elevy 0055 yry 004

Multiplicity correcting model priorRank Mz selected p(Mz |y)

1 0862 elevy3 0023 datey2 0024 elevy2 0025 yry 002

Finally it is possible to use the MPIPs to obtain the median probability model which

contains the terms that have a MPIP higher than 50 For the occupancy process

(Table 3-10) under the uniform prior the model with the year the elevation and the

elevation cubed are included The MPM with multiplicity correction prior coincides with

the HPM from this prior The MPM chosen for the detection component (Table 3-11)

under both priors is the intercept only model coinciding again with the HPM

Given the outcomes of the simulation studies from Section 35 especially those

pertaining to the detection component the results in Table 3-11 appear to indicate that

none of the predictors considered belong to the true model especially when considering

80

Table 3-10 MPIP presence component

Predictor p(predictor isin MTz |y z w v)

Unif MultCorryrz 053 007elevz 051 073elevz2 045 023elevz3 050 067

Table 3-11 MPIP detection component

Predictor p(predictor isin MTy |y z w v)

Unif MultCorryry 019 003elevy 018 003elevy2 018 003elevy 3 019 004datey 016 003datey2 015 004

those derived with the multiplicity correction prior On the other hand for the presence

component (Table 3-10) there is an indication that terms related to the cubic polynomial

in elevz can explain the occupancy patterns362 Validation for the Selection Procedure

Approximately half of the sites were selected at random for training (ie for model

selection and parameter estimation) and the remaining half were used as test data In

the previous section we observed that using the marginal posterior inclusion probability

of the predictors the our method effectively separates predictors in the true model from

those that are not in it However in Tables 3-10 and 3-11 this separation is only clear for

the presence component using the multiplicity correction prior

Therefore in the validation procedure we observe the misclassification rates for the

detections using the following models (1) the model ultimately recommended in Kery

et al (2010) (yrz+elevz+elevz2+elevz3 + elevy+ elevy2+ datey+ datey2) (2) the

highest probability model (HPM) with a uniform prior (yrz+elevz) (3) the HPM with a

multiplicity correcting prior (elevz + elevz3 ) (4) the median probability model (MPM)

ndashthe model including only predictors with a MPIP larger than 50ndash with the uniform

prior (yrz+elevz+elevz3) and finally (5) the MPM with a multiplicity correction prior

(elevz+elevz3 same as the HPM with multiplicity correction)

We must emphasize that the models resulting from the implement ion of our model

selection procedure used exclusively the training dataset On the other hand the model

in Kery et al (2010) was chosen to minimize the prediction error of the complete data

81

Because this model was obtained from the full dataset results derived from it can only

be considered as a lower bound for the prediction errors

The benchmark misclassification error rate for true 1rsquos is high (close to 70)

However the misclassification rate for true 0rsquos which accounts for most of the

responses is less pronounced (15) Overall the performance of the selected models

is comparable They yield considerably worse results than the benchmark for the true

1rsquos but achieve rates close to the benchmark for the true zeros Pooling together

the results for true ones and true zeros the selected models with either prior have

misclassification rates close to 30 The benchmark model performs comparably with a

joint misclassification error of 23 (Table 3-12)

Table 3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform andmultiplicity correction model priors

Model True 1 True 0 Jointbenchmark (Kery et al 2010) yrz+elevz+elevz2+elevz3 + 066 015 023

elevy+ elevy2+ datey+ datey2

HPM Unif yrz+elevz 083 017 028HPMHPM MC elevz + elevz3 082 018 028MPM Unif yrz+elevz+elevz3 082 018 029

37 Discussion

In this Chapter we proposed an objective and fully automatic Bayes methodology for

the single season site-occupancy model The methodology is said to be fully automatic

because no hyper-parameter specification is necessary in defining the parameter priors

and objective because it relies on the intrinsic priors derived from noninformative priors

The intrinsic priors have been shown to have desirable properties as testing priors We

also propose a fast stochastic search algorithm to explore large model spaces using our

model selection procedure

Our simulation experiments demonstrated the ability of the method to single out the

predictors present in the true model when considering the marginal posterior inclusion

probabilities for the predictors For predictors in the true model these probabilities

were comparatively larger than those for predictors absent from it Also the simulations

82

indicated that the method has a greater discrimination capability for predictors in the

detection component of the model especially when using multiplicity correction priors

Multiplicity correction priors were not described in this Chapter however their

influence on the selection outcome is significant This behavior was observed in the

simulation experiment and in the analysis of the Blue Hawker data Model priors play an

essential role As the number of predictors grows these are instrumental in controlling

for selection of false positive predictors Additionally model priors can be used to

account for predictor structure in the selection process which helps both to reduce the

size of the model space and to make the selection more robust These issues are the

topic of the next Chapter

Accounting for the polynomial hierarchy in the predictors within the occupancy

context is a straightforward extension of the procedures we describe in Chapter 4

Hence our next step is to develop efficient software for it An additional direction we

plan to pursue is developing methods for occupancy variable selection in a multivariate

setting This can be used to conduct hypothesis testing in scenarios with varying

conditions through time or in the case where multiple species are co-observed A

final variation we will investigate for this problem is that of occupancy model selection

incorporating random effects

83

CHAPTER 4PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS

It has long been an axiom of mine that the little things are infinitely themost important

ndashSherlock HolmesA Case of Identity

41 Introduction

In regression problems if a large number of potential predictors is available the

complete model space is too large to enumerate and automatic selection algorithms are

necessary to find informative parsimonious models This multiple testing problem

is difficult and even more so when interactions or powers of the predictors are

considered In the ecological literature models with interactions andor higher order

polynomial terms are ubiquitous (Johnson et al 2013 Kery et al 2010 Zeller et al

2011) given the complexity and non-linearities found in ecological processes Several

model selection procedures even in the classical normal linear setting fail to address

two fundamental issues (1) the model selection outcome is not invariant to affine

transformations when interactions or polynomial structures are found among the

predictors and (2) additional penalization is required to control for false positives as the

model space grows (ie as more covariates are considered)

These two issues motivate the developments developed throughout this Chapter

Building on the results of Chipman (1996) we propose investigate and provide

recommendations for three different prior distributions on the model space These

priors help control for test multiplicity while accounting for polynomial structure in the

predictors They improve upon those proposed by Chipman first by avoiding the need

for specific values for the prior inclusion probabilities of the predictors and second

by formulating principled alternatives to introduce additional structure in the model

84

priors Finally we design a stochastic search algorithm that allows fast and thorough

exploration of model spaces with polynomial structure

Having structure in the predictors can determine the selection outcome As an

illustration consider the model E [y ] = β00 + β01x2 + β20x21 where the order one

term x1 is not present (this choice of subscripts for the coefficients is defined in the

following section) Transforming x1 7rarr xlowast1 = x1 + c for some c = 0 the model

becomes E [y ] = β00 + β01x2 + βlowast20x

lowast21 Note that in terms of the original predictors

xlowast21 = x21 + 2c middot x1 + c2 implying that this seemingly innocuous transformation of x1

modifies the column space of the design matrix by including x1 which was not in the

original model That is when lower order terms in the hierarchy are omitted from the

model the column space of the design matrix is not invariant to afine transformations

As the hat matrix depends on the column space the modelrsquos predictive capability is also

affected by how the covariates in the model are coded an undesirable feature for any

model selection procedure To make model selection invariant to afine transformations

the selection must be constrained to the subset of models that respect the hierarchy

(Griepentrog et al 1982 Khuri 2002 McCullagh amp Nelder 1989 Nelder 2000

Peixoto 1987 1990) These models are known as well-formulated models (WFMs)

Succinctly a model is well-formulated if for any predictor in the model every lower order

predictor associated with it is also in the model The model above is not well-formulated

as it contains x21 but not x1

WFMs exhibit strong heredity in that all lower order terms dividing higher order

terms in the model must also be included An alternative is to only require weak heredity

(Chipman 1996) which only forces some of the lower terms in the corresponding

polynomial hierarchy to be in the model However Nelder (1998) demonstrated that the

conditions under which weak heredity allows the design matrix to be invariant to afine

transformations of the predictors are too restrictive to be useful in practice

85

Although this topic appeared in the literature more than three decades ago (Nelder

1977) only recently have modern variable selection techniques been adapted to

account for the constraints imposed by heredity As described in Bien et al (2013)

the current literature on variable selection for polynomial response surface models

can be classified into three broad groups mult-istep procedures (Brusco et al 2009

Peixoto 1987) regularized regression methods (Bien et al 2013 Yuan et al 2009)

and Bayesian approaches (Chipman 1996) The methods introduced in this Chapter

take a Bayesian approach towards variable selection for well-formulated models with

particular emphasis on model priors

As mentioned in previous chapters the Bayesian variable selection problem

consists of finding models with high posterior probabilities within a pre-specified model

space M The model posterior probability for M isin M is given by

p(M|yM) prop m(y|M)π(M|M) (4ndash1)

Model posterior probabilities depend on the prior distribution on the model space

as well as on the prior distributions for the model specific parameters implicitly through

the marginals m(y|M) Priors on the model specific parameters have been extensively

discussed in the literature (Berger amp Pericchi 1996 Berger et al 2001 George 2000

Jeffreys 1961 Kass amp Wasserman 1996 Liang et al 2008 Zellner amp Siow 1980) In

contrast the effect of the prior on the model space has until recently been neglected

A few authors (eg Casella et al (2014) Scott amp Berger (2010) Wilson et al (2010))

have highlighted the relevance of the priors on the model space in the context of multiple

testing Adequately formulating priors on the model space can both account for structure

in the predictors and provide additional control on the detection of false positive terms

In addition using the popular uniform prior over the model space may lead to the

undesirable and ldquoinformativerdquo implication of favoring models of size p2 (where p is the

86

total number of covariates) since this is the most abundant model size contained in the

model space

Variable selection within the model space of well-formulated polynomial models

poses two challenges for automatic objective model selection procedures First the

notion of model complexity takes on a new dimension Complexity is not exclusively

a function of the number of predictors but also depends upon the depth and

connectedness of the associations defined by the polynomial hierarchy Second

because the model space is shaped by such relationships stochastic search algorithms

used to explore the models must also conform to these restrictions

Models without polynomial hierarchy constitute a special case of WFMs where

all predictors are of order one Hence all the methods developed throughout this

Chapter also apply to models with no predictor structure Additionally although our

proposed methods are presented for the normal linear case to simplify the exposition

these methods are general enough to be embedded in many Bayesian selection

and averaging procedures including of course the occupancy framework previously

discussed

In this Chapter first we provide the necessary definitions to characterize the

well-formulated model selection problem Then we proceed to introduce three new prior

structures on the well-formulated model space and characterize their behavior with

simple examples and simulations With the model priors in place we build a stochastic

search algorithm to explore spaces of well-formulated models that relies on intrinsic

priors for the model specific parameters mdash though this assumption can be relaxed

to use other mixtures of g-priors Finally we implement our procedures using both

simulated and real data

87

42 Setup for Well-Formulated Models

Suppose that the observations yi are modeled using the polynomial regression of

the covariates xi 1 xi p given by

yi =sum

β(α1αp)

pprodj=1

xαji j + ϵi (4ndash2)

where α = (α1 αp) belongs to Np0 the p-dimensional space of natural numbers

including 0 with ϵiiidsim N(0σ2) and only finitely many βα are allowed to be non-zero

As an illustration consider a model space that includes polynomial terms incorporating

covariates xi 1 and xi 2 only The terms x2i 2 and x2i 1xi 2 can be represented by α = (0 2)

and α = (2 1) respectively

The notation y = Z(X)β + ϵ is used to denote that observed response y =

(y1 yn) is modeled via a polynomial function Z of the original covariates contained

in X = (x1 xp) (where xj = (x1j xnj)prime) and the coefficients of the polynomial

terms are given by β A specific polynomial model M is defined by the set of coefficients

βα that are allowed to be non-zero This definition is equivalent to characterizing M

through a collection of multi-indices α isin Np0 In particular model M is specified by

M = αM1 αM|M| for αMk isin Np0 where βα = 0 for α isin M

Any particular model M uses a subset XM of the original covariates X to form the

polynomial terms in the design matrix ZM(X) Without ambiguity a polynomial model

ZM(X) on X can be identified with a polynomial model ZM(XM) on the covariates XM

The number of terms used by M to model the response y denoted by |M| corresponds

to the number of columns of ZM(XM) The coefficient vector and error variance of

the model M are denoted by βM and σ2M respectively Thus M models the data as

y = ZM(XM)βM + ϵM where ϵM sim N(0 Iσ2M

) Model M is said to be nested in model M prime

if M sub M prime M models the response of the covariates in two distinct ways choosing the

set of meaningful covariates XM as well as choosing the polynomial structure of these

covariates ZM(XM)

88

The set Np0 constitutes a partially ordered set or more succinctly a poset A poset

is a set partially ordered through a binary relation ldquo≼rdquo In this context the binary relation

on the poset Np0 is defined between pairs (ααprime) by αprime ≼ α whenever αj ge αprime

j for all

j = 1 prime with αprime ≺ α if additionally αj gt αprimej for some j The order of a term α isin Np

0

is given by the sum of its elements order(α) =sumαj When order(α) = order(αprime) + 1

and αprime ≺ α then αprime is said to immediately precede α which is denoted by αprime rarr α

The parent set of α is defined by P(α) = αprime isin Np0 αprime rarr α and is given by the

set of nodes that immediately precede the given node A polynomial model M is said to

be well-formulated if α isin M implies that P(α) sub M For example any well-formulated

model using x2i 1xi 2 to model yi must also include the parent terms xi 1xi 2 and x2i 1 their

corresponding parent terms xi 1 and xi 2 and the intercept term 1

The poset Np0 can be represented by a Directed Acyclic Graph (DAG) denoted

by (Np0) Without ambiguity we can identify nodes in the graph α isin Np

0 with terms in

the set of covariates The graph has directed edges to a node from its parents Any

well-formulated model M is represented by a subgraph (M) of (Np0) with the property

that if node α isin (M) then the nodes corresponding to P(α) are also in (M) Figure

4-1 shows examples of well-formulated polynomial models where α isin Np0 is identified

withprodp

j=1 xαjj

The motivation for considering only well-formulated polynomial models is

compelling Let ZM be the design matrix associated with a polynomial model The

subspace of y modeled by ZM given by the hat matrix HM = ZM(ZprimeMZM)

minus1ZprimeM is

invariant to affine transformations of the matrix XM if and only if M corresponds to a

well-formulated polynomial model (Peixoto 1990)

89

A B

Figure 4-1 Graphs of well-formulated polynomial models for p = 2

For example if p = 2 and yi = β(00) + β(10)xi 1 + β(01)xi 2 + β(11)xi 1xi 2 + ϵi then

the hat matrix is invariant to any covariate transformation of the form A(xi 1xi 2

)+ b for any

real-valued positive definite 2 times 2 matrix A and any real-valued vector of dimension two

b In contrast if yi = β(00) + β(20)x2i 1 + ϵi then the hat matrix formed after applying the

transformation xi 1 7rarr xi 1 + c for real c = 0 is not the same as the hat matrix formed by

the original xi 1421 Well-Formulated Model Spaces

The spaces of WFMs M considered in this paper can be characterized in terms

of two WFMs MB the base model and MF the full model The base model contains at

least the intercept term and is nested in the full model The model space M is populated

by all well formulated models M that nest MB and are nested in MF

M = M MB sube M sube MF and M is well-formulated

For M to be well-formulated the entire ancestry of each node in M must also be

included in M Because of this M isin M can be uniquely identified by two different sets

of nodes in MF the set of extreme nodes and the set of children nodes For M isin M

90

the sets of extreme and children nodes respectively denoted by E(M) and C(M) are

defined by

E(M) = α isin M MB α isin P(αprime) forall αprime isin M

C(M) = α isin MF M α cupM is well-formulated

The extreme nodes are those nodes that when removed from M give rise to a WFM in

M The children nodes are those nodes that when added to M give rise to a WFM in

M Because MB sube M for all M isin M the set of nodes E(M)cupMB determine M by

beginning with this set and iteratively adding parent nodes Similarly the nodes in C(M)

determine the set αprime isin P(α) α isin C(M)cupαprime isin E(MF ) α ≼ αprime for all α isin C(M)

which contains E(M)cupMB and thus uniquely identifies M

1

x1

x2

x21

x1x2

x22

A Extreme node set

1

x1

x2

x21

x1x2

x22

B Children node set

Figure 4-2

In Figure 4-2 the extreme and children sets for model M = 1 x1 x21 are shown for

the model space characterized by MF = 1 x1 x2 x21 x1x2 x22 In Figure 4-2A the solid

nodes represent nodes α isin M E(M) the dashed node corresponds to α isin E(M) and

the dotted nodes are not in M Solid nodes in Figure 4-2B correspond to those in M

The dashed node is the single node in C(M) and the dotted nodes are not in M cup C(M)43 Priors on the Model Space

As discussed in Scott amp Berger (2010) the Ockhamrsquos-razor effect found

automatically in Bayesian variable selection through the Bayes factor does not correct

91

for multiple testing This penalization acts against more complex models but does not

account for the collection of models in the model space which describes the multiplicity

of the testing problem This is where the role of the prior on the model space becomes

important As Scott amp Berger explain the multiplicity penalty is ldquohidden awayrdquo in the

model prior probabilities π(M|M)

In what follows we propose three different prior structures on the model space

for WFMs discuss their advantages and disadvantages and describe reasonable

choices for their hyper-parameters In addition we investigate how the choice of

prior structure and hyper-parameter combinations affect the posterior probabilities for

predictor inclusion providing some recommendations for different situations431 Model Prior Definition

The graphical structure for the model spaces suggests a method for prior

construction on M guided by the notion of inheritance A node α is said to inherit from

a node αprime if there is a directed path from αprime to α in the graph (MF ) The inheritance

is said to be immediate if order(α) = order(αprime) + 1 (equivalently if αprime isin P(α) or if αprime

immediately precedes α)

For convenience define (M) = M MB to be the set of nodes in M that are not

in the base model MB For α isin (MF ) let γα(M) be the indicator function describing

whether α is included in M ie γα(M) = I(αisinM) Denote by γν(M) the set of indicators

of inclusion in M for all order ν nodes in (MF ) Finally let γltν(M) =cupνminus1

j=0 γ j(M)

the set of indicators for inclusion in M for all nodes in (MF ) of order less than ν With

these definitions the prior probability of any model M isin M can be factored as

π(M|M) =

JmaxMprod

j=JminM

π(γ j(M)|γltj(M)M) (4ndash3)

where JminM and Jmax

M are respectively the minimum and maximum order of nodes in

(MF ) and π(γJminM (M)|γltJmin

M (M)M) = π(γJminM (M)|M)

92

Prior distributions on M can be simplified by making two assumptions First if

order(α) = order(αprime) = j then γα and γαprime are assumed to be conditionally independent

when conditioned on γltj denoted by γα perpperp γαprime|γltj Second immediate inheritance is

invoked and it is assumed that if order(α) = j then γα(M)|γltj(M) = γα(M)|γP(α)(M)

where γP(α)(M) is the inclusion indicator for the set of parent nodes of α This indicator

is one if the complete parent set of α is contained in M and zero otherwise

In Figure 4-3 these two assumptions are depicted with MF being an order two

surface in two main effects The conditional independence assumption (Figure 4-3A)

implies that the inclusion indicators for x21 x22 and x1x2 is independent when conditioned

on all the lower order terms In this same space immediate inheritance implies that

the inclusion of x21 conditioned on the inclusion of all lower order nodes is equivalent to

conditioning it on its parent set (x1 in this case)

x21 perpperp x1x2 perpperp x22

∣∣∣∣∣

1

x1

x2

A Conditional independence

x21∣∣∣∣∣

1

x1

x2

=

x21

∣∣∣∣∣ x1

B Immediate inheritance

Figure 4-3

Denote the conditional inclusion probability of node α in model M by πα =

π(γα(M) = 1|γP(α)(M)M) Under the assumptions of conditional independence

93

and immediate inheritance the prior probability of M is

π(M|πMM) =prod

αisin(MF )

πγα(M)α (1minus πα)

1minusγα(M) (4ndash4)

with πM = πα α isin (MF ) Because M must be well-formulated πα = γα =

0 if γP(α)(M) = 0 Thus the product in 4ndash4 can be restricted to the set of nodes

α isin (M)cup

C(M) Additional structure can be built into the prior on M by making

assumptions about the inclusion probabilities πα such as equality assumptions or

assumptions of a hyper-prior for these parameters Three such prior classes are

developed next first by assigning hyperpriors on πM assuming some structure among

its elements and then marginalizing out the πM

Hierarchical Uniform Prior (HUP) The HUP assumes that the non-zero πα

are all equal Specifically for a model M isin M it is assumed that πα = π for all

α isin (M)cupC(M) A complete Bayesian specification of the HUP is completed by

assuming a prior distribution for π The choice of π sim Beta(a b) produces

πHUP(M|M a b) =B(|(M)|+ a |C(M)|+ b)

B(a b) (4ndash5)

where B is the beta function Setting a = b = 1 gives the particular value of

πHUP(M|M a = 1 b = 1) =1

|(M)|+ |C(M)|+ 1

(|(M)|+ |C(M)|

|(M)|

)minus1

(4ndash6)

The HUP assigns equal probabilities to all models for which the sets of nodes (M)

and C(M) have the same cardinality This prior provides a combinatorial penalization

but essentially fails to account for the hierarchical structure of the model space An

additional penalization for model complexity can be incorporated into the HUP by

changing the values of a and b Because πα = π for all α this penalization can only

depend on some aspect of the entire graph of MF such as the total number of nodes

not in the null model |(MF )|

94

Hierarchical Independence Prior (HIP) The HIP assumes that there are no

equality constraints among the non-zero πα Each non-zero πα is given its own prior

which is assumed to be a Beta distribution with parameters aα and bα Thus the prior

probability of M under the HIP is

πHIP(M|M ab) =

prodαisin(M)

aα + bα

prodαisinC(M)

aα + bα

(4ndash7)

where the product over empty is taken to be 1 Because the πα are totally independent any

choice of aα and bα is equivalent to choosing a probability of success πα for a given α

Setting aα = bα = 1 for all α isin (M)cup

C(M) gives the particular value of

πHIP(M|M a = 1b = 1) =

(1

2

)|(M)|+|C(M)|

(4ndash8)

Although the prior with this choice of hyper-parameters accounts for the hierarchical

structure of the model space it essentially provides no penalization for combinatorial

complexity at different levels of the hierarchy This can be observed by considering a

model space with main effects only the exponent in 4ndash8 is the same for every model in

the space because each node is either in the model or in the children set

Additional penalizations for model complexity can be incorporated into the HIP

Because each γ j is conditioned on γltj in the prior construction the aα and bα for α of

order j can be conditioned on γltj One such additional penalization utilizes the number

of nodes of order j that could be added to produce a WFM conditioned on the inclusion

vector γltj which is denoted as chj(γltj) Choosing aα = 1 and bα(M) = chj(γ

ltj) is

equivalent to choosing a probability of success πα = 1chj(γltj) This penalization can

drive down the false positive rate when chj(γltj) is large but may produce more false

negatives

Hierarchical Order Prior (HOP) A compromise between complete equality and

complete independence of the πα is to assume equality between the πα of a given

order and independence across the different orders Define j(M) = α isin (M)

95

order(α) = j and Cj(M) = α isin C(M) order(α) = j The HOP assumes that πα = πj

for all α isin j(M)cupCj(M) Assuming that πj sim Beta(aj bj) provides a prior probability of

πHOP(M|M ab) =

JmaxMprod

j=JminM

B(|j(M)|+ aj |Cj(M)|+ bj)

B(aj bj)(4ndash9)

The specific choice of aj = bj = 1 for all j gives a value of

πHOP(M|M a = 1b = 1) =prodj

[1

|j(M)|+ |Cj(M)|+ 1

(|j(M)|+ |Cj(M)|

|j(M)|

)minus1]

(4ndash10)

and produces a hierarchical version of the Scott and Berger multiplicity correction

The HOP arises from a conditional exchangeability assumption on the indicator

variables Conditioned on γltj(M) the indicators γα α isin j(M)cup

Cj(M) are

assumed to be exchangeable Bernoulli random variables By de Finettirsquos theorem these

arise from independent Bernoulli random variables with common probability of success

πj with a prior distribution Our construction of the HOP assumes that this prior is a

beta distribution Additional complexity penalizations can be incorporated into the HOP

in a similar fashion to the HIP The number of possible nodes that could be added of

order j while maintaining a WFM is given by chj(M) = chj(γltj(M)) = |j(M)

cupCj(M)|

Using aj = 1 and bj(M) = chj(M) produces a prior with two desirable properties

First if M prime sub M then π(M) le π(M prime) Second for each order j the conditional

probability of including k nodes is greater than or equal to that of including k + 1 nodes

for k = 0 1 chj(M)minus 1432 Choice of Prior Structure and Hyper-Parameters

Each of the priors introduced in Section 31 defines a whole family of model priors

characterized by the probability distribution assumed for the inclusion probabilities πM

For the sake of simplicity this paper focuses on those arising from Beta distributions

and concentrates on particular choices of hyper-parameters which can be specified

automatically First we describe some general features about how each of the three

prior structures (HUP HIP HOP) allocates mass to the models in the model space

96

Second as there is an infinite number of ways in which the hyper-parameters can be

specified focused is placed on the default choice a = b = 1 as well as the complexity

penalizations described in Section 31 The second alternative is referred to as a =

1b = ch where b = ch has a slightly different interpretation depending on the prior

structure Accordingly b = ch is given by bj(M) = bα(M) = chj(M) = |j(M)cup

Cj(M)|

for the HOP and HIP where j = order(α) while b = ch denotes that b = |(MF )| for

the HUP The prior behavior for two model spaces In both cases the base model MB is

taken to be the intercept only model and MF is the DAG shown (Figures 4-4 and 4-5)

The priors considered treat model complexity differently and some general properties

can be seen in these examples

ModelHIP HOP HUP

(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)

1 1 14 49 13 12 13 572 1 x1 18 19 112 112 112 5563 1 x2 18 19 112 112 112 5564 1 x1 x

21 18 19 112 112 112 5168

5 1 x2 x22 18 19 112 112 112 5168

6 1 x1 x2 132 364 112 112 160 1727 1 x1 x2 x

21 132 164 136 160 160 1168

8 1 x1 x2 x1x2 132 164 136 160 160 11689 1 x1 x2 x

22 132 164 136 160 160 1168

10 1 x1 x2 x21 x1x2 132 1192 136 1120 130 1252

11 1 x1 x2 x21 x

22 132 1192 136 1120 130 1252

12 1 x1 x2 x1x2 x22 132 1192 136 1120 130 1252

13 1 x1 x2 x21 x1x2 x

22 132 1576 112 1120 16 1252

Figure 4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch)

First contrast the choice of HIP HUP and HOP for the choice of (ab) = (1 1) The

HIP induces a complexity penalization that only accounts for the order of the terms in

the model This is best exhibited by the model space in Figure 4-4 Models including x1

and x2 models 6 through 13 are given the same prior probability and no penalization is

incurred for the inclusion of any or all of the quadratic terms In contrast to the HIP the

97

ModelHIP HOP HUP

(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)

1 1 18 2764 14 12 14 472 1 x1 18 964 112 110 112 2213 1 x2 18 964 112 110 112 2214 1 x3 18 964 112 110 112 2215 1 x1 x3 18 364 112 120 112 41056 1 x2 x3 18 364 112 120 112 41057 1 x1 x2 116 3128 124 140 130 1428 1 x1 x2 x1x2 116 3128 124 140 120 1709 1 x1 x2 x3 116 1128 18 140 120 17010 1 x1 x2 x3 x1x2 116 1128 18 140 15 170

Figure 4-5 Prior probabilities for the space of well-formulated models associated tothree main effects and one interaction term where MB is taken to be theintercept only model and (ab) isin (1 1) (1 ch)

HUP induces a penalization for model complexity but it does not adequately penalize

models for including additional terms Using the HIP models including all of the terms

are given at least as much probability as any model containing any non-empty set of

terms (Figures 4-4 and 4-5) This lack of penalization of the full model is originates from

its combinatorial simplicity (ie this is the only model that contains every term) and

as an unfortunate consequence this model space distribution favors the base and full

models Similar behavior is observed with the HOP with (ab) = (1 1) As models

become more complex they are appropriately penalized for their size However after a

sufficient number of nodes are added the number of possible models of that particular

size is considerably reduced Thus combinatorial complexity is negligible on the largest

models This is best exhibited in Figure 4-5 where the HOP places more mass on

the full model than on any model containing a single order one node highlighting an

undesirable behavior of the priors with this choice of hyper-parameters

In contrast if (ab) = (1 ch) all three priors produce strong penalization as

models become more complex both in terms of the number and order of the nodes

contained in the model For all of the priors adding a node α to a model M to form M prime

produces p(M) ge p(M prime) However differences between the priors are apparent The

98

HIP penalizes the full model the most with the HOP penalizing it the least and the HUP

lying between them At face value the HOP creates the most compelling penalization

of model complexity In Figure 4-5 the penalization of the HOP is the least dramatic

producing prior odds of 20 for MB versus MF as opposed to the HUP and HIP which

produce prior odds of 40 and 54 respectively Similarly the prior odds in Figure 4-4 are

60 180 and 256 for the HOP HUP and HIP respectively433 Posterior Sensitivity to the Choice of Prior

To determine how the proposed priors are adjusting the posterior probabilities to

account for multiplicity a simple simulation was performed The goal of this exercise

was to understand how the priors respond to increasing complexity First the priors are

compared as the number of main effects p grows Second they are compared as the

depth of the hierarchy increases or in other words as the orderJMmax increases

The quality of a node is characterized by its marginal posterior inclusion

probabilities defined as pα =sum

MisinM I(αisinM)p(M|yM) for α isin MF These posteriors

were obtained for the proposed priors as well as the Equal Probability Prior (EPP)

on M For all prior structures both the default hyper-parameters a = b = 1 and

the penalizing choice of a = 1 and b = ch are considered The results for the

different combinations of MF and MT incorporated in the analysis were obtained

from 100 random replications (ie generating at random 100 matrices of main effects

and responses) The simulation proceeds as follows

1 Randomly generate main effects matrices X = (x1 x18) for xiiidsim Nn(0 In) and

error vectors ϵ sim Nn(0 In) for n = 60

2 Setting all coefficient values equal to one calculate y = ZMTβ + ϵ for the true

models given byMT 1 = x1 x2 x3 x

21 x1x2 x

22 x2x3 with |MT 1| = 7

MT 2 = x1 x2 x16 with |MT 2| = 16MT 3 = x1 x2 x3 x4 with |MT 3| = 4MT 4 = x1 x2 x8 x

21 x3x4 with |MT 4| = 10

MT 5 = x1 x2 x3 x4 x21 x3x4 with |MT 5| = 6

99

Table 4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulationsgrowing p fixed JM

max fixed p growing JMmax

MF

∣∣MF

∣∣ ∣∣M∣∣ MT used MF

∣∣MF

∣∣ ∣∣M∣∣ MT used(x1 + x2 + x3)

2 9 95 MT 1 (x1 + x2 + x3)2 9 95 MT 1

(x1 + + x4)2 14 1337 MT 1 (x1 + x2 + x3)

3 19 2497 MT 1

(x1 + + x5)2 20 38619 MT 1 (x1 + x2 + x3)

4 34 161421 MT 1

Other model spacesMF

∣∣MF

∣∣ ∣∣M∣∣ MT usedx1 + x2 + middot middot middot+ x18 18 262144 MT 2MT 3

(x1 + x2 + x4)2 + x5+ 20 85568 MT 4MT 5x6 + + x10

3 In all simulations the base model MB is the intercept only model The notation(x1 + + xp)

d is used to represent the full order-d polynomial response surface inp main effects The model spaces characterized by their corresponding full modelMF are presented in Table 4-1 as well as the true models used in each case

4 Enumerate the model spaces and calculate p(M|yM) for all M isin Musing the EPP HUP HIP and HOP the latter two each with the two sets ofhyper-parameters

5 Count the number of true positives and false positives in each M for the differentpriors

The true positives (TP) are defined as those nodes α isin MT such that pα gt 05

With the false positives (FP) three different cutoffs are considered for pα elucidating

the adjustment for multiplicity induced by the model priors These cutoffs are

010 020 and 050 for α isin MT The results from this exercise provide insight

about the influence of the prior on the marginal posterior inclusion probabilities In Table

4-1 the model spaces considered are described in terms of the number of models they

contain and in terms of the number of nodes of MF the full model that defines the DAG

for M

Growing number of main effects fixed polynomial degree This simulation

investigates the posterior behavior as the number of covariates grows for a polynomial

100

surface of degree two The true model is assumed to be MT 1 and has 7 polynomial

terms The false positive and true positive rates are displayed in Table 4-2

First focus on the posterior when (ab) = (1 1) As p increases and the cutoff

is low the number of false positives increases for the EPP as well as the hierarchical

priors although less dramatically for the latter All of the priors identify all of the true

positives The false positive rate for the 50 cutoff is less than one for all four prior

structures with the HIP exhibiting the smallest false positive rate

With the second choice of hyper-parameters (1 ch) the improvement of the

hierarchical priors over the EPP is dramatic and the difference in performance is more

pronounced as p increases These also considerably outperform the priors using the

default hyper-parameters a = b = 1 in terms of the false positives Regarding the

number of true positives all priors discovered the 7 true predictors in MT 1 for most of

the 100 random samples of data with only minor differences observed between any of

the priors considered That being said the means for the priors with a = 1b = ch are

slightly lower for the true positives With a 50 cutoff the hierarchical priors keep a tight

control on the number of false positives but in doing so discard true positives with slightly

higher frequency

Growing polynomial degree fixed main effects For these examples the true

model is once again MT 1 When the complexity is increased by making the order of MF

larger (Table 4-3) the inability of the EPP to adjust the inclusion posteriors for multiplicity

becomes more pronounced the EPP becomes less and less efficient at removing false

positives when the FP cutoff is low Among the priors with a = b = 1 as the order

increases the HIP is the best at filtering out the false positives Using the 05 false

positive cutoff some false positives are included both for the EEP and for all the priors

with a = b = 1 indicating that the default hyper-parameters might not be the best option

to control FP The 7 covariates in the true model all obtain a high inclusion posterior

probability both with the EEP and the a = b = 1 priors

101

Table 4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

7 (x1 + x2 + x3)2

178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3 + x4)2

362 194 233 245 010 063 107FP(gt020) 160 047 217 215 001 017 024FP(gt050) 025 006 035 036 000 002 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3 + x4 + x5)2

600 216 260 255 012 043 115FP(gt020) 291 055 213 218 002 019 027FP(gt050) 066 011 025 037 000 003 001TP(gt050) (MT 1) 700 700 700 700 697 699 699

In contrast any of the a = 1 and b = ch priors dramatically improve upon their

a = b = 1 counterparts consistently assigning low inclusion probabilities for the majority

of the false positive terms even for low cutoffs As the order of the polynomial surface

increases the difference in performance between these priors and either the EEP or

their default versions becomes even more clear At the 50 cutoff the hierarchical priors

with complexity penalization exhibit very low false positive rates The true positive rate

decreases slightly for the priors but not to an alarming degree

Other model spaces This part of the analysis considers model spaces that do not

correspond to full polynomial degree response surfaces (Table 4-4) The first example

is a model space with main effects only The second example includes a full quadratic

surface of order 2 but in addition includes six terms for which only main effects are to be

modeled Two true models are used in combination with each model space to observe

how the posterior probabilities vary under the influence of the different priors for ldquolargerdquo

and ldquosmallrdquo true models

102

Table 4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

7 (x1 + x2 + x3)2

178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3)3

737 521 606 291 055 105 139FP(gt020) 291 155 361 208 017 034 031FP(gt050) 040 021 050 026 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 698 700FP(gt010)

7 (x1 + x2 + x3)4

822 400 469 261 052 055 132FP(gt020) 421 113 176 203 012 015 031FP(gt050) 056 017 022 027 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 697 699

By construction in model spaces with main effects only HIP(11) and EPP are

equivalent as are HOP(ab) and HUP(ab) This accounts for the similarities observed

among the results for the first two cases presented in Table 4-4 where the model space

corresponds to a full model with 18 main effects and the true models are a model with

16 and 4 main effects respectively When the number of true coefficients is large the

HUP(11) and HOP(11) do poorly at controlling false positives even at the 50 cutoff

In contrast the HIP (and thus the EPP) with the 50 cutoff identifies the true positives

and no false positives This result however does not imply that the EPP controls false

positives well The true model contains 16 out of the 18 nodes in MF so there is little

potential for false positives The a = 1 and b = ch priors show dramatically different

behavior The HIP controls false positive well but fails to identify the true coefficients at

the 50 cutoff In contrast the HOP identifies all of the true positives and has a small

false positive rate for the 50 cutoff

103

If the number of true positives is small most terms in the full model are truly zero

The EPP includes at least one false positive in approximately 50 of the randomly

sampled datasets On the other hand the HUP(11) provides some control for

multiplicity obtaining on average a lower number of false positives than the EPP

Furthermore the proposed hierarchical priors with a = 1b = ch are substantially better

than the EPP (and the choice of a = b = 1) at controlling false positives and capturing

all true positives using the marginal posterior inclusion probabilities The two examples

suggest that the HOP(1 ch) is the best default choice for model selection when the

number of terms available at a given degree is large

The third and fourth examples in Table 4-4 consider the same irregular model

space with data generated from MT 4 with ten terms and MT 5 with six terms HIP(11)

and EPP again behave quite similarly incorporating a large number of false positives

for the 01 cutoff At the 05 cutoff some false positives are still included The HUP(11)

and HOP(11) behave similarly with a slightly higher false positive rate at the 50 cutoff

In terms of the true positives the EPP and a = b = 1 priors always include all of the

predictors in MT 4 and MT 5 On the other hand the ability of the a = 1b = ch priors

to control for false positives is markedly better than that of the EPP and the hierarchical

priors with choice of a = 1 = b = 1 At the 50 cutoff these priors identify all of the true

positives and true negatives Once again these examples point to the hierarchical priors

with additional penalization for complexity as being good default priors on the model

space44 Random Walks on the Model Space

When the model space M is too large to enumerate a stochastic procedure can

be used to find models with high posterior probability In particular an MCMC algorithm

can be utilized to generate a dependent sample of models from the model posterior The

structure of the model space M both presents difficulties and provides clues on how to

build algorithms to explore it Different MCMC strategies can be adopted two of which

104

Table 4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

16 x1 + x2 + + x18

193 193 200 200 003 180 180FP(gt020) 052 052 200 200 001 046 046FP(gt050) 007 007 200 200 001 004 004TP(gt050) (MT 2) 1599 1599 1600 1600 699 1599 1599FP(gt010)

4 x1 + x2 + + x18

1395 1395 915 915 026 131 131FP(gt020) 545 545 303 303 005 045 045FP(gt050) 084 084 045 045 002 006 006TP(gt050) (MT 3) 400 400 400 400 400 400 400FP(gt010)

10

973 971 1000 560 034 233 220FP(gt020) (x1 + + x4)

2+ 265 265 873 305 012 074 069FP(gt050) +x5 + + x10 035 035 136 168 002 011 012TP(gt050) (MT 4) 1000 1000 1000 999 994 998 999FP(gt010)

6

1352 1352 1106 994 044 163 196FP(gt020) (x1 + + x4)

2+ 422 421 360 501 015 048 068FP(gt050) +x5 + + x10 053 053 057 075 001 008 011TP(gt050) (MT 5) 600 600 600 600 599 599 599

are outlined in this section Combining the different strategies allows the model selection

algorithm to explore the model space thoroughly and relatively fast441 Simple Pruning and Growing

This first strategy relies on small localized jumps around the model space turning

on or off a single node at each step The idea behind this algorithm is to grow the model

by activating one node in the children set or to prune the model by removing one node

in the extreme set At a given step in the algorithm assume that the current state of the

chain is model M Let pG be the probability that algorithm chooses the growth step The

proposed model M prime can either be M+ = M cup α for some α isin C(M) or Mminus = M α

or some α isin E(M)

An example transition kernel is defined by the mixture

g(M prime|M) = pG middot qGrow(M prime|M) + (1minus pG) middot qPrune(M prime|M)

105

=IM =MF

1 + IM =MBmiddotIαisinC(M)

|C(M)|+

IM =MB

1 + IM =MF middotIαisinE(M)

|E(M)|(4ndash11)

where pG has explicitly been defined as 05 when both C(M) and E(M) are non-empty

and as 0 (or 1) when C(M) = empty (or E(M) = empty) After choosing pruning or growing a

single node is proposed for addition to or deletion from M uniformly at random

For this simple algorithm pruning has the reverse kernel of growing and vice-versa

From this construction more elaborate algorithms can be specified First instead of

choosing the node uniformly at random from the corresponding set nodes can be

selected using the relative posterior probability of adding or removing the node Second

more than one node can be selected at any step for instance by also sampling at

random the number of nodes to add or remove given the size of the set Third the

strategy could combine pruning and growing in a single step by sampling one node

α isin C(M) cup E(M) and adding or removing it accordingly Fourth the sets of nodes from

C(M) cup E(M) that yield well-formulated models can be added or removed This simple

algorithm produces small moves around the model space by focusing node addition or

removal only on the set C(M) cup E(M)442 Degree Based Pruning and Growing

In exploring the model space it is possible to take advantage of the hierarchical

structure defined between nodes of different order One can update the vector of

inclusion indicators by blocks denoted j(M) Two flavors of this algorithm are

proposed one that separates the pruning and growing steps and one where both

are done simultaneously

Assume that at a given step say t the algorithm is at M If growing the strategy

proceeds successively by order class going from j = Jmin up to j = Jmax with Jmin

and Jmax being the lowest and highest orders of nodes in MF MB respectively Define

Mt(Jminminus1) = M and set j = Jmin The growth kernel comprises the following steps

proceeding from j = Jmin to j = Jmax

106

1) Propose a model M prime by selecting a set of nodes from Cj(Mt(jminus1)) through thekernel qGrow j(middot|Mt(jminus1))

2) Compute the Metropolis-Hastings correction for M prime versus Mt(jminus1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(jminus1)

3) If j lt Jmax then set j = j + 1 and return to 1) otherwise proceed to 4)

4) Set Mt = Mt(Jmax )

The pruning step is defined In a similar fashion however it starts at order j = Jmax

and proceeds down to j = Jmin Let Ej(M prime) = E(M prime) cap j(MF ) be the set of nodes of

order j that can removed from the model M to produce a WFM Define Mt(Jmax+1) = M

and set j = Jmax The pruning kernel comprises the following steps

1) Propose a model M prime by selecting a set of nodes from Ej(Mt(j+1)) through thekernel qPrunej(middot|Mt(j+1))

2) Compute the Metropolis-Hastings correction for M prime versus Mt(j+1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(j+1)

3) If j gt Jmin then set j = j minus 1 and return to Step 1) otherwise proceed to Step 4)

4) Set Mt = Mt(Jmin )

It is clear that the growing and pruning steps are reverse kernels of each other

Pruning and growing can be combined for each j The forward kernel proceeds from

j = Jmin to j = Jmax and proposes adding sets of nodes from Cj(M) cup Ej(M) The reverse

kernel simply reverses the direction of j proceeding from j = Jmax to j = Jmin 45 Simulation Study

To study the operating characteristics of the proposed priors a simulation

experiment was designed with three goals First the priors are characterized by how

the posterior distributions are affected by the sample size and the signal-to-noise ratio

(SNR) Second given the SNR level the influence of the allocation of the signal across

the terms in the model is investigated Third performance is assessed when the true

model has special points in the scale (McCullagh amp Nelder 1989) ie when the true

107

model has coefficients equal to zero for some lower-order terms in the polynomial

hierarchy

With these goals in mind sets of predictors and responses are generated under

various experimental conditions The model space is defined with MB being the

intercept-only model and MF being the complete order-four polynomial surface in five

main effects that has 126 nodes The entries of the matrix of main effects are generated

as independent standard normal The response vectors are drawn from the n-variate

normal distribution as y sim Nn

(ZMT

(X)βγ In) where MT is the true model and In is the

n times n identity matrix

The sample sizes considered are n isin 130 260 1040 which ensures that

ZMF(X) is of full rank The cardinality of this model space is |M| gt 12 times 1022 which

makes enumeration of all models unfeasible Because the value of the 2k-th moment

of the standard normal distribution increases with k = 1 2 higher-order terms by

construction have a larger variance than their ancestors As such assuming equal

values for all coefficients higher-order terms necessarily contain more ldquosignalrdquo than

the lower order terms from which they inherit (eg x21 has more signal than x1) Once a

higher-order term is selected its entire ancestry is also included Therefore to prevent

the simulation results from being overly optimistic (because of the larger signals from the

higher-order terms) sphering is used to calculate meaningful values of the coefficients

ensuring that the signal is of the magnitude intended in any given direction Given

the results of the simulations from Section 433 only the HOP with a = 1b = ch is

considered with the EPP included for comparison

The total number of combinations of SNR sample size regression coefficient

values and nodes in MT amounts to 108 different scenarios Each scenario was run

with 100 independently generated datasets and the mean behavior of the samples was

observed The results presented in this section correspond to the median probability

model (MPM) from each of the 108 simulation scenarios considered Figure 4-7 shows

108

the comparison between the two priors for the mean number of true positive (TP) and

false positive (FP) terms Although some of the scenarios consider true models that are

not well-formulated the smallest well-formulated model that stems from MT is always

the one shown in Figure 4-6

Figure 4-6 MT DAG of the largest true model used in simulations

The results are summarized in Figure 4-7 Each point on the horizontal axis

corresponds to the average for a given set of simulation conditions Only labels for the

SNR and sample size are included for clarity but the results are also shown for the

different values of the regression coefficients and the different true models considered

Additional details about the procedure and other results are included in the appendices451 SNR and Sample Size Effect

As expected small sample sizes conditioned upon a small SNR impair the ability

of the algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this

effect being greater when using the latter prior However considering the mean number

of TPs jointly with the number of FPs it is clear that although the number of TPs is

specially low with HOP(1 ch) most of the few predictors that are discovered in fact

belong to the true model In comparison to the results with EPP in terms of FPs the

HOP(1 ch) does better and even more so when both the sample size and the SNR are

109

Figure 4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch)

smallest Finally when either the SNR or the sample size is large the performance in

terms of TPs is similar between both priors but the number of FPs are somewhat lower

with the HOP452 Coefficient Magnitude

Three ways to allocate the amount of signal across predictors are considered For

the first choice all coefficients contain the same amount of signal regardless of their

order In the second each order-one coefficient contains twice as much signal as any

order-two coefficient and four times as much as any order-three coefficient Finally

each order-one coefficient contains a half as much signal as any order-two coefficient

and a quarter of what any order-three coefficient has These choices are denoted by

β(1) = c(1o1 1o2 1o3) β(2) = c(1o1 05o2 025o3) and β(3) = c(025o1 05o2 1o3)

respectively In Figure 4-7 the first 4 scenarios correspond to simulations with β(1) the

next four use β(2) the next four correspond to β(3) and then the values are cycled in

110

the same way The results show that scenarios using either β(1) or β(3) behave similarly

contrasting with the negative impact of having the highest signal in the order-one terms

through β(2) In Figure 4-7 the effect of using β(2) is evident as it corresponds to the

lowest values for the TPs regardless of the sample size the SNR or the prior used This

is an intuitive result since giving more signal to higher-order terms makes it easier to

detect higher-order terms and consequently by strong heredity the algorithm will also

select the corresponding lower-order terms included in the true model453 Special Points on the Scale

Four true models were considered (1) the model from Figure 4-6 (MT 1) (2)

the model without the order-one terms (MT 2) (3) the model without order-two terms

(MT 3) and (4) the model without x21 and x2x5 (MT 4) The last three are clearly not

well-formulated In Figure 4-7 the leftmost point on the horizontal axis corresponds to

scenarios with MT 1 the next point is for scenarios with MT 2 followed by those with MT 3

then with MT 4 then MT 1 etc In comparison to the EPP the HOP(1 ch) tightly controls

the inclusion of FPs by choosing smaller models at the expense of also reducing the TP

count especially when there is more uncertainty about the true model (ie SNR=025)

For both prior structures the results indicate that at low SNR levels the presence of

special points has no apparent impact as the selection behavior is similar between the

four models in terms of both the TP and FP An interesting observation is that the effect

of having special points on the scale is vastly magnified whenever the coefficients that

assign more weight to order-one terms (β(2)) are used46 Case Study Ozone Data Analysis

This section uses the ozone data from Breiman amp Friedman (1985) and followsthe analysis performed by Liang et al (2008) who investigated hyper g-priors Afterremoving observations with missing values 330 observations remain includingdaily measurements of maximum ozone concentration near Los Angeles and eightmeteorological variables Table D From the 330 observations 165 were sampled atrandom without replacement and used to run the variable selection procedure theremaining 165 were used for validation The eight meteorological variables interactionsand their squared terms are used as predictors resulting in a full model with 44predictors The model space assumes that the base model MB is the intercept onlymodel and that MF is the quadratic surface in the eight meteorological variables The

111

model space contains approximately 71 billion models and computation of all modelposterior probabilities is not feasible

Table 4-5 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA

vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX

The HOP HUP and HIP with a = 1 and b = ch as well as the EPP are considered forcomparison purposes To obtain the Bayes factors in equation 3ndash3 four different mixtures ofg-priors are utilized intrinsic priors (IP) (which yields the expression in equation 3ndash2) hyper-g(HG) priors (Liang et al 2008) with hyper-parameters α = 2β = 1 and α = β = 1 and Zellner-Siow (ZS) priors (Zellner amp Siow 1980) The results were extracted for the median posteriorprobability (MPM) models Additionally the model is estimated using the R package hierNet(Bien et al 2013) to compare model selection results to those obtained using the hierarchicallasso (Bien et al 2013) restricted to well formulated models by imposing the strong heredityconstraint The procedures were assessed on the basis of their predictive accuracy on thevalidation dataset

Among all models the one that yields the smallest RMSE is the median probability modelobtained using the HOP and EPP with the ZS prior and also using the HOP with both HGpriors (Table 4-6) The HOP model with the intrinsic prior has all the terms contained in thelowest RMSE model with the exception of dpg2 which has a relatively high marginal inclusionprobability of 46 This disparity between the IP and other mixtures of g-priors is explainedby the fact that the IP induces less posterior shrinkage than the ZS and HG priors The MPMobtained through the HUP and HIP are nested in the best model suggesting that these modelspace priors penalize complexity too much and result in false negatives Consideration ofthese MPMs suggest that the HOP is best at producing true positives while controlling for falsepositives

Finally the model obtained from the hierarchical lasso (HierNet) is the largest model andproduces the second to largest RMSE All of the terms contained in any of the other modelsexcept for vh are nested within the hierarchical lasso model and most of the terms that areexclusive to this model receive extremely low marginal inclusion probabilities under any of themodel priors and parameter priors considered under Bayesian model selection

112

Table 4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso

BF Prior Model R2 RMSEIP EPP hum dpg ibt hum2 hum lowast dpg 08054 42739

hum lowast ibt dpg2 ibt2IP HIP hum ibt hum2 hum lowast ibt ibt2 07740 43396IP HOP hum dpg ibt hum2 hum lowast ibt ibt2 07848 43175IP HUP hum dpg ibt hum lowast ibt ibt2 07767 43508ZS EPP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HIP hum ibt hum lowast ibt ibt2 07525 43505ZS HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG11 EPP vh hum dpg ibt hum2 hum lowast ibt dpg2 07701 43049HG11 HIP hum ibt hum lowast ibt ibt2 07525 43505HG11 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG11 HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 EPP hum dpg ibt hum2 hum lowast ibt dpg2 07701 43037HG21 HIP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG21 HUP hum dpg ibt hum lowast ibt 07526 44036

HierNet hum temp ibh dpg ibt vis hum2 hum lowast ibt 07651 43680temp2 temp lowast ibt dpg2

47 DiscussionScott amp Berger (2010) noted that Ockhamrsquos-razor effect found automatically in Bayesian

variable selection through the Bayes factor does not correct for multiple testing The Bayesfactor penalizes complexity of the alternative model according to the number of parametersin excess of those of the null model Therefore the Bayes factor only controls complexity in apairwise fashion If the model selection procedure uses equal prior probabilities for all M isin Mthen these comparisons ignore the effect of the multiplicity of the testing problem This is wherethe role of the prior on the model space becomes important The multiplicity penalty is ldquohiddenawayrdquo in the model prior probabilities π(M|M)

In addition to the multiplicity of the testing problem disregarding the hierarchical polynomialstructure in the predictors in model selection procedures has the potential to lead to differentresults according to how the predictors are setup (eg in what units these predictors areexpressed)

In this Chapter we investigated a solution to these two issues We define prior structuresfor well-formulated models and develop random walk algorithms to traverse this type of modelspace The key to understanding prior distributions on the space of WFMs is the hierarchicalnature of the model space itself The prior distributions described take advantage of thathierarchy in two ways First conditional independence and immediate inheritance are used todevelop the HOP HIP and HUP structures discussed in Section 43 Second the conditionalnature of the priors allows for the direct incorporation of complexity penalizations Of the priorsproposed the HOP using the hyperparameter choice (1 ch) provides the best control of falsepositives while maintaining a reasonable true positive rate Thus this prior is recommended asthe default prior on the space of WFMs

113

In the near future the software developed to carry out a Metropolis-Hastings random walkon the space of WFMs will be integrated to the R package varSelectIP These new functionsimplement various local priors for the regression coefficients including the intrinsic prior Zellner-Siow prior and hyper g-priors In addition the software supports the computation of crediblesets for each regression coefficient conditioned on the selected model as well as under modelaveraging

114

CHAPTER 5CONCLUSIONS

Ecologists are now embracing the use of Bayesian methods to investigate the

interactions that dictate the distribution and abundance of organisms These tools are

both powerful and flexible They allow integrating under a single methodology empirical

observations and theoretical process models and can seamlessly account for several

sources of uncertainty and dependence The estimation and testing methods proposed

throughout the document will contribute to the understanding of Bayesian methods used

in ecology and hopefully these will shed light about the differences between estimation

and testing Bayesian tools

All of our contributions exploit the potential of the latent variable formulation This

approach greatly simplifies the analysis of complex models it redirects the bulk of

the inferential burden away from the original response variables and places it on the

easy-to-work-with latent scale for which several time-tested approaches are available

Our methods are distinctly classified into estimation and testing tools

For estimation we proposed a Bayesian specification of the single-season

occupancy model for which a Gibbs sampler is available using both logit and probit

link functions This setup allows detection and occupancy probabilities to depend

on linear combinations of predictors Then we developed a dynamic version of this

approach incorporating the notion that occupancy at a previously occupied site depends

both on survival of current settlers and habitat suitability Additionally because these

dynamics also vary in space we suggest a strategy to add spatial dependence among

neighboring sites

Ecological inquiry usually requires of competing explanations and uncertainty

surrounds the decision of choosing any one of them Hence a model or a set of

probable models should be selected from all the viable alternatives To address this

testing problem we proposed an objective and fully automatic Bayesian methodology

115

for the single season site-occupancy model Our approach relies on the intrinsic prior

which prevents from introducing (commonly unavailable) subjectively information

into the model In simulation experiments we observed that the methods single out

accurately the predictors present in the true model using the marginal posterior inclusion

probabilities of the predictors For predictors in the true model these probabilities were

comparatively larger than those for predictors not present in the true model Also the

simulations indicated that the method provides better discrimination for predictors in the

detection component of the model

In our simulations and in the analysis of the Blue Hawker data we observed that the

effect from using the multiplicity correction prior was substantial This occurs because

the Bayes factor only penalizes complexity of the alternative model according to its

number of parameters in excess to those of the null model As the number of predictors

grows the number of models in the models space also grows increasing the chances

of making false positive decisions on the inclusion of predictors This is where the role

of the prior on the model space becomes important The multiplicity penalty is ldquohidden

awayrdquo in the model prior probabilities π(M|M) In addition to the multiplicity of the

testing problem disregarding the hierarchical polynomial structure in the predictors in

model selection procedures has the potential to lead to different results according to

how the predictors are coded (eg in what units these predictors are expressed)

To confront this situation we propose three prior structures for well-formulated

models take advantage of the hierarchical structure of the predictors Of the priors

proposed we recommend the HOP using the hyperparameter choice (1 ch) which

provides the best control of false positives while maintaining a reasonable true positive

rate

Overall considering the flexibility of the latent approach several other extensions of

these methods follow Currently we envision three future developments (1) occupancy

models incorporate various sources of information (2) multi-species models that make

116

use of spatial and interspecific dependence and (3) investigate methods to conduct

model selection for the dynamic and spatially explicit version of the model

117

APPENDIX AFULL CONDITIONAL DENSITIES DYMOSS

In this section we introduce the full conditional probability density functions for all

the parameters involved in the DYMOSS model using probit as well as logic links

Sampler Z

The full conditionals corresponding to the presence indicators have the same form

regardless of the link used These are derived separately for the cases t = 1 1 lt t lt T

and t = T since their corresponding probabilities take on slightly different forms

Let ϕ(ν|microσ2) represent the density for a normal random variable ν with mean micro and

variance σ2 and recall that ψi1 = F (xprime(o)iα) and pijt = F (qprimeijtλt) where F () is the

inverse link function The full conditional for zit is given by

1 For t = 1

π(zi1|vi1αλ1βc1 δ

s1) = ψlowast

i1zi1 (1minus ψlowast

i1)1minuszi1

= Bernoulli(ψlowasti1) (Andash1)

where

ψlowasti1 =

ψi1ϕ(vi1|xprimei1βc1 + δs1 1)

prodJi1j=1(1minus pij1)

ψi1ϕ(vi1|xprimei1βc1 + δs1 1)

prodJi1j=1(1minus pij1) + (1minus ψi1)ϕ(vi1|xprimei1β

c1 1)

prodJj=1 Iyij1=0

2 For 1 lt t lt T

π(zit |zi(tminus1) zi(t+1)λt βctminus1 δ

stminus1) = ψlowast

itzit (1minus ψlowast

it)1minuszit

= Bernoulli(ψlowastit) (Andash2)

where

ψlowastit =

κitprodJit

j=1(1minus pijt)

κitprodJit

j=1(1minus pijt) +nablait

prodJj=1 Iyijt=0

with

(a) κit = F (xprimei(tminus1)β

ctminus1 + zi(tminus1)δ

stminus1)ϕ(vit |xprimeitβ

ct + δst 1) and

(b) nablait =(1minus F (xprime

i(tminus1)βctminus1 + zi(tminus1)δ

stminus1)

)ϕ(vit |xprimeitβ

ct 1)

3 For t = T

π(ziT |zi(Tminus1)λT βcTminus1 δ

sTminus1) = ψ⋆iT

ziT (1minus ψ⋆iT )1minusziT

118

=

Nprodi=1

Bernoulli(ψ⋆iT ) (Andash3)

where

ψ⋆iT =κ⋆iT

prodJiTj=1(1minus pijT )

κ⋆iTprodJiT

j=1(1minus pijT ) +nabla⋆iT

prodJj=1 IyijT=0

with

(a) κ⋆iT = F (xprimei(Tminus1)β

cTminus1 + zi(Tminus1)δ

sTminus1) and

(b) nabla⋆iT =

(1minus F (xprime

i(Tminus1)βcTminus1 + zi(Tminus1)δ

sTminus1)

)Sampler ui

1

π(ui |zi1α) = tr N(xprime(o)iα 1 trunc(zi1))

where trunc(zi1) =

(minusinfin 0] zi1 = 0

(0infin) zi1 = 1(Andash4)

and tr N(microσ2A) denotes the pdf of a truncated normal random variable with mean microvariance σ2 and truncation region A

Sampler α

1

π(α|u) prop [α]

Nprodi=1

ϕ(ui xprime(o)iα 1) (Andash5)

If [α] prop 1 then

α|u sim N(m(α)α)

with m(α) = αXprime(o)u and α = (X prime

(o)X(o))minus1

Sampler vit

1 (For t gt 1)

π(vi (tminus1)|zi (tminus1) zit βctminus1 δ

stminus1) = tr N

(micro(v)i(tminus1) 1 trunc(zit)

)(Andash6)

where micro(v)i(tminus1) = xprime

i(tminus1)βctminus1 + zi(tminus1)δ

ci(tminus1) and trunc(zit) defines the corresponding

truncation region given by zit

119

Sampler(β(c)tminus1 δ

(c)tminus1

)

1 (For t gt 1)

π(β(s)tminus1 δ

(c)tminus1|vtminus1 ztminus1) prop [β

(s)tminus1 δ

(c)tminus1]

Nprodi=1

ϕ(vit xprimei(tminus1)β

(c)tminus1 + zi(tminus1)δ

(s)tminus1 1) (Andash7)

If[β(c)tminus1 δ

(s)tminus1

]prop 1 then

β(c)tminus1 δ

(s)tminus1|vtminus1 ztminus1 sim N(m(β

(c)tminus1 δ

(s)tminus1)tminus1)

with m(β(c)tminus1 δ

(s)tminus1) = tminus1 ~X

primetminus1vtminus1 and tminus1 = (~X prime

tminus1 ~Xtminus1)minus1 where ~Xtminus1 =(

Xtminus1 ztminus1)

Sampler wijt

1 (For t gt 1 and zit = 1)

π(wijt | i zit = 1 yijt λ) = tr N(qprimeijtλt 1 tr(yijt)

)(Andash8)

Sampler λt

1 (For t = 1 2 T )

π(λt |zt wt) prop [λt ]prod

i zit=1

Jitprodj=1

ϕ(wijt qprimeijtλt 1) (Andash9)

If [λt ] prop 1 then

λt |wt zt sim N(m(λt)λt)

with m(λt) = λtQ primetwt and λt

= (Q primetQt)

minus1 where Qt and wt respectively are the designmatrix and the vector of latent variables for surveys of sites such that zit = 1

120

APPENDIX BRANDOM WALK ALGORITHMS

Global Jump From the current state M the global jump is performed by drawing

a model M prime at random from the model space This is achieved by beginning at the base

model and increasing the order from JminM to the Jmax

M the minimum and maximum orders

of nodes in (MF ) = MF MB at each order a set of nodes is selected at random from

the prior conditioned on the nodes already in the model The MH correction is

α =

1m(y|M primeM)

m(y|MM)

Local Jump From the current state M the local jump is performed by drawing a

model from the set of models L(M) = Mα α isin E(M) cup C(M) where Mα is M α

for α isin E(M) and M cup α for α isin C(M) The proposal probabilities for the model are

computed as a mixture of p(M prime|yMM prime isin L(M)) and the discrete uniform distribution

The proposal kernel is

q(M prime|yMM prime isin L(M)) =1

2

(p(M prime|yMM prime isin L(M)) +

1

|L(M)|

)This choice promotes moving to better models while maintaining a non-negligible

probability of moving to any of the possible models The MH correction is

α =

1m(y|M primeM)

m(y|MM)

q(M|yMM isin L(M prime))

q(M prime|yMM prime isin L(M))

Intermediate Jump The intermediate jump is performed by increasing or

decreasing the order of the nodes under consideration performing local proposals based

on order For a model M prime define Lj(Mprime) = M prime cup M prime

α α isin (E(M prime) cup C(M prime)) capj(MF )

From a state M the kernel chooses at random whether to increase or decrease the

order If M = MF then decreasing the order is chosen with probability 1 and if M = MB

then increasing the order is chosen with probability 1 in all other cases the probability of

increasing and decreasing order is 12 The proposal kernels are given by

121

Increasing order proposal kernel

1 Set j = JminM minus 1 and M prime

j = M

2 Draw M primej+1 from qincj+1(M

prime|yMM prime isin Lj+1(Mprimej )) where

qincj+1(Mprime|yMM prime isin Lj+1(M

primej )) =

12

(p(M prime|yMM prime isin Lj+1(M

primej )) +

1|Lj+1(M

primej)|

)

3 Set j = j + 1

4 If j lt JmaxM then return to 2 O therwise proceed to 5

5 Set M prime = M primeJmaxM

and compute the proposal probability

qinc(Mprime|yMM) =

JmaxM minus1prod

j=JminM minus1

qincj+1(Mprimej |yMM prime isin Lj+1(M

primej )) (Bndash1)

Decreasing order proposal kernel

1 Set j = JmaxM + 1 and M prime

j = M

2 Draw M primejminus1 from qdecjminus1(M

prime|yMM prime isin Ljminus1(Mprimej )) where

qdecjminus1(Mprime|yMM prime isin Ljminus1(M

primej )) =

12

(p(M prime|yMM prime isin Ljminus1(M

primej )) +

1|Ljminus1(M

primej)|

)

3 Set j = j minus 1

4 If j gt JminM then return to 2 Otherwise proceed to 5

5 Set M prime = M primeJminM

and compute the proposal probability

qdec(Mprime|yMM) =

JminM +1prod

j=JmaxM +1

qdecjminus1(Mprimej |yMM prime isin Ljminus1(M

primej )) (Bndash2)

If increasing order is chosen then the MH correction is given by

α = min

1

(1 + I (M prime = MF )

1 + I (M = MB)

)qdec(M|yMM prime)

qinc(M prime|yMM)

p(M prime|yM)

p(M|yM)

(Bndash3)

and similarly if decreasing order is chosen

Other Local and Intermediate Kernels The local and intermediate kernels

described here perform a kind of stochastic forwards-backwards selection Each kernel

122

q can be relaxed to allow more than one node to be turned on or off at each step which

could provide larger jumps for each of these kernels The tradeoff is that number of

proposed models for such jumps could be very large precluding the use of posterior

information in the construction of the proposal kernel

123

APPENDIX CWFM SIMULATION DETAILS

Briefly the idea is to let ZMT(X )βMT

= (QR)βMT= QηMT

(ie βMT= Rminus1ηMT

)

using the QR decomposition As such setting all values in ηMTproportional to one

corresponds to distributing the signal in the model uniformly across all predictors

regardless of their order

The (unconditional) variance of a single observation yi is var(yi) = var (E [yi |zi ]) +

E [var(yi |zi)] where zi is the i -th row of the design matrix ZMT Hence we take the

signal to noise ratio for each observation to be

SNR(η) = ηTMT

RminusTzRminus1ηMT

σ2

where z = var(zi) We determine how the signal is distributed across predictors up to a

proportionality constant to be able to control simultaneously the signal to noise ratio

Additionally to investigate the ability of the model to capture correctly the

hierarchical structure we specify four different 0-1 vectors that determine the predictors

in MT which generates the data in the different scenarios

Table C-1 Experimental conditions WFM simulationsParameter Values considered

SNR(ηMT) = k 025 1 4

ηMTprop (1 13 14 12) (1 13 1214

1412) (1 1413

1214 12)

γMT(1 13 14 12) (1 13 14 02) (1 13 04 12) (1 03 0 1 1 0 12)

n 130 260 1040

The results presented below are somewhat different from those found in the main

body of the article in Section 5 These are extracted averaging the number of FPrsquos

TPrsquos and model sizes respectively over the 100 independent runs and across the

corresponding scenarios for the 20 highest probability models

124

SNR and Sample Size Effect

In terms of the SNR and the sample size (Figure C-1) we observe that as

expected small sample sizes conditioned upon a small SNR impair the ability of the

algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this effect

more notorious when using the latter prior However considering the mean number

of true positives (TP) jointly with the mean model size it is clear that although the

sensitivity is low most of the few predictors that are discovered belong to the true

model The results observed with SNR of 025 and a relatively small sample size are

far from being impressive however real problems where the SNR is as low as 025

will yield many spurious associations under the EPP The fact that the HOP(1 ch) has

a strong protection against false positive is commendable in itself A SNR of 1 also

represents a feeble relationship between the predictors and the response nonetheless

the method captures approximately half of the true coefficients while including very few

false positives Following intuition as either the sample size or the SNR increase the

algorithms performance is greatly enhanced Either having a large sample size or a

large SNR yields models that contain mostly true predictors Additionally HOP(1 ch)

provides a strong control over the number of false positives therefore for high SNR

or larger sample sizes the number of predictors in the top 20 models is close to the

size of the true model In general the EPP allows the detection of more TPrsquos while

the HOP(1 ch) provides a stronger control on the amount of FPrsquos included when

considering small sample sizes combined with small SNRs As either sample size or

SNR grows the differences between the two priors become indistinct

125

Figure C-1 SNR vs n Average model size average true positives and average false

positives for all simulated scenarios by model ranking according to model

posterior probabilities

Coefficient Magnitude

This part of the experiment explores the effect of how the signal is distributed across

predictors As mentioned before sphering is used to assign the coefficients values

in a manner that controls the amount of signal that goes into each coefficient Three

possible ways to allocate the signal are considered First each order-one coefficient

contains twice as much signal as any order-two coefficient and four times as much

any as order-three coefficient second all coefficients contain the same amount of

signal regardless of their order and third each order-one coefficient contains a half

as much signal as any order-two coefficient and a quarter of what any order-three

126

coefficient has In Figure C-2 these values are denoted by β = c(1o1 05o2 025o3)

β = c(1o1 1o2 1o3) and β = c(025o1 05o2 1o3) respectively

Observe that the number of FPrsquos is invulnerable to how the SNR is distributed

across predictors using the HOP(1 ch) conversely when using the EPP the number

of FPrsquos decreases as the SNR grows always being slightly higher than those obtained

with the HOP With either prior structure the algorithm performs better whenever all

coefficients are equally weighted or when those for the order-three terms have higher

weights In these two cases (ie with β = c(1o1 05o2 025o3) or β = c(1o1 1o2 1o3))

the effect of the SNR appears to be similar In contrast when more weight is given to

order one terms the algorithm yields slightly worse models at any SNR level This is an

intuitive result since giving more signal to higher order terms makes it easier to detect

higher order terms and consequently by strong heredity the algorithm will also select

the corresponding lower order terms included in the true model

Special Points on the Scale

In Nelder (1998) the author argues that the conditions under which the

weak-heredity principle can be used for model selection are so restrictive that the

principle is commonly not valid in practice in this context In addition the author states

that considering well-formulated models only does not take into account the possible

presence of special points on the scales of the predictors that is situations where

omitting lower order terms is justified due to the nature of the data However it is our

contention that every model has an underlying well-formulated structure whether or not

some predictor has special points on its scale will be determined through the estimation

of the coefficients once a valid well-formulated structure has been chosen

To understand how the algorithm behaves whenever the true data generating

mechanism has zero-valued coefficients for some lower order terms in the hierarchy

four different true models are considered Three of them are not well-formulated while

the remaining one is the WFM shown in Figure 4-6 The three models that have special

127

Figure C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model rankingaccording to model posterior probabilities

points correspond to the same model MT from Figure 4-6 but have respectively

zero-valued coefficients for all the order-one terms all the order-two terms and for x21

and x2x5

As seen before in comparison to the EPP the HOP(1 ch) tightly controls the

inclusion FPs by choosing smaller models at the expense of also reducing the TP

count especially when there is more uncertainty about the true model (ie SNR=025)

For both prior structures the results in Figure C-3 indicate that at low SNR levels the

presence of special points has no apparent impact as the selection behavior is similar

between the four models in terms of both the TP and FP As the SNR increases the

TPs and the model size are affected for true models with zero-valued lower order

128

Figure C-3 SNR vs different true models MT Average model size average truepositives and average false positives for all simulated scenarios by modelranking according to model posterior probabilities

terms These differences however are not very large Relatively smaller models are

selected whenever some terms in the hierarchy are missing but with high SNR which

is where the differences are most pronounced the predictors included are mostly true

coefficients The impact is almost imperceptible for the true model that lacks order one

terms and the model with zero coefficients for x21 and x2x5 and is more visible for models

without order two terms This last result is expected due to strong-heredity whenever

the order-one coefficients are missing the inclusion of order-two and order-three

terms will force their selection which is also the case when only a few order two terms

have zero-valued coefficients Conversely when all order two predictors are removed

129

some order three predictors are not selected as their signal is attributed the order two

predictors missing from the true model This is especially the case for the order three

interaction term x1x2x5 which depends on the inclusion of three order two terms terms

(x1x2 x1x5 x2x5) in order for it to be included as well This makes the inclusion of this

term somewhat more challenging the three order two interactions capture most of

the variation of the polynomial terms that is present when the order three term is also

included However special points on the scale commonly occur on a single or at most

on a few covariates A true data generating mechanism that removes all terms of a given

order in the context of polynomial models is clearly not justified here this was only done

for comparison purposes

130

APPENDIX DSUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS

The covariates considered for the ozone data analysis match those used in Liang

et al (2008) these are displayed in Table D below

Table D-1 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA

vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX

The marginal posterior inclusion probability corresponds to the probability of including a

given term in the full model MF after summing over all models in the model space For each

node α isin MF this probability is given by pα =sum

MisinM I(αisinM)p(M|yM) Given that in problems

with a large model space such as the one considered for the ozone concentration problem

enumeration of the entire space is not feasible Thus these probabilities are estimated summing

over every model drawn by the random walk over the model space M

Given that there are in total 44 potential predictors for convenience in Tables D-2 to D-5

below we only display the marginal posterior probabilities for the terms included under at least

one of the model priors considered (EPP HIP HUP and HOP) for each of the parameter priors

utilized (intrinsic priors Zellner-Siow priors Hyper-g(11) and Hyper-g(21))

131

Table D-2 Marginal inclusion probabilities

intrinsic prior

EPP HIP HUP HOP

hum 099 069 085 076

dpg 085 048 052 053

ibt 099 100 100 100

hum2 076 051 043 062

humdpg 055 002 003 017

humibt 098 069 084 075

dpg2 072 036 025 046

ibt2 059 078 057 081

Table D-3 Marginal inclusion probabilities

Zellner-Siow prior

EPP HIP HUP HOP

hum 076 067 080 069

dpg 089 050 055 058

ibt 099 100 100 100

hum2 057 049 040 057

humibt 072 066 078 068

dpg2 081 038 031 051

ibt2 054 076 055 077

Table D-4 Marginal inclusion probabilities

Hyper-g11

EPP HIP HUP HOP

vh 054 005 010 011

hum 081 067 080 069

dpg 090 050 055 058

ibt 099 100 099 099

hum2 061 049 040 057

humibt 078 066 078 068

dpg2 083 038 030 051

ibt2 049 076 054 077

Table D-5 Marginal inclusion probabilities

Hyper-g21

EPP HIP HUP HOP

hum 079 064 073 067

dpg 090 052 060 059

ibt 099 100 099 100

hum2 060 047 037 055

humibt 076 064 071 067

dpg2 082 041 036 052

ibt2 047 073 049 075

132

REFERENCES

Akaike H (1983) Information measures and model selection Bull Int Statist Inst 50277ndash290

Albert J H amp Chib S (1993) Bayesian-analysis of binary and polychotomousresponse data Journal of the American Statistical Association 88(422) 669ndash679

Berger J amp Bernardo J (1992) On the development of reference priors BayesianStatistics 4 (pp 35ndash60)

URL httpisbastatdukeedueventsvalencia1992Valencia4Refpdf

Berger J amp Pericchi L (1996) The intrinsic Bayes factor for model selection andprediction Journal of the American Statistical Association 91(433) 109ndash122

URL httpamstattandfonlinecomdoiabs10108001621459199610476668

Berger J Pericchi L amp Ghosh J (2001) Objective Bayesian methods for modelselection introduction and comparison In Model selection vol 38 of IMS LectureNotes Monogr Ser (pp 135ndash207) Inst Math Statist

URL httpwwwjstororgstable1023074356165

Besag J York J amp Mollie A (1991) Bayesian Image-Restoration with 2 Applicationsin Spatial Statistics Annals of the Institute of Statistical Mathematics 43 1ndash20

Bien J Taylor J amp Tibshirani R (2013) A lasso for hierarchical interactions TheAnnals of Statistics 41(3) 1111ndash1141

URL httpprojecteuclidorgeuclidaos1371150895

Breiman L amp Friedman J (1985) Estimating optimal transformations for multipleregression and correlation Journal of the American Statistical Association 80580ndash598

Brusco M J Steinley D amp Cradit J D (2009) An exact algorithm for hierarchicallywell-formulated subsets in second-order polynomial regression Technometrics 51(3)306ndash315

Casella G Giron F J Martınez M L amp Moreno E (2009) Consistency of Bayesianprocedures for variable selection The Annals of Statistics 37 (3) 1207ndash1228

URL httpprojecteuclidorgeuclidaos1239369020

Casella G Moreno E amp Giron F (2014) Cluster Analysis Model Selection and PriorDistributions on Models Bayesian Analysis TBA(TBA) 1ndash46

URL httpwwwstatufledu~casellaPapersClusterModel-July11-Apdf

133

Chipman H (1996) Bayesian variable selection with related predictors CanadianJournal of Statistics 24(1) 17ndash36

URL httponlinelibrarywileycomdoi1023073315687abstract

Clyde M amp George E I (2004) Model Uncertainty Statistical Science 19(1) 81ndash94

URL httpprojecteuclidorgDienstgetRecordid=euclidss1089808274

Dewey J (1958) Experience and nature New York Dover Publications

Dorazio R M amp Taylor-Rodrıguez D (2012) A Gibbs sampler for Bayesian analysis ofsite-occupancy data Methods in Ecology and Evolution 3 1093ndash1098

Ellison A M (2004) Bayesian inference in ecology Ecology Letters 7 509ndash520

Fiske I amp Chandler R (2011) unmarked An R package for fitting hierarchical modelsof wildlife occurrence and abundance Journal of Statistical Software 43(10)

URL httpcorekmiopenacukdownloadpdf5701760pdf

George E (2000) The variable selection problem Journal of the American StatisticalAssociation 95(452) 1304ndash1308

URL httpwwwtandfonlinecomdoiabs10108001621459200010474336

Giron F J Moreno E Casella G amp Martınez M L (2010) Consistency of objectiveBayes factors for nonnested linear models and increasing model dimension Revistade la Real Academia de Ciencias Exactas Fisicas y Naturales Serie A Matematicas104(1) 57ndash67

URL httpwwwspringerlinkcomindex105052RACSAM201006

Good I J (1950) Probability and the Weighing of Evidence New York Haffner

Griepentrog G L Ryan J M amp Smith L D (1982) Linear transformations ofpolynomial regression-models American Statistician 36(3) 171ndash174

Gunel E amp Dickey J (1974) Bayes factors for independence in contingency tablesBiometrika 61 545ndash557

Hanski I (1994) A Practical Model of Metapopulation Dynamics Journal of AnimalEcology 63 151ndash162

Hooten M (2006) Hierarchical spatio-temporal models for ecological processesDoctoral dissertation University of Missouri-Columbia

URL httpsmospacelibraryumsystemeduxmluihandle103554500

Hooten M B amp Hobbs N T (2014) A Guide to Bayesian Model Selection forEcologists Ecological Monographs (In Press)

134

Hughes J amp Haran M (2013) Dimension reduction and alleviation of confoundingfor spatial generalized linear mixed models Journal of the Royal Statistical SocietySeries B Statistical Methodology 75 139ndash159

Hurvich C M amp Tsai C-L (1989) Regression and time series model selection insmall samples Biometrika 76 297ndash307

URL httpbiometoxfordjournalsorgcontent762297abstract

Jeffreys H (1935) Some tests of significance treated by the theory of probabilityProcedings of the Cambridge Philosophy Society 31 203ndash222

Jeffreys H (1961) Theory of Probability London Oxford University Press 3rd ed

Johnson D Conn P Hooten M Ray J amp Pond B (2013) Spatial occupancymodels for large data sets Ecology 94(4) 801ndash808

URL httpwwwesajournalsorgdoiabs10189012-05641mi=3eywlhampaf=R

ampsearchText=human+population

Kass R amp Wasserman L (1995) A reference Bayesian test for nested hypothesesand its relationship to the Schwarz criterion Journal of the American StatisticalAssociation 90(431)

URL httpamstattandfonlinecomdoiabs10108001621459199510476592

Kass R E amp Raftery A E (1995) Bayes Factors Journal of the American StatisticalAssociation 90 773ndash795

URL httpwwwtandfonlinecomdoiabs10108001621459199510476572$

delimiter026E30F$nhttpwwwtandfonlinecomdoiabs10108001621459

199510476572UvBybrTIgcs

Kass R E amp Wasserman L (1996) The Selection of Prior Distributions by FormalRules Journal of the American Statistical Association 91(435) 1343

URL httpwwwjstororgstable2291752origin=crossref

Kery M (2010) Introduction to WinBUGS for Ecologists Bayesian Approach toRegression ANOVA Mixed Models and Related Analyses Academic Press 1st ed

Kery M Gardner B amp Monnerat C (2010) Predicting species distributions fromchecklist data using site-occupancy models Journal of Biogeography 37 (10)1851ndash1862 Kery Marc Gardner Beth Monnerat Christian

Khuri A (2002) Nonsingular linear transformations of the control variables in responsesurface models Technical Report

Krebs C J (1972) Ecology the experimental analysis of distribution and abundance

135

Lempers F B (1971) Posterior probabilities of alternative linear models University ofRotterdam Press Rotterdam

Leon-Novelo L Moreno E amp Casella G (2012) Objective Bayes model selection inprobit models Statistics in medicine 31(4) 353ndash65

URL httpwwwncbinlmnihgovpubmed22162041

Liang F Paulo R Molina G Clyde M a amp Berger J O (2008) Mixtures of g Priorsfor Bayesian Variable Selection Journal of the American Statistical Association103(481) 410ndash423

URL httpwwwtandfonlinecomdoiabs101198016214507000001337

Link W amp Barker R (2009) Bayesian inference with ecological applications Elsevier

URL httpbooksgooglecombookshl=enamplr=ampid=hecon2l2QPcCampoi=fnd

amppg=PP2ampdq=Bayesian+Inference+with+ecological+applicationsampots=S82_

0pxrNmampsig=L3xbsSQcKD8FV6rxCMp2pmP2JKk

MacKenzie D amp Nichols J (2004) Occupancy as a surrogate for abundanceestimation Animal biodiversity and conservation 1 461ndash467

URL httpcrsitbacidmediajurnalrefslandscapemackenzie2004zhpdf

MacKenzie D Nichols J amp Hines J (2003) Estimating site occupancy colonizationand local extinction when a species is detected imperfectly Ecology 84(8)2200ndash2207

URL httpwwwesajournalsorgdoiabs10189002-3090

MacKenzie D I Bailey L L amp Nichols J D (2004) Investigating speciesco-occurrence patterns when species Journal of Animal Ecology 73 546ndash555

MacKenzie D I Nichols J D Lachman G B Droege S Royle J A amp LangtimmC A (2002) Estimating site occupancy rates when detection probabilities are lessthan one Ecology 83(8) 2248ndash2255

Mazerolle M amp Mazerolle M (2013) Package rsquoAICcmodavgrsquo (c)

URL ftpheanetarchivegnewsenseorgdisk1CRANwebpackages

AICcmodavgAICcmodavgpdf

McCullagh P amp Nelder J A (1989) Generalized linear models (2nd ed) LondonEngland Chapman amp Hall

McQuarrie A Shumway R amp Tsai C-L (1997) The model selection criterion AICu

136

Moreno E Bertolino F amp Racugno W (1998) An intrinsic limiting procedure for modelselection and hypotheses testing Journal of the American Statistical Association93(444) 1451ndash1460

Moreno E Giron F J amp Casella G (2010) Consistency of objective Bayes factors asthe model dimension grows The Annals of Statistics 38(4) 1937ndash1952

URL httpprojecteuclidorgeuclidaos1278861238

Nelder J A (1977) Reformulation of linear-models Journal of the Royal StatisticalSociety Series A - Statistics in Society 140 48ndash77

Nelder J A (1998) The selection of terms in response-surface models - how strong isthe weak-heredity principle American Statistician 52(4) 315ndash318

Nelder J A (2000) Functional marginality and response-surface fitting Journal ofApplied Statistics 27 (1) 109ndash112

Nichols J Hines J amp Mackenzie D (2007) Occupancy estimation and modeling withmultiple states and state uncertainty Ecology 88(6) 1395ndash1400

URL httpwwwesajournalsorgdoipdf10189006-1474

Ovaskainen O Hottola J amp Siitonen J (2010) Modeling species co-occurrenceby multivariate logistic regression generates new hypotheses on fungal interactionsEcology 91(9) 2514ndash21

URL httpwwwncbinlmnihgovpubmed20957941

Peixoto J L (1987) Hierarchical variable selection in polynomial regression-modelsAmerican Statistician 41(4) 311ndash313

Peixoto J L (1990) A property of well-formulated polynomial regression-modelsAmerican Statistician 44(1) 26ndash30

Pericchi L R (2005) Model selection and hypothesis testing based on objectiveprobabilities and bayes factors In Handbook of Statistics Elsevier

Polson N G Scott J G amp Windle J (2013) Bayesian Inference for Logistic ModelsUsing Polya-Gamma Latent Variables Journal of the American Statistical Association108 1339ndash1349

URL httpdxdoiorg101080016214592013829001

Rao C R amp Wu Y (2001) On model selection vol Volume 38 of Lecture NotesndashMonograph Series (pp 1ndash57) Beachwood OH Institute of Mathematical Statistics

URL httpdxdoiorg101214lnms1215540960

137

Reich B J Hodges J S amp Zadnik V (2006) Effects of residual smoothing on theposterior of the fixed effects in disease-mapping models Biometrics 62 1197ndash1206

Reiners W amp Lockwood J (2009) Philosophical Foundations for the Practices ofEcology Cambridge University Press

URL httpbooksgooglecombooksid=dr9cPgAACAAJ

Rigler F amp Peters R (1995) Excellence in Ecology Science and Limnology EcologyInstitute Germany

URL httportoncatieaccrcgi-binwxisexeIsisScript=CIENLxis

ampmethod=postampformato=2ampcantidad=1ampexpresion=mfn=008268

Robert C Chopin N amp Rousseau J (2009) Harold Jeffreysrsquo Theory of Probabilityrevisited Statistical Science Volume 24(2) 141ndash179

URL httpswwwnewtonacukpreprintsNI08021pdf

Robert C P (1993) A note on jeffreys-lindley paradox Statistica Sinica 3 601ndash608

Royle J A amp Kery M (2007) A Bayesian state-space formulation of dynamicoccupancy models Ecology 88(7) 1813ndash23

URL httpwwwncbinlmnihgovpubmed17645027

Scott J amp Berger J (2010) Bayes and Empirical-Bayes Multiplicity Adjustment in thevariable selection problem The Annals of Statistics

URL httpprojecteuclidorgeuclidaos1278861454

Spiegelhalter D J amp Smith A F M (1982) Bayes factor for linear and log-linearmodels with vague prior information J R Statist Soc B 44 377ndash387

Tierney L amp Kadane J B (1986) Accurate approximations for posterior moments andmarginal densities Journal of the American Statistical Association 81 82ndash86

Tyre A J Tenhumberg B Field S a Niejalke D Parris K amp Possingham H P(2003) Improving Precision and Reducing Bias in Biological Surveys EstimatingFalse-Negative Error Rates Ecological Applications 13(6) 1790ndash1801

URL httpwwwesajournalsorgdoiabs10189002-5078

Waddle J H Dorazio R M Walls S C Rice K G Beauchamp J Schuman M Jamp Mazzotti F J (2010) A new parameterization for estimating co-occurrence ofinteracting species Ecological applications a publication of the Ecological Society ofAmerica 20 1467ndash1475

Wasserman L (2000) Bayesian Model Selection and Model Averaging Journal ofmathematical psychology 44(1) 92ndash107

138

URL httpwwwncbinlmnihgovpubmed10733859

Wilson M Iversen E Clyde M A Schmidler S C amp Schildkraut J M (2010)Bayesian model search and multilevel inference for SNP association studies TheAnnals of Applied Statistics 4(3) 1342ndash1364

URL httpwwwncbinlmnihgovpmcarticlesPMC3004292

Womack A J Leon-Novelo L amp Casella G (2014) Inference from Intrinsic BayesProcedures Under Model Selection and Uncertainty Journal of the AmericanStatistical Association (June) 140114063448000

URL httpwwwtandfonlinecomdoiabs101080016214592014880348

Yuan M Joseph V R amp Zou H (2009) Structured variable selection and estimationThe Annals of Applied Statistics 3(4) 1738ndash1757

URL httpprojecteuclidorgeuclidaoas1267453962

Zeller K A Nijhawan S Salom-Perez R Potosme S H amp Hines J E (2011)Integrating occupancy modeling and interview data for corridor identification A casestudy for jaguars in nicaragua Biological Conservation 144(2) 892ndash901

Zellner A amp Siow A (1980) Posterior odds ratios for selected regression hypothesesIn Trabajos de estadıstica y de investigacion operativa (pp 585ndash603)

URL httpwwwspringerlinkcomindex5300770UP12246M9pdf

139

BIOGRAPHICAL SKETCH

Daniel Taylor-Rodrıguez was born in Bogota Colombia He earned a BS

degree in economics from the Universidad de Los Andes (2004) and a Specialist

degree in statistics from the Universidad Nacional de Colombia In 2009 he traveled

to Gainesville Florida to pursue a masterrsquos in statistics under the supervision of

George Casella Upon completion he started a PhD in interdisciplinary ecology with

concentration in statistics again under George Casellarsquos supervision After Georgersquos

passing Linda Young and Nikolay Bliznyuk continued to oversee Danielrsquos mentorship

He has currently accepted a joint postdoctoral fellowship at the Statistical and Applied

Mathematical Sciences Institute and the Department of Statistical Science at Duke

University

140

  • ACKNOWLEDGMENTS
  • TABLE OF CONTENTS
  • LIST OF TABLES
  • LIST OF FIGURES
  • ABSTRACT
  • 1 GENERAL INTRODUCTION
    • 11 Occupancy Modeling
    • 12 A Primer on Objective Bayesian Testing
    • 13 Overview of the Chapters
      • 2 MODEL ESTIMATION METHODS
        • 21 Introduction
          • 211 The Occupancy Model
          • 212 Data Augmentation Algorithms for Binary Models
            • 22 Single Season Occupancy
              • 221 Probit Link Model
              • 222 Logit Link Model
                • 23 Temporal Dynamics and Spatial Structure
                  • 231 Dynamic Mixture Occupancy State-Space Model
                  • 232 Incorporating Spatial Dependence
                    • 24 Summary
                      • 3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
                        • 31 Introduction
                        • 32 Objective Bayesian Inference
                          • 321 The Intrinsic Methodology
                          • 322 Mixtures of g-Priors
                            • 3221 Intrinsic priors
                            • 3222 Other mixtures of g-priors
                                • 33 Objective Bayes Occupancy Model Selection
                                  • 331 Preliminaries
                                  • 332 Intrinsic Priors for the Occupancy Problem
                                  • 333 Model Posterior Probabilities
                                  • 334 Model Selection Algorithm
                                    • 34 Alternative Formulation
                                    • 35 Simulation Experiments
                                      • 351 Marginal Posterior Inclusion Probabilities for Model Predictors
                                      • 352 Summary Statistics for the Highest Posterior Probability Model
                                        • 36 Case Study Blue Hawker Data Analysis
                                          • 361 Results Variable Selection Procedure
                                          • 362 Validation for the Selection Procedure
                                            • 37 Discussion
                                              • 4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
                                                • 41 Introduction
                                                • 42 Setup for Well-Formulated Models
                                                  • 421 Well-Formulated Model Spaces
                                                    • 43 Priors on the Model Space
                                                      • 431 Model Prior Definition
                                                      • 432 Choice of Prior Structure and Hyper-Parameters
                                                      • 433 Posterior Sensitivity to the Choice of Prior
                                                        • 44 Random Walks on the Model Space
                                                          • 441 Simple Pruning and Growing
                                                          • 442 Degree Based Pruning and Growing
                                                            • 45 Simulation Study
                                                              • 451 SNR and Sample Size Effect
                                                              • 452 Coefficient Magnitude
                                                              • 453 Special Points on the Scale
                                                                • 46 Case Study Ozone Data Analysis
                                                                • 47 Discussion
                                                                  • 5 CONCLUSIONS
                                                                  • A FULL CONDITIONAL DENSITIES DYMOSS
                                                                  • B RANDOM WALK ALGORITHMS
                                                                  • C WFM SIMULATION DETAILS
                                                                  • D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
                                                                  • REFERENCES
                                                                  • BIOGRAPHICAL SKETCH
Page 8: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,

LIST OF TABLES

Table page

1-1 Interpretation of BFji when contrasting Mj and Mi 20

3-1 Simulation control parameters occupancy model selector 69

3-2 Comparison of average minOddsMPIP under scenarios having different numberof sites (N=50 N=100) and under scenarios having different number of surveysper site (J=3 J=5) for the presence and detection components using uniformand multiplicity correction priors 75

3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors 75

3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highest probabilitymodels for the presence and the detection components using uniform andmultiplicity correcting priors on the model space 76

3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space 77

3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 77

3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negative predictorsaveraged over the highest probability models for the presence and the detectioncomponents using uniform and multiplicity correcting priors on the model space 78

3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data 80

3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data 80

3-10 MPIP presence component 81

3-11 MPIP detection component 81

3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform and multiplicitycorrection model priors 82

8

4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulations 100

4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchical independenceprior (HIP) the hierarchical order prior (HOP) and the hierarchical uniformprior (HUP) 102

4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and the hierarchicaluniform prior (HUP) 103

4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP) 105

4-5 Variables used in the analyses of the ozone contamination dataset 112

4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso 113

C-1 Experimental conditions WFM simulations 124

D-1 Variables used in the analyses of the ozone contamination dataset 131

D-2 Marginal inclusion probabilities intrinsic prior 132

D-3 Marginal inclusion probabilities Zellner-Siow prior 132

D-4 Marginal inclusion probabilities Hyper-g11 132

D-5 Marginal inclusion probabilities Hyper-g21 132

9

LIST OF FIGURES

Figure page

2-1 Graphical representation occupancy model 25

2-2 Graphical representation occupancy model after data-augmentation 31

2-3 Graphical representation multiseason model for a single site 39

2-4 Graphical representation data-augmented multiseason model 39

3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors 71

3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors 72

3-3 Predictor MPIP averaged over scenarios with the interaction between the numberof sites and the surveys per site using uniform (U) and multiplicity correction(MC) priors 72

3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors 73

3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors 73

4-1 Graphs of well-formulated polynomial models for p = 2 90

4-2 E(M) and C(M) in M defined by a quadratic surface in two main effects formodel M = 1 x1 x21 91

4-3 Graphical representation of assumptions on M defined by the quadratic surfacein two main effects 93

4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the intercept onlymodel and (ab) isin (1 1) (1 ch) 97

4-5 Prior probabilities for the space of well-formulated models associated to threemain effects and one interaction term where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch) 98

4-6 MT DAG of the largest true model used in simulations 109

4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch) 110

C-1 SNR vs n Average model size average true positives and average false positivesfor all simulated scenarios by model ranking according to model posterior probabilities126

10

C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 128

C-3 SNR vs different true models MT Average model size average true positivesand average false positives for all simulated scenarios by model ranking accordingto model posterior probabilities 129

11

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

OBJECTIVE BAYESIAN METHODS FOR OCCUPANCY MODEL ESTIMATION ANDSELECTION

By

Daniel Taylor-Rodrıguez

August 2014

Chair Linda J YoungCochair Nikolay BliznyukMajor Interdisciplinary Ecology

The ecological literature contains numerous methods for conducting inference about

the dynamics that govern biological populations Among these methods occupancy

models have played a leading role during the past decade in the analysis of large

biological population surveys The flexibility of the occupancy framework has brought

about useful extensions for determining key population parameters which provide

insights about the distribution structure and dynamics of a population However the

methods used to fit the models and to conduct inference have gradually grown in

complexity leaving practitioners unable to fully understand their implicit assumptions

increasing the potential for misuse This motivated our first contribution We develop

a flexible and straightforward estimation method for occupancy models that provides

the means to directly incorporate temporal and spatial heterogeneity using covariate

information that characterizes habitat quality and the detectability of a species

Adding to the issue mentioned above studies of complex ecological systems now

collect large amounts of information To identify the drivers of these systems robust

techniques that account for test multiplicity and for the structure in the predictors are

necessary but unavailable for ecological models We develop tools to address this

methodological gap First working in an ldquoobjectiverdquo Bayesian framework we develop

the first fully automatic and objective method for occupancy model selection based

12

on intrinsic parameter priors Moreover for the general variable selection problem we

propose three sets of prior structures on the model space that correct for multiple testing

and a stochastic search algorithm that relies on the priors on the models space to

account for the polynomial structure in the predictors

13

CHAPTER 1GENERAL INTRODUCTION

As with any other branch of science ecology strives to grasp truths about the

world that surrounds us and in particular about nature The objective truth sought

by ecology may well be beyond our grasp however it is reasonable to think that at

least partially ldquoNature is capable of being understoodrdquo (Dewey 1958) We can observe

and interpret nature to formulate hypotheses which can then be tested against reality

Hypotheses that encounter no or little opposition when confronted with reality may

become contextual versions of the truth and may be generalized by scaling them

spatially andor temporally accordingly to delimit the bounds within which they are valid

To formulate hypotheses accurately and in a fashion amenable to scientific inquiry

not only the point of view and assumptions considered must be made explicit but

also the object of interest the properties worthy of consideration of that object and

the methods used in studying such properties (Reiners amp Lockwood 2009 Rigler amp

Peters 1995) Ecology as defined by Krebs (1972) is ldquothe study of interactions that

determine the distribution and abundance of organismsrdquo This characterizes organisms

and their interactions as the objects of interest to ecology and prescribes distribution

and abundance as a relevant property of these organisms

With regards to the methods used to acquire ecological scientific knowledge

traditionally theoretical mathematical models (such as deterministic PDEs) have been

used However naturally varying systems are imprecisely observed and as such are

subject to multiple sources of uncertainty that must be explicitly accounted for Because

of this the ecological scientific community is developing a growing interest in flexible

and powerful statistical methods and among these Bayesian hierarchical models

predominate These methods rely on empirical observations and can accommodate

fairly complex relationships between empirical observations and theoretical process

models while accounting for diverse sources of uncertainty (Hooten 2006)

14

Bayesian approaches are now used extensively in ecological modeling however

there are two issues of concern one from the standpoint of ecological practitioners

and another from the perspective of scientific ecological endeavors First Bayesian

modeling tools require a considerable understanding of probability and statistical theory

leading practitioners to view them as black box approaches (Kery 2010) Second

although Bayesian applications proliferate in the literature in general there is a lack of

awareness of the distinction between approaches specifically devised for testing and

those for estimation (Ellison 2004) Furthermore there is a dangerous unfamiliarity with

the proven risks of using tools designed for estimation in testing procedures (Berger amp

Pericchi 1996 Berger et al 2001 Kass amp Raftery 1995 Moreno et al 1998 Robert

et al 2009 Robert 1993) (eg use of flat priors in hypothesis testing)

Occupancy models have played a leading role during the past decade in large

biological population surveys The flexibility of the occupancy framework has allowed

the development of useful extensions to determine several key population parameters

which provide robust notions of the distribution structure and dynamics of a population

In order to address some of the concerns stated in previous paragraph we concentrate

in the occupancy framework to develop estimation and testing tools that will allow

ecologists first to gain insight about the estimation procedure and second to conduct

statistically sound model selection for site-occupancy data

11 Occupancy Modeling

Since MacKenzie et al (2002) and Tyre et al (2003) introduced the site-occupancy

framework countless applications and extensions of the method have been developed

in the ecological literature as evidenced by the 438000 hits on Google Scholar for

a search of rdquooccupancy modelrdquo This class of models acknowledges that techniques

used to conduct biological population surveys are prone to detection errors ndashif an

individual is detected it must be present while if it is not detected it might or might

not be Occupancy models improve upon traditional binary regression by accounting

15

for observed detection and partially observed presence as two separate but related

components In the site occupancy setting the chosen locations are surveyed

repeatedly in order to reduce the ambiguity caused by the observed zeros This

approach therefore allows probabilities of both presence (occurrence) and detection

to be estimated

The uses of site-occupancy models are many For example metapopulation

and island biogeography models are often parameterized in terms of site (or patch)

occupancy (Hansky 19921994 1997 as cited in MacKenzie et al (2003)) and

occupancy may be used as a surrogate for abundance to answer questions regarding

geographic distribution range size and metapopulation dynamics (MacKenzie et al

2004 Royle amp Kery 2007)

The basic occupancy framework which assumes a single closed population with

fixed probabilities through time has proven to be quite useful however it might be of

limited utility when addressing some problems In particular assumptions for the basic

model may become too restrictive or unrealistic whenever the study period extends

throughout multiple years or seasons especially given the increasingly changing

environmental conditions that most ecosystems are currently experiencing

Among the extensions found in the literature one that we consider particularly

relevant incorporates heterogenous occupancy probabilities through time Models

that incorporate temporally varying probabilities stem from important meta-population

notions provided by Hanski (1994) such as occupancy probabilities depending on local

colonization and local extinction processes In spite of the conceptual usefulness of

Hanskirsquos model several strong and untenable assumptions (eg all patches being

homogenous in quality) are required for it to provide practically meaningful results

A more viable alternative which builds on Hanski (1994) is an extension of

the single season occupancy model of MacKenzie et al (2003) In this model the

heterogeneity of occupancy probabilities across seasons arises from local colonization

16

and extinction processes This model is flexible enough to let detection occurrence

extinction and colonization probabilities to each depend upon its own set of covariates

Model parameters are obtained through likelihood-based estimation

Using a maximum likelihood approach presents two drawbacks First the

uncertainty assessment for maximum likelihood parameter estimates relies on

asymptotic results which are obtained from implementation of the delta method

making it sensitive to sample size Second to obtain parameter estimates the latent

process (occupancy) is marginalized out of the likelihood leading to the usual zero

inflated Bernoulli model Although this is a convenient strategy for solving the estimation

problem after integrating the latent state variables (occupancy indicators) they are

no longer available Therefore finite sample estimates cannot be calculated directly

Instead a supplementary parametric bootstrapping step is necessary Further

additional structure such as temporal or spatial variation cannot be introduced by

means of random effects (Royle amp Kery 2007)

12 A Primer on Objective Bayesian Testing

With the advent of high dimensional data such as that found in modern problems

in ecology genetics physics etc coupled with evolving computing capability objective

Bayesian inferential methods have gained increasing popularity This however is by no

means a new approach in the way Bayesian inference is conducted In fact starting with

Bayes and Laplace and continuing for almost 200 years Bayesian analysis was primarily

based on ldquononinformativerdquo priors (Berger amp Bernardo 1992)

Now subjective elicitation of prior probabilities in Bayesian analysis is widely

recognized as the ideal (Berger et al 2001) however it is often the case that the

available information is insufficient to specify appropriate prior probabilistic statements

Commonly as in model selection problems where large model spaces have to be

explored the number of model parameters is prohibitively large preventing one from

eliciting prior information for the entire parameter space As a consequence in practice

17

the determination of priors through the definition of structural rules has become the

alternative to subjective elicitation for a variety of problems in Bayesian testing Priors

arising from these rules are known in the literature as noninformative objective default

or reference Many of these connotations generate controversy and are accused

perhaps rightly of providing a false pretension of objectivity Nevertheless we will avoid

that discussion and refer to them herein exchangeably as noninformative or objective

priors to convey the sense that no attempt to introduce an informed opinion is made in

defining prior probabilities

A plethora of ldquononinformativerdquo methods has been developed in the past few

decades (see Berger amp Bernardo (1992) Berger amp Pericchi (1996) Berger et al (2001)

Clyde amp George (2004) Kass amp Wasserman (1995 1996) Liang et al (2008) Moreno

et al (1998) Spiegelhalter amp Smith (1982) Wasserman (2000) and the references

therein) We find particularly interesting those derived from the model structure in which

no tuning parameters are required especially since these can be regarded as automatic

methods Among them methods based on the Bayes factor for Intrinsic Priors have

proven their worth in a variety of inferential problems given their excellent performance

flexibility and ease of use This class of priors is discussed in detail in chapter 3 For

now some basic notation and notions of Bayesian inferential procedures are introduced

Hypothesis testing and the Bayes factor

Bayesian model selection techniques that aim to find the true model as opposed

to searching for the model that best predicts the data are fundamentally extensions to

Bayesian hypothesis testing strategies In general this Bayesian approach to hypothesis

testing and model selection relies on determining the amount of evidence found in favor

of one hypothesis (or model) over the other given an observed set of data Approached

from a Bayesian standpoint this type of problem can be formulated in great generality

using a natural well defined probabilistic framework that incorporates both model and

parameter uncertainty

18

Jeffreys (1935) first developed the Bayesian strategy to hypothesis testing and

consequently to the model selection problem Bayesian model selection within

a model space M = (M1M2 MJ) where each model is associated with a

parameter θj which may be a vector of parameters itself incorporates three types

of probability distributions (1) a prior probability distribution for each model π(Mj)

(2) a prior probability distribution for the parameters in each model π(θj |Mj) and (3)

the distribution of the data conditional on both the model and the modelrsquos parameters

f (x|θj Mj) These three probability densities induce the joint distribution p(x θj Mj) =

f (x|θj Mj) middot π(θj |Mj) middot π(Mj) which is instrumental in producing model posterior

probabilities The model posterior probability is the probability that a model is true given

the data It is obtained by marginalizing over the parameter space and using Bayes rule

p(Mj |x) =m(x|Mj)π(Mj)sumJ

i=1m(x|Mi)π(Mi) (1ndash1)

where m(x|Mj) =intf (x|θj Mj)π(θj |Mj)dθj is the marginal likelihood of Mj

Given that interest lies in comparing different models evidence in favor of one or

another model is assessed with pairwise comparisons using posterior odds

p(Mj |x)p(Mk |x)

=m(x|Mj)

m(x|Mk)middot π(Mj)

π(Mk) (1ndash2)

The first term on the right hand side of (1ndash2) m(x|Mj )

m(x|Mk) is known as the Bayes factor

comparing model Mj to model Mk and it is denoted by BFjk(x) The Bayes factor

provides a measure of the evidence in favor of either model given the data and updates

the model prior odds given by π(Mj )

π(Mk) to produce the posterior odds

Note that the model posterior probability in (1ndash1) can be expressed as a function of

Bayes factors To illustrate let model Mlowast isin M be a reference model All other models

compare in M are compared to the reference model Then dividing both the numerator

19

and denominator in (1ndash1) by m(x|Mlowast)π(Mlowast) yields

p(Mj |x) =BFjlowast(x)

π(Mj )

π(Mlowast)

1 +sum

MiisinMMi =Mlowast

BFilowast(x)π(Mi )π(Mlowast)

(1ndash3)

Therefore as the Bayes factor increases the posterior probability of model Mj given the

data increases If all models have equal prior probabilities a straightforward criterion

to select the best among all candidate models is to choose the model with the largest

Bayes factor As such the Bayes factor is not only useful for identifying models favored

by the data but it also provides a means to rank models in terms of their posterior

probabilities

Assuming equal model prior probabilities in (1ndash3) the prior odds are set equal to

one and the model posterior odds in (1ndash2) become p(Mj |x)p(Mk |x) = BFjk(x) Based

on the Bayes factors the evidence in favor of one or another model can be interpreted

using Table 1-1 adapted from Kass amp Raftery (1995)

Table 1-1 Interpretation of BFji when contrasting Mj and Mi

lnBFjk BFjk Evidence in favor of Mj P(Mj |x)0 to 2 1 to 3 Weak evidence 05-0752 to 6 3 to 20 Positive evidence 075-095

6 to 10 20 to 150 Strong evidence 095-099gt10 gt150 Very strong evidence gt 099

Bayesian hypothesis testing and model selection procedures through Bayes factors

and posterior probabilities have several desirable features First these methods have a

straight forward interpretation since the Bayes factor is an increasing function of model

(or hypothesis) posterior probabilities Second these methods can yield frequentist

matching confidence bounds when implemented with good testing priors (Kass amp

Wasserman 1996) such as the reference priors of Berger amp Bernardo (1992) Third

since the Bayes factor contains the ratio of marginal densities it automatically penalizes

complexity according to the number of parameters in each model this property is

known as Ockhamrsquos razor (Kass amp Raftery 1995) Four the use of Bayes factors does

20

not require having nested hypotheses (ie having the null hypothesis nested in the

alternative) standard distributions or regular asymptotics (eg convergence to normal

or chi squared distributions) (Berger et al 2001) In contrast this is not always the case

with frequentist and likelihood ratio tests which depend on known distributions (at least

asymptotically) for the test statistic to perform the test Finally Bayesian hypothesis

testing procedures using the Bayes factor can naturally incorporate model uncertainty by

using the Bayesian machinery for model averaged predictions and confidence bounds

(Kass amp Raftery 1995) It is not clear how to account for this uncertainty rigorously in a

fully frequentist approach

13 Overview of the Chapters

In the chapters that follow we develop a flexible and straightforward hierarchical

Bayesian framework for occupancy models allowing us to obtain estimates and conduct

robust testing from an ldquoobjectiverdquo Bayesian perspective Latent mixtures of random

variables supply a foundation for our methodology This approach provides a means to

directly incorporate spatial dependency and temporal heterogeneity through predictors

that characterize either habitat quality of a given site or detectability features of a

particular survey conducted in a specific site On the other hand the Bayesian testing

methods we propose are (1) a fully automatic and objective method for occupancy

model selection and (2) an objective Bayesian testing tool that accounts for multiple

testing and for polynomial hierarchical structure in the space of predictors

Chapter 2 introduces the methods proposed for estimation of occupancy model

parameters A simple estimation procedure for the single season occupancy model

with covariates is formulated using both probit and logit links Based on the simple

version an extension is provided to cope with metapopulation dynamics by introducing

persistence and colonization processes Finally given the fundamental role that spatial

dependence plays in defining temporal dynamics a strategy to seamlessly account for

this feature in our framework is introduced

21

Chapter 3 develops a new fully automatic and objective method for occupancy

model selection that is asymptotically consistent for variable selection and averts the

use of tuning parameters In this Chapter first some issues surrounding multimodel

inference are described and insight about objective Bayesian inferential procedures is

provided Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate

priors on the parameter space the intrinsic priors for the parameters of the occupancy

model are obtained These are used in the construction of a variable selection algorithm

for ldquoobjectiverdquo variable selection tailored to the occupancy model framework

Chapter 4 touches on two important and interconnected issues when conducting

model testing that have yet to receive the attention they deserve (1) controlling for false

discovery in hypothesis testing given the size of the model space ie given the number

of tests performed and (2) non-invariance to location transformations of the variable

selection procedures in the face of polynomial predictor structure These elements both

depend on the definition of prior probabilities on the model space In this chapter a set

of priors on the model space and a stochastic search algorithm are proposed Together

these control for model multiplicity and account for the polynomial structure among the

predictors

22

CHAPTER 2MODEL ESTIMATION METHODS

ldquoData Data Datardquo he cried impatiently ldquoI canrsquot make bricks without clayrdquo

ndashSherlock HolmesThe Adventure of the Copper Beeches

21 Introduction

Prior to the introduction of site-occupancy models (MacKenzie et al 2002 Tyre

et al 2003) presence-absence data from ecological monitoring programs were used

without any adjustment to assess the impact of management actions to observe trends

in species distribution through space and time or to model the habitat of a species (Tyre

et al 2003) These efforts however were suspect due to false-negative errors not

being accounted for False-negative errors occur whenever a species is present at a site

but goes undetected during the survey

Site-occupancy models developed independently by MacKenzie et al (2002)

and Tyre et al (2003) extend simple binary-regression models to account for the

aforementioned errors in detection of individuals common in surveys of animal or plant

populations Since their introduction the site-occupancy framework has been used in

countless applications and numerous extensions for it have been proposed Occupancy

models improve upon traditional binary regression by analyzing observed detection

and partially observed presence as two separate but related components In the site

occupancy setting the chosen locations are surveyed repeatedly in order to reduce the

ambiguity caused by the observed zeros This approach therefore allows simultaneous

estimation of the probabilities of presence (occurrence) and detection

Several extensions to the basic single-season closed population model are

now available The occupancy approach has been used to determine species range

dynamics (MacKenzie et al 2003 Royle amp Kery 2007) and to understand agestage

23

structure within populations (Nichols et al 2007) model species co-occurrence

(MacKenzie et al 2004 Ovaskainen et al 2010 Waddle et al 2010) It has even been

suggested as a surrogate for abundance (MacKenzie amp Nichols 2004) MacKenzie et al

suggested using occupancy models to conduct large-scale monitoring programs since

this approach avoids the high costs associated with surveys designed for abundance

estimation Also to investigate metapopulation dynamics occupancy models improve

upon incidence function models (Hanski 1994) which are often parameterized in terms

of site (or patch) occupancy and assume homogenous patches and a metapopulation

that is at a colonization-extinction equilibrium

Nevertheless the implementation of Bayesian occupancy models commonly resorts

to sampling strategies dependent on hyper-parameters subjective prior elicitation

and relatively elaborate algorithms From the standpoint of practitioners these are

often treated as black-box methods (Kery 2010) As such the potential of using the

methodology incorrectly is high Commonly these procedures are fitted with packages

such as BUGS or JAGS Although the packagersquos ease of use has led to a wide-spread

adoption of the methods the user may be oblivious as to the assumptions underpinning

the analysis

We believe providing straightforward and robust alternatives to implement these

methods will help practitioners gain insight about how occupancy modeling and more

generally Bayesian modeling is performed In this Chapter using a simple Gibbs

sampling approach first we develop a versatile method to estimate the single season

closed population site-occupancy model then extend it to analyze metapopulation

dynamics through time and finally provide a further adaptation to incorporate spatial

dependence among neighboring sites211 The Occupancy Model

In this section of the document we first introduce our results published in Dorazio

amp Taylor-Rodrıguez (2012) and build upon them to propose relevant extensions For

24

the standard sampling protocol for collecting site-occupancy data J gt 1 independent

surveys are conducted at each of N representative sample locations (sites) noting

whether a species is detected or not detected during each survey Let yij denote a binary

random variable that indicates detection (y = 1) or non-detection (y = 0) during the

j th survey of site i Without loss of generality J may be assumed constant among all N

sites to simplify description of the model In practice however site-specific variation in

J poses no real difficulties and is easily implemented This sampling protocol therefore

yields a N times J matrix Y of detectionnon-detection data

Note that the observed process yij is an imperfect representation of the underlying

occupancy or presence process Hence letting zi denote the presence indicator at site i

this model specification can therefore be represented through the hierarchy

yij |zi λ sim Bernoulli (zipij)

zi |α sim Bernoulli (ψi) (2ndash1)

where pij is the probability of correctly classifying as occupied the i th site during the j th

survey ψi is the presence probability at the i th site The graphical representation of this

process is

ψi

zi

yi

pi

Figure 2-1 Graphical representation occupancy model

Probabilities of detection and occupancy can both be made functions of covariates

and their corresponding parameter estimates can be obtained using either a maximum

25

likelihood or a Bayesian approach Existing methodologies from the likelihood

perspective marginalize over the latent occupancy process (zi ) making the estimation

procedure depend only on the detections Most Bayesian strategies rely on MCMC

algorithms that require parameter prior specification and tuning However Albert amp Chib

(1993) proposed a longstanding strategy in the Bayesian statistical literature that models

binary outcomes using a simple Gibbs sampler This procedure which is described in

the following section can be extrapolated to the occupancy setting eliminating the need

for tuning parameters and subjective prior elicitation212 Data Augmentation Algorithms for Binary Models

Probit model Data-augmentation with latent normal variables

At the root of Albert amp Chibrsquos algorithm lies the idea that if the observed outcome is

0 the latent variable can be simulated from a truncated normal distribution with support

(minusinfin 0] And if the outcome is 1 the latent variable can be simulated from a truncated

normal distribution in (0infin) To understand the reasoning behind this strategy let

Y sim Bern((xTβ)

) and V = xTβ + ε with ε sim N (0 1) In such a case note that

Pr(y = 1 | xTβ) = (xTβ) = Pr(ε lt xTβ)

= Pr(ε gt minusxTβ)

= Pr(v gt 0 | xTβ)

Thus whenever y = 1 then v gt 0 and v le 0 otherwise In other words we

may think of y as a truncated version of v Thus we can sample iteratively alternating

between the latent variables conditioned on model parameters and vice versa to draw

from the desired posterior densities By augmenting the data with the latent variables

we are able to obtain full conditional posterior distributions for model parameters that are

easy to draw from (equation 2ndash3 below) Further we may sample the latent variables

we may also sample the parameters

Given some initial values for all model parameters values for the latent variables

can be simulated By conditioning on the latter it is then possible to draw samples

26

from the parameterrsquos posterior distributions These samples can be used to generate

new values for the latent variables etc The process is iterated using a Gibbs sampling

approach Generally after a large number iterations it yields draws from the joint

posterior distribution of the latent variables and the model parameters conditional on the

observed outcome values We formalize the procedure below

Assume that each outcome Y1Y2 Yn is such that Yi |xi β sim Bernoulli(qi)

where qi = (xTi β) is the standard normal CDF evaluated at xTi β where xi and β

are the p-dimensional vectors of observed covariates for the i -th observation and their

corresponding parameters respectively

Now let y = y1 y2 yn be the vector of observed outcomes and [ β ] represents

the prior distribution of the model parameters Therefore the posterior distribution of β is

given by

[ β|y ] prop [ β ]nprodi=1

(xTi β)yi(1minus(xTi β)

)1minusyi (2ndash2)

which is intractable Nevertheless introducing latent random variables V = (V1 Vn)

such that Vi sim N (xTi β 1) resolves this difficulty by specifying that whenever Yi = 1

then Vi gt 0 and if Yi = 0 then Vi le 0 This yields

[ β v|y ] prop [ β ]

nprodi=1

ϕ(vi | xTi β 1)Ivile0Iyi=0 + Ivigt0Iyi=1

(2ndash3)

where ϕ(x |micro τ 2) is the probability density function of normal random variable x

with mean micro and variance τ2 The data augmentation artifact works since [ β|y ] =int[ β v|y ]dv hence if we sample from joint posterior 2ndash3 and extract only the sampled

values for β they will correspond to samples from [ β|y ]

From the expression above it is possible to obtain the full conditional distributions

for V and β Thus a Gibbs sampler can be proposed For example if we use a flat prior

27

for β (ie [ β ] prop 1) the full conditionals are given by

β|V y sim MVNk

((XTX )minus1(XTV ) (XTX )minus1

)(2ndash4)

V|β y simnprodi=1

tr N (xTi β 1Qi) (2ndash5)

where MVNq(micro ) represents a multivariate normal distribution with mean vector micro

and variance-covariance matrix and tr N (ξσ2Q) stands for the truncated normal

distribution with mean ξ variance σ2 and truncation region Q For each i = 1 2 n

the support of the truncated variables is given by Q = (minusinfin 0 ] if yi = 0 and Q = (0infin)

otherwise Note that conjugate normal priors could be used alternatively

At iteration m + 1 the Gibbs sampler draws V(m+1) conditional on β(m) from (2ndash5)

and then samples β(m+1) conditional on V(m+1) from (2ndash4) This process is repeated for

s = 0 1 nsim where nsim is the number of iterations in the Gibbs sampler

Logit model Data-augmentation with latent Polya-gamma variables

Recently Polson et al (2013) developed a novel and efficient approach for Bayesian

inference for logistic models using Polya-gamma latent variables which is analogous

to the Albert amp Chib algorithm The result arises from what the authors refer to as the

Polya-gamma distribution To construct a random variable from this family consider the

infinite mixture of the iid sequence of Exp(1) random variables Ekinfink=1 given by

ω =2

π2

infinsumk=1

Ek

(2k minus 1)2

with probability density function

g(ω) =infinsumk=1

(minus1)k 2k + 1radic2πω3

eminus(2k+1)2

8ω Iωisin(0infin) (2ndash6)

and Laplace density transform E[eminustω] = coshminus1(radic

t2)

28

The Polya-gamma family of densities is obtained through an exponential tilting of

the density g from 2ndash6 These densities indexed by c ge 0 are characterized by

f (ω|c) = cosh c2 eminusc2ω2 g(ω)

The likelihood for the binomial logistic model can be expressed in terms of latent

Polya-gamma variables as follows Assume yi sim Bernoulli(δi) with predictors xprimei =

(xi1 xip) and success probability δi = exprimeiβ(1 + ex

primeiβ) Hence the posterior for the

model parameters can be represented as

[β|y] =[β]prodn

i δyii (1minus δi)

1minusyi

c(y)

where c(y) is the normalizing constant

To facilitate the sampling procedure a data augmentation step can be performed

by introducing a Polya-gamma random variable ω sim PG(xprimeβ 1) This yields the

data-augmented posterior

[βω|y] =

(prodn

i=1 Pr(yi = 1|β))f (ω|xprime

β) [β] dω

c(y) (2ndash7)

such that [β|y] =int

R+[βω|y] dω

Thus from the augmented model the full conditional density for β is given by

[β|ω y] prop

(nprodi=1

Pr(yi = 1|β)

)f (ω|xprime

β) [β] dω

=

nprodi=1

(exprimeiβ)yi

1 + exprimeiβ

nprodi=1

cosh

(∣∣xprime

iβ∣∣

2

)exp

[minus(x

prime

iβ)2ωi

2

]g(ωi)

(2ndash8)

This expression yields a normal posterior distribution if β is assigned flat or normal

priors Hence a two-step sampling strategy analogous to that of Albert amp Chib (1993)

can be used to estimate β in the occupancy framework22 Single Season Occupancy

Let pij = F (qTij λ) be the probability of correctly classifying as occupied the i th

site during the j th survey conditional on the site being occupied and let ψi = F (xTi α)

29

correspond to the presence probability at the i th site Further let Fminus1(middot) denote a link

function (ie probit or logit) connecting the response to the predictors and denote by λ

and α respectively the r -variate and p-variate coefficient vectors for the detection and

for the presence probabilities Then the following is the joint posterior probability for the

presence indicators and the model parameters

πlowast(z vαwλ) prop πα(α)πλ(λ)Nprodi=1

F (xprimeiα)zi (1minus F (xprimeiα))

(1minuszi ) times

Jprodj=1

(ziF (qprimeijλ))

yij (1minus ziF (qprimeijλ))

1minusyij (2ndash9)

As in the simple probit regression problem this posterior is intractable Consequently

sampling from it directly is not possible But the procedures of Albert amp Chib for the

probit model and of Polson et al for the logit model can be extended to generate an

MCMC sampling strategy for the occupancy problem In what follows we make use of

this framework to develop samplers with which occupancy parameter estimates can be

obtained for both probit and logit link functions These algorithms have the added benefit

that they do not require tuning parameters nor eliciting parameter priors subjectively221 Probit Link Model

To extend Albert amp Chibrsquos algorithm to the occupancy framework with a probit link

first we introduce two sets of latent variables denoted by wij and vi corresponding to

the normal latent variables used to augment the data The corresponding hierarchy is

yij |zi sij sim Bernoulli(ziIwijgt0

)wij |λ sim N

(qprimeijλ 1

)λ sim [λ]

zi |vi sim Ivigt0

vi |α sim N (xprimeiα 1)

α sim [α] (2ndash10)

30

represented by the directed graph found in Figure 2-2

α

vi

zi

yi

wi

λ

Figure 2-2 Graphical representation occupancy model after data-augmentation

Under this hierarchical model the joint density is given by

πlowast(z vαwλ) prop Cyπα(α)πλ(λ)Nprodi=1

ϕ(vi xprimeiα 1)I

zivigt0I

(1minuszi )vile0 times

Jprodj=1

(ziIwijgt0)yij (1minus ziIwijgt0)

1minusyijϕ(wij qprimeijλ 1) (2ndash11)

The full conditional densities derived from the posterior in equation 2ndash11 are

detailed below

1 These are obtained from the full conditional of z after integrating out v and w

f (z|αλ) =

Nprodi=1

f (zi |αλ) =Nprodi=1

ψlowastizi (1minus ψlowast

i )1minuszi

where ψlowasti =

ψiprodJ

j=1 pyijij (1minus pij)

1minusyij

ψiprodJ

j=1 pyijij (1minus pij)1minusyij + (1minus ψi)

prodJ

j=1 Iyij=0(2ndash12)

2

f (v|zα) =

Nprodi=1

f (vi |zi α) =Nprodi=1

tr N (x primeiα 1Ai)

where Ai =

(minusinfin 0] zi = 0(0infin) zi = 1

(2ndash13)

31

and tr N(microσ2A) denotes the pdf of a truncated normal random variable withmean micro variance σ2 and truncation region A

3

f (α|v) = ϕp (α αXprimev α) (2ndash14)

where α = (X primeX )minus1and ϕk(x micro ) represents the k-variate normal density withmean vector micro and variance matrix

4

f (w|y zλ) =

Nprodi=1

Jprodj=1

f (wij |yij zi λ) =Nprodi=1

Jprodj=1

tr N (qprimeijλ 1Bij)

where Bij =

(minusinfininfin) zi = 0(minusinfin 0] zi = 1 and yij = 0(0infin) zi = 1 and yij = 1

(2ndash15)

5

f (λ|w) = ϕr (λ λQprimew λ) (2ndash16)

where λ = (Q primeQ)minus1

The Gibbs sampling algorithm for the model can then be summarized as

1 Initialize z α v λ and w

2 Sample zi sim Bern(ψilowast)

3 Sample vi from a truncated normal with micro = x primeiα σ = 1 and truncation regiondepending on zi

4 Sample α sim N (αXprimev α) with α = (X primeX )minus1

5 Sample wij from a truncated normal with micro = qprimeijλ σ = 1 and truncation region

depending on yij and zi

6 Sample λ sim N (λQprimew λ) with λ = (Q primeQ)minus1

222 Logit Link Model

Now turning to the logit link version of the occupancy model again let yij be the

indicator variable used to mark detection of the target species on the j th survey at the

i th site and let zi be the indicator variable that denotes presence (zi = 1) or absence

32

(zi = 0) of the target species at the i th site The model is now defined by

yij |zi λ sim Bernoulli (zipij) where pij =eq

primeijλ

1 + eqprimeijλ

λ sim [λ]

zi |α sim Bernoulli (ψi) where ψi =ex

primeiα

1 + exprimeiα

α sim [α]

In this hierarchy the contribution of a single site to the likelihood is

Li(αλ) =(ex

primeiα)zi

1 + exprimeiα

Jprodj=1

(zi

eqprimeijλ

1 + eqprimeijλ

)yij(1minus zi

eqprimeijλ

1 + eqprimeijλ

)1minusyij

(2ndash17)

As in the probit case we data-augment the likelihood with two separate sets

of covariates however in this case each of them having Polya-gamma distribution

Augmenting the model and using the posterior in (2ndash7) the joint is

[ zαλ|y ] prop [α] [λ]

Nprodi=1

(ex

primeiα)zi

1 + exprimeiαcosh

(∣∣xprime

iα∣∣

2

)exp

[minus(x

prime

iα)2vi

2

]g(vi)times

Jprodj=1

(zi

eqprimeijλ

1 + eqprimeijλ

)yij(1minus zi

eqprimeijλ

1 + eqprimeijλ

)1minusyij

times

cosh

(∣∣ziqprimeijλ∣∣2

)exp

[minus(ziq

primeijλ)2wij

2

]g(wij)

(2ndash18)

The full conditionals for z α v λ and w obtained from (2ndash18) are provided below

1 The full conditional for z is obtained after marginalizing the latent variables andyields

f (z|αλ) =

Nprodi=1

f (zi |αλ) =Nprodi=1

ψlowastizi (1minus ψlowast

i )1minuszi

where ψlowasti =

ψiprodJ

j=1 pyijij (1minus pij)

1minusyij

ψiprodJ

j=1 pyijij (1minus pij)1minusyij + (1minus ψi)

prodJ

j=1 Iyij=0(2ndash19)

33

2 Using the result derived in Polson et al (2013) we have that

f (v|zα) =

Nprodi=1

f (vi |zi α) =Nprodi=1

PG(1 xprimeiα) (2ndash20)

3

f (α|v) prop [α ]

Nprodi=1

exp[zix

prime

iαminus xprime

2minus (x

prime

iα)2vi

2

] (2ndash21)

4 By the same result as that used for v the full conditional for w is

f (w|y zλ) =

Nprodi=1

Jprodj=1

f (wij |yij zi λ)

=

(prodiisinS1

Jprodj=1

PG(1 |qprimeijλ| )

)(prodi isinS1

Jprodj=1

PG(1 0)

) (2ndash22)

with S1 = i isin 1 2 N zi = 1

5

f (λ|z yw) prop [λ ]prodiisinS1

exp

[yijq

prime

ijλminusq

prime

ijλ

2minus

(qprime

ijλ)2wij

2

] (2ndash23)

with S1 as defined above

The Gibbs sampling algorithm is analogous to the one with a probit link but with the

obvious modifications to incorporate Polya-gamma instead of normal latent variables23 Temporal Dynamics and Spatial Structure

The uses of the single-season model are limited to very specific problems In

particular assumptions for the basic model may become too restrictive or unrealistic

whenever the study period extends throughout multiple years or seasons especially

given the increasingly changing environmental conditions that most ecosystems are

currently experiencing

Among the many extensions found in the literature one that we consider particularly

relevant incorporates heterogenous occupancy probabilities through time Extensions of

34

site-occupancy models that incorporate temporally varying probabilities can be traced

back to Hanski (1994) The heterogeneity of occupancy probabilities through time arises

from local colonization and extinction processes MacKenzie et al (2003) proposed an

alternative to Hanskirsquos approach in order to incorporate imperfect detection The method

is flexible enough to let detection occurrence survival and colonization probabilities

each depend upon its own set of covariates using likelihood-based estimation for the

model parameters

However the approach of MacKenzie et al presents two drawbacks First

the uncertainty assessment for maximum likelihood parameter estimates relies on

asymptotic results (obtained from implementation of the delta method) making it

sensitive to sample size And second to obtain parameter estimates the latent process

(occupancy) is marginalized out of the likelihood leading to the usual zero-inflated

Bernoulli model Although this is a convenient strategy to solve the estimation problem

the latent state variables (occupancy indicators) are no longer available and as such

finite sample estimates cannot be calculated unless an additional (and computationally

expensive) parametric bootstrap step is performed (Royle amp Kery 2007) Additionally as

the occupancy process is integrated out the likelihood approach precludes incorporation

of additional structural dependence using random effects Thus the model cannot

account for spatial dependence which plays a fundamental role in this setting

To work around some of the shortcomings encountered when fitting dynamic

occupancy models via likelihood based methods Royle amp Kery developed what they

refer to as a dynamic occupancy state space model (DOSS) alluding to the conceptual

similarity found between this model and the class of state space models found in the

time series literature In particular this model allows one to retain the latent process

(occupancy indicators) in order to obtain small sample estimates and to eventually

generate extensions that incorporate structure in time andor space through random

effects

35

The data used in the DOSS model comes from standard repeated presenceabsence

surveys with N sampling locations (patches or sites) indexed by i = 1 2 N Within

a given season (eg year month week depending on the biology of the species) each

sampling location is visited (surveyed) j = 1 2 J times This process is repeated for

t = 1 2 T seasons Here an important assumption is that the site occupancy status

is closed within but not across seasons

As is usual in the occupancy modeling framework two different processes are

considered The first one is the detection process per site-visit-season combination

denoted by yijt The yijt are indicator functions that take the value 1 if the species is

present at site i survey j and season t and 0 otherwise These detection indicators

are assumed to be independent within each site and season The second response

considered is the partially observed presence (occupancy) indicators zit These are

indicator variables which are equal to 1 whenever yijt = 1 for one or more of the visits

made to site i during season t otherwise the values of the zit rsquos are unknown Royle amp

Kery refer to these two processes as the observation (yijt) and the state (zit) models

In this setting the parameters of greatest interest are the occurrence or site

occupancy probabilities denoted by ψit as well as those representing the population

dynamics which are accounted for by introducing changes in occupancy status over

time through local colonization and survival That is if a site was not occupied at season

t minus 1 at season t it can either be colonized or remain unoccupied On the other hand

if the site was in fact occupied at season t minus 1 it can remain that way (survival) or

become abandoned (local extinction) at season t The probabilities of survival and

colonization from season t minus 1 to season t at the i th site are denoted by θi(tminus1) and

γi(tminus1) respectivelyDuring the initial period (or season) the model for the state process is expressed in

terms of the occupancy probability (equation 2ndash24) For subsequent periods the stateprocess is specified in terms of survival and colonization probabilities (equation 2ndash25) inparticular

zi1 sim Bernoulli (ψi1) (2ndash24)

36

zit |zi(tminus1) sim Bernoulli(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)(2ndash25)

The observation model conditional on the latent process zit is defined by

yijt |zit sim Bernoulli(zitpijt

)(2ndash26)

Royle amp Kery induce the heterogeneity by site site-season and site-survey-seasonrespectively in the occupancy survival and colonization and in the detection probabilitiesthroughthe following specification

logit(ψi1) = x1 + ri ri sim N(0σ2ψ) logitminus1(x1) sim Unif(0 1)

logit(θit) = at + ui ui sim N(0σ2θ ) logitminus1(ai) sim Unif(0 1)logit(γit) = bt + vi vi sim N(0σ2γ) logitminus1(bi) sim Unif(0 1)

logit(pijt) = ct + wij wi sim N(0σ2p) logitminus1(ci) sim Unif(0 1) (2ndash27)

where x1 at bt ct are the season fixed effects for the corresponding probabilities

and where (ri ui vi) and wij are the site and site-survey random effects respectively

Additionally all variance components assume the usual inverse gamma priors

As the authors state this formulation can be regarded as ldquobeing suitably vaguerdquo

however it is also restrictive in the sense that it is not clear what strategy to follow to

incorporate additional covariates while preserving the straightforward sampling strategy231 Dynamic Mixture Occupancy State-Space Model

We assume that the probabilities for occupancy survival colonization and detection

are all functions of linear combinations of covariates However our setup varies

slightly from that considered by Royle amp Kery (2007) In essence we modify the way in

which the estimates for survival and colonization probabilities are attained Our model

incorporates the notion that occupancy at a site occupied during the previous season

takes place through persistence where we define persistence as a function of both

survival and colonization That is a site occupied at time t may again be occupied

at time t + 1 if the current settlers survive if they perish and new settlers colonize

simultaneously or if both current settlers survive and new ones colonize

Our functional forms of choice are again the probit and logit link functions This

means that each probability of interest which we will refer to for illustration as δ is

37

linked to a linear combination of covariates xprime ξ through the relationship defined by

δ = F (xT ξ) where F (middot) represents the inverse link function This particular assumption

facilitates relating the data augmentation algorithms of Albert amp Chib and Polson et al to

Royle amp Keryrsquos DOSS model We refer to this extension of Royle amp Keryrsquos model as the

Dynamic Mixture Occupancy State Space model (DYMOSS)

As before let yijt be the indicator variable used to mark detection of the target

species on the j th survey at the i th site during the tth season and let zit be the indicator

variable that denotes presence (zit = 1) or absence (zit = 0) of the target species at the

i th site tth season with i isin 1 2 N j isin 1 2 J and t isin 1 2 T

Additionally assume that probabilities for occupancy at time t = 1 persistence

colonization and detection are all functions of covariates with corresponding parameter

vectors α (s) =δ(s)tminus1

Tt=2

B(c) =β(c)tminus1

Tt=2

and = λtTt=1 and covariate matrices

X(o) X = Xtminus1Tt=2 and Q(s) = QtTt=1 respectively Using the notation above our

proposed dynamic occupancy model is defined by the following hierarchyState model

zi1|α sim Bernoulli (ψi1) where ψi1 = F(xprime(o)iα

)zit |zi(tminus1) δ

(c)tminus1β

(s)tminus1 sim Bernoulli

(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)where θi(tminus1) = F

(δ(s)tminus1 + xprimei(tminus1)β

(c)tminus1

) and

γi(tminus1) = F(xprimei(tminus1)β

(c)tminus1

)(2ndash28)

Observed modelyijt |zit ηt sim Bernoulli (zitpijt)

where pijt = F (qTijtλt) (2ndash29)

In the hierarchical setup given by Equations 2ndash28 and 2ndash29 θi(tminus1) corresponds to

the probability of persistence from time t minus 1 to time t at site i and γi(tminus1) denotes the

colonization probability Note that θi(tminus1) minus γi(tminus1) yields the survival probability from t minus 1

to t The effect of survival is introduced by changing the intercept of the linear predictor

by a quantity δ(s)tminus1 Although in this version of the model this effect is accomplished by

just modifying the intercept it can be extended to have covariates determining δ(s)tminus1 as

well The graphical representation of the model for a single site is

38

α

zi1

yi1

λ1

zi2

yi2

λ1

δ(s)1

β(c)1

middot middot middot

zit

yit

λt

δ(s)tminus1

β(c)tminus1

middot middot middot

ziT

yiT

λT

δ(s)Tminus1

β(c)Tminus1

Figure 2-3 Graphical representation multiseason model for a single site

The joint posterior for the model defined by this hierarchical setting is

[ zηαβλ|y ] = Cy

Nprodi=1

ψi1 Jprodj=1

pyij1ij1 (1minus pij1)

(1minusyij1)

zi1(1minus ψi1)

Jprodj=1

Iyij1=0

1minuszi1 [η1][α]times

Tprodt=2

Nprodi=1

[(θziti(tminus1)(1minus θi(tminus1))

1minuszit)zi(tminus1)

+(γziti(tminus1)(1minus γi(tminus1))

1minuszit)1minuszi(tminus1)

] Jprod

j=1

pyijtijt (1minus pijt)

1minusyijt

zit

times

Jprodj=1

Iyijt=0

1minuszit [ηt ][βtminus1][λtminus1]

(2ndash30)

which as in the single season case is intractable Once again a Gibbs sampler cannot

be constructed directly to sample from this joint posterior The graphical representation

of the model for one site incorporating the latent variables is provided in Figure 2-4

α

ui1

zi1

yi1

wi1

λ1

zi2

yi2

wi2

λ1

vi1

δ(s)1

β(c)1

middot middot middot

middot middot middot

zit

vi tminus1

yit

wit

λt

δ(s)tminus1

β(c)tminus1

middot middot middot

middot middot middot

ziT

vi Tminus1

yiT

wiT

λT

δ(s)Tminus1

β(s)Tminus1

Figure 2-4 Graphical representation data-augmented multiseason model

Probit link normal-mixture DYMOSS model

39

We deal with the intractability of the joint posterior distribution as before that is

by introducing latent random variables Each of the latent variables incorporates the

relevant linear combinations of covariates for the probabilities considered in the model

This artifact enables us to sample from the joint posterior distributions of the model

parameters For the probit link the sets of latent random variables respectively for first

season occupancy persistence and colonization and detection are

bull ui sim N (bTi α 1)

bull vi(tminus1) sim zi(tminus1)N(δ(s)(tminus1) + xTi(tminus1)β

(c)(tminus1) 1

)+ (1minus zi(tminus1))N

(xTi(tminus1)β

(c)(tminus1) 1

) and

bull wijt sim N (qTijtηt 1)

Introducing these latent variables into the hierarchical formulation yieldsState model

ui1|α sim N(xprime(o)iα 1

)zi1|ui sim Bernoulli

(Iuigt0

)for t gt 1

vi(tminus1)|zi(tminus1)βtminus1 sim zi(tminus1)N(δ(s)(tminus1) + xprimei(tminus1)β

(c)(tminus1) 1

)+

(1minus zi(tminus1))N(xprimei(tminus1)β

(c)(tminus1) 1

)zit |vi(tminus1) sim Bernoulli

(Ivi(tminus1)gt0

)(2ndash31)

Observed modelwijt |ηt sim N

(qTijtλt 1

)yijt |zit wijt sim Bernoulli

(zitIrijtgt0

)(2ndash32)

Note that the result presented in Section 22 corresponds to the particular case for

T = 1 of the model specified by Equations 2ndash31 and 2ndash32

As mentioned previously model parameters are obtained using a Gibbs sampling

approach Let ϕ(x |microσ2) denote the pdf of a normally distributed random variable x

with mean micro and standard deviation σ Also let

1 Wt = (w1t w2t wNt) with wit = (wi1t wi2t wiJitt) (for i = 1 2 N andt = 1 2 T )

40

2 u = (u1 u2 uN)

3 V = (v1 vTminus1) with vt = (v1t v2t vNt)

For the probit link model the joint posterior distribution is

π(ZuV WtTt=1αB(c) δ(s)

)prop [α]

prodNi=1 ϕ

(ui∣∣ xprime(o)iα 1

)Izi1uigt0I

1minuszi1uile0

times

Tprodt=2

[β(c)tminus1 δ

(s)tminus1

] Nprodi=1

ϕ(vi(tminus1)

∣∣micro(v)i(tminus1) 1

)Izitvi(tminus1)gt0

I1minuszitvi(tminus1)le0

times

Tprodt=1

[λt ]

Nprodi=1

Jitprodj=1

ϕ(wijt

∣∣qprimeijtλt 1)(zitIwijtgt0)yij1(1minus zitIwijtgt0)

(1minusyijt)

where micro(v)i(tminus1) = zi(tminus1)δ

(s)tminus1 + xprimei(tminus1)β

(c)tminus1 (2ndash33)

Initialize the Gibbs sampler at α(0)B(0)(c) δ

(s)(0)2minus1 and (0) For m = 0 1 nsim

The sampler proceeds iteratively by block sampling sequentially for each primary

sampling period as follows first the presence process then the latent variables from

the data-augmentation step for the presence component followed by the parameters for

the presence process then the latent variables for the detection component and finally

the parameters for the detection component Letting [|] denote the full conditional

probability density function of the component conditional on all other unknown

parameters and the observed data for m = 1 nsim the sampling procedure can be

summarized as

[z(m)1 | middot

]rarr[u(m)| middot

]rarr[α(m)

∣∣∣ middot ]rarr [W

(m)1 | middot

]rarr[λ(m)1

∣∣∣ middot ]rarr[z(m)2 | middot

]rarr[V(m)2minus1| middot

]rarr[β(c)(m)2minus1 δ(s)(m)

2minus1

∣∣∣ middot ]rarr [W

(m)2 | middot

]rarr[λ(m)2

∣∣∣ middot ]rarr middot middot middot

middot middot middot rarr[z(m)T | middot

]rarr[V(m)Tminus1| middot

]rarr[β(c)(m)Tminus1 δ(s)(m)

Tminus1

∣∣∣ middot ]rarr [W

(m)T | middot

]rarr[λ(m)T

∣∣∣ middot ]The full conditional probability densities for this Gibbs sampling algorithm are

presented in detail within Appendix A

41

Logit link Polya-Gamma DYMOSS model

Using the same notation as before the logit link model resorts to the hierarchy given

byState model

ui1|α sim PG(xT(o)iα 1

)zi1|ui sim Bernoulli

(Iuigt0

)for t gt 1

vi(tminus1)| sim PG(1∣∣zi(tminus1)δ(s)(tminus1) + xprimei(tminus1)β

(c)(tminus1)

∣∣)sim Bernoulli

(Ivi(tminus1)gt0

)(2ndash34)

Observed modelwijt |λt sim PG

(qTijtλt 1

)yijt |zit wijt sim Bernoulli

(zitIwijtgt0

)(2ndash35)

The logit link version of the joint posterior is given by

π(ZuV WtTt=1αB(s)B(c)

)prop

Nprodi=1

(e

xprime(o)i

α)zi1

1 + exprime(o)i

αPG

(ui 1 |xprime(o)iα|

)[λ1][α]times

Ji1prodj=1

(zi1

eqprimeij1λ1

1 + eqprimeij1λ1

)yij1(1minus zi1

eqprimeij1λ1

1 + eqprimeij1λ1

)1minusyij1

PG(wij1 1 |zi1qprimeij1λ1|

)times

Tprodt=2

[δ(s)tminus1][β

(c)tminus1][λt ]

Nprodi=1

(exp

[micro(v)tminus1

])zit1 + exp

[micro(v)i(tminus1)

]PG (vit 1 ∣∣∣micro(v)i(tminus1)

∣∣∣)timesJitprodj=1

(zit

eqprimeijtλt

1 + eqprimeijtλt

)yijt(1minus zit

eqprimeijtλt

1 + eqlowastTij

λt

)1minusyijt

PG(wijt 1 |zitqprimeijtλt |

)

(2ndash36)

with micro(v)tminus1 = zi(tminus1)δ

(s)tminus1 + xprimei(tminus1)β

(c)tminus1

42

The sampling procedure is entirely analogous to that described for the probit

version The full conditional densities derived from expression 2ndash36 are described in

detail in Appendix A232 Incorporating Spatial Dependence

In this section we describe how the additional layer of complexity space can also

be accounted for by continuing to use the same data-augmentation framework The

method we employ to incorporate spatial dependence is a slightly modified version of

the traditional approach for spatial generalized linear mixed models (GLMMrsquos) and

extends the model proposed by Johnson et al (2013) for the single season closed

population occupancy model

The traditional approach consists of using spatial random effects to induce a

correlation structure among adjacent sites This formulation introduced by Besag et al

(1991) assumes that the spatial random effect corresponds to a Gaussian Markov

Random Field (GMRF) The model known as the Spatial GLMM (SGLMM) is used to

analyze areal data It has been applied extensively given the flexibility of its hierarchical

formulation and the availability of software for its implementation (Hughes amp Haran

2013)

Succinctly the spatial dependence is accounted for in the model by adding a

random vector η assumed to have a conditionally-autoregressive (CAR) prior (also

known as the Gaussian Markov random field prior) To define the prior let the pair

G = (V E) represent the undirected graph for the entire spatial region studied where

V = (1 2 N) denotes the vertices of the graph (sites) and E the set of edges

between sites E is constituted by elements of the form (i j) indicating that sites i

and j are spatially adjacent for some i j isin V The prior for the spatial effects is then

characterized by

[η|τ ] prop τ rank()2exp[minusτ2ηprimeη

] (2ndash37)

43

where = (diag(A1)minus A) is the precision matrix with A denoting the adjacency matrix

The entries of the adjacency matrix A are such that diag(A) = 0 and Aij = I(i j)isinE

The matrix is singular Hence the probability density defined in equation 2ndash37

is improper ie it doesnrsquot integrate to 1 Regardless of the impropriety of the prior this

model can be fitted using a Bayesian approach since even if the prior is improper the

posterior for the model parameters is proper If a constraint such assum

k ηk = 0 is

imposed or if the precision matrix is replaced by a positive definite matrix the model

can also be fitted using a maximum likelihood approach

Assuming that all but the detection process are subject to spatial correlations and

using the notation we have developed up to this point the spatially explicit version of the

DYMOSS model is characterized by the hierarchy represented by equations 2ndash38 and

2ndash39

Hence adding spatial structure into the DYMOSS framework described in the

previous section only involves adding the steps to sample η(o) and ηtT

t=2 conditional

on all other parameters Furthermore the corresponding parameters and spatial

random effects of a given component (ie occupancy survival and colonization)

can be effortlessly pooled together into a single parameter vector to perform block

sampling For each of the latent variables the only modification required is to sum the

corresponding spatial effect to the linear predictor so that these retain their conditional

independence given the linear combination of fixed effects and the spatial effects

State modelzi1|α sim Bernoulli (ψi1) where ψi1 = F

(xT(o)iα+ η

(o)i

)[η(o)|τ

]prop τ rank()2exp

[minusτ2η(o)primeη(o)

]zit |zi(tminus1)αβtminus1λtminus1 sim Bernoulli

(zi(tminus1)θi(tminus1) +

(1minus zi(tminus1)

)γi(tminus1)

)where θi(tminus1) = F

(δ(s)(tminus1) + xTi(tminus1)β

(c)tminus1 + ηit

) and

γi(tminus1) = F(xTi(tminus1)β

(c)tminus1 + ηit

)[ηt |τ ] prop τ rank()2exp

[minusτ2ηprimetηt

](2ndash38)

44

Observed modelyijt |zit ηt sim Bernoulli (zitpijt)

where pijt = F (qTijtλt) (2ndash39)

In spite of the popularity of this approach to incorporating spatial dependence three

shortcomings have been reported in the literature (Hughes amp Haran 2013 Reich et al

2006) (1) model parameters have no clear interpretation due to spatial confounding

of the predictors with the spatial effect (2) there is variance inflation due to spatial

confounding and (3) the high dimensionality of the latent spatial variables leads to

high computational costs To avoid such difficulties we follow the approach used by

Hughes amp Haran (2013) which builds upon the earlier work by Reich et al (2006) This

methodology is summarized in what follows

Let a vector of spatial effects η have the CAR model given by 2ndash37 above Now

consider a random vector ζ sim MVN(0 τKprimeK

) with defined as above and where

τKprimeK corresponds to the precision of the distribution and not the covariance matrix

with matrix K satisfying KprimeK = I

This last condition implies that the linear predictor Xβ + η = Xβ + Kζ With

respect to how the matrix K is chosen Hughes amp Haran (2013) recommend basing its

construction on the spectral decomposition of operator matrices based on Moranrsquos I

The Moran operator matrix is defined as PperpAPperp with Pperp = IminusX (XprimeX )minus1X

prime and where A

is the adjacency matrix previously described The choice of the Moran operator is based

on the fact that it accounts for the underlying graph while incorporating the spatial

structure residual to the design matrix X These elements are incorporated into its

spectral decomposition of the Moran operator That is its eigenvalues correspond to the

values of Moranrsquos I statistic (a measure of spatial autocorrelation) for a spatial process

orthogonal to X while its eigenvectors provide the patterns of spatial dependence

residual to X Thus the matrix K is chosen to be the matrix whose columns are the

eigenvectors of the Moran operator for a particular adjacency matrix

45

Using this strategy the new hierarchical formulation of our model is simply modified

by letting η(o) = K(o)ζ(o) and ηt = Ktζt with

1 ζ(o) sim MVN(0 τ (o)K(o)primeK(o)

) where K(o) is the eigenvector matrix for

P(o)perpAP(o)perp and

2 ζt sim MVN(0 τtK

primetKt

) where Kt is the Pperp

t APperpt for t = 2 3 T

The algorithms for the probit and logit link from section 231 can be readily

adapted to incorporate the spatial structure simply by obtaining the joint posteriors

for (α ζ(o)) and (β(c)tminus1 δ

(s)tminus1 ζt) making the obvious modification of the corresponding

linear predictors to incorporate the spatial components24 Summary

With a few exceptions (Dorazio amp Taylor-Rodrıguez 2012 Johnson et al 2013

Royle amp Kery 2007) recent Bayesian approaches to site-occupancy modeling with

covariates have relied on model configurations (eg as multivariate normal priors of

parameters in logit scale) that lead to unfamiliar conditional posterior distributions thus

precluding the use of a direct sampling approach Therefore the sampling strategies

available are based on algorithms (eg Metropolis Hastings) that require tuning and the

knowledge to do so correctly

In Dorazio amp Taylor-Rodrıguez (2012) we proposed a Bayesian specification for

which a Gibbs sampler of the basic occupancy model is available and allowed detection

and occupancy probabilities to depend on linear combinations of predictors This

method described in section 221 is based on the data augmentation algorithm of

Albert amp Chib (1993) There the full conditional posteriors of the parameters of the probit

regression model are cast as latent mixtures of normal random variables The probit and

the logit link yield similar results with large sample sizes however their results may be

different when small to moderate sample sizes are considered because the logit link

function places more mass in the tails of the distribution than the probit link does In

46

section 222 we adapt the method for the single season model to work with the logit link

function

The basic occupancy framework is useful but it assumes a single closed population

with fixed probabilities through time Hence its assumptions may not be appropriate to

address problems where the interest lies in the temporal dynamics of the population

Hence we developed a dynamic model that incorporates the notion that occupancy

at a site previously occupied takes place through persistence which depends both on

survival and habitat suitability By this we mean that a site occupied at time t may again

be occupied at time t + 1 if (1) the current settlers survive (2) the existing settlers

perish but new settlers simultaneously colonize or (3) current settlers survive and new

ones colonize during the same season In our current formulation of the DYMOSS both

colonization and persistence depend on habitat suitability characterized by xprimei(tminus1)β(c)tminus1

They only differ in that persistence is also influenced by whether the site being occupied

during season t minus 1 enhances the suitability of the site or harms it through density

dependence

Additionally the study of the dynamics that govern distribution and abundance of

biological populations requires an understanding of the physical and biotic processes

that act upon them and these vary in time and space Consequently as a final step in

this Chapter we described a straightforward strategy to add spatial dependence among

neighboring sites in the dynamic metapopulation model This extension is based on the

popular Bayesian spatial modeling technique of Besag et al (1991) updated using the

methods described in (Hughes amp Haran 2013)

Future steps along these lines are (1) develop the software necessary to

implement the tools described throughout the Chapter and (2) build a suite of additional

extensions using this framework for occupancy models will be explored The first of

them will be used to incorporate information from different sources such as tracks

scats surveys and direct observations into a single model This can be accomplished

47

by adding a layer to the hierarchy where the source and spatial scale of the data is

accounted for The second extension is a single season spatially explicit multiple

species co-occupancy model This model will allow studying complex interactions

and testing hypotheses about species interactions at a given point in time Lastly this

co-occupancy model will be adapted to incorporate temporal dynamics in the spirit of

the DYMOSS model

48

CHAPTER 3INTRINSIC ANALYSIS FOR OCCUPANCY MODELS

Eliminate all other factors and the one which remains must be the truthndashSherlock Holmes

The Sign of Four

31 Introduction

Occupancy models are often used to understand the mechanisms that dictate

the distribution of a species Therefore variable selection plays a fundamental role in

achieving this goal To the best of our knowledge ldquoobjectiverdquo Bayesian alternatives for

variable selection have not been put forth for this problem and with a few exceptions

(Hooten amp Hobbs 2014 Link amp Barker 2009) AIC is the method used to choose from

competing site-occupancy models In addition the procedures currently implemented

and accessible to ecologists require enumerating and estimating all the candidate

models (Fiske amp Chandler 2011 Mazerolle amp Mazerolle 2013) In practice this

can be achieved if the model space considered is small enough which is possible

if the choice of the model space is guided by substantial prior knowledge about the

underlying ecological processes Nevertheless many site-occupancy surveys collect

large amounts of covariate information about the sampled sites Given that the total

number of candidate models grows exponentially fast with the number of predictors

considered choosing a reduced set of models guided by ecological intuition becomes

increasingly difficult This is even more so the case in the occupancy model context

where the model space is the cartesian product of models for presence and models for

detection Given the issues mentioned above we propose the first objective Bayesian

variable selection method for the single-season occupancy model framework This

approach explores in a principled manner the entire model space It is completely

49

automatic precluding the need for both tuning parameters in the sampling algorithm and

subjective elicitation of parameter prior distributions

As mentioned above in ecological modeling if model selection or less frequently

model averaging is considered the Akaike Information Criterion (AIC) (Akaike 1983)

or a version of it is the measure of choice for comparing candidate models (Fiske amp

Chandler 2011 Mazerolle amp Mazerolle 2013) The AIC is designed to find the model

that has on average the density closest in Kullback-Leibler distance to the density

of the true data generating mechanism The model with the smallest AIC is selected

However if nested models are considered one of them being the true one generally the

AIC will not select it (Wasserman 2000) Commonly the model selected by AIC will be

more complex than the true one The reason for this is that the AIC has a weak signal to

noise ratio and as such it tends to overfit (Rao amp Wu 2001) Other versions of the AIC

provide a bias correction that enhances the signal to noise ratio leading to a stronger

penalization for model complexity Some examples are the AICc (Hurvich amp Tsai 1989)

and AICu (McQuarrie et al 1997) however these are also not consistent for selection

albeit asymptotically efficient (Rao amp Wu 2001)

If we are interested in prediction as opposed to testing the AIC is certainly

appropriate However when conducting inference the use of Bayesian model averaging

and selection methods is more fitting If the true data generating mechanism is among

those considered asymptotically Bayesian methods choose the true model with

probability one Conversely if the true model is not among the alternatives and a

suitable parameter prior is used the posterior probability of the most parsimonious

model closest to the true one tends asymptotically to one

In spite of this in general for Bayesian testing direct elicitation of prior probabilistic

statements is often impeded because the problems studied may not be sufficiently

well understood to make an informed decision about the priors Conversely there may

be a prohibitively large number of parameters making specifying priors for each of

50

these parameters an arduous task In addition to this seemingly innocuous subjective

choices for the priors on the parameter space may drastically affect test outcomes

This has been a recurring argument in favor of objective Bayesian procedures

which appeal to the use of formal rules to build parameter priors that incorporate the

structural information inside the likelihood while utilizing some objective criterion (Kass amp

Wasserman 1996)

One popular choice of ldquoobjectiverdquo prior is the reference prior (Berger amp Bernardo

1992) which is the prior that maximizes the amount of signal extracted from the

data These priors have proven to be effective as they are fully automatic and can

be frequentist matching in the sense that the posterior credible interval agrees with the

frequentist confidence interval from repeated sampling with equal coverage-probability

(Kass amp Wasserman 1996) Reference priors however are improper and while

they yield reasonable posterior parameter probabilities the derived model posterior

probabilities may be ill defined To avoid this shortcoming Berger amp Pericchi (1996)

introduced the intrinsic Bayes factor (IBF) for model comparison Moreno et al (1998)

building on the IBF of Berger amp Pericchi (1996) developed a limiting procedure to

generate a system of priors that yield well-defined posteriors even though these

priors may sometimes be improper The IBF is built using a data-dependent prior to

automatically generate Bayes factors however the extension introduced by Moreno

et al (1998) generates the intrinsic prior by taking a theoretical average over the space

of training samples freeing the prior from data dependence

In our view in the face of a large number of predictors the best alternative is to run

a stochastic search algorithm using good ldquoobjectiverdquo testing parameter priors and to

incorporate suitable model priors This being said the discussion about model priors is

deferred until Chapter 4 this Chapter focuses on the priors on the parameter space

The Chapter is structured as follows First issues surrounding multimodel inference

are described and insight about objective Bayesian inferential procedures is provided

51

Then building on modern methods for ldquoobjectiverdquo Bayesian testing to generate priors

on the parameter space the intrinsic priors for the parameters of the occupancy model

are derived These are used in the construction of an algorithm for ldquoobjectiverdquo model

selection tailored to the occupancy model framework To assess the performance of our

methods we provide results from a simulation study in which distinct scenarios both

favorable and unfavorable are used to determine the robustness of these tools and

analyze the Blue Hawker data set which has been examined previously in the ecological

literature (Dorazio amp Taylor-Rodrıguez 2012 Kery et al 2010)32 Objective Bayesian Inference

As mentioned before in practice noninformative priors arising from structural

rules are an alternative to subjective elicitation of priors Some of the rules used in

defining noninformative priors include the principle of insufficient reason parametrization

invariance maximum entropy geometric arguments coverage matching and decision

theoretic approaches (see Kass amp Wasserman (1996) for a discussion)

These rules reflect one of two attitudes (1) noninformative priors either aim to

convey unique representations of ignorance or (2) they attempt to produce probability

statements that may be accepted by convention This latter attitude is in the same

spirit as how weights and distances are defined (Kass amp Wasserman 1996) and

characterizes the way in which Bayesian reference methods are interpreted today ie

noninformative priors are seen to be chosen by convention according to the situation

A word of caution must be given when using noninformative priors Difficulties arise

in their implementation that should not be taken lightly In particular these difficulties

may occur because noninformative priors are generally improper (meaning that they do

not integrate or sum to a finite number) and as such are said to depend on arbitrary

constants

Bayes factors strongly depend upon the prior distributions for the parameters

included in each of the models being compared This can be an important limitation

52

considering that when using noninformative priors their introduction will result in the

Bayes factors being a function of the ratio of arbitrary constants given that these priors

are typically improper (see Jeffreys 1961 Pericchi 2005 and references therein)

Many different approaches have been developed to deal with the arbitrary constants

when using improper priors since then These include the use of partial Bayes factors

(Berger amp Pericchi 1996 Good 1950 Lempers 1971) setting the ratio of arbitrary

constants to a predefined value (Spiegelhalter amp Smith 1982) and approximating to the

Bayes factor (see Haughton 1988 as cited in Berger amp Pericchi 1996 Kass amp Raftery

1995 Tierney amp Kadane 1986)321 The Intrinsic Methodology

Berger amp Pericchi (1996) cleverly dealt with the arbitrary constants that arise when

using improper priors by introducing the intrinsic Bayes factor (IBF) procedure This

solution based on partial Bayes factors provides the means to replace the improper

priors by proper ldquoposteriorrdquo priors The IBF is obtained from combining the model

structure with information contained in the observed data Furthermore they showed

that as the sample size tends to infinity the Intrinsic Bayes factor corresponds to the

proper Bayes factor arising from the intrinsic priors

Intrinsic priors however are not unique The asymptotic correspondence between

the IBF and the Bayes factor arising from the intrinsic prior yields two functional

equations that are solved by a whole class of intrinsic priors Because all the priors

in the class produce Bayes factors that are asymptotically equivalent to the IBF for

finite sample sizes the resulting Bayes factor is not unique To address this issue

Moreno et al (1998) formalized the methodology through the ldquolimiting procedurerdquo

This procedure allows one to obtain a unique Bayes factor consolidating the method

as a valid objective Bayesian model selection procedure which we will refer to as the

Bayes factor for intrinsic priors (BFIP) This result is particularly valid for nested models

although the methodology may be extended with some caution to nonnested models

53

As mentioned before the Bayesian hypothesis testing procedure is highly sensitive

to parameter-prior specification and not all priors that are useful for estimation are

recommended for hypothesis testing or model selection Evidence of this is provided

by the Jeffreys-Lindley paradox which states that a point null hypothesis will always

be accepted when the variance of a conjugate prior goes to infinity (Robert 1993)

Additionally when comparing nested models the null model should correspond to

a substantial reduction in complexity from that of larger alternative models Hence

priors for the larger alternative models that place probability mass away from the null

model are wasteful If the true model is ldquofarrdquo from the null it will be easily detected by

any statistical procedure Therefore the prior on the alternative models should ldquowork

harderrdquo at selecting competitive models that are ldquocloserdquo to the null This principle known

as the Savage continuity condition (Gunel amp Dickey 1974) is widely recognized by

statisticians

Interestingly the intrinsic prior in correspondence with the BFIP automatically

satisfies the Savage continuity condition That is when comparing nested models the

intrinsic prior for the more complex model is centered around the null model and in spite

of being a limiting procedure it is not subject to the Jeffreys-Lindley paradox

Moreover beyond the usual pairwise consistency of the Bayes factor for nested

models Casella et al (2009) show that the corresponding Bayesian procedure with

intrinsic priors for variable selection in normal regression is consistent in the entire

class of normal linear models adding an important feature to the list of virtues of the

procedure Consistency of the BFIP for the case where the dimension of the alternative

model grows with the sample size is discussed in Moreno et al (2010)322 Mixtures of g-Priors

As previously mentioned in the Bayesian paradigm a model M in M is defined

by a sampling density and a prior distribution The sampling density associated with

model M is denoted by f (y|βM σ2M M) where (βM σ

2M) is a vector of model-specific

54

unknown parameters The prior for model M and its corresponding set of parameters is

π(βM σ2M M|M) = π(βM σ

2M |MM) middot π(M|M)

Objective local priors for the model parameters (βM σ2M) are achieved through

modifications and extensions of Zellnerrsquos g-prior (Liang et al 2008 Womack et al

2014) In particular below we focus on the intrinsic prior and provide some details for

other scaled mixtures of g-priors We defer the discussion on priors over the model

space until Chapter 5 where we describe them in detail and develop a few alternatives

of our own3221 Intrinsic priors

An automatic choice of an objective prior is the intrinsic prior (Berger amp Pericchi

1996 Moreno et al 1998) Because MB sube M for all M isin M the intrinsic prior for

(βM σ2M) is defined as an expected posterior prior

πI (βM σ2M |M) =

intpR(βM σ

2M |~yM)mR(~y|MB)d~y (3ndash1)

where ~y is a minimal training sample for model M I denotes the intrinsic distributions

and R denotes distributions derived from the reference prior πR(βM σ2M |M) = cM

dβMdσ2M

σ2M

In (3ndash1) mR(~y|M) =intint

f (~y|βM σ2M M)πR(βM σ

2M |M)dβMdσ2M is the reference marginal

of ~y under model M and pR(βM σ2M |~yM) =

f (~y|βM σ2MM)πR(βM σ2

M|M)

mR(~y|M)is the reference

posterior density

In the regression framework the reference marginal mR is improper and produces

improper intrinsic priors However the intrinsic Bayes factor of model M to the base

model MB is well-defined and given by

BF IMMB

(y) = (1minus R2M)

minus nminus|MB |2 times

int 1

0

n + sin2(π2θ) middot (|M|+ 1)

n +sin2(π

2θ)middot(|M|+1)1minusR2

M

nminus|M|

2sin2(π

2θ) middot (|M|+ 1)

n +sin2(π

2θ)middot(|M|+1)1minusR2

M

|M|minus|MB |

2

dθ (3ndash2)

55

where R2M is the coefficient of determination of model M versus model MB The Bayes

factor between two models M and M prime is defined as BF IMMprime(y) = BF I

MMB(y)BF I

MprimeMB(y)

The ldquogoodnessrdquo of the model M based on the intrinsic priors is given by its posterior

probability

pI (M|yM) =BF I

MMB(y)π(M|M)sum

MprimeisinM BF IMprimeMB

(y)π(M prime|M) (3ndash3)

It has been shown that the system of intrinsic priors produces consistent model selection

(Casella et al 2009 Giron et al 2010) In the context of well-formulated models the

true model MT is the smallest well-formulated model M isin M such that α isin M if βα = 0

If MT is the true model then the posterior probability of model MT based on equation

(3ndash3) converges to 13222 Other mixtures of g-priors

Scaled mixtures of g-priors place a reference prior on (βMBσ2) and a multivariate

normal distribution on β in M MB that is normal with mean 0 and precision matrix

qMw

nσ2ZprimeM (IminusH0)ZM

where H0 is the hat matrix associated with ZMB The prior is completed by a prior on w

and choice of scaling qM that is set at |M| + 1 to account for the minimal sample size of

M Under these assumptions the Bayesrsquo factor for M to MB is given by

BFMMB(y) =

(1minus R2

M

) nminus|MB |2

int n + w(|M|+ 1)

n + w(|M|+1)1minusR2

M

nminus|M|

2w(|M|+ 1)

n + w(|M|+1)1minusR2

M

|M|minus|MB |

2

π(w)dw

We consider the following priors on w The intrinsic prior is π(w) = Beta(w 05 05)

which is only defined for w isin (0 1) A version of the Zellner-Siow prior is given by

w sim Gamma(05 05) which produces a multivariate Cauchy distribution on β A family

of hyper-g priors are defined by π(w) prop wminus12(β + w)(α+1)2 which have Cauchy-like

tails but produce more shrinkage than the Cauchy prior

56

33 Objective Bayes Occupancy Model Selection

As mentioned before Bayesian inferential approaches used for ecological models

are lacking In particular there exists a need for suitable objective and automatic

Bayesian testing procedures and software implementations that explore thoroughly the

model space considered With this goal in mind in this section we develop an objective

intrinsic and fully automatic Bayesian model selection methodology for single season

site-occupancy models We refer to this method as automatic and objective given that

in its implementation no hyperparameter tuning is required and that it is built using

noninformative priors with good testing properties (eg intrinsic priors)

An inferential method for the occupancy problem is possible using the intrinsic

approach given that we are able to link intrinsic-Bayesian tools for the normal linear

model through our probit formulation of the occupancy model In other words because

we can represent the single season probit occupancy model through the hierarchy

yij |zi wij sim Bernoulli(ziIwijgt0

)wij |λ sim N

(qprimeijλ 1

)zi |vi sim Bernoulli

(Ivigt0

)vi |α sim N (x primeiα 1)

it is possible to solve the selection problem on the latent scale variables wij and vi and

to use those results at the level of the occupancy and detection processes

In what follows first we provide some necessary notation Then a derivation of

the intrinsic priors for the parameters of the detection and occupancy components

is outlined Using these priors we obtain the general form of the model posterior

probabilities Finally the results are incorporated in a model selection algorithm for

site-occupancy data Although the priors on the model space are not discussed in this

Chapter the software and methods developed have different choices of model priors

built in

57

331 Preliminaries

The notation used in Chapter 2 will be considered in this section as well Namely

presence will be denoted by z detection by y their corresponding latent processes are

v and w and the model parameters are denoted by α and λ However some additional

notation is also necessary Let M0 =M0y M0z

denote the ldquobaserdquo model defined by

the smallest models considered for the detection and presence processes The base

models M0y and M0z include predictors that must be contained in every model that

belongs to the model space Some examples of base models are the intercept only

model a model with covariates related to the sampling design and a model including

some predictors important to the researcher that should be included in every model

Furthermore let the sets [Kz ] = 1 2 Kz and [Ky ] = 1 2 Ky index

the covariates considered for the variable selection procedure for the presence and

detection processes respectively That is these sets denote the covariates that can

be added from the base models in M0 or removed from the largest possible models

considered MF z and MF y which we will refer to as the ldquofullrdquo models The model space

can then be represented by the Cartesian product of subsets such that Ay sube [Ky ]

and Az sube [Kz ] The entire model space is populated by models of the form MA =MAy

MAz

isin M = My timesMz with MAy

isin My and MAzisin Mz

For the presence process z the design matrix for model MAzis given by the block

matrix XAz= (X0|Xr A) X0 corresponds to the design matrix of the base model ndash which

is such that M0z sube MAzisin Mz for all Az isin [Kz ] ndash and Xr A corresponds to the submatrix

that contains the covariates indexed by Az Analogously for the detection process y the

design matrix is given by QAy= (Q0|Qr A) Similarly the coefficients for models MAz

and

MAyare given by αA = (αprime

0αprimer A)

prime and λA = (λprime0λ

primer A)

prime

With these elements in place the model selection problem consists of finding

subsets of covariates indexed by A = Az Ay that have a high posterior probability

given the detection and occupancy processes This is equivalent to finding models with

58

high posterior odds when compared to a suitable base model These posterior odds are

given by

p(MA|y z)p(M0|y z)

=m(y z|MA)π(MA)

m(y z|M0)π(M0)= BFMAM0

(y z)π(MA)

π(M0)

Since we are able to represent the occupancy model as a truncation of latent

normal variables it is possible to work through the occupancy model selection problem

in the latent normal scale used for the presence and detection processes We formulate

two solutions to this problem one that depends on the observed and latent components

and another that solely depends on the latent level variables used to data-augment the

problem We will however focus on the latter approach as this yields a straightforward

MCMC sampling scheme For completeness the other alternative is described in

Section 34

At the root of our objective inferential procedure for occupancy models lies the

conditional argument introduced by Womack et al (work in progress) for the simple

probit regression In the occupancy setting the argument is

p(MA|y zw v) =m(y z vw|MA)π(MA)

m(y zw v)

=fyz(y z|w v)

(intfvw(vw|αλMA)παλ(αλ|MA)d(αλ)

)π(MA)

fyz(y z|w v)sum

MlowastisinM(int

fvw(vw|αλMlowast)παλ(αλ|Mlowast)d(αλ))π(Mlowast)

=m(v|MAz

)m(w|MAy)π(MA)

m(v)m(w)

prop m(v|MAz)m(w|MAy

)π(MA) (3ndash4)

where

1 fyz(y z|w v) =prodN

i=1 Izivigt0I

(1minuszi )vile0

prodJ

j=1(ziIwijgt0)yij (1minus ziIwijgt0)

1minusyij

2 fvw(vw|αλMA) =

(Nprodi=1

ϕ(vi xprimeiαMAz

1)

)︸ ︷︷ ︸

f (v|αr Aα0MAz )

(Nprodi=1

Jiprodj=1

ϕ(wij qprimeijλMAy

1)

)︸ ︷︷ ︸

f (w|λr Aλ0MAy )

and

59

3 παλ(αλ|MA) = πα(α|MAz)πλ(λ|MAy

)

This result implies that once the occupancy and detection indicators are

conditioned on the latent processes v and w respectively the model posterior

probabilities only depend on the latent variables Hence in this case the model

selection problem is driven by the posterior odds

p(MA|y zw v)p(M0|y zw v)

=m(w v|MA)

m(w v|M0)

π(MA)

π(M0) (3ndash5)

where m(w v|MA) = m(w|MAy) middotm(v|MAz

) with

m(v|MAz) =

int intf (v|αr Aα0MAz

)π(αr A|α0MAz)π(α0)dαr Adα0

(3ndash6)

m(w|MAy) =

int intf (w|λr Aλ0MAy

)π(λr A|λ0MAy)π(λ0)dλ0dλr A

(3ndash7)

332 Intrinsic Priors for the Occupancy Problem

In general the intrinsic priors as defined by Moreno et al (1998) use the functional

form of the response to inform their construction assuming some preliminary prior

distribution proper or improper on the model parameters For our purposes we assume

noninformative improper priors for the parameters denoted by πN(middot|middot) Specifically the

intrinsic priors πIP(θMlowast|Mlowast) for a vector of parameters θMlowast corresponding to model

Mlowast isin M0M sub M for a response vector s with probability density (or mass) function

f (s|θMlowast) are defined by

πIP(θM0|M0) = πN(θM0

|M0)

πIP(θM |M) = πN(θM |M)

intm(~s|M)

m(~s|M0)f (~s|θM M)d~s

where ~s is a theoretical training sample

In what follows whenever it is clear from the context in an attempt to simplify the

notation MA will be used to refer to MAzor MAy

and A will denote Az or Ay To derive

60

the parameter priors involved in equations 3ndash6 and 3ndash7 using the objective intrinsic prior

strategy we start by assuming flat priors πN(αA|MA) prop cA and πN(λA|MA) prop dA where

cA and dA are unknown constants

The intrinsic prior for the parameters associated with the occupancy process αA

conditional on model MA is

πIP(αA|MA) = πN(αA|MA)

intm(~v|MA)

m(~v|M0)f (~v|αAMA)d~v

where the marginals m(~v|Mj) with j isin A 0 are obtained by solving the analogous

equation 3ndash6 for the (theoretical) training sample ~v These marginals are given by

m(~v|Mj) = cj (2π)pjminusp0

2 |~X primej~Xj |

12 eminus

12~vprime(Iminus~Hj )~v

The training sample ~v has dimension pAz=∣∣MAz

∣∣ that is the total number of

parameters in model MAz Note that without ambiguity we use

∣∣ middot ∣∣ to denote both

the cardinality of a set and also the determinant of a matrix The design matrix ~XA

corresponds to the training sample ~v and is chosen such that ~X primeA~XA =

pAzNX primeAXA

(Leon-Novelo et al 2012) and ~Hj is the corresponding hat matrix

Replacing m(~v|MA) and m(~v|M0) in πIP(αA|MA) and solving the integral with

respect to the theoretical training sample ~v we have

πIP(αA|MA) = cA

int ((2π)minus

pAzminusp0z2

(c0

cA

)eminus

12~vprime((Iminus~HA)minus(Iminus~H0))~v |~X

primeA~XA|12

|~X prime0~X0|12

)times(

(2π)minuspAz2 eminus

12(~vminus~XAαA)

prime(~vminus~XAαA))d~v

= c0(2π)minus

pAzminusp0z2 |~X prime

Ar~XAr |

12 2minus

pAzminusp0z2 exp

[minus1

2αprimer A

(1

2~X primer A

~Xr A

)αr A

]= πN(α0)timesN

(αr A

∣∣ 0 2 middot ( ~X primer A

~Xr A)minus1)

(3ndash8)

61

Analogously the intrinsic prior for the parameters associated to the detection

process is

πIP(λA|MA) = d0(2π)minus

pAyminusp0y2 | ~Q prime

Ar~QAr |

12 2minus

pAyminusp0y2 exp

[minus1

2λprimer A

(1

2~Q primer A

~Qr A

)λr A

]= πN(λ0)timesN

(λr A

∣∣ 0 2 middot ( ~Q primeA~QA)

minus1)

(3ndash9)

In short the intrinsic priors for αA = (αprime0α

primer A)

prime and λprimeA = (λprime

0λprimer A)

prime are the product

of a reference prior on the parameters of the base model and a normal density on the

parameters indexed by Az and Ay respectively333 Model Posterior Probabilities

We now derive the expressions involved in the calculations of the model posterior

probabilities First recall that p(MA|y zw v) prop m(w v|MA)π(MA) Hence determining

this posterior probability only requires calculating m(w v|MA)

Note that since w and v are independent obtaining the model posteriors from

expression 3ndash4 reduces to finding closed form expressions for the marginals m(v |MAz)

and m(w |MAy) respectively from equations 3ndash6 and 3ndash7 Therefore

m(w v|MA) =

int intf (vw|αλMA)π

IP (α|MAz)πIP

(λ|MAy

)dαdλ

(3ndash10)

For the latent variable associated with the occupancy process plugging the

parameter intrinsic prior given by 3ndash8 into equation 3ndash6 (recalling that ~X primeA~XA =

pAzNX primeAXA)

and integrating out αA yields

m(v|MA) =

int intc0N (v|X0α0 + Xr Aαr A I)N

(αr A|0 2( ~X prime

r A~Xr A)

minus1)dαr Adα0

= c0(2π)minusn2

int (pAz

2N + pAz

) (pAzminusp0z

)

2

times

exp[minus1

2(v minus X0α0)

prime(I minus

(2N

2N + pAz

)Hr Az

)(v minus X0α0)

]dα0

62

= c0 (2π)minus(nminusp0z )2

(pAz

2N + pAz

) (pAzminusp0z

)

2

|X prime0X0|minus

12 times

exp[minus1

2vprime(I minus H0z minus

(2N

2N + pAz

)Hr Az

)v

] (3ndash11)

with Hr Az= HAz

minus H0z where HAzis the hat matrix for the entire model MAz

and H0z is

the hat matrix for the base model

Similarly the marginal distribution for w is

m(w|MA) = d0 (2π)minus(Jminusp0y )2

(pAy

2J + pAy

) (pAyminusp0y

)

2

|Q prime0Q0|minus

12 times

exp[minus1

2wprime(I minus H0y minus

(2J

2J + pAy

)Hr Ay

)w

] (3ndash12)

where J =sumN

i=1 Ji or in other words J denotes the total number of surveys conducted

Now the posteriors for the base model M0 =M0y M0z

are

m(v|M0) =

intc0N (v|X0α0 I) dα0

= c0(2π)minus(nminusp0z )2 |X prime

0X0|minus12 exp

[minus1

2(v (I minus H0z ) v)

](3ndash13)

and

m(w|M0) = d0(2π)minus(Jminusp0y )2 |Q prime

0Q0|minus12 exp

[minus1

2

(w(I minus H0y

)w)]

(3ndash14)

334 Model Selection Algorithm

Having the parameter intrinsic priors in place and knowing the form of the model

posterior probabilities it is finally possible to develop a strategy to conduct model

selection for the occupancy framework

For each of the two components of the model ndashoccupancy and detectionndash the

algorithm first draws the set of active predictors (ie Az and Ay ) together with their

corresponding parameters This is a reversible jump step which uses a Metropolis

63

Hastings correction with proposal distributions given by

q(Alowastz |zo z(t)u v(t)MAz

) =1

2

(p(MAlowast

z|zo z(t)u v(t)Mz MAlowast

zisin L(MAz

)) +1

|L(MAz)|

)q(Alowast

y |y zo z(t)u w(t)MAy) =

1

2

(p(MAlowast

w|y zo z(t)u w(t)My MAlowast

yisin L(MAy

)) +1

|L(MAy)|

)(3ndash15)

where L(MAz) and L(MAy

) denote the sets of models obtained from adding or removing

one predictor at a time from MAzand MAy

respectively

To promote mixing this step is followed by an additional draw from the full

conditionals of α and λ The densities p(α0|) p(αr A|) p(λ0|) and p(λr A|) can

be sampled from directly with Gibbs steps Using the notation a|middot to denote the random

variable a conditioned on all other parameters and on the data these densities are given

by

bull α0|middot sim N((X

prime0X0)

minus1Xprime0v (X

prime0X0)

minus1)bull αr A|middot sim N

(microαr A

αr A

) where the mean vector and the covariance matrix are

given by αr A= 2N

2N+pAz(X

prime

r AXr A)minus1 and microαr A

=(αr A

Xprime

r Av)

bull λ0|middot sim N((Q

prime0Q0)

minus1Qprime0w (Q

prime0Q0)

minus1) and

bull λr A|middot sim N(microλr A

λr A

) analogously with mean and covariance matrix given by

λr A= 2J

2J+pAy(Q

prime

r AQr A)minus1 and microλr A

=(λr A

Qprime

r Aw)

Finally Gibbs sampling steps are also available for the unobserved occupancy

indicators zu and for the corresponding latent variables v and w The full conditional

posterior densities for z(t+1)u v(t+1) and w(t+1) are those introduced in Chapter 2 for the

single season probit model

The following steps summarize the stochastic search algorithm

1 Initialize A(0)y A

(0)z z

(0)u v(0)w(0)α(0)

0 λ(0)0

2 Sample the model indices and corresponding parameters

(a) Draw simultaneously

64

bull Alowastz sim q(Az |zo z(t)u v(t)MAz

)

bull αlowast0 sim p(α0|MAlowast

z zo z

(t)u v(t)) and

bull αlowastr Alowast sim p(αr A|MAlowast

z zo z

(t)u v(t))

(b) Accept (M(t+1)Az

α(t+1)10 α(t+1)1

r A ) = (MAlowastzαlowast

0αlowastr Alowast) with probability

δz = min

(1

p(MAlowastz|zo z(t)u v(t))

p(MA(t)z|zo z(t)u v(t))

q(A(t)z |zo z(t)u v(t)MAlowast

z)

q(Alowastz |zo z

(t)u v(t)MAz

)

)

otherwise let (M(t+1)Az

α(t+1)10 α(t+1)1

r A ) = (A(t)z α(t)2

0 α(t)2r A )

(c) Sample simultaneously

bull Alowasty sim q(Ay |y zo z(t)u w(t)MAy

)

bull λlowast0 sim p(λ0|MAlowast

y y zo z

(t)u w(t)) and

bull λlowastr Alowast sim p(λr A|MAlowast

y y zo z

(t)u w(t))

(d) Accept (M(t+1)Ay

λ(t+1)10 λ(t+1)1

r A ) = (MAlowastyλlowast

0λlowastr Alowast) with probability

δy = min

(1

p(MAlowastz|y zo z(t)u w(t))

p(MA(t)z|y zo z(t)u w(t))

q(A(t)z |y zo z(t)u w(t)MAlowast

y)

q(Alowastz |y zo z

(t)u w(t)MAy

)

)

otherwise let (M(t+1)Ay

λ(t+1)10 λ(t+1)1

r A ) = (A(t)y λ(t)2

0 λ(t)2r A )

3 Sample base model parameters

(a) Draw α(t+1)20 sim p(α0|MA

(t+1)z

zo z(t)u v(t))

(b) Draw λ(t+1)20 sim p(λ0|MA(t+1)y

y zo z(t)u v(t))

4 To improve mixing resample model coefficients not present the base model butare in MA

(a) Draw α(t+1)2r A sim p(αr A|MA

(t+1)z

zo z(t)u v(t))

(b) Draw λ(t+1)2r A sim p(λr A|MA

(t+1)y

yzo z(t)u v(t))

5 Sample latent and missing (unobserved) variables

(a) Sample z(t+1)u sim p(zu|MA(t+1)z

yα(t+1)2r A α(t+1)2

0 λ(t+1)2r A λ(t+1)2

0 )

(b) Sample v(t+1) sim p(v|MA(t+1)z

zo z(t+1)u α(t+1)2

r A α(t+1)20 )

65

(c) Sample w(t+1) sim p(w|MA(t+1)y

zo z(t+1)u λ(t+1)2

r A λ(t+1)20 )

34 Alternative Formulation

Because the occupancy process is partially observed it is reasonable to consider

the posterior odds in terms of the observed responses that is the detections y and

the presences at sites where at least one detection takes place Partitioning the vector

of presences into observed and unobserved z = (zprimeo zprimeu)

prime and integrating out the

unobserved component the model posterior for MA can be obtained as

p(MA|y zo) prop Ezu [m(y z|MA)] π(MA) (3ndash16)

Data-augmenting the model in terms of latent normal variables a la Albert and Chib

the marginals for any model My Mz = M isin M of z and y inside of the expectation in

equation 3ndash16 can be expressed in terms of the latent variables

m(y z|M) =

intT (z)

intT (yz)

m(w v|M)dwdv

=

(intT (z)

m(v| Mz)dv

)(intT (yz)

m(w|My)dw

) (3ndash17)

where T (z) and T (y z) denote the corresponding truncation regions for v and w which

depend on the values taken by z and y and

m(v|Mz) =

intf (v|αMz)π(α|Mz)dα (3ndash18)

m(w|My) =

intf (w|λMy)π(λ|My)dλ (3ndash19)

The last equality in equation 3ndash17 is a consequence of the independence of the

latent processes v and w Using expressions 3ndash18 and 3ndash19 allows one to embed this

model selection problem in the classical linear normal regression setting where many

ldquoobjectiverdquo Bayesian inferential tools are available In particular these expressions

facilitate deriving the parameter intrinsic priors (Berger amp Pericchi 1996 Moreno

et al 1998) for this problem This approach is an extension of the one implemented in

Leon-Novelo et al (2012) for the simple probit regression problem

66

Using this alternative approach all that is left is to integrate m(v|MA) and m(w|MA)

over their corresponding truncation regions T (z) and T (y z) which yields m(y z|MA)

and then to obtain the expectation with respect to the unobserved zrsquos Note however

two issues arise First such integrals are not available in closed form Second

calculating the expectation over the limit of integration further complicates things To

address these difficulties it is possible to express E [m(y z|MA)] as

Ezu [m(y z|MA)] = Ezu

[(intT (z)

m(v| MAz)dv

)(intT (yz)

m(w|MAy)dw

)](3ndash20)

= Ezu

[(intT (z)

intm(v| MAz

α0)πIP(α0|MAz

)dα0dv

)times(int

T (yz)

intm(w| MAy

λ0)πIP(λ0|MAy

)dλ0dw

)]

= Ezu

int (int

T (z)

m(v| MAzα0)dv

)︸ ︷︷ ︸

g1(T (z)|MAz α0)

πIP(α0|MAz)dα0 times

int (intT (yz)

m(w|MAyλ0)dw

)︸ ︷︷ ︸

g2(T (yz)|MAy λ0)

πIP(λ0|MAy)dλ0

= Ezu

[intg1(T (z)|MAz

α0)πIP(α0|MAz

)dα0 timesintg2(T (y z)|MAy

λ0)πIP(λ0|MAy

)dλ0

]= c0 d0

int intEzu

[g1(T (z)|MAz

α0)g2(T (y z)|MAyλ0)

]dα0 dλ0

where the last equality follows from Fubinirsquos theorem since m(v|MAzα0) and

m(w|MAyλ0) are proper densities From 3ndash21 the posterior odds are

p(MA|y zo)p(M0|y zo)

=

int intEzu

[g1(T (z)|MAz

α0)g2(T (y z)|MAyλ0)

]dα0 dλ0int int

Ezu

[g1(T (z)|M0z α0)g2(T (y z)|M0y λ0)

]dα0 dλ0

π(MA)

π(M0)

(3ndash21)

67

35 Simulation Experiments

The proposed methodology was tested under 36 different scenarios where we

evaluate the behavior of the algorithm by varying the number of sites the number of

surveys the amount of signal in the predictors for the presence component and finally

the amount of signal in the predictors for the detection component

For each model component the base model is taken to be the intercept only model

and the full models considered for the presence and the detection have respectively 30

and 20 predictors Therefore the model space contains 230times220 asymp 112times1015 candidate

models

To control the amount of signal in the presence and detection components values

for the model parameter were purposefully chosen so that quantiles 10 50 and 90 of the

occupancy and detection probabilities match some pre-specified probabilities Because

presence and detection are binary variables the amount of signal in each model

component associates to the spread and center of the distribution for the occupancy and

detection probabilities respectively Low signal levels relate to occupancy or detection

probabilities close to 50 High signal levels associate with probabilities close to 0 or 1

Large spreads of the distributions for the occupancy and detection probabilities reflect

greater heterogeneity among the observations collected improving the discrimination

capability of the model and viceversa

Therefore for the presence component the parameter values of the true model

were chosen to set the median for the occupancy probabilities equal 05 The chosen

parameter values also fix quantiles 10 and 90 symmetrically about 05 at small (Qz10 =

03Qz90 = 07) intermediate (Qz

10 = 02Qz90 = 08) and large (Qz

10 = 01Qz90 = 09)

distances For the detection component the model parameters are obtained to reflect

detection probabilities concentrated about low values (Qy50 = 02) intermediate values

(Qy50 = 05) and high values (Qy

50 = 08) while keeping quantiles 10 and 90 fixed at 01

and 09 respectively

68

Table 3-1 Simulation control parameters occupancy model selectorParameter Values considered

N 50 100

J 3 5

(Qz10Q

z50Q

z90)

(03 05 07) (02 05 08) (01 05 09)

(Qy

10Qy50Q

y90)

(01 02 09) (01 05 09) (01 08 09)

There are in total 36 scenarios these result from crossing all the levels of the

simulation control parameters (Table 3-1) Under each of these scenarios 20 data sets

were generated at random True presence and detection indicators were generated

with the probit model formulation from Chapter 2 This with the assumed true models

MTz = 1 x2 x15 x16 x22 x28 for the presence and MTy = 1 q7 q10 q12 q17 for

the detection with the predictors included in the randomly generated datasets In this

context 1 represents the intercept term Throughout the Section we refer to predictors

included in the true models as true predictors and to those absent as false predictors

The selection procedure was conducted using each one of these data sets with

two different priors on the model space the uniform or equal probability prior and a

multiplicity correcting prior

The results are summarized through the marginal posterior inclusion probabilities

(MPIPs) for each predictor and also the five highest posterior probability models (HPM)

The MPIP for a given predictor under a specific scenario and for a particular data set is

defined as

p(predictor is included|y zw v) =sumMisinM

I(predictorisinM)p(M|y zw vM) (3ndash22)

In addition we compare the MPIP odds between predictors present in the true model

and predictors absent from it Specifically we consider the minimum odds of marginal

posterior inclusion probabilities for the predictors Let ~ξ and ξ denote respectively a

69

predictor in the true model MT and a predictor absent from MT We define the minimum

MPIP odds between the probabilities of true and false predictor as

minOddsMPIP =min~ξisinMT

p(I~ξ = 1|~ξ isin MT )

maxξ isinMTp(Iξ = 1|ξ isin MT )

(3ndash23)

If the variable selection procedure adequately discriminates true and false predictors

minOddsMPIP will take values larger than one The ability of the method to discriminate

between the least probable true predictor and the most probable false predictor worsens

as the indicator approaches 0351 Marginal Posterior Inclusion Probabilities for Model Predictors

For clarity in Figures 3-1 through 3-5 only predictors in the true models are labeled

and are emphasized with a dotted line passing through them The left hand side plots

in these figures contain the results for the presence component and the ones on the

right correspond to predictors in the detection component The results obtained with

the uniform model priors correspond to the black lines and those for the multiplicity

correcting prior are in red In these Figures the MPIPrsquos have been averaged over all

datasets corresponding scenarios matching the condition observed

In Figure 3-1 we contrast the mean MPIPrsquos of the predictors over all datasets from

scenarios with 50 sites to the mean MPIPrsquos obtained for the scenarios with 100 sites

Similarly Figure 3-2 compares the mean MPIPrsquos of scenarios where 3 surveys are

performed to those of scenarios having 5 surveys per site Figures 3-4 and 3-5 show the

effect of the different levels of signal considered in the occupancy probabilities and in the

detection probabilities

From these figures mainly three results can be drawn (1) the effect of the model

prior is substantial (2) the proposed methods yield MPIPrsquos that clearly separate

true predictors from false predictors and (3) the separation between MPIPrsquos of true

predictors and false predictors is noticeably larger in the detection component

70

Regardless of the simulation scenario and model component observed under the

uniform prior false predictors obtain a relatively high MPIP Conversely the multiplicity

correction prior strongly shrinks towards 0 the MPIP for false predictors In the presence

component the MPIP for the true predictors is shrunk substantially under the multiplicity

prior however there remains a clear separation between true and false predictors In

contrast in the detection component the MPIP for true predictors remains relatively high

(Figures 3-1 through 3-5)

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif N=50MC N=50

Unif N=100MC N=100

Figure 3-1 Predictor MPIP averaged over scenarios with N=50 and N=100 sites usinguniform (U) and multiplicity correction (MC) priors

71

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif J=3MC J=3

Unif J=5MC J=5

Figure 3-2 Predictor MPIP averaged over scenarios with J=3 and J=5 surveys per siteusing uniform (U) and multiplicity correction (MC) priors

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

Unif N=50 J=3Unif N=50 J=5

Unif N=100 J=3Unif N=100 J=5

MC N=50 J=3MC N=50 J=5

MC N=100 J=3MC N=100 J=5

Figure 3-3 Predictor MPIP averaged over scenarios with the interaction between thenumber of sites and the surveys per site using uniform (U) and multiplicitycorrection (MC) priors

72

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

U(03 05 07)MC(03 05 07)

U(02 05 08)MC(02 05 08)

U(01 05 09)MC(01 05 09)

Figure 3-4 Predictor MPIP averaged over scenarios with equal signal in the occupancyprobabilities using uniform (U) and multiplicity correction (MC) priors

00

02

04

06

08

10

x2 x15 x22 x28 q7 q10 q17

Presence component Detection component

Mar

gina

l inc

lusi

on p

roba

bilit

y

U(01 02 09)MC(01 02 09)

U(01 05 09)MC(01 05 09)

U(01 08 09)MC(01 08 09)

Figure 3-5 Predictor MPIP averaged over scenarios with equal signal in the detectionprobabilities using uniform (U) and multiplicity correction (MC) priors

73

In scenarios where more sites were surveyed the separation between the MPIP of

true and false predictors grew in both model components (Figure 3-1) Increasing the

number of sites has an effect over both components given that every time a new site is

included covariate information is added to the design matrix of both the presence and

the detection components

On the hand increasing the number of surveys affects the MPIP of predictors in the

detection component (Figures 3-2 and 3-3) but has only a marginal effect on predictors

of the presence component This may appear to be counterintuitive however increasing

the number of surveys only increases the number of observation in the design matrix

for the detection while leaving unaltered the design matrix for the presence The small

changes observed in the MPIP for the presence predictors J increases are exclusively

a result of having additional detection indicators equal to 1 in sites where with less

surveys would only have 0 valued detections

From Figure 3-3 it is clear that for the presence component the effect of the number

of sites dominates the behavior of the MPIP especially when using the multiplicity

correction priors In the detection component the MPIP is influenced by both the number

of sites and number of surveys The influence of increasing the number of surveys is

larger when considering a smaller number of sites and viceversa

Regarding the effect of the distribution for the occupancy probabilities we observe

that mostly the detection component is affected There is stronger discrimination

between true and false predictors as the distribution has a higher variability (Figure

3-4) This is consistent with intuition since having the presence probabilities more

concentrated about 05 implies that the predictors do not vary much from one site to

the next whereas having the occupancy probabilities more spread out would have the

opposite effect

Finally concentrating the detection probabilities about high or low values For

predictors in the detection component the separation between MPIP of true and false

74

predictors is larger both in scenarios where the distribution of the detection probability

is centered about 02 or 08 when compared to those scenarios where this distribution

is centered about 05 (where the signal of the predictors is weakest) For predictors in

the presence component having the detection probabilities centered at higher values

slightly increases the inclusion probabilities of the true predictors (Figure 3-5) and

reduces that of false predictors

Table 3-2 Comparison of average minOddsMPIP under scenarios having differentnumber of sites (N=50 N=100) and under scenarios having different numberof surveys per site (J=3 J=5) for the presence and detection componentsusing uniform and multiplicity correction priors

Sites SurveysComp π(M) N=50 N=100 J=3 J=5

Presence Unif 112 131 119 124MC 320 846 420 674

Detection Unif 203 264 211 257MC 2115 3246 2139 3252

Table 3-3 Comparison of average minOddsMPIP for different levels of signal consideredin the occupancy and detection probabilities for the presence and detectioncomponents using uniform and multiplicity correction priors

(Qz10Q

z50Q

z90) (Qy

10Qy50Q

y90)

Comp π(M) (030507) (020508) (010509) (010209) (010509) (010809)

Presence Unif 105 120 134 110 123 124MC 202 455 805 238 619 640

Detection Unif 234 234 230 257 200 238MC 2537 2077 2528 2933 1852 2849

The separation between the MPIP of true and false predictors is even more

notorious in Tables 3-2 and 3-3 where the minimum MPIP odds between true and

false predictors are shown Under every scenario the value for the minOddsMPIP (as

defined in 3ndash23) was greater than 1 implying that on average even the lowest MPIP

for a true predictor is higher than the maximum MPIP for a false predictor In both

components of the model the minOddsMPIP are markedly larger under the multiplicity

correction prior and increase with the number of sites and with the number of surveys

75

For the presence component increasing the signal in the occupancy probabilities

or having the detection probabilities concentrate about higher values has a positive and

considerable effect on the magnitude of the odds For the detection component these

odds are particularly high specially under the multiplicity correction prior Also having

the distribution for the detection probabilities center about low or high values increases

the minOddsMPIP 352 Summary Statistics for the Highest Posterior Probability Model

Tables 3-4 through 3-7 show the number of true predictors that are included in

the HPM (True +) and the number of false predictors excluded from it (True minus)

The mean percentages observed in these Tables provide one clear message The

highest probability models chosen with either model prior commonly differ from the

corresponding true models The multiplicity correction priorrsquos strong shrinkage only

allows a few true predictors to be selected but at the same time it prevents from

including in the HPM any false predictors On the other hand the uniform prior includes

in the HPM a larger proportion of true predictors but at the expense of also introducing

a large number of false predictors This situation is exacerbated in the presence

component but also occurs to a minor extent in the detection component

Table 3-4 Comparison between scenarios with 50 and 100 sites in terms of the averagepercentage of true positive and true negative terms over the highestprobability models for the presence and the detection components usinguniform and multiplicity correcting priors on the model space

True + True minusComp π(M) N=50 N=100 N=50 N=100

Presence Unif 057 063 051 055MC 006 013 100 100

Detection Unif 077 085 087 093MC 049 070 100 100

Having more sites or surveys improves the inclusion of true predictors and exclusion

of false ones in the HPM for both the presence and detection components (Tables 3-4

and 3-5) On the other hand if the distribution for the occupancy probabilities is more

76

Table 3-5 Comparison between scenarios with 3 and 5 surveys per site in terms of thepercentage of true positive and true negative predictors averaged over thehighest probability models for the presence and the detection componentsusing uniform and multiplicity correcting priors on the model space

True + True minusComp π(M) J=3 J=5 J=3 J=5

Presence Unif 059 061 052 054MC 008 010 100 100

Detection Unif 078 085 087 092MC 050 068 100 100

spread out the HPM includes more true predictors and less false ones in the presence

component In contrast the effect of the spread of the occupancy probabilities in the

detection HPM is negligible (Table 3-6) Finally there is a positive relationship between

the location of the median for the detection probabilities and the number of correctly

classified true and false predictors for the presence The HPM in the detection part of

the model responds positively to low and high values of the median detection probability

(increased signal levels) in terms of correctly classified true and false predictors (Table

3-7)

Table 3-6 Comparison between scenarios with different level of signal in the occupancycomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space

True + True minusComp π(M) (030507) (020508) (010509) (030507) (020508) (010509)

Presence Unif 055 061 064 050 054 055MC 002 008 018 100 100 100

Detection Unif 081 082 081 090 089 089MC 057 061 059 100 100 100

36 Case Study Blue Hawker Data Analysis

During 1999 and 2000 an intensive volunteer surveying effort coordinated by the

Centre Suisse de Cartographie de la Faune (CSCF) was conducted in order to analyze

the distribution of the blue hawker Ashna cyanea (Odonata Aeshnidae) a common

dragonfly in Switzerland Given that Switzerland is a small and mountainous country

77

Table 3-7 Comparison between scenarios with different level of signal in the detectioncomponent in terms of the percentage of true positive and true negativepredictors averaged over the highest probability models for the presence andthe detection components using uniform and multiplicity correcting priors onthe model space

True + True minusComp π(M) (010209) (010509) (010809) (010209) (010509) (010809)

Presence Unif 059 059 062 051 054 054MC 006 010 011 100 100 100

Detection Unif 089 077 078 091 087 091MC 070 048 059 100 100 100

there is large variation in its topography and physio-geography as such elevation is a

good candidate covariate to predict species occurrence at a large spatial scale It can

be used as a proxy for habitat type intensity of land use temperature as well as some

biotic factors (Kery et al 2010)

Repeated visits to 1-ha pixels took place to obtain the corresponding detection

history In addition to the survey outcome the x and y-coordinates thermal-level the

date of the survey and the elevation were recorded Surveys were restricted to the

known flight period of the blue hawker which takes place between May 1 and October

10 In total 2572 sites were surveyed at least once during the surveying period The

number of surveys per site ranges from 1 to 22 times within each survey year

Kery et al (2010) summarize the results of this effort using AIC-based model

comparisons first by following a backwards elimination approach for the detection

process while keeping the occupancy component fixed at the most complex model and

then for the presence component choosing among a group of three models while using

the detection model chosen In our analysis of this dataset for the detection and the

presence we consider as the full models those used in Kery et al (2010) namely

minus1(ψ) = α0 + α1year+ α2elev+ α3elev2 + α4elev

3

minus1(p) = λ0 + λ1year+ λ2elev+ λ3elev2 + λ4elev

3 + λ5date+ λ6date2

78

where year = Iyear=2000

The model spaces for this data contain 26 = 64 and 24 = 16 models respectively

for the detection and occupancy components That is in total the model space contains

24+6 = 1 024 models Although this model space can be enumerated entirely for

illustration we implemented the algorithm from section 334 generating 10000 draws

from the Gibbs sampler Each one of the models sampled were chosen from the set of

models that could be reached by changing the state of a single term in the current model

(to inclusion or exclusion accordingly) This allows a more thorough exploration of the

model space because for each of the 10000 models drawn the posterior probabilities

for many more models can be observed Below the labels for the predictors are followed

by either ldquozrdquo or ldquoyrdquo accordingly to represent the component they pertain to Finally

using the results from the model selection procedure we conducted a validation step to

determine the predictive accuracy of the HPMrsquos and of the median probability models

(MPMrsquos) The performance of these models is then contrasted with that of the model

ultimately selected by Kery et al (2010)361 Results Variable Selection Procedure

The model finally chosen for the presence component in Kery et al (2010) was not

found among the highest five probability models under either model prior 3-8 Moreover

the year indicator was never chosen under the multiplicity correcting prior hinting that

this term might correspond to a falsely identified predictor under the uniform prior

Results in Table 3-10 support this claim the marginal inclusion posterior probability for

the year predictor is 7 under the multiplicity correction prior The multiplicity correction

prior concentrates more densely the model posterior probability mass in the highest

ranked models (90 of the mass is in the top five models) than the uniform prior (which

account for 40 of the mass)

For the detection component the HPM under both priors is the intercept only model

which we represent in Table 3-9 with a blank label In both cases this model obtains very

79

Table 3-8 Posterior probability for the five highest probability models in the presencecomponent of the blue hawker data

Uniform model priorRank Mz selected p(Mz |y)

1 yrz+elevz 0102 yrz+elevz+elevz3 0083 elevz2+elevz3 0084 yrz+elevz2 0075 yrz+elevz3 007

Multiplicity correcting model priorRank Mz selected p(Mz |y)

1 elevz+elevz3 0532 0153 elevz+elevz2 0094 elevz2 0065 elevz+elevz2+elevz3 005

high posterior probabilities The terms contained in cubic polynomial for the elevation

appear to contain some relevant information however this conflicts with the MPIPs

observed in Table 3-11 which under both model priors are relatively low (lt 20 with the

uniform and le 4 with the multiplicity correcting prior)

Table 3-9 Posterior probability for the five highest probability models in the detectioncomponent of the blue hawker data

Uniform model priorRank Mz selected p(Mz |y)

1 0452 elevy3 0063 elevy2 0054 elevy 0055 yry 004

Multiplicity correcting model priorRank Mz selected p(Mz |y)

1 0862 elevy3 0023 datey2 0024 elevy2 0025 yry 002

Finally it is possible to use the MPIPs to obtain the median probability model which

contains the terms that have a MPIP higher than 50 For the occupancy process

(Table 3-10) under the uniform prior the model with the year the elevation and the

elevation cubed are included The MPM with multiplicity correction prior coincides with

the HPM from this prior The MPM chosen for the detection component (Table 3-11)

under both priors is the intercept only model coinciding again with the HPM

Given the outcomes of the simulation studies from Section 35 especially those

pertaining to the detection component the results in Table 3-11 appear to indicate that

none of the predictors considered belong to the true model especially when considering

80

Table 3-10 MPIP presence component

Predictor p(predictor isin MTz |y z w v)

Unif MultCorryrz 053 007elevz 051 073elevz2 045 023elevz3 050 067

Table 3-11 MPIP detection component

Predictor p(predictor isin MTy |y z w v)

Unif MultCorryry 019 003elevy 018 003elevy2 018 003elevy 3 019 004datey 016 003datey2 015 004

those derived with the multiplicity correction prior On the other hand for the presence

component (Table 3-10) there is an indication that terms related to the cubic polynomial

in elevz can explain the occupancy patterns362 Validation for the Selection Procedure

Approximately half of the sites were selected at random for training (ie for model

selection and parameter estimation) and the remaining half were used as test data In

the previous section we observed that using the marginal posterior inclusion probability

of the predictors the our method effectively separates predictors in the true model from

those that are not in it However in Tables 3-10 and 3-11 this separation is only clear for

the presence component using the multiplicity correction prior

Therefore in the validation procedure we observe the misclassification rates for the

detections using the following models (1) the model ultimately recommended in Kery

et al (2010) (yrz+elevz+elevz2+elevz3 + elevy+ elevy2+ datey+ datey2) (2) the

highest probability model (HPM) with a uniform prior (yrz+elevz) (3) the HPM with a

multiplicity correcting prior (elevz + elevz3 ) (4) the median probability model (MPM)

ndashthe model including only predictors with a MPIP larger than 50ndash with the uniform

prior (yrz+elevz+elevz3) and finally (5) the MPM with a multiplicity correction prior

(elevz+elevz3 same as the HPM with multiplicity correction)

We must emphasize that the models resulting from the implement ion of our model

selection procedure used exclusively the training dataset On the other hand the model

in Kery et al (2010) was chosen to minimize the prediction error of the complete data

81

Because this model was obtained from the full dataset results derived from it can only

be considered as a lower bound for the prediction errors

The benchmark misclassification error rate for true 1rsquos is high (close to 70)

However the misclassification rate for true 0rsquos which accounts for most of the

responses is less pronounced (15) Overall the performance of the selected models

is comparable They yield considerably worse results than the benchmark for the true

1rsquos but achieve rates close to the benchmark for the true zeros Pooling together

the results for true ones and true zeros the selected models with either prior have

misclassification rates close to 30 The benchmark model performs comparably with a

joint misclassification error of 23 (Table 3-12)

Table 3-12 Mean misclassification rate for HPMrsquos and MPMrsquos using uniform andmultiplicity correction model priors

Model True 1 True 0 Jointbenchmark (Kery et al 2010) yrz+elevz+elevz2+elevz3 + 066 015 023

elevy+ elevy2+ datey+ datey2

HPM Unif yrz+elevz 083 017 028HPMHPM MC elevz + elevz3 082 018 028MPM Unif yrz+elevz+elevz3 082 018 029

37 Discussion

In this Chapter we proposed an objective and fully automatic Bayes methodology for

the single season site-occupancy model The methodology is said to be fully automatic

because no hyper-parameter specification is necessary in defining the parameter priors

and objective because it relies on the intrinsic priors derived from noninformative priors

The intrinsic priors have been shown to have desirable properties as testing priors We

also propose a fast stochastic search algorithm to explore large model spaces using our

model selection procedure

Our simulation experiments demonstrated the ability of the method to single out the

predictors present in the true model when considering the marginal posterior inclusion

probabilities for the predictors For predictors in the true model these probabilities

were comparatively larger than those for predictors absent from it Also the simulations

82

indicated that the method has a greater discrimination capability for predictors in the

detection component of the model especially when using multiplicity correction priors

Multiplicity correction priors were not described in this Chapter however their

influence on the selection outcome is significant This behavior was observed in the

simulation experiment and in the analysis of the Blue Hawker data Model priors play an

essential role As the number of predictors grows these are instrumental in controlling

for selection of false positive predictors Additionally model priors can be used to

account for predictor structure in the selection process which helps both to reduce the

size of the model space and to make the selection more robust These issues are the

topic of the next Chapter

Accounting for the polynomial hierarchy in the predictors within the occupancy

context is a straightforward extension of the procedures we describe in Chapter 4

Hence our next step is to develop efficient software for it An additional direction we

plan to pursue is developing methods for occupancy variable selection in a multivariate

setting This can be used to conduct hypothesis testing in scenarios with varying

conditions through time or in the case where multiple species are co-observed A

final variation we will investigate for this problem is that of occupancy model selection

incorporating random effects

83

CHAPTER 4PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS

It has long been an axiom of mine that the little things are infinitely themost important

ndashSherlock HolmesA Case of Identity

41 Introduction

In regression problems if a large number of potential predictors is available the

complete model space is too large to enumerate and automatic selection algorithms are

necessary to find informative parsimonious models This multiple testing problem

is difficult and even more so when interactions or powers of the predictors are

considered In the ecological literature models with interactions andor higher order

polynomial terms are ubiquitous (Johnson et al 2013 Kery et al 2010 Zeller et al

2011) given the complexity and non-linearities found in ecological processes Several

model selection procedures even in the classical normal linear setting fail to address

two fundamental issues (1) the model selection outcome is not invariant to affine

transformations when interactions or polynomial structures are found among the

predictors and (2) additional penalization is required to control for false positives as the

model space grows (ie as more covariates are considered)

These two issues motivate the developments developed throughout this Chapter

Building on the results of Chipman (1996) we propose investigate and provide

recommendations for three different prior distributions on the model space These

priors help control for test multiplicity while accounting for polynomial structure in the

predictors They improve upon those proposed by Chipman first by avoiding the need

for specific values for the prior inclusion probabilities of the predictors and second

by formulating principled alternatives to introduce additional structure in the model

84

priors Finally we design a stochastic search algorithm that allows fast and thorough

exploration of model spaces with polynomial structure

Having structure in the predictors can determine the selection outcome As an

illustration consider the model E [y ] = β00 + β01x2 + β20x21 where the order one

term x1 is not present (this choice of subscripts for the coefficients is defined in the

following section) Transforming x1 7rarr xlowast1 = x1 + c for some c = 0 the model

becomes E [y ] = β00 + β01x2 + βlowast20x

lowast21 Note that in terms of the original predictors

xlowast21 = x21 + 2c middot x1 + c2 implying that this seemingly innocuous transformation of x1

modifies the column space of the design matrix by including x1 which was not in the

original model That is when lower order terms in the hierarchy are omitted from the

model the column space of the design matrix is not invariant to afine transformations

As the hat matrix depends on the column space the modelrsquos predictive capability is also

affected by how the covariates in the model are coded an undesirable feature for any

model selection procedure To make model selection invariant to afine transformations

the selection must be constrained to the subset of models that respect the hierarchy

(Griepentrog et al 1982 Khuri 2002 McCullagh amp Nelder 1989 Nelder 2000

Peixoto 1987 1990) These models are known as well-formulated models (WFMs)

Succinctly a model is well-formulated if for any predictor in the model every lower order

predictor associated with it is also in the model The model above is not well-formulated

as it contains x21 but not x1

WFMs exhibit strong heredity in that all lower order terms dividing higher order

terms in the model must also be included An alternative is to only require weak heredity

(Chipman 1996) which only forces some of the lower terms in the corresponding

polynomial hierarchy to be in the model However Nelder (1998) demonstrated that the

conditions under which weak heredity allows the design matrix to be invariant to afine

transformations of the predictors are too restrictive to be useful in practice

85

Although this topic appeared in the literature more than three decades ago (Nelder

1977) only recently have modern variable selection techniques been adapted to

account for the constraints imposed by heredity As described in Bien et al (2013)

the current literature on variable selection for polynomial response surface models

can be classified into three broad groups mult-istep procedures (Brusco et al 2009

Peixoto 1987) regularized regression methods (Bien et al 2013 Yuan et al 2009)

and Bayesian approaches (Chipman 1996) The methods introduced in this Chapter

take a Bayesian approach towards variable selection for well-formulated models with

particular emphasis on model priors

As mentioned in previous chapters the Bayesian variable selection problem

consists of finding models with high posterior probabilities within a pre-specified model

space M The model posterior probability for M isin M is given by

p(M|yM) prop m(y|M)π(M|M) (4ndash1)

Model posterior probabilities depend on the prior distribution on the model space

as well as on the prior distributions for the model specific parameters implicitly through

the marginals m(y|M) Priors on the model specific parameters have been extensively

discussed in the literature (Berger amp Pericchi 1996 Berger et al 2001 George 2000

Jeffreys 1961 Kass amp Wasserman 1996 Liang et al 2008 Zellner amp Siow 1980) In

contrast the effect of the prior on the model space has until recently been neglected

A few authors (eg Casella et al (2014) Scott amp Berger (2010) Wilson et al (2010))

have highlighted the relevance of the priors on the model space in the context of multiple

testing Adequately formulating priors on the model space can both account for structure

in the predictors and provide additional control on the detection of false positive terms

In addition using the popular uniform prior over the model space may lead to the

undesirable and ldquoinformativerdquo implication of favoring models of size p2 (where p is the

86

total number of covariates) since this is the most abundant model size contained in the

model space

Variable selection within the model space of well-formulated polynomial models

poses two challenges for automatic objective model selection procedures First the

notion of model complexity takes on a new dimension Complexity is not exclusively

a function of the number of predictors but also depends upon the depth and

connectedness of the associations defined by the polynomial hierarchy Second

because the model space is shaped by such relationships stochastic search algorithms

used to explore the models must also conform to these restrictions

Models without polynomial hierarchy constitute a special case of WFMs where

all predictors are of order one Hence all the methods developed throughout this

Chapter also apply to models with no predictor structure Additionally although our

proposed methods are presented for the normal linear case to simplify the exposition

these methods are general enough to be embedded in many Bayesian selection

and averaging procedures including of course the occupancy framework previously

discussed

In this Chapter first we provide the necessary definitions to characterize the

well-formulated model selection problem Then we proceed to introduce three new prior

structures on the well-formulated model space and characterize their behavior with

simple examples and simulations With the model priors in place we build a stochastic

search algorithm to explore spaces of well-formulated models that relies on intrinsic

priors for the model specific parameters mdash though this assumption can be relaxed

to use other mixtures of g-priors Finally we implement our procedures using both

simulated and real data

87

42 Setup for Well-Formulated Models

Suppose that the observations yi are modeled using the polynomial regression of

the covariates xi 1 xi p given by

yi =sum

β(α1αp)

pprodj=1

xαji j + ϵi (4ndash2)

where α = (α1 αp) belongs to Np0 the p-dimensional space of natural numbers

including 0 with ϵiiidsim N(0σ2) and only finitely many βα are allowed to be non-zero

As an illustration consider a model space that includes polynomial terms incorporating

covariates xi 1 and xi 2 only The terms x2i 2 and x2i 1xi 2 can be represented by α = (0 2)

and α = (2 1) respectively

The notation y = Z(X)β + ϵ is used to denote that observed response y =

(y1 yn) is modeled via a polynomial function Z of the original covariates contained

in X = (x1 xp) (where xj = (x1j xnj)prime) and the coefficients of the polynomial

terms are given by β A specific polynomial model M is defined by the set of coefficients

βα that are allowed to be non-zero This definition is equivalent to characterizing M

through a collection of multi-indices α isin Np0 In particular model M is specified by

M = αM1 αM|M| for αMk isin Np0 where βα = 0 for α isin M

Any particular model M uses a subset XM of the original covariates X to form the

polynomial terms in the design matrix ZM(X) Without ambiguity a polynomial model

ZM(X) on X can be identified with a polynomial model ZM(XM) on the covariates XM

The number of terms used by M to model the response y denoted by |M| corresponds

to the number of columns of ZM(XM) The coefficient vector and error variance of

the model M are denoted by βM and σ2M respectively Thus M models the data as

y = ZM(XM)βM + ϵM where ϵM sim N(0 Iσ2M

) Model M is said to be nested in model M prime

if M sub M prime M models the response of the covariates in two distinct ways choosing the

set of meaningful covariates XM as well as choosing the polynomial structure of these

covariates ZM(XM)

88

The set Np0 constitutes a partially ordered set or more succinctly a poset A poset

is a set partially ordered through a binary relation ldquo≼rdquo In this context the binary relation

on the poset Np0 is defined between pairs (ααprime) by αprime ≼ α whenever αj ge αprime

j for all

j = 1 prime with αprime ≺ α if additionally αj gt αprimej for some j The order of a term α isin Np

0

is given by the sum of its elements order(α) =sumαj When order(α) = order(αprime) + 1

and αprime ≺ α then αprime is said to immediately precede α which is denoted by αprime rarr α

The parent set of α is defined by P(α) = αprime isin Np0 αprime rarr α and is given by the

set of nodes that immediately precede the given node A polynomial model M is said to

be well-formulated if α isin M implies that P(α) sub M For example any well-formulated

model using x2i 1xi 2 to model yi must also include the parent terms xi 1xi 2 and x2i 1 their

corresponding parent terms xi 1 and xi 2 and the intercept term 1

The poset Np0 can be represented by a Directed Acyclic Graph (DAG) denoted

by (Np0) Without ambiguity we can identify nodes in the graph α isin Np

0 with terms in

the set of covariates The graph has directed edges to a node from its parents Any

well-formulated model M is represented by a subgraph (M) of (Np0) with the property

that if node α isin (M) then the nodes corresponding to P(α) are also in (M) Figure

4-1 shows examples of well-formulated polynomial models where α isin Np0 is identified

withprodp

j=1 xαjj

The motivation for considering only well-formulated polynomial models is

compelling Let ZM be the design matrix associated with a polynomial model The

subspace of y modeled by ZM given by the hat matrix HM = ZM(ZprimeMZM)

minus1ZprimeM is

invariant to affine transformations of the matrix XM if and only if M corresponds to a

well-formulated polynomial model (Peixoto 1990)

89

A B

Figure 4-1 Graphs of well-formulated polynomial models for p = 2

For example if p = 2 and yi = β(00) + β(10)xi 1 + β(01)xi 2 + β(11)xi 1xi 2 + ϵi then

the hat matrix is invariant to any covariate transformation of the form A(xi 1xi 2

)+ b for any

real-valued positive definite 2 times 2 matrix A and any real-valued vector of dimension two

b In contrast if yi = β(00) + β(20)x2i 1 + ϵi then the hat matrix formed after applying the

transformation xi 1 7rarr xi 1 + c for real c = 0 is not the same as the hat matrix formed by

the original xi 1421 Well-Formulated Model Spaces

The spaces of WFMs M considered in this paper can be characterized in terms

of two WFMs MB the base model and MF the full model The base model contains at

least the intercept term and is nested in the full model The model space M is populated

by all well formulated models M that nest MB and are nested in MF

M = M MB sube M sube MF and M is well-formulated

For M to be well-formulated the entire ancestry of each node in M must also be

included in M Because of this M isin M can be uniquely identified by two different sets

of nodes in MF the set of extreme nodes and the set of children nodes For M isin M

90

the sets of extreme and children nodes respectively denoted by E(M) and C(M) are

defined by

E(M) = α isin M MB α isin P(αprime) forall αprime isin M

C(M) = α isin MF M α cupM is well-formulated

The extreme nodes are those nodes that when removed from M give rise to a WFM in

M The children nodes are those nodes that when added to M give rise to a WFM in

M Because MB sube M for all M isin M the set of nodes E(M)cupMB determine M by

beginning with this set and iteratively adding parent nodes Similarly the nodes in C(M)

determine the set αprime isin P(α) α isin C(M)cupαprime isin E(MF ) α ≼ αprime for all α isin C(M)

which contains E(M)cupMB and thus uniquely identifies M

1

x1

x2

x21

x1x2

x22

A Extreme node set

1

x1

x2

x21

x1x2

x22

B Children node set

Figure 4-2

In Figure 4-2 the extreme and children sets for model M = 1 x1 x21 are shown for

the model space characterized by MF = 1 x1 x2 x21 x1x2 x22 In Figure 4-2A the solid

nodes represent nodes α isin M E(M) the dashed node corresponds to α isin E(M) and

the dotted nodes are not in M Solid nodes in Figure 4-2B correspond to those in M

The dashed node is the single node in C(M) and the dotted nodes are not in M cup C(M)43 Priors on the Model Space

As discussed in Scott amp Berger (2010) the Ockhamrsquos-razor effect found

automatically in Bayesian variable selection through the Bayes factor does not correct

91

for multiple testing This penalization acts against more complex models but does not

account for the collection of models in the model space which describes the multiplicity

of the testing problem This is where the role of the prior on the model space becomes

important As Scott amp Berger explain the multiplicity penalty is ldquohidden awayrdquo in the

model prior probabilities π(M|M)

In what follows we propose three different prior structures on the model space

for WFMs discuss their advantages and disadvantages and describe reasonable

choices for their hyper-parameters In addition we investigate how the choice of

prior structure and hyper-parameter combinations affect the posterior probabilities for

predictor inclusion providing some recommendations for different situations431 Model Prior Definition

The graphical structure for the model spaces suggests a method for prior

construction on M guided by the notion of inheritance A node α is said to inherit from

a node αprime if there is a directed path from αprime to α in the graph (MF ) The inheritance

is said to be immediate if order(α) = order(αprime) + 1 (equivalently if αprime isin P(α) or if αprime

immediately precedes α)

For convenience define (M) = M MB to be the set of nodes in M that are not

in the base model MB For α isin (MF ) let γα(M) be the indicator function describing

whether α is included in M ie γα(M) = I(αisinM) Denote by γν(M) the set of indicators

of inclusion in M for all order ν nodes in (MF ) Finally let γltν(M) =cupνminus1

j=0 γ j(M)

the set of indicators for inclusion in M for all nodes in (MF ) of order less than ν With

these definitions the prior probability of any model M isin M can be factored as

π(M|M) =

JmaxMprod

j=JminM

π(γ j(M)|γltj(M)M) (4ndash3)

where JminM and Jmax

M are respectively the minimum and maximum order of nodes in

(MF ) and π(γJminM (M)|γltJmin

M (M)M) = π(γJminM (M)|M)

92

Prior distributions on M can be simplified by making two assumptions First if

order(α) = order(αprime) = j then γα and γαprime are assumed to be conditionally independent

when conditioned on γltj denoted by γα perpperp γαprime|γltj Second immediate inheritance is

invoked and it is assumed that if order(α) = j then γα(M)|γltj(M) = γα(M)|γP(α)(M)

where γP(α)(M) is the inclusion indicator for the set of parent nodes of α This indicator

is one if the complete parent set of α is contained in M and zero otherwise

In Figure 4-3 these two assumptions are depicted with MF being an order two

surface in two main effects The conditional independence assumption (Figure 4-3A)

implies that the inclusion indicators for x21 x22 and x1x2 is independent when conditioned

on all the lower order terms In this same space immediate inheritance implies that

the inclusion of x21 conditioned on the inclusion of all lower order nodes is equivalent to

conditioning it on its parent set (x1 in this case)

x21 perpperp x1x2 perpperp x22

∣∣∣∣∣

1

x1

x2

A Conditional independence

x21∣∣∣∣∣

1

x1

x2

=

x21

∣∣∣∣∣ x1

B Immediate inheritance

Figure 4-3

Denote the conditional inclusion probability of node α in model M by πα =

π(γα(M) = 1|γP(α)(M)M) Under the assumptions of conditional independence

93

and immediate inheritance the prior probability of M is

π(M|πMM) =prod

αisin(MF )

πγα(M)α (1minus πα)

1minusγα(M) (4ndash4)

with πM = πα α isin (MF ) Because M must be well-formulated πα = γα =

0 if γP(α)(M) = 0 Thus the product in 4ndash4 can be restricted to the set of nodes

α isin (M)cup

C(M) Additional structure can be built into the prior on M by making

assumptions about the inclusion probabilities πα such as equality assumptions or

assumptions of a hyper-prior for these parameters Three such prior classes are

developed next first by assigning hyperpriors on πM assuming some structure among

its elements and then marginalizing out the πM

Hierarchical Uniform Prior (HUP) The HUP assumes that the non-zero πα

are all equal Specifically for a model M isin M it is assumed that πα = π for all

α isin (M)cupC(M) A complete Bayesian specification of the HUP is completed by

assuming a prior distribution for π The choice of π sim Beta(a b) produces

πHUP(M|M a b) =B(|(M)|+ a |C(M)|+ b)

B(a b) (4ndash5)

where B is the beta function Setting a = b = 1 gives the particular value of

πHUP(M|M a = 1 b = 1) =1

|(M)|+ |C(M)|+ 1

(|(M)|+ |C(M)|

|(M)|

)minus1

(4ndash6)

The HUP assigns equal probabilities to all models for which the sets of nodes (M)

and C(M) have the same cardinality This prior provides a combinatorial penalization

but essentially fails to account for the hierarchical structure of the model space An

additional penalization for model complexity can be incorporated into the HUP by

changing the values of a and b Because πα = π for all α this penalization can only

depend on some aspect of the entire graph of MF such as the total number of nodes

not in the null model |(MF )|

94

Hierarchical Independence Prior (HIP) The HIP assumes that there are no

equality constraints among the non-zero πα Each non-zero πα is given its own prior

which is assumed to be a Beta distribution with parameters aα and bα Thus the prior

probability of M under the HIP is

πHIP(M|M ab) =

prodαisin(M)

aα + bα

prodαisinC(M)

aα + bα

(4ndash7)

where the product over empty is taken to be 1 Because the πα are totally independent any

choice of aα and bα is equivalent to choosing a probability of success πα for a given α

Setting aα = bα = 1 for all α isin (M)cup

C(M) gives the particular value of

πHIP(M|M a = 1b = 1) =

(1

2

)|(M)|+|C(M)|

(4ndash8)

Although the prior with this choice of hyper-parameters accounts for the hierarchical

structure of the model space it essentially provides no penalization for combinatorial

complexity at different levels of the hierarchy This can be observed by considering a

model space with main effects only the exponent in 4ndash8 is the same for every model in

the space because each node is either in the model or in the children set

Additional penalizations for model complexity can be incorporated into the HIP

Because each γ j is conditioned on γltj in the prior construction the aα and bα for α of

order j can be conditioned on γltj One such additional penalization utilizes the number

of nodes of order j that could be added to produce a WFM conditioned on the inclusion

vector γltj which is denoted as chj(γltj) Choosing aα = 1 and bα(M) = chj(γ

ltj) is

equivalent to choosing a probability of success πα = 1chj(γltj) This penalization can

drive down the false positive rate when chj(γltj) is large but may produce more false

negatives

Hierarchical Order Prior (HOP) A compromise between complete equality and

complete independence of the πα is to assume equality between the πα of a given

order and independence across the different orders Define j(M) = α isin (M)

95

order(α) = j and Cj(M) = α isin C(M) order(α) = j The HOP assumes that πα = πj

for all α isin j(M)cupCj(M) Assuming that πj sim Beta(aj bj) provides a prior probability of

πHOP(M|M ab) =

JmaxMprod

j=JminM

B(|j(M)|+ aj |Cj(M)|+ bj)

B(aj bj)(4ndash9)

The specific choice of aj = bj = 1 for all j gives a value of

πHOP(M|M a = 1b = 1) =prodj

[1

|j(M)|+ |Cj(M)|+ 1

(|j(M)|+ |Cj(M)|

|j(M)|

)minus1]

(4ndash10)

and produces a hierarchical version of the Scott and Berger multiplicity correction

The HOP arises from a conditional exchangeability assumption on the indicator

variables Conditioned on γltj(M) the indicators γα α isin j(M)cup

Cj(M) are

assumed to be exchangeable Bernoulli random variables By de Finettirsquos theorem these

arise from independent Bernoulli random variables with common probability of success

πj with a prior distribution Our construction of the HOP assumes that this prior is a

beta distribution Additional complexity penalizations can be incorporated into the HOP

in a similar fashion to the HIP The number of possible nodes that could be added of

order j while maintaining a WFM is given by chj(M) = chj(γltj(M)) = |j(M)

cupCj(M)|

Using aj = 1 and bj(M) = chj(M) produces a prior with two desirable properties

First if M prime sub M then π(M) le π(M prime) Second for each order j the conditional

probability of including k nodes is greater than or equal to that of including k + 1 nodes

for k = 0 1 chj(M)minus 1432 Choice of Prior Structure and Hyper-Parameters

Each of the priors introduced in Section 31 defines a whole family of model priors

characterized by the probability distribution assumed for the inclusion probabilities πM

For the sake of simplicity this paper focuses on those arising from Beta distributions

and concentrates on particular choices of hyper-parameters which can be specified

automatically First we describe some general features about how each of the three

prior structures (HUP HIP HOP) allocates mass to the models in the model space

96

Second as there is an infinite number of ways in which the hyper-parameters can be

specified focused is placed on the default choice a = b = 1 as well as the complexity

penalizations described in Section 31 The second alternative is referred to as a =

1b = ch where b = ch has a slightly different interpretation depending on the prior

structure Accordingly b = ch is given by bj(M) = bα(M) = chj(M) = |j(M)cup

Cj(M)|

for the HOP and HIP where j = order(α) while b = ch denotes that b = |(MF )| for

the HUP The prior behavior for two model spaces In both cases the base model MB is

taken to be the intercept only model and MF is the DAG shown (Figures 4-4 and 4-5)

The priors considered treat model complexity differently and some general properties

can be seen in these examples

ModelHIP HOP HUP

(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)

1 1 14 49 13 12 13 572 1 x1 18 19 112 112 112 5563 1 x2 18 19 112 112 112 5564 1 x1 x

21 18 19 112 112 112 5168

5 1 x2 x22 18 19 112 112 112 5168

6 1 x1 x2 132 364 112 112 160 1727 1 x1 x2 x

21 132 164 136 160 160 1168

8 1 x1 x2 x1x2 132 164 136 160 160 11689 1 x1 x2 x

22 132 164 136 160 160 1168

10 1 x1 x2 x21 x1x2 132 1192 136 1120 130 1252

11 1 x1 x2 x21 x

22 132 1192 136 1120 130 1252

12 1 x1 x2 x1x2 x22 132 1192 136 1120 130 1252

13 1 x1 x2 x21 x1x2 x

22 132 1576 112 1120 16 1252

Figure 4-4 Prior probabilities for the space of well-formulated models associated to thequadratic surface on two variables where MB is taken to be the interceptonly model and (ab) isin (1 1) (1 ch)

First contrast the choice of HIP HUP and HOP for the choice of (ab) = (1 1) The

HIP induces a complexity penalization that only accounts for the order of the terms in

the model This is best exhibited by the model space in Figure 4-4 Models including x1

and x2 models 6 through 13 are given the same prior probability and no penalization is

incurred for the inclusion of any or all of the quadratic terms In contrast to the HIP the

97

ModelHIP HOP HUP

(1 1) (1 ch) (1 1) (1 ch) (1 1) (1 ch)

1 1 18 2764 14 12 14 472 1 x1 18 964 112 110 112 2213 1 x2 18 964 112 110 112 2214 1 x3 18 964 112 110 112 2215 1 x1 x3 18 364 112 120 112 41056 1 x2 x3 18 364 112 120 112 41057 1 x1 x2 116 3128 124 140 130 1428 1 x1 x2 x1x2 116 3128 124 140 120 1709 1 x1 x2 x3 116 1128 18 140 120 17010 1 x1 x2 x3 x1x2 116 1128 18 140 15 170

Figure 4-5 Prior probabilities for the space of well-formulated models associated tothree main effects and one interaction term where MB is taken to be theintercept only model and (ab) isin (1 1) (1 ch)

HUP induces a penalization for model complexity but it does not adequately penalize

models for including additional terms Using the HIP models including all of the terms

are given at least as much probability as any model containing any non-empty set of

terms (Figures 4-4 and 4-5) This lack of penalization of the full model is originates from

its combinatorial simplicity (ie this is the only model that contains every term) and

as an unfortunate consequence this model space distribution favors the base and full

models Similar behavior is observed with the HOP with (ab) = (1 1) As models

become more complex they are appropriately penalized for their size However after a

sufficient number of nodes are added the number of possible models of that particular

size is considerably reduced Thus combinatorial complexity is negligible on the largest

models This is best exhibited in Figure 4-5 where the HOP places more mass on

the full model than on any model containing a single order one node highlighting an

undesirable behavior of the priors with this choice of hyper-parameters

In contrast if (ab) = (1 ch) all three priors produce strong penalization as

models become more complex both in terms of the number and order of the nodes

contained in the model For all of the priors adding a node α to a model M to form M prime

produces p(M) ge p(M prime) However differences between the priors are apparent The

98

HIP penalizes the full model the most with the HOP penalizing it the least and the HUP

lying between them At face value the HOP creates the most compelling penalization

of model complexity In Figure 4-5 the penalization of the HOP is the least dramatic

producing prior odds of 20 for MB versus MF as opposed to the HUP and HIP which

produce prior odds of 40 and 54 respectively Similarly the prior odds in Figure 4-4 are

60 180 and 256 for the HOP HUP and HIP respectively433 Posterior Sensitivity to the Choice of Prior

To determine how the proposed priors are adjusting the posterior probabilities to

account for multiplicity a simple simulation was performed The goal of this exercise

was to understand how the priors respond to increasing complexity First the priors are

compared as the number of main effects p grows Second they are compared as the

depth of the hierarchy increases or in other words as the orderJMmax increases

The quality of a node is characterized by its marginal posterior inclusion

probabilities defined as pα =sum

MisinM I(αisinM)p(M|yM) for α isin MF These posteriors

were obtained for the proposed priors as well as the Equal Probability Prior (EPP)

on M For all prior structures both the default hyper-parameters a = b = 1 and

the penalizing choice of a = 1 and b = ch are considered The results for the

different combinations of MF and MT incorporated in the analysis were obtained

from 100 random replications (ie generating at random 100 matrices of main effects

and responses) The simulation proceeds as follows

1 Randomly generate main effects matrices X = (x1 x18) for xiiidsim Nn(0 In) and

error vectors ϵ sim Nn(0 In) for n = 60

2 Setting all coefficient values equal to one calculate y = ZMTβ + ϵ for the true

models given byMT 1 = x1 x2 x3 x

21 x1x2 x

22 x2x3 with |MT 1| = 7

MT 2 = x1 x2 x16 with |MT 2| = 16MT 3 = x1 x2 x3 x4 with |MT 3| = 4MT 4 = x1 x2 x8 x

21 x3x4 with |MT 4| = 10

MT 5 = x1 x2 x3 x4 x21 x3x4 with |MT 5| = 6

99

Table 4-1 Characterization of the full models MF and corresponding model spaces Mconsidered in simulationsgrowing p fixed JM

max fixed p growing JMmax

MF

∣∣MF

∣∣ ∣∣M∣∣ MT used MF

∣∣MF

∣∣ ∣∣M∣∣ MT used(x1 + x2 + x3)

2 9 95 MT 1 (x1 + x2 + x3)2 9 95 MT 1

(x1 + + x4)2 14 1337 MT 1 (x1 + x2 + x3)

3 19 2497 MT 1

(x1 + + x5)2 20 38619 MT 1 (x1 + x2 + x3)

4 34 161421 MT 1

Other model spacesMF

∣∣MF

∣∣ ∣∣M∣∣ MT usedx1 + x2 + middot middot middot+ x18 18 262144 MT 2MT 3

(x1 + x2 + x4)2 + x5+ 20 85568 MT 4MT 5x6 + + x10

3 In all simulations the base model MB is the intercept only model The notation(x1 + + xp)

d is used to represent the full order-d polynomial response surface inp main effects The model spaces characterized by their corresponding full modelMF are presented in Table 4-1 as well as the true models used in each case

4 Enumerate the model spaces and calculate p(M|yM) for all M isin Musing the EPP HUP HIP and HOP the latter two each with the two sets ofhyper-parameters

5 Count the number of true positives and false positives in each M for the differentpriors

The true positives (TP) are defined as those nodes α isin MT such that pα gt 05

With the false positives (FP) three different cutoffs are considered for pα elucidating

the adjustment for multiplicity induced by the model priors These cutoffs are

010 020 and 050 for α isin MT The results from this exercise provide insight

about the influence of the prior on the marginal posterior inclusion probabilities In Table

4-1 the model spaces considered are described in terms of the number of models they

contain and in terms of the number of nodes of MF the full model that defines the DAG

for M

Growing number of main effects fixed polynomial degree This simulation

investigates the posterior behavior as the number of covariates grows for a polynomial

100

surface of degree two The true model is assumed to be MT 1 and has 7 polynomial

terms The false positive and true positive rates are displayed in Table 4-2

First focus on the posterior when (ab) = (1 1) As p increases and the cutoff

is low the number of false positives increases for the EPP as well as the hierarchical

priors although less dramatically for the latter All of the priors identify all of the true

positives The false positive rate for the 50 cutoff is less than one for all four prior

structures with the HIP exhibiting the smallest false positive rate

With the second choice of hyper-parameters (1 ch) the improvement of the

hierarchical priors over the EPP is dramatic and the difference in performance is more

pronounced as p increases These also considerably outperform the priors using the

default hyper-parameters a = b = 1 in terms of the false positives Regarding the

number of true positives all priors discovered the 7 true predictors in MT 1 for most of

the 100 random samples of data with only minor differences observed between any of

the priors considered That being said the means for the priors with a = 1b = ch are

slightly lower for the true positives With a 50 cutoff the hierarchical priors keep a tight

control on the number of false positives but in doing so discard true positives with slightly

higher frequency

Growing polynomial degree fixed main effects For these examples the true

model is once again MT 1 When the complexity is increased by making the order of MF

larger (Table 4-3) the inability of the EPP to adjust the inclusion posteriors for multiplicity

becomes more pronounced the EPP becomes less and less efficient at removing false

positives when the FP cutoff is low Among the priors with a = b = 1 as the order

increases the HIP is the best at filtering out the false positives Using the 05 false

positive cutoff some false positives are included both for the EEP and for all the priors

with a = b = 1 indicating that the default hyper-parameters might not be the best option

to control FP The 7 covariates in the true model all obtain a high inclusion posterior

probability both with the EEP and the a = b = 1 priors

101

Table 4-2 Mean number of false and true positives in 100 randomly generated datasetsas the number of main effects increases from three to five predictors in a is afull quadratic under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

7 (x1 + x2 + x3)2

178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3 + x4)2

362 194 233 245 010 063 107FP(gt020) 160 047 217 215 001 017 024FP(gt050) 025 006 035 036 000 002 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3 + x4 + x5)2

600 216 260 255 012 043 115FP(gt020) 291 055 213 218 002 019 027FP(gt050) 066 011 025 037 000 003 001TP(gt050) (MT 1) 700 700 700 700 697 699 699

In contrast any of the a = 1 and b = ch priors dramatically improve upon their

a = b = 1 counterparts consistently assigning low inclusion probabilities for the majority

of the false positive terms even for low cutoffs As the order of the polynomial surface

increases the difference in performance between these priors and either the EEP or

their default versions becomes even more clear At the 50 cutoff the hierarchical priors

with complexity penalization exhibit very low false positive rates The true positive rate

decreases slightly for the priors but not to an alarming degree

Other model spaces This part of the analysis considers model spaces that do not

correspond to full polynomial degree response surfaces (Table 4-4) The first example

is a model space with main effects only The second example includes a full quadratic

surface of order 2 but in addition includes six terms for which only main effects are to be

modeled Two true models are used in combination with each model space to observe

how the posterior probabilities vary under the influence of the different priors for ldquolargerdquo

and ldquosmallrdquo true models

102

Table 4-3 Mean number of false and true positives in 100 randomly generated datasetsas the maximum order of MF increases from two to four in a full model withthree main effects under the equal probability prior (EPP) the hierarchicalindependence prior (HIP) the hierarchical order prior (HOP) and thehierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

7 (x1 + x2 + x3)2

178 178 200 200 011 131 106FP(gt020) 043 043 200 198 001 028 024FP(gt050) 004 004 097 036 000 003 002TP(gt050) (MT 1) 700 700 700 700 697 699 699FP(gt010)

7 (x1 + x2 + x3)3

737 521 606 291 055 105 139FP(gt020) 291 155 361 208 017 034 031FP(gt050) 040 021 050 026 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 698 700FP(gt010)

7 (x1 + x2 + x3)4

822 400 469 261 052 055 132FP(gt020) 421 113 176 203 012 015 031FP(gt050) 056 017 022 027 003 003 004TP(gt050) (MT 1) 700 700 700 700 697 697 699

By construction in model spaces with main effects only HIP(11) and EPP are

equivalent as are HOP(ab) and HUP(ab) This accounts for the similarities observed

among the results for the first two cases presented in Table 4-4 where the model space

corresponds to a full model with 18 main effects and the true models are a model with

16 and 4 main effects respectively When the number of true coefficients is large the

HUP(11) and HOP(11) do poorly at controlling false positives even at the 50 cutoff

In contrast the HIP (and thus the EPP) with the 50 cutoff identifies the true positives

and no false positives This result however does not imply that the EPP controls false

positives well The true model contains 16 out of the 18 nodes in MF so there is little

potential for false positives The a = 1 and b = ch priors show dramatically different

behavior The HIP controls false positive well but fails to identify the true coefficients at

the 50 cutoff In contrast the HOP identifies all of the true positives and has a small

false positive rate for the 50 cutoff

103

If the number of true positives is small most terms in the full model are truly zero

The EPP includes at least one false positive in approximately 50 of the randomly

sampled datasets On the other hand the HUP(11) provides some control for

multiplicity obtaining on average a lower number of false positives than the EPP

Furthermore the proposed hierarchical priors with a = 1b = ch are substantially better

than the EPP (and the choice of a = b = 1) at controlling false positives and capturing

all true positives using the marginal posterior inclusion probabilities The two examples

suggest that the HOP(1 ch) is the best default choice for model selection when the

number of terms available at a given degree is large

The third and fourth examples in Table 4-4 consider the same irregular model

space with data generated from MT 4 with ten terms and MT 5 with six terms HIP(11)

and EPP again behave quite similarly incorporating a large number of false positives

for the 01 cutoff At the 05 cutoff some false positives are still included The HUP(11)

and HOP(11) behave similarly with a slightly higher false positive rate at the 50 cutoff

In terms of the true positives the EPP and a = b = 1 priors always include all of the

predictors in MT 4 and MT 5 On the other hand the ability of the a = 1b = ch priors

to control for false positives is markedly better than that of the EPP and the hierarchical

priors with choice of a = 1 = b = 1 At the 50 cutoff these priors identify all of the true

positives and true negatives Once again these examples point to the hierarchical priors

with additional penalization for complexity as being good default priors on the model

space44 Random Walks on the Model Space

When the model space M is too large to enumerate a stochastic procedure can

be used to find models with high posterior probability In particular an MCMC algorithm

can be utilized to generate a dependent sample of models from the model posterior The

structure of the model space M both presents difficulties and provides clues on how to

build algorithms to explore it Different MCMC strategies can be adopted two of which

104

Table 4-4 Mean number of false and true positives in 100 randomly generated datasetswith unstructured or irregular model spaces under the equal probability prior(EPP) the hierarchical independence prior (HIP) the hierarchical order prior(HOP) and the hierarchical uniform prior (HUP)

Cutoff |MT | MF EPP a = 1b = 1 a = 1b = ch

HIP HUP HOP HIP HUP HOPFP(gt010)

16 x1 + x2 + + x18

193 193 200 200 003 180 180FP(gt020) 052 052 200 200 001 046 046FP(gt050) 007 007 200 200 001 004 004TP(gt050) (MT 2) 1599 1599 1600 1600 699 1599 1599FP(gt010)

4 x1 + x2 + + x18

1395 1395 915 915 026 131 131FP(gt020) 545 545 303 303 005 045 045FP(gt050) 084 084 045 045 002 006 006TP(gt050) (MT 3) 400 400 400 400 400 400 400FP(gt010)

10

973 971 1000 560 034 233 220FP(gt020) (x1 + + x4)

2+ 265 265 873 305 012 074 069FP(gt050) +x5 + + x10 035 035 136 168 002 011 012TP(gt050) (MT 4) 1000 1000 1000 999 994 998 999FP(gt010)

6

1352 1352 1106 994 044 163 196FP(gt020) (x1 + + x4)

2+ 422 421 360 501 015 048 068FP(gt050) +x5 + + x10 053 053 057 075 001 008 011TP(gt050) (MT 5) 600 600 600 600 599 599 599

are outlined in this section Combining the different strategies allows the model selection

algorithm to explore the model space thoroughly and relatively fast441 Simple Pruning and Growing

This first strategy relies on small localized jumps around the model space turning

on or off a single node at each step The idea behind this algorithm is to grow the model

by activating one node in the children set or to prune the model by removing one node

in the extreme set At a given step in the algorithm assume that the current state of the

chain is model M Let pG be the probability that algorithm chooses the growth step The

proposed model M prime can either be M+ = M cup α for some α isin C(M) or Mminus = M α

or some α isin E(M)

An example transition kernel is defined by the mixture

g(M prime|M) = pG middot qGrow(M prime|M) + (1minus pG) middot qPrune(M prime|M)

105

=IM =MF

1 + IM =MBmiddotIαisinC(M)

|C(M)|+

IM =MB

1 + IM =MF middotIαisinE(M)

|E(M)|(4ndash11)

where pG has explicitly been defined as 05 when both C(M) and E(M) are non-empty

and as 0 (or 1) when C(M) = empty (or E(M) = empty) After choosing pruning or growing a

single node is proposed for addition to or deletion from M uniformly at random

For this simple algorithm pruning has the reverse kernel of growing and vice-versa

From this construction more elaborate algorithms can be specified First instead of

choosing the node uniformly at random from the corresponding set nodes can be

selected using the relative posterior probability of adding or removing the node Second

more than one node can be selected at any step for instance by also sampling at

random the number of nodes to add or remove given the size of the set Third the

strategy could combine pruning and growing in a single step by sampling one node

α isin C(M) cup E(M) and adding or removing it accordingly Fourth the sets of nodes from

C(M) cup E(M) that yield well-formulated models can be added or removed This simple

algorithm produces small moves around the model space by focusing node addition or

removal only on the set C(M) cup E(M)442 Degree Based Pruning and Growing

In exploring the model space it is possible to take advantage of the hierarchical

structure defined between nodes of different order One can update the vector of

inclusion indicators by blocks denoted j(M) Two flavors of this algorithm are

proposed one that separates the pruning and growing steps and one where both

are done simultaneously

Assume that at a given step say t the algorithm is at M If growing the strategy

proceeds successively by order class going from j = Jmin up to j = Jmax with Jmin

and Jmax being the lowest and highest orders of nodes in MF MB respectively Define

Mt(Jminminus1) = M and set j = Jmin The growth kernel comprises the following steps

proceeding from j = Jmin to j = Jmax

106

1) Propose a model M prime by selecting a set of nodes from Cj(Mt(jminus1)) through thekernel qGrow j(middot|Mt(jminus1))

2) Compute the Metropolis-Hastings correction for M prime versus Mt(jminus1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(jminus1)

3) If j lt Jmax then set j = j + 1 and return to 1) otherwise proceed to 4)

4) Set Mt = Mt(Jmax )

The pruning step is defined In a similar fashion however it starts at order j = Jmax

and proceeds down to j = Jmin Let Ej(M prime) = E(M prime) cap j(MF ) be the set of nodes of

order j that can removed from the model M to produce a WFM Define Mt(Jmax+1) = M

and set j = Jmax The pruning kernel comprises the following steps

1) Propose a model M prime by selecting a set of nodes from Ej(Mt(j+1)) through thekernel qPrunej(middot|Mt(j+1))

2) Compute the Metropolis-Hastings correction for M prime versus Mt(j+1) If M prime isaccepted then set Mt(j) = M prime otherwise set Mt(j) = Mt(j+1)

3) If j gt Jmin then set j = j minus 1 and return to Step 1) otherwise proceed to Step 4)

4) Set Mt = Mt(Jmin )

It is clear that the growing and pruning steps are reverse kernels of each other

Pruning and growing can be combined for each j The forward kernel proceeds from

j = Jmin to j = Jmax and proposes adding sets of nodes from Cj(M) cup Ej(M) The reverse

kernel simply reverses the direction of j proceeding from j = Jmax to j = Jmin 45 Simulation Study

To study the operating characteristics of the proposed priors a simulation

experiment was designed with three goals First the priors are characterized by how

the posterior distributions are affected by the sample size and the signal-to-noise ratio

(SNR) Second given the SNR level the influence of the allocation of the signal across

the terms in the model is investigated Third performance is assessed when the true

model has special points in the scale (McCullagh amp Nelder 1989) ie when the true

107

model has coefficients equal to zero for some lower-order terms in the polynomial

hierarchy

With these goals in mind sets of predictors and responses are generated under

various experimental conditions The model space is defined with MB being the

intercept-only model and MF being the complete order-four polynomial surface in five

main effects that has 126 nodes The entries of the matrix of main effects are generated

as independent standard normal The response vectors are drawn from the n-variate

normal distribution as y sim Nn

(ZMT

(X)βγ In) where MT is the true model and In is the

n times n identity matrix

The sample sizes considered are n isin 130 260 1040 which ensures that

ZMF(X) is of full rank The cardinality of this model space is |M| gt 12 times 1022 which

makes enumeration of all models unfeasible Because the value of the 2k-th moment

of the standard normal distribution increases with k = 1 2 higher-order terms by

construction have a larger variance than their ancestors As such assuming equal

values for all coefficients higher-order terms necessarily contain more ldquosignalrdquo than

the lower order terms from which they inherit (eg x21 has more signal than x1) Once a

higher-order term is selected its entire ancestry is also included Therefore to prevent

the simulation results from being overly optimistic (because of the larger signals from the

higher-order terms) sphering is used to calculate meaningful values of the coefficients

ensuring that the signal is of the magnitude intended in any given direction Given

the results of the simulations from Section 433 only the HOP with a = 1b = ch is

considered with the EPP included for comparison

The total number of combinations of SNR sample size regression coefficient

values and nodes in MT amounts to 108 different scenarios Each scenario was run

with 100 independently generated datasets and the mean behavior of the samples was

observed The results presented in this section correspond to the median probability

model (MPM) from each of the 108 simulation scenarios considered Figure 4-7 shows

108

the comparison between the two priors for the mean number of true positive (TP) and

false positive (FP) terms Although some of the scenarios consider true models that are

not well-formulated the smallest well-formulated model that stems from MT is always

the one shown in Figure 4-6

Figure 4-6 MT DAG of the largest true model used in simulations

The results are summarized in Figure 4-7 Each point on the horizontal axis

corresponds to the average for a given set of simulation conditions Only labels for the

SNR and sample size are included for clarity but the results are also shown for the

different values of the regression coefficients and the different true models considered

Additional details about the procedure and other results are included in the appendices451 SNR and Sample Size Effect

As expected small sample sizes conditioned upon a small SNR impair the ability

of the algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this

effect being greater when using the latter prior However considering the mean number

of TPs jointly with the number of FPs it is clear that although the number of TPs is

specially low with HOP(1 ch) most of the few predictors that are discovered in fact

belong to the true model In comparison to the results with EPP in terms of FPs the

HOP(1 ch) does better and even more so when both the sample size and the SNR are

109

Figure 4-7 Average true positives (TP) and average false positives (FP) in all simulatedscenarios for the median probability model with EPP and HOP(1 ch)

smallest Finally when either the SNR or the sample size is large the performance in

terms of TPs is similar between both priors but the number of FPs are somewhat lower

with the HOP452 Coefficient Magnitude

Three ways to allocate the amount of signal across predictors are considered For

the first choice all coefficients contain the same amount of signal regardless of their

order In the second each order-one coefficient contains twice as much signal as any

order-two coefficient and four times as much as any order-three coefficient Finally

each order-one coefficient contains a half as much signal as any order-two coefficient

and a quarter of what any order-three coefficient has These choices are denoted by

β(1) = c(1o1 1o2 1o3) β(2) = c(1o1 05o2 025o3) and β(3) = c(025o1 05o2 1o3)

respectively In Figure 4-7 the first 4 scenarios correspond to simulations with β(1) the

next four use β(2) the next four correspond to β(3) and then the values are cycled in

110

the same way The results show that scenarios using either β(1) or β(3) behave similarly

contrasting with the negative impact of having the highest signal in the order-one terms

through β(2) In Figure 4-7 the effect of using β(2) is evident as it corresponds to the

lowest values for the TPs regardless of the sample size the SNR or the prior used This

is an intuitive result since giving more signal to higher-order terms makes it easier to

detect higher-order terms and consequently by strong heredity the algorithm will also

select the corresponding lower-order terms included in the true model453 Special Points on the Scale

Four true models were considered (1) the model from Figure 4-6 (MT 1) (2)

the model without the order-one terms (MT 2) (3) the model without order-two terms

(MT 3) and (4) the model without x21 and x2x5 (MT 4) The last three are clearly not

well-formulated In Figure 4-7 the leftmost point on the horizontal axis corresponds to

scenarios with MT 1 the next point is for scenarios with MT 2 followed by those with MT 3

then with MT 4 then MT 1 etc In comparison to the EPP the HOP(1 ch) tightly controls

the inclusion of FPs by choosing smaller models at the expense of also reducing the TP

count especially when there is more uncertainty about the true model (ie SNR=025)

For both prior structures the results indicate that at low SNR levels the presence of

special points has no apparent impact as the selection behavior is similar between the

four models in terms of both the TP and FP An interesting observation is that the effect

of having special points on the scale is vastly magnified whenever the coefficients that

assign more weight to order-one terms (β(2)) are used46 Case Study Ozone Data Analysis

This section uses the ozone data from Breiman amp Friedman (1985) and followsthe analysis performed by Liang et al (2008) who investigated hyper g-priors Afterremoving observations with missing values 330 observations remain includingdaily measurements of maximum ozone concentration near Los Angeles and eightmeteorological variables Table D From the 330 observations 165 were sampled atrandom without replacement and used to run the variable selection procedure theremaining 165 were used for validation The eight meteorological variables interactionsand their squared terms are used as predictors resulting in a full model with 44predictors The model space assumes that the base model MB is the intercept onlymodel and that MF is the quadratic surface in the eight meteorological variables The

111

model space contains approximately 71 billion models and computation of all modelposterior probabilities is not feasible

Table 4-5 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA

vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX

The HOP HUP and HIP with a = 1 and b = ch as well as the EPP are considered forcomparison purposes To obtain the Bayes factors in equation 3ndash3 four different mixtures ofg-priors are utilized intrinsic priors (IP) (which yields the expression in equation 3ndash2) hyper-g(HG) priors (Liang et al 2008) with hyper-parameters α = 2β = 1 and α = β = 1 and Zellner-Siow (ZS) priors (Zellner amp Siow 1980) The results were extracted for the median posteriorprobability (MPM) models Additionally the model is estimated using the R package hierNet(Bien et al 2013) to compare model selection results to those obtained using the hierarchicallasso (Bien et al 2013) restricted to well formulated models by imposing the strong heredityconstraint The procedures were assessed on the basis of their predictive accuracy on thevalidation dataset

Among all models the one that yields the smallest RMSE is the median probability modelobtained using the HOP and EPP with the ZS prior and also using the HOP with both HGpriors (Table 4-6) The HOP model with the intrinsic prior has all the terms contained in thelowest RMSE model with the exception of dpg2 which has a relatively high marginal inclusionprobability of 46 This disparity between the IP and other mixtures of g-priors is explainedby the fact that the IP induces less posterior shrinkage than the ZS and HG priors The MPMobtained through the HUP and HIP are nested in the best model suggesting that these modelspace priors penalize complexity too much and result in false negatives Consideration ofthese MPMs suggest that the HOP is best at producing true positives while controlling for falsepositives

Finally the model obtained from the hierarchical lasso (HierNet) is the largest model andproduces the second to largest RMSE All of the terms contained in any of the other modelsexcept for vh are nested within the hierarchical lasso model and most of the terms that areexclusive to this model receive extremely low marginal inclusion probabilities under any of themodel priors and parameter priors considered under Bayesian model selection

112

Table 4-6 Median probability models (MPM) from different combinations of parameterand model priors vs model selected using the hierarchical lasso

BF Prior Model R2 RMSEIP EPP hum dpg ibt hum2 hum lowast dpg 08054 42739

hum lowast ibt dpg2 ibt2IP HIP hum ibt hum2 hum lowast ibt ibt2 07740 43396IP HOP hum dpg ibt hum2 hum lowast ibt ibt2 07848 43175IP HUP hum dpg ibt hum lowast ibt ibt2 07767 43508ZS EPP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HIP hum ibt hum lowast ibt ibt2 07525 43505ZS HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518ZS HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG11 EPP vh hum dpg ibt hum2 hum lowast ibt dpg2 07701 43049HG11 HIP hum ibt hum lowast ibt ibt2 07525 43505HG11 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG11 HUP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 EPP hum dpg ibt hum2 hum lowast ibt dpg2 07701 43037HG21 HIP hum dpg ibt hum lowast ibt ibt2 07767 43508HG21 HOP hum dpg ibt hum2 hum lowast ibt dpg2 ibt2 07896 42518HG21 HUP hum dpg ibt hum lowast ibt 07526 44036

HierNet hum temp ibh dpg ibt vis hum2 hum lowast ibt 07651 43680temp2 temp lowast ibt dpg2

47 DiscussionScott amp Berger (2010) noted that Ockhamrsquos-razor effect found automatically in Bayesian

variable selection through the Bayes factor does not correct for multiple testing The Bayesfactor penalizes complexity of the alternative model according to the number of parametersin excess of those of the null model Therefore the Bayes factor only controls complexity in apairwise fashion If the model selection procedure uses equal prior probabilities for all M isin Mthen these comparisons ignore the effect of the multiplicity of the testing problem This is wherethe role of the prior on the model space becomes important The multiplicity penalty is ldquohiddenawayrdquo in the model prior probabilities π(M|M)

In addition to the multiplicity of the testing problem disregarding the hierarchical polynomialstructure in the predictors in model selection procedures has the potential to lead to differentresults according to how the predictors are setup (eg in what units these predictors areexpressed)

In this Chapter we investigated a solution to these two issues We define prior structuresfor well-formulated models and develop random walk algorithms to traverse this type of modelspace The key to understanding prior distributions on the space of WFMs is the hierarchicalnature of the model space itself The prior distributions described take advantage of thathierarchy in two ways First conditional independence and immediate inheritance are used todevelop the HOP HIP and HUP structures discussed in Section 43 Second the conditionalnature of the priors allows for the direct incorporation of complexity penalizations Of the priorsproposed the HOP using the hyperparameter choice (1 ch) provides the best control of falsepositives while maintaining a reasonable true positive rate Thus this prior is recommended asthe default prior on the space of WFMs

113

In the near future the software developed to carry out a Metropolis-Hastings random walkon the space of WFMs will be integrated to the R package varSelectIP These new functionsimplement various local priors for the regression coefficients including the intrinsic prior Zellner-Siow prior and hyper g-priors In addition the software supports the computation of crediblesets for each regression coefficient conditioned on the selected model as well as under modelaveraging

114

CHAPTER 5CONCLUSIONS

Ecologists are now embracing the use of Bayesian methods to investigate the

interactions that dictate the distribution and abundance of organisms These tools are

both powerful and flexible They allow integrating under a single methodology empirical

observations and theoretical process models and can seamlessly account for several

sources of uncertainty and dependence The estimation and testing methods proposed

throughout the document will contribute to the understanding of Bayesian methods used

in ecology and hopefully these will shed light about the differences between estimation

and testing Bayesian tools

All of our contributions exploit the potential of the latent variable formulation This

approach greatly simplifies the analysis of complex models it redirects the bulk of

the inferential burden away from the original response variables and places it on the

easy-to-work-with latent scale for which several time-tested approaches are available

Our methods are distinctly classified into estimation and testing tools

For estimation we proposed a Bayesian specification of the single-season

occupancy model for which a Gibbs sampler is available using both logit and probit

link functions This setup allows detection and occupancy probabilities to depend

on linear combinations of predictors Then we developed a dynamic version of this

approach incorporating the notion that occupancy at a previously occupied site depends

both on survival of current settlers and habitat suitability Additionally because these

dynamics also vary in space we suggest a strategy to add spatial dependence among

neighboring sites

Ecological inquiry usually requires of competing explanations and uncertainty

surrounds the decision of choosing any one of them Hence a model or a set of

probable models should be selected from all the viable alternatives To address this

testing problem we proposed an objective and fully automatic Bayesian methodology

115

for the single season site-occupancy model Our approach relies on the intrinsic prior

which prevents from introducing (commonly unavailable) subjectively information

into the model In simulation experiments we observed that the methods single out

accurately the predictors present in the true model using the marginal posterior inclusion

probabilities of the predictors For predictors in the true model these probabilities were

comparatively larger than those for predictors not present in the true model Also the

simulations indicated that the method provides better discrimination for predictors in the

detection component of the model

In our simulations and in the analysis of the Blue Hawker data we observed that the

effect from using the multiplicity correction prior was substantial This occurs because

the Bayes factor only penalizes complexity of the alternative model according to its

number of parameters in excess to those of the null model As the number of predictors

grows the number of models in the models space also grows increasing the chances

of making false positive decisions on the inclusion of predictors This is where the role

of the prior on the model space becomes important The multiplicity penalty is ldquohidden

awayrdquo in the model prior probabilities π(M|M) In addition to the multiplicity of the

testing problem disregarding the hierarchical polynomial structure in the predictors in

model selection procedures has the potential to lead to different results according to

how the predictors are coded (eg in what units these predictors are expressed)

To confront this situation we propose three prior structures for well-formulated

models take advantage of the hierarchical structure of the predictors Of the priors

proposed we recommend the HOP using the hyperparameter choice (1 ch) which

provides the best control of false positives while maintaining a reasonable true positive

rate

Overall considering the flexibility of the latent approach several other extensions of

these methods follow Currently we envision three future developments (1) occupancy

models incorporate various sources of information (2) multi-species models that make

116

use of spatial and interspecific dependence and (3) investigate methods to conduct

model selection for the dynamic and spatially explicit version of the model

117

APPENDIX AFULL CONDITIONAL DENSITIES DYMOSS

In this section we introduce the full conditional probability density functions for all

the parameters involved in the DYMOSS model using probit as well as logic links

Sampler Z

The full conditionals corresponding to the presence indicators have the same form

regardless of the link used These are derived separately for the cases t = 1 1 lt t lt T

and t = T since their corresponding probabilities take on slightly different forms

Let ϕ(ν|microσ2) represent the density for a normal random variable ν with mean micro and

variance σ2 and recall that ψi1 = F (xprime(o)iα) and pijt = F (qprimeijtλt) where F () is the

inverse link function The full conditional for zit is given by

1 For t = 1

π(zi1|vi1αλ1βc1 δ

s1) = ψlowast

i1zi1 (1minus ψlowast

i1)1minuszi1

= Bernoulli(ψlowasti1) (Andash1)

where

ψlowasti1 =

ψi1ϕ(vi1|xprimei1βc1 + δs1 1)

prodJi1j=1(1minus pij1)

ψi1ϕ(vi1|xprimei1βc1 + δs1 1)

prodJi1j=1(1minus pij1) + (1minus ψi1)ϕ(vi1|xprimei1β

c1 1)

prodJj=1 Iyij1=0

2 For 1 lt t lt T

π(zit |zi(tminus1) zi(t+1)λt βctminus1 δ

stminus1) = ψlowast

itzit (1minus ψlowast

it)1minuszit

= Bernoulli(ψlowastit) (Andash2)

where

ψlowastit =

κitprodJit

j=1(1minus pijt)

κitprodJit

j=1(1minus pijt) +nablait

prodJj=1 Iyijt=0

with

(a) κit = F (xprimei(tminus1)β

ctminus1 + zi(tminus1)δ

stminus1)ϕ(vit |xprimeitβ

ct + δst 1) and

(b) nablait =(1minus F (xprime

i(tminus1)βctminus1 + zi(tminus1)δ

stminus1)

)ϕ(vit |xprimeitβ

ct 1)

3 For t = T

π(ziT |zi(Tminus1)λT βcTminus1 δ

sTminus1) = ψ⋆iT

ziT (1minus ψ⋆iT )1minusziT

118

=

Nprodi=1

Bernoulli(ψ⋆iT ) (Andash3)

where

ψ⋆iT =κ⋆iT

prodJiTj=1(1minus pijT )

κ⋆iTprodJiT

j=1(1minus pijT ) +nabla⋆iT

prodJj=1 IyijT=0

with

(a) κ⋆iT = F (xprimei(Tminus1)β

cTminus1 + zi(Tminus1)δ

sTminus1) and

(b) nabla⋆iT =

(1minus F (xprime

i(Tminus1)βcTminus1 + zi(Tminus1)δ

sTminus1)

)Sampler ui

1

π(ui |zi1α) = tr N(xprime(o)iα 1 trunc(zi1))

where trunc(zi1) =

(minusinfin 0] zi1 = 0

(0infin) zi1 = 1(Andash4)

and tr N(microσ2A) denotes the pdf of a truncated normal random variable with mean microvariance σ2 and truncation region A

Sampler α

1

π(α|u) prop [α]

Nprodi=1

ϕ(ui xprime(o)iα 1) (Andash5)

If [α] prop 1 then

α|u sim N(m(α)α)

with m(α) = αXprime(o)u and α = (X prime

(o)X(o))minus1

Sampler vit

1 (For t gt 1)

π(vi (tminus1)|zi (tminus1) zit βctminus1 δ

stminus1) = tr N

(micro(v)i(tminus1) 1 trunc(zit)

)(Andash6)

where micro(v)i(tminus1) = xprime

i(tminus1)βctminus1 + zi(tminus1)δ

ci(tminus1) and trunc(zit) defines the corresponding

truncation region given by zit

119

Sampler(β(c)tminus1 δ

(c)tminus1

)

1 (For t gt 1)

π(β(s)tminus1 δ

(c)tminus1|vtminus1 ztminus1) prop [β

(s)tminus1 δ

(c)tminus1]

Nprodi=1

ϕ(vit xprimei(tminus1)β

(c)tminus1 + zi(tminus1)δ

(s)tminus1 1) (Andash7)

If[β(c)tminus1 δ

(s)tminus1

]prop 1 then

β(c)tminus1 δ

(s)tminus1|vtminus1 ztminus1 sim N(m(β

(c)tminus1 δ

(s)tminus1)tminus1)

with m(β(c)tminus1 δ

(s)tminus1) = tminus1 ~X

primetminus1vtminus1 and tminus1 = (~X prime

tminus1 ~Xtminus1)minus1 where ~Xtminus1 =(

Xtminus1 ztminus1)

Sampler wijt

1 (For t gt 1 and zit = 1)

π(wijt | i zit = 1 yijt λ) = tr N(qprimeijtλt 1 tr(yijt)

)(Andash8)

Sampler λt

1 (For t = 1 2 T )

π(λt |zt wt) prop [λt ]prod

i zit=1

Jitprodj=1

ϕ(wijt qprimeijtλt 1) (Andash9)

If [λt ] prop 1 then

λt |wt zt sim N(m(λt)λt)

with m(λt) = λtQ primetwt and λt

= (Q primetQt)

minus1 where Qt and wt respectively are the designmatrix and the vector of latent variables for surveys of sites such that zit = 1

120

APPENDIX BRANDOM WALK ALGORITHMS

Global Jump From the current state M the global jump is performed by drawing

a model M prime at random from the model space This is achieved by beginning at the base

model and increasing the order from JminM to the Jmax

M the minimum and maximum orders

of nodes in (MF ) = MF MB at each order a set of nodes is selected at random from

the prior conditioned on the nodes already in the model The MH correction is

α =

1m(y|M primeM)

m(y|MM)

Local Jump From the current state M the local jump is performed by drawing a

model from the set of models L(M) = Mα α isin E(M) cup C(M) where Mα is M α

for α isin E(M) and M cup α for α isin C(M) The proposal probabilities for the model are

computed as a mixture of p(M prime|yMM prime isin L(M)) and the discrete uniform distribution

The proposal kernel is

q(M prime|yMM prime isin L(M)) =1

2

(p(M prime|yMM prime isin L(M)) +

1

|L(M)|

)This choice promotes moving to better models while maintaining a non-negligible

probability of moving to any of the possible models The MH correction is

α =

1m(y|M primeM)

m(y|MM)

q(M|yMM isin L(M prime))

q(M prime|yMM prime isin L(M))

Intermediate Jump The intermediate jump is performed by increasing or

decreasing the order of the nodes under consideration performing local proposals based

on order For a model M prime define Lj(Mprime) = M prime cup M prime

α α isin (E(M prime) cup C(M prime)) capj(MF )

From a state M the kernel chooses at random whether to increase or decrease the

order If M = MF then decreasing the order is chosen with probability 1 and if M = MB

then increasing the order is chosen with probability 1 in all other cases the probability of

increasing and decreasing order is 12 The proposal kernels are given by

121

Increasing order proposal kernel

1 Set j = JminM minus 1 and M prime

j = M

2 Draw M primej+1 from qincj+1(M

prime|yMM prime isin Lj+1(Mprimej )) where

qincj+1(Mprime|yMM prime isin Lj+1(M

primej )) =

12

(p(M prime|yMM prime isin Lj+1(M

primej )) +

1|Lj+1(M

primej)|

)

3 Set j = j + 1

4 If j lt JmaxM then return to 2 O therwise proceed to 5

5 Set M prime = M primeJmaxM

and compute the proposal probability

qinc(Mprime|yMM) =

JmaxM minus1prod

j=JminM minus1

qincj+1(Mprimej |yMM prime isin Lj+1(M

primej )) (Bndash1)

Decreasing order proposal kernel

1 Set j = JmaxM + 1 and M prime

j = M

2 Draw M primejminus1 from qdecjminus1(M

prime|yMM prime isin Ljminus1(Mprimej )) where

qdecjminus1(Mprime|yMM prime isin Ljminus1(M

primej )) =

12

(p(M prime|yMM prime isin Ljminus1(M

primej )) +

1|Ljminus1(M

primej)|

)

3 Set j = j minus 1

4 If j gt JminM then return to 2 Otherwise proceed to 5

5 Set M prime = M primeJminM

and compute the proposal probability

qdec(Mprime|yMM) =

JminM +1prod

j=JmaxM +1

qdecjminus1(Mprimej |yMM prime isin Ljminus1(M

primej )) (Bndash2)

If increasing order is chosen then the MH correction is given by

α = min

1

(1 + I (M prime = MF )

1 + I (M = MB)

)qdec(M|yMM prime)

qinc(M prime|yMM)

p(M prime|yM)

p(M|yM)

(Bndash3)

and similarly if decreasing order is chosen

Other Local and Intermediate Kernels The local and intermediate kernels

described here perform a kind of stochastic forwards-backwards selection Each kernel

122

q can be relaxed to allow more than one node to be turned on or off at each step which

could provide larger jumps for each of these kernels The tradeoff is that number of

proposed models for such jumps could be very large precluding the use of posterior

information in the construction of the proposal kernel

123

APPENDIX CWFM SIMULATION DETAILS

Briefly the idea is to let ZMT(X )βMT

= (QR)βMT= QηMT

(ie βMT= Rminus1ηMT

)

using the QR decomposition As such setting all values in ηMTproportional to one

corresponds to distributing the signal in the model uniformly across all predictors

regardless of their order

The (unconditional) variance of a single observation yi is var(yi) = var (E [yi |zi ]) +

E [var(yi |zi)] where zi is the i -th row of the design matrix ZMT Hence we take the

signal to noise ratio for each observation to be

SNR(η) = ηTMT

RminusTzRminus1ηMT

σ2

where z = var(zi) We determine how the signal is distributed across predictors up to a

proportionality constant to be able to control simultaneously the signal to noise ratio

Additionally to investigate the ability of the model to capture correctly the

hierarchical structure we specify four different 0-1 vectors that determine the predictors

in MT which generates the data in the different scenarios

Table C-1 Experimental conditions WFM simulationsParameter Values considered

SNR(ηMT) = k 025 1 4

ηMTprop (1 13 14 12) (1 13 1214

1412) (1 1413

1214 12)

γMT(1 13 14 12) (1 13 14 02) (1 13 04 12) (1 03 0 1 1 0 12)

n 130 260 1040

The results presented below are somewhat different from those found in the main

body of the article in Section 5 These are extracted averaging the number of FPrsquos

TPrsquos and model sizes respectively over the 100 independent runs and across the

corresponding scenarios for the 20 highest probability models

124

SNR and Sample Size Effect

In terms of the SNR and the sample size (Figure C-1) we observe that as

expected small sample sizes conditioned upon a small SNR impair the ability of the

algorithm to detect true coefficients with both the EPP and HOP(1 ch) with this effect

more notorious when using the latter prior However considering the mean number

of true positives (TP) jointly with the mean model size it is clear that although the

sensitivity is low most of the few predictors that are discovered belong to the true

model The results observed with SNR of 025 and a relatively small sample size are

far from being impressive however real problems where the SNR is as low as 025

will yield many spurious associations under the EPP The fact that the HOP(1 ch) has

a strong protection against false positive is commendable in itself A SNR of 1 also

represents a feeble relationship between the predictors and the response nonetheless

the method captures approximately half of the true coefficients while including very few

false positives Following intuition as either the sample size or the SNR increase the

algorithms performance is greatly enhanced Either having a large sample size or a

large SNR yields models that contain mostly true predictors Additionally HOP(1 ch)

provides a strong control over the number of false positives therefore for high SNR

or larger sample sizes the number of predictors in the top 20 models is close to the

size of the true model In general the EPP allows the detection of more TPrsquos while

the HOP(1 ch) provides a stronger control on the amount of FPrsquos included when

considering small sample sizes combined with small SNRs As either sample size or

SNR grows the differences between the two priors become indistinct

125

Figure C-1 SNR vs n Average model size average true positives and average false

positives for all simulated scenarios by model ranking according to model

posterior probabilities

Coefficient Magnitude

This part of the experiment explores the effect of how the signal is distributed across

predictors As mentioned before sphering is used to assign the coefficients values

in a manner that controls the amount of signal that goes into each coefficient Three

possible ways to allocate the signal are considered First each order-one coefficient

contains twice as much signal as any order-two coefficient and four times as much

any as order-three coefficient second all coefficients contain the same amount of

signal regardless of their order and third each order-one coefficient contains a half

as much signal as any order-two coefficient and a quarter of what any order-three

126

coefficient has In Figure C-2 these values are denoted by β = c(1o1 05o2 025o3)

β = c(1o1 1o2 1o3) and β = c(025o1 05o2 1o3) respectively

Observe that the number of FPrsquos is invulnerable to how the SNR is distributed

across predictors using the HOP(1 ch) conversely when using the EPP the number

of FPrsquos decreases as the SNR grows always being slightly higher than those obtained

with the HOP With either prior structure the algorithm performs better whenever all

coefficients are equally weighted or when those for the order-three terms have higher

weights In these two cases (ie with β = c(1o1 05o2 025o3) or β = c(1o1 1o2 1o3))

the effect of the SNR appears to be similar In contrast when more weight is given to

order one terms the algorithm yields slightly worse models at any SNR level This is an

intuitive result since giving more signal to higher order terms makes it easier to detect

higher order terms and consequently by strong heredity the algorithm will also select

the corresponding lower order terms included in the true model

Special Points on the Scale

In Nelder (1998) the author argues that the conditions under which the

weak-heredity principle can be used for model selection are so restrictive that the

principle is commonly not valid in practice in this context In addition the author states

that considering well-formulated models only does not take into account the possible

presence of special points on the scales of the predictors that is situations where

omitting lower order terms is justified due to the nature of the data However it is our

contention that every model has an underlying well-formulated structure whether or not

some predictor has special points on its scale will be determined through the estimation

of the coefficients once a valid well-formulated structure has been chosen

To understand how the algorithm behaves whenever the true data generating

mechanism has zero-valued coefficients for some lower order terms in the hierarchy

four different true models are considered Three of them are not well-formulated while

the remaining one is the WFM shown in Figure 4-6 The three models that have special

127

Figure C-2 SNR vs coefficient values Average model size average true positives andaverage false positives for all simulated scenarios by model rankingaccording to model posterior probabilities

points correspond to the same model MT from Figure 4-6 but have respectively

zero-valued coefficients for all the order-one terms all the order-two terms and for x21

and x2x5

As seen before in comparison to the EPP the HOP(1 ch) tightly controls the

inclusion FPs by choosing smaller models at the expense of also reducing the TP

count especially when there is more uncertainty about the true model (ie SNR=025)

For both prior structures the results in Figure C-3 indicate that at low SNR levels the

presence of special points has no apparent impact as the selection behavior is similar

between the four models in terms of both the TP and FP As the SNR increases the

TPs and the model size are affected for true models with zero-valued lower order

128

Figure C-3 SNR vs different true models MT Average model size average truepositives and average false positives for all simulated scenarios by modelranking according to model posterior probabilities

terms These differences however are not very large Relatively smaller models are

selected whenever some terms in the hierarchy are missing but with high SNR which

is where the differences are most pronounced the predictors included are mostly true

coefficients The impact is almost imperceptible for the true model that lacks order one

terms and the model with zero coefficients for x21 and x2x5 and is more visible for models

without order two terms This last result is expected due to strong-heredity whenever

the order-one coefficients are missing the inclusion of order-two and order-three

terms will force their selection which is also the case when only a few order two terms

have zero-valued coefficients Conversely when all order two predictors are removed

129

some order three predictors are not selected as their signal is attributed the order two

predictors missing from the true model This is especially the case for the order three

interaction term x1x2x5 which depends on the inclusion of three order two terms terms

(x1x2 x1x5 x2x5) in order for it to be included as well This makes the inclusion of this

term somewhat more challenging the three order two interactions capture most of

the variation of the polynomial terms that is present when the order three term is also

included However special points on the scale commonly occur on a single or at most

on a few covariates A true data generating mechanism that removes all terms of a given

order in the context of polynomial models is clearly not justified here this was only done

for comparison purposes

130

APPENDIX DSUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS

The covariates considered for the ozone data analysis match those used in Liang

et al (2008) these are displayed in Table D below

Table D-1 Variables used in the analyses of the ozone contamination datasetName Descriptionozone Daily max 1hr-average ozone (ppm) at Upland CA

vh 500 millibar pressure height (m) at Vandenberg AFBwind Wind speed (mph) at LAXhum Humidity () at LAXtemp Temperature (F) measured at Sandburg CAibh Inversion base height (ft) at LAXdpg Pressure gradient (mm Hg) from LAX to Daggett CAvis Visibility (miles) measured at LAXibt Inversion base temperature (F) at LAX

The marginal posterior inclusion probability corresponds to the probability of including a

given term in the full model MF after summing over all models in the model space For each

node α isin MF this probability is given by pα =sum

MisinM I(αisinM)p(M|yM) Given that in problems

with a large model space such as the one considered for the ozone concentration problem

enumeration of the entire space is not feasible Thus these probabilities are estimated summing

over every model drawn by the random walk over the model space M

Given that there are in total 44 potential predictors for convenience in Tables D-2 to D-5

below we only display the marginal posterior probabilities for the terms included under at least

one of the model priors considered (EPP HIP HUP and HOP) for each of the parameter priors

utilized (intrinsic priors Zellner-Siow priors Hyper-g(11) and Hyper-g(21))

131

Table D-2 Marginal inclusion probabilities

intrinsic prior

EPP HIP HUP HOP

hum 099 069 085 076

dpg 085 048 052 053

ibt 099 100 100 100

hum2 076 051 043 062

humdpg 055 002 003 017

humibt 098 069 084 075

dpg2 072 036 025 046

ibt2 059 078 057 081

Table D-3 Marginal inclusion probabilities

Zellner-Siow prior

EPP HIP HUP HOP

hum 076 067 080 069

dpg 089 050 055 058

ibt 099 100 100 100

hum2 057 049 040 057

humibt 072 066 078 068

dpg2 081 038 031 051

ibt2 054 076 055 077

Table D-4 Marginal inclusion probabilities

Hyper-g11

EPP HIP HUP HOP

vh 054 005 010 011

hum 081 067 080 069

dpg 090 050 055 058

ibt 099 100 099 099

hum2 061 049 040 057

humibt 078 066 078 068

dpg2 083 038 030 051

ibt2 049 076 054 077

Table D-5 Marginal inclusion probabilities

Hyper-g21

EPP HIP HUP HOP

hum 079 064 073 067

dpg 090 052 060 059

ibt 099 100 099 100

hum2 060 047 037 055

humibt 076 064 071 067

dpg2 082 041 036 052

ibt2 047 073 049 075

132

REFERENCES

Akaike H (1983) Information measures and model selection Bull Int Statist Inst 50277ndash290

Albert J H amp Chib S (1993) Bayesian-analysis of binary and polychotomousresponse data Journal of the American Statistical Association 88(422) 669ndash679

Berger J amp Bernardo J (1992) On the development of reference priors BayesianStatistics 4 (pp 35ndash60)

URL httpisbastatdukeedueventsvalencia1992Valencia4Refpdf

Berger J amp Pericchi L (1996) The intrinsic Bayes factor for model selection andprediction Journal of the American Statistical Association 91(433) 109ndash122

URL httpamstattandfonlinecomdoiabs10108001621459199610476668

Berger J Pericchi L amp Ghosh J (2001) Objective Bayesian methods for modelselection introduction and comparison In Model selection vol 38 of IMS LectureNotes Monogr Ser (pp 135ndash207) Inst Math Statist

URL httpwwwjstororgstable1023074356165

Besag J York J amp Mollie A (1991) Bayesian Image-Restoration with 2 Applicationsin Spatial Statistics Annals of the Institute of Statistical Mathematics 43 1ndash20

Bien J Taylor J amp Tibshirani R (2013) A lasso for hierarchical interactions TheAnnals of Statistics 41(3) 1111ndash1141

URL httpprojecteuclidorgeuclidaos1371150895

Breiman L amp Friedman J (1985) Estimating optimal transformations for multipleregression and correlation Journal of the American Statistical Association 80580ndash598

Brusco M J Steinley D amp Cradit J D (2009) An exact algorithm for hierarchicallywell-formulated subsets in second-order polynomial regression Technometrics 51(3)306ndash315

Casella G Giron F J Martınez M L amp Moreno E (2009) Consistency of Bayesianprocedures for variable selection The Annals of Statistics 37 (3) 1207ndash1228

URL httpprojecteuclidorgeuclidaos1239369020

Casella G Moreno E amp Giron F (2014) Cluster Analysis Model Selection and PriorDistributions on Models Bayesian Analysis TBA(TBA) 1ndash46

URL httpwwwstatufledu~casellaPapersClusterModel-July11-Apdf

133

Chipman H (1996) Bayesian variable selection with related predictors CanadianJournal of Statistics 24(1) 17ndash36

URL httponlinelibrarywileycomdoi1023073315687abstract

Clyde M amp George E I (2004) Model Uncertainty Statistical Science 19(1) 81ndash94

URL httpprojecteuclidorgDienstgetRecordid=euclidss1089808274

Dewey J (1958) Experience and nature New York Dover Publications

Dorazio R M amp Taylor-Rodrıguez D (2012) A Gibbs sampler for Bayesian analysis ofsite-occupancy data Methods in Ecology and Evolution 3 1093ndash1098

Ellison A M (2004) Bayesian inference in ecology Ecology Letters 7 509ndash520

Fiske I amp Chandler R (2011) unmarked An R package for fitting hierarchical modelsof wildlife occurrence and abundance Journal of Statistical Software 43(10)

URL httpcorekmiopenacukdownloadpdf5701760pdf

George E (2000) The variable selection problem Journal of the American StatisticalAssociation 95(452) 1304ndash1308

URL httpwwwtandfonlinecomdoiabs10108001621459200010474336

Giron F J Moreno E Casella G amp Martınez M L (2010) Consistency of objectiveBayes factors for nonnested linear models and increasing model dimension Revistade la Real Academia de Ciencias Exactas Fisicas y Naturales Serie A Matematicas104(1) 57ndash67

URL httpwwwspringerlinkcomindex105052RACSAM201006

Good I J (1950) Probability and the Weighing of Evidence New York Haffner

Griepentrog G L Ryan J M amp Smith L D (1982) Linear transformations ofpolynomial regression-models American Statistician 36(3) 171ndash174

Gunel E amp Dickey J (1974) Bayes factors for independence in contingency tablesBiometrika 61 545ndash557

Hanski I (1994) A Practical Model of Metapopulation Dynamics Journal of AnimalEcology 63 151ndash162

Hooten M (2006) Hierarchical spatio-temporal models for ecological processesDoctoral dissertation University of Missouri-Columbia

URL httpsmospacelibraryumsystemeduxmluihandle103554500

Hooten M B amp Hobbs N T (2014) A Guide to Bayesian Model Selection forEcologists Ecological Monographs (In Press)

134

Hughes J amp Haran M (2013) Dimension reduction and alleviation of confoundingfor spatial generalized linear mixed models Journal of the Royal Statistical SocietySeries B Statistical Methodology 75 139ndash159

Hurvich C M amp Tsai C-L (1989) Regression and time series model selection insmall samples Biometrika 76 297ndash307

URL httpbiometoxfordjournalsorgcontent762297abstract

Jeffreys H (1935) Some tests of significance treated by the theory of probabilityProcedings of the Cambridge Philosophy Society 31 203ndash222

Jeffreys H (1961) Theory of Probability London Oxford University Press 3rd ed

Johnson D Conn P Hooten M Ray J amp Pond B (2013) Spatial occupancymodels for large data sets Ecology 94(4) 801ndash808

URL httpwwwesajournalsorgdoiabs10189012-05641mi=3eywlhampaf=R

ampsearchText=human+population

Kass R amp Wasserman L (1995) A reference Bayesian test for nested hypothesesand its relationship to the Schwarz criterion Journal of the American StatisticalAssociation 90(431)

URL httpamstattandfonlinecomdoiabs10108001621459199510476592

Kass R E amp Raftery A E (1995) Bayes Factors Journal of the American StatisticalAssociation 90 773ndash795

URL httpwwwtandfonlinecomdoiabs10108001621459199510476572$

delimiter026E30F$nhttpwwwtandfonlinecomdoiabs10108001621459

199510476572UvBybrTIgcs

Kass R E amp Wasserman L (1996) The Selection of Prior Distributions by FormalRules Journal of the American Statistical Association 91(435) 1343

URL httpwwwjstororgstable2291752origin=crossref

Kery M (2010) Introduction to WinBUGS for Ecologists Bayesian Approach toRegression ANOVA Mixed Models and Related Analyses Academic Press 1st ed

Kery M Gardner B amp Monnerat C (2010) Predicting species distributions fromchecklist data using site-occupancy models Journal of Biogeography 37 (10)1851ndash1862 Kery Marc Gardner Beth Monnerat Christian

Khuri A (2002) Nonsingular linear transformations of the control variables in responsesurface models Technical Report

Krebs C J (1972) Ecology the experimental analysis of distribution and abundance

135

Lempers F B (1971) Posterior probabilities of alternative linear models University ofRotterdam Press Rotterdam

Leon-Novelo L Moreno E amp Casella G (2012) Objective Bayes model selection inprobit models Statistics in medicine 31(4) 353ndash65

URL httpwwwncbinlmnihgovpubmed22162041

Liang F Paulo R Molina G Clyde M a amp Berger J O (2008) Mixtures of g Priorsfor Bayesian Variable Selection Journal of the American Statistical Association103(481) 410ndash423

URL httpwwwtandfonlinecomdoiabs101198016214507000001337

Link W amp Barker R (2009) Bayesian inference with ecological applications Elsevier

URL httpbooksgooglecombookshl=enamplr=ampid=hecon2l2QPcCampoi=fnd

amppg=PP2ampdq=Bayesian+Inference+with+ecological+applicationsampots=S82_

0pxrNmampsig=L3xbsSQcKD8FV6rxCMp2pmP2JKk

MacKenzie D amp Nichols J (2004) Occupancy as a surrogate for abundanceestimation Animal biodiversity and conservation 1 461ndash467

URL httpcrsitbacidmediajurnalrefslandscapemackenzie2004zhpdf

MacKenzie D Nichols J amp Hines J (2003) Estimating site occupancy colonizationand local extinction when a species is detected imperfectly Ecology 84(8)2200ndash2207

URL httpwwwesajournalsorgdoiabs10189002-3090

MacKenzie D I Bailey L L amp Nichols J D (2004) Investigating speciesco-occurrence patterns when species Journal of Animal Ecology 73 546ndash555

MacKenzie D I Nichols J D Lachman G B Droege S Royle J A amp LangtimmC A (2002) Estimating site occupancy rates when detection probabilities are lessthan one Ecology 83(8) 2248ndash2255

Mazerolle M amp Mazerolle M (2013) Package rsquoAICcmodavgrsquo (c)

URL ftpheanetarchivegnewsenseorgdisk1CRANwebpackages

AICcmodavgAICcmodavgpdf

McCullagh P amp Nelder J A (1989) Generalized linear models (2nd ed) LondonEngland Chapman amp Hall

McQuarrie A Shumway R amp Tsai C-L (1997) The model selection criterion AICu

136

Moreno E Bertolino F amp Racugno W (1998) An intrinsic limiting procedure for modelselection and hypotheses testing Journal of the American Statistical Association93(444) 1451ndash1460

Moreno E Giron F J amp Casella G (2010) Consistency of objective Bayes factors asthe model dimension grows The Annals of Statistics 38(4) 1937ndash1952

URL httpprojecteuclidorgeuclidaos1278861238

Nelder J A (1977) Reformulation of linear-models Journal of the Royal StatisticalSociety Series A - Statistics in Society 140 48ndash77

Nelder J A (1998) The selection of terms in response-surface models - how strong isthe weak-heredity principle American Statistician 52(4) 315ndash318

Nelder J A (2000) Functional marginality and response-surface fitting Journal ofApplied Statistics 27 (1) 109ndash112

Nichols J Hines J amp Mackenzie D (2007) Occupancy estimation and modeling withmultiple states and state uncertainty Ecology 88(6) 1395ndash1400

URL httpwwwesajournalsorgdoipdf10189006-1474

Ovaskainen O Hottola J amp Siitonen J (2010) Modeling species co-occurrenceby multivariate logistic regression generates new hypotheses on fungal interactionsEcology 91(9) 2514ndash21

URL httpwwwncbinlmnihgovpubmed20957941

Peixoto J L (1987) Hierarchical variable selection in polynomial regression-modelsAmerican Statistician 41(4) 311ndash313

Peixoto J L (1990) A property of well-formulated polynomial regression-modelsAmerican Statistician 44(1) 26ndash30

Pericchi L R (2005) Model selection and hypothesis testing based on objectiveprobabilities and bayes factors In Handbook of Statistics Elsevier

Polson N G Scott J G amp Windle J (2013) Bayesian Inference for Logistic ModelsUsing Polya-Gamma Latent Variables Journal of the American Statistical Association108 1339ndash1349

URL httpdxdoiorg101080016214592013829001

Rao C R amp Wu Y (2001) On model selection vol Volume 38 of Lecture NotesndashMonograph Series (pp 1ndash57) Beachwood OH Institute of Mathematical Statistics

URL httpdxdoiorg101214lnms1215540960

137

Reich B J Hodges J S amp Zadnik V (2006) Effects of residual smoothing on theposterior of the fixed effects in disease-mapping models Biometrics 62 1197ndash1206

Reiners W amp Lockwood J (2009) Philosophical Foundations for the Practices ofEcology Cambridge University Press

URL httpbooksgooglecombooksid=dr9cPgAACAAJ

Rigler F amp Peters R (1995) Excellence in Ecology Science and Limnology EcologyInstitute Germany

URL httportoncatieaccrcgi-binwxisexeIsisScript=CIENLxis

ampmethod=postampformato=2ampcantidad=1ampexpresion=mfn=008268

Robert C Chopin N amp Rousseau J (2009) Harold Jeffreysrsquo Theory of Probabilityrevisited Statistical Science Volume 24(2) 141ndash179

URL httpswwwnewtonacukpreprintsNI08021pdf

Robert C P (1993) A note on jeffreys-lindley paradox Statistica Sinica 3 601ndash608

Royle J A amp Kery M (2007) A Bayesian state-space formulation of dynamicoccupancy models Ecology 88(7) 1813ndash23

URL httpwwwncbinlmnihgovpubmed17645027

Scott J amp Berger J (2010) Bayes and Empirical-Bayes Multiplicity Adjustment in thevariable selection problem The Annals of Statistics

URL httpprojecteuclidorgeuclidaos1278861454

Spiegelhalter D J amp Smith A F M (1982) Bayes factor for linear and log-linearmodels with vague prior information J R Statist Soc B 44 377ndash387

Tierney L amp Kadane J B (1986) Accurate approximations for posterior moments andmarginal densities Journal of the American Statistical Association 81 82ndash86

Tyre A J Tenhumberg B Field S a Niejalke D Parris K amp Possingham H P(2003) Improving Precision and Reducing Bias in Biological Surveys EstimatingFalse-Negative Error Rates Ecological Applications 13(6) 1790ndash1801

URL httpwwwesajournalsorgdoiabs10189002-5078

Waddle J H Dorazio R M Walls S C Rice K G Beauchamp J Schuman M Jamp Mazzotti F J (2010) A new parameterization for estimating co-occurrence ofinteracting species Ecological applications a publication of the Ecological Society ofAmerica 20 1467ndash1475

Wasserman L (2000) Bayesian Model Selection and Model Averaging Journal ofmathematical psychology 44(1) 92ndash107

138

URL httpwwwncbinlmnihgovpubmed10733859

Wilson M Iversen E Clyde M A Schmidler S C amp Schildkraut J M (2010)Bayesian model search and multilevel inference for SNP association studies TheAnnals of Applied Statistics 4(3) 1342ndash1364

URL httpwwwncbinlmnihgovpmcarticlesPMC3004292

Womack A J Leon-Novelo L amp Casella G (2014) Inference from Intrinsic BayesProcedures Under Model Selection and Uncertainty Journal of the AmericanStatistical Association (June) 140114063448000

URL httpwwwtandfonlinecomdoiabs101080016214592014880348

Yuan M Joseph V R amp Zou H (2009) Structured variable selection and estimationThe Annals of Applied Statistics 3(4) 1738ndash1757

URL httpprojecteuclidorgeuclidaoas1267453962

Zeller K A Nijhawan S Salom-Perez R Potosme S H amp Hines J E (2011)Integrating occupancy modeling and interview data for corridor identification A casestudy for jaguars in nicaragua Biological Conservation 144(2) 892ndash901

Zellner A amp Siow A (1980) Posterior odds ratios for selected regression hypothesesIn Trabajos de estadıstica y de investigacion operativa (pp 585ndash603)

URL httpwwwspringerlinkcomindex5300770UP12246M9pdf

139

BIOGRAPHICAL SKETCH

Daniel Taylor-Rodrıguez was born in Bogota Colombia He earned a BS

degree in economics from the Universidad de Los Andes (2004) and a Specialist

degree in statistics from the Universidad Nacional de Colombia In 2009 he traveled

to Gainesville Florida to pursue a masterrsquos in statistics under the supervision of

George Casella Upon completion he started a PhD in interdisciplinary ecology with

concentration in statistics again under George Casellarsquos supervision After Georgersquos

passing Linda Young and Nikolay Bliznyuk continued to oversee Danielrsquos mentorship

He has currently accepted a joint postdoctoral fellowship at the Statistical and Applied

Mathematical Sciences Institute and the Department of Statistical Science at Duke

University

140

  • ACKNOWLEDGMENTS
  • TABLE OF CONTENTS
  • LIST OF TABLES
  • LIST OF FIGURES
  • ABSTRACT
  • 1 GENERAL INTRODUCTION
    • 11 Occupancy Modeling
    • 12 A Primer on Objective Bayesian Testing
    • 13 Overview of the Chapters
      • 2 MODEL ESTIMATION METHODS
        • 21 Introduction
          • 211 The Occupancy Model
          • 212 Data Augmentation Algorithms for Binary Models
            • 22 Single Season Occupancy
              • 221 Probit Link Model
              • 222 Logit Link Model
                • 23 Temporal Dynamics and Spatial Structure
                  • 231 Dynamic Mixture Occupancy State-Space Model
                  • 232 Incorporating Spatial Dependence
                    • 24 Summary
                      • 3 INTRINSIC ANALYSIS FOR OCCUPANCY MODELS
                        • 31 Introduction
                        • 32 Objective Bayesian Inference
                          • 321 The Intrinsic Methodology
                          • 322 Mixtures of g-Priors
                            • 3221 Intrinsic priors
                            • 3222 Other mixtures of g-priors
                                • 33 Objective Bayes Occupancy Model Selection
                                  • 331 Preliminaries
                                  • 332 Intrinsic Priors for the Occupancy Problem
                                  • 333 Model Posterior Probabilities
                                  • 334 Model Selection Algorithm
                                    • 34 Alternative Formulation
                                    • 35 Simulation Experiments
                                      • 351 Marginal Posterior Inclusion Probabilities for Model Predictors
                                      • 352 Summary Statistics for the Highest Posterior Probability Model
                                        • 36 Case Study Blue Hawker Data Analysis
                                          • 361 Results Variable Selection Procedure
                                          • 362 Validation for the Selection Procedure
                                            • 37 Discussion
                                              • 4 PRIORS ON THE MODEL SPACE AND WELL-FORMULATED MODELS
                                                • 41 Introduction
                                                • 42 Setup for Well-Formulated Models
                                                  • 421 Well-Formulated Model Spaces
                                                    • 43 Priors on the Model Space
                                                      • 431 Model Prior Definition
                                                      • 432 Choice of Prior Structure and Hyper-Parameters
                                                      • 433 Posterior Sensitivity to the Choice of Prior
                                                        • 44 Random Walks on the Model Space
                                                          • 441 Simple Pruning and Growing
                                                          • 442 Degree Based Pruning and Growing
                                                            • 45 Simulation Study
                                                              • 451 SNR and Sample Size Effect
                                                              • 452 Coefficient Magnitude
                                                              • 453 Special Points on the Scale
                                                                • 46 Case Study Ozone Data Analysis
                                                                • 47 Discussion
                                                                  • 5 CONCLUSIONS
                                                                  • A FULL CONDITIONAL DENSITIES DYMOSS
                                                                  • B RANDOM WALK ALGORITHMS
                                                                  • C WFM SIMULATION DETAILS
                                                                  • D SUPPLEMENTARY INFORMATION FOR THE OZONE DATA ANALYSIS
                                                                  • REFERENCES
                                                                  • BIOGRAPHICAL SKETCH
Page 9: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 10: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 11: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 12: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 13: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 14: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 15: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 16: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 17: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 18: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 19: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 20: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 21: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 22: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 23: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 24: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 25: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 26: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 27: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 28: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 29: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 30: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 31: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 32: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 33: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 34: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 35: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 36: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 37: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 38: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 39: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 40: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 41: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 42: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 43: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 44: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 45: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 46: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 47: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 48: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 49: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 50: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 51: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 52: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 53: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 54: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 55: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 56: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 57: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 58: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 59: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 60: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 61: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 62: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 63: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 64: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 65: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 66: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 67: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 68: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 69: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 70: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 71: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 72: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 73: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 74: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 75: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 76: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 77: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 78: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 79: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 80: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 81: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 82: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 83: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 84: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 85: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 86: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 87: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 88: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 89: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 90: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 91: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 92: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 93: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 94: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 95: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 96: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 97: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 98: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 99: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 100: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 101: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 102: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 103: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 104: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 105: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 106: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 107: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 108: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 109: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 110: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 111: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 112: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 113: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 114: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 115: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 116: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 117: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 118: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 119: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 120: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 121: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 122: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 123: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 124: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 125: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 126: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 127: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 128: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 129: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 130: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 131: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 132: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 133: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 134: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 135: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 136: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 137: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 138: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 139: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,
Page 140: ufdcimages.uflib.ufl.edu€¦ · ACKNOWLEDGMENTS Completing this dissertation would not have been possible without the support from the people that have helped me remain focused,