recent advances in statistical ecology using computationally intensive methods ruth king

38
Statistical Ecology using Computationally Intensive Methods Ruth King

Upload: araceli-still

Post on 31-Mar-2015

225 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Recent Advances in Statistical Ecology

using Computationally Intensive Methods

Ruth King

Page 2: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Overview Introduction to wildlife populations and

identification of questions of interest. Motivating example. Issues to be addressed:

Missing data; Model discrimination.

Summary. Future research.

Page 3: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Wildlife Populations In recent years there has been increasing

interest in wildlife populations. Often we may be interested in population

changes over time, e.g. if there is a declining population. (Steve Buckland)

Alternatively, we may be interested in the underlying dynamics of the system, in order to obtain a better understanding of the population.

We shall concentrate on this latter problem, with particular focus on identifying factors that affect demographic rates.

Page 4: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Data Collection Data are often collected via some form of

capture-recapture study. Observers go out into the field and record all

animals that are seen at a series of capture events.

Animals may be recorded via simply resightings or recaptures (of live animals) and recoveries (of dead animals).

At each capture event all unmarked animals are uniquely marked; all observed animals are recorded and subsequently released back into the population.

Page 5: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Data Each animal is uniquely identifiable so our data

consist of the capture histories for each individual observed in the study.

A typical capture history may look like:0 1 1 0 0 1 2

0/1 corresponds to the individual being unobserved/observed at that capture time; and

2 denotes an individual is recovered dead. We can then explicitly write down the

corresponding likelihood as a function of survival (), recapture (p) and recovery () rates.

Page 6: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Likelihood The likelihood is the product over all individuals of

the probability of their corresponding capture history, conditional on their initial capture.

For example, for an individual with history:0 1 1 0 0 1 2

the contribution to the likelihood is: 2 p3 3 (1-p4) 4 (1-p5) 5 p6 (1-6) 7.

Then, we can use this likelihood to estimate the parameter values (either MLE’s or posterior distributions).

Page 7: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Covariates Covariates are often used to explain temporal

heterogeneity within the parameters. Typically these are “environmental” factors, such

as resource availability, weather conditions or human intervention.

Alternatively, heterogeneity in a population can often be explained via different (individual) covariates.

For example, sex, condition or breeding status. We shall consider the survival rates to be possibly

dependent on the different covariates.

Page 8: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Case Study: Soay sheep We consider mark-recapture-recovery (MRR) data

relating to Soay sheep on the island of Hirta. The sheep are free from human activity and

external competition/predation. Thus, this population is ideal for investigating the

impact of different environmental and/or individual factors on the sheep.

We consider annual data from 1986-2000, collected on female sheep (1079 individuals).

This is joint work with Steve Brooks and Tim Coulson

Page 9: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Covariate Information(i) Individual covariates:

Coat type (1=dark, 2=light) Horn type (1=polled, 2=scurred, 3=classical) Birth weight (real – normalised) Age (in years); Weight (real - normalised); Number of lambs born to the sheep in the spring prior to

summer census (0, 1, 2); And in the spring following the census (0, 1, 2).

(ii) Environmental covariates: NAO index; population size; March rainfall; Autumn rainfall

and March temperature.

Page 10: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Survival Rates - Let the set of environmental covariate values at

time t be denoted by xt. The survival rate for animal i, of age a, at time t,

is given by,logit i,a,t = a + a

T xt + aT yi + a

T zi,t + a,t

Here, yi denotes the set of time-independent covariates and zi,t the time varying covariates;

a,t ~ N(0,a2) denotes a random effect.

There arise two issues here: missing covariate values and model choice.

Page 11: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Issue 1: Missing Data The capture histories and time-invariant covariate

values data are presented in the form:

Note that there are also many missing values for the time-dependent weight covariate.

Capture

history

Covariate values

Missing values

Page 12: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Problem Given the set of covariate values, the

corresponding survival rate can be obtained, and hence the likelihood calculated.

However, a complexity arises when there are unknown (i.e. missing) covariate values, removing the simple and explicit expression for the likelihood.

Page 13: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Missing Data: Classical Approaches Typical classical approaches to missing data

problems include: Ignoring individuals with missing covariate values; EM algorithm (can be difficult to implement and

computationally expensive); Imputation of missing values using some underlying

model (e.g. Gompertz curve). Conditional approach for time-varying covariates

(Catchpole, Morgan and Tavecchia, 2006 – in submission) We consider a Bayesian approach, where we

assume an underlying model for the missing data which allows us to account for the corresponding uncertainty of the missing values.

Page 14: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Bayesian Approach Suppose that we wish to make inference on the

parameters and the data observed corresponds to capture histories, c, and covariate values, vobs.

Then, Bayes’ theorem states: (|c, vobs) Ç L(c, vobs| ) p()

The posterior distribution is very complex and so we use Markov chain Monte Carlo (MCMC) to obtain estimates of the posterior statistics of interest.

However, in our case the likelihood is analytically intractable, due to the missing covariate values.

Page 15: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Auxiliary Variables We treat the missing covariate values (vmis) as

parameters or auxiliary variables (AVs). We then form the joint posterior distribution over the

parameters, , and AVs, given the capture histories c and observed covariate values yobs:

(, vmis | c, vobs) Ç L(c | , vmis , vobs)

£ f(vmis, vobs | ) p() We can now sample from the joint posterior

distribution: (, vmis | c, vobs). We can integrate out over the missing covariate

values, vmis, within the MCMC algorithm to obtain a sample from ( | c, vobs).

We can calculate the likelihood of the capture histories given all the covariate values (observed and imputed missing values).

The underlying model for the covariate values.

Prior on parameters

Page 16: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

f(vmis, vobs | ) – Categorical Data For categorical data (coat type and horn type), we

assume the following model. Let y1,i denote the horn type of individual i. Then,

y1,i 2 {1,2,3}, and we assume that,

y1,i ~ Multinomial(1,q),

where q = {q1, q2, q3}. Thus, we have additional parameters, q, which

can be regarded as the underlying probability of each horn type.

The q’s are updated within the MCMC algorithm, as well as the y1,i’s which are unknown.

We assume the analogous model for coat type.

Page 17: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

f(vmis, vobs | ) – Continuous Data Let y3,i denote the birth weight of individual i. Then, a possible model is to assume that,

y3,i ~ N(, 2),

where and 2 are to be estimated. For the weight of individual i, aged a at time t,

denoted by z1,i,t we set,

z1,i,t ~ N(z1,i,t-1 + a + t, w2),

where the parameters a, t and w2 are to be

estimated. In general, the modelling assumptions will depend

on the system under study.

Page 18: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Practical Implications Within each step of the MCMC algorithm, we not

only need to update the parameter values , but also impute the missing covariate values and random effects (if present).

This can be computationally expensive for large amounts of missing data.

The posterior results may depend on the underlying model for the covariates– a sensitivity analysis can be performed using different underlying models.

(State-space modelling can also be implemented using similar ideas – see Steve’s talk).

Page 19: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Issue 2: Model Selection For the sheep data set we can now deal with the

issue of missing covariate values. But….. how do we decide which covariates to use

and/or the age structure? – often there may be a large number of possible covariates and/or age structures.

Discriminating between different models can often be of particular interest, since they represent competing biological hypotheses.

Model choice can also be important for future predictions of the system.

Page 20: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Possible Models We only want to consider biologically plausible

models. We have uncertainty on the age structure of the

survival rates, and consider models of the form:1:a; a+1;b; …; j+

k is unknown a priori and also the covariate dependence for each age group.

We consider similar age-type models for p and with possible arbitrary time dependence.

E.g. 1(N,BW), 2:7(W,L+),8+(N,P)/p(t)/1,2+(t) The number of possible models is immense!!

k groups

Page 21: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Bayesian Approach We treat the model itself to be an unknown

parameter to be estimated. Then, applying Bayes’ Theorem we obtain the

posterior distribution over both parameter and model space:

(m, m | data) Ç L(data | m, m) p(m) p(m).

Here m denotes the parameters in model m.

Likelihood Prior on parameters in model m Prior on model m

Page 22: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Reversible Jump MCMC The reversible jump (RJ)MCMC algorithm allows us

to construct a Markov chain with stationary distribution equal to the posterior distribution.

It is simply an extension of the Metropolis-Hastings algorithm that allows moves between different dimensions.

This algorithm is needed because the number of parameters, m, in model m, may differ between models.

Note that this algorithm needs only one Markov chain, regardless of the number of models.

Page 23: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Posterior Model Probabilities Posterior model probabilities are able to

quantitatively compare different models. The posterior probability of model m is defined

as, (m| data) = s (m,m| data) dm.

These are simply estimated within the RJMCMC algorithm as the proportion of the time the chain is in the given model.

We are also able to obtain model-averaged estimates of parameters, which takes into account both parameter and model uncertainty.

Page 24: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

General Comments The RJMCMC algorithm is the most widely used

algorithm to explore and summarise a posterior distribution defined jointly over parameter and model space.

The posterior model probabilities can be sensitive to the priors specified on the parameters (p()).

The acceptance probabilities for reversible jump moves are typically lower than MH updates.

Longer simulations are generally needed to explore the posterior distribution.

Only a single Markov chain is necessary, irrespective of the number of possible models!!

Page 25: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Problem 1 Constructing efficient RJ moves can be difficult. This is particularly the case when updating the

number of age groups for the survival rates. This step involves:

Proposing a new age structure (typically add/remove a change-point);

Proposing a covariate dependence for the new age structure;

Proposing new parameter values for this new model. It can be very difficult to construct the Markov

chain with (reasonably) high acceptance rates.

Page 26: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Example: Reversibility Problem One obvious (and tried!) method for adding a

change-point would be as follows. Suppose that we propose to split age group a:c. We propose new age groups – a:b and b+1:c. Consider a small perturbation for each (non-zero)

regression parameter, e.g., ’a:b = a:c + ; and ’b+1:c = a:c - .

where ~ N(0,2).

However, to satisfy the reversibility constraints, a change-point can only be removed when the covariate dependence structure is the same for two consecutive age groups…

Page 27: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Example: High Posterior Mass An alternative proposal would be to set

’a:b = a:c (i.e. keep all parameters same for a:b).

Propose a model (in terms of covariates) for ’b+1:c: Choose each possible model with equal probability (reverse

move always possible) Choose model that differs by at most one individual and one

environmental covariate. The problem now lies in proposing “sensible”

parameter values for the new model. One approach is to use posterior estimates of the

parameters from a “saturated” model (full covariate dependence for some age structure) as the mean of the proposal distribtion.

Page 28: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Problem 2 Consider the missing covariates vmis. Then, if the covariate is present, we have,

(vmis | vobs, , data) / L(data | , vobs, vmis)

£ f(vmis, vobs | ). However, if the covariate is not present in the

model, we have,(vmis | vobs, , data) / f(vmis, vobs | ).

Thus, adding (or removing) the covariate values may be difficult, due to (potentially) fairly different posterior conditional distributions.

One way to avoid this is to simultaneously update the missing covariate values in the model move.

Page 29: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Soay Sheep Analysis We now use these techniques for analysing the

(complex) Soay sheep data. Discussion with experts identified several points

of particular interest: the age-dependence of the parameters; identification of the covariates influencing the survival

rates; whether random effects are present.

We focus on these issues on the results presented.

Page 30: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Prior Specification We place vague priors on the parameters present

in each model. Priors also need to be specified on the models. Placing an equal prior on each model places a

high prior mass on models with a large number of age groups, since the number of models increases with the number of age groups.

Thus, we specify an equal prior probability on each marginal age structure and a flat prior over the covariate dependence given the age structure; or on time-dependence.

Page 31: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Results: Survival Rates The marginal models for the age groups with

largest posterior support are:

Note that with probability 1, lambs have a distinct survival rate.

Often the models with most posterior support are close neighbours of each other.

Age-structure Posterior probability

1; 2:7; 8+ 0.943

1; 2:7; 8:9; 10+ 0.052

Page 32: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Covariate DependenceBF = 3BF = 20

BF = Bayes factor

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Age

Po

ste

rio

r p

rob

ab

ility

Coat typeHorn typeBirth weightWeightLamb before censusLamb following census

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Age

Po

ste

rio

r p

rob

ab

ility

NAO indexPopulation sizeMarch rainfallMarch temperatureAutumn rainfallRandom effects

Page 33: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Influence of CovariatesWeight

Age 1 Age 10+

Population size

0 200 400 600

0.0

0.2

0.4

0.6

0.8

1.0

Population size

Su

rviv

al r

ate

5 10 15 20 25 30 35

0.0

0.2

0.4

0.6

0.8

1.0

Weight

Su

rviv

al r

ate

Page 34: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Marginalisations Presenting only marginal results can hide some of

the more intricate details. This is most clearly seen from another MRR data

analysis relating to the UK lapwing population. There are two covariates – time and “fdays” – a

measure of the harshness of the winter. Without going into too many details, we had MRR

data and survey data (estimates of total population size).

An integrated analysis was performed, jointly analysing both data sets.

Page 35: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Marginal Models The marginal models with most posterior support

for the demographic parameters are: (a) 1 – 1st year survival (b) a – adult survival

Model Post. prob. Model Post. prob. 1(fdays) 0.636 a(fdays,t) 0.522

1 0.331 a(fdays) 0.407

(c) – productivity (d) – recovery rate

Model Post. prob. Model Post. prob. 0.497 (t) 0.730 (t) 0.393 (fdays,t) 0.270

This is joint work with Steve Brooks, Chiara Mazzetta and Steve Freeman

Page 36: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Warning! The previous (marginal) posterior estimates hides

some of the intricate details. The marginal models of the adult survival rate and

productivity rate are highly correlated. In particular, the joint posterior probabilities for

these parameters are:

Model Post. prob.a(fdays,t), 0.45

a(fdays), (t) 0.36

Thus, there is strong evidence that either adult survival rate OR productivity rate is time dependent.

Page 37: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Summary Bayesian techniques are widely used, and allow

complex data analyses. Covariates can be used to explain both temporal

and individual heterogeneity. However, missing values can add another layer of

complexity and the need to make additional assumptions.

The RJMCMC algorithm can explore the possible models and discriminate between competing biological hypotheses.

The analysis of the Soay sheep has stimulated new discussion with biologists, in terms of the factors that affect their survival rates.

Page 38: Recent Advances in Statistical Ecology using Computationally Intensive Methods Ruth King

Further Research The development of efficient and generic RJMCMC

algorithms. Assessing the posterior sensitivity on the model

specification for the covariates. Investigation of the missing at random

assumption for the covariates in both classical and Bayesian frameworks.

Prediction in the presence of time-varying covariate information.