casi 2016 university of limerick - istat.ie · lisa mccrink 1, ozgur asar 2, helen alderson 3,...

CASI 2016

University of Limerick

May 16th — 18th 2016

Welcome

Dear Delegate.

On behalf of the organising committee, it is with the greatest of pleasure that I welcome you to

Limerick for the 36th Conference on Applied Statistics in Ireland. I would particularly like to welcome

our overseas visitors and those of you making your first ever visit to Limerick. Although Limerick

is Ireland’s fourth largest city, it is by far her greenest. Set in the heart of Munster, the outskirts

of the city stretch into rural Limerick and into neighbouring County Clare. If your itinerary allows,

then I would encourage you to visit the medieval King John’s Castle, the Treaty Stone, The Hunt

Museum, and the home of rugby, Thomond Park, all in the city. Bunratty Castle & Folk Park, the

Clare Glens, Killaloe, boat cruises on Lough Derg, and the megalithic site at Lough Gur are all worth

a visit.

The statistical community in Ireland has come a long way since 1981 when Trinity College Dublin

and the Royal Statistical Society Northern Ireland Local Group organised a one day meeting in the

lovely surroundings of Blessington County Wicklow. Limerick were hosts for CASI in 1997 when the

Irish Statistical Association was formed. With the twentieth anniversary of ISA to be marked next

year it is timely to think about the reinvigoration and organisation and development of statistics in

Ireland over the next decade. I would therefore encourage you to contribute to this year’s ISA AGM

with this in mind.

On a personal note, I would like especially to thank all my colleagues on the organising committee

who have worked together to bring you what will be an excellent conference of truly international

dimension. I wish to thank everyone who agreed to chair the eight sessions, the poster judging

panel, our invited speakers and our five keynote speakers. A big thank you to all of you who have

contributed talks, posters and your time in attending CASI 2016. Finally, I wish to thank this year’s

kind sponsors each of their contribution’s is greatly appreciated.

I hope you have an enjoyable and rewarding conference.

Kevin Hayes

Chair of CASI 2016

CASI 2016 Organising Committee:

Aoife O’Neill, Helen Purtill, Kevin Burke, Cathal Walsh,

Aine Dooley, Ali Sheikhi, Norma Bargary, Ailish Hannigan,

Avril Hegarty, Sinead Burke, Peg Hanrahan & Kevin Brosnan.

Sponsors

Monday, 16th May

13:00 – 14:00 Lunch

14:00 – 14:10 CASI Opening Address

Session 1: Chair, Professor Cathal Walsh

Keynote Speaker

14:10 – 15:00 Linda Sharples Uncertainty in Health Economic Models: Beyond ClinicalTrials

Contributed Talks

15:00 – 15:20 Lisa McCrink Biomarker discovery for chronic kidney disease: a joint mod-elling approach with competing risks

15:20 – 15:40 Joy Leahy Incorporating single treatment arm evidence into a networkmeta analysis (if you must!)

15:40 – 16:00 John Newell VisUaliSE

16:00 – 16:20 Tea/Coffee

Session 2: Chair, Professor Chris Jones

Invited Speaker

16:20 – 16:40 Alessandra Menafoglio Object Oriented Kriging in Hilbert spaces with applicationto particle-size densities in Bayes spaces

Contributed Talks

16:40 – 17:00 Ricardo Ehlers Bayesian inference under sparsity for generalized spatial ex-treme value binary regression models

17:00 – 17:20 Niamh Russell Alternatives to BIC in posterior model probability forBayesian model averaging

17:20 – 17:40 Riccardo Rastelli Optimal Bayes decision rules in cluster analysis via greedyoptimisation

17:40 – 17:50 Wine & Canapes

Poster & Lightning Talks Session: Chair, Dr. Norma Bargary

17:45 – 18:30 Lightning Talks

18:30 – 20:00 Posters & Wine Reception

20:00 – 22:00 Networking Dinner

Tuesday, 17th May (Morning)

Session 3: Chair, Professor John Hinde

Keynote Speaker

09:10 – 10:00 Laura Sangalli Functional Data Analysis, Spatial Data Analysis and PartialDifferential Equations: a fruitful union

Contributed Talks

10:00 – 10:20 Alan Benson An adaptive MCMC method for multiple changepoint anal-ysis with applications to large datasets

10:20 – 10:40 Rafael Moral Diagnostic checking for N-mixture models applied to miteabundance data

10:40 – 11:00 Bernardo Nipoti A Bayesian nonparametric approach to the analysis of clus-tered time-to-event data

11:00 – 11:20 Tea/Coffee

Session 4: Chair, Professor Christian Pipper

Invited Speaker

11:20 – 11:40 Chris Jones Log-location-scale-log-concave distribution for lifetime data

Contributed Talks

11:40 – 12:00 Amirhossein Jalali Confidence envelopes for the mean residual life function

12:00 – 12:20 Christopher Steele Modelling the time to type 2 diabetes related complicationsusing a survival tree based approach

12:20 – 12:40 Shirin Moghaddam A Bayesian approach to the imputation of survival data

12:40 – 13:50 Lunch

Tuesday, 17th May (Afternoon)

Session 5: Chair, Professor Ailish Hanningan

Keynote Speaker

13:50 – 14:40 Bethany Bray Cutting-Edge Advances in Latent Class Analysis for Today’sBehavioral Scientists

Contributed Talks

14:40 – 15:00 Myriam Tami EM estimation of a structural equation model

15:00 – 15:20 Arthur White Identifying patterns of learner behaviour using latent classanalysis

15:20 – 15:40 Brendan Murphy Variable selection for latent class analysis with applicationto low back pain diagnosis

15:40 – 16:00 Tea/Coffee

Session 6: Chair, Dr. John Newell

Invited Speaker

16:00 – 16:40 David Leslie Thompson sampling for website optimisation

Contributed Talks

16:40 – 17:00 Sergio Gonzalez-Sanz Beyond machine classification: hedging predictions withconfidence and credibility values

17:00 – 17:20 James Sweeney Spatial modelling of house prices in the Dublin area

17:20 – 17:40 Susana Conde Model selection in sparse multi-dimensional contingency ta-bles

17:45 – 18:00 ISA AGM

19:00 – 19:15 Bus to Conference Dinner

19:45 – 22:00 Conference Dinner

Wednesday, 18th May

Special INSIGHT Session: Chair, Dr. Kevin Hayes

Keynote Speaker

09:10 – 10:00 Sofia Olhede Anisotropy in random fields

INSIGHT Session

Nial Friel

10:00 – 11:00 Brian Caulfield Insight centre for data analytics: A collection of short stories

Brendan Murphy

11:00 – 11:20 Tea/Coffee

Session 7: Chair, Dr. Kevin Burke

Invited Speaker

11:20 – 11:40 Christian Pipper Evaluation of multi-outcome longitudinal studies

Contributed Talks

11:40 – 12:00 Conor Donnelly A multivariate joint modelling approach to incorporate indi-viduals’ longitudinal response trajectories within the Coxianphase-type distribution

12:00 – 12:20 Katie O’Brien Breast screening and disease subtypes: a population-basedanalysis

12:20 – 12:40 Andrew Gordon Prediction of time until readmission to hospital of elderlypatients using a discrete conditional phase-type model in-corporating a survival tree

12:40 – 12:50 CASI Closing Address

12:50 – 14:00 Lunch

Monday 16th May

Session 1: Chair, Professor Cathal Walsh

Keynote Speaker

14:10 – 15:00 Linda Sharples Uncertainty in Health Economic Models: Beyond ClinicalTrials

Contributed Talks

15:00 – 15:20 Lisa McCrink Biomarker discovery for chronic kidney disease: a joint mod-elling approach with competing risks

15:20 – 15:40 Joy Leahy Incorporating single treatment arm evidence into a networkmeta analysis (if you must!)

15:40 – 16:00 John Newell VisUaliSE

Uncertainty in Health Economic Models: Beyond Clinical Trials

Linda Sharples∗1

1Statistics Department, University of Leeds, United Kingdom∗Email: [email protected]

While RCTs have become well established as an approach to establishing causal relationships between

new treatments and outcomes, it is necessary to incorporate other sources of evidence in order to

generalise findings beyond the narrow setting of a single trial. This is especially the case in the

context of health economic models, where a decision needs to be taken regarding the introduction

of a treatment into a population and this is likely to have an impact over a long time frame.

Traditional approaches to extrapolations of survival effects adopt a parametric survival function and

examine the difference between new and existing treatments. However, trial data is often of relatively

short duration and thus there is substantial uncertainty about the appropriate functional form. In

this talk, we use an approach which incorporates evidence from long term survival patterns in registry

data and that of the general population in order to extrapolate the treatment effect observed in the

trial. We implement this approach in a Bayesian evidence synthesis framework, with application to

the example of Implantable Cardioverter Defibrillators.

Biomarker Discovery for Chronic Kidney Disease: A Joint Modelling

Approach with Competing Risks

Lisa McCrink∗1, Ozgur Asar2, Helen Alderson3, Philip Kalra3 and Peter Diggle4

1CenSSOR, Queen’s University Belfast, UK2Department of Biostatistics and Health Informatics, Acibadem University, Istanbul, Turkey

3Vascular Research Group, Manchester Academic Health Sciences Centre, University of

Manchester, Salford Royal NHS Foundation Trust, Salford, UK.4CHICAS, Lancaster Medical School, Lancaster University, UK

∗Email: [email protected]

Abstract: In this research, the relationship between fibroblast growth factor 23 (FGF-23), a potential

biomarker for the progression of chronic kidney disease (CKD), and the survival of CKD patients is investi-

gated. Utilising a joint modelling approach, the association between FGF-23 and the widely used biomarker,

serum creatinine, is analysed and the effect of both biomarkers changing over time on several competing

risks for CKD patients is presented.

Introduction

The prevalence of chronic kidney disease (CKD) within the UK is approximately 6-8% where a high

proportion of individuals suffer mortality due to associated cardiovascular events (Roderick, 2011).

Investigating the factors that affect the survival of CKD patients is of utmost importance. This

research will explore the association between several competing risks for CKD patients and a novel

biomarker, fibroblast growth factor 23 (FGF-23).

Data

The data analysed within this research was obtained from the Chronic Renal Insufficiency Stan-

dards Implementation Study (CRISIS), an observational study run by Salford Royal Hospital NHS

Foundation Trust, Greater Manchester. It consists of 999 patients with a total of 2,468 repeated

measurements. In a multivariate approach, FGF-23 will be studied alongside the commonly used

biomarker, serum creatinine. The influence of both biomarkers on the time to fatal or non-fatal

cardiac events, time to death due to non-cardiac reasons and time to the start of renal replacement

therapy will be analysed.

Methods

Over recent years, there is a growing volume of literature focusing on joint modelling approaches

in the analysis of associated longitudinal and survival processes (Rizopoulos, 2012). It is common

within the literature that such approaches have analysed univariate repeated measurements and time-

to-event data. This research will expand upon such approaches through the utilisation of a bivariate

linear mixed model using correlated random effects to jointly study patients’ repeated measurements

of FGF-23 and serum creatinine. This longitudinal submodel is linked with a cause-specific hazards

model to represent the three competing risks for CKD patients discussed previously (Williamson,

2008). The parameters of both submodels are estimated using an expectation-maximisation algo-

rithm.

Discussion

The research presented utilises a joint modelling approach to demonstrate the association between

the survival of CKD patients and the novel biomarker, FGF-23. This builds upon previously work

which, through utilisation of a two-stage approach, highlighted a significant association between the

repeated measurements of FGF-23 and the risk of death and cardiovascular events for CKD patients

(Alderson, To Be Submitted).

References

Roderick, P., Roth, M. and Mindell, J. (2011). Prevalence of chronic kidney disease in England: Find-

ings from the 2009 Health Survey for England. Journal of Epidemiology and Community Health,

64(2), pp. A1 – A40.

Rizopoulos, D. (2012). Joint models for longitudinal and time-to-event data: With applications in R.

CRC Press.

Williamson, P. R., Kolamunnage-Dona, R., Philipson, P. and Marson, A. G. (2008). Joint

modelling of longitudinal and competing risks data. Statistics in Medicine, 27, pp. 6426 – 6438.

Alderson, H. V., Asar, O., Ritchie, J. P., Middleton, R., Larsson, A., Diggle, P. J., Larsson,

T. E., Kalra, P. A. (To Be Submitted). Longitudinal change in FGF-23 is associated with risk

for death and cardiovascular events but not renal replacement therapy in advanced CKD. To Be

Submitted.

Incorporating Single Treatment Arm Evidence into a Network Meta

Analysis (if you must!)

Joy Leahy ∗1 and Cathal Walsh2

1Department of Statistics, Trinity College Dublin, Ireland2Professor of Statistics, University of Limerick, Ireland


Abstract: Combining all available evidence in the form of a Mixed Treatment Comparison (MTC) is an

important tool for facilitating decision making in choosing agents in a clinical setting. Randomised Control

Trials (RCTs) are considered the gold standard of evidence, because potential bias is minimised due to

its controlled approach. However, much of the evidence available can be gained from other sources such

as one-armed, single-comparator and observational studies. Here we propose to include single arm and

single agent trials by choosing a similar arm from another trial in the network to use as a comparator arm.

By simulating trials so that the effects of treatments are known, and by sampling across the potential

matches, we vary parameters which are likely to influence the effectiveness of matching. The objective here

is to identify and examine the parameters which influence how well matching works, propose a method for

choosing matched arms, and assess whether they are likely to work better than using the RCT evidence

alone.

Introduction

There are caveats to consider when including potentially biased sources of evidence. In a well

conducted RCT we can be confident that patients are exchangeable across treatment arms as they

have been randomly assigned. Including single comparator trials breaks this randomisation. However,

one could argue that randomisation is a sufficient but not necessary condition for comparing the

effects of different treatments. Our goal is to find reliable methods of including potentially biased

sources of evidence into an MTC along with devising a test for bias.

Schmitz et al (2013) propose methods for including observational studies into an MTC. Thom et

al (2015) also investigate methods of including single arm evidence. Other relevant work includes

Hong et al (2015) which deals with absolute versus relative effects, and a follow-up discussion by

Dias and Ades (2015).

Methods

We ran a simulation study to investigate possible matching methods versus using RCT evidence

alone. We simulated the study effect, treatment effect, the effect of covariates and the number of

covariates in each arm. From this we obtained the binary response rate as follows:

Response[i, j] = µi + δt[i,j] + β1(x1ik) + β2(x2ik) + ...+ βm(xmik)...+ βn(xnik),

where µi is the study effect in study i, δt[i,j] is the the treatment effect in arm j of study i, βm is the

effect of each of the covariates, and xmik is the proportion of patients with the binary covariate m in

arm j of study i. We ran a standard logit model in OpenBugs3.2.3. We then compared the model’s

estimate of the treatment effect by varying a number of parameters and assessed if these parameters

had an effect on whether the matched arm improved our estimate. The parameters considered were

study size, treatment effect, study effect and covariate effect.

Results

Figure 1

Matching by the covariate generally produces

better estimates than randomly matching. In-

cluding single arm evidence produces better es-

timates than including only RCT when there is

no study effect. Increasing the study effect even-

tually leads to the single arm evidence giving

less accurate evidence. An illustration of this is

shown in figure 1.

Conclusion

Under certain conditions, including single arm

evidence can increase the accuracy of our estimates of treatment effects in an MTC. However,

we must exercise caution when including single arm estimates as they may introduce bias into the

model.

References

Dias, S., and Ades, A. E. (2016). Absolute or relative effects? Arm-based synthesis of trial data. Re-

search Synthesis Methods, 7(1), pp. 23 – 28.

Hong, H. et al (2015). A Bayesian missing data framework for generalized multiple outcome mixed

treatment comparisons. Research Synthesis Methods, 7(1), pp. 6 – 22.

Thom, HH. et al (2015). Network meta-analysis combining individual patient and aggregate data from

a mixture of study designs with an application to pulmonary arterial hypertension. BMC Medical

Research Methodology, 12, pp. 15 – 34.

Schmitz, S. et al (2013). Incorporating data from various trial designs into a mixed treatment comparison

model. Statistics in Medicine., 32, pp. 2935 – 2949.

Visualise

John Newell∗1,2, Amirhossein Jalali1,2, Shirin Moghaddam2 and Jaynal Abedin3

1HRB Clinical Research Facility, NUI Galway, Ireland2School of Mathematics, Statistics and Applied Mathematics NUI Galway, Ireland

3INSIGHT Centre for Data Analytics, NUI Galway, Ireland∗Email: [email protected]

Abstract: The ability to summarise data graphically while adhering to good statistical practice is a key

component of a statisticians training. The emergence of open source options in R and commercial applica-

tions for the creation of dynamic and interactive graphs raises questions as to whether traditional teaching

of statistics should adapt to incorporate training in these environments.

Introduction

The emergence of big data has brought with it the necessity to develop methods and software for

the generation of interactive and dynamic graphs, a renaming of graphs to visualisations and the

creation of intriguing job titles such as data visualiser, data architect, data engineer, data scientist,

data wrangler and data munger. For example, the SAS website claims Data visualization is going

to change the way our analysts work with data. Theyre going to be expected to respond to issues

more rapidly. And theyll need to be able to dig for more insights look at data differently, more

imaginatively. Data visualization will promote that creative data exploration.

Methods

Assessment of data quality, subjective impressions with respect to the primary question of interest

and the assumptions underlying the analysis are typically achieved by preparing suitable (static)

graphs. Dynamic visualisations are often lauded as they allow the user to drill down and dive deeper

into the sample to generate insights that may be missed in static graphics. For example, a dashboard

of dynamically linked graphs summarizing data arising from a classical two-sample paired problem

are illustrated in Figure 1 with the relationship between adherence and outcome highlighted.

Discussion

In this talk dynamic visualisations across a range of platforms will be presented and their advantages

and disadvantages discussed. The question will be raised as to whether and how the teaching of

statistics should change given the availability of software and R packages for the generation of

dynamic graphs. Examples will be given of the capabilities of open source options in R, commercial

applications (Tableau and Qlik) and so called trade off analytics and deep learning (IBM Watson)

Figure 1: Sample Tableau Dashboard

in a variety of clinical and sports related settings.

Monday 16th May

Session 2: Chair, Professor Chris Jones

Invited Speaker

16:20 – 16:40 Alessandra Menafoglio Object Oriented Kriging in Hilbert spaces with applicationto particle-size densities in Bayes spaces

Contributed Talks

16:40 – 17:00 Ricardo Ehlers Bayesian inference under sparsity for generalized spatial ex-treme value binary regression models

17:00 – 17:20 Niamh Russell Alternatives to BIC in posterior model probability forBayesian model averaging

17:20 – 17:40 Riccardo Rastelli Optimal Bayes decision rules in cluster analysis via greedyoptimisation

Object Oriented Kriging in Hilbert spaces with application to

particle-size densities in Bayes spaces

Alessandra Menafoglio∗1, Piercesare Secchi1 and Alberto Guadagnini2

1MOX, Department of Mathematics, Politecnico di Milano, Italy2Department of Civil and Environmental Engineering, Politecnico di Milano, Milano, Italy;

Department of Hydrology and Water Resources, University of Arizona, 85721 Tucson, Arizona,

USA∗Email: [email protected]

Abstract: We present a methodology to perform Kriging of spatially dependent data of a Hilbert space.

Our object-oriented approach is conducive to the spatial prediction of the entire data-object, which is

conceptualized as a point within an infinite dimensional Hilbert space. In this communication, we focus

on the application of this broad framework to the geostatistical treatment of particle-size densities, i.e.,

probability densities that describe the distribution of grain sizes in given soil samples.

Introduction

The variety, dimensionality and complexity of the data commonly available in field studies pose new

challenges for data-driven geoscience applications. We here focus on the geostatistical treatment

of complex environmental data such as georeferenced functional data (e.g., curves or surfaces) or

distributional data (e.g., probability density functions), by pursuing an object oriented approach

(e.g., Marron and Alonso, 2014). The interpretation of the data as points within an infinite-

dimensional space offers a powerful perspective to address key issues such as estimation, prediction

and uncertainty quantification in a geostatistical setting.

Methods

Motivated by a field application dealing with particle-size data, we shall focus on the problem of the

geostatistical characterization of functional compositions (FCs). These are functions constrained to

be positive and to integrate to a constant (e.g., probability density functions). We interpret each

datum as a point within the Bayes Hilbert space of Egozcue et al. (2006) whose elements are FCs.

We here review the approach we developed in Menafoglio et al. (2014), and exploit appropriate

notion of spatial dependence to develop a Functional Compositional Kriging (FCK) predictor. FCK

provides the best linear unbiased predictor of the data, in the sense of minimizing the prediction

error under the unbiasedness constraint. Based on our recent work (Menafoglio et al., 2015), we will

illustrate additional tools to explore and characterize the (spatial) variability of the data, including

methods for dimensionality reduction and uncertainty quantification.

Results and Discussion

We illustrate our theoretical developments on a field study relying upon the particle-size densities

(PSDs) considered in Menafoglio et al. (2014). These data describe the local distribution of grain

sizes for 60 soil samples collected along a borehole in a shallow aquifer near the city of Tubingen,

Germany. We will show examples of Kriging predictions at the field site, and discuss the quality

of the results, assessed via cross-validation. A key advantage of our approach lies in the possibility

of obtaining predictions of the entire object at unsampled locations, as opposed to classical kriging

techniques which allow only finite-dimensional predictions, based on a set of selected features (or

synthetic indices) of the data (e.g., the mean of the density function). In general, the availability

of methods for the analysis and prediction of complex objects allows to use the entire information

content embedded in the data, and to project this to unsampled location in the system.

References

Egozcue, J. J., Dıaz-Barrero, J. L., and Pawlowsky-Glahn, V. (2006) Hilbert space of probability den-

sity functions based on Aitchison geometry. Acta Mathematica Sinica, English Series, 22(4), pp.

1175 – 1182.

Marron, J. S., and Alonso, A. M. (2014) Overview of object oriented data analysis. Biometrical Jour-

nal, 56, pp. 732 – 753.

Menafoglio, A., Secchi, P., and Dalla Rosa, M. (2013) A Universal Kriging predictor for spatially de-

pendent functional data of a Hilbert Space. Electronic Journal of Statistics, 7, pp. 2209 – 2240.

Menafoglio, A., Guadagnini, A., and Secchi, P. (2014) A Kriging approach based on Aitchison geom-

etry for the characterization of particle-size curves in heterogeneous aquifers. Stochastic Environ-

mental Research and Risk Assessment, 28(7), pp. 1835 – 1851.

Menafoglio, A., Guadagnini, A., and Secchi, P. (2015) Stochastic Simulation of Soil Particle-Size

Curves in Heterogeneous Aquifer Systems through a Bayes space approach. MOX-report 59/2015.

Bayesian Inference under Sparsity for Generalized Spatial Extreme Value

Binary Regression Models

Ricardo S. Ehlers∗1, Dipankar Bandyopadhyay2 and Nial Friel1

1Department of Applied Mathematics & Statistics, University of Sao Paulo, Brazil2Department of Biostatistics, Virginia Commonwealth University, USA3School of Mathematical Sciences, University College Dublin, Ireland


Abstract: In this paper, we develop Bayesian Hamiltonian Monte Carlo inference under sparsity for spatial

generalized extreme value binary regression models. We apply our methodology to a motivating dataset

on periodontal disease.

Introduction

Consider a spatial situation where we observe a binary response yis for subject i, at site s within

subject i. We assume that Yis ∼ Bernoulli(pis) with

P (Yis = 1) = pis = 1− exp− [1− ξ(x′iβ + φis)]

−1/ξ+

where xi is the vector of covariates for subject i (that do not vary across space), β ∈ Rk is the

vector of covariate coefficients (fixed effects) and φis are spatially correlated random effects. We

assume that φi ∼ N(0,Σ) where Σ is the positive definite variance covariance matrix and we denote

the corresponding precision matrix by Ω = Σ−1. Instead of the usual conditionally autoregressive

(CAR) model for the spatial effects we assume, Ω = Σ−1 ∼ G-WishartW (κ, S) with degrees of

freedom κ and scale matrix S, constrained to have null entries for each zero in the adjacency matrix

W (Wss′ = 1 if s and s′ are neighbours and zero otherwise). Its density function is given by,

p(Ω|W ) =1

ZW (κ, S)|Ω|(κ−2)/2 exp

−1

2tr(SΩ)

I(Ω ∈MW ), κ > 2

where I(·) is the indicator function, MW is the set of symmetric positive definite matrices with

null off-diagonal elements corresponding to zeroes in W and ZW (κ, S) is the normalizing constant

(Roverato, 2002). We specify S = D − ρW where D is a diagonal matrix with elements given by

the number of neighbours at each location and ρ ∈ (0, 1) controls the degree of spatial correlation.

Methods

Assuming that β, φ, ξ and κ are a priori independent the joint posterior distribution is given by,

p(β, ξ,φ, κ,Ω|y,X) ∝n∏i=1

p(yi|xi,β, ξ,φi) p(φi|Ω)× p(Ω|W,κ) p(β) p(ξ) p(κ).

We assign the following priors, β ∼ N(0, σ2βIk), ξ ∼ N(0, σ2

ξ ) and κ ∼ Ga(a, b). The full condi-

tional posterior distribution of Ω is again G-Wishart and we use a novel approach via Hamiltonian

Monte Carlo (HMC) methods to sample from it. The spatial effects φi and the degrees of freedom κ

have no closed form full conditionals and are sampled using Metropolis-Hastings schemes. However

the full conditional of κ depends on the normalizing constant Z(κ, S) and we use the exchange

algorithm (Murray et al., 2006). Finally, (β, ξ) are sampled using a version of Riemann manifold

HMC methods (Girolami and Calderhead, 2011).

Results

Our motivating data was provided by the Center for Oral Health Research (COHR) at the Medical

University of South Carolina. The objective of this analysis is to quantify periodontal disease status

of a population and to study the associations between disease status and patient-level covariates:

age, Body mass index (Bmi), gender, a diabetes marker (HbA1C) and smoking status. Preliminary

results indicate that all covariates have significant effects on the probability of disease and likewise

for the spatial effects. Estimate of ξ indicates a strong assimetry in the link between covariates and

disease probability.

References

Girolami, M. and Calderhead, B. (2011) Riemann manifold Langevin and Hamiltonian Monte Carlo

methods. Journal of the Royal Statistical Society B, 73, pp. 123 – 214.

Lee, W.X. and Silver, Y.Z. (2006). MCMC for doubly-intractable distributions. In: Uncertainty in

Artificial Intelligence, AUAI Press, R. Dechter and T. Richardson (Eds.) pp. 359 – 366.

Roverato, A. (2002). Hyper inverse Wishart distribution for non-decomposable graphs and its application

to Bayesian inference for gaussian graphical models. Scandinavian Journal of Statistics, 29, pp.

391 – 411.

Alternatives to BIC in posterior model probability for Bayesian model

averaging

Niamh Russell1,∗, Yuxin Bai1,∗, Brendan Murphy1,2

1 School of Mathematical Sciences, University College Dublin, Ireland2The Insight Centre for Data Analytics, UCD, Ireland.


Abstract: Bayesian model averaging (BMA) classically uses BIC to calculate estimates of posterior model

probability. This method is sometimes objected to in the literature on the basis that it is only correct to

use BIC to select the optimal model. We propose two alternatives.

Introduction

Commonly in model-based clustering, we fit a number of competing models and base the clustering

results on the best model. This ignores the uncertainty that arises from model selection. BMA allows

for combination of clustering results across multiple models, thus accounting for model uncertainty.

However, the best method for estimating the model uncertainty is contested.

One approximation of the posterior model probabilities uses the BIC criterion to determine the

weights. where for a list of candidate models M1, . . . ,Mm

P(Mk|D) 'exp(1

2BICk)∑K

j=1 exp(12BICj)

. (1)

We propose two alternative methods. One uses BIC∗ in a similar way to Equation 1 but which

corrects for small sample size. Another method is the convex hull method (CHull) (Bulteel et al

(2013)) which produces weights as part of the algorithm.

Methods

When using BMA, suggested in Raftery (1995), to combine results across multiple models, it is

required to weight candidate models according to the posterior model probability. This technique

allows us to average across a statistic of interest, θ, say.

We propose averaging across θMkthe statistic of interest for each model using the posterior model

probabilities from the two proposed methods. So, given model-based estimates of θ, θMk, say, we

have

θBMA =K∑k=1

PMk| Data θMk

where PMk| Data is calculated using three alternative methods: BIC as in equation 1, BIC∗

(Equation 2) and CHull.

BIC∗ = 2L −∑p

log(np)(p) (2)

where the penalty on the log-likelihood in BIC∗ depends on the number of observations required to

estimate each parameter p in the model.

CHull is a heuristic method where a convex hull is drawn over a ordered graph of the log-likelihood

versus the number of free parameters of a number of candidate models. The models involved in the

convex hull are then compared to calculate the required weights.

We propose to compare results from the three methods.

Discussion

We propose two alternatives to using BIC as a basis for estimating posterior model probability.

References

Bulteel, K, Wilderjans, TF, Tuerlinkx, F. and Ceulemans, E. (2013). CHull as an alternative to AIC

and BIC in the context of mixtures of factor analyzers. Behavior Research Methods, 45 3, pp.

782–791.

Fraley, C. and Raftery, A.E. (2002). Model-based clustering, discriminant analysis, and density estima-

tion. JASA, 97, pp. 611–631.

Raftery, A.E. (1995) Bayesian model selection in social research. Sociological Methodology, 25, pp.

111–164.

Optimal Bayes decision rules in cluster analysis via greedy optimisation

Riccardo Rastelli∗1 and Nial Friel1

1School of Mathematics and Statistics, University College Dublin, Ireland∗Email: [email protected]

Abstract: In cluster analysis interest lies in capturing the partitioning of individuals into groups, such that

those belonging to the same group share similar attributes or relational profiles. Recent developments in

statistics have introduced the possibility of obtaining Bayesian posterior samples for the clustering variables

efficiently. However, the interpretation of such a collection of data is not straightforward, mainly due to

the categorical nature of the clustering variables and to the computational requirements. We consider

an arbitrary clustering context and introduce a greedy algorithm capable of finding an optimal Bayesian

clustering solution with a low computational demand and potential to scale to big data scenarios.

Introduction

The observed data characterises the profiles of a group of individuals. The population may present

a clustering structure where within each cluster individuals exhibit similar attributes or behaviours.

Each individual is characterised by a latent allocation categorical variable z defining its member-

ship. Markov Chain Monte Carlo techniques can be adopted to obtain a marginal posterior sample

Z(1), . . . ,Z(T ) for the clustering variables. However, summary statistics such as the posterior mean

or median have little meaning due to the discrete nature of the clustering variables.

A decision theoretic approach offers an alternative interpretation: a loss function L is used to assess

differences between partitions, and hence the optimal clustering solution is that minimising the

expected posterior loss, approximated by:

ψ (Z) ≈T∑t=1

L(

Z(t),Z). (1)

Related work

The 0-1 loss L (Z,Z′) = 1z 6=z′ is typically used since it implies simplifications and low computa-

tional demands. Bertoletti et al. (2015) propose greedy routines that tackle this task efficiently.

However, the 0-1 loss is evidently short-sighted: for different partitions, the loss is equal to one re-

gardless of how different the partitions actually are. Wade and Ghahramani (2015) have addressed

this impasse, proposing to use more sensible loss functions such as those derived from information

theory. However, the minimisation of the expected posterior loss implies lots of evaluations of the

sum on the rhs of Equation 1. This becomes impractical even for moderate datasets.

Proposed solution

In this work we focus on the wide family of loss functions that depend on the partitions only through

their contingency table. In such scenario, when a small perturbation is applied to a partition, the

variation in the corresponding loss can be evaluated in a constant time. This makes the greedy

routines of Bertoletti et al. (2015) applicable to the problem described, taking full advantage of the

computational savings to reduce the overall complexity and make such a tool widely applicable. We

propose examples to Gaussian Finite Mixtures and Stochastic Block Models.

−2 −1 0 1 2 3

−4

−2

02

4

0−1 loss

X1

X2

−2 −1 0 1 2 3

−4

−2

02

4

VI loss

X1

X2

Figure 1: A toy example on a simulated Gaussian mixture: the optimal clustering according to 0-1loss (left panel) is qualitatively different from that obtained using the Variation of Information loss(right panel).

References

Bertoletti, M., Friel N. and Rastelli, R. (2015). Choosing the number of clusters in a finite mixture

model using an exact integrated completed likelihood criterion. METRON, pp. 1 – 23.

Wade, S. and Ghahramani, Z. (2015). Bayesian cluster analysis: Point estimation and credible balls.

arXiv preprint 1505.03339

Tuesday 17th May

Session 3: Chair, Professor John Hinde

Keynote Speaker

09:10 – 10:00 Laura Sangalli Functional Data Analysis, Spatial Data Analysis and PartialDifferential Equations: a fruitful union

Contributed Talks

10:00 – 10:20 Alan Benson An adaptive MCMC method for multiple changepoint anal-ysis with applications to large datasets

10:20 – 10:40 Rafael Moral Diagnostic checking for N-mixture models applied to miteabundance data

10:40 – 11:00 Bernardo Nipoti A Bayesian nonparametric approach to the analysis of clus-tered time-to-event data

Functional Data Analysis, Spatial Data Analysis and

Partial Differential Equations: a fruitful union

Laura M. Sangalli

MOX - Dipartimento di Matematica, Politecnico di Milano, Italy

Email: [email protected]

Abstract: I will discuss an innovative method for the analysis of spatially distributed data, that merges

advanced statistical and numerical analysis techniques.

Spatial regression with differential regularizations

I will present a novel class of models for the analysis of spatially (or space-time) distributed data,

based on the idea of regression with differential regularizations. The models merge statistical

methodology, specifically from functional data analysis, and advanced numerical analysis techniques.

Thanks to the combination of potentialities from these different scientific areas, the proposed method

has important advantages with respect to classical spatial data analysis techniques. Spatial Regres-

sion with differential regularizations is able to efficiently deal with data distributed over irregularly

shaped domains, with complex boundaries, strong concavities and interior holes [Sangalli et al.

(2013)]. Moreover, it can comply with specific conditions at the boundaries of the problem domain

[Sangalli et al. (2013), Azzimonti et al. (2014, 2015)], which is fundamental in many applications

to obtain meaningful estimates. The proposed models have the capacity to incorporate problem-

specific priori information about the spatial structure of the phenomenon under study [Azzimonti

et al. (2014, 2015)]; this very flexible modeling of space-variation allows to naturally account for

anisotropy and non-stationarity. Space-varying covariate information is accounted for via a semipara-

metric framework. The models can also be extended to space-time data [Bernardi et al. (2016)].

Furthermore, spatial regression with differential regularizations can deal with data scattered over

non-planar domains, specifically over Riemannian manifold domains, including surface domains with

non-trivial geometries [Ettinger et al. (2016), Dassi et al. (2015), Wilhelm et al. (2016)]. This

has fascinating applications in the earth-sciences, in the life-sciences and in engineering. The use of

advanced numerical analysis techniques, and in particular of the finite element method or of isoge-

ometric analysis, makes the models computationally very efficient. The models are implemented in

the R package fdaPDE [Lila et al. (2016)].

References

Azzimonti, L., Nobile, F., Sangalli, L.M., Secchi, P. (2014). Mixed Finite Elements for spa-

tial regression with PDE penalization. SIAM/ASA Journal on Uncertainty Quantification, 2,

1, pp. 305 – 335.

Azzimonti, L., Sangalli, L.M., Secchi, P., Domanin, M., Nobile, F. (2015). Blood flow ve-

locity field estimation via spatial regression with PDE penalization. Journal of the American

Statistical Association, Theory and Methods, 110, 511, pp. 1057 – 1071.

Bernardi, M.S., Sangalli, L.M., Mazza, G., Ramsay, J.O. (2016). A penalized regression

model for spatial functional data with application to the analysis of the production of waste in

Venice province. Stochastic Environmental Research and Risk Assessment, DOI: 10.1007/s00477-

016-1237-3.

Dassi, F., Ettinger, B., Perotto, S., Sangalli, L.M. (2015). A mesh simplification strategy for

a spatial regression analysis over the cortical surface of the brain. Applied Numerical Mathe-

matics, 90, 1, pp. 111 – 131.

Ettinger, B., Perotto, S., Sangalli, L.M. (2015). Spatial regression models over two-dimensional

manifolds. Biometrika, 103, 1, pp. 71 – 88.

Lila, E., Aston, J.A.D., Sangalli, L.M. (2016). Smooth Principal Component Analysis over

two-dimensional manifolds with an application to Neuroimaging.

ArXiv:1601.03670, http://arxiv.org/abs/1601.03670.

Lila, E., Sangalli, L.M., Ramsay, J.O., Formaggia, L. (2016). fdaPDE: functional data anal-

ysis and Partial Differential Equations; statistical analysis of functional and spatial data, based

on regression with partial differential regularizations, R package version 0.1-2,

http://CRAN.R-project.org/package=fdaPDE.

Sangalli, L.M., Ramsay, J.O., Ramsay, T.O. (2013). Spatial spline regression models. Journal

of the Royal Statistical Society Ser. B, Statistical Methodology, 75, 4, pp. 681 – 703.

Wilhelm, M., Dede’, L., Sangalli, L.M., Wilhelm, P. (2016). IGS: an IsoGeometric approach

for Smoothing on surfaces. Computer Methods in Applied Mechanics and Engineering, 302,

pp. 70 – 89.

An Adaptive MCMC Method for Multiple Changepoint Analysis with

applications to Large Datasets

Alan Benson∗1 and Nial Friel1

1University College Dublin∗Email: [email protected]

Abstract: We consider the problem of Bayesian inference for changepoints where the number and position

of the changepoints are both unknown. In particular, we consider product partition models where it is

possible to integrate out model parameters for regimes between each change point, leaving a posterior

distribution over a latent vector indicating the presence or not of a change point at each observation.

This problem has been considered by Fearnhead (2006) where one can use a filtering recursion algorithm

to make exact inference. However the complexity of this algorithm depends quadratically on the number

of observations. Our approach relies on an adaptive Markov Chain Monte Carlo (MCMC) method for

finite discrete state spaces. We develop an adaptive algorithm which can learn from the past state of the

Markov chain in order to build proposal distributions which can quickly discover where change point are

likely to be located. We prove that our algorithm leaves the posterior distribution ergodic. Crucially, we

demonstrate that our adaptive MCMC algorithm is viable for large datasets for which the exact filtering

recursion approach is not. Moreover, we show that inference is possible in a reasonable time.

Introduction

A motivating example is the Well Log Data (Ruanaidh & Fitzgerald, 2006) shown in Figure 1. It

consists of 3,979 data points measuring the magnetic response of an oil well drill to identify change in

rock structure with depth (time axis). This data has distinct changepoints visible by inspection but

0 1000 2000 3000 4000

1000

0011

0000

1200

0013

0000

1400

00

Well−log Data

Time

Mea

sure

men

t

Figure 1: Well Log Data

there are many more changepoints that are not as visible without further considering the distribution

of the data.

Problem Statement

It can be assumed that the data comes from some likelihood family f(y|θi) with some parameter(s),

θ, which are changing over time. These changes occur at discrete time points τ = τ1 < · · · < τk,where k and the individual τi values are unknown and are to be inferred. With a suitable priors on

τ and θ we can form a posterior

π(τ |y) ∝∫θ

k+1∏j=1

τj∏i=τj−1+1

f(yi|θj)π(θj)π(τ ) dθj (1)

Methods

The purpose of this work is to develop scalable algorithms for changepoint analyis of large datasets.

We employ an adaptive Markov Chain Monte Carlo (MCMC) method to sample from the posterior

(1), leading to a novel algorithm that returns the location and number of changepoints in the data

sequence. The key idea of Adaptive MCMC (Haario, 2001) is to modify the proposal distribution

q(x′|x) in the standard Metropolis Hastings (M-H) algorithm in order to explore high probability

regions more often. Restrictions on how we modify the proposal are the essence of constructing

an efficient Adaptive MCMC that also preserves the ergodicity of the inference. The parameters

modified are the change point inclusion weights.

Results

Results will be presented showing the distribution of the changepoint position and the number of

changepoints in the Well Log data (Figure 1). Further results for larger datasets will be shown and

also a comparison of our algorithm to an alternative filtering recursion algorithm (Fearnhead, 2006).

References

Fearnhead, P. (2006). Exact and efficient Bayesian inference for multiple changepoint problems. Statis-

tics and computing, 16, pp. 203 – 213.

Ruanaidh & Fitzgerald (1996). Numerical Bayesian Methods Applied to Signal Processing. Publisher

address: Springer Science.

Haario (2001). An adaptive Metropolis algorithm. Bernoulli, pp. 223 – 242.

Diagnostic checking for N-mixture models applied to mite abundance

data

Rafael Moral∗1, John Hinde2 and Clarice Demetrio1

1Department of Exact Sciences, ESALQ/USP, Brazil2School of Mathematics, Statistics, and Applied Mathematics, NUI-Galway, Ireland


Abstract: In ecological field surveys it is often of interest to estimate the abundance of species. However,

detection is imperfect and hence it is important to model these data taking into account the ecological

processes and sampling methodologies. In this context, N-mixture models and extensions are particularly

useful, as it is possible to estimate population size and detection probabilities under different ecological

assumptions. We apply extensions of this class of models to a mite sampling study and develop methods

for assessing goodness-of-fit by proposing different types of residuals for this model class.

Introduction

It is very important in the ecological context to measure animal abundance and understand how this

abundance changes over time and space. There are different statistical models that may be used

to estimate abundance as well as site-occupancy. N-mixture models were defined by Royle (2004)

and have since been generalised, see Hostetler and Chandler (2015). Here we develop and apply

extensions of this class of models to estimate mite abundance in a field survey. So far, specific forms

of residuals and model diagnostics have not been proposed and we will develop goodness-of-fit

assessment techniques for these models.

Material and methods

The diversity and abundance of soil animals is dominated by the mites, and hence the understanding

of soil systems is directly related to the mite fauna. To study the effect of climate change on mite

abundance, a sampling study was conducted in Colombia. Mites were sampled bimonthly throughout

the year 2010 at nine sites in both a forest patch and in a pasture, totalling 6 × 9 × 2 = 108

observations.

Let nit represent mite counts for site i, i = 1, . . . , R over sampling occasion t = 1, . . . , T . We are

interested in estimating site abundance Ni, however, there is a detection (or capture) probability p

which is also unknown. Considering closed populations (i.e. no migration and constant birth and

death rates), we may assume that nit are independent and identically distributed as Binomial(Ni, p).

The approach described by Royle (2004) takes Ni to be independent and identically distributed latent

random variables with density f(Ni; θ). Integrating the binomial likelihood with respect to Ni, we

may write the likelihood function of the N-mixture model as

L(θ, p|n11, . . . , nRT ) =R∏i=1

∞∑

Ni=maxtnit

T∏t=1

(Ni

nit

)pnit(1− p)Ni−nitf(Ni; θ)

(1)

Sensible choices for the distribution of Ni are, for example, the Poisson and negative binomial

models.

It is important to assess goodness-of-fit in this setting so that abundance may not be over- or under-

estimated. We will propose different types of residuals and develop methods to assess goodness-of-fit

for these models and check their robustness using simulation studies.

Results and discussion

Preliminary analyses showed that the mite data may be overdispersed, as the choice of the negative

binomial model for the distribution of Ni yielded a significantly better fit than a Poisson model

(likelihood ratio test statistic: 60.82 on 1 d.f., p < 0.0001). Using half-normal plots of the ordinary

conditional residuals with a simulated envelope, it appears that assuming a negative binomial model

for the latent abundance variable is more appropriate for these data.

Conclusion

N-mixture models are a valuable tool to analyse repeated count data and estimate abundance.

However, goodness-of-fit must be assessed in order to assure more reliable abundance estimates.

References

Hostetler, J.A. and Chandler, R.B. (2015). Improved state-space models for inference about spatial

and temporal variation in abundance from count data. Ecology, 96, 1713 – 1723.

Royle, J.A. (2004). N-mixture models for estimating population size from spatially replicated counts.

Biometrics, 60, 108 – 115.

A Bayesian nonparametric approach

to the analysis of clustered time-to-event data

Bernardo Nipoti∗1, Alejandro Jara2 and Michele Guindani3

1School of Computer Science and Statistics, Trinity College Dublin, Ireland2Pontificia Universidad Catolica de Chile, Santiago, Chile

3The University of Texas MD Anderson Cancer Center, Houston, USA∗Email: [email protected]

Abstract: We propose a clustered proportional hazard model based on the introduction of cluster-

dependent random hazard functions and on the use of mixture models induced by completely random

measures. We show that the proposed approach accommodates for different degrees of association within

a cluster, which vary as a function of cluster level and individual covariates. The behaviour of the proposed

model is illustrated by means of the analysis of simulated and real data.

Introduction

Cox’s proportional hazards (PH) model (Cox, 1972) has been widely used in the analysis of time-to-

event data. Shared frailty models (see, e.g., Hougaard, 2000) conveniently extend the PH framework

by including a group-specific random effect term (frailty) in the hazard function, so to take into

account the presence of heterogeneous clusters of subjects in the data. Frailty random variables are

typically assumed independent and identically distributed (iid). A potentially important drawback

of shared frailty models is represented by the simple marginal association structure that the model

induces, possibly not appropriate for some applications. We propose a more general and flexible

approach where cluster-specific baseline hazard functions are modelled as mixtures governed by iid

completely random measures (CRM).

Mehtods

Let Ti,j ∈ IR+ be the time-to-event for the ith individual in the jth cluster, with j = 1, . . . , k and

i = 1, . . . , nj, and let zi,j be a p–dimensional vector of explanatory covariates associated with the

ith individual in the jth cluster. We extend the ideas proposed by Dykstra and Laud (1981) and

propose a Bayesian model based on the assumption that the conditional hazard functions can be

expressed as a mixture model induced by iid cluster-specific random mixing distributions, i.e.

hj(t) =

∫Y

k(t | y)µj(dy) and µj | Gi.i.d.∼ G, (1)

j = 1, . . . , k, where Y is an appropriate measurable space, k(· | ·) is a suitable kernel, µj is a CRM

on Y, such that limt→∞∫ t

0hj(s)ds = +∞ a.s., and G is the common probability law for the mixing

CRMs. We assume that, given the cluster-specific baseline hazard function hj, j = 1, . . . , k and a

vector β of regression coefficients, the Ti,j’s are independent, following a clustered PH model with

conditional density

f (t | zi,j,β, hj) = expz′i,jβhj(t) exp

− expz′i,jβ

∫ t

0

hj(u)du

,

that is Ti,j | zi,j,β, hjind.∼ f (· | zi,j,β, hj) .

Results

Under the proposed model we derive the expressions for the Kendall’s τ and survival ratio, popular

measures of dependence between survival times (see, e.g., Anderson et al., 1992). This allows

us to show that our approach accommodates for different degrees of association within a cluster,

which vary as a function of cluster level and individual covariates. We also show that a particular

specification of the proposed model, namely the choice of a σ-stable distribution G for the iid

CRMs in (1), has the appealing property of preserving marginally the PH structure. The behaviour

of the proposed model is illustrated by means of the analyses of simulated data as well as a real

dataset consisting of joint survival times of couples stipulating a last survivor policy with a Canadian

insurance company.

References

Anderson, J.E., Louis, T. A., Holm, N. V. and Harvald, B. (1992).

Time-dependent association measures for bivariate survival distributions, J. Am. Stat. Assoc., 87,

pp. 641 – 650.

Cox, D. (1972). Regression models and life tables (with discussion). J. Roy. Statist. Soc. Ser. A, 34,

pp. 187 – 202.

Dykstra, R. L. and Laud, P. (1981). A Bayesian nonparametric approach to reliability, Ann. Stat., 9,

356 – 367.

Hougaard, P. (2000). Analysis of Multivariate Survival Data. Springer, New York.

Tuesday 17th May

Session 4: Chair, Professor Christian Pipper

Invited Speaker

11:20 – 11:40 Chris Jones Log-location-scale-log-concave distribution for lifetime data

Contributed Talks

11:40 – 12:00 Amirhossein Jalali Confidence envelopes for the mean residual life function

12:00 – 12:20 Christopher Steele Modelling the time to type 2 diabetes related complicationsusing a survival tree based approach

12:20 – 12:40 Shirin Moghaddam A Bayesian approach to the imputation of survival data

Log-Location-Scale-Log-Concave Distributions

for Lifetime Data

Chris Jones∗1

Department of Mathematics and Statistics, The Open University, U.K.∗Email: [email protected]

Abstract: This talk concerns the sub-class of log-location-scale (LLS) models for continuous survival

and reliability data formed by restricting the density of the underlying location-scale distribution to be

log-concave (LC); hence log-location-scale-log-concave (LLSLC) models, introduced in Jones and Noufaily

(2015). These models display a number of attractive properties, one of which is the unimodality of their

density functions.

Methods

I shall concentrate on hazard functions of members of this class of distributions. Their shapes are

driven by tail properties of the underlying LLS distributions. Perhaps the most useful subset of LLSLC

models corresponds to those which allow constant, increasing, decreasing, bathtub and upside-down

bathtub shapes for their hazard functions, controlled by just two shape parameters (which principally

control the hazards’ behaviour at 0 and∞, respectively). The generalized gamma and exponentiated

Weibull distributions are particular examples thereof, for which Cox and Matheson (2014) conclude,

more generally, “that the similarity between the distributions is striking”. A third, also pre-existing,

example is the less well known power generalized Weibull distribution which I newly reparametrize

to cover each of the popular Weibull, Burr Type XII, linear hazard rate and Gompertz distributions

as special or limiting cases.

Discussion

For distributions involving shape parameters on the real line, with density function f(x), say, in

practice one necessarily incorporates all-important location, µ, and scale, σ, parameters too, via

(1/σ)f((x−µ)/σ). Similarly, for lifetime distributions with shape parameters, with hazard function

h(t), say, I contend that in practice one should incorporate scale, σ, and proportionality, β, param-

eters too, via (β/σ)h(t/σ). In the presence of covariates, this covers accelerated failure time and

proportional hazards models (with flexible parametric baseline hazards), amongst others. Practical

implementation of these ideas is in its infancy.

References

Cox, C. and Matheson, M. (2014). A comparison of the generalized gamma and exponentiated Weibull

distributions. Statistics in Medicine, 33, pp. 3772 – 3780.

Jones, M.C. and Noufaily, A. (2015). Log-location-scale-log-concave distributions for survival and reli-

ability analysis. Electronic Journal of Statistics, 9, pp. 2732 – 2750.

Confidence Envelopes for the Mean Residual Life function

Amirhossein Jalali∗1 2, Alberto Alvarez-Iglesias 2, John Hinde1

and John Newell1 2

1School of Mathematics, Statistics and Applied Mathematics NUI Galway, Ireland2HRB Clinical Research Facility, NUI Galway, Ireland


Introduction

Survival analysis is a collection of statistical methods to analyse time to event data, in the presence

of censoring. The survivor function S(t), the probability of the event occurring beyond any particular

time point t, is the typical summary presented. Another function of interest is the Mean Residual

Life (MRL) function, which at any time t provides the expected remaining lifetime given survival up

to time t.

Mean Residual Life

The MRL function has been used traditionally in engineering and reliability studies and provides a

clearer answer to the question ”how long do I have left?”. This characteristic of the MRL function

is particularly interesting when one tries to communicate results involving time to event data.

The MRL function is defined as the expected survival time given survival till the current time:

m(t) = E(T − t|T > t) =1

S(t)

∫ ∞t

S(s)ds

A recent paper by Alvarez-Iglesias, et al. (2015) presented an estimator of the MRL function in the

presence of non-informative right censoring. This novel semi-parametric approach combines existing

nonparametric methods and an extreme-value tail model, where the limited sample information in

the tail (up to study termination) is used to estimate the upper tail behaviour.

Variability Bands

The MRL estimator (Alvarez-Iglesias, et al. 2015) is based on a hybrid estimator and includes a

nonparametric estimate of survival and a parametric approximation of the upper tail of the survival

curve. Gong (2012) derived a method of calculating the variance estimate for such hybrid estimators

at the start of follow-up time i.e. t = 0. The bootstrap approach is an attractive option here due

to the complexity arising from a hybrid estimator, because, to our knowledge, no closed formula for

the variance is available.

An example of a global and pointwise confidence envelope is given in the following figure for a MRL

estimated based on simulated data arising from a Weibull distribution with increasing hazard.

Conclusion

The Mean Residual Life has been suggested as more informative graphical summary as it may

provide a clearer interpretation for both clinicians and patients. It is argued that the MRL is easier

to interpret as the summary is given in units of time. This study focuses on generating variability

bands to the precisions the estimated MRL using the bootstrap. Additionally, a graphical tool will

be presented for summarising time to event data using the MRL function.

References

Alvarez-Iglesias, Alberto, et al. (2015). Summarising censored survival data using the mean residual life

function. In: Statistics in medicine, 34.11, 1965-1976.

Canty, Angelo J. (2002). Resampling methods in R: the boot package. In: R News, 2.3, 2-7.

Gong, Qi, and Liang Fang. (2012). Asymptotic properties of mean survival estimate based on the Ka-

plan–Meier curve with an extrapolated tail. In: Pharmaceutical statistics, 11.2, 135-140.

Newell, John, et al. (2006). Survival ratio plots with permutation envelopes in survival data problems.

In: Computers in biology and medicine, 36.5, 526-541.

Modelling the time to type 2 diabetes related complications using a

survival tree based approach

Christopher J. Steele∗1,2, Adele H. Marshall1,2, Anne Kouvonen2,3, Reijo Sund3, Frank Kee2

1Centre for Statistical Science and Operational Research, School of Mathematics and Physics,

Queen’s University Belfast2UKCRC Centre of Excellence for Public Health, Queen’s University Belfast

3Department of Social Research, University of Helsinki∗Email: [email protected]

Abstract: The substantial increase in the number of individuals being diagnosed with type 2 diabetes

(T2D) has caused a simultaneous increase in the prevalence of complications related to the disease. Type

2 diabetics are at a significantly greater risk of experiencing a stroke, heart disease and various other health

problems. A survival tree approach is used to identify cohorts of individuals with significantly different

time distributions from T2D diagnosis to complication. Three survival trees were built for the time until

death, amputation/revascularisation and stroke/acute myocardial infarction (AMI). Age and the presence

of a comorbidity were shown to be influential variables when modelling the time to any of the outcomes.

Introduction

The number of people that suffer from T2D has risen significantly over the past decade and there

are no signs that this sharp increase in the prevalence of the disease is going to slow down. Due

to the increase in T2D cases worldwide there has been a substantial increase in the number of

T2D related complications. Individuals with T2D are at a greater risk of experiencing numerous

conditions including heart disease, stroke, nerve damage, kidney disease and foot problems. The

aim of the proposed study is to group type 2 diabetics by their associated characteristics to give

cohorts of individuals with significantly different time to event distributions. In order to achieve this

a survival tree based approach was used.

Methods

Subjects were participants in the Diabetes in Finland (FinDM II) study, a national register-based

dataset of older adults with T2D in Finland. The objective of the analysis was to model the time

until the first recorded complication after T2D diagnosis. Hence, individuals were excluded from the

analysis if they suffered a related complication before their diagnosis of T2D. 18,903 individuals met

the inclusion criteria of which 13,835 individuals experienced an event of interest during the follow-

up period. The events of interest were death (n=6,908; 36.5%), lower limb amputation (n=453;

2.4%), coronary revascularisation (n=868; 4.6%), stroke (n=2,861; 15.1%) and AMI (n=2,745;

14.5%). A survival tree approach was used to investigate how individual characteristics influence

the time until the event of interest occurred. The log rank test statistic was used to determine

the splitting of the tree nodes (Zhang and Singer, 1999). The variable which yielded the largest

significant log rank test statistic was chosen to make the split.


It was shown that the underlying time distributions from T2D diagnosis to death and T2D related

complication were significantly different. However, the type of complication gives rise to differing

survival behaviours. Those patients experiencing an amputation or revascularisation have similar sur-

vival times which is significantly different to those patients experiencing stroke or AMI. Three survival

trees were built to investigate how individual characteristics affected the time to death, amputa-

tion/revascularisation and stroke/AMI. The trees identified cohorts of individuals with significantly

different time distributions from T2D diagnosis to the event of interest. The most influential factor

in both the death and stroke/AMI survival tree was age, where individuals aged 65 or over were at

a greater risk of experiencing an event. Gender was shown to be the most influential variable in the

amputation/revascularisation tree with age proving to be the next most significant variable. The

presence or absence of a comorbidity also played a major role in all three survival trees. Individuals

65 years or older who did not suffer from a comorbidity, not in manual or non manual employment

and had low income had the shortest median time to death while older individuals who suffered from

a comorbidity and not in any type of manual or non manual work had the shortest median time to

stroke/AMI. Older males who suffered from a comorbidity and had high levels of education had the

shortest median time to amputation/revascularisation.

References

Zhang, H and Singer B. (1999). Recursive Partitioning In The Health Sciences. New York: Springer,

pp. 79-103.

A Bayesian approach to imputation of survival data

Shirin Moghaddam∗1, John Newell2 and John Hinde1

1School of Mathematics, Statistics and Applied Mathematics, NUI Galway, Ireland2HRB Clinical Research Facility and School of Mathematics, Statistics and Applied Mathematics,

NUI Galway, Ireland∗Email: [email protected]

Abstract: In survival analysis, due to censoring, standard methods of plotting individual survival times

are invalid. Therefore, graphical display of time-to-event data usually takes the form of a Kaplan-Meier

survival plot. By treating the censored observations as missing and using imputation methods, a complete

dataset can be formed. Then standard graphics may usefully complement Kaplan-Meier plots. Here, we

consider using a Bayesian framework to present a flexible approach to impute the censored observations

using predictive distributions.

Introduction

Survival data measures time from some origin point to a particular event. One common feature of

such data is that some individuals may not experience the event during the follow-up period, giving

right-censored observations. In the presence of right censoring, simple approaches for analysis and

visualization are impracticable. Therefore, Kaplan-Meier curves, which take account of the censoring,

have become the standard graphical method for displaying survival data. But suppose that we were

to treat the censored observations as missing and use imputation to provide a complete dataset, then

both standard analysis methods and graphics could be used. One such approach was introduced by

Royston (2008) where each censored survival time is imputed by assuming a log-normal distribution.

Here, we consider using a Bayesian framework to give a more flexible approach to impute the censored

observations using predictive distributions. The use of this method is investigated for low, medium

and high censoring rates with and without covariates. The method is intended to be used for the

visual exploration and presentation of survival data. We illustrate its use for standard survivor and

hazard function plots and also for the mean residual life function, which gives a simple, interpretable

display for physicians and patients to understand the results from clinical trials.

Methods

Censored survival times may be viewed as a type of missing, or incomplete, data. Bayesian methods

are used, here taking a Weibull distribution for the survival times with lognormal and Gamma priors

for the shape and scale parameters (see, Christensen, et al., 2010). Using MCMC methods (e.g. in

WinBUGS, see Lunn et al, 2012) we obtain simulated draws for the predicted values of the censored

observations, conditional on the observed censoring times. We can then use these predicted values

as imputed values to give complete datasets, as in standard multiple imputation methods. Standard

graphics can then be used with the imputed datasets to explore treatment effects, hazard functions,

etc., with some indication of the uncertainty due to the censoring.

The approach is easily extended to other survival distributions and Bayesian survival models.

Conclusion

In summary, we have introduced a flexible approach for imputing values from censored survival times

to give completed datasets. The datasets can then be used for standard graphical displays that may

be a useful complement to Kaplan-Meier plots of the original censored dataset.

References

Christensen, R. et al. (2010). Bayesian Ideas and Data Analysis: an introduction for scientists and

statisticians. CRC Press.

Lunn, D. et al. (2012). The BUGS Book: A Practical Introduction to Bayesian Analysis. CRC Press.

Royston, P. et al. (2008). Visualizing length of survival in time-to-event studies: a complement to

KaplanMeier plots. Journal of the National Cancer Institute, 100, pp. 92 – 97.

Tuesday 17th May

Session 5: Chair, Professor Ailish Hanningan

Keynote Speaker

13:50 – 14:40 Bethany Bray Cutting-Edge Advances in Latent Class Analysis for Today’sBehavioral Scientists

Contributed Talks

14:40 – 15:00 Myriam Tami EM estimation of a structural equation model

15:00 – 15:20 Arthur White Identifying patterns of learner behaviour using latent classanalysis

15:20 – 15:40 Brendan Murphy Variable selection for latent class analysis with applicationto low back pain diagnosis

Cutting-Edge Advances in Latent Class Analysis for Today’s Behavioral

Scientists

Bethany C. Bray∗1

1The Methodology Center, College of Health and Human Development, The Pennsylvania State

University∗Email: [email protected]

Abstract: Latent class analysis (LCA) is a statistical tool that behavioural scientists are turning to with

increasing frequency to explain population heterogeneity by identifying subgroups of individuals. As appli-

cation of LCA increases in behavioural science, more complex scientific questions are being posed about

the role that class membership plays in development. Recent advances have proposed new and exciting

extensions to LCA that address some pressing methodological challenges as todays scientists pose increas-

ingly complex questions about behavioural development. This keynote presentation will discuss two specific

advances: causal inference in LCA and LCA with a distal outcome.

Introduction

Latent class analysis (LCA) is a statistical tool that behavioural scientists are turning to with in-

creasing frequency to explain population heterogeneity by identifying subgroups of individuals. The

subgroups (i.e., classes) are comprised of individuals who are similar in their responses to a set

of observed variables; class membership is inferred from responses to the observed variables. As

application of LCA increases in behavioural science, more complex scientific questions are being

posed about the role that class membership plays in development. Addressing these questions of-

ten requires estimating associations between the latent class variable and other observed variables,

such as predictors, outcomes, moderators and mediators. In some cases, these associations can be

modelled in the context of the LCA itself. In other cases, the research questions are too complex

and cannot currently be addressed in this way. Recent methodological advances have proposed new

and exciting extensions to LCA that address some of the most pressing challenges. This keynote

presentation will discuss two advances that can be used to address the complex research questions

about development posed by today’s behavioural scientists.

Using data from the National Longitudinal Study of Adolescent to Adult Health (Add Health),

a nationally representative, longitudinal study of U.S. adolescents followed into young adulthood,

cutting-edge advances in models for causal inference in LCA and LCA with a distal outcome will be

discussed. Emphasis will be placed on the new questions that can be addressed with these methods

and how to implement them in scientists’ own work.

Casual inference in LCA

Modern causal inference methods, such as inverse propensity weighting, facilitate drawing causal

conclusions from observational data and these techniques are now commonly used in behavioural

studies. However, causal inference methods have only recently been extended to the latent variable

context to draw causal inferences about predictors of latent variables. This keynote presentation

will demonstrate the use of inverse propensity weighting to estimate the causal effect of a predictor

on latent class membership by estimating the causal effect of high risk for adolescent depression on

adult substance use latent class membership in the Add Health data.

LCA with a distal outcome

The mathematical model for predicting class membership from a covariate is well-understood; ques-

tions related to associations between a latent class predictor and distal outcome, however, present

a more difficult methodological challenge. Solving this problem is a hot topic in the methodological

literature today. Advantages and disadvantages of three competing, state-of-the-art approaches to

LCA with a distal outcome will be discussed, in order to guide scientists in their own work. As a

demonstration, early risk exposure latent class membership will be linked to later binge drinking in

the Add Health data.

EM estimation of a Structural Equation Model

Myriam Tami∗1, Xavier Bry1 and Christian Lavergne1

1Universities of Montpellier, IMAG, France∗Email: [email protected]

Abstract: We propose an estimation method of a Structural Equation Model (SEM). It consists in viewing

the Latent Variables (LV’s) as missing data and using the EM algorithm to maximize the whole model’s

likelihood, which simultaneously provides estimates not only of the model’s coefficients, but also of the

values of LV’s. Through a simulation study, we investigate how fast and accurate the method is, and

eventually apply it to real data.

Introduction

The proposed approach is an estimation method of a SEM linking latent factors. It provides estimates

of the coefficients of the model and its factors at the same time. This method departs from more

classical methods such as LISREL. In fact, LISREL mainly focuses on the covariance structure and

the LV scores estimation is based on a least-squares technique performed on mere measurement

equations. Contrary to PLS-like methods, we do not constrain factors to belong to the spaces

spanned by the Observed Variables (OV’s), but only to be normally distributed.

The model and data notations

The data consists in blocks of OV’s describing the same n independent units. Y = yji (resp.

Xm = xj,mi ); i ∈ J1, nK, j ∈ J1, qY K (resp. j ∈ J1, qmK, m ∈ J1, pK) is the n× qY (resp. n× qm)

matrix coding the dependent block of OV’s (resp. mieth-explanatory block of OV’s), identified with

its column-vectors. T (resp. Tm) refers to a n × rT (resp. n × rm) matrix of covariates. For

the sake of simplicity, the SEM we handle here is a restricted one. It contains only one structural

equation, relating a dependent latent factor g, underlying block Y , to p explanatory latent factors

fm respectively underlying blocks Xm. The SEM consists of p+ 1 measurement equations and one

structural equation : Y = TD + gb′ + εY

∀m ∈ J1, pK, Xm = TmDm + fmam′ + εm

g = f 1c1 + · · ·+ fpcp

+ εg(1)

where, εg ∈ Rn (resp. εY , εm) is a disturbance vector (resp. are disturbance matrices) and

∀m ∈ J1, pK, θ = D,Dm, b, am, c1, c2, ψY , ψm is the set of parameters. The main assumptions

of this model are the following: fm are standard normal; g is normal with zero-mean, and its

expectation conditional on all fm is a linear combination of them; εg ∼ N (0, 1); εg is independent

of εY and εm, ∀m ∈ J1, pK.

Estimation using the EM algorithm

We propose to carry out likelihood maximization through an iterative Expectation-Maximization

algorithm (Dempster et al. (1977)). If we consider factors as missing data, EM algorithm enables

us to estimate the factors. Let Z = (Y,X1, . . . , Xp) be the OV’s, h = (g, f 1, . . . , f p) the LV’s.

To maximize the log-likelihood associated with the complete data L(θ;Z, h), in the EM framework,

we must solve: Ehz[∂∂θL(θ;Z, h)

]= 0. Thanks to the explicit solutions of this system and the

distribution of h conditional on Z we design an algorithm. It is a rapidly converging iterative

procedure starting from a good initialization. The iteration equations have been given in detail in

Bry et al. (2016) and will be presented.

Results and application

A sensitivity analysis has been performed to investigate how the quality of estimations could be

affected by the number n of observations and the number q of OV’s in each block. The results

were that the sample size n proved to have more impact on the quality of parameter estimation and

factor reconstruction than the number of OV’s. We advise to use a minimal sample size of n = 100.

Conclusion

This method can estimate quickly and precisely factors of the SEM (in addition to estimating its

loadings) by maximization of the whole model’s likelihood. Various simulations and an application

on real data will be presented.

References

Bry, X., Lavergne, X., Tami, M. (2016). EM estimation of a Structural Equation Model in review.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data

via the EM Algorithm. Journal of the Royal Statistical Society.

Identifying Patterns of Learner Behaviour Using Latent Class Analysis

Arthur White∗1 and Paula Carroll2

1School of Computer Science and Statistics, Trinity College Dublin, the University of Dublin.2 UCD Quinn School of Business, University College Dublin.


Abstract: We investigate the learning patterns of students taking an introductory statistics module. Latent

class analysis is used to assess how the students interact with the different learning resources at their disposal

as the module progresses. Four behavioural groups were identified: while differing levels of face to face

attendance and online interaction existed, none of the groups engaged with online material in a timely

manner. Significant differences in levels of attainment were found to exist between groups, with an at risk

group of low engagers clearly identified.

Introduction

Learning objects are defined as any entity, digital or non-digital, that may be used for learning,

education or training. We examine how such objects are used by students in the University College

Dublin (UCD) Business School taking an introductory statistics core module called Data Analysis

for Decision Makers. The module is offered in a blended learning environment, in which, as well

as attending weekly lectures and tutorials, students use their own device to access digital learning

resources. We investigate the behavioural patterns of learning object usage, over the course of the

module, and assess how these patterns relate to attainment levels of learning outcomes.

Methods

Latent class analysis is used to identify student learning patterns. The group probability τ , represents

the a priori probability that a student belongs to a particular cluster. The item probability parameter

θ represents the probability of a student interacting with a learning resource, indexed by time as

well as resource.

Denote the data X = (X1, . . . ,XN), M -dimensional vector-valued binary random variables, com-

posed of G groups or clusters. The observed-data likelihood can then be written: p(X|θ, τ ) =∏Ni=1

∑Gg=1 τg

∏Mm=1 θ

Ximgm (1− θgm)1−Xim .

Inference is facilitated by the introduction of the latent variable Z = (Z1, . . . ,ZN), which indi-

cates the cluster membership of each individual student. The complete-data likelihood is then

p(Xi,Zi|τ ,θ) =∏G

g=1

[τg∏M

m=1 θXimgm (1− θgm)1−Xim

]Zig

.We applied LCA to our data using the R

package BayesLCA (White and Murphy, 2014).

Results

Four clusters were identified. The estimated group probabilities were τ = (0.34, 0.28, 0.27, 0.11).

The estimated item probability parameter, θ, is visualised in Figure 1. Based on these estimates, we

made the following interpretation: Group 1 appear to develop a preference for online material over

lectures or tutorials; Group 2 have high attendance but are slow to access online material; Group

3 have highest activity across all resources; Group 4 have low engagement overall. The ANOVA of

exam score based on this clustering was highly significant (F3,519 = 84.7, p < 10−16).

Figure 1: Proportion of clusters interacting with learning resources each week.

Conclusion

Our study shows the diversity in learning behaviour among the student body and indicates that

students tailor their usage of learning resources. While many students seem to be successfully

transitioning to become self-directed learners at university, assessment prompted engagement is also

evident for a substantial proportion of students. We suggest that these students warrant further

analysis and research.

References

White, A and Murphy, T.B. (2014). BayesLCA: An R package for Bayesian latent class analysis. Jour-

nal of Statistical Software, 61(13), pp. 1 – 28.

Variable Selection for Latent Class Analysis with Application to Low

Back Pain Diagnosis

Michael Fop1, Keith Smart2 and Thomas Brendan Murphy∗1

1School of Mathematics & Statistics and Insight Research Centre, University College Dublin,

Belfield, Dublin 4, Ireland.2St. Vincent’s University Hospital, Dublin 4, Ireland.


Abstract: The identification of most relevant clinical criteria related to low back pain disorders is a crucial

task for a quick and correct diagnosis of the nature of pain. Data concerning low back pain can be

of categorical nature, in form of check-list in which each item denotes presence or absence of a clinical

condition. Latent class analysis is a model-based clustering method for multivariate categorical responses

which can be applied to such data for a preliminary diagnosis of the type of pain. In this work we propose

a variable selection method for latent class analysis applied to the selection of the most useful variables

in detecting the group structure in the data. The method is based on the comparison of two different

models and allows the discarding of those variables with no group information and those variables carrying

the same information as the already selected ones. The method is applied to the selection of the clinical

criteria most useful for the clustering of patients in different classes of pain. It is shown to perform a

parsimonious variable selection and to give a good clustering performance.

Introduction

Low-back pain (LBP) is the muscoloskeletal pain related to disorders in the lumbar spine, low back

muscles and nerves and it may radiate to the legs.

When observations are measured on categorical variables, the most common model-based clustering

method is the latent class analysis model (LCA) (Lazarsfeld and Henry, 1968). Typically all the

variables are considered in fitting the model, but often only a subset of the variables contains the

useful information about the group structure of the data.

In this work we develop a variable selection method for LCA based on the model selection framework

of Dean and Raftery (2010) which overcomes the limitation of the above independence assumption.

Model

To select the variables relevant for clustering in LCA, a stepwise model comparison approach is used.

At each step we partition of the variables into:

• XC , the current set of relevant clustering variables, dependent on the cluster membership

variable z,

z

XC XP

XO

M1 z

XC XP

XO

XR ⊆ XC

M2

Figure 1: The two competing models for variable selection

• XP , the variable proposed to be added or removed from the clustering variables,

• XO, the set of the other variables which are not relevant for clustering.

Then the decision of adding or removing the considered variable is made by comparing two models:

model M1, in which the variable is useful for clustering, and model M2 in which it does not.

Figure 1 gives a graphical sketch of the two competing models.

Conclusion

Using the variable selection method proposed here we retain 11 variables and the BIC selects a

3-class model on these; the groups closely correspond to established clinical groupings. The se-

lected variables present good degree of separation between the three classes which are generally

characterized by the almost full presence or almost complete absence of the selected criteria.

References

Dean, N. and Raftery, A.E. (2010). Latent Class Analysis Variable Selection Annals of the Institute of

Statistical Mathematics, 62, 11–35.

Fop, M., Smart, K. and Murphy, T.B. (2015). Variable Selection for Latent Class Analysis with Ap-

plication to Low Back Pain Diagnosis. arXiv:1512.03350 .

Lazarsfeld, P. and Henry, N. (1968) Latent Structure Analysis, Houghton Mifflin.

Tuesday 17th May

Session 6: Chair, Dr. John Newell

Invited Speaker

16:00 – 16:40 David Leslie Thompson sampling for website optimisation

Contributed Talks

16:40 – 17:00 Sergio Gonzalez-Sanz Beyond machine classification: hedging predictions withconfidence and credibility values

17:00 – 17:20 James Sweeney Spatial modelling of house prices in the Dublin area

17:20 – 17:40 Susana Conde Model selection in sparse multi-dimensional contingency ta-bles

Thompson sampling for website optimisation

David Leslie∗1

1Mathematics and Statistics Department, Lancaster University, United Kingdom∗Email: [email protected]

When individuals are learning how to behave in an unknown environment, a statistically sensible

thing to do is form posterior distributions over unknown quantities of interest (such as features of

the environment and individuals’ preferences) then select an action by integrating with respect to

these posterior distributions. However reasoning with such distributions is very troublesome, even in

a machine learning context with extensive computational resources; Savage himself indicated that

Bayesian decision theory is only sensibly used in reasonably “small” situations.

Random beliefs is a framework in which individuals instead respond to a single sample from a

posterior distribution. This is a strategy known as Thompson sampling, after its introduction in a

medical trials context by Thompson (1933), and is used by many Web providers both to select which

adverts to show you and to perform website optimisation. I will demonstrate that such behaviour

’solves’ the exploration-exploitation dilemma in a contextual bandit setting, which is the framework

used by most current applications.

Beyond machine classification: hedging predictions with confidence and

credibility values

Sergio Gonzalez-Sanz

1Fashion Insights Centre, Zalando, Ireland∗Email: [email protected]

Abstract: Supervised classification is a well-known task in the Machine Learning field, whereby a labelled

training set is directly used to build a model of the underlying pattern(s) in the data. Once this model

is built, the overall performance can be subsequently assessed using new data and well known metrics

such accuracy, recall, and precision. These metrics provide an insight into how a model performs on a

whole across all test data instances. However, one question remains: what is the quality of any single

prediction made by a model? Conformal predictors hedge a classifier’s predictions with confidence and

credibility measures enabling the end user to take appropriate actions according to the quality of the single

predictions. This paper describes how conformal predictions can be used to make informed decisions on

the selection of multiple competing classifiers for a given task.

Introduction

Conformal predictors were designed by Gammerman et al. (1998) using transduction rules for

Support Vector Machines. His goal was to provide a measure of the evidence found on each

prediction. In order to do so, conformal predictors augment model predictions with accurate levels

of confidence and credibility (usually known as conformal measures). Later on, the methodology

was extended to work with induction (Vovk, 2013) for other ML techniques such as the nearest

neighbours, ridge regression and decision trees (Shafer and Vovk, 2008). Conformal predictors

have been successfully applied to a wide range of fields such as computer vision, nuclear fusion

(Gonzalez-Sanz, 2013) and medicine.

This approach, providing information about single predictions, differs from traditional evaluation

metrics such as accuracy or precision, which assess the quality of a model as a whole. Conformal

measures enable analysts to take informed decisions on what actions to take given the output label

of a machine learning classifier, for example whether or not to discard predicted values.

Methods

This paper leverages conformal measures as the foundation of a novel technique for competing

multi-model evaluation and selection. The most appropriate model choice can be quantitatively

made by examining the distribution of the credibility scores of all samples compared to those of

incorrectly classified samples (false positives and false negatives). Cross-validation on all models will

also be used in order to ensure the measures obtained are stable and generalise to other independent

datasets.

In order to test this proposal, a set of competing models will be created using a large, open source,

test set. The performance values obtained from the conformal measures in each model will be then

compared against existing model selection techniques, such as the ROC curve and the overall model

accuracy and precision.

Results & Discussion

The results section will include the findings regarding the usefulness of the conformal measures as

indicators for model selection. A set containing millions of labelled webpages (obtained from DMOZ,

https://www.dmoz.org/) will be used for building a multi-class conformal predictor using multiple

models. The distribution of the errors in the test sets will be plotted as a function of the credibility.

An examination of the plots will show that the errors are located on the low credibility regions.

Thus, analysts will be able to use the credibility values to set their risk aversion strategy effectively.

These plots and different accuracy measures will be compared across different classifiers. Conformal

predictors can shed some light on the differences between two samples labelled with the same tag.

References

Gammerman, A., Vock, V. and Vapnik, V. (1998). Learning by transduction. In: Proceedigns of the

14th Conference on Uncertainty in Artificial Intelligence, Madison, Wisconsin, USA, pp. 148 – 155.

Gonzalez-Sanz, S. (2013). Data mining techniques for massive databases: an application to JET and

TJ-II fusion devices. Ph.D. dissertation available at http://eprints.ucm.es/21575/1/T34490.pdf

Shafer, G. and Vovk, V. (2008). A tutorial on conformal prediction. Journal of Machine Learning Re-

search, 9, pp. 371 – 421.

Vovk, V. (2013). Conditional validity of inductive conformal predictors. Machine learning, 92(2-3), pp.

349 – 376.

Spatial Modelling Of House Prices In The Dublin Area

Dr. James Sweeney∗1

1 School of Business, UCD, Ireland∗1Email: [email protected]

Abstract: Assessments of the state of play in the Dublin housing market are mainly qualitative at present,

based on simple summaries of property prices. Here a proof of concept spatial model for house prices in

the Dublin postcode area is outlined, which is applied to a dataset of 1531 properties containing price

information and features such as size, beds, postcode, local area and spatial location. The model appears

promising for price prediction given a number of simple property features and provides some interesting

results in terms of the factors deemed important in the value of a property.

Introduction

Existing property price estimators are limited in terms of the factors they use to estimate the value

of a property for the purposes of property tax payment, being primarily based on the dwelling type,

number of bedrooms & bathrooms, as well as a comparison to nearby houses for which sales price

may be known. No uncertainty in the price prediction is typically provided, a substantial caveat given

that property price predictions in areas where property turnover is low should be highly uncertain.

Furthermore, a substantial issue of interest is whether there are subjective biases in terms of the

prices people are willing to pay for a property - for example, will people overpay for perceived good

addresses? Existing hedonic models for property prices cannot address this question as all of the

factors impacting on price are not known (Gelfand et al. (2014)).

General Model

Due to the unavailability of several unknown important neighbourhood characteristics in the value

of a house, we would expect that there will be spatial association remaining in the residuals of a

simple hedonic regression model, even after the inclusion of location attributes in terms of postcode

or the townland. The type of dwelling is also important, with a clear differentiation in the value of

houses and apartments for example, controlling for other factors. Visual exploration of the dataset

in terms of the size of the property and number of bedrooms reveals that the assumption of a

linear relationship between these factors and price is potentially inappropriate, in particular for larger

properties. Due to the variability in the values of properties we work on the log scale, prompting

the following model for the data, where y∗i = log(price per m2), si representing spatial location

(latitude, longitude).

y∗i = typei + areai + postcodei +

f(sizei) + f(bedsi) + g(si) + εi

εi ∼ N (0, σ2)

postcodei ∼ N (0, τ 21 )

areai ∼ N (0, τ 22 )

typei ∼ N (0, τ 23 )

We assume that the effects of both property size and number of beds vary smoothly, assigning a

separate intrinsic random walk of order 2 for each process, i.e. f() ∼ IRW2(κ). We also assume

a smoothly varying spatial effect for g(s), assigning a Gaussian Process for this purpose.

Results

The model output is quite promising in terms of teasing out the underlying factors which are most

important in determining the value of a property. The Deviance Information Criterion (DIC) is

lowest for the specified model, illustrating the benefit of incorporating a spatial effect in addition to

parameters for size and number of beds. The spatial effect is perhaps a proxy for local effects such

as transport links and schools, or other unidentifiable features which are impossible to capture in the

data collection process. There are also significant postcode and townland effects, particularly in the

postcodes of the south side of Dublin, reflecting a consumer preference for these areas irrespective

of other factors. The number of baths in a property appears unimportant, which is an interesting

result given that existing calculators consider this to be a primary factor.

References

Gelfand, A.E., Ecker, M.D., Knight, J.R. and Sirmans, C.F. (2004). The Dynamics of Location in

Home Price. The Journal of Real Estate Finance and Economics, 29, pp. 149 – 166.

Model Selection in Sparse Multi-Dimensional Contingency Tables

Susana Conde∗1 and Gilbert MacKenzie2

1Centre of Biostatistics, University of Manchester, UK.2CREST, ENSAI, Rennes, France.


Abstract: We compare the Lasso method of variable selection in sparse multi-dimensional contingency

tables with a new Smooth Lasso method which casts regularization in a standard regression analysis

mould, and also with the classical backwards elimination algorithm, which is used in standard software

packages. First, we make a general methodological point in relation to model selection with interactions.

Next we undertake a simulation study which explores the ability of the three algorithms to identify the

correct model. Finally, we analyse a set of comorbidities arising in a study of obesity. The findings do not

favour the standard Lasso regardless of the optimization algorithm employed.

Introduction

Sparse contingency tables arise often in genetic, bioinformatic, medical and database applications.

Then the target is to estimate the dependence structure between the variables modelled via the

interaction terms in a log-linear model. High dimensionality will force attention on identifying

important low-order interactions — a technical advance since most model selection work relies only

on main effects. We present the Smooth Lasso (SL), a penalized likelihood, which does not require

specialized optimization algorithms such as the method of coordinate descent. It uses a convex,

parametric, analytic penalty function that asymptotically approximates the Lasso: minimization is

accomplished with standard Newton-Raphson algorithms and standard errors are available.

A Smooth Lasso

The penalized log-likelihood is: `λ(θ) = `(θ) − penλ where penλ, is the penalty term, λ > 0. For

the LASSO penλ = λ∑p

j=2 |θj| omitting the intercept term and for the Smooth LASSO penλ =

λ∑p

j=2 Qω(θj) where Qω(θj) = ω log[

cosh(θjω

)]for a constant ω that regulates the approximation

of the function to the absolute value one. Note that Qω(θj) ∈ C∞, the set of functions that are

infinitely differentiable, and is convex (Conde, 2011, Conde and MacKenzie, 2011). Following we

define the maximum penalised likelihood estimator (MPLE) as θ := arg maxθ∈Θ`(θ)− penλ(θ).We should more properly write θλ, rather than θ, but the dependence on λ will be understood in

what follows. The goal here is to estimate λ and we use five-fold cross-validation. We use for the

Lasso the method of coordinate descent and the optimization algorithm in Dahinden et al (2007)

via the R glmnet and logilasso packages respectively.

Results

For the Smooth Lasso one must pick a level of statistical significance, as with ordinary regression

methods. Thus SL−95 corresponds to the 5% level. We notice that the 5% level produces very poor

results when the sample size is small, but improves with increasing sample size, while the classical

Backward Elimination algorithm performs better for smaller sample sizes. The latter is very fast,

taking approximately 1 minute for 1000 simulated tables compared to hours for the Lasso methods

due to cross-validation. We also note that the Smooth Lasso estimator is sparser and closer to the

truth than the usual Lasso. However, in the analysis of obesity data, the backwards elimination

algorithm performs best overall and produces the best model as judged by the AIC.

Discussion

In the presence of interactions, Lasso methods will often fail to produce scientific models. Moreover,

it is well known that the Lasso lacks the oracle property and our results confirm this. All these issues

raise serious questions about the usefulness of Lasso methods for model selection.

References

Conde, S. (2011). Interactions: Log-Linear Models in Sparse Contingency Tables. Ph.D. thesis, University

of Limerick, Ireland.

Conde, S. & MacKenzie, G. (2011). LASSO Penalised Likelihood in High-Dimensional Contingency

Tables. In: Proceedings of the 26th International Workshop on Statistical Modelling, Valencia, D.

Conesa, A. Forte, A. et al. eds.

Dahinden, C., Parmigiani, G., Emerick, M. C., & Buhlmann, P. (2007). Penalized likelihood for

sparse contingency tables with an application to full-length cDNA libraries. BMC Bioinformatics,

8:476.

Wednesday 18th May

Special INSIGHT Session: Chair, Dr. Kevin Hayes

Keynote Speaker

09:10 – 10:00 Sofia Olhede Anisotropy in random fields

INSIGHT Session

Nial Friel

10:00 – 11:00 Brian Caulfield Insight centre for data analytics: A collection of short stories

Brendan Murphy

Anisotropy in random fields

Sofia Olhede∗1

1Department of Statistical Science, University College London, London, United Kingdom∗Email: [email protected]

Anisotropy is a key structural feature of many physical processes. Despite this, most theory for

the modelling and estimation of random fields is based on assuming isotropy of the observed field.

Anisotropy can arise both in the structural features of the field, and between field components. I will

discuss both forms of anisotropy, and how we may model them, parametrically for applications in

geophysics such as understanding interface-loading processes, and more generically to capture strong

directional preferences. I will also describe how we may nonparametrically identify the presence of

anisotropic features without strong structural assumptions, such as a given parametric model class.

This is joint work with Frederik Simons, David Ramirez and Peter Schreier, as well as others.

Insight centre for data analytics: A collection of short stories

Brian Caulfield1, Nial Friel∗2 and Brendan Murphy2

1Insight Centre for Data Analytics and School of Physiotherapy and Performance Science,

University College Dublin.2Insight Centre for Data Analytics and School of Mathematics and Statistics, University College

Dublin.∗Email: [email protected]

Abstract: Data is changing our world. The field of data analytics is progressing at a rate beyond anything

we have ever experienced. If we can tap into this new wealth of information and make decisions based on

it, we will transform the way our world works. Data analytics is a massive global research effort aimed at

taking the guesswork out of decision making in society. It has the potential to improve our approach to

everything from hospital waiting lists to energy use to advertising. At Insight Ireland, this is what we do.

We take this deluge of data and we make sense of it. Then we come up with ways about how best to use

it for the benefit of society. At Insight Ireland, we process and use information to enable better decision

making for individuals, society and industry.

Introduction

In this session we will give a flavour of the diverse range of problems which are working on in Insight

and outline some of the research challenges which are aiming to overcome.

Personal Analytics

Personal Analytics is a particular focus within Insight. Our research is concerned with fundamen-

tal questions related to personal sensing, measurement and understanding human behaviour and

performance, and implementation of feedback and information that is designed to enhance human

behaviour and performance. We have seen an exponential rise in our capability to measure and mon-

itor a range of human performance and behavior metrics in recent years through the development

of a large range of sensing technologies. This is irrespective as to whether we are talking about

consumer wellness and fitness application space, or in formal management of health. Despite all

this progress, there is still a lot of work to do in this area. Firstly, there are still some biomedical

targets that we cannot accurately and effectively measure outside of a laboratory or clinical envi-

ronment so we need to keep developing the pipeline of new sensing technologies. As well as this,

we also need to make better progress with our capacity to better understand the data and resultant

application models associated with existing sensor technologies. And, this essentially, is what we set

out to do in the Personal Sensing group in Insight. We are addressing this gap with a programme

of interdisciplinary research that takes on the following set of high level challenges, which this talk

will outline.

Scaling Bayesian statistics for big data

One of the major issues facing practitioners is the question of how to scale Bayesian methods to

large datasets. Markov chain Monte Carlo is the de facto method of choice. However it is requires

one to evaluate the likelihood function of the data, twice at every iteration. Therefore it is inherently

costly when the number of observations is large. Here we present light and widely Applicable (LWA-

) MCMC which is a novel approximation of the Metropolis-Hastings kernel targeting a posterior

distribution for a large dataset. Inspired by Approximate Bayesian Computation, we design a Markov

chain whose transition makes use of an unknown but fixed, fraction of the available data, where the

random choice of sub-sample is guided by the fidelity of this sub-sample to the observed data, as

measured by summary (or sufficient) statistics. LWA-MCMC is a generic and flexible approach, as

illustrated by the diverse set of examples which we explore. In each case LWA-MCMC yields excellent

performance and in some cases a dramatic improvement compared to existing methodologies.

Modeling network data

Network data arise when the connections between entities are the focus of the analysis. Network

data are becoming increasingly common in the big data era. We will give an overview of some recent

novel network models that have been developed within Insight. The models developed account for

clustered network data, temporal networks, networks of rankings with be described. Finally, recent

models for hypergraph data will also be introduced.

References

Maire, F., Friel, N. and Alquier, P. (2015). Light and Widely Applicable MCMC: Approximate Bayesian

Inference for Large Datasets. arXiv:1503.04178.

Wednesday 18th May

Session 7: Chair, Dr. Kevin Burke

Invited Speaker

11:20 – 11:40 Christian Pipper Evaluation of multi-outcome longitudinal studies

Contributed Talks

11:40 – 12:00 Conor Donnelly A multivariate joint modelling approach to incorporate indi-viduals’ longitudinal response trajectories within the Coxianphase-type distribution

12:00 – 12:20 Katie O’Brien Breast screening and disease subtypes: a population-basedanalysis

12:20 – 12:40 Andrew Gordon Prediction of time until readmission to hospital of elderlypatients using a discrete conditional phase-type model in-corporating a survival tree

Evaluation of multi-outcome longitudinal studies

Christian Bressen Pipper∗1, Signe Marie Jensen2 and Christian Ritz2

1Department of Public Health, University of Copenhagen, Denmark.2Department of Nutrition, Exercise and Sports, University of Copenhagen, Denmark.


Abstract: Evaluation of intervention effects on multiple outcomes is a common scenario in clinical studies.

In longitudinal studies such evaluation is a challenge if one wishes to adequately capture simultaneous

data behavior. Therefore a popular approach is to analyse each outcome separately. As a consequence

multiple statistical statements about the intervention effect need to be reported and adjustment for multiple

testing is necessary. However, this is typically done by means of the Bonferroni procedure not taking into

account the correlation between outcomes and thus resulting in overly conservative conclusions. In this

talk an alternative approach for multiplicity adjustment is proposed. The suggested approach incorporates

between outcome dependence towards an appreciably less conservative evaluation.

Introduction

In clinical intervention studies the effect of intervention is often sought to be evaluated on the

basis of multiple longitudinal outcomes. These outcomes typically represent different aspects of

the progression of a particular condition and are inherently correlated. One such example, that we

look into in this talk, is a longitudinal intervention study of the effect of consumption of different

milk proteins on health in overweight adolescents. The particular outcomes considered in this

study include profiles of body weight, BMI, waist circumference, plasma glucose and plasma insulin

(Arnberg et al., 2012). All of these outcomes are clearly biologically linked and thus expected to be

substantially correlated.

Methods

To adequately capture the data generating mechanism this apparent correlation should be addressed

at some point during the analysis of these data. However, the classical statistical approach of doing

so in terms of a simultaneous model for all outcomes quickly becomes a complicated matter involving

an excessive amount of model parameters, model assumptions, and hard to interpret measures of

intervention effect. For these reasons, the analysis multi-outcome longitudinal studies are rarely

approached by simultaneous modelling of outcomes.

A more commonly used approach to analysing such data is to model outcomes separately by means

of standard methodology such as mixed linear normal models for analysis of repeated measurements.

This has the advantage of providing an easy to understand and more robust evaluation of intervention

effect per outcome. The disadvantage is that we are subsequently faced with multiple assessments of

the intervention effect. Accordingly, if we want to make a confirmatory evaluation of the intervention

effect, where we control the familywise type 1 error, we need some kind of adjustment for multiple

testing. To this end the Bonferroni adjustment is typically applied, but, as the outcomes - and

consquently also the test statistics - are correlated, this approach may lead to overly conservative

conclusions.

Accordingly, the potential gain of utilizing correlations between test statistics is reflected in recent

developments of procedures for multiplicity adjustment the most popular one being the single-step

procedure proposed by Hothorn et al. (2008). For this procedure to work we need estimates of

correlations between test statistics. However, in the context of test statistics for different outcomes

from different models, no such estimates are readily available.


In this talk, we approach the analysis of multi-outcome longitudinal data by means of separate mixed

linear normal models for each outcome. Within this framework we outline how to obtain a consistent

estimator of the simultaneous asymptotic variance-covariance matrix of within and between model

fixed effect parameter estimates. This in turn enables the use of the single-step procedure proposed

by Hothorn et al (2008) to obtain an efficient evaluation of an intervention effect on multiple

longitudinal outcomes.

The derivation of the simultaneous asymptotic variance-covariance matrix is made without any

additional model assumptions. It is an extension of the methodology developed in Pipper et al

(2012), where simultaneous asymptotic behavior of estimates from multiple models was derived by

so called stacking of an asymptotic representation of the estimates.

With the methodology in place we turn to the analysis of the milk protein study. Here we discuss the

challenges and advantages of the different modelling strategies. Next, we provide an outline of the

analysis with more details on the study design and the actual statistical modelling. We also compare

evaluation of intervention effects based on the proposed multiplicity correction with evaluation based

on traditional Bonferroni correction. Finally, we remark on the applicability of our proposal in terms

of robustness, design issues such as missing values, implementation, and potential extensions. The

talk is based on the paper by Jensen et al. (2015).

References

Arnberg, K. et al. (2016). Skim Milk, Whey, and Casein Increase Body Weight and Whey and Casein Increase the Plasma C-peptide Concentration in Overweight Adolescents, The

Journal of Nutrition, 142, pp 2083–2090.

Hothorn, T., Bretz, F. and Westfall, P. (2008). Simultaneous inference in general parametric models. Biometrical Journal, 50, pp 346–363.

Jensen, S.M., Pipper, C.B. and Ritz, C. (2015). Evaluation of multi-outcome longitudinal

studies. Statistics in Medicine, 34, pp. 1993–2003.

Pipper, C.B., Ritz, C. and Bisgaard, H. (2012). A versatile method for confirmatory evaluation of the effects of a covariate in multiple models. JRSS C, 61, pp. 315–326.

A multivariate joint modelling approach to incorporate individuals’

longitudinal response trajectories within the Coxian phase-type

distribution

Conor Donnelly∗, Lisa M. McCrink and Adele H. Marshall

Centre for Statistical Science and Operational Research (CenSSOR),

Queen’s University Belfast, Northern Ireland∗Email: [email protected]

Abstract: This research explores the use of a two-stage approach to evaluate the effect of multiple

longitudinal response variables on some related future event outcome. In stage one, a multivariate linear

mixed effects (LME) model is utilised to represent the correlated longitudinal responses and, in stage two,

a Coxian phase-type distribution is employed to evaluate the effect of each longitudinal response on time

to event outcome. The approach is illustrated using data collected on individuals suffering from chronic

kidney disease (CKD). It was observed that both the time-varying responses, haemoglobin and creatinine

levels, have a strong predictive potential of the time to death of CKD patients.

Introduction

It is common, particularly within the medical field, for longitudinal and survival data to be collected

concurrently, with previous research showing that there typically exists some association between

both processes. For instance, when making multiple measures on various biomarkers relating to an

individual’s health condition, it would be expected that the dynamic nature of these biomarkers would

have a strong predictive potential of some related, future event outcome. In the presence of such an

association, independent analysis of one process can produce biased parameter estimates and lead to

invalid inferences. Instead, joint modelling techniques are a relatively recent statistical development

capable of considering both processes simultaneously so as to reduce this bias (Henderson et al.,

2000).

Joint models make use of two submodels to represent each of the processes of interest; typically

a linear mixed effects (LME) model is utilised to represent the longitudinal response and a Cox

proportional hazards model to represent the survival process. Such an approach has been successfully

employed, for example, to evaluate the effect of changing CD4 cell counts on the time to AIDS

diagnosis in HIV patients (Wulfsohn and Tsiatis, 1997).

This paper will consider the use of the Coxian phase-type distribution to represent the survival

process. The Coxian phase-type distribution is a special type of Markov model which represents

the time to absorption of a continuous, finite state Markov chain (Marshall and McClean, 2004).

The distribution can be used to represent an individual’s survival time as a series of distinct states

through which the individual transitions as their health condition changes. Thus, by employing this

approach, inferences can be made on the factors that affect both the individuals’ survival times and

the rates at which their condition changes.

Application

The methodology described above is implemented on a dataset collected from the Northern Ireland

Renal Information Service over a ten year period from 2002 until 2012. It consists of 1,320 pa-

tients with a total of 27,113 repeated measures. As it is of interest to evaluate the effect of both

haemoglobin (Hb) and creatinine levels on survival, a multivariate LME model with correlated ran-

dom effects is fitted in stage one. In stage two, then, the individuals’ estimated random effects are

incorporated within the Coxian phase-type distribution so as to evaluate their effect on individuals’

rates of transition through the CKD health stages, represented by the states of the distribution.

Conclusion

The covariances of the random effects, which give a measure of the correlation between the two

response variables, indicate that Hb and creatinine are significantly correlated. By incorporating the

individuals’ random effects within the Coxian phase-type distribution, it is observed that deviations

from the population average Hb and creatinine levels have a significant impact both on individuals’

survival times and their rates of flow through the underlying stages of the disease.

References

R. Henderson, P. Diggle, and A. Dobson (2000). Joint modelling of longitudinal measurements and

event time data. Biostatistics, 1(4), pp. 465 – 480.

A. H. Marshall and S. I. McClean (2004). Using Coxian phase-type distributions to identify patient

characteristics for duration of stay in hospital. Health Care Management Science, 7(4), pp. 285 –

289.

M. S. Wulfsohn and A. A. Tsiatis (1997). A joint model for survival and longitudinal data measured

with error. Biometrics, 53(1), pp. 330 – 339

Breast screening and disease subtypes: a population-based analysis

KM O’Brien∗1, P Fitzpatrick2,3, T Kelleher1 and L Sharp4

1National Cancer Registry, Ireland2School of Public Health, Physiotherapy and Sports Science, University College Dublin, Ireland

3Director of Programme Evaluation, BreastCheck, Ireland4Institute of Health & Society, Newcastle University, United Kingdom


Aims

Mammographic screening affects the natural history and epidemiology of breast cancer in the popu-

lation and population based data is needed to improve the understanding of the impact of screening.

Previous studies have suggested differences in stage, grade and tumour size in screen-detected, com-

pared with non-screen detected, cancers. We investigated whether there was an association between

mode of detection of breast cancer and tumour characteristics in particular, disease subtype, at the

population level.

Methods

We matched individual-level data from the Irish breast screening programme (BreastCheck) with

the Irish National Cancer Registry (NCR) to classify all breast cancers diagnosed in the period 2006

2011, by mode of detection: screen detected cancers; interval cancers (i.e. cancers diagnosed within

two years of a negative mammogram); and cancers diagnosed in non-participants of the screening

programme. Information on oestrogen receptor expression (ER), progesterone receptor expression

(PR), and human epidermal growth factor receptor 2 (HER-2) was obtained from NCR records.

Subtype was defined as shown in Table 1.

Table 1: Subtype definition

Subtype Receptor statusluminal A ER or PR positive, and HER-2 negativeluminal B ER or PR positive, and HER-2 positiveHER2 over-expressing ER negative and PR negative and HER-2 positivetriple negative ER negative and PR negative and HER-2 negative

The association of the outcome, mode of detection, with the main explanatory variable, tumour

subtype, was assessed using a multinomial logistic regression model, with screen-detected cancers

as the baseline category. The model was adjusted for socio-demographic variables (marital status,

smoking status at diagnosis and deprivation category of area of residence) and clinical variables

(stage and grade).

Results

In the period 2006 2011, there were 6, 848 women, aged between 50 and 66, with a primary breast

cancer diagnosis of whom 45% had screen detected cancer, 14% had an interval cancer and 41%

were not participants of the screening programme. Of those with known subtype, 75% were luminal

A, 11% were luminal B, 6% were HER2 over-expressing and 8% were triple negative.

As compared to screen-detected cancers, interval cancers were 3 times more likely to be triple

negative than luminal A? In the adjusted analysis, compared to screen-detected cancers, interval

cancers were 3 times more likely to be triple negative than luminal A (OR 2.7 , 95% CI [2.0, 3.7]).

Similarly, non-participants were almost twice as likely to be triple negative (OR 1.9 , 95% CI [1.5, 2.5])

.

In addition, compared to screen-detected cancers, interval cancers , was associated with a 1.7 fold

increase in the odds of being HER-2-over expressing (compared to luminal A) (95% CI [1.1, 2.4])

and for non-participants, there was a 1.4 fold increase in the odds of being HER-2-over expressing

(95% CI [1.1, 1.9]).

Conclusion

In this novel study, we have provided evidence that breast cancer subtype distribution differs by

mode of detection. Since subtype is one of the major prognostic indicators, and a determinant

of treatment, our results may, in part, explain the well-known survival advantage for women with

screen-detected, rather than symptomatic, cancers. There is a need for further population-based

studies of on subtype in screen-detected, interval and other (symptomatic) cancers, to determine

whether our findings hold in other healthcare settings.

Predicition of time until readmission to hospital for elderly patients using

a discrete conditional phase-type model incorporating a survival tree

Andrew S. Gordon∗1, Adele H. Marshall1, Karen J. Cairns1 and Mariangela Zenga2

1Centre for Statistical Science and Operational Research, Queen’s University Belfast, Northern

Ireland, UK2Department of Economics, Management and Statistics, Universita degli Studi di Milano-Bicocca,

Italy∗Email: [email protected]

Abstract: A feature of elderly patient care is the frequent readmissions they require to hospital. The Dis-

crete Conditional Phase-type (DC-Ph) model is a technique through which length of stay in the community

may be modelled by using a conditional component to partition patients into cohorts before representing

the resulting survival distributions by a process component, namely the Coxian phase-type distribution.

This research expands the DC-Ph family of models by introducing a survival tree as the conditional com-

ponent, with a method to predict the time taken until readmission also presented. The methodology is

demonstrated using data for elderly patients from the Lombardy and Abruzzo regions of Italy.

Introduction

With a steady rise in the length of time people are living comes an increase in the strain placed on

hospital resources. As more elderly people require healthcare resources, hospitals often discharge

patients to continue their care in the community, in an attempt to alleviate this strain. This

often leads to frequent hospital readmissions not long after disharge to the community; the result of

elderly patients not having had the necessary time to convalesce. Accurate prediction of the expected

duration spent by patients in the community before they require readmission to hospital would greatly

facilitate hospital managers in ensuring that alternative measures of community care are in place for

this time, so that readmission to hospital may be avoided. Nevertheless, the time that elderly people

stay in the community can be greatly influenced by a large number of possible cirumstances. As a

result, this duration is unlikely to be homogeneous across all elderly people, meaning that obtaining

a single prediction of expected time in the community, for the elderly population as a whole, is likely

to be erroneous.

Methodology

Discrete Conditional Phase-type models (DC-Ph) are a family of models consisting of two compo-

nents; a conditional component and a process component (Marshall et al., 2007). The conditional

component is used to separate survival data into distinct classes before the process component repre-

sents the skewed survival distribution for each class using the Coxian phase-type distribution. With a

survival tree (McClean et al., 2010) used in the role of the conditional component, the DC-Ph model

is used to model time spent by discharged elderly patients in the community, prior to readmission

to hospital. Elderly people having a similar distribution of time spent in the community are grouped

together in the same class, whilst those with significantly different distributions of length of stay

are in different classes. Modelling the resulting distribution for each class through the use of the

Coxian phase-type distribution enables the rates associated with different latent subprocesses within

the overall system of community care to be determined. Furthermore, the model may be inverted

to predict the length of time spent in the community for new patients.

Conclusion

Patient information for two regional data sets from Italy has been used to build a DC-Ph model

with a view to predicting the length of time elderly patients spend in the community between

hospital spells. Ten cohorts of patients are identified by the survival tree, each of which have

significantly different skewed survival distributions. Through the simulation of times for each of these

distributions, accurate estimates (with confidence intervals) of when newly discharged patients are

likely to require readmission to hospital may be obtained. If alternate care provision can be planned

for patients in accordance with these respective estimates, then readmission to hospital may be

avoided altogether and vital hospital resources saved.

References

Marshall, A.H. et al. (2007). Patient Activity in Hospital using Discrete Conditional Phase-type (DC-Ph)

Models. Recent Advances in Stochastic Modelling and Data Analysis, pp. 154 – 161.

McClean, S. et al. (2010). Using mixed phase-type distributions to model patient pathways. IEEE

Computer-Based Medical Systems (CBMS), 23rd International Symposium on, pp. 172 – 177.

Monday 16th MayPoster & Lightning Talks Session: Chair, Dr. Norma Bargary

1. Alberto Alvarez-Iglesias An alternative pruning based approach to unbiased recur-sive partitioning algorithms

2. Idemauro Antonio Rodriguesde Lara

Ordinal transition models and a test of stationarity

3. Fiona Boland Retention in methadone maintenance treatment (MMT)in primary care: national cohort study using proportionalhazards frailty model for recurrent MMT episodes

4. Lampros Bouranis Bayesian inference for misspecified exponential randomgraph models

5. Kevin Burke Non-proportional Hazards Modelling

6. Caoimhe M. Carbery Dynamic Bayesian networks implemented for the analysisof clinical data

7. Niamh Ducey Cluster Analysis of Hepatitis C Viral Load Profiles in Pre-treatment Patients with Censored Infection Times

8. Jonathan Dunne Of queues and cures: A solution to modelling the intertime arrivals of cloud outage events

9. Lida Fallah Study of joint progressive type-II censoring in heteroge-neous populations

10. John Ferguson Extending Average Attributable Fractions

11. Olga Kalinina Variable Selection with multiply imputated data when fit-ting a Cox proportional hazards model: a simulation study

12. Felicity Lamrock Extending the msm package to derive transition proba-bilities for a Decision Analytic Markov Model

13. Angela McCourt Adaptive Decision-Making using Non-Parametric Predic-tive Intervals

14. Meabh G. McCurdy Identifying universal classifiers for multiple correlated out-comes, in clinical development

15. Keefe Murphy Mixtures of Infinite Factor Analysers

16. Aoife O’Neill Activity profiles using self-reported measures in popula-tion studies in young children: are there gender differ-ences?

17. Amanda Reilly Handling Missing Data in Clinical Trials

18. Davood Roshan Sangachin Bayesian Adaptive Ranges for Clinical Biomarkers

An alternative pruning based approach to unbiased recursive partitioning

algorithms

Alberto Alvarez-Iglesias∗1, John Hinde2, John Ferguson1 and John Newell1,2

1HRB Clinical Research Facility, NUI Galway2School of Mathematics, Statistics and Applied Mathematics, NUI Galway


Abstract: A new post-pruning strategy is presented for tree based methods using unbiased recursive

partitioning algorithms. The proposed method includes a novel pruning procedure that uses a false discovery

rate (FDR) controlling procedure for the determination of splits corresponding to significant tests. The

new approach allows the automatic identification of interaction effects where other methods fail to do so.

Simulated and real-life examples will illustrate the procedure.

Introduction

Recursive partitioning algorithms are a popular tool at the exploratory stage of any data analysis

since they can generate models that can be easily interpreted with virtually no assumptions. To

avoid over-fitting the final tree is obtained using one of the following two strategies: first grow a

large tree and then prune it back using CART-style post-pruning procedures (Breiman et al. 1984)

or use direct stopping rules based on p-values in the growing process (pre-pruning). The latter has

the advantage that variable selection is not biased towards predictors with many possible splits (see

Hothorn et al. 2006). This presentation discusses some of the drawbacks of pre-pruned trees based

on p-values in the presence of interaction effects and presents a simple solution that includes a novel

post-pruning strategy.

Methods

Pre-pruning strategies based on hypothesis tests, as in Hothorn et al. (2006), are used to protect

locally against the discovery of false positives (splits on noisy variables) at a pre-specified significance

level. They also work as a closed testing procedure where subsequent hypotheses are only assessed

if all previous ones are significant, controlling Family Wise Error Rate and preventing the tree from

over-fitting. However, due to the nested nature of the sequence of hypotheses along a tree, a

stopping rule based on significance may prevent the model from testing other hypothesis below that

could identify important effects, like interactions. Although solutions to this problem are available,

like increasing the significance level to grow a larger tree or the adoption of a CART-post-pruning

strategy, in both cases the significance level would be interpreted as a simple hyper-parameter

losing its statistical meaning. The novel approach presented here allows the identification of such

interactions. The new method uses a FDR controlling procedure (Benjamini and Hochberg 1995) for

the determination of splits corresponding to significant tests. The method proposed considers the

p-values obtained at each node globally, to control the proportion of significant tests that correspond

to true alternative hypothesis. By doing so, the tests performed when growing the tree still retain

a statistical interpretation and, at the same time, can be used in the pruning procedure.

Discussion

Table 1 shows the results of a simulation study from a model with an interaction effect and 9

additional noise variables.

Table 1: Relative comparison between the proposed method and rpart (Breiman et al. 1984) andctree (Hothorn et al. 2006), with positive values indicating superiority of the proposed method.

Accuracy ComplexityMean 95%CI Mean 95%CI

rpart 37.7 34.0 41.5 -65.3 -71.8 -58.7ctree 43.6 39.8 47.3 -87.6 -94.2 -81.1

As one can see the proposed method has significantly better predictive accuracy in this setting. The

drawback is the inability to pick up the first splits reliably producing unnecessarily large trees. This

is not intrinsically a problem of the proposed method but a problem of 1-step-ahead binary recursive

partitioning in general and further research is needed to provide a more desirable solution.

References

Benjamini, Y., Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful

approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological),

57 (1), pp. 289 – 300.

Breiman, L., Friedman, J. H., Stone, C. J., Olshen, R. A. (1984). Classification and regression trees

Boca Raton, Florida: CHAPMAN & HALL/CRC.

Hothorn, T., Hornik, K., Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference

framework. Journal of Computational and Graphical Statistics, 15 (3), pp. 651 – 674.

Ordinal transition models and a test of stationarity1

Idemauro Antonio Rodrigues de Lara∗1, John Hinde2

1Exact Sciences Department, Luiz de Queiroz College of Agriculture, University of Sao Paulo,

Brazil2School of Mathematics, Statistics and Applied Mathematics, National Univerisity of

Ireland/Galway∗Email: [email protected]

Abstract: In this work, we present the class of transition models to analyse longitudinal categorical data.

We consider two applications with ordinal responses, where proportional odds transition models are used

and a test to assess stationarity is proposed.

Introduction

The Markov transition models (Diggle et al. 2002) are a class of models for longitudinal data.

These models are based on stochastic processes and we consider a discrete-time discrete-state

process with a first-order Markov assumption, i.e., πab(t − 1, t) = πab(t) = P (Yt = b | Yt−1 = a),

with a, b ∈ S = 1, 2, . . . , k and t ∈ τ = 0, 1, . . . , T. In these models, a relevant issue is the

assumption of stationarity and a test to assess this is proposed.

Methodology

We consider two examples of longitudinal ordinal response data where transition models can be

applied. The first data concern respiratory condition over 5 time occasions (Koch et al., 1990). The

second dataset is from animal sciences (Castro, 2016), with 4 time occasions. We use a proportional

odds model (McCullagh, 1980) and incorporate the longitudinal dependence by including the previous

response as an additional covariate. The model is η = log

(γab(t)(x)

1−γab(t)(x)

)= λab(t) + δ′tx, where

γab(t)(x) = P(Yjt ≤ b | Ya(t−1), x) = πa1(t)(x) + . . . + πab(t)(x) are the cumulative probabilities;

λab(t) is an intercept; x = (xt1, xt2, . . . , xtp, xt(p+1))′ is the vector of (p+ 1) covariates with xt(p+1)

denoting the previous state; δ′t = (βt1, . . . , βtp, αt) is a vector of unknown parameters. The general

model with δt has time dependent effects and to assess stationarity we use a likelihood-based test

comparing this to a model with constant effects over time, i.e. δt = δ0 A simulation study was

guided by the motivationing examples, i.e., we used the estimates of the parameters from these

examples to simulate new ordinal data under two scenarios: stationary (1) and non-stationary (2).

For each scenario we performed 10,000 simulations for three different sample sizes and three time

durations. The analysis and simulation were implemented in R (R Core Team).

1This work was supported by the FAPESP, funding agency, Brazil, award 2015/02628-2

Results

Table 1: Rejection rates for proposed test, resulting from 10,000 simulations, for the scenario 1(test size) and scenario 2 (test power).

Time T=4 T=5 T=6Level 10% 5% 1% 10% 5% 1% 10% 5% 1%

Scenarios(1) 0.1113 0.0583 0.0135 0.0764 0.0336 0.0064 0.0649 0.0280 0.0051

N=50 (2) 1.0000 1.0000 1.0000 0.4841 0.3356 0.1359 0.5206 0.3740 0.1536Scenarios(1) 0.1149 0.0576 0.0122 0.1137 0.0563 0.0109 0.1070 0.0559 0.0113

N=100 (2) 1.0000 1.0000 1.0000 0.9577 0.9171 0.7664 0.9774 0.9513 0.8492Scenarios(1) 0.1089 0.0546 0.0098 0.1047 0.0507 0.0107 0.1084 0.0557 0.0115

N=500 (2) 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

The proposed test is simple to apply and the results of the simulation show good performance. They

are quite close to the classical goodness-of-fit type test of Anderson and Goodman (1957).

References

Anderson, T.W., Goodman, L.A. (1957) Statistical Inference about Markov Chains. Annals of Mathe-

matical Statistics, 28: 89–110.

Castro, A.C. (2016). Comportamento e desempenho sexual de suınos reprodutores em ambientes en-

riquecidos, PhD. dissertation. Brazil: University of Sao Paulo.

Diggle, P.J., Heagerty, P.J., Liang, K.Y., Zeger, S.L. (2002). Analysis of longitudinal data. New

York: Oxford University Press.

Koch, G.C., Carr, G.J., Amara, I.A., Stokes, M.E., Uryniak, T.J. (1990). Categorical Data Analy-

sis. In: Statistical Methodology in the Pharmaceutical Sciences. New York: Marcel Dekker, Chapter

13, pp. 389 – 473.

McCullagh, P. (1980). Regression Methods for Ordinal Data. Journal of The Royal Statistical Society,

42, pp. 109 – 142.

R Core Team. (2015). R: A language and environment for statistical computing. R Foundation for

Statistical Computing, Vienna, Austria.

URL http://www.R-project.org/.

Retention in methadone maintenance treatment (MMT) in primary care:

national cohort study using proportional hazards frailty model for

recurrent MMT episodes

Grainne Cousins1, Fiona Boland∗2 Joseph Barry3 Suzi Lyons,4 Kathleen Bennett5 and Tom Fahey 6

1School of Pharmacy, Royal College of Surgeons in Ireland (RCSI), Dublin, Ireland2HRB Centre for Primary Care Research, Royal College of Surgeons in Ireland3Trinity College Centre for Health Sciences, Tallaght Hospital, Dublin, Ireland

4Health Research Board, Dublin, Ireland5Division of Population Health Sciences, Royal College of Surgeons in Ireland


Abstract: The objective of this study was to identify determinants of time to discontinuation of methadone

maintenance treatment (MMT) across multiple treatment episodes in primary care. All patients on a na-

tional methadone treatment register aged 16-65 years between 2004 and 2010 were included. Proportional

hazards frailty models were developed to assess factors associated with time to discontinuation from re-

current MMT episodes. A total of 6,393 patients experienced 19,715 treatment episodes. Median daily

doses over 60mgs and having more than 20% of methadone dispensed as supervised consumption were

associated with longer treatment episodes. Patients experiencing multiple treatment episodes tended to

stay in treatment for progressively longer periods of time.

Introduction

Opiate users have a high risk of premature mortality (Degenhardt et al., 2013). Ireland was identified

as having the third highest prevalence of opiate use in Europe (0.72%) (European Monitoring

Centre for Drugs and Drug Addiction, 2013) and the number of cases entering treatment continues

to increase. However, the overall number of deaths from overdose of opiates has not decreased

(Lyons et al., 2014). Retention in methadone maintenance treatment (MMT) is associated with

reduced mortality and therefore the objective of this study was to identify determinants of time to

discontinuation of MMT across multiple treatment episodes in primary care.

Methods

We identified people registered on the Central Treatment List (CTL), a national register of patients

in MMT, who were prescribed and dispensed at least one prescription for methadone between

August 2004 and December 2010. The outcome measure was time to discontinuation of MMT.

A patient was defined as ‘on treatment’ based on the coverage of their methadone prescriptions.

If there was a gap of 7 days a patient was considered to have ceased treatment. Median daily

methadone dose and proportion of methadone scripts per treatment episode which were dispensed

under supervised consumption were included as possible determinants of time to discontinuation of

MMT. Age, gender and comorbidities were included as potential confounders. Proportional hazards

gamma frailty models were fitted to account for the dependence in the length of individuals repeated

treatment episodes.

Results

6,393 patients experienced 19,715 treatment episodes. The overall median episode length was 140

days (IQR: 38-412), with 19.5% of all episodes ongoing at the end of follow-up. Compared to

<60mgs, median daily doses > 60 mgs (60-120 mg: hazard ratio (HR)=0.47, 95% CI 0.45 0.50);

>120 mgs: HR=0.62, 95% CI 0.53 0.72), and having greater than 20% of methadone scripts

dispensed as supervised consumption (compared to <20%) were associated with longer treatment

episodes (20-39% of scripts: HR=0.36, 95% CI 0.33 0.38); 40-59% of scripts: HR=0.24, 95%

CI 0.22 0.27; 60-79% of scripts: HR=0.25, 95% CI 0.22 0.27; >80% of scripts: HR=0.28, 95%

CI 0.26 0.30). Patients experiencing multiple treatment episodes tended to stay in treatment for

progressively longer periods of time.

Conclusion

The prescription of higher daily doses of methadone, and regular supervised consumption can increase

MMT retention.

References

Degenhardt L., Bucello C., Mathers B., Briegleb C., Ali H., Hickman M. et al. (2011) Mortality

among regular or dependent users of heroin and other opioids: a systematic review and meta-analysis

of cohort studies. Addiction 106: pp. 32 - 51.

European Monitoring Centre for Drugs and Drug Addiction (EMCDDA). (2013) European Drug

Report: Trends and developments. Euro Surveill 18: pp. 29 - 47.

Lyons S., Lynn E., Walsh S., Long J. (2014) Drug-related deaths and deaths among drug users in Ire-

land. HRB Trends Series 4. Dublin: Health Research Board.

Bayesian inference for misspecified exponential random graph models

Lampros Bouranis∗1, Nial Friel1 and Florian Maire1

1School of Mathematics and Statistics & Insight Centre for Data Analytics, University College

Dublin, Ireland∗Email: [email protected]

Abstract: In this work, we explored Bayesian inference of exponential random graph models with tractable

approximations to the true likelihood and we applied our methodology in real networks of increased com-

plexity. Our work is involved with the pseudolikelihood function which is algebraically identical to the

likelihood for a logistic regression. Naive implementation of a posterior from such a misspecified model

is likely to give misleading inferences. We provide background theory and practical guidelines for efficient

correction of the posterior mean and covariance for the analysis of real-world graphs.

Introduction

There are many statistical models with intractable (or difficult to evaluate) likelihood functions.

Composite likelihoods provide a generic approach to overcome this computational difficulty. A nat-

ural idea in a Bayesian context is to consider the approximate posterior distribution πCL(θ|y) ∝pCL(y|θ)p(θ). Surprisingly, there has been very little study of such a misspecified posterior dis-

tribution. We focus on the exponential random graph model, which is widely used in statistical

network analysis. The pseudolikelihood function provides a low-dimensional approximation of the

ERG likelihood. We provide a framework which allows one to calibrate the pseudo-posterior distri-

bution. In experiments our approach provided improved statistical efficiency with respect to to more

computationally demanding Monte Carlo approaches.

Methods

To conduct Bayesian inference using the ERG likelihood model, we adopt the well-known full-update

Metropolis-Hastings sampler. To overcome the intractability of the ERG likelihood, we propose to

replace the true likelihood p(y|θ) = q(y|θ)z(θ)

with a tractable but misspecified likelihood model, leading

us to focus on the approximated posterior distribution, or ”pseudo-posterior”:

πPL(θ|y) ∝ pPL(y|θ) · p(θ). (1)

Estimating the pseudolikelihood pPL(y|θ) is effortless; misspecification comes from the strong and

often unrealistic assumption of independent graph dyads. Calibration of the unadjusted posterior

MCMC samples to obtain appropriate inference is executed with two operations: a ”mean adjust-

ment” to ensure that the true and the approximated posterior distributions have the same mode and

a ”curvature adjustment” that modifies the geometry of the approximated posterior at the mode

(Stoehr and Friel, 2015).

Application

We consider a large network of 1200 nodes and a two-dimensional model. We compare the calibration

procedure to the Approximate exchange algorithm (AEA) of Caimo and Friel (2011), which has been

used in the context of ERG models and has shown good results. It is possible to carry out inference

for graphs of larger size (eg. 1000 nodes), but at the cost of an increased computational time.

Bayesian logistic regression can be performed using standard software, in a fast and straight-forward

manner. Once the mean adjustment and the curvature adjustment were performed, a very good

approximation of the true posterior with efficient correction of the posterior variance was obtained

(Figure 1), while achieving a five-fold speedup relative to the Approximate Exchange.

−0.5

−0.4

−0.3

−0.2

−0.1

−5.5 −5.0 −4.5 −4.0

θ1

θ 2

Algorithm AEA Pseudoposterior

Unadjusted

TV = 0.028

−0.5

−0.4

−0.3

−0.2

−0.1

−5.5 −5.0 −4.5 −4.0

θ1

θ 2Algorithm AEA Calibrated pseudoposterior

Mean + Curvature−Adjusted

Figure 1: Phases of calibration of the misspecified posterior distribution using a pseudolikelihoodapproximation.

References

Caimo, A. and Friel, N. (2011). Bayesian inference for exponential random graph models. Social Net-

works, 33:41-55.

Stoehr, J. and Friel, N. (2015). Calibration of conditional composite likelihood for bayesian inference

on gibbs random fields. AISTATS, Journal of Machine Learning Research: W & CP, volume 38,

pages 921–929.

Non-proportional Hazards Modelling

Kevin Burke∗1 and Gilbert MacKenzie2

1Department of Mathematics and Statistics, University of Limerick, Ireland.2CREST, ENSAI, France.∗Email: [email protected]

Abstract: We investigate parametric approaches for handling non-proportional hazards. Specifically, we

introduce the Multi-Parameter Regression (MPR) modelling framework and compare this to a standard

approach known as frailty. It is noteworthy that the MPR approach generates a new test of proportionality.

We argue that multi-parameter regression is more natural than frailty for capturing non-proportional effects

and show that it is more flexible both analytically and in the context of a lung cancer dataset. We also

consider models which combine the MPR and frailty concepts providing further generality.

Introduction

The most popular regression model for survival data is, by far, the Proportional Hazards (PH)

model whereby the hazard for individual i is λ(t |xi) = exp(xTi β)λ0(t) where xi = (xi1, . . . , xip)T

and β = (β1, . . . , βp)T are the vectors of covariates and regression coefficients respectively and

λ0(t) is a baseline hazard function common to all individuals. Clearly, the ratio of two hazards

is exp[ (xi − xj)Tβ ] which does not depend on time, i.e, hazards are proportional. From this,

straightforward interpretation follows: e.g., “the risk in group 1 is ψ times that of group 2” where

ψ is a proportionality constant. In spite of the virtue of interpretability under the PH assumption,

non-PH effects are often encountered in practice. We will investigate parametric survival models

which account for non-PH effects.

Methods

One common explanation for non-PH effects is the presence of an unobservable, gamma-distributed

random effects term (Duchateau & Janssen, 2007). This term represents additional heterogeneity

in the hazard which cannot be explained by xi, i.e., missing information/covariates. In this so-called

gamma frailty model, covariate effects may be proportional at the individual level but are non-

proportional at the marginal level as a consequence of this hererogeneity. An alternative explanation

is that covariates truly exhibit non-PH effects at the individual level (and, hence, at the marginal

level). We introduce and explore the “Multi-Parameter Regression” (MPR) framework which handles

this situation. Here we allow multiple distributional parameters to depend on covariates which

generalises the parametric PH model where covariates enter only through a scale parameter. We

will illustrate the various concepts by comparing four Weibull models (PH, PH-frailty, MPR and

MPR-frailty) in terms of their hazard ratio and by application to a Northern Irish lung cancer

dataset.

Results

We show that the PH-frailty model imposes non-PH effects on all covariates (i.e., it does not allow

any covariates to have PH effects) and, moreover, for these imposed non-PH effects, the degree

of time-variation permitted is not large – only convergent hazards are handled. In contrast, we

show that the MPR model is highly flexible and can handle proportional, convergent, divergent and

crossing hazards where covariates are not all forced into sharing any one such trajectory. This flexible

regression model can produce dramatic improvements in fit compared to the basic PH model and

also its frailty extension. Combining the MPR and frailty approaches leads to a more general model

still. It is interesting to find that this MPR-frailty model outperforms the MPR model in the setting

of our real data example showing that although both MPR and frailty approaches allow for non-PH

effects, the presence of one does not abolish the need for the other.

Discussion

The Multi-Parameter Regression (MPR) modelling framework provides a direct generalisation of the

PH model to non-PH status along with a new test of proportionality. We argue that the approach

is a more natural extension to non-PH modelling than the incorporation random effects, i.e., the

frailty model. Notwithstanding the flexibility of the MPR approach in its own right, the combined

MPR-frailty model offers further generality and provides a method for testing MPR effects against

frailty effects. Finally, the MPR-frailty combination itself suggests further novel extensions which

we will discuss briefly; these extensions are a focus of our future work.

References

Burke, K. and MacKenzie, G. (2016 - submitted). Multi-parameter regression survival models,

Biometrics.

Duchateau, L. and Janssen, P. (2007). The frailty model. Springer.

Dynamic Bayesian networks implemented for the analysis of clinical data

Caoimhe M. Carbery∗1,2, Adele H. Marshall1,

Roger Woods2 and William Scanlon2

1 Centre for Statistical Science and Operational Research (CenSSOR), Queen’s University Belfast,

Northern Ireland2 The Institute of Electronics, Communications and Information Technology (ECIT), Queen’s

University Belfast, Northern Ireland∗Email: [email protected]

Abstract: Bayesian networks are graphical models that represent variables as nodes and conditional depen-

dence relationships between variables as arcs between the nodes. Associated with each node is a conditional

probability given the parents of that node. Bayesian networks have proved useful in the past for representing

medical data. An extension of the Bayesian network is the dynamic Bayesian network that extends the

theory to allow for data observed over multiple time slices. This paper wishes to demonstrate how the

dynamic Bayesian network can be an effective tool for representing clinical time series data. In particular

the approach is applied to data for patients with chronic kidney disease.

Introduction

Time series data is in abundance, in particular there is a large amount of clinical data that is deemed

to be a form of time series data. The importance of investigating the dynamic change of a system

in regards to time advances is crucial in analysing medicines. This paper will illustrate the use

of dynamic Bayesian networks (DBNs) to model dynamic clinical systems. DBNs have the ability

to model medical systems that involve the rigid collection of data at set time points. This paper

presents an application of the DBN to time series data where the condition and survival of a group

of kidney dialysis patients with chronic kidney disease are modelled over fixed time points.

Methodology

BNs are a special type of graphical model whose nodes represent the variables and the edges

between the nodes correspond to the relationships between the variables. Dynamic Bayesian networks

(DBNs) are an extension of Bayesian networks which incorporates a temporal dimension to the

graphical model by allowing each time point to have a separate BN structure with multiple time

slices connected together to form the overall DBN model. This extra dimension is critical for

the network to model dynamic systems through time dependent data thus it allows the system to

relate variables to each other through adjacent time points (or time slices). There are a number of

algorithms appropriate for creating DBNs with one example being the K2 algorithm. This paper will

provide information on how these algorithms are implemented to create an appropriate DBN.

Application

The theory discussed above will be applied to time series data in a clinical context, specifically to

a dataset involving the analysis of kidney disease. This application will demonstrate the relevance

of using a network of this kind for clinical research. The clinical data contains information on a

number of variables associated with the measurement of kidney functions for 11,527 patients with

associated measurements at quarterly time points from 2005 to 2008. The investigation considers

the benefits of different drug types to a patients’ haemoglobin and ferritin level. Other application

areas will be discussed to emphasise the adaptability of DBNs to different data types. Matlab is

used to create the resulting DBNs for the clinical kidney data.

Discussion

Upon demonstrating the relevance of the DBN to clinical time series data; this presentation will

discuss the incorporation of levels using hierarchical systems. This added aspect will aim to adapt

systems to allow for deep learning to be performed. Deep learning will be briefly discussed with

emphasis placed on its statistical relevance.

References

Dean, Thomas and Keiji Kanazawa. (1989). A model for reasoning about persistence and causation.

In: Computational intelligence 5,no. 2 pp. 142 – 150.

Murphy, Kevin P. (2002). Dynamic bayesian networks: representation, inference and learning. Disserta-

tion: University of California, Berkeley.

Cluster Analysis of Hepatitis C Viral Load Profiles in Pre-treatment

Patients with Censored Infection Times

Niamh Ducey∗1, Kathleen O’Sullivan1, Jian Huang1, John Levis2, Elizabeth Kenny-Walsh3, Orla

Crosbie3 and Liam Fanning2

1Department of Statistics, University College Cork, Ireland2Molecular Virology Diagnostic and Research Laboratory, Department of Medicine, Cork University

Hospital and University College Cork, Ireland3Department of Gastroenterology and Hepatology, Cork University Hospital, Ireland


Abstract: Viral load of the Hepatitis C virus (HCV) has been identified as an important predictor of the

outcome of Hepatitis C disease progression (Fanning et al., 2000). There is a limited amount of information

available to explain the fluctuations in viral load in the timeline of HCV infection and, more specifically,

the viral load changes over time in an untreated patient population. This study aims to cluster the viral

load profiles of a sample of pretreatment chronic Hepatitis C patients to investigate the presence of distinct

groupings in viral load progression patterns during HCV infection.

Introduction

Hepatitis C (World Health Organisation, 2015), infecting an estimated 185 million people globally,

is a disease characterised by liver inflammation that occurs due to infection with the Hepatitis C

virus (HCV). A key obstacle of this disease is its ability to remain symptomless for long periods of

time. This causes increased risk of viral transmission, delayed treatment, and difficulty determining

the initial infection point. Viral Load (VL, IU/ml) is the amount of virus present in body fluid at

any one time. Quantifying a patient’s VL at multiple time points leads to the development of a

VL profile. The objective of this study is to examine VL profiles in untreated chronic Hepatitis C

patients in order to identify distinct patterns in viral load progression.

Methods

VL profiles were obtained for 81 pre-treatment females chronically infected with HCV due to the

receipt of HCV-1b contaminated Anti-D immunoglobulin. This selection criteria enabled estimation

of initial infection to a narrow time interval between 1977 and 19781. Based on work, performed by

Luan and Li (2003), involving clustering sparse and irregularly-spaced time course data, we applied

a mixed-effects model that utilises B-Spline basis functions to cluster the viral load profiles over

time since infection. The optimum cluster solution was identified using the Integrated Complete

Likelihood (ICL) criterion and Monte Carlo re-sampling simulations. Additionally, infection times

1The authors would like to thank Dr. Joan Power for her contribution on the infection timeline profiles.

were randomized to access the effect of censoring infection times on the cluster result.

Results and Conclusions

Two clusters were identified as the optimum descriptor of the distinct patterns observed in VL

progression. The mean VL curves of these two clusters are presented in Figure 1.

Figure 1: The mean curves of viral load (log10IU/ml) profiles over time since infection (years) ofthe individuals in cluster one (n = 32) and cluster two (n = 49).

Cluster 1 displays a relatively steady increase in VL over time, whereas Cluster 2 portrays a more rapid

increase in VL, in addition to peaking at a higher viral load level than Cluster 1. The randomization

of infection start times had no major effect on the cluster membership of the individual patients.

References

Fanning, L., Kenny-Walsh, et al. (2000). Natural fluctuations of hepatitis C viral load in a homoge-

neous patient population: a prospective study. Hepatology (Baltiomre, Md.), 31(1), pp. 225 – 9.

Luan, Y., & Li, H. (2003). Clustering of time-course gene expression data using a mixed-effects model

with B-splines. Bioinformatics, 19(4), pp. 474 – 482.

Of queues and cures: A solution to modelling the inter time arrivals of

cloud outage events

Jonathan Dunne∗1 and David Malone1

1Maynooth University (Hamilton Institute, Maynooth University, Ireland)∗Email: [email protected], [email protected]

Abstract: The management of Cloud based outages represents a challenge for Small Medium Enterprises

(SMEs), due to the variety of ways in which production outages can occur. We consider the inter-arrival

times for outages events in a framework where these arrival times are used to align Systems Operations

resources. Using an enterprise dataset, we address the question of how inter-arrival times are distributed

by testing against a number of common distribution types. The proposed framework can help SMEs to

manage their limited resource workflows. We shall also consider correlation between arrival times.

Keywords: Distribution fitting, goodness of fit, correlation resource planning.

Introduction

For the European SME the adoption of cloud technology no easy task. Due to resource constraints

and myriad of failure patterns, SMEs face challenges in providing a reliable and stable service platform

for their customer’s needs. In this paper we describe a framework, that the SME an leverage to best

manage their limited set of resources as part of incoming outage events in their cloud infrastructure.

Data Set

The study presented in this paper examines approximately 250 cloud outage events from a large

enterprise system. Our study aims to answer a key question: Which distribution is best suited to

model the interarrival time of cloud outage events. To answer this question a number of common

distribution types were modelled; lognormal, gamma, Weibull, exponential, logistic, loglogistic and

Pareto.

Results

Each distribution was modelled against the interarrival times of the dataset using the R computer

package and the fitdistrplus library which also calculates the distribution parameters. Using a second

package ADGofTest, the parameters of each distribution validated for their goodness of fit. Table

1 summarises the results of this test.

Table 1: Summary of Anderson-Darling GoF statistics.

Distribution Name AD statistic p-valuelognormal 3.039 0.026gamma 6.034 9.347e-04Weibull 0.975 0.371exponential 3.110 0.024logistic 12.819 2.765e-06loglogistic 1.823 0.115Pareto 0.661 0.592

Histogram and theoretical densities

data

Den

sity

0 5000 10000 15000 20000 25000

0e+0

02e

−04

4e−0

46e

−04

8e−0

41e

−03

WeibullllogisticPareto

0 20000 60000 100000

050

0010

000

1500

020

000

2500

0

Q−Q plot

Theoretical quantiles

Em

piric

al q

uant

iles


0 5000 10000 15000 20000 25000

0.0

0.2

0.4

0.6

0.8

1.0

Empirical and theoretical CDFs

data

CD

F


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

P−P plot

Theoretical probabilities

Em

piric

al p

roba

bilit

ies


Figure 1: Four goodness-of-fit plots for Weibull, loglogistic and Pareto distributions fitted to theinterarrival times from the cloud outage data set.

Discussion

Table 1 shows that Pareto has the best p-value for the Anderson-Darling test, followed by loglogistic

and Weibull. All other distributions were rejected as part of hypothesis testing. Figure 1 shows

goodness of fit plots for the three best fitted distributions. The quantile-qualtile plot shows that the

Pareto distribution more closely models the data set even for extreme values.

Conclusion

It was found that the Pareto distribution is a useful distribution for modelling the interarrival times

of cloud outages. This result can be used by SMEs as an arrival time parameter for a queuing model

for cloud outage events.

References

Delignette-Muller, M.L. and Dutang, C. and Pouillot, R. and Denis, J.B. (2015). Web Page:

https://cran.r-project.org/web/packages/fitdistrplus/index.html

Bellosta, C.J.G (2011). https://cran.r-project.org/web/packages/ADGofTest/index.html

Study of joint progressive type-II censoring in heterogeneous populations

Lida Fallah∗1, John Hinde1

1School of Mathematics, Statistics and Applied Mathematics, NUI Galway, Ireland∗Email: [email protected]

Abstract: Time to event, or survival, data is common in the biological and medical sciences. Here, we

consider the analysis of time to event data from two populations undergoing life-testing, mainly under a

joint Type-II censoring scheme for heterogeneous situations. We consider a mixture model formulation and

maximum likelihood estimation using the EM algorithm and conduct a simulation to study the effect of

the form of censoring scheme on parameter estimation and study duration.

Key words: EM algorithm, Maximum likelihood estimation, Type-II censoring

Introduction

Time to event, or survival, data is common in the biological and medical sciences with typical exam-

ples being time to death and time to recurrence of a tumour. In practice, survival data is typically

subject to censoring with incomplete observation of some failure times due to drop-out, intermit-

tent follow-up and finite study duration. Many different probability models have been proposed for

survival times. Extending these to mixture models allows the modelling of heterogeneous popula-

tions, e.g. susceptible/non-susceptible individuals (Kuo and Peng, 2000). This allows the clustering

of individuals to different groups together with parameter estimation. This becomes more compli-

cated in the presence of censoring and requires care in model fitting and interpretation. Maximum

likelihood estimation can be done by direct optimization or with the EM algorithm using a nested

version to handle the two aspects of missing data, the mixture component labels and the censored

observations.

Model

Let X = (X1, . . . , Xm) be i.i.d. random variables following a f (1) distribution for the lifetimes of m

units and Y = (Y1, . . . , Yn) are i.i.d. random variables following a f (2) distribution for the group of

n units. Now let W1 ≤ . . . ≤ WN , N = m+ n, denote the order statistics of the random variables

X1, . . . , Xm;Y1, . . . , Yn. Under joint Type-II censoring, N = m+n units are placed on a life-test.

At the time of the rth failure, R = S + T remaining units are withdrawn, where S and T are the

number of withdrawals from the X and Y samples respectively and we can write N = R + r, and

the test is terminated.

The likelihood function under joint Type-II censoring given the observed data is given by

L(Θ|z,w, s) = C

r∏i=1

[f (1)(wi)zif (2)(wi)1−zi

]F (1)

(wi)m−sF(2)

(wi)n−t,

where F(1)

= 1− F (1), F(2)

= 1− F (2), C is a normalising constant, and Zi is a 0/1 indicator for

Wi coming from the X population or not.

Discussion

We focus on Type-II censoring, where follow-up is terminated after a pre-specified number of failures,

with all other individuals censored at the largest failure time. The performance of the estimation

procedure depends, not surprisingly, on the characteristics of the censoring scheme and the form of

the component densities. Some experimentation with a mixture of exponential distributed compo-

nents highlighted potential problems, however with normal components things are better behaved.

In progressive Type-II censoring some individuals are removed randomly at each failure time. This has

the effect of spreading out the censoring over the observation period with a consequent extension of

the follow-up time. This censoring scheme seems to improve the efficiency of estimation for mixed

populations. The above can be extended to two heterogeneous populations (e.g. male/female)

applying Type-II or progressive Type-II censoring over the two populations, referred to as joint (pro-

gressive) Type-II censoring schemes (Rasouli and Balakrishnan, 2010). We focus on the above

settings and conduct a simulation study to evaluate the impact of the form of the censoring scheme

on parameter estimation and study duration. We obtain standard errors for parameter estimates

and construct confidence intervals. Results will be discussed to show the benefits of the progressive

regime. Finally, we illustrate with a real-data example.

References

Kuo, L. and Peng, F. (2000). Generalized linear models: A Bayesian Perspective. New York: Marcel

Dekker, pp. 255 – 270.

Rasouli, A. and Balakrishnan, N. (2010). Exact likelihood inference for two exponential populations

under joint progressive Type-II censoring. Communications in Statistics- Theory and Methods, 39,

pp. 2172 – 2191.

Extending Average Attributable Fractions

John Ferguson∗1, Alberto Alvarez-Iglesias1, John Newell1,2, John Hinde2 and Martin O’Donnell1

1HRB Clinical Research Facility, NUI Galway, Galway, Ireland2School of Mathematics, Statistics and Applied Mathematics, NUI Galway, Galway, Ireland


Abstract: Chronic diseases tend to depend on a large number of risk factors, both environmental and

genetic. Average attributable fractions (Eide and Gefeller, 1995) were introduced as a way of partitioning

overall disease burden into contributions from individual risk factors; this may be useful in deciding which

risk factors to target in disease interventions. However, in practice they are seldom used due to technical

and methodological limitations. To bridge this gap, we introduce new estimation methods for average

attributable fractions that are appropriate for both case control designs and prospective studies.

Introduction

In epidemiology, the attributable fraction represents the proportional reduction in population disease

prevalence that might be observed if a particular risk factor could be eliminated from the population.

In some regards, it is a more relevant measure of disease association than odds ratios or relative risks

as it hints at the potential impact of an intervention targeting the risk factor. Average attributable

fractions are a related concept, more tailored to the situation where several risk factors are known to

be associated with the disease, in which case they define a partition of cumulative disease burden into

contributions from each risk factor. At first sight, average attributable fractions seem an extremely

useful tool to quantify the portion of risk contributed by each risk factor; a pertinent calculation in

describing chronic disease epidemiology. However, they are seldom used by practitioners. Perhaps

the main reason for this may be the technical issues that researchers face in their application. In

brief, some of these hurdles relate to: (a) computational difficulties when the number of risk factors

is large (b) no proposed method to produce confidence intervals (c) a lack of flexible software to

assist in their calculation. In this presentation, we describe these issues in more depth, and propose

some solutions. The new methods are demonstrated using real and simulated data. Estimation

accuracy, coverage-probability and computational efficiency compared to alternative approaches are

examined. An R-package, averisk, that implements the methods discussed in this presentation

can be downloaded from the CRAN server.

References

Eide, Geir Egil, and Olaf Gefeller (1995). Sequential and Average Attributable Fractions as Aids in

the Selection of Preventive Strategies. Journal of clinical epidemiology, 48(5), pp. 645 – 655.

Variable Selection with multiply imputated data when fitting a Cox

proportional hazards model: a simulation study

Olga Kalinina∗1, Dr. Emma Holian1, Dr. John Newell1,2, Dr Nicolla Miller3 and Prof. Michael

Kerin3

1School of Mathematics, Statistics and Applied Mathematics, National University of Ireland,

Galway, Ireland2HRB Clinical Research Facility, NUI Galway, Ireland

3Discipline of Surgery, School of Medicine, National University of Ireland, Galway, Ireland∗Email: [email protected]

Abstract: The purpose of this study is to explore methods for variable selection in survival analysis in the

presence of missing data.

Introduction

Prognostic models play an important role in medical decision making process. Missing predictors and

censored responses are common problems within prognostic modelling studies. Simple methods, such

as complete cases analysis, are commonly used as the default procedure in many statistical software

packages. Several studies have shown that such approach loses efficiency, and may lead to biased

estimates if there is a relationship between missing values and the response. Multiple imputation is an

attractive approach, which replaces each missing value in predictor by M credible values estimated

from the observed data. Then M imputed data sets are analysed separately and the parameters

estimates and their standard errors combined using ’Rubin’s Rule’. However, it is still unclear

how to conduct variable selection over multiply imputed data sets under the framework of penalized

regressions. Several methods have been proposed and used in the literature. Wood(2008) performed

classical backward stepwise selection method where i) at each step, the inclusion and exclusion

of the variable is based on combined overall estimates with standard errors using Rubin’s Rule, and

ii) a stacking method is used where the multiply imputed data sets into one using a weighting scheme

to account for the fraction of missing data in each explanatory variable. Chen(2013) and Wan(2015)

proposed methods combining multiple imputation and penalized regressions. Chen(2013) treated

estimates from the same variable across all imputed data sets as a group, and applied the group

lasso penalty to yield a consistent variable selection, while Wan(2015) proposed weighted elastic

net method to the stacking method after multiple imputation with a weight accounting for the

proportion of the observed information for each subject.

Discussion

Penalized regression techniques like lasso, elastic net and group lasso achieve parsimony as they

shrink some regression coefficients to zero. However, it may lead to inconsistent variable selection

if they are applied directly to the multiply imputed data sets. Both proposed methods combining

multiple imputation and penalized regressions, presented in literature, are discussed for the linear

regression model only. My aim is to extend the above ideas to the Cox proportional hazards model

and examine their performance with the alternatives through a comprehensive study.

References

Chen Q, Wang S (2013). Variable selection for multiply-imputed data with application to dioxin exposure

study. J Stat Comput Simulat 2015, 85(9), pp. 1902 – 1916.

Wan Y, Datta S, Conklin DJ, Kong M (2015). Variable selection models based on multiple imputation

with an application for predicting median effective dose and maximum effect. Stat Med 2008, 27,

pp. 3227 – 3246.

Wood AM, White IR, Royston P (2008). How should variable selection be performed with multiply

imputed data? Stat Med 2008, 27, pp. 3227 – 3246.

Extending the msm package to derive transition probabilities for a

Decision Analytic Markov Model

Felicity Lamrock∗1,2, Karen Cairns2, Frank Kee1, Annette Conrads-Frank3, and Uwe Siebert3

1Centre for Public Health, 2Centre for Statistical Science and Operational Research, (Queens

University Belfast, United Kingdom)3Institute of Public Health, Medical Decision Making and Health Technology Assessment, UMIT,

Hall i.T., Austria∗Email: [email protected]

Abstract: Several novel biomarkers have shown to have promising ability to better determine who is at

risk for a cardiovascular event beyond conventional risk factors. The aim of this paper is to extend the R

package msm, to estimate transition probabilities between health states within a decision-analytic Markov

model, to assess if the measurement of novel biomarkers (in addition to current prevention strategies using

conventional risk factors) can lead to cost-effective strategies for prevention.

Introduction

Cardiovascular disease (CVD) is the single most common cause of death in the world, and several

novel biomarkers are being considered for their ability to enhance cardiovascular risk estimation.

Decision-analytic Markov models (DAMMs) can be used to assess whether different prevention

strategies are not only effective but cost-effective. Markov models can be used to describe how

individuals move between different health states over time. The aim of this paper is to outline the

techniques to populate a DAMM with transition probabilities between health states to assess the

cost-effectiveness of adding novel biomarkers to existing prevention strategies.

Methods

Using a Finnish population cohort (FINRISK97), and the follow-up for cardiovascular events, a

multi-state Markov model is built to describe movements between five different health states. The

R package msm can fit multi-state models to longitudinal data, giving output for all permitted

state-to-state transitions. All of the transition rates between each of the health states are examined

in the one process, where the rates between health state r and s can be influenced by time-

dependent or constant covariates, z, and estimated in a proportional hazards fashion: qrs(tj, zij) =

q(0)rs exp(βTrszij), where zij are the explanatory covariates with effect on the intensity for individual i at

time tj. The usefulness of the package could be further enhanced through its extension to formulate

transition probabilities. To calculate the transition probability matrix, P(tu, tv, ziu), where the (r, s)

element, prs, gives the probability of an individual transitioning between health state r and health

state s between times tu and tv, the evaluation of the matrix exponential of the transition intensity

matrix Q is required to calculate P: P(tu, tv, ziu) = exp[(tv − tu)Q(tu, ziu)]. Since transitions

between different health states may be influenced by an individuals characteristics (including novel

biomarker information) and any prevention treatment received, different prevention strategies will

involve different transition probabilities between health states.

Results

All analyses were performed using R 3.1.1. To build the msm model, initial guesses for the q(0)rs were

deployed to avoid convergence on local maxima. Fifty runs were performed in each case. It is also

useful to use starting values for (βTrs)k from any nested models to aid convergence. The probability

matrix, P(tu, tv, ziu), was obtained for each individual within the FINRISK97 cohort. For each of

the strategies within the DAM, the subset of individuals are chosen and each probability matrix

averaged over the cohort. The one year probabilities were therefore obtained for each year starting

from 50 years, for each transition between health states, and implemented into a cost-effectiveness

model.

Discussion

Novel biomarkers have the potential to be an effective and cost-effective strategy for targeting

subsequent CVD prevention. The approach that extends the msm package to populate a DAMM

has been outlined, and it is a useful tool for parameter estimation. The information from this

research will be combined in the DAM with cost data from different prevention treatments, to

assess cost-effectiveness.

References

Blankenberg, S et al. (2010). Contribution of 30 Biomarkers to 10-year cardiovascular risk estimation in

2 population cohorts: The MONICA, risk, genetics, archiving and monograph (MORGAM) biomarker

project. Circulation, 121, (22), pp. 2388-2397.

Jackson, C. (2011). Multi-State Model for Panel Data: The msm Package for R. Journal of Statistical

Software, 38, (8), pp. 1-29

Adaptive Decision-Making via Non-Parametric Predictive Intervals

Angela McCourt∗1 and Dr Brett Houlding1

(1Dept. of Statistics, Trinity College Dublin, Ireland)∗Email: [email protected]

Abstract: Normative decision theory is concerned with the study of strategies for decision-making under

conditions of uncertainty in such a way as to maximize the expected utility. The approach taken in this

research does not assume that an individual has, or can specify and work with, a belief network and/or

preference function returning a unique value regardless of the vagueness or unfamiliarity of the event to

them. Instead it will allow to explicitly model how an individual can and/or does derive their belief and

utility functions (or alternative concepts to replace these within the setting considered) based on how

a person or party may well learn their actually belief network. In order to achieve this, non-parametric

predictive intervals (NPI) are employed. This is a modelling technique in which vagueness is incorporated

via the use of imprecise probabilities - where precise values are replaced by a lower and upper bound of

probability. We wish to develop a way to update utilities in light of new information and to explore and

explain what heuristics may be involved in this process.

Introduction

Normative Decision Theory explores what type of decisions that ought to be made and assumes that

the individual making a decision is able to place the correct numerical utility value on any reward,

including such rewards that have never been experienced before, i.e. experiences that are novel to

that individual. If we were to not assume that an individual has, or can specify and work with, a

belief network and/or preference function returning a unique value regardless of the vagueness or

unfamiliarity of the event to them this would allow us to explicitly model how an individual can

and/or does derive their belief and utility functions (or alternative concepts to replace these within

the setting considered).To achieve this non-parametric predictive intervals (NPIs) are employed; a

modelling technique in which vagueness is incorporated via the use of imprecise probabilities - where

precise values are replaced by a lower and upper bound of probability (Coolen, 2006).

Simulated Data

One thousand data points was randomly simulated from three distributions that would lead to

correlated and uncorrelated data being generated. The absolute correlation coefficient, |ρ|, was

calculated in blocks of 50. The selection of fifty correlations per block may appear arbitrary but

much research has been conducted on recommender systems which has highlighted that algorithms

preform reliably when there are approximately fifty ratings available (see Jannah, Zanker, Felfernig,

& Friedich (2010) for an introduction to recommender systems). The non-parametric prediction

intervals were calculated as follows:

E[Intervalnew] = 1n+1

∑ni=1 |ρi|

E[Intervalnew] = 1n+1

(1 +∑n

i=1 |ρi|)

For (X,Y), uncorrelated data was simulated, whereas (Y,Z) correlated data was simulated. However,

it is assumed that we know nothing about this data and so, via the NPI’s, it is possible to “build”

our knowledge of this data. When subsequent intervals are calculated we get a narrowing of the

interval. From Fig. 1 we see that the lower bounds change very little in comparison to the upper

bounds for the uncorrelated data and the interval is becoming more narrower towards zero, which

indicates no correlation between these distributions. However for the correlated data we have the

opposite effect; the lower bounds are now increasing so that the interval is becoming narrowing

towards 1.

Figure 1: Non-parametric Predictive Intervals and Differences between Bounds

References

Coolen, F.P.A. (2006). On Nonparametric Predictive Inference and Objective Bayesianism. Journal of

Logic, Language and Information, 141, pp.382 – 391.

Jannach, D., Zanker, M., Felfernig, A., and Friedrich, G. (2010). Recommender systems: An intro-

duction. Cambridge University Press.

Identifying universal classifers for multiple correlated outcomes, in clinical

development

Meabh G. McCurdy∗1, Adele H. Marshall1 and J. Renwick Beattie2

1Centre for Statistical Science and Operational Research, School of Mathematics and Physics,

Queen’s University Belfast, Belfast, UK2Exploristics Ltd, Floor 4 Linenhall Street, Belfast, BT2 8BG


Abstract: The pharmaceutical industry is fast becoming a key aspect in scientific and medical progress in

recent years. This is due to new innovative ideas, such as stratified medicine which is achieved by employing

different types of subgroup analysis in order to determine which patients respond better to a particular

treatment. The method used in this analysis is known as the Patient Rule Induction Method (PRIM). This

method was applied to a dataset taken from a clinical trial to determine subgroups of patients and to aid

with the prediction of a patient’s survival of the trial.

Introduction

Currently 95% of the experimental drugs that are studied in humans fail to be both safe and effective

(Healthcare & Pharma 2013). This is a growing concern with the pharmaceutical industry and as

a result, they are looking for ways to advance and improve the success of trial drugs whilst at the

same time managing the costs. One particular aspect of drug development that pharmaceutical

companies are required to consider is the ability to identify subgroups of patients that are likely

to derive additional benefits from different treatments. This can be achieved by using subgroup

analysis to determine different groups with similiar characteristics. This paper uses subgroup analysis

methodology on a clinical trial in which each patient had four biomarkers recorded at three different

time points throughout the trial. Information on whether the patients survived the total duration of

the study was also recorded and key to the analysis.

Methods

A special type of subgroup analysis is the Patient Rule Induction Method (PRIM). PRIM, also

known as the ”bump-hunting” algorithm was published by Fisher and Friedman (1999). It is a non-

parametric method used for the primary aim of identifying subgroups or bumps in the data that will

maximise the mean of a target variable. In this analysis PRIM uses the target variable corresponding

to the classification variable ALL Censflag, where a value of 1 refers to the patient who survived

the study and a value of 0 refers to the patient who died before the study ended. The method was

implemented in SAS using a macro created by Sniadecki (2011).

Results

PRIM was applied to the dataset of 105 patients and as a result was able to identify two different

subgroups within the data. The make up of these two subgroups can be seen in Table 1.

Table 1: Properties of the subgroups obtained from PRIM

Group 1 Group 2ALL Censflag 0 value = 2 (4%) 0 value = 27 (56%)

1 value = 55 (96%) 1 value = 21 (44%)Mean of target variable µ = 0.965 µ = 0.438

Therefore, if a patient belonged to Group 1 they were predicted to survive the study and likewise

if they belonged to Group 2 they were predicted to die before the study ended. The sensitivty and

specificty are 93% and 72% respectively.

Conclusion

Analysing the typical biomarker values common to those patients that belong to the same subgroup,

revealed that the outcome of a patient can be predicted. The methodolgy has the potential to be

applied to other clinical trials to determine subgroups of patients that will benefit more than others

from an experimental drug. This area of analysis will play a key role in the future development of

the pharmaceutical industry, and in particular in the creation of stratified medicine.

References

Fisher, N. I. and Friedman, J. H. (1999). Bump hunting in high-dimensional data. Statistics and Com-

puting 9, Volume No. 2, pp. 123-143.

Healthcare & Pharma. (2013). How the staggering cost of inventing new drugs is shaping the future

of medicine. [Online] http://www.forbes.com/sites/matthewherper /2013/08/11/. Accessed 10

March 2015.

Sniadecki J. and Therapeutics A. (2011). Bump Hunting with SAS: A Macro Approach to Employing

PRIM. SAS Global Forum, Volume 156.

Mixtures of Infinite Factor Analysers

Keefe Murphy∗1,2, Dr. Claire Gormley1,2

1School of Mathematics and Statistics, UCD, Ireland2Insight Centre for Data Analytics, UCD, Ireland


Abstract: Typically, when clustering via mixtures of factor analysers (MFA), one must specify ranges of

values for the numbers of groups and factors in advance. The pair of values which optimises some model

selection criterion, such as BIC, are chosen. Not only is this computationally intensive, it’s generally only

reasonable to fit models where the number of factors is the same across groups. The development to date

of a flexible, adaptive Gibbs sampler algorithm for clustering high-dimensional data via a mixture of infinite

factor analysers (MIFA) is presented. MIFA allows different clusters to have different numbers of latent

factors and estimates these quantities automatically during model fitting. An application to metabolomic

data illustrates the methodology.

Introduction

Typically, orthogonal factor analysis models the p-vector xi as a linear function of a q-vector of

unobserved latent factors fi, where q p,

(xi − µ

)(p×1)

= Λ(p×q)

fi(q×1)+ εi(p×1)

In a Bayesian setting, the means µ are assumed to be MVNp distributed, with mean 0 and diagonal

covariance matrix. The scores fi are assumed to be MVNq distributed with mean 0 and identity

covariance matrix. Finally, εi ∼ MVNp (0,Ψ), with diagonal Ψ, and the following prior on its

non-zero elements : ψ−1 ∼ Gap (α/2, β/2).

Methods

A latent indicator zig, which is equal to 1 if i ∈ cluster g of G, and 0 otherwise, is introduced, s.t. zi ∼Mult (1, π). A Dir (α) prior is assumed for π, the mixing proportions. Marginally, this provides the

following parsimonious covariance structure,

xi | zig = 1 ∼MVNp

(µg,ΛgΛ

Tg + Ψg

)∴ P (xi) =

G∑g=1

πg MVNp

(µg,ΛgΛ

Tg + Ψg

).

A multiplicative gamma process shrinkage prior is used for the infinite loadings matrices. This allows

each cluster to have infinitely many factors, with loadings increasingly shrunk towards zero as the

column index increases,

Loadings : λjk ∼N(0, φ−1

jk , τ−1k

)Local Shrinkage : φjk ∼Ga (ν/2, ν/2)

Global Shrinkage : τk =k∏

h=1

δh, δ1 ∼ Ga (α1, 1) , δh ∼ Ga (α2, 1) , ∀ h ≥ 2.

A conversatively high estimate of q?g ∀ g = 1, . . . , G is chosen initially. The adaptive Gibbs sampler

tunes the number of factors as the MCMC chain progresses. Adaptation decreases in frequency

exponentially fast. The columns in the loadings matrix having some proportion of elements in some

neighbourhood of 0 are monitored. If the number of such columns drops to zero, an additional

loadings column is added by simulating from the prior distribution. Otherwise redundant columns

are discarded and parameters corresponding to the non-redundant columns are retained. The scores

matrix is also modified accordingly, and identifiability issues are addressed via Procrustean methods.

The number of effective factors at each iteration is stored, and the posterior mode after burn-in

and thinning have been applied is used as an estimate for qg with credible intervals quantifying

uncertainty.

Results

The methodology is illustrated with application to metabolomic data, consisting of 189 spectral

peaks from urine samples of 18 subjects, half of which have epilepsy and half of which are controls.

A 2-cluster MIFA model correctly uncovers the epileptic and control groups, and gives different

credible intervals for q1 and q2.

Discussion

The need to choose the optimal number of latent factors in a mixture of factor analysers has been

obviated using MIFA. Though this greatly reduces the search space, the issue of model choice is still

not entirely resolved. The next logical extension to this body of work is to estimate the number of

mixture components in a similarly choice-free manner, by exploring the literature either on overfitting

mixture models or on Dirichlet Processes.

Activity profiles using self-reported measures in population studies in

young children: are there gender differences?

Aoife ONeill∗1, Dr. Kieran Dowd2, Ailish Hannigan3, Prof. Clodagh OGorman3, Prof. Cathal

Walsh1 and Dr. Helen Purtill1

11Department of Mathematics and Statistics, University of Limerick, Ireland2Department of Physical Education and Sports Science, University of Limerick, Ireland

2Graduate Entry Medical School, University of Limerick, Ireland∗Email: [email protected]

Abstract: The aims of this study were to cluster preadolescents into distinct activity profiles for boys and

girls separately, based on self- and parental-reported physical activity (PA) and sedentary behaviour (SB)

variables from the Growing Up in Ireland (GUI) study, and determine if the identified profiles were predictive

of weight change from age 9 to age 13. The findings highlighted 1) distinct activity profiles based on self-

and parental-reported PA and SB variables were identifiable for boys but not for girls, 2) activity profiles

for boys were associated with weight status at 9 years, and 3) activity profiles for boys were predictive of

future weight change at 13 years.

Introduction

Profiling activity behaviours in young children is important in developing a better understanding

of how the associations between activity patterns and weight status track over time (Leech et al.

2014). Cluster analysis is a multivariate statistical technique which aims to group individuals into

profiles, based on similarities found in the data (Ferrar et al. 2013). The GUI study is a nationally

representative study which aims to track the development of children in the Republic of Ireland.

8570 9-year-old children took part in the first wave of GUI and the second wave of the study was

carried out when the children were aged 13 with an 87% follow up (n=7,423). The two waves were

matched to analyse weight changes over time.

Methods

A Two Step Cluster Analysis (TSCA) was performed, using the log-likelihood distance measure, to

the self- and parental-reported PA and SB by gender. Multiple iterations of the cluster analysis

were carried out with the number of predefined clusters being adjusted to maximise the silhouette

coefficient, where the silhouette coefficient measures the cohesion and separation of the clusters.

The cluster membership variable was used to examine the association between activity levels and

BMI categories at age 9 and 13.

Results

5.4% of the boys were classified as obese at age 9 and 4.6% at age 13. 7.8% of the girls were

classified as obese at age 9 and 7.2% at age 13. Four cohesive activity profiles were identified for

boys. No cohesive activity profiles were identified for girls. The profiles found for boys were ordered

by level of activity, with 32.7% of 9-year-old boys assigned to the most active profile (profile 1),

13.4% to profile 2, 28.1% to profile 3 and 23.9% to profile 4. Profile 1 had the highest PA levels

and lowest levels of SB. In comparison, profile 4 had the lowest levels of PA and the highest levels of

SB. 7.4% of boys in the least active group were identified as obese compared to 3.8% in the most

active group. There was an increased risk of being overweight or obese in profile 4 at age 9 (OR =

1.5, 95% CI: 1.2, 1.9) and at age 13 (OR = 1.9, 95% CI: 1.5, 2.3) compared to profile 1, controlling

for socio-demographic variables and parental weight status. The odds of a normal weight 9-year-old

boy in the least active profile becoming overweight or obese at age 13 were over twice those in most

active profile (unadjusted OR = 2.4, 95% CI: 1.7, 3.3).

Conclusion

This study provides important insights into profiling PA and SB in pre-adolescent children. It also

contributes to a better understanding of how activity profiles in 9-year-old boys relate to current

weight status; and are predictive of future weight status. This study highlights gender differences in

the responses to the self- and parental-reported PA and SB questions and our findings suggest that

these questions do not identify meaningful patterns of activity in pre-adolescent girls.

References

Leech, R. M., S. A. McNaughton and A. Timperio (2014). ”The clustering of diet, physical activity

and sedentary behavior in children and adolescents: a review.” Int J Behav Nutr Phys Act, 11(4).

Ferrar, K., T. Olds and C. Maher (2013). ”More than just physical activity: Time use clusters and

profiles of Australian youth.” Journal of Science and Medicine in Sport, 16(5) pp. 427 – 432.

Handling Missing Data in Clinical Trials

Amanda Reilly∗1 and John Newell1,2

1HRB Clinical Research Facility, NUI Galway, Ireland2School of Mathematics, Statistics and Applied Mathematics, NUI Galway, Ireland


Abstract: Ideally, data collected during a clinical trial would be complete, but this is never the case in

reality. In fact, missing data is a serious issue for clinical trials. The focus of this poster is to briefly

describe various techniques used to handle missing data, to allow statisticians choose the most appropriate

technique for the data being analyzed.

Introduction

Missing data is a common but serious issue for clinical trials, so it is important that statisticians

know how to handle missing data, as well as the consequences of ignoring missing data. This poster

identifies what are missing data, as well as why, when and how they should be handled. Although

not an extensive review, this poster describes various techniques that can be used to handle missing

data.

What are missing data?

Missing data are defined by Little et al. as “values that are not available and that would be meaningful

for analysis if they were observed”. Ideally, data collected during a study would be complete, but

this is rarely the case in reality. Early discontinuations from a trial are the main source of missing

data, but there can be many other reasons such as data entry errors or subjects lost to follow up.

Why should missing data be considered?

The power of a trial is its ability to reliably detect and measure the difference between the treatment

and control groups for an effective treatment. Since power increases along with sample size, it’s

important to avoid excluding subjects with missing values as it may lead to an incorrect conclusion

that the treatment is ineffective when there is a true treatment effect. Bias in the estimation of the

treatment effect is the inclination to be in favour of the treatment group over the control group and

is another important concern around missing data. It is usually caused by ignoring the uncertainty

around missing values when estimating standard errors.

The European Medicines Agency (EMA) Committee for Medicinal Products for Human Use (CHMP)

have released guidelines around handling missing data that are important for regulated trials but also

good clinical practice (GCP) for others. “ICH Topic E 9 - Statistical Principles for Clinical Trials”

(1998) stresses the importance of minimising missing data in a trial but indicates that missing data

may be compensated for with pre-defined methods, while “Guideline on Missing Data in Confirmatory

Clinical Trials” (2010) notes that ignoring missing data is not acceptable for confirmatory clinical

trials (which assess if a treatment effect observed in a previous randomized trial is real or important).

How should missing data be accounted for?

For an accurate analysis, it is paramount that missing data are handled correctly, as the power and

results of the trial may be affected. Various methods will be discussed in the poster, such as deletion,

imputation and modelling methods. The advantages, disadvantages, underlying assumptions and

limitations of each will be noted to allow statisticians choose the most appropriate technique for the

data being analyzed.

Conclusion

Missing clinical trial data is a common issue and handling it is an important aspect of analysing

the data. The only correct approach is to prevent missing data, but there are various methods of

handling unavoidable missing data. The power of the trial, the risk of bias in the estimation of the

treatment effect and the CHMP guidelines should be considered, and it should be acknowledged

that the approach taken may affect the results of the analysis and may be a cause of bias in itself.

References

Roderick J. Little Ph.D et al. (2012). The Prevention and Treatment of Missing Data in Clinical Trials.

The New England Journal of Medicine, 367, pp. 1355 – 1360.

ICH (1998). Statistical Principles For Clinical Trials E9. http://www.ich.org/fileadmin/

Public Web Site/ICH Products/Guidelines/Efficacy/E9/Step4/E9 Guideline.pdf.

EMA Committee for Medicinal Products for Human Use (2010). Guideline on Missing Data in Con-

firmatory Clinical Trials. www.ema.europa.eu/docs/en GB/

document library/Scientific guideline/2010/09/WC500096793.pdf.

Bayesian Adaptive Ranges for Clinical Biomarkers

Davood Roshan Sangachin∗1,2, Dr. John Ferguson2, Prof. Francis J. Sullivan2,3,

and Dr. John Newell1,2

1School of Mathematics, Statistics and Applied Mathematics, NUI Galway, Ireland2HRB Clinical Research Facility, NUI Galway, Ireland

3Prostate Cancer Institute, NUI Galway, Ireland∗Email: [email protected]

Abstract: In this paper I will discuss the use of Bayesian techniques to generate adaptive reference ranges

for blood biomarkers collected longitudinally. Examples will be given involving biomarkers collected in a

clinical setting in particular prostate cancer and amongst elite athletes.

Introduction

Biomarkers are characteristics that are objectively measured and evaluated as indicators of normal

biological processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention.

They may be measured on a bio-sample (e.g. blood), may be a recording (e.g. blood pressure), or

an imaging test (e.g. echocardiogram) and play a vital role in clinical research as indicators of risk

markers, disease state or disease progression.

A reference/normal range (e.g. Figure 1), generated from a cross-sectional analysis of healthy

individuals free of the disease of interest, is typically used when interpreting a set of biomarker

test results for a particular patient. An arbitrary percentile cut-point (typically the 95th or 97.5th

percentile) is chosen to define abnormality.

When biomarkers are collected longitudinally for patients, reference ranges that adapt to account

for between and within subject variability are needed. In this presentation a Bayesian approach will

be used to generate such adaptive reference ranges (e.g. Figure 2). Initially the patient specific

reference range is based on the range generated for the population, and typically narrows over time

as more data are collected for that individual. Such a range has the potential to detect a meaningful

change earlier.

Examples will be given involving biomarkers collected in a clinical setting and amongst elite athletes.

Figures

Figure 1: Normal Ranges for a specific biomarker

Figure 2: Bayesian Adaptive Ranges for a specific biomarker

Conclusion

This paper highlights the capabilities of Bayesian approaches for generating adaptive ranges for

clinical biomarkers in order to identify abnormal variability at the patient level.

References

Sottas P, Baume N, Saudan C, Schweizer C, Kamber M, Saugy M. (2007). Bayesian detection of

abnormal values in longitudinal biomarkers with an application to T/E ratio. Biostatistics, Volume

8, pp. 285 – 296.

Zorzoli M. (2011). Biological passport parameters. Journal of Human Sport and Exercise, Volume 6,

pp. 205-217.

casi 2016 university of limerick - istat.ie · lisa mccrink 1, ozgur asar 2, helen alderson 3,...

Documents