casi 2016 university of limerick - istat.ie · lisa mccrink 1, ozgur asar 2, helen alderson 3,...
TRANSCRIPT
CASI 2016
University of Limerick
May 16th — 18th 2016
Welcome
Dear Delegate.
On behalf of the organising committee, it is with the greatest of pleasure that I welcome you to
Limerick for the 36th Conference on Applied Statistics in Ireland. I would particularly like to welcome
our overseas visitors and those of you making your first ever visit to Limerick. Although Limerick
is Ireland’s fourth largest city, it is by far her greenest. Set in the heart of Munster, the outskirts
of the city stretch into rural Limerick and into neighbouring County Clare. If your itinerary allows,
then I would encourage you to visit the medieval King John’s Castle, the Treaty Stone, The Hunt
Museum, and the home of rugby, Thomond Park, all in the city. Bunratty Castle & Folk Park, the
Clare Glens, Killaloe, boat cruises on Lough Derg, and the megalithic site at Lough Gur are all worth
a visit.
The statistical community in Ireland has come a long way since 1981 when Trinity College Dublin
and the Royal Statistical Society Northern Ireland Local Group organised a one day meeting in the
lovely surroundings of Blessington County Wicklow. Limerick were hosts for CASI in 1997 when the
Irish Statistical Association was formed. With the twentieth anniversary of ISA to be marked next
year it is timely to think about the reinvigoration and organisation and development of statistics in
Ireland over the next decade. I would therefore encourage you to contribute to this year’s ISA AGM
with this in mind.
On a personal note, I would like especially to thank all my colleagues on the organising committee
who have worked together to bring you what will be an excellent conference of truly international
dimension. I wish to thank everyone who agreed to chair the eight sessions, the poster judging
panel, our invited speakers and our five keynote speakers. A big thank you to all of you who have
contributed talks, posters and your time in attending CASI 2016. Finally, I wish to thank this year’s
kind sponsors each of their contribution’s is greatly appreciated.
I hope you have an enjoyable and rewarding conference.
Kevin Hayes
Chair of CASI 2016
CASI 2016 Organising Committee:
Aoife O’Neill, Helen Purtill, Kevin Burke, Cathal Walsh,
Aine Dooley, Ali Sheikhi, Norma Bargary, Ailish Hannigan,
Avril Hegarty, Sinead Burke, Peg Hanrahan & Kevin Brosnan.
Sponsors
Sponsors
Monday, 16th May
13:00 – 14:00 Lunch
14:00 – 14:10 CASI Opening Address
Session 1: Chair, Professor Cathal Walsh
Keynote Speaker
14:10 – 15:00 Linda Sharples Uncertainty in Health Economic Models: Beyond ClinicalTrials
Contributed Talks
15:00 – 15:20 Lisa McCrink Biomarker discovery for chronic kidney disease: a joint mod-elling approach with competing risks
15:20 – 15:40 Joy Leahy Incorporating single treatment arm evidence into a networkmeta analysis (if you must!)
15:40 – 16:00 John Newell VisUaliSE
16:00 – 16:20 Tea/Coffee
Session 2: Chair, Professor Chris Jones
Invited Speaker
16:20 – 16:40 Alessandra Menafoglio Object Oriented Kriging in Hilbert spaces with applicationto particle-size densities in Bayes spaces
Contributed Talks
16:40 – 17:00 Ricardo Ehlers Bayesian inference under sparsity for generalized spatial ex-treme value binary regression models
17:00 – 17:20 Niamh Russell Alternatives to BIC in posterior model probability forBayesian model averaging
17:20 – 17:40 Riccardo Rastelli Optimal Bayes decision rules in cluster analysis via greedyoptimisation
17:40 – 17:50 Wine & Canapes
Poster & Lightning Talks Session: Chair, Dr. Norma Bargary
17:45 – 18:30 Lightning Talks
18:30 – 20:00 Posters & Wine Reception
20:00 – 22:00 Networking Dinner
Tuesday, 17th May (Morning)
Session 3: Chair, Professor John Hinde
Keynote Speaker
09:10 – 10:00 Laura Sangalli Functional Data Analysis, Spatial Data Analysis and PartialDifferential Equations: a fruitful union
Contributed Talks
10:00 – 10:20 Alan Benson An adaptive MCMC method for multiple changepoint anal-ysis with applications to large datasets
10:20 – 10:40 Rafael Moral Diagnostic checking for N-mixture models applied to miteabundance data
10:40 – 11:00 Bernardo Nipoti A Bayesian nonparametric approach to the analysis of clus-tered time-to-event data
11:00 – 11:20 Tea/Coffee
Session 4: Chair, Professor Christian Pipper
Invited Speaker
11:20 – 11:40 Chris Jones Log-location-scale-log-concave distribution for lifetime data
Contributed Talks
11:40 – 12:00 Amirhossein Jalali Confidence envelopes for the mean residual life function
12:00 – 12:20 Christopher Steele Modelling the time to type 2 diabetes related complicationsusing a survival tree based approach
12:20 – 12:40 Shirin Moghaddam A Bayesian approach to the imputation of survival data
12:40 – 13:50 Lunch
Tuesday, 17th May (Afternoon)
Session 5: Chair, Professor Ailish Hanningan
Keynote Speaker
13:50 – 14:40 Bethany Bray Cutting-Edge Advances in Latent Class Analysis for Today’sBehavioral Scientists
Contributed Talks
14:40 – 15:00 Myriam Tami EM estimation of a structural equation model
15:00 – 15:20 Arthur White Identifying patterns of learner behaviour using latent classanalysis
15:20 – 15:40 Brendan Murphy Variable selection for latent class analysis with applicationto low back pain diagnosis
15:40 – 16:00 Tea/Coffee
Session 6: Chair, Dr. John Newell
Invited Speaker
16:00 – 16:40 David Leslie Thompson sampling for website optimisation
Contributed Talks
16:40 – 17:00 Sergio Gonzalez-Sanz Beyond machine classification: hedging predictions withconfidence and credibility values
17:00 – 17:20 James Sweeney Spatial modelling of house prices in the Dublin area
17:20 – 17:40 Susana Conde Model selection in sparse multi-dimensional contingency ta-bles
17:45 – 18:00 ISA AGM
19:00 – 19:15 Bus to Conference Dinner
19:45 – 22:00 Conference Dinner
Wednesday, 18th May
Special INSIGHT Session: Chair, Dr. Kevin Hayes
Keynote Speaker
09:10 – 10:00 Sofia Olhede Anisotropy in random fields
INSIGHT Session
Nial Friel
10:00 – 11:00 Brian Caulfield Insight centre for data analytics: A collection of short stories
Brendan Murphy
11:00 – 11:20 Tea/Coffee
Session 7: Chair, Dr. Kevin Burke
Invited Speaker
11:20 – 11:40 Christian Pipper Evaluation of multi-outcome longitudinal studies
Contributed Talks
11:40 – 12:00 Conor Donnelly A multivariate joint modelling approach to incorporate indi-viduals’ longitudinal response trajectories within the Coxianphase-type distribution
12:00 – 12:20 Katie O’Brien Breast screening and disease subtypes: a population-basedanalysis
12:20 – 12:40 Andrew Gordon Prediction of time until readmission to hospital of elderlypatients using a discrete conditional phase-type model in-corporating a survival tree
12:40 – 12:50 CASI Closing Address
12:50 – 14:00 Lunch
Monday 16th May
Session 1: Chair, Professor Cathal Walsh
Keynote Speaker
14:10 – 15:00 Linda Sharples Uncertainty in Health Economic Models: Beyond ClinicalTrials
Contributed Talks
15:00 – 15:20 Lisa McCrink Biomarker discovery for chronic kidney disease: a joint mod-elling approach with competing risks
15:20 – 15:40 Joy Leahy Incorporating single treatment arm evidence into a networkmeta analysis (if you must!)
15:40 – 16:00 John Newell VisUaliSE
Uncertainty in Health Economic Models: Beyond Clinical Trials
Linda Sharples∗1
1Statistics Department, University of Leeds, United Kingdom∗Email: [email protected]
While RCTs have become well established as an approach to establishing causal relationships between
new treatments and outcomes, it is necessary to incorporate other sources of evidence in order to
generalise findings beyond the narrow setting of a single trial. This is especially the case in the
context of health economic models, where a decision needs to be taken regarding the introduction
of a treatment into a population and this is likely to have an impact over a long time frame.
Traditional approaches to extrapolations of survival effects adopt a parametric survival function and
examine the difference between new and existing treatments. However, trial data is often of relatively
short duration and thus there is substantial uncertainty about the appropriate functional form. In
this talk, we use an approach which incorporates evidence from long term survival patterns in registry
data and that of the general population in order to extrapolate the treatment effect observed in the
trial. We implement this approach in a Bayesian evidence synthesis framework, with application to
the example of Implantable Cardioverter Defibrillators.
Biomarker Discovery for Chronic Kidney Disease: A Joint Modelling
Approach with Competing Risks
Lisa McCrink∗1, Ozgur Asar2, Helen Alderson3, Philip Kalra3 and Peter Diggle4
1CenSSOR, Queen’s University Belfast, UK2Department of Biostatistics and Health Informatics, Acibadem University, Istanbul, Turkey
3Vascular Research Group, Manchester Academic Health Sciences Centre, University of
Manchester, Salford Royal NHS Foundation Trust, Salford, UK.4CHICAS, Lancaster Medical School, Lancaster University, UK
∗Email: [email protected]
Abstract: In this research, the relationship between fibroblast growth factor 23 (FGF-23), a potential
biomarker for the progression of chronic kidney disease (CKD), and the survival of CKD patients is investi-
gated. Utilising a joint modelling approach, the association between FGF-23 and the widely used biomarker,
serum creatinine, is analysed and the effect of both biomarkers changing over time on several competing
risks for CKD patients is presented.
Introduction
The prevalence of chronic kidney disease (CKD) within the UK is approximately 6-8% where a high
proportion of individuals suffer mortality due to associated cardiovascular events (Roderick, 2011).
Investigating the factors that affect the survival of CKD patients is of utmost importance. This
research will explore the association between several competing risks for CKD patients and a novel
biomarker, fibroblast growth factor 23 (FGF-23).
Data
The data analysed within this research was obtained from the Chronic Renal Insufficiency Stan-
dards Implementation Study (CRISIS), an observational study run by Salford Royal Hospital NHS
Foundation Trust, Greater Manchester. It consists of 999 patients with a total of 2,468 repeated
measurements. In a multivariate approach, FGF-23 will be studied alongside the commonly used
biomarker, serum creatinine. The influence of both biomarkers on the time to fatal or non-fatal
cardiac events, time to death due to non-cardiac reasons and time to the start of renal replacement
therapy will be analysed.
Methods
Over recent years, there is a growing volume of literature focusing on joint modelling approaches
in the analysis of associated longitudinal and survival processes (Rizopoulos, 2012). It is common
within the literature that such approaches have analysed univariate repeated measurements and time-
to-event data. This research will expand upon such approaches through the utilisation of a bivariate
linear mixed model using correlated random effects to jointly study patients’ repeated measurements
of FGF-23 and serum creatinine. This longitudinal submodel is linked with a cause-specific hazards
model to represent the three competing risks for CKD patients discussed previously (Williamson,
2008). The parameters of both submodels are estimated using an expectation-maximisation algo-
rithm.
Discussion
The research presented utilises a joint modelling approach to demonstrate the association between
the survival of CKD patients and the novel biomarker, FGF-23. This builds upon previously work
which, through utilisation of a two-stage approach, highlighted a significant association between the
repeated measurements of FGF-23 and the risk of death and cardiovascular events for CKD patients
(Alderson, To Be Submitted).
References
Roderick, P., Roth, M. and Mindell, J. (2011). Prevalence of chronic kidney disease in England: Find-
ings from the 2009 Health Survey for England. Journal of Epidemiology and Community Health,
64(2), pp. A1 – A40.
Rizopoulos, D. (2012). Joint models for longitudinal and time-to-event data: With applications in R.
CRC Press.
Williamson, P. R., Kolamunnage-Dona, R., Philipson, P. and Marson, A. G. (2008). Joint
modelling of longitudinal and competing risks data. Statistics in Medicine, 27, pp. 6426 – 6438.
Alderson, H. V., Asar, O., Ritchie, J. P., Middleton, R., Larsson, A., Diggle, P. J., Larsson,
T. E., Kalra, P. A. (To Be Submitted). Longitudinal change in FGF-23 is associated with risk
for death and cardiovascular events but not renal replacement therapy in advanced CKD. To Be
Submitted.
Incorporating Single Treatment Arm Evidence into a Network Meta
Analysis (if you must!)
Joy Leahy ∗1 and Cathal Walsh2
1Department of Statistics, Trinity College Dublin, Ireland2Professor of Statistics, University of Limerick, Ireland
∗Email: [email protected]
Abstract: Combining all available evidence in the form of a Mixed Treatment Comparison (MTC) is an
important tool for facilitating decision making in choosing agents in a clinical setting. Randomised Control
Trials (RCTs) are considered the gold standard of evidence, because potential bias is minimised due to
its controlled approach. However, much of the evidence available can be gained from other sources such
as one-armed, single-comparator and observational studies. Here we propose to include single arm and
single agent trials by choosing a similar arm from another trial in the network to use as a comparator arm.
By simulating trials so that the effects of treatments are known, and by sampling across the potential
matches, we vary parameters which are likely to influence the effectiveness of matching. The objective here
is to identify and examine the parameters which influence how well matching works, propose a method for
choosing matched arms, and assess whether they are likely to work better than using the RCT evidence
alone.
Introduction
There are caveats to consider when including potentially biased sources of evidence. In a well
conducted RCT we can be confident that patients are exchangeable across treatment arms as they
have been randomly assigned. Including single comparator trials breaks this randomisation. However,
one could argue that randomisation is a sufficient but not necessary condition for comparing the
effects of different treatments. Our goal is to find reliable methods of including potentially biased
sources of evidence into an MTC along with devising a test for bias.
Schmitz et al (2013) propose methods for including observational studies into an MTC. Thom et
al (2015) also investigate methods of including single arm evidence. Other relevant work includes
Hong et al (2015) which deals with absolute versus relative effects, and a follow-up discussion by
Dias and Ades (2015).
Methods
We ran a simulation study to investigate possible matching methods versus using RCT evidence
alone. We simulated the study effect, treatment effect, the effect of covariates and the number of
covariates in each arm. From this we obtained the binary response rate as follows:
Response[i, j] = µi + δt[i,j] + β1(x1ik) + β2(x2ik) + ...+ βm(xmik)...+ βn(xnik),
where µi is the study effect in study i, δt[i,j] is the the treatment effect in arm j of study i, βm is the
effect of each of the covariates, and xmik is the proportion of patients with the binary covariate m in
arm j of study i. We ran a standard logit model in OpenBugs3.2.3. We then compared the model’s
estimate of the treatment effect by varying a number of parameters and assessed if these parameters
had an effect on whether the matched arm improved our estimate. The parameters considered were
study size, treatment effect, study effect and covariate effect.
Results
Figure 1
Matching by the covariate generally produces
better estimates than randomly matching. In-
cluding single arm evidence produces better es-
timates than including only RCT when there is
no study effect. Increasing the study effect even-
tually leads to the single arm evidence giving
less accurate evidence. An illustration of this is
shown in figure 1.
Conclusion
Under certain conditions, including single arm
evidence can increase the accuracy of our estimates of treatment effects in an MTC. However,
we must exercise caution when including single arm estimates as they may introduce bias into the
model.
References
Dias, S., and Ades, A. E. (2016). Absolute or relative effects? Arm-based synthesis of trial data. Re-
search Synthesis Methods, 7(1), pp. 23 – 28.
Hong, H. et al (2015). A Bayesian missing data framework for generalized multiple outcome mixed
treatment comparisons. Research Synthesis Methods, 7(1), pp. 6 – 22.
Thom, HH. et al (2015). Network meta-analysis combining individual patient and aggregate data from
a mixture of study designs with an application to pulmonary arterial hypertension. BMC Medical
Research Methodology, 12, pp. 15 – 34.
Schmitz, S. et al (2013). Incorporating data from various trial designs into a mixed treatment comparison
model. Statistics in Medicine., 32, pp. 2935 – 2949.
Visualise
John Newell∗1,2, Amirhossein Jalali1,2, Shirin Moghaddam2 and Jaynal Abedin3
1HRB Clinical Research Facility, NUI Galway, Ireland2School of Mathematics, Statistics and Applied Mathematics NUI Galway, Ireland
3INSIGHT Centre for Data Analytics, NUI Galway, Ireland∗Email: [email protected]
Abstract: The ability to summarise data graphically while adhering to good statistical practice is a key
component of a statisticians training. The emergence of open source options in R and commercial applica-
tions for the creation of dynamic and interactive graphs raises questions as to whether traditional teaching
of statistics should adapt to incorporate training in these environments.
Introduction
The emergence of big data has brought with it the necessity to develop methods and software for
the generation of interactive and dynamic graphs, a renaming of graphs to visualisations and the
creation of intriguing job titles such as data visualiser, data architect, data engineer, data scientist,
data wrangler and data munger. For example, the SAS website claims Data visualization is going
to change the way our analysts work with data. Theyre going to be expected to respond to issues
more rapidly. And theyll need to be able to dig for more insights look at data differently, more
imaginatively. Data visualization will promote that creative data exploration.
Methods
Assessment of data quality, subjective impressions with respect to the primary question of interest
and the assumptions underlying the analysis are typically achieved by preparing suitable (static)
graphs. Dynamic visualisations are often lauded as they allow the user to drill down and dive deeper
into the sample to generate insights that may be missed in static graphics. For example, a dashboard
of dynamically linked graphs summarizing data arising from a classical two-sample paired problem
are illustrated in Figure 1 with the relationship between adherence and outcome highlighted.
Discussion
In this talk dynamic visualisations across a range of platforms will be presented and their advantages
and disadvantages discussed. The question will be raised as to whether and how the teaching of
statistics should change given the availability of software and R packages for the generation of
dynamic graphs. Examples will be given of the capabilities of open source options in R, commercial
applications (Tableau and Qlik) and so called trade off analytics and deep learning (IBM Watson)
Figure 1: Sample Tableau Dashboard
in a variety of clinical and sports related settings.
Monday 16th May
Session 2: Chair, Professor Chris Jones
Invited Speaker
16:20 – 16:40 Alessandra Menafoglio Object Oriented Kriging in Hilbert spaces with applicationto particle-size densities in Bayes spaces
Contributed Talks
16:40 – 17:00 Ricardo Ehlers Bayesian inference under sparsity for generalized spatial ex-treme value binary regression models
17:00 – 17:20 Niamh Russell Alternatives to BIC in posterior model probability forBayesian model averaging
17:20 – 17:40 Riccardo Rastelli Optimal Bayes decision rules in cluster analysis via greedyoptimisation
Object Oriented Kriging in Hilbert spaces with application to
particle-size densities in Bayes spaces
Alessandra Menafoglio∗1, Piercesare Secchi1 and Alberto Guadagnini2
1MOX, Department of Mathematics, Politecnico di Milano, Italy2Department of Civil and Environmental Engineering, Politecnico di Milano, Milano, Italy;
Department of Hydrology and Water Resources, University of Arizona, 85721 Tucson, Arizona,
USA∗Email: [email protected]
Abstract: We present a methodology to perform Kriging of spatially dependent data of a Hilbert space.
Our object-oriented approach is conducive to the spatial prediction of the entire data-object, which is
conceptualized as a point within an infinite dimensional Hilbert space. In this communication, we focus
on the application of this broad framework to the geostatistical treatment of particle-size densities, i.e.,
probability densities that describe the distribution of grain sizes in given soil samples.
Introduction
The variety, dimensionality and complexity of the data commonly available in field studies pose new
challenges for data-driven geoscience applications. We here focus on the geostatistical treatment
of complex environmental data such as georeferenced functional data (e.g., curves or surfaces) or
distributional data (e.g., probability density functions), by pursuing an object oriented approach
(e.g., Marron and Alonso, 2014). The interpretation of the data as points within an infinite-
dimensional space offers a powerful perspective to address key issues such as estimation, prediction
and uncertainty quantification in a geostatistical setting.
Methods
Motivated by a field application dealing with particle-size data, we shall focus on the problem of the
geostatistical characterization of functional compositions (FCs). These are functions constrained to
be positive and to integrate to a constant (e.g., probability density functions). We interpret each
datum as a point within the Bayes Hilbert space of Egozcue et al. (2006) whose elements are FCs.
We here review the approach we developed in Menafoglio et al. (2014), and exploit appropriate
notion of spatial dependence to develop a Functional Compositional Kriging (FCK) predictor. FCK
provides the best linear unbiased predictor of the data, in the sense of minimizing the prediction
error under the unbiasedness constraint. Based on our recent work (Menafoglio et al., 2015), we will
illustrate additional tools to explore and characterize the (spatial) variability of the data, including
methods for dimensionality reduction and uncertainty quantification.
Results and Discussion
We illustrate our theoretical developments on a field study relying upon the particle-size densities
(PSDs) considered in Menafoglio et al. (2014). These data describe the local distribution of grain
sizes for 60 soil samples collected along a borehole in a shallow aquifer near the city of Tubingen,
Germany. We will show examples of Kriging predictions at the field site, and discuss the quality
of the results, assessed via cross-validation. A key advantage of our approach lies in the possibility
of obtaining predictions of the entire object at unsampled locations, as opposed to classical kriging
techniques which allow only finite-dimensional predictions, based on a set of selected features (or
synthetic indices) of the data (e.g., the mean of the density function). In general, the availability
of methods for the analysis and prediction of complex objects allows to use the entire information
content embedded in the data, and to project this to unsampled location in the system.
References
Egozcue, J. J., Dıaz-Barrero, J. L., and Pawlowsky-Glahn, V. (2006) Hilbert space of probability den-
sity functions based on Aitchison geometry. Acta Mathematica Sinica, English Series, 22(4), pp.
1175 – 1182.
Marron, J. S., and Alonso, A. M. (2014) Overview of object oriented data analysis. Biometrical Jour-
nal, 56, pp. 732 – 753.
Menafoglio, A., Secchi, P., and Dalla Rosa, M. (2013) A Universal Kriging predictor for spatially de-
pendent functional data of a Hilbert Space. Electronic Journal of Statistics, 7, pp. 2209 – 2240.
Menafoglio, A., Guadagnini, A., and Secchi, P. (2014) A Kriging approach based on Aitchison geom-
etry for the characterization of particle-size curves in heterogeneous aquifers. Stochastic Environ-
mental Research and Risk Assessment, 28(7), pp. 1835 – 1851.
Menafoglio, A., Guadagnini, A., and Secchi, P. (2015) Stochastic Simulation of Soil Particle-Size
Curves in Heterogeneous Aquifer Systems through a Bayes space approach. MOX-report 59/2015.
Bayesian Inference under Sparsity for Generalized Spatial Extreme Value
Binary Regression Models
Ricardo S. Ehlers∗1, Dipankar Bandyopadhyay2 and Nial Friel1
1Department of Applied Mathematics & Statistics, University of Sao Paulo, Brazil2Department of Biostatistics, Virginia Commonwealth University, USA3School of Mathematical Sciences, University College Dublin, Ireland
∗Email: [email protected]
Abstract: In this paper, we develop Bayesian Hamiltonian Monte Carlo inference under sparsity for spatial
generalized extreme value binary regression models. We apply our methodology to a motivating dataset
on periodontal disease.
Introduction
Consider a spatial situation where we observe a binary response yis for subject i, at site s within
subject i. We assume that Yis ∼ Bernoulli(pis) with
P (Yis = 1) = pis = 1− exp− [1− ξ(x′iβ + φis)]
−1/ξ+
where xi is the vector of covariates for subject i (that do not vary across space), β ∈ Rk is the
vector of covariate coefficients (fixed effects) and φis are spatially correlated random effects. We
assume that φi ∼ N(0,Σ) where Σ is the positive definite variance covariance matrix and we denote
the corresponding precision matrix by Ω = Σ−1. Instead of the usual conditionally autoregressive
(CAR) model for the spatial effects we assume, Ω = Σ−1 ∼ G-WishartW (κ, S) with degrees of
freedom κ and scale matrix S, constrained to have null entries for each zero in the adjacency matrix
W (Wss′ = 1 if s and s′ are neighbours and zero otherwise). Its density function is given by,
p(Ω|W ) =1
ZW (κ, S)|Ω|(κ−2)/2 exp
−1
2tr(SΩ)
I(Ω ∈MW ), κ > 2
where I(·) is the indicator function, MW is the set of symmetric positive definite matrices with
null off-diagonal elements corresponding to zeroes in W and ZW (κ, S) is the normalizing constant
(Roverato, 2002). We specify S = D − ρW where D is a diagonal matrix with elements given by
the number of neighbours at each location and ρ ∈ (0, 1) controls the degree of spatial correlation.
Methods
Assuming that β, φ, ξ and κ are a priori independent the joint posterior distribution is given by,
p(β, ξ,φ, κ,Ω|y,X) ∝n∏i=1
p(yi|xi,β, ξ,φi) p(φi|Ω)× p(Ω|W,κ) p(β) p(ξ) p(κ).
We assign the following priors, β ∼ N(0, σ2βIk), ξ ∼ N(0, σ2
ξ ) and κ ∼ Ga(a, b). The full condi-
tional posterior distribution of Ω is again G-Wishart and we use a novel approach via Hamiltonian
Monte Carlo (HMC) methods to sample from it. The spatial effects φi and the degrees of freedom κ
have no closed form full conditionals and are sampled using Metropolis-Hastings schemes. However
the full conditional of κ depends on the normalizing constant Z(κ, S) and we use the exchange
algorithm (Murray et al., 2006). Finally, (β, ξ) are sampled using a version of Riemann manifold
HMC methods (Girolami and Calderhead, 2011).
Results
Our motivating data was provided by the Center for Oral Health Research (COHR) at the Medical
University of South Carolina. The objective of this analysis is to quantify periodontal disease status
of a population and to study the associations between disease status and patient-level covariates:
age, Body mass index (Bmi), gender, a diabetes marker (HbA1C) and smoking status. Preliminary
results indicate that all covariates have significant effects on the probability of disease and likewise
for the spatial effects. Estimate of ξ indicates a strong assimetry in the link between covariates and
disease probability.
References
Girolami, M. and Calderhead, B. (2011) Riemann manifold Langevin and Hamiltonian Monte Carlo
methods. Journal of the Royal Statistical Society B, 73, pp. 123 – 214.
Lee, W.X. and Silver, Y.Z. (2006). MCMC for doubly-intractable distributions. In: Uncertainty in
Artificial Intelligence, AUAI Press, R. Dechter and T. Richardson (Eds.) pp. 359 – 366.
Roverato, A. (2002). Hyper inverse Wishart distribution for non-decomposable graphs and its application
to Bayesian inference for gaussian graphical models. Scandinavian Journal of Statistics, 29, pp.
391 – 411.
Alternatives to BIC in posterior model probability for Bayesian model
averaging
Niamh Russell1,∗, Yuxin Bai1,∗, Brendan Murphy1,2
1 School of Mathematical Sciences, University College Dublin, Ireland2The Insight Centre for Data Analytics, UCD, Ireland.
∗Email: [email protected]
Abstract: Bayesian model averaging (BMA) classically uses BIC to calculate estimates of posterior model
probability. This method is sometimes objected to in the literature on the basis that it is only correct to
use BIC to select the optimal model. We propose two alternatives.
Introduction
Commonly in model-based clustering, we fit a number of competing models and base the clustering
results on the best model. This ignores the uncertainty that arises from model selection. BMA allows
for combination of clustering results across multiple models, thus accounting for model uncertainty.
However, the best method for estimating the model uncertainty is contested.
One approximation of the posterior model probabilities uses the BIC criterion to determine the
weights. where for a list of candidate models M1, . . . ,Mm
P(Mk|D) 'exp(1
2BICk)∑K
j=1 exp(12BICj)
. (1)
We propose two alternative methods. One uses BIC∗ in a similar way to Equation 1 but which
corrects for small sample size. Another method is the convex hull method (CHull) (Bulteel et al
(2013)) which produces weights as part of the algorithm.
Methods
When using BMA, suggested in Raftery (1995), to combine results across multiple models, it is
required to weight candidate models according to the posterior model probability. This technique
allows us to average across a statistic of interest, θ, say.
We propose averaging across θMkthe statistic of interest for each model using the posterior model
probabilities from the two proposed methods. So, given model-based estimates of θ, θMk, say, we
have
θBMA =K∑k=1
PMk| Data θMk
where PMk| Data is calculated using three alternative methods: BIC as in equation 1, BIC∗
(Equation 2) and CHull.
BIC∗ = 2L −∑p
log(np)(p) (2)
where the penalty on the log-likelihood in BIC∗ depends on the number of observations required to
estimate each parameter p in the model.
CHull is a heuristic method where a convex hull is drawn over a ordered graph of the log-likelihood
versus the number of free parameters of a number of candidate models. The models involved in the
convex hull are then compared to calculate the required weights.
We propose to compare results from the three methods.
Discussion
We propose two alternatives to using BIC as a basis for estimating posterior model probability.
References
Bulteel, K, Wilderjans, TF, Tuerlinkx, F. and Ceulemans, E. (2013). CHull as an alternative to AIC
and BIC in the context of mixtures of factor analyzers. Behavior Research Methods, 45 3, pp.
782–791.
Fraley, C. and Raftery, A.E. (2002). Model-based clustering, discriminant analysis, and density estima-
tion. JASA, 97, pp. 611–631.
Raftery, A.E. (1995) Bayesian model selection in social research. Sociological Methodology, 25, pp.
111–164.
Optimal Bayes decision rules in cluster analysis via greedy optimisation
Riccardo Rastelli∗1 and Nial Friel1
1School of Mathematics and Statistics, University College Dublin, Ireland∗Email: [email protected]
Abstract: In cluster analysis interest lies in capturing the partitioning of individuals into groups, such that
those belonging to the same group share similar attributes or relational profiles. Recent developments in
statistics have introduced the possibility of obtaining Bayesian posterior samples for the clustering variables
efficiently. However, the interpretation of such a collection of data is not straightforward, mainly due to
the categorical nature of the clustering variables and to the computational requirements. We consider
an arbitrary clustering context and introduce a greedy algorithm capable of finding an optimal Bayesian
clustering solution with a low computational demand and potential to scale to big data scenarios.
Introduction
The observed data characterises the profiles of a group of individuals. The population may present
a clustering structure where within each cluster individuals exhibit similar attributes or behaviours.
Each individual is characterised by a latent allocation categorical variable z defining its member-
ship. Markov Chain Monte Carlo techniques can be adopted to obtain a marginal posterior sample
Z(1), . . . ,Z(T ) for the clustering variables. However, summary statistics such as the posterior mean
or median have little meaning due to the discrete nature of the clustering variables.
A decision theoretic approach offers an alternative interpretation: a loss function L is used to assess
differences between partitions, and hence the optimal clustering solution is that minimising the
expected posterior loss, approximated by:
ψ (Z) ≈T∑t=1
L(
Z(t),Z). (1)
Related work
The 0-1 loss L (Z,Z′) = 1z 6=z′ is typically used since it implies simplifications and low computa-
tional demands. Bertoletti et al. (2015) propose greedy routines that tackle this task efficiently.
However, the 0-1 loss is evidently short-sighted: for different partitions, the loss is equal to one re-
gardless of how different the partitions actually are. Wade and Ghahramani (2015) have addressed
this impasse, proposing to use more sensible loss functions such as those derived from information
theory. However, the minimisation of the expected posterior loss implies lots of evaluations of the
sum on the rhs of Equation 1. This becomes impractical even for moderate datasets.
Proposed solution
In this work we focus on the wide family of loss functions that depend on the partitions only through
their contingency table. In such scenario, when a small perturbation is applied to a partition, the
variation in the corresponding loss can be evaluated in a constant time. This makes the greedy
routines of Bertoletti et al. (2015) applicable to the problem described, taking full advantage of the
computational savings to reduce the overall complexity and make such a tool widely applicable. We
propose examples to Gaussian Finite Mixtures and Stochastic Block Models.
−2 −1 0 1 2 3
−4
−2
02
4
0−1 loss
X1
X2
−2 −1 0 1 2 3
−4
−2
02
4
VI loss
X1
X2
Figure 1: A toy example on a simulated Gaussian mixture: the optimal clustering according to 0-1loss (left panel) is qualitatively different from that obtained using the Variation of Information loss(right panel).
References
Bertoletti, M., Friel N. and Rastelli, R. (2015). Choosing the number of clusters in a finite mixture
model using an exact integrated completed likelihood criterion. METRON, pp. 1 – 23.
Wade, S. and Ghahramani, Z. (2015). Bayesian cluster analysis: Point estimation and credible balls.
arXiv preprint 1505.03339
Tuesday 17th May
Session 3: Chair, Professor John Hinde
Keynote Speaker
09:10 – 10:00 Laura Sangalli Functional Data Analysis, Spatial Data Analysis and PartialDifferential Equations: a fruitful union
Contributed Talks
10:00 – 10:20 Alan Benson An adaptive MCMC method for multiple changepoint anal-ysis with applications to large datasets
10:20 – 10:40 Rafael Moral Diagnostic checking for N-mixture models applied to miteabundance data
10:40 – 11:00 Bernardo Nipoti A Bayesian nonparametric approach to the analysis of clus-tered time-to-event data
Functional Data Analysis, Spatial Data Analysis and
Partial Differential Equations: a fruitful union
Laura M. Sangalli
MOX - Dipartimento di Matematica, Politecnico di Milano, Italy
Email: [email protected]
Abstract: I will discuss an innovative method for the analysis of spatially distributed data, that merges
advanced statistical and numerical analysis techniques.
Spatial regression with differential regularizations
I will present a novel class of models for the analysis of spatially (or space-time) distributed data,
based on the idea of regression with differential regularizations. The models merge statistical
methodology, specifically from functional data analysis, and advanced numerical analysis techniques.
Thanks to the combination of potentialities from these different scientific areas, the proposed method
has important advantages with respect to classical spatial data analysis techniques. Spatial Regres-
sion with differential regularizations is able to efficiently deal with data distributed over irregularly
shaped domains, with complex boundaries, strong concavities and interior holes [Sangalli et al.
(2013)]. Moreover, it can comply with specific conditions at the boundaries of the problem domain
[Sangalli et al. (2013), Azzimonti et al. (2014, 2015)], which is fundamental in many applications
to obtain meaningful estimates. The proposed models have the capacity to incorporate problem-
specific priori information about the spatial structure of the phenomenon under study [Azzimonti
et al. (2014, 2015)]; this very flexible modeling of space-variation allows to naturally account for
anisotropy and non-stationarity. Space-varying covariate information is accounted for via a semipara-
metric framework. The models can also be extended to space-time data [Bernardi et al. (2016)].
Furthermore, spatial regression with differential regularizations can deal with data scattered over
non-planar domains, specifically over Riemannian manifold domains, including surface domains with
non-trivial geometries [Ettinger et al. (2016), Dassi et al. (2015), Wilhelm et al. (2016)]. This
has fascinating applications in the earth-sciences, in the life-sciences and in engineering. The use of
advanced numerical analysis techniques, and in particular of the finite element method or of isoge-
ometric analysis, makes the models computationally very efficient. The models are implemented in
the R package fdaPDE [Lila et al. (2016)].
References
Azzimonti, L., Nobile, F., Sangalli, L.M., Secchi, P. (2014). Mixed Finite Elements for spa-
tial regression with PDE penalization. SIAM/ASA Journal on Uncertainty Quantification, 2,
1, pp. 305 – 335.
Azzimonti, L., Sangalli, L.M., Secchi, P., Domanin, M., Nobile, F. (2015). Blood flow ve-
locity field estimation via spatial regression with PDE penalization. Journal of the American
Statistical Association, Theory and Methods, 110, 511, pp. 1057 – 1071.
Bernardi, M.S., Sangalli, L.M., Mazza, G., Ramsay, J.O. (2016). A penalized regression
model for spatial functional data with application to the analysis of the production of waste in
Venice province. Stochastic Environmental Research and Risk Assessment, DOI: 10.1007/s00477-
016-1237-3.
Dassi, F., Ettinger, B., Perotto, S., Sangalli, L.M. (2015). A mesh simplification strategy for
a spatial regression analysis over the cortical surface of the brain. Applied Numerical Mathe-
matics, 90, 1, pp. 111 – 131.
Ettinger, B., Perotto, S., Sangalli, L.M. (2015). Spatial regression models over two-dimensional
manifolds. Biometrika, 103, 1, pp. 71 – 88.
Lila, E., Aston, J.A.D., Sangalli, L.M. (2016). Smooth Principal Component Analysis over
two-dimensional manifolds with an application to Neuroimaging.
ArXiv:1601.03670, http://arxiv.org/abs/1601.03670.
Lila, E., Sangalli, L.M., Ramsay, J.O., Formaggia, L. (2016). fdaPDE: functional data anal-
ysis and Partial Differential Equations; statistical analysis of functional and spatial data, based
on regression with partial differential regularizations, R package version 0.1-2,
http://CRAN.R-project.org/package=fdaPDE.
Sangalli, L.M., Ramsay, J.O., Ramsay, T.O. (2013). Spatial spline regression models. Journal
of the Royal Statistical Society Ser. B, Statistical Methodology, 75, 4, pp. 681 – 703.
Wilhelm, M., Dede’, L., Sangalli, L.M., Wilhelm, P. (2016). IGS: an IsoGeometric approach
for Smoothing on surfaces. Computer Methods in Applied Mechanics and Engineering, 302,
pp. 70 – 89.
An Adaptive MCMC Method for Multiple Changepoint Analysis with
applications to Large Datasets
Alan Benson∗1 and Nial Friel1
1University College Dublin∗Email: [email protected]
Abstract: We consider the problem of Bayesian inference for changepoints where the number and position
of the changepoints are both unknown. In particular, we consider product partition models where it is
possible to integrate out model parameters for regimes between each change point, leaving a posterior
distribution over a latent vector indicating the presence or not of a change point at each observation.
This problem has been considered by Fearnhead (2006) where one can use a filtering recursion algorithm
to make exact inference. However the complexity of this algorithm depends quadratically on the number
of observations. Our approach relies on an adaptive Markov Chain Monte Carlo (MCMC) method for
finite discrete state spaces. We develop an adaptive algorithm which can learn from the past state of the
Markov chain in order to build proposal distributions which can quickly discover where change point are
likely to be located. We prove that our algorithm leaves the posterior distribution ergodic. Crucially, we
demonstrate that our adaptive MCMC algorithm is viable for large datasets for which the exact filtering
recursion approach is not. Moreover, we show that inference is possible in a reasonable time.
Introduction
A motivating example is the Well Log Data (Ruanaidh & Fitzgerald, 2006) shown in Figure 1. It
consists of 3,979 data points measuring the magnetic response of an oil well drill to identify change in
rock structure with depth (time axis). This data has distinct changepoints visible by inspection but
0 1000 2000 3000 4000
1000
0011
0000
1200
0013
0000
1400
00
Well−log Data
Time
Mea
sure
men
t
Figure 1: Well Log Data
there are many more changepoints that are not as visible without further considering the distribution
of the data.
Problem Statement
It can be assumed that the data comes from some likelihood family f(y|θi) with some parameter(s),
θ, which are changing over time. These changes occur at discrete time points τ = τ1 < · · · < τk,where k and the individual τi values are unknown and are to be inferred. With a suitable priors on
τ and θ we can form a posterior
π(τ |y) ∝∫θ
k+1∏j=1
τj∏i=τj−1+1
f(yi|θj)π(θj)π(τ ) dθj (1)
Methods
The purpose of this work is to develop scalable algorithms for changepoint analyis of large datasets.
We employ an adaptive Markov Chain Monte Carlo (MCMC) method to sample from the posterior
(1), leading to a novel algorithm that returns the location and number of changepoints in the data
sequence. The key idea of Adaptive MCMC (Haario, 2001) is to modify the proposal distribution
q(x′|x) in the standard Metropolis Hastings (M-H) algorithm in order to explore high probability
regions more often. Restrictions on how we modify the proposal are the essence of constructing
an efficient Adaptive MCMC that also preserves the ergodicity of the inference. The parameters
modified are the change point inclusion weights.
Results
Results will be presented showing the distribution of the changepoint position and the number of
changepoints in the Well Log data (Figure 1). Further results for larger datasets will be shown and
also a comparison of our algorithm to an alternative filtering recursion algorithm (Fearnhead, 2006).
References
Fearnhead, P. (2006). Exact and efficient Bayesian inference for multiple changepoint problems. Statis-
tics and computing, 16, pp. 203 – 213.
Ruanaidh & Fitzgerald (1996). Numerical Bayesian Methods Applied to Signal Processing. Publisher
address: Springer Science.
Haario (2001). An adaptive Metropolis algorithm. Bernoulli, pp. 223 – 242.
Diagnostic checking for N-mixture models applied to mite abundance
data
Rafael Moral∗1, John Hinde2 and Clarice Demetrio1
1Department of Exact Sciences, ESALQ/USP, Brazil2School of Mathematics, Statistics, and Applied Mathematics, NUI-Galway, Ireland
∗Email: [email protected]
Abstract: In ecological field surveys it is often of interest to estimate the abundance of species. However,
detection is imperfect and hence it is important to model these data taking into account the ecological
processes and sampling methodologies. In this context, N-mixture models and extensions are particularly
useful, as it is possible to estimate population size and detection probabilities under different ecological
assumptions. We apply extensions of this class of models to a mite sampling study and develop methods
for assessing goodness-of-fit by proposing different types of residuals for this model class.
Introduction
It is very important in the ecological context to measure animal abundance and understand how this
abundance changes over time and space. There are different statistical models that may be used
to estimate abundance as well as site-occupancy. N-mixture models were defined by Royle (2004)
and have since been generalised, see Hostetler and Chandler (2015). Here we develop and apply
extensions of this class of models to estimate mite abundance in a field survey. So far, specific forms
of residuals and model diagnostics have not been proposed and we will develop goodness-of-fit
assessment techniques for these models.
Material and methods
The diversity and abundance of soil animals is dominated by the mites, and hence the understanding
of soil systems is directly related to the mite fauna. To study the effect of climate change on mite
abundance, a sampling study was conducted in Colombia. Mites were sampled bimonthly throughout
the year 2010 at nine sites in both a forest patch and in a pasture, totalling 6 × 9 × 2 = 108
observations.
Let nit represent mite counts for site i, i = 1, . . . , R over sampling occasion t = 1, . . . , T . We are
interested in estimating site abundance Ni, however, there is a detection (or capture) probability p
which is also unknown. Considering closed populations (i.e. no migration and constant birth and
death rates), we may assume that nit are independent and identically distributed as Binomial(Ni, p).
The approach described by Royle (2004) takes Ni to be independent and identically distributed latent
random variables with density f(Ni; θ). Integrating the binomial likelihood with respect to Ni, we
may write the likelihood function of the N-mixture model as
L(θ, p|n11, . . . , nRT ) =R∏i=1
∞∑
Ni=maxtnit
T∏t=1
(Ni
nit
)pnit(1− p)Ni−nitf(Ni; θ)
(1)
Sensible choices for the distribution of Ni are, for example, the Poisson and negative binomial
models.
It is important to assess goodness-of-fit in this setting so that abundance may not be over- or under-
estimated. We will propose different types of residuals and develop methods to assess goodness-of-fit
for these models and check their robustness using simulation studies.
Results and discussion
Preliminary analyses showed that the mite data may be overdispersed, as the choice of the negative
binomial model for the distribution of Ni yielded a significantly better fit than a Poisson model
(likelihood ratio test statistic: 60.82 on 1 d.f., p < 0.0001). Using half-normal plots of the ordinary
conditional residuals with a simulated envelope, it appears that assuming a negative binomial model
for the latent abundance variable is more appropriate for these data.
Conclusion
N-mixture models are a valuable tool to analyse repeated count data and estimate abundance.
However, goodness-of-fit must be assessed in order to assure more reliable abundance estimates.
References
Hostetler, J.A. and Chandler, R.B. (2015). Improved state-space models for inference about spatial
and temporal variation in abundance from count data. Ecology, 96, 1713 – 1723.
Royle, J.A. (2004). N-mixture models for estimating population size from spatially replicated counts.
Biometrics, 60, 108 – 115.
A Bayesian nonparametric approach
to the analysis of clustered time-to-event data
Bernardo Nipoti∗1, Alejandro Jara2 and Michele Guindani3
1School of Computer Science and Statistics, Trinity College Dublin, Ireland2Pontificia Universidad Catolica de Chile, Santiago, Chile
3The University of Texas MD Anderson Cancer Center, Houston, USA∗Email: [email protected]
Abstract: We propose a clustered proportional hazard model based on the introduction of cluster-
dependent random hazard functions and on the use of mixture models induced by completely random
measures. We show that the proposed approach accommodates for different degrees of association within
a cluster, which vary as a function of cluster level and individual covariates. The behaviour of the proposed
model is illustrated by means of the analysis of simulated and real data.
Introduction
Cox’s proportional hazards (PH) model (Cox, 1972) has been widely used in the analysis of time-to-
event data. Shared frailty models (see, e.g., Hougaard, 2000) conveniently extend the PH framework
by including a group-specific random effect term (frailty) in the hazard function, so to take into
account the presence of heterogeneous clusters of subjects in the data. Frailty random variables are
typically assumed independent and identically distributed (iid). A potentially important drawback
of shared frailty models is represented by the simple marginal association structure that the model
induces, possibly not appropriate for some applications. We propose a more general and flexible
approach where cluster-specific baseline hazard functions are modelled as mixtures governed by iid
completely random measures (CRM).
Mehtods
Let Ti,j ∈ IR+ be the time-to-event for the ith individual in the jth cluster, with j = 1, . . . , k and
i = 1, . . . , nj, and let zi,j be a p–dimensional vector of explanatory covariates associated with the
ith individual in the jth cluster. We extend the ideas proposed by Dykstra and Laud (1981) and
propose a Bayesian model based on the assumption that the conditional hazard functions can be
expressed as a mixture model induced by iid cluster-specific random mixing distributions, i.e.
hj(t) =
∫Y
k(t | y)µj(dy) and µj | Gi.i.d.∼ G, (1)
j = 1, . . . , k, where Y is an appropriate measurable space, k(· | ·) is a suitable kernel, µj is a CRM
on Y, such that limt→∞∫ t
0hj(s)ds = +∞ a.s., and G is the common probability law for the mixing
CRMs. We assume that, given the cluster-specific baseline hazard function hj, j = 1, . . . , k and a
vector β of regression coefficients, the Ti,j’s are independent, following a clustered PH model with
conditional density
f (t | zi,j,β, hj) = expz′i,jβhj(t) exp
− expz′i,jβ
∫ t
0
hj(u)du
,
that is Ti,j | zi,j,β, hjind.∼ f (· | zi,j,β, hj) .
Results
Under the proposed model we derive the expressions for the Kendall’s τ and survival ratio, popular
measures of dependence between survival times (see, e.g., Anderson et al., 1992). This allows
us to show that our approach accommodates for different degrees of association within a cluster,
which vary as a function of cluster level and individual covariates. We also show that a particular
specification of the proposed model, namely the choice of a σ-stable distribution G for the iid
CRMs in (1), has the appealing property of preserving marginally the PH structure. The behaviour
of the proposed model is illustrated by means of the analyses of simulated data as well as a real
dataset consisting of joint survival times of couples stipulating a last survivor policy with a Canadian
insurance company.
References
Anderson, J.E., Louis, T. A., Holm, N. V. and Harvald, B. (1992).
Time-dependent association measures for bivariate survival distributions, J. Am. Stat. Assoc., 87,
pp. 641 – 650.
Cox, D. (1972). Regression models and life tables (with discussion). J. Roy. Statist. Soc. Ser. A, 34,
pp. 187 – 202.
Dykstra, R. L. and Laud, P. (1981). A Bayesian nonparametric approach to reliability, Ann. Stat., 9,
356 – 367.
Hougaard, P. (2000). Analysis of Multivariate Survival Data. Springer, New York.
Tuesday 17th May
Session 4: Chair, Professor Christian Pipper
Invited Speaker
11:20 – 11:40 Chris Jones Log-location-scale-log-concave distribution for lifetime data
Contributed Talks
11:40 – 12:00 Amirhossein Jalali Confidence envelopes for the mean residual life function
12:00 – 12:20 Christopher Steele Modelling the time to type 2 diabetes related complicationsusing a survival tree based approach
12:20 – 12:40 Shirin Moghaddam A Bayesian approach to the imputation of survival data
Log-Location-Scale-Log-Concave Distributions
for Lifetime Data
Chris Jones∗1
Department of Mathematics and Statistics, The Open University, U.K.∗Email: [email protected]
Abstract: This talk concerns the sub-class of log-location-scale (LLS) models for continuous survival
and reliability data formed by restricting the density of the underlying location-scale distribution to be
log-concave (LC); hence log-location-scale-log-concave (LLSLC) models, introduced in Jones and Noufaily
(2015). These models display a number of attractive properties, one of which is the unimodality of their
density functions.
Methods
I shall concentrate on hazard functions of members of this class of distributions. Their shapes are
driven by tail properties of the underlying LLS distributions. Perhaps the most useful subset of LLSLC
models corresponds to those which allow constant, increasing, decreasing, bathtub and upside-down
bathtub shapes for their hazard functions, controlled by just two shape parameters (which principally
control the hazards’ behaviour at 0 and∞, respectively). The generalized gamma and exponentiated
Weibull distributions are particular examples thereof, for which Cox and Matheson (2014) conclude,
more generally, “that the similarity between the distributions is striking”. A third, also pre-existing,
example is the less well known power generalized Weibull distribution which I newly reparametrize
to cover each of the popular Weibull, Burr Type XII, linear hazard rate and Gompertz distributions
as special or limiting cases.
Discussion
For distributions involving shape parameters on the real line, with density function f(x), say, in
practice one necessarily incorporates all-important location, µ, and scale, σ, parameters too, via
(1/σ)f((x−µ)/σ). Similarly, for lifetime distributions with shape parameters, with hazard function
h(t), say, I contend that in practice one should incorporate scale, σ, and proportionality, β, param-
eters too, via (β/σ)h(t/σ). In the presence of covariates, this covers accelerated failure time and
proportional hazards models (with flexible parametric baseline hazards), amongst others. Practical
implementation of these ideas is in its infancy.
References
Cox, C. and Matheson, M. (2014). A comparison of the generalized gamma and exponentiated Weibull
distributions. Statistics in Medicine, 33, pp. 3772 – 3780.
Jones, M.C. and Noufaily, A. (2015). Log-location-scale-log-concave distributions for survival and reli-
ability analysis. Electronic Journal of Statistics, 9, pp. 2732 – 2750.
Confidence Envelopes for the Mean Residual Life function
Amirhossein Jalali∗1 2, Alberto Alvarez-Iglesias 2, John Hinde1
and John Newell1 2
1School of Mathematics, Statistics and Applied Mathematics NUI Galway, Ireland2HRB Clinical Research Facility, NUI Galway, Ireland
∗Email: [email protected]
Introduction
Survival analysis is a collection of statistical methods to analyse time to event data, in the presence
of censoring. The survivor function S(t), the probability of the event occurring beyond any particular
time point t, is the typical summary presented. Another function of interest is the Mean Residual
Life (MRL) function, which at any time t provides the expected remaining lifetime given survival up
to time t.
Mean Residual Life
The MRL function has been used traditionally in engineering and reliability studies and provides a
clearer answer to the question ”how long do I have left?”. This characteristic of the MRL function
is particularly interesting when one tries to communicate results involving time to event data.
The MRL function is defined as the expected survival time given survival till the current time:
m(t) = E(T − t|T > t) =1
S(t)
∫ ∞t
S(s)ds
A recent paper by Alvarez-Iglesias, et al. (2015) presented an estimator of the MRL function in the
presence of non-informative right censoring. This novel semi-parametric approach combines existing
nonparametric methods and an extreme-value tail model, where the limited sample information in
the tail (up to study termination) is used to estimate the upper tail behaviour.
Variability Bands
The MRL estimator (Alvarez-Iglesias, et al. 2015) is based on a hybrid estimator and includes a
nonparametric estimate of survival and a parametric approximation of the upper tail of the survival
curve. Gong (2012) derived a method of calculating the variance estimate for such hybrid estimators
at the start of follow-up time i.e. t = 0. The bootstrap approach is an attractive option here due
to the complexity arising from a hybrid estimator, because, to our knowledge, no closed formula for
the variance is available.
An example of a global and pointwise confidence envelope is given in the following figure for a MRL
estimated based on simulated data arising from a Weibull distribution with increasing hazard.
Conclusion
The Mean Residual Life has been suggested as more informative graphical summary as it may
provide a clearer interpretation for both clinicians and patients. It is argued that the MRL is easier
to interpret as the summary is given in units of time. This study focuses on generating variability
bands to the precisions the estimated MRL using the bootstrap. Additionally, a graphical tool will
be presented for summarising time to event data using the MRL function.
References
Alvarez-Iglesias, Alberto, et al. (2015). Summarising censored survival data using the mean residual life
function. In: Statistics in medicine, 34.11, 1965-1976.
Canty, Angelo J. (2002). Resampling methods in R: the boot package. In: R News, 2.3, 2-7.
Gong, Qi, and Liang Fang. (2012). Asymptotic properties of mean survival estimate based on the Ka-
plan–Meier curve with an extrapolated tail. In: Pharmaceutical statistics, 11.2, 135-140.
Newell, John, et al. (2006). Survival ratio plots with permutation envelopes in survival data problems.
In: Computers in biology and medicine, 36.5, 526-541.
Modelling the time to type 2 diabetes related complications using a
survival tree based approach
Christopher J. Steele∗1,2, Adele H. Marshall1,2, Anne Kouvonen2,3, Reijo Sund3, Frank Kee2
1Centre for Statistical Science and Operational Research, School of Mathematics and Physics,
Queen’s University Belfast2UKCRC Centre of Excellence for Public Health, Queen’s University Belfast
3Department of Social Research, University of Helsinki∗Email: [email protected]
Abstract: The substantial increase in the number of individuals being diagnosed with type 2 diabetes
(T2D) has caused a simultaneous increase in the prevalence of complications related to the disease. Type
2 diabetics are at a significantly greater risk of experiencing a stroke, heart disease and various other health
problems. A survival tree approach is used to identify cohorts of individuals with significantly different
time distributions from T2D diagnosis to complication. Three survival trees were built for the time until
death, amputation/revascularisation and stroke/acute myocardial infarction (AMI). Age and the presence
of a comorbidity were shown to be influential variables when modelling the time to any of the outcomes.
Introduction
The number of people that suffer from T2D has risen significantly over the past decade and there
are no signs that this sharp increase in the prevalence of the disease is going to slow down. Due
to the increase in T2D cases worldwide there has been a substantial increase in the number of
T2D related complications. Individuals with T2D are at a greater risk of experiencing numerous
conditions including heart disease, stroke, nerve damage, kidney disease and foot problems. The
aim of the proposed study is to group type 2 diabetics by their associated characteristics to give
cohorts of individuals with significantly different time to event distributions. In order to achieve this
a survival tree based approach was used.
Methods
Subjects were participants in the Diabetes in Finland (FinDM II) study, a national register-based
dataset of older adults with T2D in Finland. The objective of the analysis was to model the time
until the first recorded complication after T2D diagnosis. Hence, individuals were excluded from the
analysis if they suffered a related complication before their diagnosis of T2D. 18,903 individuals met
the inclusion criteria of which 13,835 individuals experienced an event of interest during the follow-
up period. The events of interest were death (n=6,908; 36.5%), lower limb amputation (n=453;
2.4%), coronary revascularisation (n=868; 4.6%), stroke (n=2,861; 15.1%) and AMI (n=2,745;
14.5%). A survival tree approach was used to investigate how individual characteristics influence
the time until the event of interest occurred. The log rank test statistic was used to determine
the splitting of the tree nodes (Zhang and Singer, 1999). The variable which yielded the largest
significant log rank test statistic was chosen to make the split.
Results and Discussion
It was shown that the underlying time distributions from T2D diagnosis to death and T2D related
complication were significantly different. However, the type of complication gives rise to differing
survival behaviours. Those patients experiencing an amputation or revascularisation have similar sur-
vival times which is significantly different to those patients experiencing stroke or AMI. Three survival
trees were built to investigate how individual characteristics affected the time to death, amputa-
tion/revascularisation and stroke/AMI. The trees identified cohorts of individuals with significantly
different time distributions from T2D diagnosis to the event of interest. The most influential factor
in both the death and stroke/AMI survival tree was age, where individuals aged 65 or over were at
a greater risk of experiencing an event. Gender was shown to be the most influential variable in the
amputation/revascularisation tree with age proving to be the next most significant variable. The
presence or absence of a comorbidity also played a major role in all three survival trees. Individuals
65 years or older who did not suffer from a comorbidity, not in manual or non manual employment
and had low income had the shortest median time to death while older individuals who suffered from
a comorbidity and not in any type of manual or non manual work had the shortest median time to
stroke/AMI. Older males who suffered from a comorbidity and had high levels of education had the
shortest median time to amputation/revascularisation.
References
Zhang, H and Singer B. (1999). Recursive Partitioning In The Health Sciences. New York: Springer,
pp. 79-103.
A Bayesian approach to imputation of survival data
Shirin Moghaddam∗1, John Newell2 and John Hinde1
1School of Mathematics, Statistics and Applied Mathematics, NUI Galway, Ireland2HRB Clinical Research Facility and School of Mathematics, Statistics and Applied Mathematics,
NUI Galway, Ireland∗Email: [email protected]
Abstract: In survival analysis, due to censoring, standard methods of plotting individual survival times
are invalid. Therefore, graphical display of time-to-event data usually takes the form of a Kaplan-Meier
survival plot. By treating the censored observations as missing and using imputation methods, a complete
dataset can be formed. Then standard graphics may usefully complement Kaplan-Meier plots. Here, we
consider using a Bayesian framework to present a flexible approach to impute the censored observations
using predictive distributions.
Introduction
Survival data measures time from some origin point to a particular event. One common feature of
such data is that some individuals may not experience the event during the follow-up period, giving
right-censored observations. In the presence of right censoring, simple approaches for analysis and
visualization are impracticable. Therefore, Kaplan-Meier curves, which take account of the censoring,
have become the standard graphical method for displaying survival data. But suppose that we were
to treat the censored observations as missing and use imputation to provide a complete dataset, then
both standard analysis methods and graphics could be used. One such approach was introduced by
Royston (2008) where each censored survival time is imputed by assuming a log-normal distribution.
Here, we consider using a Bayesian framework to give a more flexible approach to impute the censored
observations using predictive distributions. The use of this method is investigated for low, medium
and high censoring rates with and without covariates. The method is intended to be used for the
visual exploration and presentation of survival data. We illustrate its use for standard survivor and
hazard function plots and also for the mean residual life function, which gives a simple, interpretable
display for physicians and patients to understand the results from clinical trials.
Methods
Censored survival times may be viewed as a type of missing, or incomplete, data. Bayesian methods
are used, here taking a Weibull distribution for the survival times with lognormal and Gamma priors
for the shape and scale parameters (see, Christensen, et al., 2010). Using MCMC methods (e.g. in
WinBUGS, see Lunn et al, 2012) we obtain simulated draws for the predicted values of the censored
observations, conditional on the observed censoring times. We can then use these predicted values
as imputed values to give complete datasets, as in standard multiple imputation methods. Standard
graphics can then be used with the imputed datasets to explore treatment effects, hazard functions,
etc., with some indication of the uncertainty due to the censoring.
The approach is easily extended to other survival distributions and Bayesian survival models.
Conclusion
In summary, we have introduced a flexible approach for imputing values from censored survival times
to give completed datasets. The datasets can then be used for standard graphical displays that may
be a useful complement to Kaplan-Meier plots of the original censored dataset.
References
Christensen, R. et al. (2010). Bayesian Ideas and Data Analysis: an introduction for scientists and
statisticians. CRC Press.
Lunn, D. et al. (2012). The BUGS Book: A Practical Introduction to Bayesian Analysis. CRC Press.
Royston, P. et al. (2008). Visualizing length of survival in time-to-event studies: a complement to
KaplanMeier plots. Journal of the National Cancer Institute, 100, pp. 92 – 97.
Tuesday 17th May
Session 5: Chair, Professor Ailish Hanningan
Keynote Speaker
13:50 – 14:40 Bethany Bray Cutting-Edge Advances in Latent Class Analysis for Today’sBehavioral Scientists
Contributed Talks
14:40 – 15:00 Myriam Tami EM estimation of a structural equation model
15:00 – 15:20 Arthur White Identifying patterns of learner behaviour using latent classanalysis
15:20 – 15:40 Brendan Murphy Variable selection for latent class analysis with applicationto low back pain diagnosis
Cutting-Edge Advances in Latent Class Analysis for Today’s Behavioral
Scientists
Bethany C. Bray∗1
1The Methodology Center, College of Health and Human Development, The Pennsylvania State
University∗Email: [email protected]
Abstract: Latent class analysis (LCA) is a statistical tool that behavioural scientists are turning to with
increasing frequency to explain population heterogeneity by identifying subgroups of individuals. As appli-
cation of LCA increases in behavioural science, more complex scientific questions are being posed about
the role that class membership plays in development. Recent advances have proposed new and exciting
extensions to LCA that address some pressing methodological challenges as todays scientists pose increas-
ingly complex questions about behavioural development. This keynote presentation will discuss two specific
advances: causal inference in LCA and LCA with a distal outcome.
Introduction
Latent class analysis (LCA) is a statistical tool that behavioural scientists are turning to with in-
creasing frequency to explain population heterogeneity by identifying subgroups of individuals. The
subgroups (i.e., classes) are comprised of individuals who are similar in their responses to a set
of observed variables; class membership is inferred from responses to the observed variables. As
application of LCA increases in behavioural science, more complex scientific questions are being
posed about the role that class membership plays in development. Addressing these questions of-
ten requires estimating associations between the latent class variable and other observed variables,
such as predictors, outcomes, moderators and mediators. In some cases, these associations can be
modelled in the context of the LCA itself. In other cases, the research questions are too complex
and cannot currently be addressed in this way. Recent methodological advances have proposed new
and exciting extensions to LCA that address some of the most pressing challenges. This keynote
presentation will discuss two advances that can be used to address the complex research questions
about development posed by today’s behavioural scientists.
Using data from the National Longitudinal Study of Adolescent to Adult Health (Add Health),
a nationally representative, longitudinal study of U.S. adolescents followed into young adulthood,
cutting-edge advances in models for causal inference in LCA and LCA with a distal outcome will be
discussed. Emphasis will be placed on the new questions that can be addressed with these methods
and how to implement them in scientists’ own work.
Casual inference in LCA
Modern causal inference methods, such as inverse propensity weighting, facilitate drawing causal
conclusions from observational data and these techniques are now commonly used in behavioural
studies. However, causal inference methods have only recently been extended to the latent variable
context to draw causal inferences about predictors of latent variables. This keynote presentation
will demonstrate the use of inverse propensity weighting to estimate the causal effect of a predictor
on latent class membership by estimating the causal effect of high risk for adolescent depression on
adult substance use latent class membership in the Add Health data.
LCA with a distal outcome
The mathematical model for predicting class membership from a covariate is well-understood; ques-
tions related to associations between a latent class predictor and distal outcome, however, present
a more difficult methodological challenge. Solving this problem is a hot topic in the methodological
literature today. Advantages and disadvantages of three competing, state-of-the-art approaches to
LCA with a distal outcome will be discussed, in order to guide scientists in their own work. As a
demonstration, early risk exposure latent class membership will be linked to later binge drinking in
the Add Health data.
EM estimation of a Structural Equation Model
Myriam Tami∗1, Xavier Bry1 and Christian Lavergne1
1Universities of Montpellier, IMAG, France∗Email: [email protected]
Abstract: We propose an estimation method of a Structural Equation Model (SEM). It consists in viewing
the Latent Variables (LV’s) as missing data and using the EM algorithm to maximize the whole model’s
likelihood, which simultaneously provides estimates not only of the model’s coefficients, but also of the
values of LV’s. Through a simulation study, we investigate how fast and accurate the method is, and
eventually apply it to real data.
Introduction
The proposed approach is an estimation method of a SEM linking latent factors. It provides estimates
of the coefficients of the model and its factors at the same time. This method departs from more
classical methods such as LISREL. In fact, LISREL mainly focuses on the covariance structure and
the LV scores estimation is based on a least-squares technique performed on mere measurement
equations. Contrary to PLS-like methods, we do not constrain factors to belong to the spaces
spanned by the Observed Variables (OV’s), but only to be normally distributed.
The model and data notations
The data consists in blocks of OV’s describing the same n independent units. Y = yji (resp.
Xm = xj,mi ); i ∈ J1, nK, j ∈ J1, qY K (resp. j ∈ J1, qmK, m ∈ J1, pK) is the n× qY (resp. n× qm)
matrix coding the dependent block of OV’s (resp. mieth-explanatory block of OV’s), identified with
its column-vectors. T (resp. Tm) refers to a n × rT (resp. n × rm) matrix of covariates. For
the sake of simplicity, the SEM we handle here is a restricted one. It contains only one structural
equation, relating a dependent latent factor g, underlying block Y , to p explanatory latent factors
fm respectively underlying blocks Xm. The SEM consists of p+ 1 measurement equations and one
structural equation : Y = TD + gb′ + εY
∀m ∈ J1, pK, Xm = TmDm + fmam′ + εm
g = f 1c1 + · · ·+ fpcp
+ εg(1)
where, εg ∈ Rn (resp. εY , εm) is a disturbance vector (resp. are disturbance matrices) and
∀m ∈ J1, pK, θ = D,Dm, b, am, c1, c2, ψY , ψm is the set of parameters. The main assumptions
of this model are the following: fm are standard normal; g is normal with zero-mean, and its
expectation conditional on all fm is a linear combination of them; εg ∼ N (0, 1); εg is independent
of εY and εm, ∀m ∈ J1, pK.
Estimation using the EM algorithm
We propose to carry out likelihood maximization through an iterative Expectation-Maximization
algorithm (Dempster et al. (1977)). If we consider factors as missing data, EM algorithm enables
us to estimate the factors. Let Z = (Y,X1, . . . , Xp) be the OV’s, h = (g, f 1, . . . , f p) the LV’s.
To maximize the log-likelihood associated with the complete data L(θ;Z, h), in the EM framework,
we must solve: Ehz[∂∂θL(θ;Z, h)
]= 0. Thanks to the explicit solutions of this system and the
distribution of h conditional on Z we design an algorithm. It is a rapidly converging iterative
procedure starting from a good initialization. The iteration equations have been given in detail in
Bry et al. (2016) and will be presented.
Results and application
A sensitivity analysis has been performed to investigate how the quality of estimations could be
affected by the number n of observations and the number q of OV’s in each block. The results
were that the sample size n proved to have more impact on the quality of parameter estimation and
factor reconstruction than the number of OV’s. We advise to use a minimal sample size of n = 100.
Conclusion
This method can estimate quickly and precisely factors of the SEM (in addition to estimating its
loadings) by maximization of the whole model’s likelihood. Various simulations and an application
on real data will be presented.
References
Bry, X., Lavergne, X., Tami, M. (2016). EM estimation of a Structural Equation Model in review.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data
via the EM Algorithm. Journal of the Royal Statistical Society.
Identifying Patterns of Learner Behaviour Using Latent Class Analysis
Arthur White∗1 and Paula Carroll2
1School of Computer Science and Statistics, Trinity College Dublin, the University of Dublin.2 UCD Quinn School of Business, University College Dublin.
∗Email: [email protected]
Abstract: We investigate the learning patterns of students taking an introductory statistics module. Latent
class analysis is used to assess how the students interact with the different learning resources at their disposal
as the module progresses. Four behavioural groups were identified: while differing levels of face to face
attendance and online interaction existed, none of the groups engaged with online material in a timely
manner. Significant differences in levels of attainment were found to exist between groups, with an at risk
group of low engagers clearly identified.
Introduction
Learning objects are defined as any entity, digital or non-digital, that may be used for learning,
education or training. We examine how such objects are used by students in the University College
Dublin (UCD) Business School taking an introductory statistics core module called Data Analysis
for Decision Makers. The module is offered in a blended learning environment, in which, as well
as attending weekly lectures and tutorials, students use their own device to access digital learning
resources. We investigate the behavioural patterns of learning object usage, over the course of the
module, and assess how these patterns relate to attainment levels of learning outcomes.
Methods
Latent class analysis is used to identify student learning patterns. The group probability τ , represents
the a priori probability that a student belongs to a particular cluster. The item probability parameter
θ represents the probability of a student interacting with a learning resource, indexed by time as
well as resource.
Denote the data X = (X1, . . . ,XN), M -dimensional vector-valued binary random variables, com-
posed of G groups or clusters. The observed-data likelihood can then be written: p(X|θ, τ ) =∏Ni=1
∑Gg=1 τg
∏Mm=1 θ
Ximgm (1− θgm)1−Xim .
Inference is facilitated by the introduction of the latent variable Z = (Z1, . . . ,ZN), which indi-
cates the cluster membership of each individual student. The complete-data likelihood is then
p(Xi,Zi|τ ,θ) =∏G
g=1
[τg∏M
m=1 θXimgm (1− θgm)1−Xim
]Zig
.We applied LCA to our data using the R
package BayesLCA (White and Murphy, 2014).
Results
Four clusters were identified. The estimated group probabilities were τ = (0.34, 0.28, 0.27, 0.11).
The estimated item probability parameter, θ, is visualised in Figure 1. Based on these estimates, we
made the following interpretation: Group 1 appear to develop a preference for online material over
lectures or tutorials; Group 2 have high attendance but are slow to access online material; Group
3 have highest activity across all resources; Group 4 have low engagement overall. The ANOVA of
exam score based on this clustering was highly significant (F3,519 = 84.7, p < 10−16).
Figure 1: Proportion of clusters interacting with learning resources each week.
Conclusion
Our study shows the diversity in learning behaviour among the student body and indicates that
students tailor their usage of learning resources. While many students seem to be successfully
transitioning to become self-directed learners at university, assessment prompted engagement is also
evident for a substantial proportion of students. We suggest that these students warrant further
analysis and research.
References
White, A and Murphy, T.B. (2014). BayesLCA: An R package for Bayesian latent class analysis. Jour-
nal of Statistical Software, 61(13), pp. 1 – 28.
Variable Selection for Latent Class Analysis with Application to Low
Back Pain Diagnosis
Michael Fop1, Keith Smart2 and Thomas Brendan Murphy∗1
1School of Mathematics & Statistics and Insight Research Centre, University College Dublin,
Belfield, Dublin 4, Ireland.2St. Vincent’s University Hospital, Dublin 4, Ireland.
∗Email: [email protected]
Abstract: The identification of most relevant clinical criteria related to low back pain disorders is a crucial
task for a quick and correct diagnosis of the nature of pain. Data concerning low back pain can be
of categorical nature, in form of check-list in which each item denotes presence or absence of a clinical
condition. Latent class analysis is a model-based clustering method for multivariate categorical responses
which can be applied to such data for a preliminary diagnosis of the type of pain. In this work we propose
a variable selection method for latent class analysis applied to the selection of the most useful variables
in detecting the group structure in the data. The method is based on the comparison of two different
models and allows the discarding of those variables with no group information and those variables carrying
the same information as the already selected ones. The method is applied to the selection of the clinical
criteria most useful for the clustering of patients in different classes of pain. It is shown to perform a
parsimonious variable selection and to give a good clustering performance.
Introduction
Low-back pain (LBP) is the muscoloskeletal pain related to disorders in the lumbar spine, low back
muscles and nerves and it may radiate to the legs.
When observations are measured on categorical variables, the most common model-based clustering
method is the latent class analysis model (LCA) (Lazarsfeld and Henry, 1968). Typically all the
variables are considered in fitting the model, but often only a subset of the variables contains the
useful information about the group structure of the data.
In this work we develop a variable selection method for LCA based on the model selection framework
of Dean and Raftery (2010) which overcomes the limitation of the above independence assumption.
Model
To select the variables relevant for clustering in LCA, a stepwise model comparison approach is used.
At each step we partition of the variables into:
• XC , the current set of relevant clustering variables, dependent on the cluster membership
variable z,
z
XC XP
XO
M1 z
XC XP
XO
XR ⊆ XC
M2
Figure 1: The two competing models for variable selection
• XP , the variable proposed to be added or removed from the clustering variables,
• XO, the set of the other variables which are not relevant for clustering.
Then the decision of adding or removing the considered variable is made by comparing two models:
model M1, in which the variable is useful for clustering, and model M2 in which it does not.
Figure 1 gives a graphical sketch of the two competing models.
Conclusion
Using the variable selection method proposed here we retain 11 variables and the BIC selects a
3-class model on these; the groups closely correspond to established clinical groupings. The se-
lected variables present good degree of separation between the three classes which are generally
characterized by the almost full presence or almost complete absence of the selected criteria.
References
Dean, N. and Raftery, A.E. (2010). Latent Class Analysis Variable Selection Annals of the Institute of
Statistical Mathematics, 62, 11–35.
Fop, M., Smart, K. and Murphy, T.B. (2015). Variable Selection for Latent Class Analysis with Ap-
plication to Low Back Pain Diagnosis. arXiv:1512.03350 .
Lazarsfeld, P. and Henry, N. (1968) Latent Structure Analysis, Houghton Mifflin.
Tuesday 17th May
Session 6: Chair, Dr. John Newell
Invited Speaker
16:00 – 16:40 David Leslie Thompson sampling for website optimisation
Contributed Talks
16:40 – 17:00 Sergio Gonzalez-Sanz Beyond machine classification: hedging predictions withconfidence and credibility values
17:00 – 17:20 James Sweeney Spatial modelling of house prices in the Dublin area
17:20 – 17:40 Susana Conde Model selection in sparse multi-dimensional contingency ta-bles
Thompson sampling for website optimisation
David Leslie∗1
1Mathematics and Statistics Department, Lancaster University, United Kingdom∗Email: [email protected]
When individuals are learning how to behave in an unknown environment, a statistically sensible
thing to do is form posterior distributions over unknown quantities of interest (such as features of
the environment and individuals’ preferences) then select an action by integrating with respect to
these posterior distributions. However reasoning with such distributions is very troublesome, even in
a machine learning context with extensive computational resources; Savage himself indicated that
Bayesian decision theory is only sensibly used in reasonably “small” situations.
Random beliefs is a framework in which individuals instead respond to a single sample from a
posterior distribution. This is a strategy known as Thompson sampling, after its introduction in a
medical trials context by Thompson (1933), and is used by many Web providers both to select which
adverts to show you and to perform website optimisation. I will demonstrate that such behaviour
’solves’ the exploration-exploitation dilemma in a contextual bandit setting, which is the framework
used by most current applications.
Beyond machine classification: hedging predictions with confidence and
credibility values
Sergio Gonzalez-Sanz
1Fashion Insights Centre, Zalando, Ireland∗Email: [email protected]
Abstract: Supervised classification is a well-known task in the Machine Learning field, whereby a labelled
training set is directly used to build a model of the underlying pattern(s) in the data. Once this model
is built, the overall performance can be subsequently assessed using new data and well known metrics
such accuracy, recall, and precision. These metrics provide an insight into how a model performs on a
whole across all test data instances. However, one question remains: what is the quality of any single
prediction made by a model? Conformal predictors hedge a classifier’s predictions with confidence and
credibility measures enabling the end user to take appropriate actions according to the quality of the single
predictions. This paper describes how conformal predictions can be used to make informed decisions on
the selection of multiple competing classifiers for a given task.
Introduction
Conformal predictors were designed by Gammerman et al. (1998) using transduction rules for
Support Vector Machines. His goal was to provide a measure of the evidence found on each
prediction. In order to do so, conformal predictors augment model predictions with accurate levels
of confidence and credibility (usually known as conformal measures). Later on, the methodology
was extended to work with induction (Vovk, 2013) for other ML techniques such as the nearest
neighbours, ridge regression and decision trees (Shafer and Vovk, 2008). Conformal predictors
have been successfully applied to a wide range of fields such as computer vision, nuclear fusion
(Gonzalez-Sanz, 2013) and medicine.
This approach, providing information about single predictions, differs from traditional evaluation
metrics such as accuracy or precision, which assess the quality of a model as a whole. Conformal
measures enable analysts to take informed decisions on what actions to take given the output label
of a machine learning classifier, for example whether or not to discard predicted values.
Methods
This paper leverages conformal measures as the foundation of a novel technique for competing
multi-model evaluation and selection. The most appropriate model choice can be quantitatively
made by examining the distribution of the credibility scores of all samples compared to those of
incorrectly classified samples (false positives and false negatives). Cross-validation on all models will
also be used in order to ensure the measures obtained are stable and generalise to other independent
datasets.
In order to test this proposal, a set of competing models will be created using a large, open source,
test set. The performance values obtained from the conformal measures in each model will be then
compared against existing model selection techniques, such as the ROC curve and the overall model
accuracy and precision.
Results & Discussion
The results section will include the findings regarding the usefulness of the conformal measures as
indicators for model selection. A set containing millions of labelled webpages (obtained from DMOZ,
https://www.dmoz.org/) will be used for building a multi-class conformal predictor using multiple
models. The distribution of the errors in the test sets will be plotted as a function of the credibility.
An examination of the plots will show that the errors are located on the low credibility regions.
Thus, analysts will be able to use the credibility values to set their risk aversion strategy effectively.
These plots and different accuracy measures will be compared across different classifiers. Conformal
predictors can shed some light on the differences between two samples labelled with the same tag.
References
Gammerman, A., Vock, V. and Vapnik, V. (1998). Learning by transduction. In: Proceedigns of the
14th Conference on Uncertainty in Artificial Intelligence, Madison, Wisconsin, USA, pp. 148 – 155.
Gonzalez-Sanz, S. (2013). Data mining techniques for massive databases: an application to JET and
TJ-II fusion devices. Ph.D. dissertation available at http://eprints.ucm.es/21575/1/T34490.pdf
Shafer, G. and Vovk, V. (2008). A tutorial on conformal prediction. Journal of Machine Learning Re-
search, 9, pp. 371 – 421.
Vovk, V. (2013). Conditional validity of inductive conformal predictors. Machine learning, 92(2-3), pp.
349 – 376.
Spatial Modelling Of House Prices In The Dublin Area
Dr. James Sweeney∗1
1 School of Business, UCD, Ireland∗1Email: [email protected]
Abstract: Assessments of the state of play in the Dublin housing market are mainly qualitative at present,
based on simple summaries of property prices. Here a proof of concept spatial model for house prices in
the Dublin postcode area is outlined, which is applied to a dataset of 1531 properties containing price
information and features such as size, beds, postcode, local area and spatial location. The model appears
promising for price prediction given a number of simple property features and provides some interesting
results in terms of the factors deemed important in the value of a property.
Introduction
Existing property price estimators are limited in terms of the factors they use to estimate the value
of a property for the purposes of property tax payment, being primarily based on the dwelling type,
number of bedrooms & bathrooms, as well as a comparison to nearby houses for which sales price
may be known. No uncertainty in the price prediction is typically provided, a substantial caveat given
that property price predictions in areas where property turnover is low should be highly uncertain.
Furthermore, a substantial issue of interest is whether there are subjective biases in terms of the
prices people are willing to pay for a property - for example, will people overpay for perceived good
addresses? Existing hedonic models for property prices cannot address this question as all of the
factors impacting on price are not known (Gelfand et al. (2014)).
General Model
Due to the unavailability of several unknown important neighbourhood characteristics in the value
of a house, we would expect that there will be spatial association remaining in the residuals of a
simple hedonic regression model, even after the inclusion of location attributes in terms of postcode
or the townland. The type of dwelling is also important, with a clear differentiation in the value of
houses and apartments for example, controlling for other factors. Visual exploration of the dataset
in terms of the size of the property and number of bedrooms reveals that the assumption of a
linear relationship between these factors and price is potentially inappropriate, in particular for larger
properties. Due to the variability in the values of properties we work on the log scale, prompting
the following model for the data, where y∗i = log(price per m2), si representing spatial location
(latitude, longitude).
y∗i = typei + areai + postcodei +
f(sizei) + f(bedsi) + g(si) + εi
εi ∼ N (0, σ2)
postcodei ∼ N (0, τ 21 )
areai ∼ N (0, τ 22 )
typei ∼ N (0, τ 23 )
We assume that the effects of both property size and number of beds vary smoothly, assigning a
separate intrinsic random walk of order 2 for each process, i.e. f() ∼ IRW2(κ). We also assume
a smoothly varying spatial effect for g(s), assigning a Gaussian Process for this purpose.
Results
The model output is quite promising in terms of teasing out the underlying factors which are most
important in determining the value of a property. The Deviance Information Criterion (DIC) is
lowest for the specified model, illustrating the benefit of incorporating a spatial effect in addition to
parameters for size and number of beds. The spatial effect is perhaps a proxy for local effects such
as transport links and schools, or other unidentifiable features which are impossible to capture in the
data collection process. There are also significant postcode and townland effects, particularly in the
postcodes of the south side of Dublin, reflecting a consumer preference for these areas irrespective
of other factors. The number of baths in a property appears unimportant, which is an interesting
result given that existing calculators consider this to be a primary factor.
References
Gelfand, A.E., Ecker, M.D., Knight, J.R. and Sirmans, C.F. (2004). The Dynamics of Location in
Home Price. The Journal of Real Estate Finance and Economics, 29, pp. 149 – 166.
Model Selection in Sparse Multi-Dimensional Contingency Tables
Susana Conde∗1 and Gilbert MacKenzie2
1Centre of Biostatistics, University of Manchester, UK.2CREST, ENSAI, Rennes, France.
∗Email: [email protected]
Abstract: We compare the Lasso method of variable selection in sparse multi-dimensional contingency
tables with a new Smooth Lasso method which casts regularization in a standard regression analysis
mould, and also with the classical backwards elimination algorithm, which is used in standard software
packages. First, we make a general methodological point in relation to model selection with interactions.
Next we undertake a simulation study which explores the ability of the three algorithms to identify the
correct model. Finally, we analyse a set of comorbidities arising in a study of obesity. The findings do not
favour the standard Lasso regardless of the optimization algorithm employed.
Introduction
Sparse contingency tables arise often in genetic, bioinformatic, medical and database applications.
Then the target is to estimate the dependence structure between the variables modelled via the
interaction terms in a log-linear model. High dimensionality will force attention on identifying
important low-order interactions — a technical advance since most model selection work relies only
on main effects. We present the Smooth Lasso (SL), a penalized likelihood, which does not require
specialized optimization algorithms such as the method of coordinate descent. It uses a convex,
parametric, analytic penalty function that asymptotically approximates the Lasso: minimization is
accomplished with standard Newton-Raphson algorithms and standard errors are available.
A Smooth Lasso
The penalized log-likelihood is: `λ(θ) = `(θ) − penλ where penλ, is the penalty term, λ > 0. For
the LASSO penλ = λ∑p
j=2 |θj| omitting the intercept term and for the Smooth LASSO penλ =
λ∑p
j=2 Qω(θj) where Qω(θj) = ω log[
cosh(θjω
)]for a constant ω that regulates the approximation
of the function to the absolute value one. Note that Qω(θj) ∈ C∞, the set of functions that are
infinitely differentiable, and is convex (Conde, 2011, Conde and MacKenzie, 2011). Following we
define the maximum penalised likelihood estimator (MPLE) as θ := arg maxθ∈Θ`(θ)− penλ(θ).We should more properly write θλ, rather than θ, but the dependence on λ will be understood in
what follows. The goal here is to estimate λ and we use five-fold cross-validation. We use for the
Lasso the method of coordinate descent and the optimization algorithm in Dahinden et al (2007)
via the R glmnet and logilasso packages respectively.
Results
For the Smooth Lasso one must pick a level of statistical significance, as with ordinary regression
methods. Thus SL−95 corresponds to the 5% level. We notice that the 5% level produces very poor
results when the sample size is small, but improves with increasing sample size, while the classical
Backward Elimination algorithm performs better for smaller sample sizes. The latter is very fast,
taking approximately 1 minute for 1000 simulated tables compared to hours for the Lasso methods
due to cross-validation. We also note that the Smooth Lasso estimator is sparser and closer to the
truth than the usual Lasso. However, in the analysis of obesity data, the backwards elimination
algorithm performs best overall and produces the best model as judged by the AIC.
Discussion
In the presence of interactions, Lasso methods will often fail to produce scientific models. Moreover,
it is well known that the Lasso lacks the oracle property and our results confirm this. All these issues
raise serious questions about the usefulness of Lasso methods for model selection.
References
Conde, S. (2011). Interactions: Log-Linear Models in Sparse Contingency Tables. Ph.D. thesis, University
of Limerick, Ireland.
Conde, S. & MacKenzie, G. (2011). LASSO Penalised Likelihood in High-Dimensional Contingency
Tables. In: Proceedings of the 26th International Workshop on Statistical Modelling, Valencia, D.
Conesa, A. Forte, A. et al. eds.
Dahinden, C., Parmigiani, G., Emerick, M. C., & Buhlmann, P. (2007). Penalized likelihood for
sparse contingency tables with an application to full-length cDNA libraries. BMC Bioinformatics,
8:476.
Wednesday 18th May
Special INSIGHT Session: Chair, Dr. Kevin Hayes
Keynote Speaker
09:10 – 10:00 Sofia Olhede Anisotropy in random fields
INSIGHT Session
Nial Friel
10:00 – 11:00 Brian Caulfield Insight centre for data analytics: A collection of short stories
Brendan Murphy
Anisotropy in random fields
Sofia Olhede∗1
1Department of Statistical Science, University College London, London, United Kingdom∗Email: [email protected]
Anisotropy is a key structural feature of many physical processes. Despite this, most theory for
the modelling and estimation of random fields is based on assuming isotropy of the observed field.
Anisotropy can arise both in the structural features of the field, and between field components. I will
discuss both forms of anisotropy, and how we may model them, parametrically for applications in
geophysics such as understanding interface-loading processes, and more generically to capture strong
directional preferences. I will also describe how we may nonparametrically identify the presence of
anisotropic features without strong structural assumptions, such as a given parametric model class.
This is joint work with Frederik Simons, David Ramirez and Peter Schreier, as well as others.
Insight centre for data analytics: A collection of short stories
Brian Caulfield1, Nial Friel∗2 and Brendan Murphy2
1Insight Centre for Data Analytics and School of Physiotherapy and Performance Science,
University College Dublin.2Insight Centre for Data Analytics and School of Mathematics and Statistics, University College
Dublin.∗Email: [email protected]
Abstract: Data is changing our world. The field of data analytics is progressing at a rate beyond anything
we have ever experienced. If we can tap into this new wealth of information and make decisions based on
it, we will transform the way our world works. Data analytics is a massive global research effort aimed at
taking the guesswork out of decision making in society. It has the potential to improve our approach to
everything from hospital waiting lists to energy use to advertising. At Insight Ireland, this is what we do.
We take this deluge of data and we make sense of it. Then we come up with ways about how best to use
it for the benefit of society. At Insight Ireland, we process and use information to enable better decision
making for individuals, society and industry.
Introduction
In this session we will give a flavour of the diverse range of problems which are working on in Insight
and outline some of the research challenges which are aiming to overcome.
Personal Analytics
Personal Analytics is a particular focus within Insight. Our research is concerned with fundamen-
tal questions related to personal sensing, measurement and understanding human behaviour and
performance, and implementation of feedback and information that is designed to enhance human
behaviour and performance. We have seen an exponential rise in our capability to measure and mon-
itor a range of human performance and behavior metrics in recent years through the development
of a large range of sensing technologies. This is irrespective as to whether we are talking about
consumer wellness and fitness application space, or in formal management of health. Despite all
this progress, there is still a lot of work to do in this area. Firstly, there are still some biomedical
targets that we cannot accurately and effectively measure outside of a laboratory or clinical envi-
ronment so we need to keep developing the pipeline of new sensing technologies. As well as this,
we also need to make better progress with our capacity to better understand the data and resultant
application models associated with existing sensor technologies. And, this essentially, is what we set
out to do in the Personal Sensing group in Insight. We are addressing this gap with a programme
of interdisciplinary research that takes on the following set of high level challenges, which this talk
will outline.
Scaling Bayesian statistics for big data
One of the major issues facing practitioners is the question of how to scale Bayesian methods to
large datasets. Markov chain Monte Carlo is the de facto method of choice. However it is requires
one to evaluate the likelihood function of the data, twice at every iteration. Therefore it is inherently
costly when the number of observations is large. Here we present light and widely Applicable (LWA-
) MCMC which is a novel approximation of the Metropolis-Hastings kernel targeting a posterior
distribution for a large dataset. Inspired by Approximate Bayesian Computation, we design a Markov
chain whose transition makes use of an unknown but fixed, fraction of the available data, where the
random choice of sub-sample is guided by the fidelity of this sub-sample to the observed data, as
measured by summary (or sufficient) statistics. LWA-MCMC is a generic and flexible approach, as
illustrated by the diverse set of examples which we explore. In each case LWA-MCMC yields excellent
performance and in some cases a dramatic improvement compared to existing methodologies.
Modeling network data
Network data arise when the connections between entities are the focus of the analysis. Network
data are becoming increasingly common in the big data era. We will give an overview of some recent
novel network models that have been developed within Insight. The models developed account for
clustered network data, temporal networks, networks of rankings with be described. Finally, recent
models for hypergraph data will also be introduced.
References
Maire, F., Friel, N. and Alquier, P. (2015). Light and Widely Applicable MCMC: Approximate Bayesian
Inference for Large Datasets. arXiv:1503.04178.
Wednesday 18th May
Session 7: Chair, Dr. Kevin Burke
Invited Speaker
11:20 – 11:40 Christian Pipper Evaluation of multi-outcome longitudinal studies
Contributed Talks
11:40 – 12:00 Conor Donnelly A multivariate joint modelling approach to incorporate indi-viduals’ longitudinal response trajectories within the Coxianphase-type distribution
12:00 – 12:20 Katie O’Brien Breast screening and disease subtypes: a population-basedanalysis
12:20 – 12:40 Andrew Gordon Prediction of time until readmission to hospital of elderlypatients using a discrete conditional phase-type model in-corporating a survival tree
Evaluation of multi-outcome longitudinal studies
Christian Bressen Pipper∗1, Signe Marie Jensen2 and Christian Ritz2
1Department of Public Health, University of Copenhagen, Denmark.2Department of Nutrition, Exercise and Sports, University of Copenhagen, Denmark.
∗Email: [email protected]
Abstract: Evaluation of intervention effects on multiple outcomes is a common scenario in clinical studies.
In longitudinal studies such evaluation is a challenge if one wishes to adequately capture simultaneous
data behavior. Therefore a popular approach is to analyse each outcome separately. As a consequence
multiple statistical statements about the intervention effect need to be reported and adjustment for multiple
testing is necessary. However, this is typically done by means of the Bonferroni procedure not taking into
account the correlation between outcomes and thus resulting in overly conservative conclusions. In this
talk an alternative approach for multiplicity adjustment is proposed. The suggested approach incorporates
between outcome dependence towards an appreciably less conservative evaluation.
Introduction
In clinical intervention studies the effect of intervention is often sought to be evaluated on the
basis of multiple longitudinal outcomes. These outcomes typically represent different aspects of
the progression of a particular condition and are inherently correlated. One such example, that we
look into in this talk, is a longitudinal intervention study of the effect of consumption of different
milk proteins on health in overweight adolescents. The particular outcomes considered in this
study include profiles of body weight, BMI, waist circumference, plasma glucose and plasma insulin
(Arnberg et al., 2012). All of these outcomes are clearly biologically linked and thus expected to be
substantially correlated.
Methods
To adequately capture the data generating mechanism this apparent correlation should be addressed
at some point during the analysis of these data. However, the classical statistical approach of doing
so in terms of a simultaneous model for all outcomes quickly becomes a complicated matter involving
an excessive amount of model parameters, model assumptions, and hard to interpret measures of
intervention effect. For these reasons, the analysis multi-outcome longitudinal studies are rarely
approached by simultaneous modelling of outcomes.
A more commonly used approach to analysing such data is to model outcomes separately by means
of standard methodology such as mixed linear normal models for analysis of repeated measurements.
This has the advantage of providing an easy to understand and more robust evaluation of intervention
effect per outcome. The disadvantage is that we are subsequently faced with multiple assessments of
the intervention effect. Accordingly, if we want to make a confirmatory evaluation of the intervention
effect, where we control the familywise type 1 error, we need some kind of adjustment for multiple
testing. To this end the Bonferroni adjustment is typically applied, but, as the outcomes - and
consquently also the test statistics - are correlated, this approach may lead to overly conservative
conclusions.
Accordingly, the potential gain of utilizing correlations between test statistics is reflected in recent
developments of procedures for multiplicity adjustment the most popular one being the single-step
procedure proposed by Hothorn et al. (2008). For this procedure to work we need estimates of
correlations between test statistics. However, in the context of test statistics for different outcomes
from different models, no such estimates are readily available.
Results and Discussion
In this talk, we approach the analysis of multi-outcome longitudinal data by means of separate mixed
linear normal models for each outcome. Within this framework we outline how to obtain a consistent
estimator of the simultaneous asymptotic variance-covariance matrix of within and between model
fixed effect parameter estimates. This in turn enables the use of the single-step procedure proposed
by Hothorn et al (2008) to obtain an efficient evaluation of an intervention effect on multiple
longitudinal outcomes.
The derivation of the simultaneous asymptotic variance-covariance matrix is made without any
additional model assumptions. It is an extension of the methodology developed in Pipper et al
(2012), where simultaneous asymptotic behavior of estimates from multiple models was derived by
so called stacking of an asymptotic representation of the estimates.
With the methodology in place we turn to the analysis of the milk protein study. Here we discuss the
challenges and advantages of the different modelling strategies. Next, we provide an outline of the
analysis with more details on the study design and the actual statistical modelling. We also compare
evaluation of intervention effects based on the proposed multiplicity correction with evaluation based
on traditional Bonferroni correction. Finally, we remark on the applicability of our proposal in terms
of robustness, design issues such as missing values, implementation, and potential extensions. The
talk is based on the paper by Jensen et al. (2015).
References
Arnberg, K. et al. (2016). Skim Milk, Whey, and Casein Increase Body Weight and Whey and Casein Increase the Plasma C-peptide Concentration in Overweight Adolescents, The
Journal of Nutrition, 142, pp 2083–2090.
Hothorn, T., Bretz, F. and Westfall, P. (2008). Simultaneous inference in general parametric models. Biometrical Journal, 50, pp 346–363.
Jensen, S.M., Pipper, C.B. and Ritz, C. (2015). Evaluation of multi-outcome longitudinal
studies. Statistics in Medicine, 34, pp. 1993–2003.
Pipper, C.B., Ritz, C. and Bisgaard, H. (2012). A versatile method for confirmatory evaluation of the effects of a covariate in multiple models. JRSS C, 61, pp. 315–326.
A multivariate joint modelling approach to incorporate individuals’
longitudinal response trajectories within the Coxian phase-type
distribution
Conor Donnelly∗, Lisa M. McCrink and Adele H. Marshall
Centre for Statistical Science and Operational Research (CenSSOR),
Queen’s University Belfast, Northern Ireland∗Email: [email protected]
Abstract: This research explores the use of a two-stage approach to evaluate the effect of multiple
longitudinal response variables on some related future event outcome. In stage one, a multivariate linear
mixed effects (LME) model is utilised to represent the correlated longitudinal responses and, in stage two,
a Coxian phase-type distribution is employed to evaluate the effect of each longitudinal response on time
to event outcome. The approach is illustrated using data collected on individuals suffering from chronic
kidney disease (CKD). It was observed that both the time-varying responses, haemoglobin and creatinine
levels, have a strong predictive potential of the time to death of CKD patients.
Introduction
It is common, particularly within the medical field, for longitudinal and survival data to be collected
concurrently, with previous research showing that there typically exists some association between
both processes. For instance, when making multiple measures on various biomarkers relating to an
individual’s health condition, it would be expected that the dynamic nature of these biomarkers would
have a strong predictive potential of some related, future event outcome. In the presence of such an
association, independent analysis of one process can produce biased parameter estimates and lead to
invalid inferences. Instead, joint modelling techniques are a relatively recent statistical development
capable of considering both processes simultaneously so as to reduce this bias (Henderson et al.,
2000).
Joint models make use of two submodels to represent each of the processes of interest; typically
a linear mixed effects (LME) model is utilised to represent the longitudinal response and a Cox
proportional hazards model to represent the survival process. Such an approach has been successfully
employed, for example, to evaluate the effect of changing CD4 cell counts on the time to AIDS
diagnosis in HIV patients (Wulfsohn and Tsiatis, 1997).
This paper will consider the use of the Coxian phase-type distribution to represent the survival
process. The Coxian phase-type distribution is a special type of Markov model which represents
the time to absorption of a continuous, finite state Markov chain (Marshall and McClean, 2004).
The distribution can be used to represent an individual’s survival time as a series of distinct states
through which the individual transitions as their health condition changes. Thus, by employing this
approach, inferences can be made on the factors that affect both the individuals’ survival times and
the rates at which their condition changes.
Application
The methodology described above is implemented on a dataset collected from the Northern Ireland
Renal Information Service over a ten year period from 2002 until 2012. It consists of 1,320 pa-
tients with a total of 27,113 repeated measures. As it is of interest to evaluate the effect of both
haemoglobin (Hb) and creatinine levels on survival, a multivariate LME model with correlated ran-
dom effects is fitted in stage one. In stage two, then, the individuals’ estimated random effects are
incorporated within the Coxian phase-type distribution so as to evaluate their effect on individuals’
rates of transition through the CKD health stages, represented by the states of the distribution.
Conclusion
The covariances of the random effects, which give a measure of the correlation between the two
response variables, indicate that Hb and creatinine are significantly correlated. By incorporating the
individuals’ random effects within the Coxian phase-type distribution, it is observed that deviations
from the population average Hb and creatinine levels have a significant impact both on individuals’
survival times and their rates of flow through the underlying stages of the disease.
References
R. Henderson, P. Diggle, and A. Dobson (2000). Joint modelling of longitudinal measurements and
event time data. Biostatistics, 1(4), pp. 465 – 480.
A. H. Marshall and S. I. McClean (2004). Using Coxian phase-type distributions to identify patient
characteristics for duration of stay in hospital. Health Care Management Science, 7(4), pp. 285 –
289.
M. S. Wulfsohn and A. A. Tsiatis (1997). A joint model for survival and longitudinal data measured
with error. Biometrics, 53(1), pp. 330 – 339
Breast screening and disease subtypes: a population-based analysis
KM O’Brien∗1, P Fitzpatrick2,3, T Kelleher1 and L Sharp4
1National Cancer Registry, Ireland2School of Public Health, Physiotherapy and Sports Science, University College Dublin, Ireland
3Director of Programme Evaluation, BreastCheck, Ireland4Institute of Health & Society, Newcastle University, United Kingdom
∗Email: [email protected]
Aims
Mammographic screening affects the natural history and epidemiology of breast cancer in the popu-
lation and population based data is needed to improve the understanding of the impact of screening.
Previous studies have suggested differences in stage, grade and tumour size in screen-detected, com-
pared with non-screen detected, cancers. We investigated whether there was an association between
mode of detection of breast cancer and tumour characteristics in particular, disease subtype, at the
population level.
Methods
We matched individual-level data from the Irish breast screening programme (BreastCheck) with
the Irish National Cancer Registry (NCR) to classify all breast cancers diagnosed in the period 2006
2011, by mode of detection: screen detected cancers; interval cancers (i.e. cancers diagnosed within
two years of a negative mammogram); and cancers diagnosed in non-participants of the screening
programme. Information on oestrogen receptor expression (ER), progesterone receptor expression
(PR), and human epidermal growth factor receptor 2 (HER-2) was obtained from NCR records.
Subtype was defined as shown in Table 1.
Table 1: Subtype definition
Subtype Receptor statusluminal A ER or PR positive, and HER-2 negativeluminal B ER or PR positive, and HER-2 positiveHER2 over-expressing ER negative and PR negative and HER-2 positivetriple negative ER negative and PR negative and HER-2 negative
The association of the outcome, mode of detection, with the main explanatory variable, tumour
subtype, was assessed using a multinomial logistic regression model, with screen-detected cancers
as the baseline category. The model was adjusted for socio-demographic variables (marital status,
smoking status at diagnosis and deprivation category of area of residence) and clinical variables
(stage and grade).
Results
In the period 2006 2011, there were 6, 848 women, aged between 50 and 66, with a primary breast
cancer diagnosis of whom 45% had screen detected cancer, 14% had an interval cancer and 41%
were not participants of the screening programme. Of those with known subtype, 75% were luminal
A, 11% were luminal B, 6% were HER2 over-expressing and 8% were triple negative.
As compared to screen-detected cancers, interval cancers were 3 times more likely to be triple
negative than luminal A? In the adjusted analysis, compared to screen-detected cancers, interval
cancers were 3 times more likely to be triple negative than luminal A (OR 2.7 , 95% CI [2.0, 3.7]).
Similarly, non-participants were almost twice as likely to be triple negative (OR 1.9 , 95% CI [1.5, 2.5])
.
In addition, compared to screen-detected cancers, interval cancers , was associated with a 1.7 fold
increase in the odds of being HER-2-over expressing (compared to luminal A) (95% CI [1.1, 2.4])
and for non-participants, there was a 1.4 fold increase in the odds of being HER-2-over expressing
(95% CI [1.1, 1.9]).
Conclusion
In this novel study, we have provided evidence that breast cancer subtype distribution differs by
mode of detection. Since subtype is one of the major prognostic indicators, and a determinant
of treatment, our results may, in part, explain the well-known survival advantage for women with
screen-detected, rather than symptomatic, cancers. There is a need for further population-based
studies of on subtype in screen-detected, interval and other (symptomatic) cancers, to determine
whether our findings hold in other healthcare settings.
Predicition of time until readmission to hospital for elderly patients using
a discrete conditional phase-type model incorporating a survival tree
Andrew S. Gordon∗1, Adele H. Marshall1, Karen J. Cairns1 and Mariangela Zenga2
1Centre for Statistical Science and Operational Research, Queen’s University Belfast, Northern
Ireland, UK2Department of Economics, Management and Statistics, Universita degli Studi di Milano-Bicocca,
Italy∗Email: [email protected]
Abstract: A feature of elderly patient care is the frequent readmissions they require to hospital. The Dis-
crete Conditional Phase-type (DC-Ph) model is a technique through which length of stay in the community
may be modelled by using a conditional component to partition patients into cohorts before representing
the resulting survival distributions by a process component, namely the Coxian phase-type distribution.
This research expands the DC-Ph family of models by introducing a survival tree as the conditional com-
ponent, with a method to predict the time taken until readmission also presented. The methodology is
demonstrated using data for elderly patients from the Lombardy and Abruzzo regions of Italy.
Introduction
With a steady rise in the length of time people are living comes an increase in the strain placed on
hospital resources. As more elderly people require healthcare resources, hospitals often discharge
patients to continue their care in the community, in an attempt to alleviate this strain. This
often leads to frequent hospital readmissions not long after disharge to the community; the result of
elderly patients not having had the necessary time to convalesce. Accurate prediction of the expected
duration spent by patients in the community before they require readmission to hospital would greatly
facilitate hospital managers in ensuring that alternative measures of community care are in place for
this time, so that readmission to hospital may be avoided. Nevertheless, the time that elderly people
stay in the community can be greatly influenced by a large number of possible cirumstances. As a
result, this duration is unlikely to be homogeneous across all elderly people, meaning that obtaining
a single prediction of expected time in the community, for the elderly population as a whole, is likely
to be erroneous.
Methodology
Discrete Conditional Phase-type models (DC-Ph) are a family of models consisting of two compo-
nents; a conditional component and a process component (Marshall et al., 2007). The conditional
component is used to separate survival data into distinct classes before the process component repre-
sents the skewed survival distribution for each class using the Coxian phase-type distribution. With a
survival tree (McClean et al., 2010) used in the role of the conditional component, the DC-Ph model
is used to model time spent by discharged elderly patients in the community, prior to readmission
to hospital. Elderly people having a similar distribution of time spent in the community are grouped
together in the same class, whilst those with significantly different distributions of length of stay
are in different classes. Modelling the resulting distribution for each class through the use of the
Coxian phase-type distribution enables the rates associated with different latent subprocesses within
the overall system of community care to be determined. Furthermore, the model may be inverted
to predict the length of time spent in the community for new patients.
Conclusion
Patient information for two regional data sets from Italy has been used to build a DC-Ph model
with a view to predicting the length of time elderly patients spend in the community between
hospital spells. Ten cohorts of patients are identified by the survival tree, each of which have
significantly different skewed survival distributions. Through the simulation of times for each of these
distributions, accurate estimates (with confidence intervals) of when newly discharged patients are
likely to require readmission to hospital may be obtained. If alternate care provision can be planned
for patients in accordance with these respective estimates, then readmission to hospital may be
avoided altogether and vital hospital resources saved.
References
Marshall, A.H. et al. (2007). Patient Activity in Hospital using Discrete Conditional Phase-type (DC-Ph)
Models. Recent Advances in Stochastic Modelling and Data Analysis, pp. 154 – 161.
McClean, S. et al. (2010). Using mixed phase-type distributions to model patient pathways. IEEE
Computer-Based Medical Systems (CBMS), 23rd International Symposium on, pp. 172 – 177.
Monday 16th MayPoster & Lightning Talks Session: Chair, Dr. Norma Bargary
1. Alberto Alvarez-Iglesias An alternative pruning based approach to unbiased recur-sive partitioning algorithms
2. Idemauro Antonio Rodriguesde Lara
Ordinal transition models and a test of stationarity
3. Fiona Boland Retention in methadone maintenance treatment (MMT)in primary care: national cohort study using proportionalhazards frailty model for recurrent MMT episodes
4. Lampros Bouranis Bayesian inference for misspecified exponential randomgraph models
5. Kevin Burke Non-proportional Hazards Modelling
6. Caoimhe M. Carbery Dynamic Bayesian networks implemented for the analysisof clinical data
7. Niamh Ducey Cluster Analysis of Hepatitis C Viral Load Profiles in Pre-treatment Patients with Censored Infection Times
8. Jonathan Dunne Of queues and cures: A solution to modelling the intertime arrivals of cloud outage events
9. Lida Fallah Study of joint progressive type-II censoring in heteroge-neous populations
10. John Ferguson Extending Average Attributable Fractions
11. Olga Kalinina Variable Selection with multiply imputated data when fit-ting a Cox proportional hazards model: a simulation study
12. Felicity Lamrock Extending the msm package to derive transition proba-bilities for a Decision Analytic Markov Model
13. Angela McCourt Adaptive Decision-Making using Non-Parametric Predic-tive Intervals
14. Meabh G. McCurdy Identifying universal classifiers for multiple correlated out-comes, in clinical development
15. Keefe Murphy Mixtures of Infinite Factor Analysers
16. Aoife O’Neill Activity profiles using self-reported measures in popula-tion studies in young children: are there gender differ-ences?
17. Amanda Reilly Handling Missing Data in Clinical Trials
18. Davood Roshan Sangachin Bayesian Adaptive Ranges for Clinical Biomarkers
An alternative pruning based approach to unbiased recursive partitioning
algorithms
Alberto Alvarez-Iglesias∗1, John Hinde2, John Ferguson1 and John Newell1,2
1HRB Clinical Research Facility, NUI Galway2School of Mathematics, Statistics and Applied Mathematics, NUI Galway
∗Email: [email protected]
Abstract: A new post-pruning strategy is presented for tree based methods using unbiased recursive
partitioning algorithms. The proposed method includes a novel pruning procedure that uses a false discovery
rate (FDR) controlling procedure for the determination of splits corresponding to significant tests. The
new approach allows the automatic identification of interaction effects where other methods fail to do so.
Simulated and real-life examples will illustrate the procedure.
Introduction
Recursive partitioning algorithms are a popular tool at the exploratory stage of any data analysis
since they can generate models that can be easily interpreted with virtually no assumptions. To
avoid over-fitting the final tree is obtained using one of the following two strategies: first grow a
large tree and then prune it back using CART-style post-pruning procedures (Breiman et al. 1984)
or use direct stopping rules based on p-values in the growing process (pre-pruning). The latter has
the advantage that variable selection is not biased towards predictors with many possible splits (see
Hothorn et al. 2006). This presentation discusses some of the drawbacks of pre-pruned trees based
on p-values in the presence of interaction effects and presents a simple solution that includes a novel
post-pruning strategy.
Methods
Pre-pruning strategies based on hypothesis tests, as in Hothorn et al. (2006), are used to protect
locally against the discovery of false positives (splits on noisy variables) at a pre-specified significance
level. They also work as a closed testing procedure where subsequent hypotheses are only assessed
if all previous ones are significant, controlling Family Wise Error Rate and preventing the tree from
over-fitting. However, due to the nested nature of the sequence of hypotheses along a tree, a
stopping rule based on significance may prevent the model from testing other hypothesis below that
could identify important effects, like interactions. Although solutions to this problem are available,
like increasing the significance level to grow a larger tree or the adoption of a CART-post-pruning
strategy, in both cases the significance level would be interpreted as a simple hyper-parameter
losing its statistical meaning. The novel approach presented here allows the identification of such
interactions. The new method uses a FDR controlling procedure (Benjamini and Hochberg 1995) for
the determination of splits corresponding to significant tests. The method proposed considers the
p-values obtained at each node globally, to control the proportion of significant tests that correspond
to true alternative hypothesis. By doing so, the tests performed when growing the tree still retain
a statistical interpretation and, at the same time, can be used in the pruning procedure.
Discussion
Table 1 shows the results of a simulation study from a model with an interaction effect and 9
additional noise variables.
Table 1: Relative comparison between the proposed method and rpart (Breiman et al. 1984) andctree (Hothorn et al. 2006), with positive values indicating superiority of the proposed method.
Accuracy ComplexityMean 95%CI Mean 95%CI
rpart 37.7 34.0 41.5 -65.3 -71.8 -58.7ctree 43.6 39.8 47.3 -87.6 -94.2 -81.1
As one can see the proposed method has significantly better predictive accuracy in this setting. The
drawback is the inability to pick up the first splits reliably producing unnecessarily large trees. This
is not intrinsically a problem of the proposed method but a problem of 1-step-ahead binary recursive
partitioning in general and further research is needed to provide a more desirable solution.
References
Benjamini, Y., Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful
approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological),
57 (1), pp. 289 – 300.
Breiman, L., Friedman, J. H., Stone, C. J., Olshen, R. A. (1984). Classification and regression trees
Boca Raton, Florida: CHAPMAN & HALL/CRC.
Hothorn, T., Hornik, K., Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference
framework. Journal of Computational and Graphical Statistics, 15 (3), pp. 651 – 674.
Ordinal transition models and a test of stationarity1
Idemauro Antonio Rodrigues de Lara∗1, John Hinde2
1Exact Sciences Department, Luiz de Queiroz College of Agriculture, University of Sao Paulo,
Brazil2School of Mathematics, Statistics and Applied Mathematics, National Univerisity of
Ireland/Galway∗Email: [email protected]
Abstract: In this work, we present the class of transition models to analyse longitudinal categorical data.
We consider two applications with ordinal responses, where proportional odds transition models are used
and a test to assess stationarity is proposed.
Introduction
The Markov transition models (Diggle et al. 2002) are a class of models for longitudinal data.
These models are based on stochastic processes and we consider a discrete-time discrete-state
process with a first-order Markov assumption, i.e., πab(t − 1, t) = πab(t) = P (Yt = b | Yt−1 = a),
with a, b ∈ S = 1, 2, . . . , k and t ∈ τ = 0, 1, . . . , T. In these models, a relevant issue is the
assumption of stationarity and a test to assess this is proposed.
Methodology
We consider two examples of longitudinal ordinal response data where transition models can be
applied. The first data concern respiratory condition over 5 time occasions (Koch et al., 1990). The
second dataset is from animal sciences (Castro, 2016), with 4 time occasions. We use a proportional
odds model (McCullagh, 1980) and incorporate the longitudinal dependence by including the previous
response as an additional covariate. The model is η = log
(γab(t)(x)
1−γab(t)(x)
)= λab(t) + δ′tx, where
γab(t)(x) = P(Yjt ≤ b | Ya(t−1), x) = πa1(t)(x) + . . . + πab(t)(x) are the cumulative probabilities;
λab(t) is an intercept; x = (xt1, xt2, . . . , xtp, xt(p+1))′ is the vector of (p+ 1) covariates with xt(p+1)
denoting the previous state; δ′t = (βt1, . . . , βtp, αt) is a vector of unknown parameters. The general
model with δt has time dependent effects and to assess stationarity we use a likelihood-based test
comparing this to a model with constant effects over time, i.e. δt = δ0 A simulation study was
guided by the motivationing examples, i.e., we used the estimates of the parameters from these
examples to simulate new ordinal data under two scenarios: stationary (1) and non-stationary (2).
For each scenario we performed 10,000 simulations for three different sample sizes and three time
durations. The analysis and simulation were implemented in R (R Core Team).
1This work was supported by the FAPESP, funding agency, Brazil, award 2015/02628-2
Results
Table 1: Rejection rates for proposed test, resulting from 10,000 simulations, for the scenario 1(test size) and scenario 2 (test power).
Time T=4 T=5 T=6Level 10% 5% 1% 10% 5% 1% 10% 5% 1%
Scenarios(1) 0.1113 0.0583 0.0135 0.0764 0.0336 0.0064 0.0649 0.0280 0.0051
N=50 (2) 1.0000 1.0000 1.0000 0.4841 0.3356 0.1359 0.5206 0.3740 0.1536Scenarios(1) 0.1149 0.0576 0.0122 0.1137 0.0563 0.0109 0.1070 0.0559 0.0113
N=100 (2) 1.0000 1.0000 1.0000 0.9577 0.9171 0.7664 0.9774 0.9513 0.8492Scenarios(1) 0.1089 0.0546 0.0098 0.1047 0.0507 0.0107 0.1084 0.0557 0.0115
N=500 (2) 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
The proposed test is simple to apply and the results of the simulation show good performance. They
are quite close to the classical goodness-of-fit type test of Anderson and Goodman (1957).
References
Anderson, T.W., Goodman, L.A. (1957) Statistical Inference about Markov Chains. Annals of Mathe-
matical Statistics, 28: 89–110.
Castro, A.C. (2016). Comportamento e desempenho sexual de suınos reprodutores em ambientes en-
riquecidos, PhD. dissertation. Brazil: University of Sao Paulo.
Diggle, P.J., Heagerty, P.J., Liang, K.Y., Zeger, S.L. (2002). Analysis of longitudinal data. New
York: Oxford University Press.
Koch, G.C., Carr, G.J., Amara, I.A., Stokes, M.E., Uryniak, T.J. (1990). Categorical Data Analy-
sis. In: Statistical Methodology in the Pharmaceutical Sciences. New York: Marcel Dekker, Chapter
13, pp. 389 – 473.
McCullagh, P. (1980). Regression Methods for Ordinal Data. Journal of The Royal Statistical Society,
42, pp. 109 – 142.
R Core Team. (2015). R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria.
URL http://www.R-project.org/.
Retention in methadone maintenance treatment (MMT) in primary care:
national cohort study using proportional hazards frailty model for
recurrent MMT episodes
Grainne Cousins1, Fiona Boland∗2 Joseph Barry3 Suzi Lyons,4 Kathleen Bennett5 and Tom Fahey 6
1School of Pharmacy, Royal College of Surgeons in Ireland (RCSI), Dublin, Ireland2HRB Centre for Primary Care Research, Royal College of Surgeons in Ireland3Trinity College Centre for Health Sciences, Tallaght Hospital, Dublin, Ireland
4Health Research Board, Dublin, Ireland5Division of Population Health Sciences, Royal College of Surgeons in Ireland
∗Email: [email protected]
Abstract: The objective of this study was to identify determinants of time to discontinuation of methadone
maintenance treatment (MMT) across multiple treatment episodes in primary care. All patients on a na-
tional methadone treatment register aged 16-65 years between 2004 and 2010 were included. Proportional
hazards frailty models were developed to assess factors associated with time to discontinuation from re-
current MMT episodes. A total of 6,393 patients experienced 19,715 treatment episodes. Median daily
doses over 60mgs and having more than 20% of methadone dispensed as supervised consumption were
associated with longer treatment episodes. Patients experiencing multiple treatment episodes tended to
stay in treatment for progressively longer periods of time.
Introduction
Opiate users have a high risk of premature mortality (Degenhardt et al., 2013). Ireland was identified
as having the third highest prevalence of opiate use in Europe (0.72%) (European Monitoring
Centre for Drugs and Drug Addiction, 2013) and the number of cases entering treatment continues
to increase. However, the overall number of deaths from overdose of opiates has not decreased
(Lyons et al., 2014). Retention in methadone maintenance treatment (MMT) is associated with
reduced mortality and therefore the objective of this study was to identify determinants of time to
discontinuation of MMT across multiple treatment episodes in primary care.
Methods
We identified people registered on the Central Treatment List (CTL), a national register of patients
in MMT, who were prescribed and dispensed at least one prescription for methadone between
August 2004 and December 2010. The outcome measure was time to discontinuation of MMT.
A patient was defined as ‘on treatment’ based on the coverage of their methadone prescriptions.
If there was a gap of 7 days a patient was considered to have ceased treatment. Median daily
methadone dose and proportion of methadone scripts per treatment episode which were dispensed
under supervised consumption were included as possible determinants of time to discontinuation of
MMT. Age, gender and comorbidities were included as potential confounders. Proportional hazards
gamma frailty models were fitted to account for the dependence in the length of individuals repeated
treatment episodes.
Results
6,393 patients experienced 19,715 treatment episodes. The overall median episode length was 140
days (IQR: 38-412), with 19.5% of all episodes ongoing at the end of follow-up. Compared to
<60mgs, median daily doses > 60 mgs (60-120 mg: hazard ratio (HR)=0.47, 95% CI 0.45 0.50);
>120 mgs: HR=0.62, 95% CI 0.53 0.72), and having greater than 20% of methadone scripts
dispensed as supervised consumption (compared to <20%) were associated with longer treatment
episodes (20-39% of scripts: HR=0.36, 95% CI 0.33 0.38); 40-59% of scripts: HR=0.24, 95%
CI 0.22 0.27; 60-79% of scripts: HR=0.25, 95% CI 0.22 0.27; >80% of scripts: HR=0.28, 95%
CI 0.26 0.30). Patients experiencing multiple treatment episodes tended to stay in treatment for
progressively longer periods of time.
Conclusion
The prescription of higher daily doses of methadone, and regular supervised consumption can increase
MMT retention.
References
Degenhardt L., Bucello C., Mathers B., Briegleb C., Ali H., Hickman M. et al. (2011) Mortality
among regular or dependent users of heroin and other opioids: a systematic review and meta-analysis
of cohort studies. Addiction 106: pp. 32 - 51.
European Monitoring Centre for Drugs and Drug Addiction (EMCDDA). (2013) European Drug
Report: Trends and developments. Euro Surveill 18: pp. 29 - 47.
Lyons S., Lynn E., Walsh S., Long J. (2014) Drug-related deaths and deaths among drug users in Ire-
land. HRB Trends Series 4. Dublin: Health Research Board.
Bayesian inference for misspecified exponential random graph models
Lampros Bouranis∗1, Nial Friel1 and Florian Maire1
1School of Mathematics and Statistics & Insight Centre for Data Analytics, University College
Dublin, Ireland∗Email: [email protected]
Abstract: In this work, we explored Bayesian inference of exponential random graph models with tractable
approximations to the true likelihood and we applied our methodology in real networks of increased com-
plexity. Our work is involved with the pseudolikelihood function which is algebraically identical to the
likelihood for a logistic regression. Naive implementation of a posterior from such a misspecified model
is likely to give misleading inferences. We provide background theory and practical guidelines for efficient
correction of the posterior mean and covariance for the analysis of real-world graphs.
Introduction
There are many statistical models with intractable (or difficult to evaluate) likelihood functions.
Composite likelihoods provide a generic approach to overcome this computational difficulty. A nat-
ural idea in a Bayesian context is to consider the approximate posterior distribution πCL(θ|y) ∝pCL(y|θ)p(θ). Surprisingly, there has been very little study of such a misspecified posterior dis-
tribution. We focus on the exponential random graph model, which is widely used in statistical
network analysis. The pseudolikelihood function provides a low-dimensional approximation of the
ERG likelihood. We provide a framework which allows one to calibrate the pseudo-posterior distri-
bution. In experiments our approach provided improved statistical efficiency with respect to to more
computationally demanding Monte Carlo approaches.
Methods
To conduct Bayesian inference using the ERG likelihood model, we adopt the well-known full-update
Metropolis-Hastings sampler. To overcome the intractability of the ERG likelihood, we propose to
replace the true likelihood p(y|θ) = q(y|θ)z(θ)
with a tractable but misspecified likelihood model, leading
us to focus on the approximated posterior distribution, or ”pseudo-posterior”:
πPL(θ|y) ∝ pPL(y|θ) · p(θ). (1)
Estimating the pseudolikelihood pPL(y|θ) is effortless; misspecification comes from the strong and
often unrealistic assumption of independent graph dyads. Calibration of the unadjusted posterior
MCMC samples to obtain appropriate inference is executed with two operations: a ”mean adjust-
ment” to ensure that the true and the approximated posterior distributions have the same mode and
a ”curvature adjustment” that modifies the geometry of the approximated posterior at the mode
(Stoehr and Friel, 2015).
Application
We consider a large network of 1200 nodes and a two-dimensional model. We compare the calibration
procedure to the Approximate exchange algorithm (AEA) of Caimo and Friel (2011), which has been
used in the context of ERG models and has shown good results. It is possible to carry out inference
for graphs of larger size (eg. 1000 nodes), but at the cost of an increased computational time.
Bayesian logistic regression can be performed using standard software, in a fast and straight-forward
manner. Once the mean adjustment and the curvature adjustment were performed, a very good
approximation of the true posterior with efficient correction of the posterior variance was obtained
(Figure 1), while achieving a five-fold speedup relative to the Approximate Exchange.
−0.5
−0.4
−0.3
−0.2
−0.1
−5.5 −5.0 −4.5 −4.0
θ1
θ 2
Algorithm AEA Pseudoposterior
Unadjusted
TV = 0.028
−0.5
−0.4
−0.3
−0.2
−0.1
−5.5 −5.0 −4.5 −4.0
θ1
θ 2Algorithm AEA Calibrated pseudoposterior
Mean + Curvature−Adjusted
Figure 1: Phases of calibration of the misspecified posterior distribution using a pseudolikelihoodapproximation.
References
Caimo, A. and Friel, N. (2011). Bayesian inference for exponential random graph models. Social Net-
works, 33:41-55.
Stoehr, J. and Friel, N. (2015). Calibration of conditional composite likelihood for bayesian inference
on gibbs random fields. AISTATS, Journal of Machine Learning Research: W & CP, volume 38,
pages 921–929.
Non-proportional Hazards Modelling
Kevin Burke∗1 and Gilbert MacKenzie2
1Department of Mathematics and Statistics, University of Limerick, Ireland.2CREST, ENSAI, France.∗Email: [email protected]
Abstract: We investigate parametric approaches for handling non-proportional hazards. Specifically, we
introduce the Multi-Parameter Regression (MPR) modelling framework and compare this to a standard
approach known as frailty. It is noteworthy that the MPR approach generates a new test of proportionality.
We argue that multi-parameter regression is more natural than frailty for capturing non-proportional effects
and show that it is more flexible both analytically and in the context of a lung cancer dataset. We also
consider models which combine the MPR and frailty concepts providing further generality.
Introduction
The most popular regression model for survival data is, by far, the Proportional Hazards (PH)
model whereby the hazard for individual i is λ(t |xi) = exp(xTi β)λ0(t) where xi = (xi1, . . . , xip)T
and β = (β1, . . . , βp)T are the vectors of covariates and regression coefficients respectively and
λ0(t) is a baseline hazard function common to all individuals. Clearly, the ratio of two hazards
is exp[ (xi − xj)Tβ ] which does not depend on time, i.e, hazards are proportional. From this,
straightforward interpretation follows: e.g., “the risk in group 1 is ψ times that of group 2” where
ψ is a proportionality constant. In spite of the virtue of interpretability under the PH assumption,
non-PH effects are often encountered in practice. We will investigate parametric survival models
which account for non-PH effects.
Methods
One common explanation for non-PH effects is the presence of an unobservable, gamma-distributed
random effects term (Duchateau & Janssen, 2007). This term represents additional heterogeneity
in the hazard which cannot be explained by xi, i.e., missing information/covariates. In this so-called
gamma frailty model, covariate effects may be proportional at the individual level but are non-
proportional at the marginal level as a consequence of this hererogeneity. An alternative explanation
is that covariates truly exhibit non-PH effects at the individual level (and, hence, at the marginal
level). We introduce and explore the “Multi-Parameter Regression” (MPR) framework which handles
this situation. Here we allow multiple distributional parameters to depend on covariates which
generalises the parametric PH model where covariates enter only through a scale parameter. We
will illustrate the various concepts by comparing four Weibull models (PH, PH-frailty, MPR and
MPR-frailty) in terms of their hazard ratio and by application to a Northern Irish lung cancer
dataset.
Results
We show that the PH-frailty model imposes non-PH effects on all covariates (i.e., it does not allow
any covariates to have PH effects) and, moreover, for these imposed non-PH effects, the degree
of time-variation permitted is not large – only convergent hazards are handled. In contrast, we
show that the MPR model is highly flexible and can handle proportional, convergent, divergent and
crossing hazards where covariates are not all forced into sharing any one such trajectory. This flexible
regression model can produce dramatic improvements in fit compared to the basic PH model and
also its frailty extension. Combining the MPR and frailty approaches leads to a more general model
still. It is interesting to find that this MPR-frailty model outperforms the MPR model in the setting
of our real data example showing that although both MPR and frailty approaches allow for non-PH
effects, the presence of one does not abolish the need for the other.
Discussion
The Multi-Parameter Regression (MPR) modelling framework provides a direct generalisation of the
PH model to non-PH status along with a new test of proportionality. We argue that the approach
is a more natural extension to non-PH modelling than the incorporation random effects, i.e., the
frailty model. Notwithstanding the flexibility of the MPR approach in its own right, the combined
MPR-frailty model offers further generality and provides a method for testing MPR effects against
frailty effects. Finally, the MPR-frailty combination itself suggests further novel extensions which
we will discuss briefly; these extensions are a focus of our future work.
References
Burke, K. and MacKenzie, G. (2016 - submitted). Multi-parameter regression survival models,
Biometrics.
Duchateau, L. and Janssen, P. (2007). The frailty model. Springer.
Dynamic Bayesian networks implemented for the analysis of clinical data
Caoimhe M. Carbery∗1,2, Adele H. Marshall1,
Roger Woods2 and William Scanlon2
1 Centre for Statistical Science and Operational Research (CenSSOR), Queen’s University Belfast,
Northern Ireland2 The Institute of Electronics, Communications and Information Technology (ECIT), Queen’s
University Belfast, Northern Ireland∗Email: [email protected]
Abstract: Bayesian networks are graphical models that represent variables as nodes and conditional depen-
dence relationships between variables as arcs between the nodes. Associated with each node is a conditional
probability given the parents of that node. Bayesian networks have proved useful in the past for representing
medical data. An extension of the Bayesian network is the dynamic Bayesian network that extends the
theory to allow for data observed over multiple time slices. This paper wishes to demonstrate how the
dynamic Bayesian network can be an effective tool for representing clinical time series data. In particular
the approach is applied to data for patients with chronic kidney disease.
Introduction
Time series data is in abundance, in particular there is a large amount of clinical data that is deemed
to be a form of time series data. The importance of investigating the dynamic change of a system
in regards to time advances is crucial in analysing medicines. This paper will illustrate the use
of dynamic Bayesian networks (DBNs) to model dynamic clinical systems. DBNs have the ability
to model medical systems that involve the rigid collection of data at set time points. This paper
presents an application of the DBN to time series data where the condition and survival of a group
of kidney dialysis patients with chronic kidney disease are modelled over fixed time points.
Methodology
BNs are a special type of graphical model whose nodes represent the variables and the edges
between the nodes correspond to the relationships between the variables. Dynamic Bayesian networks
(DBNs) are an extension of Bayesian networks which incorporates a temporal dimension to the
graphical model by allowing each time point to have a separate BN structure with multiple time
slices connected together to form the overall DBN model. This extra dimension is critical for
the network to model dynamic systems through time dependent data thus it allows the system to
relate variables to each other through adjacent time points (or time slices). There are a number of
algorithms appropriate for creating DBNs with one example being the K2 algorithm. This paper will
provide information on how these algorithms are implemented to create an appropriate DBN.
Application
The theory discussed above will be applied to time series data in a clinical context, specifically to
a dataset involving the analysis of kidney disease. This application will demonstrate the relevance
of using a network of this kind for clinical research. The clinical data contains information on a
number of variables associated with the measurement of kidney functions for 11,527 patients with
associated measurements at quarterly time points from 2005 to 2008. The investigation considers
the benefits of different drug types to a patients’ haemoglobin and ferritin level. Other application
areas will be discussed to emphasise the adaptability of DBNs to different data types. Matlab is
used to create the resulting DBNs for the clinical kidney data.
Discussion
Upon demonstrating the relevance of the DBN to clinical time series data; this presentation will
discuss the incorporation of levels using hierarchical systems. This added aspect will aim to adapt
systems to allow for deep learning to be performed. Deep learning will be briefly discussed with
emphasis placed on its statistical relevance.
References
Dean, Thomas and Keiji Kanazawa. (1989). A model for reasoning about persistence and causation.
In: Computational intelligence 5,no. 2 pp. 142 – 150.
Murphy, Kevin P. (2002). Dynamic bayesian networks: representation, inference and learning. Disserta-
tion: University of California, Berkeley.
Cluster Analysis of Hepatitis C Viral Load Profiles in Pre-treatment
Patients with Censored Infection Times
Niamh Ducey∗1, Kathleen O’Sullivan1, Jian Huang1, John Levis2, Elizabeth Kenny-Walsh3, Orla
Crosbie3 and Liam Fanning2
1Department of Statistics, University College Cork, Ireland2Molecular Virology Diagnostic and Research Laboratory, Department of Medicine, Cork University
Hospital and University College Cork, Ireland3Department of Gastroenterology and Hepatology, Cork University Hospital, Ireland
∗Email: [email protected]
Abstract: Viral load of the Hepatitis C virus (HCV) has been identified as an important predictor of the
outcome of Hepatitis C disease progression (Fanning et al., 2000). There is a limited amount of information
available to explain the fluctuations in viral load in the timeline of HCV infection and, more specifically,
the viral load changes over time in an untreated patient population. This study aims to cluster the viral
load profiles of a sample of pretreatment chronic Hepatitis C patients to investigate the presence of distinct
groupings in viral load progression patterns during HCV infection.
Introduction
Hepatitis C (World Health Organisation, 2015), infecting an estimated 185 million people globally,
is a disease characterised by liver inflammation that occurs due to infection with the Hepatitis C
virus (HCV). A key obstacle of this disease is its ability to remain symptomless for long periods of
time. This causes increased risk of viral transmission, delayed treatment, and difficulty determining
the initial infection point. Viral Load (VL, IU/ml) is the amount of virus present in body fluid at
any one time. Quantifying a patient’s VL at multiple time points leads to the development of a
VL profile. The objective of this study is to examine VL profiles in untreated chronic Hepatitis C
patients in order to identify distinct patterns in viral load progression.
Methods
VL profiles were obtained for 81 pre-treatment females chronically infected with HCV due to the
receipt of HCV-1b contaminated Anti-D immunoglobulin. This selection criteria enabled estimation
of initial infection to a narrow time interval between 1977 and 19781. Based on work, performed by
Luan and Li (2003), involving clustering sparse and irregularly-spaced time course data, we applied
a mixed-effects model that utilises B-Spline basis functions to cluster the viral load profiles over
time since infection. The optimum cluster solution was identified using the Integrated Complete
Likelihood (ICL) criterion and Monte Carlo re-sampling simulations. Additionally, infection times
1The authors would like to thank Dr. Joan Power for her contribution on the infection timeline profiles.
were randomized to access the effect of censoring infection times on the cluster result.
Results and Conclusions
Two clusters were identified as the optimum descriptor of the distinct patterns observed in VL
progression. The mean VL curves of these two clusters are presented in Figure 1.
Figure 1: The mean curves of viral load (log10IU/ml) profiles over time since infection (years) ofthe individuals in cluster one (n = 32) and cluster two (n = 49).
Cluster 1 displays a relatively steady increase in VL over time, whereas Cluster 2 portrays a more rapid
increase in VL, in addition to peaking at a higher viral load level than Cluster 1. The randomization
of infection start times had no major effect on the cluster membership of the individual patients.
References
Fanning, L., Kenny-Walsh, et al. (2000). Natural fluctuations of hepatitis C viral load in a homoge-
neous patient population: a prospective study. Hepatology (Baltiomre, Md.), 31(1), pp. 225 – 9.
Luan, Y., & Li, H. (2003). Clustering of time-course gene expression data using a mixed-effects model
with B-splines. Bioinformatics, 19(4), pp. 474 – 482.
Of queues and cures: A solution to modelling the inter time arrivals of
cloud outage events
Jonathan Dunne∗1 and David Malone1
1Maynooth University (Hamilton Institute, Maynooth University, Ireland)∗Email: [email protected], [email protected]
Abstract: The management of Cloud based outages represents a challenge for Small Medium Enterprises
(SMEs), due to the variety of ways in which production outages can occur. We consider the inter-arrival
times for outages events in a framework where these arrival times are used to align Systems Operations
resources. Using an enterprise dataset, we address the question of how inter-arrival times are distributed
by testing against a number of common distribution types. The proposed framework can help SMEs to
manage their limited resource workflows. We shall also consider correlation between arrival times.
Keywords: Distribution fitting, goodness of fit, correlation resource planning.
Introduction
For the European SME the adoption of cloud technology no easy task. Due to resource constraints
and myriad of failure patterns, SMEs face challenges in providing a reliable and stable service platform
for their customer’s needs. In this paper we describe a framework, that the SME an leverage to best
manage their limited set of resources as part of incoming outage events in their cloud infrastructure.
Data Set
The study presented in this paper examines approximately 250 cloud outage events from a large
enterprise system. Our study aims to answer a key question: Which distribution is best suited to
model the interarrival time of cloud outage events. To answer this question a number of common
distribution types were modelled; lognormal, gamma, Weibull, exponential, logistic, loglogistic and
Pareto.
Results
Each distribution was modelled against the interarrival times of the dataset using the R computer
package and the fitdistrplus library which also calculates the distribution parameters. Using a second
package ADGofTest, the parameters of each distribution validated for their goodness of fit. Table
1 summarises the results of this test.
Table 1: Summary of Anderson-Darling GoF statistics.
Distribution Name AD statistic p-valuelognormal 3.039 0.026gamma 6.034 9.347e-04Weibull 0.975 0.371exponential 3.110 0.024logistic 12.819 2.765e-06loglogistic 1.823 0.115Pareto 0.661 0.592
Histogram and theoretical densities
data
Den
sity
0 5000 10000 15000 20000 25000
0e+0
02e
−04
4e−0
46e
−04
8e−0
41e
−03
WeibullllogisticPareto
0 20000 60000 100000
050
0010
000
1500
020
000
2500
0
Q−Q plot
Theoretical quantiles
Em
piric
al q
uant
iles
WeibullllogisticPareto
0 5000 10000 15000 20000 25000
0.0
0.2
0.4
0.6
0.8
1.0
Empirical and theoretical CDFs
data
CD
F
WeibullllogisticPareto
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
P−P plot
Theoretical probabilities
Em
piric
al p
roba
bilit
ies
WeibullllogisticPareto
Figure 1: Four goodness-of-fit plots for Weibull, loglogistic and Pareto distributions fitted to theinterarrival times from the cloud outage data set.
Discussion
Table 1 shows that Pareto has the best p-value for the Anderson-Darling test, followed by loglogistic
and Weibull. All other distributions were rejected as part of hypothesis testing. Figure 1 shows
goodness of fit plots for the three best fitted distributions. The quantile-qualtile plot shows that the
Pareto distribution more closely models the data set even for extreme values.
Conclusion
It was found that the Pareto distribution is a useful distribution for modelling the interarrival times
of cloud outages. This result can be used by SMEs as an arrival time parameter for a queuing model
for cloud outage events.
References
Delignette-Muller, M.L. and Dutang, C. and Pouillot, R. and Denis, J.B. (2015). Web Page:
https://cran.r-project.org/web/packages/fitdistrplus/index.html
Bellosta, C.J.G (2011). https://cran.r-project.org/web/packages/ADGofTest/index.html
Study of joint progressive type-II censoring in heterogeneous populations
Lida Fallah∗1, John Hinde1
1School of Mathematics, Statistics and Applied Mathematics, NUI Galway, Ireland∗Email: [email protected]
Abstract: Time to event, or survival, data is common in the biological and medical sciences. Here, we
consider the analysis of time to event data from two populations undergoing life-testing, mainly under a
joint Type-II censoring scheme for heterogeneous situations. We consider a mixture model formulation and
maximum likelihood estimation using the EM algorithm and conduct a simulation to study the effect of
the form of censoring scheme on parameter estimation and study duration.
Key words: EM algorithm, Maximum likelihood estimation, Type-II censoring
Introduction
Time to event, or survival, data is common in the biological and medical sciences with typical exam-
ples being time to death and time to recurrence of a tumour. In practice, survival data is typically
subject to censoring with incomplete observation of some failure times due to drop-out, intermit-
tent follow-up and finite study duration. Many different probability models have been proposed for
survival times. Extending these to mixture models allows the modelling of heterogeneous popula-
tions, e.g. susceptible/non-susceptible individuals (Kuo and Peng, 2000). This allows the clustering
of individuals to different groups together with parameter estimation. This becomes more compli-
cated in the presence of censoring and requires care in model fitting and interpretation. Maximum
likelihood estimation can be done by direct optimization or with the EM algorithm using a nested
version to handle the two aspects of missing data, the mixture component labels and the censored
observations.
Model
Let X = (X1, . . . , Xm) be i.i.d. random variables following a f (1) distribution for the lifetimes of m
units and Y = (Y1, . . . , Yn) are i.i.d. random variables following a f (2) distribution for the group of
n units. Now let W1 ≤ . . . ≤ WN , N = m+ n, denote the order statistics of the random variables
X1, . . . , Xm;Y1, . . . , Yn. Under joint Type-II censoring, N = m+n units are placed on a life-test.
At the time of the rth failure, R = S + T remaining units are withdrawn, where S and T are the
number of withdrawals from the X and Y samples respectively and we can write N = R + r, and
the test is terminated.
The likelihood function under joint Type-II censoring given the observed data is given by
L(Θ|z,w, s) = C
r∏i=1
[f (1)(wi)zif (2)(wi)1−zi
]F (1)
(wi)m−sF(2)
(wi)n−t,
where F(1)
= 1− F (1), F(2)
= 1− F (2), C is a normalising constant, and Zi is a 0/1 indicator for
Wi coming from the X population or not.
Discussion
We focus on Type-II censoring, where follow-up is terminated after a pre-specified number of failures,
with all other individuals censored at the largest failure time. The performance of the estimation
procedure depends, not surprisingly, on the characteristics of the censoring scheme and the form of
the component densities. Some experimentation with a mixture of exponential distributed compo-
nents highlighted potential problems, however with normal components things are better behaved.
In progressive Type-II censoring some individuals are removed randomly at each failure time. This has
the effect of spreading out the censoring over the observation period with a consequent extension of
the follow-up time. This censoring scheme seems to improve the efficiency of estimation for mixed
populations. The above can be extended to two heterogeneous populations (e.g. male/female)
applying Type-II or progressive Type-II censoring over the two populations, referred to as joint (pro-
gressive) Type-II censoring schemes (Rasouli and Balakrishnan, 2010). We focus on the above
settings and conduct a simulation study to evaluate the impact of the form of the censoring scheme
on parameter estimation and study duration. We obtain standard errors for parameter estimates
and construct confidence intervals. Results will be discussed to show the benefits of the progressive
regime. Finally, we illustrate with a real-data example.
References
Kuo, L. and Peng, F. (2000). Generalized linear models: A Bayesian Perspective. New York: Marcel
Dekker, pp. 255 – 270.
Rasouli, A. and Balakrishnan, N. (2010). Exact likelihood inference for two exponential populations
under joint progressive Type-II censoring. Communications in Statistics- Theory and Methods, 39,
pp. 2172 – 2191.
Extending Average Attributable Fractions
John Ferguson∗1, Alberto Alvarez-Iglesias1, John Newell1,2, John Hinde2 and Martin O’Donnell1
1HRB Clinical Research Facility, NUI Galway, Galway, Ireland2School of Mathematics, Statistics and Applied Mathematics, NUI Galway, Galway, Ireland
∗Email: [email protected]
Abstract: Chronic diseases tend to depend on a large number of risk factors, both environmental and
genetic. Average attributable fractions (Eide and Gefeller, 1995) were introduced as a way of partitioning
overall disease burden into contributions from individual risk factors; this may be useful in deciding which
risk factors to target in disease interventions. However, in practice they are seldom used due to technical
and methodological limitations. To bridge this gap, we introduce new estimation methods for average
attributable fractions that are appropriate for both case control designs and prospective studies.
Introduction
In epidemiology, the attributable fraction represents the proportional reduction in population disease
prevalence that might be observed if a particular risk factor could be eliminated from the population.
In some regards, it is a more relevant measure of disease association than odds ratios or relative risks
as it hints at the potential impact of an intervention targeting the risk factor. Average attributable
fractions are a related concept, more tailored to the situation where several risk factors are known to
be associated with the disease, in which case they define a partition of cumulative disease burden into
contributions from each risk factor. At first sight, average attributable fractions seem an extremely
useful tool to quantify the portion of risk contributed by each risk factor; a pertinent calculation in
describing chronic disease epidemiology. However, they are seldom used by practitioners. Perhaps
the main reason for this may be the technical issues that researchers face in their application. In
brief, some of these hurdles relate to: (a) computational difficulties when the number of risk factors
is large (b) no proposed method to produce confidence intervals (c) a lack of flexible software to
assist in their calculation. In this presentation, we describe these issues in more depth, and propose
some solutions. The new methods are demonstrated using real and simulated data. Estimation
accuracy, coverage-probability and computational efficiency compared to alternative approaches are
examined. An R-package, averisk, that implements the methods discussed in this presentation
can be downloaded from the CRAN server.
References
Eide, Geir Egil, and Olaf Gefeller (1995). Sequential and Average Attributable Fractions as Aids in
the Selection of Preventive Strategies. Journal of clinical epidemiology, 48(5), pp. 645 – 655.
Variable Selection with multiply imputated data when fitting a Cox
proportional hazards model: a simulation study
Olga Kalinina∗1, Dr. Emma Holian1, Dr. John Newell1,2, Dr Nicolla Miller3 and Prof. Michael
Kerin3
1School of Mathematics, Statistics and Applied Mathematics, National University of Ireland,
Galway, Ireland2HRB Clinical Research Facility, NUI Galway, Ireland
3Discipline of Surgery, School of Medicine, National University of Ireland, Galway, Ireland∗Email: [email protected]
Abstract: The purpose of this study is to explore methods for variable selection in survival analysis in the
presence of missing data.
Introduction
Prognostic models play an important role in medical decision making process. Missing predictors and
censored responses are common problems within prognostic modelling studies. Simple methods, such
as complete cases analysis, are commonly used as the default procedure in many statistical software
packages. Several studies have shown that such approach loses efficiency, and may lead to biased
estimates if there is a relationship between missing values and the response. Multiple imputation is an
attractive approach, which replaces each missing value in predictor by M credible values estimated
from the observed data. Then M imputed data sets are analysed separately and the parameters
estimates and their standard errors combined using ’Rubin’s Rule’. However, it is still unclear
how to conduct variable selection over multiply imputed data sets under the framework of penalized
regressions. Several methods have been proposed and used in the literature. Wood(2008) performed
classical backward stepwise selection method where i) at each step, the inclusion and exclusion
of the variable is based on combined overall estimates with standard errors using Rubin’s Rule, and
ii) a stacking method is used where the multiply imputed data sets into one using a weighting scheme
to account for the fraction of missing data in each explanatory variable. Chen(2013) and Wan(2015)
proposed methods combining multiple imputation and penalized regressions. Chen(2013) treated
estimates from the same variable across all imputed data sets as a group, and applied the group
lasso penalty to yield a consistent variable selection, while Wan(2015) proposed weighted elastic
net method to the stacking method after multiple imputation with a weight accounting for the
proportion of the observed information for each subject.
Discussion
Penalized regression techniques like lasso, elastic net and group lasso achieve parsimony as they
shrink some regression coefficients to zero. However, it may lead to inconsistent variable selection
if they are applied directly to the multiply imputed data sets. Both proposed methods combining
multiple imputation and penalized regressions, presented in literature, are discussed for the linear
regression model only. My aim is to extend the above ideas to the Cox proportional hazards model
and examine their performance with the alternatives through a comprehensive study.
References
Chen Q, Wang S (2013). Variable selection for multiply-imputed data with application to dioxin exposure
study. J Stat Comput Simulat 2015, 85(9), pp. 1902 – 1916.
Wan Y, Datta S, Conklin DJ, Kong M (2015). Variable selection models based on multiple imputation
with an application for predicting median effective dose and maximum effect. Stat Med 2008, 27,
pp. 3227 – 3246.
Wood AM, White IR, Royston P (2008). How should variable selection be performed with multiply
imputed data? Stat Med 2008, 27, pp. 3227 – 3246.
Extending the msm package to derive transition probabilities for a
Decision Analytic Markov Model
Felicity Lamrock∗1,2, Karen Cairns2, Frank Kee1, Annette Conrads-Frank3, and Uwe Siebert3
1Centre for Public Health, 2Centre for Statistical Science and Operational Research, (Queens
University Belfast, United Kingdom)3Institute of Public Health, Medical Decision Making and Health Technology Assessment, UMIT,
Hall i.T., Austria∗Email: [email protected]
Abstract: Several novel biomarkers have shown to have promising ability to better determine who is at
risk for a cardiovascular event beyond conventional risk factors. The aim of this paper is to extend the R
package msm, to estimate transition probabilities between health states within a decision-analytic Markov
model, to assess if the measurement of novel biomarkers (in addition to current prevention strategies using
conventional risk factors) can lead to cost-effective strategies for prevention.
Introduction
Cardiovascular disease (CVD) is the single most common cause of death in the world, and several
novel biomarkers are being considered for their ability to enhance cardiovascular risk estimation.
Decision-analytic Markov models (DAMMs) can be used to assess whether different prevention
strategies are not only effective but cost-effective. Markov models can be used to describe how
individuals move between different health states over time. The aim of this paper is to outline the
techniques to populate a DAMM with transition probabilities between health states to assess the
cost-effectiveness of adding novel biomarkers to existing prevention strategies.
Methods
Using a Finnish population cohort (FINRISK97), and the follow-up for cardiovascular events, a
multi-state Markov model is built to describe movements between five different health states. The
R package msm can fit multi-state models to longitudinal data, giving output for all permitted
state-to-state transitions. All of the transition rates between each of the health states are examined
in the one process, where the rates between health state r and s can be influenced by time-
dependent or constant covariates, z, and estimated in a proportional hazards fashion: qrs(tj, zij) =
q(0)rs exp(βTrszij), where zij are the explanatory covariates with effect on the intensity for individual i at
time tj. The usefulness of the package could be further enhanced through its extension to formulate
transition probabilities. To calculate the transition probability matrix, P(tu, tv, ziu), where the (r, s)
element, prs, gives the probability of an individual transitioning between health state r and health
state s between times tu and tv, the evaluation of the matrix exponential of the transition intensity
matrix Q is required to calculate P: P(tu, tv, ziu) = exp[(tv − tu)Q(tu, ziu)]. Since transitions
between different health states may be influenced by an individuals characteristics (including novel
biomarker information) and any prevention treatment received, different prevention strategies will
involve different transition probabilities between health states.
Results
All analyses were performed using R 3.1.1. To build the msm model, initial guesses for the q(0)rs were
deployed to avoid convergence on local maxima. Fifty runs were performed in each case. It is also
useful to use starting values for (βTrs)k from any nested models to aid convergence. The probability
matrix, P(tu, tv, ziu), was obtained for each individual within the FINRISK97 cohort. For each of
the strategies within the DAM, the subset of individuals are chosen and each probability matrix
averaged over the cohort. The one year probabilities were therefore obtained for each year starting
from 50 years, for each transition between health states, and implemented into a cost-effectiveness
model.
Discussion
Novel biomarkers have the potential to be an effective and cost-effective strategy for targeting
subsequent CVD prevention. The approach that extends the msm package to populate a DAMM
has been outlined, and it is a useful tool for parameter estimation. The information from this
research will be combined in the DAM with cost data from different prevention treatments, to
assess cost-effectiveness.
References
Blankenberg, S et al. (2010). Contribution of 30 Biomarkers to 10-year cardiovascular risk estimation in
2 population cohorts: The MONICA, risk, genetics, archiving and monograph (MORGAM) biomarker
project. Circulation, 121, (22), pp. 2388-2397.
Jackson, C. (2011). Multi-State Model for Panel Data: The msm Package for R. Journal of Statistical
Software, 38, (8), pp. 1-29
Adaptive Decision-Making via Non-Parametric Predictive Intervals
Angela McCourt∗1 and Dr Brett Houlding1
(1Dept. of Statistics, Trinity College Dublin, Ireland)∗Email: [email protected]
Abstract: Normative decision theory is concerned with the study of strategies for decision-making under
conditions of uncertainty in such a way as to maximize the expected utility. The approach taken in this
research does not assume that an individual has, or can specify and work with, a belief network and/or
preference function returning a unique value regardless of the vagueness or unfamiliarity of the event to
them. Instead it will allow to explicitly model how an individual can and/or does derive their belief and
utility functions (or alternative concepts to replace these within the setting considered) based on how
a person or party may well learn their actually belief network. In order to achieve this, non-parametric
predictive intervals (NPI) are employed. This is a modelling technique in which vagueness is incorporated
via the use of imprecise probabilities - where precise values are replaced by a lower and upper bound of
probability. We wish to develop a way to update utilities in light of new information and to explore and
explain what heuristics may be involved in this process.
Introduction
Normative Decision Theory explores what type of decisions that ought to be made and assumes that
the individual making a decision is able to place the correct numerical utility value on any reward,
including such rewards that have never been experienced before, i.e. experiences that are novel to
that individual. If we were to not assume that an individual has, or can specify and work with, a
belief network and/or preference function returning a unique value regardless of the vagueness or
unfamiliarity of the event to them this would allow us to explicitly model how an individual can
and/or does derive their belief and utility functions (or alternative concepts to replace these within
the setting considered).To achieve this non-parametric predictive intervals (NPIs) are employed; a
modelling technique in which vagueness is incorporated via the use of imprecise probabilities - where
precise values are replaced by a lower and upper bound of probability (Coolen, 2006).
Simulated Data
One thousand data points was randomly simulated from three distributions that would lead to
correlated and uncorrelated data being generated. The absolute correlation coefficient, |ρ|, was
calculated in blocks of 50. The selection of fifty correlations per block may appear arbitrary but
much research has been conducted on recommender systems which has highlighted that algorithms
preform reliably when there are approximately fifty ratings available (see Jannah, Zanker, Felfernig,
& Friedich (2010) for an introduction to recommender systems). The non-parametric prediction
intervals were calculated as follows:
E[Intervalnew] = 1n+1
∑ni=1 |ρi|
E[Intervalnew] = 1n+1
(1 +∑n
i=1 |ρi|)
For (X,Y), uncorrelated data was simulated, whereas (Y,Z) correlated data was simulated. However,
it is assumed that we know nothing about this data and so, via the NPI’s, it is possible to “build”
our knowledge of this data. When subsequent intervals are calculated we get a narrowing of the
interval. From Fig. 1 we see that the lower bounds change very little in comparison to the upper
bounds for the uncorrelated data and the interval is becoming more narrower towards zero, which
indicates no correlation between these distributions. However for the correlated data we have the
opposite effect; the lower bounds are now increasing so that the interval is becoming narrowing
towards 1.
Figure 1: Non-parametric Predictive Intervals and Differences between Bounds
References
Coolen, F.P.A. (2006). On Nonparametric Predictive Inference and Objective Bayesianism. Journal of
Logic, Language and Information, 141, pp.382 – 391.
Jannach, D., Zanker, M., Felfernig, A., and Friedrich, G. (2010). Recommender systems: An intro-
duction. Cambridge University Press.
Identifying universal classifers for multiple correlated outcomes, in clinical
development
Meabh G. McCurdy∗1, Adele H. Marshall1 and J. Renwick Beattie2
1Centre for Statistical Science and Operational Research, School of Mathematics and Physics,
Queen’s University Belfast, Belfast, UK2Exploristics Ltd, Floor 4 Linenhall Street, Belfast, BT2 8BG
∗Email: [email protected]
Abstract: The pharmaceutical industry is fast becoming a key aspect in scientific and medical progress in
recent years. This is due to new innovative ideas, such as stratified medicine which is achieved by employing
different types of subgroup analysis in order to determine which patients respond better to a particular
treatment. The method used in this analysis is known as the Patient Rule Induction Method (PRIM). This
method was applied to a dataset taken from a clinical trial to determine subgroups of patients and to aid
with the prediction of a patient’s survival of the trial.
Introduction
Currently 95% of the experimental drugs that are studied in humans fail to be both safe and effective
(Healthcare & Pharma 2013). This is a growing concern with the pharmaceutical industry and as
a result, they are looking for ways to advance and improve the success of trial drugs whilst at the
same time managing the costs. One particular aspect of drug development that pharmaceutical
companies are required to consider is the ability to identify subgroups of patients that are likely
to derive additional benefits from different treatments. This can be achieved by using subgroup
analysis to determine different groups with similiar characteristics. This paper uses subgroup analysis
methodology on a clinical trial in which each patient had four biomarkers recorded at three different
time points throughout the trial. Information on whether the patients survived the total duration of
the study was also recorded and key to the analysis.
Methods
A special type of subgroup analysis is the Patient Rule Induction Method (PRIM). PRIM, also
known as the ”bump-hunting” algorithm was published by Fisher and Friedman (1999). It is a non-
parametric method used for the primary aim of identifying subgroups or bumps in the data that will
maximise the mean of a target variable. In this analysis PRIM uses the target variable corresponding
to the classification variable ALL Censflag, where a value of 1 refers to the patient who survived
the study and a value of 0 refers to the patient who died before the study ended. The method was
implemented in SAS using a macro created by Sniadecki (2011).
Results
PRIM was applied to the dataset of 105 patients and as a result was able to identify two different
subgroups within the data. The make up of these two subgroups can be seen in Table 1.
Table 1: Properties of the subgroups obtained from PRIM
Group 1 Group 2ALL Censflag 0 value = 2 (4%) 0 value = 27 (56%)
1 value = 55 (96%) 1 value = 21 (44%)Mean of target variable µ = 0.965 µ = 0.438
Therefore, if a patient belonged to Group 1 they were predicted to survive the study and likewise
if they belonged to Group 2 they were predicted to die before the study ended. The sensitivty and
specificty are 93% and 72% respectively.
Conclusion
Analysing the typical biomarker values common to those patients that belong to the same subgroup,
revealed that the outcome of a patient can be predicted. The methodolgy has the potential to be
applied to other clinical trials to determine subgroups of patients that will benefit more than others
from an experimental drug. This area of analysis will play a key role in the future development of
the pharmaceutical industry, and in particular in the creation of stratified medicine.
References
Fisher, N. I. and Friedman, J. H. (1999). Bump hunting in high-dimensional data. Statistics and Com-
puting 9, Volume No. 2, pp. 123-143.
Healthcare & Pharma. (2013). How the staggering cost of inventing new drugs is shaping the future
of medicine. [Online] http://www.forbes.com/sites/matthewherper /2013/08/11/. Accessed 10
March 2015.
Sniadecki J. and Therapeutics A. (2011). Bump Hunting with SAS: A Macro Approach to Employing
PRIM. SAS Global Forum, Volume 156.
Mixtures of Infinite Factor Analysers
Keefe Murphy∗1,2, Dr. Claire Gormley1,2
1School of Mathematics and Statistics, UCD, Ireland2Insight Centre for Data Analytics, UCD, Ireland
∗Email: [email protected]
Abstract: Typically, when clustering via mixtures of factor analysers (MFA), one must specify ranges of
values for the numbers of groups and factors in advance. The pair of values which optimises some model
selection criterion, such as BIC, are chosen. Not only is this computationally intensive, it’s generally only
reasonable to fit models where the number of factors is the same across groups. The development to date
of a flexible, adaptive Gibbs sampler algorithm for clustering high-dimensional data via a mixture of infinite
factor analysers (MIFA) is presented. MIFA allows different clusters to have different numbers of latent
factors and estimates these quantities automatically during model fitting. An application to metabolomic
data illustrates the methodology.
Introduction
Typically, orthogonal factor analysis models the p-vector xi as a linear function of a q-vector of
unobserved latent factors fi, where q p,
(xi − µ
)(p×1)
= Λ(p×q)
fi(q×1)+ εi(p×1)
In a Bayesian setting, the means µ are assumed to be MVNp distributed, with mean 0 and diagonal
covariance matrix. The scores fi are assumed to be MVNq distributed with mean 0 and identity
covariance matrix. Finally, εi ∼ MVNp (0,Ψ), with diagonal Ψ, and the following prior on its
non-zero elements : ψ−1 ∼ Gap (α/2, β/2).
Methods
A latent indicator zig, which is equal to 1 if i ∈ cluster g of G, and 0 otherwise, is introduced, s.t. zi ∼Mult (1, π). A Dir (α) prior is assumed for π, the mixing proportions. Marginally, this provides the
following parsimonious covariance structure,
xi | zig = 1 ∼MVNp
(µg,ΛgΛ
Tg + Ψg
)∴ P (xi) =
G∑g=1
πg MVNp
(µg,ΛgΛ
Tg + Ψg
).
A multiplicative gamma process shrinkage prior is used for the infinite loadings matrices. This allows
each cluster to have infinitely many factors, with loadings increasingly shrunk towards zero as the
column index increases,
Loadings : λjk ∼N(0, φ−1
jk , τ−1k
)Local Shrinkage : φjk ∼Ga (ν/2, ν/2)
Global Shrinkage : τk =k∏
h=1
δh, δ1 ∼ Ga (α1, 1) , δh ∼ Ga (α2, 1) , ∀ h ≥ 2.
A conversatively high estimate of q?g ∀ g = 1, . . . , G is chosen initially. The adaptive Gibbs sampler
tunes the number of factors as the MCMC chain progresses. Adaptation decreases in frequency
exponentially fast. The columns in the loadings matrix having some proportion of elements in some
neighbourhood of 0 are monitored. If the number of such columns drops to zero, an additional
loadings column is added by simulating from the prior distribution. Otherwise redundant columns
are discarded and parameters corresponding to the non-redundant columns are retained. The scores
matrix is also modified accordingly, and identifiability issues are addressed via Procrustean methods.
The number of effective factors at each iteration is stored, and the posterior mode after burn-in
and thinning have been applied is used as an estimate for qg with credible intervals quantifying
uncertainty.
Results
The methodology is illustrated with application to metabolomic data, consisting of 189 spectral
peaks from urine samples of 18 subjects, half of which have epilepsy and half of which are controls.
A 2-cluster MIFA model correctly uncovers the epileptic and control groups, and gives different
credible intervals for q1 and q2.
Discussion
The need to choose the optimal number of latent factors in a mixture of factor analysers has been
obviated using MIFA. Though this greatly reduces the search space, the issue of model choice is still
not entirely resolved. The next logical extension to this body of work is to estimate the number of
mixture components in a similarly choice-free manner, by exploring the literature either on overfitting
mixture models or on Dirichlet Processes.
Activity profiles using self-reported measures in population studies in
young children: are there gender differences?
Aoife ONeill∗1, Dr. Kieran Dowd2, Ailish Hannigan3, Prof. Clodagh OGorman3, Prof. Cathal
Walsh1 and Dr. Helen Purtill1
11Department of Mathematics and Statistics, University of Limerick, Ireland2Department of Physical Education and Sports Science, University of Limerick, Ireland
2Graduate Entry Medical School, University of Limerick, Ireland∗Email: [email protected]
Abstract: The aims of this study were to cluster preadolescents into distinct activity profiles for boys and
girls separately, based on self- and parental-reported physical activity (PA) and sedentary behaviour (SB)
variables from the Growing Up in Ireland (GUI) study, and determine if the identified profiles were predictive
of weight change from age 9 to age 13. The findings highlighted 1) distinct activity profiles based on self-
and parental-reported PA and SB variables were identifiable for boys but not for girls, 2) activity profiles
for boys were associated with weight status at 9 years, and 3) activity profiles for boys were predictive of
future weight change at 13 years.
Introduction
Profiling activity behaviours in young children is important in developing a better understanding
of how the associations between activity patterns and weight status track over time (Leech et al.
2014). Cluster analysis is a multivariate statistical technique which aims to group individuals into
profiles, based on similarities found in the data (Ferrar et al. 2013). The GUI study is a nationally
representative study which aims to track the development of children in the Republic of Ireland.
8570 9-year-old children took part in the first wave of GUI and the second wave of the study was
carried out when the children were aged 13 with an 87% follow up (n=7,423). The two waves were
matched to analyse weight changes over time.
Methods
A Two Step Cluster Analysis (TSCA) was performed, using the log-likelihood distance measure, to
the self- and parental-reported PA and SB by gender. Multiple iterations of the cluster analysis
were carried out with the number of predefined clusters being adjusted to maximise the silhouette
coefficient, where the silhouette coefficient measures the cohesion and separation of the clusters.
The cluster membership variable was used to examine the association between activity levels and
BMI categories at age 9 and 13.
Results
5.4% of the boys were classified as obese at age 9 and 4.6% at age 13. 7.8% of the girls were
classified as obese at age 9 and 7.2% at age 13. Four cohesive activity profiles were identified for
boys. No cohesive activity profiles were identified for girls. The profiles found for boys were ordered
by level of activity, with 32.7% of 9-year-old boys assigned to the most active profile (profile 1),
13.4% to profile 2, 28.1% to profile 3 and 23.9% to profile 4. Profile 1 had the highest PA levels
and lowest levels of SB. In comparison, profile 4 had the lowest levels of PA and the highest levels of
SB. 7.4% of boys in the least active group were identified as obese compared to 3.8% in the most
active group. There was an increased risk of being overweight or obese in profile 4 at age 9 (OR =
1.5, 95% CI: 1.2, 1.9) and at age 13 (OR = 1.9, 95% CI: 1.5, 2.3) compared to profile 1, controlling
for socio-demographic variables and parental weight status. The odds of a normal weight 9-year-old
boy in the least active profile becoming overweight or obese at age 13 were over twice those in most
active profile (unadjusted OR = 2.4, 95% CI: 1.7, 3.3).
Conclusion
This study provides important insights into profiling PA and SB in pre-adolescent children. It also
contributes to a better understanding of how activity profiles in 9-year-old boys relate to current
weight status; and are predictive of future weight status. This study highlights gender differences in
the responses to the self- and parental-reported PA and SB questions and our findings suggest that
these questions do not identify meaningful patterns of activity in pre-adolescent girls.
References
Leech, R. M., S. A. McNaughton and A. Timperio (2014). ”The clustering of diet, physical activity
and sedentary behavior in children and adolescents: a review.” Int J Behav Nutr Phys Act, 11(4).
Ferrar, K., T. Olds and C. Maher (2013). ”More than just physical activity: Time use clusters and
profiles of Australian youth.” Journal of Science and Medicine in Sport, 16(5) pp. 427 – 432.
Handling Missing Data in Clinical Trials
Amanda Reilly∗1 and John Newell1,2
1HRB Clinical Research Facility, NUI Galway, Ireland2School of Mathematics, Statistics and Applied Mathematics, NUI Galway, Ireland
∗Email: [email protected]
Abstract: Ideally, data collected during a clinical trial would be complete, but this is never the case in
reality. In fact, missing data is a serious issue for clinical trials. The focus of this poster is to briefly
describe various techniques used to handle missing data, to allow statisticians choose the most appropriate
technique for the data being analyzed.
Introduction
Missing data is a common but serious issue for clinical trials, so it is important that statisticians
know how to handle missing data, as well as the consequences of ignoring missing data. This poster
identifies what are missing data, as well as why, when and how they should be handled. Although
not an extensive review, this poster describes various techniques that can be used to handle missing
data.
What are missing data?
Missing data are defined by Little et al. as “values that are not available and that would be meaningful
for analysis if they were observed”. Ideally, data collected during a study would be complete, but
this is rarely the case in reality. Early discontinuations from a trial are the main source of missing
data, but there can be many other reasons such as data entry errors or subjects lost to follow up.
Why should missing data be considered?
The power of a trial is its ability to reliably detect and measure the difference between the treatment
and control groups for an effective treatment. Since power increases along with sample size, it’s
important to avoid excluding subjects with missing values as it may lead to an incorrect conclusion
that the treatment is ineffective when there is a true treatment effect. Bias in the estimation of the
treatment effect is the inclination to be in favour of the treatment group over the control group and
is another important concern around missing data. It is usually caused by ignoring the uncertainty
around missing values when estimating standard errors.
The European Medicines Agency (EMA) Committee for Medicinal Products for Human Use (CHMP)
have released guidelines around handling missing data that are important for regulated trials but also
good clinical practice (GCP) for others. “ICH Topic E 9 - Statistical Principles for Clinical Trials”
(1998) stresses the importance of minimising missing data in a trial but indicates that missing data
may be compensated for with pre-defined methods, while “Guideline on Missing Data in Confirmatory
Clinical Trials” (2010) notes that ignoring missing data is not acceptable for confirmatory clinical
trials (which assess if a treatment effect observed in a previous randomized trial is real or important).
How should missing data be accounted for?
For an accurate analysis, it is paramount that missing data are handled correctly, as the power and
results of the trial may be affected. Various methods will be discussed in the poster, such as deletion,
imputation and modelling methods. The advantages, disadvantages, underlying assumptions and
limitations of each will be noted to allow statisticians choose the most appropriate technique for the
data being analyzed.
Conclusion
Missing clinical trial data is a common issue and handling it is an important aspect of analysing
the data. The only correct approach is to prevent missing data, but there are various methods of
handling unavoidable missing data. The power of the trial, the risk of bias in the estimation of the
treatment effect and the CHMP guidelines should be considered, and it should be acknowledged
that the approach taken may affect the results of the analysis and may be a cause of bias in itself.
References
Roderick J. Little Ph.D et al. (2012). The Prevention and Treatment of Missing Data in Clinical Trials.
The New England Journal of Medicine, 367, pp. 1355 – 1360.
ICH (1998). Statistical Principles For Clinical Trials E9. http://www.ich.org/fileadmin/
Public Web Site/ICH Products/Guidelines/Efficacy/E9/Step4/E9 Guideline.pdf.
EMA Committee for Medicinal Products for Human Use (2010). Guideline on Missing Data in Con-
firmatory Clinical Trials. www.ema.europa.eu/docs/en GB/
document library/Scientific guideline/2010/09/WC500096793.pdf.
Bayesian Adaptive Ranges for Clinical Biomarkers
Davood Roshan Sangachin∗1,2, Dr. John Ferguson2, Prof. Francis J. Sullivan2,3,
and Dr. John Newell1,2
1School of Mathematics, Statistics and Applied Mathematics, NUI Galway, Ireland2HRB Clinical Research Facility, NUI Galway, Ireland
3Prostate Cancer Institute, NUI Galway, Ireland∗Email: [email protected]
Abstract: In this paper I will discuss the use of Bayesian techniques to generate adaptive reference ranges
for blood biomarkers collected longitudinally. Examples will be given involving biomarkers collected in a
clinical setting in particular prostate cancer and amongst elite athletes.
Introduction
Biomarkers are characteristics that are objectively measured and evaluated as indicators of normal
biological processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention.
They may be measured on a bio-sample (e.g. blood), may be a recording (e.g. blood pressure), or
an imaging test (e.g. echocardiogram) and play a vital role in clinical research as indicators of risk
markers, disease state or disease progression.
A reference/normal range (e.g. Figure 1), generated from a cross-sectional analysis of healthy
individuals free of the disease of interest, is typically used when interpreting a set of biomarker
test results for a particular patient. An arbitrary percentile cut-point (typically the 95th or 97.5th
percentile) is chosen to define abnormality.
When biomarkers are collected longitudinally for patients, reference ranges that adapt to account
for between and within subject variability are needed. In this presentation a Bayesian approach will
be used to generate such adaptive reference ranges (e.g. Figure 2). Initially the patient specific
reference range is based on the range generated for the population, and typically narrows over time
as more data are collected for that individual. Such a range has the potential to detect a meaningful
change earlier.
Examples will be given involving biomarkers collected in a clinical setting and amongst elite athletes.
Figures
Figure 1: Normal Ranges for a specific biomarker
Figure 2: Bayesian Adaptive Ranges for a specific biomarker
Conclusion
This paper highlights the capabilities of Bayesian approaches for generating adaptive ranges for
clinical biomarkers in order to identify abnormal variability at the patient level.
References
Sottas P, Baume N, Saudan C, Schweizer C, Kamber M, Saugy M. (2007). Bayesian detection of
abnormal values in longitudinal biomarkers with an application to T/E ratio. Biostatistics, Volume
8, pp. 285 – 296.
Zorzoli M. (2011). Biological passport parameters. Journal of Human Sport and Exercise, Volume 6,
pp. 205-217.