33 rd conference on applied statistics in ireland · include model uncertainty and sensitivity...

85
National University of Ireland Maynooth 33 rd Conference on Applied Statistics in Ireland 15 th to 17 th May 2013 Westgrove Hotel, Clane Co. Kildare Clane Abbey

Upload: others

Post on 21-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

National University of Ireland Maynooth

33rd Conference on Applied Statistics in Ireland

15th to 17th May 2013

Westgrove Hotel, Clane Co. Kildare

Clane Abbey

Page 2: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating
Page 3: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Welcome to CASI 2013!

The statistics group within the National University of Ireland Maynooth welcomes you

to the 33rd Conference on Applied Statistics in Ireland from Wednesday 15th to Friday

17th May 2013 in the Westgrove Hotel in Clane, Co. Kildare, Ireland. The conference is

the Irish Statistical Association’s forum for discussion of statistical and related issues for

Irish and International statisticians with an emphasis on both theoretical research and

practical applications in all areas of statistics.

Organising committee

Caroline Brophy Catherine Hurley

Katarina Domijan Aine Dooley

Alberto Caimo Isabella Gollini

Mark O’Connell Grainne O’Rourke

Page 4: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Invited SpeakersAndreas Buja is currently Liem Sioe Liong/First Pacific Company Pro-fessor at The Wharton School, University of Pennsylvania. He has pub-lished widely on topics in statistical inference, statistical computing,data mining and data visualisation. He is co-developer of the widelyused and highly influential GGobi tool for data visualisation. Andreashas also served as managing editor of Journal of Computational andGraphical Statistics.

Trevor Hastie is Professor in Statistics and Biostatistics at Stanford Uni-versity. His main research contributions have been in applied statistics,and he has co-authored two books in this area: "Generalized AdditiveModels" (with R. Tibshirani), and "Elements of Statistical Learning"(with R. Tibshirani and J. Friedman). He has also made contributionsin statistical computing, co-editing (with J. Chambers) a book and largesoftware library on modelling tools in the S language ("Statistical Mod-els in S"), which form the foundation for much of the statistical mod-elling in R. His current research focuses on applied problems in biologyand genomics, medicine and industry, in particular data mining, predic-tion and classification problems.

Friedrich Leisch is Professor of Applied Statistics at the University ofNatural Resources and Life Sciences, Vienna. His research interestsare statistical computing, market segmentation, biostatistics, economet-rics, classification, cluster analysis, and time series analysis. This hasled to software development and statistical applications in technology,economics, management science and biomedical research. He is Sec-retary General of The R Foundation for Statistical Computing and hascontributed many R packages to CRAN and Bioconductor.

Marian Scott is Professor of Environmental Statistics in the School ofMathematics and Statistics at the University of Glasgow. She is anelected member of the International Statistical Institute (ISI) and a Fel-low of the Royal Society of Edinburgh (RSE). Her research interestsinclude model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating and assess-ment of animal welfare.

Page 5: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

CASI 2013 Programme

Wednesday 15th May 2013

11.00 Registration opens

12.30 Lunch (Assaggio Restaurant)

Session 1: Wednesday 14.00. Chair: Catherine HurleyAll talks and the poster session will be held in the Alexandra Suite.

14.00 Opening addressProfessor Philip Nolan, President, NUI Maynooth

14.10 Valid post-selection inferenceA. Buja, R. Berk, L. Brown, K. Zhang, L. Zhao

15.00 Survival Trees using Node Resampling.A. Alvarez-Iglesias, J. Newell, J. Hinde

15.20 Latent space models for multiview networksM. Salter-Townshend and T. McCormick

15.40 Improved quantitative analysis of tissue characteristics in PET studies withlimited uptake informationJ. Sweeney and F. O’Sullivan

16.00 Tea & Coffee

Session 2: Wednesday 16.20. Chair: John Hinde

16.20 Rank penalized estimation of a quantum systemP. Alquier, C. Butucea, M. Hebiri, K. Meziani, T. Morimae

16.40 Redistributing staff for an efficient orthopaedic serviceJ. Gillespie, S. McClean, B. Scotney, F. FitzGibbon, F. Dobbs and B. J.Meenan

17.00 Mixed membership of experts stochastic blockmodelA. White and T.B. Murphy

17.20 Robust estimation of crosscovariance and specification of transfer functionmodel in the presence of multiple outliers in leading indicator seriesD.K. Shangodoyin

17.40 Smarter city predictive analytics using generalized additive modelsB. Chen and M. Sinn

Poster Session and Drinks Reception: Wednesday 18.45.

20.15 Dinner (Onyx Bar)

22.00 Live traditional Irish music in Oak bar

Page 6: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Thursday 16th May 2013

Session 3: Thursday 9.00 International Year of Statistics SessionChair: Caroline Brophy

9.00 Sparse Linear ModelsT. Hastie

9.50 The use of multilevel models to represent patient outcome from geriatricwards: An Italian case studyA. H. Marshall, M. Zenga, F. Crippa and G. Merchich

10.10 Covariance modelling for multivariate longitudinal dataG. MacKenzie and J. Xu

10.30 Quasi-likelihood estimation for Poisson-Tweedie regression modelsJ. Hinde, B. Jorgensen, C. Demetrio and C. Kokonendji

10.50 Tea & Coffee

Session 4: Thursday 11.10 Chair: Tony Fitzgerald

11.10 Application of random item IRT models to longitudinal data from electroniclearning environmentsD. T. Kadengye, E. Ceulemans and W. Van den Noortgate

11.30 Graduation of crude mortality rates for the Irish life tablesK. McCormack

11.50 Advanced analytics in fraud & complianceM. Grimaldi

12.10 Inferences on inverse Weibull distribution based on progressive censoringusing EM algorithmA. Helu and H. Samawi

12.50 Lunch (Assaggio Restaurant)

Page 7: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Session 5: Thursday 14.00 Chair: Gabrielle Kelly

14.00 Statistical challenges in describing a complex aquatic environmentE.M. Scott

14.50 What does statistics tell us about the palaeo-climate?J. Haslett and A. Parnell

15.10 Reconstruct climate history at multiple locations given irregular observa-tions in timeT. K. Doan, A. C. Parnell and J. Haslett

15.30 A threshold based discounting mechanism in the revised EU Bathing waterdirectiveS. Ahilan, J.J. OSullivan B. Masterson, K. Demeter, W. Weijer, G. OHare

15.50 Tea & Coffee

Session 6: Thursday 16.10 Chair: Adele Marshall

16.10 The impact of outliers in a joint model settingL. M. McCrink, A. H. Marshall and K. J. Cairns

16.30 Longitudinal PSA reference ranges: choosing the underlying model of agerelated changesA. Simpkin, C. Metcalfe, J. L. Donovan, R.M. Martin, J. Athene Lane, F.C. Hamdy, D. E. Neal and K. Tilling

16.50 Identifying psychosocial risk classes in families of children newly diagnosedwith cancer: a latent class approachWei-Ting Hwang and A. E. Kazak

17.10 Using multivariate exploratory data analysis techniques to build multi-stateMarkov models: predicting life expectancy with and without cardiovasculardiseaseK. Cairns, P. McMillen, M. O’Doherty and F. Kee

17.30 ISA AGM

20.00 Conference Dinner (Alexandra Suite)

Page 8: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Friday 17th May 2013

Session 7: Friday 9.00 Chair: John Newell

9.00 Resampling methods for exploring cluster stabilityF. Leisch

9.50 Clustering PET volumes of interest for the analysis of driving metaboliccharacteristicsE. Wolsztynski, E. Brick, F. O’Sullivan and J.F. Eary

10.10 A random walk interpretation of diffusion rankH. Yang

10.30 Changepoint model with autocorrelationE. Sturludottir and G. Stefansson

Session 8: Friday 11.10 Chair: Sally McClean

11.10 Modelling global and local changes of distributions to adapt a classificationrule in the presence of verification latencyV. Hofer

11.30 Burn in: estimation of p of Binomial distribution with implemented coun-termeasuresH. Lewitschnig, D. Kurz and J. Pilz

11.50 Bayesian model averaging optimisation for the expectation-maximisationalgorithmA. O’Hagan and S. O’Carroll

12.10 Bayesian stable isotope mixing modelsA. C. Parnell, D. L. Phillips, S. Bearhop, B. X. Semmens, E. J. Ward, J.W. Moore, A. L. Jackson, J. Grey, D. Kelly and R. Inger

12.30 The use of structural equation modelling (SEM) to assess the proportionalodds assumption of ordinal logistic regression concurrently over multiplegroupsR. McDowell, A. Ryan, B. Bunting, S. O’Neill

12.50 Lunch (Assaggio Restaurant)

Page 9: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Poster Session: Wednesday 6.45pm.

P2 Statistical issues in clinical oncology trials.G. Avalos, A. Alvarez-Iglesias, I. Parker and J. Newell

P3 Teaching biostatistical concepts to undergraduate medical studentsA. A. Bahnassy

P4 Prognostic modelling for triaging patients diagnosed with prostate cancerM. A. Barragan-Martinez, J. Newell and G. Escarela-Perez

P5 Adaptive Bayesian inference for doubly intractable distributionsA. Boland and N. Friel

P6 Development and validation of a panel of serum biomarkers to inform surgicalintervention for prostate cancerS. Boyce, L. Murphy, J. M. Fitzpatrick, T. B. Murphy and R. W. G. Watson

P7 Clustering of fishing vessel speeds using vessel monitoring system dataE. Brick, P. Harrison, G. Sutton, M. Cronin and E. Wolsztynski

P8 Validating the academic confidence subscalesE. Brick, K. O’Sullivan and J. O’Mullane

P9 Stochastic modelling of atmospheric re-entry highly-energetic breakup eventsC. De Persis and S. Wilson

P10 An examination of variable selection strategies for the investigation of tooth wearin childrenC. Doherty, M. Cronin and M. Harding

P11 Multivariate analysis of the biodiversity-ecosystem functioning relationship forgrassland communities

A. Dooley, F. Isbel, L. Kirwan, J. Connolly, J. A. Finn and C. Brophy

P12 Classification methods for mortgage distressT. Fitzpatrick

P13 A-Traumatic restorative treatment Vs. conventional treatment in an elderly pop-ulationA. Halton, M. Cronin and C. Mendonca da Mata

P14 Doubly robust estimation for clinical trial studies.B. Hernandez, A. Parnell, S. Pennington, I. Lipkovich and M.O’Kelly

P15 A conjugate class of utility functions for sequential decision problemsB. Houlding, F.P.A. Coolen and D. Bolger

P16 The association between weather and bovine tuberculosisR. Jin, M. Good, S. J. More , C. Sweeney, G. McGrath and G. E. Kelly

P17 Fractal characteristics of raingauge networks in Seoul, KoreaS. Jung, H.K. Joh and J.S. Park

P18 A generalized longitudinal mixture IRT model for measuring differential growthin learning environmentsD. T. Kadengye, E. Ceulemans and W. Van den Noortgate

Page 10: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

P20 Small sample confidence intervals for the skewness parameter of the maximumfrom a bivariate normal vectorV. Mameli, A. R. Brazzale

P21 A systems dynamic investigation of the long term management consequences ofcoronary heart disease patientsJ. McQuillan, A. H. Marshall and K. J. Cairns

P22 Modelling competition between overlapping niche predatorsR. Moral, J. Hinde and C. Demetrio

P23 Multi-dimensional partition models for monitoring freeze drying in pharmaceuti-cal industryG. Nyamundanda and K. Hayes

P24 Expert elicitation for decision tree parameters in health technology assessment

S. O Meachair, M. Arnold, G. Lacey and C. Walsh

P25 Perceptions of academic confidenceJ. O’Mullane, K. O’Sullivan, A. Wall and Philip O’Mahoney

P26 An extension to the Goel and Okumoto model of software reliability

S. O Rıordain and S. Wilson

P27 A new individual tree volume model for Irish Sitka spruce with a comparison toexisting Forestry Commission and GROWFOR modelsS. O’Rourke, G. Kelly and M. Mac Siurtain

P28 An analysis of lower limits of detection of hepatitis C VirusK. O’Sullivan, L.J. Fanning, J.O’Mullane, ST4050 Students and L. Daly

P29 Predicting magazine subscriber churnK. O’Sullivan, A. Grannell, J. O’Mullane and ST4050 Students

P32 Predicting retinopathy of prematurity in newborn babiesR. Rollins, A H. Marshall, K. Cairns

P33 A new approach to variable selection in presence of multicollinearity: a simulatedstudyO. Schoni

P34 Lapse prediction modelsL. Sobisek and M. Stachova

P35 Latent class model of attitude of the unemployedJ. Vild

P36 Variable selection techniques for multiply imputed dataD. Wall, G. Callagy, H. Ingoldsby, M. J. Kerin and J. Newell

P37 A maintenance policy for a two-phase system in utility optimisationS. Zhou, B. Houlding and S. P. Wilson

Page 11: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

CASI 2013 Abstracts

Page 12: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Valid Post-Selection Inference

Andreas Buja, Richard Berk, Larry Brown, Kai Zhang, Linda Zhao

Statistics Department, The Wharton School, University of Pennsylvania

Abstract

It is common practice in statistical data analysis to perform data-driven variable

selection and derive statistical inference from the resulting model. Such inference

enjoys none of the guarantees that classical statistical theory provides for tests and

confidence intervals when the model has been chosen a priori. We propose to produce

valid “post-selection inference” by reducing the problem to one of simultaneous in-

ference and hence suitably widening conventional confidence and retention intervals.

Simultaneity is required for all linear functions that arise as coefficient estimates in

all submodels. By purchasing “simultaneity insurance” for all possible submodels,

the resulting post-selection inference is rendered universally valid under all possible

model selection procedures. This inference is therefore generally conservative for

particular selection procedures, but it is always less conservative than full Scheffe

protection. Importantly it does NOT depend on the truth of the selected submodel,

and hence it produces valid inference even in wrong models. We describe the struc-

ture of the simultaneous inference problem and give some asymptotic results.

1

Page 13: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Survival Trees using Node Resampling

Alberto Alvarez-Iglesias1, John Newell2,3 and John Hinde3

1HRB Clinical Research Facility, NUI Galway, Ireland 2 HRB Clinical Research Facility, NUI Galway, Ireland

3School of Mathematics, Statistics and Applied Mathematics, NUI Galway, Ireland

Abstract Tree based methods are a popular non-parametric approach for classification and regression problems in Biostatistics. Extensions of the Breiman et al (1984) CART recursive partitioning algorithm have been proposed for tree based modelling of time to event data where typically the logrank statistic is used as a measure of between node separation. One of the drawbacks of the CART recursive partitioning algorithm is the issue of variable selection bias; splits are favoured for covariates with many possible splits, regardless of whether the corresponding predictor is associated with the response or not. More recently Hothorn et al. (2006) presented a unified framework for tree based modelling using unbiased recursive partitioning based on conditional inference where stopping criteria are based on a series of hypothesis tests. This modified version of the recursive partitioning algorithm was developed to overcome the problem of variable selection bias and to avoid overfitting as pruning is automatic. This method however has some negative implications in relation to the identification of interaction effects, as will be demonstrated. This is a major drawback since the identification of interaction effects is one of the features that make tree based methods attractive. In this presentation, a novel method for growing survival trees will be explored that is unaffected by variable selection bias, correctly identifies interactions and allows automatic pruning. This so called node re-sampling algorithm uses bootstrapping at the node level to generate the different splits of the tree. The selection of the primary and surrogate splits is made using a relative importance plot which is based on the out-of-bag values of the log rank test statistic (those observations not included in each bootstrap replicate of the data). One of the novel features of node re-sampling is the possibility of pruning a saturated tree interactively. To facilitate this, a new graphical user interface has also been developed and some of its functionality will be demonstrated. Examples of survival data arising from observational studies in coronary care and breast cancer are used to illustrate.

Breiman, Leo, Friedman, Jerome H., Stone, Charles J., & Olshen, Richard A. 1984. Classification and regression trees. Boca Raton, Florida: CHAPMAN & HALL/CRC. Hothorn, Torsten, Hornik, Kurt, & Zeileis, Achim. 2006. Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15(3), 651-674.

2

Page 14: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Latent Space Models for Multiview Networks

Michael Salter-Townshend1, and Tyler McCormick2

1Clique Cluster, UCD, Ireland2C.S.S.S., University of Washington, USA

AbstractSocial network analysis is the rapidly expanding field that deals with interactions

between individuals or groups. The literature has tended to focus on single net-

work views, i.e. networks comprised of a group of nodes with a single type of link

between node pairs. However, nodes may interact in different ways with the same

alters. For example, on twitter one user may retweet, follow, list or message another

user. There are thus 4 separate networks to consider. Current approaches include

examining all network views independently or aggregating the different views to a

single super network. Neither of these approaches are satisfying as the interaction

between relationship types across network views is not explored.

We are motivated by an example consisting of the census of 75 villages in the Kar-

nataka province in India. The data was collated for use by a microfinance company

and 12 different link types are recorded. We develop a novel method for joint mod-

elling of multiview networks as follows; we begin with the popular latent space model

for social networks and then extend the model to multiview networks through the

addition of a matrix of interaction terms. The theory behind this extension is due

to emerging work on Multivariate Bernoulli models. We first present the theory

behind our new model. We then explore the relationship between the interaction

terms and the correlation of the links across network views and finally we present

results for the Karnataka dataset.

Inference is challenge and we adopt the No-U-Turn sampler, a variant of Hamiltonian

Monte Carlo for Bayesian inference.

3

Page 15: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Improved quantitative analysis of tissue characteristics in

PET studies with limited uptake information.

James Sweeney & Finbarr O’Sullivan

School of Mathematical Sciences, University College Cork, Ireland

Abstract

Quantitative analysis of FDG uptake is important in oncologic PET studies in order

to determine whether a tissue is malignant or benign, and in attempting to predict

the aggressiveness of an individual tumour. Following injection of the FDG tracer,

a series of scans are taken of a region of interest; this provides dynamic information

from which we draw inference on the metabolic parameters describing the evolution

of tracer activity and hence underlying tissue characteristics.

However, in certain cases only a single static scan of a region of interest may be

available. A prime example is a “late uptake scan” where a body region is analysed

for only a very brief period, providing an extremely sparse and potentially noisy data

set from which it is difficult to draw conclusions on underlying tissue characteristics.

In this talk we explore the impact and benefit of incorporating prior tissue infor-

mation, via penalty structures in our data models, to improve prediction outcomes

in the presence of limited uptake information. Specifically, we display that our pro-

posal appears to offer an extremely promising alternative to existing, competing

methodologies.

4

Page 16: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Rank penalized estimation of a quantum system

Pierre Alquier1, Cristina Butucea2,3, Mohamed Hebiri2, Katia Meziani4

and Tomoyuki Morimae5

1School of Mathematical Sciences, UCD, Ireland2LAMA, Universite de Marne-la-Vallee, France

3CREST, ENSAE, France4CEREMADE, Universite Paris Dauphine, France

5Deparment of Physics, Imperial College London, UK

Abstract

The estimation of low-rank matrices received recently a lot of attention in statistics:

for example in marketing and recommendation systems, see e.g. the Netflix challenge

http://www.netflixprize.com/

In this work, we deal with another application where one faces the problem of low-

rank matrices estimation: quantum tomography. In quantum physics, a physical

system is represented by a (finite or infinite) semi-definite positive matrix ρ, with

complex coefficients, such that ρ∗ = ρ and tr(ρ) = 1. This matrix is called the

density matrix of the system. Let X be a physical quantity of interest - e.g. energy

level, spin of a particule, ... X is associated to a matrix X =∑a λau

∗aua called

observable in the following way:

∀a, Prob(X = λa) = tr(uau∗aρ). (1)

We are here interested in systems of q trapped ions, for each ion three {−1,+1}-valued oservables are available. This system is know as a q-qubit. In this case, it

is known that the dimension of ρ is 2q × 2q. The objective is to estimate in what

state ρ does a given experimental device “put” the ions. This is highly relevant

for applications: to be able to prepare q-qubits in given states is a necessary con-

dition to realize a quantum computer. The problem is that as q grows, the size of

the parameter space (22q) prevents a reasonnable estimation. On the other hand,

the assumption that the matrix ρ has a low rank makes sense, see e.g. the discus-

sion in Guta, Kypraios and Drygen (2012). This allows to reduce considerably the

dimension of the parameter space.

In a first time, we build a moment estimator ρ of ρ based on (1). In order to take

5

Page 17: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

the low-rank information into account, we define the rank-penalized estimator

ρν = arg minρ

[‖ρ− ρ‖2F + ν.rank(ρ)

](2)

where ‖A‖F is the Frobenius norm of the matrix A, ‖A‖F =√

tr(A∗A). From a

computational point of view, on the contrary to the likelihood-based estimators used

in previous works, it is possible to compute all the ρν for ν > 0 in a reasonnable

time, thanks to a singular value decomposition of ρ. From a theoretical perspective,

we prove that, when ν is set to a (known) value ν0, with large probability

1. ‖ρν−ρ‖2F ≤ c. rank(ρ)q(4/3)q

nwhere n is the number of observations (experiments)

and c a constant. So, the smaller the rank of ρ, the easier is the estimation

task.

2. under additional assumptions, rank(ρν) = rank(ρ).

The proposed methodology is illustrated with both simulated and real experimental

data, with promising results.

6

Page 18: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Redistributing Staff for an Efficient Orthopaedic Service

Jennifer Gillespie1, Sally McClean1,2, Bryan Scotney2, Francis FitzGibbon1, Frank

Dobbs3 and Brian J Meenan2

1Multipdisciplinary Assessment of Technology Centre for Healthcare Programme,

University of Ulster, Northern Ireland, UK2School of Computing and Information Engineering, University of Ulster, Northern

Ireland, UK3Faculty of Life and Health Science, University of Ulster, Northern Ireland, UK

Abstract

Musculoskeletal disorders affect over 100 million people in Europe each year, and in

the United Kingdom musculoskeletal complaints are cited as the reason that 60%

of people are on long term sickness. With our population ageing these figures are

expected to rise, and orthopaedic staff are under pressure to manage the increasing

number of referrals. The orthopaedic Integrated Clinical Assessment and Treatment

Service (ICATS) have found that queues are currently building up in the depart-

ment. The main reason for this is that staff are inefficiently distributed. In this

poster we present two approaches based on classic queueing theory, which efficiently

distributes staff to minimise the overall waiting time. The first is an Exhaustive

Search Algorithm (ESA) which calculates the overall expected waiting time for every

combination of staff in the department. This approach is known to find an optimal

solution, but it also has a long execution time. The second approach uses a Genetic

Algorithm (GA) which aims to find a solution as close to the optimum as possible, in

a shorter time frame. These approaches have been applied to orthopaedic ICATS,

to find an appropriate number of staff in each state for the department to reach

steady state. The distribution of staff, the overall expected waiting time, and the

computation time, have been compared to assess which approach is more suitable

for orthopaedic ICATS. The results show that the ESA finds a minimum overall ex-

pected waiting time, and the distribution of staff is very similar to the GA. However,

the execution time of the ESA is very large. Orthopaedic ICATS is only a small

department with a limited number of staff to distribute; therefore, the computation

time of the ESA would increase significantly for a larger department. In conclusion,

we recommend that a GA would be an appropriate approach to efficiently distribute

the staff through similar healthcare departments.

7

Page 19: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Mixed Membership of Experts Stochastic Blockmodel

Arthur White1, Thomas Brendan Murphy1

1School of Mathematical Sciences,

University College Dublin,

Ireland.

Abstract

Network analysis is the study of relational data between a set of actors who can share

links between each other. Classic examples include friendship, sexual interaction,

and professional collaboration, while more recent examples include email exchanges,

Facebook statuses and other interaction via social media. Empirical studies of net-

works have shown that links are often formed in a highly dependent manner. One

commonly occuring feature of such analyses is referred to as homophily by attributes,

whereby actors with attributes in common are more likely to share links.

The stochastic blockmodel (SBM) is a flexible modelling technique for network

analysis, whereby actors are partitioned into blocks, latent groups which exhibit

different connective properties. Conditional on block membership, the probability

distribution for a link between two actors is modelled with respect to a global pa-

rameter. One extension to the SBM is the mixed membership stochastic blockmodel

(MMSBM). This allows actors partial membership to several blocks, reflecting the

multiple role and community memberships often exhibited by actors in real world

networks.

We introduce the mixed membership of experts stochastic blockmodel (MMESBM),

an extension to the MMSBM. This model incorporates covariate actor information

into the existing model, facilitating direct analysis of the impact covariates have

on the network, rather than having to use them to conduct a post-hoc analysis of

some kind. The method is illustrated with application to the Lazega Lawyers strong

coworker dataset, where the impact covariates such as status, gender, and age has

on the network is examined.

8

Page 20: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

ROBUST  ESTIMATION  OF  CROSSCOVARIANCE  AND  SPECIFICATION  OF  TRANSFER  FUNCTION  MODEL  IN  THE  PRESENCE  OF  MULTIPLE  OUTLIERS  IN  LEADING  INDICATOR  SERIES  

 D  K  SHANGODOYIN  

 UNIVERSITY  OF  BOTSWANA  

BOTSWANA  Email:  [email protected]  

ABSTRACT  

This paper considers the effect of outliers on transfer function modeling in context of autocovariance and cross covariance as identification and model specification tools. We established that outlier input series significantly affect the mean and variance of cross-covariance as their asymptotic convergence is influenced if the original series is classified as 2-dimensional random fields in the presence of outliers. Robust estimates of cross-covariance that accommodate outlier input series in transfer function process are proposed.

 KEYWORDS:   Outlying   observations,   Transfer   function,   Crosscovariance,   jackknife   estimate,  alternate  random  group  method  and  leading  indicator.      

9

Page 21: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Smarter city predictive analytics using Generalized Additive Models

Bei Chen and Mathieu Sinn

IBM Research - Ireland

Abstract

Establishing efficient energy and transportation systems are key challenges for accommodatingthe fast-growing population living in cities. While all major cities worldwide are facing energydependence, air pollution and traffic problems, providing fast and accurate predictions of energyand transportation systems is a stepping stone to improve the efficiency and sustainability of thecity. In this talk, we present a class of algorithms which use the Generalized Additive Models(GAMs) (Tibshirani and Hastie, 1990) for predictive analytics in smarter city applications. The firstapplication is short-term electricity load forecasting on various aggregation levels in the electricgrid. We show results for highly aggregated series (national and regional demand), cluster of smartmeter, and data from individual buildings. We also present an adaptive method which updatesthe GAM model parameters using a Recursive Least Squares filter. Experiments with real datademonstrate the effectiveness of this approach for tracking trends, e.g., due to macroeconomicchanges or variations in the customer portfolio. The second part of this talk focuses on multi-modal transportation networks. The bike sharing system is one of the major modes of the Dublintransportation network. We demonstrate how the GAM algorithm solves a two-stage predictionproblem (1) How many bikes (or free bike-stands) will be available at a given time point in thefuture? (2) If the predicted number of bikes (or free bike-stands) is zero, how long will be thewaiting time? The GAM approach takes into account exogenous factors such as weather or thetime of the day, and yields superior performance compared to state of the art methods. Besidesbike systems, our algorithm can be applied to any shared mobility scheme, such as car parkings,shared cars etc.

110

Page 22: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Sparse Linear Models

Trevor Hastie

Statistics Department

Stanford University, California, USA

Abstract

In a statistical world faced with an explosion of data, regularization has become

an important ingredient. In many problems, we have many more variables than

observations, and the lasso penalty and its hybrids have become increasingly useful.

This talk presents a general framework for fitting large scale regularization paths for

a variety of problems. We describe the approach, and demonstrate it via examples

using our R package GLMNET∗.

∗GLMNET is produced jointly with Jerome Friedman, Rob Tibshirani and Noah

Simon, all of Stanford University

11

Page 23: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

The use of multilevel models to represent patient outcomefrom geriatric wards: An Italian case study

Adele H. Marshall1, Mariangela Zenga2, Franca Crippa2 and Gianluca Merchich2

1Centre for Statistical Science and Operational Research (CenSSOR), Queen’sUniversity Belfast, Belfast, BT7 1NN

2Milano-Bicocca University, via B. degli Arcimboldi, 20126 Milano,

Abstract

The proportion of elderly people has increased across all European countries in thelast fifteen years. That means a growth of the service for the healthcare systemdedicated to the elderly, in particular an increase in the expenditure and the admis-sion to hospitals leading to an overall increase in patient length of stay (LOS) inhospital. In fact, in Italy in 2012 the elderly people comprised 37% of the admissionsto hospital consuming near half (49%) of the LOS days. It has been estimated thatin 2050 the aging of the population will produce an increase of 4-8% of the GDPacross Europe. In contrast, recent years in Italy has also seen the closure of severalpediatric wards replaced by geriatric wards.

In the Italian national health system, each care unit is allowed to establish its owncriteria for effective care allocation to the elderly. However, the central health au-thority dictates the criteria for hospitalization of the elderly patients. This paperinvestigates the relationship between the combination of effectiveness and appro-priateness with respect to geriatric wards, belonging to the national health systemwithin a specific region in Central Italy. In particular, our attention focuses onthe evaluation of the healthcare outcome, the outcome itself relying on the patientwell-being, where the latter is the result of a complex system of reciprocal relationsbetween patients and healthcare agents.

The process is considered using a multilevel model where the patient outcomes arerepresented taking into account both the patient condition and ward/hospital set-tings. The evaluation of the healthcare outcome is modelled in terms of risk, withthe inclusion of risk adjustments with respect to covariates, as advanced by Gold-stein and Spiegelhalter in 1996 [?]. The model shows how certain outcomes arerelated to the healthcare structure and others are not. The multilevel model leadsto a ranking between wards according to risk adjustments.

References

[1] H. Goldstein, D.J. Spiegelhalter, League Tables and Their Limitations: Statistical Issues in Comparisons of

Institutional Performance, Journal of the Royal Statistical Society, Series A, 1996, 159(3), 385-443

12

Page 24: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Covariance modelling for multivariate longitudinal data

Gilbert MacKenzie1 and Jing Xu 2

1Centre of Biostatistics, University of Limerick, Ireland.2Birbeck College, London University, London.

Abstract

In many studies subjects are measured on several occasions with regard to multi-

variate response variables. Consider, as an example, a randomized controlled trial of

teletherapy for age-related macular degeneration (Hart, et al., 2002). Patients were

randomly assigned to either radiotherapy or observation and distance visual acuity,

near visual acuity and contrast sensitivity were measured throughout the study.

Modelling the covariance structures for such multivariate longitudinal data is usu-

ally more complicated than for the univariate case due to correlation between the

responses at each time point, correlation within separate responses over time and

cross-correlation between different responses at different times. Two approaches

are commonly adopted: models with a Kronecker product covariance structure and

multivariate mixed models with random coefficients. These approaches select the

covariance structures from a limited set of potential candidate structures including

compound-symmetry, ar(1) and unstructured covariance, and very often assume

that the data are sampled from multiple stationary stochastic processes.

In this paper, we develop a method to model covariance structures for bivariate lon-

gitudinal data by extending the ideas of modified Cholesky decomposition (Pourah-

madi, 1999) and matrix-logarithmic covariance modelling (Chiu et al., 1996). Fi-

nally, we model the parameters in these matrices parsimoniously using regression

models and use our new methods to analyse the bivariate response in the ARMD

trial referenced above.

Keywords: Covariance Modelling, Multivariate Longitudinal Data

References

Xu J. and MacKenzie G. (2012). Modelling covariance structure in bivariate marginal

models for longitudinal data Biometrika., 99, 3, pp. 649 - 662.

13

Page 25: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Quasi-likelihood Estimation for Poisson-Tweedie Regression

Models

John Hinde1, Bent Jorgensen2, Clarice Demetrio3 and Celestin Kokonendji4

1NUI Galway, Galway, Ireland2University of Southern Denmark, Odense, Denmark

3ESALQ/USP, Piracicaba, Brazil4Universite de Franche-Comte, Besancon, France

Abstract

We consider a new type of generalized linear model for discrete data based on

Poisson-Tweedie mixtures. This class of models has previously been considered

intractible, but recent theoretical results show that we may parameterize the mod-

els by the mean µ and a dispersion parameter γ, where the variance takes the form

µ + γµp and p ≥ 1 is the Tweedie power parameter. Here we describe a quasi-

likelihood method for estimating a regression model and a bias-corrected Pearson

estimating function for the variance parameters γ and p. This provides a unified re-

gression methodology for a range of different discrete distributions such as Neyman

Type A, Polya-Aepple, negative binomial and Poisson-inverse-Gaussian distribu-

tions, as well as the Hermite distribution. We discuss these models in the context

of overdispersion and zero-inflation and illustrate their application to some classic

datasets and some recent data on hospitalisations.

14

Page 26: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Application of Random Item IRT Models to Longitudinal Data from Electronic Learning Environments

Damazo T. Kadengye1, Eva Ceulemans2 and Wim Van den Noortgate3

1,2,3Centre for Methodology of Educational Research, KU Leuven, Belgium 1,3Faculty of Psychology and Educational Sciences, KU Leuven – Kulak, Belgium

Abstract

In educational learning environments, monitoring persons’ progress over time may help teachers

to continually evaluate the effectiveness of their teaching or training procedures and make more

informed instructional decisions. Electronic learning (e-learning) environments are increasingly

being utilized as part of formal education and tracking and logging data sets can be used to

understand how and whether students progress over time or to improve the learning environment.

In this paper, we present and compare four longitudinal models based on the item response

theory (IRT) framework for measuring growth in persons’ ability within and between study

sessions in tracking and logging data from web based e-learning environments. Two of the

models, that have been proposed and applied in other aspects of educational research, focus on

measurement of growth between study sessions. These are compared to two extensions that we

propose; one model that measures growth within study sessions while the other model combines

the two aspects – growth within and between study sessions. Differences in growth across person

groups are explained by extending the models to the explanatory IRT framework. An e-learning

example is used to illustrate the presented models. Results show that by incorporating time spent

within and between study sessions into an IRT model, one is able to track changes in ability of a

population of persons or for groups of persons at any time of the learning process.

Keywords: Item Response Theory, e-Learning Environments, Modelling of Growth

15

Page 27: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Graduation of Crude Mortality Rates for the Irish Life

Tables

Kevin McCormack

Central Statistics Office Ireland

Abstract

A life table is a convenient way of summarising various aspects of the variation of

mortality with age. The graduation or smoothing of crude population mortality

rates is essential in the construction of life tables as the recording of the underlying

deaths are subject to errors.

Period Life Tables have been produced by the Irish Central Statistics office on fifteen

occasions, from 1926 to 2005-07, and on each occasion the Kings 1911 formula for

Osculatory Interpolation was used to graduate the crude mortality rates.

In this paper a modern and more statistically accurate cubic-spline graduation

method based on the TRANSREG feature in SAS is developed and applied to the

2005-07 Irish crude mortality rates. The Irish crude mortality rates where also

smoothed using the UKs Office of National Statistics GeD Spline methodology. Life

tables for males and female are constructed using both graduated mortality rates

and the results are compared.

16

Page 28: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Advanced  Analytics  in  Fraud  &  Compliance  

Marco  Grimaldi  Accenture  Analytics  Innovation  Centre,  Dublin  4,  Ireland  

Fraud  is  a  much  bigger  problem  for  organisations  than  they  generally  admit  –  and  it  is  costing  them.    It  is  costing  them  through  loss  of  revenue  and  reputational  risk  if  they  are  not  seen  as  protecting  their  customers.  There  is  compelling  international  evidence  to  demonstrate  the  scale  of  the  problem.  Most  organisations  are  struggling  to  keep  pace  with  the  scale  and  sophistication  of  fraud.      As  organisations  become  more  innovative  in  how  they  deliver  and  get  paid  for  products  and  services  to  clients,  they  also  become  more  vulnerable  to  fraud.    

Conventional  approaches  to  fraud  management  are  no  longer  enough  and  for  the  most  part  only  tackle  the  tip  of  the  fraud  problem.  More  sophisticated,  data  driven  approaches  to  fraud  are  delivering  remarkable  results.  Most  organisations  have  exceptional  amounts  of  data  on  their  current  and  potential  clients  which  they  get  from  internal  and  external  sources.  Many  are  now  using  this  data  to  build  risk  profiles  of  who  has  or  is  likely  to  de-­‐fraud  them.  When  the  data  it  is  fully  exploited  it  allows  a  targeted  rather  than  random  approach  to  fraud.  This  in  turn  means  much  better  results.  Data  mining  and  advanced  analytics  are  at  the  heart  of  new,  more  holistic  approaches  to  fraud  management.  With  them,  organisations  can  uncover  patterns  of  fraud  and  they  are  now  developing  integrated  fraud  prevention  strategies  that  are  cross-­‐functional  and  have  analytics  at  their  core.      

Accenture’s  Analytics  Innovation  Centre  (AAIC)  is  achieving  impressive  results  in  fraud  detection.  The  Centre  is  the  first  of  its  kind  in  Ireland  and  has  become  a  Centre  of  Excellence  for  fraud  analytics.  It  is  already  achieving  impressive  results,  for  example:  • It  is  delivering  a  45%  increase  in  non-­‐compliance  yield  per  investigation  in  a  European  Revenue  

Agency.  • It  has  increased  fraud  detection  rates  in  a  European  Welfare  Agency  by  40%.  

 

17

Page 29: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Inferences on Inverse Weibull distribution based on progressivecensoring using EM algorithm

Amal Helu1 and Hani Samawi2

1Department of mathematics, The University of Jordan, Jordan2Jiann-Ping Hsu College of Public Health,Georgia Southern University, USA

Abstract

The Inverse Weibull (IW ) distribution can be readily applied to a wide range of situationsincluding applications in medicine, reliability and ecology. Based on progressively type-IIcensored data, the maximum likelihood estimators (MLEs) for the parameters of the IWdistributions are derived using the expectation maximization (EM) algorithm. Moreover,the expected Fisher information matrix based on the missing value principle is computed.Using extensive simulation and three criteria, namely, bias, mean squared error and Pitmannearness (PN) probability, we compare the performance of theMLEs via the EM algorithmwith the Newton-Raphson (NR) method. It is concluded that the MLEs using the EMalgorithm outperform their counter parts using the NR method. Real data example is usedto illustrate our proposed estimators.

1

18

Page 30: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Statistical challenges in describing a complex aquatic

environment

E Marian Scott

School of Mathematics and Statistics, University of Glasgow, Glasgow G12 8QW,

UK

Abstract

Both quality and quantity of water are of crucial importance for many reasons,

impacting on diverse topics from flood risk, to human health. Water quality is

determined by many determinands and characteristics, and subject to a variety of

different regulatory regimes Within the European Union, there are several regulatory

frameworks dealing with the aquatic environment, of which the Water Framework

Directive (WFD) is perhaps the most significant. Three others are the Bathing

Waters Directive (BWD), for predicting microbiological health risk, the Floods Di-

rective, which requires a national assessment of flood risk by 2011 and flood risk

and hazard maps by 2013, and the Nitrates directive which is linked to the WFD.

The WFD expresses objectives in terms of achieving good ecological and chemical

status. Setting such environmental objectives requires a means of judging the state

of the environment, and an integrated river basin management planning system.

This development presents some interesting statistical challenges, with waters being

managed at river basin level through a river basin management plan. At the same

time our ability to monitor the environment at increasingly high resolution both

spatially and temporally will produce a revolution in our understanding provided it

is matched with a revolution in our ability to model and visualise such data. New

sensor technology while not yet widespread is becoming more generally deployed

and it is within this context that some grand water monitoring challenges including

integration and synthesis of observations from a variety of sensor networks, devel-

opment of statistical models to handle large data sets, focussing on extremes and

visualisation of dynamic systems are considered.

19

Page 31: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

What does statistics tell us about the palaeo-climate?

John Haslett1, Andrew Parnell21

1School of Computer Science and Statistics, TCD, Ireland2School of Mathematics, UCD, Ireland

Abstract

The climate change debate has been much informed, over the past decade, by infor-

mation about the palaeo-climate. For the past decade, with SFI support and several

collaborators, we in Ireland have been part of this international research effort. Our

general focus is on the past 100,000 years and more specifically on the past 15,000

years. This includes the abrupt transition to the Holocene, the current inter-glacial

period. In this paper we present an overview of this research and touch on some of

the implications.

To a statistician, the general objective of climate change research is the reduction

of uncertainty about past and/or future of climate. But public debate - especially

about the future - is mired in the concept of uncertainty.

This presentation will discuss the general issue of the study of the uncertainty con-

cerning a complex stochastic space-time system on which there is a small amount

of poor quality (proxy) data. Modern Bayesian methods involving MCMC are our

preferred tool. It will then touch on the communication of that uncertainty, firstly

scientist-to-scientist and subsequently scientific community-to-public. The topic is

timely, as the Intergovernmental Panel on Climate Change (IPCC) will in 2013 issue

its first major report since 2007. It is likely to update its current recommendations

on uncertainty.

20

Page 32: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Reconstruct climate history at multiple locations given

irregular observations in timeThinh K. Doan1, Andrew C. Parnell2 and John Haslett1

1School of Computer Science & Statistics, Trinity College Dublin2School of Mathematical Science, University College Dublin

Abstract

The best method to predict climate change is to understand the past. Direct mea-

surements are only available in the last 250 years, but there are other indirect

measurements (known as climate proxies) which can be used as a guide to further

past climatic conditions. Using pollen abundance as a proxy, the aim of this work

is to derive information about the climate dynamic processes that generate climate

variability in the past. This is commonly referred to as reconstruction.

The data are available at irregular time intervals at multiple locations in Finnmark

(Norway) going back as far as the last 14,000 years. This period includes the very

rapid warming and cooling of climate known as the Younger Dryas. We use compu-

tationally intensive Monte Carlo methods for parameter estimation, and develop a

multivariate long-tail smoothing algorithm for the joint reconstruction of the mul-

tivariate climate time series.

The methodological focus for this presentation concerns the fact that the data series

are not only irregular, but misaligned. That is, the series yj = {yj (tij) ; j = 1, . . . , ni}have observations at different times.

21

Page 33: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

A threshold based ‘discounting’ mechanism in the revised EU Bathing Water Directive

S. Ahilan1, J.J. O’Sullivan1, B. Masterson2, K. Demeter2, W. Weijer2, G. O’Hare3

1UCD School of Civil, Structural and Environmental Engineering 2UCD School of Biomolecular and Biomedical Science Engineering

3UCD School of Computer Science and Informatics

Abstract Under the revised Bathing Water Directive, more stringent bathing water quality standards, defined in terms of E.coli and Intestinal Enterococci (IE), will apply. An integrated approach to coastal zone management essential for sustaining tourism and shellfish harvesting in European coastal waters is required in the context of the new legislation. The directive recognises that elevated levels of faecal coliform bacteria in bathing areas can derive from the overland transport of waste from livestock in the rural fraction of river catchments. On days therefore, that follow significant storm events in coastal agricultural catchments, exceedences of threshold bacteria levels may occur. Given that these exceedences result from ‘natural’ rather than anthropogenic influences, a ‘discounting’ mechanism is included in the Directive where high levels of faecal bacteria contamination can be excluded from the water quality record if they are predicted in advance and mitigation actions to maintain public health protection are taken. However, this discounting, which can apply to a maximum of 15% of water quality samples in a 4-year monitoring period, is required on a continuous basis rather than at the end of the monitoring period. This presents practical problem for responsible authorities charged with enforcing the legislation. In the event of a poor quality water sample associated with a naturally occurring short-term pollution incident being recorded early in the monitoring period, authorities must decide whether or not this is likely to be included in the 15% of discounted samples in a 4 year period. This study develops a risk based probabilistic approach to ‘discounting’ for advising responsible Authorities whether the water quality of a particular sample should be discounted. The study uses E.coli and IE records in three bathing water areas (Dollymount, Merrion and Sandymount Strands) on the east coast of Ireland collected from 2003 – 2012. The records were initially analysed to identify any seasonal or other patterns in the data. Following this, synthetic records of E.coli and IE were generated from a Monte-Carlo simulation over the monitoring period. The non-exceedence probabilities of E.coli and IE were determined from the generated samples. The directive requires that threshold values of both E.coli and IE should be within specified limits to maintain beach quality and therefore, a joint probability analysis was undertaken to identify allowable E.coli and IE levels to facilitate the discounting of 15% of samples.

22

Page 34: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

The Impact of Outliers in a Joint Model Setting

Lisa M. McCrink, Adele H. Marshall and Karen J. Cairns

Centre for Statistical Science and Operational Research (CenSSOR),

Queen’s University Belfast, Northern Ireland

Abstract

With a growing volume of medical longitudinal and survival data being gathered

concurrently, joint models have become a popular technique to handle the typi-

cal associations found between such data through simultaneous estimation of both

the longitudinal and survival processes. Although this field of research is growing

rapidly, little research has assessed the impact of the commonly used normality

assumptions within these models.

This research focuses on the impact of this normality assumption of both the longitu-

dinal random effects and random error terms in the presence of longitudinal outliers.

In doing so, a robust joint model is introduced which can account for both outlying

observations within individuals alongside outlying individuals that dont conform to

the population trends. An illustration using longitudinal data from Northern Irish

end stage renal disease patients is presented in this research.

Due to the natural decline in kidney functions with age, the aging population within

the UK has led to an increasing number of individuals commencing renal replacement

therapy. For these patients, one of the largest influences on their survival is the

management of anaemia with previous renal research stressing the negative impact

that fluctuating haemoglobin (Hb) levels over time have on patients survival [1].

Due to this association between a longitudinal and survival process, independent

models can result in biased estimates and so a joint model is recommended [1].

Both outlying observations and individuals are common in this type of data. This

research illustrates the effect that these outliers have on the longitudinal parameters

and the impact of this in a joint model setting.

References

[1] Ratcliffe, S.J., Guo, W. & Ten Have, T.R. 2004, ”Joint modeling of longitudinal

and survival data via a common frailty”, Biometrics, vol. 60, no. 4, pp. 892-899.

23

Page 35: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Longitudinal PSA reference ranges: choosing the underlying model of age related changes

Andrew Simpkin1, Chris Metcalfe1, Jenny L. Donovan1, Richard M. Martin1, 2, J. Athene Lane1, Freddie C. Hamdy3, David E. Neal4 and Kate Tilling1 1 School of Social and Community Medicine, University of Bristol, Bristol, UK

2 MRC Centre for Causal Analysis in Translational Epidemiology, University of Bristol, Bristol, UK

3 Nuffield Department of Surgical Sciences, University of Oxford, Oxford, UK

4 Oncology Centre, Addenbrooke’s Hospital, Cambridge, UK

Abstract

Background: Serial measurements of prostate specific antigen (PSA) are used as a biomarker for

men diagnosed with prostate cancer following an active monitoring programme. Distinguishing

pathological changes from natural age-related changes is not straightforward. Here we compare

four approaches to modelling age-related change in PSA with the aim of developing reference

ranges for repeated measures of PSA. A suitable model for PSA reference ranges must satisfy

two criteria. Firstly it must offer an accurate description of the trend of PSA on average and in

individuals. Secondly it must be able to make accurate predictions about new PSA observations

for an individual and about the entire PSA trajectory for a new individual.

Methods: Data from over 7,000 serial PSA tests were available from a cohort of 512 men

enrolled in the active monitoring preference arm of the Prostate testing for cancer and Treatment

(ProtecT) trial. We used linear mixed models assuming (i) a linear change in PSA with time, and

explored non-linear trajectories using (ii) fractional polynomials and (iii) linear regression

splines in the mixed model setting. The final method comparison was with (iv) a nonparametric

method (PACE) for fitting smooth curves to repeated measures data. Using methods developed

for linear mixed models, we can enhance predictions for future observations by conditioning on

already observed PSA. The approaches were compared for model fit using Akaike’s Information

Criterion (AIC) while root mean squared error (RMSE) was used to evaluate the performance in

fitting the sample PSA data.

Results: PACE offered the best fit to the observed PSA data with an RMSE of 2.26ng/ml.

However, using conditional prediction methods, the mixed model approaches can be used to

24

Page 36: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

better predict observations for new individuals. Among these methods, a linear regression spline

mixed model performed the best in modelling repeated PSA, with a lower AIC and an RMSE of

2.10ng/ml. This analysis demonstrates an advantage of regression splines over fractional

polynomials, i.e. using a local rather than a global basis to fit models to non-linear relationships.

Conclusions: The nonparametric method performed best in describing the features of population

trend in this repeated measures analysis. Parametric techniques are more suited to predicting new

observations for individuals, as methods exist to condition on initial values. Among methods

discussed here, the linear regression spline mixed model is the optimum approach for deriving

reference ranges for repeated measures of PSA.

25

Page 37: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Identifying Psychosocial Risk Classes in Families of

Children Newly Diagnosed with Cancer: a Latent Class

Approach

Wei-Ting Hwang1 and Anne E. Kazak2

1Department of Biostatistics and Epidemiology

University of Pennsylvania Perelman School of Medicine, USA2Department of Psychology, The Children’s Hospital of Philadelphia, USA

Abstract

In an effort to provide high quality and appropriate psychosocial care for children

newly diagnosed with cancer, and their families, the Psychosocial Assessment Tool

(PAT2.0) was developed as a brief screening instrument to identify family needs and

targets for intervention. The PAT items consist of a constellation of risk factors

focusing on family structure and resource, social support, stress reaction, family

problems, child problems, sibling problems, and family beliefs. The development of

PAT is also based on the Pediatric Preventative Psychosocial Health Model (PP-

PHM). The PPPHM describes the pediatric health population by conceptualizing

families with three levels of risks and needs (universal, targeted, and clinical) in

times of a stressful event such as cancer diagnosis. The objective of this work is to

identify profiles of psychosocial risks and study their evolution in this population.

Using data collected for a PAT project of 142 families, we performed latent class

analysis (LCA) to identify categories of psychosocial risk classes based on the risk

indicators of PAT2.0. The proposed approach will be compared with the traditional

method that uses the weighted scores. Then the stability of the PAT risk classifica-

tion over the first 4 months of treatment was assessed by a latent transition analysis

(LTA). The covariate effects on the class memberships and transition probabilities

were also estimated.

26

Page 38: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Using Multivariate Exploratory Data Analysis Techniques

to Build Multi-State Markov Models: Predicting Life

Expectancy with and without Cardiovascular Disease

Karen Cairns1, Paula McMillen1, Mark O’Doherty2 and Frank Kee2

1Centre for Statistical Science and Operational Research (CenSSOR), Queen’s

University Belfast, UK2UKCRC Centre of Excellence for Public Health Research (NI), Queen’s

University Belfast, UK

Abstract

Often covariates being integrated into multi-state Markov models are of mixed type

(categorical and continuous) and exhibit associations. This can lead to spurious

results when covariate-based multi-state Markov models are fitted.

This paper demonstrates the usefulness of integrating multivariate exploratory data

analysis technqiues into the process of building multi-state Markov models. The

msm package in R was used to build the multistate Markov models (Jackson 2011),

while the FactoMineR package in R was used to perform multiple factor analysis

(Le et al 2008). Bayesian Information Criterion (BIC) was used to determine the

optimal model. This paper also extends the msm code to predict life expectancy

with and wthout cardiovascular disease, based on multiple covariates.

The methods have been applied to the PRIME Belfast data set. PRIME Belfast

follows 2745 men aged 50 to 59 years from Belfast, UK, giving a indication of their

coronary status. Information on subjects was grouped by ‘Background’ (married,

education); ‘Lifestyle’ (alcohol consumed, physical activity, smoking); and ‘Risk

Indicators’ (hypertension, body mass index (BMI), cholesterol, diabetes). Multiple

factor analysis indicates the first dimension is highly correlated with Risk Indicators,

while the second dimension correlates with Background and Lifestyle information.

The optimal model obtained was based on the use of the transformed variables (the

first two dimensions). In constrast, the (sub-optimal) model based on the original

variables depends only on lifestyle factors.

Le, S., Josse, J., Husson F. (2008). FactoMineR: an R package for multivariate

analysis. Journal of Statistical Software, 25, pp. 1-18.

Jackson, C.H. (2011). Multi-State Models for Panel Data: The msm Package for R.

Journal of Statistical Software, 38, pp. 1-29.

27

Page 39: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Resampling Methods for Exploring Cluster Stability

Friedrich Leisch University of Natural Resources and Life Sciences, Vienna, Austria  

[email protected]    

Model diagnostics for cluster analysis is still a developing field because of its exploratory nature. Numerous indices have been proposed in the literature to evaluate goodness-of-fit, but no clear winner that works in all situations has been found yet. Derivation of (asymptotic) distribution properties is not possible in most cases. Over the last decade several resampling schemes which cluster repeatedly on bootstrap samples or random splits of the data and compare the resulting partitions have been proposed in the literature. These resampling schemes provide an elegant framework to computationally derive the distribution of interesting quantities describing the quality of a partition. Due to the increasing availability of parallel processing even on standard laptops and desktops these simulation-based approaches can now be used in everyday cluster analysis applications. We give an overview over existing methods, show how they can be represented in a unifying framework including an implementation in R package flexclust, and compare them on simulated and real-world data. Special emphasis will be given to stability of a partition, i.e., given a new sample from the same population, how likely is it to obtain a similar clustering? Key Words: cluster analysis, resampling methods, bootstrap, R

28

Page 40: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Clustering PET volumes of interest for the analysis of

driving metabolic characteristics

Eric Wolsztynski1, Emily Brick1, Finbarr O’Sullivan1 and Janet F. Eary2

1School of Mathematical Sciences, University College Cork, Ireland2School of Medicine, University of Washington, Seattle, WA, USA

Abstract

The main advantage of Positron Emission Tomography (PET) over other imaging

modalities resides in the unique metabolic information it provides with the imaged

distribution of radiolabelled tracer uptake in tissue. PET imaging of solid tumours

has thus become part of standard treatment protocols for an increasing number

of cancer types, in particular for the evaluation of prognosis at baseline and of

therapeutic response. Clinical experience indicates that certain sub-regions of higher

uptake intensity within the tumour play a dominant role when determining overall

tumour metabolic progression and treatment outcome. Relevant statistical analysis

can thus be carried out on sub-volumes of the PET image. This process however

relies on adequate techniques for tumour and region delineation from data of limited

spatial resolution, which is not straightforward. The choice of a statistical indicator

of metabolic activity within the sub-region is then in most cases, depending on the

type of disease and medical protocols in place, a variation of some mean or maximum

tracer uptake quantitation.

We explore the feasibility of model-based clustering in partitioning the PET volume

of interest (VOI) into sub-volumes of characteristic uptake patterns. Delineation of

the sub-VOI is thus data-driven and also has the advantage of regrouping uptake

information from areas of comparable metabolic intensities that are not connected

spatially. We consider a recombined Gaussian mixture modelling of the PET VOI

to identify sub-VOIs of analytic value. Performance of standard statistical quanti-

tators applied to clusters of radiotracer uptake is also considered in terms of clinical

utility. Preliminary results on sarcoma studies highlight the potential of statistical

quantitation derived from clusters of PET data for both prognosis and therapeutic

response assessment.

[Research supported by SFI MI-2007 and NIH/NCI ROI-CA-65537.]

29

Page 41: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

A Random Walk Interpretation of DiffusionRank

Haixuan Yang1

1School of Mathematics, Statistics, & Applied Mathematics

University of Ireland, Galway, Ireland

Abstract

Based on a heat diffusion model on a graph, recently we (2007) proposed Diffu-

sionRank to solve the link manipulation problem of PageRank. A little earlier, we

(2005) applied a similar model to the semi-supervised classification problem where,

the task is, given some labelled nodes in a graph, to predict the labels of unlabelled

nodes. More recently, there were two papers appearing in Bioinformatics: Goncalves

et al. (2011) applied DiffusionRank to the problem of prioritization of regulatory

associations, and Poirel et al. (2013) evaluated its (and other three algorithms’)

performance when applied to the problem of ranking genes for 54 diverse human

diseases. With these successful applications of DiffusionRank, it is interesting to

show its properties for a better understanding of why it works. We (2005, 2009)

showed that DiffusionRank can be considered as a generalization of both K nearest

neighbors and the Parzen window nonparametric method for classification problems.

Here I would like to share a random walk interpretation: DiffusionRank is a lazy

random walk in which a random walker rests in a node with a flexible probability;

and computationally equivalently, it is a finite-step lazy random walk with a fixed

rest probability. Such a lazy random walk has the properties of keeping track of the

initial condition, and reflecting the network structure. Both of the properties are

important for ranking problems and classification problems as the inference of the

ranking scores and classification confidence scores rely on both the initial condition

and the network structure.

30

Page 42: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Changepoint model with autocorrelation

Erla Sturludottir and Gunnar Stefansson

Science institute, University of Iceland, Iceland

Abstract

Changepoint is a point in time-series when there is either a step change, i.e. a mean

shift in the response variable and/or a trend change. These changes can occur e.g. in

time-series of concentration of contaminants, change in analysis method can result

in a step change in the time-series and a change in emission can result in a change

in trend. Changepoint analyses have mainly been carried out in climate time-series

and the focus has been on a step change rather than change in trend.

Most of the methods to detect a changepoint assume identical, independent and

normally distributed errors. However, an autocorrelation is common in time-series

which inflates the error rate when ignored and research on changepoint models with

autocorrelation have focused on model which only allow a step change.The like-

lihood ratio (test statistic) for an unknown changepoint does not follow a known

distribution and the critical value of an empirical distribution depends both on the

length of the time-series and the autocorrelation parameters.

The model (1) allows for a changepoint in a time-series yt, i.e. the intercept (α1 6= α2)

and/or the slope (β1 6= β2) are different before and after the changepoint.

yt =

α1 + β1 + εt n0 ≤ t ≤ c

α2 + β2 + εt c < t ≤ n− n0

(1)

The errors εt are autocorrelated, c is the changepoint and n0 is the first possible

changepoint and n is the length of the time-series. This model will detect both step

and trend type change points.

In this study a changepoint model which can detect either step and/or trend change

when accounting for autocorrelation will be investigated with simulations and ap-

plication in contaminant time-series will be given.

31

Page 43: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Modelling global and local changes of distributions to adapt

a classification rule in the presence of verification latency

Vera Hofer1

1Department of Statistics and Operations Research,

Karl-Franzens University of Graz, Austria

Abstract

The distributions a classification problem is based on can be subject to changes over

the course of time. Such changes can relate to the class prior distribution (global

change) or to the conditional or unconditional feature distributions (local change).

After any change the original training data comprising features and class labels are

no longer representative for the population, and thus the classifier’s performance may

deteriorate. However, in the presence of verification latency a re-estimation of the

classification rule after changes is impossible. Verification latency, a phenomenon

that often appears in practical applications, denotes a learning environment, in

which only recent unlabelled data are available. The labels are known only after

some time lapse.

To adapt a classification rule in the presence of verification latency a model is in-

troduced that estimates global and local changes in a two-step procedure using

recent unlabelled data. Since after a change unlabelled data is available, the new

unconditional feature distribution can be estimated. The old conditional feature

distributions are known from the time before the change. The model is based on

mixture distributions, where the local changes are modelled as local displacement of

probability mass from the positions of the old components as given in the conditional

feature distribution to the positions displayed by the new components. Assuming

that the transfer of probability mass is carried out at a minimum of energy, local

changes are estimated by solving a transportation problem for a fixed value of the

class prior change. In a further step the global change is found as the minimum of

the objective function values obtained from the transportation problem.

The usefulness of the proposed models is demonstrated using artificial data and a

real-world dataset from the area of credit scoring.

32

Page 44: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Burn In: Estimation of p of Binomial Distribution withImplemented Countermeasures

Horst Lewitschnig1, Daniel Kurz2 and Juergen Pilz3

1Infineon Technologies Austria AG, Villach, Austria2Institute of Statistics, Alpen-Adria-University, Klagenfurt, Austria3Institute of Statistics, Alpen-Adria-University, Klagenfurt, Austria

Abstract

Quality in one of the key topics in semiconductor industry. The failure rate of suchdevices follow the so called bathtube curve. In their early life, they show a decliningfailure rate (early fails). Over lifetime, the failure rate is constant and at the end oflife, in the so called wear out phase, the failure rate is increasing. We focus here onearly fails.Early fails should not be delivered, but weeded out at the manufacturer. For thatpurpose, devices are operated under controlled conditions. This is called burn in.Burn in is done on sample base to check on the early life failure rate level. Thisis called burn in study. A defined number of devices are taken, stressed and thenumber of failing units is counted. Based on that, the p of the binomial distributionis estimated. Typically the Clopper-Pearson interval estimation is used.The burn in study is passed if no burn in relevant fails have occurred. If a fail hasoccurred, a countermeasure is implemented.Let’s say the countermeasure is 100 % effective. If this countermeasure would havebeen implemented before the start of the burn in study, the fail would not haveoccurred. Therefore this fail is not counted as burn in relevant.If the countermeasure is less than 100 % effective, then the fail would have occurredwith a certain probability, say 1−α, and with probability α the fail would have notoccurred.We propose a model, that estimates the p of the binomial distribution and takes theeffectiveness of countermeasures into account. Several fails tackled by countermea-sures are modeled by the generalized binomial distribution. A convolution algorithmis given for its calculation.The model is discussed for its decision theoretical background. We simulate differentscenarios: Various sample spaces are used with their respecitive weights to simulatethe possible outcomes of the random sampling. The loss function, the risk functionand the decision taken are adopted to these different scenarios.The benefit of this model is that improvement measures in the chip manufacturingprocess are reflected in the p of a binomial distribution, without the need of repeatingthe sampling experiment.Acknowledgement:The work has been performed in the project EPT300, co-funded by grants fromAustria, Germany, Italy, The Netherlands and the ENIAC Joint Undertaking.

33

Page 45: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Bayesian Model Averaging Optimisation for the

Expectation-Maximisation Algorithm

Adrian O’Hagan and Susan O’Carroll

School of Mathematical Sciences, University College Dublin, Ireland

Abstract

The Expectation-Maximisation (EM) algorithm is a popular tool for deriving max-

imum likelihood estimates for a large family of statistical models. Chief among its

attributes is the property that the algorithm always drives the likelihood uphill. A

serious pitfall is that in the case of multimodal likelihood functions the algorithm

may become trapped at a local maximum. In addition, even in cases where the

global likelihood maximum is ultimately attained, the rate of convergence may be

slow. These phenomena are often fuelled by the use of sub-optimal starting values

for initialisation of the EM algorithm.

We introduce the use of Bayesian Model Averaging (BMA) as a method for pro-

moting algorithmic efficiency. When employed as a precursor to the EM algorithm

it can produce starting values of a higher quality than those arising from simply

employing random starts. The ensuing convergent likelihoods and associated clus-

tering solutions of observations from the BMA-EM algorithm are presented. These

are contrasted with the output arising from the use of random starts; as well as from

the model-based clustering package mclust in R, which uses a hierarchical cluster-

ing initialising step. Datasets tested include the Galaxies data and the Faithful

eruptions data.

The overall goal is to increase convergence rates to the global likelihood maximum

and/or to attain the global maximum in a higher percentage of cases.

34

Page 46: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Bayesian Stable Isotope Mixing Models

Andrew C. Parnell1, Donald L. Phillips2, Stuart Bearhop3, Brice X. Semmens4,

Eric J. Ward5, Jonathan W. Moore6,,rew L. Jackson7, Jonathan Grey8, David

Kelly9 and Richard Inger3

1School of Mathematical Sciences (Statistics), Complex and Adaptive Systems

Laboratory, University College Dublin, Ireland2U.S. Environmental Protection Agency, National Health & Environmental Effects

Research Laboratory, Oregon, USA3Centre for Ecology and Conservation, School of Biosciences, University of Exeter,

UK4Scripps Institution of Oceanography, University of California, San Diego, 9500

Gilman Drive, La Jolla, California, USA5Northwest Fisheries Science Center, National Marine Fisheries Service, National

Oceanic and Atmospheric Administration, Seattle, USA6Earth2Ocean Research Group, Simon Fraser University, Burnaby, Canada

7School of Natural Sciences, Trinity College Dublin, Ireland8Environment and Sustainability Institute, School of Biosciences, University of

Exeter, UK9School of Biological & Chemical Sciences, Queen Mary, University of London, UK

Abstract

Stable Isotope Mixing Models (SIMMs) are used to quantify the proportional con-

tributions of various sources to a mixture. The most widely used application is

quantifying the diet of organisms based on the food sources they have been ob-

served to consume. We propose and implement a multivariate statistical model

which allows for a compositional mixture of the food sources corrected for various

metabolic factors. The compositional component of our model is based on the iso-

metric log ratio (ilr) transform of Egozcue et al (2003). Through this transform we

can apply a range of time series and non-parametric smoothing relationships. We

illustrate our models with 3 case studies based on real animal dietary behaviour.

35

Page 47: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

 

 

The use of Structural Equation Modelling (SEM) to assess the proportional odds assumption of ordinal logistic regression concurrently over multiple groups

Ron McDowell1, Dr. Assumpta Ryan1, Professor Brendan Bunting2, Dr. Siobhan O’Neill

1Institute of Nursing and Health Research, University of Ulster, Coleraine

2 Psychology Research Institute, University of Ulster, Derry

Introduction: Ordinal logistic regression is used to model an ordinal dependent variable as a function of relevant covariates. The proportional odds (PO) assumption implicit within ordinal logistic regression can be easily tested within standard software packages, however it is less straightforward to do so when analyzing multigroup data. We propose using SEM to test the PO assumption of a number of variables concurrently over multiple groups using data from10,530 adults from N. Ireland, Portugal and Romania participating in the World Mental Health Survey Initiative (WMHSI) and the M-Plus software package. Methods: Each participant received a Guttman score for each of the eight mood and anxiety disorders of interest describing how far they had progressed through the Composite International Diagnostic Instrument (WHO-CIDI). These scores were analysed using multigroup ordinal latent class analysis and a series of ordered latent classes was obtained. After assessment of measurement invariance participants were allocated to their most likely class. A series of correlated binary variables was constructed to reflect all possible divisions in two of the ranked latent classes, with the effects associated with age, gender, marital status and urbanicity on these binary variables constrained both within and across countries. These were examined using standard SEM techniques. Mediating effects of self-reported cognitive disability and chronic illness were also included in the analyses. Results We identified 6 latent classes describing progression through the structured interview. The proportional odds assumption held for age within each country and for the other variables across and within countries. Older adults were increasingly likely in N. Ireland and Portugal not to progress past the screening section for all 8 disorders, or if they did progress not to receive a 12-month diagnosis for any disorder. Effects associated with each of the other variables were the same across countries. Whilst the effect of age mediated via chronic illness was associated with an increase in the probability of participants proceeding through the diagnostic process, this was nullified in later life by the effect of age mediated via cognitive disability which was associated with a decrease in the probability of moving through the diagnostic process. This held regardless of country and how the latent classes were partitioned. Conclusion: Testing the PO assumption in this way allows for stronger claims on the effects associated with variables of interest to be made not just within one group but across several. One limitation is the lack of assessment of measurement invariance for age and the other variables due to treating most likely class membership as an observed rather than a latent variable. The implementation of generalized ordinal logistic models within Mplus, which can readily deal with multiple group analyses in many other contexts, will be beneficial. Given that so few older adults process past the lifetime screening questions for any of the diagnoses, further sensitivity analysis of the instrument is required.

36

Page 48: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Statistical  Issues  in  Clinical  Oncology  Trials.  

Gloria  Avalos1,  Alberto  Alvarez-­‐Iglesias1,  Imelda  Parker2  and  John  Newell1,3.  

1HRB Clinical Research Facility, NUI Galway, Ireland 2 All-Ireland Co-operative Oncology Research Group (ICORG), Ireland.

3School of Mathematics, Statistics and Applied Mathematics, NUI Galway, Ireland

Abstract

The number of cancer cases diagnosed daily continues to increase in Ireland and worldwide and there is an urgent need to develop more effective therapies for this disease. Clinical trials in patients remain the gold standard for clinical research in oncology. They are the key to developing more effective therapies for patients with cancer by providing them with access to treatments that are not currently available outside of the clinical research arena.

The All-Ireland Co-operative Oncology Research Group (ICORG) was established in Ireland in 1996 with aims to promote, design, conduct and facilitate clinical cancer research on the island of Ireland and has succeeded in offering research options to over 7500 patients across Ireland in the last fifteen years.

Oncology trials provide several interesting statistical challenges that are unique to this area of medical research. Recent advances in molecular pathways, genomics and cytostatic or targeted agent development have fuelled the rapid progress in clinical oncolcogy. In parallel, development in statistical theory, in particular Bayesian approaches, have continued to provide more powerful methods for dealing with these advances.

In this poster challenges relating to design issues, the choice of outcome and primary endpoints used, the inclusion of stopping rules for efficacy and futility, sequential designs and the implementation of Bayesian adaptive approaches will be presented.

37

Page 49: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Teaching Biostatistical Concepts to Undergraduate Medical Students, Faculty of Medicine, King Fahad Medical City

Dr. Ahmed A. Bahnassy, M.Sc., MSPH, PhD

Associate Professor of Biostatistics Faculty of Medicine

King Fahd Medical City King Saud Bin Abdulaziz for Health Sciences

E-mail: [email protected]

Abstract: Medical Students are considered consumers of biostatistical methods thought their future career. Many of them are not familiar enough with biostatistical techniques, and their appropriate usages in their future fields. The widespread use of personal computers in the last two decades made it easy for such students to apply some statistical tests in their required research work without knowing the basic assumptions behind each test; and when, where and how to use each test properly in most of the cases. A course of basic biostatistics with computer applications has been developed to suite the basic needs of medical students to be familiar with biostatistical tests and how to use a statistical package to conduct simple required analysis. The course consists of modules each of them was developed to be in both theoretical and practical of computer software. This course was carried out in Faculty of Medicine, King Fahad Medical City. Evaluation of is course showed that the participants’ statistical knowledge and interpretation of statistical results has significantly increased comparing between pre and post this course (p< 0.05). Overall results showed that females, over all, performed significantly higher than males (p <0.001). Students performed better in the univariate analysis than multivariate analysis (p = 0.035), while no different among them by age, previous experience with statistical courses. Students scored significantly higher in practical than theoretical exams (0.023). The mathematics and computer phobia of the participants were faced. By the end of this course, participants gained the necessary confidence for carrying out their data analysis. This study recommends conducting of such workshop with more medical application on the computer is a proper way for health professionals. Dr. Ahmed A. Bahnassy M.Sc., MSPH, PhD

38

Page 50: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Prognostic modelling for triaging patients diagnosed with prostate cancer  Marco Antonio Barragan-Martínez1, John Newell2,3 and Gabriel Escarela-Perez1 ,

1 Universidad Autónoma Metropolitana-Iztapalapa, Mexico City, Mexico

2 HRB Clinical Research Facility, National University of Ireland, Galway, Ireland

3 School of Mathematics, Statistics and Applied Mathematics National University of Ireland, Galway, Ireland.

Abstract

My research is focused on the choice of treatment for patients with prostate cancer because it is now very important that doctors have guidelines supported by a sound statistical model. I am working on the analysis of survival of patients diagnosed with prostate cancer in the United States from 1988 to 2008. The dataset comprises different explanatory variables such as demographics, date of diagnosis of cancer, date of death, cause of death, treatment, tumor grade, stage and tumor size. Only cases of adenocarcinoma of the prostate are considered because the cancer is more prevalent. The difficulty of this study is that several of the explanatory variables have missing data, for example the stage and grade, hence the need for to use a method for handling missing data such as Multiple Imputation.

Multiple imputation is a statistical technique for analysing incomplete data, the main idea is to generate m> 1 possible values for each missing value and thus have m complete data sets to be analysed and then to pool the results over each imputed dataset . The idea of this method is to use all the information in the data to ‘fill in’ missing data as methods that discard variables with missing data can introduce bias and a lack of power.

Once data have been imputed I will proceed to fit the model presented by Larson and Dinse (1985); this model is used to specify the cumulative incidence functions in terms of conditional survival probabilities of specific cause and finally the probability that the event is a specific cause. Larson and Dinse propose a fully parametric structure where the conditional survival functions are parametrically Cox proportional hazards whose baseline hazard function is exponential constructed pieces and cause-specific probability follows a multinomial model.

Reference

Larson, M.G. and Dinse, G., E. (1985). A Mixture Model for the Regression Analysis of Competing Risks Data. Journal of the Royal Statistical Society. Series C (Applied Statistics) , Vol. 34, No. 3, pp. 201-211.

39

Page 51: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Adaptive Bayesian inference for doubly intractable

distributions

Aidan Boland, Nial Friel

School of Mathematical Sciences and Complex Adaptive Systems Laboratory,

University College Dublin.

Abstract

There are many problems in which the likelihood function is analytically intractable.

In this situation Bayesian inference is often termed ”doubly intractable” because the

posterior distribution itself is also itself intractable. There has been a lot of work

carried out on this class of problem including the exchange algorithm [1] and Caimo

and Friel [2] who adapted the exchange algorithm to the network graph framework

and used a population-based MCMC approach to improve mixing. However many of

these approaches are still computationally intensive and suffer from problems such

as slow convergence and poor mixing.

The auxiliary variable method on which the exchange algorithm is based involves

repeated sampling from the likelihood function. It turns out that the samples from

the likelihood function can be used to estimate the gradient of the target distribu-

tion. We explore how this information can be used in the context of a Langevin

MCMC algorithm. This method is less computationally intensive as it explores the

target distribution efficiently and does not need a population-based approach.

We envisage that this approach may have some applicability to more general prob-

lems where ABC (Approximate Bayesian Computation) or likelihood-free inference

is used.

References

[1] I. Murray, Z. Ghahramani, D. MacKay. (2006), MCMC for doubly-intractable

distributions. In Proceedings of the 22nd Annual Conference on Uncertainty in

Artificial Intelligence (UAI-06), Arlington, Virginia, AUAI Press.

[2] Caimo A., Friel N. (2011) Bayesian inference for the exponential random graph

model. Social Networks, 33, 41–55.

40

Page 52: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Development and Validation of a Panel of Serum

Biomarkers to Inform Surgical Intervention for Prostate

Cancer

Susie Boyce12, Lisa Murphy1, John M. Fitzpatrick1, T. Brendan Murphy2 and R.

William G. Watson1

1UCD School of Medicine and Medical Science, University College Dublin, Dublin 42UCD School of Mathematical Sciences, University College Dublin, Dublin 4

Abstract

Introduction: Prostate cancer (PCa) is the most common cancer in European and

North American men, and the third most common cause of male cancer deaths. We

have previously shown the inability of current clinical tests to accurately predict

key prostate cancer outcomes. Many studies have established that new biomarker

features are urgently required for this area. In this study, we measure nine protein

biomarkers and their ability to accurately predict prostate cancer stage.

Methods: Serum samples of 197 men diagnosed with prostate cancer collected

through the Prostate Cancer Research Consortium were used. Nine protein biomark-

ers were measured using Meso Scale Discoverys electrochemiluminescence multiplex-

ing platform. Statistically significant differences in the expression levels of each

marker for organ confined vs non-organ confined prostate cancer patients was as-

sessed using independent samples t-tests. The markers were modelled using logistic

regression and their predictive ability measured using a combination of discrimi-

nation metrics (receiver operating characteristic (ROC) curves and area under the

curve (AUC) values), calibration curves and decision curve analysis.

Results: Using logistic regression, each of the markers was modelled in isolation

and in combination to measure their ability to predict prostate cancer stage (organ

confined vs. non-organ confined). Backwards feature selection was then used to

remove redundant markers. The optimal biomarker panel was determined to con-

tain 4 markers (composition undisclosed due to patenting issues). This biomarker

panel achieves a discrimination AUC value of 0.81 indicating that the panel is highly

discriminate. The panel is also well calibrated and shows a significant benefit in a

clinical setting based on decision curve analysis. In order to compare the biomarker

panel to the current clinical tests in use, four clinical variables recorded for each

patient (age, prostate specific antigen (PSA), clinical stage based on digital rectal

exam (DRE) and biopsy Gleason Score) were included in a logistic regression model.

41

Page 53: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

This clinical tests model achieves an AUC of 0.688. Finally, a model consisting of

the biomarker panel combined with the four clinical tests was developed and this

achieved an AUC value of 0.856

Conclusion: The biomarker panel developed in this study achieves far better dis-

crimination and clinical benefit than the current clinical tests in use. This panel

can also be used in collaboration with the current clinical tests and offers a far more

accurate prediction method. This panel shows huge promise for the prostate cancer

field. The next stage in this study is external validation of the marker panel in an

Austrian cohort of patients (results of which will be expected in time for CASI 2013

Meeting).

42

Page 54: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Clustering of fishing vessel speeds using Vessel Monitoring

System data

Emily Brick1, Paula Harrison2, Gerry Sutton2,

Michael Cronin1 and Eric Wolsztynski1

1School of Mathematical Sciences, University College Cork, Ireland2Coastal and Marine Research Centre, University College Cork, Ireland

Abstract

Satellite-based Vessel Monitoring Systems (VMS) record the speed, position and

course of fishing vessels. These recordings do not however explicitly report the

activity in which the vessel is engaged. Adequate classification of VMS data would

be useful to characterize vessel activity, particularly in monitoring traffic surveillance

and measuring fish catch composition in specified areas. In this view we aim to define

three vessel speed clusters, classifying vessels as either starting/stopping, fishing, or

steaming, and explore the feasibility of model-based clustering techniques to do so.

Here we report on the analysis of a dataset of approximately 500,000 Irish fishing

vessel speeds, reconstructed from [1]. Due to the multi-modal, heavy-tailed nature

of the underlying distribution, we explore various mixture modelling techniques, and

especially consider recombined Gaussian mixtures for this classification. Clustering

performance is evaluated non-parametrically in terms of goodness-of-fit, uncertainty

and cluster validity. The output unsupervised classification is ultimately compared

to current gold standards. To the best of the authors’ knowledge, the implementation

and calibration of such techniques to VMS data represents an original contribution.

All analyses were implemented in the R software environment and make significant

use of the add-on package mclust version 4 [2].

References

[1] P. Harrison, M. Cronin, and G. Sutton, “Using VMS and EU logbook data to analyze

deep water fishing activity in the seas around Ireland,” in ERI Research Open Day,

2012.

[2] C. Fraley, A. Raftery, T. Murphy, and L. Scrucca, “mclust version 4 for R: Normal

mixture modeling for model-based clustering, classification, and density estimation,”

Tech. Rep. 597, Department of Statistics, University of Washington, Seattle, USA,

2012.

43

Page 55: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Validating the Academic Confidence Subscales

Emily Brick1, Kathleen O’Sullivan1 and John O’Mullane2

1Department of Statistics, University College Cork, Ireland2Department of Computer Science, University College Cork, Ireland

Abstract

Introduction: An online student experience survey was conducted in 2013 on

all undergraduate students in University College Cork. One aspect of the survey

was the Academic Confidence Scale (ACS). The ACS contains 24 items relating to

students’ perception of academic confidence. The underlying dimensions of the ACS

has been investigated in studies in the UK, Spain and Ireland. In this study, the

factor structure of the ACS was derived and compared with the factor structures

found in these other studies. The aim of this study was to validate which factor

structure is most suitable for our undergraduate data.

Method: After screening for outliers, the dataset (n = 2029) was evenly split

into datasets A and B. Exploratory factor analysis using principal component factor

extraction with oblimin rotation was applied to dataset A. Factors with eigenvalues

> 1.0 were retained. Items with factor loadings of ≥ 0.4 on one factor and < 0.4 on

all other factors were retained. The resulting factor structure was compared to other

proposed factor structures for the ACS. To verify the stability of the factor structure,

dataset B was independently factor analysed and its factor structure compared with

that of dataset A.

Results: Exploratory factor analysis identified a four factor solution using 21

items that explained 57% of the variance. These factors were examined and la-

belled ‘Preparation and Understanding’, ‘Engagement’, ‘Attendance’ and ‘Study

and Achievement’. Conducting exploratory factor analysis on dataset B, led to an

identical (same items dropped from the model, same items loading on each factor)

four factor solution, thus indicating a stable model.

Conclusion: The four factor solution proposed was very similar to existing factor

structures. Two factors ‘Attendance’ and ‘Engagement’ are identical in all proposed

structures, with different combinations of the other items leading to differently la-

belled factors. The four factor solution is stable, however to confirm if this solution

provides a better description of the data than the factor structures from other stud-

ies, a confirmatory factor analysis must be conducted.

44

Page 56: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Stochastic modelling of atmospheric re-entry

highly-energetic breakup events

Cristina De Persis1, Simon Wilson1

1School of Computer Science and Statistics

Trinity College of Dublin

Ireland

Abstract

Spacecraft and rocket bodies re-enter via targeted trajectories or naturally decaying

orbits at the end of their missions.

An object entering the Earth’s atmosphere is subject to atmospheric drag forces.

The friction caused during the entry by these forces heats up the object. The

action of the aerodynamic forces, the heating of the structure, with resulting internal

structural stresses and melting of some materials, usually cause the fragmentation

of the object. In some cases it may occur, under certain physical conditions, that

the structural integrity of the object can no longer be contained and the object

explodes.

The resulting fragments, after the explosion or the fragmentation, which impact the

Earth’s surface could cause serious damages. While there are various tools able to

detect the fragmentation of a spacecraft, the explosion process is a break-up mode

not yet adequately modelled.

First of all, I want to demonstrate how Fault tree and a Bayesian network theories

could be applied to assess the probability to get an explosion, starting from the

combination of the elementary causes that can lead to its occurrence.

Next I want to present a first attempt to model the uncertainty of these elementary

causes, i.e. conditions of temperature and pressure, using an autoregressive model,

as well as I want to show how the Cox’s proportional hazard model could be useful

to solve this issue.

45

Page 57: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

An Examination of Variable Selection Strategies for the Investigation of Tooth Wear in Children

Cathal Doherty1, Michael Cronin1 and Mairead Harding2

1Department of Statistics, University College Cork 2Oral Health Services Research Centre, University College Cork

Abstract

  Tooth wear is an all-encompassing term describing the non-carious loss from the surface of the tooth due to attrition, abrasion or erosion. Attrition is the mechanical wearing of tooth against tooth, abrasion is the wearing of the tooth surface caused by friction and erosion is the wearing by an acid which dissolves enamel and dentine(Smith 1989). Data has been collected longitudinally at three time points (5, 12 & 14 years old) and will be collected at a fourth (16 years old), for a sample of children residing in Cork city and county. This longitudinal data consists of a large number of potential predictors, correlation between predictors, diminishing sample size (202 at age 5, 123 at age 12 & 85 at age 16) and missing data.

Tooth wear (outcome variable) is recorded as a categorical variable; hence a multinomial logistic model is appropriate. Numerous variable selection techniques are identified and examined for suitability or potential development in the selection of such a model. Some techniques include Least Absolute Shrinkage and Selection Operator (LASSO)(Tibshirani 1996), the Elastic Net(Zou and Hastie 2005) and the Dantzig Selector(Candes and Tao 2007). We compare and contrast these techniques utilising simulated data, which have the same potential complexities as the longitudinal data collected over the four time points.

Candes, E. and T. Tao (2007). "The Dantzig Selector: Statistical Estimation When p Is Much Larger than n." The Annals of Statistics 35(6): 2313-2351.

Smith, B. (1989). "Toothwear: etiology and diagnosis." Dental Update 16: 204-212.

Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." Journal of the Royal Statistical Society. Series B (Methodological) 58(1): 267-288.

Zou, H. and T. Hastie (2005). "Regularization and variable selection via the elastic net." Journal of the Royal Statistical Society Series B-Statistical Methodology 67: 301-320.

 

 

46

Page 58: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Multivariate analysis of the biodiversity-ecosystem

functioning relationship for grassland communities.

Aine Dooley1, Forest Isbell2, Laura Kirwan3, John Connolly4, John A. Finn5 and

Caroline Brophy1

1Department of Mathematics and Statistics, National University of Ireland

Maynooth, Maynooth, Co. Kildare, Ireland.2Department of Ecology, Evolution, and Behavior, University of Minnesota, St

Paul, Minnesota 55108, USA.3Department of Chemical and Life Science, Waterford Institute of Technology,

Cork Road, Waterford, Ireland.4School of Mathematical Sciences, Ecological and Environmental Modelling Group,

University College Dublin, Dublin 4, Ireland.5Teagasc Environment Research Centre, Johnstown Castle, Co. Wexford, Ireland.

Abstract

Current methods for analysing the biodiversity ecosystem function (BEF) relation-

ship typically focus on a single ecosystem function (such as the biomass produced),

however biodiversity can affect multiple ecosystems simultaneously (multifunction-

ality). Analysing a single function may provide an incomplete picture of the effects

of biodiversity on ecosystem functioning. The Diversity-Interaction model [1] can be

used to model a single ecosystem function based on the species sown proportions and

how the species interact with one another. Here we extend the Diversity-Interaction

model [1] to a multivariate model. This extension allows us to explore the BEF

relationship for multifunctional ecosystems whilst also providing information about

how the ecosystem functions relate to one another. The Diversity-Interaction mul-

tivariate model also allows us to explore the relative effect of species on multiple

functions. We used this method to analyse data from a four species grassland ex-

periment and found that there was a positive effect of increasing biodiversity on

multiple ecosystem functions.

References

[1] Kirwan, L. et al. (2009) Diversity-interaction modeling:estimating contributions

of species identities and interactions to ecosystem function. Ecology, 90, 2032-2038.

47

Page 59: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Classification methods for mortgage distress

Trevor Fitzpatrick

Central Bank of Ireland

Abstract

The banking crisis in Ireland has been one of the most severe since the 1970s (Laeven

and Valencia, 2012). One particularly important dimension of the Irish crisis has

been large amount of mortgage debt originated in the boom years and the subsequent

large increase in mortgage delinquencies since the start of the crisis in 2008. The

magnitude of the current problem suggests that classification methods may be a

useful approach to determine the probability of being in arrears to triage cases as

part of a systematic debt work-out.

This applied paper uses borrower level origination data, macroeconomic and mort-

gage payment status data for over 100,000 mortgages to compare the performance

of a number of classification approaches for current and future arrears status. The

methods explored include boosted regression trees, generalised linear and additive

logistic regression models (Berg, 2007), (Hastie et. al., 2009), (Muller, 2012).

Preliminary results suggest regression trees and generalised additive models outper-

form generalised linear models based on AUC measures. The results indicate that

the various approaches have a reasonable degree of predictive power, but this be-

gins to degrade after 12-18 months. Related results from partial response analysis

suggests the presence of non-linear effects among the features considered. Overall,

the results suggests that early intervention strategies are possible using currently

available information.

ReferencesBerg,D., (2007), ’Bankruptcy prediction by generalized additive models’, Applied

Stochastic Models in Business and Industry,23(2).

Hastie,T., Tibshirani, R., and Friedman, J., (2009), The Elements of Statistical

Learning, Addison Wesley.

Laeven, L., and Valencia, F., (2012), ’Systemic Banking Crises Database: An Up-

date’, IMF Working Paper, WP/12/163.

Muller, M., (2012), ’A case study on using generalized additive models to fit credit

rating scores’, Paper presented to BIS-Irving Fisher Conference, Dublin.

48

Page 60: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

A-Traumatic Restorative Treatment Vs. Conventional

Treatment in an Elderly Population

Amy Halton1, Michael Cronin1 and Cristiane Mendonca da Mata2

1Department of Statistics, University College Cork, Ireland2Department of Restorative Dentistry, University Dental School and Hospital,

Cork, Ireland

Abstract

Introduction: In the last twenty years, A-traumatic Restorative Treatment (ART)

has been introduced into developed societies because of its minimally invasive nature,

making it suitable for patients who suffer from stress and fear of dental procedures.

However, to date not many studies have been completed on the effectiveness of ART

compared to conventional treatment for the elderly population. This study compares

the one year survival rate of ART to conventional treatment in a population aged

over 65. The effects of age, gender, cavity class of the restoration and the number of

restorations the patient received on the one-year survival rates were also assessed.

Method: Logistic regression was used to analyse the data. The bootstrapping by

cluster technique was employed as there was an issue with dependence in the data.

5000 bootstrap samples were taken. After a literature research was conducted,

two methods for dealing with quasi-complete separation of some of the bootstrap

samples were applied and compared to each other. Empirical confidence intervals

were calculated since estimates for two of the varibles were not normally distributed.

Results: ART restorations were 76.7% less likely to survive over a 12 month pe-

riod accounting for age, gender, cavity class of the restoration and the number of

restorations received by the patient. However, the results showed that this differ-

ence was not statistically significant. None of the other variables were found to be

statistically significant.

Conclusion: Bootstrapping by cluster was used successfully to overcome the issue

of lack of independence in the data while keeping the dependence structure of the

data. 5000 was shown to be a sufficient number of samples to obtain stable parameter

estimates and standard errors. Firths adjustment for reducing the bias of maximum

likelihood estimates was shown to be the best method to deal with quasi-complete

separation.

49

Page 61: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Doubly Robust Estimation for Clinical Trial Studies Belinda Hernández123, Andrew Parnell1, Stephen Pennington2, Illya Lipkovich3 and Michael

O’Kelly3

1School of Mathematical Sciences, University College Dublin, Ireland 2School of Medicine and Medical Science, University College Dublin, Ireland

3Center for Statistics in Drug Development, Quintiles

Abstract

Missing data are a common problem in clinical trial studies. There are many reasons why data can be missing, including patient drop out, perhaps due to side effects or lack of efficacy; or other censoring events unrelated to the study outcome such as death from an unrelated disease. Even if a patient completes a study they may still have elements of incomplete data due to measurements being missed at one or more visits. It has been widely noted that ignoring missing data can lead to biased results and incorrect conclusions. This bias may affect the comparison of treatment groups or the representivness of the study. Because of this, missing data is an important issue that must be dealt with using appropriate statistical techniques.

Here we discuss doubly robust methods for dealing with missing data in the context of clinical trials and present a generalisation of a doubly robust method first proposed by Vansteelandt et al [1] which provides doubly robust estimates for longitudinal data.

Doubly robust estimators combine three models; an imputation model which models the response variable yi on the covariates X; a missingness model which calculates πij: the probability of being observed for each subject i at trial visit j and a final analysis model which is the model that would have been used if there were no missing values in the data set. With doubly robust methods, either the imputation model or the missingness model, but not both, can be misspecified and the trialist will still obtain unbiased estimates, thus giving the analyst two opportunities to correctly specify a model and get valid, consistent results. Output and findings from two illustrative clinical trial datasets using a SAS macro which performs our proposed doubly robust estimator will also be shown.

References

Vansteelandt S, Carpenter J, Kenward M (2012) Analysis of incomplete data using inverse probability weighting and doubly robust estimators. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences 6: 37-48

50

Page 62: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

A Conjugate Class of Utility Functions for Sequential

Decision Problems

Brett Houlding1, Frank P.A. Coolen2 and Donnacha Bolger1

1Discipline of Statistics, Trinity College Dublin, Ireland.2Department of Mathematical Sciences, Durham University, UK.

Abstract

The use of the conjugacy property for members of the exponential family of distribu-

tions is commonplace within Bayesian statistical analysis, allowing for tractable and

simple solutions to problems of inference. However, despite a shared motivation,

there has been little previous development of a similar property for using utility

functions within a Bayesian decision analysis. As such, this work explores a class

of utility functions that appear to be reasonable for modeling the preferences of a

decision maker in many real-life situations, but which also permit a tractable and

simple analysis within sequential decision problems.

51

Page 63: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

 

 

The association between weather and bovine tuberculosis

Renhao  Jin1, Margaret Good2, Simon J. More3 , Conor Sweeney1, Guy McGrath3, Gabrielle E. Kelly1

1UCD School of Mathematical Sciences, University College Dublin, Belfield, Dublin 4, Ireland 2Department of Agriculture, Food and the Marine (DAFM), Kildare St, Dublin 2, Ireland 3Centre for Veterinary Epidemiology and Risk Analysis (CVERA), UCD School of Veterinary Medicine, University College Dublin, Belfield, Dublin 4, Ireland

Abstract

Bovine tuberculosis (bTB), caused by infection with Mycobacterium bovis, affects approximately 0.3% of cattle annually in Ireland, with 18,531 reactor cattle identified in 2011. This has major financial implications both for the farmer whose herd is restricted from trading and cattle slaughtered, and for the exchequer that compensates the farmer and implements measures to control the disease. Climate describes the long term variations of the atmosphere and is based on historical weather records for a particular location, usually over 30 years, while weather refers to the short-term state of the atmosphere. Climatic or weather factors could influence herd bTB occurrence in several ways. Firstly, they may affect the survival of M. bovis in the environment. Secondly, climatic factors may affect wildlife ecology, in particular badgers, and inter-species contact. Thirdly, it is known weather factors affect cattle management, farming and food supply. In this study, we examined the influence of weather variables on bovine tuberculosis (bTB) occurrence in cattle herds, together with well established risk factors in the area known as West Wicklow, in the east of Ireland. Using aggregated data, collected from 2005 to 2009, maximum monthly rainfall over quarters and quarterly herd bTB incidence were found to be correlated. Then logistic linear mixed models (LLMM) were fitted to herd level data, and a non-spatial LLMM was found to adequately describe the data. Herd bTB incidence was positively associated with annual total rainfall, herd size and a herd bTB history in the previous three years, and negatively associated with distance to nearest badger sett. Our models demonstrate that weather variables are associated with bTB risk in Irish cattle herds. High rainfall levels emerged as a significant predictor of bTB. We speculate that it may be possible to mitigate some of the impact of high rainfall through changes to farm management, including the additional controls of pre-movement testing and enhanced clearance procedures as well as bio-security as a control measure. In addition to high rainfall, location and distance to the nearest badger sett, together with herd size and herd bTB history, are associated with herd bTB occurrence. The emergence of the inter-linking factors high rainfall, location and distance to the nearest badger sett as predictors of herd bTB incidence requires further study. Changeable weather patterns and extreme weather events result in difficulties in management for farmers, ecologists and veterinarians that may lead to increased levels of bovine TB in both cattle herds and badgers.

52

Page 64: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Fractal characteristics of raingauge networks in Seoul, Korea

Sun Jung, Hyung-Kyoung Joh and Jong-Sook Park

WISE project team, CATER(Centre for Atmosphere and Earthquake Research)

Abstract

Obtaining more accurate rainfall measurement is one of the most important factors for flood prediction, and this feature has widely been recognized as more intensive floods happened in the past decades over the world. At the same time the demand on higher resolution rainfall measurement has been increased due to the lack of observed rainfall in order to improve the prediction of urban floods. This study has been initiated by WISE (Weather Information Service Engine, www.wise2020.org) project, which is launched on June 2012 and funded by Korea Meteorological Administration and National Institute for Meteorological Research. One of the main aims of the project is enhancing the existing raingauge networks for Seoul, where has been suffered from severe urban flash floods in the year of 2010, 2011 and 2012.

There are 26 AWS (Automatic Weather System) sites including raingauges maintained by the KMA (Korea Meteorological Administration) in Seoul (605 km2). It has been recognised that those raingauges are deficient to capture the intensive rainfall caused severe floods and has been seek for the method of optimising raingauge locations. This study is an attempt to identify the optimal locations for newly installed raingauges using fractal analysis.

Firstly the correlation coefficient was calculated using the coordinate of the twenty six AWS sites in Seoul and then regression coefficient was estimated from the correlation coefficient. The gained regression coefficient is regarded as fractal dimension, which is an indicator of the areal homogeneity of two neighboring raingauges. Fractal dimension will be set in the range of 0 (where all stations are distributed as a single or isolated point) to 2(where all stations are uniformly distributed). The gained fractal dimension will be interpreted as places where are required more or less raingauges in order to improve the predictability of urban flash floods for Seoul.

Keywords: fractal dimension, regression analysis, raingauge networks, urban flash floods

53

Page 65: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

A Generalized Longitudinal Mixture IRT Model for Measuring Differential Growth in Learning Environments

Damazo T. Kadengye1, Eva Ceulemans2 and Wim Van den Noortgate3

1,2,3Centre for Methodology of Educational Research, KU Leuven, Belgium 1,3Faculty of Psychology and Educational Sciences, KU Leuven – Kulak, Belgium

Abstract

This paper describes a generalized longitudinal mixture item response theory (IRT) model that

allows for detecting latent group differences for item response data obtained from electronic

learning (e-learning) environments or other environments that result in a large number of items.

The described model can be viewed as a combination of a longitudinal Rasch model, a mixture

Rasch model, a random item IRT model, and includes some features of the explanatory IRT

modeling framework. The model assumes presence of latent classes in item response patterns

either due to initial person level differences before learning takes place, or as a result of latent

class-specific learning trajectories, or due to a combination of both, and allows for differential

item functioning over the classes. A Bayesian model estimation procedure is described and

results of a simulation study are presented that indicate that the parameters are recovered well

particularly for conditions with large item sample sizes, as well as for balanced sample designs.

Keywords: Item Response Theory, e-Learning, Modelling of Growth, Mixture Models.

54

Page 66: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Small sample confidence intervals for the skewness

parameter of the maximum from a bivariate normal vector

Valentina Mameli1, Alessandra R. Brazzale1

1Department of Statistical Sciences, University of Padua, Italy

Abstract

Azzalini (1985) introduced the skew-normal distribution (SN), which generalizes the

normal distribution through an additional parameter used to regulate the skewness.

The SN distribution presents some inferential issues connected to the estimation of

that additional parameter. In particular, it is not easy to deal with confidence sets

for this parameter. Loperfido (2002) proved that the distribution of the maximum

(the minimum) of two jointly normally distributed exchangeable random variables

with correlation coefficient ρ, is a skew-normal whose skewness parameter depends

on ρ. Mameli et al. (2012), using Loperfido’s result and Fisher’s transformation of

ρ, provided an asymptotic confidence set for λ. Their simulation results showed

that, when the sample size is small or moderate and the correlation coefficient ρ is

close to −1, the actual coverage probability of their asymptotic confidence interval

is close to the nominal coverage and its expected length become larger. The aim of

this paper is to present higher order likelihood based procedures to obtain accurate

confidence intervals for λ in terms of both actual coverage and expected length,

when ρ is negative and close to −1 and for small or moderate sample sizes.

References

[1] Azzalini, A. (1985). A class of distributions which includes the normal ones.

Scandinavian Journal of Statistics, 12, 171–178.

[2] Loperfido, N. (2002). Statistical implications of selectively reported inferential

results. Statistics & Probability Letters, 56, 13–22.

[3] Mameli, V., Musio, M., Saleau, E. and Biggeri, A. (2012) Large sample con-

fidence intervals for the skewness parameter of the skew-normal distribution

based on Fisher’s transformation. Journal of Applied Statistics, 39, 1693–1702.

55

Page 67: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

A Systems Dynamic Investigation of the Long Term

Management Consequences of Coronary Heart Disease

Patients

Janette McQuillan1, Adele H. Marshall1 and Karen J. Cairns1

1Centre for Statistical Science and Operational Research (CenSSOR)

Queen’s University Belfast

Abstract

The incidence of coronary heart disease (CHD) increases with age and with the pro-

portion of the population aged over 65 in Northern Ireland anticipated to increase

by approximately 42% by 2025, this is set to put a severe strain on our health care

system [1]. It has been reported that the number of cases of CHD in Northern

Ireland is expected to rise from 75,158 to 97,255 between 2007 and 2020[2]. These

projections highlight the urgent requirement of strategic management of such pa-

tients.

This study aims to develop a model which can identify the long term effects of various

interventions on the health care system in Northern Ireland. It will involve using

a system dynamics modelling approach to evaluate the impact of both upstream

and downstream policy interventions on patients with or at risk of developing CHD.

System dynamics is a form of simulation modelling which aims to replicate the be-

haviour of a real-world system over a given time period and hence make inference

regarding how this system will operate in the future[3]. Further work will consider

the incorporation of patient length of stay and subsequent costs incurred in the

model.

References

[1] http://www.nisra.gov.uk/archive/demography/population/projections/Northern

%20Ireland%20Population%20Projections%202010%20-%20Statistical%20Report

%20-%20FINAL.pdf (accessed on 15/03/2013)

[2] http://www.inispho.org/files/file/Making%20Chronic%20Conditions.pdf

(accessed on 15/03/2013)

[3] Forrester, J. W. (1961). Industrial Dynamics. Originally published by MIT Press,

Cambridge, MA, reprinted by Pegasus Communications, Waltham MA.

56

Page 68: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Modelling competition between overlapping niche predators

Rafael Moral1, John Hinde2 and Clarice Demetrio1

1Departamento de Ciencias Exatas, Universidade de Sao Paulo, Brazil2School of Mathematics, Statistics and Applied Mathematics, National University

of Ireland, Galway, Ireland

Abstract

The ring-legged earwig, Euborellia annulipes, and the Neotropical stink bug Podisus

nigrispinus are potential biological control agents of the fall armyworm, Spodoptera

frugiperda, and the leafworm Alabama argillacea, important pests of maize and cot-

ton. E. annulipes individuals are usually found on the soil and in the lower-section

of the crops. On the other hand, P. nigrispinus specimens are generally found in

the mid-section of the crops, but are also found in the lower and upper-sections. In

that sense, these predators’ niches overlap in the agroecosystem, hence there may

be competition for prey. Two different experiments were used to study this. Firstly,

males and females of both predators were placed in separate Petri dishes, along with

one S. frugiperda caterpillar. The full experiment consisted of a completely random-

ized design with 33 replicates for males and 34 replicates for females. The system

was observed for one hour and it was recorded whether one competitor attacked

the other, as well as which predator effectively consumed the prey (most efficient

competitor). The outcome of this experiment motivated a second experiment with

a slightly different set-up. This second experiment considered A. argillacea as prey,

in the same situation as the first experiment, but with two caterpillar densities:

1 and 3 per dish. The experiment was installed in a randomized complete block

design with a 2x2 factorial treatment structure (predator sex and density of prey),

with 50 blocks. The number of attacks between the competitors was recorded, as

well as which predator effectively consumed the prey. Binomial generalized linear

models were fit to the binary data (which prey did the predator attack / effectively

consume) and quasi-Poisson models were fit to the count data (number of attacks).

Preliminary results show that female specimens of E. annulipes show more aggressive

behaviour than males and that the earwig is a more efficient competitor.

57

Page 69: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Multi-dimensional partition models for monitoring Freeze

Drying in Pharmaceutical Industry.

Gift Nyamundanda and Kevin Hayes

Department of Mathematics and Statistics, University of Limerick, Ireland

Abstract

Freeze drying, also known as lyophilization, is a low temperature batch wise drying

process used to remove water from pharmaceutical solutions into solids of sufficient

stability for distribution and storage. Freeze drying is very expensive process per-

formed through three successive time-consuming stages. The initial step is freezing,

in which the solution is frozen, followed by primary drying, where ice crystals are re-

moved by sublimation, and finally in the secondary drying step, the unfrozen water

is removed by desorption under high vacuum. There is need for methods that can

determine, in-line and in real-time, end points of the different freeze drying steps,

in order to reduce costs, improve process efficiency and to guarantee final product

quality.

Currently, spectroscopic process analyzers, such as near infrared and raman, are

used in combination with chemometrics tools, such as principal component analysis

(PCA) and partial least squares (PLS), to determine the endpoints of the different

critical stages of freeze drying. However, such chememotrics methods are limited in

that, they are not based on any probability model. Hence, it is difficult to account

for dependence in time measurements and efficiently predict, with certain level of

assurance, the endpoints of the intermediate steps of freeze drying process. In this

work, we treat this problem of determining endpoints of different stages of freeze

drying as multiple change point problem such that product partition models can be

employed.

58

Page 70: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Expert Elicitation for Decision Tree Parameters in Health

Technology Assessment

Shane O Meachair12, Mirko Arnold3, Gerard Lacey3 and Cathal Walsh1

1Discipline of Statistics, Trinity College Dublin, Ireland.2Mathematical Institute, Leiden University, The Netherlands

3Discipline of Computer Science, Trinity College Dublin, Ireland

Abstract

One of the attractions of Bayesian analysis is the ease with which evidence from dif-

ferent sources can be synthesised, particularly expert knowledge. Expert estimates

of the probabilities of observing certain outcomes can be used as prior informa-

tion and combined with data, or can supplement the likelihood and be combined

with non-informative priors to derive a posterior distribution when direct data is

unavailable or unreliable. However, expert elicitation is seldom formally applied

in practice. In this example, expert elicitation arises in the context of a Health

Technology Assessment (HTA) of an innovative colonoscopy enhancement tool to

be used in colon-cancer screening. HTA assesses the utility, economic cost, social,

political and legal implications of introducing new health technologies. Methods

for performing HTA on medical devices are not yet standardised and direct data is

often unavailable. In this example, parameter estimates for cost-effectiveness were

unavailable due to the early stage of development of the device. A variation of the

non-parametric roulette method as outlined in the SHELF (Oakley and O’Hagan,

2010) was used to elicit probability distributions representing expert knowledge on

the likely distribution of quality metrics associated with the device. An expert physi-

cian with experience in colonoscopy was familiarised with the device and estimates of

probabilities were elicited and used to parameterise Beta distributions. We present

the method of elicitation and its application, along with results yielding parameter

estimates for a cost-effectiveness decision tree.

59

Page 71: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Perceptions of Academic Confidence

John O’Mullane1, Kathleen O’Sullivan2, Amanda Wall2 and Philip O’Mahoney2

1Department of Computer Science, University College Cork, Ireland2Department of Statistics, University College Cork, Ireland

Abstract

Introduction: Undergraduate (UG) and taught postgraduate (PGT) students in

University College Cork completed an Academic Confidence (AC) Scale, consisting

of 24 items each rated on a 5-point Likert scale, used to identify students’ confidence

in their ability to perform academic tasks. The objective of this study is to determine

the underlying dimensions of the UG and PGT AC scales and to infer the differences

in perception of academic confidence between UG and PGT students.

Method: Principal component factor analyses using varimax (UG) and oblimin

(PGT) rotations were carried out on the AC scale. The number of factors extracted

was determined by eigenvalues greater than 1. A simple structure was desirable

where items with factor loadings ≥ 0.4 on one factor and < 0.4 on the other factors

were retained. Factors defined by two or fewer items were eliminated.

Results: Factor analyses identified four UG factors, based on 23 items, labelled

‘Study and Perform’, ‘Interact’, ‘Prepare and Understand’, and ‘Attend’ and three

PGT factors, based on 20 items, labelled ‘Understanding and Participation’, ‘Com-

mitment’ and ‘Preparation and Achievement’. Eight items loaded on the PGT

‘Understanding and Participation’. Six of these items formed the UG ‘Interact’ and

2 items loaded on the UG ‘Prepare and Understand’. Of the 7 items loading on

the PGT ‘Commitment’, 3 joined UG ‘Attend’, 2 items joined UG ‘Study and Per-

form’ and 2 items joined UG ‘Prepare and Understand’. The PGT ‘Preparation

and Achievement’ consisted of 5 items, all of which loaded on the UG ‘Study and

Perform’. The additional 3 items in the UG loaded on ‘Study and Perform’.

Conclusions: There is overlap in the UG and PGT factor sets, but the combina-

tions of items suggest different perceptions of academic confidence. PGT students

have a more holistic view of academic confidence. For example, PGT ‘Understand-

ing and Participation’ indicates that these students perceive items relating to un-

derstanding and participation as being connected, while UG students perceive items

relating to understanding and preparation as being connected. Also, UG students

perceive participation as a separate factor (‘Attend’), unconnected to understanding.

60

Page 72: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

An extention to the Goel and Okumoto model of software

reliability

Sean O Rıordain 1 and Simon P. Wilson1

1School of Computer Science and Statistics, Trinity College, Dublin

Abstract

This work presents an extention to the Goel and Okumoto[1] model software re-

liability model to three parameters. Previously the model has been applied to all

of the bug data simultaneously, but this work splits the model at the release date

and applies a different b before the release date to the b′ after the release date.

The model is then applied to Mozilla Firefox data for the rapid releases (versions

5+). We explore the variation in behaviour across successive releases of Firefox and

discuss how this might be modelled.

References

[1] Amrit L. Goel and Kazu Okumoto, Time-Dependent Error-Detection Rate Model

for Software Reliability and Other Performance Measures. IEEE Trasactions on

Reliability, 1997, Vol.R-28, No.3, p206–211

61

Page 73: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

A new individual tree volume model for Irish Sitka spruce

with a comparison to existing Forestry Commission and

GROWFOR models

Sarah O’Rourke1, Gabrielle Kelly1 and Mairtın Mac Siurtain2

1School of Mathematical Sciences, University College Dublin2School of Agriculture and Food Science, University College Dublin

Abstract

Sitka spruce (Picea sitchensis) is the main species of tree in Irish forests accounting

for approximately 53% of the total forest area. A model is required for estimating

the individual tree volume of Irish Sitka spruce, without the necessity of felling trees,

so that the amount of thinnings to be removed, the value of the forest and forest

forecasts can be estimated. The objective of this study was to develop a model for

estimating the individual tree volume of Irish Sitka spruce and to compare this with

existing models, namely, the UK Forestry Commission single tree tariff model for

Sitka spruce, still widely used in the UK and Ireland, and the Irish GROWFOR

model for Sitka spruce developed using the multivariate Bertalanffy-Richards model

and that is used commercially.

Coillte Teoranta (The Irish Forestry Board Limited) maintains the most extensive

crop structure database on Sitka spruce in the Irish Republic. The database in-

cludes many forestry thinning and spacing experiments that have involved repeated

measures on trees during the period 1963 to 2006. Permanent sample plots were

laid down in forests throughout the country and each tree was given a unique I.D.

number. The age and diameter at breast height (DBH) of all trees were recorded at

each assessment. Within each plot a subsample of trees had their height and volume

also recorded. For many of these trees actual volume was also recorded as they have

now been felled.

A number of models were fit to the data using least squares regression to investigate

the variables that can be used to predict actual tree volume. The Box-Cox trans-

formation method, along with diagnostic analysis, was used to improve the model.

Volume estimates were also calculated from the data using the UK Forestry Com-

mission and Irish GROWFOR models for Sitka spruce. The models were compared

using the approximate volume per hectare (m3ha−1) estimates based on predicted

62

Page 74: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

values from the models and their agreement with actual values. 2k-fold cross vali-

dation was used to carry out a subset comparison for the preferred model.

We see the existing UK and GROWFOR models underestimate the volume of timber

in the Coillte sample plots (by 8.6% and 4.6% respectively) while the new model

overpredicts volume by a small amount (0.38%).

63

Page 75: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

An Analysis of Lower Limits of Detection of Hepatitis C

Virus

Kathleen O’Sullivan1, Liam J. Fanning2, John O’Mullane3, ST4050 Students1 and

Linda Daly1

1Department of Statistics, School of Mathematical Sciences, University College

Cork, Ireland2Molecular Virology, Department of Medicine, Cork University Hospital and

University College Cork, Ireland3School of Computer Science & Information Technology, University College Cork,

Ireland

Abstract

Introduction: Hepatitis C is a disease that attacks liver function ranging from a

mild illness lasting a few weeks to a serious lifelong condition condition that can lead

to liver cirrhosis or hepatocellular carcinoma. It is estimated that 3% of the worlds

population is infected. Hepatitis C is sub-classified by six genotypes; we studied

genotypes 1, 2 and 3. Detection involves testing of a peripheral blood sample for

the amount of virus material present (Viral Load (VL), IU/ml) and in this study

the tests results are qualitatively described as a positive response (virus detected)

or Target Not Detected (TND). Detection varies with VL and at the lower levels of

VL becomes uncertain. This study determined the lower limit of detection (LLOD),

the VL required to achieve a 95% detection (hit) rate, for genotypes 1, 2 and 3, and

made comparisons the manufacturers values. We investigated if the effect of VL on

hit rates differed by genotypes.

Method: The laboratory tested independently validated third party proficiency

controls in which the VLs had been identified by the manufacturer for genotypes

1, 2 and 3. Data collected since 2005 provided, for each genotype, the number of

replicates tested (4 – 27) at each VL (0.04 – 10,000 IU/ml) and the number of these

that tested positive. LLODs were estimated by fitting probit models (Pi = Φ(β0 +

β1(log10 VLi) where Pi = hit rate, Φ = cumulative standard normal distribution,

β0 = intercept and β1 = slope), where model parameters were estimated by Maxi-

mum Likelihood Estimation. A likelihood ratio test investigated whether the effect

of VL on hit rate differed by genotype. Pearsons chi-square goodness-of-fit test

assessed model fits. Statistical significance was determined using p < 0.05.

64

Page 76: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Results: We estimated the LLODs for HCV genotypes 1, 2 and 3 as 10.9 IU/ml

[95% CI: 6.2 – 30.4 IU/ml], 11.2 IU/ml [95% CI: 5.6 – 50.5 IU/ml] and 28.8 IU/ml

[95% CI: 11.2 227.7 IU/ml] respectively. The manufacturers stated LLOD was 8

IU/ml [95% CI: 614 IU/ml]. There was no evidence that the effect of VL differed

by genotype (p > 0.05). Pearsons chi-square goodness-of-fit tests confirmed the

suitability of all models fitted to the data (p > 0.05).

Conclusions: The LLODs for genotypes 1 and 2 were comparable to the manu-

facturers but not for genotype 3. The effect of VL on hit rate did not differ by

genotype. The laboratory is performing within established criteria for genotypes 1

and 2. For genotype 3, the data was limited due to the low number of replicates

tested at VLs in the range 3.7 IU/ml to 37 IU/ml. The findings suggest that a

power calculation should be conducted to inform the number of replicates tested at

each VL, and that more levels in a restricted range of VLs (1 – 50 IU/ml) should

be examined to improve the estimation of the LLODs.

65

Page 77: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Predicting Magazine Subscriber Churn

Kathleen O’Sullivan1, Andrew Grannell2, John O’Mullane3 and ST4050 Students1

1Department of Statistics, University College Cork, Ireland2Statistical Solutions, Cork, Ireland

3Department of Computer Science, University College Cork, Ireland

Abstract

Introduction: Retention of subscribers is critical to maintaining the success of

magazines. It is significantly more cost effective to retain subscribers than acquire

new ones. Predictive analytics focuses retention campaigns by predicting which

subscribers are at risk of not renewing (churning) and targeting them. Without

predictive targeting, a retention campaign may cost more than it gains. This study

aimed to identify subscribers who were likely not to renew their subscription to a

sports magazine and to create a profile of these non-renewal subscribers.

Method: Data was provided on 22,265 subscribers to a sports magazine by the

client. Thirteen variables, four relating to subscription incentive, two relating to

payment, one relating to renewal opportunities, three relating to the nature of the

current subscription, one relating to customer location, one relating to subscription

duration and one signifying if the subscriber renewed, formed a subscriber record.

Logistic regression was used to estimate the probability of a subscriber not renewing

based on the variables in the subscriber record. The performance of the fitted model

was assessed using diagnostic measures, including sensitivity, specificity, positive

predictive value (PPV) and negative predictive value (NPV), and ROC analysis.

Results: A logistic regression model based on 12 subscriber variables identified

18,174 non-renewal subscribers. The model’s predictive ability was 61% (Nagelkerke

R2). It correctly classified 88% of subscribers, and the sensitivity and specificity were

94% and 66% respectively. Additionally PPV was 91% and NPV was 76%. Area

under the ROC curve was 92%. Models with fewer subscriber variables (7, 5 and 4)

were considered and demonstrated comparable classification performance.

Conclusions: The journal reader codes of predicted non-renewal subscribers were

identified. This group was segmented by decile of risk and a profile of the subscribers

in each decile was created, for use in targeted retention campaigns. As models with

fewer subscriber variables had comparable performance, this may be used to reduce

the amount of data a subscriber has to supply without loss of predictive ability.

66

Page 78: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Predicting Retinopathy of Prematurity in Newborn Babies

Rebecca Rollins∗, Adele H. Marshall∗ and Karen Cairns∗

∗CenSSOR, Queen’s University Belfast, Belfast, BT7 1NN

Abstract

Retinopathy of prematurity1 (ROP) is a disease in which retinal blood vessels of

premature infants fail to grow and develop normally. Being one of the major causes

of childhood blindness, globally, it is estimated at least 50,000 children are blind

from ROP, and likely many more will be unilaterally blind or visually impaired. In

general, about 60% of low birthweight infants will develop some degree of ROP1.

The combined effect of increasing premature infant survival rates, and the nature

of the disease, make predicting ROP difficult at such an early stage of life.

Research into the prediction of ROP has mainly been undertaken by clinicians who

utilise statistical analysis such as Chi-square tests and logistic regression to identify

risk factors and subsequently predict ROP. In particular, the WINROP2 algorithm

uses longitudinal measures to monitor and predict ROP with detection rates of 100%

in those infants who required treatment and 84% of those who do not. However, this

method relies on the collection of clinical measures over time and is not appropriate

as a tool to indicate which infants are most at risk of ROP at birth.

The purpose of this research is to predict ROP in premature infants using infor-

mation known at birth, which is provided by the Royal Victoria Hospital Belfast.

Techniques explored consist of decision trees, random forests, adaboost models, sup-

port vector machines (SVMs) and neural networks. Results show that the models

performed similarly overall. However, given the fairly uncommon occurrence of

ROP, and the documented difficulties in predicting minority classes the SVM had

the best potential to be developed further. The SVM model has a sensitivity of 68%

and area under the ROC curve of 0.9016. Future work will involve developing the

SVM model for the multiclass minority problem, allowing patients with ROP to be

classified by the severity of the disease, indicating those who require treatment.

1Gilbert, C. Retinopathy of prematurity: A global perspective of the epidemics, population ofbabies at risk and implications for control, Early human development, 2008, 84, 77-82

2Lofqvist, C., et al, Longitudinal Postnatal Weight and Insulin-like Growth Factor I Measure-ments in the Prediciton of Retinopathy of Prematurity, Archives of Ophthalmology, 2006, 124,1711-1718

67

Page 79: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

A new approach to variable selection in presence of

multicollinearity: a simulated study

Olivier Schoni

Department of Quantitative Economics, University of Fribourg,

Bd. de Perolles 90, 1700 Fribourg, Switzerland

[email protected]

Abstract

The present paper evaluates the impact of multicollinearity on automated variable

selection procedures. This objective is achieved by comparing the model selection

performance of different selection methods in hedonic regression models where noise

variables inducing/ not inducing multicollinearity have been introduced. Besides

analysing widespread stepwise selection methods based on information criteria, a

new selection method using a multimodel approach is also examined. In order to

gauge the performance of the considered selection procedures, a data generating

process is simulated using real housing data. The ability of each selection method

to correctly classify informative and uninformative variables is measured by means

of its balanced accuracy. It is shown that the proposed multimodel selection rule

seems to perform systematically better than the usual stepwise selection methods.

68

Page 80: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Lapse prediction models

Lukas Sobisek1, Maria Stachova2

1Department of Statistics and Probability, Faculty of Informatics and Statistics,

University of Economics, Prague, CZ 13067 Czech Republic (e-mail:

[email protected])2Department of Quantitative Methods and Information Systems, Faculty of

Economics, Matej Bel University, Banska Bystrica, SK 97590 Slovak Republic

(e-mail: [email protected])

Abstract

The main objective of our contribution is to evaluate and compare different classifi-

cation models that can be used to assess the insurance customer risk. These models

are built on real customer’s data set that comes from a Czech insurance company.

The purpose of these models is to observe and evaluate policy lapses two years after

inception. The Bayesian models such as Bayesian logistic regression as well as non-

Bayesian models, for example classification trees and random forest were applied in

our analysis. The models were estimated in statistical system R and in the SPSS

software.

69

Page 81: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Latent class model of attitude of the unemployed

Jiri Vild

University of Economics, Prague (Department of Statistics and Probability, University of Economics, Czech Republic)

Abstract

The paper deals with attitude of the Czech unemployed. In a periodical Labour Force Survey, the unemployed answer questions concerning ways they choose to find a new job. They can just be registered at the Labour Office, they can use specialized agencies, contact directly the potential employers etc. In the analysis, data from Labour Force Survey held in spring 2011 will be used. In the first stage latent class model will be estimated and by analysing its parameters we will find specific groups of the unemployed revealing what ways of job search they prefer or not prefer. We will also examine share of these groups to find out the overall attitude of the Czech unemployed. In the second stage the latent class model will be extended with covariates. In latent class analysis, covariates are used to predict probability of membership in the latent classes. Covariates like gender, age or working experience will be incorporated into model to find out how the attitude differs for specific subgroups of respondents.

70

Page 82: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Variable Selection Techniques for Multiply Imputed Data

Deirdre Wall1, Grace Callagy2, Helen Ingoldsby2, Michael J Kerin3 and John Newell1,4

1 School of Mathematics, Statistics and Applied Mathematics, NUI Galway, Ireland2 Discipline of Pathology, NUI Galway3 Discipline of Surgery, NUI Galway

4 HRB Clinical Research Facility, NUI Galway, Ireland

Abstract

Missing data can be a serious problem, in particular in retrospective observational studieswhere the percentage of subjects with complete data can be of concern. Casewise deletionoccurs when a subject is missing in one predictor, the whole case (subject) is omitted as aconsequence. This can result in over half the cases being deleted, even if there is as little as10% missing in each predictor. This reduction in sample size will reduce the power of suchstudies to identify useful predictors.

Multiple Imputation (MI) is a popular technique in missing data problems. MI usesmodels based on observed data to replace missing values with credible values. This processis repeated a number of times to create several imputed datasets. Typically MI is used atthe end of the analysis where the final model generated from the original data is fitted toeach of the imputed datasets and the results combined using Rubin’s Rules. An alternativeapproach to model selection is to identify the final model based on the imputed datasets.

For example variable selection techniques can be applied to each imputed dataset. Anappealing attribute of this approach, is that power can be retained by avoiding casewisedeletion. However, an extra level of complexity is added in terms of identifying a consistentset of predictors.

Methods for variable selection in multiply imputed data in the literature, suggests select-ing predictors using a voting system, such as selection of predictors that appear in any, halfor all the models. Another method suggested is to stack the imputed datasets and performweighted regression using weights related to the amount of missingness present.

Results will be presented from a simulation study to compare the performance of variableselection techniques in multiple imputed data and a new method of imputation where randomforests are used to impute a single dataset, where values are imputed by averaging over manyunpruned classification or regression trees. In addition an application of variable selection,in the presence of missing data, in determining a prognostic model for Disease Free Survivalfrom the breast cancer cohort in University College Hospital Galway will be presented.

1

71

Page 83: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

A maintenance policy for a two-phase system in utility

optimisation

Shuaiwei Zhou1, Brett Houdling1 and Simon P. Wilson1

1Discipline of Statistics, Trinity College, Dublin, Ireland.

Abstract

We consider a general system with two phases whose failure times follow an spec-

ified distribution, that possibly leads to failure prior to maintenance. The utility

properties, such as the costs of maintenance, repair and failure, have been imple-

mented into the reliability assessment for the system. This system is subjected to a

preventative maintenance policy, with sequential maintenance decisions. We com-

pute the sequential optimal maintenance time with respect to the unit-time utility

optimisation of the system. Numerical examples are studied.

72

Page 84: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Index

Ahilan S, OSullivan JJ, Masterson B,Demeter K, Weijer W, OHare G,22

Alquier P, Butucea C, Hebiri M, MezianiK, Morimae T, 5, 6

Alvarez-Iglesias A, Newell J and Hinde J,2

Avalos G, Alvarez-Iglesias A, Parker Iand Newell L, 37

Bahnassy AA, 38Barragan-Martinez MA, Newell J and

Escarela-Perez G, 39Boland A, Friel N, 40Boyce S, Murphy L, Fitzpatrick JM, Mur-

phy TB and Watson RWG, 41,42

Brick E, Harrison P, Sutton G, Cronin Mand Wolsztynski E, 43

Brick E, O’Sullivan K and O’Mullane J,44

Buja A, Berk R, Brown L, Zhang K, ZhaoL, 1

Cairns K, McMillen P, O’Doherty M. andKee F, 27

Chen B and Sinn M, 10

De Persis C, Wilson S, 45Doan TK, Parnell AC Haslett J, 21Doherty C, Cronin M and Harding M, 46Dooley A, Isbel F, Kirwan L, Connolly J,

Finn JA and Brophy C, 47

Fitzpatrick T, 48

Gillespie J, McClean S, Scotney B,FitzGibbon F, Dobbs F andMeenan BJ, 7

Grimaldi, M, 17

Halton A, Cronin M and Mendonca daMata C, 49

Haslett J, 20Hastie T, 11Helu A and Samawi H, 18Hernandez B, Parnell A, Pennington S,

Lipkovich I and O’Kelly M, 50Hinde J, Jorgensen B, Demetrio C and

Kokonendji C, 14

Hofer V, 32Houlding B, Coolen FPA and Bolger D,

51Hwang WT and Kazak AE, 26

Jin R, Good M , More SJ , Sweeney C,McGrath G and Kelly GE, 52

Jung S, Joh HJ and Park JS, 53

Kadengye DT, Ceulemans E. and Vanden Noortgate W, 15, 54

Leisch F, 28Lewitschnig H, Kurz D and Pilz J, 33

MacKenzie G, Xu J, 13Mameli V, Brazzal AR, 55Marshall AH, Zenga M, Crippa F and

Merchich G , 12McCormack K, 16McCrink LM, Marshall AH and Cairns

KJ , 23McDowell R, Ryan A, Bunting B, O’Neill

S, 36McQuillan J, Marshall AH and Cairns

KJ, 56Moral R, Hinde J and Demetrio C, 57

Nyamundanda G and Hayes K, 58

O’Hagan A. and O’Carroll S, 34O’Meachair S, Arnold M, Lacey G and

Walsh C, 59O’Mullane J, O’Sullivan K, Wall A and

O’Mahoney P, 60O’Rıordain S., Wilson S, 61O’Rourke S, Kelly G and Mac Siurtain

M, 62, 63O’Sullivan K, Fanning LJ, O’Mullane J,

ST4050 Students and Daly L, 64,65

O’Sullivan K, Grannell A, O’Mullane J,ST4050 Students, 66

Parnell AC, Phillips DL, Bearhop S, Sem-mens BX, Ward EJ, Moore JW,Jackson AL, Grey J, Kelly D andInger R, 35

Rollins R, Marshall AH, Cairns K, 67

Page 85: 33 rd Conference on Applied Statistics in Ireland · include model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating

Salter-Townshend M and McCormick T,3

Schoni O, 68

Scott EM, 19

Shangodoyin DK, 9

Simpkin A, Metcalfe C, Donovan JL,Martin RM, Athene Lane J,Hamdy FC, Neal DE and TillingK, 24, 25

Sobisek L, Stachova M , 69

Sturludottir E and Stefansson G, 31

Sweeney J, O’Sullivan F, 4

Vild J, 70

Wall D, Callagy G, Ingoldsby H, KerinMJ and Newell J, 71

White A and Murphy TB, 8Wolsztynski E, Brick E, O’Sullivan F and

Eary JF , 29

Yang H, 30

Zhou S, Houdling B and Wilson SP, 72