National University of Ireland Maynooth
33rd Conference on Applied Statistics in Ireland
15th to 17th May 2013
Westgrove Hotel, Clane Co. Kildare
Clane Abbey
Welcome to CASI 2013!
The statistics group within the National University of Ireland Maynooth welcomes you
to the 33rd Conference on Applied Statistics in Ireland from Wednesday 15th to Friday
17th May 2013 in the Westgrove Hotel in Clane, Co. Kildare, Ireland. The conference is
the Irish Statistical Association’s forum for discussion of statistical and related issues for
Irish and International statisticians with an emphasis on both theoretical research and
practical applications in all areas of statistics.
Organising committee
Caroline Brophy Catherine Hurley
Katarina Domijan Aine Dooley
Alberto Caimo Isabella Gollini
Mark O’Connell Grainne O’Rourke
Invited SpeakersAndreas Buja is currently Liem Sioe Liong/First Pacific Company Pro-fessor at The Wharton School, University of Pennsylvania. He has pub-lished widely on topics in statistical inference, statistical computing,data mining and data visualisation. He is co-developer of the widelyused and highly influential GGobi tool for data visualisation. Andreashas also served as managing editor of Journal of Computational andGraphical Statistics.
Trevor Hastie is Professor in Statistics and Biostatistics at Stanford Uni-versity. His main research contributions have been in applied statistics,and he has co-authored two books in this area: "Generalized AdditiveModels" (with R. Tibshirani), and "Elements of Statistical Learning"(with R. Tibshirani and J. Friedman). He has also made contributionsin statistical computing, co-editing (with J. Chambers) a book and largesoftware library on modelling tools in the S language ("Statistical Mod-els in S"), which form the foundation for much of the statistical mod-elling in R. His current research focuses on applied problems in biologyand genomics, medicine and industry, in particular data mining, predic-tion and classification problems.
Friedrich Leisch is Professor of Applied Statistics at the University ofNatural Resources and Life Sciences, Vienna. His research interestsare statistical computing, market segmentation, biostatistics, economet-rics, classification, cluster analysis, and time series analysis. This hasled to software development and statistical applications in technology,economics, management science and biomedical research. He is Sec-retary General of The R Foundation for Statistical Computing and hascontributed many R packages to CRAN and Bioconductor.
Marian Scott is Professor of Environmental Statistics in the School ofMathematics and Statistics at the University of Glasgow. She is anelected member of the International Statistical Institute (ISI) and a Fel-low of the Royal Society of Edinburgh (RSE). Her research interestsinclude model uncertainty and sensitivity analysis, modelling the dis-persal of pollutants in the environment, radiocarbon dating and assess-ment of animal welfare.
CASI 2013 Programme
Wednesday 15th May 2013
11.00 Registration opens
12.30 Lunch (Assaggio Restaurant)
Session 1: Wednesday 14.00. Chair: Catherine HurleyAll talks and the poster session will be held in the Alexandra Suite.
14.00 Opening addressProfessor Philip Nolan, President, NUI Maynooth
14.10 Valid post-selection inferenceA. Buja, R. Berk, L. Brown, K. Zhang, L. Zhao
15.00 Survival Trees using Node Resampling.A. Alvarez-Iglesias, J. Newell, J. Hinde
15.20 Latent space models for multiview networksM. Salter-Townshend and T. McCormick
15.40 Improved quantitative analysis of tissue characteristics in PET studies withlimited uptake informationJ. Sweeney and F. O’Sullivan
16.00 Tea & Coffee
Session 2: Wednesday 16.20. Chair: John Hinde
16.20 Rank penalized estimation of a quantum systemP. Alquier, C. Butucea, M. Hebiri, K. Meziani, T. Morimae
16.40 Redistributing staff for an efficient orthopaedic serviceJ. Gillespie, S. McClean, B. Scotney, F. FitzGibbon, F. Dobbs and B. J.Meenan
17.00 Mixed membership of experts stochastic blockmodelA. White and T.B. Murphy
17.20 Robust estimation of crosscovariance and specification of transfer functionmodel in the presence of multiple outliers in leading indicator seriesD.K. Shangodoyin
17.40 Smarter city predictive analytics using generalized additive modelsB. Chen and M. Sinn
Poster Session and Drinks Reception: Wednesday 18.45.
20.15 Dinner (Onyx Bar)
22.00 Live traditional Irish music in Oak bar
Thursday 16th May 2013
Session 3: Thursday 9.00 International Year of Statistics SessionChair: Caroline Brophy
9.00 Sparse Linear ModelsT. Hastie
9.50 The use of multilevel models to represent patient outcome from geriatricwards: An Italian case studyA. H. Marshall, M. Zenga, F. Crippa and G. Merchich
10.10 Covariance modelling for multivariate longitudinal dataG. MacKenzie and J. Xu
10.30 Quasi-likelihood estimation for Poisson-Tweedie regression modelsJ. Hinde, B. Jorgensen, C. Demetrio and C. Kokonendji
10.50 Tea & Coffee
Session 4: Thursday 11.10 Chair: Tony Fitzgerald
11.10 Application of random item IRT models to longitudinal data from electroniclearning environmentsD. T. Kadengye, E. Ceulemans and W. Van den Noortgate
11.30 Graduation of crude mortality rates for the Irish life tablesK. McCormack
11.50 Advanced analytics in fraud & complianceM. Grimaldi
12.10 Inferences on inverse Weibull distribution based on progressive censoringusing EM algorithmA. Helu and H. Samawi
12.50 Lunch (Assaggio Restaurant)
Session 5: Thursday 14.00 Chair: Gabrielle Kelly
14.00 Statistical challenges in describing a complex aquatic environmentE.M. Scott
14.50 What does statistics tell us about the palaeo-climate?J. Haslett and A. Parnell
15.10 Reconstruct climate history at multiple locations given irregular observa-tions in timeT. K. Doan, A. C. Parnell and J. Haslett
15.30 A threshold based discounting mechanism in the revised EU Bathing waterdirectiveS. Ahilan, J.J. OSullivan B. Masterson, K. Demeter, W. Weijer, G. OHare
15.50 Tea & Coffee
Session 6: Thursday 16.10 Chair: Adele Marshall
16.10 The impact of outliers in a joint model settingL. M. McCrink, A. H. Marshall and K. J. Cairns
16.30 Longitudinal PSA reference ranges: choosing the underlying model of agerelated changesA. Simpkin, C. Metcalfe, J. L. Donovan, R.M. Martin, J. Athene Lane, F.C. Hamdy, D. E. Neal and K. Tilling
16.50 Identifying psychosocial risk classes in families of children newly diagnosedwith cancer: a latent class approachWei-Ting Hwang and A. E. Kazak
17.10 Using multivariate exploratory data analysis techniques to build multi-stateMarkov models: predicting life expectancy with and without cardiovasculardiseaseK. Cairns, P. McMillen, M. O’Doherty and F. Kee
17.30 ISA AGM
20.00 Conference Dinner (Alexandra Suite)
Friday 17th May 2013
Session 7: Friday 9.00 Chair: John Newell
9.00 Resampling methods for exploring cluster stabilityF. Leisch
9.50 Clustering PET volumes of interest for the analysis of driving metaboliccharacteristicsE. Wolsztynski, E. Brick, F. O’Sullivan and J.F. Eary
10.10 A random walk interpretation of diffusion rankH. Yang
10.30 Changepoint model with autocorrelationE. Sturludottir and G. Stefansson
Session 8: Friday 11.10 Chair: Sally McClean
11.10 Modelling global and local changes of distributions to adapt a classificationrule in the presence of verification latencyV. Hofer
11.30 Burn in: estimation of p of Binomial distribution with implemented coun-termeasuresH. Lewitschnig, D. Kurz and J. Pilz
11.50 Bayesian model averaging optimisation for the expectation-maximisationalgorithmA. O’Hagan and S. O’Carroll
12.10 Bayesian stable isotope mixing modelsA. C. Parnell, D. L. Phillips, S. Bearhop, B. X. Semmens, E. J. Ward, J.W. Moore, A. L. Jackson, J. Grey, D. Kelly and R. Inger
12.30 The use of structural equation modelling (SEM) to assess the proportionalodds assumption of ordinal logistic regression concurrently over multiplegroupsR. McDowell, A. Ryan, B. Bunting, S. O’Neill
12.50 Lunch (Assaggio Restaurant)
Poster Session: Wednesday 6.45pm.
P2 Statistical issues in clinical oncology trials.G. Avalos, A. Alvarez-Iglesias, I. Parker and J. Newell
P3 Teaching biostatistical concepts to undergraduate medical studentsA. A. Bahnassy
P4 Prognostic modelling for triaging patients diagnosed with prostate cancerM. A. Barragan-Martinez, J. Newell and G. Escarela-Perez
P5 Adaptive Bayesian inference for doubly intractable distributionsA. Boland and N. Friel
P6 Development and validation of a panel of serum biomarkers to inform surgicalintervention for prostate cancerS. Boyce, L. Murphy, J. M. Fitzpatrick, T. B. Murphy and R. W. G. Watson
P7 Clustering of fishing vessel speeds using vessel monitoring system dataE. Brick, P. Harrison, G. Sutton, M. Cronin and E. Wolsztynski
P8 Validating the academic confidence subscalesE. Brick, K. O’Sullivan and J. O’Mullane
P9 Stochastic modelling of atmospheric re-entry highly-energetic breakup eventsC. De Persis and S. Wilson
P10 An examination of variable selection strategies for the investigation of tooth wearin childrenC. Doherty, M. Cronin and M. Harding
P11 Multivariate analysis of the biodiversity-ecosystem functioning relationship forgrassland communities
A. Dooley, F. Isbel, L. Kirwan, J. Connolly, J. A. Finn and C. Brophy
P12 Classification methods for mortgage distressT. Fitzpatrick
P13 A-Traumatic restorative treatment Vs. conventional treatment in an elderly pop-ulationA. Halton, M. Cronin and C. Mendonca da Mata
P14 Doubly robust estimation for clinical trial studies.B. Hernandez, A. Parnell, S. Pennington, I. Lipkovich and M.O’Kelly
P15 A conjugate class of utility functions for sequential decision problemsB. Houlding, F.P.A. Coolen and D. Bolger
P16 The association between weather and bovine tuberculosisR. Jin, M. Good, S. J. More , C. Sweeney, G. McGrath and G. E. Kelly
P17 Fractal characteristics of raingauge networks in Seoul, KoreaS. Jung, H.K. Joh and J.S. Park
P18 A generalized longitudinal mixture IRT model for measuring differential growthin learning environmentsD. T. Kadengye, E. Ceulemans and W. Van den Noortgate
P20 Small sample confidence intervals for the skewness parameter of the maximumfrom a bivariate normal vectorV. Mameli, A. R. Brazzale
P21 A systems dynamic investigation of the long term management consequences ofcoronary heart disease patientsJ. McQuillan, A. H. Marshall and K. J. Cairns
P22 Modelling competition between overlapping niche predatorsR. Moral, J. Hinde and C. Demetrio
P23 Multi-dimensional partition models for monitoring freeze drying in pharmaceuti-cal industryG. Nyamundanda and K. Hayes
P24 Expert elicitation for decision tree parameters in health technology assessment
S. O Meachair, M. Arnold, G. Lacey and C. Walsh
P25 Perceptions of academic confidenceJ. O’Mullane, K. O’Sullivan, A. Wall and Philip O’Mahoney
P26 An extension to the Goel and Okumoto model of software reliability
S. O Rıordain and S. Wilson
P27 A new individual tree volume model for Irish Sitka spruce with a comparison toexisting Forestry Commission and GROWFOR modelsS. O’Rourke, G. Kelly and M. Mac Siurtain
P28 An analysis of lower limits of detection of hepatitis C VirusK. O’Sullivan, L.J. Fanning, J.O’Mullane, ST4050 Students and L. Daly
P29 Predicting magazine subscriber churnK. O’Sullivan, A. Grannell, J. O’Mullane and ST4050 Students
P32 Predicting retinopathy of prematurity in newborn babiesR. Rollins, A H. Marshall, K. Cairns
P33 A new approach to variable selection in presence of multicollinearity: a simulatedstudyO. Schoni
P34 Lapse prediction modelsL. Sobisek and M. Stachova
P35 Latent class model of attitude of the unemployedJ. Vild
P36 Variable selection techniques for multiply imputed dataD. Wall, G. Callagy, H. Ingoldsby, M. J. Kerin and J. Newell
P37 A maintenance policy for a two-phase system in utility optimisationS. Zhou, B. Houlding and S. P. Wilson
CASI 2013 Abstracts
Valid Post-Selection Inference
Andreas Buja, Richard Berk, Larry Brown, Kai Zhang, Linda Zhao
Statistics Department, The Wharton School, University of Pennsylvania
Abstract
It is common practice in statistical data analysis to perform data-driven variable
selection and derive statistical inference from the resulting model. Such inference
enjoys none of the guarantees that classical statistical theory provides for tests and
confidence intervals when the model has been chosen a priori. We propose to produce
valid “post-selection inference” by reducing the problem to one of simultaneous in-
ference and hence suitably widening conventional confidence and retention intervals.
Simultaneity is required for all linear functions that arise as coefficient estimates in
all submodels. By purchasing “simultaneity insurance” for all possible submodels,
the resulting post-selection inference is rendered universally valid under all possible
model selection procedures. This inference is therefore generally conservative for
particular selection procedures, but it is always less conservative than full Scheffe
protection. Importantly it does NOT depend on the truth of the selected submodel,
and hence it produces valid inference even in wrong models. We describe the struc-
ture of the simultaneous inference problem and give some asymptotic results.
1
Survival Trees using Node Resampling
Alberto Alvarez-Iglesias1, John Newell2,3 and John Hinde3
1HRB Clinical Research Facility, NUI Galway, Ireland 2 HRB Clinical Research Facility, NUI Galway, Ireland
3School of Mathematics, Statistics and Applied Mathematics, NUI Galway, Ireland
Abstract Tree based methods are a popular non-parametric approach for classification and regression problems in Biostatistics. Extensions of the Breiman et al (1984) CART recursive partitioning algorithm have been proposed for tree based modelling of time to event data where typically the logrank statistic is used as a measure of between node separation. One of the drawbacks of the CART recursive partitioning algorithm is the issue of variable selection bias; splits are favoured for covariates with many possible splits, regardless of whether the corresponding predictor is associated with the response or not. More recently Hothorn et al. (2006) presented a unified framework for tree based modelling using unbiased recursive partitioning based on conditional inference where stopping criteria are based on a series of hypothesis tests. This modified version of the recursive partitioning algorithm was developed to overcome the problem of variable selection bias and to avoid overfitting as pruning is automatic. This method however has some negative implications in relation to the identification of interaction effects, as will be demonstrated. This is a major drawback since the identification of interaction effects is one of the features that make tree based methods attractive. In this presentation, a novel method for growing survival trees will be explored that is unaffected by variable selection bias, correctly identifies interactions and allows automatic pruning. This so called node re-sampling algorithm uses bootstrapping at the node level to generate the different splits of the tree. The selection of the primary and surrogate splits is made using a relative importance plot which is based on the out-of-bag values of the log rank test statistic (those observations not included in each bootstrap replicate of the data). One of the novel features of node re-sampling is the possibility of pruning a saturated tree interactively. To facilitate this, a new graphical user interface has also been developed and some of its functionality will be demonstrated. Examples of survival data arising from observational studies in coronary care and breast cancer are used to illustrate.
Breiman, Leo, Friedman, Jerome H., Stone, Charles J., & Olshen, Richard A. 1984. Classification and regression trees. Boca Raton, Florida: CHAPMAN & HALL/CRC. Hothorn, Torsten, Hornik, Kurt, & Zeileis, Achim. 2006. Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15(3), 651-674.
2
Latent Space Models for Multiview Networks
Michael Salter-Townshend1, and Tyler McCormick2
1Clique Cluster, UCD, Ireland2C.S.S.S., University of Washington, USA
AbstractSocial network analysis is the rapidly expanding field that deals with interactions
between individuals or groups. The literature has tended to focus on single net-
work views, i.e. networks comprised of a group of nodes with a single type of link
between node pairs. However, nodes may interact in different ways with the same
alters. For example, on twitter one user may retweet, follow, list or message another
user. There are thus 4 separate networks to consider. Current approaches include
examining all network views independently or aggregating the different views to a
single super network. Neither of these approaches are satisfying as the interaction
between relationship types across network views is not explored.
We are motivated by an example consisting of the census of 75 villages in the Kar-
nataka province in India. The data was collated for use by a microfinance company
and 12 different link types are recorded. We develop a novel method for joint mod-
elling of multiview networks as follows; we begin with the popular latent space model
for social networks and then extend the model to multiview networks through the
addition of a matrix of interaction terms. The theory behind this extension is due
to emerging work on Multivariate Bernoulli models. We first present the theory
behind our new model. We then explore the relationship between the interaction
terms and the correlation of the links across network views and finally we present
results for the Karnataka dataset.
Inference is challenge and we adopt the No-U-Turn sampler, a variant of Hamiltonian
Monte Carlo for Bayesian inference.
3
Improved quantitative analysis of tissue characteristics in
PET studies with limited uptake information.
James Sweeney & Finbarr O’Sullivan
School of Mathematical Sciences, University College Cork, Ireland
Abstract
Quantitative analysis of FDG uptake is important in oncologic PET studies in order
to determine whether a tissue is malignant or benign, and in attempting to predict
the aggressiveness of an individual tumour. Following injection of the FDG tracer,
a series of scans are taken of a region of interest; this provides dynamic information
from which we draw inference on the metabolic parameters describing the evolution
of tracer activity and hence underlying tissue characteristics.
However, in certain cases only a single static scan of a region of interest may be
available. A prime example is a “late uptake scan” where a body region is analysed
for only a very brief period, providing an extremely sparse and potentially noisy data
set from which it is difficult to draw conclusions on underlying tissue characteristics.
In this talk we explore the impact and benefit of incorporating prior tissue infor-
mation, via penalty structures in our data models, to improve prediction outcomes
in the presence of limited uptake information. Specifically, we display that our pro-
posal appears to offer an extremely promising alternative to existing, competing
methodologies.
4
Rank penalized estimation of a quantum system
Pierre Alquier1, Cristina Butucea2,3, Mohamed Hebiri2, Katia Meziani4
and Tomoyuki Morimae5
1School of Mathematical Sciences, UCD, Ireland2LAMA, Universite de Marne-la-Vallee, France
3CREST, ENSAE, France4CEREMADE, Universite Paris Dauphine, France
5Deparment of Physics, Imperial College London, UK
Abstract
The estimation of low-rank matrices received recently a lot of attention in statistics:
for example in marketing and recommendation systems, see e.g. the Netflix challenge
http://www.netflixprize.com/
In this work, we deal with another application where one faces the problem of low-
rank matrices estimation: quantum tomography. In quantum physics, a physical
system is represented by a (finite or infinite) semi-definite positive matrix ρ, with
complex coefficients, such that ρ∗ = ρ and tr(ρ) = 1. This matrix is called the
density matrix of the system. Let X be a physical quantity of interest - e.g. energy
level, spin of a particule, ... X is associated to a matrix X =∑a λau
∗aua called
observable in the following way:
∀a, Prob(X = λa) = tr(uau∗aρ). (1)
We are here interested in systems of q trapped ions, for each ion three {−1,+1}-valued oservables are available. This system is know as a q-qubit. In this case, it
is known that the dimension of ρ is 2q × 2q. The objective is to estimate in what
state ρ does a given experimental device “put” the ions. This is highly relevant
for applications: to be able to prepare q-qubits in given states is a necessary con-
dition to realize a quantum computer. The problem is that as q grows, the size of
the parameter space (22q) prevents a reasonnable estimation. On the other hand,
the assumption that the matrix ρ has a low rank makes sense, see e.g. the discus-
sion in Guta, Kypraios and Drygen (2012). This allows to reduce considerably the
dimension of the parameter space.
In a first time, we build a moment estimator ρ of ρ based on (1). In order to take
5
the low-rank information into account, we define the rank-penalized estimator
ρν = arg minρ
[‖ρ− ρ‖2F + ν.rank(ρ)
](2)
where ‖A‖F is the Frobenius norm of the matrix A, ‖A‖F =√
tr(A∗A). From a
computational point of view, on the contrary to the likelihood-based estimators used
in previous works, it is possible to compute all the ρν for ν > 0 in a reasonnable
time, thanks to a singular value decomposition of ρ. From a theoretical perspective,
we prove that, when ν is set to a (known) value ν0, with large probability
1. ‖ρν−ρ‖2F ≤ c. rank(ρ)q(4/3)q
nwhere n is the number of observations (experiments)
and c a constant. So, the smaller the rank of ρ, the easier is the estimation
task.
2. under additional assumptions, rank(ρν) = rank(ρ).
The proposed methodology is illustrated with both simulated and real experimental
data, with promising results.
6
Redistributing Staff for an Efficient Orthopaedic Service
Jennifer Gillespie1, Sally McClean1,2, Bryan Scotney2, Francis FitzGibbon1, Frank
Dobbs3 and Brian J Meenan2
1Multipdisciplinary Assessment of Technology Centre for Healthcare Programme,
University of Ulster, Northern Ireland, UK2School of Computing and Information Engineering, University of Ulster, Northern
Ireland, UK3Faculty of Life and Health Science, University of Ulster, Northern Ireland, UK
Abstract
Musculoskeletal disorders affect over 100 million people in Europe each year, and in
the United Kingdom musculoskeletal complaints are cited as the reason that 60%
of people are on long term sickness. With our population ageing these figures are
expected to rise, and orthopaedic staff are under pressure to manage the increasing
number of referrals. The orthopaedic Integrated Clinical Assessment and Treatment
Service (ICATS) have found that queues are currently building up in the depart-
ment. The main reason for this is that staff are inefficiently distributed. In this
poster we present two approaches based on classic queueing theory, which efficiently
distributes staff to minimise the overall waiting time. The first is an Exhaustive
Search Algorithm (ESA) which calculates the overall expected waiting time for every
combination of staff in the department. This approach is known to find an optimal
solution, but it also has a long execution time. The second approach uses a Genetic
Algorithm (GA) which aims to find a solution as close to the optimum as possible, in
a shorter time frame. These approaches have been applied to orthopaedic ICATS,
to find an appropriate number of staff in each state for the department to reach
steady state. The distribution of staff, the overall expected waiting time, and the
computation time, have been compared to assess which approach is more suitable
for orthopaedic ICATS. The results show that the ESA finds a minimum overall ex-
pected waiting time, and the distribution of staff is very similar to the GA. However,
the execution time of the ESA is very large. Orthopaedic ICATS is only a small
department with a limited number of staff to distribute; therefore, the computation
time of the ESA would increase significantly for a larger department. In conclusion,
we recommend that a GA would be an appropriate approach to efficiently distribute
the staff through similar healthcare departments.
7
Mixed Membership of Experts Stochastic Blockmodel
Arthur White1, Thomas Brendan Murphy1
1School of Mathematical Sciences,
University College Dublin,
Ireland.
Abstract
Network analysis is the study of relational data between a set of actors who can share
links between each other. Classic examples include friendship, sexual interaction,
and professional collaboration, while more recent examples include email exchanges,
Facebook statuses and other interaction via social media. Empirical studies of net-
works have shown that links are often formed in a highly dependent manner. One
commonly occuring feature of such analyses is referred to as homophily by attributes,
whereby actors with attributes in common are more likely to share links.
The stochastic blockmodel (SBM) is a flexible modelling technique for network
analysis, whereby actors are partitioned into blocks, latent groups which exhibit
different connective properties. Conditional on block membership, the probability
distribution for a link between two actors is modelled with respect to a global pa-
rameter. One extension to the SBM is the mixed membership stochastic blockmodel
(MMSBM). This allows actors partial membership to several blocks, reflecting the
multiple role and community memberships often exhibited by actors in real world
networks.
We introduce the mixed membership of experts stochastic blockmodel (MMESBM),
an extension to the MMSBM. This model incorporates covariate actor information
into the existing model, facilitating direct analysis of the impact covariates have
on the network, rather than having to use them to conduct a post-hoc analysis of
some kind. The method is illustrated with application to the Lazega Lawyers strong
coworker dataset, where the impact covariates such as status, gender, and age has
on the network is examined.
8
ROBUST ESTIMATION OF CROSSCOVARIANCE AND SPECIFICATION OF TRANSFER FUNCTION MODEL IN THE PRESENCE OF MULTIPLE OUTLIERS IN LEADING INDICATOR SERIES
D K SHANGODOYIN
UNIVERSITY OF BOTSWANA
BOTSWANA Email: [email protected]
ABSTRACT
This paper considers the effect of outliers on transfer function modeling in context of autocovariance and cross covariance as identification and model specification tools. We established that outlier input series significantly affect the mean and variance of cross-covariance as their asymptotic convergence is influenced if the original series is classified as 2-dimensional random fields in the presence of outliers. Robust estimates of cross-covariance that accommodate outlier input series in transfer function process are proposed.
KEYWORDS: Outlying observations, Transfer function, Crosscovariance, jackknife estimate, alternate random group method and leading indicator.
9
Smarter city predictive analytics using Generalized Additive Models
Bei Chen and Mathieu Sinn
IBM Research - Ireland
Abstract
Establishing efficient energy and transportation systems are key challenges for accommodatingthe fast-growing population living in cities. While all major cities worldwide are facing energydependence, air pollution and traffic problems, providing fast and accurate predictions of energyand transportation systems is a stepping stone to improve the efficiency and sustainability of thecity. In this talk, we present a class of algorithms which use the Generalized Additive Models(GAMs) (Tibshirani and Hastie, 1990) for predictive analytics in smarter city applications. The firstapplication is short-term electricity load forecasting on various aggregation levels in the electricgrid. We show results for highly aggregated series (national and regional demand), cluster of smartmeter, and data from individual buildings. We also present an adaptive method which updatesthe GAM model parameters using a Recursive Least Squares filter. Experiments with real datademonstrate the effectiveness of this approach for tracking trends, e.g., due to macroeconomicchanges or variations in the customer portfolio. The second part of this talk focuses on multi-modal transportation networks. The bike sharing system is one of the major modes of the Dublintransportation network. We demonstrate how the GAM algorithm solves a two-stage predictionproblem (1) How many bikes (or free bike-stands) will be available at a given time point in thefuture? (2) If the predicted number of bikes (or free bike-stands) is zero, how long will be thewaiting time? The GAM approach takes into account exogenous factors such as weather or thetime of the day, and yields superior performance compared to state of the art methods. Besidesbike systems, our algorithm can be applied to any shared mobility scheme, such as car parkings,shared cars etc.
110
Sparse Linear Models
Trevor Hastie
Statistics Department
Stanford University, California, USA
Abstract
In a statistical world faced with an explosion of data, regularization has become
an important ingredient. In many problems, we have many more variables than
observations, and the lasso penalty and its hybrids have become increasingly useful.
This talk presents a general framework for fitting large scale regularization paths for
a variety of problems. We describe the approach, and demonstrate it via examples
using our R package GLMNET∗.
∗GLMNET is produced jointly with Jerome Friedman, Rob Tibshirani and Noah
Simon, all of Stanford University
11
The use of multilevel models to represent patient outcomefrom geriatric wards: An Italian case study
Adele H. Marshall1, Mariangela Zenga2, Franca Crippa2 and Gianluca Merchich2
1Centre for Statistical Science and Operational Research (CenSSOR), Queen’sUniversity Belfast, Belfast, BT7 1NN
2Milano-Bicocca University, via B. degli Arcimboldi, 20126 Milano,
Abstract
The proportion of elderly people has increased across all European countries in thelast fifteen years. That means a growth of the service for the healthcare systemdedicated to the elderly, in particular an increase in the expenditure and the admis-sion to hospitals leading to an overall increase in patient length of stay (LOS) inhospital. In fact, in Italy in 2012 the elderly people comprised 37% of the admissionsto hospital consuming near half (49%) of the LOS days. It has been estimated thatin 2050 the aging of the population will produce an increase of 4-8% of the GDPacross Europe. In contrast, recent years in Italy has also seen the closure of severalpediatric wards replaced by geriatric wards.
In the Italian national health system, each care unit is allowed to establish its owncriteria for effective care allocation to the elderly. However, the central health au-thority dictates the criteria for hospitalization of the elderly patients. This paperinvestigates the relationship between the combination of effectiveness and appro-priateness with respect to geriatric wards, belonging to the national health systemwithin a specific region in Central Italy. In particular, our attention focuses onthe evaluation of the healthcare outcome, the outcome itself relying on the patientwell-being, where the latter is the result of a complex system of reciprocal relationsbetween patients and healthcare agents.
The process is considered using a multilevel model where the patient outcomes arerepresented taking into account both the patient condition and ward/hospital set-tings. The evaluation of the healthcare outcome is modelled in terms of risk, withthe inclusion of risk adjustments with respect to covariates, as advanced by Gold-stein and Spiegelhalter in 1996 [?]. The model shows how certain outcomes arerelated to the healthcare structure and others are not. The multilevel model leadsto a ranking between wards according to risk adjustments.
References
[1] H. Goldstein, D.J. Spiegelhalter, League Tables and Their Limitations: Statistical Issues in Comparisons of
Institutional Performance, Journal of the Royal Statistical Society, Series A, 1996, 159(3), 385-443
12
Covariance modelling for multivariate longitudinal data
Gilbert MacKenzie1 and Jing Xu 2
1Centre of Biostatistics, University of Limerick, Ireland.2Birbeck College, London University, London.
Abstract
In many studies subjects are measured on several occasions with regard to multi-
variate response variables. Consider, as an example, a randomized controlled trial of
teletherapy for age-related macular degeneration (Hart, et al., 2002). Patients were
randomly assigned to either radiotherapy or observation and distance visual acuity,
near visual acuity and contrast sensitivity were measured throughout the study.
Modelling the covariance structures for such multivariate longitudinal data is usu-
ally more complicated than for the univariate case due to correlation between the
responses at each time point, correlation within separate responses over time and
cross-correlation between different responses at different times. Two approaches
are commonly adopted: models with a Kronecker product covariance structure and
multivariate mixed models with random coefficients. These approaches select the
covariance structures from a limited set of potential candidate structures including
compound-symmetry, ar(1) and unstructured covariance, and very often assume
that the data are sampled from multiple stationary stochastic processes.
In this paper, we develop a method to model covariance structures for bivariate lon-
gitudinal data by extending the ideas of modified Cholesky decomposition (Pourah-
madi, 1999) and matrix-logarithmic covariance modelling (Chiu et al., 1996). Fi-
nally, we model the parameters in these matrices parsimoniously using regression
models and use our new methods to analyse the bivariate response in the ARMD
trial referenced above.
Keywords: Covariance Modelling, Multivariate Longitudinal Data
References
Xu J. and MacKenzie G. (2012). Modelling covariance structure in bivariate marginal
models for longitudinal data Biometrika., 99, 3, pp. 649 - 662.
13
Quasi-likelihood Estimation for Poisson-Tweedie Regression
Models
John Hinde1, Bent Jorgensen2, Clarice Demetrio3 and Celestin Kokonendji4
1NUI Galway, Galway, Ireland2University of Southern Denmark, Odense, Denmark
3ESALQ/USP, Piracicaba, Brazil4Universite de Franche-Comte, Besancon, France
Abstract
We consider a new type of generalized linear model for discrete data based on
Poisson-Tweedie mixtures. This class of models has previously been considered
intractible, but recent theoretical results show that we may parameterize the mod-
els by the mean µ and a dispersion parameter γ, where the variance takes the form
µ + γµp and p ≥ 1 is the Tweedie power parameter. Here we describe a quasi-
likelihood method for estimating a regression model and a bias-corrected Pearson
estimating function for the variance parameters γ and p. This provides a unified re-
gression methodology for a range of different discrete distributions such as Neyman
Type A, Polya-Aepple, negative binomial and Poisson-inverse-Gaussian distribu-
tions, as well as the Hermite distribution. We discuss these models in the context
of overdispersion and zero-inflation and illustrate their application to some classic
datasets and some recent data on hospitalisations.
14
Application of Random Item IRT Models to Longitudinal Data from Electronic Learning Environments
Damazo T. Kadengye1, Eva Ceulemans2 and Wim Van den Noortgate3
1,2,3Centre for Methodology of Educational Research, KU Leuven, Belgium 1,3Faculty of Psychology and Educational Sciences, KU Leuven – Kulak, Belgium
Abstract
In educational learning environments, monitoring persons’ progress over time may help teachers
to continually evaluate the effectiveness of their teaching or training procedures and make more
informed instructional decisions. Electronic learning (e-learning) environments are increasingly
being utilized as part of formal education and tracking and logging data sets can be used to
understand how and whether students progress over time or to improve the learning environment.
In this paper, we present and compare four longitudinal models based on the item response
theory (IRT) framework for measuring growth in persons’ ability within and between study
sessions in tracking and logging data from web based e-learning environments. Two of the
models, that have been proposed and applied in other aspects of educational research, focus on
measurement of growth between study sessions. These are compared to two extensions that we
propose; one model that measures growth within study sessions while the other model combines
the two aspects – growth within and between study sessions. Differences in growth across person
groups are explained by extending the models to the explanatory IRT framework. An e-learning
example is used to illustrate the presented models. Results show that by incorporating time spent
within and between study sessions into an IRT model, one is able to track changes in ability of a
population of persons or for groups of persons at any time of the learning process.
Keywords: Item Response Theory, e-Learning Environments, Modelling of Growth
15
Graduation of Crude Mortality Rates for the Irish Life
Tables
Kevin McCormack
Central Statistics Office Ireland
Abstract
A life table is a convenient way of summarising various aspects of the variation of
mortality with age. The graduation or smoothing of crude population mortality
rates is essential in the construction of life tables as the recording of the underlying
deaths are subject to errors.
Period Life Tables have been produced by the Irish Central Statistics office on fifteen
occasions, from 1926 to 2005-07, and on each occasion the Kings 1911 formula for
Osculatory Interpolation was used to graduate the crude mortality rates.
In this paper a modern and more statistically accurate cubic-spline graduation
method based on the TRANSREG feature in SAS is developed and applied to the
2005-07 Irish crude mortality rates. The Irish crude mortality rates where also
smoothed using the UKs Office of National Statistics GeD Spline methodology. Life
tables for males and female are constructed using both graduated mortality rates
and the results are compared.
16
Advanced Analytics in Fraud & Compliance
Marco Grimaldi Accenture Analytics Innovation Centre, Dublin 4, Ireland
Fraud is a much bigger problem for organisations than they generally admit – and it is costing them. It is costing them through loss of revenue and reputational risk if they are not seen as protecting their customers. There is compelling international evidence to demonstrate the scale of the problem. Most organisations are struggling to keep pace with the scale and sophistication of fraud. As organisations become more innovative in how they deliver and get paid for products and services to clients, they also become more vulnerable to fraud.
Conventional approaches to fraud management are no longer enough and for the most part only tackle the tip of the fraud problem. More sophisticated, data driven approaches to fraud are delivering remarkable results. Most organisations have exceptional amounts of data on their current and potential clients which they get from internal and external sources. Many are now using this data to build risk profiles of who has or is likely to de-‐fraud them. When the data it is fully exploited it allows a targeted rather than random approach to fraud. This in turn means much better results. Data mining and advanced analytics are at the heart of new, more holistic approaches to fraud management. With them, organisations can uncover patterns of fraud and they are now developing integrated fraud prevention strategies that are cross-‐functional and have analytics at their core.
Accenture’s Analytics Innovation Centre (AAIC) is achieving impressive results in fraud detection. The Centre is the first of its kind in Ireland and has become a Centre of Excellence for fraud analytics. It is already achieving impressive results, for example: • It is delivering a 45% increase in non-‐compliance yield per investigation in a European Revenue
Agency. • It has increased fraud detection rates in a European Welfare Agency by 40%.
17
Inferences on Inverse Weibull distribution based on progressivecensoring using EM algorithm
Amal Helu1 and Hani Samawi2
1Department of mathematics, The University of Jordan, Jordan2Jiann-Ping Hsu College of Public Health,Georgia Southern University, USA
Abstract
The Inverse Weibull (IW ) distribution can be readily applied to a wide range of situationsincluding applications in medicine, reliability and ecology. Based on progressively type-IIcensored data, the maximum likelihood estimators (MLEs) for the parameters of the IWdistributions are derived using the expectation maximization (EM) algorithm. Moreover,the expected Fisher information matrix based on the missing value principle is computed.Using extensive simulation and three criteria, namely, bias, mean squared error and Pitmannearness (PN) probability, we compare the performance of theMLEs via the EM algorithmwith the Newton-Raphson (NR) method. It is concluded that the MLEs using the EMalgorithm outperform their counter parts using the NR method. Real data example is usedto illustrate our proposed estimators.
1
18
Statistical challenges in describing a complex aquatic
environment
E Marian Scott
School of Mathematics and Statistics, University of Glasgow, Glasgow G12 8QW,
UK
Abstract
Both quality and quantity of water are of crucial importance for many reasons,
impacting on diverse topics from flood risk, to human health. Water quality is
determined by many determinands and characteristics, and subject to a variety of
different regulatory regimes Within the European Union, there are several regulatory
frameworks dealing with the aquatic environment, of which the Water Framework
Directive (WFD) is perhaps the most significant. Three others are the Bathing
Waters Directive (BWD), for predicting microbiological health risk, the Floods Di-
rective, which requires a national assessment of flood risk by 2011 and flood risk
and hazard maps by 2013, and the Nitrates directive which is linked to the WFD.
The WFD expresses objectives in terms of achieving good ecological and chemical
status. Setting such environmental objectives requires a means of judging the state
of the environment, and an integrated river basin management planning system.
This development presents some interesting statistical challenges, with waters being
managed at river basin level through a river basin management plan. At the same
time our ability to monitor the environment at increasingly high resolution both
spatially and temporally will produce a revolution in our understanding provided it
is matched with a revolution in our ability to model and visualise such data. New
sensor technology while not yet widespread is becoming more generally deployed
and it is within this context that some grand water monitoring challenges including
integration and synthesis of observations from a variety of sensor networks, devel-
opment of statistical models to handle large data sets, focussing on extremes and
visualisation of dynamic systems are considered.
19
What does statistics tell us about the palaeo-climate?
John Haslett1, Andrew Parnell21
1School of Computer Science and Statistics, TCD, Ireland2School of Mathematics, UCD, Ireland
Abstract
The climate change debate has been much informed, over the past decade, by infor-
mation about the palaeo-climate. For the past decade, with SFI support and several
collaborators, we in Ireland have been part of this international research effort. Our
general focus is on the past 100,000 years and more specifically on the past 15,000
years. This includes the abrupt transition to the Holocene, the current inter-glacial
period. In this paper we present an overview of this research and touch on some of
the implications.
To a statistician, the general objective of climate change research is the reduction
of uncertainty about past and/or future of climate. But public debate - especially
about the future - is mired in the concept of uncertainty.
This presentation will discuss the general issue of the study of the uncertainty con-
cerning a complex stochastic space-time system on which there is a small amount
of poor quality (proxy) data. Modern Bayesian methods involving MCMC are our
preferred tool. It will then touch on the communication of that uncertainty, firstly
scientist-to-scientist and subsequently scientific community-to-public. The topic is
timely, as the Intergovernmental Panel on Climate Change (IPCC) will in 2013 issue
its first major report since 2007. It is likely to update its current recommendations
on uncertainty.
20
Reconstruct climate history at multiple locations given
irregular observations in timeThinh K. Doan1, Andrew C. Parnell2 and John Haslett1
1School of Computer Science & Statistics, Trinity College Dublin2School of Mathematical Science, University College Dublin
Abstract
The best method to predict climate change is to understand the past. Direct mea-
surements are only available in the last 250 years, but there are other indirect
measurements (known as climate proxies) which can be used as a guide to further
past climatic conditions. Using pollen abundance as a proxy, the aim of this work
is to derive information about the climate dynamic processes that generate climate
variability in the past. This is commonly referred to as reconstruction.
The data are available at irregular time intervals at multiple locations in Finnmark
(Norway) going back as far as the last 14,000 years. This period includes the very
rapid warming and cooling of climate known as the Younger Dryas. We use compu-
tationally intensive Monte Carlo methods for parameter estimation, and develop a
multivariate long-tail smoothing algorithm for the joint reconstruction of the mul-
tivariate climate time series.
The methodological focus for this presentation concerns the fact that the data series
are not only irregular, but misaligned. That is, the series yj = {yj (tij) ; j = 1, . . . , ni}have observations at different times.
21
A threshold based ‘discounting’ mechanism in the revised EU Bathing Water Directive
S. Ahilan1, J.J. O’Sullivan1, B. Masterson2, K. Demeter2, W. Weijer2, G. O’Hare3
1UCD School of Civil, Structural and Environmental Engineering 2UCD School of Biomolecular and Biomedical Science Engineering
3UCD School of Computer Science and Informatics
Abstract Under the revised Bathing Water Directive, more stringent bathing water quality standards, defined in terms of E.coli and Intestinal Enterococci (IE), will apply. An integrated approach to coastal zone management essential for sustaining tourism and shellfish harvesting in European coastal waters is required in the context of the new legislation. The directive recognises that elevated levels of faecal coliform bacteria in bathing areas can derive from the overland transport of waste from livestock in the rural fraction of river catchments. On days therefore, that follow significant storm events in coastal agricultural catchments, exceedences of threshold bacteria levels may occur. Given that these exceedences result from ‘natural’ rather than anthropogenic influences, a ‘discounting’ mechanism is included in the Directive where high levels of faecal bacteria contamination can be excluded from the water quality record if they are predicted in advance and mitigation actions to maintain public health protection are taken. However, this discounting, which can apply to a maximum of 15% of water quality samples in a 4-year monitoring period, is required on a continuous basis rather than at the end of the monitoring period. This presents practical problem for responsible authorities charged with enforcing the legislation. In the event of a poor quality water sample associated with a naturally occurring short-term pollution incident being recorded early in the monitoring period, authorities must decide whether or not this is likely to be included in the 15% of discounted samples in a 4 year period. This study develops a risk based probabilistic approach to ‘discounting’ for advising responsible Authorities whether the water quality of a particular sample should be discounted. The study uses E.coli and IE records in three bathing water areas (Dollymount, Merrion and Sandymount Strands) on the east coast of Ireland collected from 2003 – 2012. The records were initially analysed to identify any seasonal or other patterns in the data. Following this, synthetic records of E.coli and IE were generated from a Monte-Carlo simulation over the monitoring period. The non-exceedence probabilities of E.coli and IE were determined from the generated samples. The directive requires that threshold values of both E.coli and IE should be within specified limits to maintain beach quality and therefore, a joint probability analysis was undertaken to identify allowable E.coli and IE levels to facilitate the discounting of 15% of samples.
22
The Impact of Outliers in a Joint Model Setting
Lisa M. McCrink, Adele H. Marshall and Karen J. Cairns
Centre for Statistical Science and Operational Research (CenSSOR),
Queen’s University Belfast, Northern Ireland
Abstract
With a growing volume of medical longitudinal and survival data being gathered
concurrently, joint models have become a popular technique to handle the typi-
cal associations found between such data through simultaneous estimation of both
the longitudinal and survival processes. Although this field of research is growing
rapidly, little research has assessed the impact of the commonly used normality
assumptions within these models.
This research focuses on the impact of this normality assumption of both the longitu-
dinal random effects and random error terms in the presence of longitudinal outliers.
In doing so, a robust joint model is introduced which can account for both outlying
observations within individuals alongside outlying individuals that dont conform to
the population trends. An illustration using longitudinal data from Northern Irish
end stage renal disease patients is presented in this research.
Due to the natural decline in kidney functions with age, the aging population within
the UK has led to an increasing number of individuals commencing renal replacement
therapy. For these patients, one of the largest influences on their survival is the
management of anaemia with previous renal research stressing the negative impact
that fluctuating haemoglobin (Hb) levels over time have on patients survival [1].
Due to this association between a longitudinal and survival process, independent
models can result in biased estimates and so a joint model is recommended [1].
Both outlying observations and individuals are common in this type of data. This
research illustrates the effect that these outliers have on the longitudinal parameters
and the impact of this in a joint model setting.
References
[1] Ratcliffe, S.J., Guo, W. & Ten Have, T.R. 2004, ”Joint modeling of longitudinal
and survival data via a common frailty”, Biometrics, vol. 60, no. 4, pp. 892-899.
23
Longitudinal PSA reference ranges: choosing the underlying model of age related changes
Andrew Simpkin1, Chris Metcalfe1, Jenny L. Donovan1, Richard M. Martin1, 2, J. Athene Lane1, Freddie C. Hamdy3, David E. Neal4 and Kate Tilling1 1 School of Social and Community Medicine, University of Bristol, Bristol, UK
2 MRC Centre for Causal Analysis in Translational Epidemiology, University of Bristol, Bristol, UK
3 Nuffield Department of Surgical Sciences, University of Oxford, Oxford, UK
4 Oncology Centre, Addenbrooke’s Hospital, Cambridge, UK
Abstract
Background: Serial measurements of prostate specific antigen (PSA) are used as a biomarker for
men diagnosed with prostate cancer following an active monitoring programme. Distinguishing
pathological changes from natural age-related changes is not straightforward. Here we compare
four approaches to modelling age-related change in PSA with the aim of developing reference
ranges for repeated measures of PSA. A suitable model for PSA reference ranges must satisfy
two criteria. Firstly it must offer an accurate description of the trend of PSA on average and in
individuals. Secondly it must be able to make accurate predictions about new PSA observations
for an individual and about the entire PSA trajectory for a new individual.
Methods: Data from over 7,000 serial PSA tests were available from a cohort of 512 men
enrolled in the active monitoring preference arm of the Prostate testing for cancer and Treatment
(ProtecT) trial. We used linear mixed models assuming (i) a linear change in PSA with time, and
explored non-linear trajectories using (ii) fractional polynomials and (iii) linear regression
splines in the mixed model setting. The final method comparison was with (iv) a nonparametric
method (PACE) for fitting smooth curves to repeated measures data. Using methods developed
for linear mixed models, we can enhance predictions for future observations by conditioning on
already observed PSA. The approaches were compared for model fit using Akaike’s Information
Criterion (AIC) while root mean squared error (RMSE) was used to evaluate the performance in
fitting the sample PSA data.
Results: PACE offered the best fit to the observed PSA data with an RMSE of 2.26ng/ml.
However, using conditional prediction methods, the mixed model approaches can be used to
24
better predict observations for new individuals. Among these methods, a linear regression spline
mixed model performed the best in modelling repeated PSA, with a lower AIC and an RMSE of
2.10ng/ml. This analysis demonstrates an advantage of regression splines over fractional
polynomials, i.e. using a local rather than a global basis to fit models to non-linear relationships.
Conclusions: The nonparametric method performed best in describing the features of population
trend in this repeated measures analysis. Parametric techniques are more suited to predicting new
observations for individuals, as methods exist to condition on initial values. Among methods
discussed here, the linear regression spline mixed model is the optimum approach for deriving
reference ranges for repeated measures of PSA.
25
Identifying Psychosocial Risk Classes in Families of
Children Newly Diagnosed with Cancer: a Latent Class
Approach
Wei-Ting Hwang1 and Anne E. Kazak2
1Department of Biostatistics and Epidemiology
University of Pennsylvania Perelman School of Medicine, USA2Department of Psychology, The Children’s Hospital of Philadelphia, USA
Abstract
In an effort to provide high quality and appropriate psychosocial care for children
newly diagnosed with cancer, and their families, the Psychosocial Assessment Tool
(PAT2.0) was developed as a brief screening instrument to identify family needs and
targets for intervention. The PAT items consist of a constellation of risk factors
focusing on family structure and resource, social support, stress reaction, family
problems, child problems, sibling problems, and family beliefs. The development of
PAT is also based on the Pediatric Preventative Psychosocial Health Model (PP-
PHM). The PPPHM describes the pediatric health population by conceptualizing
families with three levels of risks and needs (universal, targeted, and clinical) in
times of a stressful event such as cancer diagnosis. The objective of this work is to
identify profiles of psychosocial risks and study their evolution in this population.
Using data collected for a PAT project of 142 families, we performed latent class
analysis (LCA) to identify categories of psychosocial risk classes based on the risk
indicators of PAT2.0. The proposed approach will be compared with the traditional
method that uses the weighted scores. Then the stability of the PAT risk classifica-
tion over the first 4 months of treatment was assessed by a latent transition analysis
(LTA). The covariate effects on the class memberships and transition probabilities
were also estimated.
26
Using Multivariate Exploratory Data Analysis Techniques
to Build Multi-State Markov Models: Predicting Life
Expectancy with and without Cardiovascular Disease
Karen Cairns1, Paula McMillen1, Mark O’Doherty2 and Frank Kee2
1Centre for Statistical Science and Operational Research (CenSSOR), Queen’s
University Belfast, UK2UKCRC Centre of Excellence for Public Health Research (NI), Queen’s
University Belfast, UK
Abstract
Often covariates being integrated into multi-state Markov models are of mixed type
(categorical and continuous) and exhibit associations. This can lead to spurious
results when covariate-based multi-state Markov models are fitted.
This paper demonstrates the usefulness of integrating multivariate exploratory data
analysis technqiues into the process of building multi-state Markov models. The
msm package in R was used to build the multistate Markov models (Jackson 2011),
while the FactoMineR package in R was used to perform multiple factor analysis
(Le et al 2008). Bayesian Information Criterion (BIC) was used to determine the
optimal model. This paper also extends the msm code to predict life expectancy
with and wthout cardiovascular disease, based on multiple covariates.
The methods have been applied to the PRIME Belfast data set. PRIME Belfast
follows 2745 men aged 50 to 59 years from Belfast, UK, giving a indication of their
coronary status. Information on subjects was grouped by ‘Background’ (married,
education); ‘Lifestyle’ (alcohol consumed, physical activity, smoking); and ‘Risk
Indicators’ (hypertension, body mass index (BMI), cholesterol, diabetes). Multiple
factor analysis indicates the first dimension is highly correlated with Risk Indicators,
while the second dimension correlates with Background and Lifestyle information.
The optimal model obtained was based on the use of the transformed variables (the
first two dimensions). In constrast, the (sub-optimal) model based on the original
variables depends only on lifestyle factors.
Le, S., Josse, J., Husson F. (2008). FactoMineR: an R package for multivariate
analysis. Journal of Statistical Software, 25, pp. 1-18.
Jackson, C.H. (2011). Multi-State Models for Panel Data: The msm Package for R.
Journal of Statistical Software, 38, pp. 1-29.
27
Resampling Methods for Exploring Cluster Stability
Friedrich Leisch University of Natural Resources and Life Sciences, Vienna, Austria
Model diagnostics for cluster analysis is still a developing field because of its exploratory nature. Numerous indices have been proposed in the literature to evaluate goodness-of-fit, but no clear winner that works in all situations has been found yet. Derivation of (asymptotic) distribution properties is not possible in most cases. Over the last decade several resampling schemes which cluster repeatedly on bootstrap samples or random splits of the data and compare the resulting partitions have been proposed in the literature. These resampling schemes provide an elegant framework to computationally derive the distribution of interesting quantities describing the quality of a partition. Due to the increasing availability of parallel processing even on standard laptops and desktops these simulation-based approaches can now be used in everyday cluster analysis applications. We give an overview over existing methods, show how they can be represented in a unifying framework including an implementation in R package flexclust, and compare them on simulated and real-world data. Special emphasis will be given to stability of a partition, i.e., given a new sample from the same population, how likely is it to obtain a similar clustering? Key Words: cluster analysis, resampling methods, bootstrap, R
28
Clustering PET volumes of interest for the analysis of
driving metabolic characteristics
Eric Wolsztynski1, Emily Brick1, Finbarr O’Sullivan1 and Janet F. Eary2
1School of Mathematical Sciences, University College Cork, Ireland2School of Medicine, University of Washington, Seattle, WA, USA
Abstract
The main advantage of Positron Emission Tomography (PET) over other imaging
modalities resides in the unique metabolic information it provides with the imaged
distribution of radiolabelled tracer uptake in tissue. PET imaging of solid tumours
has thus become part of standard treatment protocols for an increasing number
of cancer types, in particular for the evaluation of prognosis at baseline and of
therapeutic response. Clinical experience indicates that certain sub-regions of higher
uptake intensity within the tumour play a dominant role when determining overall
tumour metabolic progression and treatment outcome. Relevant statistical analysis
can thus be carried out on sub-volumes of the PET image. This process however
relies on adequate techniques for tumour and region delineation from data of limited
spatial resolution, which is not straightforward. The choice of a statistical indicator
of metabolic activity within the sub-region is then in most cases, depending on the
type of disease and medical protocols in place, a variation of some mean or maximum
tracer uptake quantitation.
We explore the feasibility of model-based clustering in partitioning the PET volume
of interest (VOI) into sub-volumes of characteristic uptake patterns. Delineation of
the sub-VOI is thus data-driven and also has the advantage of regrouping uptake
information from areas of comparable metabolic intensities that are not connected
spatially. We consider a recombined Gaussian mixture modelling of the PET VOI
to identify sub-VOIs of analytic value. Performance of standard statistical quanti-
tators applied to clusters of radiotracer uptake is also considered in terms of clinical
utility. Preliminary results on sarcoma studies highlight the potential of statistical
quantitation derived from clusters of PET data for both prognosis and therapeutic
response assessment.
[Research supported by SFI MI-2007 and NIH/NCI ROI-CA-65537.]
29
A Random Walk Interpretation of DiffusionRank
Haixuan Yang1
1School of Mathematics, Statistics, & Applied Mathematics
University of Ireland, Galway, Ireland
Abstract
Based on a heat diffusion model on a graph, recently we (2007) proposed Diffu-
sionRank to solve the link manipulation problem of PageRank. A little earlier, we
(2005) applied a similar model to the semi-supervised classification problem where,
the task is, given some labelled nodes in a graph, to predict the labels of unlabelled
nodes. More recently, there were two papers appearing in Bioinformatics: Goncalves
et al. (2011) applied DiffusionRank to the problem of prioritization of regulatory
associations, and Poirel et al. (2013) evaluated its (and other three algorithms’)
performance when applied to the problem of ranking genes for 54 diverse human
diseases. With these successful applications of DiffusionRank, it is interesting to
show its properties for a better understanding of why it works. We (2005, 2009)
showed that DiffusionRank can be considered as a generalization of both K nearest
neighbors and the Parzen window nonparametric method for classification problems.
Here I would like to share a random walk interpretation: DiffusionRank is a lazy
random walk in which a random walker rests in a node with a flexible probability;
and computationally equivalently, it is a finite-step lazy random walk with a fixed
rest probability. Such a lazy random walk has the properties of keeping track of the
initial condition, and reflecting the network structure. Both of the properties are
important for ranking problems and classification problems as the inference of the
ranking scores and classification confidence scores rely on both the initial condition
and the network structure.
30
Changepoint model with autocorrelation
Erla Sturludottir and Gunnar Stefansson
Science institute, University of Iceland, Iceland
Abstract
Changepoint is a point in time-series when there is either a step change, i.e. a mean
shift in the response variable and/or a trend change. These changes can occur e.g. in
time-series of concentration of contaminants, change in analysis method can result
in a step change in the time-series and a change in emission can result in a change
in trend. Changepoint analyses have mainly been carried out in climate time-series
and the focus has been on a step change rather than change in trend.
Most of the methods to detect a changepoint assume identical, independent and
normally distributed errors. However, an autocorrelation is common in time-series
which inflates the error rate when ignored and research on changepoint models with
autocorrelation have focused on model which only allow a step change.The like-
lihood ratio (test statistic) for an unknown changepoint does not follow a known
distribution and the critical value of an empirical distribution depends both on the
length of the time-series and the autocorrelation parameters.
The model (1) allows for a changepoint in a time-series yt, i.e. the intercept (α1 6= α2)
and/or the slope (β1 6= β2) are different before and after the changepoint.
yt =
α1 + β1 + εt n0 ≤ t ≤ c
α2 + β2 + εt c < t ≤ n− n0
(1)
The errors εt are autocorrelated, c is the changepoint and n0 is the first possible
changepoint and n is the length of the time-series. This model will detect both step
and trend type change points.
In this study a changepoint model which can detect either step and/or trend change
when accounting for autocorrelation will be investigated with simulations and ap-
plication in contaminant time-series will be given.
31
Modelling global and local changes of distributions to adapt
a classification rule in the presence of verification latency
Vera Hofer1
1Department of Statistics and Operations Research,
Karl-Franzens University of Graz, Austria
Abstract
The distributions a classification problem is based on can be subject to changes over
the course of time. Such changes can relate to the class prior distribution (global
change) or to the conditional or unconditional feature distributions (local change).
After any change the original training data comprising features and class labels are
no longer representative for the population, and thus the classifier’s performance may
deteriorate. However, in the presence of verification latency a re-estimation of the
classification rule after changes is impossible. Verification latency, a phenomenon
that often appears in practical applications, denotes a learning environment, in
which only recent unlabelled data are available. The labels are known only after
some time lapse.
To adapt a classification rule in the presence of verification latency a model is in-
troduced that estimates global and local changes in a two-step procedure using
recent unlabelled data. Since after a change unlabelled data is available, the new
unconditional feature distribution can be estimated. The old conditional feature
distributions are known from the time before the change. The model is based on
mixture distributions, where the local changes are modelled as local displacement of
probability mass from the positions of the old components as given in the conditional
feature distribution to the positions displayed by the new components. Assuming
that the transfer of probability mass is carried out at a minimum of energy, local
changes are estimated by solving a transportation problem for a fixed value of the
class prior change. In a further step the global change is found as the minimum of
the objective function values obtained from the transportation problem.
The usefulness of the proposed models is demonstrated using artificial data and a
real-world dataset from the area of credit scoring.
32
Burn In: Estimation of p of Binomial Distribution withImplemented Countermeasures
Horst Lewitschnig1, Daniel Kurz2 and Juergen Pilz3
1Infineon Technologies Austria AG, Villach, Austria2Institute of Statistics, Alpen-Adria-University, Klagenfurt, Austria3Institute of Statistics, Alpen-Adria-University, Klagenfurt, Austria
Abstract
Quality in one of the key topics in semiconductor industry. The failure rate of suchdevices follow the so called bathtube curve. In their early life, they show a decliningfailure rate (early fails). Over lifetime, the failure rate is constant and at the end oflife, in the so called wear out phase, the failure rate is increasing. We focus here onearly fails.Early fails should not be delivered, but weeded out at the manufacturer. For thatpurpose, devices are operated under controlled conditions. This is called burn in.Burn in is done on sample base to check on the early life failure rate level. Thisis called burn in study. A defined number of devices are taken, stressed and thenumber of failing units is counted. Based on that, the p of the binomial distributionis estimated. Typically the Clopper-Pearson interval estimation is used.The burn in study is passed if no burn in relevant fails have occurred. If a fail hasoccurred, a countermeasure is implemented.Let’s say the countermeasure is 100 % effective. If this countermeasure would havebeen implemented before the start of the burn in study, the fail would not haveoccurred. Therefore this fail is not counted as burn in relevant.If the countermeasure is less than 100 % effective, then the fail would have occurredwith a certain probability, say 1−α, and with probability α the fail would have notoccurred.We propose a model, that estimates the p of the binomial distribution and takes theeffectiveness of countermeasures into account. Several fails tackled by countermea-sures are modeled by the generalized binomial distribution. A convolution algorithmis given for its calculation.The model is discussed for its decision theoretical background. We simulate differentscenarios: Various sample spaces are used with their respecitive weights to simulatethe possible outcomes of the random sampling. The loss function, the risk functionand the decision taken are adopted to these different scenarios.The benefit of this model is that improvement measures in the chip manufacturingprocess are reflected in the p of a binomial distribution, without the need of repeatingthe sampling experiment.Acknowledgement:The work has been performed in the project EPT300, co-funded by grants fromAustria, Germany, Italy, The Netherlands and the ENIAC Joint Undertaking.
33
Bayesian Model Averaging Optimisation for the
Expectation-Maximisation Algorithm
Adrian O’Hagan and Susan O’Carroll
School of Mathematical Sciences, University College Dublin, Ireland
Abstract
The Expectation-Maximisation (EM) algorithm is a popular tool for deriving max-
imum likelihood estimates for a large family of statistical models. Chief among its
attributes is the property that the algorithm always drives the likelihood uphill. A
serious pitfall is that in the case of multimodal likelihood functions the algorithm
may become trapped at a local maximum. In addition, even in cases where the
global likelihood maximum is ultimately attained, the rate of convergence may be
slow. These phenomena are often fuelled by the use of sub-optimal starting values
for initialisation of the EM algorithm.
We introduce the use of Bayesian Model Averaging (BMA) as a method for pro-
moting algorithmic efficiency. When employed as a precursor to the EM algorithm
it can produce starting values of a higher quality than those arising from simply
employing random starts. The ensuing convergent likelihoods and associated clus-
tering solutions of observations from the BMA-EM algorithm are presented. These
are contrasted with the output arising from the use of random starts; as well as from
the model-based clustering package mclust in R, which uses a hierarchical cluster-
ing initialising step. Datasets tested include the Galaxies data and the Faithful
eruptions data.
The overall goal is to increase convergence rates to the global likelihood maximum
and/or to attain the global maximum in a higher percentage of cases.
34
Bayesian Stable Isotope Mixing Models
Andrew C. Parnell1, Donald L. Phillips2, Stuart Bearhop3, Brice X. Semmens4,
Eric J. Ward5, Jonathan W. Moore6,,rew L. Jackson7, Jonathan Grey8, David
Kelly9 and Richard Inger3
1School of Mathematical Sciences (Statistics), Complex and Adaptive Systems
Laboratory, University College Dublin, Ireland2U.S. Environmental Protection Agency, National Health & Environmental Effects
Research Laboratory, Oregon, USA3Centre for Ecology and Conservation, School of Biosciences, University of Exeter,
UK4Scripps Institution of Oceanography, University of California, San Diego, 9500
Gilman Drive, La Jolla, California, USA5Northwest Fisheries Science Center, National Marine Fisheries Service, National
Oceanic and Atmospheric Administration, Seattle, USA6Earth2Ocean Research Group, Simon Fraser University, Burnaby, Canada
7School of Natural Sciences, Trinity College Dublin, Ireland8Environment and Sustainability Institute, School of Biosciences, University of
Exeter, UK9School of Biological & Chemical Sciences, Queen Mary, University of London, UK
Abstract
Stable Isotope Mixing Models (SIMMs) are used to quantify the proportional con-
tributions of various sources to a mixture. The most widely used application is
quantifying the diet of organisms based on the food sources they have been ob-
served to consume. We propose and implement a multivariate statistical model
which allows for a compositional mixture of the food sources corrected for various
metabolic factors. The compositional component of our model is based on the iso-
metric log ratio (ilr) transform of Egozcue et al (2003). Through this transform we
can apply a range of time series and non-parametric smoothing relationships. We
illustrate our models with 3 case studies based on real animal dietary behaviour.
35
The use of Structural Equation Modelling (SEM) to assess the proportional odds assumption of ordinal logistic regression concurrently over multiple groups
Ron McDowell1, Dr. Assumpta Ryan1, Professor Brendan Bunting2, Dr. Siobhan O’Neill
1Institute of Nursing and Health Research, University of Ulster, Coleraine
2 Psychology Research Institute, University of Ulster, Derry
Introduction: Ordinal logistic regression is used to model an ordinal dependent variable as a function of relevant covariates. The proportional odds (PO) assumption implicit within ordinal logistic regression can be easily tested within standard software packages, however it is less straightforward to do so when analyzing multigroup data. We propose using SEM to test the PO assumption of a number of variables concurrently over multiple groups using data from10,530 adults from N. Ireland, Portugal and Romania participating in the World Mental Health Survey Initiative (WMHSI) and the M-Plus software package. Methods: Each participant received a Guttman score for each of the eight mood and anxiety disorders of interest describing how far they had progressed through the Composite International Diagnostic Instrument (WHO-CIDI). These scores were analysed using multigroup ordinal latent class analysis and a series of ordered latent classes was obtained. After assessment of measurement invariance participants were allocated to their most likely class. A series of correlated binary variables was constructed to reflect all possible divisions in two of the ranked latent classes, with the effects associated with age, gender, marital status and urbanicity on these binary variables constrained both within and across countries. These were examined using standard SEM techniques. Mediating effects of self-reported cognitive disability and chronic illness were also included in the analyses. Results We identified 6 latent classes describing progression through the structured interview. The proportional odds assumption held for age within each country and for the other variables across and within countries. Older adults were increasingly likely in N. Ireland and Portugal not to progress past the screening section for all 8 disorders, or if they did progress not to receive a 12-month diagnosis for any disorder. Effects associated with each of the other variables were the same across countries. Whilst the effect of age mediated via chronic illness was associated with an increase in the probability of participants proceeding through the diagnostic process, this was nullified in later life by the effect of age mediated via cognitive disability which was associated with a decrease in the probability of moving through the diagnostic process. This held regardless of country and how the latent classes were partitioned. Conclusion: Testing the PO assumption in this way allows for stronger claims on the effects associated with variables of interest to be made not just within one group but across several. One limitation is the lack of assessment of measurement invariance for age and the other variables due to treating most likely class membership as an observed rather than a latent variable. The implementation of generalized ordinal logistic models within Mplus, which can readily deal with multiple group analyses in many other contexts, will be beneficial. Given that so few older adults process past the lifetime screening questions for any of the diagnoses, further sensitivity analysis of the instrument is required.
36
Statistical Issues in Clinical Oncology Trials.
Gloria Avalos1, Alberto Alvarez-‐Iglesias1, Imelda Parker2 and John Newell1,3.
1HRB Clinical Research Facility, NUI Galway, Ireland 2 All-Ireland Co-operative Oncology Research Group (ICORG), Ireland.
3School of Mathematics, Statistics and Applied Mathematics, NUI Galway, Ireland
Abstract
The number of cancer cases diagnosed daily continues to increase in Ireland and worldwide and there is an urgent need to develop more effective therapies for this disease. Clinical trials in patients remain the gold standard for clinical research in oncology. They are the key to developing more effective therapies for patients with cancer by providing them with access to treatments that are not currently available outside of the clinical research arena.
The All-Ireland Co-operative Oncology Research Group (ICORG) was established in Ireland in 1996 with aims to promote, design, conduct and facilitate clinical cancer research on the island of Ireland and has succeeded in offering research options to over 7500 patients across Ireland in the last fifteen years.
Oncology trials provide several interesting statistical challenges that are unique to this area of medical research. Recent advances in molecular pathways, genomics and cytostatic or targeted agent development have fuelled the rapid progress in clinical oncolcogy. In parallel, development in statistical theory, in particular Bayesian approaches, have continued to provide more powerful methods for dealing with these advances.
In this poster challenges relating to design issues, the choice of outcome and primary endpoints used, the inclusion of stopping rules for efficacy and futility, sequential designs and the implementation of Bayesian adaptive approaches will be presented.
37
Teaching Biostatistical Concepts to Undergraduate Medical Students, Faculty of Medicine, King Fahad Medical City
Dr. Ahmed A. Bahnassy, M.Sc., MSPH, PhD
Associate Professor of Biostatistics Faculty of Medicine
King Fahd Medical City King Saud Bin Abdulaziz for Health Sciences
E-mail: [email protected]
Abstract: Medical Students are considered consumers of biostatistical methods thought their future career. Many of them are not familiar enough with biostatistical techniques, and their appropriate usages in their future fields. The widespread use of personal computers in the last two decades made it easy for such students to apply some statistical tests in their required research work without knowing the basic assumptions behind each test; and when, where and how to use each test properly in most of the cases. A course of basic biostatistics with computer applications has been developed to suite the basic needs of medical students to be familiar with biostatistical tests and how to use a statistical package to conduct simple required analysis. The course consists of modules each of them was developed to be in both theoretical and practical of computer software. This course was carried out in Faculty of Medicine, King Fahad Medical City. Evaluation of is course showed that the participants’ statistical knowledge and interpretation of statistical results has significantly increased comparing between pre and post this course (p< 0.05). Overall results showed that females, over all, performed significantly higher than males (p <0.001). Students performed better in the univariate analysis than multivariate analysis (p = 0.035), while no different among them by age, previous experience with statistical courses. Students scored significantly higher in practical than theoretical exams (0.023). The mathematics and computer phobia of the participants were faced. By the end of this course, participants gained the necessary confidence for carrying out their data analysis. This study recommends conducting of such workshop with more medical application on the computer is a proper way for health professionals. Dr. Ahmed A. Bahnassy M.Sc., MSPH, PhD
38
Prognostic modelling for triaging patients diagnosed with prostate cancer Marco Antonio Barragan-Martínez1, John Newell2,3 and Gabriel Escarela-Perez1 ,
1 Universidad Autónoma Metropolitana-Iztapalapa, Mexico City, Mexico
2 HRB Clinical Research Facility, National University of Ireland, Galway, Ireland
3 School of Mathematics, Statistics and Applied Mathematics National University of Ireland, Galway, Ireland.
Abstract
My research is focused on the choice of treatment for patients with prostate cancer because it is now very important that doctors have guidelines supported by a sound statistical model. I am working on the analysis of survival of patients diagnosed with prostate cancer in the United States from 1988 to 2008. The dataset comprises different explanatory variables such as demographics, date of diagnosis of cancer, date of death, cause of death, treatment, tumor grade, stage and tumor size. Only cases of adenocarcinoma of the prostate are considered because the cancer is more prevalent. The difficulty of this study is that several of the explanatory variables have missing data, for example the stage and grade, hence the need for to use a method for handling missing data such as Multiple Imputation.
Multiple imputation is a statistical technique for analysing incomplete data, the main idea is to generate m> 1 possible values for each missing value and thus have m complete data sets to be analysed and then to pool the results over each imputed dataset . The idea of this method is to use all the information in the data to ‘fill in’ missing data as methods that discard variables with missing data can introduce bias and a lack of power.
Once data have been imputed I will proceed to fit the model presented by Larson and Dinse (1985); this model is used to specify the cumulative incidence functions in terms of conditional survival probabilities of specific cause and finally the probability that the event is a specific cause. Larson and Dinse propose a fully parametric structure where the conditional survival functions are parametrically Cox proportional hazards whose baseline hazard function is exponential constructed pieces and cause-specific probability follows a multinomial model.
Reference
Larson, M.G. and Dinse, G., E. (1985). A Mixture Model for the Regression Analysis of Competing Risks Data. Journal of the Royal Statistical Society. Series C (Applied Statistics) , Vol. 34, No. 3, pp. 201-211.
39
Adaptive Bayesian inference for doubly intractable
distributions
Aidan Boland, Nial Friel
School of Mathematical Sciences and Complex Adaptive Systems Laboratory,
University College Dublin.
Abstract
There are many problems in which the likelihood function is analytically intractable.
In this situation Bayesian inference is often termed ”doubly intractable” because the
posterior distribution itself is also itself intractable. There has been a lot of work
carried out on this class of problem including the exchange algorithm [1] and Caimo
and Friel [2] who adapted the exchange algorithm to the network graph framework
and used a population-based MCMC approach to improve mixing. However many of
these approaches are still computationally intensive and suffer from problems such
as slow convergence and poor mixing.
The auxiliary variable method on which the exchange algorithm is based involves
repeated sampling from the likelihood function. It turns out that the samples from
the likelihood function can be used to estimate the gradient of the target distribu-
tion. We explore how this information can be used in the context of a Langevin
MCMC algorithm. This method is less computationally intensive as it explores the
target distribution efficiently and does not need a population-based approach.
We envisage that this approach may have some applicability to more general prob-
lems where ABC (Approximate Bayesian Computation) or likelihood-free inference
is used.
References
[1] I. Murray, Z. Ghahramani, D. MacKay. (2006), MCMC for doubly-intractable
distributions. In Proceedings of the 22nd Annual Conference on Uncertainty in
Artificial Intelligence (UAI-06), Arlington, Virginia, AUAI Press.
[2] Caimo A., Friel N. (2011) Bayesian inference for the exponential random graph
model. Social Networks, 33, 41–55.
40
Development and Validation of a Panel of Serum
Biomarkers to Inform Surgical Intervention for Prostate
Cancer
Susie Boyce12, Lisa Murphy1, John M. Fitzpatrick1, T. Brendan Murphy2 and R.
William G. Watson1
1UCD School of Medicine and Medical Science, University College Dublin, Dublin 42UCD School of Mathematical Sciences, University College Dublin, Dublin 4
Abstract
Introduction: Prostate cancer (PCa) is the most common cancer in European and
North American men, and the third most common cause of male cancer deaths. We
have previously shown the inability of current clinical tests to accurately predict
key prostate cancer outcomes. Many studies have established that new biomarker
features are urgently required for this area. In this study, we measure nine protein
biomarkers and their ability to accurately predict prostate cancer stage.
Methods: Serum samples of 197 men diagnosed with prostate cancer collected
through the Prostate Cancer Research Consortium were used. Nine protein biomark-
ers were measured using Meso Scale Discoverys electrochemiluminescence multiplex-
ing platform. Statistically significant differences in the expression levels of each
marker for organ confined vs non-organ confined prostate cancer patients was as-
sessed using independent samples t-tests. The markers were modelled using logistic
regression and their predictive ability measured using a combination of discrimi-
nation metrics (receiver operating characteristic (ROC) curves and area under the
curve (AUC) values), calibration curves and decision curve analysis.
Results: Using logistic regression, each of the markers was modelled in isolation
and in combination to measure their ability to predict prostate cancer stage (organ
confined vs. non-organ confined). Backwards feature selection was then used to
remove redundant markers. The optimal biomarker panel was determined to con-
tain 4 markers (composition undisclosed due to patenting issues). This biomarker
panel achieves a discrimination AUC value of 0.81 indicating that the panel is highly
discriminate. The panel is also well calibrated and shows a significant benefit in a
clinical setting based on decision curve analysis. In order to compare the biomarker
panel to the current clinical tests in use, four clinical variables recorded for each
patient (age, prostate specific antigen (PSA), clinical stage based on digital rectal
exam (DRE) and biopsy Gleason Score) were included in a logistic regression model.
41
This clinical tests model achieves an AUC of 0.688. Finally, a model consisting of
the biomarker panel combined with the four clinical tests was developed and this
achieved an AUC value of 0.856
Conclusion: The biomarker panel developed in this study achieves far better dis-
crimination and clinical benefit than the current clinical tests in use. This panel
can also be used in collaboration with the current clinical tests and offers a far more
accurate prediction method. This panel shows huge promise for the prostate cancer
field. The next stage in this study is external validation of the marker panel in an
Austrian cohort of patients (results of which will be expected in time for CASI 2013
Meeting).
42
Clustering of fishing vessel speeds using Vessel Monitoring
System data
Emily Brick1, Paula Harrison2, Gerry Sutton2,
Michael Cronin1 and Eric Wolsztynski1
1School of Mathematical Sciences, University College Cork, Ireland2Coastal and Marine Research Centre, University College Cork, Ireland
Abstract
Satellite-based Vessel Monitoring Systems (VMS) record the speed, position and
course of fishing vessels. These recordings do not however explicitly report the
activity in which the vessel is engaged. Adequate classification of VMS data would
be useful to characterize vessel activity, particularly in monitoring traffic surveillance
and measuring fish catch composition in specified areas. In this view we aim to define
three vessel speed clusters, classifying vessels as either starting/stopping, fishing, or
steaming, and explore the feasibility of model-based clustering techniques to do so.
Here we report on the analysis of a dataset of approximately 500,000 Irish fishing
vessel speeds, reconstructed from [1]. Due to the multi-modal, heavy-tailed nature
of the underlying distribution, we explore various mixture modelling techniques, and
especially consider recombined Gaussian mixtures for this classification. Clustering
performance is evaluated non-parametrically in terms of goodness-of-fit, uncertainty
and cluster validity. The output unsupervised classification is ultimately compared
to current gold standards. To the best of the authors’ knowledge, the implementation
and calibration of such techniques to VMS data represents an original contribution.
All analyses were implemented in the R software environment and make significant
use of the add-on package mclust version 4 [2].
References
[1] P. Harrison, M. Cronin, and G. Sutton, “Using VMS and EU logbook data to analyze
deep water fishing activity in the seas around Ireland,” in ERI Research Open Day,
2012.
[2] C. Fraley, A. Raftery, T. Murphy, and L. Scrucca, “mclust version 4 for R: Normal
mixture modeling for model-based clustering, classification, and density estimation,”
Tech. Rep. 597, Department of Statistics, University of Washington, Seattle, USA,
2012.
43
Validating the Academic Confidence Subscales
Emily Brick1, Kathleen O’Sullivan1 and John O’Mullane2
1Department of Statistics, University College Cork, Ireland2Department of Computer Science, University College Cork, Ireland
Abstract
Introduction: An online student experience survey was conducted in 2013 on
all undergraduate students in University College Cork. One aspect of the survey
was the Academic Confidence Scale (ACS). The ACS contains 24 items relating to
students’ perception of academic confidence. The underlying dimensions of the ACS
has been investigated in studies in the UK, Spain and Ireland. In this study, the
factor structure of the ACS was derived and compared with the factor structures
found in these other studies. The aim of this study was to validate which factor
structure is most suitable for our undergraduate data.
Method: After screening for outliers, the dataset (n = 2029) was evenly split
into datasets A and B. Exploratory factor analysis using principal component factor
extraction with oblimin rotation was applied to dataset A. Factors with eigenvalues
> 1.0 were retained. Items with factor loadings of ≥ 0.4 on one factor and < 0.4 on
all other factors were retained. The resulting factor structure was compared to other
proposed factor structures for the ACS. To verify the stability of the factor structure,
dataset B was independently factor analysed and its factor structure compared with
that of dataset A.
Results: Exploratory factor analysis identified a four factor solution using 21
items that explained 57% of the variance. These factors were examined and la-
belled ‘Preparation and Understanding’, ‘Engagement’, ‘Attendance’ and ‘Study
and Achievement’. Conducting exploratory factor analysis on dataset B, led to an
identical (same items dropped from the model, same items loading on each factor)
four factor solution, thus indicating a stable model.
Conclusion: The four factor solution proposed was very similar to existing factor
structures. Two factors ‘Attendance’ and ‘Engagement’ are identical in all proposed
structures, with different combinations of the other items leading to differently la-
belled factors. The four factor solution is stable, however to confirm if this solution
provides a better description of the data than the factor structures from other stud-
ies, a confirmatory factor analysis must be conducted.
44
Stochastic modelling of atmospheric re-entry
highly-energetic breakup events
Cristina De Persis1, Simon Wilson1
1School of Computer Science and Statistics
Trinity College of Dublin
Ireland
Abstract
Spacecraft and rocket bodies re-enter via targeted trajectories or naturally decaying
orbits at the end of their missions.
An object entering the Earth’s atmosphere is subject to atmospheric drag forces.
The friction caused during the entry by these forces heats up the object. The
action of the aerodynamic forces, the heating of the structure, with resulting internal
structural stresses and melting of some materials, usually cause the fragmentation
of the object. In some cases it may occur, under certain physical conditions, that
the structural integrity of the object can no longer be contained and the object
explodes.
The resulting fragments, after the explosion or the fragmentation, which impact the
Earth’s surface could cause serious damages. While there are various tools able to
detect the fragmentation of a spacecraft, the explosion process is a break-up mode
not yet adequately modelled.
First of all, I want to demonstrate how Fault tree and a Bayesian network theories
could be applied to assess the probability to get an explosion, starting from the
combination of the elementary causes that can lead to its occurrence.
Next I want to present a first attempt to model the uncertainty of these elementary
causes, i.e. conditions of temperature and pressure, using an autoregressive model,
as well as I want to show how the Cox’s proportional hazard model could be useful
to solve this issue.
45
An Examination of Variable Selection Strategies for the Investigation of Tooth Wear in Children
Cathal Doherty1, Michael Cronin1 and Mairead Harding2
1Department of Statistics, University College Cork 2Oral Health Services Research Centre, University College Cork
Abstract
Tooth wear is an all-encompassing term describing the non-carious loss from the surface of the tooth due to attrition, abrasion or erosion. Attrition is the mechanical wearing of tooth against tooth, abrasion is the wearing of the tooth surface caused by friction and erosion is the wearing by an acid which dissolves enamel and dentine(Smith 1989). Data has been collected longitudinally at three time points (5, 12 & 14 years old) and will be collected at a fourth (16 years old), for a sample of children residing in Cork city and county. This longitudinal data consists of a large number of potential predictors, correlation between predictors, diminishing sample size (202 at age 5, 123 at age 12 & 85 at age 16) and missing data.
Tooth wear (outcome variable) is recorded as a categorical variable; hence a multinomial logistic model is appropriate. Numerous variable selection techniques are identified and examined for suitability or potential development in the selection of such a model. Some techniques include Least Absolute Shrinkage and Selection Operator (LASSO)(Tibshirani 1996), the Elastic Net(Zou and Hastie 2005) and the Dantzig Selector(Candes and Tao 2007). We compare and contrast these techniques utilising simulated data, which have the same potential complexities as the longitudinal data collected over the four time points.
Candes, E. and T. Tao (2007). "The Dantzig Selector: Statistical Estimation When p Is Much Larger than n." The Annals of Statistics 35(6): 2313-2351.
Smith, B. (1989). "Toothwear: etiology and diagnosis." Dental Update 16: 204-212.
Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." Journal of the Royal Statistical Society. Series B (Methodological) 58(1): 267-288.
Zou, H. and T. Hastie (2005). "Regularization and variable selection via the elastic net." Journal of the Royal Statistical Society Series B-Statistical Methodology 67: 301-320.
46
Multivariate analysis of the biodiversity-ecosystem
functioning relationship for grassland communities.
Aine Dooley1, Forest Isbell2, Laura Kirwan3, John Connolly4, John A. Finn5 and
Caroline Brophy1
1Department of Mathematics and Statistics, National University of Ireland
Maynooth, Maynooth, Co. Kildare, Ireland.2Department of Ecology, Evolution, and Behavior, University of Minnesota, St
Paul, Minnesota 55108, USA.3Department of Chemical and Life Science, Waterford Institute of Technology,
Cork Road, Waterford, Ireland.4School of Mathematical Sciences, Ecological and Environmental Modelling Group,
University College Dublin, Dublin 4, Ireland.5Teagasc Environment Research Centre, Johnstown Castle, Co. Wexford, Ireland.
Abstract
Current methods for analysing the biodiversity ecosystem function (BEF) relation-
ship typically focus on a single ecosystem function (such as the biomass produced),
however biodiversity can affect multiple ecosystems simultaneously (multifunction-
ality). Analysing a single function may provide an incomplete picture of the effects
of biodiversity on ecosystem functioning. The Diversity-Interaction model [1] can be
used to model a single ecosystem function based on the species sown proportions and
how the species interact with one another. Here we extend the Diversity-Interaction
model [1] to a multivariate model. This extension allows us to explore the BEF
relationship for multifunctional ecosystems whilst also providing information about
how the ecosystem functions relate to one another. The Diversity-Interaction mul-
tivariate model also allows us to explore the relative effect of species on multiple
functions. We used this method to analyse data from a four species grassland ex-
periment and found that there was a positive effect of increasing biodiversity on
multiple ecosystem functions.
References
[1] Kirwan, L. et al. (2009) Diversity-interaction modeling:estimating contributions
of species identities and interactions to ecosystem function. Ecology, 90, 2032-2038.
47
Classification methods for mortgage distress
Trevor Fitzpatrick
Central Bank of Ireland
Abstract
The banking crisis in Ireland has been one of the most severe since the 1970s (Laeven
and Valencia, 2012). One particularly important dimension of the Irish crisis has
been large amount of mortgage debt originated in the boom years and the subsequent
large increase in mortgage delinquencies since the start of the crisis in 2008. The
magnitude of the current problem suggests that classification methods may be a
useful approach to determine the probability of being in arrears to triage cases as
part of a systematic debt work-out.
This applied paper uses borrower level origination data, macroeconomic and mort-
gage payment status data for over 100,000 mortgages to compare the performance
of a number of classification approaches for current and future arrears status. The
methods explored include boosted regression trees, generalised linear and additive
logistic regression models (Berg, 2007), (Hastie et. al., 2009), (Muller, 2012).
Preliminary results suggest regression trees and generalised additive models outper-
form generalised linear models based on AUC measures. The results indicate that
the various approaches have a reasonable degree of predictive power, but this be-
gins to degrade after 12-18 months. Related results from partial response analysis
suggests the presence of non-linear effects among the features considered. Overall,
the results suggests that early intervention strategies are possible using currently
available information.
ReferencesBerg,D., (2007), ’Bankruptcy prediction by generalized additive models’, Applied
Stochastic Models in Business and Industry,23(2).
Hastie,T., Tibshirani, R., and Friedman, J., (2009), The Elements of Statistical
Learning, Addison Wesley.
Laeven, L., and Valencia, F., (2012), ’Systemic Banking Crises Database: An Up-
date’, IMF Working Paper, WP/12/163.
Muller, M., (2012), ’A case study on using generalized additive models to fit credit
rating scores’, Paper presented to BIS-Irving Fisher Conference, Dublin.
48
A-Traumatic Restorative Treatment Vs. Conventional
Treatment in an Elderly Population
Amy Halton1, Michael Cronin1 and Cristiane Mendonca da Mata2
1Department of Statistics, University College Cork, Ireland2Department of Restorative Dentistry, University Dental School and Hospital,
Cork, Ireland
Abstract
Introduction: In the last twenty years, A-traumatic Restorative Treatment (ART)
has been introduced into developed societies because of its minimally invasive nature,
making it suitable for patients who suffer from stress and fear of dental procedures.
However, to date not many studies have been completed on the effectiveness of ART
compared to conventional treatment for the elderly population. This study compares
the one year survival rate of ART to conventional treatment in a population aged
over 65. The effects of age, gender, cavity class of the restoration and the number of
restorations the patient received on the one-year survival rates were also assessed.
Method: Logistic regression was used to analyse the data. The bootstrapping by
cluster technique was employed as there was an issue with dependence in the data.
5000 bootstrap samples were taken. After a literature research was conducted,
two methods for dealing with quasi-complete separation of some of the bootstrap
samples were applied and compared to each other. Empirical confidence intervals
were calculated since estimates for two of the varibles were not normally distributed.
Results: ART restorations were 76.7% less likely to survive over a 12 month pe-
riod accounting for age, gender, cavity class of the restoration and the number of
restorations received by the patient. However, the results showed that this differ-
ence was not statistically significant. None of the other variables were found to be
statistically significant.
Conclusion: Bootstrapping by cluster was used successfully to overcome the issue
of lack of independence in the data while keeping the dependence structure of the
data. 5000 was shown to be a sufficient number of samples to obtain stable parameter
estimates and standard errors. Firths adjustment for reducing the bias of maximum
likelihood estimates was shown to be the best method to deal with quasi-complete
separation.
49
Doubly Robust Estimation for Clinical Trial Studies Belinda Hernández123, Andrew Parnell1, Stephen Pennington2, Illya Lipkovich3 and Michael
O’Kelly3
1School of Mathematical Sciences, University College Dublin, Ireland 2School of Medicine and Medical Science, University College Dublin, Ireland
3Center for Statistics in Drug Development, Quintiles
Abstract
Missing data are a common problem in clinical trial studies. There are many reasons why data can be missing, including patient drop out, perhaps due to side effects or lack of efficacy; or other censoring events unrelated to the study outcome such as death from an unrelated disease. Even if a patient completes a study they may still have elements of incomplete data due to measurements being missed at one or more visits. It has been widely noted that ignoring missing data can lead to biased results and incorrect conclusions. This bias may affect the comparison of treatment groups or the representivness of the study. Because of this, missing data is an important issue that must be dealt with using appropriate statistical techniques.
Here we discuss doubly robust methods for dealing with missing data in the context of clinical trials and present a generalisation of a doubly robust method first proposed by Vansteelandt et al [1] which provides doubly robust estimates for longitudinal data.
Doubly robust estimators combine three models; an imputation model which models the response variable yi on the covariates X; a missingness model which calculates πij: the probability of being observed for each subject i at trial visit j and a final analysis model which is the model that would have been used if there were no missing values in the data set. With doubly robust methods, either the imputation model or the missingness model, but not both, can be misspecified and the trialist will still obtain unbiased estimates, thus giving the analyst two opportunities to correctly specify a model and get valid, consistent results. Output and findings from two illustrative clinical trial datasets using a SAS macro which performs our proposed doubly robust estimator will also be shown.
References
Vansteelandt S, Carpenter J, Kenward M (2012) Analysis of incomplete data using inverse probability weighting and doubly robust estimators. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences 6: 37-48
50
A Conjugate Class of Utility Functions for Sequential
Decision Problems
Brett Houlding1, Frank P.A. Coolen2 and Donnacha Bolger1
1Discipline of Statistics, Trinity College Dublin, Ireland.2Department of Mathematical Sciences, Durham University, UK.
Abstract
The use of the conjugacy property for members of the exponential family of distribu-
tions is commonplace within Bayesian statistical analysis, allowing for tractable and
simple solutions to problems of inference. However, despite a shared motivation,
there has been little previous development of a similar property for using utility
functions within a Bayesian decision analysis. As such, this work explores a class
of utility functions that appear to be reasonable for modeling the preferences of a
decision maker in many real-life situations, but which also permit a tractable and
simple analysis within sequential decision problems.
51
The association between weather and bovine tuberculosis
Renhao Jin1, Margaret Good2, Simon J. More3 , Conor Sweeney1, Guy McGrath3, Gabrielle E. Kelly1
1UCD School of Mathematical Sciences, University College Dublin, Belfield, Dublin 4, Ireland 2Department of Agriculture, Food and the Marine (DAFM), Kildare St, Dublin 2, Ireland 3Centre for Veterinary Epidemiology and Risk Analysis (CVERA), UCD School of Veterinary Medicine, University College Dublin, Belfield, Dublin 4, Ireland
Abstract
Bovine tuberculosis (bTB), caused by infection with Mycobacterium bovis, affects approximately 0.3% of cattle annually in Ireland, with 18,531 reactor cattle identified in 2011. This has major financial implications both for the farmer whose herd is restricted from trading and cattle slaughtered, and for the exchequer that compensates the farmer and implements measures to control the disease. Climate describes the long term variations of the atmosphere and is based on historical weather records for a particular location, usually over 30 years, while weather refers to the short-term state of the atmosphere. Climatic or weather factors could influence herd bTB occurrence in several ways. Firstly, they may affect the survival of M. bovis in the environment. Secondly, climatic factors may affect wildlife ecology, in particular badgers, and inter-species contact. Thirdly, it is known weather factors affect cattle management, farming and food supply. In this study, we examined the influence of weather variables on bovine tuberculosis (bTB) occurrence in cattle herds, together with well established risk factors in the area known as West Wicklow, in the east of Ireland. Using aggregated data, collected from 2005 to 2009, maximum monthly rainfall over quarters and quarterly herd bTB incidence were found to be correlated. Then logistic linear mixed models (LLMM) were fitted to herd level data, and a non-spatial LLMM was found to adequately describe the data. Herd bTB incidence was positively associated with annual total rainfall, herd size and a herd bTB history in the previous three years, and negatively associated with distance to nearest badger sett. Our models demonstrate that weather variables are associated with bTB risk in Irish cattle herds. High rainfall levels emerged as a significant predictor of bTB. We speculate that it may be possible to mitigate some of the impact of high rainfall through changes to farm management, including the additional controls of pre-movement testing and enhanced clearance procedures as well as bio-security as a control measure. In addition to high rainfall, location and distance to the nearest badger sett, together with herd size and herd bTB history, are associated with herd bTB occurrence. The emergence of the inter-linking factors high rainfall, location and distance to the nearest badger sett as predictors of herd bTB incidence requires further study. Changeable weather patterns and extreme weather events result in difficulties in management for farmers, ecologists and veterinarians that may lead to increased levels of bovine TB in both cattle herds and badgers.
52
Fractal characteristics of raingauge networks in Seoul, Korea
Sun Jung, Hyung-Kyoung Joh and Jong-Sook Park
WISE project team, CATER(Centre for Atmosphere and Earthquake Research)
Abstract
Obtaining more accurate rainfall measurement is one of the most important factors for flood prediction, and this feature has widely been recognized as more intensive floods happened in the past decades over the world. At the same time the demand on higher resolution rainfall measurement has been increased due to the lack of observed rainfall in order to improve the prediction of urban floods. This study has been initiated by WISE (Weather Information Service Engine, www.wise2020.org) project, which is launched on June 2012 and funded by Korea Meteorological Administration and National Institute for Meteorological Research. One of the main aims of the project is enhancing the existing raingauge networks for Seoul, where has been suffered from severe urban flash floods in the year of 2010, 2011 and 2012.
There are 26 AWS (Automatic Weather System) sites including raingauges maintained by the KMA (Korea Meteorological Administration) in Seoul (605 km2). It has been recognised that those raingauges are deficient to capture the intensive rainfall caused severe floods and has been seek for the method of optimising raingauge locations. This study is an attempt to identify the optimal locations for newly installed raingauges using fractal analysis.
Firstly the correlation coefficient was calculated using the coordinate of the twenty six AWS sites in Seoul and then regression coefficient was estimated from the correlation coefficient. The gained regression coefficient is regarded as fractal dimension, which is an indicator of the areal homogeneity of two neighboring raingauges. Fractal dimension will be set in the range of 0 (where all stations are distributed as a single or isolated point) to 2(where all stations are uniformly distributed). The gained fractal dimension will be interpreted as places where are required more or less raingauges in order to improve the predictability of urban flash floods for Seoul.
Keywords: fractal dimension, regression analysis, raingauge networks, urban flash floods
53
A Generalized Longitudinal Mixture IRT Model for Measuring Differential Growth in Learning Environments
Damazo T. Kadengye1, Eva Ceulemans2 and Wim Van den Noortgate3
1,2,3Centre for Methodology of Educational Research, KU Leuven, Belgium 1,3Faculty of Psychology and Educational Sciences, KU Leuven – Kulak, Belgium
Abstract
This paper describes a generalized longitudinal mixture item response theory (IRT) model that
allows for detecting latent group differences for item response data obtained from electronic
learning (e-learning) environments or other environments that result in a large number of items.
The described model can be viewed as a combination of a longitudinal Rasch model, a mixture
Rasch model, a random item IRT model, and includes some features of the explanatory IRT
modeling framework. The model assumes presence of latent classes in item response patterns
either due to initial person level differences before learning takes place, or as a result of latent
class-specific learning trajectories, or due to a combination of both, and allows for differential
item functioning over the classes. A Bayesian model estimation procedure is described and
results of a simulation study are presented that indicate that the parameters are recovered well
particularly for conditions with large item sample sizes, as well as for balanced sample designs.
Keywords: Item Response Theory, e-Learning, Modelling of Growth, Mixture Models.
54
Small sample confidence intervals for the skewness
parameter of the maximum from a bivariate normal vector
Valentina Mameli1, Alessandra R. Brazzale1
1Department of Statistical Sciences, University of Padua, Italy
Abstract
Azzalini (1985) introduced the skew-normal distribution (SN), which generalizes the
normal distribution through an additional parameter used to regulate the skewness.
The SN distribution presents some inferential issues connected to the estimation of
that additional parameter. In particular, it is not easy to deal with confidence sets
for this parameter. Loperfido (2002) proved that the distribution of the maximum
(the minimum) of two jointly normally distributed exchangeable random variables
with correlation coefficient ρ, is a skew-normal whose skewness parameter depends
on ρ. Mameli et al. (2012), using Loperfido’s result and Fisher’s transformation of
ρ, provided an asymptotic confidence set for λ. Their simulation results showed
that, when the sample size is small or moderate and the correlation coefficient ρ is
close to −1, the actual coverage probability of their asymptotic confidence interval
is close to the nominal coverage and its expected length become larger. The aim of
this paper is to present higher order likelihood based procedures to obtain accurate
confidence intervals for λ in terms of both actual coverage and expected length,
when ρ is negative and close to −1 and for small or moderate sample sizes.
References
[1] Azzalini, A. (1985). A class of distributions which includes the normal ones.
Scandinavian Journal of Statistics, 12, 171–178.
[2] Loperfido, N. (2002). Statistical implications of selectively reported inferential
results. Statistics & Probability Letters, 56, 13–22.
[3] Mameli, V., Musio, M., Saleau, E. and Biggeri, A. (2012) Large sample con-
fidence intervals for the skewness parameter of the skew-normal distribution
based on Fisher’s transformation. Journal of Applied Statistics, 39, 1693–1702.
55
A Systems Dynamic Investigation of the Long Term
Management Consequences of Coronary Heart Disease
Patients
Janette McQuillan1, Adele H. Marshall1 and Karen J. Cairns1
1Centre for Statistical Science and Operational Research (CenSSOR)
Queen’s University Belfast
Abstract
The incidence of coronary heart disease (CHD) increases with age and with the pro-
portion of the population aged over 65 in Northern Ireland anticipated to increase
by approximately 42% by 2025, this is set to put a severe strain on our health care
system [1]. It has been reported that the number of cases of CHD in Northern
Ireland is expected to rise from 75,158 to 97,255 between 2007 and 2020[2]. These
projections highlight the urgent requirement of strategic management of such pa-
tients.
This study aims to develop a model which can identify the long term effects of various
interventions on the health care system in Northern Ireland. It will involve using
a system dynamics modelling approach to evaluate the impact of both upstream
and downstream policy interventions on patients with or at risk of developing CHD.
System dynamics is a form of simulation modelling which aims to replicate the be-
haviour of a real-world system over a given time period and hence make inference
regarding how this system will operate in the future[3]. Further work will consider
the incorporation of patient length of stay and subsequent costs incurred in the
model.
References
[1] http://www.nisra.gov.uk/archive/demography/population/projections/Northern
%20Ireland%20Population%20Projections%202010%20-%20Statistical%20Report
%20-%20FINAL.pdf (accessed on 15/03/2013)
[2] http://www.inispho.org/files/file/Making%20Chronic%20Conditions.pdf
(accessed on 15/03/2013)
[3] Forrester, J. W. (1961). Industrial Dynamics. Originally published by MIT Press,
Cambridge, MA, reprinted by Pegasus Communications, Waltham MA.
56
Modelling competition between overlapping niche predators
Rafael Moral1, John Hinde2 and Clarice Demetrio1
1Departamento de Ciencias Exatas, Universidade de Sao Paulo, Brazil2School of Mathematics, Statistics and Applied Mathematics, National University
of Ireland, Galway, Ireland
Abstract
The ring-legged earwig, Euborellia annulipes, and the Neotropical stink bug Podisus
nigrispinus are potential biological control agents of the fall armyworm, Spodoptera
frugiperda, and the leafworm Alabama argillacea, important pests of maize and cot-
ton. E. annulipes individuals are usually found on the soil and in the lower-section
of the crops. On the other hand, P. nigrispinus specimens are generally found in
the mid-section of the crops, but are also found in the lower and upper-sections. In
that sense, these predators’ niches overlap in the agroecosystem, hence there may
be competition for prey. Two different experiments were used to study this. Firstly,
males and females of both predators were placed in separate Petri dishes, along with
one S. frugiperda caterpillar. The full experiment consisted of a completely random-
ized design with 33 replicates for males and 34 replicates for females. The system
was observed for one hour and it was recorded whether one competitor attacked
the other, as well as which predator effectively consumed the prey (most efficient
competitor). The outcome of this experiment motivated a second experiment with
a slightly different set-up. This second experiment considered A. argillacea as prey,
in the same situation as the first experiment, but with two caterpillar densities:
1 and 3 per dish. The experiment was installed in a randomized complete block
design with a 2x2 factorial treatment structure (predator sex and density of prey),
with 50 blocks. The number of attacks between the competitors was recorded, as
well as which predator effectively consumed the prey. Binomial generalized linear
models were fit to the binary data (which prey did the predator attack / effectively
consume) and quasi-Poisson models were fit to the count data (number of attacks).
Preliminary results show that female specimens of E. annulipes show more aggressive
behaviour than males and that the earwig is a more efficient competitor.
57
Multi-dimensional partition models for monitoring Freeze
Drying in Pharmaceutical Industry.
Gift Nyamundanda and Kevin Hayes
Department of Mathematics and Statistics, University of Limerick, Ireland
Abstract
Freeze drying, also known as lyophilization, is a low temperature batch wise drying
process used to remove water from pharmaceutical solutions into solids of sufficient
stability for distribution and storage. Freeze drying is very expensive process per-
formed through three successive time-consuming stages. The initial step is freezing,
in which the solution is frozen, followed by primary drying, where ice crystals are re-
moved by sublimation, and finally in the secondary drying step, the unfrozen water
is removed by desorption under high vacuum. There is need for methods that can
determine, in-line and in real-time, end points of the different freeze drying steps,
in order to reduce costs, improve process efficiency and to guarantee final product
quality.
Currently, spectroscopic process analyzers, such as near infrared and raman, are
used in combination with chemometrics tools, such as principal component analysis
(PCA) and partial least squares (PLS), to determine the endpoints of the different
critical stages of freeze drying. However, such chememotrics methods are limited in
that, they are not based on any probability model. Hence, it is difficult to account
for dependence in time measurements and efficiently predict, with certain level of
assurance, the endpoints of the intermediate steps of freeze drying process. In this
work, we treat this problem of determining endpoints of different stages of freeze
drying as multiple change point problem such that product partition models can be
employed.
58
Expert Elicitation for Decision Tree Parameters in Health
Technology Assessment
Shane O Meachair12, Mirko Arnold3, Gerard Lacey3 and Cathal Walsh1
1Discipline of Statistics, Trinity College Dublin, Ireland.2Mathematical Institute, Leiden University, The Netherlands
3Discipline of Computer Science, Trinity College Dublin, Ireland
Abstract
One of the attractions of Bayesian analysis is the ease with which evidence from dif-
ferent sources can be synthesised, particularly expert knowledge. Expert estimates
of the probabilities of observing certain outcomes can be used as prior informa-
tion and combined with data, or can supplement the likelihood and be combined
with non-informative priors to derive a posterior distribution when direct data is
unavailable or unreliable. However, expert elicitation is seldom formally applied
in practice. In this example, expert elicitation arises in the context of a Health
Technology Assessment (HTA) of an innovative colonoscopy enhancement tool to
be used in colon-cancer screening. HTA assesses the utility, economic cost, social,
political and legal implications of introducing new health technologies. Methods
for performing HTA on medical devices are not yet standardised and direct data is
often unavailable. In this example, parameter estimates for cost-effectiveness were
unavailable due to the early stage of development of the device. A variation of the
non-parametric roulette method as outlined in the SHELF (Oakley and O’Hagan,
2010) was used to elicit probability distributions representing expert knowledge on
the likely distribution of quality metrics associated with the device. An expert physi-
cian with experience in colonoscopy was familiarised with the device and estimates of
probabilities were elicited and used to parameterise Beta distributions. We present
the method of elicitation and its application, along with results yielding parameter
estimates for a cost-effectiveness decision tree.
59
Perceptions of Academic Confidence
John O’Mullane1, Kathleen O’Sullivan2, Amanda Wall2 and Philip O’Mahoney2
1Department of Computer Science, University College Cork, Ireland2Department of Statistics, University College Cork, Ireland
Abstract
Introduction: Undergraduate (UG) and taught postgraduate (PGT) students in
University College Cork completed an Academic Confidence (AC) Scale, consisting
of 24 items each rated on a 5-point Likert scale, used to identify students’ confidence
in their ability to perform academic tasks. The objective of this study is to determine
the underlying dimensions of the UG and PGT AC scales and to infer the differences
in perception of academic confidence between UG and PGT students.
Method: Principal component factor analyses using varimax (UG) and oblimin
(PGT) rotations were carried out on the AC scale. The number of factors extracted
was determined by eigenvalues greater than 1. A simple structure was desirable
where items with factor loadings ≥ 0.4 on one factor and < 0.4 on the other factors
were retained. Factors defined by two or fewer items were eliminated.
Results: Factor analyses identified four UG factors, based on 23 items, labelled
‘Study and Perform’, ‘Interact’, ‘Prepare and Understand’, and ‘Attend’ and three
PGT factors, based on 20 items, labelled ‘Understanding and Participation’, ‘Com-
mitment’ and ‘Preparation and Achievement’. Eight items loaded on the PGT
‘Understanding and Participation’. Six of these items formed the UG ‘Interact’ and
2 items loaded on the UG ‘Prepare and Understand’. Of the 7 items loading on
the PGT ‘Commitment’, 3 joined UG ‘Attend’, 2 items joined UG ‘Study and Per-
form’ and 2 items joined UG ‘Prepare and Understand’. The PGT ‘Preparation
and Achievement’ consisted of 5 items, all of which loaded on the UG ‘Study and
Perform’. The additional 3 items in the UG loaded on ‘Study and Perform’.
Conclusions: There is overlap in the UG and PGT factor sets, but the combina-
tions of items suggest different perceptions of academic confidence. PGT students
have a more holistic view of academic confidence. For example, PGT ‘Understand-
ing and Participation’ indicates that these students perceive items relating to un-
derstanding and participation as being connected, while UG students perceive items
relating to understanding and preparation as being connected. Also, UG students
perceive participation as a separate factor (‘Attend’), unconnected to understanding.
60
An extention to the Goel and Okumoto model of software
reliability
Sean O Rıordain 1 and Simon P. Wilson1
1School of Computer Science and Statistics, Trinity College, Dublin
Abstract
This work presents an extention to the Goel and Okumoto[1] model software re-
liability model to three parameters. Previously the model has been applied to all
of the bug data simultaneously, but this work splits the model at the release date
and applies a different b before the release date to the b′ after the release date.
The model is then applied to Mozilla Firefox data for the rapid releases (versions
5+). We explore the variation in behaviour across successive releases of Firefox and
discuss how this might be modelled.
References
[1] Amrit L. Goel and Kazu Okumoto, Time-Dependent Error-Detection Rate Model
for Software Reliability and Other Performance Measures. IEEE Trasactions on
Reliability, 1997, Vol.R-28, No.3, p206–211
61
A new individual tree volume model for Irish Sitka spruce
with a comparison to existing Forestry Commission and
GROWFOR models
Sarah O’Rourke1, Gabrielle Kelly1 and Mairtın Mac Siurtain2
1School of Mathematical Sciences, University College Dublin2School of Agriculture and Food Science, University College Dublin
Abstract
Sitka spruce (Picea sitchensis) is the main species of tree in Irish forests accounting
for approximately 53% of the total forest area. A model is required for estimating
the individual tree volume of Irish Sitka spruce, without the necessity of felling trees,
so that the amount of thinnings to be removed, the value of the forest and forest
forecasts can be estimated. The objective of this study was to develop a model for
estimating the individual tree volume of Irish Sitka spruce and to compare this with
existing models, namely, the UK Forestry Commission single tree tariff model for
Sitka spruce, still widely used in the UK and Ireland, and the Irish GROWFOR
model for Sitka spruce developed using the multivariate Bertalanffy-Richards model
and that is used commercially.
Coillte Teoranta (The Irish Forestry Board Limited) maintains the most extensive
crop structure database on Sitka spruce in the Irish Republic. The database in-
cludes many forestry thinning and spacing experiments that have involved repeated
measures on trees during the period 1963 to 2006. Permanent sample plots were
laid down in forests throughout the country and each tree was given a unique I.D.
number. The age and diameter at breast height (DBH) of all trees were recorded at
each assessment. Within each plot a subsample of trees had their height and volume
also recorded. For many of these trees actual volume was also recorded as they have
now been felled.
A number of models were fit to the data using least squares regression to investigate
the variables that can be used to predict actual tree volume. The Box-Cox trans-
formation method, along with diagnostic analysis, was used to improve the model.
Volume estimates were also calculated from the data using the UK Forestry Com-
mission and Irish GROWFOR models for Sitka spruce. The models were compared
using the approximate volume per hectare (m3ha−1) estimates based on predicted
62
values from the models and their agreement with actual values. 2k-fold cross vali-
dation was used to carry out a subset comparison for the preferred model.
We see the existing UK and GROWFOR models underestimate the volume of timber
in the Coillte sample plots (by 8.6% and 4.6% respectively) while the new model
overpredicts volume by a small amount (0.38%).
63
An Analysis of Lower Limits of Detection of Hepatitis C
Virus
Kathleen O’Sullivan1, Liam J. Fanning2, John O’Mullane3, ST4050 Students1 and
Linda Daly1
1Department of Statistics, School of Mathematical Sciences, University College
Cork, Ireland2Molecular Virology, Department of Medicine, Cork University Hospital and
University College Cork, Ireland3School of Computer Science & Information Technology, University College Cork,
Ireland
Abstract
Introduction: Hepatitis C is a disease that attacks liver function ranging from a
mild illness lasting a few weeks to a serious lifelong condition condition that can lead
to liver cirrhosis or hepatocellular carcinoma. It is estimated that 3% of the worlds
population is infected. Hepatitis C is sub-classified by six genotypes; we studied
genotypes 1, 2 and 3. Detection involves testing of a peripheral blood sample for
the amount of virus material present (Viral Load (VL), IU/ml) and in this study
the tests results are qualitatively described as a positive response (virus detected)
or Target Not Detected (TND). Detection varies with VL and at the lower levels of
VL becomes uncertain. This study determined the lower limit of detection (LLOD),
the VL required to achieve a 95% detection (hit) rate, for genotypes 1, 2 and 3, and
made comparisons the manufacturers values. We investigated if the effect of VL on
hit rates differed by genotypes.
Method: The laboratory tested independently validated third party proficiency
controls in which the VLs had been identified by the manufacturer for genotypes
1, 2 and 3. Data collected since 2005 provided, for each genotype, the number of
replicates tested (4 – 27) at each VL (0.04 – 10,000 IU/ml) and the number of these
that tested positive. LLODs were estimated by fitting probit models (Pi = Φ(β0 +
β1(log10 VLi) where Pi = hit rate, Φ = cumulative standard normal distribution,
β0 = intercept and β1 = slope), where model parameters were estimated by Maxi-
mum Likelihood Estimation. A likelihood ratio test investigated whether the effect
of VL on hit rate differed by genotype. Pearsons chi-square goodness-of-fit test
assessed model fits. Statistical significance was determined using p < 0.05.
64
Results: We estimated the LLODs for HCV genotypes 1, 2 and 3 as 10.9 IU/ml
[95% CI: 6.2 – 30.4 IU/ml], 11.2 IU/ml [95% CI: 5.6 – 50.5 IU/ml] and 28.8 IU/ml
[95% CI: 11.2 227.7 IU/ml] respectively. The manufacturers stated LLOD was 8
IU/ml [95% CI: 614 IU/ml]. There was no evidence that the effect of VL differed
by genotype (p > 0.05). Pearsons chi-square goodness-of-fit tests confirmed the
suitability of all models fitted to the data (p > 0.05).
Conclusions: The LLODs for genotypes 1 and 2 were comparable to the manu-
facturers but not for genotype 3. The effect of VL on hit rate did not differ by
genotype. The laboratory is performing within established criteria for genotypes 1
and 2. For genotype 3, the data was limited due to the low number of replicates
tested at VLs in the range 3.7 IU/ml to 37 IU/ml. The findings suggest that a
power calculation should be conducted to inform the number of replicates tested at
each VL, and that more levels in a restricted range of VLs (1 – 50 IU/ml) should
be examined to improve the estimation of the LLODs.
65
Predicting Magazine Subscriber Churn
Kathleen O’Sullivan1, Andrew Grannell2, John O’Mullane3 and ST4050 Students1
1Department of Statistics, University College Cork, Ireland2Statistical Solutions, Cork, Ireland
3Department of Computer Science, University College Cork, Ireland
Abstract
Introduction: Retention of subscribers is critical to maintaining the success of
magazines. It is significantly more cost effective to retain subscribers than acquire
new ones. Predictive analytics focuses retention campaigns by predicting which
subscribers are at risk of not renewing (churning) and targeting them. Without
predictive targeting, a retention campaign may cost more than it gains. This study
aimed to identify subscribers who were likely not to renew their subscription to a
sports magazine and to create a profile of these non-renewal subscribers.
Method: Data was provided on 22,265 subscribers to a sports magazine by the
client. Thirteen variables, four relating to subscription incentive, two relating to
payment, one relating to renewal opportunities, three relating to the nature of the
current subscription, one relating to customer location, one relating to subscription
duration and one signifying if the subscriber renewed, formed a subscriber record.
Logistic regression was used to estimate the probability of a subscriber not renewing
based on the variables in the subscriber record. The performance of the fitted model
was assessed using diagnostic measures, including sensitivity, specificity, positive
predictive value (PPV) and negative predictive value (NPV), and ROC analysis.
Results: A logistic regression model based on 12 subscriber variables identified
18,174 non-renewal subscribers. The model’s predictive ability was 61% (Nagelkerke
R2). It correctly classified 88% of subscribers, and the sensitivity and specificity were
94% and 66% respectively. Additionally PPV was 91% and NPV was 76%. Area
under the ROC curve was 92%. Models with fewer subscriber variables (7, 5 and 4)
were considered and demonstrated comparable classification performance.
Conclusions: The journal reader codes of predicted non-renewal subscribers were
identified. This group was segmented by decile of risk and a profile of the subscribers
in each decile was created, for use in targeted retention campaigns. As models with
fewer subscriber variables had comparable performance, this may be used to reduce
the amount of data a subscriber has to supply without loss of predictive ability.
66
Predicting Retinopathy of Prematurity in Newborn Babies
Rebecca Rollins∗, Adele H. Marshall∗ and Karen Cairns∗
∗CenSSOR, Queen’s University Belfast, Belfast, BT7 1NN
Abstract
Retinopathy of prematurity1 (ROP) is a disease in which retinal blood vessels of
premature infants fail to grow and develop normally. Being one of the major causes
of childhood blindness, globally, it is estimated at least 50,000 children are blind
from ROP, and likely many more will be unilaterally blind or visually impaired. In
general, about 60% of low birthweight infants will develop some degree of ROP1.
The combined effect of increasing premature infant survival rates, and the nature
of the disease, make predicting ROP difficult at such an early stage of life.
Research into the prediction of ROP has mainly been undertaken by clinicians who
utilise statistical analysis such as Chi-square tests and logistic regression to identify
risk factors and subsequently predict ROP. In particular, the WINROP2 algorithm
uses longitudinal measures to monitor and predict ROP with detection rates of 100%
in those infants who required treatment and 84% of those who do not. However, this
method relies on the collection of clinical measures over time and is not appropriate
as a tool to indicate which infants are most at risk of ROP at birth.
The purpose of this research is to predict ROP in premature infants using infor-
mation known at birth, which is provided by the Royal Victoria Hospital Belfast.
Techniques explored consist of decision trees, random forests, adaboost models, sup-
port vector machines (SVMs) and neural networks. Results show that the models
performed similarly overall. However, given the fairly uncommon occurrence of
ROP, and the documented difficulties in predicting minority classes the SVM had
the best potential to be developed further. The SVM model has a sensitivity of 68%
and area under the ROC curve of 0.9016. Future work will involve developing the
SVM model for the multiclass minority problem, allowing patients with ROP to be
classified by the severity of the disease, indicating those who require treatment.
1Gilbert, C. Retinopathy of prematurity: A global perspective of the epidemics, population ofbabies at risk and implications for control, Early human development, 2008, 84, 77-82
2Lofqvist, C., et al, Longitudinal Postnatal Weight and Insulin-like Growth Factor I Measure-ments in the Prediciton of Retinopathy of Prematurity, Archives of Ophthalmology, 2006, 124,1711-1718
67
A new approach to variable selection in presence of
multicollinearity: a simulated study
Olivier Schoni
Department of Quantitative Economics, University of Fribourg,
Bd. de Perolles 90, 1700 Fribourg, Switzerland
Abstract
The present paper evaluates the impact of multicollinearity on automated variable
selection procedures. This objective is achieved by comparing the model selection
performance of different selection methods in hedonic regression models where noise
variables inducing/ not inducing multicollinearity have been introduced. Besides
analysing widespread stepwise selection methods based on information criteria, a
new selection method using a multimodel approach is also examined. In order to
gauge the performance of the considered selection procedures, a data generating
process is simulated using real housing data. The ability of each selection method
to correctly classify informative and uninformative variables is measured by means
of its balanced accuracy. It is shown that the proposed multimodel selection rule
seems to perform systematically better than the usual stepwise selection methods.
68
Lapse prediction models
Lukas Sobisek1, Maria Stachova2
1Department of Statistics and Probability, Faculty of Informatics and Statistics,
University of Economics, Prague, CZ 13067 Czech Republic (e-mail:
[email protected])2Department of Quantitative Methods and Information Systems, Faculty of
Economics, Matej Bel University, Banska Bystrica, SK 97590 Slovak Republic
(e-mail: [email protected])
Abstract
The main objective of our contribution is to evaluate and compare different classifi-
cation models that can be used to assess the insurance customer risk. These models
are built on real customer’s data set that comes from a Czech insurance company.
The purpose of these models is to observe and evaluate policy lapses two years after
inception. The Bayesian models such as Bayesian logistic regression as well as non-
Bayesian models, for example classification trees and random forest were applied in
our analysis. The models were estimated in statistical system R and in the SPSS
software.
69
Latent class model of attitude of the unemployed
Jiri Vild
University of Economics, Prague (Department of Statistics and Probability, University of Economics, Czech Republic)
Abstract
The paper deals with attitude of the Czech unemployed. In a periodical Labour Force Survey, the unemployed answer questions concerning ways they choose to find a new job. They can just be registered at the Labour Office, they can use specialized agencies, contact directly the potential employers etc. In the analysis, data from Labour Force Survey held in spring 2011 will be used. In the first stage latent class model will be estimated and by analysing its parameters we will find specific groups of the unemployed revealing what ways of job search they prefer or not prefer. We will also examine share of these groups to find out the overall attitude of the Czech unemployed. In the second stage the latent class model will be extended with covariates. In latent class analysis, covariates are used to predict probability of membership in the latent classes. Covariates like gender, age or working experience will be incorporated into model to find out how the attitude differs for specific subgroups of respondents.
70
Variable Selection Techniques for Multiply Imputed Data
Deirdre Wall1, Grace Callagy2, Helen Ingoldsby2, Michael J Kerin3 and John Newell1,4
1 School of Mathematics, Statistics and Applied Mathematics, NUI Galway, Ireland2 Discipline of Pathology, NUI Galway3 Discipline of Surgery, NUI Galway
4 HRB Clinical Research Facility, NUI Galway, Ireland
Abstract
Missing data can be a serious problem, in particular in retrospective observational studieswhere the percentage of subjects with complete data can be of concern. Casewise deletionoccurs when a subject is missing in one predictor, the whole case (subject) is omitted as aconsequence. This can result in over half the cases being deleted, even if there is as little as10% missing in each predictor. This reduction in sample size will reduce the power of suchstudies to identify useful predictors.
Multiple Imputation (MI) is a popular technique in missing data problems. MI usesmodels based on observed data to replace missing values with credible values. This processis repeated a number of times to create several imputed datasets. Typically MI is used atthe end of the analysis where the final model generated from the original data is fitted toeach of the imputed datasets and the results combined using Rubin’s Rules. An alternativeapproach to model selection is to identify the final model based on the imputed datasets.
For example variable selection techniques can be applied to each imputed dataset. Anappealing attribute of this approach, is that power can be retained by avoiding casewisedeletion. However, an extra level of complexity is added in terms of identifying a consistentset of predictors.
Methods for variable selection in multiply imputed data in the literature, suggests select-ing predictors using a voting system, such as selection of predictors that appear in any, halfor all the models. Another method suggested is to stack the imputed datasets and performweighted regression using weights related to the amount of missingness present.
Results will be presented from a simulation study to compare the performance of variableselection techniques in multiple imputed data and a new method of imputation where randomforests are used to impute a single dataset, where values are imputed by averaging over manyunpruned classification or regression trees. In addition an application of variable selection,in the presence of missing data, in determining a prognostic model for Disease Free Survivalfrom the breast cancer cohort in University College Hospital Galway will be presented.
1
71
A maintenance policy for a two-phase system in utility
optimisation
Shuaiwei Zhou1, Brett Houdling1 and Simon P. Wilson1
1Discipline of Statistics, Trinity College, Dublin, Ireland.
Abstract
We consider a general system with two phases whose failure times follow an spec-
ified distribution, that possibly leads to failure prior to maintenance. The utility
properties, such as the costs of maintenance, repair and failure, have been imple-
mented into the reliability assessment for the system. This system is subjected to a
preventative maintenance policy, with sequential maintenance decisions. We com-
pute the sequential optimal maintenance time with respect to the unit-time utility
optimisation of the system. Numerical examples are studied.
72
Index
Ahilan S, OSullivan JJ, Masterson B,Demeter K, Weijer W, OHare G,22
Alquier P, Butucea C, Hebiri M, MezianiK, Morimae T, 5, 6
Alvarez-Iglesias A, Newell J and Hinde J,2
Avalos G, Alvarez-Iglesias A, Parker Iand Newell L, 37
Bahnassy AA, 38Barragan-Martinez MA, Newell J and
Escarela-Perez G, 39Boland A, Friel N, 40Boyce S, Murphy L, Fitzpatrick JM, Mur-
phy TB and Watson RWG, 41,42
Brick E, Harrison P, Sutton G, Cronin Mand Wolsztynski E, 43
Brick E, O’Sullivan K and O’Mullane J,44
Buja A, Berk R, Brown L, Zhang K, ZhaoL, 1
Cairns K, McMillen P, O’Doherty M. andKee F, 27
Chen B and Sinn M, 10
De Persis C, Wilson S, 45Doan TK, Parnell AC Haslett J, 21Doherty C, Cronin M and Harding M, 46Dooley A, Isbel F, Kirwan L, Connolly J,
Finn JA and Brophy C, 47
Fitzpatrick T, 48
Gillespie J, McClean S, Scotney B,FitzGibbon F, Dobbs F andMeenan BJ, 7
Grimaldi, M, 17
Halton A, Cronin M and Mendonca daMata C, 49
Haslett J, 20Hastie T, 11Helu A and Samawi H, 18Hernandez B, Parnell A, Pennington S,
Lipkovich I and O’Kelly M, 50Hinde J, Jorgensen B, Demetrio C and
Kokonendji C, 14
Hofer V, 32Houlding B, Coolen FPA and Bolger D,
51Hwang WT and Kazak AE, 26
Jin R, Good M , More SJ , Sweeney C,McGrath G and Kelly GE, 52
Jung S, Joh HJ and Park JS, 53
Kadengye DT, Ceulemans E. and Vanden Noortgate W, 15, 54
Leisch F, 28Lewitschnig H, Kurz D and Pilz J, 33
MacKenzie G, Xu J, 13Mameli V, Brazzal AR, 55Marshall AH, Zenga M, Crippa F and
Merchich G , 12McCormack K, 16McCrink LM, Marshall AH and Cairns
KJ , 23McDowell R, Ryan A, Bunting B, O’Neill
S, 36McQuillan J, Marshall AH and Cairns
KJ, 56Moral R, Hinde J and Demetrio C, 57
Nyamundanda G and Hayes K, 58
O’Hagan A. and O’Carroll S, 34O’Meachair S, Arnold M, Lacey G and
Walsh C, 59O’Mullane J, O’Sullivan K, Wall A and
O’Mahoney P, 60O’Rıordain S., Wilson S, 61O’Rourke S, Kelly G and Mac Siurtain
M, 62, 63O’Sullivan K, Fanning LJ, O’Mullane J,
ST4050 Students and Daly L, 64,65
O’Sullivan K, Grannell A, O’Mullane J,ST4050 Students, 66
Parnell AC, Phillips DL, Bearhop S, Sem-mens BX, Ward EJ, Moore JW,Jackson AL, Grey J, Kelly D andInger R, 35
Rollins R, Marshall AH, Cairns K, 67
Salter-Townshend M and McCormick T,3
Schoni O, 68
Scott EM, 19
Shangodoyin DK, 9
Simpkin A, Metcalfe C, Donovan JL,Martin RM, Athene Lane J,Hamdy FC, Neal DE and TillingK, 24, 25
Sobisek L, Stachova M , 69
Sturludottir E and Stefansson G, 31
Sweeney J, O’Sullivan F, 4
Vild J, 70
Wall D, Callagy G, Ingoldsby H, KerinMJ and Newell J, 71
White A and Murphy TB, 8Wolsztynski E, Brick E, O’Sullivan F and
Eary JF , 29
Yang H, 30
Zhou S, Houdling B and Wilson SP, 72