nemours biomedical research evaluating the performance of clustering using different inputs and...
TRANSCRIPT
Nemours Biomedical Research
Evaluating the performance of clustering using different inputs and algorithms to group children
based on early childhood growth patterns
Jobayer Hossain, Ph.D.
Tim Wysocki, Ph.D.
Samuel S. Gidding, MD
MingXing Gong, M.Sc.
H. Timothy Bunnell, Ph.D.
September 20, 2012
Nemours Biomedical Research
Early Childhood Growth Pattern
• Childhood growth pattern is the pattern of the temporal change in height, weight, and head circumference
A determinant of body composition and body weight
• Commonly used measures of body composition and weight are– Weight-for-length for ages < 2 years
Body mass index (BMI) for ages ≥ 2 years
• Standardized scores of these two measures are weight-for-length z- score and BMI z-score (both termed as BMIz in this presentation)
Nemours Biomedical Research
Enrollment Criteria
• Inclusion Criteria:
• Born between 2001 and 2005
• Had first well child visit at a Nemours clinic in < 1 month of age
• Had at least one well child visit each year for the next 5 years
• Exclusion Criteria: Children with medical diagnoses associated
with poor growth and development such as cancer and cystic
fibrosis.
Nemours Biomedical Research
Data Collection and BMIz Calculation
• 3365 children were enrolled in the study
• Retrospectively collected (from Nemours electronic medical record
(EMR)) height, weight, age, other demographics, and comorbidities
of childhood obesity for children ages between 0-5 years
• Calculated BMIz (using height, weight, age and gender data) on
their clinic visits.
• Interpolated (LOESS) BMIz score for ages (months) 1, 6, 12, 18,
24, 30, 36, 42, 48, 60 for each subject
Demographics
Nemours Biomedical Research
Variables Number (%) Gender Male 1634 (48.6%) Female 1731 (51.4%) Ethnicity Non-Hispanic/Non-Latino 2927 (87%) Hispanic/Latino 404 (12%) Missing/Refused 34 (1%) Race Caucasian 1429 (42.5%) African American 1542 (45.8%) Others 353 (10.5%) Missing/Refused 41(1.2%)
Nemours Biomedical Research
• Trend in Mean BMIz• A cubic polynomial trend over the ages (0-5 years)
• Approximately linear in three intervals of time: 1-9 months, 9-27 months, and 27-60 months
Temporal Change in Mean BMIz
Mean BMIz Over Time: By Demographics
Nemours Biomedical Research
A cubic polynomial of mean change in BMIz over time
Nemours Biomedical Research
Objectives
• To group children (0-5 years of age) using a cluster analysis that captures the natural structure of the temporal changes in BMIz
• To acquire the objective we did- Perform many sets of cluster analyses using different clustering methods,
algorithms, software and cluster inputs
Use a mixed model to evaluate and compare the performance of many sets of clusters
Select optimum clustering(s) based on the model fit statistics to the data
Rationale of Our Work
• Cluster analysis is mainly an exploratory data analysis
with limited scope of evaluation and comparison of the results of two or more sets of clustering
• A large number of clustering methods/algorithms have been developed
• Different statistical software accommodate different algorithms• Cluster inputs can be raw data or some form of standardized data• No method/algorithm is uniquely best
Nemours Biomedical Research
Nemours Biomedical Research
Cluster Methods and Software Used
• We used the following software/cluster methods with suggested several distance/similarity measures
• SAS Procedures: CLUSTER (Hierarchical), FASTCLUS (k-means), MODECLUS (clusters based on non-parametric density estimation using several algorithms) and some ancillary procedures such as VARCLUS, TREE, ACECLUS, DISTANCE, PRINCOMP, STDSIZE
• SPSS Cluster Analysis: Two-step (hierarchical clustering in two steps) and k-means
• R Cluster Analysis: Package mclust (model based clustering), Package cluster (hierarchical), and function kmeans (K-means)
Nemours Biomedical Research
Inputs Used for Cluster Analyses
We performed cluster analyses of
• BMIz scores at different ages (every six-month interval )
• Selected principal component (PC) /factor scores, after PC/factor analyses of BMIz scores of 11 variables
• Random coefficients of mixed effects model of BMIz
• Coefficients of auto-regression (AR)/autocorrelation (AC) for each individual
Nemours Biomedical Research
Model Used for Cluster Evaluation
• We used a mixed model that fits the BMIz of children, nested within a cluster group, as a polynomial function of time
error term random theis
s)(continuou time theis
groupth theofintercept theis
cluster of groupth within nestedeffect subject random theis
th time at the groupth within nestedsubject th theof BMIz theis
Where,
* * * 32
ijk
ijk
i
ij
ijk
ijkijkiijkiijkiiijijk
e
t
igroup
igroupS
kijY
etgrouptgrouptgroupgroupgroupSY
Nemours Biomedical Research
Cluster Analysis of BMIz score
? Natural structure of the temporal change in BMI
Nemours Biomedical Research
Principal Component Analysis (PCA)
• PCA is a powerful exploratory tool to identify the patterns in data
• PCs reflect the interrelation of the similarities and dissimilarities between
observed variables
• Performed PCA of BMIz
• The first 4 PCs (PC4) explain about 99.19% of the total variation in the dataPC component
Eigenvalues/Extraction Total % of
Variancecumulative
1 8.900 80.910 80.910
2 1.347 12.246 93.157
3 .511 4.644 97.801
4 .153 1.390 99.191
5 .055 .499 99.690
6 .019 .176 99.866
7 .007 .062 99.928
8 .005 .044 99.972
9 .002 .018 99.990
10 .001 .007 99.997
11 .000 .003 100.000
Nemours Biomedical Research
Cluster Analysis of PC Scores
• PC4 (the first four PCs (explain 99.19% variation in the data)): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested by cluster algorithm
• PC3 (the first three PCs (explain 97.80% variation in the data)): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested by cluster algorithm
• PC2 (the first two PCs (explain 93.16% variation in the data)): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested
• PC11( the all eleven PCs)): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested by cluster algorithm by cluster algorithm (Perhaps cluster analysis using 11 PCs equivalent to the cluster analysis of the raw data)
Cluster Analysis of PC4
Nemours Biomedical Research
? Natural structure of the temporal change in BMI
Nemours Biomedical Research
Cluster Analysis of PC3
? Natural structure of the temporal change in BMI
Nemours Biomedical Research
Cluster Analysis of PC2
? Natural structure of the temporal change in BMI
Nemours Biomedical Research
Cluster Analysis of Factor Scores
• Fac4 (the first four Factors (explain 99.19% variation in the data)): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested by cluster algorithm
• Fac3 (the first three Factors (explain 97.80% variation in the data)): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested by cluster algorithm
• Fac2 (the first two Factors (explain 93.16% variation in the data)): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested by cluster algorithm
• Fac11( the all eleven factors): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested by cluster algorithm (Perhaps cluster analysis using 11 Factors equivalent to the cluster analysis of the raw data))
Nemours Biomedical Research
Cluster Analysis of Fac4
? Natural structure of the temporal change in BMI
Nemours Biomedical Research
Cluster Analysis of Fac3
? Natural structure of the temporal change in BMI
Nemours Biomedical Research
Cluster Analysis of Fac2
? Natural structure of the temporal change in BMI
• Recall: the change in population mean of BMIz is a cubic polynomial trend over the ages 0-5 years i.e.
• Approximately linear in three intervals of time• We can model this dataset with an intercept and three slopes for three
intervals. • We used parameters (intercept and slopes) as both fixed and random effects
in the model • Fixed effects of slopes explain the rate of change in the population level of
BMIz in each interval of time• Random effects explain the individual to individual variation
• at the beginning of life (age of one month)• of the change in BMIz trajectories in three intervals
• That’s random effects account for the sources of heterogeneity in the change in population BMIz
Nemours Biomedical Research
Piece-wise Mixed Effects Model with Random Coefficients
• Exploratory analyses indicate that splits at 9 and 27 months of ages yield the best fit of the piece-wise linear mixed effects model to the individual and population levels of change in BMIz
Nemours Biomedical Research
Piece-wise Mixed Effects Model with Random Coefficients
Nemours Biomedical Research
26for 0
26for 26
8for 0
8for 8
individualith of age ofmonth )1( theis
month )1(at individualith of BMIz theis Where,
)/(
27
9
2749321
2739321
ij
ijij
ij
ij
ijij
ij
ij
ij
ijiijiijii
ijijijiij
t
ttt
t
ttt
thjt
thjY
tbtbtbb
tttbYE
Piece-wise Mixed Effects Model with Random Coefficients
Nemours Biomedical Research
interval age same for the individual
of BMIzin change of rate theis and months 9)-(1
between agesfor BMIz populationin change of rate theis
month 1 of age at the individualith theofintercept theis
month 1 ageat n)(populatiointercept theis Where,
)/(
22
2
11
1
27493212739321
ithb
b
tbtbtbbtttbYE
i
i
ijiijiijiiijijijiij
Piece-wise Mixed Effects Model with Random Coefficients
Nemours Biomedical Research
age same for the individual of BMIzin change of rate theis
months, 60)-(27between agesfor BMIz
populationin (month) change of rate theis Similarly,
age same for the individual of BMIzin change
of rate theis and months 27)-(9between
agesfor BMIz populationin change of rate theis Where,
)/(
432432
432
3232
32
27493212739321
ith
bbb
ith
bb
tbtbtbbtttbYE
iii
ii
ijiijiijiiijijijiij
Piece-wise Mixed Effects Model with Random Coefficients
Nemours Biomedical Research
Estimate of the regression coefficients (population)
Piece-wise Mixed Effects Model with Random Coefficients
Effect Estimate SE P-value Intercept 0.2879 0.0138 <0.00011 0.102 0.0195 <0.00012 -0.1606 0.0215 <0.00013 0.1535 0.0082 <0.0001
Estimated variance of the random effects (individual)
Effect Estimate SE p-valueV(intercept) 0.6305 0.0157 <0.0001V(b1) 1.212 0.0311 <0.0001V(b2) 1.4552 0.0381 <0.0001V(b3) 0.2056 0.0055 <0.0001
• The model we just discussed, is our initial model to track the trajectories of individual and population level of BMIz
• We also observed a significant difference in mean BMIz between gender, race and ethnicity.
• In addition to our initial model, we also fitted a model to track the change in BMIz after adjustment for gender and race-ethnicity.
Nemours Biomedical Research
Piece-wise Mixed Effects Model with Random Coefficients
Nemours Biomedical Research
Cluster Analysis of the Random Effects
• We performed a cluster analysis of individual level four parameters:
intercept (b1i) and three slopes (b2i, b2i +b3i, b2i +b3i+b4i )
Nemours Biomedical Research
Cluster Analysis of the Random Coefficients of Piece-wise Mixed Effects Model
? Natural structure of the temporal change in BMI
Nemours Biomedical Research
Time Series Analysis
• Performed
• Auto-regression (AR) on the BMIz of each individual and extract coefficients
(burg, Yule-walker, MLE methods were used) of the best model suggested
by AIC
• Auto-correlation on the BMIz of each individual and extract autocorrelations
• Spectral analysis on each individual’s BMIz and extract frequencies
• Performed cluster analysis of AR coefficients, auto-correlations,
and frequencies
Cluster Analysis of the AR Coefficients (Yule-Walker)
Nemours Biomedical Research
? Natural structure of the temporal change in BMI
Nemours Biomedical Research
Model Based Evaluation Ordered by Logliklihood (Largest to Smallest)
AIC BIC(smaller is better) (smaller is better)
Fac4_reg_grp_opt 10 -8152.6 -8140.4 -8156.6 4078.3PC4corr_grp_opt 11 -7629.5 -7617.3 -7633.5 3816.75Fac3_reg_grp_opt 7 -6391.3 -6379.1 -6395.3 3197.65PC3cov_grp_opt 6 -5443.7 -5431.4 -5447.7 2723.85Fac3_reg_grp6 6 -5171.4 -5159.1 -5175.4 2587.7initmodel_grp6 6 -5066.2 -5054 -5070.2 2535.1
Model_gendrace_grp6 6 -4855.9 -4843.6 -4859.9 2429.95PC3corr_grp6 6 -4301.1 -4288.8 -4305.1 2152.55PC4corr_grp6 6 -4217.2 -4204.9 -4221.2 2110.6
Fac3_reg_grp5 5 -4153.4 -4141.2 -4157.4 2078.7Fac4_reg_grp6 6 -3951.8 -3939.5 -3955.8 1977.9
PC3corr_grp_opt 5 -3595.1 -3582.9 -3599.1 1799.55PC4corr_grp5 5 -3515.2 -3503 -3519.2 1759.6
Fac4_reg_grp5 5 -3248 -3235.8 -3252 1626PC4cov_grp6 6 -3146.2 -3134 -3150.2 1575.1PC3cov_grp5 5 -3125.1 -3112.9 -3129.1 1564.55
initmodelcoe_grp6 6 -2848.9 -2836.6 -2852.9 1426.45Model_gender_ethnicity6 6 -2825.8 -2813.5 -2829.8 1414.9
PC11corr_grp11 11 -2741.6 -2729.4 -2745.6 1372.8Model_gendrace_coe_grp6 6 -2278.4 -2266.2 -2282.4 1141.2
initmodel_grpopt4 4 -2139 -2126.8 -2143 1071.5PC4cov_grp_opt 5 -2105.5 -2093.3 -2109.5 1054.75
PC3cov_grp4 4 -1467.1 -1454.8 -1471.1 735.55PC4corr_grp4 4 -1429.7 -1417.5 -1433.7 716.85
PC2corr_grp_opt 9 -1187.9 -1175.6 -1191.9 595.95Fac3_reg_grp4 4 -1040.5 -1028.3 -1044.5 522.25PC3corr_grp4 4 -969.5 -957.3 -973.5 486.75
PC11corr_grp9 9 -655.9 -643.6 -659.9 329.95PC11corr_grp10 10 -309.6 -297.4 -313.6 156.8Fac4_reg_grp4 4 34.2 46.5 30.2 -15.1PC4cov_grp4 4 167 179.2 163 -81.5
PC11corr_grp7 7 182 194.2 178 -89Model_gendrace_coe_opt3 3 428.6 440.8 424.6 -212.3
initmodelcoe_opt3 3 631.5 643.7 627.5 -313.75
Clustering Number of Groups -2 Log-likelihood Log-likelihood
Choosing an Optimum Clustering
• The following few are the candidates of an optimum clustering-
Nemours Biomedical Research
Optimum Choice: Based on the likelihood score and number of groups, prefer smaller number of groups with maximum possible likelihood score
AIC BIC(smaller is better) (smaller is better)
Fac4_reg_grp_opt 10 -8152.6 -8140.4 -8156.6 4078.3PC4corr_grp_opt 11 -7629.5 -7617.3 -7633.5 3816.75Fac3_reg_grp_opt 7 -6391.3 -6379.1 -6395.3 3197.65PC3cov_grp_opt 6 -5443.7 -5431.4 -5447.7 2723.85Fac3_reg_grp6 6 -5171.4 -5159.1 -5175.4 2587.7initmodel_grp6 6 -5066.2 -5054 -5070.2 2535.1
Model_gendrace_grp6 6 -4855.9 -4843.6 -4859.9 2429.95PC3corr_grp6 6 -4301.1 -4288.8 -4305.1 2152.55PC4corr_grp6 6 -4217.2 -4204.9 -4221.2 2110.6
Fac3_reg_grp5 5 -4153.4 -4141.2 -4157.4 2078.7
Clustering Number of Groups -2 Log-likelihood Log-likelihood
Best Clustering Based on the Model Fit Statistics
Nemours Biomedical Research
Mean Profile for the Best Clustering
Group 1 2 3 4 5 6 7 8 9 10
BM
Iz
-2
-1
0
1
2
3
AGE(month)
1 16 31 46 61
Nemours Biomedical Research
Optimum Clustering Based on the Number of Groups and Model Fit Statistics
Nemours Biomedical Research
The Fastclus Procedure
• The FASTCLUS procedure combines an effective method for finding
initial clusters with a standard iterative algorithm for minimizing the
sum of squared distances from the cluster means. This kind of
clustering method is often called a k-means model.
• By default, the FASTCLUS procedure uses Euclidean distances.
• The FASTCLUS procedure can use an Lp (least pth powers)
clustering criterion instead of the least squares (L2) criterion used in
k-means clustering methods.• PROC FASTCLUS uses algorithms that place a larger influence on
variables with larger variance, so it might be necessary to standardize the variables before performing the cluster analysis.
Nemours Biomedical Research
The Modeclus Procedure
• The MODECLUS procedure clusters observations in a SAS data set by using
any of several algorithms based on nonparametric density estimates. PROC
MODECLUS implements several clustering methods by using nonparametric
density estimation.
• PROC MODECLUS can perform approximate significance tests for the
number of clusters
• PROC MODECLUS produces output data sets containing density estimates
and cluster membership, various cluster statistics including approximate p-
values, and a summary of the number of clusters generated by various
algorithms, smoothing parameters, and significance levels.
• For nonparametric clustering methods, a cluster is loosely defined as a region
surrounding a local maximum of the probability density function.
Nemours Biomedical Research
The Cluster Procedure
• The CLUSTER procedure hierarchically clusters the observations
in a SAS data set by using one of 11 methods. The data can be
coordinates or distances.
• The clustering methods are: average linkage, the centroid method,
complete linkage, density linkage (including Wong’s hybrid and kth-
nearest-neighbor methods), maximum likelihood for mixtures of
spherical multivariate normal distributions with equal variances but
possibly unequal mixing proportions, the flexible-beta method,
McQuitty’s similarity analysis, the median method, single linkage,
two-stage density linkage, and Ward’s minimum-variance method.
Nemours Biomedical Research
The SPSS TwoStep Cluster
• Handles both continuous and categorical variables by extending the model-
based distance
• Utilizes a two-step clustering approach similar to BIRCH (Zhang et al. 1996)
• Step1(Pre-cluster): Uses a sequential clustering approach (Theodoridis and
Koutroumbas 1999). It scans the records one by one and decides if the
current record should merge with the previously formed clusters or start a
new cluster based on the distance criterion.
• Step2 (group data in to sub-cluster): Use the resulting sub-clusters in step1
and groups them into the desired number of clusters
• Provides the capability to automatically find the optimal number of clusters.
Nemours Biomedical Research
R-mclust package
• mclust is a contributed R package for model-based clustering, classification,
and density estimation based on finite normal mixture modeling.
• It provides functions for parameter estimation via the EM algorithm for normal
mixture models with a variety of covariance structures, and functions for
simulation from these models.
• Also included are functions that combine model-based hierarchical clustering,
EM for mixture estimation and the Bayesian Information Criterion (BIC) in
comprehensive strategies for clustering, density estimation and discriminant
analysis.
• Provides the capability to automatically find the optimal number of clusters.
Principal Component Analysis
• PCA transformed a set of interrelated variables to a new set of uncorrelated variables called principal components (PCs).
• The variance of a PC indicates the amount of total variation (information) in the original variables conveyed by that particular PC.
• The transformation is taken in such a way that the PCs are ordered and the first PC accounts for as much of the variability in the original variables as possible, and then each succeeding PC in turn has the highest variance possible under the constraint that it be uncorrelated with the preceding PCs.
• Thus the most informative PC is the first and the least informative is the last.• PCA is a powerful exploratory tool to identify the patterns in data because PCs
reflect the interrelation of the similarities and dissimilarities between observed variables
Nemours Biomedical Research
Nemours Biomedical Research
Estimate of the regression coefficients (population)
Effect Estimate (SE) P-value Mean BMIz at the age of one month 0.2879 (.0138) <0.0001Rate of change in mean BMIz from 1-9 months 0.102 (.0195) <0.0001Rate of change in mean BMIz from 9-26 months -.0586 (.0096) <0.0001Rate of change in mean BMIz from 26-60 months 0.0949 (.0062) <0.0001
• A significant increasing trend in population BMIz at ages 1 to 9 months and at ages 27 to 60 months
• A significant decreasing trend in population BMIz at ages 9 to 27 months
• The rate of changes in BMIz in three pieces of time are significantly different
Piece-wise Mixed Effects Model with Random Coefficients
Nemours Biomedical Research
Effect Estimate (SE) p-valueVar(BMIz at the age of one month) 0.6305(0.0157) <.0001Var(Rate of change in BMIz from 1-9 months) 1.212 (0.0311) <.0001Var(Rate of change in BMIz from 9-26 months) 0.1766 (0.0281) <.0001Var(Rate of change in BMIz from 26-60 months) 0.0636 (0.0045) <.0001
Estimated variance of the random effects (individual)
• A substantial individual to individual variability in• BMIz at the age in one month
• the rate of change in BMIz at each piece of time
Piece-wise Mixed Effects Model with Random Coefficients
Nemours Biomedical Research
References
References:
1.Abraham S, Nordsieck M. Relationship of excess weight in children and adults. Public Health Rep 1960;75:263–73.
2.Guo S, Chumlea W Tracking of body mass index in children in relation to overweight in adulthood. Am J Clin Nutr 1999;70(suppl):145S–8S.
3.Daniels SR, Jacobson MS, MacCrindle BW, et al. American Heart Association Childhood Obesity Summit Report. Circulation. Published online March 30, 2009. DOI: 0116/CIRCULATIONAHA. 109.192116; Accessed 9/28/29.
4.Pratt CA, Stevens J, Daniels S. Childhood obesity prevention and treatment: recommendations for future research. Am J Prevent Med 2008;35:249-252.
5.Collins LM, Murphy SA, Nair VN, Strecher VJ. A strategy for optimizing and evaluating behavioral interventions. Ann Behav Med 2005;30:65-73.
6.Collins LM, Chakraborty B, Murphy SA, Strecher VJ. Comparison of a phased experimental approach and a single randomized clinical trial for developing multi-component behavioral interventions. Clin Trials 2009;6:5-15.
7.Whitlock EP, Williams SB, Gold R, Smith PR, Shipman SA. Screening and interventions for child overweight: a summary of evidence for the US Preventive Services Task Force. Pediatrics 2005; 116:e125- e144.
8.Olstad D L, McCargar L. Prevention of overweight and obesity in children under the age of 6 years. Appl Physiol Nutr Metab. 2009 Aug;34(4):551-70.
9.Taveras EM, Rifas-Shiman SL, Belfort MB, Kleinman KP, Ken E, Gillman MW. Weight status in the first 6 months of life and obesity at 3 years of age. Pediatrics 2009;123:1177-1183.
10.Serdula MK, Ivey D, Coates RJ, Williamson DF, Byers T. Do obese children become obese adults? A review of the literature. Prev Ed 1993;22:167-177.
11.Dietz WH. Health consequences of obesity in youth: childhood predictors of adult disease. Pediatrics 1998;101:518-525.