nemours biomedical research evaluating the performance of clustering using different inputs and...

49
Nemours Biomedical Research Evaluating the performance of clustering using different inputs and algorithms to group children based on early childhood growth patterns Jobayer Hossain, Ph.D. Tim Wysocki, Ph.D. Samuel S. Gidding, MD MingXing Gong, M.Sc. H. Timothy Bunnell, Ph.D. September 20, 2012

Upload: morris-jenkins

Post on 27-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Nemours Biomedical Research

Evaluating the performance of clustering using different inputs and algorithms to group children

based on early childhood growth patterns

Jobayer Hossain, Ph.D.

Tim Wysocki, Ph.D.

Samuel S. Gidding, MD

MingXing Gong, M.Sc.

H. Timothy Bunnell, Ph.D.

September 20, 2012

Nemours Biomedical Research

Early Childhood Growth Pattern

• Childhood growth pattern is the pattern of the temporal change in height, weight, and head circumference

A determinant of body composition and body weight

• Commonly used measures of body composition and weight are– Weight-for-length for ages < 2 years

Body mass index (BMI) for ages ≥ 2 years

• Standardized scores of these two measures are weight-for-length z- score and BMI z-score (both termed as BMIz in this presentation)

Nemours Biomedical Research

Enrollment Criteria

• Inclusion Criteria:

• Born between 2001 and 2005

• Had first well child visit at a Nemours clinic in < 1 month of age

• Had at least one well child visit each year for the next 5 years

• Exclusion Criteria: Children with medical diagnoses associated

with poor growth and development such as cancer and cystic

fibrosis.

Nemours Biomedical Research

Data Collection and BMIz Calculation

• 3365 children were enrolled in the study

• Retrospectively collected (from Nemours electronic medical record

(EMR)) height, weight, age, other demographics, and comorbidities

of childhood obesity for children ages between 0-5 years

• Calculated BMIz (using height, weight, age and gender data) on

their clinic visits.

• Interpolated (LOESS) BMIz score for ages (months) 1, 6, 12, 18,

24, 30, 36, 42, 48, 60 for each subject

Demographics

Nemours Biomedical Research

Variables Number (%) Gender Male 1634 (48.6%) Female 1731 (51.4%) Ethnicity Non-Hispanic/Non-Latino 2927 (87%) Hispanic/Latino 404 (12%) Missing/Refused 34 (1%) Race Caucasian 1429 (42.5%) African American 1542 (45.8%) Others 353 (10.5%) Missing/Refused 41(1.2%)

Nemours Biomedical Research

• Trend in Mean BMIz• A cubic polynomial trend over the ages (0-5 years)

• Approximately linear in three intervals of time: 1-9 months, 9-27 months, and 27-60 months

Temporal Change in Mean BMIz

Natural Structure of the Temporal Change in BMIz of Individual Children

Nemours Biomedical Research

Mean BMIz Over Time: By Demographics

Nemours Biomedical Research

A cubic polynomial of mean change in BMIz over time

Nemours Biomedical Research

Objectives

• To group children (0-5 years of age) using a cluster analysis that captures the natural structure of the temporal changes in BMIz

• To acquire the objective we did- Perform many sets of cluster analyses using different clustering methods,

algorithms, software and cluster inputs

Use a mixed model to evaluate and compare the performance of many sets of clusters

Select optimum clustering(s) based on the model fit statistics to the data

Rationale of Our Work

• Cluster analysis is mainly an exploratory data analysis

with limited scope of evaluation and comparison of the results of two or more sets of clustering

• A large number of clustering methods/algorithms have been developed

• Different statistical software accommodate different algorithms• Cluster inputs can be raw data or some form of standardized data• No method/algorithm is uniquely best

Nemours Biomedical Research

Nemours Biomedical Research

Cluster Methods and Software Used

• We used the following software/cluster methods with suggested several distance/similarity measures

• SAS Procedures: CLUSTER (Hierarchical), FASTCLUS (k-means), MODECLUS (clusters based on non-parametric density estimation using several algorithms) and some ancillary procedures such as VARCLUS, TREE, ACECLUS, DISTANCE, PRINCOMP, STDSIZE

• SPSS Cluster Analysis: Two-step (hierarchical clustering in two steps) and k-means

• R Cluster Analysis: Package mclust (model based clustering), Package cluster (hierarchical), and function kmeans (K-means)

Nemours Biomedical Research

Inputs Used for Cluster Analyses

We performed cluster analyses of

• BMIz scores at different ages (every six-month interval )

• Selected principal component (PC) /factor scores, after PC/factor analyses of BMIz scores of 11 variables

• Random coefficients of mixed effects model of BMIz

• Coefficients of auto-regression (AR)/autocorrelation (AC) for each individual

Nemours Biomedical Research

Model Used for Cluster Evaluation

• We used a mixed model that fits the BMIz of children, nested within a cluster group, as a polynomial function of time

error term random theis

s)(continuou time theis

groupth theofintercept theis

cluster of groupth within nestedeffect subject random theis

th time at the groupth within nestedsubject th theof BMIz theis

Where,

* * * 32

ijk

ijk

i

ij

ijk

ijkijkiijkiijkiiijijk

e

t

igroup

igroupS

kijY

etgrouptgrouptgroupgroupgroupSY

Nemours Biomedical Research

Cluster Analysis of BMIz score

? Natural structure of the temporal change in BMI

Nemours Biomedical Research

Principal Component Analysis (PCA)

• PCA is a powerful exploratory tool to identify the patterns in data

• PCs reflect the interrelation of the similarities and dissimilarities between

observed variables

• Performed PCA of BMIz

• The first 4 PCs (PC4) explain about 99.19% of the total variation in the dataPC component

Eigenvalues/Extraction Total % of

Variancecumulative

1 8.900 80.910 80.910

2 1.347 12.246 93.157

3 .511 4.644 97.801

4 .153 1.390 99.191

5 .055 .499 99.690

6 .019 .176 99.866

7 .007 .062 99.928

8 .005 .044 99.972

9 .002 .018 99.990

10 .001 .007 99.997

11 .000 .003 100.000

Nemours Biomedical Research

Cluster Analysis of PC Scores

• PC4 (the first four PCs (explain 99.19% variation in the data)): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested by cluster algorithm

• PC3 (the first three PCs (explain 97.80% variation in the data)): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested by cluster algorithm

• PC2 (the first two PCs (explain 93.16% variation in the data)): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested

• PC11( the all eleven PCs)): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested by cluster algorithm by cluster algorithm (Perhaps cluster analysis using 11 PCs equivalent to the cluster analysis of the raw data)

Cluster Analysis of PC4

Nemours Biomedical Research

? Natural structure of the temporal change in BMI

Nemours Biomedical Research

Cluster Analysis of PC3

? Natural structure of the temporal change in BMI

Nemours Biomedical Research

Cluster Analysis of PC2

? Natural structure of the temporal change in BMI

Nemours Biomedical Research

Cluster Analysis of Factor Scores

• Fac4 (the first four Factors (explain 99.19% variation in the data)): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested by cluster algorithm

• Fac3 (the first three Factors (explain 97.80% variation in the data)): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested by cluster algorithm

• Fac2 (the first two Factors (explain 93.16% variation in the data)): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested by cluster algorithm

• Fac11( the all eleven factors): four clusters- 4 groups, 5 groups, 6 groups and the optimal number of groups suggested by cluster algorithm (Perhaps cluster analysis using 11 Factors equivalent to the cluster analysis of the raw data))

Nemours Biomedical Research

Cluster Analysis of Fac4

? Natural structure of the temporal change in BMI

Nemours Biomedical Research

Cluster Analysis of Fac3

? Natural structure of the temporal change in BMI

Nemours Biomedical Research

Cluster Analysis of Fac2

? Natural structure of the temporal change in BMI

• Recall: the change in population mean of BMIz is a cubic polynomial trend over the ages 0-5 years i.e.

• Approximately linear in three intervals of time• We can model this dataset with an intercept and three slopes for three

intervals. • We used parameters (intercept and slopes) as both fixed and random effects

in the model • Fixed effects of slopes explain the rate of change in the population level of

BMIz in each interval of time• Random effects explain the individual to individual variation

• at the beginning of life (age of one month)• of the change in BMIz trajectories in three intervals

• That’s random effects account for the sources of heterogeneity in the change in population BMIz

Nemours Biomedical Research

Piece-wise Mixed Effects Model with Random Coefficients

• Exploratory analyses indicate that splits at 9 and 27 months of ages yield the best fit of the piece-wise linear mixed effects model to the individual and population levels of change in BMIz

Nemours Biomedical Research

Piece-wise Mixed Effects Model with Random Coefficients

Nemours Biomedical Research

26for 0

26for 26

8for 0

8for 8

individualith of age ofmonth )1( theis

month )1(at individualith of BMIz theis Where,

)/(

27

9

2749321

2739321

ij

ijij

ij

ij

ijij

ij

ij

ij

ijiijiijii

ijijijiij

t

ttt

t

ttt

thjt

thjY

tbtbtbb

tttbYE

Piece-wise Mixed Effects Model with Random Coefficients

Nemours Biomedical Research

interval age same for the individual

of BMIzin change of rate theis and months 9)-(1

between agesfor BMIz populationin change of rate theis

month 1 of age at the individualith theofintercept theis

month 1 ageat n)(populatiointercept theis Where,

)/(

22

2

11

1

27493212739321

ithb

b

tbtbtbbtttbYE

i

i

ijiijiijiiijijijiij

Piece-wise Mixed Effects Model with Random Coefficients

Nemours Biomedical Research

age same for the individual of BMIzin change of rate theis

months, 60)-(27between agesfor BMIz

populationin (month) change of rate theis Similarly,

age same for the individual of BMIzin change

of rate theis and months 27)-(9between

agesfor BMIz populationin change of rate theis Where,

)/(

432432

432

3232

32

27493212739321

ith

bbb

ith

bb

tbtbtbbtttbYE

iii

ii

ijiijiijiiijijijiij

Piece-wise Mixed Effects Model with Random Coefficients

Nemours Biomedical Research

Estimate of the regression coefficients (population)

Piece-wise Mixed Effects Model with Random Coefficients

Effect Estimate SE P-value Intercept 0.2879 0.0138 <0.00011 0.102 0.0195 <0.00012 -0.1606 0.0215 <0.00013 0.1535 0.0082 <0.0001

Estimated variance of the random effects (individual)

Effect Estimate SE p-valueV(intercept) 0.6305 0.0157 <0.0001V(b1) 1.212 0.0311 <0.0001V(b2) 1.4552 0.0381 <0.0001V(b3) 0.2056 0.0055 <0.0001

• The model we just discussed, is our initial model to track the trajectories of individual and population level of BMIz

• We also observed a significant difference in mean BMIz between gender, race and ethnicity.

• In addition to our initial model, we also fitted a model to track the change in BMIz after adjustment for gender and race-ethnicity.

Nemours Biomedical Research

Piece-wise Mixed Effects Model with Random Coefficients

Nemours Biomedical Research

Cluster Analysis of the Random Effects

• We performed a cluster analysis of individual level four parameters:

intercept (b1i) and three slopes (b2i, b2i +b3i, b2i +b3i+b4i )

Nemours Biomedical Research

Cluster Analysis of the Random Coefficients of Piece-wise Mixed Effects Model

? Natural structure of the temporal change in BMI

Nemours Biomedical Research

Time Series Analysis

• Performed

• Auto-regression (AR) on the BMIz of each individual and extract coefficients

(burg, Yule-walker, MLE methods were used) of the best model suggested

by AIC

• Auto-correlation on the BMIz of each individual and extract autocorrelations

• Spectral analysis on each individual’s BMIz and extract frequencies

• Performed cluster analysis of AR coefficients, auto-correlations,

and frequencies

Cluster Analysis of the AR Coefficients (Yule-Walker)

Nemours Biomedical Research

? Natural structure of the temporal change in BMI

Nemours Biomedical Research

Model Based Evaluation Ordered by Logliklihood (Largest to Smallest)

AIC BIC(smaller is better) (smaller is better)

Fac4_reg_grp_opt 10 -8152.6 -8140.4 -8156.6 4078.3PC4corr_grp_opt 11 -7629.5 -7617.3 -7633.5 3816.75Fac3_reg_grp_opt 7 -6391.3 -6379.1 -6395.3 3197.65PC3cov_grp_opt 6 -5443.7 -5431.4 -5447.7 2723.85Fac3_reg_grp6 6 -5171.4 -5159.1 -5175.4 2587.7initmodel_grp6 6 -5066.2 -5054 -5070.2 2535.1

Model_gendrace_grp6 6 -4855.9 -4843.6 -4859.9 2429.95PC3corr_grp6 6 -4301.1 -4288.8 -4305.1 2152.55PC4corr_grp6 6 -4217.2 -4204.9 -4221.2 2110.6

Fac3_reg_grp5 5 -4153.4 -4141.2 -4157.4 2078.7Fac4_reg_grp6 6 -3951.8 -3939.5 -3955.8 1977.9

PC3corr_grp_opt 5 -3595.1 -3582.9 -3599.1 1799.55PC4corr_grp5 5 -3515.2 -3503 -3519.2 1759.6

Fac4_reg_grp5 5 -3248 -3235.8 -3252 1626PC4cov_grp6 6 -3146.2 -3134 -3150.2 1575.1PC3cov_grp5 5 -3125.1 -3112.9 -3129.1 1564.55

initmodelcoe_grp6 6 -2848.9 -2836.6 -2852.9 1426.45Model_gender_ethnicity6 6 -2825.8 -2813.5 -2829.8 1414.9

PC11corr_grp11 11 -2741.6 -2729.4 -2745.6 1372.8Model_gendrace_coe_grp6 6 -2278.4 -2266.2 -2282.4 1141.2

initmodel_grpopt4 4 -2139 -2126.8 -2143 1071.5PC4cov_grp_opt 5 -2105.5 -2093.3 -2109.5 1054.75

PC3cov_grp4 4 -1467.1 -1454.8 -1471.1 735.55PC4corr_grp4 4 -1429.7 -1417.5 -1433.7 716.85

PC2corr_grp_opt 9 -1187.9 -1175.6 -1191.9 595.95Fac3_reg_grp4 4 -1040.5 -1028.3 -1044.5 522.25PC3corr_grp4 4 -969.5 -957.3 -973.5 486.75

PC11corr_grp9 9 -655.9 -643.6 -659.9 329.95PC11corr_grp10 10 -309.6 -297.4 -313.6 156.8Fac4_reg_grp4 4 34.2 46.5 30.2 -15.1PC4cov_grp4 4 167 179.2 163 -81.5

PC11corr_grp7 7 182 194.2 178 -89Model_gendrace_coe_opt3 3 428.6 440.8 424.6 -212.3

initmodelcoe_opt3 3 631.5 643.7 627.5 -313.75

Clustering Number of Groups -2 Log-likelihood Log-likelihood

Choosing an Optimum Clustering

• The following few are the candidates of an optimum clustering-

Nemours Biomedical Research

Optimum Choice: Based on the likelihood score and number of groups, prefer smaller number of groups with maximum possible likelihood score

AIC BIC(smaller is better) (smaller is better)

Fac4_reg_grp_opt 10 -8152.6 -8140.4 -8156.6 4078.3PC4corr_grp_opt 11 -7629.5 -7617.3 -7633.5 3816.75Fac3_reg_grp_opt 7 -6391.3 -6379.1 -6395.3 3197.65PC3cov_grp_opt 6 -5443.7 -5431.4 -5447.7 2723.85Fac3_reg_grp6 6 -5171.4 -5159.1 -5175.4 2587.7initmodel_grp6 6 -5066.2 -5054 -5070.2 2535.1

Model_gendrace_grp6 6 -4855.9 -4843.6 -4859.9 2429.95PC3corr_grp6 6 -4301.1 -4288.8 -4305.1 2152.55PC4corr_grp6 6 -4217.2 -4204.9 -4221.2 2110.6

Fac3_reg_grp5 5 -4153.4 -4141.2 -4157.4 2078.7

Clustering Number of Groups -2 Log-likelihood Log-likelihood

Best Clustering Based on the Model Fit Statistics

Nemours Biomedical Research

Mean Profile for the Best Clustering

Group 1 2 3 4 5 6 7 8 9 10

BM

Iz

-2

-1

0

1

2

3

AGE(month)

1 16 31 46 61

Nemours Biomedical Research

Optimum Clustering Based on the Number of Groups and Model Fit Statistics

Acknowledgement

• Li Xie, Biostatistician

Nemours Biomedical Research

Nemours Biomedical Research

Thank you very much

Nemours Biomedical Research

The Fastclus Procedure

• The FASTCLUS procedure combines an effective method for finding

initial clusters with a standard iterative algorithm for minimizing the

sum of squared distances from the cluster means. This kind of

clustering method is often called a k-means model.

• By default, the FASTCLUS procedure uses Euclidean distances.

• The FASTCLUS procedure can use an Lp (least pth powers)

clustering criterion instead of the least squares (L2) criterion used in

k-means clustering methods.• PROC FASTCLUS uses algorithms that place a larger influence on

variables with larger variance, so it might be necessary to standardize the variables before performing the cluster analysis.

Nemours Biomedical Research

The Modeclus Procedure

• The MODECLUS procedure clusters observations in a SAS data set by using

any of several algorithms based on nonparametric density estimates. PROC

MODECLUS implements several clustering methods by using nonparametric

density estimation.

• PROC MODECLUS can perform approximate significance tests for the

number of clusters

• PROC MODECLUS produces output data sets containing density estimates

and cluster membership, various cluster statistics including approximate p-

values, and a summary of the number of clusters generated by various

algorithms, smoothing parameters, and significance levels.

• For nonparametric clustering methods, a cluster is loosely defined as a region

surrounding a local maximum of the probability density function.

Nemours Biomedical Research

The Cluster Procedure

• The CLUSTER procedure hierarchically clusters the observations

in a SAS data set by using one of 11 methods. The data can be

coordinates or distances.

• The clustering methods are: average linkage, the centroid method,

complete linkage, density linkage (including Wong’s hybrid and kth-

nearest-neighbor methods), maximum likelihood for mixtures of

spherical multivariate normal distributions with equal variances but

possibly unequal mixing proportions, the flexible-beta method,

McQuitty’s similarity analysis, the median method, single linkage,

two-stage density linkage, and Ward’s minimum-variance method.

Nemours Biomedical Research

The SPSS TwoStep Cluster

• Handles both continuous and categorical variables by extending the model-

based distance

• Utilizes a two-step clustering approach similar to BIRCH (Zhang et al. 1996)

• Step1(Pre-cluster): Uses a sequential clustering approach (Theodoridis and

Koutroumbas 1999). It scans the records one by one and decides if the

current record should merge with the previously formed clusters or start a

new cluster based on the distance criterion.

• Step2 (group data in to sub-cluster): Use the resulting sub-clusters in step1

and groups them into the desired number of clusters

• Provides the capability to automatically find the optimal number of clusters.

Nemours Biomedical Research

R-mclust package

• mclust is a contributed R package for model-based clustering, classification,

and density estimation based on finite normal mixture modeling.

• It provides functions for parameter estimation via the EM algorithm for normal

mixture models with a variety of covariance structures, and functions for

simulation from these models.

• Also included are functions that combine model-based hierarchical clustering,

EM for mixture estimation and the Bayesian Information Criterion (BIC) in

comprehensive strategies for clustering, density estimation and discriminant

analysis.

• Provides the capability to automatically find the optimal number of clusters.

Principal Component Analysis

• PCA transformed a set of interrelated variables to a new set of uncorrelated variables called principal components (PCs).

• The variance of a PC indicates the amount of total variation (information) in the original variables conveyed by that particular PC.

• The transformation is taken in such a way that the PCs are ordered and the first PC accounts for as much of the variability in the original variables as possible, and then each succeeding PC in turn has the highest variance possible under the constraint that it be uncorrelated with the preceding PCs.

• Thus the most informative PC is the first and the least informative is the last.• PCA is a powerful exploratory tool to identify the patterns in data because PCs

reflect the interrelation of the similarities and dissimilarities between observed variables

Nemours Biomedical Research

Nemours Biomedical Research

Estimate of the regression coefficients (population)

Effect Estimate (SE) P-value Mean BMIz at the age of one month 0.2879 (.0138) <0.0001Rate of change in mean BMIz from 1-9 months 0.102 (.0195) <0.0001Rate of change in mean BMIz from 9-26 months -.0586 (.0096) <0.0001Rate of change in mean BMIz from 26-60 months 0.0949 (.0062) <0.0001

• A significant increasing trend in population BMIz at ages 1 to 9 months and at ages 27 to 60 months

• A significant decreasing trend in population BMIz at ages 9 to 27 months

• The rate of changes in BMIz in three pieces of time are significantly different

Piece-wise Mixed Effects Model with Random Coefficients

Nemours Biomedical Research

Effect Estimate (SE) p-valueVar(BMIz at the age of one month) 0.6305(0.0157) <.0001Var(Rate of change in BMIz from 1-9 months) 1.212 (0.0311) <.0001Var(Rate of change in BMIz from 9-26 months) 0.1766 (0.0281) <.0001Var(Rate of change in BMIz from 26-60 months) 0.0636 (0.0045) <.0001

Estimated variance of the random effects (individual)

• A substantial individual to individual variability in• BMIz at the age in one month

• the rate of change in BMIz at each piece of time

Piece-wise Mixed Effects Model with Random Coefficients

Nemours Biomedical Research

References

References:

1.Abraham S, Nordsieck M. Relationship of excess weight in children and adults. Public Health Rep 1960;75:263–73.

2.Guo S, Chumlea W Tracking of body mass index in children in relation to overweight in adulthood. Am J Clin Nutr 1999;70(suppl):145S–8S.

3.Daniels SR, Jacobson MS, MacCrindle BW, et al. American Heart Association Childhood Obesity Summit Report. Circulation. Published online March 30, 2009. DOI: 0116/CIRCULATIONAHA. 109.192116; Accessed 9/28/29.

4.Pratt CA, Stevens J, Daniels S. Childhood obesity prevention and treatment: recommendations for future research. Am J Prevent Med 2008;35:249-252.

5.Collins LM, Murphy SA, Nair VN, Strecher VJ. A strategy for optimizing and evaluating behavioral interventions. Ann Behav Med 2005;30:65-73.

6.Collins LM, Chakraborty B, Murphy SA, Strecher VJ. Comparison of a phased experimental approach and a single randomized clinical trial for developing multi-component behavioral interventions. Clin Trials 2009;6:5-15.

7.Whitlock EP, Williams SB, Gold R, Smith PR, Shipman SA. Screening and interventions for child overweight: a summary of evidence for the US Preventive Services Task Force. Pediatrics 2005; 116:e125- e144.

8.Olstad D L, McCargar L. Prevention of overweight and obesity in children under the age of 6 years. Appl Physiol Nutr Metab. 2009 Aug;34(4):551-70.

9.Taveras EM, Rifas-Shiman SL, Belfort MB, Kleinman KP, Ken E, Gillman MW. Weight status in the first 6 months of life and obesity at 3 years of age. Pediatrics 2009;123:1177-1183.

10.Serdula MK, Ivey D, Coates RJ, Williamson DF, Byers T. Do obese children become obese adults? A review of the literature. Prev Ed 1993;22:167-177.

11.Dietz WH. Health consequences of obesity in youth: childhood predictors of adult disease. Pediatrics 1998;101:518-525.