with applications to profiling and differentiating habitual ... · with applications to profiling...
TRANSCRIPT
NEW VARIATIONAL BAYESIAN
APPROACHES FOR
STATISTICAL DATA MINING
With applications to profiling and differentiating
habitual consumption behaviour of customers
in the wireless telecommunication industry
BURTON WU
Bachelor of Applied Science (Mathematics), QUTBachelor of Engineering (Electrical & Computing), QUT
Bachelor of Engineering (Honours), QUT
A thesis submitted for the degree ofDoctor of Philosophy
Mathematical SciencesFaculty of Science and Technology
Queensland University of Technology
Principal Supervisor: Professor Anthony N. PettittAssociate Supervisor: Dr. Clare A. McGrory
April 2011
Abstract
This thesis investigates profiling and differentiating customers through the use of
statistical data mining techniques. The business application of our work centres on
examining individuals’ seldomly studied yet critical consumption behaviour over an
extensive time period within the context of the wireless telecommunication indus-
try; consumption behaviour (as oppose to purchasing behaviour) is behaviour that
has been performed so frequently that it become habitual and involves minimal in-
tentions or decision making. Key variables investigated are the activity initialised
timestamp and cell tower location as well as the activity type and usage quantity
(e.g., voice call with duration in seconds); and the research focuses are on customers’
spatial and temporal usage behaviour. The main methodological emphasis is on
the development of clustering models based on Gaussian mixture models (GMMs)
which are fitted with the use of the recently developed variational Bayesian (VB)
method. VB is an efficient deterministic alternative to the popular but computa-
tionally demanding Markov chain Monte Carlo (MCMC) methods. The standard VB-
GMM algorithm is extended by allowing component splitting such that it is robust to
initial parameter choices and can automatically and efficiently determine the num-
ber of components. The new algorithm we propose allows more effective modelling
of individuals’ highly heterogeneous and spiky spatial usage behaviour, or more gen-
erally human mobility patterns; the term spiky describes data patterns with large
areas of low probability mixed with small areas of high probability. Customers are
then characterised and segmented based on the fitted GMM which corresponds to
how each of them uses the products/services spatially in their daily lives; this is es-
sentially their likely lifestyle and occupational traits. Other significant research con-
tributions include fitting GMMs using VB to circular data i.e., the temporal usage
behaviour, and developing clustering algorithms suitable for high dimensional data
based on the use of VB-GMM.
iii
Keywords
Gaussian Mixture Model (GMM); Mixture Models; Probability Density Estima-
tion; Variational Bayes (VB); Bayesian Statistics; Data Mining (DM); Combinational
Data Analysis (CDA); Profiling; Segmentation; Clustering; Feature Extraction; Be-
havioural Characteristics; Consumer Behaviour; Customer Behaviour; Consump-
tion Behaviour; Customer Relationship Management (CRM); Relationship Market-
ing (RM); Human Mobility Pattern; Spatial Behaviour; Temporal Behaviour; Circu-
lar Data; Data Stream; High Dimensional Data; Call Detail Records (CDR); Wireless
Telecommunication Industry
v
Acronyms
AIC Akaike’s Information CriterionBIC Bayesian Information CriterionCCC Cubic Clustering CriterionCDA Combinational Data AnalysisCDR Call Detail RecordsCH Calinski and Harabasz (Index)DIC Deviance Information CriterionDP Dirichlet ProcessEM Expectation-Maximization (Algorithm)GMM Gaussian Mixture ModelHDDC High Dimensional Data Clustering (Algorithm)i.i.d. Independent and Identically DistributedKL Kullback-Leibler (Divergence)KM k-Means AlgorithmLL Log-LikelihoodMAE Mean Absolute ErrorMAEAC Mean Absolute Error Adjusted for CovarianceMCMC Markov Chain Monte CarloMD Mahalanobis DistanceML Maximum LikelihoodPCA Principal Component AnalysisRJMCMC Reversible Jump Markov Chain Monte CarloSD Standard DeviationSEVB Split and Eliminate Variational Bayesian (Method/Algorithm)SMS Short Message ServicesVB Variational Bayes or Variational Bayesian (Method/Algorithm)
vii
Preface
This thesis includes four chapters that have been submitted as articles for publica-
tion as follows.
• Chapter 3 titled “The Variational Bayesian Method: Component Elimination,
Initialization & Circular Data” has been submitted;
• Chapter 4 titled “A New Variational Bayesian Algorithm with Application to Hu-
man Mobility Pattern Modeling” has been accepted for Statistics and Comput-
ing. Note that the concepts described in this chapter were also presented as
a peer reviewed poster at the thirteenth international conference on artificial
intelligence and statistics (AISTATS 2010);
• Chapter 5 titled “Customer Spatial Usage Behavior Profiling and Segmentation
with Mixture Modeling” is undergoing revision for a marketing journal. Note
that the concepts discussed here were also presented as a poster at the ninth
world conference of the international society for Bayesian analysis (ISBA 2008);
and
• Chapter 6 titled “Identifying Subspace Clusters for High Dimensional Data with
Mixture Models” has been submitted.
All research was carried out in collaboration with my principal supervisor, Professor
Anthony N. Pettitt, and my associate supervisor, Dr. Clare A. McGrory. I proposed
the ideas for each of the articles and I was the main researcher responsible for im-
plementing the methodology described therein and the writing of the articles. Ad-
ditionally, I collected all of the data according to the intellectual property (IP) agree-
ment signed by all associated parties.
ix
Acknowledgements
I am grateful to my supervisors, Professor Anthony N. Pettitt and Dr. Clare A. Mc-
Grory, for their guidance in this work. Their experiences and knowledge were in-
valuable to this project.
I also thank my former managers, Michael Sheehan and Terry Simmonds, for their
support, in particular their efforts in organising the intellectual property agreement;
this research would not have been possible without their involvement. For the same
reason, I also thank Dr. Terry Bell who was the senior contract review officer at QUT.
I am also grateful to all the academic and supporting staff and students within the
discipline of mathematical sciences as well as many of my former colleagues for their
assistance and kindness.
Finally, thank you mum, dad and my sister for always being there for me.
xi
Contents
Abstract iii
Keywords v
Acronyms vii
Preface ix
Acknowledgements xi
1 Introduction 1
1.1 Understanding Customer Behaviour & Human Mobility Patterns . . . . 1
1.2 Telecommunication Call Data Record Dataset . . . . . . . . . . . . . . . 3
1.3 Motivation for Our Inferential Approach . . . . . . . . . . . . . . . . . . 3
1.4 The Role of Statistical Data Analysis . . . . . . . . . . . . . . . . . . . . . 7
1.5 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Literature Review 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Research problem & methodology requirements . . . . . . . . . 11
2.1.2 Research methodology overview . . . . . . . . . . . . . . . . . . 16
2.2 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Gaussian mixture model (GMM) . . . . . . . . . . . . . . . . . . 18
2.2.3 Classical/frequentist techniques (for GMMs) . . . . . . . . . . . 18
2.2.4 Bayesian techniques (for GMMs) . . . . . . . . . . . . . . . . . . 21
2.2.5 Approximate techniques (for GMMs) . . . . . . . . . . . . . . . . 27
2.2.6 High dimensional GMM . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.7 Review conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 32
xiii
2.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.2 Classical clustering algorithms . . . . . . . . . . . . . . . . . . . 35
2.3.3 Scalable clustering algorithms . . . . . . . . . . . . . . . . . . . . 37
2.3.4 Algorithms for clustering high dimensional data . . . . . . . . . 39
2.3.5 Review conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3 Variational Bayesian Method: Component Elimination, Initialization & Cir-
cular Data 51
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 VB-GMM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Model Evaluation Criterion . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.1 The irreversible nature of the VB component elimination prop-
erty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.2 Evaluating the results of the VB-GMM fit under different initial-
ization schemes for padded circular data . . . . . . . . . . . . . 59
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4 A New Variational Bayesian Algorithm with Application to Human Mobility
Pattern Modeling 71
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2 Standard VB-GMM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3 Split and Eliminate Variational Bayes for Gaussian Mixture Models
(SEVB-GMM) Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.1 Model stability criterion . . . . . . . . . . . . . . . . . . . . . . . 82
4.3.2 Component splitting criteria . . . . . . . . . . . . . . . . . . . . . 83
4.3.3 Component split operations . . . . . . . . . . . . . . . . . . . . . 85
4.3.4 Algorithm termination criterion . . . . . . . . . . . . . . . . . . . 88
4.3.5 Model selection criterion . . . . . . . . . . . . . . . . . . . . . . . 89
4.4 Human Mobility Pattern Application & Results . . . . . . . . . . . . . . 90
4.4.1 Data mining & human mobility patterns . . . . . . . . . . . . . . 90
4.4.2 Simulated results . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.3 Real data results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5 Customer Spatial Usage Behavior Profiling and Segmentation with Mixture
Modeling 111
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.2 Data & Individuals’ Consumption Behavior . . . . . . . . . . . . . . . . 116
5.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.2.2 Usage behavior of aggregated voice call durations and SMS
counts & the segmentation stability benchmark . . . . . . . . . 117
5.2.3 Spatial usage behavior (or mobility patterns) . . . . . . . . . . . 118
5.3 Modeling Individuals’ Spatial Usage Behavior . . . . . . . . . . . . . . . 119
5.3.1 Gaussian mixture model (GMM) . . . . . . . . . . . . . . . . . . 120
5.3.2 The variational Bayesian (VB) method . . . . . . . . . . . . . . . 121
5.3.3 Split and eliminate variational Bayes for Gaussian mixture
model (SEVB-GMM) algorithm . . . . . . . . . . . . . . . . . . . 122
5.3.4 Results, model accuracy & computational efficiency . . . . . . . 124
5.4 Profiling Individuals’ Spatial Usage Behavior . . . . . . . . . . . . . . . 126
5.4.1 SEVB-GMM component characteristics . . . . . . . . . . . . . . 128
5.4.2 Differentiating SEVB-GMM components . . . . . . . . . . . . . 128
5.4.3 SEVB-GMM component types . . . . . . . . . . . . . . . . . . . . 130
5.4.4 Spatial usage behavioral signatures . . . . . . . . . . . . . . . . . 131
5.4.5 Results & spatial usage behavioral profile stability . . . . . . . . 133
5.5 Spatial Usage Behavioral Segmentation . . . . . . . . . . . . . . . . . . 135
5.5.1 The k-means (KM) algorithm & selection of number of groups . 135
5.5.2 Results & spatial usage behavioral segmentation stability . . . . 137
5.6 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6 Identifying Subspace Clusters for High Dimensional Data with Mixture
Models 151
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.2 VB-GMM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.3 Subspace Clusters Identification . . . . . . . . . . . . . . . . . . . . . . 157
6.3.1 Approximating the density of each 2D subspace with VB-GMM 157
6.3.2 Detection of dense regions of each 2D subspace . . . . . . . . . 158
6.3.3 Estimating the associated subspace of each observation . . . . . 159
6.3.4 Identifying interesting associated subspaces . . . . . . . . . . . 161
6.3.5 Assigning observations to appropriate subspace clusters . . . . 161
6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.4.1 Sensitivity to choice of δ, the tolerance level for determining if
the VB-GMM model has converged . . . . . . . . . . . . . . . . . 163
6.4.2 Sensitivity to choice of GMM granularity h (or kinitial) . . . . . . 164
6.4.3 Sensitivity to choice of c2, the likelihood threshold where ob-
servations are considered to be in the dense regions . . . . . . . 164
6.4.4 Sensitivity to choice of c3, the threshold level in determining
the dimension relevance to an observation . . . . . . . . . . . . 165
6.4.5 Effect of data dimensionality d . . . . . . . . . . . . . . . . . . . 165
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7 Conclusion 171
7.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
7.1.1 Semi-parametric Bayesian methods & mixed membership
models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
7.1.2 Spatial-temporal/longitudinal extension . . . . . . . . . . . . . 173
7.2 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 174
A Review of Research Question 179
A.1 Telecommunication Industry Research . . . . . . . . . . . . . . . . . . . 179
A.2 Customer/Consumer Research . . . . . . . . . . . . . . . . . . . . . . . 180
A.2.1 Customer management system . . . . . . . . . . . . . . . . . . . 182
A.2.2 Customer behaviour heterogeneity . . . . . . . . . . . . . . . . . 183
A.2.3 Consumer behaviour research . . . . . . . . . . . . . . . . . . . . 184
A.2.4 Customer/market segmentation . . . . . . . . . . . . . . . . . . 186
A.3 Review Conclusion & Research Proposal . . . . . . . . . . . . . . . . . . 187
B Review of Data Stream Mining 191
B.1 Data Stream & Its Mining Challenges . . . . . . . . . . . . . . . . . . . . 191
B.2 Synopsis Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
B.3 Review Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
C Review of Clustering Time Series & Data Stream 199
C.1 Time Series Representation & Clustering . . . . . . . . . . . . . . . . . . 199
C.2 Clustering on Extracted Time Series Characteristics . . . . . . . . . . . 200
C.3 Data Stream Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
C.4 Review Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Bibliography 203
List of Figures
1.1 Spatial usage behaviour of two subscribers over the 17-month period.
Plot (a) and (c) are line plots connecting consecutive activities; (b) is a
bubble plot where the bubble volume represents the total number of
activities at the location. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Weekday 24-hour temporal usage behaviour (i.e., number of activities)
for two subscribers over the 17-month period. . . . . . . . . . . . . . . 5
3.1 Overlapping initialization scheme . . . . . . . . . . . . . . . . . . . . . 55
3.2 Distribution of number of components k with selected setups. . . . . . 61
3.3 The results of the VB-GMM fits of the usage pattern of User A. The his-
togram summarizes the actual observations; (a) represents the model
fit of the Partitioned and kinitial = 17 setup, (b) Overlapping and
kinitial = 17 setup, (c) Partitioned and kinitial = 23 setup, (d) Overlap-
ping and kinitial = 23 setup, (e) Partitioned and kinitial = 35 setup, and
(f) Overlapping and kinitial = 35 setup. . . . . . . . . . . . . . . . . . . . 64
3.4 The results of the VB-GMM fits of the usage pattern of User B. The his-
togram summarizes the actual observations; (a) represents the model
fit of the Partitioned and kinitial = 17 setup, (b) Overlapping and
kinitial = 17 setup, (c) Partitioned and kinitial = 23 setup, (d) Overlap-
ping and kinitial = 23 setup, (e) Partitioned and kinitial = 35 setup, and
(f) Overlapping and kinitial = 35 setup. . . . . . . . . . . . . . . . . . . . 65
3.5 The results of the VB-GMM fits of the usage pattern of User C. The his-
togram summarizes the actual observations; (a) represents the model
fit of the Partitioned and kinitial = 17 setup, (b) Overlapping and
kinitial = 17 setup, (c) Partitioned and kinitial = 23 setup, (d) Overlap-
ping and kinitial = 23 setup, (e) Partitioned and kinitial = 35 setup, and
(f) Overlapping and kinitial = 35 setup. . . . . . . . . . . . . . . . . . . . 66
xvii
3.6 Stephen’s Kuiper V ∗n vs. n; (a) Random and kinitial = 17 setup, (b) Par-
titioned and kinitial = 17 setup, (c) Overlapping and kinitial = 17 setup,
(d) Random and kinitial = 23 setup, (e) Partitioned and kinitial = 23setup, (f) Overlapping and kinitial = 23 setup, (g) Random and kinitial =35 setup, (h) Partitioned and kinitial = 35 setup, and (i) Overlapping
and kinitial = 35 setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1 (a) Plot of our simulated dataset where the data points (‘Actual’) are
marked by an ‘x’. (b) The results of our SEVB-GMM fit of a bivariate
mixture model to these data; the center of each component in the fit-
ted mixture is indicated by a ‘+’ and we also show 95% probability re-
gions (outlined by ‘-’) for each component in the model. We can see
that the data appear to be well represented by the fitted model. Note
also that the resulting fit is identical for kinitial = 1− 20. . . . . . . . . . 96
4.2 Selected results obtained from applying the standard VB-GMM algo-
rithm under different initialization conditions to the simulated data
shown in Figure 4.1 (a). The centers of each component in the fitted
mixtures are indicated by a ‘+’, we also show 95% probability regions
(‘-’) for each component in the model. The computed values of F and
MAEAC, and the fitted value of k in the final model are also shown.
We can see that the initial choice for k and the corresponding initial
component allocation does influence the final fit obtained. . . . . . . . 97
4.3 (a) Observed mobility pattern of Subscriber A over a 17-month period
corresponding to the recorded locations, marked by an ‘x’, of cell tow-
ers from which telecommunication activities were initialized. (b) The
results of the SEVB-GMM fit of a bivariate mixture model to these data;
the center of each component in the fitted mixture is indicated by a ‘+’
and we plot the 95% probability regions (‘-’) for each fitted compo-
nent. Note that results obtained were similar for kinitial = 1 − 18 and
that values of kfinal, F and MAEAC corresponding to various kinitial are
summarized in Table 4.3. . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.4 Selected results obtained by using various choices for kinitial in the
standard VB-GMM algorithm for Subscriber A’s mobility pattern
shown in Figure 4.3 (a); the center of each component in the fitted
mixture is indicated by a ‘+’ and we plot the 95% probability regions
(marked by ‘-’) for each fitted component. . . . . . . . . . . . . . . . . . 101
4.5 Mobility patterns over a 17-month period for four subscribers are
shown in the left column. Observations, ‘x’, are the recorded cell tower
locations from which subscribers initiated a communication. Bivari-
ate mixture models fitted using SEVB-GMM are shown in the right
column; the center of each fitted mixture component is marked ‘+’
and corresponding 95% probability regions (‘-’) are shown. Note that
SEVB-GMM was initialized with inappropriate choice kinitial = 1 each
time, yet we are still able to model the data well. Values for F and
MAEAC are also reported. . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.6 Comparisons between fits obtained from standard VB-GMM and our
SEVB-GMM algorithm when using different values kinitial ranging from
1 to 30 based on the observed data for 100 randomly selected anony-
mous individuals: (a) plot of the fitted kfinal vs. the kinitial that was used
for both algorithms, (b) value of MAEAC (Equation (4.7)) for the fits
from both algorithms vs. the kinitial that was used, and (c) for the fits
obtained from both the standard and SEVB algorithms, we computed
the corresponding values of BIC, DIC, F and MAEAC then plotted the
% of times there was an agreement in the model that would be selected
based on either the BIC, DIC or F values, and the model that was se-
lected in the SEVB algorithm using MAEAC. . . . . . . . . . . . . . . . . 105
5.1 Voice call duration distributions approximated by a mixture of log-
normal distributions (‘—’s) of two subscribers whose voice call dura-
tions have a mean of 58 seconds. (a) Subscriber 1: large amount of
‘message’-like calls of a very short duration. (b) Subscriber 2: call du-
ration is more evenly distributed when compared with Subscriber 1. . . 113
5.2 Spatial behavior of four different subscribers. (a) Subscriber A: inter-
capital businessperson-like pattern. (b) Subscriber B: inter-state truck
driver-like pattern. (c) Subscriber C: home-office-like pattern shown
in bubble plot. (d) Subscriber D: taxi driver-like pattern shown in bub-
ble plot. Note that in (a) and (b) ‘x’s represent the actual observations
and ‘. . .’s represent the ‘virtual’ path the user is likely to have taken
between two consecutive actual observations. In (c) and (d), user pat-
terns are shown in the form of bubble plots instead of the scatter plot
for better demonstrating that a large number of activities were initi-
ated from the same cell tower locations; the size of the bubble repre-
sents the activity volume of the particular location. . . . . . . . . . . . . 115
5.3 Distributions of users’ aggregated call patterns. (a) Aggregated voice
call durations. (b) Aggregated SMS counts. . . . . . . . . . . . . . . . . . 117
5.4 Mobility pattern analysis. (a) Percentage of outbound activities made
from users’ top five preferred locations. (b) Average of users’ cumula-
tive activity count distribution with respect to distance from their real
centers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.5 SEVB-GMM results of the four subscribers in Figure 5.2. (a) Subscriber
A. (b) Subscriber B. (c) Subscriber C. (d) Subscriber D. Note that the el-
lipses represent 95% probability regions for the component densities,
whereas the estimated centers of these components are marked by ‘+’s
and the actual observations are marked by ‘x’s. We also note that the
95% probability regions of some components (e.g., those correspond-
ing to point masses) are not always visible because they are simply
too small to be seen. The most noticeable examples are the two most
weighted components in (c) which correspond to the three big bub-
bles (with two of them centered at nearly identical spot) in Figure 5.2 (c).125
5.6 Model accuracy of SEVB-GMM. (a) Distribution of distances between
real & SEVB-GMM centers. (b) Average of users’ cumulative activ-
ity count distribution with respect to distance from their SEVB-GMM
centers. Note that ‘. . .’s refer to calculations made with respect to the
SEVB-GMM model fits, whereas ‘—’s were calculated with respect to
the actual data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.7 Mobility pattern analysis based on SEVB-GMM. (a) Distribution of
SEVB-GMM component maximum SD σmax for which σmax ≤ 10km. (b) Distribution of SEVB-GMM component weight w for which
w ≤ 0.24. (c) Distribution of % of variation accounted for by the
first principal components (the p1’s) of the SEVB-GMM components.
(d) Distribution of distances between users’ daily activity boundary to
their SEVB-GMM centers (Note: almost identical for real centers). . . . 129
5.8 Distribution of spatial usage behavior. (a) SignificantWt. (b) UrbanWt.
(c) RemoteWtX2. (d) UrbanArea (1 = 302π km2). (e) RouteDist (1 =1000 km). (f) HomeOfficeLik. . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.9 Selected k-means clustering results # 1. (a) Clustering quality eval-
uated with respect to different g when subscribers are clustered with
SignificantWt, UrbanWt & UrbanArea. (b) Clustering quality evaluated
with respect to different g when subscribers are clustered with Signif-
icantWt, UrbanWt, UrbanArea, RemoteWtX2 & RouteDist. (c) Cluster-
ing quality evaluated with respect to different g when including voice
call duration & SMS counts into the setting (b). (d) Variables R2’s (RSQ)
for the setting (c) with voice call duration marked as D, SMS counts as
S & five spatial behavioral signatures unmarked. Note that in (a) to (c),
lines with • correspond to CH index, and lines with � correspond to
CCC; number of groups g is generally chosen based on the local max-
ima shared by both CH index and CCC. . . . . . . . . . . . . . . . . . . . 137
5.10 Cross validation results # 1. (a) Clustering quality evaluated with re-
spect to different g for the new sample # 1 with setting of Figure 5.9 (b).
(b) Clustering quality evaluated with respect to different g for the new
sample # 1 with the unsuccessful simplistic model described in §5.6.
Note that lines with • correspond to CH index, and lines with � cor-
respond to CCC; number of groups g is generally chosen based on the
local maxima shared by both CH index and CCC. . . . . . . . . . . . . . 141
Statement of Original Authorship
I hereby declare that this submission is my own work and to the best of my knowl-
edge it contains no material previously published or written by another person, nor
material which to a substantial extent has been accepted for the award of any other
degree or diploma at QUT or any other educational institution, except where due
acknowledgement is made in the thesis. Any contribution made to the research by
colleagues, with whom I have worked at QUT or elsewhere, during my candidature,
is fully acknowledged.
I also declare that the intellectual content of this thesis is the product of my
own work, except to the extent that assistance from others in the project’s design
and conception or in style, presentation and linguistic expression is acknowledged.
Signature:
Burton Wu
Date:
xxiii
1Introduction
1.1 Understanding Customer Behaviour & Human Mobility
Patterns
Customers are the most important asset of any business, but customers today
are more educated, sophisticated, expectant, demanding and volatile than ever
(Yankelovich and Meer, 2006). “Being willing and able to change your behaviour to-
ward an individual customer based on what the customer tells you and what else you
know about the customer” (Peppers et al., 1999, p.151) is vital to business survival
and success. It is also important to understand that not all customers are the same or
equally profitable for business (Cooper and Kaplan, 1991). Differentiating between
customers according to the detailed understanding of their needs, behaviour, prof-
itability and values to the business is therefore crucial in enabling companies to have
appropriate relationships with each individual customer (Peppers et al., 1999).
Research finds that currently there is “too little emphasis on actual [customer] be-
haviour” (Yankelovich and Meer, 2006, p.131). Still, the majority of existing stud-
ies focus on purchasing behaviour (e.g., buying goods such as houses, vehicles, or
plasma televisions) or loyalties (i.e., churn), while little attention is paid to compre-
hending customers’ consumption behaviour (Jacoby, 1978) i.e., behaviour such as
making phone calls, accessing the Internet and using water or electricity.
More formally, consumption behaviour refers to activities which have been per-
formed so frequently that they have become habitual and involve little decision
making (Ouellette and Wood, 1998; Ajzen, 2001). It is different from purchasing be-
haviour (Alderson, 1957) and is more relevant to the service industries than to the
retail industries. However, companies’ existing understanding of individuals’ con-
sumption behaviour appears to be almost exclusively limited to discrete (e.g., which
services customers use and the number of institutions they conduct business with),
2 Chapter 1. Introduction
or average and aggregated measures (e.g., number of transactions per month) (c.f.
Yankelovich and Meer, 2006). These measures, which make no distributional as-
sumptions at all, are not necessarily appropriate, meaningful or adequate for de-
scribing the observed patterns (Schultz, 1995). Also, some studies (e.g., Schultz,
1995) suggest that using just a single Gaussian distribution is generally not appro-
priate for describing customers and their behaviour. Appendix A.2.2 review the issue
of customer behavioural heterogeneity further.
In spite of this, most studies to date do not take the observed behaviour pattern
variations at a point in time into consideration; and they also do not investigate
individuals’ behaviour spatially or temporally. Businesses need a more sophisti-
cated approach than any that are currently available to profile as well as differentiate
the types of customer behaviour they have observed. This would increase insights
that can be used for interacting with each individual in a personalised format, and
support effective strategic and tactical informed decision making, business man-
agement and resource planning. In other words, there appears to be insufficient
understanding of customers’ actual consumption behaviour in the competitive and
rapidly changing wireless telecommunication industry, which is the application fo-
cus of this research.
Telecommunication is both a service as well as a retail industry (Berry and Linoff,
2004, pp.314-315) and there is a great demand for statistical data analysis. Our ap-
plication for research involves the use of mobile phone traffic data to better compre-
hend each telecommunication subscriber’s usage patterns. Such data are sometimes
referred to as call detail records (CDR); its primary use is for billing subscribers (and
hence there are fewer issues with accuracy and completeness when compared to
data typically used for analysis). To date, detailed CDR has not been utilised for bet-
ter profiling and differentiating individual subscribers. This is somewhat surprising
since CDR is typically readily available to all established telecommunication busi-
nesses.
Specifically, the primary interest for our application of this research is to analyse in-
dividuals’ seldomly studied, habitual spatial and temporal, with respect to pattern
on days of the week and hours of the day, usage behaviour. However, the key focus
of the modelling is the spatial aspect of the problem where we see unique charac-
teristics which are discussed further in § 2.1.1. This is essentially the problem of
gaining statistical insights into human mobility patterns (Gonzalez et al., 2008) for
which modelling is still largely problematic. The ability to model human mobility
has implications for epidemic prevention, emergency response, and urban planning,
for example (Gonzalez et al., 2008). Our research in this area makes a contribution
to the development of statistical methods for pattern approximation, interpretation
and classification.
1.2 Telecommunication Call Data Record Dataset 3
1.2 Telecommunication Call Data Record Dataset
The dataset consisted of a total of 95,061 wireless consumer (as oppose to business)
customers or 185,331 subscribers when including those who have defected, partially
defected or newly connected. They have been randomly sampled via a simple ran-
dom sampling (SRS) procedure, and are based on a fixed percentage of the entire
subscriber base. Note that the selected subscribers should not give a biased repre-
sentation of the customer base as they are not based on uncontrolled convenience
samples, a technique that is commonly applied (Glymour et al., 1996; Hand, 1998).
Successful outbound usage activities for 17 consecutive months from 1st May 2006
00:00 to 30th September 2007 23:59 for these traced subscribers were recorded. Key
attributes of these records are:
• subscriber identification number,
• activity initiated date and time,
• activity initiated tower location in latitude and longitude,
• activity type, and usage quantities. For example, voice call in second (and
count), and text message in count.
There are a total of 264,782,106 observed records (i.e., usage activities) with
185,324,517 made within Australia.
Moreover, in contrast to other related research, popular demographic variables
available about the customer, while having been shown to be useful (Verhoef and
Donkers, 2001), are not investigated here. The reason being that the account holder
is not necessarily the actual user. Therefore such demographic variables may be mis-
leading. Note that such false assumptions are often made in current practice. The
sole focus here is the seldomly studied behavioural data which should also be use-
ful in predicting a customer’s future behaviour (c.f. Schmittlein and Peterson, 1994).
Additionally, in contrast to analysing outbound voice calls and short message service
(SMS) separately, as is done typically, this research focuses mainly on analysing the
activities together.
1.3 Motivation for Our Inferential Approach
Selected examples which illustrate the inspiration for our research application are
presented here. Figure 1.1 shows the spatial usage behaviour of two anonymous sub-
scribers over a period of time. The pattern is based on the observed cellular tower
location (i.e., its longitude and latitude) where the subscriber successfully initialised
an outbound activity such as voice call or text message; the activity-initialised cell
tower is typically the one geographically closest to the user, exception to this may
occur if closest cell tower is out of service or too busy, for example. Figure 1.2 illus-
trates the temporal usage behaviour of two other anonymous subscribers over the
4 Chapter 1. Introduction
(a) Subscriber A (b) Subscriber A (Zoomed in on Sydney)
(c) Subscriber B
Figure 1.1: Spatial usage behaviour of two subscribers over the 17-month period.Plot (a) and (c) are line plots connecting consecutive activities; (b) is a bubble plotwhere the bubble volume represents the total number of activities at the location.
same period.
Looking at Figure 1.1 (a), we can see that there is a strong suggestion that Subscriber
A is a business professional who is based in Sydney, Australia, and travels interstate
(i.e., to Brisbane, Gold Coast, Cairns, and Melbourne) occasionally (about 10% of the
time). It also shows that he/she is most likely to have travelled by airplane rather
than by car to these interstate destinations since no activities have been recorded
1.3 Motivation for Our Inferential Approach 5
(a) Subscriber C (b) Subscriber D
Figure 1.2: Weekday 24-hour temporal usage behaviour (i.e., number of activities)for two subscribers over the 17-month period.
between Sydney and the interstate locations; Hamilton Island is a holiday destina-
tion. Figure 1.1 (b) provides a closer look at his/her spatial usage behaviour closer to
‘home’ (i.e., Sydney). The size of the ‘bubble’ in this figure represents the number of
the activities initialized via the cellar tower location. The somewhat ‘tradesperson’-
like pattern suggests that his/her profession and/or lifestyle requires visiting vari-
ous part of the Sydney regularly (about 30% of the time), in contrast to many other
‘home-and-office’-like subscribers who typically visit only a handful of selected lo-
cations (e.g., home and office) within the living neighbourhood. It also suggests that
Subscriber A is able to make the majority (about 60%) of activities from a relatively
fixed location where the two largest bubbles are located. His/her ‘home’, in this case,
is most likely to be located at the intersection of two cellular towers service areas.
Figure 1.1 (c) shows the mobility pattern of another subscriber, Subscriber B, who
has travelled around Australia during the analysed period.
Figure 1.2 (a) shows one of many subscribers, Subscriber C, for whom the majority
of his/her 1,442 communication is ‘restricted’ to an hour-long window (i.e., between
7 to 8pm) during the entire analysed period. Preliminary investigation indicates that
this is not an isolated case; communication time ‘restriction’, for reasons that are
not known to us, is a behavioural characteristic of this user. On the other hand, Fig-
ure 1.2 (b) shows the temporal usage pattern of Subscriber D, who has been quite
active during midnight when majority of other subscribers are asleep; this ‘party
lifestyle’-like behaviour, shared by many other users, appears to be their distinct fea-
ture. Note that this study only focuses on the weekdays; while preliminary investi-
gation shows that individual’s temporal behaviour generally varies somewhat from
6 Chapter 1. Introduction
Monday to Friday, weekday usage tends to be significantly different to weekend us-
age.
Overall, considering the above examples, individuals’ time independent spatial and
temporal usage patterns appears to provide some indication of their likely lifestyle
and/or occupational traits, which otherwise cannot be easily or cheaply discovered.
These insights, which are a combination of the actual behavioural understanding
and the infer lifestyle/occupation, appear to be potentially valuable for business to
enhance their relationship with customers, influence customer behaviour, ‘bene-
fit’ the customers with a deeper understanding of how the products/services are
used in their day to day lives (Fournier et al., 1998; Yankelovich and Meer, 2006),
and make more appropriate actions and/or decisions (e.g., pricing structure around
hours of the day). There is an obvious need to understand each customer’s exhib-
ited behaviour more comprehensively, and businesses have been arguing the im-
portance of understanding customers’ lifestyle/occupation. However, many of them
appear to have lost the focus and have been ‘actively looking’ for customers’ lifestyle/
occupation (Yankelovich and Meer, 2006) through the only available approaches
they know of such as the rather ‘unreliable’ market research approach (Wolfers and
Zitzewitz, 2004) and/or the use of coupon promotions (Stone et al., 2004, p.114). Ap-
pendix A.2.3 provides a detailed review of the current ‘strategies’ used in practice and
discusses the lack of reliability issues associated with market research.
This research challenges/objectives are therefore to first find a reliable and practi-
cal statistical way to approximate these clearly overlooked individual habitual con-
sumption behaviour patterns, many of which are extremely heterogeneous (to be
discussed further in § 2.1.1, and more generally in Appendix A.2.2), and then pro-
file these observed patterns meaningfully. Efficient and effective techniques are evi-
dently required as it is clearly impractical to attempt to interpret each pattern subjec-
tively, visually and manually. Of course, the proposed methods must also be trans-
parent for interpretation. While it appears that our proposed strategy can provide
businesses with a more sophisticated description and understanding of each (ex-
isting) individual customer with respect to his/her actual consumption behaviour,
profiling and differentiating customers based on these alternative ideas also needs
to be examined and compared to an existing benchmark. Chapter 2 reviews various
existing potentially suitable techniques for our problem. While Appendix A provides
a more in depth reviews on the business related aspect of our research question; in
particular Appendix A.2.4 reviews approaches of how customer segmentation is typ-
ically being performed today.
Note that in reality, one wireless customer can have multiple wireless subscriptions.
1.4 The Role of Statistical Data Analysis 7
Here the focus is on the subscriptions rather than the customers. However, the ex-
tracted subscriber knowledge will still be insightful for those who have multiple sub-
scriptions.
1.4 The Role of Statistical Data Analysis
Analysis of this kind is often referred to as data mining (DM) or data mining and
knowledge discovery from data (DMKDD); it refers to the process of extracting non-
trivial, and previously unknown but useful information hidden from large datasets
in a bottom-up fashion (Fayyad et al., 1996). It is an interdisciplinary field of study
which integrates statistics, database technology, machine learning, and pattern
recognition, for example (Hand, 1998). It has attracted serious attention in recent
years, particularly in industry, as a result of an explosive amount of data being avail-
able nowadays, and the imminent need to turn it into competitive advantage i.e.,
knowledge (Han and Kamber, 2006).
A common misconception is that data mining is an “automatic or semi-automatic”
(Berry and Linoff, 2004, p.7) black box product, and while you might place your ‘faith’
in finding the complete solutions, it often turns out to be a disaster (Dasu and John-
son, 2003, p.2). That is, while it is essential to have some sort of computer program
(i.e., for automation) when dealing with large volumes of data, making use of the
domain knowledge, and involvement of the researcher in the inference process is
fundamental (Hand, 1998), but often ignored. Additionally, most off the shelf pack-
ages generally only consist of restricted functionalities of very limited standard or
textbook techniques that will not suit all projects, such as ours, for example.
Large volumes of CDR pose great challenges to analysts for obtaining useful sub-
scriber knowledge. One of the essential first steps for mining them is therefore to
summarise/approximate the data efficiently with significantly less required space
(Han and Kamber, 2006). While it is important to be able to process the data effi-
ciently, or even in a parallel/distributed fashion, which is also known as high per-
formance computing (Park and Kargupta, 2003), it is also important to emphasise
the statistical learning aspects (rather than today’s common emphasis on comput-
ing and databases (c.f. Appendix A.2.1), for example) (Hand, 1998). Large volumes of
data records and variables means traditional processes for statistical inferences are
likely to be inappropriate (Hand, 1998).
Our research focus here is on designing transparent methodologies. Predictive ac-
curacy improvements from the black box operations, if any (Glymour et al., 1996),
should not override the interpretability goals for both models and results that are
critical to the business (e.g., ability to include experts’ insights and opinions). In
8 Chapter 1. Introduction
fact, modelling individuals’ mobility patterns with nonparametric approaches such
as the kernel method, for example, would not be able to provide the interpretations
and the data summarisation needed for this research. As indicated earlier, vari-
ous techniques were considered to this research problem before we chose our ap-
proach based on the variational Bayesian (VB) method for Gaussian mixture models
(GMMs).
1.5 Thesis Overview
So far, we have demonstrated the application value of this proposed customer con-
sumption behaviour research in which our primary focus is analysing users’ spatial
usage behaviour although we also investigate temporal and high dimensional fea-
tures. We emphasise that, to the best of our knowledge, this is the first research which
aims to profile each mobile phone subscriber’s overall spatial usage behaviour auto-
matically, effectively and meaningfully, as well as differentiate general users based
on their actual observed mobility patterns (outlined in Chapter 4 and 5). In essence,
our proposed strategy to this spatial objective involves a two-stage clustering pro-
cess:
• we first cluster/model each individual’s spatial usage pattern for which his/her
unique behavioural characteristics are extracted;
• customers are then clustered/segmented based on these extracted features.
As we will discuss later, our data is difficult to model efficiently and effectively using
many existing standard methods, as they are too restrictive and lacking in the flex-
ibility required and/or are not scalable. This is largely the result of our spatial data
being highly heterogeneous and spiky; the term spiky describes data patterns with
large areas of low probability mixed with small areas of high probability. The nature
of our data will be discussed further in Chapter 2.
To undertake this research, a total of three new statistical algorithms have been de-
veloped. They are designed to:
• model one-dimensional circular data i.e., individuals’ temporal usage be-
haviour (Chapter 3),
• model two-dimensional heterogeneous spiky patterns with weak prior infor-
mation i.e., individuals’ spatial usage behaviour (Chapter 4), and
• cluster high dimensional data (Chapter 6) in a way that is more useful for seg-
menting customers with a large number of (behavioural) attributes,
respectively. These algorithms are all based on the variational Bayesian (VB) method
and Gaussian mixture models (GMMs). We briefly introduce these methods and our
algorithms in the chapter overview below.
The main text of the thesis is separated into seven chapters, of which this is the first.
1.5 Thesis Overview 9
In Chapter 2, we review the literature related to the research problem in detail
to facilitate comprehension of research requirements for this project. This is fol-
lowed by detailed methodology reviews on mixture models and clustering. Con-
sideration of the thorough reviews led to the adoption of the recently popular VB
method together with GMMs as the foundational techniques of the study. That is, the
fast, non-sampling based VB method, an alternative to Markov chain Monte Carlo
(MCMC)-based methods, and GMMs will be utilised for approximating individuals’
consumption behaviour and to facilitate the extraction of selected customer-centric
behavioural characteristics. GMMs are one of the most popular and flexible ap-
proaches for modelling more complicated patterns.
Chapters 3 to 6 are written in the form of papers.
Chapter 3 focuses on modelling individuals’ 24-hour activity patterns (c.f. one-
dimensional circular data). We begin by first exploring VB’s unique component elim-
ination property in more detail, and evaluating its modelling effectiveness and ro-
bustness. The empirical results appear promising; and we highlight a potential im-
plication of the VB elimination property which is often overlooked. A new VB-GMM
algorithm is also presented that is suitable for modelling circular data; its effective-
ness is evaluated and demonstrated with Stephens’ Kuiper statistics.
Chapter 4 and Chapter 5 are the heart of the thesis, these focus primarily on the
subscribers’ spatial usage behaviour. In Chapter 4, a new VB algorithm, called split
and eliminate variational Bayesian (SEVB), is developed. This new algorithm is more
suitable for modelling large numbers of highly heterogeneous spatial patterns as
GMMs with weak prior information. This new algorithm introduces and makes use
of several novel concepts in areas such as component splitting and proposes a new
model evaluation criterion. Empirical results suggest that our SEVB-GMM algorithm
is effective and robust to different initialisation settings including the initial choice of
the number of components in the model, as well as various observation component
initialisation allocations. This chapter also examines the unstableness of many ex-
isting log-likelihood (LL)-based model selection criteria which makes them not suit-
able for this application; whereas our proposed alternative model evaluation mea-
sure appears to be able to provide consistent and reliable model selection. Chapter 5
adapts this new algorithm to our real world dataset, and focuses on interpreting the
patterns to gain useful customer insights. It also investigates the stability and differ-
entiability of users’ spatial usage behaviour. Empirical results reveal that users’ spa-
tial usage behaviour profiles are more stable than the currently popular approach
which involves the ordered partitioning of customers based on current benchmark
measures such as aggregated voice call durations.
10 Chapter 1. Introduction
Chapter 6 develops a new VB-GMM based clustering algorithm capable of finding
subspace clusters in high dimensional subspaces; the key notion of subspace clus-
ters is that not all attributes are relevant to all clusters. This algorithm aims to ad-
dress the potential cluster quality issue of many current algorithms. Besides demon-
strating the portability of VB to many existing data mining algorithms (which makes
use of, for example, histograms, as the density estimation tool), the rationale of this
chapter is that clustering algorithms for high dimensional data should be considered
for customer behaviour segmentation since the number of subscriber behavioural
characteristics that we are interested in and extract from the database are likely to
increase over time. Empirical results suggest that this algorithm is capable of identi-
fying subspace clusters with very low intrinsic dimensionality in settings that would
be considered challenging for many existing clustering algorithms.
The thesis concludes in Chapter 7 where our contributions are summarised. We also
discuss several future research directions that would be valuable from both statistical
and application perspectives.
2Literature Review
2.1 Introduction
2.1.1 Research problem & methodology requirements
In order to understand the nature of the challenges we faced in this research and
to set out appropriate methodology requirements, it is important to review the lit-
erature on consumption behaviour studies within the telecommunication industry,
human mobility pattern modelling, and the nature of call detail records (CDR).
Existing Consumption Behavioural Understanding & Segmentation
As we briefly mentioned earlier in § 1.3, customers’ consumption behaviour is typ-
ically evaluated based on discrete, or average and aggregated measures. In relation
to segmentation (i.e., differentiating customers based on the certain characteristics),
they are commonly partitioned, for example, into several quantile groups, based on
the RFM model; RFM stands for recency (i.e., time lapsed since last activity), (av-
erage) frequency and (aggregated) monetary (over a predefined time period) (Stone
et al., 2004, pp.111-134). The telecommunication industry is no exception; however
measures used which somewhat deviate from RFM include:
• whether certain features have been used by a customer,
• the total number of distinct receivers contacted by the customer over a speci-
fied period (Wei and Chiu, 2002),
• fraction of incomplete calls,
• call number ‘birthday’ i.e., the first day that a call was made to the number
(Cortes and Pregibon, 1999),
• the top regions where calls were made to and from, and
• the top cell tower locations where calls were made from (Cortes et al., 2000).
12 Chapter 2. Literature Review
Besides evaluating the churn (i.e., customer defection) likelihood of the customer
(e.g., Wei and Chiu, 2002) (c.f. Appendix A.1), which should not be considered as con-
sumption behaviour, individuals’ usage patterns have not been examined in detail;
papers such as Cortes and Pregibon (1998), Cortes and Pregibon (1999) and Cortes
et al. (2000) from AT&T can be considered as the exceptions though their focuses
are mostly on mining data stream (which we discuss briefly below and more fully in
Appendix B) and identifying fraudulent accounts.
For each phone number, AT&T modelled the voice call distribution with 24 bins (c.f.
24 hours) and the voice call duration distribution with 12 logarithmically spaced
bins. Degree of ‘business’-likeness of the phone number was evaluated based on
whether the majority of the calls were made during weekday office hours (exclud-
ing lunch time) and whether they were mainly shorter calls. However, even with
their work on this topic, valuable call detail records (CDR) are still not being uti-
lized fully for understanding individuals’ actual behaviour; we can see this over-
looked potential when considering the spatial usage behaviour illustrated in § 1.3
and thus the proposed profiling/segmentation approach in this research. Note that
analysing CDR has already been shown to be useful. For example, promoting ser-
vices to groups of customers by association and sequential patterns (Han and Kam-
ber, 2006, p.653), identifying the best time to contact customers (Berry and Linoff,
2000, p.394), and understanding the relationship among pricing, voice call counts,
average voice call duration, household demographics and revenue at the higher than
individual level (Train et al., 1987; Heitfield and Levy, 2001).
Businesses often request to have an ‘actionable’ requirement (Wedel and Kamakura,
1998, pp.328-329) in having small and balanced customer groups (Ghosh and Strehl,
2004); this single view segmentation is often achieved by partitioning the customer
base based only on predefined attributes of so called experts’ opinions (Berkhin,
2006). Such a process completely ignores the reality of heterogeneous customer be-
haviour (Smith et al., 2000). This commonly used customer segmentation approach,
which is in fact contrary to the idea of ‘actionable’ one-to-one/relationship market-
ing, cannot explain behaviour in detail and cannot be used to comprehend the needs
of the customer; additionally the resulting segments may not be easy to relate to and
action on for the market specialists. In fact, 50 homogeneous customer groups is
believed to be the ‘optimal’ solution which segmenting car insurance renewal be-
haviour (Smith et al., 2000); whereas, a total of 66 marketing segments are identified
in Claritas’s segmentation system, PRIZM, which is based on market researches of
demographic traits, lifestyle preference and consumer behaviour in USA (Claritas
Inc., 2008).
2.1 Introduction 13
That is, rather than analysing the entire customer behaviour as a ’whole’, which
would result in large amount of customer groups, and is thus not necessarily use-
ful or actionable for the businesses due to its highly heterogeneous nature, it would
be more preferable to have “different segments for different purposes” (Yankelovich
and Meer, 2006, p.125) and not to predetermine the number of segments required for
each purpose. This implies that customers should be able to be classified into sev-
eral different groups at the same time such that detailed understanding on different
aspects of customer behaviour can be preserved; algorithms utilised need to be able
to automatically determine the complexity of the models. As indicated in § 1.3, our
proposed strategy may provide a complementary/innovative view of the customers
based on their spatial (and temporal) usage behaviour; its feasibility and potential
merit (e.g., differentiability and stability) will need to be measured against the most
common existing approach, for example, the most popular method of naively parti-
tioning the customers based on the aggregated call volumes.
Furthermore, the number of subscriber behavioural characteristics (what we call
‘signatures’), that we are interested in and that are extracted from the database,
are likely to increase over time even when analyses have the same purpose (e.g.,
analysing spatial usage behaviour). This means that the chosen scalable algorithm
utilised for segmenting customers also needs to be suitable for high dimensional
data. This is a problematic issue and requires thorough investigation; detailed ex-
planations and reviews are presented in § 2.3.4. Note that Appendix C reviews a re-
lated subject: clustering algorithms specifically for time series or data stream. How-
ever, it appears that it is more preferable, at least given the nature of this research, to
cluster time series in the typical way i.e., capturing each series’ longitudinal charac-
teristics, and then clustering series based on these extracted characteristics with an
algorithm suitable for high dimensional data. Note that the investigation of longitu-
dinal aspects of subscriber behavioural changes is outside the scope of this research
because our data is limited to only 17 months of records.
Spatial Usage Behaviour (or Human Mobility Patterns)
One of the most important, unique and ignored features of CDR is its spatial in-
formation (Han and Kamber, 2006, p.653). Engineering/network focused commer-
cial warehouses such as CDRInsight (LGR Telecommunications, 2008a) and CDRLive
(LGR Telecommunications, 2008b) are perhaps the stand out exceptions. While their
primary focus is mostly on monitoring call drop outs (with respect to the cell towers)
(Intel Corporation, 2002; Ajala, 2005, 2006); they are able to identify subscribers geo-
graphically (i.e., subscribers have utilised certain cell towers) for marketing/strategy
purposes, and able to display each subscriber’s usage histories efficiency.
14 Chapter 2. Literature Review
On the other hand, from the more ‘scientific’ viewpoint, individual human mobility
patterns (or spatial usage behaviour) have recently been studied (Gonzalez et al.,
2008) and results suggest that human trajectories has a high degree of spatial (as
well as temporal) regularity as we observed/illustrated in § 1.3. That is, individuals
typically spend the majority of their time in their most highly preferred locations,
and occasionally visit other ‘isolated’ places varying widely in the range of distances
outside of their usual activity areas (Gonzalez et al., 2008).
Statistically, this implies that users’ repeated spatial usage behaviour is not only
heterogeneous (both between and within users), but also spiky; the term spiky de-
scribes data patterns with large areas of low probability mixed with small areas of
high probability. These important features of human mobility patterns have been
largely ignored as is evident when we consider that previous modelling approaches
were based on Levy flight and Markov-based modelling approaches (e.g., Brock-
mann et al., 2006). To further complicate the problem, CDR typically only consists of
locations with respect to the cell tower where the activity was initialised from i.e., the
exact location of the subscriber is not known. This lack of jitter creates an additional
data singularity issue; jitter refers to some small jumpy movements. Moreover, the
‘home’ location for each subscriber is different and is not known in advance. Over-
all, these unique features of individuals’ spatial usage behaviour can be problematic
when applying standard algorithms (c.f. Nurmi and Koolwaaij, 2006) and trying to
interpret/segment them meaningfully. We note that there appears to be no previous
attempt to model each individual’s overall mobility pattern, and differentiate their
spatial usage behaviour in the way that we propose here.
Call Detail Records (CDR) & Data Stream
The distinct features of these research data are the massive volume of observations
and the sequential characteristics. This emerging form of data is known as data
stream (or stream data) and it is also not necessarily ordered in the desired way. For
example, telecommunication transitions are recorded in a sequential manner, but a
transaction is not recorded until it has ended. With today’s technology, one can make
use of several service connections at the same time, but the data sequence recorded
may be ordered by transaction completion time, i.e. an Internet connection com-
menced prior to making a phone call which is then completed before the Internet
session ends, may appear after the phone call in record sequence.
Data stream was first defined in 1998 to refer to this type of data that is often real time
and generated by a continuous process, growing rapidly at an ‘unlimited’ rate (Hen-
zinger et al., 1998; Muthukrishnan, 2005). Much modern data shares these unique
characteristics, for example records on banking, credit card, shopping and financial
2.1 Introduction 15
market transactions, Internet clickstream records, weather measurements, and sen-
sor monitoring observations (Babcock et al., 2002a).
Analysing such data (or under such a data environment scenario), usually requires
one to:
• process it in a single pass (i.e., only have one look at each data record); and
• approximate it with acceptable levels of accuracy but within a stricter time and
space requirement.
It is also often necessary to be adaptive to the non-stationary nature of the data as
it may evolve over time (Aggarwal, 2007b). This is because data stream may not be
stored entirely on a disk or memory and it may not be possible to access it randomly
(Babcock et al., 2002a). More traditional data analytic approaches focus on examin-
ing the same data throughout the study, and they only focus on learning data with
bounded memory. They also generally require multiple scans of the data. In con-
trast, data stream is typically continuously updated throughout the analysis, mean-
ing that having advance knowledge of the input data size, for example, is generally
not possible and not particularly helpful (Babcock et al., 2002a). In other words,
traditional tactics in analysing data, such as finding the median value of a dataset,
which requires first knowing the size of the input data and then determining the me-
dian after the data has been ordered is heavy computationally, is no longer practical
or efficient. This is an exciting aspect of the research problem.
Most of the research in this field comes from research laboratories within companies
such as AT&T, Bell, Google, IBM, and Microsoft that have a management system or
database as the primary focus (Muthukrishnan, 2005). However, after detailed re-
viewing their research which can be found in Appendix B, it appears that they are
typically based on the use of histograms (e.g., Muthukrishnan and Strauss, 2004)
(of which many are based on the use of wavelet transformation (c.f. Donoho et al.,
1996)), the concept of averages (e.g., Aggarwal et al., 2003), or they are in a format
such that distribution and/or seasonality/periodic information cannot be easily ob-
tained (Littau and Boley, 2006b), for example. While the above statement overlooks
what they have achieved with these approaches, it does imply that their analytical
techniques do not appear to be appropriate for the extracting versatile behaviour
characteristics we seek in this research.
That is, despite the fact that CDR is data stream, this research will not proceed in
that direction i.e., CDR will be treated as typical data; this should be acceptable since
CDR for analytical purpose is typically updated periodically rather than in real time
(Chaudhuri et al., 2001; Ganti et al., 2002). Though, it is worthwhile pointing out
that purely from the viewpoint of density approximation i.e., not interpreting the
patterns, the following algorithms appear useful: one-dimensional histogram (Guha
16 Chapter 2. Literature Review
and Harb, 2006), two-dimensional histogram (without assuming variables are inde-
pendent from each other) (Thaper et al., 2002), and kernel estimation (Heinz and
Seeger, 2008); many recent papers (e.g., Bhaduri et al., 2007; Parthasarathy et al.,
2007; Feldman et al., 2008) also focus on analysing data stream in the parallel/
distributed fashion.
2.1.2 Research methodology overview
It is clear by now that this research needs to utilise scalable, space efficient and trans-
parent algorithms to approximate the patterns in one- (i.e., temporal) and two- (i.e.,
spatial) dimensional spaces, and to differentiate the patterns in high dimensional
space without predetermining the model complexity. Analysis of this kind can be
more formally described as combinational data analysis (CDA); the focus of CDA is
the sensible arrangement of objects for which useful data are available (Arabie and
Hubert, 1996). Mixture modelling and clustering are the two main approaches (Ara-
bie and Hubert, 1996) used for this. Mixture models aim to model an unknown den-
sity as a combination of typically parametric functions; though the main difficulty
encountered is that the parametric form for the density is not known (Scott and Sain,
2005). Clustering is perhaps the most commonly applied CDA (Arabie and Hubert,
1996, p.8) and it aims to identify homogeneous groups of objects in a nonparamet-
ric manner. Mixture models are reviewed in the following section while clustering is
reviewed in § 2.3.
While clustering algorithms (e.g., Ester et al., 1996) have been applied before on
modelling some aspects of individuals’ spatial usage behaviour (e.g., Nurmi and
Koolwaaij, 2006), such an approach appears to be lacking in the interpretability
needed to profile each subscriber; although clustering is still preferable to his-
tograms from the viewpoint of density approximation due to the heterogeneous and
spiky nature of human mobility pattern that we discussed above. Mixture models,
on the other hand, appear to be more suitable for this particular task; we shall have
to evaluate their effectiveness against the clustering approach. Both mixture mod-
els and clustering have been considered before for use in customer segmentation
(Wedel and Kamakura, 1998, Chapter 5 and 6), though this was without the assump-
tion of the data being high dimensional.
2.2 Mixture Models 17
2.2 Mixture Models
2.2.1 Introduction
Patterns in data can be statistically represented with a convex combination of den-
sity distributions resulting in what is known as mixture models (Newcomb, 1886;
Pearson, 1894; Everitt and Hand, 1981, pp.1-2). They are flexible and attractive in that
they do not assume the overall shape of the distribution. At the same time, they are
known to be suitable for representing any distribution (c.f. Marron and Wand, 1992;
Priebe, 1994) as in the case of nonparametric approaches (Scott and Sain, 2005). This
is despite the fact that they often model subpopulations of the observed data para-
metrically (e.g., using a Gaussian distribution) (c.f. § 2.2.2).
Mixture models have been used extensively in various applications (McLachlan and
Peel, 2000) including customer segmentation (Wedel and Kamakura, 1998, Chapter
6); their usefulness in the context of clustering and discriminant analysis is a result
of the fact that they are able to represent a particular subpopulation in the data as a
mixture component (Banfield and Raftery, 1993; Fraley and Raftery, 2002). They can
also be useful for detecting outliers (Aitkin and Wilson, 1980; Wang et al., 1997a), an
important task in data mining (Madigan and Ridgeway, 2003).
Mixture models are now often considered as incomplete/missing data/variable
models (Everitt and Hand, 1981, pp.5-7) since the subpopulation identifier of the
observations is generally not known in practice (Marin and Robert, 2007, Chapter 6).
In fact, even though mixture models have a long history, fitting them was problem-
atic until the introduction of expectation-maximisation (EM) algorithm (Dempster
et al., 1977) which framed the models as a missing data problem (Scott and Sain,
2005). The objective of mixture models is to ‘unmix’ the distributions (Wedel and
Kamakura, 1998, p.73), and hence to estimate how many groups there are and the
distribution setting for each group. They are often used in a way such that recon-
ciling the missing actual group membership information is not of interest (Marin
and Robert, 2007, p.149), and one of the advantages is enabling statistical inference
(Wedel and Kamakura, 1998, p.76).
Mixtures of a finite number of parametric distributions have been shown to provide a
computationally convenient and flexible approach for modelling more complicated
distributions (McLachlan and Peel, 2000, pp.1-4) making them suitable for repre-
senting individuals’ heterogeneous behavioural patterns. The overall finite paramet-
ric mixture density with k (∈ N) components for an observation x = (x1, ..., xn) can
be expressed as:
18 Chapter 2. Literature Review
f (x) =k∑j=1
wjfj (x|θj),
where wj is the mixing distribution weighting associated with jth component and
{wj} needs to satisfy both 0 ≤ wj and∑k
j=1wj = 1, and f (.) denotes a parametric
density distribution with θj representing unknown component parameters (Everitt
and Hand, 1981, pp.4-7).
2.2.2 Gaussian mixture model (GMM)
One of the most popular of all mixture models is Gaussian mixture models (GMMs),
these have been applied extensively (McLachlan and Peel, 2000); mainly because the
theory for Gaussian distributions is well understood. The overall density distribution
of a GMM can be expressed as:
f (x) =k∑j=1
wjN(x;µj , T−1
j
),
where µj and T−1j represent parameters mean and (co-)variance of the correspond-
ing underlying Gaussian distribution N (.) (Everitt and Hand, 1981, p.25). From the
viewpoint of clustering, T−1j controls the geometric features of the cluster i.e., its
shape, volume and the orientation (Fraley and Raftery, 2002).
The main problem here, and more generally for mixture models, is the estimation of
the parameters k, w, and θ =(µ, T−1
). However, parameter formulae typically can-
not be written down explicitly (Titterington et al., 1985, p.ix), and the value of k, de-
spite its importance (Titterington et al., 1985, p.148), is usually not known in advance
(McLachlan and Peel, 2000, p.4); and the challenges generally increase with increas-
ing dimensionality of the data (Scott and Sain, 2005; Jain and Dubes, 1988, p.118).
This means that the solutions generally cannot be obtained analytically and many
estimation techniques have been considered (c.f. Titterington et al., 1985, Chapter
4). We briefly review some below, and we note that these methods extend beyond
mixture model scenarios.
2.2.3 Classical/frequentist techniques (for GMMs)
In the past the main method for the parameter estimation was the method of mo-
ments based on Pearson (1894). However, the introduction of the EM algorithm
2.2 Mixture Models 19
(Dempster et al., 1977), which simplified the problem considerably by instead inter-
preting the observations as incomplete data (McLachlan and Peel, 2000, p.4), com-
bined with computational resources becoming more widely available, have made
maximum likelihood estimation (MLE) popular. MLE involves maximisation for es-
timating the parameter values best suited to the data and are not available in closed
form. The likelihood of finite mixtures can also be maximised via numerical opti-
mization routines such as the Newton-Raphson (NR) method (McHugh, 1956).
Expectation-Maximisation (EM) Algorithm
The EM algorithm starts with initial guesses for the parameter of interest, and mak-
ing guesses may involve some sort of clustering of data (Scott and Sain, 2005). It is
iterative in nature; and there are two steps in each iteration for a given fixed number
of components k (Dempster et al., 1977; McLachlan and Krishnan, 2008). They are:
• Expectation (E-step): This is the first step at each iteration; it calculates the
expected value of the complete data log-likelihood (i.e., replaces the missing
values with their conditional expectation) given the observed data and the pro-
visional parameter estimates;
• Maximisation (M-step): This is the second step at each iteration; it maximises
the expectation using both the observed data and the predictions found in the
previous step.
The algorithm iterates between these two steps until a converged solution is reached.
Theoretically (Boyles, 1983; Wu, 1983) i.e., not always true in practice (Fraley and
Raftery, 2002), the solution of EM should be at least a local maximum as the like-
lihood is increased at each iteration i.e., it converges monotonically. This means
that, in order for EM to avoid finding a suboptimal solution, it is essential to have
good initial parameter values (Biernacki et al., 2003; Karlis and Xekalaki, 2003) and
stopping criteria (Wedel and Kamakura, 1998, p.87). Thus, many strategies have
been proposed for better initialisation (e.g., McLachlan and Peel, 2000, pp.54-60);
and many modifications to the standard EM algorithm have been made which at-
tempt to guide the algorithms towards the global maximum (e.g., McLachlan and
Krishnan, 2008). For example, the stochastic EM (SEM) algorithm (e.g., Celeux and
Diebolt, 1985; Celeux et al., 1996), and split and merge EM (SMEM) algorithm (Ueda
et al., 2000).
Scalability wise, EM may be slower than those direct numerical optimization meth-
ods mentioned above (c.f. Titterington et al., 1985, pp.90-91); since the convergence
is quadratic in the number of parameters (Wedel and Kamakura, 1998, p.87). Thus,
it may be a good idea to adopt a hybrid approach i.e., use EM initially for its good
global convergence property but then switch to Newton-type method for its rapid
20 Chapter 2. Literature Review
local convergence (McLachlan and Peel, 2000, p.70-75). Note that there have also
been many recent developments (e.g., McLachlan and Peel, 2000, Chapter 12) mak-
ing EM more suitable for large datasets such as ours.
However, above all previously described limitations (e.g., Wedel and Kamakura,
1998, pp.87-92), the key weakness of EM for mixture modelling is the need to prede-
fine the number of components k (Cheung, 2005), the central issue in mixture mod-
elling; this is also the issue for many clustering algorithms (to be reviewed in § 2.3).
Consequently, there has been a great amount of research into better determining
the choice of k (Scott and Sain, 2005). For example, this can be done more tradition-
ally by using informal graphical techniques or formal hypothesis testing techniques
(Everitt and Hand, 1981, pp.22,30-57,108-118).
Today, selecting models with respect to k is often achieved by comparing solu-
tions utilizing complexity criteria (McLachlan and Peel, 2000, Chapter 6) such as
Akaike’s information criterion (AIC) (Akaike, 1974), Bayesian information criterion
(BIC) (Schwarz, 1978), Laplace empirical criterion (LEC), and minimum message
length (MML) (Wallace and Dowe, 1994; Wallace and Freeman, 1987; Wallace and
Dowe, 2000). We note that numerical experiments have shown that measures such
as BIC and MML, for example, lead to similar results (Roberts et al., 1998; Biernacki
et al., 2000). Sometimes computationally expensive approaches (Corduneanu and
Bishop, 2001) such as the bootstrap (Efron and Tibshirani, 1993), cross-validation
which is also wasteful of valuable data (Smyth, 2000), or statistical tests (Har-even
and Brailovsky, 1995; Polymenis and Titterington, 1998) may also be used.
Nonetheless, the central issue of EM, and more generally maximum likelihood
(ML) approaches, is that it tends to favour models with ever increasing complexity,
(Svensen and Bishop, 2005; Archambeau and Verleysen, 2007; McLachlan and Peel,
2000, p.41). That is, having more components i.e., larger k nearly always translates
to a ‘better’ model within the ML framework which is generally not appropriate (Ya-
mazaki and Watanabe, 2003). Nonetheless the model selection criteria mentioned
above do aim to penalize over complicated models and find the balance between
data fitting and the model complexity. However, these measures can be misleading
if the ‘training’ sample size is small (Watanabe et al., 2002); AIC is known to choose
models with too many components, while certain regularity conditions for BIC do
not hold in the mixture model case despite BIC’s increasing popularity (Aitkin and
Rubin, 1985; Biernacki et al., 2000; Scott and Sain, 2005).
In other words, besides the need to predefine k, EM has several inherent problems
such as sensitivity to the starting parameter values, possible singular/suboptimal
solutions (Jain and Dubes, 1988, p.118), and it is not suitable for estimating very
large numbers of components (Fraley and Raftery, 1998). With few exceptions
2.2 Mixture Models 21
(e.g., Bradley et al., 2000; Ordonez and Omiecinski, 2005), like most fuzzy-, nearest
neighbours-, kernel-, optimisation-, or neural network-based clustering techniques
(c.f. Jain et al., 1999; Xu and Wunsch II, 2005), EM is not particularly well suited for
clustering large datasets. Nonetheless, it has been used as a clustering algorithm
(e.g., Wallace and Dowe, 1994; Fraley and Raftery, 1998, 1999; McLachlan et al., 1999;
Cadez et al., 2001), and can provide pattern interpretations which otherwise can-
not be obtained with those model-free clustering approaches (c.f. § 2.3). Finally, on
the theoretical front, mixture models with k covariance matrices assumed equal is
related to a well-known clustering method (Symons, 1981); while Celeux and Gov-
aert (1992) and Banfield and Raftery (1993) have shown that the classification EM
(CEM) algorithm under a spherical Gaussian mixture is the ‘same’ as the k-means
(KM) clustering algorithm. A more effective modelling approach than EM is needed.
2.2.4 Bayesian techniques (for GMMs)
Bayesian Statistics Overview
A more ‘recent’ advance in mixture modelling was the use of Bayesian techniques
(Binder, 1978; Symons, 1981; Gilks et al., 1989). Bayesian data analysis (Gelman et al.,
2004; Lee, 2004; Ghosh et al., 2006b; Marin and Robert, 2007) aims to make infer-
ences about data using probability models for quantities we observe and for quan-
tities about which we wish to learn (Gelman et al., 2004, p.1). It has now been used
widely in various real life applications (Ridgeway and Madigan, 2003; Gelman et al.,
2004) including many financial, economic (e.g., Rachev et al., 2008) and marketing
applications (e.g., Rossi et al., 2005). It differs from classical/frequentist statistics in
its use of probability for naturally quantifying model uncertainty at all levels of the
modelling process, and provides a natural framework for producing more reliable
parameter estimates (Andrieu et al., 2003); this includes selecting an appropriate
number of components k in mixture modelling either by fully Bayesian models or
comparing model marginal likelihood (Diebolt and Robert, 1994; Richardson and
Green, 1997). Bayesian statistics makes use of prior distributions on the model pa-
rameters to express the uncertainty present even before seeing the data (Chatfield,
1995). In essence, it is like mixing several models rather than aiming to obtain one
single best model (Chatfield, 1995).
The attractiveness of the Bayesian approach comes from the transparent inclusion of
prior knowledge, a straightforward probabilistic interpretation and hence communi-
cation of parameter estimates, and greater flexibility in model specification (Ridge-
way and Madigan, 2003; Gelman et al., 2004, pp.3-4). Unlike classical/frequentist
statistics, Bayesian approaches favour a simpler model (Jefferys and Berger, 1992;
MacKay, 1995). They are less likely to over-fit the data (Beal and Ghahraman, 2002).
22 Chapter 2. Literature Review
Additionally, they do not utilize sometimes problematic p-values (Schervish, 1996;
Sterne et al., 2001; Hubbard and Lindsay, 2008). Consequently, Bayesian inference is
also now widely established as one of the principal foundations for machine learning
(e.g., Winn and Bishop, 2005; Bishop, 2006).
A Bayesian statistical analysis typically begins with a full probability model, and then
uses Bayes’ theorem (Bayes, 1763) to learn or to compute the posterior distribution
of the parameters of interest after seeing the data. Bayesian posterior distribution,
p(θ|x), of the model for the parameter of interest θ given data x can be expressed as:
posterior ∝ prior× likelihood
p(θ|x) ∝ p(θ)× p(x|θ),
with p(θ) representing the prior knowledge of θ, and p(x|θ) representing the likeli-
hood inference of θ drawn from x (Gelman et al., 2004, pp.7-8). Integration is re-
quired for obtaining the expectation of h(θ):
E(h(θ)|x) =∫h(θ)× p(θ|x)dθ,
and typically researchers work with the logarithm of these quantities for conve-
nience (Madigan and Ridgeway, 2003). However, such operations are known to be
generally intractable, and hence obtaining exact inference is rarely possible (Madi-
gan and Ridgeway, 2003). Thus, many approximation techniques have been devel-
oped. Two fundamental techniques are described below.
Monte Carlo Sampling Sampling techniques enjoy wide applicability and can be
powerful in evaluating multi-dimensional integrals and representing posterior dis-
tributions (Madigan and Ridgeway, 2003). Monte Carlo integration sampling is one
computational approach that can approximate the solutions by sampling from the
posterior distribution iteratively (c.f. m times),
limm→∞
1m
m∑i=1
h(θi) =∫h(θ)× p(θ|x)dθ.
However, the convergence of the approximation of this method can be slow (Madi-
gan and Ridgeway, 2003).
2.2 Mixture Models 23
Importance Sampling (IS) Importance sampling is a useful technique that may
assist in obtaining a sampling distribution that converges quickly to the required
integral. It is often utilised in Monte Carlo-based approaches for estimating a tar-
get distribution p(θ|x) which is difficult to compute with samples generated from an
alternative and more amenable distribution g(θ), known as the importance distribu-
tion (Madigan and Ridgeway, 2003). Importance sampling is based on the identities
below where θi is drawn independent and identically distributed (i.i.d.) from g(θ):
∫h(θ)× p(θ|x)dθ =
∫h(θ)
p(θ|x)g(θ)
g(θ)dθ
=limm→∞
1m
m∑i=1
ωih(θi)
E(h(θ|x)) =1m
∑mi=1 ωih(θi)∑mi=1 ωi
ωi =p(θi|x)g(θi)
.
Where ωi are the importance sampling weights; and p (θ|x) need not be normalised.
However, importance sampling can be difficult to implement when the target distri-
bution is complex, because it can be difficult to find a suitable g(θ) to use (Andrieu
et al., 2003).
Markov Chain Monte Carlo (MCMC) Methods
While there exist many other types of Monte Carlo-based approaches (e.g., Fearn-
head, 2008), Markov chain Monte Carlo (MCMC) methods (Geyer, 1992; Tierney,
1994; Besag et al., 1995; Gilks et al., 1998) are generally considered as standard, flex-
ible, and the most widely used Bayesian methods for approximating these incalcu-
lable distributions (Andrieu et al., 2003; Ridgeway and Madigan, 2003; Balakrishnan
and Madigan, 2006). Typically, Bayesian inference for mixture models was not cor-
rectly treated until the introduction of MCMC algorithms in the early 1990s (Marin
and Robert, 2007, p.147) or required being simplified for approximation (Everitt and
Hand, 1981, pp.12-13).
Unlike standard Monte Carlo-based methods that create independent sample draws,
MCMC methods build a Markov chain, a sample of dependent draws that has
stationary target distribution p(θ|x). The Metropolis-Hastings (MH) algorithm
(Metropolis et al., 1953; Hastings, 1970) is the most popular MCMC method. A pop-
ular MH kernel is the Gibbs Sampler (Geman and Geman, 1984; Gelfand and Smith,
1990) which is somewhat related to the EM algorithm (Andrieu et al., 2003). The
24 Chapter 2. Literature Review
Gibbs sampler forms the basis for software packages such as BUGS (Gilks et al.,
1994), and is just an example of an ‘efficient’ sampler that aims to make sensible
moves based on the current distributional knowledge (Andrieu et al., 2003).
However, while the MH algorithm is simple, the success or failure of the algorithm of-
ten depends on the careful design of the distribution proposal (Andrieu et al., 2003);
and it is typically not suitable for high dimensions (Mengersen and Tweedie, 1996)
i.e., it may not be flexible enough for the task of modelling individuals’ heteroge-
neous mobility patterns.
Note that it is also possible to combine several different samplers (Tierney, 1994). A
combination of ‘global’ and ‘local’ proposals, for example, can be a useful approach
when the target distribution has many narrow peaks (Andrieu et al., 2003).
Reversible Jump MCMC (RJMCMC) & Related Methods
The Bayes factor (BF) (Kass and Raftery, 1995) is a standard measure for comparing
Bayesian models and can therefore be used for selecting a suitable number of com-
ponents k; this selection is particularly critical for modelling patterns which are het-
erogeneous. However, BF approximations can be computational demanding (Han
and Carlin, 2001). Alternatively, one can:
• utilise other measures (e.g., Mengersen and Robert, 1994; Raftery, 1996; Roeder
and Wasserman, 1997) such as the recently proposed deviance information
criteria (DIC) (Spiegelhalter et al., 2002; Celeux et al., 2006) which can be
computed more straightforwardly (McGrory and Titterington, 2007) for model
comparison; this strategy is, of course, ad-hoc since models with various k
need to be obtained first;
• adopting the nonparametric Dirichlet process strategy which involves mod-
elling the Gaussian parameters as coming from a Dirichlet process (Escobar
and West, 1995); this approach, however, is not always recommended: Roeder
and Wasserman (1997) argues that choosing the number of components rather
than modelling it using a Dirichlet process is preferable, on the basis that this
provides a more direct control over the number of components;
• circumvent the problem by having fixed k in the model but allowing some com-
ponents to be empty as done by Gilks; this strategy, while theoretically sound,
has been found to be problematic in practice particularly for modelling spiky
patterns (c.f. discussions in Richardson and Green, 1997); or
2.2 Mixture Models 25
• utilise a reversible jump sampler (Green, 1995) capable of performing model
comparison between models of varying dimensions.
The practical advantage of adopting the reversible jump approach for mixture mod-
elling within the MCMC scheme is being able to automatically select k by a fully
Bayesian method and estimate the parameter values simultaneously (Richardson
and Green, 1997); this approach is known as the reversible jump MCMC (RJMCMC)
method. The reversible jump sampler (Green, 1995) allows algorithms to occasion-
ally propose ‘jumps’ for exploring potential models; models may be rejected to en-
sure the desired stationary distribution is retained. In the mixture modelling sense,
this means that the algorithm can attempt to randomly split/merge components
provided the move is reversible (Richardson and Green, 1997) en route to the ‘op-
timal’ models. However, engineering reversible moves can be very tricky and time
consuming (Andrieu et al., 2003). Note that such split/merge strategies can also be
implemented for the EM algorithm (Ueda et al., 2000), but k needs to stay fixed; this
is to prevent over-fitting within the maximum likelihood (ML) framework.
Instead of making ‘reversible jump’ moves, Stephens (2000) proposed an alterna-
tive scheme for determining k based on a continuous time Markov birth-death pro-
cess; the scheme allows ‘birth’ of new components and the ‘death’ of some exist-
ing components. While this more straightforward birth-death MCMC method does
not require the calculations of a complicated Jacobian, its computational time re-
quirement is comparable to RJMCMC; that is, both algorithms are not suitable for
analysing large datasets such as those used in our research.
However, significant headway has been made recently in approximate Bayesian
computation (ABC) such as the approach we describe in the following.
Sequential Monte Carlo (SMC) Methods & Sequential Importance Sampling (SIS)
Traditional importance sampling (IS) can be modified to sequential importance
sampling (SIS) so that efficiency can be improved for the sequential data environ-
ment. When a new observation arrives at time t, importance sampling weights need
to be adjusted by gt(θ)gt−1(θ) , which is proportional to p(xt|θ). Such a setting allows the
algorithm to stop when uncertainty in the parameters of interest has reached the sat-
isfactory level. This approach is known as the sequential Monte Carlo (SMC) method
or particle filter (PF) (Doucet et al., 2001); in contrast to standard MCMC methods
(Ridgeway and Madigan, 2003), SMC has been shown to be efficient, flexible, paral-
lelisable, and easy to implement (Doucet et al., 2001; Ridgeway and Madigan, 2003;
Balakrishnan and Madigan, 2006). More recently, there has been even more focus on
the efficiency issue of SMC methods (Chopin, 2002).
26 Chapter 2. Literature Review
Already, Ridgeway and Madigan (2002, 2003), for example, have managed to reduce
98% of the data access requirements when compared to traditional MCMC meth-
ods with the use of importance sampling algorithm in their experimental studies.
Their algorithm partitions all observations into subsets x1 and x2; instead of sam-
pling from the posterior conditioned on all the data, samples are drawn from the
posterior conditioned on x1 to speed up the sampling procedure, and x2 is utilised
for adjusting the sampling parameter values by reweighting. Furthermore, Balakr-
ishnan and Madigan (2006) have recently improved on this and have presented the
first single pass algorithm in this research direction, known as the one pass particle
filter with shrinkage (1PFS) which is better suited for real time environment. 1PFS
bypasses the exhaustive analysis of an initial portion of the training data by sam-
pling initial particles from the prior distribution of the parameters (Balakrishnan
and Madigan, 2006). Nonetheless, SMC methods typically focus on analysing long
series (as oppose to many shorter series), and their properties for static datasets are
still largely unknown; their shortfalls are described more generally below.
Disadvantages of Monte Carlo-Based Approaches
MCMC methods can provide ‘correct’ approximations given infinite computational
resources (Bishop, 2006, p.462). However, despite their popularity, they are typi-
cally computationally intensive (Balakrishnan and Madigan, 2006; Blei and Jordan,
2006). While they have been successfully utilised for solving many smaller data min-
ing problems (Giudici and Castelo, 2001; Giudici and Passerone, 2002), they are not
known to be practical for analysing massive datasets (Madigan and Ridgeway, 2003);
this is in spite of the fact that their statistical concepts are sound and useful, and
often being noted in the data mining literature (Glymour et al., 1996). MCMC’s scal-
ability issues are the result of their following requirements.
1. intensive computational requirements of scanning (and updating models)
through the dataset which generally means a very large number of iterations
ensuring converged models (Ridgeway and Madigan, 2003);
2. typical requirement of a complete scan of the dataset for each iteration (Ridge-
way and Madigan, 2003; Balakrishnan and Madigan, 2006);
3. high storage requirements (Wang and Titterington, 2006);
4. having parameter posterior distributions stored as a set of samples usually in
the memory (Andrieu et al., 2003); and
5. the typical step of loading data into the memory prior to the modelling process
(Ridgeway and Madigan, 2003).
While the more recently popular sequential Monte Carlo (SMC) methods (Doucet
et al., 2001; Chopin, 2002; Ridgeway and Madigan, 2003; Balakrishnan and Madigan,
2.2 Mixture Models 27
2006) have attempted to address many of these scalability challenges, Monte Carlo-
based methods typically have several additional drawbacks, namely they:
6. rely on samples being a good representation of the true model (Andrieu et al.,
2003);
7. do not yield closed form solutions;
8. do not guarantee monotonically improving approximations (Jaakkola and Jor-
dan, 2000);
9. involve difficulties in verifying if models have converged (Robert and Casella,
1999, Chapter 8); and
10. have a label switching issue caused by the non-identifiably of the components
under symmetric priors of Monte Carlo-based methods (Celeux et al., 2000;
Marin and Robert, 2007, p.162).
In short, Bayesian approaches allow for flexibility within models by naturally incor-
porating (decision) uncertainty into the models via prior distributions on the pa-
rameters, and the results can be interpreted simply. However, their scalability can
be significantly worse than those classical/frequentist techniques described in the
previous subsection making them impractical for many applications such as this re-
search. Nonetheless, there are several clustering algorithms (e.g., Cheeseman and
Stutz, 1996) based on the Bayesian mixture models framework; some recent algo-
rithms (e.g., Pizzuti and Talia, 2003) are more scalable as a result of a parallel imple-
mentation, for example. In summation, a more efficient approach than the existing
typical Bayesian methods is needed for our task.
2.2.5 Approximate techniques (for GMMs)
Variational Bayesian (VB) Methods
There are many Bayesian inference approximation schemes (e.g., Madigan and
Ridgeway, 2003; Gelman et al., 2004; Bishop, 2006; Rue et al., 2009), and one general
approach is variational Bayesian (VB) methods (Wang and Titterington, 2006). VB
(Waterhouse et al., 1996; Neal and Hinton, 1998; Jordan et al., 1998) was first formally
formularised by Attias (1999). It involves introducing a prior over the model struc-
ture and is a deterministic alternative to sampling-based Monte Carlo-based meth-
ods for Bayesian inference. It is less computationally demanding and promising for
analysing large datasets (Madigan and Ridgeway, 2003). This is achieved by convert-
ing the inference problems into the ‘relaxed’ optimization problems (Blei and Jordan,
2006). It is based on the work on calculus of variations which has its origins in the
18th century i.e., it involves a functional derivative which expresses how the value
28 Chapter 2. Literature Review
of the functional changes in response to infinitesimal changes to the input function
(Bishop, 2006, pp.462-463).
The theory of VB has now been well documented (e.g., Wang and Titterington, 2006).
In short, it aims to obtain an approximation to the posterior distribution p (θ|x),
which leads to a coupled expression for the posterior that can be solved iteratively
until convergence. It aims to maximise the lower bound on the data marginal like-
lihood p (x), which is done by aiming to minimise the Kullback-Leibler (KL) diver-
gence between a function which is an approximation to the posterior, and the actual
target posterior from which the observed marginal posterior can be obtained. That
is, with the introduction of a variational function q (θ, z|x) and the use of Jensen’s in-
equality, the log transformation of the data marginal likelihood can be expressed as
follows:
log p (x) = log∫ ∑{z}
q (θ, z|x)p (x, z, θ)q (θ, z|x)
dθ
=∫ ∑{z}
q (θ, z|x) logp (x, z, θ)q (θ, z|x)
dθ
+∫ ∑{z}
q (θ, z|x) logq (θ, z|x)p (θ, z|x)
dθ (2.1)
= F (q (θ, z|x)) +KL (q|p)
≥ F (q (θ, z|x)) ,
where F (.) is the first term in (2.1) and KL (.) is the second and also is the KL di-
vergence between the target p (θ, z|x) and its variational approximation; and z de-
notes the unobserved component membership information of x. By minimizing
KL (.), which cannot be negative, VB is effectively maximizing F (.), a lower bound
of log p (x). However, the variational function must be chosen carefully so that it is
a close approximation to the true conditional density, and importantly that it gives
tractable computations for approximating the required posterior distribution. Typi-
cally it is assumed that q (θ, z|x) can be expressed as qθ (θ|x)×qz (z|x), with conjugate
distributions chosen for the parameters. VB then involves solving q (θ, z|x) iteratively
in a way similar to the standard EM algorithm. At each iteration, log p (x) is increased
provided it has not reached the maximum. We note that this typical approach of the
variational function factorisation is based on the approximation framework of mean
field theory (Opper and Saad, 2001).
Besides its scalability and deterministic nature, the advantages of VB are that it does
not have the singularity problems of maximum likelihood (ML) approaches such as
EM (Attias, 1999), or the mixing or label switching problems of Monte Carlo-based
2.2 Mixture Models 29
methods (Wang and Titterington, 2006). It has been shown to perform well empir-
ically (Wang and Titterington, 2006), and is more suitable (Ghahramani and Beal,
1999) and accurate for mixture models when compared with the Laplace approxi-
mation, a large sample method that assumes all posterior distributions are Gaussian
(MacKay, 1998). Consequently, it has already been applied in various applications
(Winn and Bishop, 2005; Wang and Titterington, 2006). Note that as with the other
methods which we have described previously, the VB approach is not limited to mix-
ture modelling or the missing data model context, but this is the primary focus of
this research.
Furthermore, it is not yet clear exactly why, but it turns out that VB will effectively
and progressively eliminate redundant components when an excessive number of
components is specified in the initial model (Attias, 1999; Corduneanu and Bishop,
2001; McGrory and Titterington, 2007). This automatic selection of the number of
components k implies that one does not need to utilize a fully Bayesian computa-
tional extensive approaches such as RJMCMC method (Richardson and Green, 1997)
or birth-death MCMC method (Stephens, 2000). That is, VB can be used to simul-
taneously perform model selection and estimation of model parameters (McGrory
and Titterington, 2007) as in the case of those fully Bayesian methods, but in a much
more efficient way. Note that this can be a very useful feature for our research aim
of modelling individuals’ heterogeneous spatial usage behaviour, but its effect re-
quires further investigation. We discuss this issue in Chapter 3 where we explore the
practical implications of this aspect of the approach, and also in Chapter 4 where
we attempt to describe this feature more fully. Also the actual VB model hierarchies
for GMMs are omitted here as they are discussed in relation to particular cases later;
please refer to Chapter 3 for the one-dimensional GMM and Chapter 4 to 6 for the
two-dimensional GMM.
For completeness, we conclude this subsection by detailing the general result of VB
for mixtures of exponential family models in the one-dimensional case which can
be easily extended to higher dimensions as outlined in the thesis of McGrory (2006).
For consistency, we again denote an observation as x = (x1, ..., xn), which is to be
modelled as mixture of k distributions where each distribution has a corresponding
parameter φ, and z = {zij : i = 1, ..., n, j = 1, ..., k} as the missing binary observation
component indicator variable such that zij = 1 if observation xi is from the compo-
nent j and zij = 0 otherwise. The complete data marginal likelihood of mixture of
exponential distributions can generally be expressed as follows:
p (x, z|φ,w) =n∏i=1
k∏j=1
wzij
j [s (xi) t (φj) exp {a (xi) b (φj)}]zij ,
30 Chapter 2. Literature Review
where w is the distribution mixing weights; b (φj) is the natural parameter and s (xi),
t (φj) and a (xi) are functions defining the exponential family distribution. Given
this, the conjugate prior will be in the form of:
p (x, z|η, υ) ∝k∏j=1
wα
(0)j −1
j
k∏j=1
h(η
(0)j , υ
(0)j
)t (φj)
η(0)j exp
{υ
(0)j b (φj)
},
where α is an hyper-parameter and h (.) is a function of another exponential family
distribution with parameters η and υ:
p (x, z, φ, w|η, υ) ∝n∏i=1
k∏j=1
wzij
j [s (xi) t (φj) exp {a (xi) b (φj)}]zij
×k∏j=1
wα
(0)j −1
j
k∏j=1
h(η
(0)j , υ
(0)j
)t (φj)
η(0)j exp
{υ
(0)j b (φj)
}.
Assuming the introduced variational
function q (z, φ, w) =∏ni=1 {qzi (zi)}qφ (φ) qw (w) with qφ (φ) =
∏kj=1 qφj
(φj); the vari-
ational posterior qzi (zi = j) can be derived by focusing on the sufficient statistics of
the variational lower bound marginal log-likelihood:
∑{z}
∫ ∏n
i=1{qzi (zi)}qφ (φ) qw (w) log
p (φ,w)∏ni=1 p (xi, zi|φ,w)∏n
i=1 {qzi (zi)}qφ (φ) qw (w)dφdw
=∑{j}
∫qzi (zi = j) qφ (φ) qw (w) log
p (xi, zi = j|φ,w)qzi (zi = j)
dφdw
+ terms independent of qzi
=∑{j}
qzi (zi = j){∫
qφ (φ) qw (w) log p (xi, zi = j|φ,w) dφdw − log qzi (zi = j)}
+ terms independent of qzi
=∑{j}
qzi (zi = j) log[
exp∫qφ (φ) qw (w) log p (xi, zi = j|φ,w) dφdw
qzi (zi = j)
]+ terms independent of qzi
That is, when taking the prior and the general form of mixture density into the equa-
tion:
2.2 Mixture Models 31
qzi (zi = j) ∝ exp{∫
qφ (φ) qw (w) log p (xi, zi = j|φ,w) dφdw}
∝ exp{∫
qφ (φ) qw (w) [logwj + log t (φj) + a (xi) b (φj)] dφdw}
= exp {Eq [logwj ] + Eq [log t (φj)] + a (xi) Eq [b (φj)]} .
Similarly, we can obtain:
qw (w) ∝k∏j=1
wα
(0)j +
∑ni=1 qij−1
j
=k∏j=1
wαj−1j ,
where qij = qzi (zi = j) and αj = α(0)j +
∑ni=1 qij ;
qφj
(qφj
)∝ t (φj)
∑ni=1 Eqzi
[zij ] exp
[n∑i=1
Eqij [zij ] a (xi) b (φj)
]t (φj)
η(0)j exp
[υ
(0)j b (φj)
]= t (φj)
∑ni=1 qzij +η
(0)j exp
[{n∑i=1
qija (xi) + υ(0)j
}b (φj)
]= t (φj)
ηj exp [υjb (φj)] ,
and
ηj = η(0)j +
n∑i=1
qij ,
υj = υ(0)j +
n∑i=1
qija (xi).
Expectation Propagation (EP) Method
The deterministic expectation propagation (EP) method (Minka, 2001) is closely re-
lated to VB in that it also aims to approximate inference based on minimising KL
divergence. However, EP minimises KL (p|q) instead of KL (q|p). Note that KL is not
symmetric. One disadvantage of EP is that it typically does not guarantee to converge
32 Chapter 2. Literature Review
monotonically; and EP is not ‘sensible’ for mixture modelling as it aims to capture all
of the posterior modes (Bishop, 2006, p.510); however attempts to use it have been
made (Minka and Ghahramani, 2003).
2.2.6 High dimensional GMM
As indicated earlier, one of the research requirements is to investigate combinational
data analysis (CDA) for high dimensional data. There is limited study on high dimen-
sional GMMs in comparison to high dimensional clustering algorithms (which we
later discuss in § 2.3.4). This is not surprising considering that it is more challenging
to model ‘complex’ data with parametric/semi-parametric models in comparison to
the nonparametric approach.
One recent noticeable exception is Bouveyron et al. (2007) which models subspace
clusters in the high dimensional spaces with a GMM; the concept of subspace clus-
ters, which is discussed further in § 2.3.4, is that some dimensions are considered as
noise for some clusters. The algorithm is based on work on mixtures of probabilistic
PCA (Tipping and Bishop, 1999; McLachlan et al., 2003) and eigenvalue decomposi-
tion of the covariance matrices (Celeux and Govaert, 1995) with only certain essen-
tial parameters estimated by an EM algorithm. The algorithm assumes that there
are no irrelevant dimensions (but some can have a weighting very close to zero), but
the intrinsic dimensionality of the clusters are estimated iteratively (in each M-step
based on the eigenvalues of the each cluster covariance matrix) with the use of the
scree-test of Cattell (1966) and BIC.
In spite of assumptions and constraints that have been made so that it calculates
parameter estimations only with respect to the likely subspace of each cluster, the
algorithm still appears to be quite computationally demanding. More problemati-
cally, however, is that it requires the number of clusters k to be predetermined which
is clearly not desirable for this research. Note that algorithms (e.g., Friedman and
Meulman, 2004; Domeniconi et al., 2004; Jing et al., 2007; Cheng et al., 2008) which
adopt this approach are often considered as weighted k-means-like algorithms since
they focus on normalising attributes but not discarding them; they are sometimes
referred to as soft projected clustering algorithms (Kriegel et al., 2009).
2.2.7 Review conclusion
Overall, the two recent but different approaches, sampling-based sequential Monte
Carlo (SMC) methods and non-sampling based variational Bayesian (VB) methods,
2.3 Clustering 33
both within the Bayesian framework, appear to be effective and scalable for approx-
imating individuals’ mobility patterns as Gaussian mixture models (GMMs). How-
ever, the deterministic nature of VB in addition to its ability to automatically deter-
mine the suitable number of components k as well as it being space efficient suggest
that VB should be the preferred technique for modelling the heterogeneous patterns
we expect to observe in this research and it is applied more generally in Chapters 3
to 6. Moreover, it may be necessary to consider utilising more robust covariance
estimation (Campbell, 1980; Pena and Prieto, 2001; Wang and Raftery, 2002) within
the VB framework to account for the irregular nature of the problem, though this
appears to be computationally expensive.
2.3 Clustering
2.3.1 Introduction
The aim of clustering (Jain and Dubes, 1988; Duda et al., 2001) is to segment unla-
belled and typically numerical data an automatic fashion, into relatively meaningful,
natural, homogeneous but hidden subgroups or ‘clusters’. This is done by maximis-
ing intra-cluster and minimising inter-cluster similarities without the need to have
any prior knowledge (Hastie et al., 2009, p.501). The similarities are often measured
based on the (Euclidean) distance for (low dimensional) numerical data; though the
definition of ‘similar’ (Xu and Wunsch II, 2005; Jain and Dubes, 1988, pp.14-23) can
vary greatly. In machine learning, it is sometimes known as unsupervised classifica-
tion (Jain et al., 1999); and today it is usually performed without first assessing the
cluster tendency, i.e., without first determining whether clusters are present in the
data (Smith and Jain, 1984), even though some have argued the importance of do-
ing this step and proposed approaches which do so (Smith and Jain, 1984; Jain and
Dubes, 1988, p.201). In addition to discovering the underlying data structure (Hastie
et al., 2009, p.502), clustering has been used as a data reduction, compression, sum-
marisation (Jain, 2010) and outlier detection tool (Barbara et al., 1997); it has shown
to be useful for pattern/image segmentation/recognition and information retrieval,
for example (Jain, 2010). Furthermore, it has been shown to be useful as a stand
alone technique as well as a preprocessing technique for other analytical tasks such
as supervised classification (Han and Kamber, 2006, pp.383-384).
Even though probabilistic theories have recently been proposed for better algorithm
design (Dougherty and Brun, 2004), clustering is still considered to be a problematic
and subjective process with no standard benchmarks available for comparison (Jain
et al., 1999). This is especially true for clustering high dimensional data (Patrikainen
and Meila, 2006). The loose definition of a cluster, however, implies that clustering
results can be evaluated/validated according to their internal, external and relative
34 Chapter 2. Literature Review
similarities (Dubes, 1999; Jain and Dubes, 1988, Chapter 4). Yet, many of these ap-
proaches do not appear to be effective nor practical, and often do not address issues
such as the stability of the results (Lange et al., 2004). Whereas for applications, such
as our business application, being able to assess cluster interpretability and visuali-
sation (Berkhin, 2006) can be useful if not critical.
Unlike mixture models (c.f. § 2.2), which profiles data based on the mixture decom-
position (Jain and Dubes, 1988, pp.117-118), the ‘definition’ of a cluster can vary
greatly (Everitt, 1974, pp.43-44). Examples of various different cluster representa-
tions for numerical data, which is our primary research focus, include:
• by the cluster’s gravity centre. For example, the classical k-means (KM) (Mac-
Queen, 1967) and hierarchical (Kaufman and Rousseeuw, 1990) clustering al-
gorithms;
• by an object (i.e., medoid) located near its centre. For example, k-medoids
algorithms (Kaufman and Rousseeuw, 1990; Ng and Han, 1994). This definition
makes the algorithms less sensitive to outliers when compared to gravity centre
definition above, but, at the same time, makes the algorithms less scalable;
• by density connected points (Jain and Dubes, 1988, pp.128-133). Algorithms
based on this notion are sometimes referred to as density- (e.g., Ester et al.,
1996) or grid-based (e.g., Wang et al., 1997b) clustering algorithms (c.f. § 2.3.3).
They have significant influence to research in clustering high dimensional data
(Hinneburg and Keim, 1998; Agrawal et al., 1998) (c.f. § 2.3.4);
• by a collection of points (e.g., Guha et al., 1998) or by a boundary (e.g., Karypis
et al., 1999). Algorithms based on the former definition often can better repre-
sent the clusters, and reduce the implications of clusters being very different in
sizes, for example. Whereas algorithms utilising boundaries as the cluster rep-
resentations are often based on the use of graph theory (Jain and Dubes, 1988,
pp.120-128) such as minimal spanning tree (MST) (Zahn, 1971). The graphical
approach is closely related to hierarchical clustering algorithms (c.f. § 2.3.2).
However, while there have been many recent developments in this research di-
rection (c.f. Jain, 2010), they do not appear to be effective for high dimensional
large datasets;
• by concept (e.g., Gennari et al., 1989);
• by its statistical summaries (e.g., Zhang et al., 1996); and
• by its probability density estimations mode (Jain and Dubes, 1988, pp.118-120).
This in turn suggests that while significant literature exists on the topics of clustering
and algorithms (e.g., Jain and Dubes, 1988; Jain et al., 1999; Xu and Wunsch II, 2005;
Berkhin, 2006; Han and Kamber, 2006), differences in cluster definitions (Jain, 2010),
cluster assumptions such as assigning each data points to:
• only one (MacQueen, 1967) or multiple (Cole and Wishart, 1970) clusters, or
• every cluster with various degrees of probability (Dempster et al., 1977; Bezdek,
1981),
2.3 Clustering 35
and the different contexts in which clustering are used, for example, have made re-
viewing clustering and transferring useful generic concepts and methodologies chal-
lenging (Jain et al., 1999). Moreover, it appears that there is no optimal algorithm for
solving all of the problems (Kleinberg, 2002), and no standard or effective criteria to
guide algorithm selection (Xu and Wunsch II, 2005). Of course, there are also issues
that need to carefully considered such as feature selection, weighting and normali-
sations (Wedel and Kamakura, 1998, pp.57-59).
Furthermore, most algorithms (including many more recently developed algo-
rithms) appear to be very sensitive to critical parameter settings such as the number
of clusters k (Jain and Dubes, 1988, p.177). Yet, these parameters can be difficult to
determine and can lead to unreliable or poor clustering quality (Jain et al., 1999) es-
pecially for clustering high dimensional data (Moise et al., 2008). Determining the
‘true’ value of k is the fundamental problem of clustering or cluster validity (Everitt,
1979), and some guidance (e.g., Milligan and Cooper, 1985; Dubes, 1987; Tibshirani
et al., 2001) has been provided (c.f. Xu and Wunsch II, 2005; Berkhin, 2006). See also
the discussion in § 2.2.3. Bayesian techniques in particular (Schwarz, 1978; Wallace
and Freeman, 1987; Kass and Raftery, 1995; Fraley and Raftery, 1998; Blei et al., 2003;
Li and McCallum, 2006) have been shown to be useful in determining the value of k
and parameters more generally. However, most of these techniques are based on the
concept of clusters being ‘compact’ and ‘isolated’ (Jain, 2010) which is not necessar-
ily appropriate for this application since human mobility patterns have a high degree
of spatial regularity which makes the overall pattern heterogeneous and spiky.
In the remainder of this review of the clustering literature, methodology is discussed
roughly in order by time of development. In § 2.3.4, techniques for clustering high
dimensional numerical data are reviewed.
2.3.2 Classical clustering algorithms
Hierarchical Clustering Algorithms
Classical algorithms are commonly separated into two classes: hierarchical or flat
partitioning (Jain and Dubes, 1988, pp.55-58). Hierarchical clustering algorithms
represent data as an easily understood hierarchical nested series of partitions; they
are typically based on a distance related criterion and/or the number of cluster cri-
terion for merging (c.f. agglomerative) or splitting (c.f. divisive) (Jain et al., 1999).
However, their minimum requirements in terms of quadratic time and space com-
putational complexities (i.e., O(n2)
with n being the number of observations) (Jain
et al., 1999) implies that they have limited application for analysing large datasets;
36 Chapter 2. Literature Review
although many recent algorithms (e.g., Achtert et al., 2007a) for clustering high di-
mensional data are influenced by them.
k-Means Algorithm(s)
On the other hand, the popular k-means (KM) algorithm is the most representative,
but not the only, partitional algorithm (Jain et al., 1999). It and its variants are some-
times known as the squared error clustering algorithms since they segment objects
into k groups iteratively by minimising the sum of squared error over all k clusters
(Jain et al., 1999). KM-based algorithms are computationally more efficiently than
the hierarchical clustering algorithms; their time complexity isO (ndk) with n being
the number of observations, d the number of attributes and k the number of clusters
(Jain et al., 1999). While the qualities of the resultant clusters between them and the
hierarchical alternatives are not conclusive (Milligan, 1980; Punj and Stewart, 1983),
they both (as well as the k-medoid variant) can generally only identify clusters with
convex shapes (or hyper-spherical shapes to be more precise) (Jain et al., 1999), and
have a tendency of splitting larger clusters for similar size clusters (Mao and Jain,
1996). Additionally, they can both be rather sensitive to noise due to the use of a sin-
gle representative per cluster and the use of distance-based measures (Guha et al.,
1999). Interestingly however, the KM algorithm with the use of Mahalanobis dis-
tance (MD) (Mahalanobis, 1936) as an alternative to the commonly used Euclidean
distance as proximity measure, tends to have clusters that are hyper-ellipsoidal and
can be unusually large or small in size (Mao and Jain, 1996).
There are a great number of KM algorithm extensions (c.f. Jain, 2010). In particular,
many (e.g., Bradley et al., 1998; Farnstrom et al., 2000; Pham et al., 2004; Ordonez
and Omiecinski, 2004) have been developed to address, for example:
• the quality of results (as well as the algorithm sensitivity due to the initial par-
tition selection which can lead to locally optimal clusters);
• the issue of data order dependency. This is an issue for many efficient algo-
rithms (c.f. Hartigan, 1975; Fisher, 1987) which scan the data only once (c.f.
Appendix B and C); and
• the suitability of the algorithm for clustering large datasets without needing to
freely access all data as most (classical) algorithms have assumed (c.f. Jain et al.,
1999; Xu and Wunsch II, 2005; Kogan et al., 2006; Kogan, 2007).
Additionally, note that the KM algorithm, which is predominantly used for cluster-
ing numerical data, has also been extended to take on categorical or mixed type at-
tributes (i.e., utilising measures other than distance) (e.g., Huang, 1998; Ghosh and
Strehl, 2006; Kogan, 2007). Overall, however, more scalable algorithms that do not
require predetermination of k and are capable of discovering more arbitrary shape
clusters are needed for this application.
2.3 Clustering 37
2.3.3 Scalable clustering algorithms
Random Sampling & Index Tree Structure Random sampling (and randomised
search) (Kaufman and Rousseeuw, 1990; Ng and Han, 1994; Guha et al., 1998), which
has been shown to be robust (in terms of resilience to noise) and useful for clustering
large datasets by fitting selected samples into the memory, is one frequent technique
utilised in various parts of many clustering algorithms. Sets of well chosen sam-
ples have shown to be useful for identifying and representing quality clusters (Guha
et al., 1998). An alternative approach to random sampling, which discards part of
the observations to improve scalability, is to summarise the data into a dynamically
updated index tree structure; an index tree structure can efficiently identify observa-
tions’ nearest neighbours (Jain, 2010). The use of index has been shown to be useful
for algorithms to focus on relevant data; algorithms can thus cluster data representa-
tives (resided in memory) based on the summarised statistical information instead
of the original data (Zhang et al., 1996). Algorithms (e.g., Zhang et al., 1996) based on
this technique have shown that it is actually possible to cluster with close to linear
time complexity i.e.,O (n). However, they typically utilise statistics (e.g., zeroth, first,
and second moments) assuming the data is Gaussian distributed which is generally
inappropriate. Additionally, these algorithms are somewhat sensitive to the data or-
dering; and their use of radius to control the boundary of the cluster still resulted in
spherical sharp clusters as in the case of classical clustering algorithms (Guha et al.,
1998; Sheikholeslami et al., 1998). Nonetheless, the index tree structure has been
widely utilised since (e.g., Ganti et al., 1999c; Aggarwal et al., 2003), and has been
shown to be useful in dealing with noisy data and detecting anomalies (e.g., Bohm
et al., 2000; Burbeck and Nadjm-Tehrani, 2005).
Density-Based Clustering Algorithms
Density-based clustering algorithms define and connect clusters based on the den-
sity of neighbourhood objects (i.e., number of data points within a given radius of
the objects) agglomeratively; DBSCAN is perhaps the most well-known example (Es-
ter et al., 1996; Sander et al., 1998). The density-based clustering algorithms, with
the use of the local criterion, have been shown to be able to discover somewhat ar-
bitrary clusters with different shapes, sizes and densities, and are naturedly robust
from outliers; though clusters may not be very informative and/or easy to interpret
(Berkhin, 2006). While many improvements have been made in, for example:
• eliminating the input parameters requirement (e.g., Xu et al., 1998). This can
be done, for example, by identifying the intrinsic clustering structures of the
clusters (Ankerst et al., 1999) for which the technique has also been found use-
ful even for discovering high dimensional clusters hierarchically (e.g., Achtert
et al., 2006, 2007a),
38 Chapter 2. Literature Review
• dealing outliers better with the use of degrees of likelihood instead of a binary
decision (e.g., Breunig et al., 2000),
• improving efficiency by being able to incrementally update only neighbour-
hood information related to the updated data (e.g., Ester et al., 1998), and
• extending the algorithm to clustering objects in the spatial-temporal domain
(e.g., Birant and Kut, 2007).
Poor quality clusters may still be obtained as a result of global density tactic (i.e., the
use of fixed radius), and they typically still require O (n log n) time complexity even
with the use of the index tree structure discussed above. Note that instead of using
global distance measure, utilising hyper-graph partitioning techniques (e.g., Karypis
et al., 1999; Estivill-Castro and Lee, 2000; Agarwal et al., 2005) has been shown to
improve the resultant cluster qualities with relative interconnectivity (c.f. Guha et al.,
1999) and closeness concepts (c.f. Guha et al., 1998).
Grid-Based Clustering Algorithms
The use of density means random sampling (c.f. Guha et al., 1998) is not practical for
improving the scalability. However, density-based clustering algorithms have been
able to be approximated efficiently with algorithms based on grids (e.g., Schikuta
and Erhart, 1997; Wang et al., 1997b; Sheikholeslami et al., 1998). These algorithms
minimise the distance computation requirements by:
• quantising the objects in its original feature spaces,
• computing statistical distribution summaries for each attribute within each
grid cell in a single scan of data i.e., with time complexity ofO (n), and then
• hierarchically clustering on the resultant grid information structure instead of
the original objects.
They generally require minimal prior knowledge, can obtain quality clusters (some-
times even with multi-resolution (e.g., Sheikholeslami et al., 1998)), are data order
independent, and are robust to outliers (Berkhin, 2006). More importantly, how-
ever, is that the grid structure can also naturally facilitate parallel/distributed (c.f.
Parthasarathy et al., 2007) processing, incremental updating only summaries re-
lated to the updated data (Wang et al., 1997b) and work with mixed type attributes
(Berkhin, 2006). Its use has also been shown to improve the algorithm scalability,
better utilise memory, and provide clusters with better qualities (as a result of algo-
rithms being less depend on the initialisations) (e.g., Guha et al., 1998; Zhang et al.,
2005; Garg et al., 2006); and it has been found useful for speeding up clustering algo-
rithms based on kernel density distribution functions that models the overall density
of a point analytically as the sum of influence function of data points around it (e.g.,
Hinneburg and Keim, 1998; Hinneburg and Gabriel, 2007). However, while these al-
gorithms (as well as many recent algorithms) have improved the scalability require-
ments for clustering large datasets, and the qualities of resulting clusters, and often
2.3 Clustering 39
without the need to predetermine k, they are still not suitable for clustering the type
of high dimensional data that this application is facing (but they should be adequate
for modelling individuals’ mobility patterns as these are in two dimensional). The
most pertinent part of our review of clustering follows.
2.3.4 Algorithms for clustering high dimensional data
Curse of Dimensionality Data embedded in a high dimensional space remains dif-
ficult for humans to interpret (Jain et al., 1999) despite some recent development
of visualisation tools (e.g., Lee and Ong, 1996; Kandogan, 2001; Ankerst et al., 1999;
Konig and Gratz, 2004; Ghosh and Strehl, 2004; Assent et al., 2007b). As data dimen-
sionality d increases, the sparseness of data usually increases as a result; this leads to
meaningless similarity measures which are the foundation of clustering (Berchtold
et al., 1997; Agrawal et al., 1998; Aggarwal et al., 2001) especially when the distance-
based proximity similarity is used. This issue is known as the “curse of dimension-
ality” (Bellman, 1961, p.94) or “empty space phenomenon” (Scott, 1992, p.84). It im-
plies that there is a lack of data separation in the high dimensional spaces (Aggarwal
and Yu, 2000; Hastie et al., 2009, pp.22-27) and that the nearest neighbours are not
stable (Beyer et al., 1999). This phenomenon also means that algorithms based on
the density notion (c.f. § 2.3.3) are less effective for clustering high dimensional data;
and outliers detection is more challenging (Aggarwal and Yu, 2001; Yu et al., 2002)
while at the same time being more critical (Hinneburg and Keim, 1999).
Recall that also in § 2.3.3, many more recent algorithms make use of the (spatial)
index tree structures for scalability improvements. However, their effectiveness has
been shown to degrade rapidly for dimension d > 10; that is, having an index is no
better than simply doing sequential searches (Weber et al., 1998; Beyer et al., 1999;
Chakrabarti and Mehrotra, 1999). While some not so recent developments in the
index data structure have achieved higher dimensional limits (e.g., Berchtold et al.,
1996, 1998) or even have extended to the spatial-temporal (e.g., Zhang et al., 2003) or
multi-dimensional (e.g., Gaede and Gunther, 1998; Bohm et al., 2001) domain, they
typically still require overall superlinear runtime complexity (Bohm et al., 2000). Ad-
ditionally, they are generally still somewhat limited in their usefulness for clustering
‘real’ high dimensional data (i.e., d� 20); though the index data structure itself is still
an active research area (e.g., Houle and Sakuma, 2005; Manolopoulos et al., 2005).
One the other hand, there are a small number of algorithms based on the use of
grid (e.g., Sheikholeslami et al., 1998; Hinneburg and Keim, 1998), random sampling
(e.g., Guha et al., 1998), and fractal dimension (e.g., Barbara and Chen, 2000) (which
is not discussed here due to its uniqueness) that have been shown or are believed to
work somewhat better with high dimensional data (i.e., d ∼ 20). However they too
40 Chapter 2. Literature Review
increasingly lose effectiveness (and become more sensitive to noise) as the dimen-
sion d increases as clusters are likely to be spread over many grid cells with many of
them actually being empty (Hinneburg and Keim, 1999). Though, being able to ob-
tain (non-axis-parallel) ‘optimal’ grid partitions by cutting low density regions and
maximising clusters discriminations through a set of (contracting) projections divi-
sively recursively has been shown useful (Hinneburg and Keim, 1999) for analysing
high dimensional data.
Dimension Reduction One approach to this high dimensional problem is to first
reduce the dimensionality of the data prior to clustering; feature transformation/
extraction, and feature selection are two logical methods (Han and Kamber, 2006,
pp.435-436).
Feature transformation projects the data onto a smaller space while seeking to main-
tain the original relative distances between objects. Principal component analysis
(PCA) (Jolliffe, 2002) is one of the popular techniques, and was also suggested by
Sheikholeslami et al. (1998) in addressing their algorithm shortfall in clustering ‘real’
high dimensional data. However, PCA is only suitable for projecting Gaussian dis-
tributions (Cherkassky and Mulier, 2007, p.204); it is relatively computationally ex-
pensive (Witten and Frank, 2005, p.309), sensitive to noise, and often misses inter-
esting details (Volkovich et al., 2006). Not surprisingly, PCA has been shown to be
an ineffective technique for clustering high dimensional data (Kriegel et al., 2009).
In fact, Chang (1983) argued that the PCA factor with largest eigenvalue may not
necessarily be the first component, which makes PCA unsuitable for clustering high
dimensional data; PCA was not recommended by Wedel and Kamakura (1998, p.59)
for customer segmentation more generally.
There are many more recent linear techniques such as independent component
analysis (ICA), projection pursuit, random projections, singular value decomposi-
tion (SVD) (c.f. Xu and Wunsch II, 2005; Hastie et al., 2009) and wavelet transforma-
tion (WT) (Murtagh et al., 2000) which have shown improvements from PCA. How-
ever, it has also been shown that, for example, SVD is unable to achieve any ‘real’
dimensionality reduction (Agrawal et al., 1998); and RP, useful for nearest neighbour
search as in the case of index tree structures (Jain, 2010), can result in highly un-
stable clusters (Fern and Brodley, 2003). Their ineffectiveness is the result of not
removing irrelevant attributes (Parsons et al., 2004), and not taking into account
that there may be different feature correlations for different clusters (Kriegel et al.,
2009). However, perhaps more importantly, at least for our application is that, trans-
formation generates results with poor interpretability that are critical to clustering
(Agrawal et al., 1998). Of course, this also means that it is not practical to consider
those computational infeasible kernel or non-linear transformation techniques as
2.3 Clustering 41
the clustering preprocessing for clustering high dimensional data (c.f. Xu and Wun-
sch II, 2005). Though, it is worthwhile pointing out that some recent PCA-based
(e.g., Bohm et al., 2004b; Tung et al., 2005; Kriegel et al., 2008) and SVD-based (e.g.,
Agarwal and Mustafa, 2004) clustering algorithms have been shown to discover even
arbitrary-shaped high dimensional clusters, and some success has been reported
with the use of the Hough transform (e.g., Achtert et al., 2008).
On the other hand, feature selection (Guyon and Elisseeff, 2003) aims to reduce
the number of attributes in a dataset by removing irrelevant/redundant dimensions
(globally); it has been shown, more generally, to be able to improve prediction per-
formance, stability, and interpretation (Parsons et al., 2004). Note that most of these
techniques were designed for supervised learning (Parsons et al., 2004). While there
exists no universal accepted approach for measuring the (unsupervised) clustering
accuracies and hence guide the attribute selection, a number of methods (e.g., en-
tropy analysis) have been shown to be useful (Parsons et al., 2004). Unfortunately,
these techniques are typically iterative in nature and hence not scalable (Parsons
et al., 2004); and they can cause loss of important information or even distort the
real clusters (Aggarwal et al., 1999; Xu and Wunsch II, 2005). That is, feature selec-
tion as done in the typical way, generally can not overcome the challenge in cluster-
ing high dimensional data (Kriegel et al., 2009). Interestingly, however, without any
dimensional reduction, some success was shown in McCallum et al. (2000) by first
dividing the data into overlapping subsets, known as canopies, prior to clustering;
this technique is more generally known as domain decomposition.
Subspace Clusters Fortunately, despite all the challenges faced in clustering high
dimensional data, high dimensional data usually have an intrinsic dimensionality
(Jain and Dubes, 1988, pp.42-46) that is much lower than the original dimensions
(Cherkassky and Mulier, 2007, p.178). That is, as observed by Agrawal et al. (1998),
usually only a small numbers of different dimensions (i.e., subspaces) are relevant to
certain clusters, whilst noisy signals often contribute information in the remaining
unwanted dimensions. This also implies that the number of unwanted attributes
grows with dimensions, as objects are increasingly likely located in different dimen-
sional subspaces (Berkhin, 2006). This phenomenon is sometimes referred to as “lo-
cal feature relevance” or “local feature correlation” (Kriegel et al., 2009, p.5). That
is, algorithms discussed previously are not effective because they were developed to
discover clusters in the full dimensional space (Agrawal et al., 1998) and it is not fea-
sible to obtain clusters by searching through all possible combination of subspaces
(i.e., different combinations of features) (Parsons et al., 2004). Consequently, the
challenge becomes being able to instead search (in a localised way) effectively and
efficiently for groups of clusters within different subspaces of the same dataset.
42 Chapter 2. Literature Review
Besides (sequential) pattern-based clustering (see short discussion separately be-
low), algorithms that aim to discover subspace clusters are often divided to into
two categories, subspace clustering algorithms and projected clustering algorithms
(Parsons et al., 2004), although the distinction is sometimes not clear (Kriegel et al.,
2009).
• Subspace clustering algorithms aim at finding all subspaces where clusters can
be identified (Kriegel et al., 2009); thus their solutions can consist of significant
overlapping since they aim to discover all clusters in all subspaces.
• In contrast, projected clustering algorithms aim to find an assignment(s) of
each point to a subspace(s) (Kriegel et al., 2009). They typically report non-
overlapped clusters (i.e., an unique assignment for each object) and thus
are sometimes referred to as partition-based clustering algorithms (Liu et al.,
2009).
Note that, unlike subspace clustering algorithms which always follow a bottom-
up search approach, projected clustering algorithms often adopt a top-down ap-
proach in discovering the clusters (Kriegel et al., 2009); though, as pointed out by
Kriegel et al. (2009), the distinction between these two methods should not be sim-
plified as being bottom-up ‘dimension-growth’ subspace algorithms and top-down
‘dimension-reduction’ projected algorithms as done in Han and Kamber (2006,
p.434). Nevertheless, Han and Kamber (2006, Chapter 7) is the first ever ‘textbook’
review on high dimensional numerical data clustering, as nearly all papers in this
research direction focus on the algorithm developments except Parsons et al. (2004)
and Kriegel et al. (2009), for example.
Most of these relatively recent algorithms are axis-parallel in the sense that they do
not focus on finding arbitrarily shaped clusters (Kriegel et al., 2009); algorithms that
are non-axis-parallel are referred separately as correlation clustering algorithms in
Kriegel et al. (2009). Axis-parallel algorithms have the advantages of restricted search
spaces but are stillO(2d)
with d the data dimensionality (Kriegel et al., 2009), and the
clusters found are more meaningful for business applications such as this. Note that
many of these algorithms (e.g., Ng et al., 2005; Moise et al., 2008) can still discover
arbitrarily shaped clusters but the clusters are identified in the hyper-rectangles for-
mat. Consequently, this review concentrates on the axis-parallel algorithms. Oth-
erwise, it is worth pointing out that Kailing et al. (2003) has proposed an approach
in ranking interesting subspaces and has shown some success for clustering high
dimensional data.
2.3 Clustering 43
Pattern-based Clustering Algorithms
As mentioned previously, pattern-based clustering (also known as bi-clustering
(Cheng and Church, 2000) or co-clustering, for example) algorithms also aim to dis-
cover subspace clusters (Kriegel et al., 2009). They, in contrast, focus on clustering
categorical data with application domains such as microarray (e.g., gene expression)
data (e.g., Jiang et al., 2004; Madeira and Oliveira, 2004; Van Mechelen et al., 2004;
Tanay et al., 2006). They have also been utilised for discovering associations rules
(Agrawal et al., 1993; Agrawal and Srikant, 1994) among transactions/webs/texts and
is useful for (e-commence and retail) business applications such as recommender
systems (e.g., Cho and Kim, 2004; Cho et al., 2002; Kim and Yum, 2005; Lee et al.,
2002; Li et al., 2005; Wang et al., 2004). However, since our research focus is primarily
about clustering numerical data, there is no further review on this particular topic.
Subspace Clustering Algorithms
Recall that subspace clustering algorithm aim to identify all subspace clusters in all
subspaces (Kriegel et al., 2009). To avoid exhaustive subspace searches through all
possible subspaces, it employs a bottom-up strategy based on the downward clo-
sure (also known as monotonicity) property of density. This property is based on the
lemma that if a d′-dimensional unit is dense (i.e., there is/are (a) cluster/clusters),
then so are its projections in (d′ − 1)-dimensional subspace (Agrawal et al., 1998).
That is, if a cluster is found in subspace S, it must also be found in subspace S′ ⊆ S
(Kriegel et al., 2009). Accordingly, subspace clustering algorithms:
• first identify the dense regions/units for each dimension; this is generally based
on the use of a histogram with predefined number of bins and a density thresh-
old parameter; and
• then use dimensions that contain dense regions to form clusters by combining
adjacent dense regions; this ‘integration’ step is typically, though not always
(e.g., Liu et al., 2009), utilising an algorithm similar to Apriori (Agrawal et al.,
1993; Agrawal and Srikant, 1994) developed for market basket analysis (i.e., for
searching frequent itemsets in transactional databases).
The first and the most well-known algorithm of this kind is CLIQUE (Agrawal et al.,
1998) which is innovative and has significant influence on categorical data cluster-
ing (i.e., pattern-based clustering algorithms) (e.g., Ganti et al., 1999a; Cheng and
Church, 2000; Wang et al., 2002a; Yang et al., 2002; Zaki et al., 2005).
The utilisation of an axis-parallel grid (c.f. histograms) implies that these algorithms
are scalable, not sensitive to the order of records, and make no assumptions on the
data distribution. The process of connecting dense regions means that they can han-
dle somewhat arbitrary shaped clusters and focus on producing clusters with good
44 Chapter 2. Literature Review
interpretability rather than accurate cluster shapes. However, they may produce a
fairly large number of overlapped clusters with many of them being projections of
a higher dimensional clusters (Moise and Sander, 2008) which can make interpreta-
tion of the results a bit more complicated (Aggarwal et al., 1999). Additionally, they
may consider a relatively large number of objects as outliers which may not be ac-
ceptable for some applications (Aggarwal et al., 1999).
Many improvements have been made on CLIQUE. The quality of the clusters has
been shown to improve:
• by pruning unwanted subspaces with entropy (e.g., Cheng et al., 1999) instead
of the minimal description length (MDL) (Rissanen, 1983) used in CLIQUE;
• by using an adaptive grid instead of a fixed interval size static grid (e.g., Nagesh
et al., 2000); the bin cut-points on each dimension are analysed based on his-
tograms (Nagesh et al., 2000; Chang and Jin, 2002). This strategy can elimi-
nate the use of the pruning techniques which could result in missing clusters
(Nagesh et al., 2000);
• by allowing histogram bins to be overlapping (e.g., Liu et al., 2007, 2009); and
• by varying the density threshold parameter either globally (e.g., Sequeira and
Zaki, 2004) or being more adaptive in the sense of taking the dimensionality
into consideration (e.g., Assent et al., 2007a);
for example.
In terms of computational requirements, CLIQUE scales linearly with the size of the
inputs (Agrawal et al., 1998). Although its complete time complexity is data depen-
dent, its most computational demanding step isO(ndImax + cdImax
)withnnumber of
observations, dImax the highest cluster intrinsic dimensions (which is generally� d
with d the full data dimensionality) and c a constant (Agrawal et al., 1998). However,
despite the fact that subspace clustering algorithms are generally already faster than
projected clustering algorithms (to be discussed in § 2.3.4) (Parsons et al., 2004) effi-
cient techniques can still be utilised to further improve their performance as we dis-
cuss in more detail below; note that this is still typically the case even though many
projected clustering algorithms have already adopted some sort of random sampling
strategy (Moise and Sander, 2008) to make scalable improvements to efficiency. For
example, by:
• allowing the algorithm to perform in a parallel/distributed fashion (e.g.,
Nagesh et al., 2000), or
• adopting a ‘filter(-refinement) architecture’ which can approximate clusters
without performing the worst case search procedure; not merging dense re-
gions in the typical Apriori style but instead grouping one-dimensional dense
regions, so called base-cluster, through use of a modified DBSCAN (c.f. § 2.3.3)
(Kriegel et al., 2005).
2.3 Clustering 45
However, even though these algorithms do not require the number of clusters k to
be predetermined (Agrawal et al., 1998), they have been found to be generally rather
sensitive to the difficult to determine input parameter values; though this is gener-
ally the case for most high dimensional data clustering algorithms (Parsons et al.,
2004; Yip et al., 2005; Moise et al., 2008; Kriegel et al., 2009). It is worth pointing out
that while most of the subspace clustering algorithms are grid-based, some algo-
rithms (e.g., Kailing et al., 2004; Assent et al., 2007a) are built on the notion of density,
for example, by utilising DBSCAN (c.f. § 2.3.3) for identifying dense regions of each
dimension instead of histograms; these algorithms can produce better clustering re-
sults but require more computations.
Projected Clustering Algorithms
In contrast to the subspace clustering algorithm we discussed above, projected clus-
tering algorithms typically cluster data in a top-down partition-like fashion. Gener-
ally speaking, this aims to
• first locate an initial approximation of the clusters in the full set of equal
weighted dimensions (Parsons et al., 2004).
• then adjust feature weighting and evaluate the subspaces of each cluster it-
eratively (Parsons et al., 2004); though some algorithms (e.g., Friedman and
Meulman, 2004; Achtert et al., 2007a) adjust each dimension weight of each
instance.
Accordingly, these algorithms typically use some measures, either explicitly or im-
plicitly, of similarity on attributes or observations of interest (Moise and Sander,
2008). They, by design, are computationally more demanding than the subspace
clustering approaches. Additionally, they may require number of clusters k as an
input (e.g., Bohm et al., 2004a; Friedman and Meulman, 2004) or even the average
cluster dimensionality dIavg (e.g., Aggarwal et al., 1999; Aggarwal and Yu, 2000).
In spite of this, projected clustering algorithms often can obtain clusters with bet-
ter qualities than the subspace clustering approach; the subspace clustering ap-
proach typically utilises global density thresholds which are problematic as density
decreases with increasing dimensionality (Parsons et al., 2004; Moise et al., 2008;
Moise and Sander, 2008). As pointed out by Moise and Sander (2008) that if val-
ues for the global density threshold are chosen to be too large this will encourage the
formation of only low dimensional clusters, too small values for the global density
threshold will lead to many outlying clusters in addition to the real clusters of higher
dimension. As to the usefulness of the results, some researchers (e.g., Aggarwal et al.,
1999) believe non-overlapped clusters, as often obtained by projected clustering al-
gorithms, provide clearer cluster interpretations; while others argue that they result
in the loss of useful information (e.g., Kriegel et al., 2005). Interestingly, however,
46 Chapter 2. Literature Review
the few projected clustering algorithms (Procopiuc et al., 2002; Moise et al., 2008),
that can produce either solutions, have shown some scalability and cluster quality
improvements for obtaining not significantly overlapped clusters.
The first two projected clustering algorithms are:
• the axis-parallel PROCLUS (Aggarwal et al., 1999) which aims to group objects
located closely in each of the related dimensions in its associated subspace,
and
• the ‘improved’ non-axis-parallel ORCLUS which aims to discover arbitrary
shape clusters (Aggarwal and Yu, 2000).
However, while both algorithms were designed for high dimensional data, they both
behaved like k-medoid algorithms (Moise et al., 2008). This is because these algo-
rithms initialise the clusters based on distance calculation involving all the dimen-
sions (Moise et al., 2008). Accordingly, these two algorithms tend to produce similar
size clusters with spherical shapes (Kriegel et al., 2009) as in the case of classical clus-
tering algorithms (c.f. § 2.3.2). Additionally, the use of mean square error (MSE) as
their objective function is also considered problematic since having less dimensions
will always result in better MSE (Moise et al., 2008). Furthermore, their use of ran-
dom sampling (c.f. Guha et al., 1998) for cluster initialisation means that they can
miss interesting clusters and obtain different results each time (Kriegel et al., 2009);
though many more recent algorithms (e.g., Procopiuc et al., 2002; Woo et al., 2004;
Yiu and Mamoulis, 2005) still adopt the use of random sampling as the result of the
general scalability challenges of the projected clustering approach (Parsons et al.,
2004). Interestingly, however, is that relatively computationally demanding decision
tree has been considered for projected clustering algorithms (e.g., Liu et al., 2000).
A cluster split is determined on the evaluation of information gain; the calculation
is based on having all observations labelled with a common class and ‘added’ uni-
formly distributed data with a different class.
Locality Assumption It is important to point out that projected clustering algo-
rithms often assume the subspace of each cluster can be determined locally i.e.,
based on the local neighbourhood corresponding to members of the cluster or to
the cluster representatives (Kriegel et al., 2009); this is despite the analysed data is
high dimensional. In other words, many projected clustering algorithms (e.g., Fried-
man and Meulman, 2004) aim to derive insights, for example, the ‘true’ subspace,
of an observation through its nearest neighbours (c.f. density-based clustering algo-
rithms; § 2.3.3). In fact, such tactic has also been utilised even for algorithms (e.g.,
Achtert et al., 2007c,b) that aim to uncover arbitrary shaped subspace clusters. How-
ever, this implies that the ‘definition’ of distance and nearest neighbours clearly need
to be redefined. Some ideas are discussed below.
As in the case of subspace clustering algorithms, the well-known DBSCAN (c.f.
2.3 Clustering 47
§ 2.3.3) has also been utilised. However, instead of applying DBSCAN for identify-
ing the dense regions of each dimension as used by some subspace clustering algo-
rithms, some projected clustering algorithms (e.g., Bohm et al., 2004a,b) have modi-
fied DBSCAN such that it can cluster data full-dimensionally. The key modification,
of course, is the calculation of distance between observations. Instead of using sim-
ple Euclidean distance which clearly suffers from the curse of dimensionality, they
propose to use weighted Euclidean distances essentially only based on the prefer
dimensions (Bohm et al., 2004a) or eigenbases (Bohm et al., 2004b) of observations.
Alternatively, a specialised distance measure, dimension oriented distance (DoD),
was introduced by FINDIT (Woo et al., 2004) in determining observation/instance’s
nearest neighbours. This appears to be robust DoD measures the similarity be-
tween two observations/instances by counting number of dimensions in which
their Manhattan distance is less than a given ε; the actual value of Manhattan dis-
tance is not important, it is simply utilised to determine if the two observations/
instances are ‘close enough’ with respect to that particular attribute. Obviously, the
largest DoD implies that the two observations/instances are the ‘nearest neighbours’.
FINDIT then proposed a ‘dimensional voting policy’ in determining an observation/
instance’s likely subspace i.e., correlated dimensions; the relevant dimensions are
determined by number of its neighbours that are considered to be ‘close’. Such strat-
egy is perhaps an improvement from the use of random sampling (e.g., Procopiuc
et al., 2002); however, fixing number of reliable ‘voters’ may have been somewhat
restricted.
Bottom-up Strategy Some projected clustering algorithms (e.g., Moise et al., 2008),
conversely, have adopted a bottom-up strategy as in the case of subspace clustering
algorithms such as CLIQUE. However, not all of them adopt this for the efficiency
purpose. HARP (Yip et al., 2004) is one such example; it is the first algorithm that
aims to produce cluster hierarchy and can eliminate the need to have number of
clusters k predetermined. In essence, HARP is a slow single-link-like agglomera-
tive hierarchical clustering algorithm (Kriegel et al., 2009) which merges observa-
tions/clusters based on the proposed ‘relevant score’ with respect to the dimensions.
While HARP has trouble identifying low dimensional clusters as in the case for most
other algorithms (Kriegel et al., 2009), its ‘related’ algorithm, k-mediod-like top-
down SSPC (Yip et al., 2005), has been shown to be able to discover clusters with low
dimensionality; SSPC successfully avoids extensive distance calculation involved all
dimensions by having its objective function based on HARP’s ‘relevance score’ mea-
sure. Nonetheless, it is worthwhile pointing out that some recent top-down algo-
rithms (e.g., Achtert et al., 2007a) can also obtain the cluster hierarchy. Whereas,
SSPC is a semi-supervised algorithm which is rather unique; though there appears
to be a growing interest in clustering in the semi-supervised framework (Chapelle
et al., 2010).
48 Chapter 2. Literature Review
Other bottom-up projected clustering algorithms include EPCH (Ng et al., 2005) and
P3C (Moise et al., 2008) which, as in the case of HARP, also conveniently do not re-
quire number of clusters k as an input. Both algorithms assume clusters’ low di-
mensional projection will ‘stand out’ (Moise and Sander, 2008) similar to those sub-
space clustering algorithms. However, rather than identifying the low dimensional
dense regions utilising a predefined global density threshold, as done by most sub-
space clustering algorithms, they have adopted different approaches: EPCH has the
threshold iteratively lowered, whereas P3C applies chi-square goodness-of-fit test in
examining if attributes are uniformly distributed for each histogram bin.
Yet, the processes of EPCH and P3C in obtaining the subspace clusters are quite dif-
ferent. In EPCH, after all low dimensional dense regions have been identified, a ‘sig-
nature’ is derived for each observation recording which dense regions the object are
found. By comparing the ‘signature’ coefficients, the degree of similarity between
two observations are determined; similar objects/clusters are merged until at most
a user-specified number of clusters is obtained. Many similarity ‘rules’ have been
introduced by the algorithm. On the other hand, P3C locates ‘cluster cores’, an initial
maximal-dimensional subspace cluster approximations, by merging dense regions
with a Apriori-like process. The ‘cluster cores’ are then iteratively refined with an
EM-like procedure; and observations are assigned to the nearest ‘cluster core’ based
on Mahalanobis distance (MD) (Mahalanobis, 1936).
Interestingly, for the completeness of reviewing projected clustering algorithms,
note that there exist some relatively efficient but unique (top-down) algorithms (e.g.,
Procopiuc et al., 2002; Yiu and Mamoulis, 2005) that do not produce all clusters at the
same time. Rather, they discover clusters in a sequential manner. However, these al-
gorithms are based on an inappropriate assumption that clusters are hypercube-like
with fixed width for all attributes (Moise et al., 2008), and usually have problems with
clusters of significantly different dimensionality (Kriegel et al., 2009). The paper of
Yiu and Mamoulis (2005) is perhaps more significant in showing the suitability and
the algorithm scalability improvements by adopting FP-Growth (Han et al., 2000)
instead of Apriori-like process; the tactic can and has already be adopted by some
subspace clustering algorithms (e.g., Liu et al., 2009).
Overall, the iterative nature of the projected clustering algorithms and the use of ran-
dom sampling by some imply that they are not suitable for real time applications;
their time complexity requirements are more challenging to determine. Algorithms
such as P3C did not specific its time complexity. Experimental evaluations (Moise
et al., 2008; Moise and Sander, 2008) between P3C and subspace clustering algo-
rithm MAFIA, which is considered faster than CLIQUE, indicate that P3C (as well as
many other projected clustering algorithms) is about 10 − 100 times slower, but is
about 10 − 100 times faster than PRIM (c.f. bump hunting) (Friedman and Fisher,
2.3 Clustering 49
1999). On the other hand, while projected clustering algorithms may miss clusters
(Kriegel et al., 2009), and there is no agreed cluster quality evaluation criteria (Pa-
trikainen and Meila, 2006), it appears that parameter-robust P3C and SSPC generally
perform well particularly in identifying clusters with lower intrinsic dimensionality
that is more realistic in the real world (Yip et al., 2004; Cherkassky and Mulier, 2007,
p.178). Though, only P3C is also suitable for clustering categorical/mixed type at-
tributes data which can be useful for extension of this application.
2.3.5 Review conclusion
Classical algorithms such as k-means (KM) (MacQueen, 1967) have been utilised
widely in various applications; it will be applied in Chapter 5 for segmenting cus-
tomer behaviour (c.f. Wedel and Kamakura, 1998, Chapter 5). KM’s limitations,
and more generally the clustering subject itself, have been well documented (e.g.,
Han and Kamber, 2006, Chapter 7). In 1990s, there was a significant amount of re-
search into developing more scalable clustering algorithms capable of discovering
non-convex shaped clusters without the need to predetermine number of clusters
k, which in term addresses some earlier algorithm shortfalls. DBSCAN (Ester et al.,
1996) is perhaps the most influential algorithm around that period, which has re-
cently been shown useful for identifying individuals’ highly visited locations (Nurmi
and Koolwaaij, 2006), an application which is somewhat similar to one of this re-
search objectives; its effectiveness in comparison to mixture models will be exam-
ined in Chapter 4.
Since 1998, many efficient algorithms suitable for identifying hidden subspace clus-
ters in a high dimensional space have been proposed based on the geometric con-
siderations for avoiding exhaustive subspace searches. The majority of them can be
referred to as being grid-based; they utilise histograms (c.f. grids), as a density esti-
mation tool, in identifying low dimensional dense regions corresponding to the low
dimensional projection of the subspace clusters. However, the effectiveness of these
grid-based algorithms depends on the granularity and the positioning of the grid
(Kriegel et al., 2009). The well-known Sturges’ rule (Sturges, 1926), determining the
bin size as (1 + log2 (n)) with n the number of observations, has been used frequently
by algorithms such as P3C (Moise et al., 2008); this rule, however, has been shown to
be ineffective for n > 100 or 200 (Hyndman, 1995; Scott, 2009). An alternative ap-
proach for clustering high dimensional data will be investigated in Chapter 6; it may
be useful to model low dimensional density distributions with mixture models in-
stead of common approach of grid/histograms. Finally, it is worthwhile pointing out
that Moise and Sander (2008) has recently proposed an approach for determining if a
subspace clusters is statistically significant; this approach is related to scan statistics
(Agarwal et al., 2006) and bump hunting (Friedman and Fisher, 1999).
50 Chapter 2. Literature Review
3Variational Bayesian Method: Component Elimination,
Initialization & Circular Data
Abstract
The recently popular variational Bayesian (VB) method is an efficient non-
simulation based alternative to Bayesian approaches such as the Markov chain
Monte Carlo (MCMC) method. A key practical advantage of VB in fitting data with
a Gaussian mixture model (GMM) is its ability to effectively and progressively elim-
inate redundant components specified in the initial model thereby simultaneously
estimating model complexity and parameters. In this paper, we consider the poten-
tial implications of this irreversible VB property. We then outline an extension of the
VB approach to modeling circular data represented by a truncated GMM. We con-
sider the usefulness and effectiveness of this approach and evaluate how different
observation component allocation initialization schemes may influence results.
Keyword
Variational Bayes (VB); Gaussian Mixture Model (GMM); Component Elimination;
Initialization; Directional/Circular Statistics
3.1 Introduction
Mixture models provide a convenient, flexible way to model data; a popular and
computationally efficient approach is to use Gaussian mixture models (GMMs)
(McLachlan and Peel, 2000). In this paper, we investigate the increasingly popular
variational Bayesian (VB) method for GMMs which has been shown to approximate
52 Chapter 3. The Variational Bayesian Method
Bayesian posterior distributions efficiently and has already been used in various ap-
plications (Wang and Titterington, 2006). One of VB’s unique feature for mixture
modeling is its automatic redundant component elimination property; we demon-
strate the usefulness of this feature as well as its generally overlooked implications in
a one-dimensional scenario. We base results on the algorithm of McGrory and Tit-
terington (2007). In addition, we extend this one-dimensional VB-GMM approach to
approximate the distribution of a real world circular data. Our empirical results re-
veal that such an approach is generally sufficient; and the implication of several dif-
ferent observation component allocation initialization schemes are also evaluated.
Our application dataset comprises the temporal usage patterns of phone customers
over a 24-hour period; being able to summarize each user’s behavior more formally
can provide businesses with the means to obtain better customer behavioral under-
standing, profiling/differentiation and hence better customer relationships (c.f. Wu
et al., 2010a).
The deterministic VB method for Bayesian inference was first formally outlined in
Attias (1999). It is more efficient in terms of both computation and storage require-
ments than most other approximate Bayesian approaches such as Markov chain
Monte Carlo (MCMC). Additionally, unlike Monte Carlo based approaches, VB does
not suffer from the mixing or label switching problem (c.f. Celeux et al., 2000), or the
difficulties with assessing convergence (Jaakkola and Jordan, 2000; Wang and Titter-
ington, 2006). Being a Bayesian method, VB suffers less from the over-fitting and
singularity problems that persist in maximum likelihood (ML) approaches (Attias,
1999); and theoretically it has been shown to be asymptotic consistent in approxi-
mating mixture models with fixed number of components k (Wang and Titterington,
2006). Given that a central issue of mixture modeling is the selection of a suitable k
(McLachlan and Peel, 2000), a key practical advantage of VB over ML approaches is
its ability to automatically select k to give the ‘best’ fit to the data according to the
variational optimization, and estimate the model parameter values and their poste-
rior distributions at the same time (c.f. Richardson and Green, 1997).
We use the term standard VB-based algorithms to refer to those algorithms that
do not allow components to be split and/or merged. These algorithms select k
through the complexity reduction property of the VB approximation; this property
leads to the progressive elimination of redundant components (as their component
weights tend towards zero) that were specified in the initial model during conver-
gence (McGrory and Titterington, 2007). Note that this implies that the final k, kfinal,
in the model can not be greater than the initial specification of k, kinitial; though it
is worthwhile to mention that non-standard algorithms (e.g., Ghahramani and Beal,
1999; Ueda and Ghahramani, 2002; Constantinopoulos and Likas, 2007; Wu et al.,
3.1 Introduction 53
2010b) which allow components to be split do not face such limitations. This auto-
matic feature of the approximation has been observed by many researchers (e.g., At-
tias, 1999; Corduneanu and Bishop, 2001; McGrory and Titterington, 2007), and has
been shown to perform as well or better than the use of expectation-maximization
(EM) algorithm (Dempster et al., 1977) with Bayesian information criterion (BIC)
(Schwarz, 1978) (e.g., Watanabe et al., 2002; Teschendorff et al., 2005); but its theoret-
ical reasoning is still not well understood. Despite the usefulness and effectiveness
of this VB component elimination property, its irreversible nature can potentially
lead to incorrect/different models in practice. This is because a component might
be eliminated prematurely in the iterative convergence towards a solution, remov-
ing the opportunity for appropriate observations to be allocated to it. Discussion of
this issue has generally been overlooked in the literature, here we give some consid-
eration to its implications.
As indicated earlier, our application involves analyzing users’ phone activity patterns
over a 24-hour period, and it is therefore critical to take the circular characteristics
of the data into consideration i.e., the difference between hour 0 and hour 24 is not
24 hours, but rather it is zero hours. Data analysis of this kind is often referred to
as circular or directional statistics (Fisher, 1996; Mardia and Jupp, 2000; Jammala-
madaka and Sengupta, 2001). There have been numerous distributional models al-
ready proposed for analyzing this type of data, and the von Mises, also known as
circular normal (Jammalamadaka and Sengupta, 2001), distributional family is per-
haps the most popular choice; the wrapped distributional family is also useful (Mar-
dia and Jupp, 2000, pp.32-52). However, they typically have focused on modeling
unimodal and symmetric data and are therefore not well suited to many real world
applications.
Recently some researchers (e.g., Fernandez-Duran, 2004; Pewsey, 2008; McVinish
and Mengersen, 2008) have successfully modeled more complicated circular pat-
terns either parametrically or semi-parametrically. For example, one computation-
ally demanding approach is to model them using a mixture of von Mises distribu-
tions (Ghosh et al., 2003). However, VB cannot easily be utilized when taking this ap-
proach because the algebraic expressions required for the update equations are not
computationally straightforward to evaluate. An EM-based analysis of a mixture of
wrapped distributions (c.f. Fisher and Lee, 1994) can be computationally inefficient
due to its infinite sums that need to be approximated at each step. Lees et al. (2007)
present a VB-based implementation of a wrapped normal analysis, but they appear
to have overly simplified the study by assuming only one distribution wrapping on
the circle; this analysis would otherwise be more computationally demanding. An-
other option is to use non-parametric kernel density estimation based on the von
Mises-Fisher kernel (Mardia and Jupp, 2000, pp.277-278), but the results of the ker-
nel approach depend on the degree of smoothing and lack the interpretability that
54 Chapter 3. The Variational Bayesian Method
may be critical for some applications such as ours.
Consequently, in this paper, we propose the simplistic approach of circumventing
such problems by applying the VB-GMM approach for interval data to the circular
data problem by padding the repeated data at both ends (c.f. Mardia and Jupp, 2000,
p.4), and then normalizing the resulting models i.e., f (x; 0 ≤ x ≤ 24). We acknowl-
edge that this tactic can only be used as an approximation, but we have found it to
be generally useful. We investigate how sensitive our modeling approach is to the
extent to which we pad out the ends; we do this by comparing results when we re-
peat data at both ends up to either 1/4, 1/2 or 1 cycle; that is, we consider situations
where the overall patterns analyzed are either 1.5, 2 or 3 complete data cycles. Note
that we found that the usage pattern of each user was quite different on weekdays
than it was at the weekend; but here we restrict our focus to the weekday activities.
While VB improves the model approximation at each iteration (in a similar manner
to the EM algorithm), like many other algorithms including EM (Ueda et al., 2000), it
is still somewhat sensitive to the initialization with respect to the component mem-
bership of each observation (c.f. Watanabe et al., 2002; Wu et al., 2010b). In this
paper, we also demonstrate this issue through modeling each user’s heterogeneous
temporal calling pattern. We compare the resulting fitted models that arose when
using the following three simple initialization schemes (with all other prior settings
being non-informative with the exception of kinitial); for more informative schemes
i.e., Partitioned and Overlapping, observation component allocations correspond to
intervals:
1. Random: assigning the component membership j of each observation i, which
we call i(j), non-informatively and thus randomly;
2. Partitioned: assigning each i(j) more informatively based on which non-
overlapped equal-width interval the observed value has fallen into. We focus
on the equal width interval setting here; and
3. Overlapping : similar to Partitioned using an overlapped interval setting i.e. al-
lowing an observation to have equal chances of being initialized with either
of the components corresponding to the overlapping intervals. An example of
the 17 overlapped interval setting for data range from -6 to 30 (corresponding
to having 6 hours of information padded on both side of the 24-hour period
data) is shown in Figure 3.1.
Note that it is quite obvious that the VB-GMM algorithm with Random initialization
will require more iterations to reach converged models; thus our focus here is solely
on the goodness-of-fit among the fitted models with different initialization schemes
after a large number of iterations, i.e., after 1,000 iterations have been executed. This
also implies that we are less interested in the resulting kfinal’s fitted, but we choose
3.2 VB-GMM Algorithm 55
-4 0 4 8 12 16 20 24 28
# 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8# 9 # 10 # 11 # 12 # 13 # 14 # 15 # 16 # 17
-6 -2 2 6 10 14 18 22 26 30
Figure 3.1: Overlapping initialization scheme
larger values for kinitial which should allow more chance of finding a better fit to the
model. We compare the goodness-of-fit of the results by considering Kuiper (1962)’s
test statistic and mean absolute error (MAE).
We organize the rest of this paper as follows. In Section 3.2, we briefly discuss fit-
ting GMMs with VB. In Section 3.3, we detail the Kuiper’s test statistic and MAE. We
present the results in Section 3.4 and conclude in Section 3.5.
3.2 VB-GMM Algorithm
In a GMM, it is assumed that all k underlying distributions (or components) of the
mixture are distributed as Gaussian. In the notation we adopt here, the density of
an observation x = (x1, ..., xn) is given by∑k
j=1wjN(x;µj , τ−1
j
), where k ∈ N, µj
and τ−1j represent the mean and variance, respectively, of the jth component, each
mixing coefficient wj , satisfies 0 ≤ wj and∑k
j=1wj = 1, and N (·) denotes a Gaus-
sian density. In the Bayesian framework, inference is based on the target posterior
distribution, p (θ, z|x), where θ denotes the model parameters (µ, τ, w) and z = {zij}denotes the missing component membership information of observation x. Note
that the zij ’s are indicator variables such that zij = 1 if observation xi belongs to the
jth component and zij = 0 otherwise.
The target posterior is not analytically available in this mixture model problem, as
is generally the case, and therefore it has to be approximated in the Bayesian infer-
ence approach. The idea of the VB approach is to approximate the target posterior
by a variational distribution which we denote by q (θ, z|x). Importantly, it is a mean-
field type approach in that it is assumed that this approximating distribution fac-
torizes over the model parameters θ and the missing variables z; this means that we
can write q (θ, z|x) = qθ (θ|x) × qz (z|x). In order to obtain a good approximation to
the target, the distribution q (θ, z|x) overall must be chosen carefully so that it can
approximate the true conditional density well, and qθ (θ|x) and qz (z|x) can provide
computational convenience needed at the same time. VB’s objective is to maximize
the lower bound on the log marginal likelihood, logp (x). This is equivalent to min-
imizing the Kullback-Leibler (KL) divergence between the target posterior and the
56 Chapter 3. The Variational Bayesian Method
variational approximating distribution. This approach leads to tractable coupled ex-
pressions for the variational posterior over the parameters which can be iteratively
updated in a similar fashion to classical EM algorithm to obtain convergence to a
solution.
Most of the papers on the subject of fitting GMMs with VB (e.g., Attias, 1999; Cor-
duneanu and Bishop, 2001; McGrory and Titterington, 2007) make similar prior as-
sumptions, but they differ in the form of the model hierarchy used. As indicated pre-
viously, we follow the model setting described in McGrory and Titterington (2007),
but we do not make use of the Deviance information criterion (DIC) as a comple-
mentary model selection criterion as they did in that paper (c.f. Spiegelhalter et al.,
2002; Celeux et al., 2006). We model the pattern as a mixture of k Gaussian distri-
butions with unknown means µ = (µ1, ..., µk), precisions τ = (τ1, ..., τk) and mixing
coefficients w = (w1, ..., wk), such that
p (x, z|θ) =n∏i=1
k∏j=1
{wjN
(xi;µj , τj−1
)}zij
,
with the joint distribution being p (x, z, θ) = p (x, z|θ) p (w) p (µ|τ) p (τ). We express
our priors as:
p (w) = Dirichlet(w;α1
(0), ..., αk(0))
,
p (µ|τ) =k∏j=1
N
(µj ;mj
(0),(βj
(0)τj
)−1)
, and
p (τ) =k∏j=1
Gamma(τj ; 1
2υj(0), 1
2σj(0))
,
with α(0), β(0), m(0), υ(0), and σ(0) being known, user chosen initial values. These
are the standard conjugate priors used in Bayesian mixture modeling (Gelman et al.,
2004). Using the lower bound approximation, the posteriors are then:
qw (w) = Dirichlet (w;α1, ..., αk),
qµ|τ (µ|τ) =k∏j=1
N(µj ;mj , (βjτj)
−1)
, and
qτ (τ) =k∏j=1
Gamma(τj ; 1
2υj ,12σj)
.
The posterior parameters are iteratively updated as:
αj = αj(0) +
∑ni=1 qij ,
βj = βj(0) +
∑ni=1 qij ,
υj = υj(0) +
∑ni=1 qij ,
mj = 1βj
(βj
(0)mj(0) +
∑ni=1 qijxi
), and
σj = σj(0) +
n∑i=1
qijxi2 + β
(0)j m
(0)j
2− βjmj
2,
3.3 Model Evaluation Criterion 57
where qij is the VB posterior probability that zij = 1, and expectations are given by
E ( µj) = mj , and E ( τj) = υjσ−1j . Please refer to McGrory and Titterington (2007),
for example, for more details on VB-GMM. Finally, we emphasize that approximating
circular data as a truncated GMM, as we proposed here as the simple circumventing
approach, requires the data to be first padded on both sides of the interval prior to
the modeling.
3.3 Model Evaluation Criterion
Kuiper’s test statistic (Mardia and Jupp, 2000, pp.99-103) is based on the popular
Kolmogorov-Smirnov (KS) test statistics (c.f. Sheskin, 2004). However, unlike KS,
Kuiper is suitable for comparing two circular distributions non-parametrically as it
does not depend on the choice of origin. It is based on evaluating the cumulative
distribution function (CDF), and can be expressed as:
Vn = max1≤i≤n
(F(x′i
)− Sn
(x′i
))− min
1≤i≤n
(F(x′i
)− Sn
(x′i
))In this expression, x is the data re-arranged into increasing numerical order in an
array x′; Sn is the sample distribution function of the data and F is the fitted distri-
bution function of the GMM. However, we felt that simply evaluating the goodness-
of-fit of the results based on the combined largest deviances on both sides of the
distributional fit may not be sufficient. Consequently, we propose also using the
MAE:
MAE = 1n
∑ni=1
∣∣∣F (x′i)− Sn (x′i)∣∣∣.We consider MAE to be useful here given that VB generally does not over-fit (Attias,
1999). In our results section, we utilized both measures. Finally, we will make use of
the following modification of Vn (Stephens, 1970):
V ∗n =√n× Vn ×
(1 + 0.155√
n+ 0.24
n
)The definition of V ∗n allows us to evaluate more generally whether our modeling ap-
proach is robust with respect to n, number of observations in a pattern; the distribu-
tion of V ∗n has been shown to be quite stable for n ≥ 4, but having n ≥ 8 is recom-
mended.
58 Chapter 3. The Variational Bayesian Method
3.4 Results
As indicated in the introduction, we carry out two analyses. The first task is to
demonstrate the irreversible nature of the VB component elimination property for
fitting GMMs or mixture models more generally. We next evaluate the goodness-
of-fit among different initialization schemes with respect to the observation com-
ponent allocations. We also evaluate the effectiveness of using the simple circum-
venting VB-GMM approach for modeling circular data of users’ weekday temporal
usage patterns. We note that the heterogeneous nature of this dataset provides a
good example of a real world problem for which to evaluate the practical merits and
effectiveness of VB-GMM methodology for approximating circular data.
3.4.1 The irreversible nature of the VB component elimination property
To demonstrate this property, a simulated dataset with 200 observations was gener-
ated based on a mixture of four Gaussian mixture components with parameter val-
ues shown in Table 3.1. The data we generated here can be considered as being not
well separated and therefore are reasonably challenging. This example is chosen to
illustrate that in some cases, the VB elimination property can lead to differences in
results when different values of kinitial are used and we demonstrate this with a ran-
dom initial allocation of observations to components. The models recovered by the
VB-GMM algorithm with the random initialization scheme and with various kinitial’s
are shown in Table 3.2. We note that while the algorithm is unable to identify the
original model for this less than well behaved dataset, the four-component solu-
tion found by most of the different kinitial’s appears to be satisfactory. Moreover, we
observe that VB has eliminated components with little support rather effectively as
has been observed by other previous researchers (e.g., Attias, 1999; Corduneanu and
Bishop, 2001; McGrory and Titterington, 2007).
Obviously, model outputs with kinitial < 4 cannot recover the original model. How-
ever, we observe that in the case of this example, there is some variation in results
obtained even when we initialize the algorithm such that kinitial is larger than the
number of components in the model the data were simulated from. In particular,
when we chose kinitial = 4 or 8, the algorithm only fits a three-component model,
failing to fit component four which was the least weighted component in the model
we simulated from. It is easy to see how such an over-simplification of the model
can arise, particularly when there are low weighted components heterogeneously
mixed in the model. This over-simplification occurred here because in the initial ob-
servation allocation, observations that were generated from component four were
not grouped appropriately enough giving little support to a fourth component and
hence leading to its premature elimination from the model before convergence was
3.4 Results 59
Table 3.1: Mixture model parameters used for the simulated dataset
Component #1 #2 #3 #4
Parameter µ τ−12 w µ τ−
12 w µ τ−
12 w µ τ−
12 w
Actual 5.124 0.552 0.240 6.237 0.323 0.410 6.584 0.052 0.260 7.335 0.176 0.090
complete. Following through the computation iterations by iterations, it is clear that
the standard VB elimination cannot be reversed.
While the non-uniform behavior of VB in this example (c.f. the result of kinitial = 8 6=kinitial = 7 or 9, for example, in Table 3.2) which occurs as a result of the irreversible
nature of its component elimination property is somewhat concerning, we note that
fortunately it seldomly occurred in our other experiments when the data was well
behaviored. We did not show these here as our focus is on more problematic exam-
ples. Of course, heterogeneous data can be modeled differently and it can also cause
inference challenges in other Bayesian approaches too. However, what is interesting
to us is that regardless of how the observation component allocations are initialized,
and what value of kinitial is used for this non-well separated dataset, the VB solutions
have been quite consistent in the sense that models with the same number of com-
ponents were nearly always practically identical (c.f. Table 3.2).
Finally, we note that in practice, researchers often perform several analyses on the
same dataset to ensure the findings are appropriate. Moreover, they typically choose
some kind of initial clustering (e.g., Corduneanu and Bishop, 2001), for example, to
initialize their Bayesian schemes; this tactic can provide potential computational
savings given that a more informative prior is utilized. Nevertheless, we note that
randomly allocated observations into components in a one-dimensional space ap-
pears to be somewhat concerning to us in a sense that all components are initialized
very similarly, and within the VB framework this can make this issue of premature
component elimination more likely to occur and actually lead to very inappropri-
ate models. This is a motivation for designing the allocations in Section 3.1; and
we follow on this point in the next subsection. Despite this, we again emphasize that
many studies as well as the above analysis have shown empirically that VB for GMMs
is generally very effective.
3.4.2 Evaluating the results of the VB-GMM fit under different initializa-tion schemes for padded circular data
Our data for this experiment is provided by a telecommunication provider. The
dataset consists every single successful outbound communication activity made by
100 users during a 17-month period. These anonymous users were randomly sam-
pled from a large database of several million users, with users’ activity during the
60 Chapter 3. The Variational Bayesian Method
Table 3.2: Parameter estimates recovered by VB with various kinitial
Component #1 #2 #3 #4
Parameter µ τ−12 w µ τ−
12 w µ τ−
12 w µ τ−
12 w
kinitial = 1 6.130 0.769 1.000kinitial = 2 5.969 0.840 0.737 6.579 0.038 0.263kinitial = 3 4.160 0.033 0.025 6.036 0.779 0.715 6.579 0.038 0.260kinitial = 4 4.160 0.033 0.025 6.036 0.779 0.715 6.579 0.038 0.260kinitial = 5 5.348 0.706 0.342 6.278 0.300 0.309 6.580 0.038 0.259 7.299 0.160 0.089kinitial = 6 5.348 0.707 0.342 6.278 0.300 0.309 6.580 0.038 0.259 7.299 0.160 0.089kinitial = 7 5.348 0.707 0.342 6.278 0.300 0.309 6.580 0.038 0.259 7.299 0.160 0.089kinitial = 8 4.160 0.033 0.025 6.036 0.779 0.715 6.579 0.038 0.260kinitial = 9 5.348 0.707 0.342 6.278 0.300 0.309 6.580 0.038 0.259 7.299 0.160 0.089
......
......
......
......
......
......
...kinitial = 20 5.348 0.707 0.342 6.278 0.300 0.309 6.580 0.038 0.259 7.299 0.160 0.089
Table 3.3: Average Kuiper of all setups for all 100 users; the ‘best’ model for each useris determined based on the lowest Kuiper. Note that different kinitial correspond todifferent cycle scenarios.
Average Kuiper # of ‘Best’ ModelsInit. scheme Random Partitioned Overlapping Random Partitioned Overlappingkinitial = 17 0.041911 0.033613 0.034013 21 43 36kinitial = 23 0.060567 0.036761 0.037422 15 47 38kinitial = 35 0.227374 0.039212 0.038057 12 40 48
weekdays ignored, and overall there are more than 100 activities. The average num-
ber of activities for the analyzed users is 1,766, and the maximum number of activi-
ties is 10,607. In this evaluation, we first consider results for just three of the anony-
mous users. Then we consider the whole sample to summarize the overall results
obtained for all 100 users with various VB-GMM settings.
Our objectives are to evaluate the goodness-of-fit of the VB-GMM results when ap-
plied to padded circular data and the implications of the three different initialization
schemes, Random, Partitioned and Overlapping, for the observations’ initial com-
ponent allocations. That is, on one hand, we aim to understand the implications of
different initialization schemes to VB-GMM for data that is more complicated, and
on the other hand, we assess the wellness of approximating circular data with a trun-
cated GMM. As stated in the introduction, we also investigate the sensitivity of our
approach by considering three scenarios: 1.5, 2 and 3 complete data cycles. For ease
of demonstration and ‘fair’ comparison, kinitial = 17 is used for the 1.5-cycle scenario
(c.f. Figure 3.1 for Overlapping ); while kinitial = 23 and 35 are utilized for 2- and 3-
cycle scenarios, respectively. In other words, for each pattern a total of nine setups
will be evaluated (c.f. three initialization schemes with three data repeating scenar-
ios). Note that for simplicity, we shall simply refer the three data repeating scenarios
by their corresponding kinitial.
Figures 3.3 to 3.5 illustrate the selected modeling results of the temporal usage pat-
tern of three selected users over the 24-hour period. As suspected previously we
have observed that the Random initialization can have a disastrous effect for this
3.4 Results 61
Table 3.4: Average MAE of all setups for all 100 users; the ‘best’ model for each useris determined based on the lowest MAE. Note that different kinitial correspond to dif-ferent cycle scenarios.
Average MAE # of ‘Best’ ModelsInit. scheme Random Partitioned Overlapping Random Partitioned Overlappingkinitial = 17 0.006779 0.005205 0.005172 19 48 33kinitial = 23 0.011930 0.005557 0.005656 13 48 39kinitial = 35 0.073938 0.006210 0.005865 11 38 51
(a) Overlapping and kinitial = 35 setup (b) Random and kinitial = 35 setup
Figure 3.2: Distribution of number of components k with selected setups.
one-dimensional study; the occurring poor fits appear to be the result of initialized
components being too vague and/or too similar. This suggests that all initialized
components covered nearly identical data ranges, as if one were aiming to fit a Gaus-
sian distribution to a padded heterogeneous pattern. We will follow up on this point
in the next paragraph. In contrast, most patterns appeared to have been modeled
quite well in all setups where the Partitioned or Overlapping initialization was used
(c.f. Figures 3.3 to 3.5). We can also see that the fit appears good on the ‘edges’ of the
datasets which indicates that our proposed VB-GMM with padded data approach
works well in the circular data setting.
The averages of Kuiper and MAE for all analyzed users are summarized in Tables 3.3
and 3.4. It seems that the more informative Partitioned and Overlapping initial-
izations have generally produced as well or better results when compared to non-
informative Random initialization; and the fitted models of Random initialization
appear to suffer significantly when a longer series of data is padded. Moreover, Over-
lapping appears to perform marginally better than Partitioned when a longer se-
ries of data is padded. To understand why some VB-GMM fits appear to be better
than others, we next summarize the average kfinal’s for these users in Table 3.5. It
seems to us that, on average, one requires approximately four or five components
Table 3.5: Average kfinal of all setups for all 100 users. Note that different kinitial corre-spond to different cycle scenarios.
Average kfinal
Init. scheme Random Partitioned Overlappingkinitial = 17 6.051 7.333 7.606kinitial = 23 6.788 9.586 9.778kinitial = 35 12.778 12.202 12.667
62 Chapter 3. The Variational Bayesian Method
Table 3.6: Average Stephens’ Kuiper, V ∗n , for all setups for all 100 users; the ‘good’model is determined based on comparing its V ∗n to the critical value of 1.224 basedon nominal significance level of α = 10% (Stephens, 1970). Note that different kinitial
correspond to different cycle scenarios.
Average Stephens’ Kuiper V ∗n # of ‘Good’ ModelsInit. scheme Random Partitioned Overlapping Random Partitioned Overlappingkinitial = 17 1.245 0.963 0.960 63 83 83kinitial = 23 1.851 1.058 1.097 41 77 75kinitial = 35 9.733 1.127 1.109 20 67 72
(c.f. Partitioned and Overlapping ) for modeling one’s one cycle of 24-hour calling
pattern. Whereas, the poorer model fits obtained from Random initialization (ex-
cept with kinitial = 35 i.e., three completed data cycles) appear to be the direct result
of there being less surviving components in the models, a side-effect of VB’s irre-
versible nature of the component elimination property which is normally quite ef-
fective. Nonetheless, careful evaluations of the distributions of kfinal’s for all different
setups revealed something interesting. All distributions centered around its average
kfinal’s (c.f. Figure 3.2 (a)) as expected; however, Random and kinitial = 35 setup (c.f.
Figure 3.2 (b)) is the exception. In that, most models either have lots of surviving
components (of which all of them are nearly identical) or they have very little mak-
ing its average kfinal misleading. This result suggests that the Random initialization
scheme for VB can be problematic for modeling complicated one-dimensional pat-
terns, and more informative observation component allocation schemes are gener-
ally needed. Additionally, we note that if we were to execute several thousand more
iterations for those models that currently still consist very high number of compo-
nents, in many cases components can still be eliminated; however, they generally
end up with only a handful number of components which clearly are still not suffi-
cient for modeling three cycles of heterogeneous data.
Recall that our the other focus is to assess the effectiveness of modeling circular data
as a truncated GMM. Figure 3.6 shows how the Stephens’ Kuiper, V ∗n , is distributed
with respect to n in the fitted model for all nine setups. It shows that the VB-GMM
with padded data approach for modeling circular data with either Partitioned or
Overlapping observation component initialization schemes is robust regardless of
the size of n; and for our data, different numbers of cycles appear to have minimal
affect when a more informative initialization scheme is used. Tables 3.6 summa-
rized number of cases (out of 100 models/users) where the model V ∗n ’s are less than
the critical value determined by the nominal significant level of α = 10% in the up-
per tail, that is, number of ‘good’ fitted models. These numbers echo the findings
from Figure 3.6; they show that the VB-GMM with the Partitioned and Overlapping
approach on padded circular data will generally result in satisfactory models, and it
is sufficient to approximate circular patterns in a simple way by circumventing the
problem with truncated GMMs.
3.5 Discussion 63
3.5 Discussion
In this paper, we have shown how VB-GMM can be adapted for use in approximating
circular data by taking an approach where the data is padded at the edges. Addition-
ally, we have illustrated and discussed the generally overlooked potential implica-
tions of the irreversible nature of VB’s component elimination property and illus-
trated the effectiveness of utilizing more informative observation component allo-
cation schemes in avoiding this problem. In doing this we have demonstrated an
effective circumvent modeling approach for circular data that will be of particular
value in settings where there are large volumes of data to be analyzed as VB-based
approach is generally more computationally and time efficient when compared to
other Bayesian approaches. This should be useful to other circular data applications.
One key advantage of modeling each user’s temporal usage pattern using a GMM
is the ease of interpretation of the fitted model. From an application standpoint,
this can provide telecommunication companies with an opportunity to gain insights
into, as well as differentiate, each customer’s temporal usage behavior (c.f. Wu et al.,
2010a). This type of information is valuable implication in marketing and product
design. For example, User B is mostly active during business hours, while User C is
highly active around midnight. This might suggest that these two users would have
very different needs and a company could therefore better tailor their product for
each of them. We also note that while we restricted our attention here to the standard
VB algorithm in which components may only be eliminated and not added, there are
component splitting VB schemes available as discussed in the introduction. We note
that Wu et al. (2010b) shows the usefulness of component splitting in VB applied to
telecommunication spatial data. We anticipate that the model goodness-of-fit could
also be further improved by adopting a component splitting strategy in the VB-GMM
algorithm more generally.
3.6 References
Attias, H., 1999. Inferring parameters and structure of latent variable models by vari-
ational Bayes. In: Laskey, K. B., Prade, H. (Eds.), Proceedings of the Fifteenth Con-
ference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, Stockholm,
Sweden, pp. 21–30.
Celeux, G., Forbes, F., Robert, C., Titterington, D., 2006. Deviance information crite-
ria for missing data models. Bayesian Analysis 1 (4), 651–674.
Celeux, G., Hurn, M., Robert, C. P., 2000. Computational and inferential difficulties
64 Chapter 3. The Variational Bayesian Method
kinitialPartitioned Overlapping
17(a) Kuiper=0.020312,
MAE=0.002811(b) Kuiper=0.020313,
MAE=0.002811
23(c) Kuiper=0.031329,
MAE=0.005304(d) Kuiper=0.020389,
MAE=0.002708
35(e) Kuiper=0.02255,
MAE=0.003156(f) Kuiper=0.04123,
MAE=0.007255
Figure 3.3: The results of the VB-GMM fits of the usage pattern of User A. The his-togram summarizes the actual observations; (a) represents the model fit of the Par-titioned and kinitial = 17 setup, (b) Overlapping and kinitial = 17 setup, (c) Partitionedand kinitial = 23 setup, (d) Overlapping and kinitial = 23 setup, (e) Partitioned andkinitial = 35 setup, and (f) Overlapping and kinitial = 35 setup.
3.6 References 65
kinitialPartitioned Overlapping
17(a) Kuiper=0.010413,
MAE=0.001333(b) Kuiper=0.011612,
MAE=0.001605
23(c) Kuiper=0.010727,
MAE=0.001525(d) Kuiper=0.010257,
MAE=0.001525
35(e) Kuiper=0.009021,
MAE=0.001338(f) Kuiper=0.010963,
MAE=0.001692
Figure 3.4: The results of the VB-GMM fits of the usage pattern of User B. The his-togram summarizes the actual observations; (a) represents the model fit of the Par-titioned and kinitial = 17 setup, (b) Overlapping and kinitial = 17 setup, (c) Partitionedand kinitial = 23 setup, (d) Overlapping and kinitial = 23 setup, (e) Partitioned andkinitial = 35 setup, and (f) Overlapping and kinitial = 35 setup.
66 Chapter 3. The Variational Bayesian Method
kinitialPartitioned Overlapping
17(a) Kuiper=0.038779,
MAE=0.016959(b) Kuiper=0.040034,
MAE=0.017051
23(c) Kuiper=0.040514,
MAE=0.015667(d) Kuiper=0.040352,
MAE=0.015937
35(e) Kuiper=0.041396,
MAE=0.013901(f) Kuiper=0.042945,
MAE=0.017163
Figure 3.5: The results of the VB-GMM fits of the usage pattern of User C. The his-togram summarizes the actual observations; (a) represents the model fit of the Par-titioned and kinitial = 17 setup, (b) Overlapping and kinitial = 17 setup, (c) Partitionedand kinitial = 23 setup, (d) Overlapping and kinitial = 23 setup, (e) Partitioned andkinitial = 35 setup, and (f) Overlapping and kinitial = 35 setup.
3.6 References 67
kinitialRandom Partitioned Overlapping
17(a) (b) (c)
23(d) (e) (f)
35(g) (h) (i)
Figure 3.6: Stephen’s Kuiper V ∗n vs. n; (a) Random and kinitial = 17 setup, (b) Par-titioned and kinitial = 17 setup, (c) Overlapping and kinitial = 17 setup, (d) Randomand kinitial = 23 setup, (e) Partitioned and kinitial = 23 setup, (f) Overlapping andkinitial = 23 setup, (g) Random and kinitial = 35 setup, (h) Partitioned and kinitial = 35setup, and (i) Overlapping and kinitial = 35 setup.
68 Chapter 3. The Variational Bayesian Method
with mixture posterior distributions. Journal of the American Statistical Associa-
tion 95 (451), 957–970.
Constantinopoulos, C., Likas, A., 2007. Unsupervised learning of Gaussian mixtures
based on variational component splitting. IEEE Transactions on Neural Networks
18 (3), 745–755.
Corduneanu, A., Bishop, C. M., 2001. Variational Bayesian model selection for mix-
ture distributions. In: Proceedings of the Eighth International Conference on Arti-
ficial Intelligence and Statistics. Morgan Kaufmann, Key West, FL, pp. 27–34.
Dempster, A. P., Laird, N. M., Rubin, D., 1977. Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Statis-
tical Methodology) 39 (1), 1–38.
Fernandez-Duran, J. J., 2004. Circular distributions based on nonnegative trigono-
metric sums. Biometrics 60 (2), 499–503.
Fisher, N. I., 1996. Statistical analysis of circular data, 2nd Edition. Cambridge Uni-
versity Press, Cambridge, UK.
Fisher, N. I., Lee, A. J., 1994. Time series analysis of circular data. Journal of the Royal
Statistical Society. Series B (Methodological) 56 (2), 327–339.
Gelman, A., Carlin, J. B., Stern, H. S., Rubin, D. B., 2004. Bayesian Data Analysis, 2nd
Edition. Texts in Statistical Science. Chapman & Hall, Boca Raton, FL.
Ghahramani, Z., Beal, M. J., 1999. Variational inference for Bayesian mixtures of fac-
tor analysers. In: Solla, S. A., Leen, T. K., Muller, K.-R. (Eds.), Proceedings of the
1999 Neural Information Processing Systems. MIT, Denver, CO, pp. 449–455.
Ghosh, K., Jammalamadaka, S. R., Tiwari, R. C., 2003. Semiparametric Bayesian tech-
niques for problems in circular data. Journal of Applied Statistics 30 (2), 145–161.
Jaakkola, T. S., Jordan, M. I., 2000. Bayesian parameter estimation via variational
methods. Statistics and Computing 10 (1), 25–37.
Jammalamadaka, S. R., Sengupta, A., 2001. Topics in Circular Statistics. Series on
Multivariate Analysis. World Scientific, Singapore.
Kuiper, N. H., 1962. Tests concerning random points on a circle. Proceedings of the
Koninklijke Nederlandse Akademie van Wetenschappen, Series A 63, 38–47.
3.6 References 69
Lees, K., Roberts, S., Skamnioti, P., Gurr, S., 2007. Gene microarray analysis using
angular distribution decomposition. Journal of Computational Biology 14 (1), 68–
83.
Mardia, K. V., Jupp, P. E., 2000. Directional Statistics, 2nd Edition. Wiley Series in
Probability and Statistics. Wiley, Chichester, UK.
McGrory, C. A., Titterington, D. M., 2007. Variational approximations in Bayesian
model selection for finite mixture distributions. Computational Statistics & Data
Analysis 51 (11), 5352–5367.
McLachlan, G. J., Peel, D., 2000. Finite Mixture Models. Wiley Series in Probability
and Statistics. Wiley, New York.
McVinish, R., Mengersen, K., 2008. Semiparametric Bayesian circular statistics.
Computational Statistics & Data Analysis 52 (10), 4722–4730.
Pewsey, A., 2008. The wrapped stable family of distributions as a flexible model for
circular data. Computational Statistics & Data Analysis 52 (3), 1516 – 1523.
Richardson, S., Green, P. J., 1997. On Bayesian analysis of mixtures with an unknown
number of components (with discussion). Journal of the Royal Statistical Society:
Series B (Statistical Methodology) 59 (4), 731–792.
Schwarz, G., 1978. Estimating the dimension of a model. The Annals of Statistics
6 (2), 461–464.
Sheskin, D. J., 2004. Handbook of Parametric and Nonparametric Statistical Proce-
dures, 3rd Edition. Chapman & Hall, Boca Raton, FL.
Spiegelhalter, D., Best, N., Carlin, B., Van der Linde, A., 2002. Bayesian measures of
model complexity and fit. Journal of the Royal Statistical Society: Series B (Statis-
tical Methodology) 64 (4), 583–639.
Stephens, M. A., 1970. Use of the Kolmogorov-Smirnov, Cramer-von Mises and re-
lated statistics without extensive tables. Journal of the Royal Statistical Society:
Series B (Statistical Methodology) 32 (1), 115–122.
Teschendorff, A. E., Wang, Y., Barbosa-Morais, N. L., Brenton, J. D., Caldas, C., 2005.
A variational Bayesian mixture modelling framework for cluster analysis of gene-
expression data. Bioinformatics 21 (13), 3025–3033.
Ueda, N., Ghahramani, Z., 2002. Bayesian model search for mixture models based
on optimizing variational bounds. Neural Networks 15 (10), 1223–1241.
70 Chapter 3. The Variational Bayesian Method
Ueda, N., Nakano, R., Ghahramani, Z., Hinton, G. E., 2000. SMEM algorithm for
mixture models. Neural Computation 12 (9), 2109–2128.
Wang, B., Titterington, D. M., 2006. Convergence properties of a general algo-
rithm for calculating variational Bayesian estimates for a normal mixture model.
Bayesian Analysis 1 (3), 625–650.
Watanabe, S., Minami, Y., Nakamura, A., Ueda, N., 2002. Application of variational
Bayesian approach to speech recognition. In: Becker, S., Thrun, S., Obermayer,
K. (Eds.), Proceedings of the 2002 Neural Information Processing Systems. MIT,
Vancouver, BC, Canada, pp. 1237–1244.
Wu, B., McGrory, C. A., Pettitt, A. N., 2010a. Customer spatial usage behavior profiling
and segmentation with mixture modeling. Submitted.
Wu, B., McGrory, C. A., Pettitt, A. N., 2010b. A new variational Bayesian algorithm
with application to human mobility pattern modeling. Statistics and Computing,
(in press).
http://dx.doi.org/10.1007/s11222-010-9217-9
4A New Variational Bayesian Algorithm with Application to Human
Mobility Pattern Modeling
Abstract
A new variational Bayesian (VB) algorithm, split and eliminate VB (SEVB), for mod-
eling data via a Gaussian mixture model (GMM) is developed. This new algorithm
makes use of component splitting in a way that is more appropriate for analyzing a
large number of highly heterogeneous spiky spatial patterns with weak prior infor-
mation than existing VB-based approaches. SEVB is a highly computationally effi-
cient approach to Bayesian inference and like any VB-based algorithm it can per-
form model selection and parameter value estimation simultaneously. A significant
feature of our algorithm is that the fitted number of components is not limited by
the initial proposal giving increased modeling flexibility. We introduce two types of
split operation in addition to proposing a new goodness-of-fit measure for evalu-
ating mixture models. We evaluate their usefulness through empirical studies. In
addition, we illustrate the utility of our new approach in an application on modeling
human mobility patterns. This application involves large volumes of highly hetero-
geneous spiky data; it is difficult to model this type of data well using the standard
VB approach as it is too restrictive and lacking in the flexibility required. Empirical
results suggest that our algorithm has also improved upon the goodness-of-fit that
would have been achieved using the standard VB method, and that it is also more
robust to various initialization settings.
Keyword
Variational Bayes (VB) ; Gaussian Mixture Model (GMM) ; Component Splitting ; Hu-
man Mobility Pattern ; Data Mining
72 Chapter 4. A New Variational Bayesian Algorithm
4.1 Introduction
Mixture models are commonly employed in statistical analysis as they provide a
great deal of modeling flexibility. In particular, one very popular and computation-
ally convenient approach is to model data as a mixture of a finite number of inde-
pendent Gaussian distributions. In this paper we refer to this model as a Gaussian
mixture model (GMM) (McLachlan and Peel, 2000). In recent years the computa-
tionally efficient variational Bayesian (VB) approach has been successfully used to
fit GMMs as described in McGrory and Titterington (2007). We refer to this approach
as the standard VB-GMM method. While this method enables faster computation
and lower storage requirements than most other Bayesian approaches, working with
large volumes of data which exhibit widely varying patterns can still be challenging.
This paper aims to improve on the standard method to create an approach which
is better suited to analyzing datasets that are characterized by a large number of
highly heterogeneous spiky spatial patterns and, in particular, when there is only
weak prior information available. We use the term spiky to describe data patterns
with large areas of low probability mixed with small areas of high probability; and
the term heterogeneous to describe datasets where we observe patterns in different
regions with various degrees of complexity, some of which are better described by a
mixture of one or two components, while others may require a model with a large
number of components.
Human mobility patterns are known to be highly heterogeneous and spiky (Gonzalez
et al., 2008). Understanding of human mobility patterns is valuable for urban plan-
ning, traffic modeling and predicting the transmission of biological viruses (Gon-
zalez et al., 2008), for example. To the best of our knowledge, individuals’ mobility
patterns have not yet been modeled with GMMs; therefore taking such an approach
will allow us to gain further insights into this type of data. To capture human mobility
patterns, we analyze individuals’ telecommunication call detail records (CDR) that
were observed over a 17-month period. While clearly CDR information is biased in
that it only reflects those times when communications are being made, studies have
previously shown that this information is in fact adequate to provide a good reflec-
tion of a person’s overall mobility pattern (Gonzalez et al., 2008). For the reasons
mentioned earlier, fitting a GMM to this type of data presents challenges for many
standard Bayesian approaches. Another challenge is that CDR data is somewhat dis-
crete since the location of the user is only known up to the location of the nearest
activity initiated cell tower. In this paper we present an algorithm that is highly sta-
ble, efficient and more appropriate than existing approaches for analyzing real world
data where such issues arise. In particular we shall show the advantages of adopting
a component splitting strategy in the analysis.
VB was first formally proposed by Attias (1999) and has now been used in various
4.1 Introduction 73
applications (Wang and Titterington, 2006). Its scalability, the ease of computations,
and efficiency in terms of both computation and storage requirements, makes VB
practical for analyzing large datasets in contrast to the more popular but compu-
tationally demanding Markov chain Monte Carlo (MCMC) approach (Madigan and
Ridgeway, 2003). Another alternative Bayesian approach for GMMs is sequential
Monte Carlo (SMC); the properties of SMC for static datasets are largely unknown
although Balakrishnan and Madigan (2006) proposes an efficient approach which
is comparable with MCMC. Other key advantages associated with the VB approach
are that, unlike Monte Carlo based approaches, they do not suffer from mixing or la-
bel switching problems or the difficulties with assessing convergence (Celeux et al.,
2000; Jaakkola and Jordan, 2000; Wang and Titterington, 2006; McGrory and Titter-
ington, 2007). Further, since VB is deterministic, it does not rely on sampling, the
accuracy of which can be difficult to assess in the context of GMMs and, being a
Bayesian approach, suffers less from over-fitting and singularity problems (Bishop,
2006, pp.461-486). While the literature is lacking in formal comparisons between the
VB and maximum likelihood (ML) approaches such as expectation-maximization
(EM) algorithms (e.g., Aitkin and Wilson, 1980), within the context of speech recogni-
tion problems, Watanabe et al. (2002), for example, showed through empirical stud-
ies that VB performed as well or better in terms of robustness, accuracy and rate
of convergence in hidden Markov modeling when compared to EM algorithms with
the use of the Bayesian information criterion (BIC) and minimum description length
(MDL).
Another significant practical advantage of using VB for mixture modeling is that, in
the same way as the reversible jump MCMC (RJMCMC) approach (Richardson and
Green, 1997) or birth-death MCMC (Stephens, 2000), VB is able to automatically se-
lect the number of components k and estimate the parameter values simultaneously
(e.g., Attias, 1999). Note however that in McGrory and Titterington (2007) the authors
had also chosen to compute the deviance information criterion (DIC) (Spiegelhalter
et al., 2002; Celeux et al., 2006) within the VB algorithm, but DIC was used only as a
complementary approach to validate the automatic VB selection and to assist with
modeling decisions for those cases where the VB algorithm could automatically se-
lect alternative fitted models under different initialization settings. Of course the
computing time and storage involved for VB is significantly less than that required
to carry out Monte Carlo based approaches. Many other mixture modeling methods
are incapable of this type of simultaneous estimation; they instead separate the se-
lection of k, which is an important issue of mixture modeling (McLachlan and Peel,
2000), from the parameter estimation which assumes k is fixed (c.f. Richardson and
Green, 1997). The automatic and simultaneous scheme is naturally more desirable,
particularly for analyzing the heterogeneous spiky patterns with weak prior infor-
mation that we see in applications.
74 Chapter 4. A New Variational Bayesian Algorithm
As mentioned above, in the GMM case, VB will ultimately select a suitable k and con-
verge to give parameter estimates for the k-component model. This leads to what we
call the variational posterior fit to the data. Standard VB converges to select a suitable
k by effectively and progressively eliminating redundant components in the model.
In general mixture modeling (i.e., fitting mixtures with an unknown number of com-
ponents and unknown parameter values), it is well-known that the posterior may be
multimodal (Wang and Titterington, 2006; Titterington et al., 1985, pp.48-50). This of
course can cause mixing and label switching problems for MCMC-based algorithms.
Within the VB framework, if the posterior is multimodal, then naturally the algo-
rithm can only converge to one of the local maxima of the posterior and any others
would not be explored. It would also be possible for the algorithm to converge to dif-
ferent local maxima if different parameter initializations were used. Therefore, while
in the specific context of GMMs, VB is guaranteed to converge (Bishop, 2006, p.466)
at least locally to the maximum likelihood estimator (Wang and Titterington, 2006)
and has been shown to monotonically improve the model approximations from one
iteration to the next (in contrast to stochastic convergence that is associated with
MCMC schemes), standard VB is still somewhat sensitive to the initialization of the
hyper-parameters in the prior (Watanabe et al., 2002) and the initial component
membership probabilities of each observation. That is, in some cases suboptimal
models might be found as a result of these initialization choices.
While the elimination property of VB may often be convenient and useful, note that
it implies that the initially proposed number of components kinitial in the standard
VB-GMM approach (Attias, 1999; Corduneanu and Bishop, 2001; McGrory and Tit-
terington, 2007) is effectively the maximum. That is, the standard method will lead
to suboptimal models if the value for kinitial chosen is smaller than required. This
implies that, after assessing the dataset, a suitably large kinitial should be selected
in order to try to avoid this problem, but this approach is clearly not convenient
or efficient for exploring a large number of highly heterogeneous patterns. In this
situation, while one can set kinitial to be equivalent to or larger than the maximum
number of components likely to be present across all subsets of the data and then
let the algorithm converge in each case, such a tactic is computationally wasteful in
terms of both time and storage for those simple patterns where a large number of
unnecessary components would have to be removed as the algorithm converged.
This paper addresses the aforementioned challenges of standard VB-GMM algo-
rithm by allowing components to be split. This strategy will remove the limitation
imposed through the choice of kinitial, and allow a more thorough exploration of
the parameter space than the standard algorithm would achieve. Our approach
aims to avoid the possibility of obtaining a less appropriate model as a result of
an irreversible VB elimination operation (Wu et al., 2010c), and is more suitable
for analyzing the more challenging types of datasets that we are interested in here,
4.1 Introduction 75
namely heterogeneous spiky spatial patterns with weak prior information. Split op-
erations have been proposed previously within the VB-GMM framework in the ma-
chine learning literature (Ghahramani and Beal, 1999; Ueda and Ghahramani, 2002;
Constantinopoulos and Likas, 2007). However, these approaches have focused on
splitting only one component at a time, in an attempt to split every single compo-
nent until reaching a model that is optimal with respect to their criteria. In contrast,
we pursue a more focused approach that is more adaptable for real world problems.
That is, each time we attempt a split in a given iteration, we attempt to split not
all components, but only those fitted poorly, and we attempt to split all of them at
the same time. We define and assess the goodness-of-fit of each fitted component
through a set of proposed split criteria designed to identify why a component is a
poor fit and hence determine what are the appropriate split operations to pursue.
Two possible distributional failures of fit and hence two different split operations
are considered in our algorithm. The first operation is to split a component into
two side-by-side subcomponents such that their means are unequal, µ(1) 6= µ(2), as
is the approach taken in most mixture modeling split studies (e.g., Richardson and
Green, 1997). The other operation is to split a component in a less conventional way
into two overlapping subcomponents such that µ(1) ∼ µ(2) but∥∥Σ(1)
∥∥ << ∥∥Σ(2)∥∥.
That is, components are split into ‘inliers’ and ‘non-inliers’ subcomponents in situa-
tions where there exists a high concentration of observations (i.e., inliers) close to the
component mean. We propose to allow these two split operations non-exclusively;
that is, we will allow a component to be split into three subcomponents at the same
time if both split criteria have been satisfied. We have found this to be useful. Like all
previous VB-GMM split studies, we rely on the competing nature of the VB mixture
modeling approach that arises from the component elimination property associated
with using the variational approximation. This enables the scheme to converge to an
appropriate model, therefore we need no other elimination or merge moves.
Typically mixture models within the Bayesian framework are evaluated based on in-
formation criteria such as BIC (McLachlan and Peel, 2000) or the recently developed
DIC (McGrory and Titterington, 2007) which are less wasteful than validation ap-
proaches (Corduneanu and Bishop, 2001). However, the posterior lower bound F ,
which we outline below, is perhaps the most popular approach for comparing mod-
els in the VB framework. In fact, F has been used for guiding each split attempt in
the previously proposed VB-GMM algorithms with a split process (Ghahramani and
Beal, 1999; Ueda and Ghahramani, 2002; Constantinopoulos and Likas, 2007). De-
spite the fact that our proposed split targets poorly fitted components, and empirical
studies suggest that these lead to a more suitable model for our application data, we
have observed that the models fitted through our split moves are sometimes ranked
lower than the pre-split models with respect to BIC, DIC andF . A closer examination
of these cases revealed that these measures of fit can be unstable for our data mainly
76 Chapter 4. A New Variational Bayesian Algorithm
due to the discreteness present. We discuss this in more detail later. Consequently,
we propose a new criterion that is more robust to this issue for evaluating results,
and we select the final model based on our proposed goodness-of-fit measure in-
stead. In Section 4.3.5, we will introduce this alternative criterion which is based on
absolute errors and takes the covariance matrix of each component into considera-
tion. We believe our proposed criterion is more appropriate for assessing the fit of
the mixture models, at least for our application, and we provide motivation for this
opinion by demonstrating results on some examples.
We structure this paper as follows. In Section 4.2, we briefly discuss the theory of
VB and outline the standard VB-GMM algorithm. In Section 4.3, we detail our new
algorithm and proposed model selection measure. Section 4.4 discusses the human
mobility pattern application and presents both simulated and real data results. We
conclude this paper in Section 4.5.
4.2 Standard VB-GMM Algorithm
Modeling more complex distributions via a GMM with k independently Gaus-
sian distributed underlying mixture components is a well-known popular ap-
proach. The mixture density of an observation x = (x1, ..., xn) is of the form∑kj=1wjN
(x;µj , T−1
j
), where k ∈ N, N (.) represents the (multivariate) Gaussian
density, µj and T−1j denote the mean and variance, respectively, for component j,
the mixing proportions {wj}, satisfy 0 ≤ wj and∑k
j=1wj = 1. Bayesian inference is
based on the target posterior distribution, p (θ, z|x), where θ represents the model
parameters (µ, T, w) and z = {zij : i = 1, ..., n, j = 1, ..., k} denotes the unobserved
component membership indicators for the observed data x. The target posterior is
not analytically available, as is typically the case, and it has to be approximated in
the Bayesian inference approach.
VB methods are becoming increasingly popular as an approach for approximating
the posterior distribution of the parameters of a GMM (Wang and Titterington, 2006).
VB aims to obtain tractable coupled expressions for approximating the posterior dis-
tribution p (θ|x); these resulting expressions can then be solved iteratively (McGrory
and Titterington, 2007). The parameters of the coupled expressions are adjusted via
an EM-like optimization algorithm. This yields a sequence of approximations that
improve with each iteration and can often be expressed in the closed form when
these parameters have fixed values (Jaakkola and Jordan, 2000). In the following we
briefly outline the VB approach.
We begin by introducing a variational function written as q (θ, z|x), which will be
used to maximize the value of a quantityF (q (θ, z|x)) which depends on it as follows.
4.2 Standard VB-GMM Algorithm 77
Jensen’s inequality tells us that we can express the marginal log-likelihood as
log p (x) = log∫ ∑{z}
q (θ, z|x)p (x, z, θ)q (θ, z|x)
dθ (4.1)
=∫ ∑{z}
q (θ, z|x) logp (x, z, θ)q (θ, z|x)
dθ
+∫ ∑{z}
q (θ, z|x) logq (θ, z|x)p (θ, z|x)
dθ (4.2)
= F (q (θ, z|x)) +KL (q|p) (4.3)
≥ F (q (θ, z|x)) , (4.4)
where F (.) is the first term in Equation (4.2) and KL (q|p) is the second and also
is the Kullback-Leibler (KL) divergence between the target p (θ, z|x) and its varia-
tional approximation q (θ, z|x). Note that KL (q|p) cannot be negative. By minimiz-
ing KL (q|p), VB is effectively maximizing F (q (θ, z|x)), a lower bound of log p (x).
However, q (θ, z|x) must be chosen carefully so that it is a close approximation to
the true conditional density, and importantly that it gives tractable computations
for approximating the required posterior distribution. Typically it is assumed that
q (θ, z|x) can be expressed as qθ (θ|x) × qz (z|x), with conjugate distributions chosen
for the parameters. VB then involves solving q (θ, z|x) iteratively in a way similar to
the classical EM algorithm:
• E-step: find the expected value of the posterior of the component membership,
qz (z|x); and,
• M-step: estimate the model parameters in qθ (θ|x) by maximizing F (q (θ, z|x)).
This results in variational posterior approximations of the form p (θ|x) ≈ qθ (θ|x).
Research on the theoretical properties of VB is limited. However, Wang and Titter-
ington (2006) have already demonstrated the asymptotic consistency of the VB ap-
proximation for GMMs with fixed k. They have pointed out that VB-GMM is not bi-
ased in large samples, and have proved its local convergence to the maximum like-
lihood estimators is at the rate of O (1/n) for large n. As has been noted by other
researchers (e.g., Attias, 1999; McGrory and Titterington, 2007), we see here that VB
can effectively eliminate unnecessary mixture components when an excessive num-
ber of components is specified in the initial model. Although it is not yet well un-
derstood how this feature of the algorithm works, it means that VB can be used to
estimate model complexity and parameter values simultaneously.
Previous articles on VB-GMM (Attias, 1999; Corduneanu and Bishop, 2001; McGrory
78 Chapter 4. A New Variational Bayesian Algorithm
and Titterington, 2007; Bishop, 2006, Section 10.2) have made similar prior assump-
tions for this type of model, but have used different model hierarchies. The conju-
gate priors used have been of the following forms:
• A Dirichlet distribution for the mixture weighting w with respect to k mixture
components,
• A Gaussian distribution for the mean µ of each mixture component, and
• A Wishart distribution for precision (i.e., inverse covariance) matrix T of each
mixture component.
In this paper, we follow the model hierarchy outlined in McGrory and Titterington
(2007) and the reader may refer to that paper for further detail on the derivation
expressions for the VB posteriors given below. Alternatively, readers may also wish
to refer to Attias (1999), Corduneanu and Bishop (2001), and (Bishop, 2006, Section
10.2) for different model hierarchies with different parameter notations. We assume
a mixture of k bivariate Gaussian distributions with unknown means µ = (µ1, ..., µk),
precisions T = (T1, ..., Tk) and mixing coefficients w = (w1, ..., wk), such that
p (x, z|θ) =n∏i=1
k∏j=1
{wjN
(xi;µj , Tj−1
)}zij
.
Recall that we have introduced latent indicator variables, the zij ’s in order to express
the GMM in the convenient and popular missing data model representation. Note
that zij = 1 if observation xi belongs to the jth component and zij = 0 otherwise.
The VB approximation will lead to an update expression for the variational posterior
estimates of these latent variables which is outlined below. The joint distribution is
p (x, z, θ) = p (x, z|θ) p (w) p (µ|T ) p (T ) .
Our priors are given by:
p (w) = Dirichlet(w;α1
(0), ..., αk(0)),
p (µ|T ) =k∏j=1
N
(µj ;mj
(0),(βj
(0)Tj
)−1),
p (T ) =k∏j=1
Wishart(Tj ; υj(0),Σj
(0)),
with α(0), β(0), m(0), υ(0), and Σ(0) being known, user chosen initial values. These are
standard conjugate priors used in Bayesian mixture modeling (Gelman et al., 2004).
4.2 Standard VB-GMM Algorithm 79
Using the lower bound approximation, the posteriors are:
qw (w) = Dirichlet (w;α1, ..., αk) ,
qµ|T (µ|T ) =k∏j=1
N(µj ;mj , (βjTj)
−1),
qT (T ) =k∏j=1
Wishart (Tj ; υj ,Σj).
The variational posterior update for the qij ’s which denotes the VB posterior proba-
bility that for observation xi the component membership indicator variable, zij = 1,
is given by
qij ∝ exp{Ψ(αj)−Ψ(α·)
+12{
2∑s=1
Ψ(vj + 1− s
2) + 2 log (2)− log |Σj |}
−12
tr(vjΣ−1j (xi −mj)(xi −mj)
T +1βj
)I2}, (4.5)
where Ψ is the digamma function and α· =∑
j αj . Note that the above expression
is normalized so that for each observation xi, the qij ’s sum to one over the j’s. As
we can see, the update for the qij ’s involves the updates for the parameters, which in
turn requires this update for the qij ’s, i.e., we have a set of coupled expressions that
must be solved iteratively.
The corresponding updates for our posterior parameters are then:
αj = αj(0) +
n∑i=1
qij ,
βj = βj(0) +
n∑i=1
qij ,
υj = υj(0) +
n∑i=1
qij ,
mj =1βj
(βj
(0)mj(0) +
n∑i=1
qijxi
),
Σj = Σj(0) +
n∑i=1
qijxixiT + βj
(0)mj(0)mj
(0)T
− βjmjmjT ,
where posterior expectations are given by E ( µj) = mj , and E ( Tj) = υjΣ−1j . In this
way, the variational posterior estimate for each of the parameters is updated at each
80 Chapter 4. A New Variational Bayesian Algorithm
iteration by adding some function of the current estimates of the qij ’s to the user
chosen initial values that are denoted by the superscript (0).
Note that the VB framework we described can straightforwardly be applied to the
case of a multivariate GMM with general dimension. For clarity we have restricted
our notation in this article to the two-dimensional case since our application in-
volves two-dimensional data.
4.3 Split and Eliminate Variational Bayes for Gaussian Mix-
ture Models (SEVB-GMM) Algorithm
Fitting a GMM via the standard VB approach typically involves choosing a value for
kinitial to start off the algorithm; this is then effectively the maximum number of com-
ponents allowed in the model because the component elimination property of stan-
dard VB means that components may be removed at any iteration, but none will be
added. This automatic complexity reduction (e.g., Attias, 1999) that can occur results
from the use of the variational approximation in performing the Bayesian inference.
The choice of kinitial can have an effect on the results in some cases. In this paper,
we remove the limitation imposed by the choice kinitial by allowing components in
the model to be split during the SEVB iterations so that the size of k in the final fitted
model is still automatically determined, but it can now also be larger than kinitial, if
appropriate, as well as smaller. As we discussed in the introduction, this approach
will be capable of exploring the parameter space more thoroughly than the stan-
dard algorithm and is more suitable for analyzing the type of data patterns that we
are interested in here, that is, heterogeneous spiky spatial patterns with weak prior
information. As we have mentioned, splitting as well as merging mixture compo-
nents has already been considered within both the RJMCMC (Richardson and Green,
1997) and the VB-GMM framework (Ghahramani and Beal, 1999; Ueda and Ghahra-
mani, 2002; Constantinopoulos and Likas, 2007). In Richardson and Green (1997),
components are randomly chosen either to be split into two side-by-side subcom-
ponents, or combined into one, with the condition that the split/combine moves are
reversible. These moves are accepted or rejected via a trans-dimensional Metropolis-
Hastings update. However, this approach is known to be very computationally de-
manding.
In the context of VB-GMM, a birth-death operation on components has been pro-
posed (Ghahramani and Beal, 1999), based on the idea of improvingF . Alternatively,
Ueda and Ghahramani (2002) suggested transforming VB into a greedy search algo-
rithm that examines all possible split, merge, and split-and-merge moves until F
cannot be further improved. Note that this was based on their previous work (Ueda
et al., 2000) that transformed the standard EM algorithm to be less dependent on
4.3 Split and Eliminate Variational Bayes for Gaussian Mixture Models (SEVB-GMM)Algorithm 81
initial settings. Constantinopoulos and Likas (2007) also proposed a splitting VB al-
gorithm which always starts with a single component and progressively adds more.
Component additions are again guided by improvements in F , and this approach
requires components that will not be considered for splitting integrated out. Each
of the previously proposed VB-GMM splitting algorithms tries to split components
with the worst fit first, and has a different assessment criterion for assessing the fit.
However, we do not believe these approaches are very practical considering that the
range of feasible values for k can be very wide and many applications involve massive
volumes of data. That is, we do not believe that it is necessary, efficient or effective,
from our modeling prospective, to choose and attempt to split only one component
at a time and/or to attempt to split every single component. Consequently, in each
split attempt, our algorithm instead identifies all components that do not appear to
have described the pattern well based on a set of proposed criteria and then splits all
of them at the same time.
Our SEVB algorithm makes use of the elimination property of the VB approximation
as discussed in the introduction. That is, components fitting the same region of the
data will be competing with each other; and when there is strong evidence to suggest
that two or more components are fitting the same part of the data, in most cases
only one of these existing components will survive while others will be removed.
In addition, we only attempt to split components when we have reached a stable
model, that is, one which cannot be improved by performing further iterations of
the VB algorithm.
Our proposed algorithm can be summarized by seven steps listed under the heading
Algorithm 1.
Algorithm 1 SEVB-GMM1: Randomly assign mixture component membership probabilities to each obser-
vation and initialize the prior parameter values.2: Execute a standard VB-GMM iteration (see Section 4.2).3: If the model is assessed as not stable according to our criteria (see 4.3.1)⇒ repeat
Step 2.4: Examine our split criteria (see Section 4.3.2) on each component.5: If the algorithm should be terminated according to our criteria (see Section 4.3.4)⇒ go to Step 7.
6: Perform our split operation(s) (see Section 4.3.3) on components to be split ⇒return to Step 2.
7: Select the final model based on our model selection criterion (see Section 4.3.5)
That is, we propose to execute standard VB-GMM in Step 2, until a stable model
result is obtained in Step 3. We then aim to identify all poorly fitted components in
Step 4; these are then split simultaneously in Step 6, provided the algorithm has not
satisfied the terminating condition which is checked in Step 5. In the cases where
there are no remaining poorly fitted components to be split, or we do not want to
82 Chapter 4. A New Variational Bayesian Algorithm
split components any further (Step 5), we go to the last step (Step 7) to select the final
model and then terminate the algorithm. Otherwise, we return to Step 2 to execute
more standard VB-GMM iterations until we reach the next stable model (Step 3) after
the split operations have been performed in Step 6.
In the remainder of this section, we first discuss our criterion for assessing whether
the model is stable (Section 4.3.1). We then detail our proposed split criteria (Sec-
tion 4.3.2), split operations (Section 4.3.3), and our criterion for determining if the
algorithm should be terminated (Section 4.3.4). We conclude this section by outlin-
ing our proposed model selection criterion (Section 4.3.5).
4.3.1 Model stability criterion
Our algorithm considers splitting components only when in a stable model. Most
VB-GMM algorithms define a stable model based on examining F . That is, a model
is typically declared stable if F of the current iteration is the same as the previous it-
eration up to a very small tolerance level. Such an approach can be computationally
demanding, and is therefore not suitable for analyzing large amounts of data. We
also point out that, while subsequent iterations may be able to fine tune the mod-
els, we have observed that, often when analyzing the real world data, VB-based al-
gorithms may simply be hopping among several alternative similarly good, but dif-
ferent models, i.e., the models are not really improving. See further discussion on
Criterion T3 in Section 4.3.4. In contrast, we aim to find a balance between accuracy
and computational efficiency. We declare that the model is stable in Step 3 if:
• Number of surviving components ksurviving i.e., the number of component cur-
rently in the model remained identical from the previous iteration (S1);
• Variational posterior mean estimations of all surviving components mj ’s re-
mained the same up to a tolerance level δ1 from the previous iteration (S2). A
suitable tolerance level choice will be application driven to obtain the user’s
desired level of accuracy; and,
• At least c0 iterations have been completed since the initialization or last stable
model (S3). This is to prevent the algorithm being declared stable prematurely
in the very first few iterations.
Therefore, instead of monitoring changes in F , we propose to focus on the estima-
tion of the key model parameters through checking criteria S1 and S2. We have found
this to be adequate. As noted, S3 is to prevent the algorithm being declared stable
too early (e.g., after only one or two iterations) in the process. While in the majority
of cases model parameter estimates typically change rapidly in the early iterations,
4.3 Split and Eliminate Variational Bayes for Gaussian Mixture Models (SEVB-GMM)Algorithm 83
this has been observed to happen occasionally when applying the algorithm. There-
fore, in practice, the choice of c0 will have minimal effect on the final fit, however
choosing it to be a very large number would tend to lead to excessive and wasteful
iterations. Once we have obtained a stable model, we will then proceed to Step 4.
4.3.2 Component splitting criteria
Unlike existing VB-GMM with split process algorithms which attempt to split all of
the components, we adopt a more targeted and efficient approach: we only attempt
to split poorly fitted components. We identify the poorly fitted components in Step
4 through the use of our proposed criteria. We have proposed two split criteria for
the two distributional imperfections that we are interested in identifying. We shall
first discuss our inliers and non-inliers split criterion. This criterion aims to find
components which might be better separated into two overlapping subcomponents
in which their means are such that µ(1) ∼ µ(2), but their variances are such that∥∥Σ(1)∥∥ << ∥∥Σ(2)
∥∥. We then detail our standard split criterion for identifying poorly
fitted components that could be improved by separation into two side-by-side sub-
components whose means are such that µ(1) 6= µ(2). Note that the reader may ignore
the inliers and non-inliers split criterion and the corresponding operation if they do
not wish to make such assumptions in their application.
4.3.2.1 Inliers and Non-Inliers Split Criterion
This criterion is based using the Mahalanobis distance (MD) measure. This measure
is utilized because it takes correlation between variables into consideration and it
is scale invariant. While MD is more typically used as a multivariate outlier statistic,
we show here that by considering its theoretical distribution, we obtain a straightfor-
ward diagnostic for the distributions fitted to the components which can aid in the
identification of the inliers. We define the distance MD(j)i , corresponding to obser-
vation xi from the most likely jth component determined by the largest qij (Equation
(4.5)) with mean mj as
MD(j)i =
√(xi −mj)
T (Σj/υj )−1 (xi −mj).
For each fitted component, the estimated distribution can be compared with the
chi-square distribution with two degrees of freedom (df) i.e., MD2 ∼ χ2df=2 as this
is the theoretical relationship that exists when x is assumed to be bivariate Gaussian
distributed (Azzalini, 1996, p.291). We consider an observation xi to be an inlier with
respect to the jth component if
MD(j)i <
√χ2df=2,α=r, (4.6)
84 Chapter 4. A New Variational Bayesian Algorithm
where α represents the cumulative probability of the area under curve of the chi-
square distribution. Its value r is a user chosen probability value and a reasonable
choice for r will depend on the application: typically one would like to have a small
r so that only observations lying within a small MD, calculated from the right hand
side of Equation (4.6), from a fitted component mean will be identified as inliers.
The inliers and non-inliers split criterion for the jth component is as follows. We
consider it appropriate to split into two overlapping subcomponents
if N (j)inliers/N
(j) > q.
In the above expression, N (j), and N (j)inliers are the total number of observations, and
the number of inliers, respectively, belonging to component j; and q is a chosen
probability value such that 1 > q > r > 0. That is, we highlight components for split-
ting into two overlapping subcomponents when they have more than a proportion
q of their observations classified as inliers when the theoretical expected proportion
would only be r. In other words, we adopt a simple two level thresholding approach
in identifying inliers components: we choose r to correspond to the proportion ob-
servations we would expect to see lying in the center region of a fitted component,
then by assessing how many observations actually lie in that region for each com-
ponent present in the model, we can decide whether a further split is needed. Note
that, however, choosing r too close to zero will result in the algorithm missing the
potential inliers subcomponent if its µ is not fairly close to the fitted component
mean.
4.3.2.1 Standard Split Criterion
In contrast with the inliers and non-inliers split criterion we have just described, the
type of move we use here, which splits components into two side-by-side subcom-
ponents is much more typical in the literature. In fact, it is generally the only type
of split that is used in mixture modeling problems (Richardson and Green, 1997;
Constantinopoulos and Likas, 2007) or clustering algorithms (Ball and Hall, 1965).
For our standard split criterion move, we propose using principal component anal-
ysis (PCA). We use PCA to transform linearly correlated variables into a set of un-
correlated principal components (or eigenvectors). This can assist in determining
whether or not a component has too much variation in a certain basis as this would
suggest that it could perhaps be better fitted using more than one component. An-
other advantage to using PCA is that we can straightforwardly incorporate it into our
algorithm since it makes use of the easily computable covariance matrix Σj/υj which
is already estimated in the algorithm.
4.3 Split and Eliminate Variational Bayes for Gaussian Mixture Models (SEVB-GMM)Algorithm 85
In the PCA for our bivariate model, we transform our variables to obtain two prin-
cipal components p1 and p2. Here p1 represents as much of the data variation as
possible, that is, the larger eigenvalue λ1, and p2 represents the remaining variation
(λ2) of the component. The standard split criterion for the jth component is then
defined as split into two side-by-side subcomponents
ifλ1
λ1 + λ2> g and if σ(j)
max > s.
In the above expression σ(j)max =
√max (diag (Σj/υj )) which represents the larger
standard deviation in either X-Y coordinates of the component of interest. This
means that we assess that a component should be split into two side-by-side com-
ponents if it has more than proportion g of the data variation along p1, and the larger
of the standard deviations in either X-Y coordinates is greater than s. Here g and s
are carefully chosen values that will often be application driven: g should be chosen
to be reasonably large as this would reflect a large difference in the eigenvalue ratio
which typically suggests a poorly fitted component, and the smaller s is, the more
likely we will be to split a component that is irregular, this would then lead to more
complex fitted models. In this way we have devised an alternative criterion to that
used in Constantinopoulos and Likas (2007) where the order of the split is assessed
by det(Σ−1
). It is also an alternative to that used in Ueda and Ghahramani (2002)
which assessed whether to split based on the KL divergence distance between the
data density and its estimated model distribution. Note that, instead of σmax, alter-
natively we could have used other measures such as√λ1. Either of these criteria
would assist the algorithm in a similar manner; here σmax is adopted following Ball
and Hall (1965).
4.3.3 Component split operations
In Section 4.3.2 we described the two proposed split criteria that are used in Step
4. With the exception of cases where the algorithm terminating criterion has been
satisfied in Step 5 (e.g., this could be satisfied because there are no remaining poorly
fitted components for splitting), the aim in Step 6 is then to split the identified com-
ponents into either two or three subcomponents as appropriate. The main focus
here is on determining the posterior parameter values for initializing the newly cre-
ated subcomponents. Depending on which split criterion the component has sat-
isfied, one of the two possible split moves will be performed. We discuss these and
the special case where a component has been flagged by both of the criteria in detail
in Sections 4.3.3.1–4.3.3.3 below. Section 4.3.3.4 then gives a description of model
adjustments that we carry out after all subcomponents have been created in order
to ensure that the algorithm continues towards convergence.
86 Chapter 4. A New Variational Bayesian Algorithm
4.3.3.1 Inliers and Non-Inliers Split Operation
If a component has satisfied the inliers and non-inliers split criterion (see Sec-
tion 4.3.2.1), this implies that at least some proportion q of observations assigned
to that component have been assessed as inliers. In these instances, our split cre-
ates two new overlapping subcomponents; one of these represents the inliers and
the other represents the non-inliers. Assuming that we choose q ∼ 50% in this split
move, we can initialize the posterior parameters of the two new subcomponents (in-
stead of modifying the qij ’s to estimate them) as follows.
• minliers = mnon-inliers = mj ;
• αinliers = αnon-inliers = 0.5× αj ;• βinliers = βnon-inliers = 0.5× βj ;• υinliers = υnon-inliers = 0.5× υj ;• Σnon-inliers = Σj ;
• Σinliers = 1/c1 × Σj ,
where c1 > 1 is a user chosen value. This implies that we assign these two new sub-
components the same means with only half of the parent component mixing weight,
and the inliers subcomponent will have smaller variances than its parent compo-
nent. These newly created subcomponent initializations are used in the next round
of standard VB-GMM iterations in Step 2. The choice of c1 is data dependent, and
represents the assumed difference in variance between the inliers and non-inliers
components. While the specific choice of c1 will generally have only a minimal effect
on the resulting fit, setting c1 too large or too small may increase the likelihood of the
newly created inliers subcomponent being eliminated.
Note that, the user may wish to estimate these posterior parameter values more for-
mally by first partitioning the observations in the component. However, we found
that our simple proposal was sufficient as the following VB iterations will adjust these
proposed values.
4.3.3.2 Standard Split Operation
Components flagged by the standard split criterion (see Section 4.3.2.2) will be split
into two side-by-side subcomponents. Our approach is to use the data for initializing
the posterior parameters of these subcomponents. We do so by dividing the obser-
vations by first linearly projecting them on to p1 via PCA, and then we group them
according to m(j)p1 , which is the p1 transformation of mj . These subcomponent ini-
tializations are then used in the standard VB-GMM iteration at Step 2. Our approach
here differs from Ghahramani and Beal (1999) where the split direction was instead
sampled from the parent component’s distribution rather than relying on PCA. While
4.3 Split and Eliminate Variational Bayes for Gaussian Mixture Models (SEVB-GMM)Algorithm 87
our objective here is similar to that of Constantinopoulos and Likas (2007), we found
their inverse covariance matrix assumption T± = T (j) for initializing the two sub-
components problematic for our real world application as many components that
required this type of split were those covering two or more unrelated clusters, and in
these cases the inverse covariance matrix assumption is not realistic.
4.3.3.3 Case of Splitting One Component into Three Subcomponents
If a component satisfies both of our split criteria, we perform both operations on the
component which results in its replacement by three new subcomponents instead
of two. We do this by performing the inliers and non-inliers split operation first.
When the standard split operation is then performed, all posterior parameters with
the exception of m will need to be further halved as a result of inliers which were
not excluded from the side-by-side subcomponent initialization process. As before,
these initializations are used in the next iteration of Step 2.
4.3.3.4 Adjusting the Variance Posterior Parameters for All Components
As mentioned, our final task in Step 6 is to adjust the overall model that we obtain
after all subcomponents have been initialized. Due to the convergence properties
of the VB algorithm, the combined component variance will generally decrease at
each iteration as the algorithm moves closer to a solution and the components pro-
vide an improved fit. Since we will have split some or all components in the stable
model, it is logical to assume that the overall dynamics of the model will have been
changed. Since splitting leads to additional components in some neighborhoods,
we would expect that some of the observations covered by other non-altered com-
ponents could now be incorrectly classified. We can address this issue by increasing
the variances of all components (after all subcomponents have been initialized) such
that the value of posterior parameter Σj is updated as the following:
diag(Σ∗j)
= c2 × diag (Σj) ,
with a user chosen value c2 > 1. We do this without concern as the estimates of the
variances will then be updated and improved with further iterations of the algorithm.
That is, the specific choice of c2 would generally have little effect on the overall re-
sults, but of course setting c2 too small would defy the propose of this particular step.
After this process has been performed, we will return to Step 2.
88 Chapter 4. A New Variational Bayesian Algorithm
4.3.4 Algorithm termination criterion
In Step 5, we must decide whether to terminate the algorithm. In order to do this we
introduce termination criteria. We declare that the algorithm should be terminated
if either:
• No components satisfy any of the split criteria (T1);
• Nsplitting/ksurviving > c3 (T2); or
• Model log-likelihood (LL) is the same within a tolerance level δ2 as one of the
previous stable models in which ksurviving is identical (T3).
Here c3 is a chosen value such that c3 > 1, and ksurviving and Nsplitting represent the
number of components currently surviving i.e., currently in the model, and the total
number of split operations that have been performed until this point, respectively.
If the chosen c3 is too small, it will limit the algorithm’s exploration of the parameter
space; while if too large, potentially unnecessary iterations will be performed, there-
fore this choice involves a trade off. On the other hand, there is no need to choose
the tolerance level δ2 to be very small as very small differences between the LLs have
little significance.
Termination criterion T1 is straightforward, so here we give some further detail on
the motivation for the other criteria.
Criterion T2 allows us to assess whether further split attempts are worthwhile based
on previous split attempts. Ideally we would like to track whether previous splits
have been successful or not as was done in Constantinopoulos and Likas (2007).
However, because the design of our algorithm allows for multiple splits to be per-
formed simultaneously, and these splits can potentially change the dynamics within
the model, the tracking approach is less straightforward here. For this reason we
have designed T2 as a way of tracking simply how successful the splits have been
overall.
Criterion T3 was proposed for recognizing the two other situations where the algo-
rithm should be terminated to avoid wasteful unnecessary computations. Firstly,
we know that when all split attempts have failed the algorithm will most likely have
converged back to models which are identical or very similar to those models that
we had prior to the attempted split. Secondly, we know that, quite often, we can
model the same (heterogeneous) data well with several alternative models; and in
those situations, we have observed that our algorithm can become stuck moving be-
tween several alternative ‘good’ models in our application. As a result, we would like
4.3 Split and Eliminate Variational Bayes for Gaussian Mixture Models (SEVB-GMM)Algorithm 89
to declare the algorithm to be terminated in these situations immediately without
further unnecessary computations.
4.3.5 Model selection criterion
The final step, Step 7, of our algorithm is to select the final model. We opt to select the
final model after all proposed splits have been considered and computations have
been completed. Models within the Bayesian framework can be evaluated based on
information criteria such as BIC or DIC. The DIC has been used as a complementary
model selection technique in McGrory and Titterington (2007)’s VB-GMM. However,
in the VB literature, F is perhaps the most popular approach for comparing models.
Interestingly, some studies (e.g., Beal and Ghahraman, 2002, 2006) have shown that
monitoring of F consistently outperformed the less computationally efficient BIC
approach for finding an appropriate model structure in each of their simulated ex-
amples. Most previous VB-GMM splitting algorithms allow components to be split
(Ghahramani and Beal, 1999; Ueda and Ghahramani, 2002; Constantinopoulos and
Likas, 2007) and then examine if the proposed random split should be accepted or
rejected based on the improvement of F . We do not monitor F or utilize F for deter-
mining the validity of the splits. Our proposed splits target poorly fitted components
meaning that intuitively they should result in a better representation of the data and
visual inspection of empirical results suggested that this was the case. However,
somewhat surprisingly, we observed that on some occasions the fit we obtained after
carrying out splits was ranked lower than the pre-split model fit, with respect to BIC,
DIC, and F . Since this conflicts with intuitive reasoning, we further explored this
issue and concluded that this is largely due to discreteness in our dataset. We have
proposed a new criterion for comparing the fitted models; we have found that eval-
uating the results and selecting the final model using our new goodness-of-fit mea-
sure is more robust to this issue than any of the aforementioned criteria based on
empirical results. We outline and discuss how we propose to evaluate the goodness-
of fit below and Section 4.4.3 further illustrates this point through empirical studies.
Model evaluations or selections based on goodness-of-fit measures are particularly
useful for Bayesian techniques as they suffer less from the problem of over-fitting.
In this respect, it has been shown that absolute error is a preferred goodness-of-fit
measure over widely used square error related measures (e.g., sum of square error
(SSE), and root mean square error (RMSE) used in Ueda et al. (2000) and Ueda and
Ghahramani (2002), for example) which are known to be misleading and particularly
sensitive to outliers (see Armstrong, 2001, Chapter 14 for further detail). However,
none of these simple distance related measures are appropriate for evaluating re-
sults in applications such as ours which often involve large numbers of inliers and
90 Chapter 4. A New Variational Bayesian Algorithm
variables are highly correlated. To address this problem, we have introduced an al-
ternative criterion which is based on absolute errors, but also takes the 1-norm of
the covariance matrix of each component into consideration. We propose that this
provides a more appropriate assessment of the model. We call our proposed mea-
sure Mean Absolute Error Adjusted for Covariance (MAEAC), and it is based on the
use of MD:
MAEAC =1n
n∑i=1
MD(j)i ×
√∥∥∥Σi(j)∥∥∥
1√υi(j)
(4.7)
where observation xi belongs to the jth component as determined by the maximum
value of qij (Equation (4.5)). Recall that MD has been used in Section 4.3.2.1 for iden-
tifying inliers, and it can be considered as an absolute deviance measure. In MAEAC,
we estimate the model ‘absolute error’ with respect to observation xi by multiplying
its MD(j)i to our best estimated deviance of the jth component, square root of maxi-
mum overall variance of the component; the sum of these estimated absolute errors
results in MAEAC. We select the final model based on this goodness-of-fit measure
before we end the algorithm. We believe that this is a more appropriate selection cri-
teria than either BIC, DIC or F , because unlike these it does not involve an estimated
LL term. Estimation of the LL can be unstable when the component estimated co-
variance measure is singular or near singular. For example, this can occur when there
is a point mass in the dataset. Empirical studies support the assertion that MAEAC is
more robust in these settings and we also find that the use of MAEAC leads to more
consistent and reliable model selection when the same data is being analyzed with
different initialization settings. The initialization settings we are referring to are the
choice of initial model complexity kinitial and the corresponding various possible ini-
tial allocations of observations to components. Note that the user may elect to adopt
the usual model selection criteria instead of our MAEAC.
4.4 Human Mobility Pattern Application & Results
4.4.1 Data mining & human mobility patterns
Data mining involves extracting nontrivial, previously unknown, but useful hidden
information from large datasets (Han and Kamber, 2006). It has attracted increas-
ing attention in recent years as a result of the rapidly growing amount of available
data, and the timely need to turn it into knowledge. The real world application on
modeling human mobility patterns that we explore in this paper, is one such exam-
ple of a research area where large amounts of data are involved and efficient data
mining techniques are required. Through this application, we demonstrate that our
4.4 Human Mobility Pattern Application & Results 91
algorithm has improved upon the standard VB-GMM method and is also more ro-
bust to various initialization settings. This is a prime example for illustrating our
approach because individuals’ observed mobility patterns are highly heterogeneous
and spiky, and there is a very little prior knowledge about them. We model each indi-
vidual’s mobility pattern by a GMM. We believe our modeling approach is appropri-
ate, because of the ease of interpretation, flexibility and computational convenience
of GMM. We present results for both simulated (in Section 4.4.2) as well as real data
(in Section 4.4.3).
Human trajectories have been modeled before with Levy flight and random walk
models (Brockmann et al., 2006); but these previous analyses have not taken in-
dividuals’ well-known high degree of spatial regularity into consideration i.e., it is
known that over a period of time people tend to frequently return to the same sev-
eral locations, and these frequented locations may change across different periods
throughout their lives. For example, for many people, two of their most frequented
locations will be their current home and office (c.f. Gonzalez et al., 2008). This is
an important issue that should be accounted for when modeling this type of data,
particularly when we consider that it is estimated that individuals typically spend
approximately 40% to 80% of their time in their first two preferred locations. Note
that Gonzalez et al. (2008) show that the probabilities of individuals visiting certain
locations can be reasonably approximated by a truncated power law. This regularity
issue which have a significant influence on the choice of an appropriate distribu-
tion for use in modeling, is addressed very easily in our approach with the use of our
proposed inliers and non-inliers split process. That is, we can model the frequently
visited locations with inliers components and capture the broader activity areas with
non-inliers components. This, on the whole, should lead to a better representation
of the observed mobility patterns.
Of course, while individuals typically spend the majority of their time in the same
area(s), they will also occasionally visit alternative locations which we refer to as
‘remote’ in this context. We use the term ‘remote locations’ to encompass all ob-
served locations other than the ones that are habitually visited by the given individ-
ual. These remote locations can vary widely in their range of distance from the habit-
ual daily activity areas and in frequency of observation. For example, a one-off visit
to a friend living at the opposite end of the city, or a vacation to the other side of the
country, would both represent visits to remote locations in relation to an individual’s
commonly exhibited day to day trajectories. Readers are referred to Gonzalez et al.
(2008) for further discussion on this point. This is another important feature of hu-
man mobility patterns which has also been ignored in previous Markov-based mod-
eling approaches. This characteristic presents challenges for the standard algorithm
as these isolated cases can have a large effect on the fit obtained using the standard
approach. We have observed repeatedly that applying the standard algorithm with
92 Chapter 4. A New Variational Bayesian Algorithm
kinitial too small, or simply with an inappropriate component membership initializa-
tion setting, can lead to one component representing two or more ‘unrelated’ areas
visited by an individual. Here the notion of unrelated refers to locations in which the
individual’s observed presence has no clear connection, for example, we might think
of observations recorded at a person’s office and the surrounding cafes or transport
hubs as being related, while an observed visit to a restaurant in another part of town
and a visit to their local doctor’s surgery are unrelated. Our algorithm, through our
proposed standard split process, aims to address this issue in modeling terms, and
is therefore more robust to the initialization settings.
Gonzalez et al. (2008) points out that modeling of human mobility patterns has sug-
gested that regardless of how diverse and wide an individual’s mobility or travel
history is, human beings tend to follow simple underlying reproducible patterns.
Therefore, while our interest here is in capturing patterns with the business oriented
view of improving understanding of the needs and habits of each telecommunica-
tion customer, an ability to effectively model human mobility patterns could poten-
tially have wider implications for many other real world problems that are driven by
the effects of human mobility from the formulation of disease and epidemic strate-
gies to disaster response strategies.
4.4.2 Simulated results
We consider modeling a simulated human mobility-like spatial pattern. We compare
our SEVB-GMM algorithm to the standard VB-GMM algorithm outlined in McGrory
and Titterington (2007) using identical prior initialization, the settings for which are
not our primary concern with the exception of kinitial. We opt to use uninformative
prior settings as it is unrealistic for us to assume having any advance specific knowl-
edge on each pattern to be analyzed.
We have several algorithm settings that we must assign. We point out that in the case
of straightforward non-heterogeneous datasets, the particular choices are much less
significant as they will generally only impact computational time. This is because
their influence is to affect the number of split-merge attempts in the convergence
towards the final fit and, with datasets where there is clearer separation between
components, the final fitted model would generally not be altered as the algorithm
can more easily recover a good fitting model. However, for real data applications
where we often see heterogeneous or spiky patterns, these choices become slightly
more significant: if these values are chosen so as to encourage more split attempts
then we increase the likelihood of finding particular components and hence obtain-
ing a more complex model. This means that in choosing these variables our aim is to
find a trade off which encourages an appropriate tendency for split attempts with as
4.4 Human Mobility Pattern Application & Results 93
little computational waste as possible. In practical settings the choice of these val-
ues will primarily be application driven and here we explain the motivation for our
particular choices within the context of our problem. Our choices were:
• δ1 = 0.00001 longitudinal or latitudinal degrees. This choice corresponds to a
small area of approximately 1 meter as the tolerance level for declaring a stable
model. We use a very small δ1 here because the actual position of an individual,
unlike the observed data, is not restricted by the locations of the cell towers (c.f.
Gonzalez et al., 2008),
• δ2 = 1 as the model LL tolerance level for terminating the algorithm, which
means that we assume that the model LLs are equal if their difference between
the current and any previous stable models is less than 1,
• r = 0.25 and q = 0.45, representing that we would like to separate out inliers
from the rest of the observations for components that have more than 45% of
their observations classified as inliers whereas the theoretical value is only 25%,
• g = 0.90, and s = 0.05 degrees representing that we would like to split compo-
nents for which over 90% of the data variation is along its principal component
p1, and for which over 30 km or 0.30 degrees (i.e., 6s) in distances in either lon-
gitudinal or latitudinal direction,
• c0 = 5 so that in each case the algorithm will have at least 5 iterations; this
choice is simply to ensure that the algorithm does not terminate prematurely
and any other small number would be reasonable,
• c1 = 1000 which means that the variance posterior parameter of the inliers sub-
component will be initialized 1000 times smaller than that of the non-inliers
subcomponent when the inliers and non-inliers split operation is performed,
• c2 = 100 which means that after all subcomponents have been initialized, we
would like to adjust the overall model by increasing all component variance
posterior parameters by 100 times, and
• c3 = 2 which means that, on average, we allow each surviving component to be
split less than two times.
Note that we chose q close to 0.50 as a result of our inliers and non-inliers split op-
eration assumption and in order to have meaningful inliers subcomponent with a
reasonable number of observations. On the other hand, our choice of r, which is
partially affected by the choice of q, allows us to identify potential inlier subcom-
ponents whose µ’s are not fully aligned with the current component mean. This is
94 Chapter 4. A New Variational Bayesian Algorithm
particularly important for our real application data as it is somewhat discrete (c.f.
comments above on the selection of δ1). We find the selection of g and s are sensible
as components split using the standard approach in this application mainly arise
through individuals occasionally traveling to isolated locations in relation to their
base. That is, we target components which have their first eigenvalue typically con-
tributing close to 100% of component variation as these are the result of components
wrongly surviving through groupings of observations from two or more unrelated
components; and we simply elect to ignore those cases with small σmax which, even
when they appear to be poorly fitted, are unlikely to affect our understanding of an
individual’s mobility pattern overall.
We emphasize that the selection of c0 = 5 is only there to ensure that the algorithm
has a chance to move towards stable model results by avoiding early termination.
Users may select alternative values but we would suggest five as a minimum; our
data analysis in Section 4.4.3 involved 3000 runs, and over 95% of them required
more than five iterations (including the initialization and splitting steps) to reach
the first stable models. The choice c1 = 1000 is proposed because the average cell
tower service area is approximately 3 km2 and over 30% of them covering an area of
1 km2 or less (Gonzalez et al., 2008); cities like New York City are about 1000 times
that size. c2 = 100 is used as we have observed that, on average, when excluding
the top 5% of cases, the combined component variance of our test data decreased
by a factor of around 100 from the first iteration to the first stable model. We have
observed that when a much smaller c2 is used, we are more likely to obtain results
which either over-fit the data or have missed inliers components and are less robust
to the initialization settings.
Empirical results suggest that our algorithm is generally not very sensitive to the se-
lection of these parameter values. However, many of them were set based on ad-
vance knowledge of the nature of the data. It is largely only the computational effi-
ciency that may be influenced with the selection of δ1 and δ2. However, the model
complexity, in particular for analyzing real application data, may be affected by the
selection of r, q, g, s, and c3 to some extent. The selection of c0, c1 and c2 has largely
no effect on either the computational efficiency or the model complexity. We discuss
this in our limited sensitivity analysis based on the simulated data at the end of this
section.
Our experimental results are based on a simulated dataset of 200 observations from
a mixture of seven Gaussian distributions with parameter values given in Table 4.1
and shown in Figure 4.1 (a). This simulated individual is most active around the area
of latitudinal (Lat) 2.90 and longitudinal (Lng) 1.50 (c.f. component numbers 1 to 3;
total 85% of all observations) and is particularly active around the locations marked
by component number 2. This simulated individual has occasionally visited other
4.4 Human Mobility Pattern Application & Results 95
Table 4.1: Parameters of the mixture model that our synthetic data were simulatedfrom
Component Mean Vector Covariance Matrix(#Points) Lat Lng Var(Lat) Cov Var(Lng)
1 (40) 2.90 1.50 3e-03 1e-04 2e-032 (100) 2.95 1.49 1e-05 0 1e-053 (30) 2.90 1.51 1e-04 0 1e-044 (13) 2.20 1.20 1e-03 9e-04 1e-035 (15) 3.30 1.25 1e-04 0 1e-046 (1) 3.40 1.25 1e-99 0 1e-997 (1) 3.35 1.25 1e-99 0 1e-99
Table 4.2: Parameter estimates recovered by our SEVB-GMM algorithm with kinitial =1 for the simulated dataset plotted in Figure 4.1 (a)
Component Mean Vector Covariance Matrix(#Points) Lat Lng Var(Lat) Cov Var(Lng)
1 (41) 2.90 1.50 3.5e-03 -3.7e-05 2.3e-032 (100) 2.95 1.49 1.3e-05 -2.3e-07 8.9e-063 (29) 2.90 1.51 8.9e-05 -2.0e-05 5.5e-054 (13) 2.21 1.20 8.3e-04 6.8e-04 7.0e-045 (17) 3.31 1.25 7.1e-04 -2.0e-05 1.2e-04
locations marked by component number 4 to 7. The results from our SEVB-GMM
algorithm with kinitial varying between one and 20 were identical and this is shown
in Figure 4.1 (b), while selected results of standard VB-GMM algorithm with vari-
ous kinitial have been presented in Figure 4.2. The ellipses represent 95% probability
regions for the component densities, whereas the estimated centers of these compo-
nents are marked by ‘+’s in this figure. Note that we plot our data pattern separately
from the fitted models for clearer presentation.
In contrast to the standard VB-GMM algorithm, our SEVB-GMM algorithm has pro-
duced very consistent models regardless the value of kinitial. That is, our proposed
split process appears to be working, and we can obtain models with k both higher
and lower than initially proposed. On the other hand, as discussed before, the stan-
dard algorithm can be sensitive to the isolated observations when uninformative ini-
tialization settings are used. Beside the outliers (c.f. components 6 and 7), our algo-
rithm has recovered all other components including both inliers components (c.f.
components 1 and 3). The estimated parameter values with kinitial = 1 by our algo-
rithm is presented in Table 4.2. We found when larger kinitial is used in the standard
algorithm, we are more likely, but not always (c.f. kinitial = 11 and kinitial = 14), to
recover the model. Neither algorithms allocated any observations to the outliers,
which is useful for understanding individuals’ mobility patterns at an aggregated
level.
Note that despite the fact that six components were identified for the standard VB-
GMM algorithm with kinitial = 11 and kinitial = 14 (c.f. Figure 4.1 (d)–(e)), we stress
96 Chapter 4. A New Variational Bayesian Algorithm
(a) Simulated data, n = 200 (b) SEVB-GMM
Figure 4.1: (a) Plot of our simulated dataset where the data points (‘Actual’) aremarked by an ‘x’. (b) The results of our SEVB-GMM fit of a bivariate mixture modelto these data; the center of each component in the fitted mixture is indicated by a ‘+’and we also show 95% probability regions (outlined by ‘-’) for each component in themodel. We can see that the data appear to be well represented by the fitted model.Note also that the resulting fit is identical for kinitial = 1− 20.
that the sixth component does not represent the outliers. The extra component (c.f.
the linear-shaped component) was formed because observations from two unre-
lated components were inappropriately grouped together by the algorithm. These
are good examples of results obtained when observations’ component initial alloca-
tions are poor, but as we have shown, is not a concern for our SEVB-GMM algorithm
while it is for the standard algorithm.
Finally, we discuss the implication of the choice of algorithm settings. We concen-
trate on the SEVB-GMM algorithm only and we perform a limited sensitivity analysis
on the algorithm settings of SEVB-GMM. We focus solely on the implication of the
choice of values for g, s, r, q, c1, c2 and c3. Having experimented with using a wide
range of algorithm settings on our SEVB-GMM algorithm, we are pleased to observe
that with some rare exceptions, our algorithm has remained able to recover the cor-
rect model shown in Figure 4.1 (b). As we would expect, most of the exceptions to
this occurred when kinitial was very small; this follows logically as the settings we are
modifying have some influence whether or not split attempts will be made and how
these new components will be initialized. Of course, the splitting feature is crucial
when there are not enough components in the initial model to represent the data
well. As a result, here we only focus on reporting our results for various initializations
of the user chosen parameters when kinitial = 1 as this gives a reasonable picture of
the effect these parameters can have.
Our analysis produces three different models: the correct model shown in Fig-
ure 4.1 (b), the correct model minus one inliers component which we will refer to
4.4 Human Mobility Pattern Application & Results 97
(a) kinitial = 1 (b) kinitial = 2
(c) kinitial = 5 (d) kinitial = 11
(e) kinitial = 14 (f) kinitial = 19
Figure 4.2: Selected results obtained from applying the standard VB-GMM algo-rithm under different initialization conditions to the simulated data shown in Fig-ure 4.1 (a). The centers of each component in the fitted mixtures are indicated bya ‘+’, we also show 95% probability regions (‘-’) for each component in the model.The computed values of F and MAEAC, and the fitted value of k in the final modelare also shown. We can see that the initial choice for k and the corresponding initialcomponent allocation does influence the final fit obtained.
98 Chapter 4. A New Variational Bayesian Algorithm
as a four-component model, and the failed model in which no component splitting
has been taken place i.e., what we would have achieved by simply using the standard
VB-GMM model. We detail the results of interest below:
• the correct model was recovered when g was changed from 0.9 to 0.5, this
change means that a component may be split if its first and second eigenvalues
are not the same which is a rather extreme choice;
• the correct model was recovered when s was changed from 0.05 to 0, this
change means that a component may be split regardless of its size which is
again a rather extreme choice;
• the correct model was recovered when r was changed from 0.25 to 0.40;
• a four-component model was recovered when r (which we recall determines
the size of the center region of the components for an observation to be con-
sidered as an inlier) was changed from 0.25 to 0.10, but the algorithm failed
when a more extreme choice of 0.01 was utilized;
• the correct model was recovered when q (the proportion of components actu-
ally considered as inliers) was changed from 0.45 to either 0.30 or 0.60;
• a four-component model was recovered with q = 0.70, but the algorithm failed
when an extreme choice of 0.90 was utilized;
• the correct model was recovered when c1 (the variance ratio between inliers
and non-inliers subcomponent) was changed from 1000 to 10000;
• a four-component model was recovered with c1 = 100 or 10, but the algorithm
failed when an extreme choices such as 100000 or 1 were used (note that c1 = 1corresponds to effectively disabling the inliers and non-inliers split operation
(c.f. Section 4.3.3.1));
• a four-component model was recovered when c2 was changed from 100 to
1000, but the algorithm failed when 10 or 1 was used (note that the choice of
1 would imply no overall model variance posterior parameter adjustment (c.f.
Section 4.3.3.4) should take place). The test with c2 = 1 is particularly impor-
tant to demonstrate the need to have this operation;
• a four-component model was recovered when c3 was changed from 2 to 1, this
change means that on average an component can only be split once;
• the correct model was still recovered when c3 was increased to 3 or more.
4.4 Human Mobility Pattern Application & Results 99
Table 4.3: For the mobility pattern of Subscriber A (the observed data are plottedin Figure 4.3 (a)), we report the values of kfinal, F and MAEAC resulting from severalSEVB-GMM fits that were obtained using different values of kinitial; note that for com-parison these values were chosen to correspond to those values selected in the studyrepresented in Figure 4.4. Comparing these results with Figure 4.4, we can see thatunlike the standard VB algorithm, SEVB is much more robust to component initial-ization settings.
kinitial 1 2 4 5 8 10kfinal 3 3 3 3 3 3F 394 574 711 572 426 575
MAEAC 0.078 0.080 0.074 0.077 0.074 0.080
One can see then that our SEVB-GMM algorithm appears to be quite robust to the
selection of these values.
4.4.3 Real data results
Our real data analysis is based on the confidential call detail record (CDR) data pro-
vided by a wireless telecommunication provider based in Australia. It comprises
every single successful outbound activity made by 100 consumer subscribers over
a 17-month period. These anonymous subscribers were randomly selected from a
large database of several million subscribers, and these subscribers have stayed con-
nected during the entire study period. In practice the geographic position of an in-
dividual is generally recorded based on the position of the mobile cell tower that was
used at the commencement of the call (c.f. Gonzalez et al., 2008). This means that
the locations of the individuals are in fact approximate rather than precise geograph-
ical locations. However, cell tower location is precise enough to give us a picture of
the users’ movements. This data was collected for billing purposes with attributes
including, but not limited to, the activity-initiated date, time and mobile cell tower
location in latitude and longitude coordinates. Here we follow the same initializa-
tion settings as in the simulated study with the exception of δ1. We choose δ1 = 0.01degrees which translates to approximately 1 km. This selection reflects that, in re-
ality, we do not know the exact location of the subscribers within the tower service
area, as was the case in Gonzalez et al. (2008) and is therefore sufficient when dis-
tances between most towers are considered. Before we evaluate both algorithms, we
first examine model outputs of several selected anonymous subscribers with various
kinitial.
Selected model outputs have been presented in Figures 4.3 (b), 4.4 and 4.5. As before,
the ellipses represent 95% probability regions for the component densities, whereas
the estimated centers of these components are marked by ‘+’s in these figures. Fig-
ures 4.3 (b) and 4.4 are the results of our SEVB-GMM and the standard VB-GMM
algorithms, respectively, for the mobility pattern of Subscriber A with various values
100 Chapter 4. A New Variational Bayesian Algorithm
(a) Actual data, n = 56 (b) SEVB-GMM
Figure 4.3: (a) Observed mobility pattern of Subscriber A over a 17-month periodcorresponding to the recorded locations, marked by an ‘x’, of cell towers from whichtelecommunication activities were initialized. (b) The results of the SEVB-GMM fit ofa bivariate mixture model to these data; the center of each component in the fittedmixture is indicated by a ‘+’ and we plot the 95% probability regions (‘-’) for eachfitted component. Note that results obtained were similar for kinitial = 1 − 18 andthat values of kfinal, F and MAEAC corresponding to various kinitial are summarizedin Table 4.3.
of kinitial. With the SEVB-GMM algorithm, the resulting fits were almost identical for
the various kinitial, hence we only plot one of the actual fits, and Table 4.3 summarizes
the values of kfinal, F and MAEAC for selected kinitial. These figures show that our al-
gorithm is more robust than the standard method, and is able to obtain very similar
results regardless the value of kinitial used. Note that the computed values of F and
MAEAC in Table 4.3 appear to vary somewhat even though the final fitted models
were almost identical and had three components in each case; this is mostly due to
the size of this particular dataset. Since it is rather small, having only 56 observa-
tions, any differences in the posterior component allocations of the observed points
will have more of an influence on the computation of the selection criteria. Further,
the presence of a near singular inliers component in the fitted model strongly affects
the estimation of the LL value that is required for the calculation of F , which is why
F varies more than MAEAC does.
Figure 4.5 shows the results of four other subscribers, Subscribers B to E, with a
clearly inappropriate choice kinitial = 1 used. It shows that our algorithm is able
to model very complicated patterns sufficiently even when kinitial was assigned in-
correctly. These results are very encouraging.
Before we evaluate our SEVB algorithms more generally, we draw the reader’s atten-
tion to the differences in the values of F , between, for example:
4.4 Human Mobility Pattern Application & Results 101
(a) kinitial = 1 (b) kinitial = 2
(c) kinitial = 4 (d) kinitial = 5
(e) kinitial = 8 (f) kinitial = 10
Figure 4.4: Selected results obtained by using various choices for kinitial in the stan-dard VB-GMM algorithm for Subscriber A’s mobility pattern shown in Figure 4.3 (a);the center of each component in the fitted mixture is indicated by a ‘+’ and we plotthe 95% probability regions (marked by ‘-’) for each fitted component.
102 Chapter 4. A New Variational Bayesian Algorithm
(a) Subscriber B, n = 2970
(b) Subscriber C, n = 1593
(c) Subscriber D, n = 3576
(d) Subscriber E, n = 1843
Figure 4.5: Mobility patterns over a 17-month period for four subscribers are shownin the left column. Observations, ‘x’, are the recorded cell tower locations from whichsubscribers initiated a communication. Bivariate mixture models fitted using SEVB-GMM are shown in the right column; the center of each fitted mixture componentis marked ‘+’ and corresponding 95% probability regions (‘-’) are shown. Note thatSEVB-GMM was initialized with inappropriate choice kinitial = 1 each time, yet weare still able to model the data well. Values for F and MAEAC are also reported.
4.4 Human Mobility Pattern Application & Results 103
• Figure 4.1 (b), and 4.2 (d) and (e),
• Table 4.3 (b) with kinitial = 5 and 8, and Figure 4.4 (d)–(f), but note the small
sample size; and
• Figure 4.5 (d), and the F value resulting from the standard VB-GMM algorithm
fit with inappropriate kinitial = 1 which we found to be−16743.
These are some of the examples where the measure of F appears to conflict with
the choice of our algorithm. Comparing F between Figure 4.4 (b) and (d), for exam-
ple, gives us cause for concern about using F to choose the model since it selects a
model which does not appear to be very appropriate while MAEAC leads to a choice
which intuitively appears to be more suitable. We suggest that for this application
our goodness-of-fit measure, MAEAC, is more reliable and robust than the widely
used F measure. While we are yet to be able to show this result more generally, we
believe that one of the reasons is that F (and also BIC and DIC) relies heavily on
the estimated model LL which is influenced by the covariance matrix estimation for
components in the model. The presence of point masses in our data which occur at
some cell tower locations, and components corresponding to any potentially inap-
propriate surviving linear-shaped components in the fitted model (e.g., the linear-
shaped components in the fits shown in Figure 4.4 (d) and (e)), cause numerical
stability problems because the corresponding covariance matrices for these com-
ponents are singular or near singular. This is because the estimated covariance ma-
trix for a singular or near singular component can be heavily influenced by round-
ing errors which can have a big effect on the resulting LL estimate computed. Note
also that the estimates obtained in this case are also software dependent since dif-
ferent packages can deal with this issue in different ways. MAEAC is more robust
to these situations as we outline in the following. When there is a fitted compo-
nent corresponding to a point mass, its estimated absolute deviance measure term
in the MAEAC formula will be very close to zero and have almost no influence on
MAEAC’s ranking of the models, while in contrast an estimated LL could be greatly
and unduly influenced by even slight changes in covariance estimates associated
with a point mass, it may even be undefined and its estimation may be misleading
if the covariance matrix turns out to be not positive definitive. Further, MAEAC will
penalize linear-shaped components (which are often inappropriate in our applica-
tion, although we note that they may not be in some situations) as the estimated
absolute deviance measure term for those will be extremely large, while in contrast
an estimated LL for such a term could appear very favorable as a result of the cor-
responding covariance matrix being near singular or singular. These point masses,
and to some degree those linear-shaped components, are a result of how this data
was recorded, i.e. recorded user locations are not continuously varying in the plane
104 Chapter 4. A New Variational Bayesian Algorithm
but given by the finite set of cell tower locations. This also explains why the differ-
ence between models selected using F and MAEAC appears to be less pronounced
in our simulated data studies as synthetic data tend to be more well behaved than
real data. From this perspective, MAEAC appears to be a more robust measure un-
der these conditions and therefore it may also be useful in other applications where
these data issues arise.
Figure 4.6 summarizes the results of both algorithms based on the 100 anonymous
subscribers. For each of these subscribers, we have fitted 30 GMMs to their observed
pattern corresponding to the fits obtained with kinitial ranging from one to 30. Fig-
ure 4.6 (a) focuses on the final number of components fitted, kfinal, when different
kinitial are used. It shows that our algorithm is able to discover models with more
consistent estimates of kfinal, and is not limited by the choice of kinitial. We are able
to discover, on average, more than six components for this particular dataset, even
when kinitial = 1 is used. Note that this result possibly could be improved further by
varying the user chosen parameters. Moreover, this figure also shows that if we use
larger values for kinitial, we tend to obtain a fitted model with a higher kfinal. This is
understandable as the dataset we are dealing with is highly heterogeneous. However,
it is encouraging that for both algorithms, and more so for our algorithm, the value
of kfinal in the fitted models is reasonably consistent even when a large kinitial is used.
Note that our algorithm appears to fit models with smaller kfinal when large kinitial is
used when compared to the standard method for this particular dataset.
Figure 4.6 (b) examines both algorithms from the aspect of goodness-of-fit. It shows
that our model performs better which is not surprising considering we select our
models based on this particular criterion. It also shows that data, particularly for the
standard method, tends to be less well fitted when smaller kinitial is used correspond-
ing to models with smaller kfinal (c.f. Figure 4.6 (a)). Except when very small kinitial is
used, our algorithm appears to be able to obtain models with similar goodness-of-fit
regardless of the value of kinitial used. Based on Figure 4.6 (a) and (b), we believe our
algorithm is, in comparison, very robust and suitable for our targeted application.
Note that while we did not evaluate the results with respect to different observa-
tion membership initializations with kinitial being the same here, we believe results
in Figure 4.6 (a) and (b) are already sufficient in demonstrating the robustness of our
algorithm in this respect.
We next examined our proposed goodness-of-fit measure, MAEAC, more closely.
Figure 4.6 (c) is plotted based on comparing the model BIC, DIC, and F obtained
by both algorithms. It is based on the principle that assuming our algorithm has dis-
covered the same or better models when compared to the standard method, then,
in theory, models of our algorithm should have equal or lower BIC and DIC, as well
as equal or higher F with respect to models of the standard method. Percentages of
4.4 Human Mobility Pattern Application & Results 105
(a) (b)
(c)
Figure 4.6: Comparisons between fits obtained from standard VB-GMM and ourSEVB-GMM algorithm when using different values kinitial ranging from 1 to 30 basedon the observed data for 100 randomly selected anonymous individuals: (a) plot ofthe fitted kfinal vs. the kinitial that was used for both algorithms, (b) value of MAEAC(Equation (4.7)) for the fits from both algorithms vs. the kinitial that was used, and (c)for the fits obtained from both the standard and SEVB algorithms, we computed thecorresponding values of BIC, DIC, F and MAEAC then plotted the % of times therewas an agreement in the model that would be selected based on either the BIC, DICor F values, and the model that was selected in the SEVB algorithm using MAEAC.
106 Chapter 4. A New Variational Bayesian Algorithm
cases where BIC, DIC and F agreed with our final model selection are presented in
this figure. It shows that when kinitial is small, our final models, besides being better
fitted, generally have also improved from the point of view of BIC, DIC and F ; BIC,
DIC and F appear to agree with our model selection around 60 to 70% of time re-
gardless of kinitial and as we have discussed earlier, in cases where MAEAC is not in
agreement with these other measures, we would expect it to be more robust to the
data discreteness issues we have described.
4.5 Discussion
We have proposed an extension of the standard VB method for approximating
GMMs. Unlike the standard approach, our algorithm can lead to models with a
higher number of components than proposed initially. It is therefore more flexible
and practical for applications as we have demonstrated through our empirical re-
sults in Section 4.4. Our approach is inevitably faster than other existing VB-GMM
algorithms with split operations. This is because we only attempt to split compo-
nents that are identified as showing a poor fit, as determined by our proposed split
assessment criteria, and we split all of these components at the same time. Addi-
tionally, we terminate the algorithm based on a set of diagnostic criteria in contrast
to current approaches which typically run for as many iterations as necessary un-
til F cannot be improved further. While the exact computational advantages of our
SEVB-GMM over other VB-GMM algorithms which allow for component splitting is
difficult to evaluate, for illustration we note that a total of nine split attempts were
made by our algorithm for modeling subscriber D (c.f. Figure 4.5 (c)) while this fit
would have required making at least 39 successful split attempts if we had used one
of the existing splitting algorithms with kinitial = 1. This, in our view, suggests that
ours is a more effective and efficient approach as the increased speed is extremely
important when working with real, large datasets. We have also improved on the
standard algorithm in the sense that as the parameter space is now explored more
thoroughly.
From the application perspective, to the best of our knowledge, this is the first piece
of research that aims to model individuals’ overall human mobility patterns with
GMMs. Mixture modeling (Jain and Dubes, 1988, pp.117-118) is often considered
as a model-based approach to clustering. For comparison, we have also attempted
to model the simulated data used in Section 4.4.2 with the well-known k-Means
(KM) algorithm as well as DBSCAN (Density-Based Spatial Clustering of Applica-
tions with Noise) (Ester et al., 1996). DBSCAN is one of the efficient density-based
clustering algorithms which has become popular recently in machine learning liter-
ature (Han and Kamber, 2006). However, unlike the GMM approach, we found that
4.6 References 107
these clustering algorithms did not appear to be able to provide us with meaning-
ful model descriptions as the patterns were represented with combinations of many
non-overlapped and irregularly shaped clusters. When compared to the GMM, KM
appears to focus on identifying outliers as the result of the pattern being heteroge-
neous. In contrast, DBSCAN is able to first identify and then remove the outliers (c.f.
mixture number 6 and 7). However, beside the fact that it had identified three out-
liers from mixture number 1, it was not able to identify any inliers components. This
is not a surprise to us as clustering is typically based on the principle that the opti-
mal model will have clusters that are compacted or clearly separated (Milligan and
Cooper, 1985; Jain and Dubes, 1988). This suggests that clustering is generally not an
appropriate method for modeling human mobility patterns.
However, we should note that there is a limitation to using GMMs for modeling an
individual’s spatial pattern despite its superiority to clustering approaches. That is,
it is not able to model well the movements of an individual when they correspond
to a journey or path taken that follows a non-linear trajectory, this is illustrated in
the results of Subscriber D in Figure 4.5 (c). Additionally, one major limitation of a
GMM is its lack of robustness to outliers. Modeling more robust mixtures of Student-
t distributions within the VB framework has been proposed (Svensen and Bishop,
2005; Archambeau and Verleysen, 2007). Such algorithms might be improved further
with the adoption of our split process to result in an approach even more suitable
for our application. Additionally, Aitkin and Wilson (1980) outlined a modified EM
algorithm in which outliers were identified via mixture modeling and this might also
be usefully incorporated into our approach.
We have also proposed a new model selection criteria, MAEAC. We have shown
through empirical results that it appears to be more robust to the problems of near
singular or singular covariance matrices that arose due to issues of data discreteness.
This new criteria might also be a useful tool in other applications where such data
problems exist.
4.6 References
Aitkin, M., Wilson, G. T., 1980. Mixture models, outliers, and the EM algorithm. Tech-
nometrics 22 (3), 325–331.
Archambeau, C., Verleysen, M., 2007. Robust Bayesian clustering. Neural Networks
20 (1), 129–138.
Armstrong, J. S., 2001. Principles of Forecasting: A Handbook for Researchers and
Practitioners. International Series in Operations Research & Management Science.
108 Chapter 4. A New Variational Bayesian Algorithm
Kluwer Academic, Boston, MA.
Attias, H., 1999. Inferring parameters and structure of latent variable models by vari-
ational Bayes. In: Laskey, K. B., Prade, H. (Eds.), Proceedings of the Fifteenth Con-
ference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, Stockholm,
Sweden, pp. 21–30.
Azzalini, A., 1996. Statistical Inference: Based on the Likelihood. Monographs on
Statistics and Applied Probability. Chapman & Hall, London.
Balakrishnan, S., Madigan, D., 2006. A one-pass sequential Monte Carlo method for
Bayesian analysis of massive datasets. Bayesian Analysis 1 (2), 345–362.
Ball, G. H., Hall, D. J., 1965. ISODATA, a novel method of data analysis and pattern
classification. Tech. rep., Stanford Research Institute, Menlo Park, CA.
Beal, M. J., Ghahraman, Z., 2002. The variational Bayesian EM algorithm for incom-
plete data: with application to scoring graphical model structures. In: Bernardo,
J. M., Bayarri, M. J., Berger, J. O., Dawid, A. P., Heckerman, D., Smith, A. F. M., West,
M. (Eds.), Proceedings of the Seventh Valencia International Meeting. Oxford Uni-
versity, Tenerife, Spain, pp. 453–464.
Beal, M. J., Ghahraman, Z., 2006. Variational Bayesian learning of directed graphical
models with hidden variables. Bayesian Analysis 1 (4), 793832.
Bishop, C. M., 2006. Pattern Recognition and Machine Learning. Information Sci-
ence and Statistics. Springer, New York.
Brockmann, D., Hufnagel, L., Geisel, T., 2006. The scaling laws of human travel. Na-
ture 439 (7075), 462–465.
Celeux, G., Forbes, F., Robert, C., Titterington, D., 2006. Deviance information crite-
ria for missing data models. Bayesian Analysis 1 (4), 651–674.
Celeux, G., Hurn, M., Robert, C. P., 2000. Computational and inferential difficulties
with mixture posterior distributions. Journal of the American Statistical Associa-
tion 95 (451), 957–970.
Constantinopoulos, C., Likas, A., 2007. Unsupervised learning of Gaussian mixtures
based on variational component splitting. IEEE Transactions on Neural Networks
18 (3), 745–755.
4.6 References 109
Corduneanu, A., Bishop, C. M., 2001. Variational Bayesian model selection for mix-
ture distributions. In: Proceedings of the Eighth International Conference on Arti-
ficial Intelligence and Statistics. Morgan Kaufmann, Key West, FL, pp. 27–34.
Ester, M., Kriegel, H.-p., Sander, J., Xu, X., 1996. A density-based algorithm for dis-
covering clusters in large spatial databases with noise. In: Simoudis, E., Han,
J., Fayyad, U. M. (Eds.), Proceedings of the Second International Conference on
Knowledge Discovery and Data Mining. AAAI, Portland, OR, pp. 226–231.
Gelman, A., Carlin, J. B., Stern, H. S., Rubin, D. B., 2004. Bayesian Data Analysis, 2nd
Edition. Texts in Statistical Science. Chapman & Hall, Boca Raton, FL.
Ghahramani, Z., Beal, M. J., 1999. Variational inference for Bayesian mixtures of fac-
tor analysers. In: Solla, S. A., Leen, T. K., Muller, K.-R. (Eds.), Proceedings of the
1999 Neural Information Processing Systems. MIT, Denver, CO, pp. 449–455.
Gonzalez, M. C., Hidalgo, C. A., Barabasi, A.-L., 2008. Understanding individual hu-
man mobility patterns. Nature 453 (7196), 779–782.
Han, J., Kamber, M., 2006. Data Mining: Concepts and Techniques, 2nd Edition. The
Morgan-Kaufmann Series in Data Management Systems. Morgan Kaufmann, San
Francisco, CA.
Jaakkola, T. S., Jordan, M. I., 2000. Bayesian parameter estimation via variational
methods. Statistics and Computing 10 (1), 25–37.
Jain, A. K., Dubes, R. C., 1988. Algorithms for Clustering Data. Prentice Hall, Upper
Saddle River, NJ.
Madigan, D., Ridgeway, G., 2003. Bayesian data analysis. In: Ye, N. (Ed.), The Hand-
book of Data Mining. Human Factors and Ergonomics. Lawrence Erlbaum Asso-
ciates, Mahwah, NJ.
McGrory, C. A., Titterington, D. M., 2007. Variational approximations in Bayesian
model selection for finite mixture distributions. Computational Statistics & Data
Analysis 51 (11), 5352–5367.
McLachlan, G. J., Peel, D., 2000. Finite Mixture Models. Wiley Series in Probability
and Statistics. Wiley, New York.
Milligan, G., Cooper, M., 1985. An examination of procedures for determining the
number of clusters in a data set. Psychometrika 50 (2), 159–179.
110 Chapter 4. A New Variational Bayesian Algorithm
Richardson, S., Green, P. J., 1997. On Bayesian analysis of mixtures with an unknown
number of components (with discussion). Journal of the Royal Statistical Society:
Series B (Statistical Methodology) 59 (4), 731–792.
Spiegelhalter, D., Best, N., Carlin, B., Van der Linde, A., 2002. Bayesian measures of
model complexity and fit. Journal of the Royal Statistical Society: Series B (Statis-
tical Methodology) 64 (4), 583–639.
Stephens, M., 2000. Bayesian analysis of mixture models with an unknown number
of components - an alternative to reversible jump methods. The Annals of Statis-
tics 28 (1), 40–74.
Svensen, M., Bishop, C. M., 2005. Robust Bayesian mixture modelling. Neurocom-
puting 64, 235–252.
Titterington, D. M., Smith, A. F. M., Makov, U. E., 1985. Statistical Analysis of Finite
Mixture Distribution. Wiley series in Probability and Mathematical Statistics. Wi-
ley, New York.
Ueda, N., Ghahramani, Z., 2002. Bayesian model search for mixture models based
on optimizing variational bounds. Neural Networks 15 (10), 1223–1241.
Ueda, N., Nakano, R., Ghahramani, Z., Hinton, G. E., 2000. SMEM algorithm for
mixture models. Neural Computation 12 (9), 2109–2128.
Wang, B., Titterington, D. M., 2006. Convergence properties of a general algo-
rithm for calculating variational Bayesian estimates for a normal mixture model.
Bayesian Analysis 1 (3), 625–650.
Watanabe, S., Minami, Y., Nakamura, A., Ueda, N., 2002. Application of variational
Bayesian approach to speech recognition. In: Becker, S., Thrun, S., Obermayer,
K. (Eds.), Proceedings of the 2002 Neural Information Processing Systems. MIT,
Vancouver, BC, Canada, pp. 1237–1244.
Wu, B., McGrory, C. A., Pettitt, A. N., 2010c. The variational Bayesian method: com-
ponent elimination, initialization & circular data. Submitted.
5Customer Spatial Usage Behavior Profiling and Segmentation with
Mixture Modeling
Abstract
While companies typically acknowledge the need to be customer focused, an un-
derstanding of how each customer utilizes their product/service often appears to be
lacking. This paper describes how businesses can improve on this knowledge short-
coming, with ideas illustrated within the context of the wireless telecommunication
industry. Importantly, this article demonstrates the feasibility and potential merit
in analyzing individuals’ frequently overlooked habitual consumption behavior. For
the first time, an approach is developed that can automatically and effectively pro-
file each user’s observed overall spatial usage behavior (or mobility pattern). Mobil-
ity data is highly heterogeneous and spiky; we tailor a technique based on the use
of Gaussian mixture models (GMMs) and the variational Bayesian (VB) method for
overcoming these difficulties. The detailed distributional understanding achieved
here is then transformed to unlock potentially valuable insights such as each sub-
scriber’s likely lifestyle and occupational traits, which otherwise cannot be easily or
cheaply discovered. Our empirical results reveal that users’ spatial usage behavior
profiles are more stable than the currently popular approach which involves the or-
dered partitioning of customers based on current benchmark measures such as ag-
gregated voice call durations. The mobility patterns that we find among customer
groups are highly differentiable and therefore are valuable for business strategy for-
mulation.
112 Chapter 5. Customer Spatial Usage Behavior Profiling and Segmentation
Keywords
Consumption Behavior; Spatial Usage Behavioral Segmentation; Gaussian Mixture
Model; Variational Bayes; k-Means Clustering; Wireless Telecommunication Indus-
try
5.1 Introduction
Customers are the most important asset of any business, and companies typically
acknowledge the need and necessity of being customer focused (Christopher et al.,
1991, p.13). However, not all customers are the same (Cooper and Kaplan, 1991).
To better serve and/or satisfy each customer, businesses often seek to group them
based on their characteristics, needs, preferences and behavior exhibited for distinct
marketing propositions (Smith, 1956). Alternatively, they may try to differentiate
customers based on their current and future needs and values to the business with
the aim of exchanging appropriate relationships with them (Blattberg and Deighton,
1996; Reichheld, 1996; Fournier et al., 1998; Peppers et al., 1999). Detailed customer
behavior understanding forms a critical part of such customer knowledge. However,
the habitual consumption aspect of customer behavior has not been very well stud-
ied despite the fact that there is already a wealth of customer/consumer behavior
literature. The repeated nature of this behavior should provide good insight into
customers’ current and future patterns (c.f. Schmittlein and Peterson, 1994). In this
paper, we address this knowledge shortcoming and present an innovative approach
which aims to enable businesses to comprehend how customers have utilized their
product/service in their daily lives (c.f. Fournier et al., 1998). We illustrate our ideas
in the context of the wireless telecommunication industry although they can be gen-
eralized to other industries. More specifically, we novelly explore and analyze cus-
tomers’ spatial usage behavior (or mobility patterns), to transform it into insightful
and more stable information, as well as highly differentiable segmentation for the
business.
Consumption behavior differs from purchasing behavior (Alderson, 1957; Jacoby,
1978), and is more relevant to the service than the retail industries because of its re-
peated patterns (Ouellette and Wood, 1998; Ajzen, 2001). Existing knowledge of each
individual’s consumption behavior is primarily limited to discrete (e.g., which ser-
vices customers use) or average and aggregated measures (e.g., number of transac-
tions per month). These measures, however, are not necessarily appropriate, mean-
ingful or adequate for describing the observed pattern. Instead, these patterns can
often be modeled in a way that is more revealing and yet still fairly straight forward
by using mixture models (McLachlan and Peel, 2000). We demonstrate this with the
5.1 Introduction 113
Figure 5.1: Voice call duration distributions approximated by a mixture of lognormaldistributions (‘—’s) of two subscribers whose voice call durations have a mean of58 seconds. (a) Subscriber 1: large amount of ‘message’-like calls of a very shortduration. (b) Subscriber 2: call duration is more evenly distributed when comparedwith Subscriber 1.
(a) (b)
following example. The two users shown in Figure 5.1 have behaved quite differ-
ently in their outbound calling behavior. Yet, if we simply calculated the average
call duration for each, we would obtain the same information for these two very dif-
ferently behavioring subscribers; both of their average voice call durations over a
particular period are 58 seconds. The fitted models, marked by the overlaid solid
lines, demonstrate that these distributions can be well approximated by a mixture of
several lognormal distributions. That is, a model involving several means and stan-
dard deviations which is much more flexible than just one mean value. Somewhat
more interestingly, however, is that this figure also illustrates that even the com-
monly used hazard functions, such as the Weibull distribution, are inappropriate
(c.f. Heitfield and Levy, 2001). Clearly the behavioral distributional differences re-
vealed by the type of the modeling used in Figure 5.1 can be critical to the business.
For example, from the viewpoint of pricing structure or the understanding of po-
tential product/service substitution/migration implications. Consequently, we be-
lieve it is important to promote the analysis and comprehension of the distributions
of customers’ consumption behavior, rather than just observing averages. We will
show that this distributional analysis tactic can provide businesses with more com-
prehensive insights, which otherwise could not be uncovered using the typically ap-
plied approaches of the ordered partitioning of customers based on their average or
aggregated measures (c.f. §5.2.2) (Twedt, 1967; Wedel and Kamakura, 1998, p.10).
This paper focuses primarily on the spatial aspect of customers’ consumption be-
havior. This aspect is rarely considered in the literature despite the fact that stud-
ies have already shown that individuals exhibit a high degree of spatial regularities
(Gonzalez et al., 2008), and the locations visited by an individual can be socially im-
portant to them (Stryker and Burke, 2000). That is, each user’s highly repetitive mo-
bility pattern should reveal something about them; yet, we know very little about this
subject. For our application dataset which we will describe in §5.2.1, unsurprisingly,
our preliminary analysis of subscribers’ spatial usage behavior has demonstrated
114 Chapter 5. Customer Spatial Usage Behavior Profiling and Segmentation
that different users have been using their wireless devices in very different ways in
their daily lives. For example:
• The profiles of Subscriber A and B in Figure 5.2 (a) and (b) suggest that they
have behaved like a businessperson frequently flying between cities, and like
an inter-state truck driver frequently driving between cities, respectively.
• The profile suggests that Subscriber C in Figure 5.2 (c) has been mostly active
in places most likely to be his/her home and workplace.
• Subscriber D in Figure 5.2 (d) appears to have behaved like a tradesperson or a
taxi driver moving around places in their living (or working) neighborhood.
We believe that the distinctly different observed patterns are largely influenced by
occupations and/or lifestyles; such valuable individual insights could not otherwise
be easily or cheaply obtained, but can be useful for business in improving the cus-
tomer interactions. Given the volumes of this type of data, it is essential that these
patterns, both individually as well as behavioral segment wise, are identified and also
interpreted automatically in an efficient and effective analysis process.
Mathematically, the goal of this research is to put forward an approach for pro-
filing as well as segmenting subscribers accurately and intuitively based on their
frequently overlooked actual mobility patterns. It, however, does not aim to seg-
ment customers spatially/geographically (c.f. Hofstede et al., 2002), but rather to
differentiate users based on their spatial behavioral characteristics. In a novel way
of exploring individuals’ spatial usage behavior, we show that it is practical and
effective to model individuals’ mobility patterns using Gaussian mixture models
(GMMs), whose characteristics can be easily captured unlike alternative nonpara-
metric approaches. We tailor a recently proposed computationally efficient varia-
tional Bayesian (VB) algorithm that was designed specifically for modeling highly
heterogeneous and ‘spiky’ patterns with weak prior information available (Wu et al.,
2010b), and is therefore suitable for this application. Note that our application was
discussed in Wu et al. (2010b); but unlike that paper which focuses on the statisti-
cal methodology used for fitting a GMM, this paper concentrates on interpreting the
patterns to gain useful customer knowledge. We use the term ‘spiky’ to describe pat-
terns where large areas of low probability are often mixed with small areas of high
probability. Although VB (Attias, 1999; Wang and Titterington, 2006) has received in-
creasing attention in other fields, to the best of our knowledge, no references to such
models have been made in the marketing literature; although Braun and McAuliffe
(2010) have already illustrated the usefulness of a VB-based discrete choice model in
a statistical journal. We must emphasize that the use and interpretation of GMMs for
customer behavior modeling is the foundation of this study, therefore while Wu et al.
(2010b)’s split and eliminate VB (SEVB) algorithm is used to fit the GMMs, alternative
modeling approaches could also be taken as we discuss in §5.3.
5.1 Introduction 115
Figure 5.2: Spatial behavior of four different subscribers. (a) Subscriber A: inter-capital businessperson-like pattern. (b) Subscriber B: inter-state truck driver-likepattern. (c) Subscriber C: home-office-like pattern shown in bubble plot. (d) Sub-scriber D: taxi driver-like pattern shown in bubble plot. Note that in (a) and (b) ‘x’srepresent the actual observations and ‘. . .’s represent the ‘virtual’ path the user islikely to have taken between two consecutive actual observations. In (c) and (d),user patterns are shown in the form of bubble plots instead of the scatter plot forbetter demonstrating that a large number of activities were initiated from the samecell tower locations; the size of the bubble represents the activity volume of the par-ticular location.
(a) (b)
(c) (d)
116 Chapter 5. Customer Spatial Usage Behavior Profiling and Segmentation
Based on our thorough data analyses for individuals’ real and approximated mobil-
ity patterns, we develop a new customer behavior modeling method involving the
introduction of several behavioral ‘signatures’ (i.e., characteristics) for automatically
and statistically profiling how each user utilizes the product/service spatially in their
daily lives. We demonstrate statistically that these meaningful descriptors, which are
extracted from the approximated GMM, are more stable and highly differentiable
(c.f. Wedel and Kamakura, 1998, p.4) than existing alternatives such as the quan-
tiles of aggregated outbound voice call durations and short message services (SMS)
counts. In fact, we show that the lack of stability of ordered partitioning of customers
based on these aggregated measures can be alarming. We also show that customers’
spatial usage behaviors naturally form clusters that can be easily related to by market
specialists.
The remainder of this paper is organized as follows. We introduce our data in §5.2; we
then establish a benchmark for assessing segmentation stability which is based on
subscribers’ aggregated outbound voice call durations and SMS counts since these
are two of the most widely analyzed consumption behaviors in this industry. Addi-
tionally, we analyze individuals’ spatial usage behavioral data, and identify its unique
characteristics which pose challenges in modeling. We begin §5.3 with a brief discus-
sion of existing literature on modeling individuals’ mobility patterns. We then artic-
ulate the advantages, and demonstrate the accuracy and efficiency of instead mod-
eling the patterns with GMMs; we fit the GMMs using the SEVB algorithm (Wu et al.,
2010b). This is followed by further data analyses differentiating users’ spatial usage
behavior, which leads to the introduction of our behavioral signatures in §5.4. We
emphasize that separate GMMs are fitted to each individual whose mobility pattern
characteristics are then extracted. In concluding both §5.4 and §5.5, we evaluate the
effectiveness of our spatial usage behavioral profiling and segmentation including
the comparison to the benchmark established in §5.2. We next perform validation
demonstrating that the proposed behavioral grouping is useful and highly differen-
tiable in §5.6, and finish with a discussion of our contributions in §5.7.
5.2 Data & Individuals’ Consumption Behavior
5.2.1 Data
Studies have shown that we can sufficiently comprehend an individual’s mobility
pattern through analyzing their call detail records (CDR) without the need to track
them at all times (Gonzalez et al., 2008). Our research adopts this convenient ap-
proach, and analyzes confidential CDR provided by a wireless telecommunication
provider in Australia. Our data records every single successful outbound activity
made by 1,082 consumer subscribers during a consecutive 17-month period, which
5.2 Data & Individuals’ Consumption Behavior 117
Table 5.1: Distributions of users’ aggregated call patterns
(seconds) Mean Min 1Q Median 3Q MaxVoice Call Durations 80926 0 12326 39062 97925 1446649SMS Counts 557 0 16 135 564 12522
Figure 5.3: Distributions of users’ aggregated call patterns. (a) Aggregated voice calldurations. (b) Aggregated SMS counts.
(a) (b)
is a relatively long history for an analysis of this kind. These anonymous subscribers
were statistically randomly selected, prior to CDR being collected, and have stayed
connected during the entire study period. Attributes collected in this sample in-
clude, but are not limited to, the activity initiated timestamp and cell tower location
in latitude and longitude. Note, we do not know the precise location of an individual,
we only know approximately where they are through the location of the cell tower
used to initialize an outbound activity. This paper focuses only on the activities made
domestically.
5.2.2 Usage behavior of aggregated voice call durations and SMS counts& the segmentation stability benchmark
Ordered partitioning of customers based on their aggregated voice call durations or
SMS counts over a period of time is perhaps the most commonly adopted approach
for usage behavioral segmentation in the telecommunications industry. We refer to
this as the benchmark in this paper. Before we examine its segmentation effective-
ness, we explore these attributes. The distributional summary statistics of our sub-
scribers’ aggregated voice call durations and SMS counts over the 17-month period
are shown in Table 5.1, and presented in the histogram in Figure 5.3. They reveal
that a typical user initializes more than half an hour of talk and nearly eight SMSs
per month. Note that as a result of both distributions being highly skewed, we have
grouped observations above the 95 percentile together and capped at that level lim-
iting the influence of outlying extremely heavy users in the histogram.
To evaluate the stability of order partitioning subscribers based on these two mea-
sures, we partition our 17-month data into three non-overlapping periods:
118 Chapter 5. Customer Spatial Usage Behavior Profiling and Segmentation
Table 5.2: Stability of the ‘benchmark’ customer segmentation defined by Equa-tion 5.1
Period 1 vs. 2 Period 1 vs. 3# of Groups 4 5 10 4 5 10
Voice Call Durations 53.4% 44.4% 23.8% 55.6% 49.6% 29.3%SMS Counts 62.8% 54.9% 32.1% 58.6% 51.7% 28.7%
• Period 1 corresponding to the first five months,
• Period 2 for the following seven months, and
• Period 3 for the final five months (i.e., same months as Period 1 but a year later).
For each period we partition subscribers into four, five, and 10 equal-sized groups
based on the percentile and note which group each user’s value is in for each pe-
riod. We concentrated on the comparison between Periods 1 and 2, and 1 and 3, and
define:
Stability =# of subscribers in the same groups for both periods
# of subscribers. (5.1)
The results are presented in Table 5.2 and they suggest that users’ usage patterns in
SMS are more stable than their voice calls. Interestingly, their aggregated SMS counts
can predict the behavior of the following period better than the same period a year
later, while the situation is reversed in their aggregated voice call durations. While
users’ behavior in this industry is known to change quickly, many of these prediction
accuracies look alarming. For example, we can only correctly predict the behaviors
of 23.8% of all subscribers for the following period when they are partitioned into 10
quantile groups based on their aggregated voice call durations in the previous pe-
riod. We refer to values in Table 5.2 as the benchmark for our spatial usage behavior
research. However, we acknowledge that the usefulness of these numbers is diffi-
cult to assess. To the best of our knowledge, there exists no benchmark that is better
suited for comparison than the one we have selected; this is a direct result of the fact
that this type of analysis is typically not being conducted, despite its importance.
5.2.3 Spatial usage behavior (or mobility patterns)
Human mobility patterns have recently been examined closely (Gonzalez et al.,
2008) to reveal that they are highly heterogeneous with strong spatial regularities
exhibited. That is, individuals typically spend most of their time in their most highly
preferred locations, and occasionally visit other places that are ‘isolated’ in relation
to their usual activity areas. Statistically speaking, this implies that users’ spatial us-
age behavior is not only heterogeneous (both between and within users), but also
spiky. Our telecommunications data supports this finding, and Figure 5.4 (a) reveals
5.3 Modeling Individuals’ Spatial Usage Behavior 119
Figure 5.4: Mobility pattern analysis. (a) Percentage of outbound activities madefrom users’ top five preferred locations. (b) Average of users’ cumulative activitycount distribution with respect to distance from their real centers.
(a) (b)
that around 70% of all outbound activities made by our users were initialized at his/
her top five preferred locations as marked by the corresponding cell towers. This het-
erogeneous and spiky nature of the spatial usage behavior, along with the data being
somewhat discrete (c.f. the locations recorded in CDR are restricted to where the
cell towers are located rather than positioned on a continuous plane), poses some
modeling challenges as we shall explain in §5.3.
To help us understand how each user has moved around spatially, we define their
mobility pattern ‘real center’, also known as ‘home’ (e.g., Balazinska and Castro,
2003), as being the most frequently used cell tower location in the most active ar-
eas; each evaluated area is circular with radius being 100 km, and no differences
were found if 200 km, for example, were used. For nearly all of our users, this real
center corresponds to the location of the most frequently used cell tower; excep-
tions typically occurred when an subscriber’s activities were roughly divided some-
what equally into two or more regions and his/her spatial movement in one of the
regions is largely limited (e.g., a mining site where there is only one or two cell tow-
ers). Note that we use the term ‘real’ simply to express the fact that they are calcu-
lated directly from the actual data (as oppose to the estimation we will carry out in
§5.3.4). Assuming that each latitude or longitude degree always corresponds to 100
km, the spatial behavior of our average user with respect to their real center is shown
in Figure 5.4 (b). It illustrated that, on average, users have made around 65% of all
outbound activities within 10 km of their real centers, and nearly 90% within 100 km.
5.3 Modeling Individuals’ Spatial Usage Behavior
Individuals’ mobility patterns have mostly been studied from the perspective of in-
frastructure (Liu et al., 1998; Perkins, 2001; Camp et al., 2002; Balazinska and Cas-
tro, 2003), with the aim of providing better network experiences for the users. How-
ever, these approaches are ineffective for understanding individuals’ spatial usage
120 Chapter 5. Customer Spatial Usage Behavior Profiling and Segmentation
behavior and typically do not take the strong spatial regularities discussed in §5.2.3
into consideration. For those that do focus on modeling each subscriber’s mobil-
ity pattern, there is limited attention given besides the identification of ‘significant’
locations which would refer to individuals’ preferred (cell tower) locations, for exam-
ple, their home and workplace (Nurmi and Koolwaaij, 2006). While most users are
stationary in the sense that they can generally be found at the same locations over
time (Balazinska and Castro, 2003), non-significant locations are actually crucial to
the understanding of highly mobile users (e.g., Subscriber D in Figure 5.2 (d)). It is
important to comprehend users’ overall mobility patterns, not just the significant
locations. While it provides some insights, simply identifying the top several most
frequently used cell tower locations by individuals and attempting to characterize
them as done in Cortes et al. (2000), for example, is not adequate for fully under-
standing their spatial usage behavior. In addition, while most studies have utilized
density-based clustering algorithms (e.g., DBSCAN algorithm (Ester et al., 1996)) for
identifying individuals’ significant locations, the accuracy of these algorithms has
been shown to be a concern particularly when there is more than one cell tower cov-
ering the same location (Nurmi and Koolwaaij, 2006). In fact, Wu et al. (2010b) have
shown that these algorithms often fail to identify those frequently used locations.
Additionally, it appears that these algorithms are inadequate for comprehending and
characterizing subscribers’ spatial usage behavior in detail. This is because they can
only represent users’ mobility patterns with combinations of many non-overlapped
(and irregularly shaped) clearly separated clusters, and that type of representation is
clearly not meaningful here. We will show that the approach we outline here is much
more appropriate.
5.3.1 Gaussian mixture model (GMM)
Our approach is to model each user’s overall mobility pattern (i.e., latitude and lon-
gitude) with bivariate Gaussian mixture models (GMMs). These are easy to inter-
pret, flexible and computationally convenient (McLachlan and Peel, 2000). Mixture
models, which are the convex combination of a number of density distributions,
have been shown to be capable of representing any distribution as in the case of
nonparametric approaches (Escobar and West, 1995; Roeder and Wasserman, 1997),
and have therefore been extensively used in other research (e.g., Wedel and Ka-
makura, 1998, Chapter 6). The spatial density of an individual’s mobility pattern
x = (x1, ..., xn) (i.e., n outbound activity observations) when modeled with mixture
of k Gaussian components is given by:
f (x) =k∑j=1
wjN(x;µj , T−1
j
), (5.2)
5.3 Modeling Individuals’ Spatial Usage Behavior 121
where k ≥ 1, and µj and T−1j represent the mean and variance, respectively, of the jth
component density; each mixing proportion {wj}, satisfies 0 ≤ wj and∑k
j=1wj = 1;
and N (·) denotes a bivariate Gaussian distribution. We emphasize that each user’s
overall spatial usage behavior over a time period is individually modeled with a dif-
ferent GMM and thus fitted independently.
5.3.2 The variational Bayesian (VB) method
The most popular approach for fitting the GMM in the literature is the expectation-
maximization (EM) algorithm. However, maximum likelihood (ML) approaches
such as EM (Dempster et al., 1977), can suffer from over-fitting and singularity prob-
lems; these are much less of a problem for the relatively recent variational Bayesian
(VB) inference approach (McGrory and Titterington, 2007). VB is one of the most
time and computationally efficient Bayesian techniques currently available. It has
been shown to perform as well as or better than the EM algorithm with the use
of Bayesian information criterion (BIC) (Schwarz, 1978), in terms of accuracy, ro-
bustness and rate of convergence for mixture modeling (e.g., Watanabe et al., 2002;
Teschendorff et al., 2005). The Bayesian approach differs from classical or frequen-
tist statistical methods such as ML approaches in its use of probability for naturally
quantifying uncertainty at all levels of the modeling process. Consequently, it pro-
vides a natural framework for producing reliable parameter estimates (Gelman et al.,
2004).
The most practical motivation for adopting VB is that it can automatically select the
number of components k for each mixture model, while being able to estimate the
parameter values simultaneously; previous studies have shown that they typically
lead to a reliable fit (Attias, 1999; Corduneanu and Bishop, 2001; McGrory and Titter-
ington, 2007). Automatic selection of k is particularly critical for our application be-
cause the complexities of users’ mobility patterns can vary greatly and k is typically
not known in advance. VB determines the ‘optimal’ k by automatically, effectively
and progressively eliminating redundant components when an excessive number of
initial components are specified in the mixture model. Such a strategy is clearly more
efficient than ML/EM approaches where one must choose the ‘optimal’ k by using
more ad-hoc approaches based on comparing measures such as the BIC after fitting
models with various possible k to the same pattern. In addition, the deterministic
nature of VB implies that it is much more computationally efficient than alternative
Bayesian methods that are also capable of simultaneously estimating k such as Re-
versible Jump Markov chain Monte Carlo (RJMCMC) (Richardson and Green, 1997).
122 Chapter 5. Customer Spatial Usage Behavior Profiling and Segmentation
The theory of VB is now well documented in the literature (e.g., Wang and Titter-
ington, 2006). In short, VB aims to minimize the Kullback-Leibler (KL) distance be-
tween the target posterior distribution, and its approximation is improved through
an EM-like algorithm. The VB approximation, through its use of tractable coupled
expressions typically with conjugate distribution settings, for GMMs have been theo-
retically shown to be reliable, asymptotically consistent, and unbiased for large sam-
ples (Wang and Titterington, 2006). Additionally, unlike Markov chain Monte Carlo
(MCMC)-based methods, VB does not have a mixing or label switching problem, and
its model convergence is easier to assess (McGrory and Titterington, 2007).
For the reasons given above, in this paper, we choose Wu et al. (2010b)’s VB algorithm
to approximate each individual’s mobility pattern as a GMM. However, the particu-
lar algorithm utilized for fitting the GMM is not critical for this research, and other
existing GMM algorithms could also be adapted to be useful for this application.
5.3.3 Split and eliminate variational Bayes for Gaussian mixture model(SEVB-GMM) algorithm
The split and eliminate VB for the GMM (SEVB-GMM) algorithm (Wu et al., 2010b)
is particularly suitable for our application. It is very effective for modeling highly
heterogeneous spiky patterns with little prior knowledge on the parameters and the
model complexities. The unique component elimination property of VB coupled
with the additional component splitting strategy used in SEVB means that this al-
gorithm is able to well explore the parameter space, and can provide a good data
fit without having the number of components k limited. There are other avail-
able VB-GMM algorithms that allow for component splitting (e.g., Ghahramani and
Beal, 1999; Ueda et al., 2000; Constantinopoulos and Likas, 2007), but this algorithm
adopts a more directed and targeted approach which is why we choose it here. Em-
pirical results of Wu et al. (2010b) suggest that, typically, only a handful of split at-
tempts is required for analyzing patterns such as ours, and this algorithm is therefore
more computationally efficient than alternatives which attempt to split all compo-
nents one after another until an ‘optimal’ model is reached.
We emphasize the importance of allowing components to be split as it can guide the
algorithms such that they are less likely to coverage to a local minimum solution; al-
gorithms allow components to be split (e.g., Richardson and Green, 1997; Ueda et al.,
2000; Ueda and Ghahramani, 2002) are thus much more appropriate for fitting our
highly heterogeneous and spiky data than the standard methods. Nevertheless, in
most of these existing algorithms, components are typically only ever split into two
side-by-side subcomponents making them rather ineffective in their current form.
5.3 Modeling Individuals’ Spatial Usage Behavior 123
That is, as observed by Wu et al. (2010b) that when fitting heterogeneous spiky pat-
terns with GMMs, there is actually another distributional defect that needs to be
carefully evaluated. Specifically, it was observed that, often, a fitted component can
consist an unusually high concentration of observations near the component cen-
ters, a direct result of human’s strong spatial regularities discussed earlier in §5.2.3.
To address this challenge, Wu et al. (2010b) introduced and incorporated a novel ‘in-
liers and non-inliers’ split process into their VB-GMM algorithm, and such a step is
particularly useful for identifying and separating significant locations (i.e., ‘inliers’)
of each subscriber and the activity areas surrounding them (i.e., ‘non-inliers’).
Recall that, besides heterogeneous and spiky, the other somewhat interesting fea-
ture of our data is its discreteness (and thus point masses), which arise because the
positions of individuals cannot be located accurately as they are based on the posi-
tions of the mobile cell towers. While this type of challenge can usually be resolved
with some jittering, such tactic raised additional issues in how the data should be
‘tweaked’; activities initiated through cell towers located in the remote area need to
be adjusted very differently to those in the CBD. Without introducing those small
jumpy movements, we point out that this data singularity issue causes modeling
challenges in two fronts. First, lack of randomness around certain locations can of-
ten lead to two sets of ‘unrelated’ observations being grouped together. This can be
addressed quite simply by forcing linear-shaped component (i.e., those simply con-
nect two cell tower locations) to split. However, more importantly, is the question
of how the models of the same pattern should be evaluated when the underlying
data has a lack of jitter. This is a very interesting statistical problem by itself and re-
quires a lot further research; we do not attempt to address it fully here. Nonetheless,
we point out that this discreteness issue is particularly problematic when adopting
algorithms such as EM where k needs to be determined by comparing BIC among
models with various different k values. This is because measures such as BIC relies
heavily on the log-likelihood (LL) calculation which is influenced by the covariance
matrix estimation for components in the model (Wu et al., 2010b). Covariance ma-
trix estimation can be numerically unstable for components that are singular or near
singular which makes LL/BIC biases either towards or away from a particular model
when linear-shaped component or component corresponding to point masses ex-
ist. Consequently, Wu et al. (2010b) suggest making use of goodness-of-fit measures
suitable for components to be overlapped or variables being highly correlated; we
found that the somewhat commonly applied Silhouette coefficient, for example, is
not suitable for evaluating models with overlapping components. Wu et al. (2010b)
proposed an alternative measure called mean absolute error adjusted for covariance
(MAEAC) that aims to approximate the overall model ‘absolute errors’, taking the ad-
vantage that VB is less likely to over-fit. MAEAC was found to be generally in agree-
ment with model selection choices made by BIC when the data is well behaved, but
124 Chapter 5. Customer Spatial Usage Behavior Profiling and Segmentation
has shown to be more robust for data that is heterogeneous, spiky, and lack of jit-
ter (Wu et al., 2010b). Consequently, in this paper, we make use of the SEVB-GMM
algorithm with MAEAC. Again we emphasize that the main focus here is fitting and
interpreting a GMM for each user’s spatial usage behavior and not on the particular
algorithm for doing the task.
5.3.4 Results, model accuracy & computational efficiency
The SEVB-GMM algorithm is relatively robust to the selection of kinitial, the initial
choice of k, and can automatically determine the ‘optimal’ k for each fitted GMM.
However, models with marginally larger k are likely to be obtained with larger kinitial
values for data that is highly heterogeneous. While it is typically sufficient and is cer-
tainly more computationally efficient to obtain a good visual representation of the
data with smaller kinitial (Wu et al., 2010b), we opt to initialize the SEVB-GMM algo-
rithm with larger kinitial for the potential slight improvement on the data goodness-
of-fit. We adopt the default initialization settings used in Wu et al. (2010b) with
kinitial = 30, but here we also force linear-shape component simply connecting two
sites to be split. We detail our promising results of modeling each user’s spatial usage
behavior below.
Figure 5.5 gives the modeling results for the four selected subscribers shown in Fig-
ure 5.2. It demonstrates that the SEVB-GMM algorithm appears to be effective for
our application since the fitted mixture model component seems to capture the pat-
terns. In terms of the modeling accuracy of the SEVB-GMM, one approach is to
evaluate the pattern distributions with respect to their (estimated) pattern centers.
We opt to estimate the pattern center directly from the GMM instead of utilizing
the ‘real center’ defined in §5.2.3; this is convenient for profiling and segmenting
users in §5.4 and §5.5 i.e., without accessing to the raw data. We define the ‘SEVB-
GMM center’ as the greatest weighted component center in the most active area, in
which the definition generally corresponds simply to the greatest weighted compo-
nent center (c.f. §5.2.3). Despite the definition difference between real centers and
SEVB-GMM centers, that is to say one is no more correct than the other, the differ-
ence between them is shown in Figure 5.6 (a). For over 50% of users, the difference
between these two center definitions is less than 5 km. Figure 5.6 (b) examines the
spatial pattern distribution with respect to the SEVB-GMM centers for both the real
and modeled data. The SEVB-GMM modeled distribution is estimated based on the
component weights and distances between component centers and the SEVB-GMM
centers. It shows that SEVB-GMM can effectively summarize the patterns, and the
‘gap’ between modeled and actual (i.e., the cumulative distribution of the distances
between the SEVB-GMM center and the actual activity locations) distributions on
the left hand side can be explained by patterns surrounding the SEVB-GMM centers
5.3 Modeling Individuals’ Spatial Usage Behavior 125
Figure 5.5: SEVB-GMM results of the four subscribers in Figure 5.2. (a) Subscriber A.(b) Subscriber B. (c) Subscriber C. (d) Subscriber D. Note that the ellipses represent95% probability regions for the component densities, whereas the estimated centersof these components are marked by ‘+’s and the actual observations are marked by‘x’s. We also note that the 95% probability regions of some components (e.g., thosecorresponding to point masses) are not always visible because they are simply toosmall to be seen. The most noticeable examples are the two most weighted compo-nents in (c) which correspond to the three big bubbles (with two of them centered atnearly identical spot) in Figure 5.2 (c).
(a) (b)
(c) (d)
126 Chapter 5. Customer Spatial Usage Behavior Profiling and Segmentation
Figure 5.6: Model accuracy of SEVB-GMM. (a) Distribution of distances between real& SEVB-GMM centers. (b) Average of users’ cumulative activity count distributionwith respect to distance from their SEVB-GMM centers. Note that ‘. . .’s refer to cal-culations made with respect to the SEVB-GMM model fits, whereas ‘—’s were calcu-lated with respect to the actual data.
(a) (b)
that have already been summarized into a handful of components. Figure 5.6 (b)
echoed Figure 5.4 (b) in that users are mostly active within 10 km from the centers,
and, on average, around 10% of their activities have been carried on in locations
which are over 100 km from the centers.
We conclude that it is appropriate to model individuals’ mobility patterns with
GMMs. With an average of 10.763 GMM components or 63.578 parameters, we
have been able to accurately approximate an average of 1,288 outbound activities
during the 17-month analyzed period; this gives us a data compression ratio of ap-
proximately 20 to 1. Of course, this ratio will be higher when longer history is be-
ing analyzed. In addition, we are pleased with the efficiency delivered by VB. The
SEVB-GMM algorithm has been able to obtain a good approximation in an average
of 62.612 iterations with an average of 2.982 split attempts. It is clearly less demand-
ing in terms of computational requirements than many other Bayesian methods, as
well as other VB-GMM splitting algorithms which attempt to split all components
separately until an ‘optimal’ model is reached. It is also, in the case of this sample,
36 times more efficient than if we were to adopt an EM algorithm, given that we
now know our final pattern complexities range from one to 36; using SEVB-GMM
has removed the need to fit the model for each kinitial = 1 to 36 separately and then
select the ‘optimal’ model (c.f. number of components k). Our next task is to in-
terpret these behavior patterns meaningfully and automatically before segmenting
their spatial behavior in §5.5.
5.4 Profiling Individuals’ Spatial Usage Behavior
Several previous studies have attempted to characterize individuals’ mobility pat-
terns (e.g., Balazinska and Castro, 2003). However, they typically focused on cor-
porate or campus networks with data history often only of the order of weeks, and
5.4 Profiling Individuals’ Spatial Usage Behavior 127
sometimes even days (Balazinska and Castro, 2003), making them difficult to gener-
alize. Regardless, Balazinska and Castro (2003) grid partitioned users’ spatial usage
behavior which appears to be overly simplified based on the standardized frequency
of users’ most visited location and their median usage quantity. On the other hand,
a mobility study described in Ghosh et al. (2006a) did give more consideration to
the spatial aspect of the patterns. They first identified the nature of all locations
with respect to all users such as seminar rooms, shopping malls, and home in the
ETH Zurich Campus of The State University of New York at Buffalo, New York; this
was then utilized for profiling individuals based on their activity frequency in each
location. Despite the fact that we could generalize this study further by matching
each subscriber’s mobility pattern to the social importance of each cell tower loca-
tion (e.g., football stadium or airport) or even the regional census information (e.g.,
education, income, and primary industry such as mining or tourism), this approach
suffers from several drawbacks. Firstly, in the real world, each user’s significant loca-
tions are not aligned. That is, we do not necessarily all live in the same location and
work in the same location. Secondly, the social importance of each location to dif-
ferent individuals can be difficult to determine. For example, shopping malls have
different meanings for people who work there. Finally, as we have articulated earlier,
it is important to comprehend each user’s overall mobility pattern.
That is, simply profiling users based on their visitation probability at each location
as done in Ghosh et al. (2006a) is still inadequate for fully comprehending or charac-
terizing users’ spatial usage behavior. We reference Larson et al. (2005) which aims
to cluster/profile customers’ supermarket shopping path, a somewhat related focus;
though, as in the case of Ghosh et al. (2006a), their work is also based on first defin-
ing the meaning of each zone of a finite space which has the same significance to all
shoppers (as well as predefining the high-medium-low grouping of the path time).
Studies related to Larson et al. (2005) include Hui et al. (2009a,b), where Hui et al.
(2009b) has a strong analytical focus in relation to the implications of actual pur-
chasing behavior. Frarley and Ring (1966) and Batty (2003), for example, focused on
analyzing pedestrians’ zone-to-zone Makovian movement, but these also do not ap-
pear to be valuable to this study. In contrast, in this paper, we develop an approach
for automatically and statistically profiling each user’s overall mobility pattern based
on approximated SEVB-GMM; note that while we have chosen a particular inferen-
tial approach, the underlying concept of partitioning patterns also could be done in
other ways. We explore the characteristics of SEVB-GMM components for approx-
imating users’ spatial usage behavior in §5.4.1, then differentiate them in §5.4.2. In
§5.4.4, we introduce several spatial usage behavioral signatures, extracted from the
SEVB-GMM approximations, for automatically profiling their mobility patterns of
which the effectiveness is examined in §5.4.5.
128 Chapter 5. Customer Spatial Usage Behavior Profiling and Segmentation
5.4.1 SEVB-GMM component characteristics
Previous literature and the analysis illustrated such as in Figure 5.4 have revealed
most individuals’ spatial usage behavioral tendency. That is, they are mostly ac-
tive in the neighborhood they live in and only occasionally travel to places outside
that region. This means that the location of the components with respect to the dis-
tance from their corresponding SEVB-GMM centers which we call ∆ and the com-
ponent mixing weight w can be crucial for interpreting individuals’ overall mobility
patterns. In addition, the component size and shape can assist us in understand-
ing what patterns have been summarized nearby. For example, most components
in Figure 5.5 (b) appear to be mostly long and narrow representing the routes that
the individual appears to have regularly passed through; whereas two concentrated
point components in Figure 5.5 (c) correspond to the user’s significant locations. We
can determine the component size simply by considering the standard deviation
(SD) σ in latitude and longitude; whereas the percentage of the variation (λ1) ac-
counted by the first principal component p1 of a component, r = λ1/(λ1 + λ2), com-
puted through principal component analysis (PCA) can also provide us with insights
into the component shape; in the bivaraite case, the second principal component p2
consists the remaining of the variation λ2. Besides distance from center ∆, weight
w, SD σ, and p1’s variation % r, components can be described and characterized by
the probability region A that they cover and the maximum length l. Note that here
we adopt 99.9% as the probability region for the ellipses. Assuming a and b are the
ellipse’s i.e., the GMM component’s semi-major and minor axes respectively, then
A = a × b × π, and l = 2 × a. We next examine the differentiability of components
with some of these characteristics.
5.4.2 Differentiating SEVB-GMM components
Analyses to date have motivated us to differentiate observations or SEVB-GMM com-
ponents into the following four types:
• Those that correspond to the individuals’ significant locations (Significant),
• Those in the daily activity areas of an individual but which are not significant
(Urban),
• Those in the locations where the individual does not visit regularly (Remote),
and
• Those that correspond to the long commuting routes frequently traveled by an
individual (Route).
To reinforce this idea and assist in formally defining these component types, we have
conducted the following additional mobility pattern analyses.
5.4 Profiling Individuals’ Spatial Usage Behavior 129
Figure 5.7: Mobility pattern analysis based on SEVB-GMM. (a) Distribution of SEVB-GMM component maximum SD σmax for which σmax ≤ 10 km. (b) Distribution ofSEVB-GMM component weight w for which w ≤ 0.24. (c) Distribution of % of vari-ation accounted for by the first principal components (the p1’s) of the SEVB-GMMcomponents. (d) Distribution of distances between users’ daily activity boundary totheir SEVB-GMM centers (Note: almost identical for real centers).
(a) (b)
(c) (d)
The first three analyses are based solely on the characteristics of the SEVB-GMM
components. The distribution of components’ maximum SD σ, which we call σmax,
in latitude and longitude is shown in Figure 5.7 (a). It reveals that more than 25% of
all components have their σmax ≤ 1/3 km. In fact, most of these compact compo-
nents have their σmax ≈ 0 referring to the exact location of cell towers and individ-
uals’ significant locations. Figure 5.7 (b) examines the distribution of components’
weightw. It indicates that more than 25% of all components are almost meaningless
with respect to the understanding of individuals’ overall mobility patterns as their
w < 0.01. This needs to be taken into consideration when profiling users’ mobility
patterns. The result of PCA on components is shown in Figure 5.7 (c). A large portion
of the components are narrow with their p1’s accounting for nearly all of the variabil-
ity; these components are generally associated with individuals’ (long) commuting
routes. Note that we assume all compact components have their p1’s account for as
much variability as their p2’s.
A different perspective on individuals’ spatial usage behavior is taken in Fig-
ure 5.7 (d). This time we focus on users’ daily activity area. It is revealed that over
30% of the subscribers were mostly active within a 10 km (inner) radius with respect
to their centers, but were practically inactive in the circular ring area where the outer
radius is 10 km more than the inner radius. We define practically inactive as less than
1% of the activities. Similarly, over 35% of the users were mostly active in locations
130 Chapter 5. Customer Spatial Usage Behavior Profiling and Segmentation
where distance from center ∆ ≤ 20 km, but were practically inactive in areas where
20 < ∆ ≤ 30 km. Overall, the daily activity areas for approximately 90% of the sub-
scribers were within the 30 km radius area; whereas nearly everyone had been active
mostly within the 60 km radius area. Consequently, for the sake of most subscribers,
we define the boundary of daily activity areas (or living neighborhood) i.e., Signif-
icant and Urban as 30 km from the SEVB-GMM centers, and Remote locations as
those over 60 km away from the centers. This also implies that we will ignore around
6% of all components centered between 30 and 60 km from the SEVB-GMM centers
because of the difficultly in determining their social importance with respect to each
individual. Note that the 20 km circular area is approximately 1,257 km2 and the size
of New York City; whereas the 60 km circular area is approximately 11,309 km2 and
the size of metropolitan Sydney.
5.4.3 SEVB-GMM component types
Accordingly, we define compact components with weight w ≥ 0.01 and distance
from center ∆ ≤ 30 km as individuals’ significant locations. However, we believe
that it is important to relax the definition of compact from our earlier discussion
with respect to Figure 5.7 (a). This relaxation is necessary in order to avoid situations
in which there is more than one cell tower covering the same area, and the locations
of these different cell towers are represented by the same components which are not
compact based on the earlier definition. If we assume the typical distance between
two neighboring cell towers is 4 km, then we define compact, and thus the criterion
for Significant components, as 3×σmax ≤ 4 km. Recall that σmax refers to the compo-
nent’s maximum SD. On the other hand, we identify Route components which can
also be Remote as for those components with ∆ > 30 km, w ≥ 0.01, p1’s variation
% r ≥ 0.90, and maximum length l > 12 km. This criterion on l corresponds to
three times the assumed typical distance between two neighboring cell towers. We
formally define different types of components as follows:
• Significant : ∆ ≤ 30 km, 3× σmax ≤ 4 km, and w ≥ 0.01;
• Urban: ∆ ≤ 30 km, 3× σmax ≤ 60 km, and not Significant ;
• Remote: ∆ > 60 km;
• Route: ∆ > 30 km, r ≥ 0.90, w ≥ 0.01, and l > 12 km.
Note that we could have redefined Remote as those that are not Route, however we
found in practice (c.f. §5.4.5 and §5.5.2) that similar conclusions about individual
subscriber’s behavior in any case would still be obtained. We next investigate how
we can characterize individuals’ overall spatial usage behavior from our detailed un-
derstanding of each SEVB-GMM component.
5.4 Profiling Individuals’ Spatial Usage Behavior 131
5.4.4 Spatial usage behavioral signatures
In this section, we introduce what we refer to as several spatial usage behavioral sig-
natures; these aim to profile the mobility patterns of each subscriber meaningfully
and automatically based on the component types defined above. The key signatures
are:
• SignificantWt=∑
j∈Significant w(j);
• UrbanWt=∑
j∈Urban w(j);
• UrbanArea=∑
j∈Urban A(j)
302π;
• RemoteWtX2= Min (1, 2× RemotWt) = Min(
1, 2×∑
j∈Remote w(j))
;
• RouteDist= Min(
1,∑
j∈Route l(j)
1000
).
We next describe these signatures in more detail and their effectiveness in profiling
and segmentation in §5.4.5 and §5.5.2 respectively.
SignificantWt, UrbanWt & RemoteWtX2 Recall that the weightw is one simple and
informative measure for describing each VB-GMM component. Notice that three of
the signatures we have designed are based on the aggregatedw’s of different compo-
nent types. They are SignificantWt, UrbanWt, and RemoteWt, which equate to the ag-
gregatedw of all Significant, Urban, and Remote VB-GMM components respectively,
representing an individual’s share of activities in three different zone areas. However,
as expected, the majority of the users have not been very active visiting locations out-
side their daily activity areas. Consequently, we choose RemoteWtX2= 2×RemoteWt
instead of RemoteWt, so that it is more evenly distributed between zero and one.
This rescaling/standardization is necessary for clustering/segmentation which uti-
lizes the k-means algorithm (Jain and Dubes, 1988, pp.89-117) in §5.5.
Figure 5.8 (a)–(c) presents the distributions of the subscribers with respect to these
three signatures. Figure 5.8 (a) indicates that more than 20% of all subscribers have
not been particularly active in the overall sense in their significant locations while a
large number of users have been predominantly active in their first several preferred
locations. Figures 5.8 (b) and (c) show that the majority of subscribers do, at least
occasionally, visit their living neighborhood and places outside their daily activity
areas; while approximately 5% of all users are less typical having been more active
in locations greater than 60 km away from their center than their daily activity areas.
Note that this does not imply that our estimated center is incorrect, but users have
been more active in various places outside their living neighborhood combined.
132 Chapter 5. Customer Spatial Usage Behavior Profiling and Segmentation
Figure 5.8: Distribution of spatial usage behavior. (a) SignificantWt. (b) UrbanWt.(c) RemoteWtX2. (d) UrbanArea (1 = 302π km2). (e) RouteDist (1 = 1000 km). (f)HomeOfficeLik.
(a) (b)
(c) (d)
(e) (f)
UrbanArea It has been observed that the size of each subscriber’s daily activity area
does vary, but cannot always be reflected with UrbanWt. We have therefore intro-
duced signature UrbanArea, which equates to the portion of the aggregated prob-
ability region A of all Urban components with weight w ≥ 0.01 with respect to the
nominated area of 302π = 2, 827.43 km2, with the aim of providing a good proxy to
this observation. We note the key limitation of this proposed signature is its inability
to exclude those overlapped regions; a GMM focuses on density approximation and
thus often requires multiple components for representing one non-Gaussian com-
ponent (Baudry et al., 2010). Figure 5.8 (d) shows the distribution of the subscribers
with respect to UrbanArea. It discovers that around 15% of the users have been very
active throughout most areas of their living neighborhood.
RouteDist The unique nature of Route components implies that maximum length
l is useful for characterizing this aspect of users’ mobility patterns. We have there-
fore introduced signature RouteDist, which equates to the portion of aggregated l of
these components with weight w > 0.01 with respect to the nominated distance of
1,000 km, representing individuals’ overall unique long commuting distance. This is
5.4 Profiling Individuals’ Spatial Usage Behavior 133
useful for identifying those whose spatial usage behavior is less conventional as in
the case of the Subscriber B in Figure 5.2 (b). In fact, the unique long commuting
distance for this user was estimated to be 1,785 km providing a good indication to
1,515 km approximated in Google Maps (c.f. Brisbane to Sydney: 929 km, and Coffs
Harbour to Newcastle via Tamworth: 586 km). Note that this knowledge cannot be
extracted easily by analyzing a user’s physical path directly. Recall that we assume
one degree corresponds to 100 km in this paper. Figure 5.8 (e) shows the distribution
of the subscribers with respect to RouteDist, and reveals that the majority of the users
rarely commute long distances; with only about 5% of our subscribers that have been
traveling along routes more than 1,000 km in combined unique route distances.
Alternative signatures Finally, we would like to point out that we can also inter-
pret, for example, SignificantWt as the likelihood of an individual being a stationary
subscriber, combining UrbanWt and UrbanArea as the likelihood of being trades-
man or taxi driver-like, and combining RemoteWtX2 and RouteDist as the likelihood
of being inter-state businessperson or truck driver-like. In addition, it is also possible
to profile each user’s spatial usage behavior slightly differently. For example, besides
describing how much an individual has been active, say 70%, of his/her time in sig-
nificant locations (SignificantWt= 0.70), we can also profile them as being a certain
type of user with some degree of likelihood. For example, signature HomeOfficeLik
can be introduced for measuring the likelihood that an individual is a home-office-
like user who is mostly active in two locations assumed to be their home and office. If
we define HomeOfficeLik as the aggregated weight w of the top two Significant com-
ponents when there is more than one Significant component, then the likelihood
distribution of a user being home-office-like is shown in Figure 5.8 (f). It is revealed
that only a small number of subscribers have a very high probability of being home-
office-like users.
5.4.5 Results & spatial usage behavioral profile stability
We illustrate the usefulness of our profiling approach through examples and evalu-
ate the stability of our proposed spatial behavioral signatures for all users with re-
spect to the benchmark. For consistency, we again demonstrate our results with the
same users shown in Figure 5.2. Their signature values are listed in Table 5.3, and we
profile them as follows:
• Subscriber A is mostly active (85.0%; SignificantWt = 0.850) in the selected
fixed locations, rarely visiting other parts of the living neighborhood, and
sometimes (10.8%; RemoteWtX2 = 0.216) flying to towns/cities away from his/
her center;
134 Chapter 5. Customer Spatial Usage Behavior Profiling and Segmentation
Table 5.3: Spatial profile signature values for each subscriber in Figure 5.2
Subscriber SignificantWt UrbanWt UrbanArea RemoteWtX2 RouteDistA 0.850 0.001 - 0.216 -B 0.191 - - 1.000 1.000C 0.970 - - 0.006 0.048D - 0.964 1.000 0.031 0.033
Table 5.4: Stability of the customer spatial profile signature values defined by Equa-tion 5.1
Period 1 vs. 2 Period 1 vs. 3# of Groups 4 5 10 4 5 10
SignificantWt 61.7% 56.2% 40.3% 59.0% 52.5% 33.8%UrbanWt 62.3% 55.7% 38.2% 60.2% 54.2% 36.5%
UrbanArea 62.7% 57.3% 46.0% 58.3% 53.7% 39.8%RemoteWtX2 80.7% 76.0% 55.2% 78.8% 74.3% 59.2%
RouteDist 80.8% 76.7% 63.0% 81.5% 76.8% 64.3%HomeOfficeLik 57.3% 53.7% 46.3% 62.5% 58.3% 49.3%
• Subscriber B is mostly active (≥ 50.0%; RemoteWtX2 = 1.000 and
RouteDist = 1.000) on the roads traveling long distances, and sometimes
(19.1%; SignificantWt = 0.191) making calls or sending messages from the se-
lected fixed locations;
• Subscriber C is practically only active (97.0%; SignificantWt = 0.970) in his/her
significant locations;
• Subscriber D is very active (96.4%; UrbanWt = 0.964) throughout his/her entire
living neighborhood with no particular preferred locations.
This provides very similar findings to the discussion in §5.1. In addition, HomeOf-
ficeLik for Subscriber A and C are 0.850 and 0.970, respectively, suggesting that they
were very active in their top two preferred locations most likely to be their home and
workplace.
This highlights how our approach can provide meaningful and useful hidden in-
sights effectively and automatically based on the values of our proposed signatures,
which captures different aspect of their spatial usage behavior.
We next examine the effectiveness of profiling individuals into several non-
overlapped and equal-size intervals based on these signatures. For example, if we
group subscribers into 10 groups based on their share of activities in significant lo-
cations, group one will consist of users with 0 ≤ SignificantWt < 0.1, group two
with 0.1 ≤ SignificantWt < 0.2, and so on. Note that this is different to how users are
more commonly partitioned; usually partitions are based on the usage quantiles. We
take this alternative approach because here the partitions has some meaning within
5.5 Spatial Usage Behavioral Segmentation 135
the context of the signature values. To provide some comparison of this profiling
approach with a benchmark, even though they are not strictly directly comparable,
individuals’ CDR have also been divided into three periods as in §5.2.2 prior to being
approximated with the SEVB-GMM methods (i.e., fitted three times, one GMM for
each user and period).
Table 5.4 presents the stability results, and it shows that our profiling approach has
generally ‘outperformed’ the benchmark. Again, we stress the point that they are not
directly comparable; we do this simply to make a reference to one of the most pop-
ular approaches used in customer behavioral segmentation today. This implies that
subscribers’ spatial usage behavior is, relatively speaking, stable and allows business
to understand how their customers have been using their product/service in a spa-
tial sense for the first time. Some of the shortfalls in prediction accuracy, as in §5.2.2,
can be explained as a result of the fact that subscribers’ behavior will change over
time, and some loss in accuracy may have arisen from the SEVB-GMM approxima-
tion. In addition, since the signatures are based on numerical values (e.g., the selec-
tion of living neighborhood boundary being 30 km), this could possibly be optimized
further with a Bayesian or fuzzy approach which might further improve users’ spatial
usage behavior approximation capabilities.
5.5 Spatial Usage Behavioral Segmentation
5.5.1 The k-means (KM) algorithm & selection of number of groups
As well as having additional insights into how each individual has used their prod-
uct/service spatially, it is vital for the business to comprehend the similarities and
dissimilarities among users’ behavior for distinct marketing propositions or prod-
uct/service developments, for example. The most obvious approach for this is a
cluster analysis that groups users into clusters with similar characteristics in an un-
supervised manner. As a first attempt, we adopt the most widely used k-means (KM)
algorithm for grouping users’ overall spatial usage behavior; the limitations of KM
have been well documented (e.g., Wedel and Kamakura, 1998, Chapter 5). Note that
it is also possible to adopt a GMM or mixture modeling approach more generally
for this particular task (Fraley and Raftery, 1998); results of KM and GMM analy-
ses can be very similar (e.g., Symons, 1981; Celeux and Govaert, 1992; Banfield and
Raftery, 1993). However, GMMs are computationally more demanding than the KM
algorithm, its parameter estimations can sometimes be numerically unstable, and
as pointed out by Baudry et al. (2010), GMM components may need to be merged for
more appropriate clustering results. On the other hand, methods such as DeSarbo
et al. (1990) that make use of multi-dimensional scaling (MDS) may be useful in ob-
taining more ‘appropriate’ groupings; but the transformed/scaled features can make
136 Chapter 5. Customer Spatial Usage Behavior Profiling and Segmentation
Table 5.5: Pearson correlation coefficients among spatial profile signatures
SignificantWt UrbanWt UrbanArea RemoteWtX2 RouteDistUrbanWt -0.736 - - - -
UrbanArea -0.161 0.374 - - -RemoteWtX2 -0.388 -0.200 -0.231 - -
RouteDist -0.211 -0.115 -0.131 0.519 -Voice Call Durations 0.056 -0.052 0.176 0.014 0.041
SMS Counts 0.210 -0.078 0.102 -0.160 -0.081
the results more difficult to interpret. Please also refer to §5.7 for further discussion
in relation to other clustering techniques.
Results of KM can be influenced by variables with unequal weighting or that are un-
standardized, which has already been taken into consideration during our behav-
ioral signature constructions. However, KM can also produce unreliable results if
variables are highly correlated in §5.4. We have thus performed a correlation analy-
sis with results shown in Table 5.5. It reveals that only SignificantWt and UrbanWt
are moderately negatively correlated which coincides with our expectation as the
sum of these two variables approximately equal to 0.80 (c.f., Figure 5.4 (b)), whereas
the correlation among other variables including the two benchmark measures are
considered to be quite weak. Overall, we believe KM should be sufficiently robust
for our purposes.
Our biggest challenge in interpreting the KM results is determining the most appro-
priate number of spatial usage behavior groups, g, as in the case of mixture mod-
eling. Milligan and Cooper (1985) have shown that two of the better measures for
determining g, and hence the clustering quality, are the Calinski and Harabasz (CH)
index (Calinski and Harabasz, 1974) and the cubic clustering criterion (CCC) (SAS In-
stitute Inc., 1983). That is, the local peaks of these two measures when in agreement,
represents the most likely solutions of g. The CH index (also known as Pseudo-F
statistic) aims to capture the tightness of clusters, and:
CH index =SSB/ (g − 1)SSW/ (m− g)
, (5.3)
whenm users are grouped into g groups. It is dominated by the ratio between sum of
square between groups (SSB) and within groups (SSW). CCC, on the other hand, can
be biased towards larger g, and measures the deviation of the clusters from the dis-
tribution expected if data had sampled from a uniform distribution. We will utilize
both measures for the ‘optimal’ selection of g and hence the clustering results.
5.5 Spatial Usage Behavioral Segmentation 137
Figure 5.9: Selected k-means clustering results # 1. (a) Clustering quality evalu-ated with respect to different g when subscribers are clustered with SignificantWt,UrbanWt & UrbanArea. (b) Clustering quality evaluated with respect to differ-ent g when subscribers are clustered with SignificantWt, UrbanWt, UrbanArea, Re-moteWtX2 & RouteDist. (c) Clustering quality evaluated with respect to different gwhen including voice call duration & SMS counts into the setting (b). (d) VariablesR2’s (RSQ) for the setting (c) with voice call duration marked as D, SMS counts as S& five spatial behavioral signatures unmarked. Note that in (a) to (c), lines with •correspond to CH index, and lines with � correspond to CCC; number of groups g isgenerally chosen based on the local maxima shared by both CH index and CCC.
(a) (b)
(c) (d)
5.5.2 Results & spatial usage behavioral segmentation stability
Here we focus on segmenting users based on the following three scenarios. We first
consider their behavior only in their living neighborhood, ignoring their activities
outside of this 30 km radius area, where the majority of activities have taken place.
Next, we expand the analysis to include all of the spatial usage behavioral signa-
tures representing subscribers’ overall mobility patterns. Finally, we include the two
benchmark measures into this previous scenario with the aim of increasing our un-
derstanding with respect to these variables’ inter-relationships and their implica-
tions on user behavioral segmentation.
Mathematically, our first scenario focused on clustering users based on Signifi-
cantWt, UrbanWt and UrbanArea; and number of groups g is determined, as dis-
cussed in §5.5.1, based on the results of CH index and CCC shown in Figure 5.9 (a). It
suggests that there are four distinct groups of users with respect to their spatial usage
behavior. The four cluster results are listed in Table 5.6 (a):
138 Chapter 5. Customer Spatial Usage Behavior Profiling and Segmentation
Table 5.6: Selected k-means clustering results # 2. (a) 4-cluster solution: Cluster cen-ters & variables R2’s (setting of Figure 5.9 (a)). (b) 6-cluster solution: Cluster centers& variables R2’s (setting of Figure 5.9 (b)).
(a)
Cluster # Obs SignificantWt UrbanWt UrbanArea RemoteWtX2 RouteDist(A) 434 0.690 0.101 0.092 - -(B) 189 0.608 0.282 0.813 - -(C) 305 0.122 0.564 0.166 - -(D) 154 0.095 0.721 0.852 - -
Variable R2: 0.738 0.664 0.826 - -
(b)
Cluster # Obs SignificantWt UrbanWt UrbanArea RemoteWtX2 RouteDist(A) 321 0.760 0.117 0.107 0.129 0.077(B) 170 0.626 0.291 0.832 0.104 0.104(C) 204 0.165 0.665 0.191 0.161 0.101(D) 138 0.102 0.748 0.865 0.160 0.146(E) 125 0.245 0.216 0.174 0.831 0.177(F) 124 0.303 0.219 0.159 0.698 0.941
Variable R2: 0.696 0.681 0.755 0.690 0.729
• Cluster (A) is the largest and represents those who are mostly active in se-
lected fixed locations, for example, ‘home and office’ (c.f. Subscriber C in Fig-
ure 5.2 (c));
• Cluster (B) consists of users with similar behavior to those in Cluster (A) but are
more active outside those significant locations;
• Cluster (C) groups those who are mostly active in selected parts of their living
neighborhood. For example, ‘regional salespersons’ or ‘tradespersons’; and,
• Cluster (D) represents subscribers who are active throughout most part of
their living neighborhood. For example, ‘taxi driver’ (c.f. Subscriber D in Fig-
ure 5.2 (d)).
Next, we examine users’ overall mobility patterns. That is, also taking signatures
RemoteWtX2 and RouteDist into consideration. The results of CH index and CCC are
presented in Figure 5.9 (b) which indicate that there are mostly likely six unique user
groups. The six cluster results are listed in Table 5.6 (b) and the first four clusters
correspond very nicely to the results in Table 5.6 (a). The remaining two clusters are:
• Cluster (E) identifies users who frequently visit places most likely by flights
away from their centers (c.f. Subscriber A in Figure 5.2 (a)); and,
• Cluster (F) represents those who frequently travel along selected routes. For
example, ‘inter-state truck delivery workers’ (c.f. Subscriber B in Figure 5.2 (b)).
In addition, we have observed that if the higher number of groups g is chosen, the
first four groups remain largely the same, while Clusters (E) and (F) are broken up
5.6 Cross Validation 139
and differentiated further based on the first three signatures.
Interestingly, however, if we further include the voice call duration and SMS counts
(which we first standardized with 1 being the maximum and have 1 = 95 percentile
limiting the influence of those extremely high users; c.f. §5.2.2) into the cluster anal-
ysis, the cluster structures suddenly become less clear (c.f. Figure 5.9 (c)). We know
that a variables R2 measures the between cluster variance; and, importantly, based
on this measure, it appears that customers are more differentiable with respect to
their spatial usage behavioral signatures than those benchmark measures as indi-
cated in Figure 5.9 (d). Voice call duration, in particular, has performed relatively
poorly throughout the entire analysis. That is, mobility patterns among customer
groups are highly differentiable, we can see this in the plot as variables R2’s with re-
spect to mobility pattern (c.f. lines unmarked) are quite high in comparison to voice
call duration and SMS count (c.f. lines marked with D and S).
Despite the overwhelming support with respect to how key customer groups differ
with respect to their spatial usage behavior, as articulated earlier, it is important to
examine the segmentation effectiveness. We adopt a similar setting to §5.2.2 and
§5.4.5 in terms of separating data into three periods. When comparing Period 1 to 2,
and 1 to 3, the stability for the 4-cluster solution is 47.5% and 44.2% respectively; and
for the 6-cluster solution, the results are 42.2% and 40.0% respectively. These results
are reasonably good but are poorer than the results in §5.2.2 and §5.4.5 despite signif-
icant evidence pointing to such a clustering structure; however users’ spatial usage
behavior is being considered more completely here i.e., considering more than just
one attribute. Interestingly, however, the stability measures for the largest cluster,
Cluster (A), are 59.1% and 62.8% for the 4-cluster solution, and 53.6% and 59.1% for
the 6-cluster solution respectively making it the most stable customer group and po-
tentially the most useful for the business. Overall, we believe this implies that more
effort is still required to comprehend users’ complicated spatial usage behavior and
the appropriate segmentation. Finally, we have found results generally similar even
if different hard signatures parameter values are used even though KM can be very
sensitive to the derived data value. That is, we view spatial usage behavioral segmen-
tation as reasonably effective and potentially valuable for business strategy formu-
lations.
5.6 Cross Validation
The clustering result can be unstable in general, and cross validation is thus required
to further demonstrate the usability of the proposed spatial usage behavioral seg-
mentation. We perform this on several new samples; their sizes are much larger
than the initial dataset. The first two new samples are utilized for demonstrating
140 Chapter 5. Customer Spatial Usage Behavior Profiling and Segmentation
the differentiability of the groupings; these samples consist 6,898 and 9,337 users re-
spectively and these users do not need to stay connected for the entire 17 months.
However, users in the new sample # 1 is randomly sampled, while # 2 is block sam-
pled (based on the internal IDs). The new sample # 3 has 6,624 subscribers and is
utilized for evaluating the segmentation stability; users in this dataset were chosen
randomly but need to stay connected throughout the entire 17-month period for the
obvious reason.
We are pleased to observe that all of the results were fairly consistent with what was
observed previously, even though KM can be quite sensitive to the data, suggesting
that our initial sample is quite representative. For example, when group subscribers
in the new sample # 1 based on our five proposed spatial behavioral signatures (i.e.,
the same setting as in Figure 5.9 (b)), the CH index and CCC also point to the likely
solution of six clusters (c.f. Figure 5.10 (a)). Similarly for the new sample # 2. We
note, however, that CH indexes for both new samples also point to a good three-
cluster solution, but are in disagreement with the CCCs. We further note that when
combining these two samples, the CH index and CCC again point to the likely solu-
tion of six clusters, but it is ‘less obvious’, as in the case of Figure 5.9 (b). Additionally,
the cluster structures disappeared as before (i.e., similar to Figure 5.9 (c)) when voice
call duration and SMS counts are further included in the study. Table 5.7 lists the KM
clustering results for both new sample # 1 and # 2; note the near identical results
between the two samples as well as the initial sample (i.e., Table 5.6 (b)). Finally, we
note that the new sample # 3 illustrates the stability of the six-cluster solution; com-
parison between Period 1 and 2, and Period 1 and 3 are 42.3%, and 40.7% respectively
similar to the findings of our initial sample.
Due to the lack of directly comparable approaches in the literature, our final exper-
iment in this paper aimed to design some suitable benchmark in order to obtain a
more appropriate comparison with our spatial behavioral segmentation that seems
to be quite effective, at least highly differentiable. We aimed to group users by a
simple comparison model based on the raw dataset; behavioral characteristics of an
individual were to be mostly derived based on the aggregated activity frequencies in
each zone, which is determined based on the distance to each user’s center. For ex-
ample, we had classified zone urban being distance ∆ < 30 km from the center, and
zone remote being ∆ > 100 km; activities initiated in urban through heavily used cell
towers of the individual (e.g., frequencies greater than 0.01) were classified as signif-
icant instead. Boundary activities in zone urban were used to proxy our UrbanArea,
whereas we attempted to use activity frequency in zone 60 < ∆ < 100 km to proxy
our RouteDist, given that the ‘path’ cell towers are difficult to define and each tower
has different meanings to different users. Unfortunately, we have been unsuccess-
ful in finding a suitable and comparable simplistic model; CH index and CCC (c.f.
Figure 5.10 (b)) generally point to lack of cluster structure. In some settings where
5.7 Discussion 141
Figure 5.10: Cross validation results # 1. (a) Clustering quality evaluated with respectto different g for the new sample # 1 with setting of Figure 5.9 (b). (b) Clustering qual-ity evaluated with respect to different g for the new sample # 1 with the unsuccessfulsimplistic model described in §5.6. Note that lines with • correspond to CH index,and lines with � correspond to CCC; number of groups g is generally chosen basedon the local maxima shared by both CH index and CCC.
(a) (b)
the cluster structures appear to be more obvious, its cluster results are often not very
meaningful at all i.e., cluster centers of an attribute proxy a behavioral characteristic
are often very close to each other, and the two main clusters resided near the ‘center’
can often consist around 75% of all users.
In summary, the above cross validation have illustrated the usefulness of our re-
search.
5.7 Discussion
In this paper, we have made several contributions. Firstly, we have illustrated how
businesses can further improve their understanding of how customers typically uti-
lize their product/service, and the feasibility and potential merits of analyzing in-
dividuals’ habitual consumption behavior. Secondly, we have shown that our ap-
proach, for the first time, can effectively model and automatically profile each user’s
overall spatial usage behavior. Thirdly, individuals’ spatial usage behavioral signa-
tures are more effective, at least for our dataset, for predicting their future behavior
than ordered partitioning subscribers based on their aggregated voice call durations
and SMS counts. Finally, spatial usage behavior among customer groups is highly
differentiable.
Tactically, we have utilized observational data of CDR that is typically already avail-
able and hence cost efficient to the established businesses. We have shown that
massive amounts of CDR, when being leveraged more fully, can provide wireless
telecommunication providers with a wealth of enhanced customer knowledge. In
fact, we could have also included incoming CDR and unsuccessful outbound CDR,
142 Chapter 5. Customer Spatial Usage Behavior Profiling and Segmentation
Table 5.7: Cross validation results # 2. (a) 6-cluster k-means algorithm solution forthe new sample # 1: Cluster centers & variables R2’s (setting of Figure 5.9 (b)). (b)6-cluster k-means algorithm solution for the new sample # 2: Cluster centers & vari-ables R2’s (setting of Figure 5.9 (b)).
(a)
Cluster # Obs SignificantWt UrbanWt UrbanArea RemoteWtX2 RouteDist(A) 2335 0.779 0.108 0.098 0.117 0.077(B) 1104 0.622 0.282 0.790 0.110 0.069(C) 1375 0.154 0.703 0.214 0.163 0.089(D) 751 0.100 0.747 0.884 0.199 0.133(E) 704 0.362 0.145 0.104 0.764 0.180(F) 627 0.278 0.283 0.191 0.642 0.895
Variable R2: 0.717 0.722 0.765 0.622 0.693
(b)
Cluster # Obs SignificantWt UrbanWt UrbanArea RemoteWtX2 RouteDist(A) 3217 0.779 0.105 0.095 0.126 0.080(B) 1471 0.614 0.286 0.793 0.118 0.083(C) 1708 0.163 0.687 0.218 0.163 0.099(D) 1048 0.099 0.746 0.892 0.194 0.151(E) 1024 0.351 0.158 0.122 0.780 0.201(F) 869 0.263 0.272 0.186 0.677 0.923
Variable R2: 0.710 0.704 0.764 0.642 0.686
these are often considered valueless in analyses. Moreover, our data mining ap-
proach has provided a level of granularity and subtlety that cannot be achieved by
approaches such as market research. In addition, our tactic has allowed marketers
to identify and interact with individuals, and could be easily extended into a longitu-
dinal behavioral study. Note that recently there has been much research focused on
approximating the density distributions of stream data such as CDR in an extremely
time-and-space efficient and scalable manner (Aggarwal, 2007a). However, we be-
lieve their non-parametric density representations, as in the case of clustering (c.f.
§5.3), would be less suited for obtaining behavioral descriptors as we have done in
§5.4.
We have demonstrated the model accuracy in modeling each subscriber’s mobility
pattern with a GMM. While we have assumed that the relationships among obser-
vations are independent and identically distributed (i.i.d.), we believe they are rea-
sonable given that we are analyzing users’ habitual consumption behavior. Besides
understanding the implication of observations actually being sequential, we believe
that the most practical extension to this work is to explore their spatial usage be-
havior with respect to different time periods, for example weekdays, weeknights or
weekends (c.f. Ghosh et al., 2006a). Also, our choice of non-overlapped segmenta-
tion is ‘unnecessarily restrictive’ for interacting with the customers (c.f. Wedel and
5.7 Discussion 143
Kamakura, 1998, p.32). Results in this paper also promote us to consider other alter-
native approaches for profiling each user’s mobility pattern. We are currently explor-
ing profiling behavior with ‘mixed membership’ (Airoldi et al., 2008). That is, rather
than assigning each user into one cluster, we may profile users as a mixture of be-
havioral clusters. For example, for Subscriber A in Figure 5.2, his/her behavior might
be more preferable for profiling with 80% home-office-like plus 20% inter-capital
businessperson-like which may further improve the stability results of our spatial
usage behavioral segments in §5.5.
We acknowledge that more investigation is required to understand how our spatial
usage behavioral profiling and segmentations: (1) relate to the current and future
needs and values of the business, (2) correlate with their geo-demographic or other
aspect of customer information or knowledge including their purchasing behavior,
(3) have implications for future product/service development and (4) are effective
in interacting meaningfully with each customer in real life. Given the exploratory
nature of this paper, we leave to future researchers to examine the managerial im-
plications. Nonetheless, we believe that our inferred richer behavioral descriptions
derived based on series of exploratory analysis can provide businesses with a bet-
ter understanding of both individuals’ and spatial usage behavioral segments’ typ-
ical needs and behavior, which marketers can more easily relate to and are hence
valuable for strategy formations. Note that many benefits to both the business and
the clients, such as finding the nearest restaurants or cash machines, or assisting
in emergency situations or commercial activities such as advertising, can already
be provided simply by knowing the current position of an individual (Chong et al.,
2009). However, additional benefits can be delivered with detailed understanding
of their spatial needs. That is, it will now be possible to target customers more ad-
vancely and precisely without the need to access all the data all the time or have
data pre-summarized for a particular purposes. For example, our method allows us
to target inter-capital non-Sydney based businesspersons with restaurant discount
vouchers in Sydney, or to target users mostly at the significant locations with innova-
tive mobile devices capable of transforming into premise equipments. Also, it allows
us to provide better services to subscribers in relation to traffic and public trans-
port alerts relevant to them. Furthermore, we believe that our spatial knowledge can
help businesses to better determine the cost of providing services to each user, and
hence assess their value to the business as a result of different servicing costs asso-
ciated with different cell towers. We also believe that our spatial usage behavioral
signatures may be able to further improve on subscribers’ future behavior predic-
tions such as churn modeling (Bhattacharya, 1998; Mozer et al., 2000; Keaveney and
Parthasarathy, 2001; Lemon et al., 2002; Buckinx and Van den Poel, 2005) which are
currently based on the less effective benchmark measures.
Finally, we believe that it is not sensible to extend this research much further with
144 Chapter 5. Customer Spatial Usage Behavior Profiling and Segmentation
traditional clustering algorithms for investigating the relationships among a large
number of customer attributes at the same time. This includes KM used here. This
is because, as dimensionality increases, the sparseness of data usually increases as
a result, this in turn leads to meaningless similarity measures and clustering results
(Agrawal et al., 1998). The situation is worse when a high level of noise (c.f. het-
erogeneous behavior) also exists. Interestingly, it has been observed that usually
only a small number of different dimensions (i.e., subspaces) are relevant to certain
clusters, whilst noisy signals are often contributed by information in the remaining
irrelevant dimensions (Agrawal et al., 1998). Put simply, age, for example, may be
critical to one customer group but not to the other. Consequently, algorithms that
aim to cluster data full dimensionally, as done traditionally, are inappropriate. In
fact, even applying traditional feature selection or transformation prior to cluster-
ing, based on the full dimension philosophy will not resolve this dimensionality issue
(Agrawal et al., 1998). Accordingly, we believe future research on this topic needs to
consider adopting recently developed subspace or projected clustering algorithms,
which have shown to efficiently, effectively and automatically identify groups of clus-
ters within different subspaces of the same dataset (e.g., Moise et al., 2008). In partic-
ular, we believe algorithm P3C (Moise et al., 2008) currently looks very promising. It
can deal with both numerical as well as categorical attributes, and can find both non-
overlapped and overlapped clusters. Overall, given the competition taking place in
the telecommunication industry, we believe this study should be of interest to both
academics and practitioners.
5.8 References
Aggarwal, C. C., 2007. Data Streams: Models and Algorithms. Advances in Database
Systems. Springer, New York.
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P., 1998. Automatic subspace clus-
tering of high dimensional data for data mining applications. In: Haas, L. M., Ti-
wary, A. (Eds.), Proceedings of the 1998 ACM SIGMOD International Conference
on Management of Data. ACM, Seattle, WA, pp. 94–105.
Airoldi, E. M., Blei, D. M., Fienberg, S. E., Xing, E. P., 2008. Mixed membership
stochastic blockmodels. The Journal of Machine Learning Research 9 (Sep), 1981–
2014.
Ajzen, I., 2001. Nature and operation of attitudes. Annual Review of Psychology
52 (1), 27–58.
5.8 References 145
Alderson, W., 1957. Marketing Behavior and Executive Action: A Functionalist Ap-
proach to Marketing Theory. Richard D. Irwin, Homewood, IL.
Attias, H., 1999. Inferring parameters and structure of latent variable models by vari-
ational Bayes. In: Laskey, K. B., Prade, H. (Eds.), Proceedings of the Fifteenth Con-
ference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, Stockholm,
Sweden, pp. 21–30.
Balazinska, M., Castro, P., 2003. Characterizing mobility and network usage in a cor-
porate wireless local-area network. In: Proceedings of the First International Con-
ference on Mobile Systems, Applications, and Services. USENIX, San Francisco,
CA, pp. 303–316.
Banfield, J. D., Raftery, A. E., 1993. Model-based Gaussian and non-Gaussian cluster-
ing. Biometrics 49 (3), 803–821.
Batty, M., 2003. Agent-based pedestrian modelling. In: Longley, P. A., Batty, M. (Eds.),
Advanced Spatial Analysis: The CASA Book of GIS. ESRI Press, Redlands, CA.
Baudry, J.-P., Raftery, A. E., Celeux, G., Lo, K., Gottardo, R., 2010. Combining mix-
ture components for clustering. Journal of Computational and Graphical Statistics
19 (2), 332–353.
Bhattacharya, C. B., 1998. When customers are members: Customer retention in
paid membership contexts. Journal of the Academy of Marketing Science 26 (1),
31–44.
Blattberg, R. C., Deighton, J., 1996. Manage marketing by the customer equity test.
Harvard Business Review July-August, 136–144.
Braun, M., McAuliffe, J., 2010. Variational inference for large-scale models of discrete
choice. Journal of American Statistical Association 105 (489), 324–335.
Buckinx, W., Van den Poel, D., 2005. Customer base analysis: partial defection of
behaviourally loyal clients in a non-contractual FMCG retail setting. European
Journal of Operational Research 164 (1), 252–268.
Calinski, T., Harabasz, J., 1974. A dendrite method for cluster analysis. Communica-
tions in Statistics - Theory and Methods 3 (1), 1–27.
Camp, T., Boleng, J., Davies, V., 2002. A survey of mobility models for ad hoc network
research. Wireless Communications and Mobile Computing 2 (5), 483–502.
146 Chapter 5. Customer Spatial Usage Behavior Profiling and Segmentation
Celeux, G., Govaert, G., 1992. A classification EM algorithm for clustering and two
stochastic versions. Computational Statistics & Data Analysis 14, 315–332.
Chong, C.-C., Guvenc, I., Watanabe, F., Inamura, H., 2009. Ranging and localization
by UWB radio for indoor LBS. NTT DOCOMO Technical Journal 11 (1), 41–48.
Christopher, M., Payne, A., Ballantyne, D., 1991. Relationship Marketing: Bring-
ing Quality, Customer Service and Marketing Together. The Marketing Series.
Butterworth-Heinemann, Boston, MA.
Constantinopoulos, C., Likas, A., 2007. Unsupervised learning of Gaussian mixtures
based on variational component splitting. IEEE Transactions on Neural Networks
18 (3), 745–755.
Cooper, R., Kaplan, R. S., 1991. Profit priorities from activity-based costing. Harvard
Business Review May-June, 130–135.
Corduneanu, A., Bishop, C. M., 2001. Variational Bayesian model selection for mix-
ture distributions. In: Proceedings of the Eighth International Conference on Arti-
ficial Intelligence and Statistics. Morgan Kaufmann, Key West, FL, pp. 27–34.
Cortes, C., Fisher, K., Pregibon, D., Rogers, A., 2000. Hancock: a language for extract-
ing signatures from data streams. In: Proceedings of the Sixth ACM SIGKDD In-
ternational Conference on Knowledge Discovery and Data Mining. ACM, Boston,
MA, pp. 9–17.
Dempster, A. P., Laird, N. M., Rubin, D., 1977. Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Statis-
tical Methodology) 39 (1), 1–38.
DeSarbo, W. S., Howard, D. J., Jedidi, K., 1990. MULTICLUS: A new method for simul-
taneously performing multidimensional scaling and cluster analysis. Psychome-
trika 56 (1), 121–136.
Escobar, M. D., West, M., 1995. Bayesian density estimation and inference using
mixtures. Journal of the American Statistical Association 90 (430), 577–588.
Ester, M., Kriegel, H.-p., Sander, J., Xu, X., 1996. A density-based algorithm for dis-
covering clusters in large spatial databases with noise. In: Simoudis, E., Han,
J., Fayyad, U. M. (Eds.), Proceedings of the Second International Conference on
Knowledge Discovery and Data Mining. AAAI, Portland, OR, pp. 226–231.
Fournier, S., Dobscha, S., Mick, D. G., 1998. Preventing the premature death of rela-
tionship marketing. Harvard Business Review 76 (1), 42–51.
5.8 References 147
Fraley, C., Raftery, A. E., 1998. How many clusters? which clustering method? an-
swers via model-based cluster analysis. The Computer Journal 41 (8), 578–588.
Frarley, J. U., Ring, L. W., 1966. A stochastic model of supermarket traffic flow. Oper-
ations Research 14 (4), 555–567.
Gelman, A., Carlin, J. B., Stern, H. S., Rubin, D. B., 2004. Bayesian Data Analysis, 2nd
Edition. Texts in Statistical Science. Chapman & Hall, Boca Raton, FL.
Ghahramani, Z., Beal, M. J., 1999. Variational inference for Bayesian mixtures of fac-
tor analysers. In: Solla, S. A., Leen, T. K., Muller, K.-R. (Eds.), Proceedings of the
1999 Neural Information Processing Systems. MIT, Denver, CO, pp. 449–455.
Ghosh, J., Beal, M. J., Ngo, H. Q., Qiao, C., 2006. On profiling mobility and predicting
locations of wireless users. In: Proceedings of the 2nd International Workshop on
Multi-hop Ad Hoc Networks: From Theory to Reality. ACM, Florence, Italy, pp.
55–62.
Gonzalez, M. C., Hidalgo, C. A., Barabasi, A.-L., 2008. Understanding individual hu-
man mobility patterns. Nature 453 (7196), 779–782.
Heitfield, E., Levy, A., 2001. Parametric, semi-parametric and non-parametric mod-
els of telecommunications demand: an investigation of residential calling pat-
terns. Information Economics and Policy 13 (3), 311–329.
Hofstede, F. T., Wedel, M., Steenkamp, J.-B. E. M., 2002. Identifying spatial segments
in international markets. Marketing Science 21 (2), 160–177.
Hui, S. K., Bradlow, E. T., Fader, P. S., 2009a. Testing behavioral hypotheses using an
integrated model of grocery store shopping path and purchase behavior. Journal
of Consumer Research 36 (3), 478–493.
Hui, S. K., Fader, P. S., Bradlow, E. T., 2009b. The traveling salesman goes shopping:
The systematic deviations of grocery paths from TSP-optimality. Marketing Sci-
ence 28 (3), 566–572.
Jacoby, J., 1978. Consumer research: a state of the art review. Journal of Marketing
42 (2), 87–96.
Jain, A. K., Dubes, R. C., 1988. Algorithms for Clustering Data. Prentice Hall, Upper
Saddle River, NJ.
Keaveney, S. M., Parthasarathy, M., 2001. Customer switching behavior in online
services: an exploratory study of the role of selected attitudinal, behavioral, and
148 Chapter 5. Customer Spatial Usage Behavior Profiling and Segmentation
demographic factors. Journal of the Academy of Marketing Science 29 (4), 374–
390.
Larson, J. S., Bradlow, E. T., Fader, P. S., 2005. An exploratory look at supermarket
shopping paths. International Journal of Research in Marketing 22 (4), 395 – 414.
Lemon, K. N., White, T. B., Winer, R. S., 2002. Dynamic customer relationship man-
agement: incorporating future considerations into the service retention decision.
Journal of Marketing 66 (1), 1–14.
Liu, T., Bahl, P., Chlamtac, I., 1998. Mobility modeling, location tracking, and tra-
jectory prediction in wireless ATM networks. IEEE Journal on Selected Areas in
Communications 16 (6), 922–936.
McGrory, C. A., Titterington, D. M., 2007. Variational approximations in Bayesian
model selection for finite mixture distributions. Computational Statistics & Data
Analysis 51 (11), 5352–5367.
McLachlan, G. J., Peel, D., 2000. Finite Mixture Models. Wiley Series in Probability
and Statistics. Wiley, New York.
Milligan, G., Cooper, M., 1985. An examination of procedures for determining the
number of clusters in a data set. Psychometrika 50 (2), 159–179.
Moise, G., Sander, J., Ester, M., 2008. Robust projected clustering. Knowledge and
Information Systems 14 (3), 273–298.
Mozer, M. C., Wolniewicz, R., Grimes, D. B., Johnson, E., Kaushansky, H., 2000. Pre-
dicting subscriber dissatisfaction and improving retention in the wireless telecom-
munications industry. IEEE Transactions on Neural Networks 11 (3), 690–696.
Nurmi, P., Koolwaaij, J., 2006. Identifying meaningful locations. In: Proceedings of
the Third Annual International Conference on Mobile and Ubiquitous Systems:
Networks and Services. IEEE, San Jose, CA, pp. 1–8.
Ouellette, J. A., Wood, W., 1998. Habit and intention in everyday life: the multiple
processes by which past behavior predicts future behavior. Psychological Bulletin
124 (1), 54–74.
Peppers, D., Rogers, M., Dorf, B., 1999. Is your company ready for one-to-one mar-
keting? Harvard Business Review 77 (1), 151–160.
Perkins, C. E., 2001. Ad Hoc Networking. Addison-Wesley, Boston, MA.
5.8 References 149
Reichheld, F. F., 1996. The Loyalty Effect: The Hidden Force Behind Growth, Profits,
and Lasting Value. Harvard Business School, Boston, MA.
Richardson, S., Green, P. J., 1997. On Bayesian analysis of mixtures with an unknown
number of components (with discussion). Journal of the Royal Statistical Society:
Series B (Statistical Methodology) 59 (4), 731–792.
Roeder, K., Wasserman, L., 1997. Practical Bayesian density estimation using mix-
tures of normals. Journal of the American Statistical Association 92 (439), 894–902.
SAS Institute Inc., 1983. Cubic clustering criterion. Tech. Rep. SAS Technical Report
A-108, SAS Institute Inc., Cary, NC.
Schmittlein, D. C., Peterson, R. A., 1994. Customer base analysis: an industrial pur-
chase process application. Marketing Science 13 (1), 41–67.
Schwarz, G., 1978. Estimating the dimension of a model. The Annals of Statistics
6 (2), 461–464.
Smith, W. R., 1956. Product differentiation and market segmentation as alternative
marketing strategies. Journal of Marketing 21 (1), 3–8.
Stryker, S., Burke, P. J., 2000. The past, present, and future of an identity theory. Social
Psychology Quarterly 63 (4), 284–297.
Symons, M. J., 1981. Clustering criteria and multivariate normal mixtures. Biomet-
rics 37 (1), 35–43.
Teschendorff, A. E., Wang, Y., Barbosa-Morais, N. L., Brenton, J. D., Caldas, C., 2005.
A variational Bayesian mixture modelling framework for cluster analysis of gene-
expression data. Bioinformatics 21 (13), 3025–3033.
Twedt, D. W., 1967. How does brand awareness-attitude affect marketing strategy?
Journal of Marketing 31 (4), 64–66.
Ueda, N., Ghahramani, Z., 2002. Bayesian model search for mixture models based
on optimizing variational bounds. Neural Networks 15 (10), 1223–1241.
Ueda, N., Nakano, R., Ghahramani, Z., Hinton, G. E., 2000. SMEM algorithm for
mixture models. Neural Computation 12 (9), 2109–2128.
Wang, B., Titterington, D. M., 2006. Convergence properties of a general algo-
rithm for calculating variational Bayesian estimates for a normal mixture model.
Bayesian Analysis 1 (3), 625–650.
150 Chapter 5. Customer Spatial Usage Behavior Profiling and Segmentation
Watanabe, S., Minami, Y., Nakamura, A., Ueda, N., 2002. Application of variational
Bayesian approach to speech recognition. In: Becker, S., Thrun, S., Obermayer,
K. (Eds.), Proceedings of the 2002 Neural Information Processing Systems. MIT,
Vancouver, BC, Canada, pp. 1237–1244.
Wedel, M., Kamakura, W. A., 1998. Market Segmentation: Conceptual and Method-
ological Foundations. International Series in Quantitative Marketing. Kluwer Aca-
demic, Boston, MA.
Wu, B., McGrory, C. A., Pettitt, A. N., 2010b. A new variational Bayesian algorithm
with application to human mobility pattern modeling. Statistics and Computing,
(in press).
http://dx.doi.org/10.1007/s11222-010-9217-9
6Identifying Subspace Clusters for High Dimensional Data with
Mixture Models
Abstract
Clustering algorithms have been extensively researched, but the traditional algo-
rithms were developed to search for clusters over the full dimensional space mean-
ing that they are typically not suitable for analyzing high dimensional data i.e., iden-
tifying clusters located at different subspaces. In this paper, we put forward a simple
alternative approach for integrating the low dimensional patterns to locate subspace
clusters based on the geometric considerations among observations. We propose
utilizing Gaussian mixture models (GMMs) for approximating the low dimensional
densities; the equal width histogram is currently commonly utilized, but its gran-
ularity can often affect the clustering results. We fit the GMMs with the efficient
and recently popular non-simulation based variational Bayesian (VB) method that
is an alternative to more computationally expensive techniques such as the Markov
chain Monte Carlo (MCMC) method. In addition to the fact that the number of
clusters need not be prespecified in our approach, its clustering accuracy contrasts
significantly with that of standard full dimensional GMM, as we would expect, and
also with that of another existing model-based subspace clustering algorithm. The
method is applied to simulated data yielding promising empirical results.
Keywords
Variational Bayes (VB); Gaussian Mixture Model (GMM); Clustering Algorithm; Sub-
space Clusters; High Dimensional Data
152 Chapter 6. Identifying Subspace Clusters for High Dimensional Data
6.1 Introduction
Clustering algorithms aim to automatically segment unlabeled data into relatively
meaningful and natural, homogeneous, but hidden, subgroups (or clusters). This is
done by maximizing intra-cluster similarities and minimizing inter-cluster similari-
ties without the need for any prior knowledge (Hastie et al., 2009, p.501). They have
been used for data reduction, compression, summarization and outlier detection,
and have been shown to be useful not only as a stand alone technique, but also as a
preprocessing technique for other analytical tasks (Kriegel et al., 2009).
However, while clustering algorithms have been well studied, the traditional algo-
rithms were developed to discover clusters in the full dimensional space and such
an approach is typically not suitable for analyzing high dimensional data (Agrawal
et al., 1998). This is because as the dimensionality increases, the sparseness of data
usually also increases as a result, and this in turn means that the similarity measures
that are critical for clustering are essentially meaningless (Aggarwal et al., 2001). This
phenomenon is known as the ‘curse of dimensionality’ (Bellman, 1961) or ‘empty
space phenomenon’ (Scott, 1992, p.84); it implies a lack of data separation in high
dimensional spaces (Hastie et al., 2009, pp.22-27) and means that the nearest neigh-
bors are not stable (Beyer et al., 1999). However, we stress the point that this does
not imply that observations are always better distinguished in the lower dimensional
space. Therefore, it is still worthwhile to explore techniques for higher dimensional
data which motivates us in this paper to propose an approach that is suitable for
identifying clusters in the high dimensional space.
One possible approach to tackle this ‘curse of dimensionality’ issue is to apply di-
mensionality reduction techniques prior to clustering high dimensional data. How-
ever, global techniques such as feature transformation, e.g., principal component
analysis (PCA) and singular value decomposition (SVD), and feature selection have
already been shown to be inappropriate (Chang, 1983; Agrawal et al., 1998; Kriegel
et al., 2009). This is because many attributes are irrelevant (Parsons et al., 2004) and
there may be variation in the level of correlation that is relevant among different
clusters (Kriegel et al., 2009).
Another concept that has been proposed is that of subspace clusters; it is based on
the observation that high dimensional data usually have an intrinsic dimensionality
that is lower than the original full dimensionality (Jain and Dubes, 1988, pp.42-46).
That is, usually only a small number of the dimensions are actually relevant to par-
ticular clusters, and it is often just noisy signals that contribute information to the
remaining unwanted attributes (Agrawal et al., 1998). Also, the number of unwanted
attributes is likely to grow with the total number of dimensions, as observations are
increasingly likely to be located in different subspaces. The challenge in clustering
6.1 Introduction 153
high dimensional data, at least in the notion of subspace clusters, is therefore to
achieve the ability to search effectively and efficiently for groups of clusters within
different subspaces of the same dataset without exhaustively examining all possible
attribute combinations (Parsons et al., 2004). In other words, the key for clustering
high dimensional data is to perform feature selection, but not in the global sense as
is typically done. We follow this notion in this paper.
Many new efficient algorithms following this research direction have recently been
introduced (Parsons et al., 2004; Kriegel et al., 2009). They typically employ the re-
duced search space strategy based on the observation that if a d′-dimensional unit,
or cluster, is dense, then so are its projections in (d′ − 1)-dimensional subspace
(Agrawal et al., 1998). That is, they often aim to first identify dense regions at a lower
(typically one or two) dimensional subspace. Then, depending on the algorithms
used and the approach taken, subspaces that contain clusters are identified and
clusters are formed, for example, by combining adjacent dense units in a bottom-
up fashion (Agrawal et al., 1998).
Axis-parallel clustering algorithms (as oppose to those focused on finding arbitrarily
shaped clusters) (Kriegel et al., 2009) have often adopted a grid-based approach for
identifying dense regions. That is, they often model the distribution of each dimen-
sion (or dimension combination at the very low level) with an equal width histogram
(or a grid) with a predefined number of bins, and then identify the dense regions of
the histograms using a predefined value for the density threshold parameter. How-
ever, while the results of these algorithms can generally be interpreted meaningfully
and the algorithms are often also capable of finding arbitrarily shaped clusters in
hyper-rectangular format, clusters are likely to be spread over many bins (or grid
cells) (Hinneburg and Keim, 1999); and the accuracy of such a strategy, while simple
and generally effective, depends on the granularity and the positioning of the grid
(Kriegel et al., 2009).
Some grid-based algorithms are flexible in that they utilize an adaptive grid instead
of a fixed interval size static grid (e.g., Nagesh et al., 2000). Alternatively, some al-
gorithms increase flexibility by iteratively lowering the density threshold parameter
value as in Ng et al. (2005), for example, or by setting the threshold parameter value
based on the Poisson distribution i.e., utilizing the chi-square goodness-of-fit test
in examining if number of the observations in a bin is significantly different from a
uniform distribution as in Moise et al. (2008). However, while these improved grid-
based approaches can sometimes better identify the dense regions, they can still
mistakenly divide a cluster into several smaller subclusters. This concern is shared
by Liu et al. (2007) who have instead proposed adopting histograms with overlapped
bins. The disadvantage of using grid-based approaches is also highlighted by Kriegel
et al. (2005) who have shown that better clustering results can be obtained by using
154 Chapter 6. Identifying Subspace Clusters for High Dimensional Data
a density-based algorithm (c.f. Ester et al., 1996) that is suitable for low dimensional
data to identify the dense regions of each dimension instead of a grid.
In this paper, we follow these above typical bottom-up approaches in identifying
subspace clusters by first approximating the densities in the low dimensional spaces
(which are then utilized for guiding the discovery of the subspace clusters). However,
we do this a bit differently. We show that it is suitable to adopt Gaussian mixture
models (GMMs) for identifying the dense regions at the low dimensional level. Addi-
tionally, instead of identifying dense regions of each dimension in the usual way, we
opt to instead do this for each two-dimensional (2D) subspace (i.e., combination of
two dimensions). While such a tactic is clearly less computationally efficient, better
clustering results were obtained in doing so (Ng et al., 2005).
To implement this approach, we adopt McGrory and Titterington (2007)’s algorithm
that is based on the recently popular variational Bayesian (VB) framework for GMM
approximations. VB is able to automatically select the number of components k that
best represent the data (based on the variational approximation), which in this case
is observations in each 2D subspace. It also allows estimation of the model parame-
ter values at the same time and is computationally more efficient than the alternative
Markov chain Monte Carlo (MCMC) Bayesian approach (McGrory and Titterington,
2007). Note that one of the features associated with the use of VB is that it results in
a somewhat automatic choice of a suitable k for the fitted model by effectively and
progressively eliminating redundant components specified in the initial models as it
converges. See Wang and Titterington (2006) and McGrory and Titterington (2007),
for example, for more discussion of this aspect of the VB approximation. This VB
property implies that we are less dependent on the initial choice of k, kinitial, than we
would be if we had adopted approaches based on expectation-maximization (EM)
algorithm (Dempster et al., 1977), for example. This is partly because forcing data to
be grouped into k components can lead to a component being divided into several
smaller ones unnecessary.
Additionally, unlike some other clustering algorithms, our algorithm does not re-
quire prespecification either of the number of subspace clusters g, or of the average
subspace cluster dimensionality. We do this by taking observations’ nearest neigh-
bors into consideration and forming the subspace clusters bottom-up; we found that
our simple, straight forward approach (described later in Section 6.3) significantly
reduced the implication of ‘curse of dimensionality’ presented in the high dimen-
sional space. Note that despite us modeling the 2D subspaces with GMMs, we form
the clusters nonparametrically. This differs to certain model-based algorithms, such
as high dimensional data clustering (HDDC) algorithm (Bouveyron et al., 2007), that
aim to model each cluster as a GMM in a subspace. The intrinsic dimensionality
of the clusters in HDDC are estimated iteratively based on the eigenvalues of the
6.2 VB-GMM Algorithm 155
each cluster covariance matrix, and g can be determined based on criterion such
as Bayesian information criterion (BIC) (Schwarz, 1978; Fraley and Raftery, 1998) in
HDDC. Furthermore, this research also differs from papers (e.g., Raftery and Dean,
2006; Maugis et al., 2009; Scrucca, 2010) that aim to identify one single, but poten-
tially transformed subspace (c.f. variable selection) which best distinguishes high
dimensional data overall. It is also quite different from many other weighted k-
means-like algorithms (e.g., Friedman and Meulman, 2004) that focus on normal-
izing attributes, but do not discard attributes as we do here; these approaches lead
to clusters that are more difficult to interpret.
We organize the rest of this paper as follows. In Section 6.2, we briefly discuss how
VB can be utilized for approximating each 2D subspace with a GMM. In Section 6.3,
we detail our proposed process for identifying subspace clusters. We present our
experimental results and comparisons to the standard full dimensional GMM and
the HDDC algorithm in Section 6.4. Of course, we expect the full dimensional GMM
to perform poorly due to the ‘curse of dimensionality’ effect. We conclude with a
discussion in Section 6.5.
6.2 VB-GMM Algorithm
Finite mixture distributions provide a convenient, flexible way to approximate other
potentially complex distributions (Titterington et al., 1985). In a GMM, it is assumed
that all k underlying mixture components are distributed as Gaussian. The density
of an observation x = (x1, ..., xn) is given by∑k
j=1wjN(x;µj , T−1
j
), where k ∈ N, µj
and T−1j represent the mean and variance, respectively, of the jth component den-
sity, each mixing coefficient wj , satisfies 0 ≤ wj and∑k
j=1wj = 1, and here N (·)denotes a multivariate Gaussian density. In the Bayesian framework, inference is
based on the target posterior distribution, p (θ, z|x), where θ denotes the model pa-
rameters (µ, T, w) and z denotes the missing component membership information
for the observation x. Note that the elements of z, which we call the zij , are indicator
variables such that zij = 1 if observation xi belongs to the jth component and zij = 0otherwise. The target posterior is proportional to the product of the likelihood times
the chosen prior distributions and is generally not analytically tractable.
VB methods have become popular for approximating the target posteriors of a GMM
and the theory is now well documented in the literature (e.g., Wang and Titterington,
2006). VB approximation for a GMM is reliable, asymptotically consistent, and not
biased for large samples (Wang and Titterington, 2006). The idea of VB is to approx-
imate the target posterior by a variational distribution q (θ, z|x) that factorizes over
θ and z so that q (θ, z|x) = qθ (θ|x) × qz (z|x). The distribution q (θ, z) is chosen to
maximize the lower bound on the log marginal likelihood. Or alternatively, note that
156 Chapter 6. Identifying Subspace Clusters for High Dimensional Data
this is equivalent to minimizing the Kullback-Leibler (KL) divergence between the
target posterior and the variational approximating distribution. This minimization
produces a set of coupled expressions for the variational approximations to the pos-
teriors over the parameters and these can be iteratively updated to find a solution.
While alternative model hierarchies could be used, this paper follows the model
specification and resulting posterior updates that are described in McGrory and Tit-
terington (2007). We model each combination of two-dimensional patterns as a mix-
ture of k bivariate Gaussian distributions with unknown means µ = (µ1, ..., µk), pre-
cisions T = (T1, ..., Tk) and mixing coefficients w = (w1, ..., wk), such that
p (x, z|θ) =n∏i=1
k∏j=1
{wjN
(xi;µj , Tj−1
)}zij
,
with the joint distribution being p (x, z, θ) = p (x, z|θ) p (w) p (µ|T ) p (T ). We express
our priors as:
p (w) = Dirichlet(w;α1
(0), ..., αk(0))
,
p (µ|T ) =k∏j=1
N
(µj ;mj
(0),(βj
(0)Tj
)−1)
, and
p (T ) =k∏j=1
Wishart(Tj ; υj(0),Σj
(0))
,
with α(0), β(0), m(0), υ(0), and Σ(0) being known user chosen values. These are the
standard conjugate priors used in Bayesian mixture modeling (Gelman et al., 2004).
Using the lower bound approximation, the posteriors are:
qw (w) = Dirichlet (w;α1, ..., αk),
qµ|T (µ|T ) =k∏j=1
N(µj ;mj , (βjTj)
−1)
, and
qT (T ) =k∏j=1
Wishart (Tj ; υj ,Σj).
The variational updates for the posterior parameters are then:
αj = αj(0) +
∑ni=1 qij ,
βj = βj(0) +
∑ni=1 qij ,
υj = υj(0) +
∑ni=1 qij ,
mj = 1βj
(βj
(0)mj(0) +
∑ni=1 qijxi
), and
Σj = Σj(0) +
n∑i=1
qijxixiT + βj
(0)mj(0)mj
(0)T
− βjmjmjT ,
where qij is the VB posterior probability that component membership indicator zij =1, and the required expectations are given byE ( µj) = mj , andE ( Tj) = υjΣ−1
j . This
6.3 Subspace Clusters Identification 157
is a standard VB approach to fitting GMMs; other algorithms with different model
hierarchies include Attias (1999) and Corduneanu and Bishop (2001), for example.
6.3 Subspace Clusters Identification
In the following, we assume an observation x has dimensionality of d and is in the
feature space of<d. We denote the lth attribute of an observation xi as xil and assume
that the xil’s have already been standardized such that 0 ≤ xil ≤ 1.
Our approach for identifying associated subspaces and evaluating the results is in a
similar spirit to that of Ng et al. (2005), but differs in that it uses a VB approach in the
low dimensional dense regions estimation. It can be summarized into the following
steps:
1. Approximating the density of each 2D subspace with VB-GMM.
2. Detection of dense regions of each 2D subspace.
3. Estimating the associated subspace of each observation and derive each ob-
servation’s ‘signature’.
4. Identify interesting associated subspaces, or subspace clusters, by merging
similar observation ‘signatures’.
5. Assigning observations to appropriate subspace clusters.
We next describe these steps in detail.
6.3.1 Approximating the density of each 2D subspace with VB-GMM
Our first step is to adopt the VB-GMM algorithm described in Section 6.2 for ap-
proximating the pattern of each 2D subspace i.e., we need to execute the VB-GMM
algorithm a total of(d2
)times. Typically, iterative VB-GMM ‘declares’ a model has
converged based on examining the lower bound on the log marginal likelihood F
(Attias, 1999; Corduneanu and Bishop, 2001; Wang and Titterington, 2006). When
the lower bound F of the current iteration is the same as that of the previous iter-
ation up to a very small tolerance level then the variational scheme has converged.
However, Wu et al. (2010b) pointed out that in practice such an approach can be
computationally wasteful and subsequent iterations can simply be ‘hopping’ among
several alternative ‘good’ models. Consequently, we follow Wu et al. (2010b)’s model
stability criterion for identifying converged models which is as follows; we found that
this lead to good results in our simulation trials and improved efficiency. A model is
declared to be converged if:
158 Chapter 6. Identifying Subspace Clusters for High Dimensional Data
• The number of components k currently in the model has remained unchanged
from the previous iteration (S1);
• The variational posterior mean estimates of all components mj ’s currently in
the model are the same as in the previous iteration up to a very small tolerance
level δ (S2); and,
• At least c0 iterations have been completed (S3).
In other words, instead of monitoring changes in F as is done in most other VB pa-
pers, we focus on key model parameter estimates as this has been found to be ad-
equate. Note that we follow Wu et al. (2010b) and choose c0 to be equal to five; the
role of S3 is simply to prevent the algorithm ‘declaring’ that a model has converged
prematurely before at least some iterations have been carried out.
Each time we run the algorithm, we initialize it with kinitial components. Assuming
we choose kinitial = h2 with h ∈ N , we propose to informatively assign the initial
mixture membership of an observation based on where the observations are on the
h× h grid. An informative initial allocation strategy has been shown to perform bet-
ter than simply allocating the observation component membership randomly (Wu
et al., 2010c). However, in order to avoid introducing any bias from this initialization
scheme, we set larger initial component covariance matrices than implied by the
grid. The computational requirements of this step (for the same dimensionality of d)
are very dependent on the choice of δ and h; the implications of this are examined in
Section 6.4.
6.3.2 Detection of dense regions of each 2D subspace
In order to identify interesting associated subspaces, we must first identify dense
regions of each 2D subspace. As discussed in the introduction, many differ-
ent approaches have been used in grid-based methods, but these suffered from
some drawbacks. Alternatively, an approach was suggested by Kriegel et al. (2005)
which involves performing density approximation at each dimension with a non-
overlapped density-based clustering algorithm that is suitable for low-dimensional
data instead; dense regions are identified as those one-dimensional clusters with
weights greater than 25% of the average cluster weights.
In this paper, we adopt a similar approach to that of Kriegel et al. (2005). However, in-
stead of ignoring components with small weights as done in Kriegel et al. (2005), we
consider the jth mixture component of a 2D subspace to be dense ifwj ≥ c1×waverage,
where c1 > 1 and waverage is the average component weight. Note that waverage can
be different for each 2D subspace as it depends on the number of components re-
maining in the model; this count must be less than or equal to kinitial due to VB’s
6.3 Subspace Clusters Identification 159
component elimination property. Additionally, instead of identifying dense regions
based on all observations in the dense components as done in Kriegel et al. (2005),
we only considering observations to be in the dense region if their component like-
lihood is greater than c2; observations are only considered with respect to their most
probable component according to the VB posterior estimate of qij . The implications
of choosing different c2 values is examined in section 6.4.
After all dense regions have been identified for all 2D subspaces, we summarize each
observation into a ‘signature’ data structure as proposed by Ng et al. (2005) at the
end of this step. Each observation summary describes whether it has been found
in the dense regions in each of the 2D subspaces; and, if so, which dense region
or component number. That is, suppose we have four dimensions, A to D, and an
observation has been found (only) to be in the dense region #2 for subspace AB and
dense region #5 for subspace AC, this observation will be summarized as [2 5 0 0 0 0]
corresponding to subspaces AB, AC, AD, BC, BD, and CD. However, unlike Ng et al.
(2005), we do not refer to these observation summaries as signatures yet, as we will
first refine them further in our next step.
6.3.3 Estimating the associated subspace of each observation
After detecting the dense regions for each 2D subspace, our next step, Step 3, is to
estimate the corresponding subspace of the clusters to which the observations are
likely to belong. Ng et al. (2005) determines the ‘best’ estimated associated subspace
of an observation as the union of its dense regions’ dimensions with respect to 2D
subspaces. For example, for the example observation given above, its ‘best’ esti-
mated associated subspace will be ABC even though it was not identified to be in
any of the dense regions of subspace BC, but was identified to be in the dense re-
gions of subspaces AB and AC. Obviously, for this particular observation, one would
be more confident with the estimation if it were also in one of the subspace BC dense
regions. To take this confidence level into consideration, Ng et al. (2005) have pro-
posed to also compute the likelihood of each observation’s ‘best’ estimated associ-
ated subspace; observations with higher likelihood will have more influence on the
final clustering results.
However, we found the tactic described above in estimating the associated sub-
spaces of an observation (i.e., union of dense regions’ dimensions with respect to
2D subspaces of an observation) can be ineffective in practice. That is, assuming an
observation is in a cluster for which its true associated subspace includes dimension
A, we have observed that an observation is often found in a dense region of most
2D subspaces involving dimension A, even though other dimensions are irrelevant.
160 Chapter 6. Identifying Subspace Clusters for High Dimensional Data
Consequently, we often observed a large number of observations’ likelihood as cal-
culated in Ng et al. (2005) with respect to their ‘best’ estimated associated subspaces
are practically zero and hence not very useful. For this reason we believe that the
union of all 2D subspace dense regions’ dimensions of an observation should really
be considered as the ‘upper bound’ of the estimated associated subspace of an ob-
servation. Thus, this suggests a need to estimate the associated subspace for each
observation differently.
We propose to do this by refining the observation summaries (described in Sec-
tion 6.3.2) obtained from Step 2. We do this using two simple strategies; both aim
to identify irrelevant dimensions with respect to an observation. We then update the
observation summaries so that the observation will no longer be identified as involv-
ing those irrelevant dimensions. We believe that this way the associated subspace for
each observation can be better estimated.
Our first strategy is based on our observation that if dimension A is highly relevant
to an observation, then the observation is likely to be identified in the dense regions
of most if not all 2D subspaces involving dimension A. Thus, we determine the rele-
vance of a dimension to an observation by counting how many times an observation
has been found in a dense region of 2D subspaces involving that dimension; a di-
mension with low count with respect to an observation i.e., less than or equal to c3 of
the data dimensionality d, is considered as irrelevant. The implications of choosing
different c3 values is examined in Section 6.4, where we found c3 = 12% to be a good
choice.
For our second strategy, we adopt a similar approach to the dimension voting proce-
dure in Woo et al. (2004); that is we identify irrelevant dimensions of an observation
by utilizing the estimated associated subspaces information of its neighbors. For ex-
ample, if an observation’s estimated associated subspace is ABC and dimension A
was not found to be part of any of its neighbors’ estimated associated subspaces,
then dimension A will be considered as irrelevant for the observation and the ob-
servation’s estimated associated subspace will become just BC. To do this, Woo et al.
(2004) introduce a unique distance measure for identifying an observation’s p near-
est neighbors. However, the properties of this measure have not been explored thor-
oughly. In this paper, we instead utilize a measure similar to that of Ng et al. (2005)
for measuring the similarity between two observation summaries. Suppose we have
two observations x1 and x2, we measure their similarities as follows:
sim (x1, x2) =
(dcommon
2
)(dunique
2
) , (6.1)
6.3 Subspace Clusters Identification 161
where(d2
)denotes all possible two-dimensional combinations of dcommon and dunique
which are the number of common and unique dimensions of their estimated associ-
ated subspaces, respectively. We found in simulation experiments this measure can
effectively identify those observations which should not be considered as the neigh-
bors and hence have no right to ‘vote’. This way, our decision for an observation as to
whether a dimension is irrelevant or not will not depend only on its p nearest neigh-
bors, but rather all observations that are similar. In this paper, we define x2 to be a
neighbor of x1 if it is at least 30% similar based on Equation 6.1, and the estimated
associated subspace dimension of x1 is relevant if it is shared with at least 70% of its
neighbors. We do this several times (e.g., five) or until there are no further changes
to our observation summaries.
Recall that after identifying those irrelevant dimensions with respect to an observa-
tion, we update the observation summaries such that the observations are consid-
ered as not being found in any dense region in the 2D subspaces involving those ir-
relevant dimensions. We refer to this updated observation summary as a ‘signature’,
its structure still corresponds to the list of observations’ dense region or component
number of each 2D subspace.
6.3.4 Identifying interesting associated subspaces
Our next step, Step 4, is to merge similar observation signature entries. Our proce-
dure is simple: we group observations together as long as there is no ‘conflict’ in the
dense region number in any of the 2D subspaces, and we set some minimum size
for the group (e.g., 3% of n). We define ‘conflict’ as follows: assuming we have three
observations x1, x2, and x3 and their signature entries with respect to subspace AB
are 0, 5, and 7, respectively; we consider x2 and x3 being in conflict with each other
with respect to subspace AB, whereas x1 and x2, and x1 and x3, are not. Recall that a
record entry equal to 0 implies that the observation is at least not being considered
to be in any dense regions of the 2D subspace. At the end of this observation sig-
nature grouping/merging process, we obtain a list of groups, or subspace clusters,
which are sometimes referred to as projected clusters in the literature.
6.3.5 Assigning observations to appropriate subspace clusters
Finally, we assign observations into appropriate subspace clusters utilizing the sim-
ilarity measure defined in Ng et al. (2005). Assuming we have subspace cluster s1,
and an observation x1, their similarity is defined as follows:
sim (s1, x1) = number of matched 2D subspacesnumber of unique 2D subspaces .
162 Chapter 6. Identifying Subspace Clusters for High Dimensional Data
If we had x1 and s1 signatures of [2 5 0 2 0 0] and [2 5 3 7 0 0], respectively, then x1 and
s1 would be considered to be 50% similar since the number of unique 2D subspaces
= 4 and the number of matched 2D subspaces = 2. We assign each observation to
a subspace cluster with the highest similarity measure as long as it is greater than a
certain threshold (e.g., 30%). Note that in the above expression, we place no impor-
tance on matching signature entries that are 0, since these dimensions/subspace are
less relevant.
6.4 Experimental Results
We demonstrate the effectiveness of our method in identifying subspace clusters in
high dimensional data. One of our goals is to evaluate the suitability of utilizing
mixture models for assisting in identifying subspace clusters. This preliminary eval-
uation is based on a simulated dataset with n = 1000 observations and d = 25 di-
mensions. It consists five clusters, four of which have weights equal to 20%, and one
of which has weight equal to 15%; the remaining 5% of the observations are outliers
(c.f. Moise et al., 2008). Each cluster has an intrinsic dimensionality of six or 24%
of the total dimensionality with some dimensions being shared by more than one
cluster.
But first, we note that attributes with no relevance to any of the clusters should dis-
play a uniform distribution, while those that are relevant to one or more clusters will
typically display a non-uniform distribution (Moise et al., 2008). Our experimental
evaluations suggested to us that when we approximate bivariate uniform distribu-
tions with VB-GMM, components with maximum weights are generally lower than
1.5× waverage. Consequently, we set c1 = 1.5 representing that we consider a compo-
nent to be a dense region if its w ≥ 1.5× waverage.
We evaluate our results in the following contexts. First, we examine the implication
of the choices of δ, the tolerance level for determining if the VB-GMM model has
converged. We then evaluate the implication of different GMM granularity i.e., the
choice of grid size h (c.f. h × h = kinitial), c2, the likelihood threshold where ob-
servations are considered to be in the dense regions, and c3, the threshold level in
determining the dimension relevance to an observation.
Finally, we consider situations where the intrinsic dimensionality of the clusters is
lower than 24% by adding up to additional 75 irrelevant dimensions. As indicated
in the introduction, the approaches of using both full dimensional GMM, and the
HDDC algorithm (Bouveyron et al., 2007) written within the software package of
MIXMOD (Biernacki et al., 2006) will also be applied in the above situation for com-
parison. As is done typically, here an EM algorithm is used for full dimensional
6.4 Experimental Results 163
Table 6.1: Comparison of results obtained using different δ, the tolerance level fordetermining if the VB-GMM model has converged
δ 10−3 5× 10−2
VBPerformance
Avg. kfinal 8.8 8.9Avg. # of VB Iterations 93.6 22.1
ClusteringResults
# of Subspace Clusters 6 5Clustering Accuracy 70.7% 96.9%
Avg. Cluster Dimensionality 5.4 6.0
GMM; whereas the default settings of HDDC are employed. We note that HDDC is
based on work on mixtures of probabilistic PCA (Tipping and Bishop, 1999; McLach-
lan et al., 2003) and eigenvalue decomposition of the covariance matrices (Celeux
and Govaert, 1995) with only certain essential parameters estimated by an EM algo-
rithm. Given that both algorithms are EM-based and thus depended on the initial-
ization, we will execute them for a total of 20 runs with results presented being the
average of these. Additionally, since neither algorithms can automatically determine
the number of clusters g, BIC will be utilized for such purpose.
Note that we compare and evaluate results based on the accuracy of cluster group-
ing, that is how many observations were correctly grouped together or identified as
outliers; in the event that more than five clusters are found, we classified the re-
sults as being ‘correct’ as long as the clusters were simply a subset of the original
clusters. Additionally, we also reported the average dimensionality of the subspace
clusters recovered by our approach. We also consider the performance of VB-GMM
by reporting the average final number of components in the model, kfinal, of all 2D
subspace combinations, and the average number of iterations required to reach con-
verged models.
6.4.1 Sensitivity to choice of δ, the tolerance level for determining if theVB-GMM model has converged
We initialize the VB algorithm for each 2D subspace with nine components i.e., h = 3or kinitial = 9, and have c1 = 1.5, c2 = 90% and c3 = 12%. The results are shown in
Table 6.1 for two different choices of δ. We can see that if we have smaller δ then we
will require more VB iterations (i.e., more computations), but it does not guarantee
that we will obtain better clustering accuracy. Therefore it seems that it is only nec-
essary choose δ sufficiently small to ensure one obtains good density approximation
to each 2D subspace for identifying subspace clusters. Recall that the primary pur-
pose of fitting GMMs is simply to identify the potential dense regions within each 2D
subspace.
164 Chapter 6. Identifying Subspace Clusters for High Dimensional Data
Table 6.2: Comparison of results obtained using different GMM granularity
Granularity h× h (kinitial) 2× 2 (4) 3× 3 (9) 4× 4 (16) 5× 5 (25)Avg. kfinal 3.9 8.9 13.8 17.1
Avg. # of VB Iterations 26.6 22.1 27.7 30.1# of Subspace Clusters 6 5 5 5
Clustering Accuracy 78.3% 96.9% 98.7% 81.4%Avg. Cluster Dimensionality 4.8 6.0 6.0 3.6
6.4.2 Sensitivity to choice of GMM granularity h (or kinitial)
Next, we examine the effects of using different GMM granularity i.e., different h
or kinitial. Based on the results reported in the previous subsection, we choose
δ = 5× 10−2 (with c1 = 1.5, c2 = 90% and c3 = 12%) here; the results are shown in Ta-
ble 6.2. The results suggest that having more mixture components for approximating
the density distribution of each 2D subspace does not guarantee that we will obtain
a better clustering accuracy. Note the excellent clustering accuracy that is achieved
when VB-GMM is initialized with either nine or 16 components. This is somewhat in
contrast to the grid-based approaches for which better clustering results may be ob-
tained with finer granularity. This in turn also highlights a potential challenge when
GMMs are estimated with an EM-based algorithm since, unlike VB, it is unable to
remove redundant components. As is the case when a smaller δ is chosen, initial-
izing VB-GMM with larger kinitial will require more computations. This appears to
be wasteful as empirical results suggest that our inference was not improved. This
suggests that kinitial should not be too large or too small.
6.4.3 Sensitivity to choice of c2, the likelihood threshold where observa-tions are considered to be in the dense regions
Unlike most hard clustering algorithms, mixture models can provide each observa-
tion with a membership likelihood measure with respect to a certain component.
It provides an opportunity to define dense regions based on only a subset of ob-
servations of a component. Here we examine how the choice of c2, the likelihood
threshold where observations are considered to be in the dense regions, can affect
the results which are shown in Table 6.3. It shows that, at least for our proposed
method, better results can be obtained with larger c2. That is, we consider a dense
region to be the area covered by all observations that are classified with high prob-
ability as belonging to a heavily weighted component; this differs from Kriegel et al.
(2005)’s approach of using the area covered by all observations assigned to a heavily
weighted component even those for which the probabilities of assignment are rather
low. While we found that by adjusting some other parameter values (e.g., simply hav-
ing c3 = 20%) can significantly improve the very poor results shown towards the right
6.4 Experimental Results 165
Table 6.3: Comparison of results obtained using different c2, the likelihood thresholdwhere observations are considered to be in the dense regions
c2 90% 80% 70% 60% 50% 40% 30% < 20%
# of Subspace Clusters 5 5 5 3 2 3 2 2Clustering Accuracy 96.9% 94.3% 75.0% 41.4% 19.3% 7.8% 21.0% 18.0%
Avg. Cluster Dimensionality 6.0 6.0 6.0 8.0 16.0 16.7 16.0 16.5
Table 6.4: Comparison of results obtained using different c3, the threshold level indetermining the dimension relevance to an observation
c3 4% 8% 12% 16% 20% 24% 28% 32%
# of Subspace Clusters 2 5 5 5 5 5 5 5Clustering Accuracy 22.9% 93.4% 96.9% 97.8% 97.2% 92.5% 92.7% 91.7%
Avg. Cluster Dimensionality 6.0 6.0 6.0 6.0 5.0 3.4 3.2 3.2
hand side of Table 6.3; choosing smaller c2 still leads to the clustering algorithm be-
ing less accurate then choosing larger c2.
6.4.4 Sensitivity to choice of c3, the threshold level in determining the di-mension relevance to an observation
The clustering results are shown in Table 6.4, suggest that setting c3 too small (or
too large) can strongly influence the results. For this particular dataset, it appears
that our algorithm is relatively robust; however, having larger c3 implies that more
dimensions with respect to an observation will be considered as irrelevant which in
turn leads to smaller average cluster dimensionality. That is, the subspace clusters
will have been identified mostly correctly, but not all of the dimensions of the clus-
ters will have been. However, in any case dimensions identified with respect to each
subspace cluster were sufficient in obtaining good clustering accuracy.
6.4.5 Effect of data dimensionality d
Finally, we consider an additional scenario where the intrinsic dimensionality of the
subspace clusters is much lower than 24% of the total dimensionality. We test our al-
gorithm in this respect by adding irrelevant dimensions to our existing test data: the
intrinsic dimensionality of the subspace clusters reduced to 12% when an additional
25 noise dimensions are added, and to 8% and 6%, respectively, when a total of 50
and 75 irrelevant dimensions are added. Many existing algorithms would find sce-
narios such as 6% challenging (Moise et al., 2008). While we found the selection of
the parameter values became more critical when the intrinsic dimensionality of the
subspace clusters was smaller, we show that our approach can still identify the sub-
space clusters accurately (see Table 6.5). This contrasts significantly with situations
where the full dimensional GMM or the HDDC algorithm is applied (see Table 6.6).
166 Chapter 6. Identifying Subspace Clusters for High Dimensional Data
Table 6.5: Comparison of results obtained for different d
d 25 50 75 100
# of Subspace Clusters 5 5 5 5Clustering Accuracy 96.9% 96.6% 93.5% 92.2%
Avg. Cluster Dimensionality 6.0 5.0 3.6 3.6
Table 6.6: Comparison of results obtained for different d for full dimensional GMMand HDDC
Algorithm Full Dimensional GMM HDDCd 25 50 75 100 25 50 75 100
# of Subspace Clusters 4.50 1.00 1.00 1.00 7.40 3.40 1.00 failedClustering Accuracy 84.8% 20.0% 20.0% 20.0% 92.0% 54.1% 20.0% n/a
The results of using the full dimensional GMM were to be expected; this is simply
the outcome of ‘curse of dimensionality’ as discussed in the introduction. In this
particular example, it is unable to cluster the data when d ≥ 50; its recorded accuracy
of 20% simply reflects the fact that the largest component weight of the simulated
dataset is 20%.
On the other hand, HDDC is more robust than the full dimensional GMM. This is
not surprising since HDDC was designed to discover subspace clusters distributed
as Gaussian. However, the results in Table 6.6 indicated that in this case, the effec-
tiveness of this model-based approach decreased quite sharply with increasing d in
comparison to our proposed method in which the cluster subspaces are, in a sense,
determined based on observations’ nearest neighbors.
Yet, importantly, we note that the number of clusters determined for HDDC by BIC
appear to be problematic. When there are less irrelevant attributes in the dataset
i.e., d is smaller, BIC selects higher number of clusters g than actually existed. This
is perhaps understandable since the BIC is based on providing a density approxima-
tion rather than the number of clusters per se (Biernacki et al., 2000; Baudry et al.,
2010). However, as the number of irrelevant attributes included increases, i.e., d is
larger, the BIC appears to be ineffective in determining the suitable g in the dataset.
That is, in contrast to full dimensional GMM, for d = 50 and 75, HDDC can actu-
ally achieve a somewhat similar clustering accuracy to that of d = 25 when different,
larger g is chosen. This implies that the key issue for HDDC for clustering high di-
mensional data lies with how g should be chosen; BIC appears to have over penalized
in relation to the number of parameters in the model in the high dimensional space
(c.f. Biernacki et al., 2000). Nonetheless, we note that the small matrix determinant
estimation error has caused the HDDC algorithm to terminate when d = 100.
6.5 Discussion 167
6.5 Discussion
In this paper, we have shown that we can use mixture models for assisting in identi-
fying subspace clusters and our straightforward intuitive method appears to be use-
ful. The proposed approach for identifying nearest neighbors in the high dimen-
sional space appears effective. We have shown that good results can be obtained
without having to execute many VB iterations, and also that approximating each 2D
subspace in very fine detail may not be helpful in the identification of the subspace
clusters. Additionally, we showed that improved results may be obtained by select-
ing only those highly probable observations in the dense components, rather than of
all observations of the components as was done in Kriegel et al. (2005). However, we
cannot generalize this finding at this point with respect to other existing algorithms
without further exploration. Finally, we have shown that our method can also iden-
tify subspace clusters with very low intrinsic dimensionality and it compares better
than the full dimensional GMM and the HDDC algorithm. Additionally, we have
observed that using the BIC can be problematic in determining the number of clus-
ters. More research is required, particularly on the scalability and the comparisons
to other existing algorithms, as well as on the ability to automatically select appro-
priate parameter values. Ideas from experimental design could also possibly have
application here for reducing the number of subspace combinations that have to be
considered in the VB approximations.
6.6 References
Aggarwal, C. C., Hinneburg, A., Keim, D. A., 2001. On the surprising behavior of dis-
tance metrics in high dimensional spaces. In: Van den Bussche, J., Vianu, V. (Eds.),
Proceedings of the 8th International Conference on Database Theory. Vol. 1973.
Springer, London, pp. 420–434.
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P., 1998. Automatic subspace clus-
tering of high dimensional data for data mining applications. In: Haas, L. M., Ti-
wary, A. (Eds.), Proceedings of the 1998 ACM SIGMOD International Conference
on Management of Data. ACM, Seattle, WA, pp. 94–105.
Attias, H., 1999. Inferring parameters and structure of latent variable models by vari-
ational Bayes. In: Laskey, K. B., Prade, H. (Eds.), Proceedings of the Fifteenth Con-
ference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, Stockholm,
Sweden, pp. 21–30.
Baudry, J.-P., Raftery, A. E., Celeux, G., Lo, K., Gottardo, R., 2010. Combining mix-
ture components for clustering. Journal of Computational and Graphical Statistics
168 Chapter 6. Identifying Subspace Clusters for High Dimensional Data
19 (2), 332–353.
Bellman, R. E., 1961. Adaptive Control Processes: A Guided Tour, 5th Edition. Prince-
ton University, Princeton, NJ.
Beyer, K. S., Goldstein, J., Ramakrishnan, R., Shaft, U., 1999. When is “nearest neigh-
bor” meaningful? In: Beeri, C., Buneman, P. (Eds.), Proceedings of the Seventh
International Conference on Database Theory. Vol. 1540. Springer, Jerusalem, Is-
rael, pp. 217–235.
Biernacki, C., Celeux, G., Govaert, G., 2000. Assessing mixture model for clustering
with integrated completed likelihood. IEEE Transactions on Pattern Analysis and
Machine Intelligence 22 (7), 719–725.
Biernacki, C., Celeux, G., Govaert, G., Langrognet, F., 2006. Model-based cluster
and discriminant analysis with the MIXMOD software. Computational Statistics
& Data Analysis 51 (2), 587–600.
Bouveyron, C., Girard, S., Schmid, C., 2007. High-dimensional data clustering. Com-
putational Statistics & Data Analysis 52 (1), 502–519.
Celeux, G., Govaert, G., 1995. Gaussian parsimonious clustering models. Pattern
Recognition 28 (5), 781–793.
Chang, W.-C., 1983. On using principal components before separating a mixture
of two multivariate normal distributions. Journal of the Royal Statistical Society:
Series C (Applied Statistics) 32 (3), 267–275.
Corduneanu, A., Bishop, C. M., 2001. Variational Bayesian model selection for mix-
ture distributions. In: Proceedings of the Eighth International Conference on Arti-
ficial Intelligence and Statistics. Morgan Kaufmann, Key West, FL, pp. 27–34.
Dempster, A. P., Laird, N. M., Rubin, D., 1977. Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Statis-
tical Methodology) 39 (1), 1–38.
Ester, M., Kriegel, H.-p., Sander, J., Xu, X., 1996. A density-based algorithm for dis-
covering clusters in large spatial databases with noise. In: Simoudis, E., Han,
J., Fayyad, U. M. (Eds.), Proceedings of the Second International Conference on
Knowledge Discovery and Data Mining. AAAI, Portland, OR, pp. 226–231.
Fraley, C., Raftery, A. E., 1998. How many clusters? which clustering method? an-
swers via model-based cluster analysis. The Computer Journal 41 (8), 578–588.
6.6 References 169
Friedman, J., Meulman, J., 2004. Clustering objects on subsets of attributes. Journal
of the Royal Statistical Society: Series B (Statistical Methodology) 66 (4), 1–25.
Gelman, A., Carlin, J. B., Stern, H. S., Rubin, D. B., 2004. Bayesian Data Analysis, 2nd
Edition. Texts in Statistical Science. Chapman & Hall, Boca Raton, FL.
Hastie, T., Tibshirani, R., Friedman, J. H., 2009. The Elements of Statistical Learning:
Data Mining, Inference, and Prediction, 2nd Edition. Springer Series in Statistics.
Springer, New York.
Hinneburg, A., Keim, D. A., 1999. Optimal grid-clustering: towards breaking the
curse of dimensionality in high-dimensional clustering. In: Atkinson, M. P., Or-
lowska, M. E., Valduriez, P., Zdonik, S. B., Brodie, M. L. (Eds.), Proceedings of the
25th International Conference on Very Large Data Bases. Morgan Kaufmann, Ed-
inburgh, UK, pp. 506–517.
Jain, A. K., Dubes, R. C., 1988. Algorithms for Clustering Data. Prentice Hall, Upper
Saddle River, NJ.
Kriegel, H.-P., Kroger, P., Renz, M., Wurst, S. H. R., 2005. A generic framework for ef-
ficient subspace clustering of high-dimensional data. In: Proceedings of the Fifth
IEEE International Conference on Data Mining. IEEE, Houston, TX, pp. 250–257.
Kriegel, H.-P., Kroger, P., Zimek, A., 2009. Clustering high-dimensional data: A survey
on subspace clustering, pattern-based clustering, and correlation clustering. ACM
Transactions on Knowledge Discovery from Data 3 (1), 1–58.
Liu, G., Li, J., Sim, K., Wong, L., 2007. Distance based subspace clustering with flexi-
ble dimension partitioning. In: Proceedings of the 23rd International Conference
on Data Engineering. IEEE, Istanbul, Turkey, pp. 1250–1254.
Maugis, C., Celeux, G., Martin-Magniette, M.-L., 2009. Variable selection for cluster-
ing with Gaussian mixture models. Biometrics 65 (3), 701–709.
McGrory, C. A., Titterington, D. M., 2007. Variational approximations in Bayesian
model selection for finite mixture distributions. Computational Statistics & Data
Analysis 51 (11), 5352–5367.
McLachlan, G. J., Peel, D., Bean, R. W., 2003. Modelling high-dimensional data by
mixtures of factor analyzers. Computational Statistics & Data Analysis 41 (3), 379–
388.
Moise, G., Sander, J., Ester, M., 2008. Robust projected clustering. Knowledge and
Information Systems 14 (3), 273–298.
170 Chapter 6. Identifying Subspace Clusters for High Dimensional Data
Nagesh, H. S., Goil, S., Choudhary, A. N., 2000. Adaptive grids for clustering mas-
sive data sets. In: Proceedings of the 2000 International Conference on Parallel
Processing. IEEE, Toronto, ON, Canada, pp. 477–484.
Ng, E. K. K., Fu, A. W.-C., Wong, R. C.-W., 2005. Projective clustering by histograms.
IEEE Transactions on Knowledge and Data Engineering 17 (3), 369–383.
Parsons, L., Haque, E., Liu, H., 2004. Subspace clustering for high dimensional data:
a review. SIGKDD Explorations 6 (1), 90–105.
Raftery, A. E., Dean, N., 2006. Variable selection for model-based clustering. Journal
of the American Statistical Association 101 (473), 168–178.
Schwarz, G., 1978. Estimating the dimension of a model. The Annals of Statistics
6 (2), 461–464.
Scott, D. W., 1992. Multivariate Density Estimation: Theory, Practice, and Visualiza-
tion. Wiley Series in Probability and Statistics. Wiley, New York.
Scrucca, L., 2010. Dimension reduction for model-based clustering. Statistics and
Computing 20, 471–484.
Tipping, M. E., Bishop, C. M., 1999. Mixtures of probabilistic principal component
analyzers. Neural Computation 11 (2), 443–482.
Titterington, D. M., Smith, A. F. M., Makov, U. E., 1985. Statistical Analysis of Finite
Mixture Distribution. Wiley series in Probability and Mathematical Statistics. Wi-
ley, New York.
Wang, B., Titterington, D. M., 2006. Convergence properties of a general algo-
rithm for calculating variational Bayesian estimates for a normal mixture model.
Bayesian Analysis 1 (3), 625–650.
Woo, K.-G., Lee, J.-H., Kim, M.-H., Lee, Y.-J., 2004. FINDIT: a fast and intelligent
subspace clustering algorithm using dimension voting. Information and Software
Technology 46 (4), 255–271.
Wu, B., McGrory, C. A., Pettitt, A. N., 2010b. A new variational Bayesian algorithm
with application to human mobility pattern modeling. Statistics and Computing,
(in press).
http://dx.doi.org/10.1007/s11222-010-9217-9
Wu, B., McGrory, C. A., Pettitt, A. N., 2010c. The variational Bayesian method: com-
ponent elimination, initialization & circular data. Submitted.
7Conclusion
7.1 Discussion
In this thesis, we have motivated and demonstrated the value of analysing each cus-
tomer’s habitual consumption behaviour with the use of the variational Bayesian
(VB) method and Gaussian mixture models (GMMs). Before concluding our sum-
mary of the contribution made here, we detail some future research directions
that are expected to be active and useful from both the viewpoint of methodol-
ogy as well as application. These are separated into the following two categories:
semi-parametric Bayesian methods & mixed membership models, and the spatial-
temporal/longitudinal extension.
7.1.1 Semi-parametric Bayesian methods & mixed membership models
Nonparametric Bayesian models provide a flexible approach to many more difficult
problems; these models can be as complex as necessary (Blei and Jordan, 2004).
Their key underlying assumption is that there is a set of random variables arising
from some an unknown probability distribution and that this underlying distribu-
tion itself is drawn from a prior distribution; the cornerstone prior choice is Dirichlet
process (DP), a nonparametric measure on measures (Ferguson, 1973). DP is useful
in that the drawn measures are discrete which is useful as it avoids problems at-
tached to trying to fit the structure of a model (Blei and Jordan, 2004).
In relation to this research, DP mixture models (Antoniak, 1974; Ferguson, 1983),
semi-parametric Bayesian methods, are particularly useful in that there is no need
to prespecify the number of components k since it is ‘unbounded’ (Blei and Jordan,
2006; Heller and Ghahramani, 2007); DP is an alternative tactic to model selection
(Escobar and West, 1995). The usefulness of DP mixture models to date is largely
172 Chapter 7. Conclusion
due to the development of Markov chain Monte Carlo (MCMC) sampler (MacEach-
ern and Muller, 1998; Neal, 2000). However, despite the fact that some recent studies
have been shown improvement in the model quality by allowing splitting/merging
components (e.g., Green and Richardson, 2001; Jain and Neal, 2004), or not making
use of restricted conjugate prior settings (e.g., Jain and Neal, 2007), or even being
able to obtain models with hierarchically nested structure (Teh et al., 2006), MCMC-
based DP models are not suitable for analysing massive, multivariate and highly cor-
related data (Blei and Jordan, 2004).
One interesting development in this area is the use of deterministic VB, the underly-
ing method utilised in this research, which until recently has focused only on para-
metric models and typically has been used in the context of exponential family of
distributions (Ghahramani and Beal, 2001; Wainwright and Jordan, 2003). Early re-
sults of VB-based DP mixture models (Blei and Jordan, 2004, 2006) appear effective,
scalable and thus promising. However, one of the key drawbacks of DP mixture mod-
els more generally is the so called ‘rich-get-richer’ property (Wallach et al., 2010);
that is, as number of observations n → ∞, a small number of large clusters and
larger number of small clusters are expected to be found. Nonetheless, research
in this direction seems promising and is certainly a valuable extension. However,
the effectiveness of DP mixture models with respect to heterogeneous and spiky
data such as the one this research has been analysing require further investigation.
Note that there are other alternatives beside making use of VB in DP mixture models
(Guha, 2010). Examples include Pennell and Dunson (2007) which adopts an em-
pirical Bayes approach and MacEachern et al. (1999) that makes use of sequential
importance sampling (SIS). Other related approaches include the use data squash-
ing (DuMouchel et al., 1999; Madigan et al., 2002; Owen, 2003).
Finally, partitioning data into mutually exclusive groups, as we have done in Chap-
ter 5 with respect to segmenting different subscriber spatial usage behaviour, while
useful, is perhaps unnecessarily restricted. That is, some observations/customers do
naturally belong to multiple groups. Consequently, it may be more preferable to ex-
periment with profiling customers’ heterogeneous behaviour with mixed member-
ship models (Blei et al., 2003; Heller and Ghahramani, 2007; Airoldi et al., 2008). This
can be a useful extension; some model-free (projected) clustering algorithms (c.f.
§ 2.3.4) for high dimensional data can already produce overlapped clusters, but to
the best of our knowledge, cannot provide degrees of membership likelihoods which
can be critical to this application.
7.1 Discussion 173
7.1.2 Spatial-temporal/longitudinal extension
Besides assuming the relationships among observations are independent and iden-
tically distributed (i.i.d.), one of the most critical assumptions made in this research
is assuming attributes i.e., spatial and temporal behaviours, are independent from
each other (Hsu et al., 2008). That is, behaviours more generally, not limiting to just
spatial usage behaviour, are expected to be different with respect to different time
periods, for example, weekdays, weeknights and weekends (c.f. Ghosh et al., 2006a).
Being able to capture and interpret the complex interactions among different be-
haviours is certainly valuable from the viewpoint of this application; though this can
be somewhat circumvented by first partitioning the data into different periods. Of
course, there is still also the issue of seasonality/periodic behavioural variations that
must be taken into consideration.
However, it is perhaps more valuable to extend the algorithm for analysing longitu-
dinal/sequential data such that it is possible to track model changes or at least to
update the latest model without having all the (historical) data being available (c.f.
Doucet et al., 2001; Babcock et al., 2002a). Currently VB requires iteratively scan-
ning through the entire dataset. One exception is Bruneau et al. (2008) in which the
VB-GMM method has been reformulated such that models can be aggregated. In
that, the algorithm takes the parameter values of the existing models as an input in-
stead of the observations; parameters of the models are ‘summarised’ via a modified
VB algorithm making use of virtual sampling technique (Vasconcelos and Lippman,
1998). This strategy perhaps can be generalised further.
In other words, from the viewpoint of this application, it could be preferable to have
each subscriber’s several years worth of prior historical data i.e., not just 17 months,
‘compressed’, say quarterly, into sets of model parameters; the historical model sum-
maries can then be utilised for ‘updating’ the latest (possibly exponential decay)
models. Such a tactic is clearly more computationally efficient than the current ap-
proach which requires all of the raw data being available; though the resulting abil-
ity to analyse individuals’ gradual or sudden behavioural changes (c.f. Jacoby, 1978;
Flint et al., 1997; Eriksson and Mattsson, 2002; Liu et al., 2005; Wang and Hong, 2006)
or to differentiate/segment their behavioural longitudinally would perhaps be even
more beneficial since subscriber behaviour is known to be typically not stationary
(Wedel and Kamakura, 1998, Chapter 10).
The temporal aspect of behaviour is rarely examined. In other words, customer
segmentation today is, at best, “redrawn as soon as they have lost their relevance”
(Yankelovich and Meer, 2006, p.129). This is perhaps largely a result of the lack
of suitable/established models (c.f. Wedel and Kamakura, 1998, Chapter 10), the
174 Chapter 7. Conclusion
volatile nature of customer behaviour (Hunt, 1997; Arnould et al., 2004), and the dif-
ficulties in capturing all of the historical events/interventions (and forecasting for
that matter) and determining its causal implications. We believe the extension of
the current VB-GMM method to the temporal setting may assist this greatly since VB
is efficient in terms of computational storage requirements and speed. Note that the
behavioural changes of customers can also lead to an event in terms of consump-
tion or interventions (e.g., churn and change of product/service) by the customer to
address this; the relationship between them are not necessarily causal as is usually
assumed.
7.2 Summary of Contributions
The aim of this research has been to turn large volumes of dormant, seemingly val-
ueless, but typically readily available ‘cheap’ data into detail but useful customer in-
telligences i.e., statistical data mining, for the wireless telecommunication industry.
A unique aspect of the work presented here is the investigation into modelling, in-
terpreting and differentiating patterns which are highly heterogeneous, both within
and between patterns, and spiky (c.f. each wireless subscriber’s spatial usage be-
haviour) in an effective and efficient manner; the term spiky describes data patterns
with large areas of low probability mixed with small areas of high probability. Devel-
opments for the recently popular variational Bayesian (VB) method, and Gaussian
mixture models (GMMs) and clustering techniques underpin the entire study.
Chapter 1 outlined the motivation for our research and the potential merit, from
the viewpoint of business applications, in analysing individuals’ frequently over-
looked habitual consumption behaviour with selected illustrations. It details the
broader research interest in understanding how customers typically utilise prod-
ucts/services temporally and spatially. It also motivates the necessity of using sta-
tistical modelling in this context. Appendix A provided more in depth related detail.
Chapter 2 provided an up to date thorough review and investigation of the nature
of the problem and the data as well as considering the advantages and drawbacks
of various potentially useful current techniques. These methodological reviews dis-
cussed many very recent developments. Note that while our dataset can be consid-
ered as data stream, our research did not proceed in the direction of data stream
mining. It is worthwhile pointing out that the definition of data stream was outlined
for the first time in 1998. Research labs within companies such as AT&T, Bell, Google,
IBM and Microsoft have since been actively researching suitable techniques in this
still relatively unestablished field; this could be an exciting future research direction.
The careful methodological evaluation which led to the utilisation of VB and GMMs
7.2 Summary of Contributions 175
for the study was also described. Appendix B and C gave more detailed information
related to these issues.
Chapter 3 outlined and illustrated the methodology effectiveness and robustness
of adopting VB-GMM to modelling one-dimensional circular data i.e., individuals’
temporal usage behaviour with respect to hours of the day. Moreover, recall that
one of the key advantages of fitting GMMs with VB is its ability to automatically de-
termine the number of components to represent the data by effectively eliminating
redundant components specified in the initial model. This chapter also examined
and made note of a generally overlooked implication of this irreversible property of
VB as well as the implications of the initial component allocation strategies.
Chapter 4 was an important and significant chapter. A new VB algorithm, called
split and eliminate variational Bayesian (SEVB), was developed, it is capable of
exploring the parameter space more fully than the standard method for two-
dimensional data. Unlike the standard algorithm, our SEVB can discover models
with a higher number of components than proposed initially; this is achieved by
allowing components to be split. Relying on the ‘competing’ and ‘eliminating’ na-
ture of the mixture components within the VB framework, SEVB attempts to split all
but only those poorly fitted components at each split opportunity i.e., after a sta-
ble model has been obtained. The adopted strategy is clearly more computationally
efficient than existing alternatives that attempt to split all components one by one
until the ‘optimal’ model is found. Several criteria were introduced to ensure the
scalability of the algorithm.
Moreover, this new SEVB algorithm has considered, for the first time, splitting com-
ponents into two overlapping subcomponents with one having much larger variance
than the other; this is in addition to splitting components standardly into two side-
by-side subcomponents. This newly introduced concept is motivated by our appli-
cation data i.e., individuals’ mobility patterns which have high probabilities for sev-
eral of their own preferred locations. This is in contrast to the common perceptions
that good clustering results or mixture models should have clusters or components
that are isolated and clearly separated.
The adoption of this new component overlapping philosophy coupled with the fact
that our data is somewhat discrete (c.f. the location of the user is recorded based on
the location of the cell tower where the activity was initialised) implies that classical
approaches of model evaluation or selection are no longer appropriate. The advan-
tage of having Bayesian models is that it is appropriate to assess the results by mea-
suring the goodness-of-fit. This chapter introduced a new adaptable goodness-of-fit
measure, called mean absolute error adjusted for covariance (MAEAC), which aims
to measure the average estimated errors of observations with the use of Mahalanobis
176 Chapter 7. Conclusion
distance (MD); its effectiveness over other existing measures was demonstrated em-
pirically though not theoretically.
Overall, the modelling results for the real world data, show that our new SEVB al-
gorithm is more robust, practical and flexible for analysing large numbers of het-
erogeneous and spiky patterns. From the application perspective, to the best of our
knowledge, this is the first piece of research that aims to model an individual’s over-
all mobility pattern with a GMM. Some notable previous attempts have made use of
density-based clustering algorithms such as DBSCAN. However, we demonstrated
in this chapter that such a tactic is ineffective even for identifying only individual’s
highly visited locations (which was their aim) whereas our SEVB-GMM can provide
an interpretable and effective approximation to an individual’s overall behavioural
pattern.
Chapter 5 made full use of the SEVB-GMM algorithm developed in Chapter 4. It
examines the effectiveness of the algorithm more fully; it showed that it can accu-
rately model very complicated bivariate patterns i.e., individuals’ spatial usage be-
haviour, with considerably less space requirement. SEVB’s scalability superiority
over other common approaches such as the expectation-maximisation (EM) algo-
rithm was also discussed; this is also the result of SEVB being able to automatically
determine the model complexity. Nonetheless, this application chapter begins by
demonstrating, for those aiming to improve marketing abilities, the importance of
distributional understanding. We also illustrated the fact that the widely used av-
erage measures approach can be misleading using the example of two individuals’
outbound voice call duration patterns. This can be somewhat concerning (to the
marketers) with respect to existing pricing strategies (c.f. Danaher, 2002), and cus-
tomer valuation (c.f. Blattberg and Deighton, 1996) or churn models; these are often
based on the use of uninformative measures such as averages thus often not address
issues such as value at risk (VaR) (Chatfield, 1995).
The key contribution in this chapter, at least from the viewpoint of the application,
was that it illustrated how the approximated models can be interpreted; without
this step of translation of the hidden meanings of the patterns, these new statisti-
cal models would be worthless to profit-oriented organisations. Several ‘signatures’,
corresponding to the characteristics of users’ spatial behaviour, for profiling the pat-
terns were developed through extensive data exploration analyses; theses signatures
are then utilised for pattern differentiation i.e., segmenting behaviours of the sub-
scribers. This chapter showed that empirical analyses suggest that these signatures
are meaningful and more stable than the currently commonly adopted alternative
approach of ordered partitioning of subscribers based on their aggregated voice call
durations and SMS counts; spatial usage behaviour among customer groups from
7.2 Summary of Contributions 177
which we might infer richer behavioural descriptions such as lifestyle or occupa-
tional traits, that otherwise cannot be easily or cheaply obtained, turned out, as we
expected, to be highly differentiable and therefore valuable for business strategy for-
mulation.
Finally, in a similar way to Chapter 4, this is the first research, to the best of our
knowledge, that can automatically profile each user’s overall spatial usage behaviour
meaningfully, and differentiate general users based on their actual observed mobil-
ity patterns. These alternative insights can assist business in better interacting with
each individual (c.f. customer relationship management (CRM)) including handling
and guiding the customer shifts (Flint et al., 1997), and in more effective strategic
and tactical informed decision making (c.f. decision support system (DSS)) such as
customer behaviour segmentation based pricing (van Raaij et al., 2003; Jaihak and
Rao, 2003), business management (c.f. business performance management (BPM))
and resource planning (c.f. enterprise resource planning (ERP)), for example.
Chapter 6 continued the discussion of Chapter 5 with respect to the need to apply
a clustering algorithm which is suitable for high dimensional data; this is because the
number of subscriber behavioural characteristics we are interested in and extracted
from the database are likely to increase over time. In this chapter, we designed a new
high dimensional clustering algorithms and showed that for the first time it is suit-
able to make use of mixture models instead of the commonly adopted histogram/
grid approaches for identifying subspace clusters in the high dimensional space; the
concept of subspace clusters is that some dimensions are considered as noise for
some clusters. Due to the nature of the problem i.e., there is a lack of data separation
in the high dimensional space, we have introduced several new concepts defining
what is ‘similar’. While clearly more research is still required on the scalability and
the comparisons to other existing algorithms as well as being able to automatically
select appropriate parameter values, empirical results suggested that our straight-
forward intuitive method appears to be useful, even for identifying subspace clusters
with very low intrinsic dimensionality.
Overall, from the application perspective this research provides businesses with
a more precise understanding of existing customers’ habitual consumption be-
haviour, particularly how the products/services are utilised spatially; this is a com-
peting advantage in a very competitive industry (Wei and Chiu, 2002). In terms of
contribution to statistical methodology, we have developed a new SEVB-GMM algo-
rithm, which can automatically determine the model complexity and can explore the
parameter space more thoroughly, and it is also much more suitable for modelling
heterogeneous and spiky patterns; the first meaningful approaches in profiling and
differentiating each individual’s overall mobility pattern are also presented.
178 Chapter 7. Conclusion
AReview of Research Question
A.1 Telecommunication Industry Research
There is a great demand for the use of data mining in the telecommunication indus-
try with the view to improving the business (Weiss, 2005); and many examples have
already been provided in Berry and Linoff (2000, Chapter 11 and 12). However, most
research in this industry appears to concentrate primarily on analysing demand, and
churn.
Demand analysis typically focuses on analysing aggregated product/service or
telecommunication traffic volume (Amaral et al., 1995; Levy, 1999; Heitfield and
Levy, 2001; Cox Jr., 2001; Fildes, 2002; Gilbert et al., 2003; Li et al., 2006; Marinucci
and Perez-Amaral, 2005); the objective often being fraud detection (Xing and Giro-
lami, 2007; Gibbons and Matias, 1998) or infrastructure resource management (Cox
Jr. and Popken, 2002).
On the other hand, churn analysis (also known as survival analysis) is perhaps the
most widely studied individual users’ behaviour. Its popularity, particularly in the
wireless telecommunication industry (Mozer et al., 2000; Wei and Chiu, 2002; Cox
Jr., 2002; Ahn et al., 2006; Figini et al., 2006; Lemmens and Croux, 2006; Neslin et al.,
2006), can be explained by:
• the relatively high churn rate (Bolton, 1998; Neslin et al., 2006),
• high cost of acquisition,
• no direct customer contact, and
• the significant roles played by the device models (Berry and Linoff, 2000).
However, we believe analysis of this kind as done today is often flawed. Firstly, they
typically focus only on analysing customers’ single product/service churn behaviour
instead of taking a more holistic view. Secondly, besides the need to being cautious
180 Appendix A. Review of Research Question
in dealing with truncated/censored observations, they are often based on the com-
mon flawed strategy of ‘zero defections’ (Reichheld and Sasser Jr, 1990) rather than
focusing on strategies for optimizing retention values (Blattberg et al., 2000). Finally,
having advanced models for predicting churn events utilizing contract expiry date,
despite being a common practice, is not necessarily more meaningful or valuable
than simple models based on the contract deadline.
A.2 Customer/Consumer Research
Customers are the most important asset of any business, but today they are more
educated, sophisticated, expectant, demanding and volatile than ever (Yankelovich
and Meer, 2006). This is the result of market competition (Reinartz and Kumar, 2000).
Companies today typically have acknowledged that not all customers are the same
(Peppers et al., 1999; Hallberg, 1995), or equally profitable for business (Cooper and
Kaplan, 1991; Storbacka, 1997; Niraj et al., 2001); “being willing and able to change
your behaviour toward an individual customer based on what the customer tells you
and what else you know about the customer” (Peppers et al., 1999, p.151) is vital
to business survival and success (Lloyd, 2005; Arnould et al., 2004). Consequently,
organisations today are more willing to focus on customers (Kotler and Armstrong,
2009; Christopher et al., 1991, p.13) and customer relationships (Gummesson, 1999;
Reichheld, 1996, p.24). Similarly, they are now also more willing to move away from
the traditional prospect of the ‘4Ps’ (i.e., product, price, promotion and place) (Bor-
den, 1964; McCarthy, 1978) and the mass or transactional marketing attitude (Dwyer
et al., 1987; Gummesson, 1994; Gronroos, 1994; Payne et al., 1998; Gummesson,
1999) which treats new customers as equals to long term loyal/profitable customers.
That is, companies today have now typically recognized the need to differentiate
customers according to detailed understanding of their current and future needs,
wants, desires, behaviour, profitability and values to the business for exchanging
(i.e., establishing, developing, maintaining and terminating) appropriate relation-
ships with them (Morgan and Hunt, 1994; Blattberg and Deighton, 1996; Reichheld,
1996; Fournier et al., 1998; Peppers et al., 1999). The importance of customer un-
derstandings has prompted many organisations to invest billions of dollars in con-
structing customer relationship management (CRM) systems; this initiative is based
on the belief and hope that holistic knowledge of the customers can be obtained for
managing each individual customer more effectively, and consequently deliver value
to the business (Rigby et al., 2002; Stone et al., 2004, p.90).
The process of artificially grouping heterogeneous customers or the focused market
A.2 Customer/Consumer Research 181
based on similar characteristics, needs, preferences and behaviour exhibited for dis-
tinct marketing propositions (i.e., targeting and positioning) is known as market seg-
mentation (Smith, 1956). The advantages of such theoretical conceptual tactics have
already been well documented and accepted (McDonald and Dunbar, 2004; Wein-
stein, 2004; Wedel and Kamakura, 1998, pp.3-5). Essentially, homogeneous mar-
ket segmentation (Blattberg and Deighton, 1996; Dhar and Glazer, 2003) has been
viewed as the foundation for effective marketing planning and strategy formulation,
and hence:
• provides businesses competitive advantages and better returns as a result of
being able to better serve customers (Egan, 2005, p.214); or
• satisfies their varying needs and wants with either existing or future products/
services (McDonald and Dunbar, 2004; Weinstein, 2004; Wedel and Kamakura,
1998, p.3);
Additionally, market segmentation is also critical for integrated marketing commu-
nication (Duncan, 2005; Shimp, 2007). However, the fast changing market environ-
ment along with advancement in information technology in recent years has made
it possible, and sometimes necessary for marketers to interact with finer segments
or even segments-of-one (i.e., individuals) (Wedel and Kamakura, 1998, p.4). In this
respect, the subject of market segmentation links closely to the subject of:
• relationship marketing (also known as one-to-one marketing or customer rela-
tionship management) where there is a stronger emphasis on the value of the
customers (Blattberg and Deighton, 1996; Buttle, 2009, pp.127-136); and
• transactional-oriented database or direct marketing (O’Malley et al., 1999;
Evans et al., 2004; Egan, 2005, p.215).
Nevertheless, segment identification is still a major focus of today’s research (Wedel
and Kamakura, 1998, p.327), and is highly dependent on the bases (i.e., variables
or criteria) researched (Weinstein, 2004, p.19) and the methods employed to do so
(Wedel and Kamakura, 1998, p.5).
A better understanding of customer behaviour is essential for a successful business.
However, customer/consumer research today appears to have limited emphasis on
examining customers’ actual behaviour (Jacoby, 1978; Yankelovich and Meer, 2006);
most behaviour investigations to date are largely limited to:
• the purchasing aspect of the observable product-specific (Wedel and Ka-
makura, 1998, p.10) behaviour (e.g., buying goods such as houses, vehicles, or
plasma televisions) (Alderson, 1957; Jacoby, 1978); or
• the loyalties (e.g., product/service attrition).
That is, customers’ habitual consumption behaviour (e.g., making phone calls, ac-
cessing the Internet and using water or electricity), which is different to the purchas-
ing behaviour and is more relevant to service than retail industries, is typically being
ignored today. This is in spite of the fact that most businesses have understood that
a “successful customer relationship requires a deep understanding of the context in
182 Appendix A. Review of Research Question
which our products and services are used in the course of our customer’s day-to-day
lives” (Fournier et al., 1998, p.49). Existing knowledge of individual’s habitual con-
sumption behaviour appears to be mostly limited to discrete (e.g., which services
customers use and the number of institutions they conduct business with), average
or aggregated measures (e.g., number of transactions per month) that are not neces-
sarily appropriate or meaningful for describing the observed pattern.
Below we briefly discuss on the selected subjects of customer/consumer research.
We focus particularly on the (1) customer management system, (2) heterogeneity
aspect of the customer behaviour, (3) today’s typical focus of consumer behaviour
research, and (4) segmentation. We conclude with our proposed research aim, that
is, focusing on profiling and differentiating individuals’ habitual consumption be-
haviour.
A.2.1 Customer management system
The ideology of relationship marketing, the economics of customer retention (c.f.
zero defection) (Reichheld and Sasser Jr, 1990) and the possibility afforded by tech-
nological developments (Sheth and Parvatiyar, 1995) have propelled the focus on
customer relationships in mass markets, resulting in the emergence of CRM systems
(Mitussis et al., 2006). Many ambitious businesses have already heavily invested in
CRM hoping that such move will ‘automatically’ present them with improved cus-
tomer insights for better business decision making; though they typically focus only
on end customers (Mitussis et al., 2006) and their loyalty (Buttle, 2009). Unfortu-
nately, despite the potential benefits of having CRM (Stone et al., 2004, p.98), more
than half of the much hyped CRM initiatives were believed to have generated unsat-
isfactory returns (i.e., not being an effective and profitable communications system
with customers) (Gummesson, 1999; Rigby et al., 2002; Weinstein, 2004; Egan, 2005;
Strouse, 2004; Mitussis et al., 2006; Buttle, 2009). This is often the result of having
inadequate culture adjustments from being product focused to customer focused,
or having too much emphasis on the critical information technology infrastructure.
That is, the actual implementation of many ‘real world’ CRM systems appeared to be
building up a massive customer database, and displaying standard reports or vague
indicators (i.e., ambiguous and subjective; e.g., market share indicator) (Doyle, 1995;
Mitussis et al., 2006; Egan, 2005, pp.18,219) without clear strategies or sufficient
value adding analyses (Peppers et al., 1999; Rigby et al., 2002; Cokins, 2004; Rigby
and Ledingham, 2004; Stone et al., 2004; Buttle, 2009, Chapter 1). In other words,
‘real world’ CRM systems often have been focused more on the operational aspect
rather than the analysis; customer insights do not just ‘pop up’, as they require thor-
ough investigations despite the common misperceptions (Little and Marandi, 2003;
A.2 Customer/Consumer Research 183
Berry and Linoff, 2004; Egan, 2005, p.220). Furthermore, many currently available
analyses can be difficult to adopt in practice without resulting misleading or statis-
tically biased findings, as they often need to assume companies only offer a single
product/service, have customer’s entire business, or rely on assumptions or infor-
mation that may not be available to them (e.g., personal income, and gender or age
of the actual user rather than the account holder) (Reinartz and Kumar, 2000; Verhoef
and Donkers, 2001). Nonetheless, it is encouraging to see that companies have now
rightly shifted their focus to customers, and emphasis on the profitability (c.f. Ac-
tivity Based Management (ABM)) rather than the revenue contribution (Babad and
Balachandran, 1993; Foster and Gupta, 1994; Morgan and Hunt, 1994; Cooper and
Kaplan, 1998). This research aims to provide reliable customer insights utilizing al-
ready available behaviour data that can be easily adopted by wireless telecommuni-
cation providers.
A.2.2 Customer behaviour heterogeneity
The principle, “all customers are not created equal”, has already been established
(Hallberg, 1995). The traditional 20/80 rule (also known as Pareto’s Principal or
Pareto’s Law) reinforces this concept by suggesting that the best 20% of customers
are responsible for 80% of revenue; yet in reality the contrast between customer prof-
itability and behaviour is far more than what the rule suggests. In fact, Cooper and
Kaplan (1991) found that the profit figure for only 9% of customers is as high as 225%
for a manufacturer. Similarly is the study of Storbacka (1997) in the context of retail
banking which also shows that half of the customers are not only unprofitable, but
have destroyed the overall profit by 50%. Still, Niraj et al. (2001) has shown that the
loss on some customers can be as high as 252% of the sales revenue.
Customer heterogeneity highlights strategic issues and naturally causes challenges
when companies attempt to apply relationship marketing efficiently and effectively
(Hunt, 1997; Eriksson and Mattsson, 2002). In fact, it is often necessary to passively
ignore or actively deter customers showing certain behaviour (Hunt, 1997; Gummes-
son, 1999, p.26). However, the uni-normal distribution, lognormal distribution or
even highly aggregated measures such as the average, that have no distributional
assumptions at all, have been applied extensively, deliberately or not, and are gener-
ally not appropriate in describing the customers and their behaviour (Schultz, 1995).
On the other hand, extreme usage behaviour is commonly being observed and is a
critical insight which should never be considered as outliers (Hinneburg and Keim,
1999; Shaw et al., 2001; Lloyd, 2005). Consequently, this research investigates mod-
els to better deal with uncertainty to incorporate the inherent stochastic nature of
customers’ behaviour (Niraj et al., 2001).
184 Appendix A. Review of Research Question
A.2.3 Consumer behaviour research
Cognitive & Affective Consumer behaviour is a complex, multidimensional, dy-
namic process (Belk, 1987; Belk et al., 1988). Recently, (post-modernism) marketers
(Thompson et al., 1989; Holbrook and Hirschman, 1982) have focused on:
• consumers’ information processing & decision processes (i.e., cognitive as-
pects), and
• their experiences (i.e., affective aspects)
in examining both:
• internal influences (e.g., attitude, knowledge, motivations, needs, opinions,
perceptions, personality, and involvement), and
• external influences (e.g., culture, lifestyle, marketing activities, reference
groups, social class, and values)
on consumers (Schiffman and Kanuk, 2004). The cognitive and affective (C&A) ap-
proach (Mitchell, 1983; Rokeach, 1973; Kahle, 1983; Veroff et al., 1981) has been used
extensively by practitioners to theoretically explain consumer behaviour (Peter and
Olson, 1983).
The C&A approaches typically built on theories such as:
• Maslow’s theory (Maslow, 1954),
• Theory of reasoned action (TRA) (Ajzen and Fishbein, 1980),
• Theory of planned behaviour (TPB) (Ajzen, 1991), and
• Theory of value-attitude-behaviour hierarchy (Homer and Kahle, 1988).
However, these theories, while likely to be true, still with little empirical support (An-
derson Jr. and Golden, 1984; Kahle et al., 1986; Schwartz and Bilsky, 1987; Yalch and
Brunel, 1996; Holt, 1997; Ajzen, 2001; Schiffman and Kanuk, 2004; Rokeach, 1973,
p.122). Still, C&A researchers (Jacoby, 1978; Anderson, 1983; Hirschman, 1986; Gra-
ham, 2005) argued that empirical testing and prediction are unnecessary, criticising
that “facts do not speak for themselves” (Anderson, 1983, p.28). They, along with (Pe-
ter and Olson, 1983), believe that with an ‘infinite’ number of objective testings, and
no matter how large the datasets are, is still no guaranteed the truth. Moreover, they
also believe that all observations are subject to errors, and the choice of methodolo-
gies, data and findings are heavily influenced by the researchers. Consequently, C&A
researchers believe this cognitively and socially signification problem of consumer
behaviour should be solved by theoretically driven research.
However, it is important to point out that, C&A approaches have often been con-
ducted using a large amount of variables for describing consumer’s value and
lifestyle which may discover useful variables discriminating consumers that just
happened to be statistically significant simply by chance; no statistical test results
related to VALS (values, attitudes, and lifestyles) categories were reported in Mitchell
(1983) and Kahle et al. (1986). Additionally, the currently popular C&A approaches
A.2 Customer/Consumer Research 185
(including segmentation based on C&A measures):
• may not be suitable for a fast changing (Egan, 2005, p.18) or innovative market
(Strouse, 2004, p.39), and
• require customers’ unobservable information. That is, they require customer
details that are not legally, directly, or dynamically available to the business
(e.g. income, education, culture, attitudes, perceptions, and satisfaction) (Ver-
hoef and Donkers, 2001). Note that many of these attributes (e.g. experience,
lifestyle) are subject to higher uncertainty because they are difficult to measure.
Furthermore, C&A approaches typically rely heavily on external market research, for
which accuracy is a concern (Wolfers and Zitzewitz, 2004; Leigh and Wolfers, 2006;
Arrow et al., 2008). The reliability of the surveys has become increasingly challeng-
ing partly because of the rapidly declining response rate (Bickart and Schmittlein,
1999; Curtin et al., 2005; Robert, 2006); Bickart and Schmittlein (1999) estimated that
as few as 5% adults in US are accountable for 50% of the telephone survey inter-
views as a result of personal characteristics. There is also the issue that the surveyed
public comes from a variety of backgrounds meaning that they will interpret survey
questions differently (Grunert and Scherlorn, 1990; Brennan and Hoek, 1992; Kahle
et al., 1992), and there is a distinction between cognitive response and actual deci-
sion making (Claxton et al., 1974), for example.
Behavioural Finally, while C&A approaches may assist companies to understand
subjective theoretical hypotheses about why consumers behave the way we observe,
they are weak in predicting behaviour and hence are little use in assisting businesses:
• to plan and manage the business (Hudson and Ozanne, 1988),
• to attain their ultimate goal which is to better value lifetime profitability of their
customers,
• to better control or change their behaviour, and to better meet their individual
customer needs (Watson, 1913; Skinner, 1974) by implementing effective mar-
keting strategy at the appropriate time (Wicker, 1969; Anderson Jr. and Golden,
1984; Kahle et al., 1986; Schwartz and Bilsky, 1987; Blattberg and Deighton,
1996; Peppers et al., 1999; Dhar and Glazer, 2003; Solomon, 2004; Kumar et al.,
2006; Yankelovich and Meer, 2006).
Consequently, this research focuses on understanding each customer’s actual be-
haviour utilizing data already available instead. This approach (also known as mod-
ernism, positivism or behaviourism) has been shown to be useful for predicting cus-
tomers behaviour and their potential value to the business (Foster and Gupta, 1994),
but is less emphased today (Yankelovich and Meer, 2006).
186 Appendix A. Review of Research Question
A.2.4 Customer/market segmentation
Studies have shown that segmenting customers homogeneously can help compa-
nies achieve greater profitability in a quicker and more focused manner (Kotler,
1991; Blattberg and Deighton, 1996; Reichheld, 1996; Dhar and Glazer, 2003). How-
ever, homogeneous customer segmentation is not novel in itself (Lloyd, 2005). While
researchers and practitioners often debate the best way of segmenting customers,
they often fail to realize the need to differentiate customers differently for dif-
ferent purposes and often misuse segmentation according to its original design
(Yankelovich and Meer, 2006). Below we discuss some segmentation techniques
commonly applied today.
Psychographical segmentation (Rokeach, 1973; Kahle, 1983; Veroff et al., 1981;
Mitchell, 1983), based on the framework of C&A discussed above, has now been used
extensively today by marketers; this includes, AIO (i.e., activities, interests, and opin-
ions) which focuses on individuals’ personality, and VALS2 which is heavily driven by
psychology (Cahill, 2006, pp.15,25), for example. In fact, many organizations today
have been ‘activity looking’ for customers’ ‘speculative’ C&A behaviour understand-
ing (Gordon, 1998; Egan, 2005, pp.17-18). While undoubtedly this segmentation ap-
proach can be valuable to market positioning, new product concepts, advertising
and distribution (Wind, 1978; Belk et al., 1988; Wedel and Kamakura, 1998, pp.15,32),
many have questioned its explanatory power and effectiveness (Lastovicka, 1982;
Lastovicka et al., 1990; Novak and MacEvoy, 1990; Gordon, 1998; Cahill, 2006; Egan,
2005; Wedel and Kamakura, 1998, p.13). Particularly its capabilities for identifying
behavioural factors that will influence a particular brand (Ziff, 1971; Wells, 1975;
Dickson, 1982). Moreover, despite its popularity, psychographical segmentation is
believed to be “a mostly wasteful diversion from its original and true purpose - dis-
covering customers whose behaviour can be changed or whose needs are not being
met”, and informed the company in knowing “which markets to enter or what kinds
of offers to make, how products should be taken to market, and how they should
be priced” (Yankelovich and Meer, 2006, p.126). That is, the outcome of psycho-
graphical segmentation still follows the tradition of last century (e.g., demographic
segmentation approach (Yankelovich and Meer, 2006)) i.e., spoken to average con-
sumers (Arnould et al., 2004, p.159).
Besides psychographical segmentations, customers today are often being parti-
tioned (McDonald and Dunbar, 2004; Duncan, 2005; Buttle, 2009, pp.154-157) based
on:
• demographics or geographics. For example, PRIZM (potential rating index for
zip markets) is on the basis that where people lived and who they lived among
tells a lot about them (Cahill, 2006, p.19). Demographics, in particular, have
been shown to be important for acquisition for financial service companies
A.3 Review Conclusion & Research Proposal 187
(Kamakura et al., 1991; Rust and Verhoef, 2005). However, this information may
not be available to all products/services or industries/markets;
• interactive channel or intervention opportunities such as cross-sell and up-
selling (DeSarbo and Ramaswamy, 1994; Cokins and King, 2004; Rust and Ver-
hoef, 2005);
• lifecycle stages (Dwyer et al., 1987; Christopher et al., 1991; Payne et al., 1998;
Gordon, 1998; Stone et al., 2004, p.98) or relationship status (e.g., satisfaction,
loyalty, and referrals). This is on the premises that satisfied customers are more
loyal, and loyal customers are more profitable and referrals more (Baldinger
and Rubinson, 1996; Reichheld, 1996; Dowling and Uncles, 1997; Knox, 1998;
Reinartz and Kumar, 2000; Anderson and Mittal, 2000; Kumar et al., 2007);
• benefits sought from the products/services,
• current profit, revenue, or usage contributions (Shapiro et al., 1987; Bult and
Wansbeek, 1995; Bitran and Mondschein, 1996; Zeithaml et al., 2001; van Raaij
et al., 2003; Kotler and Armstrong, 2009). Note that more focus is now placed on
the profitability instead of the revenue contribution (Foster and Gupta, 1994;
Cooper and Kaplan, 1998), or
• lifetime/future/potential values (Niraj et al., 2001; Verhoef and Donkers, 2001).
Note that these measures may be difficult to define (Zeithaml, 2000; Venkate-
san et al., 2007) particularly with changing market definitions/conditions (e.g.,
change of pricing structure), and can be misleading for organisations with mul-
tiple products/services (Verhoef and Donkers, 2001)).
Additionally, it is important to point out that RFM (recency, frequency, and monetary
value) segmentation (Berry and Linoff, 2004) is a popular tactic for segmenting cus-
tomers based on their usage contribution. However, it has been shown to be inap-
propriate with its use of averages (Blattberg and Deighton, 1996), it is often focused
only on revenue contribution rather than profitability (Dhar and Glazer, 2003), and
is often misinterpreted by practitioners in looking at each measure independently
(Stone et al., 2004, pp.40-45).
Notice that besides segmenting customers based on their average usage contribu-
tion, it appears that there is little focus how each user has utilised the product/
service particularly from the viewpoint of habitual consumption instead of purchas-
ing behaviour. This research aims to address this shortfall.
A.3 Review Conclusion & Research Proposal
Customer Behavioural Segmentation
A detailed understanding of the customers and the ability to predict their future be-
haviour as well as their potential value with good confidence is vital to the business
188 Appendix A. Review of Research Question
(Kumar et al., 2006), but can not be achieved by simply grouping or segmenting the
customers (Rust and Verhoef, 2005). On the other hand, analysing the little studied
behavioural data, that is the golden resource and asset conveniently available to all
established businesses, has been shown to be potentially more important and effec-
tive in achieving this objective (Foster and Gupta, 1994; Wei and Chiu, 2002). Yet,
most companies today seem to have a limited understanding of their customers’ ac-
tual behaviour (Fournier et al., 1998; Yankelovich and Meer, 2006). Furthermore,
behavioural approaches have been found to be particularly useful in understand-
ing customers’ consumption behaviours that have performed frequently and have
become a habit with less intentions (Schmittlein and Peterson, 1994; Verplanken
et al., 1998; Ouellette and Wood, 1998; Leone et al., 1999; Ajzena and Fishbeinb, 2000;
Ajzen, 2001). Still, most existing behaviour studies concentrate only on analysing
purchasing behaviour that has been shown to be very different to consuming be-
haviours (Alderson, 1957) and is more relevant to retail industries rather then service
industries.
Consequently, this research aims to fill the gap by analysing customers’ actual ha-
bitual consumption behavioural data. This also means that, while psychographic
measures can provide a richer description and understanding of consumers, and be-
ing able to be involved in the different stages of a consumer’s lifecycle and decision
making process (Belk et al., 1988), we believe these measures should be utilized for
assisting the company to better understand the needs of customers or the reasoning
of their behaviour rather than being ‘actively looked’ into as commonly done in the
industry today (Yankelovich and Meer, 2006; Stone et al., 2004, p.114).
Additionally, the benefit of taking a data-driven approach to customer/consumer
behaviour understanding can also be demonstrated by the importance of existing
customers. Existing customers have been shown to be more valuable than the new
ones, because they are less expensive to create extra values (e.g., up-selling, and
cross-selling); and the cost of acquiring new customers to replace the lost ones are
expensive. This is why customer retention, especially those high value customers
(Blattberg and Deighton, 1996), has been described by some as the most important
role in relationship marketing critical to the business profitability (Reichheld and
Sasser Jr, 1990). Studies have suggested that companies who have been able to retain
just 5% more of their existing customers have been able to almost double the com-
pany profits (Reichheld and Sasser Jr, 1990), as the profits generated by the retained
customers tend to accelerate over time due to price premium, cost saving, and rev-
enue (also known as customer share of wallet (SOW)) growth (Reichheld, 1996, p.39).
Moreover, the importance of the existing customers can also be demonstrated from
the viewpoint of referral i.e., the word of mouth (WOM) effects, which have been
A.3 Review Conclusion & Research Proposal 189
found to be highly valued by the potential customers. That is, positive WOM will po-
tentially improve the company’s future profit; and even more importantly, negative
ones could hurt the company’s outlook, particularly when no well defined impres-
sion has been formed by the potential customers (Arndt, 1967; Richins, 1983; Murray,
1991; Herr et al., 1991). Of course, lack of information on the potentially new cus-
tomers also conversely makes the existing customers more valuable (Hwang et al.,
2004). For industries such as wireless telecommunication, where the customer attri-
tions have been found to be a huge headache for the business, the value of the ex-
isting customers is undoubtedly even greater (Wei and Chiu, 2002). Accordingly, our
proposed research i.e., to comprehend as well as differentiate habitual consumption
behaviour of the existing customers should be valuable to the businesses.
190 Appendix A. Review of Research Question
BReview of Data Stream Mining
B.1 Data Stream & Its Mining Challenges
Data stream (also known as stream data) is a massive or possibly ‘infinite’ volume of
unordered sequential data (Henzinger et al., 1998; Babcock et al., 2002a; Gaber et al.,
2005; Muthukrishnan, 2005; Aggarwal et al., 2007). It is often real time data, or data
generated by a continuous process which grows rapidly at an ‘unlimited’ rate. Every-
day examples of such data structures include telecommunication, banking, credit
card, shopping, financial market transactions, for example. Examples also include
Internet clickstream records (c.f. text or web mining), weather measurements, and
sensor network, mobile traffic or security monitoring observations (Babu et al., 2001;
Gama and Gaber, 2007). Data stream poses many great challenges for an insightful
analysis (Gaber et al., 2007), because of its often low level detail construct nature (c.f.
Cortes et al., 2000; Han and Kamber, 2006, p.468).
Large volumes of data (stream) pose efficiency and scalability challenges (Han and
Kamber, 2006). That is, storing an entire set of data on a disk or memory, and ran-
domly accessing it for analysis is generally not possible (Dong et al., 2003). ‘Tra-
ditional’ data mining techniques focused on learning data with bounded memory
(Vitter, 2008), but generally require multiple scans of the data (Wang et al., 2003; Ag-
garwal et al., 2007). As a result, they are not suitable for the data stream environment
where data can generally only be looked at it once (at least for the preprocessing
step) without advance knowledge such as the size of the data, for example (Babcock
et al., 2002a).
In other words, traditional data mining algorithms work on the assumption that
the same data is being analysed throughout the entire process, whereas in the
data stream mining scenarios, data is being continuously updated throughout the
analytical process. Traditional approaches such as those constructing histograms
192 Appendix B. Review of Data Stream Mining
(Piatetsky-Shapiro and Connell, 1984; Muralikrishna and DeWitt, 1988; Ioannidis
and Poosala, 1995; Poosala et al., 1996; Poosala and Ioannidis, 1997; Jagadish et al.,
1998) are not suitable for the data stream environment as they require superliner
time and space complexity (Vitter and Wang, 1999; Aggarwal and Yu, 2007). Simi-
larly, the popular singular value decomposition (SVD) also requires multiple scan of
the data (Littau and Boley, 2006b).
Another feature of data stream is that it may evolve over time (Yang et al., 2005;
Gao et al., 2007). This is known as data stream evolution (Aggarwal, 2003), changes
in data stream (Dong et al., 2003), or concept drift (Wang et al., 2003). This non-
stationary issue is not new to data stream, and is often being addressed by real time
or incremental methods that continuously update the models when new data arrives
(Wang et al., 2003). The typical approach to address this non-stationary issue is to
utilise a time window or a data weighting scheme. Unfortunately, pattern changes
may often be more critical or informative than pattern snapshots (Dong et al., 2003),
and simply adopting existing single pass (including real time or incremental) mining
algorithms may not be suitable (Aggarwal, 2003). Note that, data stream models are
very similar to real time or incremental models in the sense that decisions need to
be made before all the data are available; they are however, different in what data is
being accessed, and the timing of the required decision (Guha et al., 2003a).
In short, while many traditional algorithms have already been developed to address
the efficiency and scalability challenges posed by large volumes of data (Han and
Kamber, 2006), they typically are not suitable for analysis in the data stream envi-
ronment scenarios; the key additional challenges of data stream mining when com-
pared to traditional data mining (Han and Kamber, 2006) are: single pass processing
(at least for preprocessing), random accessing data constraints, and concept drift
(Aggarwal et al., 2007).
B.2 Synopsis Data Structure
Unlike traditional data mining, it is generally acceptable to have approximate solu-
tions in data stream mining (Dong et al., 2003). Many more traditional data reduc-
tion techniques such as sampling, data weighting (e.g., exponential decay models
(Gilbert et al., 2001; Cohen and Strauss, 2003)), sliding (time) windows which fo-
cus only on part of the stream (Babcock et al., 2002b; Datar et al., 2002; Datar and
Motwani, 2007), in one form or another, are often adopted by techniques specially
designed for data stream. So too is the histogram approach which has also been
shown to be useful for approximating the distribution of the data (Silverman, 1986)
as well as as a basic data analysis and visualisation tool in the data stream environ-
ment (Thaper et al., 2002). Whereas ill-biased load shedding techniques that ignore
B.2 Synopsis Data Structure 193
chunks of data are sometimes being used for data stream mining (Babcock et al.,
2007).
Extensive numbers of recent studies have been focused on efficiently construct-
ing synopsis data structures (Gibbons and Matias, 1999) that can summarise data
with acceptable levels of accuracy but substantively smaller than their base datasets.
Some recent studies also focused on not only time required for constructing, and
space required for storage; but also time for updating, and (query) responding, and
required working space (Matias and Urieli, 2005; Muthukrishnan, 2005; Cormode
et al., 2006). However, they are often designed explicitly for specific applications use,
and it is yet unknown how well the different methods compare with one another
(Aggarwal and Yu, 2007).
Although many synopsis designs have been shown to have good accuracy, be robust,
and be able to be maintained easily, they mostly focused on the management sys-
tem (Arasu et al., 2003) or database (Babcock et al., 2002a) view point. For example,
some (additiveable) synopses can serve an important roll in self-maintaining views
in the database or (dynamically) provide approximate information even when the
base data is not available or is remote (Faloutsos et al., 1997; Thaper et al., 2002;
Babcock et al., 2002a). Most of the focus has been approximating (query) answers
(Chakrabarti et al., 2001), and selective (join) estimation (i.e., estimate fraction of
records satisfy the query) (Alon et al., 1999); typical applications include computing
aggregates, evaluate difference between data streams, identify heavy hitters, item
frequency, and frequent itemsets (Babcock et al., 2002a; Muthukrishnan, 2005; Ag-
garwal, 2007a).
Random sampling (Acharya et al., 1999; Chaudhuri et al., 1999; Acharya et al., 2000)
and histograms (Ioannidis and Poosala, 1999; Poosala and Ganti, 1999; Ioannidis,
2003; Muthukrishnan and Strauss, 2004) are two popular techniques which have
been frequently utilised either as stand alone synopses or embedded in other syn-
opses (e.g., wavelet-based histograms (Matias et al., 1998; Vitter and Wang, 1999; Ma-
tias et al., 2000)). Histograms, in particular, were been shown to be useful for query
optimization (Poosala and Ioannidis, 1997), approximate data warehouse queries
(Acharya et al., 1999), and approximate answers for correlated aggregate queries over
data stream (Gehrke et al., 2001; Dobra et al., 2002, 2004), for example. Note that
an interesting approach (DuMouchel et al., 1999) is to perform data stream analysis
based on generates pseudo data which reproduced according to series of statistical
moments computed (with the use of sampling) from mutually exclusive groups of
actual objects. Below we briefly discuss some key synopses with different problem
focuses or challenges.
194 Appendix B. Review of Data Stream Mining
Input Data Input Size Is Unknown As discussed previously, one of the key chal-
lenges associated with the traditional data mining techniques is the need to have
prior knowledge of the data size i.e., number of observations. Reservoir-based sam-
pling (Vitter, 1985) that maintains a random sample of fixed size is the first algorithm
to break this barrier within this research framework; this is in contrast to classical
random sampling which requires knowing the target data size prior to calculate the
sample size and hence obtain the sample. Vitter (1985)’s technique has recently been
improved (Gibbons and Matias, 1998; Chaudhuri et al., 1999; Babcock et al., 2002b;
Aggarwal, 2006), and has been shown to be useful for constructing histograms (syn-
opsis) (Gibbons et al., 1997; Chaudhuri et al., 1998) in the data stream environment.
Note that constructing histograms (Agrawal and Swami, 1995; Alsabti et al., 1997;
Manku et al., 1998) in the data stream environment is closely related to estimating
quantiles without data being completely available (Aggarwal and Yu, 2007).
Recent advancements means that approaches can now estimate quantiles (Manku
et al., 1999; Greenwald and Khanna, 2001) and construct near optimal histograms
(synopses) (Guha et al., 2001, 2002; Guha and Koudas, 2002; Guha et al., 2006) in a
(working) space efficient manner (Guha, 2005) and in a single pass fashion without
the need for advance input data size knowledge.
More Accurate and Space Efficient Synopses A large portion of the recent liter-
ature also focuses on improving data representation accuracy, and is often based
on transformation (Lee et al., 1999). (Discrete) wavelet-based transformations, in
particular, have been frequently used within the synopsis design (Chakrabarti et al.,
2001; Gilbert et al., 2003). They have been shown to be better than their transforma-
tion alternatives (Barbara et al., 1997; Peng and Chu, 2004) (as well as sampling and
the better class of histograms (Chakrabarti et al., 2001)) in being able to:
• represent data in multiple resolution (Matias et al., 1998), and
• approximate spare and/or skewed data (Barbara et al., 1997) with only the most
significant wavelet coefficients (i.e., space efficient).
However, quality wavelets, other than those minimising Euclidean errors cannot be
easily obtained (Karras and Mamoulis, 2005; Guha, 2005; Guha and Harb, 2005). The
selection of quality wavelets have been shown to depend on the query workloads
(Matias and Urieli, 2005; Muthukrishnan, 2005), and any traced wavelets changes
throughout the process can have complicated effects (Matias et al., 2000; Guha et al.,
2004b).
Randomised projection techniques (which many based on the use of wavelets)
(Gilbert et al., 2002, 2003) have been shown to be able to construct synopses that are
even more space efficient then the wavelet approach. They typically work on poly-
log space with respect to the base data because that is the minimum requirement for
B.2 Synopsis Data Structure 195
database indexing (Muthukrishnan, 2005). However, synopses based on randomised
projection are generally much more difficult to interpret (than the wavelets, for ex-
ample). Note that an extensive amount of research has been done on these ran-
domised projection techniques (Flajolet and Martin, 1983; Alon et al., 1996; Feigen-
baum et al., 1999; Indyk, 2000) in recent years (Babcock et al., 2002a; Muthukrishnan,
2005; Aggarwal, 2007a).
Synopses with Guarantee Bounds Random sampling is generally considered as
easy, efficient, and widely applicable. However, many critics argue that it is:
• not suitable for evaluating infrequent patterns (Aggarwal and Yu, 2007; Gaber
et al., 2007), and
• difficult to determine whether a truly representative samples have been drawn
from the datasets (Littau and Boley, 2006a).
Nonetheless, one of the key advantage of a random sampling is being able to provide
unbiased data estimates with probabilistic error bounds (Haas, 1997). This is in con-
trast to most other synopses (Aggarwal and Yu, 2007) for which it is difficult to find
the error bounds.
Consequently, some recent studies (Matias et al., 1998; Manku et al., 1999; Indyk
et al., 2000; Gehrke et al., 2001; Garofalakis and Gibbons, 2002; Guha et al., 2004b;
Guha and Harb, 2005) have focused on being able to provide approximation guar-
antees. Many of them now focus on minimising relative errors, or minimising maxi-
mum absolute or maximum relative errors (Garofalakis and Kumar, 2004; Karras and
Mamoulis, 2005); they do these instead of minimising inappropriate absolute errors
or the overall root mean squared errors which have been shown to result in poor
quality data representation.
Multi-dimensional Extension Apart from sampling (Barbara et al., 1997; Aggarwal
and Yu, 2007) that has the same dimensional representation of original data, most
other approximation techniques do not work well or only have very limited success
with higher dimensional data (i.e., more than four or five) (Vitter et al., 1998; Vitter
and Wang, 1999; Gilbert et al., 2003). However, despite the fact that attributes are
often wrongly assumed to be independent, such approaches (i.e., one-dimensional
synopses) are often still adopted (in the commercial software) (Poosala and Ioanni-
dis, 1997; Matias et al., 2000). Some recent research has focused on, for example:
• constructing multi-dimensional histograms (Poosala and Ioannidis, 1997; Ma-
tias et al., 1998; Vitter et al., 1998; Vitter and Wang, 1999; Aboulnaga and Chaud-
huri, 1999; Muthukrishnan et al., 1999; Gunopulos et al., 2000; Chakrabarti
et al., 2001; Wu et al., 2001; Thaper et al., 2002),
196 Appendix B. Review of Data Stream Mining
• extended wavelets (Stollnitz et al., 1996; Deligiannakis and Roussopoulos,
2003; Guha et al., 2004a), and
• multi-dimensional synopses based on the randomised projection technique
(Cormode et al., 2006).
However, challenges remain in constructing optimal data summaries in a timely and
space efficient manner.
Temporal Extension While some synopses (e.g., wavelet synopses, synopses based
on the randomised projection technique) are believed to be able to easily extendable
to the temporal representation of the data stream (with a much larger space require-
ment), they are seldom used (Aggarwal and Yu, 2007). The exceptions include:
• an offline approach by Indyk et al. (2000) suitable for analysing universal
trends, and
• the work of Thaper et al. (2002) that can be used for tracking or comparing the
distribution of data streams temporally,
for example. That is, most existing synopsis constructions, concentrate on provid-
ing one static or continuously updated summary data structure efficiently (Thaper
et al., 2002). However, such continuously updated approaches still act like reap-
plying the traditional algorithms every time when new data arrives (Thaper et al.,
2002; Aggarwal et al., 2003). That is, while maintaining latest summary information
may have resolved the data evolution issue, a proper understanding of the evolving
behaviour, seasonality or periodic insights, for example, cannot be achieved. Like
multi-dimensional extension, temporal extension has many challenges ahead.
B.3 Review Conclusion
There is an extensive number of recent studies that have focused on obtaining ac-
curate and robust data summary structures in efficient single pass processing with a
limited space framework (Gibbons and Matias, 1999). However, most of the research
comes from research laboratories within companies such as AT&T, Bell, Google, IBM,
and Microsoft (as well as selected academics and universities around the world)
(Muthukrishnan, 2005) with management systems or database applications being
their primary focus. Additionally, they are typically based on the concept of av-
erages (Aggarwal et al., 2003), or in a format that distribution and/or seasonality/
periodic information cannot be easier obtained (Littau and Boley, 2006b). Conse-
quently, these approaches, despite their efficiency, do not appear to be appropriate
extracting customer behavioural characteristics and for customer analytics in gen-
eral; in these settings data is likely to be spare and/or skew, or comprises a mixture
of distributions with seasonality/periodic patterns. This is despite our research data
is in the form of data stream.
B.3 Review Conclusion 197
In this research, we adopt variational Bayesian (VB) method to extract customer be-
haviour characteristics more formally and naturally. While VB is not a single pass
algorithm (as defined by data stream mining research), it can provide more sophisti-
cated statistical understanding which otherwise cannot be obtained in data stream
models (c.f. Muthukrishnan, 2005). Alternatively, some recent studies (Zhou et al.,
2003; Heinz and Seeger, 2006, 2007, 2008) have successfully used one-dimensional
kernel density estimation in the stream environment efficiently; such nonparamet-
ric approaches can be useful to our research if our goal today is simply to approxi-
mate various customers’ behaviour. Note that as in the case of data stream mining,
more work is still needed to extend VB to the multiple dimension case and tempo-
rally.
198 Appendix B. Review of Data Stream Mining
CReview of Clustering Time Series & Data Stream
C.1 Time Series Representation & Clustering
One of the logical choices for analysing customer behaviour longitudinally is to
group time series that are correlated, or have similar patterns or similar fitted models
(Lin et al., 2004; Ratanamahatana et al., 2005; Wang et al., 2006), for example. How-
ever, simply examining the similarity among subsequences (as is often done) instead
of the entire series can be misleading (Keogh et al., 2003).
Many data approximation techniques have already been studied for reducing data
dimensionality critical for time series clustering (Gavrilov et al., 2000; Keogh and
Kasetty, 2003; Ding et al., 2008) and its closely related problems of similarity searches
and indexing (Agrawal et al., 1995; Yi and Faloutsos, 2000; Keogh et al., 2001;
Chakrabarti et al., 2002). Common representations that have been proposed to date
include:
• statistical models (Xiong and Yeung, 2004),
• spectral transformations (Agrawal et al., 1993; Faloutsos et al., 1994),
• dynamic time wrapped (DTW) (Berndt and Clifford, 1994),
• wavelets (Chan and Fu, 1999),
• singular value decomposition (SVD),
• piecewise polynomial models (Yi and Faloutsos, 2000; Chakrabarti et al., 2002),
and
• symbolic models (Lin et al., 2003),
for example; whereas improvements have been made recently in:
• reducing locally reconstruction error, improving the accuracy, efficiency, scal-
ability, space usage (Keogh et al., 2001; Chakrabarti et al., 2002),
• making better suited for series out of phase, with missing values or different
length (Xiong and Yeung, 2004; Keogh and Ratanamahatana, 2005),
• making better suited for data stream environment (Palpanas et al., 2004;
200 Appendix C. Review of Clustering Time Series & Data Stream
Yankov et al., 2007),
• extending to multiple dimensional series (Vlachos et al., 2005), and
• moving towards parametric free data mining (Keogh et al., 2004),
for example.
Symbolic representations (Lin et al., 2003, 2007), in particular, have recently been
shown to be very promising for representing massive amounts of time series in clus-
tering (Lin et al., 2004), indexing and mining (Shieh and Keogh, 2008), detecting un-
usual patterns (Keogh et al., 2005; Wei et al., 2006; Yankov et al., 2007), and visuali-
sation (Kumar et al., 2005), for example. On the other hand, many statistical models
such as hidden Markov models (HMM), Markov models (Ge and Smyth, 2000), and
autoregressive moving average (ARMA) models (Kalpakis et al., 2001; Xiong and Ye-
ung, 2002) have been shown to perform unfavourably in comparison (Keogh and
Kasetty, 2003; Ratanamahatana et al., 2005; Wang et al., 2006). This is perhaps not
unexpected since these statistical models are often based on assumptions such as
stationary, normality and independent residuals, and they often do not take trends
into consideration or cannot be fitted easily (Chatfield, 1995).
C.2 Clustering on Extracted Time Series Characteristics
However, ‘traditional’ time series clustering algorithms (i.e., those algorithms group-
ing time series that are correlated, or have similar patterns or fitted models) often
do not take the temporal aspect of the data into consideration. That is, they often
analyse series as sequences (i.e., irrespective of time) and thus are not necessarily
appropriate for all applications as they generally do not incorporate seasonality or
periodic insights (Ghosh and Strehl, 2004). Additionally, from the viewpoint of busi-
ness application, these clusters (or customer groups) are not meaningful without
identifying their (longitudinal) characteristics.
Alternatively, one can take a different approach to this problem; that is, to extract
the (longitudinal) characteristics of each series (e.g., trend, seasonality, serial corre-
lation, skewness, and signal to noise ratio (SNR) for expressing the fluctuations of the
series) (Armstrong, 2001; Last et al., 2001; Nanopoulos et al., 2001) prior to perform-
ing any non-time series clustering (Wang et al., 2006). This strategy (i.e., clustering
on extracted time series characteristics) has been shown to be robust with good ac-
curacy (Wang et al., 2006) and can incorporate the temporal aspect of the data more
appropriately. We believe this approach is more suitable to our research.
Unfortunately, in practice, many time series are not necessary stationary over
C.3 Data Stream Clustering 201
time, and may involve interventions (e.g., step functions) (Chatfield, 1995) mak-
ing the series features extraction challenging. While there have been (mostly one-
dimensional) techniques proposed which aim to quantify the changes (Ganti et al.,
1999b, 2002), detecting changes (Krishnamurthy et al., 2003; Zeira et al., 2004; Kifer
et al., 2004; Schweller et al., 2004), and diagnosis changes (Aggarwal, 2003; Dasu
et al., 2006) even for the data stream environment, it is uncertain how these inter-
ventions should be incorporated into the clustering process for application such as
ours. Below we briefly discuss some recent literature which is closely related to time
series clustering.
C.3 Data Stream Clustering
Recent studies (O’Callaghan et al., 2002; Guha et al., 2003b; Charikar et al., 2003; Bab-
cock et al., 2003; Aggarwal et al., 2003) have been shown to be able to approximate k-
centroid clustering algorithms efficiently in the data stream environment (i.e., deci-
sions need to be made before all the data is available and data can only be read once;
c.f. Appendix B). However, while some algorithms (e.g., Aggarwal et al., 2003; Cao
et al., 2006), that are based on a micro-macro clustering strategy, have been shown
to be able to obtain better clusters, to provide multiple time granularity informa-
tion, and to provide means for tracing objects or clusters temporally, they typically
function from the viewpoint of sequences rather than time series. In other words,
as in the case of traditional time series clustering algorithms, they typically focus on
analysing data independent of time.
One rare data stream clustering algorithm is HPStream (Aggarwal et al., 2004) which
can identify subspace clusters in the high-dimensional data stream environment.
Its subspace notion can improve on the common but problematic approach of clus-
tering equal length time series, which treats each time point as one dimension (c.f.
Xiong and Yeung, 2004) and thus faces serious issue of curse of dimensionality. How-
ever, HPStream is still focusing on analysing the sequences.
Note that analysing sequences can still be very useful, and this has actually been
applied quite frequently. For example, to combat the issue of data evolution (i.e.,
pattern changes over time), many stream mining algorithms have been designed to:
• continuously update statistical information (e.g., correlation among multiple
series) (Yi et al., 2000; Guha et al., 2003a; Sakurai et al., 2005),
• continuously update models (Domingos and Hulten, 2000, 2001; Hulten et al.,
2001), or
• focus on monitoring the data stream (Ganti et al., 2001; Zhu and Shasha, 2002;
Wang et al., 2002b; Zhu and Shasha, 2003; Kleinberg, 2003; Papadimitriou et al.,
2007).
202 Bibliography
Similarly, most temporal and spatial-temporal algorithms and applications (e.g.,
earth science, epidemiology ecology, and climatology) (Last et al., 2001; Li et al.,
2004; Aref et al., 2004; Han and Kamber, 2006; Huang et al., 2008; Hsu et al., 2008)
typically only focus on:
• analysing or monitoring sequential patterns (Agrawal and Psaila, 1995). Note
that scan statistics (Neill et al., 2005), for example, can be utilised for adjusting
the compared patterns with respect to seasonality, for example; or
• mining time independent transaction association rules and frequent patterns
(Agrawal et al., 1993; Agrawal and Srikant, 1994; Agrawal et al., 1995; Srikant
and Agrawal, 1996; Han et al., 1999). However, many do this by first converting
data unnaturally into a sequence of events (Tan et al., 2001; Perlman and Java,
2003; Mamoulis et al., 2004).
Nonetheless, algorithms originated from the viewpoint of data stream mining, as in
the case of traditional time series clustering algorithms, generally do not appear to
be appropriate for our application.
C.4 Review Conclusion
In summary, we believe the most appropriate approach to analyse customer be-
haviour longitudinally, or spatially for that matter, is to first extract each customer’s
overall behaviour characteristics and then cluster the customers based on their ex-
tracted features. Obviously, the extracted features need to be reprehensive and sta-
ble. In Chapter 5, we investigate how to profile each customer’s spatial behaviour
meaningfully, and evaluate the usefulness of segmenting customers based on these
extracted characteristics. However, when the number of customer attributes to be
considered is large (as is typically the case), it can be problematic to group customers
based on the typically applied classic clustering algorithms such as k-means and hi-
erarchical algorithms (c.f. Wang et al., 2006). This is the result of curse of dimension-
ality. In Chapter 6, we investigate high-dimensional data clustering.
Bibliography
Aboulnaga, A., Chaudhuri, S., 1999. Self-tuning histograms: building histograms
without looking at data. In: Delis, A., Faloutsos, C., Ghandeharizadeh, S. (Eds.),
Proceedings of the 1999 ACM SIGMOD International Conference on Management
of Data. ACM, Philadelphia, PA, pp. 181–192.
Acharya, S., Gibbons, P. B., Poosala, V., 2000. Congressional samples for approximate
answering of group-by queries. In: Chen, W., Naughton, J., Bernstein, P. (Eds.),
Proceedings of the 2000 ACM SIGMOD International Conference on Management
of Data. ACM, Dallas, TX, pp. 487–498.
Acharya, S., Gibbons, P. B., Poosala, V., Ramaswamy, S., 1999. Join synopses for
approximate query answering. In: Delis, A., Faloutsos, C., Ghandeharizadeh, S.
(Eds.), Proceedings of the 1999 ACM SIGMOD International Conference on Man-
agement of Data. ACM, Philadelphia, PA, pp. 275–286.
Achtert, E., Bohm, C., David, J., Kroger, P., Zimek, A., 2008. Robust clustering in ar-
bitrarily oriented subspaces. In: Proceedings of the 2008 SIAM International Con-
ference on Data Mining. SIAM, Atlanta, GA, pp. 763–774.
Achtert, E., Bohm, C., Kriegel, H.-P., Kroger, P., Muller-Gorman, I., Zimek, A., 2007a.
Detection and visualization of subspace cluster hierarchies. In: Ramamohanarao,
K., Krishna, P. R., Mohania, M. K., Nantajeewarawat, E. (Eds.), Proceedings of the
12th International Conference on Database Systems for Advanced Applications.
Springer, Bangkok, Thailand, pp. 152–163.
Achtert, E., Bohm, C., Kriegel, H.-P., Kroger, P., Zimek, A., 2007b. On exploring com-
plex relationships of correlation clusters. In: Proceedings of the 19th International
Conference on Scientific and Statistical Database Management. IEEE, Banff, AB,
Canada, pp. 7–16.
204 Bibliography
Achtert, E., Bohm, C., Kriegel, H.-P., Kroger, P., Zimek, A., 2007c. Robust, complete,
and efficient correlation clustering. In: Proceedings of the 2007 SIAM International
Conference on Data Mining. SIAM, Minneapolis, MN, pp. 413–418.
Achtert, E., Bohm, C., Kroger, P., Zimek, A., 2006. Mining hierarchies of correlation
clusters. In: Proceedings of the 18th International Conference on Scientific and
Statistical Database Management. IEEE, Vienna, Austria, pp. 119–128.
Agarwal, D., McGregor, A., Phillips, J. M., Venkatasubramanian, S., Zhu, Z., 2006.
Spatial scan statistics: approximations and performance study. In: Eliassi-Rad,
T., Ungar, L. H., Craven, M., Gunopulos, D. (Eds.), Proceedings of the Twelfth
ACM SIGKDD International Conference on Knowledge Discovery and Data Min-
ing. ACM, Philadelphia, PA, pp. 24–33.
Agarwal, P. K., Mustafa, N. H., 2004. k-means projective clustering. In: Deutsch,
A. (Ed.), Proceedings of the Twenty-third ACM SIGACT-SIGMOD-SIGART Sympo-
sium on Principles of Database Systems. ACM, Paris, France, pp. 155–165.
Agarwal, S., Lim, J., Zelnik-Manor, L., Perona, P., Kriegman, D. J., Belongie, S., 2005.
Beyond pairwise clustering. In: Proceedings of the 2005 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition. Vol. 2. IEEE, San Diego,
CA, pp. 838–845.
Aggarwal, C. C., 2003. A framework for diagnosing changes in evolving data streams.
In: Halevy, A. Y., Ives, Z. G., Doan, A. (Eds.), Proceedings of the 2003 ACM SIGMOD
International Conference on Management of Data. ACM, San Diego, CA, pp. 575–
586.
Aggarwal, C. C., 2006. On biased reservoir sampling in the presence of stream evolu-
tion. In: Dayal, U., Whang, K.-Y., Lomet, D. B., Alonso, G., Lohman, G. M., Kersten,
M. L., Cha, S. K., Kim, Y.-K. (Eds.), Proceedings of the 32nd International Confer-
ence on Very Large Data Bases. ACM, Seoul, Korea, pp. 607–618.
Aggarwal, C. C., 2007a. Data Streams: Models and Algorithms. Advances in Database
Systems. Springer, New York.
Aggarwal, C. C., 2007b. An introduction to data streams. In: Aggarwal, C. C. (Ed.),
Data Streams: Models and Algorithms. Advances in Database Systems. Springer,
New York.
Aggarwal, C. C., Han, J., Wang, J., Yu, P. S., 2003. A framework for clustering evolv-
ing data streams. In: Freytag, J. C., Lockemann, P. C., Abiteboul, S., Carey, M. J.,
Selinger, P. G., Heuer, A. (Eds.), Proceedings of the 29th International Conference
on Very Large Data Bases. Morgan Kaufmann, Berlin, Germany, pp. 81–92.
Aggarwal, C. C., Han, J., Wang, J., Yu, P. S., 2004. A framework for projected clustering
of high dimensional data streams. In: Nascimento, M. A., Ozsu, M. T., Kossmann,
Bibliography 205
D., Miller, R. J., Blakeley, J. A., Schiefer, K. B. (Eds.), Proceedings of the Thirtieth
International Conference on Very Large Data Bases. Morgan Kaufmann, Toronto,
ON, Canada, pp. 852–863.
Aggarwal, C. C., Han, J., Wang, J., Yu, P. S., 2007. On clustering massive data streams:
a summarization paradigm. In: Aggarwal, C. C. (Ed.), Data Streams: Models and
Algorithms. Advances in Database Systems. Springer, New York.
Aggarwal, C. C., Hinneburg, A., Keim, D. A., 2001. On the surprising behavior of dis-
tance metrics in high dimensional spaces. In: Van den Bussche, J., Vianu, V. (Eds.),
Proceedings of the 8th International Conference on Database Theory. Vol. 1973.
Springer, London, pp. 420–434.
Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., Park, J. S., 1999. Fast algorithms
for projected clustering. In: Delis, A., Faloutsos, C., Ghandeharizadeh, S. (Eds.),
Proceedings of the 1999 ACM SIGMOD International Conference on Management
of Data. ACM, Philadelphia, PA, pp. 61–72.
Aggarwal, C. C., Yu, P. S., 2000. Finding generalized projected clusters in high dimen-
sional spaces. In: Chen, W., Naughton, J. F., Bernstein, P. A. (Eds.), Proceedings of
the 2000 ACM SIGMOD International Conference on Management of Data. ACM,
Dallas, TX, pp. 70–81.
Aggarwal, C. C., Yu, P. S., 2001. Outlier detection for high dimensional data. In: Aref,
W. G. (Ed.), Proceedings of the 2001 ACM SIGMOD International Conference on
Management of Data. ACM, Santa Barbara, CA, pp. 37–46.
Aggarwal, C. C., Yu, P. S., 2007. A survey of synopsis construction in data streams. In:
Aggarwal, C. C. (Ed.), Data Streams: Models and Algorithms. Advances in Database
Systems. Springer, New York.
Agrawal, R., Faloutsos, C., Swami, A., 1993. Efficient similarity search in sequence
databases. In: Lomet, D. B. (Ed.), Proceedings of the 4th International Conference
of Foundations of Data Organization and Algorithms. Springer, Chicago, IL, pp.
69–84.
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P., 1998. Automatic subspace clus-
tering of high dimensional data for data mining applications. In: Haas, L. M., Ti-
wary, A. (Eds.), Proceedings of the 1998 ACM SIGMOD International Conference
on Management of Data. ACM, Seattle, WA, pp. 94–105.
Agrawal, R., Lin, K.-I., Sawhney, H. S., Shim, K., 1995. Fast similarity search in the
presence of noise, scaling, and translation in time-series databases. In: Dayal, U.,
Gray, P. M. D., Nishio, S. (Eds.), Proceedings of the 21st International Conference
on Very Large Data Bases. Morgan Kaufmann, Zurich, Switzerland, pp. 490–501.
206 Bibliography
Agrawal, R., Psaila, G., 1995. Active data mining. In: Fayyad, U. M., Uthurusamy, R.
(Eds.), Proceedings of the First International Conference on Knowledge Discovery
and Data Mining. AAAI, Montreal, QC, Canada, pp. 3–8.
Agrawal, R., Srikant, R., 1994. Fast algorithms for mining association rules. In: Bocca,
J. B., Jarke, M., Zaniolo, C. (Eds.), Proceedings of the 20th International Conference
on Very Large Data Bases. Morgan Kaufmann, Santiago de Chile, Chile, pp. 487–
499.
Agrawal, R., Swami, A. N., 1995. A one-pass space-efficient algorithm for finding
quantiles. In: Chaudhuri, S., Deshpande, A., Krishnamurthy, R. (Eds.), Proceed-
ings of the Seventh International Conference on Management of Data. McGraw-
Hill, Pune, India.
Ahn, J.-H., Han, S.-P., Yung-Seop, L., 2006. Customer churn analysis: churn determi-
nants and mediation effects of partial defection in the Korean mobile telecommu-
nications service industry. Telecommunications Policy 30 (10-11), 552–568.
Airoldi, E. M., Blei, D. M., Fienberg, S. E., Xing, E. P., 2008. Mixed membership
stochastic blockmodels. The Journal of Machine Learning Research 9 (Sep), 1981–
2014.
Aitkin, M., Rubin, D. B., 1985. Estimation and hypothesis testing in finite mixture
models. Journal of the Royal Statistical Society: Series B (Statistical Methodology)
47, 67–75.
Aitkin, M., Wilson, G. T., 1980. Mixture models, outliers, and the EM algorithm. Tech-
nometrics 22 (3), 325–331.
Ajala, I., 26 Nov 2005. GIS and GSM network quality monitoring: A Nigerian case
study.
URL http://www.directionsmag.com/articles/
Ajala, I., 07 Mar 2006. Spatial analysis of GSM subscriber call data records.
URL http://www.directionsmag.com/articles/
Ajzen, I., 1991. The theory of planned behavior. Organizational Behavior and Human
Decision Processes 50 (2), 179–211.
Ajzen, I., 2001. Nature and operation of attitudes. Annual Review of Psychology
52 (1), 27–58.
Ajzen, I., Fishbein, M., 1980. Understanding Attitudes and Predicting Social Behav-
ior. Prentice-Hall, Englewood-Cliffs, NJ.
Ajzena, I., Fishbeinb, M., 2000. Attitudes and the attitude-behavior relation: rea-
soned and automatic processes. European Review of Social Psychology 11, 1–33.
Bibliography 207
Akaike, H., 1974. A new look at the statistical model identification. IEEE Transactions
on Automatic Control 19 (6), 716723.
Alderson, W., 1957. Marketing Behavior and Executive Action: A Functionalist Ap-
proach to Marketing Theory. Richard D. Irwin, Homewood, IL.
Alon, N., Gibbons, P. B., Matias, Y., Szegedy, M., 1999. Tracking join and self-join
sizes in limited storage. In: Proceedings of the Eighteenth ACM SIGMOD-SIGACT-
SIGART Symposium on Principles of Database Systems. ACM, Philadelphia, PA,
pp. 10–20.
Alon, N., Matias, Y., Szegedy, M., 1996. The space complexity of approximating the
frequency moments. In: Proceedings of the 28th Annual ACM Symposium on The-
ory of Computing. ACM, Philadelphia, PA, pp. 20–29.
Alsabti, K., Ranka, S., Singh, V., 1997. A one-pass algorithm for accurately estimat-
ing quantiles for disk-resident data. In: Jarke, M., Carey, M. J., Dittrich, K. R.,
Lochovsky, F. H., Loucopoulos, P., Jeusfeld, M. A. (Eds.), Proceedings of the 23rd
International Conference on Very Large Data Bases. Morgan Kaufmann, Athens,
Greece, pp. 346–355.
Amaral, T. P., Gonzalez, F. A., Jimenez, B. M., 1995. Business telephone traffic demand
in Spain: 19801991, an econometric approach. Information Economics and Policy
7 (2), 115–134.
Anderson, E. W., Mittal, V., 2000. Strengthening the satisfaction-profit chain. Journal
of Service Research 3 (2), 107–120.
Anderson, P. F., 1983. Marketing, scientific progress, and scientific method. Journal
of Marketing 47 (4), 18–31.
Anderson Jr., W. T., Golden, L. L., 1984. Lifestyle and psychographics: a critical review
and recommendation. Advances in Consumer Research 11 (1), 405–411.
Andrieu, C., de Freitas, N., Doucet, A., Jordan, M. I., 2003. An introduction to MCMC
for machine learning. Machine Learning 50 (1-2), 5–43.
Ankerst, M., Breunig, M., Kriegel, H.-P., Sander, J., 1999. OPTICS: ordering points to
identify the clustering structure. ACM SIGMOD Record 28 (2), 49–60.
Antoniak, C., 1974. Mixtures of Dirichlet processes with applications to Bayesian
nonparametric problems. The Annals of Statistics 2 (6), 1152–1174.
Arabie, P., Hubert, L. J., 1996. An overview of combinatorial data analysis. In: Arabie,
P., Hubert, L. J., De Soete, G., , De Soete, G. (Eds.), Clustering and Classification.
World Scientific, River Edge, NJ, pp. 5–63.
208 Bibliography
Arasu, A., Babcock, B., Babu, S., Datar, M., Ito, K., Nishizawa, I., Rosenstein, J.,
Widom, J., 2003. STREAM: stanford stream data manager. In: Halevy, A. Y., Ives,
Z. G., Doan, A. (Eds.), Proceedings of the 2003 ACM SIGMOD International Con-
ference on Management of Data. ACM, San Diego, CA, pp. 665–665.
Archambeau, C., Verleysen, M., 2007. Robust Bayesian clustering. Neural Networks
20 (1), 129–138.
Aref, W. G., Elfeky, M. G., Elmagarmid, A. K., 2004. Incremental, online, and merge
mining of partial periodic patterns in time-series databases. IEEE Transactions on
Knowledge and Data Engineering 16 (3), 332–342.
Armstrong, J. S., 2001. Principles of Forecasting: A Handbook for Researchers and
Practitioners. International Series in Operations Research & Management Science.
Kluwer Academic, Boston, MA.
Arndt, J., 1967. Role of product-related conversations in the diffusion of a new prod-
uct. Journal of Marketing Research 4 (3), 291–295.
Arnould, E. J., Price, L., Zinkhan, G. M., 2004. Consumers, 2nd Edition. McGraw-
Hill/Irwin Series in Marketing. McGraw-Hill/Irwin, Boston, MA.
Arrow, K. J., Forsythe, R., Gorham, M., Hahn, R., Hanson, R., Ledyard, J. O., Levmore,
S., Litan, R., Milgrom, P., Nelson, F. D., Neumann, G. R., Ottaviani, M., Schelling,
T. C., Shiller, R. J., Smith, V. L., Snowberg, E., Sunstein, C. R., Tetlock, P. C., Tet-
lock, P. E., Varian, H. R., Wolfers, J., Zitzewitz, E., 2008. The promise of prediction
markets. Science 320 (5878), 877–878.
Assent, I., Krieger, R., Muller, E., Seidl, T., 2007a. DUSC: dimensionality unbiased
subspace clustering. In: Proceedings of the 7th IEEE International Conference on
Data Mining. IEEE, Omaha, NE, pp. 409–414.
Assent, I., Krieger, R., Muller, E., Seidl, T., 2007b. VISA: visual subspace clustering
analysis. SIGKDD Explorations 9 (2), 5–12.
Attias, H., 1999. Inferring parameters and structure of latent variable models by vari-
ational Bayes. In: Laskey, K. B., Prade, H. (Eds.), Proceedings of the Fifteenth Con-
ference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, Stockholm,
Sweden, pp. 21–30.
Azzalini, A., 1996. Statistical Inference: Based on the Likelihood. Monographs on
Statistics and Applied Probability. Chapman & Hall, London.
Babad, Y. M., Balachandran, B. V., 1993. Cost driver optimization in activity-based
costing. The Accounting Review 68 (3), 563–575.
Bibliography 209
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J., 2002a. Models and issues
in data stream systems. In: Popa, L. (Ed.), Proceedings of the Twenty-first ACM
SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM,
Madison, WI, pp. 1–16.
Babcock, B., Datar, M., Motwani, R., 2002b. Sampling from a moving window over
streaming data. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium
on Discrete Algorithms. ACM, San Francisco, CA, pp. 633–634.
Babcock, B., Datar, M., Motwani, R., 2007. Load shedding in data stream systems. In:
Aggarwal, C. C. (Ed.), Data Streams: Models and Algorithms. Advances in Database
Systems. Springer, New York.
Babcock, B., Datar, M., Motwani, R., O’Callaghan, L., 2003. Maintaining variance and
k-medians over data stream windows. In: Proceedings of the Twenty-Second ACM
SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM,
San Diego, CA, pp. 234–243.
Babu, S., Subramanian, L., Widom, J., 2001. A data stream management system for
network traffic management. In: Workshop on Network-Related Data Manage-
ment. Santa Barbara, CA.
Balakrishnan, S., Madigan, D., 2006. A one-pass sequential Monte Carlo method for
Bayesian analysis of massive datasets. Bayesian Analysis 1 (2), 345–362.
Balazinska, M., Castro, P., 2003. Characterizing mobility and network usage in a cor-
porate wireless local-area network. In: Proceedings of the First International Con-
ference on Mobile Systems, Applications, and Services. USENIX, San Francisco,
CA, pp. 303–316.
Baldinger, A. L., Rubinson, J., 1996. Brand loyalty: the link between attitude and be-
havior. Journal of Advertising Research 36 (6), 22–34.
Ball, G. H., Hall, D. J., 1965. ISODATA, a novel method of data analysis and pattern
classification. Tech. rep., Stanford Research Institute, Menlo Park, CA.
Banfield, J. D., Raftery, A. E., 1993. Model-based Gaussian and non-Gaussian cluster-
ing. Biometrics 49 (3), 803–821.
Barbara, D., Chen, P., 2000. Using the fractal dimension to cluster datasets. In: Pro-
ceedings of the Sixth ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining. ACM, Boston, MA, pp. 260–264.
Barbara, D., Dumouchel, W., Faloutsos, C., Haas, P., Hellerstein, J., Ioannidis, Y., Ja-
gadish, H. V., Johnson, T., Ng, R., Poosala, V., Ross, K., Sevcik, K., 1997. The new
jersey data reduction report. IEEE Data Engineering Bulletin 20 (4), 3–45.
210 Bibliography
Batty, M., 2003. Agent-based pedestrian modelling. In: Longley, P. A., Batty, M. (Eds.),
Advanced Spatial Analysis: The CASA Book of GIS. ESRI Press, Redlands, CA.
Baudry, J.-P., Raftery, A. E., Celeux, G., Lo, K., Gottardo, R., 2010. Combining mix-
ture components for clustering. Journal of Computational and Graphical Statistics
19 (2), 332–353.
Bayes, T., 1763. An essay towards solving a problem in the doctrine of chances. Philo-
sophical Transactions of the Royal Society of London 53,54, 370–418,296–325.
Beal, M. J., Ghahraman, Z., 2002. The variational Bayesian EM algorithm for incom-
plete data: with application to scoring graphical model structures. In: Bernardo,
J. M., Bayarri, M. J., Berger, J. O., Dawid, A. P., Heckerman, D., Smith, A. F. M., West,
M. (Eds.), Proceedings of the Seventh Valencia International Meeting. Oxford Uni-
versity, Tenerife, Spain, pp. 453–464.
Beal, M. J., Ghahraman, Z., 2006. Variational Bayesian learning of directed graphical
models with hidden variables. Bayesian Analysis 1 (4), 793832.
Belk, R. W., 1987. ACR presidential address: happy thought. Advances in Consumer
Research 14 (1), 1–4.
Belk, R. W., Sherry Jr., J. F., Wallendorf, M., 1988. A naturalistic inquiry into buyer and
seller behavior at a swap meet. Journal of Consumer Research 14 (4), 449–470.
Bellman, R. E., 1961. Adaptive Control Processes: A Guided Tour, 5th Edition. Prince-
ton University, Princeton, NJ.
Berchtold, S., Bohm, C., Keim, D. A., Kriegel, H.-P., 1997. A cost model for nearest
neighbor search in high-dimensional data space. In: Proceedings of the Sixteenth
ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.
ACM, Tucson, AZ, pp. 78–86.
Berchtold, S., Bohm, C., Kriegal, H.-P., 1998. The pyramid-technique: Towards break-
ing the curse of dimensionality. In: Proceedings of the 1998 ACM SIGMOD Inter-
national Conference on Management of data. ACM, pp. 142–153.
Berchtold, S., Keim, D. A., Kriegel, H.-P., 1996. The X-tree: an index structure for
high-dimensional data. In: Vijayaraman, T. M., Buchmann, A. P., Mohan, C., Sarda,
N. L. (Eds.), Proceedings of the 22nd International Conference on Very Large Data
Bases. Morgan Kaufmann, Mumbai, India, pp. 28–39.
Berkhin, P., 2006. A survey of clustering data mining techniques. In: Kogan, J.,
Nicholas, C., Teboulle, M. (Eds.), Grouping Multidimensional Data: Recent Ad-
vances in Clustering. Springer, New York.
Bibliography 211
Berndt, D. J., Clifford, J., 1994. Using dynamic time warping to find patterns in time
series. In: Fayyad, U. M., Uthurusamy, R. (Eds.), AAMI Workshop on Knowledge
Discovery in Databases. AAAI, Seattle, WA, pp. 359–370.
Berry, M. J. A., Linoff, G., 2000. Mastering Data Mining: The Art and Science of Cus-
tomer Relationship Management. Wiley, New York.
Berry, M. J. A., Linoff, G., 2004. Data Mining Techniques: for Marketing, Sales, and
Customer Relationship Management, 2nd Edition. Wiley, Indianapolis, IN.
Besag, J., Green, P., Higdon, D., Mengersen, K., 1995. Bayesian computation and
stochastic systems. Statistical Science 10 (1), 3–41.
Beyer, K. S., Goldstein, J., Ramakrishnan, R., Shaft, U., 1999. When is “nearest neigh-
bor” meaningful? In: Beeri, C., Buneman, P. (Eds.), Proceedings of the Seventh In-
ternational Conference on Database Theory. Vol. 1540. Springer, Jerusalem, Israel,
pp. 217–235.
Bezdek, J. C., 1981. Pattern Recognition with Fuzzy Objective Function Algorithms.
Advanced Applications in Pattern Recognition. Springer, New York.
Bhaduri, K., Das, K., Sivakumar, K., Kargupta, H., Wolff, R., 2007. Algorithms for dis-
tributed data stream mining. In: Aggarwal, C. C. (Ed.), Data Streams: Models and
Algorithms. Advances in Database Systems. Springer, New York.
Bhattacharya, C. B., 1998. When customers are members: Customer retention in
paid membership contexts. Journal of the Academy of Marketing Science 26 (1),
31–44.
Bickart, B., Schmittlein, D., 1999. The distribution of survey contact and participa-
tion in the United States: constructing a survey-based estimate. Journal of Mar-
keting Research 36 (2), 286–294.
Biernacki, C., Celeux, G., Govaert, G., 2000. Assessing mixture model for clustering
with integrated completed likelihood. IEEE Transactions on Pattern Analysis and
Machine Intelligence 22 (7), 719–725.
Biernacki, C., Celeux, G., Govaert, G., 2003. Choosing starting values for the EM algo-
rithm for getting the highest likelihood in multivariate Gaussian mixture models.
Computational Statistics & Data Analysis 41 (3-4), 561–575.
Biernacki, C., Celeux, G., Govaert, G., Langrognet, F., 2006. Model-based cluster
and discriminant analysis with the MIXMOD software. Computational Statistics
& Data Analysis 51 (2), 587–600.
Binder, D. A., 1978. Bayesian cluster analysis. Biometrika 65 (1), 31–38.
212 Bibliography
Birant, D., Kut, A., 2007. ST-DBSCAN: an algorithm for clustering spatial-temporal
data. Data & Knowledge Engineering 60 (1), 208–221.
Bishop, C. M., 2006. Pattern Recognition and Machine Learning. Information Sci-
ence and Statistics. Springer, New York.
Bitran, G. R., Mondschein, S. V., 1996. Mailing decisions in the catalog sales industry.
Management Science 42 (9), 1364–1381.
Blattberg, R. C., Deighton, J., 1996. Manage marketing by the customer equity test.
Harvard Business Review July-August, 136–144.
Blattberg, R. C., Getz, G., Thomas, J. S., 2000. Customer Equity: Building and Manag-
ing Relationships as Valuable Assets. Harvard Business School, Boston, MA.
Blei, D. M., Jordan, M. I., 2004. Variational methods for the Dirichlet process. In:
Brodley, C. E. (Ed.), Proceedings of the Twenty-first International Conference Ma-
chine Learning. ACM, Banff, AB, Canada.
Blei, D. M., Jordan, M. I., 2006. Variational inference for Dirichlet process mixtures.
Bayesian Analysis 1 (1), 121–144.
Blei, D. M., Ng, A. Y., Jordan, M. I., 2003. Latent Dirichlet allocation. Journal of Ma-
chine Learning Research 3 (Jan), 993–1022.
Bohm, C., Berchtold, S., Keim, D. A., 2001. Searching in high-dimensional spaces:
index structures for improving the performance of multimedia databases. ACM
Computing Surveys 33 (3), 322–373.
Bohm, C., Braunmuller, B., Breunig, M. M., Kriegel, H.-P., 2000. High performance
clustering based on the similarity join. In: Proceedings of the 2000 ACM CIKM
International Conference on Information and Knowledge Management. ACM,
McLean, VA, pp. 298–305.
Bohm, C., Kailing, K., Kriegel, H.-P., Kroger, P., 2004a. Density connected cluster-
ing with local subspace preferences. In: Proceedings of the 4th IEEE International
Conference on Data Mining. IEEE, Brighton, UK, pp. 27–34.
Bohm, C., Kailing, K., Kroger, P., Zimek, A., 2004b. Computing clusters of correlation
connected objects. In: Weikum, G., Konig, A. C., Deßloch, S. (Eds.), Proceedings of
the 2004 ACM SIGMOD International Conference on Management of Data. ACM,
Paris, France, pp. 455–466.
Bolton, R. N., 1998. A dynamic model of the duration of the customer’s relation-
ship with a continuous service provider: the role of satisfaction. Marketing Science
17 (1), 45–65.
Bibliography 213
Borden, N. H., 1964. The concept of the marketing mix. Journal of Advertising Re-
search 4 (June), 2–7.
Bouveyron, C., Girard, S., Schmid, C., 2007. High-dimensional data clustering. Com-
putational Statistics & Data Analysis 52 (1), 502–519.
Boyles, R. A., 1983. On the convergence of the EM algorithm. Journal of the Royal
Statistical Society: Series B (Statistical Methodology) 45 (1), 47–50.
Bradley, P., Fayyad, U. M., Reina, C., 1998. Scaling clustering algorithms to large
databases. In: Agrawal, R., Stolorz, P. E., Piatetsky-Shapiro, G. (Eds.), Proceedings
of the Fourth International Conference on Knowledge Discovery and Data Mining.
AAAI, New York, pp. 9–15.
Bradley, P. S., Reina, C., Fayyad, U. M., 2000. Clustering very large databases using EM
mixture models. In: Proceedings of the 15th International Conference on Pattern
Recognition. IEEE, Barcelona, Spain, pp. 2076–2080.
Braun, M., McAuliffe, J., 2010. Variational inference for large-scale models of discrete
choice. Journal of American Statistical Association 105 (489), 324–335.
Brennan, M., Hoek, J., 1992. The behavior of respondents, nonrespondents, and re-
fusers across mail surveys. Public Opinion Quarterly 56 (4), 530–535.
Breunig, M. M., Kriegel, H.-P., Ng, R. T., Sander, J., 2000. LOF: identifying density-
based local outliers. In: Chen, W., Naughton, J. F., Bernstein, P. A. (Eds.), Proceed-
ings of the 2000 ACM SIGMOD International Conference on Management of Data.
ACM, Dallas, TX, pp. 93–104.
Brockmann, D., Hufnagel, L., Geisel, T., 2006. The scaling laws of human travel. Na-
ture 439 (7075), 462–465.
Bruneau, P., Gelgon, M., Picarougne, F., 2008. Parameter-based reduction of Gaus-
sian mixture models with a variational-Bayes approach. In: Proceedings of the
19th International Conference on Pattern Recognition. IEEE, Tampa, FL, pp. 1–4.
Buckinx, W., Van den Poel, D., 2005. Customer base analysis: partial defection of be-
haviourally loyal clients in a non-contractual FMCG retail setting. European Jour-
nal of Operational Research 164 (1), 252–268.
Bult, J. R., Wansbeek, T., 1995. Optimal selection for direct mail. Marketing Science
14 (4), 378–394.
Burbeck, K., Nadjm-Tehrani, S., 2005. ADWICE - anomaly detection with real-time
incremental clustering. In: Park, C., Chee, S. (Eds.), Proceedings of the Seventh
International Conference on the Theory and Application of Cryptology and Infor-
mation Security. Springer, Seoul, Korea, pp. 407–424.
214 Bibliography
Buttle, F., 2009. Customer Relationship Management: Concepts and Technologies,
2nd Edition. Butterworth-Heinemann, London.
Cadez, I. V., Smyth, P., Ip, E., Mannila, H., 2001. Predictive profiles for transaction data
using finite mixture models. Tech. Rep. UCI-ICS 01-67, Department of Information
& Computer Science, University of California, Irvine, CA.
Cahill, D. J., 2006. Lifestyle Market Segmentation. Haworth Series in Segmented, Tar-
geted, and Customized Marketing. Haworth, New York.
Calinski, T., Harabasz, J., 1974. A dendrite method for cluster analysis. Communica-
tions in Statistics - Theory and Methods 3 (1), 1–27.
Camp, T., Boleng, J., Davies, V., 2002. A survey of mobility models for ad hoc network
research. Wireless Communications and Mobile Computing 2 (5), 483–502.
Campbell, N. A., 1980. Robust procedures in multivariate analysis I: robust covari-
ance estimation. Journal of the Royal Statistical Society: Series C (Applied Statis-
tics) 29 (3), 231–237.
Cao, F., Ester, M., Qian, W., Zhou, A., 2006. Density-based clustering over an evolving
data stream with noise. In: Ghosh, J., Lambert, D., Skillicorn, D. B., Srivastava, J.
(Eds.), Proceedings of the Sixth SIAM International Conference on Data Mining.
SIAM, Bethesda, MD.
Cattell, R. B., 1966. The scree test for the number of factors. Multivariate Behavioral
Research 1 (2), 245–276.
Celeux, G., Chauveau, D., Diebolt, J., 1996. Stochastic versions of the EM algorithm:
an experimental study in the mixture case. Journal of Statistical Computation and
Simulation 55 (4), 287–314.
Celeux, G., Diebolt, J., 1985. The SEM algorithm: a probabilistic teacher algorithm
derived from EM algorithm for the mixture problem. Computational Statistics
Quarterly 2, 73–82.
Celeux, G., Forbes, F., Robert, C., Titterington, D., 2006. Deviance information criteria
for missing data models. Bayesian Analysis 1 (4), 651–674.
Celeux, G., Govaert, G., 1992. A classification EM algorithm for clustering and two
stochastic versions. Computational Statistics & Data Analysis 14, 315–332.
Celeux, G., Govaert, G., 1995. Gaussian parsimonious clustering models. Pattern
Recognition 28 (5), 781–793.
Celeux, G., Hurn, M., Robert, C. P., 2000. Computational and inferential difficulties
with mixture posterior distributions. Journal of the American Statistical Associa-
tion 95 (451), 957–970.
Bibliography 215
Chakrabarti, K., Garofalakis, M. N., Rastogi, R., Shim, K., 2001. Approximate query
processing using wavelets. The VLDB Journal 10 (2-3), 199–223.
Chakrabarti, K., Keogh, E. J., Mehrotra, S., Pazzani, M. J., 2002. Locally adaptive di-
mensionality reduction for indexing large time series databases. ACM Transac-
tions on Database Systems 27 (2), 188–228.
Chakrabarti, K., Mehrotra, S., 1999. The hybrid tree: an index structure for high di-
mensional feature spaces. In: Proceedings of the 15th International Conference on
Data Engineering. IEEE, Sydney, Australia, pp. 440–447.
Chan, K.-p., Fu, A. W.-C., 1999. Efficient time series matching by wavelets. In: Pro-
ceedings of the 15th International Conference on Data Engineering. IEEE, Sydney,
Australia, pp. 126–133.
Chang, J.-W., Jin, D.-S., 2002. A new cell-based clustering method for large, high-
dimensional data in data mining applications. In: Proceedings of the 2002 ACM
Symposium on Applied Computing. ACM, Madrid, Spain, pp. 503–507.
Chang, W.-C., 1983. On using principal components before separating a mixture of
two multivariate normal distributions. Journal of the Royal Statistical Society: Se-
ries C (Applied Statistics) 32 (3), 267–275.
Chapelle, O., Sch olkopf, B., Zien, A., 2010. Semi-Supervised Learning. Adaptive
Computation and Machine Learning. MIT, London.
Charikar, M., O’Callaghan, L., Panigrahy, R., 2003. Better streaming algorithms for
clustering problems. In: Proceedings of the 35th Annual ACM Symposium on The-
ory of Computing. ACM, San Diego, CA, pp. 30–39.
Chatfield, C., 1995. Model uncertainty, data mining and statistical inference. Journal
of the Royal Statistical Society: Series A (Statistics in Society) 158 (3), 419–466.
Chaudhuri, S., Dayal, U., Ganti, V., 2001. Database technology for decision support
systems. Computer 34 (12), 48–55.
Chaudhuri, S., Motwani, R., Narasayya, V., 1998. Random sampling for histogram
construction: How much is enough? In: Proceedings of the 1998 ACM SIGMOD
international conference on Management of data. ACM, Seattle, WA, pp. 436–447.
Chaudhuri, S., Motwani, R., Narasayya, V. R., 1999. On random sampling over joins.
In: Delis, A., Faloutsos, C., Ghandeharizadeh, S. (Eds.), Proceedings of the 1999
ACM SIGMOD International Conference on Management of Data. ACM, Philadel-
phia, PA, pp. 263–274.
Cheeseman, P., Stutz, J., 1996. Bayesian classification (AutoClass): theory and re-
sults. In: Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (Eds.),
Advances in Knowledge Discovery and Data Mining. AAAI.
216 Bibliography
Cheng, C. H., Fu, A. W.-C., Zhang, Y., 1999. Entropy-based subspace clustering for
mining numerical data. In: Proceedings of the Fifth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. ACM, San Diego, CA, pp.
84–93.
Cheng, H., Hua, K. A., Vu, K., 2008. Constrained locally weighted clustering. Proceed-
ings of the VLDB Endowment 1 (1), 90–101.
Cheng, Y., Church, G. M., 2000. Biclustering of expression data. In: Bourne, P. E.,
Gribskov, M., Altman, R. B., Jensen, N., Hope, D. A., Lengauer, T., Mitchell, J. C.,
Scheeff, E. D., Smith, C., Strande, S., Weissig, H. (Eds.), Proceedings of the Eighth
International Conference Intelligent Systems for Molecular Biology. AAAI, La Jolla,
CA, pp. 93–103.
Cherkassky, V., Mulier, F. M., 2007. Learning from Data: Concepts, Theory, and Meth-
ods, 2nd Edition. Wiley, New York.
Cheung, Y.-m., 2005. Maximum weighted likelihood via rival penalized EM for den-
sity mixture clustering with automatic model selection. IEEE Transactions on
Knowledge and Data Engineering 17 (6), 750–761.
Cho, Y. H., Kim, J. K., 2004. Application of web usage mining and product taxonomy
to collaborative recommendations in e-commerce. Expert Systems with Applica-
tions 26 (2), 233–246.
Cho, Y. H., Kim, J. K., Kim, S. H., 2002. A personalized recommender system based on
web usage mining and decision tree induction. Expert Systems with Applications
23 (3), 329–342.
Chong, C.-C., Guvenc, I., Watanabe, F., Inamura, H., 2009. Ranging and localization
by UWB radio for indoor LBS. NTT DOCOMO Technical Journal 11 (1), 41–48.
Chopin, N., 2002. A sequential particle filter method for static models. Biometrika
89 (3), 539–552.
Christopher, M., Payne, A., Ballantyne, D., 1991. Relationship Marketing: Bring-
ing Quality, Customer Service and Marketing Together. The Marketing Series.
Butterworth-Heinemann, Boston, MA.
Claritas Inc., 2008. PRIZM NE (R).
Claxton, J. D., Fry, J. N., Portis, B., 1974. A taxonomy of prepurchase information
gathering patterns. Journal of Consumer Research 1 (3), 35–42.
Cohen, E., Strauss, M., 2003. Maintaining time-decaying stream aggregates. In: Pro-
ceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on
Principles of Database Systems. ACM, San Diego, CA, pp. 223–233.
Bibliography 217
Cokins, G., 2004. Performance Management: Finding the Missing Pieces (to Close
the Intelligence Gap). Wiley and SAS Business Series. Wiley, Hoboken, NJ.
Cokins, G., King, K., 2004. Managing customer profitability and economic value in
the telecommunications industry: a holistic look at the individual level to build
corporate profitability one customer at a time. Tech. rep., SAS Institute Inc., Cary,
NC.
Cole, A. J., Wishart, D., 1970. An improved algorithm for the Jardine-Sibson method
of generating overlapping clusters. Computer Journal 13 (2), 156–163.
Constantinopoulos, C., Likas, A., 2007. Unsupervised learning of Gaussian mixtures
based on variational component splitting. IEEE Transactions on Neural Networks
18 (3), 745–755.
Cooper, R., Kaplan, R. S., 1991. Profit priorities from activity-based costing. Harvard
Business Review May-June, 130–135.
Cooper, R., Kaplan, R. S., 1998. The promise - and peril - of integrated cost systems.
Harvard Business Review July-August, 109–119.
Corduneanu, A., Bishop, C. M., 2001. Variational Bayesian model selection for mix-
ture distributions. In: Proceedings of the Eighth International Conference on Arti-
ficial Intelligence and Statistics. Morgan Kaufmann, Key West, FL, pp. 27–34.
Cormode, G., Garofalakis, M. N., Sacharidis, D., 2006. Fast approximate wavelet
tracking on streams. In: Ioannidis, Y. E., Scholl, M. H., Schmidt, J. W., Matthes, F.,
Hatzopoulos, M., Bohm, K., Kemper, A., Grust, T., Bohm, C. (Eds.), Proceedings of
the Tenth International Conference on Extending Database Technology. Springer,
Munich, Germany, pp. 4–22.
Cortes, C., Fisher, K., Pregibon, D., Rogers, A., 2000. Hancock: a language for extract-
ing signatures from data streams. In: Proceedings of the Sixth ACM SIGKDD In-
ternational Conference on Knowledge Discovery and Data Mining. ACM, Boston,
MA, pp. 9–17.
Cortes, C., Pregibon, D., 1998. Giga mining. In: Agrawal, R., Stolorz, P. E., Piatetsky-
Shapiro, G. (Eds.), Proceedings of the Fourth International Conference on Knowl-
edge Discovery and Data Mining. AAAI, New York, pp. 174–178.
Cortes, C., Pregibon, D., 1999. Information mining platform: an infrastructure for
KDD rapid deployment. In: Proceedings of the Fifth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. ACM, San Diego, CA, pp.
327–331.
Cox Jr., L. A., 2001. Forecasting demand for telecommunications products from
cross-sectional data. Telecommunication Systems 16 (3), 437–454.
218 Bibliography
Cox Jr., L. A., 2002. Data mining and causal modeling of customer behaviors.
Telecommunication Systems 21 (2-4), 349–381.
Cox Jr., L. A., Popken, D. A., 2002. A hybrid system-identification method for forecast-
ing telecommunications product demands. International Journal of Forecasting
18 (4), 647–671.
Curtin, R., Presser, S., Singer, E., 2005. Changes in telephone survey nonresponse
over the past quarter century. Public Opinion Quarterly 69 (1), 87–98.
Danaher, P. J., 2002. Optimal pricing of new subscription services: analysis of a mar-
ket experiment. Marketing Science 21 (2), 119–138.
Dasu, T., Johnson, T., 2003. Exploratory Data Mining and Data Cleaning. Wiley Series
in Probability and Statistics. Wiley, Hoboken, NJ.
Dasu, T., Krishnan, S., Venkatasubramanian, S., Yi, K., 2006. An information-
theoretic approach to detecting changes in multi-dimensional data streams. In:
Proceedings of the 38th Symposium on the Interface of Statistics, Computing Sci-
ence, and Applications. Pasadena, CA.
Datar, M., Gionis, A., Indyk, P., Motwani, R., 2002. Maintaining stream statistics over
sliding window. SIAM Journal on Computing 31 (6), 1794–1813.
Datar, M., Motwani, R., 2007. The sliding-window computation model and results.
In: Aggarwal, C. C. (Ed.), Data Streams: Models and Algorithms. Advances in
Database Systems. Springer, New York.
Deligiannakis, A., Roussopoulos, N., 2003. Extended wavelets for multiple measures.
In: Halevy, A. Y., Ives, Z. G., Doan, A. (Eds.), Proceedings of the 2003 ACM SIGMOD
International Conference on Management of Data. ACM, San Diego, CA, pp. 229–
240.
Dempster, A. P., Laird, N. M., Rubin, D., 1977. Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Statis-
tical Methodology) 39 (1), 1–38.
DeSarbo, W. S., Howard, D. J., Jedidi, K., 1990. MULTICLUS: A new method for simul-
taneously performing multidimensional scaling and cluster analysis. Psychome-
trika 56 (1), 121–136.
DeSarbo, W. S., Ramaswamy, V., 1994. CRISP: customer response based iterative seg-
mentation procedures for response modeling in direct marketing. Journal of Direct
Marketing 8 (3), 7–20.
Dhar, R., Glazer, R., 2003. Hedging customers. Harvard Business Review 81 (5), 86–92.
Bibliography 219
Dickson, P. R., 1982. Person-situation: segmentation’s missing link. Journal of Mar-
keting 46 (4), 56–64.
Diebolt, J., Robert, C. P., 1994. Estimation of finite mixture distributions through
Bayesian sampling. Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 56 (2), 363–375.
Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., Keogh, E. J., 2008. Querying
and mining of time series data: experimental comparison of representations and
distance measures. In: Proceedings of the 34th International Conference on Very
Large Data Bases. ACM, Auckland, New Zealand, pp. 1542–1552.
Dobra, A., Garofalakis, M. N., Gehrke, J., Rastogi, R., 2002. Processing complex ag-
gregate queries over data streams. In: Franklin, M. J., Moon, B., Ailamaki, A. (Eds.),
Proceedings of the 2002 ACM SIGMOD International Conference on Management
of Data. ACM, Madison, WI, pp. 61–72.
Dobra, A., Garofalakis, M. N., Gehrke, J., Rastogi, R., 2004. Sketch-based multi-query
processing over data streams. In: lisa Bertino, Christodoulakis, S., Plexousakis,
D., Christophides, V., Koubarakis, M., Bohm, K., Ferrari, E. (Eds.), Proceedings of
the Ninth International Conference on Extending Database Technology. Springer,
Heraklion, Greece, pp. 551–568.
Domeniconi, C., Papadopoulos, D., Gunopulos, D., Ma, S., April 22-24, 2004 2004.
Subspace clustering of high dimensional data. In: Berry, M. W., Dayal, U., Kamath,
C., Skillicorn, D. B. (Eds.), Proceedings of the Fourth SIAM International Confer-
ence on Data Mining. SIAM, Lake Buena Vista, FL.
Domingos, P., Hulten, G., 2000. Mining high-speed data streams. In: Proceedings
of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining. ACM, Boston, MA, pp. 71–80.
Domingos, P., Hulten, G., 2001. A general method for scaling up machine learning al-
gorithms and its application to clustering. In: Brodley, C. E., Danyluk, A. P. (Eds.),
Proceedings of the Eighteenth International Conference on Machine Learning.
Morgan Kaufmann, Williamstown, MA, pp. 106–113.
Dong, G., Han, J., Lakshmanan, L. V., Pei, J., Wang, H., Yu, P. S., 2003. Online mining
of changes from data streams: research problems and preliminary results. In: Pro-
ceedings of the 2003 ACM SIGMOD Workshop on Management and Processing of
Data Streams. ACM, San Diego, CA.
Donoho, D. L., Johnstone, I. M., Kerkyacharian, G., Picard, D., 1996. Density estima-
tion by wavelet thresholding. The Annals of Statistics 24 (2), 508–539.
Doucet, A., de Freitas, N., Gordon, N., 2001. An introduction to sequential Monte
Carlo methods. In: Doucet, A., de Freitas, N., Gordon, N. (Eds.), Sequential Monte
220 Bibliography
Carlo Methods in Practice. Statistics for Engineering and Information Science.
Springer, New York.
Dougherty, E. R., Brun, M., 2004. A probabilistic theory of clustering. Pattern Recog-
nition 37 (5), 917–925.
Dowling, G. R., Uncles, M., 1997. Do customer loyalty programs really work? Sloan
Management Review 38 (4), 71–82.
Doyle, P., 1995. Marketing in the new millennium. European Journal of Marketing
29 (13), 23–41.
Dubes, R., 1987. How many clusters are best? - an experiment. Pattern Recognition
20 (6), 645–663.
Dubes, R. C., 1999. Cluster analysis and related issues. In: Chen, C.-H., Pau, L. F.,
Wang, P. S.-P. (Eds.), Handbook of Pattern Recognition & Computer Vision. World
Scientific Publishing Company, River Edge, NJ.
Duda, R. O., Hart, P. E., Stork, D. G., 2001. Pattern Classification, 2nd Edition. Wiley,
New York.
DuMouchel, W., Volinsky, C., Johnson, T., Cortes, C., Pregibon, D., 1999. Squashing
flat files flatter. In: Proceedings of the Fifth ACM SIGKDD International Confer-
ence on Knowledge Discovery and Data Mining. ACM, San Diego, CA, pp. 6–15.
Duncan, T., 2005. Principles of Advertising & IMC, 2nd Edition. The McGraw-
Hill/Irwin series in marketing. McGraw-Hill/Irwin, Burr Ridge, IL.
Dwyer, F. R., Schurr, P. H., Oh, S., 1987. Developing buyer-seller relationships. Journal
of Marketing 51 (2), 11–27.
Efron, B., Tibshirani, R., 1993. An Introduction to the Bootstrap. Chapman &
Hall/CRC Monographs on Statistics & Applied Probability. Chapman & Hall, Lon-
don.
Egan, J., 2005. Relationship Marketing: Exploring Relational Strategies in Marketing,
2nd Edition. Prentice Hall, New York.
Eriksson, K., Mattsson, J., 2002. Managers’ perception of relationship management
in heterogeneous markets. Industrial Marketing Management 31 (6), 535–543.
Escobar, M. D., West, M., 1995. Bayesian density estimation and inference using mix-
tures. Journal of the American Statistical Association 90 (430), 577–588.
Ester, M., Kriegel, H.-P., Sander, J., Wimmer, M., Xu, X., 1998. Incremental cluster-
ing for mining in a data warehousing environment. In: Gupta, A., Shmueli, O.,
Widom, J. (Eds.), Proceedings of the 24th International Conference on Very Large
Data Bases. Morgan Kaufmann, New York, pp. 323–333.
Bibliography 221
Ester, M., Kriegel, H.-p., Sander, J., Xu, X., 1996. A density-based algorithm for dis-
covering clusters in large spatial databases with noise. In: Simoudis, E., Han,
J., Fayyad, U. M. (Eds.), Proceedings of the Second International Conference on
Knowledge Discovery and Data Mining. AAAI, Portland, OR, pp. 226–231.
Estivill-Castro, V., Lee, I., 2000. AMOEBA: hierarchical clustering based on spatial
proximity using Delaunay diagram. In: Foyer, P., Yeh, A., He, J. (Eds.), Proceedings
of the 9th International Symposium on Spatial Data Handling. IGU, Beijing, China,
pp. 26–41.
Evans, M., O’Malley, L., Patterson, M., 2004. Exploring Direct & Relationship Market-
ing, 2nd Edition. Thomson, London.
Everitt, B. S., 1974. Cluster Analysis. Reviews of Current Research. Heinemann, Lon-
don.
Everitt, B. S., 1979. Unresolved problems in cluster analysis. Biometrics 35 (1), 169–
181.
Everitt, B. S., Hand, D. J., 1981. Finite Mixture Distributions. Monographs on Applied
Probability and Statistics. Chapman & Hall, London.
Faloutsos, C., Jagadish, H. V., Sidiropoulos, N., 1997. Recovering information from
summary data. In: Carey, M. J. M. J., Dittrich, K. R., Lochovsky, F. H., Loucopoulos,
P., Jeusfeld, M. A. (Eds.), Proceedings of the 23rd International Conference on Very
Large Data Bases. Morgan Kaufmann, Athens, Greece, pp. 36–45.
Faloutsos, C., Ranganathan, M., Manolopoulos, Y., 1994. Fast subsequence matching
in time-series databases. In: Snodgrass, R. T., Winslett, M. (Eds.), Proceedings of
the 1994 ACM SIGMOD International Conference on Management of Data. ACM,
Minneapolis, MN, pp. 419–429.
Farnstrom, F., Lewis, J., Elkan, C., 2000. Scalability for clustering algorithms revisited.
SIGKDD Explorations 2 (1), 51–57.
Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., 1996. From data mining to knowledge
discovery in databases. AI Magazine 17 (3), 37–54.
Fearnhead, P., 2008. Computational methods for complex stochastic systems: a re-
view of some alternatives to MCMC. Statistics and Computing 18 (2), 151–171.
Feigenbaum, J., Kannan, S., Strauss, M., Viswanathan, M., 1999. An approximate l1-
difference algorithm for massive data streams. In: Proceedings of the 40th Annual
Symposium on Foundations of Computer Science. IEEE, New York, pp. 501–511.
Feldman, J., Muthukrishnan, S., Sidiropoulos, A., Stein, C., Svitkina, Z., 2008. On the
complexity of processing massive, unordered, distributed data. The Computing
Research Repository.
222 Bibliography
Ferguson, T. S., 1973. A Bayesian analysis of some nonparametric problems. The An-
nals of Statistics 1 (2), 209–230.
Ferguson, T. S., 1983. Bayesian density estimation by mixtures of normal distribu-
tions. In: Rizvi, H., Rustagi, J. (Eds.), Recent Advances in Statistics. Academic, New
York, pp. 287–303.
Fern, X. Z., Brodley, C. E., 2003. Random projection for high dimensional data clus-
tering. In: Fawcett, T., Mishra, N. (Eds.), Proceedings of the Twentieth Interna-
tional Conference on Machine Learning. AAAI, Washington, DC, pp. 186–193.
Fernandez-Duran, J. J., 2004. Circular distributions based on nonnegative trigono-
metric sums. Biometrics 60 (2), 499–503.
Figini, S., Giudici, P., Brooks, S. P., 2006. Bayesian feature selection for estimating
customer survival. In: The Eighth World Meeting on Bayesian Statistics. Valencia,
Spain.
Fildes, R., 2002. Telecommunications demand forecasting: a review. International
Journal of Forecasting 18 (4), 489–522.
Fisher, D. H., 1987. Improving inference through conceptual clustering. In: Proceed-
ings of the Sixth National Conference on Artificial Intelligence. AAAI, Seattle, WA,
pp. 461–465.
Fisher, N. I., 1996. Statistical analysis of circular data, 2nd Edition. Cambridge Uni-
versity Press, Cambridge, UK.
Fisher, N. I., Lee, A. J., 1994. Time series analysis of circular data. Journal of the Royal
Statistical Society. Series B (Methodological) 56 (2), 327–339.
Flajolet, P., Martin, G. N., 1983. Probabilistic counting. In: Proceedings of the 24th
Annual IEEE Symposium on Foundations of Computer Science. IEEE, Tucson, AZ,
pp. 76–82.
Flint, D. J., Woodruff, R. B., Gardial, S. F., 1997. Customer value change in industrial
marketing relationships: a call for new strategies and research. Industrial Market-
ing Management 26 (2), 163–175.
Foster, G., Gupta, M., 1994. Marketing, cost management and management account-
ing. Journal of Management Accounting Research 6 (Fall), 43–77.
Fournier, S., Dobscha, S., Mick, D. G., 1998. Preventing the premature death of rela-
tionship marketing. Harvard Business Review 76 (1), 42–51.
Fraley, C., Raftery, A. E., 1998. How many clusters? which clustering method? an-
swers via model-based cluster analysis. The Computer Journal 41 (8), 578–588.
Bibliography 223
Fraley, C., Raftery, A. E., 1999. MCLUST: software for model-based cluster and dis-
criminant analysis. Tech. Rep. Tech Report 342, Statistics Department, University
of Washington, Seattle, WA.
Fraley, C., Raftery, A. E., 2002. Model-based clustering, discriminant analysis, and
density estimation. Journal of the American Statistical Association 97 (458), 611–
631.
Frarley, J. U., Ring, L. W., 1966. A stochastic model of supermarket traffic flow. Oper-
ations Research 14 (4), 555–567.
Friedman, J., Fisher, N., 1999. Bump hunting in high-dimensional data. Statistics and
Computing 9 (2), 123–143.
Friedman, J., Meulman, J., 2004. Clustering objects on subsets of attributes. Journal
of the Royal Statistical Society: Series B (Statistical Methodology) 66 (4), 1–25.
Gaber, M., Zaslavsky, A., Krishnaswamy, S., 2005. Mining data streams: a review. SIG-
MOD Record 34 (2), 18–26.
Gaber, M. M., Zaslavsky, A., Krishnaswamy, S., 2007. A survey of classification meth-
ods in data streams. In: Aggarwal, C. C. (Ed.), Data Streams: Models and Algo-
rithms. Advances in Database Systems. Springer, New York.
Gaede, V., Gunther, O., 1998. Multidimensional access methods. ACM Computing
Surveys 30 (2), 170–231.
Gama, J., Gaber, M. M., 2007. Learning from Data Streams: Processing Techniques in
Sensor Networks. Springer, Dordrecht, Netherlands.
Ganti, V., Gehrke, J., Ramakrishnan, R., 1999a. CACTUS - clustering categorical data
using summaries. In: Proceedings of the Fifth ACM SIGKDD International Confer-
ence on Knowledge Discovery and Data Mining. ACM, San Diego, CA, pp. 73–83.
Ganti, V., Gehrke, J., Ramakrishnan, R., 1999b. A framework for measuring changes
in data characteristics. In: Proceedings of the Eighteenth ACM SIGMOD-SIGACT-
SIGART Symposium on Principles of Database Systems. ACM, Philadelphia, PA,
pp. 126–137.
Ganti, V., Gehrke, J., Ramakrishnan, R., 2001. DEMON: mining and monitoring evolv-
ing data. IEEE Transactions on Knowledge and Data Engineering 13 (1), 50–63.
Ganti, V., Gehrke, J., Ramakrishnan, R., 2002. Mining data streams under block evo-
lution. SIGKDD Explorations 3 (2), 1–10.
Ganti, V., Ramakrishnan, R., Gehrke, J., Powell, A. L., French, J. C., 1999c. Clustering
large datasets in arbitrary metric spaces. In: Proceedings of the 15th International
Conference on Data Engineering. IEEE, Sydney, Australia, pp. 502–511.
224 Bibliography
Gao, J., Fan, W., Han, J., 2007. On appropriate assumptions to mine data streams:
analysis and practice. In: Proceedings of the Seventh IEEE International Confer-
ence on Data Mining. IEEE, Omaha, NE, pp. 143–152.
Garg, A., Mangla, A., Gupta, N., Bhatnagar, V., 2006. PBIRCH: a scalable parallel clus-
tering algorithm for incremental data. In: Proceedings of the Tenth International
Database Engineering and Applications Symposium. IEEE, Delhi, India, pp. 315–
316.
Garofalakis, M. N., Gibbons, P. B., 2002. Wavelet synopses with error guarantees. In:
Franklin, M. J., Moon, B., Ailamaki, A. (Eds.), Proceedings of the 2002 ACM SIG-
MOD International Conference on Management of Data. ACM, Madison, WI, pp.
476–487.
Garofalakis, M. N., Kumar, A., 2004. Deterministic wavelet thresholding for
maximum-error metrics. In: Deutsch, A. (Ed.), Proceedings of the Twenty-third
ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems.
ACM, Paris, France, pp. 166–176.
Gavrilov, M., Anguelov, D., Indyk, P., Motwani, R., 2000. Mining the stock market:
which measure is best? In: Proceedings of the Sixth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. ACM, Boston, MA, pp. 487–
496.
Ge, X., Smyth, P., 2000. Deformable Markov model templates for time-series pattern
matching. In: Proceedings of the Sixth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. ACM, Boston, MA, pp. 81–90.
Gehrke, J., Korn, F., Srivastava, D., 2001. On computing correlated aggregates over
continual data streams. SIGMOD Record 30 (2), 13–24.
Gelfand, A. E., Smith, A. F. M., 1990. Sampling-based approaches to calculating
marginal densities. Journal of the American Statistical Association 85 (410), 398–
409.
Gelman, A., Carlin, J. B., Stern, H. S., Rubin, D. B., 2004. Bayesian Data Analysis, 2nd
Edition. Texts in Statistical Science. Chapman & Hall, Boca Raton, FL.
Geman, S., Geman, D., 1984. Stochastic relaxation, Gibbs distributions, and the
Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence 6 (6), 721–741.
Gennari, J. H., Langley, P., Fisher, D. H., 1989. Models of incremental concept forma-
tion. Artificial Intelligence 40 (1-3), 11–61.
Geyer, C. J., 1992. Practical Markov chain Monte Carlo. Statistical Science 7 (4), 473–
483.
Bibliography 225
Ghahramani, Z., Beal, M. J., 1999. Variational inference for Bayesian mixtures of fac-
tor analysers. In: Solla, S. A., Leen, T. K., Muller, K.-R. (Eds.), Proceedings of the
1999 Neural Information Processing Systems. MIT, Denver, CO, pp. 449–455.
Ghahramani, Z., Beal, M. J., 2001. Propagation algorithms for variational Bayesian
learning. In: Leen, T. K., Dietterich, T. G., Tresp, V. (Eds.), Proceedings of the 2001
Neural Information Processing Systems. MIT, Denver, CO, pp. 507–513.
Ghosh, J., Beal, M. J., Ngo, H. Q., Qiao, C., 2006a. On profiling mobility and predicting
locations of wireless users. In: Proceedings of the 2nd International Workshop on
Multi-hop Ad Hoc Networks: From Theory to Reality. ACM, Florence, Italy, pp. 55–
62.
Ghosh, J., Strehl, A., 2004. Clustering and visualization of retail market baskets. In:
Pal, N. R., Jain, L. C. (Eds.), Advanced Techniques in Knowledge Discovery and
Data Mining. Advanced Information and Knowledge Processing. Springer, New
York.
Ghosh, J., Strehl, A., 2006. Similarity-based text clustering: a comparative study. In:
Kogan, J., Nicholas, C., Teboulle, M. (Eds.), Grouping Multidimensional Data: Re-
cent Advances in Clustering. Springer, New York.
Ghosh, J. K., Delampady, M., Samanta, T., 2006b. An Introduction to Bayesian analy-
sis: Theory and Methods. Springer Texts in Statistics. Springer, New York.
Ghosh, K., Jammalamadaka, S. R., Tiwari, R. C., 2003. Semiparametric Bayesian tech-
niques for problems in circular data. Journal of Applied Statistics 30 (2), 145–161.
Gibbons, P. B., Matias, Y., 1998. New sampling-based summary statistics for improv-
ing approximate query answers. In: Haas, L. M., Tiwary, A. (Eds.), Proceedings of
the 1998 ACM SIGMOD International Conference on Management of Data. ACM,
Seattle, WA, pp. 331–342.
Gibbons, P. B., Matias, Y., 1999. Synopsis data structures for massive data sets. In:
Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms.
Vol. A. ACM, Baltimore, MD, pp. 909–910.
Gibbons, P. B., Matias, Y., Poosala, V., 1997. Fast incremental maintenance of ap-
proximate histograms. In: Jarke, M., Carey, M. J., Dittrich, K. R., Lochovsky, F. H.,
Loucopoulos, P., Jeusfeld, M. A. (Eds.), Proceedings of the 23rd International Con-
ference on Very Large Data Bases. Morgan Kaufmann, Athens, Greece, pp. 466–475.
Gilbert, A. C., Guha, S., Indyk, P., Kotidis, Y., Muthukrishnan, S., Strauss, M., 2002.
Fast, small-space algorithms for approximate histogram maintenance. In: Pro-
ceedings of the 34th Annual ACM Symposium on Theory of Computing. ACM,
Quebec, QC, Canada, pp. 389–398.
226 Bibliography
Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., Strauss, M., 2001. Surfing wavelets on
streams: One-pass summaries for approximate aggregate queries. In: Apers, P.
M. G., Atzeni, P., Ceri, S., Paraboschi, S., Ramamohanarao, K., Snodgrass, R. T.
(Eds.), Proceedings of the 27th International Conference on Very Large Data Bases.
Morgan Kaufmann, Roma, Italy, pp. 79–88.
Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., Strauss, M., 2003. One-pass wavelet
decompositions of data streams. IEEE Transactions on Knowledge and Data Engi-
neering 15 (3), 541–554.
Gilks, W. R., Oldfield, L., Rutherford, A., 1989. Statistical analysis. In: Knapp, B. W.,
Darken, W. R., Gilks, S. F. S., Boumsell, L., Harlan, J. M., Kishimoto, T., Morimoto,
C., Ritz, J., Shaw, S., Silverstein, R., Springer, T., Tedder, T. F., Todd, R. F. (Eds.),
Leucocyte Typing IV. Oxford University, Oxford, UK, pp. 6–12.
Gilks, W. R., Richardson, S., Spiegelhalter, D. J., 1998. Markov Chain Monte Carlo in
Practice. Chapman & Hall, Boca Raton, FL.
Gilks, W. R., Thomas, A., Spiegelhalter, D. J., 1994. A language and program for com-
plex Bayesian modelling. Journal of the Royal Statistical Society: Series D (The
Statistician) 43 (1), 169177.
Giudici, P., Castelo, R., 2001. Association models for web mining. Data Mining and
Knowledge Discovery 5 (3), 183–196.
Giudici, P., Passerone, G., 2002. Data mining of association structures to model con-
sumer behaviour. Computational Statistics & Data Analysis 38 (4), 533–541.
Glymour, C., Madigan, D., Pregibon, D., Smyth, P., 1996. Statistical inference and data
mining. Communications of the ACM 39 (11), 35–41.
Gonzalez, M. C., Hidalgo, C. A., Barabasi, A.-L., 2008. Understanding individual hu-
man mobility patterns. Nature 453 (7196), 779–782.
Gordon, I. H., 1998. Relationship Marketing: New Strategies, Techniques, and Tech-
nologies to Win the Customers You Want and Keep Them Forever. Wiley, Etobi-
coke, ON, Canada.
Graham, G., 2005. Behaviorism. In: Zalta, E. N. (Ed.), The Stanford Encyclopedia of
Philosophy, fall 2005 Edition.
Green, P. J., 1995. Reversible jump Markov chain Monte Carlo computation and
Bayesian model determination. Biometrika 82 (4), 711–732.
Green, P. J., Richardson, S., 2001. Modelling heterogeneity with and without the
Dirichlet process. Scandinavian Journal of Statistics 28 (2), 355–375.
Bibliography 227
Greenwald, M., Khanna, S., 2001. Space-efficient online computation of quantile
summaries. In: Aref, W. G. (Ed.), Proceedings of the 2001 ACM SIGMOD Interna-
tional Conference on Management of Data. ACM, Santa Barbara, CA, pp. 58–66.
Gronroos, C., 1994. Quo vadis, marketing? toward a relationship marketing
paradigm. Journal of Marketing Management 10 (5), 347–360.
Grunert, S. C., Scherlorn, G., 1990. Consumer values in West Germany underlying di-
mensions and cross-cultural comparison with North America. Journal of Business
Research 20 (2), 97–107.
Guha, S., 2005. Space efficiency in synopsis construction algorithms. In: Bohm, K.,
Jensen, C. S., Haas, L. M., Kersten, M. L., Larson, P.-A., Ooi, B. C. (Eds.), Proceedings
of the 31st International Conference on Very Large Data Bases. ACM, Trondheim,
Norway, pp. 409–420.
Guha, S., 2010. Posterior simulation in countable mixture models for large datasets.
Journal of the American Statistical Association 105 (490), 775–786.
Guha, S., Gunopulos, D., Koudas, N., 2003a. Correlating synchronous and asyn-
chronous data streams. In: Getoor, L., Senator, T. E., Domingos, P., Faloutsos,
C. (Eds.), Proceedings of the Ninth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. ACM, Washington, DC, pp. 529–534.
Guha, S., Harb, B., 2005. Wavelet synopsis for data streams: minimizing non-
Euclidean error. In: Grossman, R., Bayardo, R. J., Bennett, K. P. (Eds.), Proceedings
of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining. ACM, Chicago, IL, pp. 88–97.
Guha, S., Harb, B., 2006. Approximation algorithms for wavelet transform coding of
data streams. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium
on Discrete Algorithm. ACM, Miami, FL, pp. 698–707.
Guha, S., Indyk, P., Muthukrishnan, S., Strauss, M., 2002. Histogramming data
streams with fast per-item processing. In: Widmayer, P., Ruiz, F. T., Bueno, R. M.,
Hennessy, M., Eidenbenz, S., Conejo, R. (Eds.), Proceedings of the 29th Interna-
tional Colloquium on Automata, Languages and Programming. Springer, Malaga,
Spain, pp. 681–692.
Guha, S., Kim, C., Shim, K., 2004a. XWAVE: approximate extended wavelets for
streaming data. In: Nascimento, M. A., Ozsu, M. T., Kossmann, D., Miller, R. J.,
Blakeley, J. A., Schiefer, K. B. (Eds.), Proceedings of the 30th International Con-
ference on Very Large Data Bases. Morgan Kaufmann, Toronto, ON, Canada, pp.
288–299.
228 Bibliography
Guha, S., Koudas, N., 2002. Approximating a data stream for querying and estima-
tion: algorithms and performance evaluation. In: Proceedings of the 18th Interna-
tional Conference on Data Engineering. IEEE, San Jose, CA, pp. 567–576.
Guha, S., Koudas, N., Shim, K., 2001. Data-streams and histograms. In: Proceedings
of the 33th Annual ACM Symposium on Theory of Computing. ACM, Heraklion,
Greece, pp. 471–475.
Guha, S., Koudas, N., Shim, K., 2006. Approximation and streaming algorithms for
histogram construction problems. ACM Transactions on Database Systems 31 (1),
396–438.
Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L., 2003b. Clustering
data streams: theory and practice. IEEE Transactions on Knowledge and Data En-
gineering 15 (3), 515–528.
Guha, S., Rastogi, R., Shim, K., 1998. CURE: an efficient clustering algorithm for large
databases. In: Haas, L. M., Tiwary, A. (Eds.), Proceedings of the ACM SIGMOD In-
ternational Conference on Management of Data. ACM, Seattle, WA, pp. 73–84.
Guha, S., Rastogi, R., Shim, K., 1999. ROCK: a robust clustering algorithm for cate-
gorical attributes. In: Proceedings of the 15th International Conference on Data
Engineering. IEEE, Sydney, Australia, pp. 512–521.
Guha, S., Shim, K., Woo, J., 2004b. REHIST: relative error histogram construction
algorithms. In: Nascimento, M. A., Ozsu, M. T., Kossmann, D., Miller, R. J., Blakeley,
J. A., Schiefer, K. B. (Eds.), Proceedings of the 30th International Conference on
Very Large Data Bases. Morgan Kaufmann, Toronto, ON, Canada, pp. 300–311.
Gummesson, E., 1994. Making relationship marketing operational. International
Journal of Service Industry Management 5 (5), 5–20.
Gummesson, E., 1999. Total Relationship Marketing: Rethinking Marketing. Man-
agement from 4Ps to 30Rs. Butterworth-Heinemann, Oxford, UK.
Gunopulos, D., Kollios, G., Tsotras, V. J., Domeniconi, C., 2000. Approximating multi-
dimensional aggregate range queries over real attributes. In: Chen, W., Naughton,
J. F., Bernstein, P. A. (Eds.), Proceedings of the 2000 ACM SIGMOD International
Conference on Management of Data. ACM, Dallas, TX, pp. 463–474.
Guyon, I., Elisseeff, A., 2003. An introduction to variable and feature selection. Jour-
nal of Machine Learning Research 3 (Mar), 1157–1182.
Haas, P. J., 1997. Large-sample and deterministic confidence intervals for online ag-
gregation. In: Ioannidis, Y. E., Hansen, D. M. (Eds.), Proceedings of the Ninth In-
ternational Conference on Scientific and Statistical Database Management. IEEE,
Olympia, WA, pp. 51–63.
Bibliography 229
Hallberg, G., 1995. All Customers Are Not Created Equal: The Differential Marketing
Strategy for Brand Loyalty and Profits. Wiley, New York.
Han, C., Carlin, B. P., 2001. Markov chain Monte Carlo methods for computing
Bayes factors: a comparative review. Journal of the American Statistical Associa-
tion 96 (455), 11221132.
Han, J., Dong, G., Yin, Y., 1999. Efficient mining of partial periodic patterns in time
series database. In: Proceedings of the 15th International Conference on Data En-
gineering. IEEE, Sydney, Australia, pp. 106–115.
Han, J., Kamber, M., 2006. Data Mining: Concepts and Techniques, 2nd Edition. The
Morgan-Kaufmann Series in Data Management Systems. Morgan Kaufmann, San
Francisco, CA.
Han, J., Pei, J., Yin, Y., 2000. Mining frequent patterns without candidate generation.
In: Chen, W., Naughton, J. F., Bernstein, P. A. (Eds.), Proceedings of the 2000 ACM
SIGMOD International Conference on Management of Data. ACM, Dallas, TX, pp.
1–12.
Hand, D. J., 1998. Data mining: statistics and more? The American Statistician 52 (2),
112–118.
Har-even, M., Brailovsky, V. L., 1995. Probabilistic validation approach for clustering.
Pattern Recognition Letters 16 (11), 1189–1196.
Hartigan, J. A., 1975. Clustering Algorithms. Wiley Series in Probability and Mathe-
matical Statistics. Wiley, New York.
Hastie, T., Tibshirani, R., Friedman, J. H., 2009. The Elements of Statistical Learning:
Data Mining, Inference, and Prediction, 2nd Edition. Springer Series in Statistics.
Springer, New York.
Hastings, W. K., 1970. Monte Carlo sampling methods using Markov Chains and their
applications. Biometrika 57 (1), 97–109.
Heinz, C., Seeger, B., 2006. Resource-aware kernel density estimators over stream-
ing data. In: Yu, P. S., Tsotras, V. J., Fox, E. A., Liu, B. (Eds.), Proceedings of the
2006 ACM CIKM International Conference on Information and Knowledge Man-
agement. ACM, Arlington, VA, pp. 870–871.
Heinz, C., Seeger, B., 2007. Adaptive wavelet density estimators over data streams.
In: Proceedings of the 19th International Conference on Scientific and Statistical
Database Management. IEEE, Banff, AB, Canada, pp. 35–35.
Heinz, C., Seeger, B., 2008. Cluster kernels: resource-aware kernel density estima-
tors over streaming data. IEEE Transactions on Knowledge and Data Engineering
20 (7), 880–893.
230 Bibliography
Heitfield, E., Levy, A., 2001. Parametric, semi-parametric and non-parametric mod-
els of telecommunications demand: an investigation of residential calling pat-
terns. Information Economics and Policy 13 (3), 311–329.
Heller, K. A., Ghahramani, Z., 2007. A nonparametric Bayesian approach to modeling
overlapping clusters. In: Proceedings of the Eleventh International Conference on
Artificial Intelligence and Statistics. San Juan, PR.
Henzinger, M. R., Raghavan, P., Rajagopalan, S., My 26 1998 1998. Computing on data
streams. Tech. Rep. SRC-TN-1998-011, Systems Research Center, Palo Alto, CA.
Herr, P. M., Kardes, F. R., Kim, J., 1991. Effects of word-of-mouth and product-
attribute information of persuasion: an accessibility-diagnosticity perspective.
Journal of Consumer Research 17 (4), 454–462.
Hinneburg, A., Gabriel, H.-H., 2007. DENCLUE 2.0: fast clustering based on kernel
density estimation. In: Berthold, M. R., Shawe-Taylor, J., Lavrac, N. (Eds.), Pro-
ceedings of the Seventh International Symposium on Intelligent Data Analysis.
Vol. 4723. Springer, Ljubljana, Slovenia, pp. 70–80.
Hinneburg, A., Keim, D. A., 1998. An efficient approach to clustering in large mul-
timedia databases with noise. In: Agrawal, R., Stolorz, P. E., Piatetsky-Shapiro, G.
(Eds.), Proceedings of the Fourth International Conference on Knowledge Discov-
ery and Data Mining. AAAI, New York, pp. 58–65.
Hinneburg, A., Keim, D. A., 1999. Optimal grid-clustering: towards breaking the curse
of dimensionality in high-dimensional clustering. In: Atkinson, M. P., Orlowska,
M. E., Valduriez, P., Zdonik, S. B., Brodie, M. L. (Eds.), Proceedings of the 25th In-
ternational Conference on Very Large Data Bases. Morgan Kaufmann, Edinburgh,
UK, pp. 506–517.
Hirschman, E. C., 1986. Humanistic inquiry in marketing research: philosophy,
method, and criteria. Journal of Marketing Research 23 (3), 237–249.
Hofstede, F. T., Wedel, M., Steenkamp, J.-B. E. M., 2002. Identifying spatial segments
in international markets. Marketing Science 21 (2), 160–177.
Holbrook, M. B., Hirschman, E. C., 1982. The experiential aspects of consumption:
consumer fantasies, feelings, and fun. Journal of Consumer Research 9 (2), 132–
140.
Holt, D. B., 1997. Poststructuralist lifestyle analysis: conceptualizing the social pat-
terning of consumption in postmodernity. Journal of Consumer Research 23 (4),
326–50.
Homer, P. M., Kahle, L. R., 1988. A structural equation test of the value-attitude-
behavior hierarchy. Journal of Personality and Social Psychology 54 (4), 638–646.
Bibliography 231
Houle, M. E., Sakuma, J., 2005. Fast approximate similarity search in extremely high-
dimensional data sets. In: Proceedings of the 21st International Conference on
Data Engineering. IEEE, Tokyo, Japan, pp. 619–630.
Hsu, W., Lee, M. L., Wang, J., 2008. Temporal and Spatio-Temporal Data Mining. IGI,
Hershey, PA.
Huang, Y., Zhang, L., Zhang, P., 2008. A framework for mining sequential patterns
from spatio-temporal event data sets. IEEE Transactions on Knowledge and Data
Engineering 20 (4), 433–448.
Huang, Z., 1998. Extensions to the k-means algorithm for clustering large data sets
with categorical values. Data Mining and Knowledge Discovery 2 (3), 283–304.
Hubbard, R., Lindsay, R. M., 2008. Why p values are not a useful measure of evidence
in statistical significance testing. Theory & Psychology 18 (1), 69–88.
Hudson, L. A., Ozanne, J. L., 1988. Alternative ways of seeking knowledge in con-
sumer research. Journal of Consumer Research 14 (4), 508.
Hui, S. K., Bradlow, E. T., Fader, P. S., 2009a. Testing behavioral hypotheses using an
integrated model of grocery store shopping path and purchase behavior. Journal
of Consumer Research 36 (3), 478–493.
Hui, S. K., Fader, P. S., Bradlow, E. T., 2009b. The traveling salesman goes shopping:
The systematic deviations of grocery paths from TSP-optimality. Marketing Sci-
ence 28 (3), 566–572.
Hulten, G., Spencer, L., Domingos, P., 2001. Mining time-changing data streams. In:
Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. ACM, San Francisco, CA, pp. 97–106.
Hunt, S. D., 1997. Competing through relationships: grounding relationship market-
ing in resource-advantage theory. Journal of Marketing Management 13 (5), 431–
445.
Hwang, H., Jung, T., Suh, E., 2004. An LTV model and customer segmentation based
on customer value: a case study on the wireless telecommunication industry. Ex-
pert Systems with Applications 26 (2), 181–188.
Hyndman, R. J., 1995. The problem with Sturges’ rule for constructing histograms.
Tech. rep., Department of Econometrics and Business Statistics, Monash Univer-
sity, Clayton, VIC, Australia.
URL http://www-personal.buseco.monash.edu.au/~hyndman/papers/sturges.
htm
232 Bibliography
Indyk, P., 2000. Stable distributions, pseudorandom generators, embeddings and
data stream computation. In: Proceedings of the 41st Annual Symposium on
Foundations of Computer Science. IEEE, Redondo Beach, CA, pp. 189–197.
Indyk, P., Koudas, N., Muthukrishnan, S., 2000. Identifying representative trends
in massive time series data sets using sketches. In: Abbadi, A. E., Brodie, M. L.,
Chakravarthy, S., Dayal, U., Kamel, N., Schlageter, G., Whang, K.-Y. (Eds.), Pro-
ceedings of the 26th International Conference on Very Large Data Bases. Morgan
Kaufmann, Cairo, Egypt, pp. 363–372.
Intel Corporation, 2002. CDR analysis and warehousing for mobile networks. Tech.
rep., Intel Corporation, Santa Clara, CA.
Ioannidis, Y. E., 2003. The history of histograms. In: Freytag, J. C., Lockemann, P. C.,
Abiteboul, S., Carey, M. J., Selinger, P. G., Heuer, A. (Eds.), Proceedings of the 29th
International Conference on Very Large Data Bases. Morgan Kaufmann, Berlin,
Germany, pp. 19–30.
Ioannidis, Y. E., Poosala, V., 1995. Balancing histogram optimality and practicality for
query result size estimation. SIGMOD Record 24 (2), 233–244.
Ioannidis, Y. E., Poosala, V., 1999. Histogram-based approximation of set-valued
query-answers. In: Atkinson, M. P., Orlowska, M. E., Valduriez, P., Zdonik, S. B.,
Brodie, M. L. (Eds.), Proceedings of the 25th International Conference on Very
Large Data Bases. Morgan Kaufmann, Edinburgh, UK, pp. 174–185.
Jaakkola, T. S., Jordan, M. I., 2000. Bayesian parameter estimation via variational
methods. Statistics and Computing 10 (1), 25–37.
Jacoby, J., 1978. Consumer research: a state of the art review. Journal of Marketing
42 (2), 87–96.
Jagadish, H. V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K. C., Suel, T.,
1998. Optimal histograms with quality guarantees. In: Gupta, A., Shmueli, O.,
Widom, J. (Eds.), Proceedings of the 24th International Conference on Very Large
Data Bases. Morgan Kaufmann, New York, pp. 275–286.
Jaihak, C., Rao, V. R., 2003. A general choice model for bundles with multiple-
category products: application to market segmentation and optimal pricing for
bundles. Journal of Marketing Research 40 (2), 115–130.
Jain, A. K., 2010. Data clustering: 50 years beyond k-means. Pattern Recognition Let-
ters 31 (8), 651–666.
Jain, A. K., Dubes, R. C., 1988. Algorithms for Clustering Data. Prentice Hall, Upper
Saddle River, NJ.
Bibliography 233
Jain, A. K., Murty, M. N., Flynn, P. J., 1999. Data clustering: a review. ACM Computing
Surveys 31 (3), 264–323.
Jain, S., Neal, R. M., 2004. A split-merge Markov chain Monte Carlo procedure for the
Dirichlet process mixture model. Journal of Computational and Graphical Statis-
tics 13 (1), 158–182.
Jain, S., Neal, R. M., 2007. Splitting and merging components of a nonconjugate
Dirichlet process mixture model (with discussion). Bayesian Analysis 2 (3), 445–
472.
Jammalamadaka, S. R., Sengupta, A., 2001. Topics in Circular Statistics. Series on
Multivariate Analysis. World Scientific, Singapore.
Jefferys, W. H., Berger, J. O., 1992. Ockham’s razor and Bayesian analysis. American
Scientist 80 (Jan/Feb), 64–72.
Jiang, D., Tang, C., Zhang, A., 2004. Cluster analysis for gene expression data: a sur-
vey. IEEE Transactions on Knowledge and Data Engineering 16 (11), 1370–1386.
Jing, L., Ng, M. K., Huang, J. Z., 2007. An entropy weighting k-means algorithm
for subspace clustering of high-dimensional sparse data. IEEE Transactions on
Knowledge and Data Engineering 19 (8), 1026–1041.
Jolliffe, I. T., 2002. Principal Component Analysis, 2nd Edition. Springer Series in
Statistics. Springer, New York.
Jordan, M. I., Ghahramani, Z., Jasskkola, T. S., Saul, L. K., 1998. An introduction
to variational methods for graphical models. In: Jordan, M. I. (Ed.), Learning
in Graphical Models. Adaptive Computation and Machine Learning. MIT, Cam-
bridge, UK, pp. 105–162.
Kahle, L. R., 1983. Social Values and Social Change: Adaptation to Life in America.
Praeger, New York.
Kahle, L. R., Beatty, S. E., Homer, P., 1986. Research in brief alternative measurement
approaches to consumer values: the list of values (LOV) and values and life style
(VALS). Journal of Consumer Research 13 (3), 405.
Kahle, L. R., Liu, R., Watkins, H., 1992. Psychographic variation across the United
States geographic regions. Advances in Consumer Research 19 (1), 346–352.
Kailing, K., Kriegel, H., Kroger, P., 2004. Density-connected subspace clustering for
high-dimensional data. In: Berry, M. W., Dayal, U., Kamath, C., Skillicorn, D. B.
(Eds.), Proceedings of the Fourth SIAM International Conference on Data Mining.
SIAM, Lake Buena Vista, FL, pp. 246–257.
234 Bibliography
Kailing, K., Kriegel, H.-P., Kroger, P., Wanka, S., 2003. Ranking interesting subspaces
for clustering high dimensional data. In: Lavrac, N., Blockeel, D. G. H., Todorovski,
L. (Eds.), Proceedings of the Seventh European Conference on Principles and Prac-
tice of Knowledge Discovery in Databases. Springer, Cavtat-Dubrovnik, Croatia,
pp. 241–252.
Kalpakis, K., Gada, D., Puttagunta, V., 2001. Distance measures for effective cluster-
ing of ARIMA time-series. In: Cercone, N., Lin, T. Y., Wu, X. (Eds.), Proceedings of
the 2001 IEEE International Conference on Data Mining. IEEE, San Jose, CA, pp.
273–280.
Kamakura, W. A., Ramaswami, S. N., Srivastava, R. K., 1991. Applying latent trait anal-
ysis in the evaluation of prospects for cross-selling of financial services. Interna-
tional Journal of Research in Marketing 8 (4), 329–349.
Kandogan, E., 2001. Visualizing multi-dimensional clusters, trends, and outliers us-
ing star coordinates. In: Proceedings of the Seventh ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. ACM, San Francisco, CA,
pp. 107–116.
Karlis, D., Xekalaki, E., 2003. Choosing initial values for the EM algorithm for finite
mixtures. Computational Statistics & Data Analysis 41 (3-4), 577–590.
Karras, P., Mamoulis, N., 2005. One-pass wavelet synopses for maximum-error met-
rics. In: Bohm, K., Jensen, C. S., Haas, L. M., Kersten, M. L., Larson, P.-A., Ooi, B. C.
(Eds.), Proceedings of the 31st International Conference on Very Large Data Bases.
ACM, Trondheim, Norway, pp. 421–432.
Karypis, G., Eui-Hong, H., Kumar, V., 1999. CHAMELEON: hierarchical clustering us-
ing dynamic modeling. Computer 32 (8), 68–75.
Kass, R. E., Raftery, A. E., 1995. Bayes factors. Journal of the American Statistical As-
sociation 90 (430), 773–795.
Kaufman, L., Rousseeuw, P. J., 1990. Finding Groups in Data: An Introduction to Clus-
ter Analysis. Wiley Series in Probability and Mathematical Statistics. Wiley, New
York.
Keaveney, S. M., Parthasarathy, M., 2001. Customer switching behavior in online ser-
vices: an exploratory study of the role of selected attitudinal, behavioral, and de-
mographic factors. Journal of the Academy of Marketing Science 29 (4), 374–390.
Keogh, E. J., Chakrabarti, K., Pazzani, M., Mehrotra, S., 2001. Dimensionality reduc-
tion for fast similarity search in large time series databases. Knowledge and Infor-
mation Systems 3 (3), 263–286.
Bibliography 235
Keogh, E. J., Kasetty, S., 2003. On the need for time series data mining benchmarks: a
survey and empirical demonstration. Data Mining and Knowledge Discovery 7 (4),
349–371.
Keogh, E. J., Lin, J., Fu, A. W.-C., 2005. Hot SAX: efficiently finding the most unusual
time series subsequence. In: Proceedings of the Fifth IEEE International Confer-
ence on Data Mining. IEEE, Houston, TX, pp. 226–233.
Keogh, E. J., Lin, J., Truppel, W., 2003. Clustering of time series subsequences is
meaningless: implications for previous and future research. In: Proceedings of the
Third IEEE International Conference on Data Mining. IEEE, Melbourne, FL, pp.
115–122.
Keogh, E. J., Lonardi, S., Ratanamahatana, C., 2004. Towards parameter-free data
mining. In: Kim, W., Kohavi, R., Gehrke, J., DuMouchel, W. (Eds.), Proceedings of
the Tenth ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining. ACM, Seattle, WA, pp. 206–215.
Keogh, E. J., Ratanamahatana, C., 2005. Exact indexing of dynamic time warping.
Knowledge and Information Systems 7 (3), 358–386.
Kifer, D., Ben-David, S., Gehrke, J., 2004. Detecting change in data streams. In: Nasci-
mento, M. A., Ozsu, M. T., Kossmann, D., Miller, R. J., Blakeley, J. A., Schiefer, K. B.
(Eds.), Proceedings of the Thirtieth International Conference on Very Large Data
Bases. Morgan Kaufmann, Toronto, ON, Canada, pp. 180–191.
Kim, D., Yum, B.-J., 2005. Collaborative filtering based on iterative principal compo-
nent analysis. Expert Systems with Applications 28 (4), 823–830.
Kleinberg, J. M., 2002. An impossibility theorem for clustering. In: Becker, S., Thrun,
S., Obermayer, K. (Eds.), Proceedings of the 2002 Neural Information Processing
Systems. MIT, Vancouver, BC, Canada, pp. 446–453.
Kleinberg, J. M., 2003. Bursty and hierarchical structure in streams. In: Proceedings
of the Eighth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining. Vol. V7. ACM, Edmonton, AB, Canada, pp. 373–397.
Knox, S., 1998. Loyalty-based segmentation and the customer development process.
European Management Journal 16 (6), 729–737.
Kogan, J., 2007. Introduction to Clustering Large and High-Dimensional Data. Cam-
bridge University, Cambridge, UK.
Kogan, J., Nicholas, C. K., Teboulle, M., 2006. Grouping Multidimensional Data: Re-
cent Advances in Clustering. Springer, New York.
236 Bibliography
Konig, A., Gratz, A., 2004. Advanced methods for the analysis of semiconductor man-
ufacturing process data. In: Pal, N. R., Jain, L. C. (Eds.), Advanced Techniques in
Knowledge Discovery and Data Mining. Advanced Information and Knowledge
Processing. Springer, New York.
Kotler, P., 1991. Marketing Management: Analysis, Planning, and Control. Prentice-
Hall, Englewood Cliffs, NJ.
Kotler, P., Armstrong, G., 2009. Principles of Marketing, 13th Edition. Pearson, Upper
Saddle River, NJ.
Kriegel, H.-P., Kroger, P., Renz, M., Wurst, S. H. R., 2005. A generic framework for ef-
ficient subspace clustering of high-dimensional data. In: Proceedings of the Fifth
IEEE International Conference on Data Mining. IEEE, Houston, TX, pp. 250–257.
Kriegel, H.-P., Kroger, P., Schubert, E., Zimek, A., 2008. A general framework for
increasing the robustness of PCA-based correlation clustering algorithms. In:
Proceedings of the 20th international conference on Scientific and Statistical
Database Management. Springer-Verlag, Berlin, Germany, pp. 418–435.
Kriegel, H.-P., Kroger, P., Zimek, A., 2009. Clustering high-dimensional data: A survey
on subspace clustering, pattern-based clustering, and correlation clustering. ACM
Transactions on Knowledge Discovery from Data 3 (1), 1–58.
Krishnamurthy, B., Sen, S., Zhang, Y., Chen, Y., 2003. Sketch-based change detection:
methods, evaluation, and applications. In: Proceedings of the Third ACM SIG-
COMM Conference on Internet Measurement. ACM, Miami Beach, FL, pp. 234–
247.
Kuiper, N. H., 1962. Tests concerning random points on a circle. Proceedings of the
Koninklijke Nederlandse Akademie van Wetenschappen, Series A 63, 38–47.
Kumar, N., Lolla, V. N., Keogh, E. J., Lonardi, S., Ratanamahatana, C. A., 2005. Time-
series bitmaps: a practical visualization tool for working with large time series
databases. In: Proceedings of the 2005 SIAM International Data Mining Confer-
ence. Newport Beach, CA.
Kumar, V., Petersen, J. A., Leone, R. P., 2007. How valuable is word of mouth? Harvard
Business Review 85 (10), 139–146.
Kumar, V., Venkatesan, R., Reinartz, W., 2006. Knowing what to sell, when, and to
whom. Harvard Business Review 84 (3), 131–137.
Lange, T., Roth, V., Braun, M., Buhmann, J., 2004. Stability-based validation of clus-
tering solutions. Neural Computation 16 (6), 1299–1323.
Larson, J. S., Bradlow, E. T., Fader, P. S., 2005. An exploratory look at supermarket
shopping paths. International Journal of Research in Marketing 22 (4), 395 – 414.
Bibliography 237
Last, M., Klein, Y., Kandel, A., 2001. Knowledge discovery in time series databases.
IEEE Transactions on Systems, Man, and Cybernetics, Part B 31 (1), 160–169.
Lastovicka, J. L., 1982. On the validation of lifestyle traits: a review and illustration.
Journal of Marketing Research 19 (1), 126–138.
Lastovicka, J. L., Murry Jr., J. P., Joachimsthaler, E. A., 1990. Evaluating the measure-
ment validity of lifestyle typologies. Journal of Marketing Research 27 (1), 11–23.
Lee, H.-Y., Ong, H.-L., 1996. Visualization support for data mining. IEEE Expert 11 (5),
69–75.
Lee, J.-H., Kim, D.-H., Chung, C.-W., 1999. Multi-dimensional selectivity estimation
using compressed histogram information. In: Delis, A., Ghandeharizadeh, C. F. S.
(Eds.), Proceedings of the 1999 ACM SIGMOD International Conference on Man-
agement of Data. ACM, Philadelphia, PA, pp. 205–214.
Lee, P. M., 2004. Bayesian Statistics: An Introduction, 3rd Edition. Hodder Arnold,
London.
Lee, W.-P., Liu, C.-H., Lu, C.-C., 2002. Intelligent agent-based systems for personal-
ized recommendations in internet commerce. Expert Systems with Applications
22 (4), 275–284.
Lees, K., Roberts, S., Skamnioti, P., Gurr, S., 2007. Gene microarray analysis using
angular distribution decomposition. Journal of Computational Biology 14 (1), 68–
83.
Leigh, A., Wolfers, J., 2006. Competing approaches to forecasting elections: eco-
nomic models, opinion polling and prediction markets. Economic Record 82 (258),
325–340.
Lemmens, A., Croux, C., 2006. Bagging and boosting classification trees to predict
churn. Journal of Marketing Research 43 (2), 276–286.
Lemon, K. N., White, T. B., Winer, R. S., 2002. Dynamic customer relationship man-
agement: incorporating future considerations into the service retention decision.
Journal of Marketing 66 (1), 1–14.
Leone, L., Perugini, M., Ercolani, A. P., 1999. A comparison of three models of
attitude-behavior relationships in the studying behavior domain. European Jour-
nal of Social Psychology 29 (2-3), 161–189.
Levy, A., 1999. Semi-parametric estimates of demand for intra-LATA telecommuni-
cations. In: Loomis, D. G., Taylor, L. D. (Eds.), The Future of the Telecommunica-
tions Industry: Forecasting and Demand Analysis. Kluwer, Boston, MA, pp. 115–
124.
238 Bibliography
LGR Telecommunications, 05 Jun 2008a. CDRInsight.
LGR Telecommunications, 05 Jun 2008b. CDRLive.
Li, S.-T., Shue, L.-Y., Lee, S.-F., 2006. Enabling customer relationship management
in ISP services through mining usage patterns. Expert Systems with Applications
30 (4), 621–632.
Li, W., McCallum, A., 2006. Pachinko allocation: DAG-structured mixture models of
topic correlations. In: Cohen, W. W., Moore, A. (Eds.), Proceedings of the 23rd In-
ternational Conference on Machine Learning. ACM, Pittsburgh, PA, pp. 577–584.
Li, Y., Han, J., Yang, J., 2004. Clustering moving objects. In: Kim, W., Kohavi, R.,
Gehrke, J., DuMouchel, W. (Eds.), Proceedings of the Tenth ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining. ACM, Seattle, WA,
pp. 617–622.
Li, Y., Lu, L., Li, X., 2005. A hybrid collaborative filtering method for multiple-interests
and multiple-content recommendation in e-commerce. Expert Systems with Ap-
plications 28 (1), 67–77.
Lin, J., Keogh, E. J., Lonardi, S., Chiu, B. Y.-c., 2003. A symbolic representation of time
series, with implications for streaming algorithms. In: Zaki, M. J., Aggarwal, C. C.
(Eds.), Proceedings of the Eighth ACM SIGMOD Workshop on Research Issues in
Data Mining and Knowledge Discovery. ACM, San Diego, CA, pp. 2–11.
Lin, J., Keogh, E. J., Wei, L., Lonardi, S., 2007. Experiencing SAX: a novel symbolic
representation of time series. Data Mining and Knowledge Discovery 15 (2), 107–
144.
Lin, J., Vlachos, M., Keogh, E. J., Gunopulos, D., 2004. Iterative incremental clustering
of time series. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides,
V., Koubarakis, M., Bohm, K., Ferrari, E. (Eds.), Proceedings of the Ninth In-
ternational Conference on Extending Database Technology. Springer, Heraklion,
Greece, pp. 106–122.
Littau, D., Boley, D., 2006a. Clustering very large data sets with principal direction
divisive partitioning. In: Kogan, J., Nicholas, C., Teboulle, M. (Eds.), Grouping Mul-
tidimensional Data: Recent Advances in Clustering. Springer, New York.
Littau, D., Boley, D., 2006b. Streaming data reduction using low-memory factored
representations. Information Sciences 176 (14), 2016–2041.
Little, E., Marandi, E., 2003. Relationship Marketing Management. Thomson, Lon-
don.
Bibliography 239
Liu, A. H., Leach, M. P., Bernhardt, K. L., 2005. Examining customer value percep-
tions of organizational buyers when sourcing from multiple vendors. Journal of
Business Research 58 (5), 559–568.
Liu, B., Xia, Y., Yu, P. S., 2000. Clustering through decision tree construction. In: Pro-
ceedings of the Ninth International Conference on Information and knowledge
management. ACM, New York, pp. 20–29.
Liu, G., Li, J., Sim, K., Wong, L., 2007. Distance based subspace clustering with flexi-
ble dimension partitioning. In: Proceedings of the 23rd International Conference
on Data Engineering. IEEE, Istanbul, Turkey, pp. 1250–1254.
Liu, G., Sim, K., Li, J., Wong, L., 2009. Efficient mining of distance-based subspace
clusters. Statistical Analysis and Data Mining 2 (5-6), 427–444.
Liu, T., Bahl, P., Chlamtac, I., 1998. Mobility modeling, location tracking, and tra-
jectory prediction in wireless ATM networks. IEEE Journal on Selected Areas in
Communications 16 (6), 922–936.
Lloyd, A., 2005. The grid and CRM: From ‘if’ to ‘when’? Telecommunications Policy
29 (2-3), 153–172.
MacEachern, S. N., Clyde, M., , Liu, J., 1999. Sequential importance sampling for
nonparametric Bayes models: the next generation. The Canadian Journal of Statis-
tics 27 (2), 251–267.
MacEachern, S. N., Muller, P., 1998. Estimating mixture of Dirichlet process models.
Journal of Computational and Graphical Statistics 7 (2), 223–238.
MacKay, D. J. C., 1995. Probable networks and plausible predictions - a review of
practical Bayesian methods for supervised neural networks. Computation in Neu-
ral Systems 6 (3), 469–505.
MacKay, D. J. C., 1998. Choice of basis for Laplace approximation. Machine Learning
33 (1), 77–86.
MacQueen, J. B., 1967. Some methods of classification and analysis of multivariate
observations. In: Cam, L. M. L., Neyman, J. (Eds.), Proceedings of the Fifth Berkeley
Symposium on Mathematical Statistics and Probability. University of California,
Berkeley, CA, pp. 281–297.
Madeira, S. C., Oliveira, A. L., 2004. Biclustering algorithms for biological data analy-
sis: a survey. IEEE/ACM Transactions on Computational Biology and Bioinformat-
ics 1 (1), 24–45.
Madigan, D., Raghavan, N., DuMouchel, W., Nason, M., Posse, C., Ridgeway, G., 2002.
Likelihood-based data squashing: a modeling approach to instance construction.
Data Mining and Knowledge Discovery 6 (2), 173–190.
240 Bibliography
Madigan, D., Ridgeway, G., 2003. Bayesian data analysis. In: Ye, N. (Ed.), The Hand-
book of Data Mining. Human Factors and Ergonomics. Lawrence Erlbaum Asso-
ciates, Mahwah, NJ.
Mahalanobis, P. C., 1936. On the generalized distance in statistic. Proceedings of the
National Institute of Sciences of India 2 (1), 49–55.
Mamoulis, N., Cao, H., Kollios, G., Hadjieleftheriou, M., Tao, Y., Cheung, D. W., 2004.
Mining, indexing, and querying historical spatiotemporal data. In: Kim, W., Ko-
havi, R., Gehrke, J., DuMouchel, W. (Eds.), Proceedings of the Tenth ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining. ACM, Seat-
tle, WA, pp. 236–245.
Manku, G. S., Rajagopalan, S., Lindsay, B. G., 1998. Approximate medians and other
quantiles in one pass and with limited memory. In: Haas, L. M., Tiwary, A. (Eds.),
Proceedings of the 1998 ACM SIGMOD International Conference on Management
of Data. ACM, Seattle, WA, pp. 426–435.
Manku, G. S., Rajagopalan, S., Lindsay, B. G., 1999. Random sampling techniques for
space efficient online computation of order statistics of large datasets. In: Delis,
A., Faloutsos, C., Ghandeharizadeh, S. (Eds.), Proceedings of the 1999 ACM SIG-
MOD International Conference on Management of Data. ACM, Philadelphia, PA,
pp. 251–262.
Manolopoulos, Y., Nanopoulos, A., Papadopoulos, A., Theodoridis, Y., 2005. R-
Trees: Theory and Applications. Advanced Information and Knowledge Process-
ing. Springer, London.
Mao, J., Jain, A. K., 1996. A self-organizing network for hyperellipsoidal clustering
(HEC). IEEE Transactions on Neural Networks 7 (1), 16–29.
Mardia, K. V., Jupp, P. E., 2000. Directional Statistics, 2nd Edition. Wiley Series in
Probability and Statistics. Wiley, Chichester, UK.
Marin, J.-M., Robert, C. P., 2007. Bayesian Core: A Practical Approach to Computa-
tional Bayesian Statistics. Springer Texts in Statistics. Springer, New York.
Marinucci, M., Perez-Amaral, T., 2005. Econometric modeling of business telecom-
munications demand using RETINA and finite mixtures. Tech. rep., Facultad
de Ciencias Economicas y Empresariales, Universidad Complutense de Madrid,
Madrid, Spain.
Marron, J. S., Wand, M. P., 1992. Exact mean integrated squared error. The Annals of
Statistics 20 (2), 712–736.
Maslow, A. H., 1954. Motivation and personality. Harper’s Psychological Series.
HarperCollins, New York.
Bibliography 241
Matias, Y., Urieli, D., 2005. Optimal workload-based weighted wavelet synopses.
In: Eiter, T., Libkin, L. (Eds.), Proceedings of the Tenth International Conference
Database Theory. Vol. 3363. Springer, Edinburgh, UK, pp. 368–382.
Matias, Y., Vitter, J. S., Wang, M., 1998. Wavelet-based histograms for selectivity esti-
mation. In: Haas, L. M., Tiwary, A. (Eds.), Proceedings of the 1998 ACM SIGMOD
International Conference on Management of Data. ACM, Seattle, WA, pp. 448–459.
Matias, Y., Vitter, J. S., Wang, M., 2000. Dynamic maintenance of wavelet-based his-
tograms. In: Abbadi, A. E., Brodie, M. L., Chakravarthy, S., Dayal, U., Kamel, N.,
Schlageter, G., Whang, K.-Y. (Eds.), Proceedings of the 26th International Confer-
ence on Very Large Data Bases. Morgan Kaufmann, Cairo, Egypt, pp. 101–110.
Maugis, C., Celeux, G., Martin-Magniette, M.-L., 2009. Variable selection for cluster-
ing with Gaussian mixture models. Biometrics 65 (3), 701–709.
McCallum, A., Nigam, K., Ungar, L. H., 2000. Efficient clustering of high-dimensional
data sets with application to reference matching. In: Proceedings of the Sixth
ACM SIGKDD International Conference on Knowledge Discovery and Data Min-
ing. ACM, Boston, MA, pp. 169–178.
McCarthy, E. J., 1978. Basic Marketing: A Managerial Approach, 6th Edition. Richard.
D. Irwin, Homewood, IL.
McDonald, M., Dunbar, I., 2004. Market segmentation: how to do it, how to profit
from it, 3rd Edition. Elsevier, Oxford, UK.
McGrory, C. A., 2006. Variational approximations in Bayesian model selection. Ph.D.
thesis, University of Glasgow.
McGrory, C. A., Titterington, D. M., 2007. Variational approximations in Bayesian
model selection for finite mixture distributions. Computational Statistics & Data
Analysis 51 (11), 5352–5367.
McHugh, R. B., 1956. Efficient estimation and local identification in latent class anal-
ysis. Psychometrika 21 (4), 331–347.
McLachlan, G., Krishnan, T., 2008. The EM Algorithm and Extensions, 2nd Edition.
Wiley Series in Probability and Statistics. Wiley, Hoboken, NJ.
McLachlan, G. J., Peel, D., 2000. Finite Mixture Models. Wiley Series in Probability
and Statistics. Wiley, New York.
McLachlan, G. J., Peel, D., Basford, K. E., Adams, P., 1999. The EMMIX software for
the fitting of mixtures of normal and t-components. Journal of Statistical Software
4 (2), 1–4.
242 Bibliography
McLachlan, G. J., Peel, D., Bean, R. W., 2003. Modelling high-dimensional data by
mixtures of factor analyzers. Computational Statistics & Data Analysis 41 (3), 379–
388.
McVinish, R., Mengersen, K., 2008. Semiparametric Bayesian circular statistics.
Computational Statistics & Data Analysis 52 (10), 4722–4730.
Mengersen, K., Robert, C., 1994. Testing for mixtures: a Bayesian entropy approach.
In: Bernardo, J. M., Berger, J. O., Dawid, A. P., Smith, A. F. M. (Eds.), Proceedings of
the Fifth Valencia International Meeting. Clarendon, Alicante, Spain.
Mengersen, K. L., Tweedie, R. L., 1996. Rates of convergence of the Hastings and
Metropolis algorithms. The Annals of Statistics 24 (1), 101121.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., Teller, E., 1953.
Equation of state calculations by fast computing machines. Journal of Chemical
Physics 21 (6), 1087–1092.
Milligan, G., Cooper, M., 1985. An examination of procedures for determining the
number of clusters in a data set. Psychometrika 50 (2), 159–179.
Milligan, G. W., 1980. An examination of the effect of six types of error pertubation
on fifteen clustering algorithms. Psychometrika 45 (3), 325–342.
Minka, T. P., 2001. Expectation propagation for approximate Bayesian inference. In:
Breese, J. S., Koller, D. (Eds.), Proceedings of the 17th Conference in Uncertainty in
Artificial Intelligence. Morgan Kaufmann, Seattle, WA, pp. 362–369.
Minka, T. P., Ghahramani, Z., 2003. Expectation propagation for infinite mixtures.
In: NIPS’03 Workshop on Nonparametric Bayesian Methods and Infinite Models.
Whistler, BC, Canada.
Mitchell, A., 1983. The Nine American Lifestyles. Macmillan, New York.
Mitussis, D., O’Malley, L., Patterson, M., 2006. Mapping the re-engagement of CRM
with relationship marketing. European Journal of Marketing 40 (5-6), 572–589.
Moise, G., Sander, J., 2008. Finding non-redundant, statistically significant regions
in high dimensional data: a novel approach to projected and subspace clustering.
In: Li, Y., Liu, B., Sarawagi, S. (Eds.), Proceedings of the 14th ACM SIGKDD Inter-
national Conference on Knowledge Discovery and Data Mining. ACM, Las Vegas,
NV, pp. 533–541.
Moise, G., Sander, J., Ester, M., 2008. Robust projected clustering. Knowledge and
Information Systems 14 (3), 273–298.
Morgan, R. M., Hunt, S. D., 1994. The commitment-trust theory of relationship mar-
keting. Journal of Marketing 58 (3), 20–38.
Bibliography 243
Mozer, M. C., Wolniewicz, R., Grimes, D. B., Johnson, E., Kaushansky, H., 2000. Pre-
dicting subscriber dissatisfaction and improving retention in the wireless telecom-
munications industry. IEEE Transactions on Neural Networks 11 (3), 690–696.
Muralikrishna, M., DeWitt, D. J., 1988. Equi-depth histograms for estimating selec-
tivity factors for multi-dimensional queries. In: Boral, H., Larson, P.-A. (Eds.), Pro-
ceedings of the 1988 ACM SIGMOD International Conference on Management of
Data. ACM, Chicago, IL, pp. 28–36.
Murray, K. B., 1991. A test of services marketing theory: consumer information ac-
quisition activities. Journal of Marketing 55 (1), 10–25.
Murtagh, F., Starck, J.-L., Berry, M. W., 2000. Overcoming the curse of dimensionality
in clustering by means of the wavelet transform. The Computer Journal 43 (2), 107–
120.
Muthukrishnan, S., 2005. Data streams: Algorithms and applications. Foundations
and Trends in Theoretical Computer Science 1 (2), 117–236.
Muthukrishnan, S., Poosala, V., Suel, T., 1999. On rectangular partitionings in two
dimensions: Algorithms, complexity, and applications. In: Beeri, C., Buneman, P.
(Eds.), Proceedings of the Seventh International Conference on Database Theory.
Vol. 1540. Springer, Jerusalem, Israel, pp. 236–256.
Muthukrishnan, S., Strauss, M., 2004. Approximate histogram and wavelet sum-
maries of streaming data. Tech. Rep. TR: 2004-52, DIMACS, The State University
of New Jersey, Piscataway, NJ.
Nagesh, H. S., Goil, S., Choudhary, A. N., 2000. Adaptive grids for clustering massive
data sets. In: Proceedings of the 2000 International Conference on Parallel Pro-
cessing. IEEE, Toronto, ON, Canada, pp. 477–484.
Nanopoulos, A., Alcock, R., Manolopoulos, Y., 2001. Feature-based classification of
time-series data. In: Information Processing and Technology. Nova, New York, pp.
49–61.
Neal, R. M., 2000. Markov chain sampling methods for Dirichlet process mixture
models. Journal of Computational and Graphical Statistics 9 (2), 249–265.
Neal, R. M., Hinton, G. E., 1998. A view of the EM algorithm that justifies incremental,
sparse, and other variants. In: Jordan, M. I. (Ed.), Learning in Graphical Models.
MIT, Cambridge, UK, pp. 355–368.
Neill, D. B., Moore, A. W., Sabhnani, M., Daniel, K., 2005. Detection of emerging
space-time clusters. In: Grossman, R., Bayardo, R. J., Bennett, K. P. (Eds.), Proceed-
ing of the Eleventh ACM SIGKDD International Conference on Knowledge Discov-
ery in Data Mining. ACM, Chicago, IL, pp. 218–227.
244 Bibliography
Neslin, S. A., Gupta, S., Kamakura, W., Junxiang, L., Mason, C. H., 2006. Defection de-
tection: measuring and understanding the predictive accuracy of customer churn
models. Journal of Marketing Research 43 (2), 204–211.
Newcomb, S., 1886. A generalized theory of the combination of observations so as to
obtain the best result. American Journal of Mathematics 8 (4), 343–366.
Ng, E. K. K., Fu, A. W.-C., Wong, R. C.-W., 2005. Projective clustering by histograms.
IEEE Transactions on Knowledge and Data Engineering 17 (3), 369–383.
Ng, R. T., Han, J., 1994. Efficient and effective clustering methods for spatial data
mining. In: Bocca, J. B., Jarke, M., Zaniolo, C. (Eds.), Proceedings of the 20th In-
ternational Conference on Very Large Data Bases. Morgan Kaufmann, Santiago de
Chile, Chile, pp. 144–155.
Niraj, R., Gupta, M., Narasimhan, C., 2001. Customer profitability in a supply chain.
Journal of Marketing 65 (3), 1–16.
Novak, T. P., MacEvoy, B., 1990. On comparing alternative segmentation schemes: the
list of values (LOV) and values and life styles (VALS). Journal of Consumer Research
17 (1), 105–109.
Nurmi, P., Koolwaaij, J., 2006. Identifying meaningful locations. In: Proceedings of
the Third Annual International Conference on Mobile and Ubiquitous Systems:
Networks and Services. IEEE, San Jose, CA, pp. 1–8.
O’Callaghan, L., Meyerson, A., Motwani, R., Mishra, N., Guha, S., 2002. Streaming-
data algorithms for high-quality clustering. In: Proceedings of the 18th Interna-
tional Conference on Data Engineering. IEEE, San Jose, CA, pp. 685–696.
O’Malley, L., Patterson, M., Evans, M. J., 1999. Exploring Direct Marketing. Thomson,
London, UK.
Opper, M., Saad, D. (Eds.), 2001. Advanced Mean Field Methods: Theory and Prac-
tice. MIT Press, Cambridge, MA.
Ordonez, C., Omiecinski, E., 2004. Efficient disk-based k-means clustering for rela-
tional databases. IEEE Transactions on Knowledge and Data Engineering 16 (8),
909–921.
Ordonez, C., Omiecinski, E., 2005. Accelerating EM clustering to find high-quality
solutions. Knowledge and Information Systems 7 (2), 135–157.
Ouellette, J. A., Wood, W., 1998. Habit and intention in everyday life: the multiple
processes by which past behavior predicts future behavior. Psychological Bulletin
124 (1), 54–74.
Bibliography 245
Owen, A. B., 2003. Data squashing by empirical likelihood. Data Mining and Knowl-
edge Discovery 7 (1), 101–113.
Palpanas, T., Vlachos, M., Keogh, E. J., Gunopulos, D., Truppel, W., 2004. Online am-
nesic approximation of streaming time series. In: Proceedings of the 20th Interna-
tional Conference on Data Engineering. IEEE, Boston, MA, pp. 338–349.
Papadimitriou, S., Sun, J., Faloutsos, C., 2007. Dimensionality reduction and fore-
casting on streams. In: Aggarwal, C. C. (Ed.), Data Streams: Models and Algo-
rithms. Advances in Database Systems. Springer, New York.
Park, B.-H., Kargupta, H., 2003. Distributed data mining. In: Ye, N. (Ed.), The Hand-
book of Data Mining. Human Factors and Ergonomics. Lawrence Erlbaum Asso-
ciates, Mahwah, NJ.
Parsons, L., Haque, E., Liu, H., 2004. Subspace clustering for high dimensional data:
a review. SIGKDD Explorations 6 (1), 90–105.
Parthasarathy, S., Ghoting, A., Otey, M. E., 2007. A survey of distributed mining of
data streams. In: Aggarwal, C. C. (Ed.), Data Streams: Models and Algorithms. Ad-
vances in Database Systems. Springer, New York.
Patrikainen, A., Meila, M., 2006. Comparing subspace clusterings. IEEE Transactions
on Knowledge and Data Engineering 18 (7), 902–916.
Payne, A., Christopher, M., Clark, M., Peck, H., 1998. Relationship Marketing
for Competitive Advantage: Winning and Keeping Customers, 2nd Edition.
Butterworth-Heinemann, Oxford, UK.
Pena, D., Prieto, F. J., 2001. Multivariate outlier detection and robust covariance ma-
trix estimation. Technometrics 43 (3), 286–310.
Pearson, K., 1894. Contributions to the theory of mathematical evolution. Philosoph-
ical Transactions of the Royal Society of London 185, 71–110.
Peng, Z. K., Chu, F. L., 2004. Application of the wavelet transform in machine con-
dition monitoring and fault diagnostics: a review with bibliography. Mechanical
Systems and Signal Processing 18 (2), 199–221.
Pennell, M. L., Dunson, D. B., 2007. Fitting semiparametric random effects models
to large data sets. Biostatistics 8 (4), 821–834.
Peppers, D., Rogers, M., Dorf, B., 1999. Is your company ready for one-to-one mar-
keting? Harvard Business Review 77 (1), 151–160.
Perkins, C. E., 2001. Ad Hoc Networking. Addison-Wesley, Boston, MA.
246 Bibliography
Perlman, E., Java, A., 2003. Predictive mining of time series data in astronomy. In:
Astronomical Data Analysis Software and Systems XII ASP Conference Series. Vol.
295. pp. 431–434.
Peter, J. P., Olson, J. C., 1983. Is science marketing? Journal of Marketing 47 (4), 111–
125.
Pewsey, A., 2008. The wrapped stable family of distributions as a flexible model for
circular data. Computational Statistics & Data Analysis 52 (3), 1516 – 1523.
Pham, D. T., Dimov, S. S., Nguyen, C. D., 2004. An incremental k-means algorithm.
Proceedings of the Institution of Mechanical Engineers, Part C 218 (7), 783–795.
Piatetsky-Shapiro, G., Connell, C., 1984. Accurate estimation of the number of tuples
satisfying a condition. In: Yormark, B. (Ed.), Proceedings of the 1984 ACM SIGMOD
International Conference on Management of Data. ACM, Boston, MA, pp. 256–276.
Pizzuti, C., Talia, D., 2003. P-AutoClass: scalable parallel clustering for mining large
data sets. IEEE Transactions on Knowledge and Data Engineering 15 (3), 629–641.
Polymenis, A., Titterington, D. M., 1998. On the determination of the number of com-
ponents in a mixture. Statistics & Probability Letters 38 (4), 295–298.
Poosala, V., Ganti, V., 1999. Fast approximate answers to aggregate queries on a data
cube. In: Ozsoyoglu, Z. M., Ozsoyoglu, G., Hou, W.-C. (Eds.), Proceedings of the
11th International Conference on Scientific on Scientific and Statistical Database
Management. IEEE, Cleveland, OH, pp. 24–33.
Poosala, V., Ioannidis, Y. E., 1997. Selectivity estimation without the attribute value
independence assumption. In: Carey, M. J. M. J., Dittrich, K. R., Lochovsky, F. H.,
Loucopoulos, P., Jeusfeld, M. A. (Eds.), Proceedings of the 23rd International Con-
ference on Very Large Data Bases. Morgan Kaufmann, Athens, Greece, pp. 486–495.
Poosala, V., Ioannidis, Y. E., Haas, P. J., Shekita, E. J., 1996. Improved histograms for
selectivity estimation of range predicates. In: Jagadish, H. V., Mumick, I. S. (Eds.),
Proceedings of the 1996 ACM SIGMOD International Conference on Management
of Data. ACM, Montreal, QC, Canada, pp. 294–305.
Priebe, C. E., 1994. Adaptive mixtures. Journal of American Statistical Association
89 (427), 796–806.
Procopiuc, C. M., Jones, M., Agarwal, P. K., Murali, T. M., 2002. A Monte Carlo algo-
rithm for fast projective clustering. In: Franklin, M. J., Moon, B., Ailamaki, A. (Eds.),
Proceedings of the 2002 ACM SIGMOD International Conference on Management
of Data. ACM, Madison, WI, pp. 418–427.
Punj, G., Stewart, D. W., 1983. Cluster analysis in marketing research: review and
suggestions for application. Journal of Marketing Research 20 (2), 134–148.
Bibliography 247
Rachev, S. T., Hsu, J. S. J., Bagasheva, B. S., Fabozzi, F. J., 2008. Bayesian methods in
finance. The Frank J. Fabozzi Series. John Wiley & Sons, Inc., Hoboken, NJ.
Raftery, A. E., 1996. Hypothesis testing and model selection. In: Gilks, W. R., Richard-
son, S., Spiegelhalter, D. J. (Eds.), Practical Markov Chain Monte Carlo. Chapman
& Hall, London, pp. 163–188 (Chapter 10).
Raftery, A. E., Dean, N., 2006. Variable selection for model-based clustering. Journal
of the American Statistical Association 101 (473), 168–178.
Ratanamahatana, C., Keogh, E. J., Bagnall, A. J., Lonardi, S., 2005. A novel bit level
time series representation with implication of similarity search and clustering.
In: Ho, T. B., Cheung, D. W.-L., Liu, H. (Eds.), Proceedings of the Ninth Pacific-
Asia Conference Advances in Knowledge Discovery and Data Mining. Vol. 3518.
Springer, Hanoi, Vietnam, pp. 771–777.
Reichheld, F. F., 1996. The Loyalty Effect: The Hidden Force Behind Growth, Profits,
and Lasting Value. Harvard Business School, Boston, MA.
Reichheld, F. F., Sasser Jr, W. E., 1990. Zero defections: quality comes to services.
Harvard Business Review 68 (5), 105–111.
Reinartz, W. J., Kumar, V., 2000. On the profitability of long-life customers in a non-
contractual setting: an empirical investigation and implications for marketing.
Journal of Marketing 64 (4), 17–35.
Richardson, S., Green, P. J., 1997. On Bayesian analysis of mixtures with an unknown
number of components (with discussion). Journal of the Royal Statistical Society:
Series B (Statistical Methodology) 59 (4), 731–792.
Richins, M. L., 1983. Negative word-of-mouth by dissatisfied consumers: a pilot
study. Journal of Marketing 47 (1), 68–78.
Ridgeway, G., Madigan, D., 2002. Bayesian analysis of massive datasets via particle
filters. In: Proceedings of the Eighth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. ACM, Edmonton, AB, Canada, pp. 5–13.
Ridgeway, G., Madigan, D., 2003. A sequential Monte Carlo method for Bayesian
analysis of massive datasets. Data Mining and Knowledge Discovery 7 (3), 301–
319.
Rigby, D. K., Ledingham, D., 2004. CRM done right. Harvard Business Review 82 (11),
118–129.
Rigby, D. K., Reichheld, F. F., Schefter, P., 2002. Avoid the four perils of CRM. Harvard
Business Review 80 (2), 101–109.
248 Bibliography
Rissanen, J., 1983. A universal prior for integers and estimation by minimum descrip-
tion length. Annals of Statistics 11 (2), 416–431.
Robert, C. P., Casella, G., 1999. Monte Carlo statistical methods. Springer Texts in
Statistics. Springer, New York.
Robert, M. G., 2006. Nonresponse rates and nonresponse bias in household surveys.
Public Opinion Quarterly 70 (5), 646–675.
Roberts, S. J., Husmeier, D., Rezek, I., Penny, W., 1998. Bayesian approaches to Gaus-
sian mixture modeling. IEEE Transactions on Pattern Analysis and Machine Intel-
ligence 20 (11), 1133–1142.
Roeder, K., Wasserman, L., 1997. Practical Bayesian density estimation using mix-
tures of normals. Journal of the American Statistical Association 92 (439), 894–902.
Rokeach, M., 1973. The Nature of Human Values. Free, New York.
Rossi, P. E., Allenby, G. M., McCulloch, R., 2005. Bayesian Statistics and Marketing.
John Wiley & Sons, Inc.
Rue, H., Martino, S., Chopin, N., 2009. Approximate Bayesian inference for latent
Gaussian models by using integrated nested Laplace approximations. Journal of
the Royal Statistical Society: Series B (Statistical Methodology) 71 (2), 319–392.
Rust, R. T., Verhoef, P. C., 2005. Optimizing the marketing interventions mix in
intermediate-term CRM. Marketing Science 24 (3), 477–489.
Sakurai, Y., Papadimitriou, S., Faloutsos, C., 2005. BRAID: Stream mining through
group lag correlations. In: Ozcan, F. (Ed.), Proceedings of the 2005 ACM SIGMOD
International Conference on Management of Data. ACM, Baltimore, MD, pp. 599–
610.
Sander, J., Ester, M., Kriegel, H.-P., Xu, X., 1998. Density-based clustering in spatial
databases: the algorithm GDBSCAN and its applications. Data Mining and Knowl-
edge Discovery 2 (2), 169–194.
SAS Institute Inc., 1983. Cubic clustering criterion. Tech. Rep. SAS Technical Report
A-108, SAS Institute Inc., Cary, NC.
Schervish, M. J., 1996. P values: what they are and what they are not. The American
Statistician 50 (3), 203–206.
Schiffman, L. G., Kanuk, L. L., 2004. Consumer Behavior, 8th Edition. Pearson, Upper
Saddle River, NJ.
Schikuta, E., Erhart, M., 1997. The BANG-clustering system: grid-based data anal-
ysis. In: Liu, X., Cohen, P. R., Berthold, M. R. (Eds.), Proceeding of the Advances
Bibliography 249
in Intelligent Data Analysis, Reasoning about Data, Second International Sympo-
sium. Springer, London, pp. 513–524.
Schmittlein, D. C., Peterson, R. A., 1994. Customer base analysis: an industrial pur-
chase process application. Marketing Science 13 (1), 41–67.
Schultz, D. E., 1995. From the editor the technological challenges to traditional direct
marketing. Journal of Direct Marketing 9 (1), 5–7.
Schwartz, S. H., Bilsky, W., 1987. Toward a universal psychological structure of hu-
man values. Journal of Personality and Social Psychology 53 (3), 550–562.
Schwarz, G., 1978. Estimating the dimension of a model. The Annals of Statistics
6 (2), 461–464.
Schweller, R. T., Gupta, A., Parsons, E., Chen, Y., 2004. Reversible sketches for effi-
cient and accurate change detection over network data streams. In: Lombardo, A.,
Kurose, J. F. (Eds.), Proceedings of the 4th ACM SIGCOMM Conference on Internet
Measurement. ACM, Taormina, Sicily, Italy, pp. 207–212.
Scott, D. W., 1992. Multivariate Density Estimation: Theory, Practice, and Visualiza-
tion. Wiley Series in Probability and Statistics. Wiley, New York.
Scott, D. W., 2009. Sturges’ rule. Wiley Interdisciplinary Reviews: Computational
Statistics 1 (3), 303–306.
Scott, D. W., Sain, S. R., 2005. Multidimensional density estimation. In: Rao, C., Weg-
man, E. J., Solka, J. L. (Eds.), Handbook of Statistics 24: Data Mining and Data
Visualization. Vol. 24 of Handbook of Statistics. Elsevier, San Diego, CA, pp. 229–
261.
Scrucca, L., 2010. Dimension reduction for model-based clustering. Statistics and
Computing 20, 471–484.
Sequeira, K., Zaki, M. J., 2004. SCHISM: A new approach for interesting subspace
mining. In: Proceedings of the Fourth IEEE International Conference on Data Min-
ing. IEEE, Brighton, UK, pp. 186–193.
Shapiro, Benson P. Rangan, V. K., Moriarty Jr., R. T., Ross, E. B., 1987. Manage cus-
tomers for profits (not just sales). Harvard Business Review 65 (5), 101–108.
Shaw, M. J., Subramaniam, C., Tan, G. W., Welge, M. E., 2001. Knowledge manage-
ment and data mining for marketing. Decision Support Systems 31 (1), 127–137.
Sheikholeslami, G., Chatterjee, S., Zhang, A., 1998. WaveCluster: a multi-resolution
clustering approach for very large spatial databases. In: Gupta, A., Shmueli, O.,
Widom, J. (Eds.), Proceedings of the 24th International Conference on Very Large
Data Bases. Morgan Kaufmann, New York, pp. 428–439.
250 Bibliography
Sheskin, D. J., 2004. Handbook of Parametric and Nonparametric Statistical Proce-
dures, 3rd Edition. Chapman & Hall, Boca Raton, FL.
Sheth, J. N., Parvatiyar, A., 1995. Relationship marketing in consumer markets: an-
tecedents and consequences. Journal of the Academy of Marketing Science 23 (4),
255–271.
Shieh, J., Keogh, E. J., 2008. iSAX: indexing and mining terabyte sized time series. In:
Li, Y., Liu, B., Sarawagi, S. (Eds.), Proceedings of the 14th ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining. ACM, Las Vegas, NV,
pp. 623–631.
Shimp, T. A., 2007. Advertising, Promotion and Other Aspects of Integrated Market-
ing Communications, 7th Edition. Thomson, Mason, OH.
Silverman, B. W., 1986. Density Estimation for Statistics and Data Analysis. Mono-
graphs on Statistics and Applied Probability. Chapman & Hall, New York.
Skinner, B. F., 1974. About Behaviorism. Vintage, New York.
Smith, K. A., Willis, R. J., Brooks, M., 2000. An analysis of customer retention and
insurance claim patterns using data mining: a case study. Journal of Operational
Research Society 51 (5), 532–541.
Smith, S. P., Jain, A. K., Jan 1984. Testing for uniformity in multidimensional data.
IEEE Transactions on Pattern Analysis and Machine Intelligence 6 (1), 73 –81.
Smith, W. R., 1956. Product differentiation and market segmentation as alternative
marketing strategies. Journal of Marketing 21 (1), 3–8.
Smyth, P., 2000. Model selection for probabilistic clustering using cross-validated
likelihood. Statistics and Computing 10 (1), 63–72.
Solomon, M. R., 2004. Consumer Behavior: Buying, Having, and Being, 6th Edition.
Prentice Hall, Upper Saddle River, NJ.
Spiegelhalter, D., Best, N., Carlin, B., Van der Linde, A., 2002. Bayesian measures of
model complexity and fit. Journal of the Royal Statistical Society: Series B (Statis-
tical Methodology) 64 (4), 583–639.
Srikant, R., Agrawal, R., 1996. Mining sequential patterns: generalizations and per-
formance improvements. In: Apers, P. M. G., Bouzeghoub, M., Gardarin, G. (Eds.),
Proceedings of the Fifth International Conference on Extending Database Tech-
nology. Vol. 1057. Springer, Avignon, France, pp. 3–17.
Stephens, M., 2000. Bayesian analysis of mixture models with an unknown number
of components - an alternative to reversible jump methods. The Annals of Statis-
tics 28 (1), 40–74.
Bibliography 251
Stephens, M. A., 1970. Use of the Kolmogorov-Smirnov, Cramer-von Mises and re-
lated statistics without extensive tables. Journal of the Royal Statistical Society: Se-
ries B (Statistical Methodology) 32 (1), 115–122.
Sterne, J. A. C., Smith, G. D., Cox, D. R., 2001. Sifting the evidence - what’s wrong with
significance tests? British Medical Journal 322 (7280), 226–231.
Stollnitz, E. J., DeRose, A. D., Salesin, D. H., 1996. Wavelets for Computer Graph-
ics. The Morgan Kaufmann Series in Computer Graphics. Morgan Kaufmann, San
Francisco, CA.
Stone, M., Bond, A., Foss, B., 2004. Consumer Insight: How to Use Data and Mar-
ket Research to Get Closer to Your Customer. Market Research in Practice Series.
Kogan Page, London.
Storbacka, K. E., 1997. Segmentation based on customer profitability - retrospective
analysis of retail bank customer bases. Journal of Marketing Management 13 (5),
479–492.
Strouse, K. G., 2004. Customer-centered Telecommunications Services Marketing.
Artech House telecommunications library. Artech House, Boston, MA.
Stryker, S., Burke, P. J., 2000. The past, present, and future of an identity theory. Social
Psychology Quarterly 63 (4), 284–297.
Sturges, H. A., 1926. The choice of a class interval. Journal of the American Statistical
Association 21 (153), 65–66.
Svensen, M., Bishop, C. M., 2005. Robust Bayesian mixture modelling. Neurocom-
puting 64, 235–252.
Symons, M. J., 1981. Clustering criteria and multivariate normal mixtures. Biomet-
rics 37 (1), 35–43.
Tan, P., Steinbach, M., Kumar, V., Potter, C., Klooster, S., Torregrosa, A., 2001. Finding
spatio-temporal patterns in earth science data. In: Proceedings of the KDD 2001
Workshop on Temporal Data Mining. ACM, San Francisco, CA.
Tanay, A., Sharan, R., Shamir, R., 2006. Biclustering algorithms: a survey. In: Aluru,
S. (Ed.), Handbook of Computational Molecular Biology. Chapman & Hall, Boca
Raton, FL.
Teh, Y. W., Jordan, M. I., Beal, M. J., Blei, D. M., 2006. Hierarchical Dirichlet processes.
Journal of the American Statistical Association 101 (476), 1566–1581.
Teschendorff, A. E., Wang, Y., Barbosa-Morais, N. L., Brenton, J. D., Caldas, C., 2005.
A variational Bayesian mixture modelling framework for cluster analysis of gene-
expression data. Bioinformatics 21 (13), 3025–3033.
252 Bibliography
Thaper, N., Guha, S., Indyk, P., Koudas, N., 2002. Dynamic multidimensional his-
tograms. In: Franklin, M. J., Moon, B., Ailamaki, A. (Eds.), Proceedings of the 2002
ACM SIGMOD International Conference on Management of Data. ACM, Madison,
WI, pp. 428–439.
Thompson, C. J., Locander, W. B., Pollio, H. R., 1989. Putting consumer experi-
ence back into consumer research: the philosophy and method of existensial-
phenomenology. Journal of Consumer Research 16 (2), 133–146.
Tibshirani, R., Walther, G., Hastie, T., 2001. Estimating the number of clusters in a
data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Sta-
tistical Methodology) 63 (2), 411–423.
Tierney, L., 1994. Markov chains for exploring posterior distributions. The Annals of
Statistics 22 (4), 1701–1728.
Tipping, M. E., Bishop, C. M., 1999. Mixtures of probabilistic principal component
analyzers. Neural Computation 11 (2), 443–482.
Titterington, D. M., Smith, A. F. M., Makov, U. E., 1985. Statistical Analysis of Finite
Mixture Distribution. Wiley series in Probability and Mathematical Statistics. Wi-
ley, New York.
Train, K. E., McFadden, D. L., Ben-Akiva, M., 1987. The demand for local telephone
service: a fully discrete model of residential calling patterns and service choices.
RAND Journal of Economics 18 (1), 109–123.
Tung, A. K. H., Xu, X., Ooi, B. C., 2005. CURLER: finding and visualizing nonlinear
correlation clusters. In: Proceedings of the 2005 ACM SIGMOD International Con-
ference on Management of Data. ACM, New York, pp. 467–478.
Twedt, D. W., 1967. How does brand awareness-attitude affect marketing strategy?
Journal of Marketing 31 (4), 64–66.
Ueda, N., Ghahramani, Z., 2002. Bayesian model search for mixture models based
on optimizing variational bounds. Neural Networks 15 (10), 1223–1241.
Ueda, N., Nakano, R., Ghahramani, Z., Hinton, G. E., 2000. SMEM algorithm for mix-
ture models. Neural Computation 12 (9), 2109–2128.
Van Mechelen, I., Bock, H.-H., Boeck, P. D., 2004. Two-mode clustering methods: a
structured overview. Statistical Methods in Medical Research 13 (5), 363–394.
van Raaij, E. M., Vernooij, M. J. A., van Triest, S., 2003. The implementation of cus-
tomer profitability analysis: a case study. Industrial Marketing Management 32 (7),
573–583.
Bibliography 253
Vasconcelos, N., Lippman, A., 1998. Learning mixture hierarchies. In: Kearns, M. J.,
Solla, S. A., Cohn, D. A. (Eds.), Proceedings of the 1998 Neural Information Pro-
cessing Systems. MIT, Denver, CO, pp. 606–612.
Venkatesan, R., Kumar, V., Bohling, T., 2007. Optimal customer relationship man-
agement using Bayesian decision theory: an application for customer selection.
Journal of Marketing Research 44 (4), 579–594.
Verhoef, P. C., Donkers, B., 2001. Predicting customer potential value an application
in the insurance industry. Decision Support Systems 32 (2), 189–199.
Veroff, J., Douvan, E., Kulka, R. A., 1981. The Inner American: A Self-Portrait from
1957 to 1976. Basic Books, New York.
Verplanken, B., Aarts, H., van Knippenberg, A., Moonen, A., 1998. Habit versus
planned behaviour: a field experiment. The British Journal of Social Psychology
37 (1), 111–128.
Vitter, J. S., 1985. Random sampling with a reservoir. ACM Transactions on Mathe-
matical Software 11 (1), 37–57.
Vitter, J. S., 2008. External memory algorithms and data structures: dealing with mas-
sive data. Foundations and Trends in Theoretical Computer Science 2 (4), 305–474.
Vitter, J. S., Wang, M., 1999. Approximate computation of multidimensional aggre-
gates of sparse data using wavelets. In: Delis, A., Faloutsos, C., Ghandeharizadeh,
S. (Eds.), Proceedings of the ACM SIGMOD International Conference on Manage-
ment of Data. ACM, Philadelphia, PA, pp. 193–204.
Vitter, J. S., Wang, M., Iyer, B. R., 1998. Data cube approximation and histograms
via wavelets. In: Gardarin, G., French, J. C., Pissinou, N., Makki, K., Bouganim, L.
(Eds.), Proceedings of the 1998 ACM CIKM International Conference on Informa-
tion and Knowledge Management. ACM, Bethesda, MD, pp. 96–104.
Vlachos, M., Hadjieleftheriou, M., Keogh, E., Gunopulos, D., Manolopoulos, Y., Pa-
padopoulos, A., Vassilakopoulos, M., Manolopoulos, Y., Papadopoulos, A., Vas-
silakopoulos, M., 2005. Indexing multi-dimensional trajectories for similarity
queries. In: Manolopoulos, Y., Papadopoulos, A. N., Vassilakopoulos, M. G. (Eds.),
Spatial Databases: Technologies, Techniques and Trends. IGI, London, pp. 107–
128.
Volkovich, Z., Kogan, J., Nicholas, C., 2006. Sampling methods for building initial par-
titions. In: Kogan, J., Nicholas, C., Teboulle, M. (Eds.), Grouping Multidimensional
Data: Recent Advances in Clustering. Springer, New York.
Wainwright, M. J., Jordan, M. I., 2003. Graphical models, exponential families, and
variational inference. Tech. Rep. Technical Report 649, Department of Statistics,
University of California, Berkeley, Berkeley, CA.
254 Bibliography
Wallace, C. S., Dowe, D. L., 1994. Intrinsic classification by MML - the Snob pro-
gram. In: Proceedings of the Seventh Australian Joint Conference on Artificial In-
telligence. World Scientific, Singapore, pp. 37–44.
Wallace, C. S., Dowe, D. L., 2000. MML clustering of multi-state, Poisson, von Mises
circular and Gaussian distributions. Statistics and Computing 10 (1), 73–83.
Wallace, C. S., Freeman, P. R., 1987. Estimation and inference by compact coding.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) 49 (3),
240–265.
Wallach, H. M., Dicker, L., Jensen, S. T., Heller, K. A., 2010. An alternative prior pro-
cess for nonparametric Bayesian clustering. In: Proceedings of the 13th Interna-
tional Conference on Artificial Intelligence and Statistics. Sardinia, Italy.
Wang, B., Titterington, D. M., 2006. Convergence properties of a general algo-
rithm for calculating variational Bayesian estimates for a normal mixture model.
Bayesian Analysis 1 (3), 625–650.
Wang, H., Fan, W., Han, P. S. Y. J., 2003. Mining concept-drifting data streams us-
ing ensemble classifiers. In: Getoor, L., Senator, T. E., Domingos, P., Faloutsos,
C. (Eds.), Proceedings of the Ninth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. ACM, Washington, DC, pp. 226–235.
Wang, H., Wang, W., Yang, J., Yu, P. S., 2002a. Clustering by pattern similarity in large
data sets. In: Franklin, M. J., Moon, B., Ailamaki, A. (Eds.), Proceedings of the 2002
ACM SIGMOD International Conference on Management of Data. ACM, Madison,
WI, pp. 394–405.
Wang, H.-F., Hong, W.-K., 2006. Managing customer profitability in a competitive
market by continuous data mining. Industrial Marketing Management 35 (6), 715–
723.
Wang, M., Chan, N. H., Papadimitriou, S., Faloutsos, C., Madhyastha, T. M., 2002b.
Data mining meets performance evaluation: fast algorithms for modeling bursty
traffic. In: Proceedings of the 18th International Conference on Data Engineering.
IEEE, San Jose, CA, pp. 507–516.
Wang, N., Raftery, A. E., 2002. Nearest-neighbor variance estimation (NNVE). Journal
of the American Statistical Association 97 (460), 994–1019.
Wang, S. J., Woodward, W. A., Gray, H. L., Wiechecki, S., Sain, S. R., 1997a. A new test
for outlier detection from a multivariate mixture distribution. Journal of Compu-
tational and Graphical Statistics 6 (3), 285–299.
Wang, W., Yang, J., Muntz, R. R., 1997b. STING: a statistical information grid ap-
proach to spatial data mining. In: Jarke, M., Carey, M. J., Dittrich, K. R., Lochovsky,
Bibliography 255
F. H., Loucopoulos, P., Jeusfeld, M. A. (Eds.), Proceedings of the 23rd International
Conference on Very Large Data Bases. Morgan Kaufmann, Athens, Greece, pp. 186–
195.
Wang, X., Smith, K. A., Hyndman, R. J., 2006. Characteristic-based clustering for time
series data. Data Mining and Knowledge Discovery 13 (3), 335–364.
Wang, Y.-F., Chuang, Y.-L., Hsu, M.-H., Keh, H.-C., 2004. A personalized recom-
mender system for the cosmetic business. Expert Systems with Applications 26 (3),
427–434.
Watanabe, S., Minami, Y., Nakamura, A., Ueda, N., 2002. Application of variational
Bayesian approach to speech recognition. In: Becker, S., Thrun, S., Obermayer,
K. (Eds.), Proceedings of the 2002 Neural Information Processing Systems. MIT,
Vancouver, BC, Canada, pp. 1237–1244.
Waterhouse, S. R., MacKay, D. J. C., Robinson, A. J., 1996. Bayesian methods for mix-
tures of experts. In: Touretzky, D. S., Mozer, M., Hasselmo, M. E. (Eds.), Proceedings
of the 1996 Neural Information Processing Systems. MIT, Denver, CO, pp. 351–357.
Watson, J. B., 1913. Psychology as the behaviorist views it. Psychological Review 20,
158–177.
Weber, R., Schek, H.-J., Blott, S., 1998. A quantitative analysis and performance study
for similarity-search methods in high-dimensional spaces. In: Gupta, A., Shmueli,
O., Widom, J. (Eds.), Proceedings of the 24rd International Conference on Very
Large Data Bases. Morgan Kaufmann, New York, pp. 194–205.
Wedel, M., Kamakura, W. A., 1998. Market Segmentation: Conceptual and Method-
ological Foundations. International Series in Quantitative Marketing. Kluwer Aca-
demic, Boston, MA.
Wei, C.-P., Chiu, I.-T., 2002. Turning telecommunications call details to churn predic-
tion: a data mining approach. Expert Systems with Applications 23 (2), 103–112.
Wei, L., Keogh, E. J., Xi, X., 2006. SAXually explicit images: finding unusual shapes.
In: Proceedings of the 6th IEEE International Conference on Data Mining. IEEE,
Hong Kong, China, pp. 711–720.
Weinstein, A., 2004. Handbook of Market Segmentation: Strategic Targeting for Busi-
ness and Technology Firms, 3rd Edition. Haworth Series in Segmented, Targeted,
and Customized Market. Haworth, Binghamton, UK.
Weiss, G. M., 2005. Data mining in telecommunications. In: Maimon, O., Rokach, L.
(Eds.), Data Mining and Knowledge Discovery Handbook. Springer, New York.
Wells, W. D., 1975. Psychographics: a critical review. Journal of Marketing Research
12 (2), 196–213.
256 Bibliography
Wicker, A. W., 1969. Attitudes vs. actions: the relationship of verbal and overt behav-
ioral responses to attitude objects. Journal of Social Issues 25 (4), 41–78.
Wind, Y., 1978. Issues and advances in segmentation research. Journal of Marketing
Research 15 (3), 317–337.
Winn, J. M., Bishop, C. M., 2005. Variational message passing. Journal of Machine
Learning Research 6, 661–694.
Witten, I. H., Frank, E., 2005. Data Mining: Practical Machine Learning Tools and
Techniques, 2nd Edition. Morgan Kaufmann Series in Data Management Systems.
Morgan Kaufman, Boston, MA.
Wolfers, J., Zitzewitz, E., 2004. Prediction markets. Journal of Economic Perspectives
18 (2), 107–126.
Woo, K.-G., Lee, J.-H., Kim, M.-H., Lee, Y.-J., 2004. FINDIT: a fast and intelligent
subspace clustering algorithm using dimension voting. Information and Software
Technology 46 (4), 255–271.
Wu, B., McGrory, C. A., Pettitt, A. N., 2010a. Customer spatial usage behavior profiling
and segmentation with mixture modeling. Submitted.
Wu, B., McGrory, C. A., Pettitt, A. N., 2010b. A new variational Bayesian algorithm
with application to human mobility pattern modeling. Statistics and Computing,
(in press).
URL http://dx.doi.org/10.1007/s11222-010-9217-9
Wu, B., McGrory, C. A., Pettitt, A. N., 2010c. The variational Bayesian method: com-
ponent elimination, initialization & circular data. Submitted.
Wu, C. F. J., 1983. On convergence properties of the EM algorithm. The Annals of
Statistics 11 (1), 95–103.
Wu, Y.-L., Agrawal, D., Abbadi, A. E., 2001. Applying the golden rule of sampling for
query estimation. SIGMOD Record 30 (2), 449–460.
Xing, D., Girolami, M., 2007. Employing latent Dirichlet allocation for fraud detection
in telecommunications. Pattern Recognition Letters 28 (13), 1727–1734.
Xiong, Y., Yeung, D.-Y., 2002. Mixtures of ARMA models for model-based time se-
ries clustering. In: Proceedings of the 2002 IEEE International Conference on Data
Mining. IEEE, Maebashi City, Japan, pp. 717–720.
Xiong, Y., Yeung, D.-Y., 2004. Time series clustering with ARMA mixtures. Pattern
Recognition 37 (8), 1675–1689.
Xu, R., Wunsch II, D., 2005. Survey of clustering algorithms. IEEE Transactions on
Neural Networks 16 (3), 645–678.
Bibliography 257
Xu, X., Ester, M., Kriegel, H.-P., Sander, J., 1998. A distribution-based clustering al-
gorithm for mining in large spatial databases. In: Proceedings of the 14th Interna-
tional Conference on Data Engineering. IEEE, Orlando, FL, pp. 324–331.
Yalch, R., Brunel, F., 1996. Need hierarchies in consumer judgments of product de-
signs: is it time to reconsider Maslow’s theory? Advances in Consumer Research
23 (1), 405–410.
Yamazaki, K., Watanabe, S., 2003. Singularities in mixture models and upper bounds
of stochastic complexity. Neural Networks 16 (7), 1029–1038.
Yang, J., Wang, W., Wang, H., Yu, P. S., 2002. δ-clusters: capturing subspace correla-
tion in a large data set. In: Proceedings of the 18th International Conference on
Data Engineering. IEEE, San Jose, CA, pp. 517–528.
Yang, Y., Wu, X., Zhu, X., 2005. Combining proactive and reactive predictions for
data streams. In: Grossman, R., Bayardo, R. J., Bennett, K. P. (Eds.), Proceedings
of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining. ACM, Chicago, IL, pp. 710–715.
Yankelovich, D., Meer, D., 2006. Rediscovering market segmentation. Harvard Busi-
ness Review 84 (2), 122–131.
Yankov, D., Keogh, E. J., Rebbapragada, U., 2007. Disk aware discord discovery: find-
ing unusual time series in terabyte sized datasets. In: Proceedings of the Seventh
IEEE International Conference on Data Mining. IEEE, Omaha, NE, pp. 381–390.
Yi, B.-K., Faloutsos, C., 2000. Fast time sequence indexing for arbitrary Lp norms.
In: Abbadi, A. E., Brodie, M. L., Chakravarthy, S., Dayal, U., Kamel, N., Schlageter,
G., Whang, K.-Y. (Eds.), Proceedings of the 26th International Conference on Very
Large Data Bases. Morgan Kaufmann, Cairo, Egypt, pp. 385–394.
Yi, B.-K., Sidiropoulos, N., Johnson, T., Jagadish, H. V., Faloutsos, C., Biliris, A., 2000.
Online data mining for co-evolving time sequences. In: Proceedings of the 16th
International Conference on Data Engineering. IEEE, San Diego, CA, pp. 13–22.
Yip, K. Y., Cheung, D. W., Ng, M. K., 2004. HARP: a practical projected clustering algo-
rithm. IEEE Transactions on Knowledge and Data Engineering 16 (11), 1387–1397.
Yip, K. Y., Cheung, D. W., Ng, M. K., 2005. On discovery of extremely low-dimensional
clusters using semi-supervised projected clustering. In: Proceedings of the 21st
International Conference on Data Engineering. IEEE, Tokyo, Japan, pp. 329–340.
Yiu, M. L., Mamoulis, N., 2005. Iterative projected clustering by subspace mining.
IEEE Transactions on Knowledge and Data Engineering 17 (2), 176–189.
Yu, D., Sheikholeslami, G., Zhang, A., 2002. FindOut: finding outliers in very large
datasets. Knowledge and Information Systems 4 (4), 387–412.
258 Bibliography
Zahn, C. T., 1971. Graph-theoretical methods for detecting and describing gestalt
clusters. IEEE Transactions on Computers 1, 68–86.
Zaki, M. J., Peters, M., Assent, I., Seidl, T., 2005. CLICKS: an effective algorithm for
mining subspace clusters in categorical datasets. In: Grossman, R., Bayardo, R. J.,
Bennett, K. P. (Eds.), Proceedings of the Eleventh ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining. ACM, Chicago, IL, pp. 736–742.
Zeira, G., Last, M., Maimon, O., 2004. Segmentation of continuous data streams
based on a change detection methodology. In: Pal, N. R., Jain, L. C. (Eds.), Ad-
vanced Techniques in Knowledge Discovery and Data Mining. Advanced Informa-
tion and Knowledge Processing. Springer, New York.
Zeithaml, V. A., 2000. Service quality, profitability, and the economic worth of cus-
tomers: what we know and what we need to learn. Journal of the Academy of Mar-
keting Science 28 (1), 67–85.
Zeithaml, V. A., Rust, R. T., Lemon, K. N., 2001. The customer pyramid: creating and
serving profitable customers. California Management Review 43 (4), 118–142.
Zhang, D., Gunopulos, D., Tsotras, V. J., Seeger, B., 2003. Temporal and spatio-
temporal aggregations over data streams using multiple time granularities. Infor-
mation Systems 28 (1-2), 61–84.
Zhang, J., Hsu, W., Lee, M.-L., 2005. Clustering in dynamic spatial databases. Journal
of Intelligent Information Systems 24 (1), 5–27.
Zhang, T., Ramakrishnan, R., Livny, M., 1996. BIRCH: an efficient data clustering
method for very large databases. In: Jagadish, H. V., Mumick, I. S. (Eds.), Proceed-
ings of the 1996 ACM SIGMOD International Conference on Management of Data.
ACM, Montreal, QC, Canada, pp. 103–114.
Zhou, A., Cai, Z., Wei, L., Qian, W., 2003. M-kernel merging: towards density estima-
tion over data streams. In: Proceedings of the Eighth International Conference on
Database Systems for Advanced Applications. IEEE, Kyoto, Japan, pp. 285–292.
Zhu, Y., Shasha, D., 2002. StatStream: statistical monitoring of thousands of data
streams in real time. In: Proceedings of the 28th International Conference on Very
Large Data Bases. Morgan Kaufmann, Hong Kong, China, pp. 358–369.
Zhu, Y., Shasha, D., 2003. Efficient elastic burst detection in data streams. In: Getoor,
L., Senator, T. E., Domingos, P., Faloutsos, C. (Eds.), Proceedings of the Ninth
ACM SIGKDD International Conference on Knowledge Discovery and Data Min-
ing. ACM, Washington, DC, pp. 336–345.
Ziff, R., 1971. Psychographics for market segmentation. Journal of Advertising Re-
search 11 (2), 3–9.