a quick overview of unsupervised learning in salford...
TRANSCRIPT
Quick Overview of Unsupervised Learning in Salford SPM
The SPM Salford Predictive Modeler software suite offers several tools for clustering
and segmentation including CART, RandomForests, and a classical statistical module
CLUSTER. In this article we illustrate the use of these tools with the well known
Boston Housing data set (pertaining to 1970s housing prices and neighborhood
characteristics in the greater Boston area).
CART
We open the data set, go to the model setup dialog to see
Observe that we are using the CART analysis engine, we have set the Analysis Type
to Unsupervised, we have checked ALL variables as predictors and we have not
selected a Target (observe that the Target Variable box is empty). What CART will
do here is take the original data we have and make a scrambled copy of it, calling the
first version “Original” and the second “Copy” and will then build a CART tree to try
to distinguish between the two versions of the data. The key intuition behind this
style of analysis is that the “Copy” is intended to follow a roughly uniform
structureless pattern of data whereas presumably the “Original” data does have
structure. Terminal nodes of the CART tree that are particularly rich in “Original”
data exhibit marked structure and can be considered regions of high density. If the
nodes are of reasonable size they could be considered clusters in the conventional
sense.
To control the size of the terminal nodes we use the limits dialog to set the smallest
possible terminal node to 30 and the smallest splittable node to 70. There is
nothing special about these settings, which have been chosen simply to ensure that
no “cluster” be smaller than 30 records. The value of 70 chosen for the smallest
parent node is more or less forced upon us once we have set the minimum terminal
node of 30 since a parent gives rise to children. If each child contains at least 30
records the parent contains at least 60 and we allow the parent to have a little
flexibility in the split by ensuring that it is a at least a little bit larger than 60.
The resulting tree, grown using cross-validation as the test method is shown below
where the bright red nodes will contain the results of interest.
Double-clicking any node we can reveal its defining “rules” as, for example, terminal
node 10 where we see a collection of neighborhoods that have below average crime
rates (typically below the median), below average pollution levels, and are above
average in distance from employment centers.
This “cluster” is made up of 200 neighborhoods, which is large enough to prevent it
from being too distinctive. If we allow smaller terminal nodes it is likely that the
most highly concentrated terminal nodes define more interesting segments.
What about the nodes that are least populated with “Original” records? What can
they tell us about the data? One interpretation, when the “Original” records are truly
rare is that we have identified anomalies (patterns which are unexpected in the data
being analyzed). Below we hover our mouse over one of the blue terminal nodes to
reveal a node that contains only 5 “Original” but 80 “Copy” records, suggesting that
the pattern is indeed unusual.
The rule indicates that there are few neighborhoods close to the centers of
employment with lower percentages of older buildings. This is something we might
not have noticed during our exploration of this data in other ways.
Using CART unsupervised learning with large datasets
The method we have described can be used to find quite interesting segmentations
in a two-stage process.
1. First, run CART in unsupervised mode allowing the tree to reach some
moderate number of terminal nodes, such as 200.
2. Score the data assigning every record to its terminal node and compute the
mean, median, or some other representative vector for each terminal node. If
all variables are continuous for example, we could just represent each
terminal node with a vector of mean values.
3. Now, we take this table of means, one for each terminal node, and treat it as a
new data set to be input into a clustering procedure (hierarchical would
make a lot of sense given the small size of this data set).
4. Once we have arrived at a satisfactory clustering of the mean vectors we can
use it to assign every record in the original data set to its appropriate cluster.
How satisfactory this will be will depend in part on the variability of the
actual data around the mean vectors.
Clustering With Random Forests
The only other data mining engine that works with our data scrambling technique is
Random Forests. Its mechanism is quite different from CART in spite of the fact that
it leverages the same contrast of an “Original” versus “Copy” data that is intended to
more or less structureless. Random Forests clustering works via its construction of
a proximity matrix. The details of this matrix are described in other Salford
publications but the essential idea is quite simple. Take any pair of data records and
drop both down every tree in the forest, counting the number of times the two
records end up in the same terminal node. You might think this would occur very
rarely but it will depend on the distribution of the data. If the two records end up in
the same terminal node in every tree they must be very similar or even identical. If
two records never end up in the same terminal node they are unlikely to have much
in common (speaking of their data values). The reason is that each tree will involve
a different set of conditions defining a terminal node, involving different subsets of
variables. If two records never match in any tree this means that regardless of the
subset of variables that we want to focus on they are different. The continuum
between never matching and always matching defines a similarity metric that can
be used to drive hierarchical clustering.
The proximity matrix is the principle mechanism of discovering clusters in
RandomForests. Further, proximity matrices are optionally created for any
RandomForests classification model meaning that the analyst could approach the
challenge from various perspectives. The target variable could be “Original” versus
“Copy” but it could also be any real variable existing in the original data allowing for
models not involving any duplication and scrambling of data. Running different
models could result in different proximity matrices and thus different clustering
solutions.
A helpful facility in SPM Random Forests is the option to run a hierarchical
clustering process immediately after the forest and proximity matrix have been
constructed. This facility is available only from the command line and can be
invoked with commands following this pattern:
LOPTIONS UNS = YES
KEEP CRIM, ZN, INDUS, CHAS, NOX, RM, AGE, DIS,
RAD, TAX, PT, B, LSTAT, MV
RF SAVE="\\PSF\BOSTON_UNSUP.CSV"
RF GO POST=YES JOIN=YES LINKAGE=WARD
The first command puts Random Forests into Unsupervised mode (UNS=YES), the
KEEP command lists the variables permitted to enter the trees, the RF SAVE
command indicates the root name which Random Forests will use to save several
output files and the final command specifies that we want the clustering to proceed
using WARD linkage.
There will not be any visual output but the output file with the word “join” in its
name (it will be boston_unsup_join.csv in this example) will contain clustering
solutions for 4, 6, and 10 clusters. Every record in your training data set will be
assigned to one of the available clusters. Three columns in the output data file
named CLUSTER_1_1, CLUSTER_1_2, and CLUSTER_1_3 contain the cluster
assignments. For other numbers of clusters, at this point in time, you will need to
run a clustering procedure outside of SPM.
Cluster: KMEANS
This is a conventional implementation of the KMEANS algorithm and is included in
SPM for convenience. Look for an updated version of this note after April 2014 to
learn more about the advances included in all of our clustering machinery in the
2014 edition of SPM.