a quick overview of unsupervised learning in salford...

6
Quick Overview of Unsupervised Learning in Salford SPM The SPM Salford Predictive Modeler software suite offers several tools for clustering and segmentation including CART, RandomForests, and a classical statistical module CLUSTER. In this article we illustrate the use of these tools with the well known Boston Housing data set (pertaining to 1970s housing prices and neighborhood characteristics in the greater Boston area). CART We open the data set, go to the model setup dialog to see Observe that we are using the CART analysis engine, we have set the Analysis Type to Unsupervised, we have checked ALL variables as predictors and we have not selected a Target (observe that the Target Variable box is empty). What CART will do here is take the original data we have and make a scrambled copy of it, calling the first version “Original” and the second “Copy” and will then build a CART tree to try to distinguish between the two versions of the data. The key intuition behind this style of analysis is that the “Copy” is intended to follow a roughly uniform structureless pattern of data whereas presumably the “Original” data does have structure. Terminal nodes of the CART tree that are particularly rich in “Original”

Upload: others

Post on 06-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Quick Overview of Unsupervised Learning in Salford SPMmedia.salford-systems.com/pdf/QuickOverview...Learning_in_Salford… · Quick Overview of Unsupervised Learning in Salford

Quick Overview of Unsupervised Learning in Salford SPM

The SPM Salford Predictive Modeler software suite offers several tools for clustering

and segmentation including CART, RandomForests, and a classical statistical module

CLUSTER. In this article we illustrate the use of these tools with the well known

Boston Housing data set (pertaining to 1970s housing prices and neighborhood

characteristics in the greater Boston area).

CART

We open the data set, go to the model setup dialog to see

Observe that we are using the CART analysis engine, we have set the Analysis Type

to Unsupervised, we have checked ALL variables as predictors and we have not

selected a Target (observe that the Target Variable box is empty). What CART will

do here is take the original data we have and make a scrambled copy of it, calling the

first version “Original” and the second “Copy” and will then build a CART tree to try

to distinguish between the two versions of the data. The key intuition behind this

style of analysis is that the “Copy” is intended to follow a roughly uniform

structureless pattern of data whereas presumably the “Original” data does have

structure. Terminal nodes of the CART tree that are particularly rich in “Original”

Page 2: A Quick Overview of Unsupervised Learning in Salford SPMmedia.salford-systems.com/pdf/QuickOverview...Learning_in_Salford… · Quick Overview of Unsupervised Learning in Salford

data exhibit marked structure and can be considered regions of high density. If the

nodes are of reasonable size they could be considered clusters in the conventional

sense.

To control the size of the terminal nodes we use the limits dialog to set the smallest

possible terminal node to 30 and the smallest splittable node to 70. There is

nothing special about these settings, which have been chosen simply to ensure that

no “cluster” be smaller than 30 records. The value of 70 chosen for the smallest

parent node is more or less forced upon us once we have set the minimum terminal

node of 30 since a parent gives rise to children. If each child contains at least 30

records the parent contains at least 60 and we allow the parent to have a little

flexibility in the split by ensuring that it is a at least a little bit larger than 60.

The resulting tree, grown using cross-validation as the test method is shown below

where the bright red nodes will contain the results of interest.

Page 3: A Quick Overview of Unsupervised Learning in Salford SPMmedia.salford-systems.com/pdf/QuickOverview...Learning_in_Salford… · Quick Overview of Unsupervised Learning in Salford

Double-clicking any node we can reveal its defining “rules” as, for example, terminal

node 10 where we see a collection of neighborhoods that have below average crime

rates (typically below the median), below average pollution levels, and are above

average in distance from employment centers.

This “cluster” is made up of 200 neighborhoods, which is large enough to prevent it

from being too distinctive. If we allow smaller terminal nodes it is likely that the

most highly concentrated terminal nodes define more interesting segments.

Page 4: A Quick Overview of Unsupervised Learning in Salford SPMmedia.salford-systems.com/pdf/QuickOverview...Learning_in_Salford… · Quick Overview of Unsupervised Learning in Salford

What about the nodes that are least populated with “Original” records? What can

they tell us about the data? One interpretation, when the “Original” records are truly

rare is that we have identified anomalies (patterns which are unexpected in the data

being analyzed). Below we hover our mouse over one of the blue terminal nodes to

reveal a node that contains only 5 “Original” but 80 “Copy” records, suggesting that

the pattern is indeed unusual.

The rule indicates that there are few neighborhoods close to the centers of

employment with lower percentages of older buildings. This is something we might

not have noticed during our exploration of this data in other ways.

Page 5: A Quick Overview of Unsupervised Learning in Salford SPMmedia.salford-systems.com/pdf/QuickOverview...Learning_in_Salford… · Quick Overview of Unsupervised Learning in Salford

Using CART unsupervised learning with large datasets

The method we have described can be used to find quite interesting segmentations

in a two-stage process.

1. First, run CART in unsupervised mode allowing the tree to reach some

moderate number of terminal nodes, such as 200.

2. Score the data assigning every record to its terminal node and compute the

mean, median, or some other representative vector for each terminal node. If

all variables are continuous for example, we could just represent each

terminal node with a vector of mean values.

3. Now, we take this table of means, one for each terminal node, and treat it as a

new data set to be input into a clustering procedure (hierarchical would

make a lot of sense given the small size of this data set).

4. Once we have arrived at a satisfactory clustering of the mean vectors we can

use it to assign every record in the original data set to its appropriate cluster.

How satisfactory this will be will depend in part on the variability of the

actual data around the mean vectors.

Clustering With Random Forests

The only other data mining engine that works with our data scrambling technique is

Random Forests. Its mechanism is quite different from CART in spite of the fact that

it leverages the same contrast of an “Original” versus “Copy” data that is intended to

more or less structureless. Random Forests clustering works via its construction of

a proximity matrix. The details of this matrix are described in other Salford

publications but the essential idea is quite simple. Take any pair of data records and

drop both down every tree in the forest, counting the number of times the two

records end up in the same terminal node. You might think this would occur very

rarely but it will depend on the distribution of the data. If the two records end up in

the same terminal node in every tree they must be very similar or even identical. If

two records never end up in the same terminal node they are unlikely to have much

in common (speaking of their data values). The reason is that each tree will involve

a different set of conditions defining a terminal node, involving different subsets of

variables. If two records never match in any tree this means that regardless of the

subset of variables that we want to focus on they are different. The continuum

between never matching and always matching defines a similarity metric that can

be used to drive hierarchical clustering.

The proximity matrix is the principle mechanism of discovering clusters in

RandomForests. Further, proximity matrices are optionally created for any

RandomForests classification model meaning that the analyst could approach the

challenge from various perspectives. The target variable could be “Original” versus

“Copy” but it could also be any real variable existing in the original data allowing for

models not involving any duplication and scrambling of data. Running different

models could result in different proximity matrices and thus different clustering

solutions.

Page 6: A Quick Overview of Unsupervised Learning in Salford SPMmedia.salford-systems.com/pdf/QuickOverview...Learning_in_Salford… · Quick Overview of Unsupervised Learning in Salford

A helpful facility in SPM Random Forests is the option to run a hierarchical

clustering process immediately after the forest and proximity matrix have been

constructed. This facility is available only from the command line and can be

invoked with commands following this pattern:

LOPTIONS UNS = YES

KEEP CRIM, ZN, INDUS, CHAS, NOX, RM, AGE, DIS,

RAD, TAX, PT, B, LSTAT, MV

RF SAVE="\\PSF\BOSTON_UNSUP.CSV"

RF GO POST=YES JOIN=YES LINKAGE=WARD

The first command puts Random Forests into Unsupervised mode (UNS=YES), the

KEEP command lists the variables permitted to enter the trees, the RF SAVE

command indicates the root name which Random Forests will use to save several

output files and the final command specifies that we want the clustering to proceed

using WARD linkage.

There will not be any visual output but the output file with the word “join” in its

name (it will be boston_unsup_join.csv in this example) will contain clustering

solutions for 4, 6, and 10 clusters. Every record in your training data set will be

assigned to one of the available clusters. Three columns in the output data file

named CLUSTER_1_1, CLUSTER_1_2, and CLUSTER_1_3 contain the cluster

assignments. For other numbers of clusters, at this point in time, you will need to

run a clustering procedure outside of SPM.

Cluster: KMEANS

This is a conventional implementation of the KMEANS algorithm and is included in

SPM for convenience. Look for an updated version of this note after April 2014 to

learn more about the advances included in all of our clustering machinery in the

2014 edition of SPM.