spss trainingboek advanced statistics and datamining

Upload: yoonho-joung

Post on 05-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    1/204

    Data Mining:Modeling

    18749-001

    SPSS v11.5; Clementine v7.0; AnswerTree 3.1; DecisionTime 1.1 Revised 9/26/2002 ss/mr

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    2/204

    For more information about SPSS

    software products, please visit our Web site athttp://www.spss.com or contact

    SPSS Inc.233 South Wacker Drive, 11th FloorChicago, IL 60606-6412

    Tel: (312) 651-3000Fax: (312) 651-3668

    SPSS is a registered trademark and its other product names are the trademarks of SPSS Inc. for itsproprietary computer software. No material describing such software may be produced ordistributed without the written permission of the owners of the trademark and license rights in thesoftware and the copyrights in the published materials.

    The SOFTWARE and documentation are provided with RESTRICTED RIGHTS. Use,duplication, or disclosure by the Government is subject to restrictions as set forth in subdivision(c)(1)(ii) of The Rights in Technical Data and Computer Software clause at 52.227-7013.Contractor/manufacturer is SPSS Inc., 233 South Wacker Drive, 11th Floor, Chicago, IL 60606-6412.

    TableLook is a trademark of SPSS Inc.

    Windows is a registered trademark of Microsoft Corporation.DataDirect, DataDirect Connect, INTERSOLV, and SequeLink are registered trademarks ofMERANT Solutions Inc.Portions of this product were created using LEADTOOLS 1991-2000, LEAD Technologies,Inc. ALL RIGHTS RESERVED.LEAD, LEADTOOLS, and LEADVIEW are registered trademarks of LEAD Technologies, Inc.Portions of this product were based on the work of the FreeType Team (http:\\www.freetype.org).

    General notice: Other product names mentioned herein are used for identification purposes onlyand may be trademarks or registered trademarks of their respective companies in the United Statesand other countries.

    Data Mining: ModelingCopyright 2002 by SPSS Inc.

    All rights reserved.Printed in the United States of America.

    No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in anyform or by any means, electronic, mechanical, photocopying, recording, or otherwise, without theprior written permission of the publisher.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    3/204

    Data Mining: Data Modeling

    Table of Contents 1

    Data Mining: Modeling

    Table of Contents

    CHAPTER 1

    INTRODUCTION

    INTRODUCTION........................................................................................................1-2MODEL OVERVIEW .................................................................................................1-3VALIDATION.............................................................................................................1-6

    CHAPTER 2

    STATISTICAL DATA MINING TECHNIQUES

    INTRODUCTION........................................................................................................2-2STATISTICAL TECHNIQUES ..................................................................................2-3LINEAR REGRESSION..............................................................................................2-4DISCRIMINANT ANALYSIS..................................................................................2-21LOGISTIC AND MULTINOMIAL REGRESSION.................................................2-31APPENDIX: GAINS TABLES..................................................................................2-38

    CHAPTER 3

    MARKET BASKET OR ASSOCIATION ANALYSIS

    INTRODUCTION........................................................................................................3-2TECHNICAL CONSIDERATIONS............................................................................3-3RULE GENERATION.................................................................................................3-4APRIORI EXAMPLE: GROCERY PURCHASES.....................................................3-5USING THE ASSOCIATIONS.................................................................................3-12APRIORI EXAMPLE: TRAINING COURSE PURCHASES..................................3-15

    CHAPTER 4

    NEURAL NETWORKS

    INTRODUCTION........................................................................................................4-2BASIC PRINCIPLES OF SUPERVISED NEURAL NETWORKS...........................4-3A NEURAL NETWORK EXAMPLE: PREDICTING CREDIT RISK ...................4-11

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    4/204

    Data Mining: Data Modeling

    Table of Contents 2

    CHAPTER 5

    RULE INDUCTION AND DECISION TREE METHODS

    INTRODUCTION........................................................................................................5-2

    WHY SO MANY METHODS?...................................................................................5-4CHAID ANALYSIS ....................................................................................................5-6A CHAID EXAMPLE: CREDIT RISK.......................................................................5-7RULE INDUCTION (C5.0).......................................................................................5-22A C5.0 EXAMPLE: CREDIT RISK..........................................................................5-22

    CHAPTER 6

    CLUSTER ANALYSIS

    INTRODUCTION........................................................................................................6-2

    WHAT TO LOOK AT WHEN CLUSTERING ..........................................................6-3A K-MEANS EXAMPLE: CLUSTERING SOFTWARE USAGE DATA................6-5CLUSTERING WITH KOHONEN NETWORKS....................................................6-14A KOHONEN EXAMPLE: CLUSTERING PURCHASE DATA ...........................6-15

    CHAPTER 7

    TIME SERIES ANALYSIS

    INTRODUCTION..............................................................................................................7-2DATA ORGANIZATION FORTIME SERIES ANALYSIS ......................................................7-3

    INTRODUCTION TO

    EXPONENTIAL

    SMOOTHING

    .............................................................7-4A DECISIONTIME FORECASTING EXAMPLE: DAILY PARCEL DELIVERIES .....................7-5

    CHAPTER 8

    SEQUENCE DETECTION

    INTRODUCTION TO SEQUENCE DETECTION....................................................8-2TECHNICAL CONSIDERATIONS............................................................................8-3DATA ORGANIZATION FOR SEQUENCE DETECTION .....................................8-4SEQUENCE DETECTION ALGORITHMS IN CLEMENTINE ..............................8-5

    A SEQUENCE DETECTION EXAMPLE: REPAIR DIAGNOSTICS......................8-6

    REFERENCES..........................................................................................R-1

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    5/204

    Data Mining: Modeling

    Introduction 1 - 1

    Chapter 1

    Introduction

    Topics:

    INTRODUCTION

    MODEL OVERVIEW

    VALIDATION

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    6/204

    Data Mining: Modeling

    Introduction 1 - 2

    INTRODUCTION

    This course focuses on the modeling stage of the data mining process. It will compareand review the analytic methods commonly used for data mining. In addition, it will

    illustrate these methods using SPSS software (SPSS, AnswerTree, DecisionTime, andClementine). The course assumes that a business question has been formulated and thatrelevant data have been collected, organized, and checked and prepared. In short, that allthe time-consuming, preparatory work has been completed and you are at the modelingstage of your project. For more details concerning what should be done during the earlierstages in a data mining project, see the SPSSData Mining: Overview and Data Mining:Data Understanding and Data Preparation courses.

    This chapter serves as a road map for the rest of the course. We try to place the variousmethods discussed within a framework and give you a sense of when to use whichmethods. The unifying theme is data mining and we discuss in detail the analytic

    techniques most often used to support these efforts. The course emphasizes the practicalissues of setting up, running, and interpreting the results of statistical and machinelearning analyses. It assumes you have, or will have, some business questions that requireanalysis, and that you know what to do with the results once you have them.

    There are choices regarding specific methods with several of these techniques, and therecommendations we make are based on what is known from properties of the methods,Monte Carlo simulations, or empirical work. You should be aware from the start that inmost cases there is not a single method that will definitely yield the best results.However, in the chapters that follow detailing the specific methods, we have sections thatlist research projects for which the method is appropriate, features and limitations of the

    method, and comments concerning model deployment. These should prove of some usewhen you must decide on the method to apply to your problem.

    Finally, the approach is practical, not mathematical. Relatively few equations arepresented and references are given for those who would like a more rigorous review ofthe techniques. Also, our goal is to provide you with a good sense of the properties ofeach method and how it is used and interpreted. The course does not strive for exhaustivedetail. Entire books have been written on a topic we cover in a single chapter, and we aretrying to present the main issues a practitioner will face.

    Analyses are run using different SPSS products. However, the emphasis in this course is

    on understanding the characteristics of the methods and being able to interpret the results.Thus we will not discuss data definition and general program operation issues. We dopresent instructions to perform the analyses, but more information is needed than ispresented here to master the software programs used. To provide this depth, SPSS offersoperational courses for the products used in this course.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    7/204

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    8/204

    Data Mining: Modeling

    Introduction 1 - 4

    than non-inferential techniques, which can be a disadvantage. However, they providerigorous tests of hypotheses unavailable with more automated methods of analysis.

    Although these methods are not always mentioned in many data mining books andarticles, you need to be aware of them because they are often exactly what are necessary

    to answer a particular question. For instance, to predict the amount of revenue, in dollars,that a new customer is likely to provide in the next two years, linear regression could be anatural choice, depending on the available predictor variables and the nature of therelationships.

    GENERAL TECHNIQUE

    (Data Mining)PREDICTOR

    VARIABLES

    OUTCOME

    VARIABLE

    Decision Trees (Rule Induction) Continuous or dummies* Categorical (some

    allow Continuous)Neural Networks Continuous or dummies Categorical or

    Continuous

    The key difference for most users between inferential and non-inferential techniques is inwhether hypotheses need to be specified beforehand. In the latter methods, this is notnormally required, as each is semi- or completely automated as it searches for a model.Nonetheless, in all non-inferential techniques, you clearly need to specify a list ofvariables as inputs to the procedure, and you may have to specify other details, dependingon the exact method. As we discussed in the previous courses in the SPSS Data Mining

    sequence, data mining is not mindless activity; even here, you need a plan of approacha research designto use these techniques wisely.

    Notice that the inferential statistical methods are not distinguished from the data miningmethods in terms of the types of variables they allow. Instead data mining methods, suchas decision trees and neural networks, are distinguished by making fewer assumptionsabout the data (for example, normality of errors). In many instances both classes ofmethods can be applied to a given prediction problem.

    Some data mining methods do not involve prediction, but instead search for groupings orassociations in the data. Several of these methods are listed below along with the types of

    analysis you can do with them.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    9/204

    Data Mining: Modeling

    Introduction 1 - 5

    GENERAL TECHNIQUE Analysis

    Cluster Analysis Uses continuous or categorical variables to createcluster memberships; no predefined outcomevariable.

    Market Basket/AssociationAnalysis

    Uses categorical variables to create associationsbetween categories; no outcome variable required.

    Sequence Detection Uses categorical variables in data sorted in time orderto discover sequences in data; no outcome variablerequired, but there may be interest in specificoutcomes.

    Finally, discussions of data mining mention the tasks of classification, affinity analysis,prediction or segmentation. Below we group the data mining techniques within thesecategories.

    Affinity/Association: These methods attempt to find items that are closely associated in adata file, with the archetypal case being shopping patterns of consumers. Market basketanalysis and sequence detection fall into this category.

    Classification/Segmentation: These methods attempt to classify customers into discretecategories that have already been defined, i.e., customers who stay and those who leave,based on a set of predictors. Several methods are available, including decision trees,neural networks, and sequence detection (when data are time structured). Note thatlogistic regression and discriminant analysis are inferential techniques that accomplishthis same task.

    Clustering/Segmentation: Notice that we have repeated the word segmentation. Thisis because segmentation is used in two senses in data mining. Its second meaning is tocreate natural clusters of objectswithout using an outcome variablethat are similaron various characteristics. Cluster analysis and Kohonen networks accomplish this task.

    Prediction/Estimation: These methods predict a continuous outcome variable, asopposed to classification methods, which work with discrete outcomes. Neural networksfall into this group. Decision tree methods can work with continuous predictors, but theysplit them into discrete ranges as the tree is built. Memory-based reasoning techniques(not covered in this course) can also predict continuous outcomes. Regression is the

    inferential method likely to be used for this purpose.

    The descriptions above are quite simple and hide a wealth of detail that we will consideras we review the techniques. More than one specific method is usually available for ageneral technique. So, to cluster data, K-means clustering, Two-step clustering, andKohonen networks (a form of neural network) could be used, with the choice of which touse depending on the type of data, the availability of software, the ease of understandingdesired, the speed of processing, and so forth.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    10/204

    Data Mining: Modeling

    Introduction 1 - 6

    VALIDATIONSince most data-mining methods do not depend on specific data distribution assumptions(for example, normality of errors) to draw inferences from the sample to the population,validation is strongly recommended. It is usually done by fitting the model to a portion of

    the data (called the Training data) and then applying the predictions to, and evaluating theresults with, the other portion of the data (called the Validation data note some authorsrefer to this as Test data, but as we will see, Test data has a specific meaning in neuralnetwork estimation). In this way, the validity of the model is established bydemonstrating that it applies to (fits) data independent of that used to derive the model.Statisticians often recommend such validation for statistical models, but it is crucial formore general (less distribution bound) data mining techniques. There are several methodsof performing validation.

    Holdout SampleThis method was described above. The data set is split into two parts: training andvalidation files. For large files it might be a 50/50 split, while for smaller files morerecords are typically placed in the training set. Modeling is performed on the trainingdata, but fit evaluation is done on the separate validation data.

    N-Fold ValidationIf the data file is small, reserving a holdout sample may not be feasible (the trainingsample may be too small to obtain stable results). In this case n-fold validation may bedone. Here the data set is divided into a number of groups of equal sample size. Lets use

    10 groups for the example. The first group is held out from the analysis, which is basedon the other 9 groups (or 9/10ths of the data), and is used as the validation sample. Nextthe second group is held out from the analysis, again based on the other 9 groups, and isused as the validation sample. This continues until each of the 10 groups has served as avalidation sample. The validation results from each of these samples are then pooled.

    This has the advantage of providing a form of validation in the presence of smallsamples, but since any given data record is used in 9 of the 10 models, there is less thancomplete independence. A second problem is that since 10 models are run there is nosingle model result (there are 10). For this reason, n-fold validation is generally used toestimate the fit or accuracy of a model with small data files and not to produce the model

    coefficients or rules.

    Some procedures extend this principle to base the model on all but one observation (usingfast algorithms), keeping a single record as the hold-out. Generally speaking, resource-wise, only closed-form models that involve no iteration (like regression or discriminant)can afford this.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    11/204

    Data Mining: Modeling

    Introduction 1 - 7

    Validate with Other ModelsSince different data mining models often can be applied to the same data, you would havegreater confidence in your results if different methods led to the same conclusions. Thisis not to say that the results should be identical, since the models do differ in their

    assumptions and approach. But you would expect that important predictors repeat acrossmethods and have the same general relationship to the outcome.

    Validate with Different Starting ValuesNeural networks usually begin with randomly assigned weights and then, hopefully,converge to the optimum solution. If analyses run with different starting values for theweights produce the same solution, then you would have greater confidence in it.

    Domain ValidationDo the model results make sense within the business area being studied? Here a domainexpertsomeone who understands the business and dataexamines the model results todetermine if they make sense, and to decide if they are interesting and useful, as opposedto obvious and trivial.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    12/204

    Data Mining: Modeling

    Introduction 1 - 8

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    13/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 1

    Chapter 2

    Statistical Data MiningTechniques

    Topics:

    STATISTICAL TECHNIQUES

    LINEAR REGRESSION

    DISCRIMINANT ANALYSIS

    LOGISTIC AND MULTINOMIAL REGRESSION

    APPENDIX: GAINS TABLES

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    14/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 2

    INTRODUCTION

    In this chapter we consider the various inferential statistical techniques that arecommonly used in data mining. We include a detailed example of each, as well as

    discussions about typical sample sizes, whether the method can be automated, and howeasily the model can be understood and deployed.

    As you work on the tasks weve cited in the last chapter, you should also be thinkingabout what data mining techniques to use to answer those questions. Research isnt donestep-by-step, in some predefined order, as we are taught in textbooks. Instead, all phasesof a data mining project should be under review early in the process.

    This is especially critical for the data mining techniques you plan to employ, for at leastthree reasons. First, each data mining technique is suitable for only some types ofanalysis, but not all. Thus the research question you have defined cant necessarily be

    answered by just any technique. So if you want to answer a question that requires, say,market basket analysis (discussed in Chapter 3), and you have little expertise in thisprocedure, youll need to prepare ahead of time, conceivably even acquire additionalsoftware, so you are ready to begin analysis when the data are ready. Second, sometechniques require more data than others do, or data of a particular kind, so you will needto have these conditions in mind when you collect the data. And third, some techniquesare more easily understandable than others and the models more readily retrained if theenvironment changes rapidly; both of which might affect your choice of which techniqueto use.

    In this chapter we provide several different frameworks or classification schemes by

    which to understand and conceptualize the various inferential data mining techniquesavailable in SPSS and other software. Examples of each technique will be given,including research questions or projects suitable for that type of analysis. Althoughdetails for running various analyses are given in the chapter, the emphasis is on setting upthe basic analysis and interpreting the results. For this reason, all available options andvariations will not be covered in this class. Also, such steps as data definition and dataexploration are assumed to be completed prior to the modeling stage. In short, the goal ofthe chapter is not to exhaustively cover each data mining procedure in SPSS, but topresent and discuss the core features needed for most analyses. (For more details onspecific procedures, you may attend separate SPSS, AnswerTree, DecisionTime, andClementine application courses.) Instead, we provide an overview of these methods with

    enough detail for you to begin to make an informed choice about which method will beappropriate for your own data mining projects, to set up a typical analysis, and interpretthe results.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    15/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 3

    STATISTICAL TECHNIQUES

    Recall that inferential statistics have two key features. They require that you specify anhypothesis to testsuch as that more satisfied customers will be more likely to make

    additional purchasesand they allow you to make inferences back to the population fromthe particular sample data you are studying.

    Below is the listing, from Chapter 1, of the many inferential methods commonly used indata mining projects. We will define them in later sections of this chapter. The type ofvariables each requires is also listed.

    GENERAL TECHNIQUE

    PREDICTOR

    VARIABLES

    OUTCOME

    VARIABLEDiscriminant Analysis Continuous or dummies* Categorical

    Linear Regression (and ANOVA) Continuous or dummies Continuous

    Logistic and Multinomial Regression Continuous or dummies CategoricalTime Series Analysis Continuous or dummies Continuous

    (*Dummies refers to transformed variables coded 1 or 0, representing the presence orabsence of a characteristic. Thus a field such as region (north, south, east and west), whenused as a predictor variable in several inferential methods, would be represented bydummy variables. For example, one dummy field might be named North and coded 1 ifthe records region code was north and 0 otherwise.)

    As we discuss the techniques, we also provide information on whether they can beautomated or not, their ease of understanding and typical size of data files, plus other

    important traits.

    After this brief glimpse at the various techniques, we turn next to a short discussion ofeach, including examples of research questions it can answer and where each can befound, if available, in SPSS software.

    You are probably already familiar with several of the inferential statistics methods weconsider here. Our emphasis is on practical use of the techniques, not on the theoryunderlying each one.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    16/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 4

    LINEAR REGRESSION

    Linear regression is a method familiar to just about everyone these days. It is the classiclinear model technique, and is used to predict an outcome variable that is interval or ratio

    with a set of predictors that are also interval or ratio. In addition, categorical predictorvariables can be included by creating dummy variables. Linear regression is available inSPSS under the AnalyzeRegression menu and is available in SPSS Clementine.

    Linear regression, of course, assumes that the data can be modeled with a linearrelationship. As illustration, Figure 2.1 exhibits a scatterplot depicting the relationshipbetween the number of previous late payments for bills and the credit risk of defaultingon a new loan. Superimposed on the plot is the best-fit regression line.

    The plot may look a bit unusual because of the use of sunflowers, which are used torepresent the number of cases at a point. Since credit risk and late payments are measured

    as whole integers, the number of discrete points here is relatively limited given the largefile size (over 2,000 cases).

    Figure 2.1 Scatterplot of Late Payments and Credit Risk

    Although there is a lot of spread around the regression line, it is clear that there is a trendin the data such that more late payments are associated with a greater credit risk. Ofcourse, linear regression is normally used with several predictors; this makes itimpossible to display the complete solution with all predictors in convenient graphicalform. Thus most users of linear regression use the numeric output.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    17/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 5

    Basic Concepts of RegressionEarlier we pointed out that to the eye there seems to be a positive relation between creditrisk and the number of late payments. However, it would be more useful in practice tohave some form of prediction equation. Specifically, if some simple function can

    approximate the pattern shown in the plot, then the equation for the function wouldconcisely describe the relation, and could be used to predict values of one variable givenknowledge of the other. A straight line is a very simple function, and is usually whatresearchers start with, unless there are reasons (theory, previous findings, or a poor linearfit) to suggest another. Also, since the point of much research involves prediction, aprediction equation is valuable. However, the value of the equation would be linked tohow well it actually describes or fits the data, and so part of the regression outputincludes fit measures.

    The Regression Equation and Fit Measure

    In the plot above, credit risk is placed on the Y (or vertical axis) and the number of latepayments appears along the X (horizontal) axis. If we are interested in credit risk as afunction of the number of late payments, we consider credit risk to be the dependentvariable and number of late payments the independent or predictor variable. A straightline is superimposed on the scatterplot along with the general form of the equation:

    Y = B*X + A

    Here, B is the slope (the change in Y per one unit change in X) and A is the intercept (thevalue of Y when X is zero).

    Given this, how would one go about finding a best-fitting straight line? In principle, thereare various criteria that might be used: minimizing the mean deviation, mean absolutedeviation, or median deviation. Due to technical considerations, and with a dose oftradition, the best-fitting straight line is the one that minimizes the sum of the squareddeviation of each point about the line

    Returning to the plot of credit risk and number of late payments, we might wish toquantify the extent to which the straight line fits the data. The fit measure most oftenused, the r-square measure, has the dual advantages of falling on a standardized scale andhaving a practical interpretation. The r-square measure (which is the correlation squared,or r2, when there is a single predicator variable, and thus its name) is on a scale from 0(no linear association) to 1 (perfect prediction). Also, the r-square value can beinterpreted as the proportion of variation in one variable that can be predicted from theother. Thus an r-square of .50 indicates that we can account for 50% of the variation inone variable if we know values of the other. You can think of this value as a measure ofthe improvement in your ability to predict one variable from the other (or others if thereis more than one independent variable).

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    18/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 6

    Multiple regression represents a direct extension of simple regression. Instead of a singlepredictor variable (Y = B*X + A), multiple regression allows for more than oneindependent variable in the prediction equation:

    Y = B1*X1 + B2*X2 + B3*X3 + . . . + A

    While we are limited to the number of dimensions we can view in a single plot (SPSS canbuild a 3-dimensional scatterplot), the regression equation allows for many independentvariables. When we run multiple regression we will again be concerned with how wellthe equation fits the data, whether there are any significant linear relations, andestimating the coefficients for the best-fitting prediction equation. In addition, we areinterested in the relative importance of the independent variables in predicting thedependent measure.

    Residuals and Outliers

    Viewing the plot, we see that many points fall near the line, but some are more distantfrom it. For each point, the difference between the value of the dependent variable andthe value predicted by the equation (value on the line) is called the residual. Points abovethe line have positive residuals (they were under predicted), those below the line havenegative residuals (they were over predicted), and a point falling on the line has a residualof zero (perfect prediction). Points having relatively large residuals are of interestbecause they represent instances where the prediction line did poorly. As we will seeshortly in our detailed example, large residuals (gross deviations from the model) havebeen used to identify data errors or possible instances of fraud (in application areas suchas insurance claims, invoice submission, telephone and credit card usage).

    In SPSS, the Regression procedure can provide information about large residuals, andalso present them in standardized form. Outliers, or points far from the mass of theothers, are of interest in regression because they can exert considerable influence on theequation (especially if the sample size is small, which is rarely the case in data mining).Also, outliers can have large residuals and would be of interest for this reason as well.While not covered in this class, SPSS can provide influence statistics to aid in judgingwhether the equation was strongly affected by an observation and, if so, to identify theobservation.

    Assumptions

    Regression is usually performed on data for which the dependent and independentvariables are interval scale. In addition, when statistical significance tests are performed,it is assumed that the deviations of points around the line (residuals) follow the normalbell-shaped curve. Also, the residuals are assumed to be independent of the predicted(values on the line) values, which implies that the variation of the residuals around theline is homogeneous (homogeneity of variance). SPSS can provide summaries and plotsuseful in evaluating these latter issues. One special case of the assumptions involves theinterval scale nature of the independent variable(s). A variable coded as a dichotomy (say

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    19/204

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    20/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 8

    Click EditOptionsClick the General tabClick the Display names option button in the Variable Lists sectionClick the Alphabetical option button in the Variable Lists sectionClick OK

    Also, files are assumed to be located in the c:\Train\DM_Model directory. They can becopied from the floppy accompanying this guide (or from the CD-ROM containing thisguide). If you are running SPSS Server (you can check by clicking File..Switch Serverfrom within SPSS), then files used with SPSS should be copied to a directory that can beaccessed from (mapped) the server.

    To develop a regression equation predicting claims amount based on hospital length ofstay, severity of illness group and age using SPSS:

    Click FileOpenData (switch to the c:\Train\DM_Model directory if

    necessary)Double click on InsClaimsClick AnalyzeRegression

    This chapter will discuss two choices: linear regression, which performs simple andmultiple linear regression and logistic regression (Binary). Curve Estimation will invokethe Curvefit procedure, which can apply up to 16 different functions relating twovariables. Binary logistic regression is used when the dependent variable is a dichotomy(for example, when predicting whether a prospective customer makes a purchase or not).Multinomial logistic regression is appropriate when you have a categorical dependentvariable with more than two possible values. Ordinal regression is appropriate if the

    outcome variable is ordinal (rank ordered). Probit analysis, nonlinear regression, weightestimation (used for weighted least squares analysis), 2-Stage least squares, and optimalscaling are not generally used for data mining and so will not be discussed further here.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    21/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 9

    Figure 2.2 Regression Menu

    We will select Linear to perform multiple linear regression, then specify claim as thedependent variable and age, asg (severity level) and length of stay (los) as theindependent variables.

    Click Linearfrom the Regression menuMove claim to the Dependent: list boxMove age, asg and los to the Independent(s): list box

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    22/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 10

    Figure 2.3 Linear Regression Dialog Box

    Since our goal is to identify exceptions to the regression model, we will ask for residualplots and information about cases with large residuals. Also, the Regression dialog boxallows many specifications; here we will discuss the most important features.

    Note on Stepwise Regression

    With such a small number of predictor variables, we will simply add them all into themodel. However, in the more common situation of many predictor variables (mostinsurance claims forms would contain far more information) a mechanism to select themost promising predictors is desirable. This could be based on the domain knowledge ofthe business expert (here perhaps a medical expert). In addition, an option may bechosen to select, from a larger set of independent variables, those that in some statisticalsense are the best predictors (Stepwise method).

    The Selection Variable option permits cross-validation of regression results. Only cases

    whose values meet the rule specified for a selection variable will be used in theregression analysis, yet the resulting prediction equation will be applied to the othercases. Thus you can evaluate the regression on cases not used in the analysis, or apply theequation derived from one subgroup of your data to other groups. The importance of suchvalidation in data mining is a repeated theme in this course.

    While SPSS will present standard regression output by default, many additional (andsome of them quite technical) statistics can be requested via the Statistics dialog box. The

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    23/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 11

    Plots dialog box is used to generate various diagnostic plots used in regression, includinga residual plot in which we have interest. The Save dialog box permits you to add newvariables to the data file containing such statistics as the predicted values from theregression equation, various residuals and influence measures. We will create these inorder to calculate our own percentage deviation field.

    Finally, the Options dialog box controls the criteria when running stepwise regressionand choices in handling missing data (the SPSS Missing Values option provides moresophisticated methods of handling missing values). Note that by default, SPSS excludes acase from regression if it has one or more values missing for the variables used in theanalysis.

    Residual PlotsWhile we can run the multiple regression at this point, we will request some diagnosticplots involving residuals and information about outliers. A residual is the difference(signed) between the actual value of the dependent variable and the value predicted bythe model. Residuals can be used to identify large errors in prediction or cases poorly fitby the model. By default no residual plots will appear. These options are explainedbelow.

    Click the Plots pushbutton

    Within the Plots dialog box:

    Check Histogram in the Standardized Residual Plots area

    Figure 2.4 Regression Plots Dialog Box

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    24/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 12

    The options in the Standardized Residual Plots area of the dialog box all involve plots ofstandardized residuals. Ordinary residuals are useful if the scale of the dependent variableis meaningful, as it is here (claim amount in dollars). Standardized residuals are helpful ifthe scale of the dependent is not familiar (say a 1 to 10 customer satisfaction scale). Bythis I mean that it may not be clear to the analyst just what constitutes a large residual: is

    an over-prediction of 1.5 units a large miss on a 1 to 10 scale? In such situations,standardized residuals (residuals expressed in standard deviation units) are very usefulbecause large prediction errors can be easily identified. If the errors follow a normaldistribution, then standardized residuals greater than 2 (in absolute value) should occur inabout 5% of the cases, and those greater than 3 (in absolute value) should happen in lessthan 1% of the cases. Thus standardized residuals provide a norm against which one canjudge what constitutes a large residual. Recall that the F and t tests in regression assumethat the residuals follow a normal distribution.

    Click Continue

    Next we will look at the Statistics dialog box, which contains options concerningCasewise Diagnostics. When this option is checked, Regression will list informationabout all cases whose standardized residuals are more than 3 standard deviations from theline. This outlier criterion is under your control.

    Click the Statistics pushbuttonClick the Casewise diagnostics check box in the Residuals area

    Figure 2.5 Regression Statistics Dialog Box

    By requesting this option we will obtain a listing of those records that the model predictspoorly. When dealing with a very large data file, which may have many outliers, such a

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    25/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 13

    list is cumbersome. It would be more efficient to save the residual value (standardized ornot) as a new field, then select the large residuals and write these cases to a new file oradd a flag field to the main database. We create these new fields below.

    Click Continue

    Click Save pushbuttonClick the check boxes forUnstandardizedPredicted Values,

    Unstandardized and Standardized Residuals

    Figure 2.6 Saving Predicted Values and Errors

    Click Continue, then click OK

    Now we examine the results.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    26/204

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    27/204

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    28/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 16

    expert (here a physician). Perhaps the youngest patients are at greater risk. If there isnt aconvincing reason for this negative association, the data values for age and claims shouldbe examined more carefully (perhaps data errors or outliers are influencing the results).Such oddities may have shown up in the original data exploration. We will not pursuethis issue here, but it certainly would be done in practice.

    The constant or intercept of $3,027 indicates that the claim of someone with 0 days inthe hospital, in the least severe illness category (0) and at age 0 would be expected to filea claim of $3,027. This is clearly impossible. This odd result stems in part from the factthat no one in the sample had less than 1 day in the hospital (it was an inpatientprocedure) and the patients were adults (no ages of 0), so the intercept projects wellbeyond where there are any data. Thus the intercept cannot represent an actual patient,but still may be needed to fit the data. Also, note that when using regression it can berisky to extrapolate beyond where the data are observed; the assumption is that the samepattern continues. Here it clearly cannot!

    The Standard Error (of B) column contains standard errors of the estimated regressioncoefficients. These provide a measure of the precision with which we estimate the Bcoefficients. The standard errors can be used to create a 95% confidence band around theB coefficients (available as a Statistics option). In our example, the regression coefficientfor length of stay is $1,106 and the standard error is about $104. Thus we would not besurprised if in the population the true regression coefficient were $1,000 or 1,200 (withintwo standard errors of our sample estimate), but it is very unlikely that the true populationcoefficient would be $300 or $2,000.

    Betas are standardized regression coefficients and are used to judge the relativeimportance of each of several independent variables. They are important because the

    values of the regression coefficients (Bs) are influenced by the standard deviations of theindependent variables and the beta coefficients adjust for this. Here, not surprisingly,length of stay is the most important predictor of claims amount, followed by severitygroup and age. Betas typically range from 1 to 1 and the further from 0, the moreinfluential the predictor variable.

    Thus if we wish to predict claims based on length of stay, severity code and age, theformula would use the B coefficients:

    Predicted Claims = $1,106*length of stay + 417*severity code 33*age + $3,027.

    Points Poorly Fit by ModelThe motivation for this analysis is to detect errors or possible fraud by identifying casesthat deviate substantially from the model. As mentioned earlier, these need not be theresult of errors or fraud, but they are inconsistent with the majority of cases and thusmerit scrutiny. We first turn to a list of cases whose residuals are more than threestandard deviations from 0 (a residual of 0 indicates the model perfectly predicts theoutcome).

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    29/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 17

    Figure 2.9 Outliers

    There are two cases for which the claims value is more than three standard deviationsfrom the regression prediction. Both are about $6,000 more than expected from themodel. Note that they are 5.5 and 6.1 standard deviations away from the modelpredictions. These would be the claims to examine more carefully. The case sequencenumber for these records appears or an identification field could be substituted (through

    the Case Labels box within the Linear Regression dialog).

    Figure 2.10 Histogram of Residuals

    This histogram of the standardized residuals presents the overall distribution of the errors.It is clear that all large residuals are positive (meaning the model under-predicted theclaims value). Case (record) identification is not available in the histogram, but since thestandardized residuals were added to the data file, they can be easily selected andexamined.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    30/204

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    31/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 19

    Figure 2.12 Percent Deviation Field

    Extreme values on this percent deviation field can also be used to identify exceptionalclaims. While we wont pursue it here, a histogram would display the distribution of thedeviations and cases with extreme values could be selected for closer examination.Unusual values could appear at both the high and low ends, with low values indicatingthe claim was much less than predicted by the model. These might be examined as well,since they might reflect errors or suggest less expensive variations on the treatment.

    In this section, we offered the search for deviations from a model as a method to identifydata errors or possible fraud. It would not detect, of course, fraudulent claims consistentwith the model prediction. In actual practice, such models are usually based on a muchgreater number of predictor variables, but the principles, whether using regression ormore complex models such as neural networks, are largely the same.

    Appropriate Research ProjectsOther examples of questions for which linear regression is appropriate are:

    Predict expected revenue in dollars from a new customer based on customercharacteristics.

    Predict sales revenue for a store (with a sufficiently large number of stores in thedatabase).

    Predict waiting time on hold for callers to an 800 number.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    32/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 20

    Other FeaturesThere is no limit to the size of data files used with linear regression, but just as withdiscriminant, most uses of regression limit the number of predictors to a manageablenumber, say under 50 or so. As before, there is then no reason for extremely large file

    sizes.

    The use of stepwise regression is quite common. Since this involves selection of a fewpredictors from a larger set, it is recommended that you validate the results with avalidation data set when you use a stepwise method.

    Although this technique is called linearregression, with the use of suitabletransformations of the predictors, it is possible to model non-linear relationships.However, more in-depth knowledge is needed to do this correctly, so if you expect non-linear relationships to occur in your data, you might consider using neural networks orclassification and regression trees, which handle these more readily, if differently.

    Model UnderstandingLinear regression produces very easily understood models, as we can see from the tablein Figure 2.8. As noted, graphical results are less helpful with more than a few predictors,although graphing the error in prediction with other variables can lead to insights aboutwhere the model fails.

    Model Deployment

    Predictions for new cases are made from one equation using the unstandardizedregression coefficient estimates. Any convenient software for doing this calculation canbe employed, and regression equations can therefore be applied directly to datawarehouses, not only to extracted datasets. This makes the model easily deployable.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    33/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 21

    DISCRIMINANT ANALYSIS

    Discriminant analysis, a technique used in market research and credit analysis for manyyears, is a general linear model method, like linear regression. It is used in situations

    where you want to build a predictive model of group or category membership, based onlinear combinations of predictor variables that are either continuous (age) or categoricalvariables represented by dummy variables (type of customer). Most of the predictorsshould be truly interval scale, else the multivariate normality assumption will be violated.Discriminant is available in SPSS under the AnalyzeClassify menu.

    Discriminant follows from a view that the domain of interest is composed of separatepopulations, each of which is measured on variables that follow a multivariate normaldistribution. Discriminant attempts to find the linear combinations of these measures thatbest separate the populations. This is represented in Figure 2.13, which shows onediscriminant function derived from two input variables, X and Y, that can be used to

    predict membership in a dependent variable: Group. The score on the discriminantfunction separates cases in group 1 from group 2, using the midpoint of the discriminantfunction (the short line segment).

    Figure 2.13 Discriminant Function Derived From Two Predictors

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    34/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 22

    A Discriminant Example: Predicting PurchasesTo demonstrate discriminant analysis we take data from a study in which respondentsanswered, hypothetically, whether they would accept an interactive news subscriptionservice (via cable). There was interest in identifying those segments most likely to adopt

    the service. Several demographic variables were available: education, gender, age,income category, number of children, number of organizations the respondent belongedto, and the number of hours of TV watched per day. The outcome measure was whetherthey would accept the offering. Most of the predictor variables are interval scale, theexceptions being gender (a dichotomy) and income (an ordered categorical variable). Wewould expect few if any of these variables to follow a normal distribution, but willproceed with discriminant. As in our other examples, we will move directly to theanalysis although ordinarily you would run data checks and exploratory data analysisfirst.

    Click File..Open..DataMove to the c:\Train\DM_Model directory (if necessary)Double click on Newschan (respond No if asked to save Data Editor

    contents)Click AnalyzeClassifyDiscriminantClick newschan, then click the upper arrow to move it into Grouping

    Variable list box

    Notice that two question marks appear beside newschan in the Grouping Variable listbox. This is because Discriminant can be applied to more than two outcome groups andexpects a minimum and maximum group code. The news channel acceptance variable iscoded 0 (no) and 1 (yes) and we use the Define Range pushbutton to supply thisinformation.

    Click Define Range pushbutton (not shown)Type 0 in the Minimum text boxClick in the Maximum text box, and type 1Click Continue to process the range

    The default Method within Discriminant is to run the analysis using all the predictorvariables. For the typical data mining application, you would probably invoke a stepwiseoption that will enter predictor variables into the equation based on statistical criteriainstead of forcing all predictors into the model.

    Click and drag from age to tvday to select themClick the lowerarrow to place the selected variables in the Independents:

    list box.Click Use stepwise method option button

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    35/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 23

    Figure 2.14 Discriminant Analysis Dialog Box

    Click the Classify pushbuttonClick the Summary table checkboxClick the Leave-one-out classification checkbox

    Figure 2.15 Classification Dialog Box

    The Classification dialog box controls the results displayed when the discriminant modelis applied to the data. The most useful table does not print out by default (becausemisclassification summaries require a second data pass), but you can easily request asummary classification table, which reports how well the model predicts the outcome

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    36/204

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    37/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 25

    case) would be more convenient to work with. If you run discriminant with more thantwo outcome categories, then Fisher's coefficients are easier to apply as prediction rules.If you suspect some of the predictors are highly related, you might view the within-groups correlations among the predictor variables to identify highly correlated predictors.

    Click Continue to process Statistics requests

    Now we are ready to run the stepwise discriminant analysis. The Select pushbutton canbe used to have SPSS select part of the data to estimate the discriminant function, andthen apply the predictions to the other part (cross-validation). We would use this methodof validation in place of the leave-one-out method if our data set were larger. The Savepushbutton will create new variables that contain the group membership predicted fromthe discriminant function and the associated probabilities. To retain predictions for thetraining data set, you would use the Save dialog to create these variables.

    Click OK to run the analysis

    Scroll to the Classification Results table at the bottom of the Viewerwindow

    Figure 2.17 Classification Results Table

    Although this table appears at the end of the discriminant output, we turn to it first. It isan important summary since it tells us how well we can expect to predict the outcome.There are two subtables with Original referring to the training data and Cross-Validated supplying the leave-one-out results. The actual (known) groups constitute therows and the predicted groups make up the columns of the table. Looking at theOriginal section, of the 227 people surveyed who said they would not accept the

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    38/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 26

    offering, the discriminant model correctly predicted 157 of them, and so its accuracy is69.2%. For the 214 respondents who said they would accept the offering, 66.4% werecorrectly predicted. Thus overall, the discriminant model was accurate in 67.80% of thecases. The Cross-Validated summary is very close (67.3% accurate overall). Is thisperformance good? If we simply guess the larger group 100% of the time, we would be

    correct 227 times of 441 (227 + 214), or about 51.5% of the time. The 67.8% or 67.3%correct figures, while certainly far from perfect accuracy, do far better than guessing.Whether you would accept this figure and review the remaining output, or go back to thedrawing board, is largely a function of the level of predictive accuracy required. Since weare interested in discovering which characteristics are associated with someone whoaccepts the news channel offer, we proceed.

    Stepwise ResultsAge is entered first, followed by gender and education. A significance test (Wilkslambda) of between-group differences is performed for the variables at each step. None

    of the other variables made a significant difference after adjusting for the first three. Asan exercise you might rerun the analysis with the additional variables entered andcompare the classification results.

    Figure 2.18 Stepwise Results

    This summary is followed by one entitled "Variables in the Analysis" (not shown), whichlists the variables included in the discriminant analysis at each step. For the variablesselected, tolerance is shown. It measures the proportion of variance in each predictorvariable that is independent of the other predictors in the equation at this step. As

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    39/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 27

    tolerance values approach 0 (say below .1 or so) the data approach multicollinearity,meaning the predictor variables are highly interrelated, and interpretation of individualcoefficients can be compromised. Note that discriminant coefficients are only calculatedafter the stepwise phase is complete.

    Figure 2.19 Standardized Coefficients and Structure Matrix

    The standardized discriminant coefficients can be used as you would regression Betacoefficients in that they attempt to quantify the relative importance of each predictor inthe discriminant function. Not surprisingly, age is the dominant factor. The signs of thecoefficients can be interpreted with respect to the group means on the discriminantfunction (see Figure 2.20). An older individual will have a higher discriminant score,since the age coefficient is positive. The outcome group accepting the offering has apositive mean (see Figure 2.20) and so older people are more likely to accept theoffering. Notice the coefficient for gender is negative. Other things being equal, as youshift from a man (code 0) to a woman (code 1), this results in a one unit change, whichwhen multiplied by the negative coefficient will lower the discriminant score, and movethe individual toward the group with a negative mean (those that dont accept theoffering). Thus women are less likely to accept the offering, adjusting for the otherpredictors.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    40/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 28

    Figure 2.20 Unstandardized Coefficients and Group Means (Centroids)

    Back in Figure 2.13 we saw a scatterplot of two separate groups and the axis along whichthey could be best separated. Unstandardized discriminant coefficients, when multipliedby the values of an observation, project an individual on this discriminant axis (orfunction) that separates the groups. If you wish to use the unstandardized coefficientestimates for prediction purposes, you simply multiply a prospective customerseducation, gender and age values by the corresponding unstandardized coefficients andadd the constant. Then you compare this value to the cut point (by default the midpoint)between the two group means (centroids) along the discriminant function (the meansappear in Figure 2.20). If the prospective customers value is greater than the cut point

    you predict the customer will accept, if the score is below the cut point, then you predictthe customer will not accept. This prediction rule is also easy to implement with twogroups, but involves much more complex calculations when more than two groups areinvolved. It is in a convenient form to do what if scenarios, for example, it we have amale with 16 years of education at what age would such an individual be a goodprospect? To answer this we determine the age value that moves the discriminant scoreabove the cut point.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    41/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 29

    Figure 2.21 Fisher Classification Coefficients

    The Fisher function coefficients can be used to classify new observations (customers). Ifwe know a prospective customers education (say 16 years), gender (Female=1) and age(30), we multiply these values by the set of Fisher coefficients for the No (no acceptance)group (2.07*16 + 1.98*1 + .32*30 -20.85), which yields a numeric score. We repeat theprocess using the coefficients for the Yes group and obtain another score. The customeris then placed in the outcome group for which she has the higher score. Thus the Fishercoefficients are easy to incorporate later into other software (spreadsheets, databases) forpredictive purposes.

    We did not test for the assumptions of discriminant analysis (normality, equality of

    within group covariance matrices) in this example. In general, normality does not make agreat deal of difference, but heterogeneity of the covariance matrices can, especially ifthe sample group sizes are very different. Here the samples sizes were about the same.For a more detailed discussion of problems with assumption violation in discriminantanalysis see Lachenbruch (1975) or Huberty (1994).

    As mentioned earlier, whether you consider the hit rate here to be adequate reallydepends on the costs of errors, the benefits of a correct prediction and what youralternatives are. Here, although the prediction was far from perfect we were able toidentify the relations between the demographic variables and the choice outcome.

    Appropriate Research ProjectsExamples of questions for which discriminant analysis is appropriate are:

    Predict instances of fraud in all types of situations, including credit card, insurance,and telephone usage.

    Predict whether customers will remain or leave (churn or not).

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    42/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 30

    Predict which customers will respond to a new product or offer.

    Predict outcomes of various medical procedures.

    Other FeaturesIn theory, there is no limit to the size of data files for discriminant analysis, either interms of records or variables. However, practically speaking, most applications ofdiscriminant limit the number of predictors to a few dozen at most. With that number ofpredictors, there is usually no reason to use more than a few thousand records.

    It is possible to use stepwise methods with discriminant, so that the software can selectthe best set of predictors from a larger potential group. In this sense, stepwisediscriminant can be considered an automated procedure like decision trees. As a result, ifyou use a stepwise method, you should use a validation dataset on which to check themodel derived by discriminant.

    Model UnderstandingDiscriminant analysis produces easily understood results. We have already seen theclassification table in Figure 2.17. In addition, the procedure calculates the relativeimportance of each variable as a predictor (standardized coefficientssee Figure 2.19).Graphical output is produced by discriminant, but with more than a few predictors itbecomes less useful.

    Model DeploymentPredictions for new cases are made from simple equations using the classificationfunction coefficients (especially the Fisher coefficients). This means that any statistical

    program, or even a spreadsheet program, could be used to generate new predictions, andthat the model can be applied directly to data warehouses, not only extracted data sets.This makes the model easily deployable.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    43/204

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    44/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 32

    Figure 2.22 The Logistic Function

    After the procedure calculates the outcome probability, it simply assigns a case to apredicted category based on whether its probability is above .50 or not. The same basicapproach is used when the dependent variable has three or more categories.

    In Figure 2.22, we see that the logistic model is a nonlinear model relating predictorvariables to the probability of a choice or event (for example, a purchase). If there are twopredictor variables (X1, X2), then the logistic prediction equation can be expressed as:

    prob(event) = )exp(1

    expA)X*BX*(B

    A)X*BX*(B

    2211

    2211

    ++

    ++

    + where exp() represents the exponential function. The conceptual problem is that theprobability of the event is not linearly related to the predictors. However, if a little mathis done you can establish that the odds of the event occurring are equal to:

    A)X*BX*(B 2211exp ++ , which equalsA)X*(B)X*(B exp*exp*exp 2211

    Although not obviously simpler to the eye, the second formulation (and SPSS displaysthe logistic coefficients in both the original form and raised to the exponential power)allows you to state how much the odds of the event change with a one unit change in thepredictor. For example, if I stated that the odds of making a sale double if a resource is

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    45/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 33

    given to me, everyone would know what I meant. With this in mind we will look at thecoefficients in the logistic regression equation and try to interpret them.

    Recall that logistic regression assumes that the predictor variables are interval scale, andlike regression, dummy coding of predictors can be performed. As such, its assumptions

    are less restrictive than discriminant.

    A Logistic Regression Example: Predicting PurchasesWe will apply logistic regression to the same problem of discovering whichdemographics are related to acceptance of an interactive news service. However, insteadof running a stepwise method we will apply the variables selected by our discriminantanalysis and compare the results.

    Click AnalyzeRegressionBinary LogisticMove newschan into the Dependent list boxMove age, educate and genderinto Covariates: list box

    Figure 2.23 Logistic Regression Dialog Box

    This is all we need in order to run a standard logistic regression analysis. Notice the

    Interaction button . You can create interaction terms by clicking on two or morepredictor variables in original list, then clicking on the Interaction button. Also, you canuse the Categorical pushbutton to have Logistic Regression create dummy coded (orcontrast) variables to substitute for your categorical predictor variables (note thatClementine performs such operations automatically in its modeling nodes). The Savepushbutton allows you to create new variables containing the predicted probability of theevent, and various residual and influence measures. As in Discriminant, the Select

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    46/204

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    47/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 35

    Figure 2.25 Classification Table

    The classification results table indicates that those refusing the offer were predicted with70.5% accuracy and those accepting with 61.7% accuracy, for an overall correctclassification of 66.2%. The logistic model predicted a slight bit better for the refusals,and about 4 percentage points worse for the acceptances, so overall it does slightly worse(about 2 percentage points) than discriminant on the training sample. The defaultclassification rule for a case is that if the predicted probability of belonging in theoutcome group with the higher value (here 1) is greater than equal to .5, then predictmembership in that group. Otherwise, predict membership in the group with the loweroutcome value (here 0). We will examine these predicted probabilities in more detaillater.

    2.26 Significance Tests and Model Summary

    The Model Chi-square test provides a significance test for the entire model (threevariables) similar to the overall F test in regression. We would say there is a significantrelation between the three predictors and the outcome. The Step Chi-square records thechange in chi-square from one step to the next and is useful when running stepwisemethods.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    48/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 36

    Figure 2.27 Model Summary

    The pseudo r-square is a statistic modeled after the r-square in regression (discussedearlier in this chapter). It measures how much of the initial lack of fit chi-square isaccounted for by the variables in the model. Both variants indicate the model onlyaccounts for a modest amount of the initial unexplained chi-square. Now lets move tothe variables in the equation.

    Figure 2.28 Variables in the Equation

    The B coefficients are the actual logistic regression coefficients, but recall they bear anonlinear relationship to the probability of accepting the offer. Although they do linearlyrelate to the log odds of accepting, most people do not find this metric helpful forinterpretation. The second column (S.E.) contains the standard errors for the Bcoefficients. The Wald statistic is used to test whether the predictor is significantlyrelated to the outcome measure adjusting for the other variables in the equation (all threeare highly significant). The last column presents the B coefficient exponentiated using thee (exponential) function, and we can interpret these coefficients in terms of an odds shiftin the outcome. For example, the coefficients of age and education are above 1 meaning

    that the odds of accepting the offer increase with increasing age. The coefficient for ageindicates that the odds increase by a factor of 1.06 per year, which seems rather small.However, recall that age can range from 18 to almost 90 years old and a 20-year agedifference would have a substantial impact on the odds of accepting the offering (theodds more than triple).

    The coefficient for gender is about .5 indicating that if the other factors are held constant,moving from a male to female reduces the odds of accepting the offering by about 1/2. In

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    49/204

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    50/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 38

    involved, which requires more calculations, but this doesnt make prediction for newcases that much more difficult.

    APPENDIX: GAINS TABLESOne way to evaluate the usefulness of a classification model is to group or order cases bya score predicted from the model, and then examine the desired response rate within thesescore-ordered groups. For the model to be useful at the practical level, cases with highmodel scores should show higher response rates than those with low scores. Gains tablesand lift charts provide numeric and graphical summaries of this aspect of the analysis.

    Using them, a direct marketing analyst can estimate the proportion of positiverespondents they will find if they promote the offering to the top x% of the sample. Someprograms, for example SPSS AnswerTree, automatically produce gains tables. However,

    a basic gains table can be produced easily within SPSS, and with some additional effort, agains plot can be created as well. In this context, the score would be the predictedprobability in a binary logistic analysis or the estimated discriminant score in a two-outcome discriminant analysis.

    The main concept behind the gains table and chart involves ordering or grouping thecases by the score produced from the model, and evaluating the response rates. If thesummary is tabular (gains table), then the cases are placed into score groups (forexample, decile groups by score). For graphs, the cases can be grouped (for example, agains plot based on decile groups) or not (for example, a gains plot in which each pointrepresents a unique score).

    To demonstrate, we will create a basic gains table using the SPSS menus. In addition,SPSS syntax that will produce a gains chart can be found in Gains.sps (located inc:\Train\DM_Model); it involves more data manipulation in SPSS. Gains and otherevaluation charts are available in Clementine and AnswerTree.

    Since the predictions in binary logistic regression are based on the predicted probability,we will use these values as the model scores. First we will collapse the data into tengroups based on their predicted probabilities, and then we will display the proportion ofYes responses within each decile group. Our use of decile (ten) groups is arbitrary; youmight create more or fewer groups.

    Click TransformRank Cases from the Data Editor windowMove pre_1 into the Variable(s) list boxClick Largest value option button

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    51/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 39

    The predicted probability variable from the logistic regression (pre_1) will be used tocreate a new rank variable. In addition, we indicate that the highest value of pre_1 shouldbe assigned rank 1. This makes sense: the case with the highest predicted probability ofbelonging to the Yes outcome group (coded 1- the highest value) should have the toprank. If we stopped here, each case would be assigned a unique rank (assuming no ties on

    predicted probability). Since we want to create decile groups, we must make use of theRank Cases: Types dialog box.

    Click Rank Types pushbuttonClick Ntiles check box to check itErase 4 and type 10 in the Ntiles text boxClick the Rank check box so it is not checkedClick Continue, then click OK

    The Rank procedure will now create a new variable coded 1 through 10, representingdecile groups based on the predicted probability of responding Yes.

    Although a number of procedures can display the critical summary (response percentagefor each decile group), we will use the OLAP Cubes procedure since it can present thebase rate (assuming a random prediction model) as well.

    Click AnalyzeReportsOLAP CubesMove newschan into the Summary Variable(s) list boxMove npre_1 into the Grouping Variable(s) list box

    The OLAP Cubes procedure is designed to produce a multidimensional summary tablethat supports drill-down operations. We choose it for some specific summary statistics.

    Click Statistics pushbuttonClick and drag to select all statistics in the Cell Statistics list boxClick the left arrow to remove these statistics from the Cell Statistics list

    boxMove Number of Cases into the Cell Statistics list boxMove Percent of Sum in (npre_1) into the Cell Statistics list boxMove Percent of N in (npre_1) into the Cell Statistics list box

    We request several summaries for each decile score group. The number of cases in eachgroup will appear, along with the percent of the total sum of the newschan variable in

    each decile group (based on npre_1). Since newschan is coded 0 or 1, the percentage ofthe overall sum of newschan in each decile group is the percentage of cases in a groupthat gave a positive response to the newschan question. Finally, the percentage of cases ineach decile group will display. This provides the base rate against which the model-scorebased deciles will be compared. The logic is that if the model predictions are unrelated tothe actual outcome, then we expect that the top decile of cases (based on the model) willcontain about 10% of the positive responses: the rate we expect by chance alone.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    52/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 40

    Click ContinueClick Title pushbuttonErase OLAP Cubes and type Gains Table in the Title text boxClick Continue, then click OK

    The OLAP Cubes table is designed to be manipulated within the Pivot Table editor topermit different views of the summaries. In order to see the entire table we must movethe decile-grouping variable into the row dimension of the pivot table (such manipulationis covered in more detail in The Basics: SPSS for Windows andIntermediate Topics:SPSS for Windows courses)

    Double-click on the Gains Table pivot table in the Viewer windowIf the Pivoting Trays window is not visible, then click Pivot..Pivoting

    Trays

    Click the pivot icon in the Layertray of the Pivoting Tray window

    The Pivot Tray window allows us to move table elements (in this case the decilecategories) from one dimension to another. We need to move the npre_1 (NTILES ofPRE_1) icon from the layer to the row dimension.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    53/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 41

    Figure 2.29 Pivot Table Editor

    Drag the NTILES of PRE_1 pivot icon to the right side of the Row

    tray (right of the icon already in the row tray)Click outside the crosshatched area in the Viewer window to close the

    Pivot Table Editor

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    54/204

    Data Mining: Modeling

    Statistical Data Mining Techniques 2 - 42

    Figure 2.30 Gains Table

    The column headings of the pivot table can be edited so they are easier to read (double-click on the pivot table, then double-click on any column heading to edit it). The rows ofthe table represent the decile groupings based on the predicted probabilities (scores) fromthe logistic regression model. The column labeled N contains the number of cases in eachdecile group and the % of N in NTILES of PRE_1 displays the percentages. This lattercolumn contains the expected percentage of overall positive responses appearing in eachgroup, under a model with no predictability (i.e. the base rate). The column of greatestinterest is labeled % of Sum in NTILES of PRE_1. It contains the expected percentageof the overall positive responses contained in each decile group under the model.

    Examining the first decile (1), we see that the most promising 10% of the sample contains16.4% of the positive respondents. Similarly, both the second and third deciles contain15% of the positive respondents. Thus, if we were to offer the interactive news cablepackage to the top 30% of the prospects, we expect we would obtain 46.4% of thepositive responders. In this way, analysts in direct mail and related areas can evaluate theexpected return of mailing to the top x% of the population (based on the model). In thefourth decile and beyond, the return is near or below that expected from a random model;so the decile groups holding most promise are the first three.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    55/204

    Data Mining: Modeling

    Market Basket or Association Analysis 3 - 1

    Chapter 3

    Market Basket or AssociationAnalysis

    Topics:

    INTRODUCTION

    TECHNICAL CONSIDERATIONS

    RULE GENERATION

    APRIORI EXAMPLE: GROCERY PURCHASES

    USING THE ASSOCIATIONS

    APRIORI EXAMPLE: TRAINING COURSE PURCHASES

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    56/204

    Data Mining: Modeling

    Market Basket or Association Analysis 3 - 2

    INTRODUCTION

    As its name implies, market basket analysis techniques were developed, in part, toanalyze consumer shopping patterns. These methods are descriptive and find groupings

    of items. Market basket or association analysis clusters fields (items), which are typicallyproducts purchased, but could also be medical procedures proscribed for patients, orbanking or telecom services used by customers. The techniques look for patterns orclusters among a small number of items. These techniques can be found in Clementineunder the Apriori and GRI procedures.

    Market basket or association analysis produces output that is easy to understand. Forexample, a rule may state that, if corn chips are purchased, then 65% of the time cola ispurchased, unless there is a promotion, in which case 85% of the time cola is purchased.In other words, the technique correlates the presence of one set of items with another. Alarge set of rules is typically generated for any data set with a reasonably diverse type and

    number of transactions. As an illustration, Figure 3.1 shows a portion of the output fromClementines Apriori procedure showing relations between various products bought bycustomers over one week in a supermarket. The first line is telling us that 10.8% of thesample, or 85 customers, bought both frozen foods and milk, and that 77.6% of suchcustomers also bought some type of alcoholic product.

    Figure 3.1 A Set of Market Basket Association Rules

    Such association rules describe relations among the items purchased; the goal is todiscover interesting and actionable associations.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    57/204

    Data Mining: Modeling

    Market Basket or Association Analysis 3 - 3

    TECHNICAL CONSIDERATIONS

    Number of Items?If the fields to be analyzed represent items sold, the number of distinct items influencesthe resources required for analysis. For example, the number of SPSS products acustomer can currently purchase is about 15, which is a relatively small number of itemsto analyze. Now, lets consider a major retail chain store, auto parts supplier, mail catalogvendor or web vendor. Each might have anywhere from hundreds or thousands to tens ofthousands of unique products. Generally, when such large numbers of products areinvolved, they are binned (grouped) together in higher-level product categories. As itemsare added, the number of possible combinations increases exponentially. Just how muchcategorization is necessary depends upon the original number of items, the detail level ofthe business question asked, and the level of grouping at which meaningful categories canbe created. When a large number of items are present, careful consideration must begiven to this issue, which can be time consuming. But time spent on this matter willincrease your chance of finding useful associations and will reduce the number of largelyredundant rules (for example, rules describing the purchase of a hammer with each ofmany different sizes and types of nails).

    Only True Responses?Since market basket fields are usually coded as dichotomies (0,1) or logical flags (F,T),one issue the analyst needs to consider is whether there is interest in both true (purchaseor occurrence) and false (no purchase or no occurrence) responses. Typically, a customerpurchases a relatively small proportion of the available items. If both true and false

    responses are included, many rules will be of the sort those who dont purchase X, tendnot to purchase Y. On the other hand, excluding the false responses implies that a rulesuch as those who dont purchase X, tend to purchase Y will not be discovered. Thusitems that act as substitutes for each other are more difficult to discover. On balance,many analysts restrict the association rules to those of the true form to avoid wadingthrough the lengthy list of things not purchased.

    Actionable?More so than with other data-mining methods, data-mining authors and consultants raisequestions about whether the associations discovered using market basket analysis are

    useful and actionable. The challenge is that if you do discover an association betweentwo products, say beer and diapers (to use an example with a storied past), just whataction would this lead a retailer to take? Of course, other data-mining methods are opento a similar challenge, but it is worthwhile to consider in advance how your organizationwould make use of any strong associations that are discovered in the market basketanalysis.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    58/204

    Data Mining: Modeling

    Market Basket or Association Analysis 3 - 4

    No Dependent Variable (Necessarily)An advantage of market basket analysis is that it investigates all associations among theanalysis variables and not only those related to a specific outcome or dependent field. Inthis sense it is a more broadly based method than data-mining techniques that are trying

    to predict a specific outcome field. That said, after the associations have been generatedin Clementine, you could designate one variable as an outcome and display only thoseassociations that relate to it.

    RULE GENERATION

    Within Clementine, the association rule methods begin by generating simple rules (thoseinvolving two items) and testing them against the data. The most interesting (that is, thosethat meet the specified minimum criteriaby default, coverage and accuracy) of theserules are stored. Next, all rules are expanded by adding an additional condition from athird item (this process is called specialization) and are, as before, tested against the data.The most interesting of these rules are stored and specialization continues. When theanalysis ends, the best of the stored rules can be examined.

    Two procedures in Clementine perform association rule analysis. GRI is more general inthat it permits both numeric (continuous) and categorical input (condition) variables.Apriori, because it only permits categorical input (condition) variables, is quicker. It alsosupports wider choice in criteria used to select rules.

    Results from an Association analysis are presented in a table with these column headings:

    Consequent Antecedent 1 Antecedent 2 Antecedent N

    For example:

    Consequent Antecedent 1 Antecedent 2

    AnswerTree Regression Models Advanced Models

    This rule tells us that customers who purchase the Regression Models and AdvancedModels SPSS options also purchase AnswerTree.

    Association rules are commonly evaluated by two criteria: support and confidence.

    Support is the percentage of records in the data set for which the conditions(antecedents) hold. It indicates how general the rule isthat is, to what percentage of thedata will it apply.

  • 7/31/2019 Spss Trainingboek Advanced Statistics and Datamining

    59/204

    Data Mining: Modeling

    Market Basket or Association Analysis 3 - 5

    Confidence (accuracy) is the proportion of records meeting the conditions (antecedents)that also meet the consequent (conclusion). It indicates how likely the consequent is,given that the conditions are met.

    It is worth noting explicitly that market basket analysis does not take time into account.

    Thus, while purchases may well be time ordered, Apriori and GRI do not include time asa component of the rule generation process. Practically speaking, this means that all datafor a case (e.g., shopping trip) must be stored in one physical record. Sequence detectionalgorithms take time sequencing into account and such analyses can be done inClementine with the Sequence node (or the CaprI algorithm add-on) see Chapter 8.

    APRIORI EXAMPLE: GROCERY

    PURCHASESTo demonstrate a market basket analysis we will use the Apriori procedure withinClementine to analyze the purchase patterns among a limited number of productcategories (10) of grocery store items: ready made, frozen foods, alcohol, freshvegetables, milk, bakery goods, fresh meat, toiletries, snacks and tinned goods. Abouttwo thousand shopping visits are analyzed. We will not construct the entire data streamwithin Clementine, b