slide 1 detecting outliers outliers are cases that have an atypical score either for a single...

16
Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables (multivariate outliers). Outliers generally have a large impact on the solution, i.e. the outlier case can conceivably change the value or score that we would predict for every other case in the study. Our concern with outliers is to answer the question of whether our analysis is more valid with the outlier case included or more valid with the outlier case excluded. To answer this question, we must have methods for detecting and assessing outliers. The method for detecting univariate outliers is to convert the scores on the variable to standard scores and scan for very large positive and negative standard scores. We will normally apply this strategy to the analysis of a metric dependent variable. The detection of multivariate outliers is used to detect unusual cases for the combined set of metric independent variables, using a multivariate distance measure analogous to standard score distance from the mean of the sample. The decision to exclude or retain the outlier case is based on our understanding of the cause of the outlier and the impact it is having on the results. If the outlier is a data entry error or an obvious misstatement by a respondent, it probably should be excluded. If the outlier is an unusual but probable value, it should be retained. We can improve our understanding of the impact of the outlier by running an analysis twice, one with the outlier included and again with the outlier excluded. Detecting Outliers

Upload: imogene-amberly-small

Post on 15-Jan-2016

228 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables

Slide 1

Detecting Outliers

Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables (multivariate outliers).  Outliers generally have a large impact on the solution, i.e. the outlier case can conceivably change the value or score that we would predict for every other case in the study.  Our concern with outliers is to answer the question of whether our analysis is more valid with the outlier case included or more valid with the outlier case excluded.

To answer this question, we must have methods for detecting and assessing outliers.  The method for detecting univariate outliers is to convert the scores on the variable to standard scores and scan for very large positive and negative standard scores.  We will normally apply this strategy to the analysis of a metric dependent variable.  The detection of multivariate outliers is used to detect unusual cases for the combined set of metric independent variables, using a multivariate distance measure analogous to standard score distance from the mean of the sample.

The decision to exclude or retain the outlier case is based on our understanding of the cause of the outlier and the impact it is having on the results.  If the outlier is a data entry error or an obvious misstatement by a respondent, it probably should be excluded.  If the outlier is an unusual but probable value, it should be retained.  We can improve our understanding of the impact of the outlier by running an analysis twice, one with the outlier included and again with the outlier excluded.

Detecting Outliers

Page 2: Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables

Slide 2

1. Detecting Univariate Outliers

To detect univariate outliers, we convert our numeric variables to their standard score equivalents. Outliers will be those cases associated with large standard z-score values, e.g. smaller than -2.5 and larger than +2.5. Standardizing variables converts them to a standard deviation unit of measurement so that the distance from the mean for any case on any variable is expressed in comparable units. 

The Descriptives procedure can create standard scores for our variables and add them to our data. SPSS names the z-score variables by preceding the variable name with the letter z. The name for the standard score equivalent for x1 is zx1.

To locate the outliers for each variable, we can either sort the data set by the z-score variable or use the SPSS Examine procedure to print out the highest and lowest values for the z-score variables to the output window.

The use of standard scores to detect outliers presumes that the variable is normally distributed. When a variable is not normally distributed, a boxplot may be more effective in identifying outliers.  A boxplot identifies outliers using a somewhat different criteria. Cases with values between 1.5 and 3 box lengths from the upper or lower edge of the box are identified as outliers. The box length is the inter-quartile range, or the difference between the case at the 25th quartile and the case at the 75th quartile.

Detecting Outliers

Page 3: Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables

Slide 3

Compute Standard Scores for the Metric Variables

First, we select the 'Descriptive Statistics | Descriptives…' command from the Analyze menu.

Second, we move the metric variables 'Delivery Speed', 'Price Level', 'Price Flexibility', 'Manufacturer Image', 'Service', 'Salesforce Image', 'Product Quality', 'Usage Level', and 'Satisfaction Level' to the 'Variable(s): ' list box.

Third, we click on the 'Save standardized values as variables' check box to request that SPSS add standard scores for each variable to the data set.

Fourth, click on the OK button to request the output.

Detecting Outliers

Page 4: Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables

Slide 4

The Standard Scores in the SPSS Data Editor

I f we scroll the SPSS Data Editor window to the right, we see the zscore variables that SPSS computed as part of the Descriptives procedure. It automatically named the new variables by prepending the letter 'z' to the original variable names.

Detecting Outliers

Page 5: Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables

Slide 5

Use the Explore Procedure to Locate Large Standard Scores Indicating Outliers

First, select the 'Descriptive Statistics | Explore…' command from the Analyze menu.

Second, move the Zscore variables computed above to the 'Dependent List: ' list box.

Third, move the ID variable to the 'Label Cases by: ' text box so that the case ID will appear in the output listings.

Fifth, click on the 'Statistics…' to request the listing of outliers.

Fourth, mark the 'Statistics' option on the Display panel.

Detecting Outliers

Page 6: Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables

Slide 6

Specify Outliers as the Desired Statistics

First, we mark the'Outliers' check boxand clear all othercheck boxes.

Second, we click onthe Continue button tocomplete our selectionof statistics.

Third, weclick on theOK buttonto producethe output.

Detecting Outliers

Page 7: Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables

Slide 7

Extreme Values as Outliers

The 'Extreme Values' output from the Explore command provides a list of the five highest and five lowest values for each variable. We use these values to identify outliers with values larger than 2.50 standard score units or smaller than -2.5 standard score units. For 'Delivery Speed', case ID 39 has a z-score of -2.66141, indicating that case 39 might have an unusually low Delivery Speed value. Similarly, for ‘Price Level’, case 71 has a z-score of 2.53919, indicating an unusually high Price Level compared to other cases.

Detecting Outliers

Page 8: Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables

Slide 8

2. Detecting Multivariate Outliers

Standard scores measure the statistical distance of a data point from the mean for all cases, measured in standard deviation units along the horizontal axis of a normal distribution plot. There is a similar measure of statistical distance in multidimensional space, known at Mahalanobis D² (d-squared). This statistic measures the distance from the centroid (multidimensional equivalent of a mean) for a set of scores (or vector) for each of the independent variables included in the analysis. The larger the value of the Mahalanobis D² for a case, and the smaller its corresponding probability value, the more likely the case is to be a multivariate outlier. The probability value enables us to make a decision about the statistical test of the null hypothesis, which is that the vector of scores for a case is equal to the centroid of the distribution for all cases.

Mahalanobis D² can be computed in SPSS with the regression procedure for a set of independent variables. The Save option will add the D² values to the data set. SPSS does not compute the probability of Mahalanobis D².  Mahalanobis D² is distributed as a chi-square statistic with degrees of freedom equal to the number of independent variables in the analysis.  The SPSS cumulative density function will compute the area under the chi-square curve from the left end of the distribution to the point corresponding to our statistical value.  The right-tail probability of obtaining a D² value this size is equal to one minus the cumulative density function value. 

We use the probability values to identify the cases which are most distant, or different, from the other cases in the sample.  We would make our decision about omitting or including extreme cases by re-running the analysis without them and comparing the results we obtain with and without them to determine whether our results are more representative with or without the extreme cases.

Detecting Outliers

Page 9: Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables

Slide 9

Request a Multiple Regression to Compute Mahalanobis Distance Statistics

Select the 'Regression | Linear…' command from the Analyze menu.

Detecting Outliers

Page 10: Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables

Slide 10

Specify the Variables to Include in the Analysis

Second, move theseven metricvariables, 'DeliverySpeed' (x1), PriceLevel' (x2), 'PriceFlexibility' (x3),'Manufacturer Image'(x4), 'Service' (x5),'Salesforce Image'(x6), and 'ProductQuality' (x7) to thelist box'Independent(s): '.

First, in order for SPSS to calculate the Mahalanobis D2

statistics, we must specify a dependent variable for theregression, even though we are not interested in theregression output. I arbitrarily selected 'Satisfaction Level'(x10) and moved it to the 'Dependent' text box. TheMahalanobis Distance calculations only involve theindependent variables, so we would get the same D2 statisticsno matter what we specified as the dependent variable.

Third, we click on the'Save…' button to openthe dialog of statistics thatSPSS will add to our dataset as a byproduct ofregression analysis.

Detecting Outliers

Page 11: Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables

Slide 11

Add the Mahalanobis Distance Statistic to the Data Set

First, mark the check box for 'Mahalanobis' on the Distances panel. All other check boxes should be clear.

Second, click on the Continue button to close the 'Linear Regression: Save' dialog box. Third, click on the

OK button on the 'Linear Regression' dialog to request the output.

Detecting Outliers

Page 12: Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables

Slide 12

The Mahalanobis Distance Statistics in the Data Editor

The output we want is in the data editor rather than the output viewer. SPSS computed the Mahalanobis D2 statistic for each case in the data set. It assigns a unique name to the new variable 'mah_1', which we could change if we desired. The values of 'mah_1' match the values in table 2.5 on page 59 of the text.

Detecting Outliers

Page 13: Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables

Slide 13

Compute the Probability Values for the Mahalanobis D² Statistics

Detecting Outliers

Page 14: Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables

Slide 14

Sorting the Data Set to Locate Statistically Significant D² Scores

First, we see that the probabilities that we computed for D2 agree with the values in the text in table 2.10 on page 59. Second,

it will be easier to locate the significant values if we sort the data set by the 'p_mahal' variable. Select the 'Sort Cases…' command from the Data menu.

Third, we move the 'p_mahal' variable to the 'Sort by: ' list box. Fourth, we click on

the OK button to produce the sort.

Detecting Outliers

Page 15: Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables

Slide 15

Highlight Cases with Statistically Significant Mahalanobis D² Scores

Using the traditional alpha level of 0.05for statistical significance, we identify sixcases that are potentially outliers. Wehighlight the cases in these rows so thatwe can identify their ID numbers.

Detecting Outliers

Page 16: Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables

Slide 16

The Case ID's for the Multivariate Outliers

We scroll the SPSS Data Editor window to the left tosee the ID numbers for the cases that are potentialmultivariate outliers.

I f we were analyzing a real problem, we would repeatour analysis without the more extreme values to seewhat impact their omission had on our findings, andthen decide whether our results are morerepresentative with or without the extreme cases.

Detecting Outliers