detecting outliers

SW388R7Data Analysis &Computers II

Slide 0

Detecting Outliers

Detecting univariate outliers

Detecting multivariate outliers


Slide 0

Outliers

! Outliers are cases that have data values that arevery different from the data values for the majorityof cases in the data set.

! Outliers are important because they can change the

results of our data analysis. ! Whether we include or exclude outliers from a data

analysis depends on the reason why the case is anoutlier and the purpose of the analysis.


Slide 0

Univariate and Multivariate Outliers

! Univariate outliers are cases that have an unusualvalue for a single variable. In our analyses, we willbe concerned with univariate outliers for thedependent variable in our data analysis.

! Multivariate outliers are cases that have an unusual

combination of values for a number of variables. The value for any of the indvidual variables may notbe a univariate outlier, but, in combination withother variables, is a case that occurs very rarely. Inour analyses, we will be concerned with multivariateoutliers for the set of independent variables in ourdata analysis.


Slide 0

Standard Scores Detect Univariate Outliers

! One way to identify univariate outliers is to convertall of the scores for a variable to standard scores.

! If the sample size is small (80 or fewer cases), a

case is an outlier if its standard score is ±2.5 orbeyond.

! If the sample size is larger than 80 cases, a case is

an outlier if its standard score is ±3.0 or beyond ! This method applies to interval level variables, and

to ordinal level variables that are treated as metric. It does not apply to nominal level variables.


Slide 0

Mahalanobis D2 and Multivariate Outliers

! Mahalanobis D2 is a multidimensional version of a z-score. It measures the distance of a case from thecentroid (multidimensional mean) of a distribution,given the covariance (multidimensional variance) ofthe distribution.

! A case is a multivariate outlier if the probability

associated with its D2 is 0.001 or less. D2 follows achi-square distribution with degrees of freedomequal to the number of variables included in thecalculation.

! Mahalanobis D2 requires that the variables be

metric, i.e. interval level or ordinal level variablesthat are treated as metric.

SW388R7

Data Analysis &

Computers II

Slide 0

Problem 1

In the dataset GSS2000.sav, is the following statement true,

false, or an incorrect application of a statistic?

In the dataset, there are 2 cases that should be evaluated as

univariate outliers for highest year of school completed.

1. True

2. True with caution

3. False

4. Incorrect application of a statistic

SW388R7

Data Analysis &

Computers II

Slide 0

Descriptive statistics compute standard scores

To compute standard scores inSPSS, select the DescriptiveStatistics | Descriptives…command from the Analyzemenu.

SW388R7

Data Analysis &

Computers II

Slide 0

Select the variable(s) for the analysis

First, click on the variableto be included in theanalysis to highlight it.

Second, click on rightarrow button to move thehighlighted variable to thelist of variables.

SW388R7

Data Analysis &

Computers II

Slide 0

Mark the option for computing standard scores

First, click on the checkbox to savestandard score values as a new variable inthe dataset. The new variable will have the letter zprepended to its name, e.g. the standardscore variable for “educ” will be “zeduc”.

Second, click on the OKbutton to complete theanalysis request.

SW388R7

Data Analysis &

Computers II

Slide 0

The z-score variable in the data editor

The variable containingthe standard scores will beadded to the list ofvariables in the dataeditor.

To identify outliersbelow –3.0, we sortthe database inascending order. Right click on thevariable header zeducand select the SortAscending commandfrom the popupmenu.


Slide 0

Outliers with unusually low scores

Cases that are outliers becausethey have unusually lowscores for the variable willappear at the top of the sortedlist. Since there are 269 cases withvalid data for the variable, thecriterion for identifying anoutlier is ±3.0. In this example, we have twooutliers with z-scores less than–3.0.

SW388R7

Data Analysis &

Computers II

Slide 0

Additional information about the outliers

To see additional information about theoutliers, we highlight the rows containingthe outliers and scroll horizontally to othervariables in which we are interested, forexample, the id numbers for the cases.

SW388R7

Data Analysis &

Computers II

Slide 0

The raw data scores for the outliers

Before deciding whether we retain or omitoutliers from the analysis, we shouldexamine the raw scores that made thesecases outliers. In this example, one of our subjects hadcompleted only 2 years of school andanother had completed only 3 years.

SW388R7

Data Analysis &

Computers II

Slide 0

Comparing the raw scores to the mean

When we compare the raw data values of 2 and 3 tothe mean (13.12) and standard deviation (2.930) ofthe distribution for the variable, we see why thesecases are outliers for this distribution. Completing 2and 3 years of school is unusual in a distribution thathad a mean of 13 years.

The Descriptives output helpsus in evaluating the raw datascores for the outliers.


Slide 0

Outliers with unusually high scores

To identify outliersabove +3.0, we sortthe database indescending order. Right click on thevariable header zeducand select the SortDescending commandfrom the popup menu.

SW388R7

Data Analysis &

Computers II

Slide 0

Descriptive statistics compute standard scores

Cases that are outliers becausethey have unusually highscores for the variable will nowappear at the top of the sortedlist. In this example, there areno outliers with extremelylarge values.

The answer to this problem is True. Univariate outliers are detected by computing standard scoresfor the variable. Computing standardard scores requires that thevariable be metric.Highest year of school completed (educ) is aninterval level or metric variable, satisfying the requirement forcomputing standard scores. Since there are 269 cases with valid data for the variable, thecriterion for identifying an outlier is ±3.0. In this dataset, 2cases have a z-score value outside this range (20000391:-3.45; 20001984: -3.80).

SW388R7

Data Analysis &

Computers II

Slide 0

Deleting the z-score variable

Once we are finishedwith the outlier analysis,we should delete thevariables that wereadded to the data set. First, click on the zeduccolumn header to selectthe entire column.

Second, select the Clearcommand from the Editmenu to delete the columnfrom the dataset.


Slide 0

Other problems on univariate outliers

! A problem may ask about outliers for a nominal levelvariable. The answer will be “An inappropriateapplication of a statistic” since z-scores cannot becomputed for nominal level variables.

! A problem may ask about outliers for an ordinal level

variable. If the number of outliers in the problemstatement is accurate, the correct answer to thequestion is “True with caution” since we may berequired to defend treating an ordinal variable asmetric.

! A problem may contain an inaccurate number of

outliers for the variable. The answer will be “False.”

SW388R7

Data Analysis &

Computers II

Slide 0

Problem 2

In the dataset GSS2000.sav, is the following

statement true, false, or an incorrect application of

a statistic? Use 0.001 as the level of significance.

In the dataset, there is 1 case that should be

evaluated as a multivariate outlier for the

combination of: number of hours worked in the past

week, occupational prestige score, and highest year

of school completed.

1. True

2. True with caution

3. False

4. Incorrect application of a statistic

SW388R7

Data Analysis &

Computers II

Slide 0

Mahalanobis D2 is computed by Regression

To compute Mahalanobis D2 inSPSS, select the Regression |Linear… command from theAnalyze menu.

SW388R7

Data Analysis &

Computers II

Slide 0

Adding the independent variables

The SPSS Linear Regression procedurecomputes Mahalanobis D2 for the set ofindependent variables entered into thedialog box. Move the variables: hrs1, prestg80, andeduc to the list of independent variables.

SW388R7

Data Analysis &

Computers II

Slide 0

Adding an arbitrary dependent variable

First, arbitrarily select avariable to use as thedependent variable. Thevariable should a numericvariable that does not haveany missing cases. For example, click on the firstnumeric variable in the list ofvariables: wrkstat.

Second, click on the rightarrow button to movewrkstat to the text box forthe dependent variable.

SPSS will not compute the Regression unlesswe specify a dependent variable, even thoughthe dependent variable is not used in theanalysis of multivariate outliers.

SW388R7

Data Analysis &

Computers II

Slide 0

Adding Mahalanobis D2 to the dataset

To request that SPSS add the value ofMahalanobis D2 to the data set, click onthe Save button to open the save dialogbox.

SW388R7

Data Analysis &

Computers II

Slide 0

Specify saving Mahalanobis D2 distance

Second, complete therequest for Mahalanobisdistance by clicking on theContinue button.

First, mark thecheckbox forMahalanobis in theDistances panel. Allother checkboxes canbe unchecked.

SW388R7

Data Analysis &

Computers II

Slide 0

Specify the statistics output needed

To understand why aparticular case is an outlier,we want to examine thedescriptive statistics for eachvariable. Click on the Statistics…button to request thestatistics.

SW388R7

Data Analysis &

Computers II

Slide 0

Request descriptive statistics

Second, complete therequest for descriptivestatistics by clicking on theContinue button.

First, mark the checkbox forDescriptives. All othercheckboxes can be unchecked.

SW388R7

Data Analysis &

Computers II

Slide 0

Complete the request for Mahalanobis D2

To complete the request forthe regression analysis thatwill compute Mahalanobis D2,click on the OK button.

SW388R7

Data Analysis &

Computers II

Slide 0

Mahalanobis D2 scores in the data editor

If we look in the column farthestto the right in the data editor, wesee that SPSS has calculated theMahalanobis D² scores for us in avariable it has named "mah_1." The evaluation for outliers,however, requires the probabilityfor the Mahalanobis D² and notthe scores themselves.

SW388R7

Data Analysis &

Computers II

Slide 0

Computing the probability of D²

To compute the probability ofD², we will use an SPSSfunction in a Computecommand.

First, select theCompute… commandfrom the Transformmenu.

SW388R7

Data Analysis &

Computers II

Slide 0

Specifying the variable name and function

First, in the target variable text box, type the name"p_mah_1" as an acronym for the probability of themah_1, the Mahalanobis D² score.

Second, scroll down the list of functions to findCDF.CHISQ, which calculates the probability ofa variable which follows as chi-squaredistribution, like Mahalanobis D².

Third, click onthe up arrowbutton to movethe highlightedfunction to theNumericExpression textbox.

SW388R7

Data Analysis &

Computers II

Slide 0

Completing the specifications for the function

Second, click on the OKcommand to signalcompletion of the computervariable dialog.

First, to complete the specifications forthe CDF.CHISQ function, type the nameof the variable containing the D² scores,mah_1, followed by a comma, followedby the number of variables used in thecalculations, 3. Since the CDF function (cumulativedensity function) computes thecumulative probability from the left endof the distribution up through a givenvalue, we subtract it from 1 to obtain theprobability in the upper tail of thedistribution.

SW388R7

Data Analysis &

Computers II

Slide 0

Probabilities for D² in the data editor

To sort the data set, right click onthe column header p_mah_1, andselect Sort Ascending from thepopup menu.

SPSS used the computecommand to calculate theprobabilities for the D²scores and list them in the dataeditor. To find the smallest probabilityvalue, we will sort the data setin ascending order.

SW388R7

Data Analysis &

Computers II

Slide 0

Identifying outliers

Scroll down the data editorpast the probabilities withmissing values, which are theresult of the computecommand when one or morevariables has missing data.

There are two values less than 0.001,displayed as .0000 and .0007. Two cases had an unusual combination ofvalues on the three variables resulting intheir designation as outliers.

SW388R7

Data Analysis &

Computers II

Slide 0

Answering the original question

The original question asked if the numberof outliers for the combination of threevariables is 1. The answer to this question is falsebecause there are two outliers. In this dataset, 2 cases have aMahalanobis D² with a probability lessthan or equal to 0.001 (20000391:D²=35.58, p<0.0001; 20001785:D²=17.15, p=0.0007).


Slide 0

Evaluating Mulitivariate Outliers

! Before we can decide whether we should omit orretain an outlier in our data analysis, we need tounderstand why it is an outlier.

! To accomplish this, we will move the columns for

the variables adjacent to each other in the dataeditor so that we can compare the values for eachcase.

! We will compare the values for each case to the

mean and standard deviation for each variable,computed in the descriptive statistics section of theregression output.

SW388R7

Data Analysis &

Computers II

Slide 0

Moving columns in the data editor – step 1

We will move the column forthe variable prestg80 next tothe column for hrs1.

First, click on the columnheader prestg80 for thevariable we want to move,so that the column isselected.

SW388R7

Data Analysis &

Computers II

Slide 0


Next, click and hold the left mouse buttondown on the column header of the variablewe want to move. A box outline will appear at the bottom ofthe arrow cursor, indicating that SPSS isprepared to move the column.

SW388R7

Data Analysis &

Computers II

Slide 0


Next, while holding the mousebutton down, move the arrowcursor over columns to the leftor right.

A vertical red line will appear betweenthe columns to indicate where thecolumn will be relocated. When the red line is located where wewant to position the column we aremoving, release the mouse button. The column will now be relocated.

SW388R7

Data Analysis &

Computers II

Slide 0


The columns for the variables are nowadjacent to one another, making it easier tocompare values.

Hint: when we move a column, thecommand “Undo Move Variables” will appearat the top of the Edit menu. I find thiscommand the easiest way to return thecolumns to their original locations in the dataeditor. Leaving columns in different locationscan make it harder to find a variable we arelooking for.


Slide 0

Highlighting the outliers for analysis

When I finished relocating the three variables, Imoved the p_mah_1 column also, so I could easilyidentify which cases were outliers. Then Ihighlighted the outlier rows and scrolled them tothe top row in the data editor. I can now compare the values for these two casesto the mean and standard deviation of thedistribution for the three variables.

SW388R7

Data Analysis &

Computers II

Slide 0

Evaluating the outlier cases

Descriptive Statistics

1.18 .384 174

41.01 12.599 174

45.16 14.188 174

13.79 2.778 174

LABOR FRCE STATUSNUMBER OF HOURSWORKED LAST WEEKRS OCCUPATIONALPRESTIGE SCORE (1980)HIGHEST YEAR OFSCHOOL COMPLETED

Mean Std. Deviation N

The number of hours worked forboth cases is well below theaverage for the sample. The firstcase has an above averageoccupational prestige scorecombined with below average yearsof education. The second case hasa below average occupationalprestige score combined with aboveaverage education.

SW388R7

Data Analysis &

Computers II

Slide 0

Deleting variables added to dataset

Once we are finished with theoutlier analysis, we should deletethe variables that were added tothe data set. First, select the mah_1 andp_mah_1 columns.

Second, select the Clearcommand from the Editmenu to delete the columnfrom the dataset.


Slide 0

Other problems on multivariate outliers

! A problem may ask about outliers for variables thatinclude a nominal level variable. The answer will be“An inappropriate application of a statistic” sinceMahalanobis D² cannot be computed unless allvariables are metric.

! A problem may ask about outliers for variables that

include an ordinal level variable. If the number ofoutliers in the problem statement is accurate, thecorrect answer to the question is “True with caution”since we may be required to defend treating an ordinalvariable as metric.

! A problem may contain an inaccurate number of

outliers for the variable. The answer will be “False.”

SW388R7

Data Analysis &

Computers II

Slide 0

Steps in evaluating outliers

The following is a guide to the decision process for answeringproblems about outliers:

Is the number of outliersstated in the problem thecorrect number?

False

Yes

No

Incorrect applicationof a statistic

Yes

NoAre all of the variables tobe evaluated metric?

Are any of the metricvariables ordinal level?

Yes

NoTrue

True with caution

detecting outliers

Documents