data screening assignment

PS4700 Data Screening and Transformation Assignment.

TASK 1: Missing data

Before choosing how to deal with missing data it is important to discover the amount of data

which is missing, why it is missing and whether it is missing at random or not. It is known

that the data file is missing 15.9% of its data which is relatively high, especially when

considering the amount of participants is only 79. Secondly, the reason for the missing data is

that participants have not responded to every question on the questionnaires. Thirdly, James

needs to discover whether the data is missing completely at random (MCAR), whether it is

missing at random (MAR), or whether the data is not missing at random (NMAR).

If James finds that the missing data is MCAR then he can consider using the mean

substitution method. In this method means are calculated from the data which is available for

each variable and these are input as the missing values. However, many experts now warn

against this method (Bennett, 2001; Pallant, 2007) due to its many disadvantages. Firstly,

when the amount of data missing is relatively high, the variance of the variable with missing

data is greatly reduced (Tabachnick & Fidell, 2013). With more data grouped around the

mean resulting in positive kurtosis, and therefore a non-normal distribution of data,

transformation of data must occur to satisfy the assumptions of conducting a parametric test,

thus making it less reliable and is harder to interpret (Tabachnick & Fidell, 2013). Also,

although the sample size is being increased by adding in new values for the missing data

(Cohen, et al., 2003) and the reduction of the standard error value indicates that the sample is

more representative of the population, this is not as a result of ‘true’ data.

Instead of using one value to fill in for each missing case, multiple imputation

(Rubin’s, 1987) replaces each missing value with a set of plausible values which reflect the

uncertainty of the researcher as to which values to use. This method is regarded by many

professionals as ‘the most respectable way of dealing with missing data’ (Tabachnick &

Fidell, 2013). This method can be applied whether data is MCAR, MAR, or NMAR, and has

been shown to provide adequate results even with small samples and high percentages of

missing data (Wayman, 2003). James’s data therefore satisfies the use of multiple imputation

and it is therefore recommended that he chooses this method.

TASK 2: Outliers

There is much debate over whether to delete, retain or transform outliers within a data set,

and this largely depends on the nature of the outliers. Tabachnick & Fidell (2013) explain that

there are four possible reasons for outliers. Firstly, James needs to make sure that his outliers

have been entered correctly and that missing-value codes are been labelled as missing and not

actual values, correcting those which have been entered incorrectly. Participants with extreme

values should be part of the intended target population and if not then the participant’s scores

should be deleted from the data set. If none of these first three points explain the extreme

values then the extreme score should be retained, and possibly altered, to improve the

normality of the data.

Univariate outliers can be identified on a box plot as a point which lies outside the

‘box and whiskers’. If the outlier has little or no effect on the skewness of the data then it is

acceptable to retain its value without transformation, however if the outlier has a seemingly

significant effect on the skewness of the data then it is argued that it should be deleted or

transformed (Tabachnick & Fidell, 2013). Firstly, the data can be trimmed (Field, 2013),

whereby a certain percent of the data is deleted from both extremes. As a second option, the

scores of univariate outliers can be changed so that they are one unit larger or smaller than

the next most extreme score in the data, known as winsorizing (Field, 2013).

Multivariate outliers can be identified by running a test of Mahalanobis distance and

are more difficult to deal with because they involve outliers on more than one variable. James

should look into the possibility that either one variable is responsible for the outliers, the

variable is highly correlated with others, or the variable is not critical to the analysis, leading

to a justification for deleting the whole variable.

TASK 3: Other screening tests

a) Overview

After transformation of outliers, the data of the emotion management variable were screened

to see whether the data satisfied the use of the parametric test, Pearson’s correlation. The case

summary shows that there were 79 participants who took part in the emotion management

questionnaire and there was no missing data, due to this being dealt with at an earlier stage.

The minimum emotion management score was 10.000 and the largest was 36.000, with a

mean score of 28.418. A Kolmogorov-Smirnov test revealed that the scores on the emotion

management variable were not normally distributed, due to the low significance level,

D(78)=.170, p<.001.

Skewness and Kurtosis

The presence of skewness and kurtosis in the data set were represented in the normal Q-Q

plot. An S-shaped distribution identified that the data was skewed, however, due to the

observed data scores being relatively close to the expected scores, there was less of a problem

with the kurtosis. A histogram and boxplot of the data set showed that the scores on the

emotion management variable were negatively skewed with more people scoring a high

emotion management ability than a low emotion management ability. Based on the z-score of

1.95 being significant at p<.05, the results show that the negative skewness (-1.019) is

statistically significant (z=3.760)). There was also a slight negative kurtosis (-.076), however

this was found to be not significant (z= .142). A detrended normal Q-Q plot showed that most

of the scores on the emotion management variable deviated from the normal distribution.

Outliers

The boxplot shows that case 42 is a mild outlier, with the lowest value of 10.00, indicating

the poorest emotion management ability. Despite having dealt with the other outliers within

the data set, case 42 was retained due to it having little effect on the distribution of the data.

b) Once James has run the same data screening tests on the stress variable as on the emotion

management variable, his data screening is not yet complete. There are further assumptions

which need to be met in order for James to be able to run parametric tests on his data. James

should also test both data sets for homogeneity of variance or homoscedasticity, linearity and

independence.

Homogeneity of variance means that the spread or variance of scores is the same for

both the emotion management scores and also the stress scores. Levine’s test of homogeneity

of variance can be conducted and if the result is non-significant (e.g. p>.05) then

homogeneity of variance can be assumed. James should be aware that the size of his sample

can greatly affect the results of the Levine’s test, with differences between variances often

remaining undetected in small sample sizes (Field, 2007).

The assumption of linearity expects that there is a linear relationship between the

variables, in this case the stress and emotion management variables. To look at the linearity

of his data, James should devise a scatter plot whereby each participant’s scores on both

variables are shown. Including a line of best fit and calculating a correlation coefficient will

make the assumption of linearity easier to determine.

Finally, the assumption of independence refers to the issue that the answers that each

participants gives should be their own and not influenced by other participants. If this

assumption has been violated, Field (2013) explains that this will have an impact on the

standard error which will in turn invalidate any confidence intervals and significance tests

which are run on the data.

Once these tests have been completed, James can think about transforming his data so

that it more closely resembles a normal distribution.

TASK 4: Transformation

James has used the square root transformation to statistically transform the data which is

correct. However, because the emotion management data is moderately negatively skewed,

the data should be reflected before the square root transformation is carried out. Field (2013)

explains that to do this, each score should be subtracted from the highest score plus 1, so that

all data can be square rooted.

Using the square root transformation should reduce the negative skew of the data and

therefore improve the normal distribution of the data. Once the z-scores have been calculated

then it will become clear whether the skewness is now non-significant and whether the

Kolmogorov-Smirnov test is now non-significant, indicating an overall normal distribution.

If the square root transformation results in the data still being significantly skewed,

then a logarithm should be performed (ensuring that the data is reflected as with the square

root transformation) to see whether this will improve the skewness of the data. If the

skewness of the data becomes non-significant then the results of the logarithm will be used,

however, if the skewness of the data still remains significant then the results of the square

root transformation will be retained.

References:

Bennett, D. A. (2001). How can I deal with missing data in my study? Australian and New

Zealand Journal of Public Health, 25, 464–469.

Cohen, J. & Cohen, P., West, S. G. & Aiken, L. S. (2003). Applied Multiple

Regression/Correlation Analysis for the Behavioral Sciences, 3rd edition. Mahwah, N.J.:

Lawrence Erlbaum.

Field, A. (2013). Chapter 5: The beast of bias. Discovering Statistics Using IBM SPSS

Statistics, 163-212, 4th Edition, Sage.

Field, A. (2007). Homogeneity of variance. Retrieved 01 February, 2014, from

http://srmo.sagepub.com/view/encyclopedia-of-measurement-and-statistics/n207.xml.

Pallant, J. (2007). SPSS survival manual (3rd ed.). New York, NY: Open University Press.

Rubin, D.B. (1987).Multiple Imputation for Nonresponse in Surveys. John Wiley and Sons,

New York.

Tabachnick, B.G., & Fidell, L.S. (2013). Cleaning up your act: Screening data prior to

analysis. Using Multivariate Statistics (6th ed.), 60-116, Pearson.

Wayman, J.C. (2003). Multiple Imputation For Missing Data: What Is It And How Can I Use

It? Retrieved 01 February, 2014, from

http://coedpages.uncc.edu/cpflower/wayman_multimp_ aera2003.pdf

Wuensch, K.L. (2013). Screening data. Retrieved 03 February, 2014, from

http://core.ecu.edu/psyc/wuenschk/MV/Screening/Screen.docx.

http://coedpages.uncc.edu/cpflower/wayman_multimp_%20aera2003.pdf

data screening assignment

Documents