data screening assignment
TRANSCRIPT
![Page 1: DATA SCREENING ASSIGNMENT](https://reader036.vdocuments.site/reader036/viewer/2022071815/55a8ee0c1a28abce2b8b487c/html5/thumbnails/1.jpg)
PS4700 Data Screening and Transformation Assignment.
TASK 1: Missing data
Before choosing how to deal with missing data it is important to discover the amount of data
which is missing, why it is missing and whether it is missing at random or not. It is known
that the data file is missing 15.9% of its data which is relatively high, especially when
considering the amount of participants is only 79. Secondly, the reason for the missing data is
that participants have not responded to every question on the questionnaires. Thirdly, James
needs to discover whether the data is missing completely at random (MCAR), whether it is
missing at random (MAR), or whether the data is not missing at random (NMAR).
If James finds that the missing data is MCAR then he can consider using the mean
substitution method. In this method means are calculated from the data which is available for
each variable and these are input as the missing values. However, many experts now warn
against this method (Bennett, 2001; Pallant, 2007) due to its many disadvantages. Firstly,
when the amount of data missing is relatively high, the variance of the variable with missing
data is greatly reduced (Tabachnick & Fidell, 2013). With more data grouped around the
mean resulting in positive kurtosis, and therefore a non-normal distribution of data,
transformation of data must occur to satisfy the assumptions of conducting a parametric test,
thus making it less reliable and is harder to interpret (Tabachnick & Fidell, 2013). Also,
although the sample size is being increased by adding in new values for the missing data
(Cohen, et al., 2003) and the reduction of the standard error value indicates that the sample is
more representative of the population, this is not as a result of ‘true’ data.
Instead of using one value to fill in for each missing case, multiple imputation
(Rubin’s, 1987) replaces each missing value with a set of plausible values which reflect the
uncertainty of the researcher as to which values to use. This method is regarded by many
professionals as ‘the most respectable way of dealing with missing data’ (Tabachnick &
Fidell, 2013). This method can be applied whether data is MCAR, MAR, or NMAR, and has
been shown to provide adequate results even with small samples and high percentages of
missing data (Wayman, 2003). James’s data therefore satisfies the use of multiple imputation
and it is therefore recommended that he chooses this method.
![Page 2: DATA SCREENING ASSIGNMENT](https://reader036.vdocuments.site/reader036/viewer/2022071815/55a8ee0c1a28abce2b8b487c/html5/thumbnails/2.jpg)
TASK 2: Outliers
There is much debate over whether to delete, retain or transform outliers within a data set,
and this largely depends on the nature of the outliers. Tabachnick & Fidell (2013) explain that
there are four possible reasons for outliers. Firstly, James needs to make sure that his outliers
have been entered correctly and that missing-value codes are been labelled as missing and not
actual values, correcting those which have been entered incorrectly. Participants with extreme
values should be part of the intended target population and if not then the participant’s scores
should be deleted from the data set. If none of these first three points explain the extreme
values then the extreme score should be retained, and possibly altered, to improve the
normality of the data.
Univariate outliers can be identified on a box plot as a point which lies outside the
‘box and whiskers’. If the outlier has little or no effect on the skewness of the data then it is
acceptable to retain its value without transformation, however if the outlier has a seemingly
significant effect on the skewness of the data then it is argued that it should be deleted or
transformed (Tabachnick & Fidell, 2013). Firstly, the data can be trimmed (Field, 2013),
whereby a certain percent of the data is deleted from both extremes. As a second option, the
scores of univariate outliers can be changed so that they are one unit larger or smaller than
the next most extreme score in the data, known as winsorizing (Field, 2013).
Multivariate outliers can be identified by running a test of Mahalanobis distance and
are more difficult to deal with because they involve outliers on more than one variable. James
should look into the possibility that either one variable is responsible for the outliers, the
variable is highly correlated with others, or the variable is not critical to the analysis, leading
to a justification for deleting the whole variable.
TASK 3: Other screening tests
a) Overview
After transformation of outliers, the data of the emotion management variable were screened
to see whether the data satisfied the use of the parametric test, Pearson’s correlation. The case
summary shows that there were 79 participants who took part in the emotion management
questionnaire and there was no missing data, due to this being dealt with at an earlier stage.
The minimum emotion management score was 10.000 and the largest was 36.000, with a
![Page 3: DATA SCREENING ASSIGNMENT](https://reader036.vdocuments.site/reader036/viewer/2022071815/55a8ee0c1a28abce2b8b487c/html5/thumbnails/3.jpg)
mean score of 28.418. A Kolmogorov-Smirnov test revealed that the scores on the emotion
management variable were not normally distributed, due to the low significance level,
D(78)=.170, p<.001.
Skewness and Kurtosis
The presence of skewness and kurtosis in the data set were represented in the normal Q-Q
plot. An S-shaped distribution identified that the data was skewed, however, due to the
observed data scores being relatively close to the expected scores, there was less of a problem
with the kurtosis. A histogram and boxplot of the data set showed that the scores on the
emotion management variable were negatively skewed with more people scoring a high
emotion management ability than a low emotion management ability. Based on the z-score of
1.95 being significant at p<.05, the results show that the negative skewness (-1.019) is
statistically significant (z=3.760)). There was also a slight negative kurtosis (-.076), however
this was found to be not significant (z= .142). A detrended normal Q-Q plot showed that most
of the scores on the emotion management variable deviated from the normal distribution.
Outliers
The boxplot shows that case 42 is a mild outlier, with the lowest value of 10.00, indicating
the poorest emotion management ability. Despite having dealt with the other outliers within
the data set, case 42 was retained due to it having little effect on the distribution of the data.
b) Once James has run the same data screening tests on the stress variable as on the emotion
management variable, his data screening is not yet complete. There are further assumptions
which need to be met in order for James to be able to run parametric tests on his data. James
should also test both data sets for homogeneity of variance or homoscedasticity, linearity and
independence.
Homogeneity of variance means that the spread or variance of scores is the same for
both the emotion management scores and also the stress scores. Levine’s test of homogeneity
of variance can be conducted and if the result is non-significant (e.g. p>.05) then
homogeneity of variance can be assumed. James should be aware that the size of his sample
can greatly affect the results of the Levine’s test, with differences between variances often
remaining undetected in small sample sizes (Field, 2007).
![Page 4: DATA SCREENING ASSIGNMENT](https://reader036.vdocuments.site/reader036/viewer/2022071815/55a8ee0c1a28abce2b8b487c/html5/thumbnails/4.jpg)
The assumption of linearity expects that there is a linear relationship between the
variables, in this case the stress and emotion management variables. To look at the linearity
of his data, James should devise a scatter plot whereby each participant’s scores on both
variables are shown. Including a line of best fit and calculating a correlation coefficient will
make the assumption of linearity easier to determine.
Finally, the assumption of independence refers to the issue that the answers that each
participants gives should be their own and not influenced by other participants. If this
assumption has been violated, Field (2013) explains that this will have an impact on the
standard error which will in turn invalidate any confidence intervals and significance tests
which are run on the data.
Once these tests have been completed, James can think about transforming his data so
that it more closely resembles a normal distribution.
TASK 4: Transformation
James has used the square root transformation to statistically transform the data which is
correct. However, because the emotion management data is moderately negatively skewed,
the data should be reflected before the square root transformation is carried out. Field (2013)
explains that to do this, each score should be subtracted from the highest score plus 1, so that
all data can be square rooted.
Using the square root transformation should reduce the negative skew of the data and
therefore improve the normal distribution of the data. Once the z-scores have been calculated
then it will become clear whether the skewness is now non-significant and whether the
Kolmogorov-Smirnov test is now non-significant, indicating an overall normal distribution.
If the square root transformation results in the data still being significantly skewed,
then a logarithm should be performed (ensuring that the data is reflected as with the square
root transformation) to see whether this will improve the skewness of the data. If the
skewness of the data becomes non-significant then the results of the logarithm will be used,
however, if the skewness of the data still remains significant then the results of the square
root transformation will be retained.
![Page 5: DATA SCREENING ASSIGNMENT](https://reader036.vdocuments.site/reader036/viewer/2022071815/55a8ee0c1a28abce2b8b487c/html5/thumbnails/5.jpg)
References:
Bennett, D. A. (2001). How can I deal with missing data in my study? Australian and New
Zealand Journal of Public Health, 25, 464–469.
Cohen, J. & Cohen, P., West, S. G. & Aiken, L. S. (2003). Applied Multiple
Regression/Correlation Analysis for the Behavioral Sciences, 3rd edition. Mahwah, N.J.:
Lawrence Erlbaum.
Field, A. (2013). Chapter 5: The beast of bias. Discovering Statistics Using IBM SPSS
Statistics, 163-212, 4th Edition, Sage.
Field, A. (2007). Homogeneity of variance. Retrieved 01 February, 2014, from
http://srmo.sagepub.com/view/encyclopedia-of-measurement-and-statistics/n207.xml.
Pallant, J. (2007). SPSS survival manual (3rd ed.). New York, NY: Open University Press.
Rubin, D.B. (1987).Multiple Imputation for Nonresponse in Surveys. John Wiley and Sons,
New York.
Tabachnick, B.G., & Fidell, L.S. (2013). Cleaning up your act: Screening data prior to
analysis. Using Multivariate Statistics (6th ed.), 60-116, Pearson.
Wayman, J.C. (2003). Multiple Imputation For Missing Data: What Is It And How Can I Use
It? Retrieved 01 February, 2014, from
http://coedpages.uncc.edu/cpflower/wayman_multimp_ aera2003.pdf
Wuensch, K.L. (2013). Screening data. Retrieved 03 February, 2014, from
http://core.ecu.edu/psyc/wuenschk/MV/Screening/Screen.docx.