data preparation and screening

26
DATA PREPARATION AND SCREENING James G. Anderson, Ph.D.

Upload: apria

Post on 12-Jan-2016

60 views

Category:

Documents


1 download

DESCRIPTION

DATA PREPARATION AND SCREENING. James G. Anderson, Ph.D. Importance. Model specification Failure of model fitting Problems with parameter estimates Problems with tests of significance. Categories of Problems. Case-related Issues Missing Observations Outliers - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: DATA PREPARATION AND SCREENING

DATA PREPARATION AND SCREENING

James G. Anderson, Ph.D.

Page 2: DATA PREPARATION AND SCREENING

Importance

• Model specification

• Failure of model fitting

• Problems with parameter estimates

• Problems with tests of significance

Page 3: DATA PREPARATION AND SCREENING

Categories of Problems

• Case-related Issues– Missing Observations

• Outliers

– Distributional/Relational Issues• Normality

• Linearity

• Homoscedasticity

Page 4: DATA PREPARATION AND SCREENING

Missing Data

• Missing Completely at Random (MCAR) – The missing data is entirely unrelated statistically to the values that would have been observed.

• Missing at Random (MAR) – Data values and missing values are conditional on a set of predictors or stratifying variables.

• Nonignorable Missing Data (NMD) – The missing data conveys probabilistic information about the vlaues that would have been observed above the information provided in the observed data.

Page 5: DATA PREPARATION AND SCREENING

Methods for Dealing with Missing Data

• Listwise deletion

• Pairwise deletion

• Mean replacement

• Regression replacement

• Pattern matching

• Maximum likelihood

Page 6: DATA PREPARATION AND SCREENING

Listwise Deletion (LD)

• Eliminates observations where there is any data value missing.

• Limitations:– Discards other information that the respondent

provided– Reduces sample size significantly

Page 7: DATA PREPARATION AND SCREENING

Pairwise Deletion (PD)

• Excludes an observation from a calculation only when it is missing a value needed for that particular calculation.

• Limitations:– Each mean, variance, covariance, etc. that is calculated

is based on a different sample size.– Pairwise deletion may lead to out of bound values

resulting in nonpositve definite/singular covariance matrices, negative variances, etc.

– Pairwise deletion is not recommended for SEM

Page 8: DATA PREPARATION AND SCREENING

Data Imputation (MI)

• Replaces the missing value with an estimate of the value based on the complete data. (e.g., the mean of the value for those persons who reported the data)

Page 9: DATA PREPARATION AND SCREENING

Data Imputation (AMOS)

• Regression Imputation. The model is initially fitted with ML. After setting model parameters to their ML estimaters, linear regression is used to predict unobserved values for each case as a linear combination of the observed values for the same case.

• Stochastic Regression Imputation. Imputes values for each case by drawing at random from the conditional distribution of the missing values given the observed values with the unknown model parameters fixed at their ML estimates.

Page 10: DATA PREPARATION AND SCREENING

Data Imputation (AMOS)

• Bayesian Imputation. Is like stochastic regression imputation except that it takes into account the fact that the parameter values are only estimated and not known.

Page 11: DATA PREPARATION AND SCREENING

Performance of the Various Methods to Deal with Missing Data

• When the missing data are MCAR ( missing is entirely unrelated statisticvally to the values that would have been observed):– PD, LD and FIML all yield consistent solutions– PD and LD are not as efficient as FIML– MI is consistent with the first moments but yields

biased variance and covariance estimates. – MI is not recommended for structural equation

modeling which is based on variance and covariance information.

Page 12: DATA PREPARATION AND SCREENING

Performance of the Various Methods to Deal with Missing Data

• When the missing data are MAR (missingness and data values are statistically unrelated conditional on a set of predictor or stratifying variables):– MPD, LD, and M I can produce severely biased

results independent of the sample size.– FIML yields parameter estimates that are

consistent and efficient.

Page 13: DATA PREPARATION AND SCREENING

Performance of the Various Methods to Deal with Missing Data

• When the missing data are nonignorable (missingness conveys probabilistic information about the values that would have been observed):– All standards multivariate approaches can yield biased

results.

– There is some evidence, however, that FIML estimates tend to be less biased than other methods.

– FIML is recommended for handling missing data.

Page 14: DATA PREPARATION AND SCREENING

NORMALITY

• Many SEM estimation procedures assume multivariate normal distributions

• Lack of univariate normality occurs when the skew index is > 3.0 and kurtosis index > 10.

• Multivariate normality can be detected by indices of multivariate skew or kurtosis

• Non-normal distributions can sometimes be corrected by transforming variables

Page 15: DATA PREPARATION AND SCREENING

OUTLIERS• Univariate outliers more than three SDs away from

the mean• Detection by inspecting frequency distributions and

univariate measures of skewness and kurtosis• Multivariate outliers may have extreme scores on two

or more variables or their figurations of scores may be unusual

• Detection by inspecting indices of multivariate skewness and kurtosis. Mahlanobis Distance squared is distributed as chi square with df equal to the number of variables.

• Can be remedied by correcting errors or by dropping these cases of transforming the variables

Page 16: DATA PREPARATION AND SCREENING

MULTICOLLINEARITY

• Occurs when intercorrelations among some variables are so high that certain mathematical operations are impossible or results are unstable because denominators are close to 0.

• Bivariate correlations >0.85; Multiple correlations>0.90• May cause a nonpositive definite/singular covariance matrix• May be due to inclusion of individual and composite variables

Detection; Tolerance = 1-R2 , 0.10; Variance Inflation Factor (VIF) = 1/(1-R2) >10

• Can be corrected by eliminating or combining redundant variables

Page 17: DATA PREPARATION AND SCREENING

RELATIVE VARIANCES• Covariance matrices where the ratio of the largest to

the smallest variance is greater than 10 are Ill Scaled• Most SEM estimation methods are iterative• Estimates may not converge to stable values when

variances of observed variables are very different in magnitude

• To prevent this problem, variables with extremely low or high variances can be rescaled by multiplying or dividing observed scores by a constant. This changes a variables mean and variance but not its correlations with other variables.

Page 18: DATA PREPARATION AND SCREENING

LINEARITY

• SEMs assume linearity in the relations among the variables

• Estimation of curvilinear and interactive effects is possible.

Page 19: DATA PREPARATION AND SCREENING

VIOLATIONS OF ASSUMPTIONS

• The best known distribution with no kurtosis is the multinormal.

• Leptokurtic (more peaked) distributions result in too many rejections of Ho based on the Chi square statistic.

• Platykurtic distributions will lead to too low estimates of Chi Square.

Page 20: DATA PREPARATION AND SCREENING

VARIABLE SCALES

• SEM in general assumes observed variables are measured on a linear continuous scale

• Dichotomous and ordinal variables cause problems because correlations /covariances tend to be truncated. These scores are not normally distributed and responses to individual items may not be very reliable.

• Some SEM programs like LISCOMP can analyze dichotomous and ordinal variables

• PRELIS can be used to prepare a corrected covariance matrix for non-continuous variables.

Page 21: DATA PREPARATION AND SCREENING

VIOLATIONS OF ASSUMPTIONS

• High degrees of skewness lead to excessively large Chi square estimates.

• In small samples (N<100), the Chi square statistic tends to be too large.

Page 22: DATA PREPARATION AND SCREENING

Reliability

• The degree to which scores are free from random measurement error

• Reliability measures– Internal Consistency Reliability– Test-retest Reliability– Alternate Forms Reliability

Page 23: DATA PREPARATION AND SCREENING

Reliability

• Levels of Reliability– 0.90 Excellent– 0.80 Very Good– 0.70 Adequate

Page 24: DATA PREPARATION AND SCREENING

Validity

• Whether the scores measure what they are sup-posed to measure

• Types of validity– Construct Validity (SEM Confirmatory Factor Analysis

helps to establish construct validity)– Criterion-Related Validity (Correlation with an external

standard)– Convergent Validity/ Discriminant Validity (Can be

determined through SEM Confirmatory Factor Analysis)

Page 25: DATA PREPARATION AND SCREENING

FORM OF INPUT DATA

• ASCII• SPSS• Microsoft Excel 3 through 8• Microsoft Access• Microsoft FoxPro 2.0, 2.5, 2.6• dBase 3 through 5• Lotus1, 2, 3 with wk1, wk3 and wk4 extensions • Microsoft Access throgh Access 97

Page 26: DATA PREPARATION AND SCREENING

Reference

• R.B. Kline, “Chapter 3. Data Preparation and Screening,” in Principles and Practice of Structural Equation Modeling, NY: Guilford Press, 2005, pp. 45-62.