recovery of structure in incomplete data by alscal

6
PSYCHOMETRIKA--VOL. 44, NO. 1 MARCH, 1978 RECOVERY OF STRUCTURE IN INCOMPLETE DATA BY ALSCAL ROBERT C. MACCALLUM THE OHIO STATE UNIVERSITY A Monte Carlo study was carried out in order to investigate the ability of ALSCAL to recover true structure inherent in simulated proximity measures when portions of the data are missing. All sets of simulated proximity measures were based on 30 stimuli and three dimensions, and selection of missing elements was done randomly. Properties of the simulated data varied according to (a) the number of individuals, (b) the level of random error, (c) the proportion of missing data, and (d) whether the same entries or different entries were deleted for each individual. Results showed that very accurate recovery of true distances, stimulus coordinates, and weight vectors could be achieved with as much as 60% missing data as long as sample size was sufficiently large and the level of random error was low. Key words: multidimensional scaling, individual differences models, missing data. A serious practical problem arises in the data collection phase of a multidimensional scaling study if the researcher wishes to employ a large number of stimuli. Given p stimuli, a complete set of data for a single individual would consist of proximity judgments for each of the p(p - 1)/2 distinct stimulus pairs. Thus, it is difficult to employ more than about 20 stimuli because the number of responses required from a subject would become too large. This problem is of substantial importance in that a number of studies [Isaac & Poor, 1974; Sherman, 1972; Young, 1970] have demonstrated that it is generally desirable to employ a large number of stimuli. A possible solution to this difficulty is offered by the capability of nonmetric multi- dimensional scaling algorithms for dealing with missing data. A researcher could employ a large number of stimuli, collect proximity judgments for only a subset of the p(p - 1)/2 possible pairs, and then obtain the scaling solution based on the incomplete data. Appli- cation of this procedure raises some important questions. How does one select the subset of stimulus pairs for which to collect judgments? How much data need be collected? What will be the quality of the solution obtained from the incomplete data relative to (a) a solution that would be obtained from complete data, and (b) the true underlying structure of the data? These questions have been investigated for the two-way multidimensional scaling model in a Monte Carlo study by Spence and Domoney [1974]. The purpose of the present study is to investigate the issue of incomplete data in the context of three-way multi- dimensional scaling. Whereas a variety of three-way scaling models have been proposed [e.g., Carroll & Chang, 1970; Tucker, 1972; Tucker & Messick, 1963; Carroll & Chang, Note 2; Harshman, Note 3], the ALSCAL technique recently developed by Takane, Young, and deLeeuw [1977] is the first three-way scaling technique capable of dealing explicitly with missing data. The ALSCAL algorithm treats missing data in the same fashion as the familiar two-way nonmetric algorithms, optimizing a goodness of fit Requests for reprints should be sent to Robert C. MacCallum, Department of Psychology, 404C West 17 TM Avenue, The Ohio State University, Columbus, Ohio 43210. 0033-3123/79/0300-0069500.75/0 © 1979 The Psychometric Society 69

Upload: robert-c-maccallum

Post on 21-Aug-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

PSYCHOMETRIKA--VOL. 44, NO. 1 MARCH, 1978

RECOVERY OF STRUCTURE IN INCOMPLETE DATA BY ALSCAL

ROBERT C . M A C C A L L U M

THE OHIO STATE UNIVERSITY

A Monte Carlo study was carried out in order to investigate the ability of ALSCAL to recover true structure inherent in simulated proximity measures when portions of the data are missing. All sets of simulated proximity measures were based on 30 stimuli and three dimensions, and selection of missing elements was done randomly. Properties of the simulated data varied according to (a) the number of individuals, (b) the level of random error, (c) the proportion of missing data, and (d) whether the same entries or different entries were deleted for each individual. Results showed that very accurate recovery of true distances, st imulus coordinates, and weight vectors could be achieved with as much as 60% missing data as long as sample size was sufficiently large and the level of random error was low.

Key words: multidimensional scaling, individual differences models, missing data.

A serious practical problem arises in the data collection phase of a multidimensional scaling study if the researcher wishes to employ a large number of stimuli. Given p stimuli, a complete set of data for a single individual would consist of proximity judgments for each of the p(p - 1)/2 distinct stimulus pairs. Thus, it is difficult to employ more than about 20 stimuli because the number of responses required from a subject would become too large. This problem is of substantial importance in that a number of studies [Isaac & Poor, 1974; Sherman, 1972; Young, 1970] have demonstrated that it is generally desirable to employ a large number of stimuli.

A possible solution to this difficulty is offered by the capability of nonmetric multi- dimensional scaling algorithms for dealing with missing data. A researcher could employ a large number of stimuli, collect proximity judgments for only a subset of the p(p - 1)/2 possible pairs, and then obtain the scaling solution based on the incomplete data. Appli- cation of this procedure raises some important questions. How does one select the subset of stimulus pairs for which to collect judgments? How much data need be collected? What will be the quality of the solution obtained from the incomplete data relative to (a) a solution that would be obtained from complete data, and (b) the true underlying structure of the data?

These questions have been investigated for the two-way multidimensional scaling model in a Monte Carlo study by Spence and Domoney [1974]. The purpose of the present study is to investigate the issue of incomplete data in the context of three-way multi- dimensional scaling. Whereas a variety of three-way scaling models have been proposed [e.g., Carroll & Chang, 1970; Tucker, 1972; Tucker & Messick, 1963; Carroll & Chang, Note 2; Harshman, Note 3], the ALSCAL technique recently developed by Takane, Young, and deLeeuw [1977] is the first three-way scaling technique capable of dealing explicitly with missing data. The ALSCAL algorithm treats missing data in the same fashion as the familiar two-way nonmetric algorithms, optimizing a goodness of fit

Requests for reprints should be sent to Robert C. MacCallum, Depar tment of Psychology, 404C West 17 TM

Avenue, The Ohio State University, Columbus , Ohio 43210. 0033-3123/79/0300-0069500.75/0 © 1979 The Psychometric Society

69

70 PSYCHOMETRIKA

measure which is based on only those stimulus pairs for which judgments are present. The scaling model which ALSCAL employs is the weighted euclidean model [Carroll & Chang, 1970; Horan, 1969; Bloxom, Note 1], which assumes that all individuals share a common r-dimensional stimulus space and that individual differences in proximity judg- ments arise from differential weighting of the stimulus dimensions. The present study will investigate, using Monte Carlo procedures, the quality of weighted euclidean solutions obtained by ALSCAL under various conditions of incomplete data.

Generation of Simulated Data

To generate simulated proximity measures, the author employed exactly the same process as was used by MacCallum and Cornelius [1977] in their investigation of recovery of structure by ALSCAL for complete data. The reader is referred to the MacCallum and Cornelius paper for all details of the data generating procedure. The present author merely added a method for deleting specified proportions of data in a controlled fashion so as to yield incomplete sets of simulated proximity measures. In generating the sets of simulated data, two aspects of the data were held constant: the number of stimuli (30), and the number of dimensions (3). Four other characteristics of the data were controlled as independent variables. The first independent variable, the number of individuals, varied across the values 10 and 20. The second independent variable, random error, took on levels of .10 and .40 as defined by the Girard and Cliff [1976] error model.

The remaining two independent variables concerned the deletion of proximity mea- sures from the complete sets of simulated data. First, the proportion of data to be deleted for all individuals in a sample was set at either .20, .40, or .60. Random deletion was employed because the results of Spence and Domoney [1974] showed that, for practical purposes, random deletion worked as well as systematic deletion techniques. The final independent variable involved individual differences in the selection of missing elements. If missing elements were selected independently for each person, it would be virtually certain that each stimulus pair would be judged by at least a few people in the sample; however, if the same elements were missing for each person, there would be many stimulus pairs for which no proximity meaures were available. Intuitively, it would seem that recovery of structure might be better in the former case because there would almost certainly be available proximity measures for all pairs of stimuli. In order to investigate this matter, simulated data were generated under each of these conditions. Under one condition, missing elements were selected randomly and independently for each individ- ual. Under the second condition, missing elements were selected randomly for the first individual, and then the same elements were deleted for all remaining individuals.

The control of these four independent variables yields a 2 × 2 X 3 × 2 design, or 24 different conditions in which simulated proximities were generated. For each cell, five different sets of data were constructed, yielding a total of 120 different sets of incomplete simulated proximity measures, each having a known underlying stimulus configuration and known true weight structure. The 120 sets of data were then analyzed via ALSCAL in three dimensions, using exactly the same program controls as specified by MacCallum and Cornelius [1977].

To evaluate the ability of ALSCAL to recover true structure, the author employed three of the four measures of recovery used by MacCallum and Cornelius [1977]. The three indices which were used were (a) M, measuring recovery of true interpoint distances; (b) 6, measuring recovery of the true stimulus configuration; and (c) ~b, measuring recovery of true individual weights. SSTRESS was not employed as a recovery measure in the present study because SSTRESS values would not be comparable across sets of data having different proportions of missing elements.

To determine how these three measures of recovery were affected by the four inde-

ROBERT C. MACCALLUM 71

pendent variables, a four-way univariate analysis of variance was applied to each recovery measure. It was also desired to examine how recovery of structure for the sets of incomplete data would compare to recovery of structure in similar sets of complete data. Thus, five sets of complete data were generated for each cell of a 2 × 2 design, where the two independent variables were number of individuals (10 and 20) and level of random error (. 1 and .4). Again, these sets of data were based on 30 stimuli and three dimensions. Values of the recovery measures were then obtained for each of these 20 sets of complete data.

Results

Results of the analyses of variance should be interpreted with caution for several reasons. First, there is a lack of randomization in that different sets of simulated data can contain overlapping sets of stimuli. Second, it is possible that assumptions of normality and homogeneity of variance have been violated. Thus, attention will be focused on those effects which reveal important phenomena and which achieve practical as well as statistical significance. A third problem is that the three dependent variables are not uncorrelated. Mean within-cell correlations are as follows: rM~ = - .42; rM~ = - .48; r~ = .68. Though these correlations are substantial, they are not extremely high, indicating that M, ~, and ~b seem to measure conceptually distinct aspects of recovery; thus, separate univariate analyses of variance might be considered justifiable and useful.

Probability levels and w~ values from the analyses of variance are presented in Table 1. (Note: M 1/~ was transformed to Fisher's z prior to analysis,) The pattern of effects of the four independent variables and their interactions is quite similar across the dependent variables. An understanding of the nature of these effects can be achieved by examining two sources of variance: (a) the main effect of error level, and (b) the three-way interaction among number of individuals, proportion of missing data, and same vs. different deletion. The main effect of error level was strong for all three recovery measures. For the. 1 and .4 error levels, respectively, the marginal means of M were .980 and .944; the means for were .186 and .398; and the means for ~b were .006 and .017. Thus, it is seen that recovery of all aspects of structure deteriorates substantially with increasing error.

The effects of the other three independent variables can be understood in terms of their three-way interaction. The effect of this interaction on M is shown in Figure 1; t5 and 4~ behaved in the same fashion. The figure shows clearly that recovery deteriorates badly under one particular combination of circumstances: a small sample of individuals, a high level of missing data, and the same elements missing for each individual. Without that one condition, effects of these three independent variables and their interactions are rather mild.

Recovery measures obtained from the analysis of complete data support the findings of MacCallum and Cornelius [1977]. The various aspects of recovery deteriorate with increasing error, but are essentially unaffected by changes in the number of individuals. Some relevant cell means are cited below.

The results are quite interpretable and have some important practical implications. The rather large effect produced by the level of random error is as expected. The effect is similar to that found by MacCallum and Cornelius [1977] in their study of comptete data, and indicates that random error is a very important aspect of the data with respect to the ability of a scaling technique to recover true structure in those data. Thus, the empirical researcher should take whatever precautions he can to minimize error of measurement during the data collection process.

The nature of the three-way interaction between the other three independent variables is also quite important. If we temporarily disregard the especially disadvantageous condi- tion of small sample, high deletion, and same missing elements for each individual, we find

72 PSYCHOMETRIKA

TABLE I

Significance Levels and Omega-Squares from Analysis of Variance of M, 6, and

Source M 6

Probability ~2 Probability ~2 Probability ~2

A (Individuals B (Error Level) C (Proportions) D (Same vs. Diff.)

<.001 .04 <.001 .ll <.005 .04 <.001 .28 <.001 .19 <.001 .17 <.001 .16 <.001 .16 <.001 .13 <.005 .02 <.05 .01 NS .00 <.005 .02 <.05 .01 NS .00 <.001 .08 <.001 .i0 <.001 .09 <.001 .05 <.001 .04 NS .01

NS .00 NS .00 NS .O0 NS .O1 <.01 .02 NS .0i NS .01 NS .00 NS .O0 NS .00 NS .00 NS .00 NS .00 NS .00 NS .00

<.005 .03 <.005 .03 <.05 .03 NS .00 NS .00 NS .00 NS .00 NS .00 NS .01

.30 .33 .53

AB AC AD BC BD CD ABC ABD ACD BCD ABCD

Residual

that (a) the effect of sample size on recovery is negligible and unsystematic; (b) recovery deteriorates only mildly as the amount of missing data increases from 20% to 60%; and (c) it makes relatively little difference whether missing elements are the same or different across individuals. This implies that the empirical researcher wishing to collect incomplete data and still obtain an optimal level of recovery can collect as little as 40%-50% of the proximity measures i f he uses a moderately large sample. The user need not go to the trouble of having each subject judge a different subset of stimulus pairs. This is an important point because it would generally be difficult and costly to obtain judgments on different stimulus pairs for each subject, whereas it would be quite simple to collect incomplete data by asking each subject to provide judgments about the same subset of stimulus pairs. Results clearly indicate, though, that when sample size is small and the proportion of missing data is high, it will be necessary to collect judgments on different stimulus pairs from each subject.

It is especially interesting to examine the values of the recovery statistics for the cell representing the conditions for optimal recovery with the greatest amount of missing data. With 60% missing data, optimal levels of all three recovery measures were obtained with n = 20, deletion of the same elements for each subject, and the lowest level of random error. For the five replications in this cell, the mean value of M was .9611, the mean 6 was .0619, and the mean 0 was .0013. These values compare favorably with the results for complete data. In particular, cell means from the analysis of complete data for n = 20 and low error level were/Q = .9887, ~ = .0300, and 4~ = .0004. Thus, M declines only slightly with 60% missing data. Though the deterioration in 6 and 0 is a little more noticeable, the values of 6 and 0 specified above still represent quite accurate recovery of stimulus coordinates and weight vectors even when 60% of the data are missing. MacCallum and Cornelius [1977] point out that values of 6 less than .1 and values of 0 less than .02 can be

ROBERT C. MACCALLUM 73

1.00

.90,

M

.80,

.70

20~; Deletion

Same

1.00

.90

M

.80

.70

40~ Deletion

L~...:,.co~me e - ° " " - e

Diff.

1.00

.90

M

.80'

.70

60~. Deletion

S a m e

g I

I I

I I

I I

I I

g I

l

I I I . L I I

io 20 lb 20 lb 20 Number of Number of Number of Individuals Individuals Individuals

F I G U R E 1

Effect of interaction between number of individuals, percent deletion, and same-vs.-different deletion on M

taken as representing accurate recovery of corresponding aspects of structure. In con- clusion, it is found that very accurate recovery of distances, stimulus coordinates, and weight vectors can be achieved by ALSCAL using less than half the proximity measures if sample size is sufficient and error of measurement is low.

REFERENCE NOTES

1. Bloxom, B. Individual differences in multidimensional scaling (ETS Res. Bull. 68-45). Princeton, New Jersey: Educational Testing Service, 1968.

2. Carroll, J. D., & Chang, J. J. 1DIOSCAL (Individual Differences in Orientation SCALing)." A generalization of INDSCA L allowing IDiOsyncratic reference systems as well as an analytic approximation to INDSCA L. Paper presented at meetings of the Psychometric Society, Princeton, N.J., March 1972.

3. Harshman, R. A. PARAFAC2: Mathematical and technical notes. (Working Papers in Phonetics 22) Califor- nia: U. C. L. A., March 1972.

REFERENCES

Carroll, J. D., & Chang, J. J. Analysis of individual differences in multidimensional scaling via an N-way generalization of Eckart-Young decomposition. Psychometrika, 1970, 35, 283-320.

Girard, R. A., & Cliff, N. A Monte Carlo evaluation of interactive multidimensional scaling~ Psychometrika, 1976, 41, 43-64.

Horan, C. B. Multidimensional scaling: Combining observations when individuals have different perceptual structures. Psychometrika, 1969, 34, 139-165.

Isaac, P. D., & Poor, D. D. S. On the determination of appropriate dimensionality in data with error. Psychometrika, 1974, 39, 91-109.

MacCallum, R. C., & Cornelius, E. T., I11. A Monte Carlo investigation of recovery of structure by ALSCAL. Psychometrika, 1977, 42, 401-428.

74 PSYCHOMETRIKA

Sherman, C. R. Nonmetric multidimensional scaling: A Monte Carlo study of the basic parameters. Psycho- metrika, 1972, 37, 323-355.

Spence, 1., & Domoney~ D. W. Single subject incomplete designs for nonmetric multidimensional scaling. Psychometrika, 1974, 39, 469-490.

Takane, Y., Young, F. W., & deLeeuw, J. Nonmetric individual differences multidimensional scaling: An alternating least squares method with optimal scaling features. Psychometrika, 1977, 42, 7-67.

Tucker, L. R. Relations between multidimensional scaling and three-mode factor analysis. Psychometrika, 1972, 37, 3-27.

Tucker, L. R., & Messick, S. An individual differences model for multidimensional scaling. Psychometrika, 1963, 28, 333-367.

Young, F. W. Nonmetric multidimensional scaling: Recovery of metric information. Psychometrika, 1970, 35, 455-473.

Manuscript received 7/8/77 First revision received 1/3/78 Final version received 6/7/78