gre analytical reasoning item statistics prediction studygre analytical reasoning item statistics...

GRE Analytical Reasoning Item Statistics Prediction Study

R. F. Boldt

GRE Board Report No. 94-02P

August 1998

This report presents the findings of a research project funded by and carried out under the auspices of the Graduate

Record Examinations Board.

Educational Testing Service, Princeton, NJ 0854 1

********************

Researchers are encouraged to express freely their professional judgment. Therefore, points of view or opinions stated in Graduate

Record Examinations Board Reports do not necessarily represent official Graduate Record Examinations Board position or policy.

********************

The Graduate Record Examinations Board and Educational Testing Service are dedicated to the principle of equal opportunity: and their programs.

services, and employment policies are guided by that principle.

EDUCATIONAL TESTING SERVICE, ETS, the ETS logo. GRADUATE RECORD EXAMINATIONS. and GRE are registered trademarks of Educational Testing Service.

Copyright ,c 1998 by Educational Testing Service. All rights reserved.

Abstract

Chalifour and Powers (1989) noted that the ability to predict item statistics might be used to reduce the volume of item pretesting (Mislevy, Sheehan, & Wingersky, 1993). It could also lead to better control of test statistics, such as item difficulty distributions, and to improved test specifications. Research, such as the previously cited study, is needed to examine the prospects for attaining these benefits. That study examined prediction of GRE analytical reasoning item statistics. Linear regression was used by these authors, but they surmised that non-linear techniques might provide somewhat better prediction of item statistics.

Chalifour and Powers had amassed item statistics on a very large sample of analytical reasoning items. That sample was used in the present study.

For the present study, predictions were generated using a type of neural net, a technique that can accommodate a wide variety of non-linear relationships, though at a cost of requiring the calculation of many constants in the prediction function. This technique did indeed provide more accurate predictions of item difficulty and item discrimination in an estimation sample. However, when the functions developed in the estimation sample were cross validated in a fresh sample, the advantages noted in the estimation sample disappeared. Only by including expert judgements of items difficulty with other predictors was the accuracy of prediction improved.

The variables that produced the above results had been selected by Chalifour and Powers (1989) for their efficacy in linear prediction. The study reported here used a "genetic algorithm", which provides a quasi-random search of the predict set to find an optimal set of predictors. The search used the validity of neural nets computed during operation of the algorithm to evaluate the efficacy of prediction. Hence, this technique sought variables sets that were optimal when a neural net was to be used.

or

The genetic algorithm accomplished some improved prediction of item difficulties in the estimation, but no improvement with regard to discrimination. When the variable sets and the related nets were evaluated in the validation sample, any advantage gained by the search was lost. The validities of predictions in the validation sample were approximately equal whether or not the genetic algorithm had been used in developing the prediction functions.

Examination of the root-mean squares of the discrepancies between predicted and actual values for item difficulties (and discriminations) revealed no advantage for the neural net. This examination used the validation sample.

In sum, application of the complex and computer-intensive neural nets and genetic algorithms revealed no advantage over linear methods for predicting item difficulty and item discrimination statistics for GRE analytical reasoning items.

Background

Several methods are available for studying the relations between item characteristics and item performance. They include:

- factor analysis - causal analysis - Tatsuoka's rule space - classical prediction

These methods use different data and lead to somewhat different inferences.

Factor analysis is the most familiar method for studying the abilities underlying item performance. GRE studies include those of Schaeffer and Kingston (1988), Stricker and Rock (1983), Rock, Wertz and Grandy (19811, Powers and Swinton (1981), and Powers, Swinton, and Carlson (1977). These studies associate items with rather generic abilities and can lead to new item types or scales.

Causal analysis links hypothesized systems of constructs, e.g., skills and abilities, to their effects on item performance. Thus a causal model includes a mixture of observed and latent variables together with a flow of influence between variables. Hypotheses about the flow of influence are often expressed graphically, as in influence diagrams (Oliver and Smith, 1990). Causal analysis can be investigated using confirmatory factor analysis and latent variables, or latent structure models in which the variables can have only a few levels. GRE has funded the development of a version of causal analysis that uses multichotomous variables (Kim, 1993). Causal research can lead to verified models that relate item performances to the underlying ability variables.

A treatment of more specific abilities and item performance uses Yule space" as defined by Tatsuoka (1983, 1990). In contrast with other approaches, rule space is defined by assigning several scores to each item response. The scores reflect locations in rule space. Thus, responses implied by specific cognitive errors can be translated into locations in rule space, and the cognitive error can be located in rule space. Also, scores on the dimensions are found for examinees, so the examinees have locations in the space. As a result, when the point representing an examinee lies near a point representing a cognitive error, that examinee might benefit from tutoring to remove the error.

Rule space studies and causal analyses both seek ways of interpreting data in terms of cognitive constructs. In these studies, characteristics of items form the basis for judgements about the role of constructs in item performance. For example, performance on an item can reflect a certain cognitive error if it affords the opportunity to make that error.

1

One can, however, conduct studies that connect item characteristics (including such characteristics as might be used in the rule-space analysis) to item statistics directly. For example, Freedle and Kostin (1992) studied the prediction of item statistics for reading comprehension, and Chalifour and Powers (1989) predicted analytical reasoning item statistics. In these classical prediction studies, coded characteristics of items were used as predictors of item statistics.

Analytical reasoning (AR) items from the GRE analytical measure are similar to the familiar paragraph comprehension items, for which approximately five items are associated with each passage. For AR items, paragraphs pose rules that relate elements, such as the members of a committee or sets of events. The items require examinees to use the rules to deduce a solution to a problem. Some examples of the items were given by Chalifour and Powers (1989).

For the kind of study conducted by Chalifour and Powers (198% once the item characteristics are identified, further specification of cognitive constructs is unnecessary for predicting item statistics. Of course, theory can play a role in selecting variables and in interpreting research results. But theoretical considerations aside, predicting item statistics is of value in itself. For example, a validated set of characteristics could be used to manipulate item difficulty distributions and to set test or item specifications.

Freedle and Kostin (1992) and Chalifour and Powers (1989) used four steps to establish prediction formulas:

1. Identify promising item attributes. 2. Define a system for coding items in terms of attributes. 3. Code a set of items. 4. Establish a prediction formula that uses the codes as

predictors. These authors used linear regression to establish their prediction formulas, but they surmised that nonlinear methods might yield even better predictions.

For ease of future applications, in this study it seemed desirable to base prediction on only a few codes. Therefore, a series of fittings of prediction formulas to item statistics carried out using different combinations of predictor sets was used here. The series of fittings was a systematic procedure designed to reach an optimum set of predictors. Freedle and Kostin and Chalifour and Powers used traditional, step-wise regression to arrive at their optimum subsets of predictors. A different system of selecting variables was needed for the nonlinear prediction system used here.

2

Nonlinear Prediction: Neural Nets

"Neural nets" is the name of a technique that originated in the field of artificial intelligence (AI) and that has been used often and successfully for prediction. The neural net approach can be characterized as one in which a complex mesh of logical connections is available, and the right configuration of strong and weak connections is sought for prediction purposes.

The use of neural nets in prediction is not widespread in educational testing literature, but they have been applied in a wide variety of other applications including price forecasts, labor hour estimates, polymer identification,legal strategies, bacteria identification, and sales prospect selection. In each of these applications there were an array of predictors and a good deal of uncertainty concerning how these predictors should be combined for prediction. The owners of these nets recorded codes describing many instances of the type in which they were interested--such as market statistics, labor statistics, chemical properties, bacteria measurements, and buyer characteristics. The codes were then fed into the net one case at a time, along with actual known outcomes--such as prices, labor hours, polymer identifications, legal outcomes, bacteria names, and sales figures. For each case, a prediction was generated and compared to the outcome, generating an error. The parameters of the net were modified on the basis of that error. When a reduction of error of useful magnitude was achieved, the net was considered to be "trained." The trained net was then available to its owner to generate forecasts using new market statistics, labor data, new compounds, bacteria characteristics, and buyers.

Descriptions of neural nets can be found in Caudill (1990). Input variables (coded item characteristics), composites, nodes, transformations, and output variables are important features of these nets. The type of net used in the present study puts the inputs through several phases when making predictions. For the first phase, several linear composites of the input variables are computed, one as input to each node (a %odel' is a block of code that produces an output, which is a nonlinear transformation of its input). The outputs from several nodes are then linearly combined to obtain predictions.

The net is similar to more familiar nonlinear prediction formulas in that both include powers and cross-products of input variables. The familiar models incorporate these elements of nonlinear prediction explicitly (Saunders, 1967). In the neural net they are incorporated through nonlinear transformation and the subsequent linear combination of transformed composites. By using the more complex process, the net selects from a wide variety of prediction formulae. As in least-squares regression, the present study uses a least-squares criterion to evaluate the accuracy of the resulting predictions. The procedure used for

3

calculating constants required for the net is given in Appendix A.

Data

With neural nets, as with traditional prediction studies, an estimation sample and a validation sample are used. The estimation sample is used to determine the weights in the composites. There are many such weights with neural nets, making it easier to capitalize on chance with the nets than with traditional linear formulae, and also making it important to cross-validate the results. For example, seven predictors require eight constants for linear regression- -seven regression weights and one additive constant. By contrast, if seven nodes are used, the type of net used here will require estimating 64 constants-- 56 (seven nodes times eight constants for the prediction composites), plus eight to combine the node outputs into a prediction. Fitting that many constants permits much capitalization on chance factors.

The 64 constants that are associated with predicting deltas, and the 57 constants (seven nodes times seven constants for the six inputs plus eight constants to combine node outputs) that are associated with predicting item biserials were estimated using Newton-Raphson iterations-in the estimation sample. Determining the constants of a net for any combination of seven variables examined was accomplished using 1,000 iterations.

Successful cross-validation requires large samples, and for this reason the Chalifour and Powers data, consisting of codings of more than 1,400 items, was used for this application. The number of items in Chalifour and Powers makes it ideal for this application, because previous experience (Boldt & Freedle, 1996) has demonstrated severe shrinkage on validation.

Procedures

Selecting Variables: Genetic Algorithms

As noted earlier, it is useful to base predictions on a small number of variables. To this end, procedures for finding the most-valid small sets of predictors among a large number of potential predictors have long been used. In the linear context, the computations needed for variable selection are compact and efficient; equally efficient schemes for nonlinear prediction do not exist. Here again a useful procedure was found in the AI literature. It is called the "genetic algorithm" (GA), and involves a constrained random search of predictor sets. Buckles and Petry (1992) give an overview of genetic algorithms.

4

As applied to the present problem, the GA was used to select a sample of sets of predictor variables. Using the net, the validity of each member of each set was evaluated. Then five processes followed:

1. The poorest members were dropped. 2. Missing members were replaced. 3. Randomly chosen members of the revised sample were

llmutatedll by changing some, but not all, of the members' predictors.

4. Randomly chosen pairs of members of a set were l?rossedll by exchanging some, but not all, of them for other predictors.

5. The accuracy of the prediction was evaluated. Random variables entered into steps 2, 3, and 4, thus allowing a sampling of the predictor space. After the accuracy of the prediction by the new members was evaluated, the procedure recycled. Fifty Newton-Raphson iterations were used to develop the nets by which the accuracies of new members were evaluated. Fewer iterations were used in the GA because the major changes in validity occur in the early iterations. The GA used here is described in more detail in Appendix B.

Data and Variables

As also noted earlier, the data used in the present study were developed by Chalifour-Powers, who coded the characteristics of 1,474 items associated with 227 analytical reasoning item sets. Data were missing in 17 of those items, and as a result, Chalifour and Powers used missing data correlations. Missing data techniques were not available for the methods used here, so only the 1,457 fully coded items were used in the present study.

Two practices were followed for item coding in the present study. First, where any code consisted of slotting items into nonordered categories, the predictor variables were coded either as I1 one I1 for the category into which an item fell, or lVzeroll otherwise. For example, one code that referred to three types of objects in a pool used three categories: (1) living things (persons, animals, imaginary creatures, plants); (2) inanimate objects, events, places; and (3) other. Items that required categorizing animals in order would be coded l,O,O; and items that required rating a mixture of plants and people would be coded l,O,l.

Second, where any code consisted of counting some characteristic, if present, the predictor variable received two codes: (1) a V1zerol' or "one" for the presence or absence of the characteristic, and (2) the number of times the characteristic was counted. For example, one code recorded the number of positions in any orderings in any groups. For this code there were two variables with the values 0,O if no orderings were

5

involved and 1,x (where "x1' is the number of positions) if indeed there were positions to be ordered. All counts were used as a single, numerical variable.

With this coding 57 predictors were available, 56 of these made from item codings and the 57th being the expert prediction of item difficulty. A complete listing of the predictors is available in Chalifour and Powers (1988).

Predicting Item Difficulty

The items used in the present study were divided into two samples: an estimation sample in which the constants, or weights, used in the neural net were computed, and a validation sample used to allow for capitalization on chance in the estimation process. Unlike Chalifour and Powers who used 90% of the items for the estimation sample and 10% for the validation sample, the present study used equal estimation and validation sample sizes because of limited computer memory.

Another difference with the Chalifour and Powers study and the present study was that the present study ascertained that all of the items associated with a particular paragraph appeared in the same sample, that is, they were all in the estimation sample or all in the validation sample. This was done because the paragraph is likely a partial determiner of the difficulty of the associated items,and splitting items in the same paragraph between the estimation and validation samples could lead to an inflated estimate of the sample validity. Items from every other paragraph were allocated alternately to the estimation or validation samples. Since Analytical Reasoning items are arranged in increasing order of difficulty, the two samples were approximately equal in terms of difficulty. After the split there were 728 items in the estimation sample and 729 items in the validation sample.

Analyses were carried out for predicting item difficulties, item deltas, and item-test biserials. The item deltas, which are proportion incorrect transformed into a normal deviate with mean 13 and standard deviation of 4, are commonly used in item analyses because they are more likely to exhibit linear trends than are the untransformed proportions. Experts' judgements of pass rates were also subject to this transformation, but note that the argument of the transformation was pass rate, not failure rate.

Results

Chalifour and Powers identified seven predictors that made significant contributions to prediction of item deltas. In the present study, these seven variables accounted for 31% of the

6

variation of the item deltas in the estimation sample when linear regression was used to make the predictions; predictions based on a neural net after a thousand iterations accounted for 51% of the variation in the deltas. Chalifour and Powers used a six- predictor set for the biserials. In the estimation sample, linear prediction accounted for 16% of the variation of the item biserials, but the neural net accounted for 26%.

Figures 1 and 2 display plots of linear predictions versus neural net predictions calculated in the estimation sample. Figure 1 depicts predictions of item deltas and Figure 2 depicts predictions of item biserials. The item pairs that are plotted in these figures are derived from the different transformations of the same item characteristic codes, yet the effect of using the different transformations appear to be the production of scatterplots of correlated random variables. This effect indicates that the neural net does more than merely bend the linear prediction.

These results appear to support the Chalifour and Powers conjecture that nonlinear prediction might improve the prediction of item statistics. But the gain was not large, and the resulting squared correlation, . 51, was only on the order of the squared correlation of expert estimates of item difficulty alone, which was -46. Improved prediction was clearly desirable. The GA was expected to bring about such an improvement by selecting variables based on their efficacy for prediction in the context of the neural net, which is the context in which the variables were being evaluated.

Variable selection by applying the GA was accomplished for predicting item deltas, both allowing and not allowing expert estimates of proportion pass as one of the predictors. Similar applications were accomplished for predicting item biserial correlations. The variables in one of the chromosomes in the first generation of every application of the algorithm were the same as those identified in the Chalifour-Powers study. The algorithms were operated over 15 generations with nine chromosomes per generation. The two best chromosomes were retained from generation to generation with the result that seven neural nets were constructed for any particular generation-- except the first, for which nine neural nets were constructed. For any application, then, 107 [(7)14+9] neural nets were constructed. When predicting item deltas, seven predictors were used; six predictors were used for predicting item biserials. As noted, the numbers of predictors used were the same as those identified through linear test selection in the Chalifour and Powers study. Fifty Newton-Raphson cycles were used for each chromosome. When selecting chromosomes for a new generation, the selection mechanisms chose crossing and mutation at random with equal probability. The crossing mechanism chose four of the seven variables at random to be crossed.

7

Figure 1:Neural Net v. Linear Estimates ti Itow Dobe

:: LP i

a

7 t 0 I

Un*o+ Raaro~alon CmUmo’W

Figure Z:Neural-Net .v. Linear Estimates 01 R--6l8or\0i cornlatbn~

0.6

1

0.8 *

I 0.4 - ‘J P 0.3 -

n

I 0.3 -

0 %I

0.1 - $0

P 0 23

0

0‘ ,

1

I

0.16 0.25 I 0.36 1 0.48 I 0-P 0.3 0.4 0.8

l&mar R*grradon Lmtlmoteo

Table 1 presents all of the item delta validity results in the study. The predictions listed in the Predictions column of the table are defined as follows:

- "Linear" predictions were made with a linear composite based on the variables selected in the Chalifour and Powers study--seven variables for predicting item deltas and six for predicting r-biserials. The composite used regression weights that were determined in the estimation sample.

- rrNetll predictions were made using the variables selected by Chalifour and Powers in a neural net constructed in the estimation sample. Chalifour and Powers selected these variables using linear regression.

- GA predictions were made using variables selected by using the genetic algorithm. The GA was applied in the estimation sample.

- "GA without expert" predictions were made using variables selected using the GA but excluding the expert estimate of proportion pass. The GA was applied in the estimation sample.

Table 1

Validities of four types of predictions of item deltas

Samples Predictions Estimation Validation

Linear . 31 . 28 Net . 51 . 17 GA .68 . 26 GA without expert . 51 . 15

Examination of Table 1 reveals several salient points. First, the "Linear" predictions were the poorest of those noted in the estimation sample. With this method the fewest number of constants is determined in the estimation sample. All the other predictions used neural nets, which require estimating many more constants--dozens more, in fact--than linear prediction requires. Determining so many constants provides a great opportunity, or danger, of capitalizing on chance.

Second, the validity of the lrNetV1 predictions and the "GA without expert" predictions was about the same for both samples. This means the genetic search procedure provided no improvement over the set of predictor selected for use in linear regression if expert predictions of percent correct were excluded (as they were in the Chalifour and Powers study).

9

Third, the validity of the GA predictions was the highest of those noted in the estimation sample. This result reflects the fact that the expert judgments provided the most valid predictions, and that there was a good deal of capitalization on chance factors in the estimation sample.

Fourth, the validity for rrNetU predictions is smaller than that for "Linear" predictions in the validation sample. The I1 gain I1 in validity due to using the genetic algorithm seems to have consisted of capitalization on chance factors in the estimation sample because it was lost when cross-validation occurred.

Finally, the "Linear" predictions and the GA predictions were equally valid in the validation sample. This is not a success for the genetic algorithm, because the "Linear" predictions achieved their validity without the expert judgements of proportion correct. Indeed, if the selection of variables were made in the estimation sample using traditional linear predictor selection methods and allowing the use of expert judgements, the resulting cross sample validity for predicting item difficulties was .54, well above the any results achieved by the neural nets. The corresponding results for predicting item biserial correlations was .33, well above any of the validation sample results in Table 2, which follows.

Table 2

Validities of four types of predictions of item biserial correlations

Samples Predictions Estimation Validation

Linear 16 . 14 Net :26 . 13 GA . 43 . 13 GA without expert . 26 . 12

Table 2 reveals that all of the types of predictions resulted in about the same validity in the validation sample. This result echoes Table 1 in that, in both cases, the estimation sample results reflect the effects of capitalization on chance factors, and on the use of expert predictions. Table 2 also reveals that all of the types of predictions resulted in about the same validity in the estimation sample.

Clearly, the use of the neural nets and GA provided no advantage in validity in the validation sample. However, the validity correlation averages errors over the entire range of predictions. A possible advantage for the neural net may be that it fits predictions better in a validation sample, because the ability to introduce a curve in the predictions might avoid large errors that occur at the extremes of linear prediction. To this

10

end, the root mean square errors (RMS) and the maximum errors of prediction (MD) were computed for the "LineaP and llNetl' predictions using the validation sample. The results are presented in Table 3.

Table 3

RMS and MD errors of predictions in the validation sample

Deltas Biserials Predictions RMS MD RMS MD

Linear 2.26 7.17 .13 .36 Net 2.60 7.18 .14 -39

Table 3 shows that the "Linear" predictions yielded very slightly smaller errors on average than did the IrNet" predictions.

Center

The present study provided reasonable opportunity for non- linearities to be displayed. Indeed, the use of neural nets resulted in increased accuracy of prediction of item deltas in the estimation sample. However, this increase unexpectedly did not hold up in a validation sample. Shrinkage in validity was expected on cross-validation. What was unexpected was that upon cross validation there would be no advantage to the nonlinear method, even when the genetic search was conducted.

Based on these results, one could not support the contention that nonlinear methods, and the neural net in particular, must replace the simpler linear regression when screening for possible predictors of reading comprehension item difficulty. Many more items are needed to develop and validate a neural net than are needed for a linear system. Rather than pursue a nonlinear alternative, a better strategy for improving item difficulty prediction would be to develop new theories of item difficulty and new codes based on those theories. Then, when more item coding is done, it can be used to study the new theories rather than to build the replication needed to evaluate a neural net.

However, the attraction of the nonlinear hypothesis is still strong. It seems intuitively obvious that, if one could isolate the elements of language and deductive reasoning, effective characterization of passages and problems would entail using combinations of these elements as well as enumerating the elements separately. One would expect synergistic effects of combinations of elements in producing item difficulties or r- biserials. This synergy would be reflected in nonlinear relations

11

between the elements present in an item and item statistics. It is therefore useful to explore why the present study failed.

Perhaps a more complicated net would have revealed stronger results. However, a functional relation that is expressible using a complex net can be well approximated using a net with a single layer of transformation nodes if a sufficient number of nodes is used. Perhaps an insufficient number of nodes was used in the present study. However, increasing the number of nodes greatly increases the number of constants to be calculated, and iterations with functions with large numbers of constants converge extremely slowly. Hence, if enough nodes are used, the value of the objective function would converge very slowly, and a more complex net, or more nodes, probably would not have provided a better result.

Another possibility is that too few iterations were used. However, the decrease in the value of the objective function between cycles was- -toward the completion of the thousand cycles- -on the order of 10s5. Indeed, the decrease was on the order of 10s4 by the end of 50 cycles. It therefore seems extremely unlikely that more iterations would have provided very different results.

These results do not, however, indicate that no synergistic effects on item statistics exist between the elements of language and deductive reasoning. Perhaps the existing coded characteristics include the possibility of such synergism, but the right combination was not found by the GA. Certainly, only 114 different chromosomes were evaluated during each run of the algorithm, which is a very small fraction of the huge number of possible combinations in the present study. Indeed, one purpose of using the least squares evaluations of chromosomes' predictive efficacy as part of the randomization mechanism used to choose the chromosomes to enter into the next generation was to reduce the number of evaluations necessary and still identify the best predictor sets. But one cannot deny that some more efficient selection procedure may have done better. However, such a procedure has yet to be developed.

Another possibility is that other statistical approaches might be effective in achieving the goals of the study. The CART algorithm (Breiman, Friedman, Olshen, and Stone, 1984), for example, would proceed through a series of sets of item characteristics, choosing at each stage the characteristic that results in sample splits that discriminate maximally between groups with high and low values of the criterion statistic. The CART algorithm, however, also provides substantial opportunity to capitalize on chance factors, so a cross-validation research design would be essential. K. M. Sheehan (Personal communication, March 1997) reports successful cross-validation of the CART procedure in another context.

12

Also, statistical tests applied in an estimation sample might indicate the existence of effects that do not appear in a fresh sample. In the present study, statistical tests of the gains in the validity of prediction afforded by changing from linear regression to the use of neural nets indicated that the gains were quite significant. But those gains disappeared in a fresh sample--a situation that would not be expected if the statistical tests were accurate. Such tests require assumptions that were not met. Clearly, working with fresh samples seems to provide a much more severe test of the hypotheses.

Another possible reason for not finding synergistic effects is that the wrong elements were used to make the predictions. Prediction improved when expert judgement of percent correct was included in the predictor sets. In fact, my examination of prediction results from a large number of chromosomes revealed a quantum increase in validity whenever expert judgment was one of the variables in the set. Perhaps a careful examination of how expert judgements are made could produce better predictors. Or perhaps hypotheses could be formulated that state how synergistic effects may arise, and how they may be used to write items with predicted difficulty or discrimination. Those predictions could then be evaluated. Such an experimental approach has proven effective in many fields and might be useful here.

13

References

Boldt, R.F., SC Freedle, O.R. (1996) Using neural net to predict item difficulty. (TOEFL Technical Report TR-11). Princeton, NJ: Educational Testing Service.

Breiman, L., Friedman, J.H., Olshen, R.A., & Stone, C.J. (1984). Classification and regression trees. Monterey, CA: Wadsworth and Brooks.

Buckles, W.P., SC Petry, F.E. (1992). Genetic alqorithms. Los Alamitos, CA: IEEE Computer Society Press.

Caudill, M. (1990). Neural networks primer. (Reprinted from issues of AI Expert, The Magazine of Artificial Intelliqence in Practice). San Francisco: Miller Freeman Publications.

Chalifour, C.L., & Powers, D.E. (1989). The relationship of content characteristics of GRE analytical reasoning items to their difficulties and discriminations. J. of Educ. Meas., Z(2), 120-132.

Chalifour, C.L., & Powers, D.E. (1988). Content characteristics of GRE analytical reasoninq items. (ETS Research Report No. RR-88-7). Princeton, NJ: Educational Testing Service.

Freedle, R., & Kostin, I. (1992). The prediction of GRE readinq comprehension item difficulty for expository prose passaqes for each of three item types: Main idea, inferences, and explicit statement items. (Research Report RR-91-57). Princeton NJ: Educational Testing Service.

Kim, S. H. (1993). Maximum likelihood estimation for influence diaqrams. Unpublished manuscript.

Mislevy, R.J., Sheehan, K.M., Wingersky, M.S. (1993). How to equate tests with little or no data. J. of Educ. Meas., 30, 55-78.

Oliver, R.M., & Smith, J.Q. (1990). Influence diagrams, belief nets and decision analysis. New York: John Wiley and Sons.

Powers, D.E., & Swinton, S.S. (1981). Extending the measurement of graduate admission abilities beyond the verbal and quantitative domains. Applied Psychological Measurement, 5, 141-158.

Powers, D.E., Swinton, S.S., & Carlson, A. (1977). A factor analytic study of the GRE Aptitude Test. (GRE Report No. 75- 11P). Princeton NJ: Educational Testing Service

14

Rock, D., Wertz, C., & Grandy, J. (1981). Construct validitv of the GRE Aptitude Test populations--an empirical confirmatory study. (GRE Report No. 78-1P). Princeton NJ: Educational Testing Service.

Saunders, D. R. (1967). Moderator variables in prediction. In D. N. Jackson, & S. J. Messick (Eds.). Problems in human assessment (pp. 362-367). New York: McGraw Hill.

Schaeffer, G. A., & Kingston, N. M. (1988). Strength of the analvtical factor of the GRE General Test in several subgroups: A full-information factor analysis approach. (GRE Board Professional Report No. 86-7P). Princeton, NJ: Educational Testing Service.

Stricker, L.J., & Rock, D.A. (1983). Factor structure of the GRE General Test for older examinees: Implications for construct validitv. (GRE Report No. 83-10). Princeton NJ: Educational Testing Service.

Tatsuoka, K.K. (1983). Rule space: An approach for dealing with misconceptions based on item response theory. J. of Educ. Meas., 20, 345-354.

Tatsuoka, K.K. (1990). Toward an integration of item response theory and cognitive error diagnosis. In N. Frederiksen, R. Glaser, A. Lesgold, and M.G. Shaft0 (Eds). Diagnostic Monitoring of Skill and Knowledge Acauisition (pp 453-488) Hillsdale NJ:Erlbaum.

Widder, D. V. (1989). Advanced calculus(2nd ed.). New York: Dover Publications, Inc.

15

Appendix A

Estimating Constants in the Neural Nets

Artificial neural nets originated as analogies to functioning structures within biological nets. The biological nets consist of connected neurons with impulses moving in a directed stream. Impulses from several neurons act quasi- summatively on a single cell that responds by firing an impulse in its characteristic fashion. The output of the firing neuron, or node as it will be called here, might be thought of as a transformation of its quasi-summed input.

An analogous computational structure used in artificial neural nets consists of inputs (the impulses from other neurons), weighted composites (the quasi-summed inputs), and transformations of the weighted composites (the output of the node). A formula for the output of a node follows:

output = t(a 7 j +E bjk 'k) k

(1)

where j indexes a particular node, k indexes inputs, aj,bjk accomplish the summation of inputs to the node,

t() is the transformation yielding the node output, and Xk is the code for the ktJ attribute input.

There can be layers of nodes where weighted outputs from one layer provide inputs to the next layer. A final stage follows where the output of the last layer of nodes is quasi-summed to produce a response from some structure other than a neuron, for example a bunch of muscle fibers. Clearly there is room for much complexity in this type of structure.

A simple net was used in the present study. It consists of one layer of nodes of the type expressed in formula (1) weighted outputs of which are summed into a single estimate. In this net the equation connecting inputs (item codes) to outputs (estimated item statistics) is given as follows:

yi = A + C Bj t(aj + C bjk Xik) j k

where i indexes items and yi is the estimated item statistic. It will be helpful to separate equation (2) into parts as follows:

1,' 13

=a j +EbjkXik

is the input to the jth node, and

(2)

(3)

(4) wm ’ 13

= t(Iij)

16

is the output from the jth node, so that

yi = A + ~ Bj Xij (5) i

To complete the specification of ym it is necessary to choose a transformation, t( >. A transformation frequently used for this purpose, also used here, is the following:

w = t(1) = eI/(l + el) where e is the base of the natural logarithm.

(6)

Examination of equation (2) reveals that the (y)s depend on A, and the (B)s, (a)s, and (b)s, which are constants to be determined. We seek values for these constants such that estimated item statistics closely approximate observed statistics. We choose to use

Q = L (Yi - yi12 (7) 1

as the measure of the goodness of fit of predictions to observations, where (Y)s are observations and the summation is over the estimation sample.

The definition of Yi requires that A, and the Bm, a', and bak be estimated iteratively because no formula can o a

he salved to tain their values. A mixture of iteration procedures was used.

Note from equations (5) and (7) that A and the (B)s can be estimated as coefficients of linear regression of Y on W when the (a>s and (b)s are fixed, which fixes the (W)s. To take advantage of this simplicity, the iterative sequence chosen was to determine the (a>s and (b)s, then determine A and the (B)s, then determine (a>s and (b)s, etc., each time choosing new values so as to minimize Q in equation (7).

Two comments will facilitate further development of the iterations used, but first a notational convention is introduced. If F is a function of several variables including u and v, then D(F u) will refer to the first derivative of F with respect to u, D(F u,v> will refer to D[D(Flu) Iv], the derivative with respect to v of the derivative of F with respect to u (in other words the cross derivative of F with respect to u and v). D(Flu,u), then, is the second derivative of F with respect to u. With the functions involved here, the order of differentiation will not affect the cross derivatives.

The first comment referred to above concerns the first derivative of f[g(x)] with respect to x if f and g are continuous and single-valued. Widder (Ch. 1, 1989) shows that

D[f(g) 1x1 = D[f(g> Igl D(glx).

That is, because f is a function of g, which is a function of x,

17

the derivative of f with respect to x equals the derivative of f with respect to g times the derivative of g with respect to x. For example, seconds per hour equals minutes per hour times seconds per minute.

The second comment is that because from equation (3) it can be seen that the derivative of I with respect to a or b is a constant, and because of the definition of t() given in equation (6) it can be shown that

D(Wlu) = (W - w2)k, (8)

where kl comes from the derivative of I with respect to u. Then using equation (8) and the first comment, it can be shown that

D(Wlu,v) = D[(W - W2)kl(v]

= (w - w2)(1-2w)k,k? (9) where the k2 comes from the derivative&of I with respect to v. In the equations below the (k)s will be either X or one. Equations (8) and (9) give the first and second derivatives of W in terms of itself and known constants.

Based on equation (8) the derivative of WI' with respect to aj is given by 17

D(Wijlaj) = Wij - Wij2 (10)

because equation (3) implies that the derivative of Iij with

respect to aI is one. Differentiating Wa 1 with respect to bl yields I 13 lk

D(Wijlbjk) = (Wij - Wij2)Xik (11)

because equation (3) implies that the derivative of Iij with respect to bjk is Xik.

Second and cross derivatives for (b)s and aj are as follows.

D (Wij 1 a-j, aj ) = (1 - 2Wij) (Wij - Wij2) (12)

D(Wij laj,bjk) = (1 - 2Wij) (Wij - Wij2)Xik (13)

D(Wij Ibjkrbjk) = (1 - 2Wij) (Wij - Wij2)Xik2 (14)

and D(Wij/bjk,bjk, > =(1 - 2Wij) (Wij - Wij2)XikXik, (15)

Note in equation (3) that only the b's and the a associated with node j affect the input to, hence the output of, that node. That is why the j's in equations (12) through (15) are all the same.

18

Any other cross derivative, one that involves parameters of the input to different nodes, is zero.

Equation (5) shows that any derivative or cross derivative Of Yi is equal to Bs times the corresponding derivative or cross derivative of Wi*, shat is, Ba derivatives give J in equation &

times the derivatives or cross (10) through (15). This provides

all of the derivatives and cross derivatives of y needed for Newton iterations, as shown below.

Newton iterations require computation of first and second derivatives and cross derivatives of y given in equation (7). If

TP is an a or a b, then

D(QITp)=-2 C [(Yi - Y-j_) D(Yi(Tp)I (16) i

Further, if T P'

is a possibly different parameter, then

D(QITp,Tp,) =

-2 C [(Yi - Yi)D(YilTp,Tp,) - D(yilTp)D(yilTp’)l -I

(17)

Note that if g stands for jk and p' stands for j'k,, all cross derivatives of Q are zero if j is not equal to j'.

The procedure used to find a new set of (a)~ and (b)s given old A, (B)s, (a)s and (b)s, was as follows: Let

V,ld be old (a)~ and (b)s arranged in a column

V new be new (a>s and (b)s arranged in a column in the same order as in V,ld

DV be the gradient of Q with derivatives arranged

Void

Delta be a vector of changes such that

as in

V new = V,ld + Delta

with Delta to be determined. Let DQ be a square matrix of second derivatives and cross derivatives of Q with respect to (a)s and (b)s with the order of rows and columns corresponding to that in

V,ld. Then, for Newton iterations,

Delta = -(DQ)-IDV (18)

The matrix DQ is of the order of the number of parameters, which is (number of input variables plus one)times(number of nodes). In the present study DQ is 56 by 56 (8times7 because the number of parameters associated with input is eight, that is, seven weights plus an additive constant, and the number of nodes is seven). A

19

great savings in computation can be made because the structure of DQ is such that it is not necessary to invert the whole 56-by-56 matrix: a series of smaller matrices serves the same purpose. This is so because DQ is the matrix of second derivatives and cross derivatives of the objective function with respect to parameters involved in computing input to the nodes. As mentioned above, the cross derivatives are nonzero only when both parameters with respect to which the derivatives are taken enter input to the same node. Thus, if the parameters represented by rows and columns of DQ are arranged by blocks related to input to the same node, then the subsections of DQ that contain non-zero entries are square and are symmetric around the diagonals of DQ. These subsections are on the order of the number of input variables plus one, and can be inverted separately. Hence the problem of inverting the very large matrix, DQ, can be reduced to inverting a series of smaller matrices. The number of such smaller matrices is equal to the number of nodes, and the order of the matrices is (number of input variables)times(number of input variables). Thus, if the coefficients of regression of item statistics on node outputs are computed as a separate step, computations for the parameters related to each node can be accomplished separately.

The iterations, then, proceed from one stage to the next as follows:

- Given old (a>s and (b)s use equations (3) and (4) to compute Ws.

- Compute new B's and an A as coefficients of regression of Y on W in equation (5).

- For each node, use equations (5), (lo), and (11) to calculate the quantities in brackets in equation (16), arranged in order corresponding to that of V, d, and accumulate across cases. Multiply each entry & y the B for the node to which the differentiation applies. This yields DV.

- For each node, use equations (5) and (12) through (15) to calculate the quantities in brackets in equation (I?'), arranged in order corresponding to that of Void, and accumulate across cases. Multiply each entry in the table by the B's to which the row and column correspond. This yields the sections of DQ to be inverted.

- Compute Delta using equation (19), and then compute V,,, using equation (18).

- Let V,,, become V,ld and cycle back to find new (W>s, etc.

The sequence given above requires starting (a>s and (b)s. It is

20

usual to start with a set of randomly chosen values and then carry out enough cycles so that the effect of the choice is virtually eliminated. That is what was done here. Trial values were chosen at random and then adjusted so that the node inputs, (I)s, had zero means and unit variances in the estimation sample.

This adjustment was used so that the inputs were, on the average, in the range of the maximum slope of to. These randomly chosen inputs became the rlold'l (a>s and (b)s for the first phase of the first iteration.

21

Appendix B

Description of the Genetic Algorithm

Genetic algorithms are often described using a colorful language taken from population genetics. The algorithms proceed through generations, each generation consisting of populations of chromosomes. The chromosomes in these populations are subjected to evaluation processes that moderate the selection of the population for the next generation. The selection mechanisms that were used in the present study are analogous to genetic crossing and mutation. As in nature, the mechanisms in this study involved the operation of random factors on the combinations of genes in chromosomes.

The natural processes thus described provide a framework for thinking about computational processes, particularly optimization processes. The process of selecting predictor sets for the present study fits this framework. A predictor set is analogous to a chromosome and predictors are analogous to genes. Thus, generations consist of collections of sets of predictors. Each set of predictors is evaluated, the evaluation being the efficacy of the set for predicting some item statistic. The specific objective function being evaluated is the sum of squared differences between predicted and actual item statistics evaluated using the process outlined in Appendix A. Since the estimation in this study was accomplished using neural nets, the genetic algorithm used here put item selection squarely in the context of evaluating collections of item sets using neural nets.

The present study used several arbitrary constraints on the generations, the chromosomes, and the mechanisms for generating new chromosomes. Fifteen generations were used with nine chromosomes per generation. Each chromosome in a particular analysis contained the same number of predictors, six for predicting biserial correlations and seven for predicting item deltas.

While the algorithms were operated, only two chromosomes were retained from generation to generation. This was done to provide substantial opportunity to explore a large number of sets of variables, to retain the best combinations, and to retain the chance for the variables in those best predictor sets to enter into other combinations.

In this study, the probability of a chromosome being chosen for crossing or mutation was proportional to the value of its objective function. When constructing a chromosome for a new generation, a mutation or crossing mechanism was chosen at random with a equal probabilities. If mutation was selected, then one old chromosome was selected for use as a basis for the new one; if genetic crossing was selected, then two old chromosomes were

22

selected.

The crossing mechanism proceeded in the following steps:

The number of variables in a chromosome was kept constant at all times. If that number was seven for example, then the lists of variables from the old chromosomes were combined into a single list with fourteen positions.

The crossing algorithm then proceeded as follows: (a) one position in the combined list was chosen at random with all positions having an equal probability, but if the position contained a zero, then another position was drawn until a nonzero position was encountered; (b) the variable number in the chosen position is copied into an unfilled position in the new chromosome; (c) if all the positions in the new chromosome were filled, then the process terminated; and (d) if some positions were not filled in the new chromosome then that same variable number was erased wherever appearing in the combined list and the process returned to step (a). In this sequence, no variable that is not in one of the two chromosomes chosen for crossing can enter into the new chromosome, all variables that appear in only one chromosome have equal probability of appearing in the new chromosome, all variables that appear in both chromosomes also have equal probability of appearing in the new chromosomes, and the probability of those variables appearing in both of the old chromosomes being copied into the new chromosome is double that of those chromosomes appearing in only one of the old chromosomes.

The mutation mechanism in the present study proceeded by replacing a fixed number of variables in the old chromosome with a like number drawn at random from the total number of variables available. The variables to be replaced were chosen at random with equal probability, as were the numbers of the variables selected to be the replacements. It was through this mechanism that variables outside of those appearing in at least one chromosome in a generation could enter the population. It was the only way such variables could enter. The number of variables to be changed when mutation took place was four for both six and seven variable sets. This provided an opportunity for approximately fourteen variables per generation to be drawn from the complete list of variables. Thus a total of approximately 196 draws from the complete list of variables over the fourteen generations occurred over each operation of the genetic algorithm.

23

gre analytical reasoning item statistics prediction studygre analytical reasoning item statistics...

Documents