a history of social science measurement

13
resentations of growth. Journal of Educational Measurement. Thissen, D. (1982). Marginal maximum likelihood estimation for the one- parameter logistic model. Psychome- trifza, 47, 175-186. Thissen, D. (1983). MULTILOG: multi- ple category item analysis and test scoring using item response theory. Chicago, IL: Scientific Software Inter- national. Thissen, D., & Steinberg, L. (1984). A response model for multiple choice items. Psychometrika, 49, 501-519. Thurstone, L. L. (1925).A method of scaling psychological and educational tests. Journal of Educational Psy- chology, 16, 433-45 1. Thurstone, L. L. (1930). The learning function. Journal of General Psychol- ogy, 3, 469-493. Tsutakawa, R. K. (1992). Prior distribu- tion for item response curves. British Journal of Mathematical and Statisti- cal Psychology, 45, 51-71. Tucker, L. R. (1946). Maximum validity of a test with equivalent items. Psy- chometrika, 11, 1-13. Urban, F. M. (1908). The application of statistical methods to the problems of psychophysics. Philadelphia: Psycho- logical Clinic Press. Verhulst, P-F. (1844). Recherches mathematiques sur la loi d’accrosse- ment de population [Mathematical research on the law of population increase]. Memoires de 1’Academe Royal de Belgique (Vol. 18). Brus- sels, Belgium: Royal Academy of Belgium. Wainer, H. (Ed.). (1990). Computerized adaptive testing: a primer. Hillsdale, NJ: Erlbaum. Zimowski, M.F., Muraki, E., Mislevy, MG: Multiple-group IRT analysis and test maintenance. Chicago, IL: Scien- tific Software International. R. J., & Bock, R. D. (1996). BILOG- A History of Social Science Measurement Benjamin D. Wright University of Chicago, MESA Psychometric Laboratory set upon them by evil tax collec- tors. The more righteous law is What are the historic oriains of measurement? What justice and good conduct. . . I v order you to take in taxes only the weight of seven. (Damascus, 723 AD) are the basic requirements for fundamental measure- ment? What are some common pitfalls to avoid? Do item response models meet these needs? The Magna Carta of King John of England requires that: fter language, our greatest in- A vention is numbers. Numbers make measures and maps and so enable us to figure out where we are, what we have, and how much it’s worth. Science is impossible without an evolving network of stable measures. The history of measurement, however, does not begin in mathematics, or even in science, but in trade and construc- tion. Long before science emerged as a profession, the commercial, ar- chitectural, political, and even moral necessities for abstract, ex- changeable units of unchanging value were well recognized. Let us begin by recalling two dra- matic turning points in political history that remind us of the antiq- uity and moral force of our need for stable measures. Next, we review the psychometric and mathematical histories of measurement, show how the obstacles to inference shape our measurement practice, and summarize Georg Raschs con- tributions to fundamental measure- ment. Finally, we review some mistakes that the history of mea- surement has taught us to stop making. A weight of seven was a tenet of faith among seventh century Mus- lims. Muslim leaders were censured for using less righteous standards (Sears, 1997). Caliph ’Umar b. ‘Abd al-Aziz ruled that: The people of al-Kufa have been struck with. . . wicked practices There shall be one measure of wine throughout Our kingdom, and one of ale, and one measure of corn, to wit, the London quar- ter, and one breadth of cloth, . . . , to wit, two ells within the sel- vages. As with measures so shall it be with weights. (Runnymede, 1215 AD) These events remind us that com- merce and politics are the source of stable units for length, area, vol- ume, and weight. It was the devel- opment of the steam engine which led to our modern measures of tem- Benjamin D. Wright is a Professor in the Department of Education, Univer- sity of Chicago, 5835 S. Kimbark Ave., Chicago, IL 60637-1 609. His specializa- tion is psychoanalytic psychology. Winter 1997 33

Upload: benjamin-d-wright

Post on 21-Jul-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A History of Social Science Measurement

resentations of growth. Journal of Educational Measurement.

Thissen, D. (1982). Marginal maximum likelihood estimation for the one- parameter logistic model. Psychome- trifza, 47, 175-186.

Thissen, D. (1983). MULTILOG: multi- ple category item analysis and test scoring using item response theory. Chicago, IL: Scientific Software Inter- national.

Thissen, D., & Steinberg, L. (1984). A response model for multiple choice items. Psychometrika, 49, 501-519.

Thurstone, L. L. (1925).A method of scaling psychological and educational

tests. Journal of Educational Psy- chology, 16, 433-45 1.

Thurstone, L. L. (1930). The learning function. Journal of General Psychol- ogy, 3, 469-493.

Tsutakawa, R. K. (1992). Prior distribu- tion for item response curves. British Journal of Mathematical and Statisti- cal Psychology, 45, 51-71.

Tucker, L. R. (1946). Maximum validity of a test with equivalent items. Psy- chometrika, 11, 1-13.

Urban, F. M. (1908). The application of statistical methods to the problems of psychophysics. Philadelphia: Psycho- logical Clinic Press.

Verhulst, P-F. (1844). Recherches mathematiques sur la loi d’accrosse- ment de population [Mathematical research on the law of population increase]. Memoires de 1’Academe Royal de Belgique (Vol. 18). Brus- sels, Belgium: Royal Academy of Belgium.

Wainer, H. (Ed.). (1990). Computerized adaptive testing: a primer. Hillsdale, NJ: Erlbaum.

Zimowski, M.F., Muraki, E., Mislevy,

MG: Multiple-group IRT analysis and test maintenance. Chicago, IL: Scien- tific Software International.

R. J., & Bock, R. D. (1996). BILOG-

A History of Social Science Measurement Benjamin D. Wright University of Chicago, MESA Psychometric Laboratory

set upon them by evil tax collec- tors. The more righteous law is

What are the historic oriains of measurement? What justice and good conduct. . . I v order you to take in taxes only

the weight of seven. (Damascus, 723 AD)

are the basic requirements for fundamental measure- ment? What are some common pitfalls to avoid? D o item response models meet these needs? The Magna Carta of King John of

England requires that:

fter language, our greatest in- A vention is numbers. Numbers make measures and maps and so enable us to figure out where we are, what we have, and how much it’s worth. Science is impossible without an evolving network of stable measures. The history of measurement, however, does not begin in mathematics, or even in science, but in trade and construc- tion. Long before science emerged as a profession, the commercial, ar- chitectural, political, and even moral necessities for abstract, ex- changeable units of unchanging value were well recognized.

Let us begin by recalling two dra- matic turning points in political history that remind us of the antiq-

uity and moral force of our need for stable measures. Next, we review the psychometric and mathematical histories of measurement, show how the obstacles to inference shape our measurement practice, and summarize Georg Raschs con- tributions to fundamental measure- ment. Finally, we review some mistakes that the history of mea- surement has taught us to stop making.

A weight of seven was a tenet of faith among seventh century Mus- lims. Muslim leaders were censured for using less righteous standards (Sears, 1997). Caliph ’Umar b. ‘Abd al-Aziz ruled that:

The people of al-Kufa have been struck with. . . wicked practices

There shall be one measure of wine throughout Our kingdom, and one of ale, and one measure of corn, to wit, the London quar- ter, and one breadth of cloth, . . . , to wit, two ells within the sel- vages. As with measures so shall it be with weights. (Runnymede, 1215 AD)

These events remind us that com- merce and politics are the source of stable units for length, area, vol- ume, and weight. It was the devel- opment of the steam engine which led to our modern measures of tem-

Benjamin D. Wright is a Professor in the Department of Education, Univer- sity of Chicago, 5835 S. Kimbark Ave., Chicago, IL 60637-1 609. His specializa- tion is psychoanalytic psychology.

Winter 1997 33

Page 2: A History of Social Science Measurement

perature and pressure. The success of all science stands on these com- mercial and engineering achieve- ments. Although the mathematics of measurement did not initiate these practices, we will find that it is the mathematics of measurement that provides the ultimate foun- dation for practice and the final logic by which useful measurement evolves and thrives.

History Mathematics The concrete measures that help us make life better are so familiar that we seldom think about how or why they work. A mathematical history of measurement, however, takes us behind practice to the theoretical re- quirements that make the practical success of measurement possible.

1. Measures are always infer-

2. Obtained by stochastic approx-

3. Of one dimensional quantities, 4. Counted in abstract units, the

5 . Unaffected by extraneous fac-

As we work through the anatomy of inference and bring out the mathematical discoveries that make inference possible, we will see that, t o meet the above requirements, measurement must be an inference of values for infinitely divisible pa- rameters which are set to define the transition odds between ob- servable increments of a promising theoretical variable (Feller, 1950, pp. 271-2721, We will also see what a divisible parameter structure looks like and how it is equivalent to conjoint additivity.

A critical turning point in the mathematical history of measure- ment is the application of Jacob Bernoulli’s 1713 binomial distribu- tion as an inverse probability for interpreting the implications of observed events (Thomas Bayes, 1764; Pierre Laplace, 1774, cited in Stigler, 1986, pp. 63-67, 99-105). The data in hand are not what we seek. Our interests go beyond to what these data imply about other data yet unmet but urgent to fore- see. When we read our weight as 180 pounds, we take that number

ences,

imations,

fixed sizes of which are

tors.

not as a one-time, local description of a particular step onto this partic- ular scale but as our approximate weight right now, just before now, and, inferentially, for a useful time to come.

Inference The first problem of inference is how to infer values for these other data, which, by the meaning of inference, are currently missing. This includes the data that are always missing in any actual attempt at data collec- tion. Because the purpose of infer- ence is to estimate what future data might be like before we encounter them, methods which require com- plete data in order to proceed can- not, by that very requirement, be methods of inference. This realiza- tion engenders a simple law: Any statistical method nominated to serve inference which requires com- plete data, by this requirement, dis- qualifies itself as an inferential method.

But, if what we want to know is missing, how can we use the data in hand to make useful inferences about the missing data they might imply? Inverse probability recon- ceives our raw observations as a probable consequence of a relevant stochastic process with a stable formulation. The apparent deter- minism of formulae like F = MA de- pends on the prior construction of relatively precise measures of M and A. The first step from raw ob- servation to inference is to identify the stochastic process by which an inverse probability can be defined. Bernoulli’s binomial distribution is the simplest and most widely used stochastic process. Its elaboration into the compound Poisson distribu- tion is the parent of all useful mea- suring distributions.

The second step to inference is to discover what mathematical models can govern the stochastic process in a way that enables a stable, ambi- guity-resilient estimation of the model’s parameters from the limited data in hand. This step requires an awareness of the obstacles that stand in the way of stable inference.

At first glance, the second step to inference looks complicated. Its 20th century history has followed so many paths traveled by so many mathematicians that one might suppose there was no clear second

step, only a jumble of unconnected possibilities with their seemingly separate mathematical solutions. Fortunately, reflection on the moti- vations for these paths and exami- nation of their mathematics lead to a reassuring simplification. Al- though each path was motivated by a particular concern about what in- ference must overcome to succeed, all solutions end up with the same simple, easy to understand, and easy to use formulation.

The second step to inference is solved by formulating the mathe- matical function which governs the inferential stochastic process so that its parameters are either infi- nitely divisible or conjointly addi- tive-that is, separable. That’s all there is to it!

Psychometrics Understanding what it takes to con- struct useful measures, however, has only recently been applied in psychometric practice. This failure of practice has not been due to lack of knowledge about the problems in- volved. Edward L. Thorndike, the patriarch of educational measure- ment, observed in 1904 that:

If one attempts to measure even so simple a thing as spelling, one is hampered by the fact that there exist no units in which to mea- sure. One may arbitrarily make up a list of words and observe ability by the number spelled cor- rectly. But if one examines such a list one is struck by the inequality of the units. All results based on the equality of any one word with any other are necessarily inaccu- rate. (Thorndike, 1904, p. 7)

Thorndike saw the unavoidable ambiguity in counting concrete events, however indicative they might seem. One might observe signs of spelling ability. But one would not have measured spelling, not yet (Engelhard, 1984, 1991, 1994). The problem of what to count, entity ambiguity, is ubiqui- tous in science, commerce, and cooking. What is an apple? How many little apples make a big one? How many apples make a pie? Why don’t three apples always cost the same amount? With apples, we solve entity ambiguity by renounc- ing the concrete apple count and

34 Educational Measurement Issues and Practice

Page 3: A History of Social Science Measurement

turning, instead, to abstract apple volume or abstract apple weight (Wright, 1996b, 1996~).

Raw Scores Are NOT Measures Unfortunately, in educational mea- surement, we have only recently begun to take this reasonable step from concrete counting to abstract measuring. Thorndike was aware not only of the inequality of the units counted but also of the nonlin- earity of any resulting raw scores. Raw scores are limited to begin at none right and to end at all right. But the linear measures we intend raw scores to imply have no such bounds. The monotonically increas- ing ogival exchange of one more right answer for a measure incre- ment is steepest where items are dense, usually toward the middle of a test near 50% right. At the ex- tremes of 0% and 100% right, how- ever, the exchange becomes flat. This means that for a symmetrical set of item difficulties one more right answer implies the least mea- sure increment near 50% but an in- finite increment at each extreme.

The magnitude of this raw score bias against extreme measures de- pends on the distribution of item difficulties. The ratio of the mea- sure increment corresponding to one more right answer at the next to largest extreme step with the measure increment corresponding to one more right answer at the smallest central step for a test with L normally distributed item diffi- culties is:

log{2(L -- 1)/(L - 2)}/ log{(L + 2)/(L - 2)>.

When items are heaped in the mid- dle of a test, the usual case, then the bias for a 50-item test is ninefold. Even when item difficulties are spread out uniformly in equal incre- ments, the raw score bias against measure increments at the ex- tremes for a 50-item test is sixfold.

This raw score bias is not limited to dichotomous responses. The bias is just as severe for partial credits, rating scales, and, of course, the in- famous Likert scale, the misuse of which pushed Thurstone's seminal 1920s work on how to transform concrete raw scores into abstract linear measures out of use.

Figure 1 shows a typical raw score to measure ogive. Notice that the

measure distance between scores of 88% and 98% is 5 times greater than the distance between scores of 45% and 55%.

The raw score bias in favor of cen- tral scores and against extreme scores means that raw scores are al- ways target-biased and sample- dependent (Wright & Linacre, 1989; Wright & Masters, 1982; Wright & Stone, 1979). Any statistical method like linear regression, analysis of variance, generalizability, or factor analysis that uses raw scores or Lik- ert scales as though they were lin- ear measures will have its output hopelessly distorted by this bias. That is why so much social science has turned out to be no more than transient description of never-to-be- reencountered situations, easy to contradict with almost any repli- cation. The obvious and easy to practice (Linacre & Wright, 1997; Wright & Linacre, 1997) law that follows is: Before applying linear statistical methods to concrete raw data, one must first use a measure- ment model to construct, from the observed raw data, abstract sample and test-free linear measures.

There are two additional advan- tages obtained by model-controlled linearization. Each measure and each calibration is now accompa- nied by a realistic estimate of its precision and a mean square resid- ual-from-expectation evaluation of the extent to which its data pattern

fits the stochastic measurement model-that is, its statistical ualid- ity. When we then move on to plot- ting results and applying linear statistics to analyze relationships among measures, we not only have linear measures to work with, but we also know their numerical preci- sion and numerical validity.

Fundamental Measurement The general name for the kind of measurement we are looking for is fundamental measurement. This term comes from physicist Norman Campbell's 1920 deduction that fundamental measurement (on which the success of physics was based) required, at least by analogy, the possibility of a physical concate- nation, like joining the ends of sticks to concatenate length or pil- ing bricks to concatenate weight (Campbell, 1920).

Sufficiency The estimator requirement to im- plement fundamental measure- ment is called sufficiency. In 1920, Ronald Fisher, while developing his likelihood version of inverse prob- ability to construct maximum like- lihood estimation, discovered a statistic so sufficient that it ex- hausted all information concerning its modeled parameter from the data in hand (Fisher, 1920). Statis-

100

,-, 9 0

t- 80 Y cc 7 0

a v

@z

0 0 t- 6 0 Z W 0 @z 50 w C L

z 40 W cli:

0 cn

cn W

3 0

t- 20

+ 10

0 -3 -2 -1 0 1 2 3 4 5 6

ABILITY MEASURE IN LOGITS (B)

FIGURE I . Extreme raw scores are biased against measures

Winter 1997 35

Page 4: A History of Social Science Measurement

tics that exhaust all modeled infor- mation enable conditional formula- tions by which a value for each parameter can be estimated inde- pendently of all other parameters in the model. This follows because the presence of a parameter in the model can be replaced by its suffi- cient statistic. Fisher’s sufficiency enables independent parameter es- timation for models that incorpo- rate many different parameters (Andersen, 1977). This leads to an- other law: When a psychometric model employs parameters for which there are no sufficient statis- tics, that model cannot cmstruct useful measurement because it can- not estimate its parameters inde- pendently of one another.

Divisibility What is the mathematical founda- tion for Campbell’s concatenation and Fisher’s sufficiency? In 1924, Paul Levy (1937) proved that the construction of laws which are sta- ble with respect to arbitrary deci- sions as to what to count requires infinitely divisible parameters. Levy’s divisibility is logarithmically equivalent to the conjoint additiv- ity (Luce & Tukey, 1964) that we now recognize as the mathematical generalization of Campbell’s fun- damental measurement. Levy’s con- clusions were reaffirmed in 1932 when A. N. Kolmogorov (1950, pp. 9, 57) proved that independence of parameter estimates also required divisibility, this time in the form of additive decomposition.

Thurstone The problems for which divisibility and its consequences, concatenation and sufficiency, were the solution were not unknown to psychometri- cians. Between 1925 and 1932 elec- trical engineer and psychologist Louis Thurstone published 24 arti- cles and a book on how to attempt solutions to these problems. Thur- stone’s requirements for useful measures are:

Unidimensionality :

The measurement of any object or entity describes only one at- tribute of the object measured. This is a universal characteristic of all measurement. (Thurstone, 1931, p. 257)

Linearity:

The very idea of measurement implies a linear continuum of some sort such as length, price, volume, weight, age. When the idea of measurement is applied to scholastic achievement, for exam- ple, it is necessary to force the qualitative variations into a scholastic linear scale of some kind. (Thurstone & Chave, 1929, P. 11)

Abstraction:

The linear continuum which is implied in all measurement is al- ways an abstraction. . . . There is a popular fallacy that a unit of measurement is a thing-such as a piece of yardstick. This is not so. A unit of measurement is always a process of some kind. . . . (Thur- stone, 1931, p. 257)

Invariance:

A process of some kind which can be repeated without modification in the different parts of the mea- surement continuum. (Thurstone, 1931, p. 257)

Sample-Free calibration:

The scale must transcend the group measured. A measuring in- strument must not be seriously affected in its measuring function by the object of measure- ment. . . . Within the range of ob- jects. . . intended, its function must be independent of the ob- ject of measurement. (Thurstone, 1928, p. 547)

Test-Free measurement:

It should be possible to omit sev- eral test questions a t different levels of the scale without af- fecting the individual score (measure). . . . It should not be re- quired to submit every subject to the whole range of the scale. The starting point and the terminal point . . . should not directly affect the individual score (measure). (Thurstone, 1926, p. 446)

Thus, by 1930, we had in print somewhere everything social sci- ence needed for the construction of stable, objective measures. The pieces were not joined. But, in the requirements of L. L. Thurstone, we knew exactly what was called for. And, in the inverse probabilities of Bernoulli, Bayes, and Laplace and

the mathematics of Fisher, Levy, and Kolmogorov, we had what was missing from Thurstone’s normal distribution method.

Guttman Then, in 1950, sociologist Louis Guttman pointed out that the meaning of any raw score, including Likert scales, would remain am- biguous unless the score specified every response in the pattern on which it was based.

If a person endorses a more ex- treme statement, he should en- dorse all less extreme statements if the statements are to be consid- ered a scale. . . .We shall call a set of items of common content a scale if a person with a higher rank than another person is just as high or higher on every item than the other person. (Italics added, Guttman, 1950, p. 62)

According to Guttman, only data that form this kind of perfect con- joint transitivity can produce un- ambiguous measures. Notice that Guttman’s definition of scalability is a deterministic version of Fisher’s stochastic definition of sufficiency. Each requires that an unambiguous statistic must exhaust the informa- tion to which it refers.

Rasch Complete solutions to Thurstone’s and Guttman’s requirements, how- ever, did not emerge until 1953 when Danish mathematician, Georg Rasch (19601, deduced that the only way he could compare past perfor- mances on different tests of oral reading was to apply the exponen- tial additivity of Poisson’s 1837 dis- tribution to data produced by a new sample of students responding si- multaneously to both tests. Rasch used Poisson’s distribution because it was the only one he could think of that enabled the equation of the two tests to be entirely independent of the obviously arbitrary distribution of the reading abilities of the new sample.

As Rasch worked out his mathe- matical solution to equating reading tests, he discovered that the mathe- matics of the probability process, the measurement model, must be restricted to formulations that pro- duced sufficient statistics. Only when his parameters had sufficient

36 Educational Measurement: Issues and Practice

Page 5: A History of Social Science Measurement

statistics could he use these statis- tics to replace and hence remove the unwanted person parameters from his estimation equations. In this way, he obtained estimates of his test parameters that were inde- pendent of the incidental values or distributions of whatever other pa- rameters were at work in the mea- surement model.

As Rasch describes the properties of his probability function, we see that he has constructed a stochastic solution to the impossibility of living up to Guttman’s deterministic con- joint transitivity with raw data.

A person having a greater ability than another should have the greater probability of solving any item of the type in question, and similarly, one item being more dif- ficult than another one means that for any person the probabil- ity of solving the second item cor- rectly is the greater one. (Rasch, 1960, p. 117)

Rasch completes his measure- ment model on pages 117 to 122 of his 1960 book. His measuring func- tion on page 118 specifies the multiplicative definition of funda- mental measurement for dichoto- mous observations as:

f(P) = bld where P is the probability of a cor- rect solution; f ( P ) is a function of P, still to be determined; b is a ratio measure of person ability; and d is a ratio calibration of item difficulty. This model applies the divisibility Levy requires for stability.

Rasch explains his measurement model as an inverse

probability of a correct solution, which may be taken as the imag- ined outcome of an indefinitely long series of trials. . . . The for- mula says that in order that the concepts b and d could be at all considered meaningful, f(P), as derived in some way from P, should equal the ratio between b and d. (1960, p. 118)

And, after pointing out that a nor- mal probit, even with its second pa- rameter set to one, will be too complicated to serve as the measur- ing function f (P ) , Rasch asks: “Does there exist such a function, f (P) , that f ( P ) = b/d is fulfilled ”(1960, p. 119>?

Because “an additive system . . . is simpler than the original. . . multi-

plicative system,” Rasch (1960, p. 119) takes logarithms:

which for technical advantage he expresses as the log odds logit:

L = log{P/(l - P)}. The question has now reached its

final form: “Does there exist a func- tion g(L) of the variable L which forms an additive system in pa- rameters for Person B and parame- ters for items -D such tha t

(Rasch, 1960, pp. 119-1201? Rasch then shows that the func-

tiong(L), which can be L itself, as in:

“contains all the possible measur- ing functions which can be con- structed. . . by suitable choice of dimensions and units, A and C for:

(Rasch, 1960, p. 121). Because of the validity of a sepa-

rability theorem (due to sufficiency): It is possible to arrange the obser- vational situation in such a way that from the responses of a num- ber of persons to the set of items in question we may derive two sets of quantities, the distribu- tions of which depend only on the item parameters, and only on the personal parameters, respec- tively. Furthermore the condi- tional distribution of the whole set of data for given values of the two sets of quantities does not de- pend on any of the parameters.

With respect to separability the choice of this model has been lucky. Had we for instance as- sumed the “Normal-Ogive Model” with all s, = l-which numeri- cally may be hard to distinguish from the logistic-then the sepa- rability theorem would have bro- ken down. And the same would, in fact, happen for any other con- formity model which is not equiv- alent-in the sense of f (P) = C{f,(P)}A to f (P) = b/d . . . as re- gards separability. The possible distributions a re . . . limited to rather simple types, bu t . . . lead to rather far-reaching generaliza- tions of the Poisson. . . process. (Rasch, 1960, p. 122)

By 1960 Rasch had proven that formulations in the compound Pois- son family, such as Bernoulli’s bino-

lOg{f(P)} = log b - log d = B - D

g(L) = B - D”

L = log{P/(l - P)} = B - D

f(P) = c{fo(P)>A”

mial, were not only sufficient but, more telling, also necessary for the construction of stable measurement. Rasch had found that the multi- plicative Poisson was the only math- ematical solution to the second step of inference- the formulation of an objective, sample-free, and test-free measurement model.

In 1992, Bookstein began report- ing his astonishment at the mathe- matical equivalence of every counting law he could find (Book- stein, 1992, 1996). In deciphering how this ubiquitous equivalence could occur, he discovered that the counting formulations were mem- bers of one family which was sur- prisingly robust with respect to ambiguities of entity (what to count), aggregation (what is count- able), and scope (how long and how far to count). Bookstein discovered that the necessary and sufficient formulation for this remarkable ro- bustness was Levy’s divisibility and, as Rasch had seen 35 years earlier, that the one and only stochastic ap- plication of this requirement was the compound-that is, multiplica- tive, Poisson distribution.

More recently Andrich (1978a, 1978b, 1978c), whose contributions in the 1970s made rating scale analysis practical and efficient, has shown that Rasch‘s separability re- quirement leads to the conclusion that the necessary and sufficient distribution for constructing mea- sures from discrete observations is Poisson (Andrich, 1995, 1996). The natural parameter for this Poisson is the ratio of the location of the ob- ject and the measurement unit of the instrument in question. This formulation preserves concatena- tion and divisibility and also the generality requirement that mea- surement in different units always implies the same location.

Conjoint Additivity American work on mathematical foundations for measurement came to fruition with the proof by Duncan Luce and John Tukey (1964) that Campbell’s concatenation was a physical realization of a general mathematical law which is the defi- nition of fundamental measurement.

The essential character of. . . the fundamental measurement of ex- tensive quantities is described by an axiomatization for the compar-

Winter 1997 37

Page 6: A History of Social Science Measurement

ison of effects of arbitrary combi- nations of “quantities” of a single specified kind. . . . Measurement on a ratio scale follows from such axioms.

The essential character of simultaneous conjoint measure- ment is described by an axiom- atization for the comparison of effects of pairs formed from two specified kinds of “quanti- ties”. . . . Measurement on inter- val scales which have a common unit follows from these axioms.

A close relation exists between conjoint measurement and the establishment of response mea- sures in a two-way table. . . for which the “effects of columns” and the “effects of rows” are additive. Indeed the discovery of such mea- sures.. . may he viewed as the discovery, via conjoint measure- ment, of fundamental measures of the row and column variables. (Luce & Tukey, 1964, p. 1)

In spite of the practical advan- tages of such response measures, objections have been raised to their quest. . . . The axioms of simultaneous conjoint measure- ment overcome these objec- tions. . . . Additivity is just as ax- iomatizahle as concatenation . . . in terms of axioms that lead to . . . interval and ratio scales.

I n . . . the behavioral and bio- logical sciences, where factors producing orderable effects and responses deserve more useful and more fundamental measure-

ment, the moral seems clear: When no natural concatenation operation exists, one should try to discover a way to measure factors and responses such that the “ef- fects” of diffirent factors are addi- tiue. (Luce & Tukey, 1964, p. 4)

Although conjoint additivity has been known to be a decisive require- ment for fundamental measure- ment since 1964, few social scientists realize that Rasch models are its fully practical realization (Wright, 1984). Rasch models con- struct conjoint additivity by apply- ing inverse probability to empirical data and then test these data for their goodness-of-fit to this mea- surement construction (Fischer, 1968; Keats, 1967; Perline, Wright, & Wainer, 1979; Wright, 1968).

The Rasch model is a special case of additive conjoint measure- ment . . . a fit of the Rasch model implies that the cancellation axiom (i.e,, conjoint transitivity) will be satisfied.. . , It then fol- lows that items and persons are measured on an interval scale with a common unit. (Brogden, 1977, p. 633)

An Anatomy of Inference We can summarize the history of in- ference in a table according to four obstacles that stand between raw data and the stable inference of measures they might imply.

Table 1 An Anatomy of Inference Obstacles Solutions Inventors

Uncertainty Have 3 want Now 3 later Statistic 3 parameter

N on I i n ea r ity Unequal intervals Incommensurability

Interdependence Interaction Confounding

Of entity, interval and aggregation

Distortion

Confusion

Ambiguity

Probability Binomial odds Regular irregularity Misfit detection

Additivity Linearity Concatenation Conjoint additivity

Separability Sufficiency lnvariance Conjoint order

Divisibility Independence Stability

Bernoulli 171 3 Bayes 1764 Laplace 1774 Poisson 1837 Fechner 1860 Helmholtz 1887 N. Campbell 1920 Luce/Tukey 1964 Rasch 1958 R.A. Fisher 1920 Thurstone 1925 Guttman 1944 Levy 1924 Kolmogorov 1932 Bookstein 1992

Uncertainty is the motivation for inference. The future is uncertain by definition. We have only the past by which to foresee. Our solution is to capture uncertainty in a con- struction of imaginary probability distributions which regularize the irregularities that disrupt connec- tions between what seems certain now but is uncertain later. The solu- tion to uncertainty is Bernoulli’s in- verse probability.

Distortion interferes with the transition from observation to con- ceptualization. Our ability to figure things out comes from our faculty to visualize. Our power of visualiza- tion evolved from the survival value of body navigation through the three dimensional space in which we live. Our antidote to distortion is to represent our observations of ex- perience in the linear form that makes them look like the space in front of us. To see what experience means, we map it.

Confusion is caused by interde- pendencies. As we look for tomor- row’s probabiIities in yesterday’s lessons, confusing interactions in- trude. Our resolution of confusion is to represent the complexity we ex- perience in terms of a few shrewdly invented dimensions. The authority of these dimensions is their utility. Final truths are unknowable. But, when our inventions work, we find them useful. And when they con- tinue to work, we come to count on them and to call them real and true.

The method we use to control con- fusion is to enforce our ideas of unidimensionality. We define and measure one invented dimension at a time. The necessary mathematics is parameter separability. Models which introduce putative causes as separately estimable parameters are our laws of quantification. These models define measurement, determine what is measurable, de- cide which data are useful, and ex- pose data which are not.

Ambiguity, a fourth obstacle to in- ference, occurs because there is no nonarbitrary way to determine ex- actly which particular definitions of existential entities are the right ones to count. As a result, the only measurement models that can work are models that are indifferent to level of composition. Bookstein

Educational Measurement Issues and Practice 38

Page 7: A History of Social Science Measurement

(1992) shows that to accomplish this the models must embody parameter divisibility or additivity as in:

H(x/y) = H(x)lHb) and G(z + y ) = G(x) + Gb).

Fortunately, the mathematical solutions to ambiguity, confusion, and distortion are identical. The parameters in the model governing the probability of the data must appear in either a divisible or ad- ditive form. Following Bookstein enables:

1. the conjoint additivity which Norman Campbell (1920) and Luce and Tukey (1964) require for fundamental measurement and which Rasch‘s models provide in practice (Perline, Wright, & Wainer, 1979; Wright, 1985, 1995a),

2. the exponential linearity which Ronald Fisher (1920) requires for estimation suf- ficiency (Andersen, 1977; Wright, 1995131, and

3. the parameter separability which Louis Thurstone (1925) and Rasch (1960) require for objectivity (Wright & Linacre, 1995).

No model which fails to satisfy the four necessities for inference- probability, additivity, separability, and divisibility- can survive actual practice. No other formulation can define or construct results which any scientist, engineer, business man, tailor, or cook would be willing to use as measures. Only data that can be understood and organized to fit such a model can be useful for constructing measures. When data cannot be made to fit such a model, the inevitable conclusion will be that those data are inadequate and must be reconsidered-perhaps omitted, perhaps replaced (Wright, 1977).

Measurement Models Turning to the details of practice, our data come to us in the form of nominal response categories like: yedno, rightlwrong, presentlabsent, always/usually/sometimes/never, strongly agreelagreeldisagreelstrong- ly disagree, and so on.

The labels we choose for these cat- egories suggest an ordering from less to more: more yes, more right more present, more frequent, more agreeable. Without thinking much about it, we take as linguistically given a putative hierarchy of ordi- nal response categories, an ordered rating scale. But whether responses to these labels are, in practice, actu- ally distinct or even ordered re- mains to be discovered when we try to use our data to construct useful measures.

It is not only the unavoidable am- biguity of what is counted but our lack of knowledge about the func- tioning distances between the or- dered categories that mislead us. The response counts cannot form a linear scale. They are not only re- stricted to occur as integers be- tween none and all but also systematically biased against off- target measures. Because, at best, they are counts, their natural quan- titative comparison will be like ra- tios rather than differences. Means and standard deviations calculated from these ranks are systematically misleading.

There are serious problems in our initial raw data: ambiguity of entity, nonlinearity, and confusion of source (Is it the smart person or the easy item that produces the right answer?). In addition, it is not these particular data that interest us. Our needs focus on what these local data imply about more extensive, future data which, in the service of infer- ence, are by definition missing. We therefore apply the inverse proba- bility step to inference by address- ing each piece of observed data, x,,, as a stochastic consequence of a modeled probability of occurring, P,,,. Then we take the mathematical step to inference by connecting P,,= to a conjointly additive function that specifies how the measurement pa- rameters in which we are interested are supposed to govern P,,,.

Our parameters could be B , the location measure of person n on the continuum of reference; D,, the loca- tion calibration of item i on the same continuum; and F,, the threshold of the transition from cat- egory (x - 1) to category (XI. The necessary and sufficient formula- tion is:

log(f‘,~~~lza.J B,, - D, - F, in which the symbol = means by de- finition rather than merely equals.

On the left of this measurement model, we see the replacement of x,,, by its BernoullYBayesLaplace sto- chastic proxy p,,,. On the right, we see the Campbell/Luce/Tukey con- joint additivity which produces pa- rameter estimates in the linear form to which our eyes, hands, and feet are so accustomed.

Exponentiating shows how this model also meets the LevyKol- mogorovlBookstein divisibility re- quirement. But it is the linear form that serves our scientific aims best. When we want to see what we mean, we draw a picture because only seeing is believing. But the only pictures we see clearly are maps of linear measures. Graphs of ratios mislead us. Try as we may, our eyes cannot see things that way. Needless to say, what we cannot see, we cannot understand-let alone believe.

Indeed, Fechner (1860) showed that when we experience any kind of ratio-light, sound, or pain-our nervous system takes its logarithm so that we can see how it feels on a linear scale. Nor was Fechner the first to notice this neurological phe- nomena. On the Pythagorean scale, musical instruments sound out of tune at each change of key. Tuning is key dependent. This problem was solved in the 17th century by tuning instruments, instead, to notes which increased in frequency by equal ratios. Equal ratio tuning pro- duced an equally tempered scale of notes which sound equally spaced in any key. Bach wrote “The Well-Tem- pered Clavier” to demonstrate the validity of this invention.

These conclusions, so thoroughly founded on the seminal work of great mathematicians, have pene- trating consequences. This history teaches us not only what to do but also what NOT to do. No study of history is complete without learning from the wrong directions and blind alleys by which we were confused and mislead. What, then, are the unlearned lessons in the history of social science measurement? Sev- eral significant blind alleys stand out.

Winter 1997 39

Page 8: A History of Social Science Measurement

What History Tells Us NOT to Do

Do NOT Use Raw Scores as Though They Were Measures Many social scientists still believe that misusing raw scores as mea- sures does no harm. They are un- aware of the consequences for their work of the raw score bias against extreme scores. Some believe that they can construct measures by de- composing raw score matrices with some kind of factor analysis. There is a similarity between measure- ment construction and factor analy- sis in the way that they expose multidimensionality (Smith, 1996). But factor analysis does not con- struct measures (Wright, 1996a). All results from raw score analyses are spoiled by their nonlinearity, their extreme score bias, and their sample dependence.

Do NOT Use Nonadditive Models Among those who have seen their way beyond raw scores to item re- sponse theory [IRT], there is a baf- fling misunderstanding concerning the necessity for conjoint additivity and sufficient statistics. These ad- venturers cannot resist trying their luck with measurement models like:

1og(f‘nz~/~nd = - 0,)

and

log{(Pn,, - C,)/Pnd = A,(Bn - Dz) which they call the 2P and 3P IRT models of Birnbaum (Lord & Novick, 1968). These models are imagined to be improvements over the 1P Rasch model because they in- clude an item scaling parameter A, to estimate a discrimination for each item and a lower asymptote parameter C, to estimate a guessing level for each item. But, because these extra parameters are not ad- ditive, their proponents find, when they try to apply them to data, that:

Item discriminations “increase without limit.” Person abilities “in- crease or decrease without limit” (Lord, 1968, pp. 1015-1016).

Even for data generated to fit the 3PL model exactly, “only item dif- ficulty is satisfactorily recovered by [the 3P computer program] LO- GIST. . . . If restraints are not imposed, the estimated value of

discrimination is likely to increase without limit. . . . Left to itself, maximum likelihood estimation procedures would produce unac- ceptable values of guessing” (Lord, 1975, pp. 13-14, 16). During “estimation in the two and three parameter models. . . the item parameter estimates drift out of bounds” (Swaminathan, 1983, p. 34). “Range restrictions (must be) ap- plied to all parameters except the item difficulties” to control “the problem of item discrimination going to infinity” (Wingersky, 1983, p. 48). “Bias [in person measures] is sig- nificant when ability estimates are obtained from estimated item pa- rameters.. . . And, in spite of the fact that the calibration and cross- validation samples are the same for each setting, the bias differs by test” (Stocking, 1989, p. 18). “Running LOGIST to complete convergence allows too much movement away from the good starting values” (Stocking, 1989, p. 25). The reason why 2P and 3P IRT

models do not converge is clear in Birnbaum’s original (Lord & Novick, 1968, pp. 421422) estima- tion equations:

+ a1 0 H

These equations are intended to iterate reciprocally to convergence. When the first equation is applied to a person with a correct response x8, = 1 on an item with discrimination a, > 1, their ability estimate 8 is in- creased by the factor a,. When the second equation is applied, the same person response xo, = 1 is multiplied by their increased ability estimate 8 which further increases discrimina- tion estimate a,. The presence of re- sponse xH, = 1 on both sides of these reciprocal equations produces a feedback which soon escalates the estimates for item discrimination a,

its data. The simplest evaluation of success is the mean square residual between each piece of data x and its modeled expectation E,, as in the mean of (x - E,12 over x. Ordinarily, the more parameters a model uses, the smaller the mean square resid- ual becomes. Otherwise, why add more parameters? Should we ever encounter a parameter the addition of which increases our mean square residuals, we have exposed a pa- rameter that works against the in- tentions of our model.

Hambleton and Martois used LOGIST to analyze 18 sets of data twice, first with a l-item parameter Rasch model and second with a 3-item parameter Birnbaum model (Hambleton & Martois, 1983). In 12 of their 18 experiments, much to their surprise, two less item pa- rameters-that is, the Rasch model- produced smaller mean square resid- uals than their 3-item parameter model. In the six data sets where this did not happen, the tests were un- usually difficult for the students. As a result, attempting to estimate guess- ing parameters reduced residuals slightly more than the Rasch model without a guessing constant.

Had a single a priori guessing constant been set at a reasonable value like C = .25 for all items and the data reanalyzed with a 1P Rasch model so modified, Hamble- ton and Martois would have discov- ered that one well-chosen a priori guessing constant did a better job than attempting to estimate a full set of item specific guessing pa- rameters. When we encounter a sit- uation in which the addition of a parameter makes things worse, we have proven to ourselves that the parameter in question does not be- long in our model.

Do NOT Destroy Additivity Another way to see the problem is to attempt to separate parameters for independent estimation by subtrac- tion. Using G,, as the data-captur- ing log odds side of the model for a dichotomy, consider the following Rasch equations:

and person measure 8 to infinity.

Do NOT Use Models That Fail to Minimize Residuals The sine qua non of a statistical

When Gni = B n - Di G,, 3, - D,

Gnj = Bn - Dj, then

model is its success at reproducing G,, - G,, = Bn - B,

40 Educational Measurement: Issues and Practice

Page 9: A History of Social Science Measurement

so that D, drops out of consideration, and

G,, - G,,, = D, - D, so that B,, drops out of considera- tion.

Now consider the parallel 2P model equations: When

Gru = - Di) Gnu == A, ( B m - 0,) Grii = Aj(B,, - Dj),

then

Gni - G m i = - Brn), and we are left with A, and

G,,, - G,, = &(A, - A,) -AD, +A,D,, and we are left with B,. We cannot separate these parameters in order to estimate them independently.

But Merely Asserting Additivity Is NOT Enough Parameters can be combined addi- tively and asserted to govern a mo- notonic probability function over an infinite range yet fail to construct stable fundamental measurement. Consider Goldstein (1980):

l~gl-l~g(P,,)I = B,, - D, and Samejima (1997):

{lOg[P,,l(l - P,,)]) A

These two models appear to specify conjoint additivity, but they do not construct fundamental measure- ment.

Neither model provides sufficient statistics for B and D , and both models fail to construct unique measures. To see this, reverse the direction of the latent variable and focus on person deficiency (-El,'), item easiness ( - D J , and task fail- ure (1 - P,,).

B, - D,.

Rasch (1960): log[P,,l(1 - PI,J1 = B, - D,

becomes log[(l - f ' r 3 ' r u I = -@n - 0,)

= -log[P,,Nl - P,,)l, in which nothing changes but direc- tion.

Goldstein (1980): l~g[-l~g(P,,)I = B, - D,,

however, becomes

log[-log(l - Pn,)l which does NOT equal

-@, - D J ,

-log[ -log(P,,)I unless [log P,,l[log(l - P,J1 = 1.

Samejima (1997): {log[P,,/(1 - P,,JIFA = B, - D,

becomes

{log[(l - P,,)/P,,,l>-A = -(B, - D J , which does NOT equal

unlessA = 1, which makes Samejima's model the Rasch model.

For Goldstein and Samejima, merely measuring from the other end of the ruler produces a second set of measures that are incommen- surable with the first. The mere as- sertion of additivity on one side of a model is not enough. To produce fundamental measurement, the model must reproduce itself regard- less of direction.

-{{log[P,,/(1 - P,,)l> A>

Do NOT Destroy Construct Stability Finally, there is a fundamental il- logic in attempting to define a con- struct with item characteristic curves (ICC) that are designed to cross by letting their slopes differ due to differing item discriminations or allowing their asymptotes to dif- fer due to differing item-guessing parameters. The resulting crossing curves destroy the variable's crite- rion definition because the hierarchy of relative item difficulty becomes different at every level of ability

Figure 2 shows the relative loca- tions of Rasch item calibrations for 5 words drawn from the word recog- nition construct Woodcock defined with Rasch item calibrations (Wood- cock, 1974). Notice that it does not matter whether the level of ability is at first, second, or third grade the words red, away, drink, octopus, and equestrian remain in the same order of experienced difficulty, at the same relative spacing. This word recognition ruler works the same way and defines the same variable for every child, whatever his or her grade. It obeys the Magna Carta.

Three Percept ions of One! Var iab le

Five Rasch Items

Red Awgy Dri<uk Octo,pus Equeslrian m 4..

B c D E 3 r d : ~

A

j WORD ORDER ~ T A + THE SAME I I I I

B c D E 2nd: 2

, I I , I I , I

1 st:; B e D E

LOGlT SCALE CALIBRATES RULER

-2 -1 0 1 2 I tem Diff iculty (relative t o i t em C)

One Variable Defined!

FIGURE 2. Five sample-free Rasch items

Winter 1997 41

Page 10: A History of Social Science Measurement

To obtain the construct stability evident in Figure 2, we need the kind of item response curves that follow from the standard definition of fundamental measurement. Fig- ure 3 shows that these Rasch curves do not cross. When we transform the vertical axis of these curves into log-odds instead of probabilities, the curves become parallel straight lines, thus demonstrating their con- joint additivity.

Figure 4, in contrast, shows five 3P Birnbaum curves for the same data. These five curves have differ- ent slopes and different asymptotes. There is no sign of conjoint addi- t ivi ty.

Figure 5 shows the construct de- struction produced by the crossing curves of Figure 4. Now for a first- grader, red is calibrated to be easier than away, which is easier than drink, which is easier than octopus. But for a third-grader, the order of item difficulty is different. Now it is away, rather than red, that is eas- ier. Red has become harder than drink! And octopus is nearly as easy to recognize as red, instead of being nearly as hard as equestrian. What is the criterion definition of this variable? What construct is defined? The definition is different at every level of ability. There is no con- struct! No ruler! No Magna Carta!

Much as we might be intrigued by the complexity of the Birnbaum 3P curves in Figure 4, we cannot use them to construct measures. To con- struct measures, we require orderly, cooperating, noncrossing curves like the Rasch curves in Figure 3. This means that we must take the trou- ble to collect and refine data so that they serve this clearly defined pur- pose, so that they approximate a stochastic Guttman scale.

When we go to market, we eschew rotten fruit. When we make a salad, we demand fresh lettuce. We have a recipe for what we want. We select our ingredients to follow. It is the same with making measures. We must think ahead when we select and prepare our data for analysis. It is foolish to swallow whatever comes. Our data must be directed to building a structure like the one in Figures 2 and 3-one ruler for everyone, everywhere, every time-so we can achieve a useful, stable con- struct definition like Woodcock's word-recognition ruler.

1

0.9

0.8

0.7

0.6

0.5

0.4

2 0.3

0.2

0.1

0

vl

V V

Lt

x t ._

D I)

a

-5 -4 -3 -2 -1 0 1 2 3 4 5 Low Y st 2 nd 3 rd High Ability Ability

FIGURE 3 . Five sample-free Rasch curves

1

0.9

0.8

0.7

~ 0.6

v) vl

3 v,

0 x

33 0

& 0.5 .- 0.4

0.3

0.2

0.1

L a.

-5 -4 -3 -2 -1 0 1 2 3 4 5 High Ability

Low 1 st 2 nd 3 rd Ability

FIGURE 4. Five sampledependent Birnbaum curves

42 Educational Measurement: Issues and Practice

Page 11: A History of Social Science Measurement

Three Different Variables

Five Birnbaum Items

Drink Red Octopus Equestrian ,C ’A D / E

3rd: *yY ,’ I

, I I

B , A C D ,E 2 nd: , I , , I / , I ,

/ I / ’ CHAOS!

1 st: A B C D E Red Away Drink Octopus Equesirian

-2 -1 0 1 2 I tem Difficulty (relative t o i t em C)

What is the Item Definition of this Variable?

There is a vast difference between gerrymandering whatever kind of model might seem to give a locally good description of some transient set of data and searching, instead, for the kind of data that can yield inferentially stable-that is, gener- alizable-meaning to the parameter estimates of interest. The 3P model is data driven: The model must fit, or another model must be found. The 3P model seldom objects to an item, no matter how badly it func- tions. The Rasch model is theory driven: The data must fit, or else better data must be found. Indeed, it is the search for better data that sets the stage for discovery. The only way discovery can occur is as an un- expected discrepancy from an other- wise stable frame of reference. When we study data misfit to the Rasch model, we discover new things about the nature of what we are measuring and the way that people are able to tell us about it in their responses. These discover- ies are important events that strengthen and clarify our con- struct as well as our ability to mea- sure it.

FIGURE 5 . Five sample-dependent Birnbaum items

Conclusions We have recalled the political and moral history of stable units for fair taxation and trade. When units are unequal, when they vary from time to time and place to place, it is not only unfair but immoral. This is also the case with the misuse of neces- sarily unequal and therefore unfair raw score units.

The purpose of measurement is inference. We measure to inform and specify our plans for what to do next. If our measures are unreli- able, if our units vary in unknown ways, our plans must go awry. This might seem a small point. Indeed, it has been belittled by presumably knowledgeable social scientists. But, far from being small, it is vital and decisive! We will never build a useful, let alone moral, social science until we stop deluding ourselves by analyzing raw scores as though they were measures (Wright, 1984).

Laws of Measurement Some laws that are basic to the con- struction of measurement have emerged:

Any statistical method nominated to serve inference that requires complete data, by this very re- quirement, disqualifies itself as an inferential method.

When a model employs parameters for which there are no sufficient statistics, that model cannot con- struct useful measurement because it cannot estimate its parameters independently of one another.

Before applying linear statistical methods to raw data, one must first use a measurement model to construct (not merely assert) from the raw data observed coherent sample and test-free linear mea- sures.

Requirements for Measures The history of measurement can be summarized as the history of the way in which solutions to Thur- stones’ requirements -

1. Measures must be linear, so that arithmetic can be done with them.

2. Item calibrations must not de- pend on whose responses are used to estimate them-they must be sample-free.

3. Person measures must not de- pend on which items they hap- pened to take-they must be test-free.

4. Missing data must not matter. 5. The method must be easy to

-were latent in Campbell’s 1920 concatenation, in Fisher’s 1920 suf- ficiency, in Levy’s 1937 and Kol- mogorov’s 1950 divisibility; clarified by Guttman’s 1950 conjoint transi- tivity; and realized by Rasch’s 1953 additive Poisson model. Guessing and Discrimination The history of Birnbaum’s 3P model is a cautionary tale. Guessing is celebrated as a reliable item asset. Discrimination is saluted as a useful scoring weight. Crossed-item, characteristic curves are shrugged off as naturally unavoidable. The Rasch model is choosier. It recog- nizes guessing not as an item asset but as an unreliable person liability. Variation in discrimination, a sure symptom of item bias and multidi- mensionality, is also rejected (Mas- ters, 1988). Unlike the Birnbaum model, the Rasch model does not

apply.

Winter 1997 43

Page 12: A History of Social Science Measurement

parameterize discrimination and guessing and then forget them. The Rasch model always analyzes the data for statistical symptoms of variation in discrimination and guessing, identifies their sources, and weighs their impact on mea- surement quality (Smith, 1985, 1986,1988,1991,1994).

In practice, guessing is easy to minimize by using well-targeted tests. When it does occur, it is not items that do the guessing. The place to look for guessing is among guessers. Even then, few people guess. But, from time to time, some people do seem to have scored a few lucky guesses. The fairest and most efficient way to deal with guessing, when it does occur, is to detect it and then to decide what is the most rea- sonable thing to do with the improb- ably successful responses the lucky guesser may have chanced on.

Fundamenta l Measurement The knowledge needed to construct fundamental measures from raw scores has been with us for 40 years. Despite hesitation by some to use fundamental measurement models to transform raw scores into mea- sures so that subsequent statistical analysis can become fruitful, there have been many successful applica- tions (Engelhard & Wilson, 1996; Fisher & Wright, 1994; Smith, 1997; Wilson, 1992, 1994; Wilson, Engel- hard, & Draney, 1997).

Rasch's model is being ex- tended to address every imaginable raw observation -dichotomies, rat- ing scales, partial credits, binomial and Poisson counts (Masters & Wright, 1984)-in every reasonable observational situation, including ratings faceted to persons, items, judges, and tasks (Linacre, 1989).

Computer programs that apply Rasch models have been in circula- tion for 30 years (Wright & Pancha- pakesan, 1969). Convenient and easy-to-use software to accomplish the application of Rasch's measur- ing functions is readily available (Linacre & Wright, 1997; Wright & Linacre, 1997).

Today, it is easy for any scientist to use these computer programs to traverse the decisive step from their unavoidably ambiguous concrete raw observations to well-defined ab-

stract linear measures with realistic precision and validity estimates. There is no methodological reason why social science cannot become as stable, as reproducible, and hence as useful as physics.

References Andersen, E. B. (1977). Sufficient sta-

tistics and latent trait models. Psy- chometrika, 42, 69-81.

Andrich, D. (1978a). A rating formula- tion for ordered response categories. Psychometrika, 43, 561-573.

Andrich, D. (1978b). Scaling attitude items constructed and scored in the Likert tradition. Educational and Psychological Measurement, 38, 665- 680.

Andrich, D. (1978~). Application of a psychometric rating model to ordered categories which are scored with suc- cessive integers. Applied Psychologi- cal Measurement, 2, 581-594.

Andrich, D. (1995). Models for measure- ment: precision and the non-di- chotomization of graded responses, Psychometrika, 60, 7-26.

Andrich, D. (1996). Measurement crite- ria for choosing among models for graded responses. In A. von Eye & C. C. Clogg (Eds.), Analysis of cate- gorical variable in developmental re- search (pp. 3-35). Orlando: Academic.

Bookstein, A. (1992). Informetric distri- butions, Parts I and 11. Journal of the American Society for Information Sci- ence, 41(5), 368-388.

Bookstein, A. (1996). Informetric distri- butions. 111. Ambiguity and random- ness. Journal of the American Society for Information Science, 48(1), 2-10.

Brogden, H. E. (1977). The Rasch model, the law of comparative judge- ment and additive conjoint measure- ment. Psychometrika, 42, 631-634.

Campbell, N. R. (1920). Physics: The el- ements. London: Cambridge Univer- sity Press.

Engelhard, G. (1984). Thorndike, Thur- stone and Rasch: A comparison of their methods of scaling psychological tests. Applied Psychological Measure- ment, 8, 21-38.

Engelhard, G. (1991). Thorndike, Thur- stone and Rasch: A comparison of their approaches to item-invariant measurement. Journal of Research and Development in Education, 24(2),

Engelhard, G. (1994). Historical views of the concept of invariance in mea- surement theory. In M. Wilson (Ed.), Objective measurement: Theory into practice, Vol. 2 (pp. 73-99). Norwood, N J Ablex.

Engelhard, G., & Wilson, M. (Eds.). (1996). Objective measurement: The- ory into practice, Vol. 3. Norwood, NJ: Ablex.

45-60.

Fechner, G. T. (1966). Elements of psy- chophysics (H. E. Adler, Trans.). New York: Holt, Rinehart & Winston. (Original work published 1860)

Feller, W. (1950). An introduction to probability theory and its applica- tions, Vol. I . New York: Wiley.

Fischer, G. (1968). Psychologische test- theorie [Psychological test theory]. Bern: Huber.

Fisher, R. A. (1920). A mathematical ex- amination of the methods of deter- mining the accuracy of an observation by the mean error and by the mean sauare error. Monthlv Notices o f t i e Royal Astronomicil Society, 53, 758-770. . - - . . . .

Fisher, W. P., & Wright, B. D. (1994). Applications of probabilistic conjoint measurement [Special issue]. Inter- national Journal Educational Re- search, 21, 557-664.

Goldstein, H. (1980). Dimensionality, bias, independence and measurement scale problems in latent trait test score models. British Journal of Mathematical and Statistical Psy- chology, 33, 234-246.

Guttman, L. (1950). The basis for scalo- gram analysis. In Stouffer et al. (Eds.), Measurement and prediction, Vol. 4 (pp. 60-90). Princeton, NJ: Princeton University Press.

Hambleton, R., & Martois, J. (1983). Test score prediction system. In Ap- plications of item response theory (pp. 208-209). Vancouver, BC: Educa- tional Research Institute of British Columbia.

Keats, J. A. (1967). Test theory. Annual Review of Psychology, 18,217-238.

Kolmogorov, A. N. (1950). Foundations of the theory ofprobability. New York: Chelsea.

Levy, P, (1937). Theorie de l'addition des variables aleatoires [Combination theory of unpredictable variables]. Paris: Wiley.

Linacre, J. M. (1989). Many-Faceted Rasch measurement. Chicago: MESA.

Linacre, J. M., & Wright, B. D. (1997). FACETS: Many-Faceted Rasch anal- ysis. Chicago: MESA.

Lord, F. M. (1968). An analysis of the Verbal Scholastic Aptitude Test using Birnbaum's three-parameter model. Educational and Psychological Mea- surement, 28,989-1020.

Lord, F. M. (1975). Evaluation with arti- ficial data of a procedure for estimat- ing ability and item characteristic curve parameters (Research Rep. No. RB-75-33). Princeton: Educational Testing Service.

Lord, F. M., & Novick M. R. (1968) Sta- tistical theories of mental test scores. Reading, MA: Addison-Wesley.

Luce, R. D., & Tukey, J . W. (1964). Si- multaneous conjoint measurement.

44 Educational Measurement: Issues and Practice

Page 13: A History of Social Science Measurement

Journal of Mathematical Psychology,

Masters, G. N. (1988). Item discrimina- tion: When more is worse. Journal of Educational Measurement, 24,15-29.

Masters, G. N., & Wright, B. D. (1984). The essential process in a family of measurement models. Psychometrika,

Perline, R., Wright, B. D., & Wainer, H. (1979). The Rasch model as additive conjoint measurement. Applied Psy- chological Measurement, 3, 237-255.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Chicago: MESA.

Samejima, F. (1997, April). Ability esti- mates that order individuals with consistent philosophies. Paper pre- sented at the Annual Meeting of the American Educational Research As- sociation, Chicago.

Sears, S. D. (1997). A monetary history of Iraq and Iran. Unpublished doc- toral dissertation. Universitv of

I, 1-27.

49, 529-544.

Chicago. Smith, R. M. (1985). Validation of in-

dividual test response patterns. In J. Keeves (Ed.), International encyclo- pedia of education (pp. 5410-5413). Oxford: Pergamon.

Smith, R. M. (1986). Person fit in the Rasch model. Educational and Psy- chological Measurement, 46, 359-372,

Smith, R. M. (1988). The distributional properties of Rasch standardized

residuals. Educational and Psycho- logical Measurement, 48, 657-667.

Smith, R. M. (1991). The distributional properties of Rasch item fit statistics. Educational and Psychological Mea- surement, 51, 541-565.

Smith, R. M. (1994). A comparison of the power of Rasch total and between item fit statistics to detect measurement disturbances. Educational and Psy- chological Measurement, 54, 42- 55.

Smith, R. M. (1996). A comparison of methods for determining dimension- ality. Structural Equation Modeling, 3(1), 25-40.

Smith, R. M. (Ed.). (1997). Outcome measurement. Physical Medicine and Rehabilitation: State of the Art Re- views, Il(2). Philadelphia: Hanley & Belfus.

Stigler, S. M. (1986). The history of sta- tistics. Cambridge: Harvard Univer- sity Press.

Stocking, M. L. (1989). Empirical esti- mation errors in item response theory as a function of test properties (Re- search Rep. No. RR-89-5). Princeton: Educational Testing Service.

Swaminathan, H. (1983). Parameter estimation in item response models. In R. Hambleton & J . Martois (Eds.), Applications of item response theory (pp. 24-44). Vancouver, BC: Educa- tional Research Institute of British Columbia.

Thorndike, E. L. (1904). An introduc- tion to the theoi-y of mental and social

measurements. New York: Teacher’s College.

Thurstone, L. L. (1925). A method of scaling psychological and educational tests. Journal of Educational Psy-

Thurstone, L. L. (1926). The scoring of individual performance. Journal of Educational Psychology, 17,446-457.

Thurstone, L. L. (1928). Attitudes can be measured. American Journal of Sociology, 23, 529-554.

Thurstone, L. L. (1931). Measurement of social attitudes. Journal of Abnormal and Social Psychology, 26, 249-269.

Thurstone, L. L., & Chave, E. J. (1929). The measurement of attitude. Chi- cago: University of Chicago Press.

Wilson, M. (Ed.). (1992). Objective mea- surement: Theory into practice (Vol. 1). Norwood, NJ: Ablex.

Wilson, M. (Ed.). (1994). Objective mea- surement: Theory into practice (Vol. 2). Norwood, NJ: Ablex.

Wilson, M., Engelhard, G., & Draney, K. (Eds.). (1997). Objective measure- ment: Theory into practice (Vol. 4). Norwood, NJ: Ablex.

Wingersky, M. S. (1983). LOGIST A program for computing maximum likelihood procedures for logistic test models. In R. Hambleton & J. Martois (Eds.), Applications of item response theory (pp. 45-56). Vancouver, BC: Educational Research Institute of British Columbia.

Continued on page 52

chology, 16, 433-451.

STATEMENT OF OWNERSHIP, MANAGEMENT, AND CIRCULATION (Required by 39 U.S.C. 3685)

1. Publication Title: Educational Measurement: Issues and Practice. 2. Publication No.: 680890. 3. Filing Date: September 15, 1997. 4. Issue Frequency: Quarterly. 5. No. of Issues Published Annually: Four. 6. Annual Subscription Price: $10/$20/$25. 7. Complete Mailing Address of Known Office of Publication: 1230 17th St., NW, Washington, DC 20036-3078.8. Complete Mailing Address of Headquarters or General Business Office of Publisher, 1230 17th St., NW, Washington, DC 20036-3078.9. Full Names and Complete Mailing Addresses of Publisher, Editor, and Managing Editor. Publisher: National Council on Measurement in Education, 1230 17th St., NW, Washington, DC 20036-3078. Editor: Linda Crocker, University of Florida, 146 Norman Hall, Gainesville, FL 3261 1. Managing Editor: Donna V. Curd, National Council on Measurement in Education, 1230 17th St., NW, Washington, DC 20036-3078. 10. Owner: National Council on Measurement in Education, 1230 17th St., NW, Washington, DC 20036-3078. 11. Known Bondholders, Mortgagees, and Other Security Holders Owning or Holding 1% or More of Total Amount of Bonds, Mortgages, or Other Securities: None. 12. For completion by nonprofit organizations authorized to mail at special rates. The purpose, function, and nonprofit status of this organization and the exempt status for federal income tax purposes has not changed during the preceding 12 months. 13. Publication Name: Educational Measurement: Issues and Practice. 14. Issue Date for Circulation Data Below: Fall 1997.15. Extent and Nature of Circulation. a. Total No. Copies: Average No. Copies Each Issue During Preceding 12 Months 3,168; Actual No. Copies of Single Issue Published Nearest,to Filing Date 3,080. b. Paid and/or Requested Circulation. (1) Sales Through Dealers and Caniers, Street Vendors, and Counter Sales: Average No. Copies Each Issue During Preceding 12 Months 0; Actual No. Copies of Single Issue Published Nearest to Filing Date 0. (2) Paid or Requested Mail Subscriptions: Average No. Copies Each Issue During Preceding 12 Months 2,615; Actual No. Copies of Single Issue Published Nearest to FilingDate2,583. c. Total Paid and/or RequestedCirculation: Average No. CopiesEach Issue During Preceding 12 Months 2,615; Actual No Copies of Single Issue Published Nearest to Filing Date 2,583. d. Free Distribution by Mail: Average No. Copies Each Issue During Preceding 12 Months 24; Actual No. Copies of Single Issue Published Nearest to Filing Date 22. e. Free Distribution Outside the Mail: Average No. Copies Each Issue During Preceding 12 Months 0; Actual No. Copies of Single Issue Published Nearest toFilingDate0. f. Total Free Distribution: Average No. Copies Each Issue During Preceding 12 Months 24; Actual No. Copies of Single Issue Published Nearest toFiling Date 22. g. Total Distribution: Average No. Copies Each Issue During Preceding 12 Months 2,639; Actual No, Copies of Single Issue Published Nearest to Filing Date 2,605. h. Copies Not Distributed. (1) Office Use, Leftovers, Spoiled: Average No, Copies Each Issue During Preceding 12 Months 529; Actual No. Copies of Single Issue Published Nearest to Filing Date475. (2) Return from News Agents: Average No. Copies Each Issue During Preceding 12 Months 0; Actual No. Copies of Single Issue Published Nearest to Filing Date 0. i. Total: Average No. Copies Each Issue During Preceding 12 Months 3,168; Actual No. Copies of Single Issue Published Nearest to Filing Date 3,080. Percent Paid and/or Requested Circulation: Average No. Copies Each Issue During Preceding 12 Months 99.1%; Actual No. Copies of Single Issue Published Nearest to Filing Date 99.2%. 16. This Statement of Ownership will be printed in the Winter 1997 issue of this publication. I certify that all information furnished on this form is true and complete. I understand that anyone who furnishes false or misleading information on this form or who omits material or information requested on the form may be subject to criminal sanctions (includingfines and imprisonment) and/or civil sanctions (including multiple dumuges and civil penalties).

Thomas J. Campbell, Director of Publications, September 15, 1997

Winter 1997 45