chapter 4 validity and test development - amazon...

Chapter 4 – Validity and Test Development

- Note: Reliability is a necessary but not a sufficient precursor of validity

- Test developers have a responsibility to demonstrate that new instruments fulfill the purposes for

which they are designed

Validity: A definition

- Validity – a test is valid to the extent that inferences made from it are appropriate, meaningful and

useful.

- A test score is per se meaningless until the examiner draw inferences from it based on the test

manual or other research findings.

- Unfortunately is actually very seldom possible to summarize the validity of a test in terms of a

single, tiny statistic

o Determining whether inferences are appropriate, meaningful, and useful typically requires

numerous studies of the relationships between test performances and independently

observed behaviours.

- Validity reflects an evolutionary, research-based judgement of how adequately a test measures the

attribute it was designed to measure

- Traditionally the different ways of accumulating validity evidence have been grouped into

categories:

o Content validity

o Criterion – related validity

o Construct validity

Content validity

- Content validity is determined by the degree to which the questions, tasks, or items on a test are

representative of the universe of behaviour the test was designed to sample.

o Nothing more than a sampling issue

o The items of a test can be visualized as a sample drawn from a larger population of

potential items that define what the researcher really wishes to measure

If the sample (specific items on the test) is representative of the population (all

possible items) then the test possess content validity

- Content validity is a useful concept when a great deal is known about the variable that the

researcher wishes to measure

- When evaluating content validity, response specification is also an integral part of defining the

relevant universe of behaviours

o For example: in reference to spelling achievement, it cannot be assumed that a multiple

choice test will measure the same spelling skills as an oral test or a frequency count of

misspellings in written compositions

- Content validity is more difficult to assure when the test measures an ill defined trait

Quantification of Content Validity

- A coefficient of content validity can be derived from the following formula:

o Content validity = D/ A+B+C+D

- The commonsense approach to content validity advocated here serves well as a flagging mechanism

to help cull out existing items that are deemed inappropriate by expert raters

- A test could possess a robust coefficient of content validity and still fall short in subtle ways

Face Validity

- A test has face validity if it looks valid to test users, examiners and especially the examinees

- Is really a matter of social acceptability and not a technical form of validity in the same category as

content, criterion related, or construct validity

- In fact, a test could possess extremely strong face validity, the items might look highly relevant ro

what is presumably measured by the instrument yet produce totally meaningless scores with no

predictive utility whatever

Criterion Related Validity

- Criterion related validity is demonstrated when a test is shown to be effective in estimating an

examinee’s performance on some outcome measure

- The variable of primary interest is the outcome measure called a criterion

- Two different approaches to validity evidence are subsumed under the heading of criterion related

validity:

o In concurrent validity, the criterion measures are obtained at approximately the same time

as the test scores

For example: the current psychiatric diagnosis of patients would be an appropriate

criterion measure to provide evidence for paper and pencil psychodiagnositc test

o in predictive validity the criterion measure are obtained in the future, usually months or

years after the test scores are obtained as with the college grades predicted from an

entrance exam

Characteristics of a Good Criterion

- The criterion must itself be reliable if it is to be a useful index of what the test measures.

- An unreliable criterion will be inherently unpredictable regardless of the merits of the test

- The extent that the reliability of the other test or the criterion or both is low, the validity coefficient

is also diminished

o Validity coefficient is the resulting correlation coefficient

- A criterion must also be free of contamination from the test itself.

o For example: screening tests of psychiatric symptoms often check for changes in eating,

sleeping, or social activities. Unfortunately, the SRE incorporates questions that check the

following: change in eating habits, change in sleeping habits, change in social activities, so if

the screening test contains the same items as the SRE, then the correlation between these

two measures will be artificially inflated. This potential source of error in test validation is

referred to as criterion contamination, since the criterion is contaminated by its artificial

commonality with the test.

- Criterion contamination is also possible when the criterion consist of ratings from experts.

o If the experts possess the knowledge of the examinee’s test scores, this information may

influence their ratings.

Concurrent validity

- In a concurrent validation study, test scorers and criterion information are obtained simultaneously

- A test with demonstrated validity provides a shortcut for obtaining information that might

otherwise require the extended investment of professional time

- Correlations between a new test and existing tests are often cited as evidence of concurrent validity

o Old tests validating a new test – but it is nonetheless appropriate if two conditions are met:

The criterion (existing) tests must have been validated through correlations with

appropriate nontest behavioural data

The instrument being validated must measure the same construct as the criterion

tests

Predictive Validity

- In a predictive validation study, test scores are used to estimate outcome measures obtained at a

later date

- Is particularly relevant for entrance examinations and employment tests

- When tests are used for purpose of prediction, it is necessary to develop a regression equation

o A regression equation describes the best fitting straight line for estimating the criterion

from the test

o Regression equation

Y=0.7X + O.2

Validity Coefficient and the Standard Error of the Estimate

- The resulting correlation is known as the validity coefficient

- The higher the validity coefficient, the more accurate is the test in predicting the criterion

- In the hypothetical case, rxy, is 1.00, the test would possess perfect validity and allow for flawless

prediction

- The standard error of estimate is the margin of error to be expected in the predicted criterion score

- The standard error of measurement indicates the margin of measurement error caused by the

unreliability of the test, where standard error of estimate indicates the margin of prediction error

caused by the imperfect validity of the test

Decision Theory Applied to Pyschological Tests

- Proponents of decision theory stress that the purpose of psychological testing is not measurement

per se but measurement in the service of decision making

- Certain combination of predicted and actual outcomes are more likely than others. If a test has a

good predictive, validity, then most persons predicted to succeed will succeed and most persons

predicted to fail will fail.

- No selection test is a perfect predictor so two other types of outcomes are also possible. Some

persons predicted to succeed will in fact fail.

o These cases are referred to as false positives

- Some persons predicted to fail would if given the chance succeed

o Case referred to as false negatives

- False positives and false negatives are collectively known as missed, because in both cases the tests

are collectively known as misses because the test has made an inaccurate prediction

o Hit rate = hits / hits + misses

- Proponents of decision theory make two fundamental assumptions about the use of selection tests:

o The value of various outcomes to the institution can be expressed in terms of a common

utility scale. One such scale – but by no means the one – is profit and loss. For example

when using an interest inventory to select salespersons, a corporation can anticipate profit

from applicants correctly identified as successful but will lose money when inevitably some

of those selected do not sell enough even to support their own salary (false positives). The

cost of the selection procedure must also be factored in to the utility scale as well

o In institution selection decisions, the most generally useful strategy is one that maximizes

the average gain on the utility scale (or minimizes average loss) over many similar decisions.

For example which selection ratio produces the largest average gain on the utility scale?

Maximization if the most fundamental decision principle.

Construct Validity

- This is the most difficult and elusive of the bunch

- A construct is a theoretical, intangible quality or trait in which individuals differ

o Examples of constructs include: leadership ability, over controlled hostility, depression and

intelligence

- Constructs are theorized to have some form of independent existence and to exert broad but to

some extent predictable influences on human behaviour

- A test designed to measure a construct must estimate the existence of an inferred, underlying

characteristic based on a limited sample of behaviour

- All psychological constructs possess two characteristics in common:

o There is no single external referent sufficient to validate the existence of the construct; that

is the construct cannot be operationally defined

o Nonetheless, a network of interlocking suppositions can be derived fro existing theory

about the construct

o Psychopath is surely a construct in that there is no single behavioural characteristic or

outcome sufficient to determine who is strongly psychopathic and who is not

o Construct validity pertains to psychological tests that claim to measure complex,

multifaceted, and theory bound psychological attributes such as psychopathy, intelligence,

leadership ability and the like

o To evaluate the construct validity of a test, we must amass a variety of evidence from

numerous sources

o Many psychometric theorists regard construct validity as the unifying concept for all types

of validity evidence

Approaches to construct validity

- Most studies of construct validity fall into one of the falling categories

o Analysis to determine whether the test items or subtests are homogeneous and therefore

measure a single construct

o Study of developmental changes to determine whether they are consistent with the theory

of the construct

o Research to ascertain whether group differences on test scores are theory – consistent

o Analysis to determine whether intervention effects on test scores are theory – consistent

o Correlation of the test with other related and unrelated tests and measures

o Factor analysis of test scores in relation to other sources of information

o Analysis to determine whether test scores allow for the correct classification of examinees.

Test Homogeneity

- If a test measures a single construct, then its component items or subtests likely will be

homogeneous (also referred to as internally consistent)

- The aim of test development is to select items that form a homogeneous scale

- The most commonly used method for achieving this goal is to correlate each potential item with the

total score and select items that show high correlation with the test score

- Homogeneity is an important first step in certifying the construct validity of a new test, but pointed

alone it is weak evidence

Appropriate Developmental Changes

- For any test of vocabulary then an important piece of construct validity evidence would be that

older subjects score better than younger subjects, assuming that education and health factors are

held constant.

Theory – consistent group differences

- Crandall developed a social interest scale that illustrates the use of theory – consistent group

differences in the process of construct validation

o To measure this construct, he devised a brief and simple instrument consisting of 15 forced

choice items.

o For each item one of the two alternative includes a trait closely related to the Adlerian

concept of social interest whereas the other choice consists of an equally attractive but

non-social trait

o The subject is instructed to choose the trait which you value more highly

o Total score on the social interest scale can range from 0 to 15

Theory consistent intervention effects

- Another approach to construct validation is to show the test scores change in appropriate direction

and amount in reaction to planned or unplanned interventions

Convergent and Discriminant validation

- Convergent validity is demonstrated when a test correlates highly with other variables or tests with

which it shares an overlap of constructs

- Discriminant validity is demonstrate when a test does not correlate with variables or tests from

which it should differ

o Ex: social interest and intelligence

- Campbell and Fiske proposed a systematic design for simultaneously confirming the convergent and

discriminant validities of a psychological test

o Their design is called the multitrait method matrix and it calls for the assessment of two or

more traits by two or more methods

- When each of these tests is administered twice to the same group of subjects and scores on all pairs

of tests are correlated the results is a multitrait multimethod matrix

o This matrix is a rich source of data on reliability, convergent validity and discriminant

validity

- It is more common for the test developers to collect convergent and discriminant validity data in

bits and pieces rather than producing an entire matrix of intercorrelations

Factor Analysis

- The purpose of factor analysis is to identify the minimum number of determiners (factors) required

to account for the intercorrelations among a battery of tests

- The goal in factor analysis is to find a smaller set of dimensions called factors that can account for

the observed array of intercorrelations among individual tests

- A typical approach in factor analysis is to administer a battery of tests to several hundred subjects

and then calculate a correlation matrix from the scores on all possible pairs of tests

- A factor loading is actually a correlation between an individual tests and a single factor

o Factor loadings can vary between – 1.0 and +1.0

o The final outcome of a factor analysis is a table depicting the correlation of each test with

each factor

- A table of factor loadings helps describe the factorial composition of a test and thereby provides

information relevant to construct validity

- The category test is a relatively complex concept formation test designed to be different from

traditional psychometric measures of intelligence and superior to them at detecting neurological

disorders.

Classification Accuracy

- Many tests are used for screening purposes to identify examinees who meet or don’t meet certain

diagnostic criteria

o For these instruments accurate classification is an essential index of validity

o Mini-mental examination, a short screening test of cognitive functioning

Consists of a number of simple questions and easy tasks like remembering three

words

Test yields a score from 0 to 30

Identify early individuals who might be experiencing dementia

Dementia is a general term that refers to significant cognitive decline and memory

loss caused by a disease process such as Alzheimer’s disease or other accumulation

of small strokes

- In exploring its utility, researches to two psychometric features that bear upon validity: sensitivity

and specificity

o Sensitivity has to do with accurate identification of patients who have a syndrome in this

case dementia

o Specificity has to do with accurate identification of normal patients

- The concepts of sensitivity and specificity are chiefly helpful in dichotomous diagnostic situations in

which individuals are presumed either to manifest a syndrome or not

- Screening tests typically provide a cut off score used to identify possible cases of the syndrome in

question

- An ideal screening test, of course would yield 100 percent specificity

- No such test exists sadly

- Choosing a cut off score that increases sensitivity invariably will reduce specificity and vice versa

- The inverse relationship between sensitivity and specificity is not only an empirical fact, but it is also

a logical necessity – if one improves, the other must decline – no exceptions are possible

Extravalidity concerns and the widening scope of test validity

- Extravalidity which include side effects and unintended consequences of testing

- Importance of extravalidity, domain psychologists confirm that the decision to use test involves

social, legal and political considerations that extend far beyond the traditions questions of technical

validity

Unintended Side Effects of Testing

- Cole and Moss cite the example of using psychological tests to determine eligibility for special

education. Although the indeed outcome is to help students learn, the process of identifying

students eligibility for special education. Although the intended outcome is to help students learn,

the process of identifying students eligible for special education may produce numerous negative

effects:

o The identified children may feel unusual or dumb

o Other children may call the children names

o Teachers may view these children as unworthy of attention

o The process may produce classes segregated by race or social class

- A consideration of side effects should influence an examiner’s decision to use a particular test for a

specified person

- Although the MMPI was originally designed as an aid in psychiatric diagnosis subsequent research

indicated that it is also useful in the identification of persons unsuited to a career in law

enforcement

The Widening Scope of Test Validity

- The functionalist perspective explicitly recognizes that the test validation has an obligation to

determine whether a practice has constructive consequences for individuals and institutions and

especially to guard against adverse outcomes

- Test validity is an overall judgement of the adequacy and appropriateness of inferences and actions

that flow from test scores

- Validity rests on four bases:

o Traditional evidence of construct validity, for example, appropriate convergent and

discrimination validity

o An analysis of the value implications of the test interpretation

o Evidence for the usefulness if test interpretations in particular applications

o An appraisal of the potential and actual social consequences including side effects from test

use

- A valid test is one that answers well to all four facets of tests validity

Utility: The Last Horizon of Test Validity

- Test utility can be summed up by the question “does use of this test result in better patient

outcomes or more efficient delivery of services?”

Test Construction

- Test construction consists of six interwined stages

o Defining the test – determining the scope and purpose which must be known before the

developer can proceed to test construction

o Selecting a scale method – is a process of setting the rules by which numbers are assigned

to test results

o Constructing the items – is as much art as science and it is here that the creativity of the

test developer may be required. Once a preliminary version of the test is available, the

developer usually administers it to a modest sized-sample of subjects in order to collect

initial data about test item characteristics.

o Testing the items – entails a variety of statistical procedures referred to collectively as item

analysis. The purpose of item analysis is to determine which items should be retained which

revised and thrown out.

o Revising the test – if the revisions are substantial, new items and additional pretesting with

new subjects may be required. Test construction involves a feedback loop whereby second,

third, and fourth drafts of an instrument might be produced.

o Publishing the test is a final step.

Defining the Test

- In order to construct a new test, the developer must have a clear idea of what the test is to measure

and how it is to differ from existing instruments.

- Kaufman and Kaufman provide a good model of the test definition process.

o Kaufman Assessment Battery for children – a new test for general intelligence in children,

the authors listed six primary goals that define the purpose of the test and distinguish it

from existing measures:

Measure intelligence from a strong theoretical and research bias

Separate acquired factual knowledge from the ability to solve unfamiliar problems

Yield scores that translate to educational intervention

Include novel tasks

Be easy to administer and objective to score

Be sensitive to the diverse needs of preschool, minority group and exceptional

children

Selecting a Scaling Method

- The purpose of psychological testing is to assign numbers to responses on a test so that the

examinee can be judged to have more or less of the characteristic measured

o The rules by which the numbers are assigned to responses define the scaling method

- No single scaling method is uniformly better than the others. For some traits ordinal ranking of

expert judges might be the best measurement approach, for other traits, complex scaling of self

report data might yield the most valid measurements

Levels of Measurement

- In a nominal scale, the numbers serve only as category names

o Ex: when collecting data for a demographic study, a research can code 1 for males and 2 for

females

o Note: the numbers are arbitrary and do not designate more or less of something; numbers

are simplified forms of naming

- An ordinal scale constitutes a form of ordering or ranking

o However ordinal ranking fail to provide information about the relative strength of rankings,

to know if a professor will strongly prefer something

- An interval scale provides information about ranking, but also supplies a metric for gauging the

differences between rankings

o In short interval scales are based on the assumption of equal – sized units or intervals for

the underlying scale

- A ratio scale has all the characteristics of an interval scale but also possesses a conceptual

meaningful zero point in which there is a total absence of the characteristic being measures.

o Are rare in psychological measurement

o Meaningful zero points just do not exist

o Weight, height, electro dermal response all qualify to ratio scale

- Levels of measurement are relevant to test construction because the most powerful and useful

parametric statistical procedures (Pearson, r analysis of variance, multiple regression) should be

used only for scores derived from measures that meet the criteria of interval or ratio scales

- For scales that are only nominal or ordinal, less powerful non parametric statistical procedures (chi-

square, rank order correlation, median tests) must be employed

Representative Scaling Methods – Expert Rankings

- A depth of coma scale could be very important in predicting the course improvement, because it is

well known that a lengthy period of unconsciousness offers a poor prognosis for ultimate recovery

- One approach to scaling the dept of coma would be to rely on the behavioural rankings of expert

o For example you could ask a panel of neurologists to list patient behaviours associated with

different levels of consciousness

After the experts had submitted a large list of diagnostic characteristics, the test

developers could rank the indicator behaviours along a continuum of consciousness

range from deep coma to basic orientation

- In addition to the rankings it is possible to compute a single overall score that is something more

than an ordinal scale although probably less than true interval level measurement

Method of Equal Appearing Intervals

- L.L. Thurstone proposed a method for constructing interval level scales from attitude statements

o His method of equal appearing intervals is still used today marking him as one of the giants

of psychometric theory

o The actual methodology of this is to construct an equal appearing intervals which is

somewhat and statistically laden but the underlying logic is easy to explain

- Thurstone approach of item scaling has powerful applications in test development

Method of Absolute Scaling

- Method of absolute scaling is a procedure for obtaining a measure of absolute item difficulty based

on results for different age groups of test takers. The methodology for determining individual item

difficult on an absolute scale is quite complex, although the underlying rationale is not too difficult

to understand

Likert Scales

- A Likert scale presents the examinee with five responses ordered on an agreed/disagree or

approve/disapprove continuum.

- Depending on the wording of an individual item, an extreme answer of strongly agree or strongly

disagree, will indicate the most favourable response on the underlying attitude measure by the

questionnaire

- Likert assigned a score of 5 to this extreme response, 1 to opposite extreme and 2, 3, 4 to

intermediate replies. The total scale score is obtained by adding the scores from individual items.

For this research a Likert scale is also referred to as a summative scale.

Guttman Scales

- On a Guttman scale, respondents who endorse one statement also agree with milder statements

pertinent to the same underlying continuum

- Guttman scale are produced by selecting items that into ordered sequence of examinee

endorsement. A perfect Guttman scale is seldom achieved because of errors of measurement but is

nonetheless a fitting goal for certain type of tests.

- Originally devised to determine whether a set of attitude statements is unidimensional, the

technique has been used in many different kind of tests

Method of Empirical Keying

- Method of Empirical Keying, test items are selected for a scale based entirely on how well they

construct a criterion group from a normative scale

- Depression scale could be derived from a pool of true-false personality inventory questions in the

following manner

o A carefully selected and homogeneous group of person experiencing major depression is

gather to answer the pool of true/false questions

o For each item, the endorsement frequency of the depression group is compared to

endorsement frequency of the normative sample

o Items which show a large different in endorsement frequency between the depression and

normative samples are selected for the Depression scale, keyed in the direction favoured by

depression subjects (true or false, as appropriate)

o Raw score on the depression scale is then simply the number of items answered in the

keyed direction

- A common finding is that some items selected for a scale may show no obvious relation to the

construct measured

Rational Scale Construction (Internal Consistency)

- The rational approach to scale construction is a popular method for the development of self report

personality inventories

- The heart of the method of rational scaling is that all scale items correlate positively with each other

and also with the total score for the scale

- For instance, if the scale is designed to identify college students with leadership potential, then it

should be administered to a cross selection of several hundred college students. For scale

development very large samples are desirable.

- The next step in rational scale construction is to correlate scores on each of the preliminary items

with the total score on the test for the 500 subjects in the tryout sample

o Scores on the items are dichotomous

- Up to half of the initial items might be discarded

- If a large population of items is initially discarded, the researcher might recalculate item total

correlations based upon the reduced item pool to verify the homogeneity of the remaining items

Constructing the items

- The item writer is confronted with a profusion of initial question

o Should item content be homogenous or carried?

o What range of difficulty should the items cover?

o How many initial items should be constructed?

o Which cognitive processes and item domains should be tapped?

o What kind of test item should be used?

Initial Questions in Test construction

- The first question pertains to the homogeneity versus heterogeneity of the test item

- In large measure whether item content is homogeneous or varied is dictated by the manner in

which the test developer has defined the new instrument

- The test developer might seek to incorporate novel problems equally unfamiliar to all examinees

- In the other hand, with a theory based test of spatial thinking, subscales with homogeneous item

content would be required

- The range of item difficulty must be sufficient to allow for meaningful differentiations of examinees

at both extremes

- A ceiling effect is observed when significant numbers of examinees obtain perfect or near perfect

scores

o the problem with a ceiling effect is that distinctions between high scoring examinees are

not possible even though these examinees might differ substantially on the underlying trait

measure by the test

- A floor effect is observed when significant numbers of examinees obtain scores at or near the

bottom of the scale

- Test developers expect that some initial items will prove to make ineffectual contributions to the

overall measurement goal of their instrument

- Final draft that contains excess items, perhaps double the number of questions desired on the final

draft

Table of Specifications

- A table of specification enumerates the information and cognitive tasks on which examinees are to

be assessed

Item Formats

- Proponents of multiple choice methodology argue that properly constructed items can measure

conceptual as well as factual knowledge. Multiple choice tests also permit quick and object machine

scores.

o The fairness of multiple choice questions can be proved (or occasionally disproved) with

very simple item analysis procedures discovered subsequently

o The major shortcomings of multiple choice questions are first the difficulty of writing good

distractor options and second the possibility that the presence of the response may cue a

half knowledgeable respondent to the correct answer

Testing the Items

- Test developers use item analysis a family of statistical procedures to identify the best items

o To determine which item should be retained, which should be revised and which should be

thrown out

- In conducting a thorough item analysis the test developer might take use of item difficulty index,

item reliability index, item validity index, item characteristic curve and an index of item

discrimination

Item Difficulty Index

- The item difficulty for a single test item is defined as the proportion of examinees in a large tryout

sample who get that item correct

- Item difficulty index is a useful tool for identifying items that should be altered or discarded

- Unfortunately this item is psychometrically unproductive because it does not provide information

about differences between examinees

- For most applications, the item should be rewritten or thrown out

- Generally item difficulties that hover around 0.5 ranging between 0.3 and 0.7 maximize the

information the test provides about differences between examinees.

- For true false or multiple choice items, the optimal level of item difficulty needs to be adjusted for

the effects of guessing

- If a test is to be used for selection of an extreme group by means of a cutting score it may be

desirable to select items with difficulty levels outside the 0.3 and 0.7 range

Item Reliability Index

- A test developer may desire an instrument with a high level of internal consistency in which the

items are reasonably homogeneous

- However individual items are typically right or wrong (often score 1 or 0) whereas total scores

constitute a continuous variable

o In order to correlate these two different kinds of scores it is necessary to use a special type

of statistic called the POINT-BISERIAL CORRELATION COEFFICIENT

- The computational formula for this correlation coefficient is equivalent to the Pearson r discussed

earlier, and the point biserial coefficient conveys much the same kind of information regarding the

relationship between two variables (one of which happens to be dichotomous score 0 or 1)

- The usefulness of an individual dichotomous test item is also determined by the extent to which

scores on it are distributed between the two outcomes of 0 and 1

- The more closely the item approaches a 50-50 split of right and wrong scores, the greater is its

standard deviation

o In general the greater the standard deviation of an item, the more useful is the item of the

overall scale

Item Validity Index

- The item validity index is a useful tool in the psychometrician’s quest to identify predictively useful

test items

- By computing the item validity index for every item in the preliminary test, the test developer can

identify ineffectual items, eliminate or rewrite them, and produce a revised instrument with greater

practical utility.

- The first step in figuring an item validity index is to compute the point biserial correlation between

the item score and the score on the criterion variable

Item Characteristic Curves

- item characteristic curve is a graphical display of the relationship between the probability of a

correct response and the examinee’s position on the underlying trait measured by the test

- An ICC is actually a mathematical idealization of the relationship between the probability of a

correct response and the amount of the trait possessed by test respondents

- Different ICC models use different mathematical functions based on initial assumptions

o The simplest ICC model is the Rasch Model

o Makes two assumptions

Test items are unidimensional and measure one common trait

Test items vary on a continuum of difficult level

The nomal ogive is simply the normal distribution graphed in cumulative form

- Psychometric purists would prefer that test item ICCs approximate the normal ogive because this

curve is convenient for making mathematical deductions about the underlying trait

- ICCs are especially useful for identifying items that perform differently for subgroups of examinees

- The underlying theory of ICC is also known as item response theory and latent trait theory

o The usefulness of this approach ahs been questioned by NUNALLY who points out that that

assumption of test unidimensionality which is violated when may psychological tests are

considered

- The merits of the ICC approach are still debated

o ICC theory seems particularly appropriate for certain forms of computerized adaptive

testing (CAT) in which each test taker responds to an individual and unique set of items that

are then scored on an underlying uniform scale

Item Discrimination Index

- ICCs that an effective test item is one that discriminates between high scorers and low scorers on

the entire test

- An ideal test item is one that most of the high scorers pass and most of the low scores fail

- If the slope of the curve is positive and the curve is preferably ogive shaped the item is doing a

good job of separating high and low scores. But visual inspection is not a completely objective

procedure; what is needed is a statistical tool that summarizes the discrimination power of

individual test items

- An item discrimination index is a statistical index of how efficiently an item discriminates between

persons who obtain high and low scores on the entire test

- A test developer can supplement the item discrimination approach by inspecting the number of

examinees in the upper and lower scoring groups who choose each of the incorrect alternatives

- If a multiple choice item is well written the incorrect alternatives should be equally attractive to

subjects who do not know the correct answer

Revising the Test

- It is common in the evolutionary process of test development that many items added

- This revised test likely contains more discriminating items with higher reliability and greater

predictive accuracy but these improvements are known to be true only for the first tryout sample

- The next step is test development is to collect new data from a second tryout sample

- If major changes are need, it is desirable to collect data from a third and even perhaps a fourth

tryout sample

Cross Validation

- The term cross validation refers to the practice of using the original regression equation in a new

sample to determine whether the test predicts the criterion as well as it did in the original sample

Validity Shrinkage

- A common discovery in cross validation research is that a test predicts the relevant criterion less

accurately with the new sample of examinees than with the original tryout sample. The term

validity shrinkage is applied to this phenomenon

- Is an inevitable part of the test development and underscores the need for cross validation

- Can be a major problem when derivation and cross validation samples are small, the number of

potential test items is large and items are chosen on a purely empirical basis without theoretical

rationale

- Demonstrate validity through cross validation, do not assume it based merely on the solemn

intentions of a new instrument

Feedback from Examinees

- The EFeQ is a short and simple questionnaire designed to elicit candid opinions from examinees as

to these features of the test examiner respondent matrix:

o Behaviour of examiners

o Testing conditions

o Clarity of exam instructions

o Convenience in using the answer sheet

o Perceive suitability of the test

o Perceived cultural fairness of the test

o Perceived sufficiency of time

o Perceived difficult of the test

o Emotional response to the test

o Level of guessing

o Cheating by the examinee or others

o Final question is an open ended remark of what you thought and could be improved about

the exam

Publishing the test

- The test construction process does not end with the collection of cross validation data

- The test developer also must oversee the production of the testing materials, publish a technical

manual and produce a user’s manual

Production of testing materials

- The testing materials must be user friendly if they are to receive wide acceptance by psychologists

and educators

- If it possible for the test developer to simplify the duties of the examiner while leaving examinee

task demands unchanged, the resulting instrument will have much greater acceptability to potential

users

Technical Manual and User’s Manual

- Technical data about a new instrument are usually summarized with appropriate references in a

technical manual

- In some cases this information is incorporated in the users manual which gives instructions for

administration and also provides guidelines for test interpretation

- Test manuals serve many purposes as outlined in the standards for educational and psychological

testing

- The influential standards manual suggests that test manuals accomplish the following goals:

o Describe the rational and recommended uses for the test

o Provide specific cautions against anticipated misuses of a test

o Cite representative studies regarding general and specific test uses

o Identify special qualifications needed to administer and interpret the test

o Provide revisions, ammendations and supplements as needed

o Cite quantitative relationships between test scores and criteria

o Report on the degree to which alternative modes of response are interchangeable

o Provide appropriate interpretive aids to the test taker

o Furnish evidence of the validity of any automated test interpretations

- Test manuals should provide the essential data on reliability and validity rather than referring the

user to other sources – an unfortunate practice encountered in some test manuals

chapter 4 validity and test development - amazon...

Documents