chapter 4 validity and test development - amazon...
TRANSCRIPT
Chapter 4 – Validity and Test Development
- Note: Reliability is a necessary but not a sufficient precursor of validity
- Test developers have a responsibility to demonstrate that new instruments fulfill the purposes for
which they are designed
Validity: A definition
- Validity – a test is valid to the extent that inferences made from it are appropriate, meaningful and
useful.
- A test score is per se meaningless until the examiner draw inferences from it based on the test
manual or other research findings.
- Unfortunately is actually very seldom possible to summarize the validity of a test in terms of a
single, tiny statistic
o Determining whether inferences are appropriate, meaningful, and useful typically requires
numerous studies of the relationships between test performances and independently
observed behaviours.
- Validity reflects an evolutionary, research-based judgement of how adequately a test measures the
attribute it was designed to measure
- Traditionally the different ways of accumulating validity evidence have been grouped into
categories:
o Content validity
o Criterion – related validity
o Construct validity
Content validity
- Content validity is determined by the degree to which the questions, tasks, or items on a test are
representative of the universe of behaviour the test was designed to sample.
o Nothing more than a sampling issue
o The items of a test can be visualized as a sample drawn from a larger population of
potential items that define what the researcher really wishes to measure
If the sample (specific items on the test) is representative of the population (all
possible items) then the test possess content validity
- Content validity is a useful concept when a great deal is known about the variable that the
researcher wishes to measure
- When evaluating content validity, response specification is also an integral part of defining the
relevant universe of behaviours
o For example: in reference to spelling achievement, it cannot be assumed that a multiple
choice test will measure the same spelling skills as an oral test or a frequency count of
misspellings in written compositions
- Content validity is more difficult to assure when the test measures an ill defined trait
Quantification of Content Validity
- A coefficient of content validity can be derived from the following formula:
o Content validity = D/ A+B+C+D
- The commonsense approach to content validity advocated here serves well as a flagging mechanism
to help cull out existing items that are deemed inappropriate by expert raters
- A test could possess a robust coefficient of content validity and still fall short in subtle ways
Face Validity
- A test has face validity if it looks valid to test users, examiners and especially the examinees
- Is really a matter of social acceptability and not a technical form of validity in the same category as
content, criterion related, or construct validity
- In fact, a test could possess extremely strong face validity, the items might look highly relevant ro
what is presumably measured by the instrument yet produce totally meaningless scores with no
predictive utility whatever
Criterion Related Validity
- Criterion related validity is demonstrated when a test is shown to be effective in estimating an
examinee’s performance on some outcome measure
- The variable of primary interest is the outcome measure called a criterion
- Two different approaches to validity evidence are subsumed under the heading of criterion related
validity:
o In concurrent validity, the criterion measures are obtained at approximately the same time
as the test scores
For example: the current psychiatric diagnosis of patients would be an appropriate
criterion measure to provide evidence for paper and pencil psychodiagnositc test
o in predictive validity the criterion measure are obtained in the future, usually months or
years after the test scores are obtained as with the college grades predicted from an
entrance exam
Characteristics of a Good Criterion
- The criterion must itself be reliable if it is to be a useful index of what the test measures.
- An unreliable criterion will be inherently unpredictable regardless of the merits of the test
- The extent that the reliability of the other test or the criterion or both is low, the validity coefficient
is also diminished
o Validity coefficient is the resulting correlation coefficient
- A criterion must also be free of contamination from the test itself.
o For example: screening tests of psychiatric symptoms often check for changes in eating,
sleeping, or social activities. Unfortunately, the SRE incorporates questions that check the
following: change in eating habits, change in sleeping habits, change in social activities, so if
the screening test contains the same items as the SRE, then the correlation between these
two measures will be artificially inflated. This potential source of error in test validation is
referred to as criterion contamination, since the criterion is contaminated by its artificial
commonality with the test.
- Criterion contamination is also possible when the criterion consist of ratings from experts.
o If the experts possess the knowledge of the examinee’s test scores, this information may
influence their ratings.
Concurrent validity
- In a concurrent validation study, test scorers and criterion information are obtained simultaneously
- A test with demonstrated validity provides a shortcut for obtaining information that might
otherwise require the extended investment of professional time
- Correlations between a new test and existing tests are often cited as evidence of concurrent validity
o Old tests validating a new test – but it is nonetheless appropriate if two conditions are met:
The criterion (existing) tests must have been validated through correlations with
appropriate nontest behavioural data
The instrument being validated must measure the same construct as the criterion
tests
Predictive Validity
- In a predictive validation study, test scores are used to estimate outcome measures obtained at a
later date
- Is particularly relevant for entrance examinations and employment tests
- When tests are used for purpose of prediction, it is necessary to develop a regression equation
o A regression equation describes the best fitting straight line for estimating the criterion
from the test
o Regression equation
Y=0.7X + O.2
Validity Coefficient and the Standard Error of the Estimate
- The resulting correlation is known as the validity coefficient
- The higher the validity coefficient, the more accurate is the test in predicting the criterion
- In the hypothetical case, rxy, is 1.00, the test would possess perfect validity and allow for flawless
prediction
- The standard error of estimate is the margin of error to be expected in the predicted criterion score
- The standard error of measurement indicates the margin of measurement error caused by the
unreliability of the test, where standard error of estimate indicates the margin of prediction error
caused by the imperfect validity of the test
Decision Theory Applied to Pyschological Tests
- Proponents of decision theory stress that the purpose of psychological testing is not measurement
per se but measurement in the service of decision making
- Certain combination of predicted and actual outcomes are more likely than others. If a test has a
good predictive, validity, then most persons predicted to succeed will succeed and most persons
predicted to fail will fail.
- No selection test is a perfect predictor so two other types of outcomes are also possible. Some
persons predicted to succeed will in fact fail.
o These cases are referred to as false positives
- Some persons predicted to fail would if given the chance succeed
o Case referred to as false negatives
- False positives and false negatives are collectively known as missed, because in both cases the tests
are collectively known as misses because the test has made an inaccurate prediction
o Hit rate = hits / hits + misses
- Proponents of decision theory make two fundamental assumptions about the use of selection tests:
o The value of various outcomes to the institution can be expressed in terms of a common
utility scale. One such scale – but by no means the one – is profit and loss. For example
when using an interest inventory to select salespersons, a corporation can anticipate profit
from applicants correctly identified as successful but will lose money when inevitably some
of those selected do not sell enough even to support their own salary (false positives). The
cost of the selection procedure must also be factored in to the utility scale as well
o In institution selection decisions, the most generally useful strategy is one that maximizes
the average gain on the utility scale (or minimizes average loss) over many similar decisions.
For example which selection ratio produces the largest average gain on the utility scale?
Maximization if the most fundamental decision principle.
Construct Validity
- This is the most difficult and elusive of the bunch
- A construct is a theoretical, intangible quality or trait in which individuals differ
o Examples of constructs include: leadership ability, over controlled hostility, depression and
intelligence
- Constructs are theorized to have some form of independent existence and to exert broad but to
some extent predictable influences on human behaviour
- A test designed to measure a construct must estimate the existence of an inferred, underlying
characteristic based on a limited sample of behaviour
- All psychological constructs possess two characteristics in common:
o There is no single external referent sufficient to validate the existence of the construct; that
is the construct cannot be operationally defined
o Nonetheless, a network of interlocking suppositions can be derived fro existing theory
about the construct
o Psychopath is surely a construct in that there is no single behavioural characteristic or
outcome sufficient to determine who is strongly psychopathic and who is not
o Construct validity pertains to psychological tests that claim to measure complex,
multifaceted, and theory bound psychological attributes such as psychopathy, intelligence,
leadership ability and the like
o To evaluate the construct validity of a test, we must amass a variety of evidence from
numerous sources
o Many psychometric theorists regard construct validity as the unifying concept for all types
of validity evidence
Approaches to construct validity
- Most studies of construct validity fall into one of the falling categories
o Analysis to determine whether the test items or subtests are homogeneous and therefore
measure a single construct
o Study of developmental changes to determine whether they are consistent with the theory
of the construct
o Research to ascertain whether group differences on test scores are theory – consistent
o Analysis to determine whether intervention effects on test scores are theory – consistent
o Correlation of the test with other related and unrelated tests and measures
o Factor analysis of test scores in relation to other sources of information
o Analysis to determine whether test scores allow for the correct classification of examinees.
Test Homogeneity
- If a test measures a single construct, then its component items or subtests likely will be
homogeneous (also referred to as internally consistent)
- The aim of test development is to select items that form a homogeneous scale
- The most commonly used method for achieving this goal is to correlate each potential item with the
total score and select items that show high correlation with the test score
- Homogeneity is an important first step in certifying the construct validity of a new test, but pointed
alone it is weak evidence
Appropriate Developmental Changes
- For any test of vocabulary then an important piece of construct validity evidence would be that
older subjects score better than younger subjects, assuming that education and health factors are
held constant.
Theory – consistent group differences
- Crandall developed a social interest scale that illustrates the use of theory – consistent group
differences in the process of construct validation
o To measure this construct, he devised a brief and simple instrument consisting of 15 forced
choice items.
o For each item one of the two alternative includes a trait closely related to the Adlerian
concept of social interest whereas the other choice consists of an equally attractive but
non-social trait
o The subject is instructed to choose the trait which you value more highly
o Total score on the social interest scale can range from 0 to 15
Theory consistent intervention effects
- Another approach to construct validation is to show the test scores change in appropriate direction
and amount in reaction to planned or unplanned interventions
Convergent and Discriminant validation
- Convergent validity is demonstrated when a test correlates highly with other variables or tests with
which it shares an overlap of constructs
- Discriminant validity is demonstrate when a test does not correlate with variables or tests from
which it should differ
o Ex: social interest and intelligence
- Campbell and Fiske proposed a systematic design for simultaneously confirming the convergent and
discriminant validities of a psychological test
o Their design is called the multitrait method matrix and it calls for the assessment of two or
more traits by two or more methods
- When each of these tests is administered twice to the same group of subjects and scores on all pairs
of tests are correlated the results is a multitrait multimethod matrix
o This matrix is a rich source of data on reliability, convergent validity and discriminant
validity
- It is more common for the test developers to collect convergent and discriminant validity data in
bits and pieces rather than producing an entire matrix of intercorrelations
Factor Analysis
- The purpose of factor analysis is to identify the minimum number of determiners (factors) required
to account for the intercorrelations among a battery of tests
- The goal in factor analysis is to find a smaller set of dimensions called factors that can account for
the observed array of intercorrelations among individual tests
- A typical approach in factor analysis is to administer a battery of tests to several hundred subjects
and then calculate a correlation matrix from the scores on all possible pairs of tests
- A factor loading is actually a correlation between an individual tests and a single factor
o Factor loadings can vary between – 1.0 and +1.0
o The final outcome of a factor analysis is a table depicting the correlation of each test with
each factor
- A table of factor loadings helps describe the factorial composition of a test and thereby provides
information relevant to construct validity
- The category test is a relatively complex concept formation test designed to be different from
traditional psychometric measures of intelligence and superior to them at detecting neurological
disorders.
Classification Accuracy
- Many tests are used for screening purposes to identify examinees who meet or don’t meet certain
diagnostic criteria
o For these instruments accurate classification is an essential index of validity
o Mini-mental examination, a short screening test of cognitive functioning
Consists of a number of simple questions and easy tasks like remembering three
words
Test yields a score from 0 to 30
Identify early individuals who might be experiencing dementia
Dementia is a general term that refers to significant cognitive decline and memory
loss caused by a disease process such as Alzheimer’s disease or other accumulation
of small strokes
- In exploring its utility, researches to two psychometric features that bear upon validity: sensitivity
and specificity
o Sensitivity has to do with accurate identification of patients who have a syndrome in this
case dementia
o Specificity has to do with accurate identification of normal patients
- The concepts of sensitivity and specificity are chiefly helpful in dichotomous diagnostic situations in
which individuals are presumed either to manifest a syndrome or not
- Screening tests typically provide a cut off score used to identify possible cases of the syndrome in
question
- An ideal screening test, of course would yield 100 percent specificity
- No such test exists sadly
- Choosing a cut off score that increases sensitivity invariably will reduce specificity and vice versa
- The inverse relationship between sensitivity and specificity is not only an empirical fact, but it is also
a logical necessity – if one improves, the other must decline – no exceptions are possible
Extravalidity concerns and the widening scope of test validity
- Extravalidity which include side effects and unintended consequences of testing
- Importance of extravalidity, domain psychologists confirm that the decision to use test involves
social, legal and political considerations that extend far beyond the traditions questions of technical
validity
Unintended Side Effects of Testing
- Cole and Moss cite the example of using psychological tests to determine eligibility for special
education. Although the indeed outcome is to help students learn, the process of identifying
students eligibility for special education. Although the intended outcome is to help students learn,
the process of identifying students eligible for special education may produce numerous negative
effects:
o The identified children may feel unusual or dumb
o Other children may call the children names
o Teachers may view these children as unworthy of attention
o The process may produce classes segregated by race or social class
- A consideration of side effects should influence an examiner’s decision to use a particular test for a
specified person
- Although the MMPI was originally designed as an aid in psychiatric diagnosis subsequent research
indicated that it is also useful in the identification of persons unsuited to a career in law
enforcement
The Widening Scope of Test Validity
- The functionalist perspective explicitly recognizes that the test validation has an obligation to
determine whether a practice has constructive consequences for individuals and institutions and
especially to guard against adverse outcomes
- Test validity is an overall judgement of the adequacy and appropriateness of inferences and actions
that flow from test scores
- Validity rests on four bases:
o Traditional evidence of construct validity, for example, appropriate convergent and
discrimination validity
o An analysis of the value implications of the test interpretation
o Evidence for the usefulness if test interpretations in particular applications
o An appraisal of the potential and actual social consequences including side effects from test
use
- A valid test is one that answers well to all four facets of tests validity
Utility: The Last Horizon of Test Validity
- Test utility can be summed up by the question “does use of this test result in better patient
outcomes or more efficient delivery of services?”
Test Construction
- Test construction consists of six interwined stages
o Defining the test – determining the scope and purpose which must be known before the
developer can proceed to test construction
o Selecting a scale method – is a process of setting the rules by which numbers are assigned
to test results
o Constructing the items – is as much art as science and it is here that the creativity of the
test developer may be required. Once a preliminary version of the test is available, the
developer usually administers it to a modest sized-sample of subjects in order to collect
initial data about test item characteristics.
o Testing the items – entails a variety of statistical procedures referred to collectively as item
analysis. The purpose of item analysis is to determine which items should be retained which
revised and thrown out.
o Revising the test – if the revisions are substantial, new items and additional pretesting with
new subjects may be required. Test construction involves a feedback loop whereby second,
third, and fourth drafts of an instrument might be produced.
o Publishing the test is a final step.
Defining the Test
- In order to construct a new test, the developer must have a clear idea of what the test is to measure
and how it is to differ from existing instruments.
- Kaufman and Kaufman provide a good model of the test definition process.
o Kaufman Assessment Battery for children – a new test for general intelligence in children,
the authors listed six primary goals that define the purpose of the test and distinguish it
from existing measures:
Measure intelligence from a strong theoretical and research bias
Separate acquired factual knowledge from the ability to solve unfamiliar problems
Yield scores that translate to educational intervention
Include novel tasks
Be easy to administer and objective to score
Be sensitive to the diverse needs of preschool, minority group and exceptional
children
Selecting a Scaling Method
- The purpose of psychological testing is to assign numbers to responses on a test so that the
examinee can be judged to have more or less of the characteristic measured
o The rules by which the numbers are assigned to responses define the scaling method
- No single scaling method is uniformly better than the others. For some traits ordinal ranking of
expert judges might be the best measurement approach, for other traits, complex scaling of self
report data might yield the most valid measurements
Levels of Measurement
- In a nominal scale, the numbers serve only as category names
o Ex: when collecting data for a demographic study, a research can code 1 for males and 2 for
females
o Note: the numbers are arbitrary and do not designate more or less of something; numbers
are simplified forms of naming
- An ordinal scale constitutes a form of ordering or ranking
o However ordinal ranking fail to provide information about the relative strength of rankings,
to know if a professor will strongly prefer something
- An interval scale provides information about ranking, but also supplies a metric for gauging the
differences between rankings
o In short interval scales are based on the assumption of equal – sized units or intervals for
the underlying scale
- A ratio scale has all the characteristics of an interval scale but also possesses a conceptual
meaningful zero point in which there is a total absence of the characteristic being measures.
o Are rare in psychological measurement
o Meaningful zero points just do not exist
o Weight, height, electro dermal response all qualify to ratio scale
- Levels of measurement are relevant to test construction because the most powerful and useful
parametric statistical procedures (Pearson, r analysis of variance, multiple regression) should be
used only for scores derived from measures that meet the criteria of interval or ratio scales
- For scales that are only nominal or ordinal, less powerful non parametric statistical procedures (chi-
square, rank order correlation, median tests) must be employed
Representative Scaling Methods – Expert Rankings
- A depth of coma scale could be very important in predicting the course improvement, because it is
well known that a lengthy period of unconsciousness offers a poor prognosis for ultimate recovery
- One approach to scaling the dept of coma would be to rely on the behavioural rankings of expert
o For example you could ask a panel of neurologists to list patient behaviours associated with
different levels of consciousness
After the experts had submitted a large list of diagnostic characteristics, the test
developers could rank the indicator behaviours along a continuum of consciousness
range from deep coma to basic orientation
- In addition to the rankings it is possible to compute a single overall score that is something more
than an ordinal scale although probably less than true interval level measurement
Method of Equal Appearing Intervals
- L.L. Thurstone proposed a method for constructing interval level scales from attitude statements
o His method of equal appearing intervals is still used today marking him as one of the giants
of psychometric theory
o The actual methodology of this is to construct an equal appearing intervals which is
somewhat and statistically laden but the underlying logic is easy to explain
- Thurstone approach of item scaling has powerful applications in test development
Method of Absolute Scaling
- Method of absolute scaling is a procedure for obtaining a measure of absolute item difficulty based
on results for different age groups of test takers. The methodology for determining individual item
difficult on an absolute scale is quite complex, although the underlying rationale is not too difficult
to understand
Likert Scales
- A Likert scale presents the examinee with five responses ordered on an agreed/disagree or
approve/disapprove continuum.
- Depending on the wording of an individual item, an extreme answer of strongly agree or strongly
disagree, will indicate the most favourable response on the underlying attitude measure by the
questionnaire
- Likert assigned a score of 5 to this extreme response, 1 to opposite extreme and 2, 3, 4 to
intermediate replies. The total scale score is obtained by adding the scores from individual items.
For this research a Likert scale is also referred to as a summative scale.
Guttman Scales
- On a Guttman scale, respondents who endorse one statement also agree with milder statements
pertinent to the same underlying continuum
- Guttman scale are produced by selecting items that into ordered sequence of examinee
endorsement. A perfect Guttman scale is seldom achieved because of errors of measurement but is
nonetheless a fitting goal for certain type of tests.
- Originally devised to determine whether a set of attitude statements is unidimensional, the
technique has been used in many different kind of tests
Method of Empirical Keying
- Method of Empirical Keying, test items are selected for a scale based entirely on how well they
construct a criterion group from a normative scale
- Depression scale could be derived from a pool of true-false personality inventory questions in the
following manner
o A carefully selected and homogeneous group of person experiencing major depression is
gather to answer the pool of true/false questions
o For each item, the endorsement frequency of the depression group is compared to
endorsement frequency of the normative sample
o Items which show a large different in endorsement frequency between the depression and
normative samples are selected for the Depression scale, keyed in the direction favoured by
depression subjects (true or false, as appropriate)
o Raw score on the depression scale is then simply the number of items answered in the
keyed direction
- A common finding is that some items selected for a scale may show no obvious relation to the
construct measured
Rational Scale Construction (Internal Consistency)
- The rational approach to scale construction is a popular method for the development of self report
personality inventories
- The heart of the method of rational scaling is that all scale items correlate positively with each other
and also with the total score for the scale
- For instance, if the scale is designed to identify college students with leadership potential, then it
should be administered to a cross selection of several hundred college students. For scale
development very large samples are desirable.
- The next step in rational scale construction is to correlate scores on each of the preliminary items
with the total score on the test for the 500 subjects in the tryout sample
o Scores on the items are dichotomous
- Up to half of the initial items might be discarded
- If a large population of items is initially discarded, the researcher might recalculate item total
correlations based upon the reduced item pool to verify the homogeneity of the remaining items
Constructing the items
- The item writer is confronted with a profusion of initial question
o Should item content be homogenous or carried?
o What range of difficulty should the items cover?
o How many initial items should be constructed?
o Which cognitive processes and item domains should be tapped?
o What kind of test item should be used?
Initial Questions in Test construction
- The first question pertains to the homogeneity versus heterogeneity of the test item
- In large measure whether item content is homogeneous or varied is dictated by the manner in
which the test developer has defined the new instrument
- The test developer might seek to incorporate novel problems equally unfamiliar to all examinees
- In the other hand, with a theory based test of spatial thinking, subscales with homogeneous item
content would be required
- The range of item difficulty must be sufficient to allow for meaningful differentiations of examinees
at both extremes
- A ceiling effect is observed when significant numbers of examinees obtain perfect or near perfect
scores
o the problem with a ceiling effect is that distinctions between high scoring examinees are
not possible even though these examinees might differ substantially on the underlying trait
measure by the test
- A floor effect is observed when significant numbers of examinees obtain scores at or near the
bottom of the scale
- Test developers expect that some initial items will prove to make ineffectual contributions to the
overall measurement goal of their instrument
- Final draft that contains excess items, perhaps double the number of questions desired on the final
draft
Table of Specifications
- A table of specification enumerates the information and cognitive tasks on which examinees are to
be assessed
Item Formats
- Proponents of multiple choice methodology argue that properly constructed items can measure
conceptual as well as factual knowledge. Multiple choice tests also permit quick and object machine
scores.
o The fairness of multiple choice questions can be proved (or occasionally disproved) with
very simple item analysis procedures discovered subsequently
o The major shortcomings of multiple choice questions are first the difficulty of writing good
distractor options and second the possibility that the presence of the response may cue a
half knowledgeable respondent to the correct answer
Testing the Items
- Test developers use item analysis a family of statistical procedures to identify the best items
o To determine which item should be retained, which should be revised and which should be
thrown out
- In conducting a thorough item analysis the test developer might take use of item difficulty index,
item reliability index, item validity index, item characteristic curve and an index of item
discrimination
Item Difficulty Index
- The item difficulty for a single test item is defined as the proportion of examinees in a large tryout
sample who get that item correct
- Item difficulty index is a useful tool for identifying items that should be altered or discarded
- Unfortunately this item is psychometrically unproductive because it does not provide information
about differences between examinees
- For most applications, the item should be rewritten or thrown out
- Generally item difficulties that hover around 0.5 ranging between 0.3 and 0.7 maximize the
information the test provides about differences between examinees.
- For true false or multiple choice items, the optimal level of item difficulty needs to be adjusted for
the effects of guessing
- If a test is to be used for selection of an extreme group by means of a cutting score it may be
desirable to select items with difficulty levels outside the 0.3 and 0.7 range
Item Reliability Index
- A test developer may desire an instrument with a high level of internal consistency in which the
items are reasonably homogeneous
- However individual items are typically right or wrong (often score 1 or 0) whereas total scores
constitute a continuous variable
o In order to correlate these two different kinds of scores it is necessary to use a special type
of statistic called the POINT-BISERIAL CORRELATION COEFFICIENT
- The computational formula for this correlation coefficient is equivalent to the Pearson r discussed
earlier, and the point biserial coefficient conveys much the same kind of information regarding the
relationship between two variables (one of which happens to be dichotomous score 0 or 1)
- The usefulness of an individual dichotomous test item is also determined by the extent to which
scores on it are distributed between the two outcomes of 0 and 1
- The more closely the item approaches a 50-50 split of right and wrong scores, the greater is its
standard deviation
o In general the greater the standard deviation of an item, the more useful is the item of the
overall scale
Item Validity Index
- The item validity index is a useful tool in the psychometrician’s quest to identify predictively useful
test items
- By computing the item validity index for every item in the preliminary test, the test developer can
identify ineffectual items, eliminate or rewrite them, and produce a revised instrument with greater
practical utility.
- The first step in figuring an item validity index is to compute the point biserial correlation between
the item score and the score on the criterion variable
Item Characteristic Curves
- item characteristic curve is a graphical display of the relationship between the probability of a
correct response and the examinee’s position on the underlying trait measured by the test
- An ICC is actually a mathematical idealization of the relationship between the probability of a
correct response and the amount of the trait possessed by test respondents
- Different ICC models use different mathematical functions based on initial assumptions
o The simplest ICC model is the Rasch Model
o Makes two assumptions
Test items are unidimensional and measure one common trait
Test items vary on a continuum of difficult level
The nomal ogive is simply the normal distribution graphed in cumulative form
- Psychometric purists would prefer that test item ICCs approximate the normal ogive because this
curve is convenient for making mathematical deductions about the underlying trait
- ICCs are especially useful for identifying items that perform differently for subgroups of examinees
- The underlying theory of ICC is also known as item response theory and latent trait theory
o The usefulness of this approach ahs been questioned by NUNALLY who points out that that
assumption of test unidimensionality which is violated when may psychological tests are
considered
- The merits of the ICC approach are still debated
o ICC theory seems particularly appropriate for certain forms of computerized adaptive
testing (CAT) in which each test taker responds to an individual and unique set of items that
are then scored on an underlying uniform scale
Item Discrimination Index
- ICCs that an effective test item is one that discriminates between high scorers and low scorers on
the entire test
- An ideal test item is one that most of the high scorers pass and most of the low scores fail
- If the slope of the curve is positive and the curve is preferably ogive shaped the item is doing a
good job of separating high and low scores. But visual inspection is not a completely objective
procedure; what is needed is a statistical tool that summarizes the discrimination power of
individual test items
- An item discrimination index is a statistical index of how efficiently an item discriminates between
persons who obtain high and low scores on the entire test
- A test developer can supplement the item discrimination approach by inspecting the number of
examinees in the upper and lower scoring groups who choose each of the incorrect alternatives
- If a multiple choice item is well written the incorrect alternatives should be equally attractive to
subjects who do not know the correct answer
Revising the Test
- It is common in the evolutionary process of test development that many items added
- This revised test likely contains more discriminating items with higher reliability and greater
predictive accuracy but these improvements are known to be true only for the first tryout sample
- The next step is test development is to collect new data from a second tryout sample
- If major changes are need, it is desirable to collect data from a third and even perhaps a fourth
tryout sample
Cross Validation
- The term cross validation refers to the practice of using the original regression equation in a new
sample to determine whether the test predicts the criterion as well as it did in the original sample
Validity Shrinkage
- A common discovery in cross validation research is that a test predicts the relevant criterion less
accurately with the new sample of examinees than with the original tryout sample. The term
validity shrinkage is applied to this phenomenon
- Is an inevitable part of the test development and underscores the need for cross validation
- Can be a major problem when derivation and cross validation samples are small, the number of
potential test items is large and items are chosen on a purely empirical basis without theoretical
rationale
- Demonstrate validity through cross validation, do not assume it based merely on the solemn
intentions of a new instrument
Feedback from Examinees
- The EFeQ is a short and simple questionnaire designed to elicit candid opinions from examinees as
to these features of the test examiner respondent matrix:
o Behaviour of examiners
o Testing conditions
o Clarity of exam instructions
o Convenience in using the answer sheet
o Perceive suitability of the test
o Perceived cultural fairness of the test
o Perceived sufficiency of time
o Perceived difficult of the test
o Emotional response to the test
o Level of guessing
o Cheating by the examinee or others
o Final question is an open ended remark of what you thought and could be improved about
the exam
Publishing the test
- The test construction process does not end with the collection of cross validation data
- The test developer also must oversee the production of the testing materials, publish a technical
manual and produce a user’s manual
Production of testing materials
- The testing materials must be user friendly if they are to receive wide acceptance by psychologists
and educators
- If it possible for the test developer to simplify the duties of the examiner while leaving examinee
task demands unchanged, the resulting instrument will have much greater acceptability to potential
users
Technical Manual and User’s Manual
- Technical data about a new instrument are usually summarized with appropriate references in a
technical manual
- In some cases this information is incorporated in the users manual which gives instructions for
administration and also provides guidelines for test interpretation
- Test manuals serve many purposes as outlined in the standards for educational and psychological
testing
- The influential standards manual suggests that test manuals accomplish the following goals:
o Describe the rational and recommended uses for the test
o Provide specific cautions against anticipated misuses of a test
o Cite representative studies regarding general and specific test uses
o Identify special qualifications needed to administer and interpret the test
o Provide revisions, ammendations and supplements as needed
o Cite quantitative relationships between test scores and criteria
o Report on the degree to which alternative modes of response are interchangeable
o Provide appropriate interpretive aids to the test taker
o Furnish evidence of the validity of any automated test interpretations
- Test manuals should provide the essential data on reliability and validity rather than referring the
user to other sources – an unfortunate practice encountered in some test manuals