test analysis

89
TEST ANALYSIS Basic Concepts in Item and Test Analysis Susan Matlock-Hetzel Texas A&M University, January 1997 Abstract When norm-referenced tests are developed for instructional purposes, to assess the effects of educational programs, or for educational research purposes, it can be very important to conduct item and test analyses. These analyses evaluate the quality of the items and of the test as a whole. Such analyses can also be employed to revise and improve both items and the test as a whole. However, some best practices in item and test analysis are too infrequently used in actual practice. The purpose of the present paper is to summarize the recommendations for item and test analysis practices, as these are reported in commonly-used measurement textbooks (Crocker & Algina, 1986; Gronlund & Linn, 1990; Pedhazur & Schemlkin, 1991; Sax, 1989; Thorndike, Cunningham, Thorndike, & Hagen, 1991). Paper presented at the annual meeting of the Southwest Educational Research Association, Austin, January, 1997. Basic Concepts in Item and Test Analysis Making fair and systematic evaluations of others' performance can be a challenging task. Judgments cannot be made solely on the basis of intuition, haphazard guessing, or custom (Sax, 1989). Teachers, employers, and others in evaluative positions use a variety of tools to assist them in their evaluations. Tests are tools that are frequently used to facilitate the evaluation process. When norm-referenced tests are developed for instructional purposes, to assess the effects of educational programs, or for educational research purposes, it can be very important to conduct item and test analyses.

Upload: aldrin-paguirigan

Post on 22-Nov-2015

55 views

Category:

Documents


12 download

DESCRIPTION

FOR TEACHERS

TRANSCRIPT

TEST ANALYSISBasic Concepts in Item and Test AnalysisSusan Matlock-HetzelTexas A&M University, January 1997AbstractWhen norm-referenced tests are developed for instructional purposes, to assess the effects of educational programs, or for educational research purposes, it can be very important to conduct item and test analyses. These analyses evaluate the quality of the items and of the test as a whole. Such analyses can also be employed to revise and improve both items and the test as a whole. However, some best practices in item and test analysis are too infrequently used in actual practice. The purpose of the present paper is to summarize the recommendations for item and test analysis practices, as these are reported in commonly-used measurement textbooks (Crocker & Algina, 1986; Gronlund & Linn, 1990; Pedhazur & Schemlkin, 1991; Sax, 1989; Thorndike, Cunningham, Thorndike, & Hagen, 1991).Paper presented at the annual meeting of the Southwest Educational Research Association, Austin, January, 1997.

Basic Concepts in Item and Test AnalysisMaking fair and systematic evaluations of others' performance can be a challenging task. Judgments cannot be made solely on the basis of intuition, haphazard guessing, or custom (Sax, 1989). Teachers, employers, and others in evaluative positions use a variety of tools to assist them in their evaluations. Tests are tools that are frequently used to facilitate the evaluation process. When norm-referenced tests are developed for instructional purposes, to assess the effects of educational programs, or for educational research purposes, it can be very important to conduct item and test analyses.Test analysis examines how the test items perform as a set. Item analysis "investigates the performance of items considered individually either in relation to some external criterion or in relation to the remaining items on the test" (Thompson & Levitov, 1985, p. 163). These analyses evaluate the quality of items and of the test as a whole. Such analyses can also be employed to revise and improve both items and the test as a whole.However, some best practices in item and test analysis are too infrequently used in actual practice. The purpose of the present paper is to summarize the recommendations for item and test analysis practices, as these are reported in commonly-used measurement textbooks (Crocker & Algina, 1986; Gronlund & Linn, 1990; Pedhazur & Schemlkin, 1991; Sax, 1989; Thorndike, Cunningham, Thorndike, & Hagen, 1991). These tools include item difficulty, item discrimination, and item distractors.Item DifficultyItem difficulty is simply the percentage of students taking the test who answered the item correctly. The larger the percentage getting an item right, the easier the item. The higher the difficulty index, the easier the item is understood to be (Wood, 1960). To compute the item difficulty, divide the number of people answering the item correctly by the total number of people answering item. The proportion for the item is usually denoted aspand is called item difficulty (Crocker & Algina, 1986). An item answered correctly by 85% of the examinees would have an item difficulty, orpvalue, of .85, whereas an item answered correctly by 50% of the examinees would have a lower item difficulty, orpvalue, of .50.Apvalue is basically a behavioral measure. Rather than defining difficulty in terms of some intrinsic characteristic of the item, difficulty is defined in terms of the relative frequency with which those taking the test choose the correct response (Thorndike et al, 1991). For instance, in the example below, which item is more difficult?1. Who was Boliver Scagnasty?2. Who was Martin Luther King?One cannot determine which item is more difficult simply by reading the questions. One can recognize the name in the second question more readily than that in the first. But saying that the first question is more difficult than the second, simply because the name in the second question is easily recognized, would be to compute the difficulty of the item using an intrinsic characteristic. This method determines the difficulty of the item in a much more subjective manner than that of apvalue.Another implication of apvalue is that the difficulty is a characteristic of both the item and the sample taking the test. For example, an English test item that is very difficult for an elementary student will be very easy for a high school student. Apvalue also provides a common measure of the difficulty of test items that measure completely different domains. It is very difficult to determine whether answering a history question involves knowledge that is more obscure, complex, or specialized than that needed to answer a math problem. Whenpvalues are used to define difficulty, it is very simple to determine whether an item on a history test is more difficult than a specific item on a math test taken by the same group of students.To make this more concrete, take into consideration the following examples. When the correct answer is not chosen (p= 0), there are no individual differences in the "score" on that item. As shown in Table 1, the correct answer C was not chosen by either the upper group or the lower group. (The upper group and lower group will be explained later.) The same is true when everyone taking the test chooses the correct response as is seen in Table 2. An item with apvalue of .0 or apvalue of 1.0 does not contribute to measuring individual differences, and this is almost certain to be useless. Item difficulty has a profound effect on both the variability of test scores and the precision with which test scores discriminate among different groups of examinees (Thorndike et al, 1991). When all of the test items are extremely difficult, the great majority of the test scores will be very low. When all items are extremely easy, most test scores will be extremely high. In either case, test scores will show very little variability. Thus, extremepvalues directly restrict the variability of test scores.Table 1Minimum Item Difficulty Example Illustrating No Individual DifferencesGroupItem Response

*

ABCD

Upper group4506

Lower group2607

Note. * denotes correct responseItem difficulty: (0 + 0)/30 = .00pDiscrimination Index: (0 - 0)/15 = .00Table 2Maximum Item Difficulty Example Illustrating No Individual DifferencesGroupItem Response

*

ABCD

Upper group00150

Lower group00150

Note. * denotes correct responseItem difficulty: (15 + 15)/30 = 1.00pDiscrimination Index: (15-15)/15 = .00In discussing the procedure for determining the minimum and maximum score on a test, Thompson and Levitov (1985) stated thatitems tend to improve test reliability when the percentage of students who correctly answer the item is halfway between the percentage expected to correctly answer if pure guessing governed responses and the percentage (100%) who would correctly answer if everyone knew the answer. (pp. 164-165)For example, many teachers may think that the minimum score on a test consisting of 100 items with four alternatives each is 0, when in actuality the theoretical floor on such a test is 25. This is the score that would be most likely if a student answered every item by guessing (e.g., without even being given the test booklet containing the items).Similarly, the ideal percentage of correct answers on a four-choice multiple-choice test is not 70-90%. According to Thompson and Levitov (1985), the ideal difficulty for such an item would be halfway between the percentage of pure guess (25%) and 100%, (25% + {(100% - 25%)/2}. Therefore, for a test with 100 items with four alternatives each, the ideal mean percentage of correct items, for the purpose of maximizing score reliability, is roughly 63%. Tables 3, 4, and 5 show examples of items with p values of roughly 63%.Table 3Maximum Item Difficulty Example Illustrating Individual DifferencesGroupItem Response

*

ABCD

Upper group10133

Lower group2556

Note. * denotes correct responseItem difficulty: (13 + 5)/30 = .60pDiscrimination Index: (13-5)/15 = .53Table 4Maximum Item Difficulty Example Illustrating Individual DifferencesDifferencesGroupItem Response

*

ABCD

Upper group10113

Lower group2076

Note. * denotes correct responseItem difficulty: (11 + 7)/30 = .60pDiscrimination Index: (11-7)/15 = .267Table 5Maximum Item Difficulty Example Illustrating Individual DifferencesGroupItem Response

*

ABCD

Upper group1073

Lower group20116

Note. * denotes correct responseItem difficulty: (11 + 7)/30 = .60pDiscrimination Index: (7 - 11)/15 = .267Item DiscriminationIf the test and a single item measure the same thing, one would expect people who do well on the test to answer that item correctly, and those who do poorly to answer the item incorrectly. A good item discriminates between those who do well on the test and those who do poorly. Two indices can be computed to determine the discriminating power of an item, the item discrimination index,D, and discrimination coefficients.Item Discrimination Index, DThe method of extreme groups can be applied to compute a very simple measure of the discriminating power of a test item. If a test is given to a large group of people, the discriminating power of an item can be measured by comparing the number of people with high test scores who answered that item correctly with the number of people with low scores who answered the same item correctly. If a particular item is doing a good job of discriminating between those who score high and those who score low, more people in the top-scoring group will have answered the item correctly.In computing the discrimination index,D, first score each student's test and rank order the test scores. Next, the 27% of the students at the top and the 27% at the bottom are separated for the analysis. Wiersma and Jurs (1990) stated that "27% is used because it has shown that this value will maximize differences in normal distributions while providing enough cases for analysis" (p. 145). There need to be as many students as possible in each group to promote stability, at the same time it is desirable to have the two groups be as different as possible to make the discriminations clearer. According to Kelly (as cited in Popham, 1981) the use of 27% maximizes these two characteristics. Nunnally (1972) suggested using 25%.The discrimination index,D, is the number of people in the upper group who answered the item correctlyminusthe number of people in the lower group who answered the item correctly,dividedby the number of people in the largest of the two groups. Wood (1960) stated thatwhen more students in the lower group than in the upper group select the right answer to an item, the item actually has negative validity. Assuming that the criterion itself has validity, the item is not only useless but is actually serving to decrease the validity of the test. (p. 87)The higher the discrimination index, the better the item because such a value indicates that the item discriminates in favor of the upper group, which should get more items correct, as shown in Table 6. An item that everyone gets correct or that everyone gets incorrect, as shown in Tables 1 and 2, will have a discrimination index equal to zero. Table 7 illustrates that if more students in the lower group get an item correct than in the upper group, the item will have a negativeDvalue and is probably flawed.Table 6Positive Item Discrimination Index DGroupItem Response

*

ABCD

Upper group32150

Lower group12332

Note. * denotes correct response74 students took the test27% = 20(N)Item difficulty: (15 + 3)/40 = .45pDiscrimination Index: (15 - 3)/20 = .60Table 7Negative Item Discrimination Index DGroupItem Response

*

ABCD

Upper group0000

Lower group00150

Note. * denotes correct responseItem difficulty: (0 + 15)/30 = .50pDiscrimination Index: (0 - 15)/15 = -1.0A negative discrimination index is most likely to occur with an item covers complex material written in such a way that it is possible to select the correct response without any real understanding of what is being assessed. A poor student may make a guess, select that response, and come up with the correct answer. Good students may be suspicious of a question that looks too easy, may take the harder path to solving the problem, read too much into the question, and may end up being less successful than those who guess. As a rule of thumb, in terms of discrimination index, .40 and greater are very good items, .30 to .39 are reasonably good but possibly subject to improvement, .20 to .29 are marginal items and need some revision, below .19 are considered poor items and need major revision or should be eliminated (Ebel & Frisbie, 1986).Discrimination CoefficientsTwo indicators of the item's discrimination effectiveness are point biserial correlation and biserial correlation coefficient. The choice of correlation depends upon what kind of question we want to answer. The advantage of using discrimination coefficients over the discrimination index (D) is that every person taking the test is used to compute the discrimination coefficients and only 54% (27% upper + 27% lower) are used to compute the discrimination index,D.Point biserial.The point biserial (rpbis) correlation is used to find out if the right people are getting the items right, and how much predictive power the item has and how it would contribute to predictions. Henrysson (1971) suggests that therpbistells more about the predictive validity of the total test than does the biserialr, in that it tends to favor items of average difficulty. It is further suggested that therpbisis a combined measure of item-criterion relationship and of difficulty level.Biserial correlation.Biserial correlation coefficients (rbis) are computed to determine whether the attribute or attributes measured by the criterion are also measured by the item and the extent to which the item measures them. Therbisgives an estimate of the well-known Pearson product-moment correlation between the criterion score and the hypothesized item continuum when the item is dichotomized into right and wrong (Henrysson, 1971). Ebel and Frisbie (1986) state that therbissimply describes the relationship betweenscores on a test item (e.g., "0" or "1") and scores (e.g., "0", "1",..."50") on the total test for all examinees.DistractorsAnalyzing the distractors (e.i., incorrect alternatives) is useful in determining the relative usefulness of the decoys in each item. Items should be modified if students consistently fail to select certain multiple choice alternatives. The alternatives are probably totally implausible and therefore of little use as decoys in multiple choice items. A discrimination index or discrimination coefficient should be obtained for each option in order to determine each distractor's usefulness (Millman & Greene, 1993). Whereas the discrimination value of the correct answer should be positive, the discrimination values for the distractors should be lower and, preferably, negative. Distractors should be carefully examined when items show large positiveDvalues. When one or more of the distractors looks extremely plausible to the informed reader and when recognition of the correct response depends on some extremely subtle point, it is possible that examinees will be penalized for partial knowledge.Thompson and Levitov (1985) suggested computing reliability estimates for a test scores to determine an item's usefulness to the test as a whole. The authors stated, "The total test reliability is reported first and then each item is removed from the test and the reliability for the test less that item is calculated" (Thompson & Levitov, 1985, p.167). From this the test developer deletes the indicated items so that the test scores have the greatest possible reliability.SummaryDeveloping the perfect test is the unattainable goal for anyone in an evaluative position. Even when guidelines for constructing fair and systematic tests are followed, a plethora of factors may enter into a student's perception of the test items. Looking at an item's difficulty and discrimination will assist the test developer in determining what is wrong with individual items. Item and test analysis provide empirical data about how individual items and whole tests are performing in real test situations.ReferencesCrocker, L., & Algina, J. (1986).Introduction to classical and modern test theory.New York: Holt, Rinehart and Winston.Ebel, R.L., & Frisbie, D.A. (1986).Essentials of educational measurement.Englewood Cliffs, NJ: Prentice-Hall.Gronlund, N.E., & Linn, R.L. (1990).Measurement and evaluation in teaching(6th ed.). New York: MacMillan.Henrysson, S. (1971). Gathering, analyzing, and using data on test items. In R.L. Thorndike (Ed.),Educational Measurement(p. 141). Washington DC: American Council on Education.Millman, J., & Greene, J. (1993). The specification and development of tests of achievement and ability. In R.L. Linn (Ed.),Educational measurement(pp. 335-366). Phoenix, AZ: Oryx Press.Nunnally, J.C. (1972).Educational measurement and evaluation(2nd ed.). New York: McGraw-Hill.Pedhazur, E.J., & Schmelkin, L.P. (1991).Measurement, design, and analysis: An integrated approach.Hillsdale, NJ: Erlbaum.Popham, W.J. (1981).Modern educational measurement. Englewood Cliff, NJ: Prentice-Hall.Sax, G. (1989).Principles of educational and psychological measurement and evaluation(3rd ed.). Belmont, CA: Wadsworth.Thompson, B., & Levitov, J.E. (1985). Using microcomputers to score and evaluate test items.Collegiate Microcomputer,3, 163-168.Thorndike, R.M., Cunningham, G.K., Thorndike, R.L., & Hagen, E.P. (1991).Measurement and evaluation in psychology and education(5th ed.). New York: MacMillan.Wiersma, W. & Jurs, S.G. (1990).Educational measurement and testing(2nd ed.). Boston, MA: Allyn and Bacon.Wood, D.A. (1960).Test construction: Development and interpretation of achievement tests.Columbus, OH: Charles E. Merrill Books, Inc.Degree ArticlesSchool ArticlesLesson PlansLearning ArticlesEducation Articles

Full-text Library|Search ERIC|Test Locator|ERIC System|Assessment Resources|Callsforpapers|About us|Site map|Search|HelpSitemap 1-Sitemap 2-Sitemap 3-Sitemap 4-Sitemap 5-Sitemap 61999-2012 Clearinghouse on Assessment and Evaluation. All rights reserved.Your privacy is guaranteed at ericae.net.

Under new ownership

We assess the quality of tests that are implemented and carry out analysis of tests for the various organizations involved in testing, such as qualifying examination bodies, educational institutions, and education services companies to ensure that the abilities of examinees are understood correctly. Is the difficulty of the questions within an appropriate range? Is the number of questions appropriate? Do the choices function well? Does each question distinguish between examinees with high level and low level ability? Is the pass line setting and method of dividing levels appropriate? Is it possible to make comparisons with previous test scores and ascertain changes in continuous test scores? What is the relationship between examinee attributes, grouping and test scores?

Item analysis - Analysis using classical test theoryOutput index nameExplanation of the index

Percentage of correct answersDifficulty of questions in the test population

Point biserial correlationHow well does a question item discriminate between high and low abilities?

Biserial correlationThe correlation when the questions and test scores are assumed to follow a bivariate normal distribution

Choice selection rateChoice selection status, function status

Actual choice rateThe actual number of choices as seen from the test results

Fundamental statisticsFundamental statistical information about the test

Reliability factorAn index of the reliability of the test score

Standard errorThe standard deviation for score assuming a certain examinee takes the test repeatedly

GP analysis tableA table and graph showing how the choices function for each level

Analysis using IRTOutput index nameExplanation of the index

Parameter aAn index of the sensitivity with which ability groups around value b are identified

Parameter bThe difficulty of a particular question

Parameter cAn index of the possibility of guessing the correct answer

Standard error aThe standard error for value a assuming repeated data acquisition and calculation

Standard error bThe standard error for value b assuming repeated data acquisition and calculation

Standard error cThe standard error for value c assuming repeated data acquisition and calculation

chi-squareThe deviation between the percentage of correct answers obtained from the model and the percentage of correct answers obtained from the data

dfDegree of freedom (the number of division categories for ability score used to calculate the squared value)

pThe occurrence ratio of relevant data assuming equivalence between the model and the data

Fundamental statisticsAverage, median, standard deviation, variance, range, minimum, maximum, sample size

Test characteristic curveA graph of the correspondence between ability score and test score

Test information curveA diagram showing the reliability of each ability score point

Test analysis: identifying test conditions Test analysis is the process of looking at something that can be used to derive test information. This basis for the tests is called thetest basis. The test basis is the information we need in order to start the test analysis and create our own test cases. Basically its a documentation on which test cases are based,such as requirements, design specifications, product risk analysis, architecture and interfaces. We can use the test basis documents to understand what the system should do once built.The test basis includes whatever the tests are based on. Sometimes tests can be based on experienced users knowledge of the system which may not be documented. From testing perspective we look at the test basis in order to see what could be tested. These are the test conditions. Atest conditionis simply something that we could test. While identifying the test conditions we want to identify as many conditions as we can and then we select about which one to take forward and combine into test cases. We could call themtest possibilities. As we know that testing everything is an impractical goal, which is known as exhaustive testing. We cannot test everything we have to select a subset of all possible tests. In practice the subset we select may be a very small subset and yet it has to have a high probability of finding most of the defects in a system. Hence we need some intelligent thought process to guide our selection calledtest techniques.The test conditions that are chosen will depend on thetest strategyor detailed test approach. For example, they might be based on risk, models of the system, etc. Once we have identified a list of test conditions, it is important to prioritize them, so that the most important test conditions are identified. Test conditions can be identified fortest dataas well as for test inputs and test outcomes, for example, different types of record, different sizes of records or fields in a record. Test conditions are documented in the IEEE 829 document called a Test Design Specification.

Tests: Post Test AnalysisTests: Post Test Analysis (PDF)Sometimes you can get valuable study clues for upcoming tests by examining old tests you have already taken. This method works best if the instructor gives many examinations. Obviously it would not work on the first test. This method is based on the premise that people tend to be consistent. Here's what you do:1. Gather all your notes, texts, and test answer sheets and visit your instructor during office hours. Ask to look over the test that was previously given in your class.2. As you look over the test, answer two basic questions:1. Where did this test come from?Did the test come mostly from lecture notes, the textbook, or the homework? Did your instructor lecture hard on Chapter 4 and then test hard on Chapter 4? Does he like lots of little specifics, or just test on broad, general areas?2. What kinds of questions were asked?Were there factual questions, application questions, definition questions? If factual, then know names, dates, places; if application, then study theory; if definitions, then be familiar with terms.For example, one student discovered that her instructor made up exams by selecting only the major paragraphs in the chapter and then using the topic sentence of each paragraph as the exam question. It was then a simple matter to study for the forthcoming examinations. This knowledge came only after carefully examining the old test.Back to topTips to COMBAT Test Panic1. Sleep. Get a good night's rest.2. Diet. Eat breakfast or lunch. This may help calm your nervous stomach and give you energy. Avoid greasy or acidic foods, and avoid overeating. Avoid caffeine pills.3. Exercise. Nothing reduces stress more than exercise. An hour or two before an examination, stop studying and go workout. Swimming, jogging, cycling, aerobics.4. Allow yourself enough time to get to the test without hurrying.5. Don't swap questions at the door. Hearing anything you don't know may weaken your confidence and send you into a state of anxiety.6. Leave your books at home. Flipping pages at the last minute may only upset you. If you must take something, take a brief outline that you know well.7. Take a watch with you, as well as extra pencils, scantron sheets, and blue books.8. Answer the easy questions first. This will relax you and help build your confidence, plus give you some assured points.9. Sit apart from your classmates to reduce being distracted by their movements.10. Don't panic if others are writing and you aren't. Your thinking may be more profitable than their writing.11. Don't be upset if others finish their tests before you do. Use as much time as you are allowed. Students who leave early don't always get the highest grades.12. If you still feel nervous during the test, try some emergency first aid: inhale deeply, close eyes, hold, than exhale slowly. Repeat as needed.

There are many benefits that can be gained by using tools to support testing. They are: Reduction of repetitive work:Repetitive work is very boring if it is done manually. People tend to make mistakes when doing the same task over and over. Examples of thistype of repetitive work include running regression tests, entering the sametest data again and again (can be done by a test executiontool), checking against coding standards (which can be done by a static analysistool) or creating a specific test database (which can be done by a test datapreparation tool). Greater consistency and repeatability:People have tendency to do the same task in a slightly different way even when theythink they are repeating something exactly. A tool will exactly reproduce whatit did before, so each time it is run the result is consistent. Objective assessment:If a person calculates a value from the software or incident reports, by mistake they mayomit something, or their own one-sided preconceived judgments or convictions may lead themto interpret that data incorrectly. Using a tool means that subjective preconceived notion isremoved and the assessment is more repeatable and consistently calculated.Examples include assessing the cyclomatic complexity or nesting levels of a component (which can be done by a static analysis tool), coverage (coverage measurement tool), system behavior (monitoring tools) and incident statistics (test management tool). Ease of access to information about tests or testing:Information presented visually is much easier for the human mind to understandand interpret. For example, a chart or graph is a better way to show informationthan a long list of numbers this is why charts and graphs in spreadsheetsare so useful. Special purpose tools give these features directly for theinformation they process. Examples include statistics and graphs about testprogress (test execution or test management tool), incident rates (incidentmanagement or test management tool) and performance (performancetesting tool).

Item AnalysisTable of Contents Major Uses of Item Analysis Item Analysis Reports Item Analysis Response Patterns Basic Item Analysis Statistics Interpretation of Basic Statistics Other Item Statistics Summary Data Report Options Item Analysis Guidelines

Major Uses of Item AnalysisItem analysis can be a powerful technique available to instructors for the guidance and improvement of instruction. For this to be so, the items to be analyzed must be valid measures of instructional objectives. Further, the items must be diagnostic, that is, knowledge of which incorrect options students select must be a clue to the nature of the misunderstanding, and thus prescriptive of appropriate remediation.In addition, instructors who construct their own examinations may greatly improve the effectiveness of test items and the validity of test scores if they select and rewrite their items on the basis of item performance data. Such data is available to instructors who have their examination answer sheets scored at the Computer Laboratory Scoring Office.[Top]

Item Analysis ReportsAs the answer sheets are scored, records are written which contain each student's score and his or her response to each item on the test. These records are then processed and an item analysis report file is generated. An instructor may obtain test score distributions and a list of students' scores, in alphabetic order, in student number order, in percentile rank order, and/or in order of percentage of total points. Instructors are sent their item analysis reports from as e-mail attacments. The item analysis report is contained in the file IRPT####.RPT, where the four digits indicate the instructors's GRADER III file. A sample of an individual long form item analysis lisitng is shown below.Item 10 of 125. The correct option is 5.

Item Response Pattern

12345OmitErrorTotal

Upper 27%2801190030

7%27%0%3%63%0%0%100%

Middle 46%32033230052

6%38%6%6%44%0%0%100%

Lower 27%658290030

20%17%27%7%30%0%0%101%

Total11331165100112

10%29%11%5%46%0%0%100%

[Top]

Item Analysis Response PatternsEach item is identified by number and the correct option is indicated. The group of students taking the test is divided into upper, middle and lower groups on the basis of students' scores on the test. This division is essential if information is to be provided concerning the operation of distracters (incorrect options) and to compute an easily interpretable index of discrimination. It has long been accepted that optimal item discrimination is obtained when the upper and lower groups each contain twenty-seven percent of the total group.The number of students who selected each option or omitted the item is shown for each of the upper, middle, lower and total groups. The number of students who marked more than one option to the item is indicated under the "error" heading. The percentage of each group who selected each of the options, omitted the item, or erred, is also listed. Note that the total percentage for each group may be other than 100%, since the percentages are rounded to the nearest whole number before totaling.The sample item listed above appears to be performing well. About two-thirds of the upper group but only one-third of the lower group answered the item correctly. Ideally, the students who answered the item incorrectly should select each incorrect response in roughly equal proportions, rather than concentrating on a single incorrect option. Option two seems to be the most attractive incorrect option, especially to the upper and middle groups. It is most undesirable for a greater proportion of the upper group than of the lower group to select an incorrect option. The item writer should examine such an option for possible ambiguity. For the sample item on the previous page, option four was selected by only five percent of the total group. An attempt might be made to make this option more attractive.Item analysis provides the item writer with a record of student reaction to items. It gives us little information about the appropriateness of an item for a course of instruction. The appropriateness or content validity of an item must be determined by comparing the content of the item with the instructional objectives.[Top]

Basic Item Analysis StatisticsA number of item statistics are reported which aid in evaluating the effectiveness of an item. The first of these is the index of difficulty which is the proportion of the total group who got the item wrong. Thus a high index indicates a difficult item and a low index indicates an easy item. Some item analysts prefer an index of difficulty which is the proportion of the total group who got an item right. This index may be obtained by marking the PROPORTION RIGHT option on the item analysis header sheet. Whichever index is selected is shown as the INDEX OF DIFFICULTY on the item analysis print-out. For classroom achievement tests, most test constructors desire items with indices of difficulty no lower than 20 nor higher than 80, with an average index of difficulty from 30 or 40 to a maximum of 60.The INDEX OF DISCRIMINATION is the difference between the proportion of the upper group who got an item right and the proportion of the lower group who got the item right. This index is dependent upon the difficulty of an item. It may reach a maximum value of 100 for an item with an index of difficulty of 50, that is, when 100% of the upper group and none of the lower group answer the item correctly. For items of less than or greater than 50 difficulty, the index of discrimination has a maximum value of less than 100. TheInterpreting the Index of Discriminationdocument contains a more detailed discussion of the index of discrimination.[Top]

Interpretation of Basic StatisticsTo aid in interpreting the index of discrimination, the maximum discrimination value and the discriminating efficiency are given for each item. The maximum discrimination is the highest possible index of discrimination for an item at a given level of difficulty. For example, an item answered correctly by 60% of the group would have an index of difficulty of 40 and a maximum discrimination of 80. This would occur when 100% of the upper group and 20% of the lower group answered the item correctly. The discriminating efficiency is the index of discrimination divided by the maximum discrimination. For example, an item with an index of discrimination of 40 and a maximum discrimination of 50 would have a discriminating efficiency of 80. This may be interpreted to mean that the item is discriminating at 80% of the potential of an item of its difficulty. For a more detailed discussion of the maximum discrimination and discriminating efficiency concepts, see theInterpreting the Index of Discriminationdocument.[Top]

Other Item StatisticsSome test analysts may desire more complex item statistics. Two correlations which are commonly used as indicators of item discrimination are shown on the item analysis report. The first is the biserial correlation, which is the correlation between a student's performance on an item (right or wrong) and his or her total score on the test. This correlation assumes that the distribution of test scores is normal and that there is a normal distribution underlying the right/wrong dichotomy. The biserial correlation has the characteristic, disconcerting to some, of having maximum values greater than unity. There is no exact test for the statistical significance of the biserial correlation coefficient.The point biserial correlation is also a correlation between student performance on an item (right or wrong) and test score. It assumes that the test score distribution is normal and that the division on item performance is a natural dichotomy. The possible range of values for the point biserial correlation is +1 to -1. The Student's t test for the statistical significance of the point biserial correlation is given on the item analysis report. Enter a table of Student's t values with N - 2 degrees of freedom at the desired percentile point N, in this case, is the total number of students appearing in the item analysis.The mean scores for students who got an item right and for those who got it wrong are also shown. These values are used in computing the biserial and point biserial coefficients of correlation and are not generally used as item analysis statistics.Generally, item statistics will be somewhat unstable for small groups of students. Perhaps fifty students might be considered a minimum number if item statistics are to be stable. Note that for a group of fifty students, the upper and lower groups would contain only thirteen students each. The stability of item analysis results will improve as the group of students is increased to one hundred or more. An item analysis for very small groups must not be considered a stable indication of the performance of a set of items.[Top]

Summary DataThe item analysis data are summarized on the last page of the item analysis report. The distribution of item difficulty indices is a tabulation showing the number and percentage of items whose difficulties are in each of ten categories, ranging from a very easy category (00-10) to a very difficult category (91-100). The distribution of discrimination indices is tabulated in the same manner, except that a category is included for negatively discriminating items.The mean item difficulty is determined by adding all of the item difficulty indices and dividing the total by the number of items. The mean item discrimination is determined in a similar manner.Test reliability, estimated by the Kuder-Richardson formula number 20, is given. If the test is speeded, that is, if some of the students did not have time to consider each test item, the reliability estimate may be spuriously high.The final test statistic is the standard error of measurement. This statistic is a common device for interpreting the absolute accuracy of the test scores. The size of the standard error of measurement depends on the standard deviation of the test scores as well as on the estimated reliability of the test.Occasionally, a test writer may wish to omit certain items from the analysis although these items were included in the test as it was administered. Such items may be omitted by leaving them blank on the test key. The response patterns for omitted items will be shown but the keyed options will be listed as OMIT. The statistics for these items will be omitted from the Summary Data.[Top]

Report OptionsA number of report options are available for item analysis data. The long-form item analysis report contains three items per page. A standard-form item analysis report is available where data on each item is summarized on one line. A sample reprot is shown below.ITEM ANALYSIS Test 4482 125 Items 112 Students

Percentages: Upper 27% - Middle - Lower 27%

ItemKey12345OmitErrorDiffDisc

147-23-570- 4- 728- 8-3664-62- 00-0-00-0-00-0-05464

227-12- 764-42-2914- 4-2114-42-360-0-00-0-00-0-05635

The standard form shows the item number, key (number of the correct option), the percentage of the upper, middle, and lower groups who selected each option, omitted the item or erred, the index of difficulty, and the index of discrimination. For example, in item 1 above, option 4 was the correct answer and it was selected by 64% of the upper group, 62% of the middle group and 0% of the lower group. The index of difficulty, based on the total group, was 54 and the index of discrimination was 64.[Top]

Item Analysis GuidelinesItem analysis is a completely futile process unless the results help instructors improve their classroom practices and item writers improve their tests. Let us suggest a number of points of departure in the application of item analysis data.1. Item analysis gives necessary but not sufficient information concerning the appropriateness of an item as a measure of intended outcomes of instruction. An item may perform beautifully with respect to item analysis statistics and yet be quite irrelevant to the instruction whose results it was intended to measure. A most common error is to teach for behavioral objectives such as analysis of data or situations, ability to discover trends, ability to infer meaning, etc., and then to construct an objective test measuring mainly recognition of facts. Clearly, the objectives of instruction must be kept in mind when selecting test items.2. An item must be of appropriate difficulty for the students to whom it is administered. If possible, items should have indices of difficulty no less than 20 and no greater than 80. lt is desirable to have most items in the 30 to 50 range of difficulty. Very hard or very easy items contribute little to the discriminating power of a test.3. An item should discriminate between upper and lower groups. These groups are usually based on total test score but they could be based on some other criterion such as grade-point average, scores on other tests, etc. Sometimes an item will discriminate negatively, that is, a larger proportion of the lower group than of the upper group selected the correct option. This often means that the students in the upper group were misled by an ambiguity that the students in the lower group, and the item writer, failed to discover. Such an item should be revised or discarded.4. All of the incorrect options, or distracters, should actually be distracting. Preferably, each distracter should be selected by a greater proportion of the lower group than of the upper group. If, in a five-option multiple-choice item, only one distracter is effective, the item is, for all practical purposes, a two-option item. Existence of five options does not automatically guarantee that the item will operate as a five-choice item.

Item Analysis of Classroom Tests: Aims and Simplified ProceduresAim:How well did my testdistinguishamong students according to thehow well they met my learning goals?Recall that each item on your test is intended to sample performance on a particular learning outcome. The test as a whole is meant to estimate performance across the full domain of learning outcomes you have targeted.Unless your learning goals are minimal or low (as they might be, for instance, on a readiness test), you can expect students to differ in how well they have met those goals. (Students are not peas in a pod!). Your aim is not to differentiate students just for the fun of it, but to be able to measure the differences in mastery that occur.One way to assess how well your test is functioning for this purpose is to look at how well the individual items do so. The basic idea is that a good item is one that good students get correct more often than do poor students. You might end up with a big spread in scores, but what if the good students are no more likely than poor students to get a high score? If we assume that you have actually given them proper instruction, then your test has not really assessed what they have learned. That is, it is "not working."An item analysis gets at the question of whether your test is working by asking the same question of allindividualitemshow well does it discriminate? If you have lots of items that didnt discriminate much if at all, you may want to replace them with better ones. If you find ones that worked in the wrong direction (where good students did worse) and therefore lowered test reliability, then you will definitely want to get rid of them.In short, item analysis gives you a way to exercise additional quality control over your tests. Well-specified learning objectives and well-constructed items give you a headstart in that process, but item analyses can give you feedback on how successful you actually were.Item analyses can also help you diagnosewhysome items did not work especially well, and thus suggest ways to improve them (for example, if you find distracters that attracted no one, try developing better ones).ReminderItem analyses are intended to assess and improve thereliabilityof your tests. If test reliability is low, test validity will necessarily also be low. This is the ultimate reason you do item analysesto improve the validity of a test by improving its reliability. Higher reliability will not necessarily raise validity (you can be more consistent in hitting the wrong target), but it is a prerequisite. That is, high reliability is necessary but not sufficient for high validity (do you remember this point on Exam 1?).However, when you examine the properties of each item, you will often discover how they may or may not actually have assessed the learning outcome you intendedwhichisa validity issue. When you change items to correct these problems, it means the item analysis has helped you to improve the likely validity of the test the next time you give it.The procedure (apply it to the sample results I gave you)1. Identify the upper 10 scorers and lowest 10 scorers on the test. Set aside the remainder.2. Construct an empty chart for recording their scores, following the sample I gave you in class. This chart lists the students down the left, by name. It arrays each item number across the top. For a 20-item test, you will have 20 columns for recording the answers for each student. Underneath the item number, write in the correct answer (A, B, etc.)3. Enter the student data into the chart you have just constructed.a. Take the top 10 scorers, and write each students name down the left, one row for each student. If there is a tie for 10th place, pick one student randomly from those who are tied.b. Skip a couple rows, then write the names of the 10 lowest-scoring students, one row for each.c. Going student by student, enter each students answers into the cells of the chart.However, enter only the wrong answers (A, B, etc.). Any empty cell will therefore signal a correct answer.d. Go back to the upper 10 students. Count how many of them got Item 1 correct (this would be all the empty cells). Write that number at the bottom of the column for those 10. Do the same for the other 19 questions. We will call these sums RU, where U stands for "upper."e. Repeat the process for the 10 lowest students. Write those sums under their 20 columns. We will call these RL, where L stands for "lower."4. Now you are ready to calculate the two important indices of item functioning. These are actually only estimates of what you would get if you had a computer program to calculate the indices for everyone who took the test (some schools do). But they are pretty good.a. Difficulty index. This is just the proportion of people who passed the item. Calculate itfor each itemby adding the number correct in the top group (RU) to the number correct in the bottom group (RL) and then dividing this sum by the total number of students in the top and bottom groups (20).RU+ RL

20Record these 20 numbers in a row near the bottom of the chart.b. Discrimination index. This index is designed to highlight to what extent students in the upper group were more likely than students in the lower group to get the item correct. That is, it is designed to get at the differences between the two groups. Calculate the index by subtracting RLfrom RU, and then dividing byhalfthe number of students involved (10)RU- RL

10Record these 20 numbers in the last row of the chart.5. You are now ready to enter these discrimination indexes into a second chart.6. Construct the second chart, based on the model I gave you in class. (This is the smaller chart that contains no student names.)a. Note that there are two rows of column headings in the sample. The first row of headings contains themaximum possiblediscrimination indexes for each item difficulty level (more on that in a moment). The second row contains possible difficulty indexes. Lets begin with that second row of headings (labeled "p"). As your sample shows, the entries range on the far left from "1.0" (for 100%) to ".4-0" (40%-0%) for a final catch-all column. Just copy the numbers from the sample onto your chart.b. Now copy the numbers from the first row of headings in the sample (labeled "Md").7. Now is the time to pick up your first chart again, where you will find thediscriminationindexes you need to enter into your second chart.a. You will be entering itslastrow of numbers into the body of the second chart.b. List each of these discrimination indexes in one and only one of the 20 columns. But which one? List each in the column corresponding to itsdifficultylevel. For instance, if item 4s difficulty level is .85 and its discrimination index is .10, put the .10 in the difficulty column labeled ".85." This number is entered, of course, into the row for the fourth item8. Study this second chart.a. How many of the items are of medium difficulty? These are the best, because they provide the most opportunity to discriminate (to see this, look at their maximum discrimination indexes in the first row of headings). Items that most everybody gets right or gets wrong simply cant discriminate much.b. The important test for an items discriminability is to compare it to the maximum possible. How well did each item discriminaterelative tothe maximum possible for an item of its particular difficulty level? Here is a rough rule of thumb. Discrimination index is near the maximumpossible= very discriminating item Discrimination index is about half the maximum possible = moderately discriminating item Discrimination index is about a quarter the maximum possible = weak item Discrimination index is near zero = non-discriminating item Discrimination index isnegative= bad item (delete it if worse than -.10)9. Go back to the first chart and study it.a. Look at whether all the distracters attracted someone. If some did not attract any, then the distracter may not be very useful. Normally you might want to examine it and consider how it might be improved or replaced.b. Look also for distractors that tended to pull your best students and therebylowerdiscriminability. Consider whether the discrimination you are asking them to make is educationally significant (or even clear). You cant do this kind of examination for the sample data I have given you, but keep it in mind for real-life item analyses.10. There is much more you can do to mine these data for ideas about your items, but this is the core of an item analysis.If you are luckyIf you use scantron sheets for grading exams, ask your school whether it can calculate item statistics when it processes the scantrons. If it can, those statistics probably include what you need: the (a) difficulty indexes for each item, (b) correlations of each item with total scores for each student on the test, and (c) the number of students who responded to each distracter. The item-total correlation is comparable to (and more accurate than) your discrimination index.If your school has this software, then you won't have to calculate any item statistics, which makes your item analyses faster and easier. It is important that you have calculated the indexes once on your own, however, so that you know what they mean.Improve multiple choice tests using item analysis

Item analysis reportAnitem analysisincludes two statistics that can help you analyze the effectiveness of your test questions. Thequestion difficultyis the percentage of students who selected the correct response. Thediscrimination (item effectiveness)indicates how well the question separates the students who know the material well from those who dont.

Question difficultyQuestion difficultyis defined as the proportion of students selecting the correct answer. The most effective questions in terms of distinguishing between high and low scoring students will be answered correctly by about half of the students. In practical terms, questions in most classroom tests will have a range of difficulties from low or easy (.90) to high or very difficult (.40). Questions having difficulty estimates outside of these ranges may not contribute much to the effective evaluation of student performance. Very easy questions may not sufficiently challenge the most able students. However, having a few relatively easy questions in a test may be important to verify the mastery of some course objectives. Keep tests balanced in terms of question difficulty. Very difficult questions, if they form most of a test, may produce frustration among students. Some very difficult questions are needed to challenge the best students.

Question discriminationThediscrimination index (item effectiveness)is a kind of correlation that describes the relationship between a students response to a single question and his or her total score on the test. This statistic can tell you how well each question was able to differentiate among students in terms of their ability and preparation. As a correlation, question discrimination can theoretically take values between -1.00 and +1.00. In practical terms values for most classroom tests range between near 0.00 to values near .90. If a question is very easy so that nearly all students answered correctly, the questions discrimination will be near zero. Extremely easy questions cannot distinguish among students in terms of their performance. If a question is extremely difficult so that nearly all students answered incorrectly, the discrimination will be near zero. The most effective questions will have moderate difficulty and high discrimination values. The higher the value of discrimination is, the more effective it is in discriminating between students who perform well on the test and those that dont. Questions having low or negative values of discrimination need to be reviewed very carefully for confusing language or an incorrect key. If no confusing language is found then the course design for the topic of the question needs to be critically reviewed. A high level of student guessing on questions will result in a question discrimination value near zero.

Steps in a review of an item analysis report1. Review the difficulty and discrimination of each question.2. For each question having low values of discrimination review the distribution of responses along with the question text to determine what might be causing a response pattern that suggests student confusion.3. If the text of the question is confusing, change the text or remove the question from the course database. If the question text is not confusing or faulty, then try to identify the instructional component that may be leading to student confusion.4. Carefully examine the questions that discriminate well between high and low scoring students to fully understand the role that instructional design played in leading to these results. Ask yourself what aspects of the instructional process appear to be most effective.Test Item Performance: The Item AnalysisTable of ContentsSummary of Test StatisticsTest Frequency DistributionItem Difficulty and Discrimination: Quintile TableInterpreting Item StatisticsMERMAC - Test Analysis and Questionnaire Package

The ITEM ANALYSIS output consists of four parts: A summary of test statistics, a test frequency distribution, an item quintile table, and item statistics. This analysis can be processed for an entire class. If it is of interest to compare the item analysis for different test forms, then the analysis can be processed by test form. The Division of Measurement and Evaluation staff is available to help instructors interpret their item analysis data.Summary of Test Statistics

Part I of the ITEM ANALYSIS consists of a summary of the following statistics:

* * * MERMAC -- TEST ANALYSIS AND QUESTIONNAIRE PACKAGE * * *SAMPLE ITEM ANALYSISSUMMARY OF TEST STATISTICS

NUMBER OF ITEMS:(Number of items on the test.)80

MEAN SCORE:(Arithmetic average; the sum of all scores divided by the number of scores.)60.92

MEDIAN SCORE:(The raw score point that divides the raw score distribution in half; 50% of the scores fall above the median and 50% fall below.)63.15

STANDARD DEVIATION:(Measure of the spread or variability of the score distribution. The higher the value of the standard deviation, the better the test is discriminating among student performance levels.)12.24

RELIABILITY (KR-20):(Is an estimate of test reliability indicating the internal consistency of the test. The range of the reliability is from 0.00 to 1.00. A reliability of .70 or better is desirable for classroom tests.)0.915

RELIABILITY (KR-21):(When item difficulties are approximately equal, is an estimate of test reliability indicating the internal consistency of the test. The range of the reliability is from 0.00 to 1.00. A reliability of .70 or better is desirable for classroom tests.)0.915

S.E. OF MEASUREMENT:(The accuracy of measurement expressed in the test score scale. The larger the standard error, the less precise the measure of student achievement. Two-thirds of the time test takers obtained scores fall within one standard error of measurement of their true score.)3.58

POSSIBLE LOW SCORE:(The possible low score.)0

POSSIBLE HIGH SCORE:(The possible high score.)80

OBTAINED LOW SCORE:(The obtained low score.)0

OBTAINED HIGH SCORE:(The obtained high score.)80

NUMBER OF SCORES:(The number of answer sheets submittedfor scoring.)603

BLANK SCORES1:(Number of test scores that could be not computed.)0

INVALID SCORES:(Number of test scores out of range specified by the user.)0

VALID SCORES:(Only those scores that fall within the range specified by the user are included in the analysis so thatthe user has the option of disregarding certain scores.)603

1Blank and invalid scores (those falling outside the specified range) are counted and are omitted from the analysisTable of ContentsTest Frequency Distribution

Part II of the ITEM ANALYSIS program displays a test frequency distribution. The raw scores are ordered from high to low with corresponding statistics:1. Standard score--a linear transformation of the raw score that sets the mean equal to 500 and the standard deviation equal to 100; in normal score distributions for classes of 500 students of more the standard score range usually falls between 200 and 800 (plus or minus three standard deviations of the mean); for classes with fewer than 30 students the standard score range usually falls within two standard deviations of the mean, i.e., a range of 300 to 700.2. Percentile rank--the percentage of individuals who received a score lower than the given score plus the percentage of half the individuals who received the given score. This measure indicates a person's relative position within a group.3. Percentage of people in the total group who received the given score.4. Frequency--in a test analysis, the number of individuals who receive a given score.5. Cumulative frequency--in a test analysis, the number of individuals who score at or below a given score value.. * * * MERMAC -- TEST ANALYSIS AND QUESTIONNAIRE PACKAGE * * * SAMPLE ITEM ANALYSIS TEST FREQUENCY DISTRIBUTION RAW STANDARD PER- CUMSCORE SCORE CENTILE PERCENT FREQ FREQ EACH * REPRESENTS 1 PERSON(S)92 717 99 0.2 1 603 *91 708 99 0.3 2 602 **90 700 99 0.0 0 60089 691 99 0.2 1 600 *88 683 99 0.8 5 599 *****87 675 99 0.3 2 594 **86 666 98 1.0 6 592 ******85 658 97 1.3 8 586 ********84 649 96 1.2 7 578 *******83 641 95 2.0 12 571 ************82 632 93 1.7 10 559 **********81 624 91 1.5 9 549 *********80 615 90 1.5 9 540 *********79 607 88 2.8 17 531 *****************78 598 85 4.1 25 514 *************************77 590 81 2.3 14 489 **************76 562 79 4.0 24 475 ************************75 573 75 2.2 13 451 *************74 565 73 3.3 20 438 ********************73 556 69 2.0 12 418 ************72 548 67 3.8 23 406 ***********************71 539 64 2.8 17 383 *****************70 531 61 3.0 18 366 ******************69 522 58 3.2 19 326 *******************67 505 51 3.6 22 307 **********************66 497 47 3.8 23 285 ***********************65 489 43 2.7 16 262 ****************64 480 41 3.2 19 246 *******************63 472 38 2.5 15 227 ***************62 463 35 3.2 19 212 *******************61 455 32 2.5 15 193 ***************60 446 30 1.8 11 178 ***********59 438 28 2.3 14 167 **************58 429 25 3.0 18 153 ******************57 421 22 1.7 10 135 **********56 413 21 3.2 12 106 ************54 396 16 1.7 10 94 **********53 387 14 1.5 9 84 *********52 379 12 1.2 7 75 *******51 370 11 2.0 12 68 ************50 362 9 1.2 7 56 *******49 353 8 1.3 8 49 ********48 345 7 1.7 10 41 **********Table of ContentsItem Difficulty and Discrimination: Quintile Table

Part III of the ITEM ANALYSIS output, an item quintile table, can aid in the interpretation of Part IV of the output. Part IV compares the item responses versus the total score distribution for each item. A good item discriminates between students who scored high or low on the examination as a whole. In order to compare different student performance levels on the examination, the score distribution is divided into fifths, or quintiles. The first fifth includes students who scored between the 81st and 100th percentiles; the second fifth includes students who scored between the 61st and 80th percentiles, and so forth. When the score distribution is skewed, more than one-fifth of the students may have scores within a given quintile and as a result, less than one-fifth of the students may score within another quintile. The table indicates the sample size, the proportion of the distribution, and the score ranges within each fifth.* * * MERMAC -- TEST ANALYSIS AND QUESTIONNAIRE PACKAGE* * *THE QUINTILE GRAPH AND MATRIX OF RESPONSESAPPEARING WITH EACH ITEM ARE BASED ON THESTATISTICS INDICATED IN THE TABLE BELOW:QUINTILESAMPLE SIZEPROPORTIONSCORE RANGE

1ST1280.2177 - 92

2ND1270.2170 - 76

3RD1210.2064 - 69

4TH121 0.20 56 - 63

5TH1060.1824 - 55

Table of ContentsInterpreting Item Statistics

Part IV of ITEM ANALYSIS portrays item statistics which can help determine which items are good and which need improvement or deletion from the examination. The quintile graph on the left side of the output indicates the percent of students within each fifth who answered the item correctly. A good, discrimination item is one in which students who scored well on the examination answered the correct alternative more frequently than students who did not score well on the examination. Therefore, the scattergram graph should form a line going from the bottom left-hand corner to the top right-hand corner of the graph. Item 1 in the sample output shows an example of this type of positive linear relationship. Item 2 in the sample output also portrays a discriminating item; although few students correctly answered the item, the students in the first fifth answered it correctly more frequently than the students in the rest of the score distribution. Item 3 indicates a poor item, the graph indicates no relationship between the fifths of the score distribution and the percentage of correct responses by fifths. However, it is likely that this item was miskeyed by the instructor--note the response pattern for alternative B.

A. Evaluating Item Distractors: Matrix of Responses

On the right-hand side of the output, a matrix of responses by fifths shows the frequency of students within each fifth who answered each alternative and who omitted the item. This information can help point out what distractors, or incorrect alternatives, are not successful because: (a) they are not plausible answers and few or no students chose the alternative (see alternatives D and E, item 2), or (b) too many students, especially students in the top fifths of the distribution, chose the incorrect alternative instead of the correct response (see alternative B, item 3). A good item will result in students in the top fifths answering the correct response more frequently than students in the lower fifths, and students in the lower fifths answering the incorrect alternative more frequently than students in the top fifths. The matrix of responses prints the correct response of the item on the right-hand side and encloses the correct response in the matrix in parentheses.B. Item Difficulty: The PROP Statistic

The proportion (PROP) of students who answer each alternative and who omit the item is printed in the first row below the matrix. The item difficulty is the proportion of subjects in a sample who correctly answer the item. In order to obtain maximum spread of student scores it is best to use items with moderate difficulties. Moderate difficulty can be defined as the point halfway between perfect score and chance score. For a five choice item, moderate difficulty level is .60, or a range between .50 and .70 (because 100% correct is perfect and we would expect 20% of the group to answer the item correctly by blind guessing).

Evaluating Item Difficulty. For the most part, items which are too easy or too difficult cannot discriminate adequately between student performance levels. Item 2 in the sample output is an exception; although the item difficulty is .23, the item is a good, discriminating one. In item 4, everyone correctly answered the item; the item difficulty is 1.00. Such an item does not discriminate at all between good and poor students, and therefore does not contribute statistically to the effectiveness of the examination. However, if one of the instructor's goals is to check that all students grasp certain basic concepts and if the examination is long enough to contain a sufficient number of discrimination items, then such an item may remain on the examination.

C. Item Discrimination: Point Biserial Correlation (RPBI)

Interpreting the RBI Statistic. The point biserieal correlation (RPBI) for each alternative and omit is printed below the PROP row. It indicates the relationship between the item response and the total test score within the group tested, i.e., it measures the discriminating power of an item. It is interpreted similarly to other correlation coefficients. Assuming that the total test score accurately discriminates among individuals in the group tested, then high positive RPBI's for the correct responses would represent the most discriminating items. That is, students who answered the correct response scored well on the examination, whereas students who not answer the correct response did not score well on the examination. It is also interesting to check the RPBI's for the item distractors, or incorrect alternatives. The opposite correlation between total score and choice of alternative is expected for the incorrect vs. the correct alternative. Where ahigh positivecorrelation is desired for the RPBI of a correct alternative, ahigh negativecorrelation is good for the RPBI of a distractor, i.e., students who answer with an incorrect alternative didnotscore well on the total examination. Due to restrictions incurred when correlating a continuous variable (total examination score) with a dichotomous variable (response vs nonresponse of an alternative), the highest possible RPBI is .80 instead of the usual maximum value of 1.00 for a correlation. This maximum RPBI is directly influenced by the item difficulty level. The maximum RPBI value of .80 occurs with items of moderate difficulty level; the further the difficulty level deviates from the moderate difficulty level in either direction, the lower the ceiling and RPBI. For example, the maximum RPBI is about .58 for difficulty levels of .10 or .90. Therefore, in order to maximize item discrimination, items of moderate difficulty level are preferred, although easy and difficult items still can be discriminating (see item 2 in the sample output).

Evaluating Item Discrimination.When an instructor examines the item analysis data, the RPBI is an important indicator in deciding which items are discriminating and should be retained, and which items are not discriminating and should be revised or replaced by a better item (other content considerations aside). The quintile graph also illustrates this same relationship between item response and total scores. However, the RPBI is a more accurate representation of this relationship. An item with a RPBI of .25 or below should be examined critically for revision or deletion; items with RPBIs of .40 and above are good discriminators. Note that all items, not only those with RPBIs lower than .25, can be improved. An examination of the matrix of responses by fifths for all items may point out weaknesses, such as implausible distractors, that can be reduced by modifying the item.

It is important to keep in mind that the statistical functioning of an item should not be the sole basis for deleting or retaining an item. The most important quality of a classroom test is its validity, the extent to which items measure relevant tasks. Items that perform poorly statistically might be retained (and perhaps revised) if they correspond to specific instructional objectives in the course. Items that perform well statistically but are not related to specific instructional objectives should be reviewed carefully before being reused.

References

Ebel, R. L. & Frisbee, D. A. (1986).Essentials of educational measurement(4th ed.). Eaglewood Cliffs, NJ: New Jersey: Prentice-Hall, Inc.

Guilford, J. P.Pshychometric method.New York: McGraw-Hill, 1954.

Gronlund, N. E. & Linn, R. L. (1990).Measurement and evaluation in teaching(6th ed.). NY: MacMillan.

Osterlind, S. J.Constructing test itemsNorwell, MA: Kluwer Academic Publishers, 1989.

Thorndike, Robert L. & Hagen, Elizabeth.Measurement and evaluation in psychology and education(3rd ed.). New York: John Wiley & Sons, 1969, Chapters 4, 6.Table of Contents * * * MERMAC -- TEST ANALYSIS AND QUESTIONNAIRE PACKAGE * * *ITEM 1 PERCENT OF CORRECT RESPONSE BY FIFTHS MATRIX OF RESPONSES BY FIFTHS E IS CORRECT RESPONSEA B C D (E) OMIT1ST + *1ST 0 25 1 0 10202ND + * 2ND 1 45 6 0 75 03RD + *3RD 1 63 5 3 49 04TH + *4TH 2 76 9 0 3405TH + * 5TH 11 73 134 5 0+----+----+----+----+----+----+----+----+----+0 10 20 30 40 50 60 70 80 90 100 PROP 0.02 0.47 0.06 0.01 (0.44) 0.00RPBI -0.20 -0.33 -0.20 -0.13 (0.51) 0.00ITEM 2 PERCENT OF CORRECT RESPONSE BY FIFTHS MATRIX OF RESPONSES BY FIFTHS A IS CORRECT RESPONSE(A) B C D E OMIT1ST + *1ST 83 35 10 0 0 02ND + *2ND 19 85 23 0 0 03RD + *3RD 17 67 37 0 0 04TH + *4TH 13 78 30 0 0 05TH + *5TH 6 84 16 0 0 0+----+----+----+----+----+----+----+----+----+0 10 20 30 40 50 60 70 80 90 100 PROP (0.23) 0.57 0.19 0.00 0.00 0.00RPBI (0.43)-0.33 -0.05 0.00 0.00 0.00ITEM 3 PERCENT OF CORRECT RESPONSE BY FIFTHS MATRIX OF RESPONSES BY FIFTHS E IS CORRECT RESPONSEA B C D (E) OMIT1ST *1ST2 125 0 1 002ND +* 2ND 6 109 0 8 4 03RD + *3RD 14 86 4 7 10 04TH + * 4TH 23 71 2 19 605TH + * 5TH 29 45 8 15 81+----+----+----+----+----+----+----+----+----+0 10 20 30 40 50 60 70 80 90 100 PROP 0.12 0.72 0.02 0.08 (0.05) 0.00RPBI-0.24 0.45 -0.16 -0.17 (0.13)-0.14ITEM 4 PERCENT OF CORRECT RESPONSE BY FIFTHS MATRIX OF RESPONSES BY FIFTHS E IS CORRECT RESPONSEA B C D (E) OMIT1ST + * 1ST0 0 0 0 128 02ND + * 2ND 0 0 0 0 127 03RD + * 3RD0 0 0 0 121 04TH + * 4TH0 0 0 0 121 05TH + * 5TH0 0 0 0 106 0+----+----+----+----+----+----+----+----+----+0 10 20 30 40 50 60 70 80 90 100 PROP 0.00 0.00 0.00 0.00 (1.00) 0.00RPBI 0.00 0.00 0.00 0.00 (0.00) 0.00Table of Contents

Purpose of Item AnalysisOK, you now know how to plan a test and build a test Now you need to know how to doITEM ANALYSIS--> looks complicated at first glance, but actually quite simple

-->even I can do this and I'm a mathematical idiot

Talking aboutnorm-referenced, objective tests mostly multiple-choice but same principals for true-false, matching and short answer by analyzing results you can refine your testing

SERVES SEVERAL PURPOSES1. Fix marksfor current class that just wrote the test find flaws in the test so that you can adjust the mark before return to students can find questions with two right answers, or that were too hard, etc., that you may want to drop from the exam even had to do that occasionally on Diploma exams, even after 36 months in development, maybe 20 different reviewers, extensive field tests, still occasionally have a question whose problems only become apparent after you give the test more common on classroom tests -- but instead of getting defensive, or making these decisions at random on basis of which of your students can argue with you, do it scientifically2. Morediagnostic informationon students anotherimmediate payoffof item analysisClassroom level: will tell which questions they were are all guessing on, or if you find a questions which most of them found very difficult, you can reteach that concept CAN do item analysis on pretests to: so if you find a question they all got right, don't waste more time on this area find the wrong answers they are choosing to identify common misconceptions can't tell this just from score on total test, or class averageIndividual level: isolate specific errorsthis child made after you've planned these tests, written perfect questions, and now analyzed the results, you're going to know more about these kids than they know themselves3. Build future tests, revise test items to make them better REALLYpays off second time you teach the same course by now you know how much work writing good questions is studies have shown us that it is FIVE times faster to revise items that didn't work, using item analysis, than trying to replace it with a completely new question new item which would just have new problems anyway

--> this way you eventually get perfect items, the envy of your neighbours SHOULD NOT REUSE WHOLE TESTS--> diagnostic teaching means that you are responding to needs of your students, so after a few years you build up a bank of test items you can custom make tests for your class know what class average will be before you even give the test because you will know approximately how difficult each item isbeforeyou use it; can spread difficulty levels across your blueprint too...4. Part of your continuingprofessional development doing the occasional item analysis will help teach you how to become a better test writer and you're also documenting just how good your evaluation is useful for dealing with parents or principals if there's ever a dispute once you start bringing out all these impressive looking stats parents and administrators will believe that maybe you do know what you're talking about when you fail students... parent says, I think your "question stinks",

well, "according to the item analysis, this question appears to have worked well -- it's your son that stinks"

(just kidding! --actually, face validity takes priority over stats any day!) and if the analysis shows that the question does stink, you've already dropped itbeforeyou've handed it back to the student, let alone the parent seeing it...5. Before and After Pictures long term payoff collect this data over ten years, not only get great item bank, but if you change how you teach the course, you can find out if innovation is working if you have a strong class (as compared to provincial baseline) but they do badly on same test you used five years ago, the new textbook stinks.

ITEM ANALYSISis one area where even a lot of otherwise very good classroom teachers fall down theythinkthey're doing a good job; they think they've doing good evaluation, but without doing item analysis, they can't reallyknow part ofbeing a professionalis going beyond the illusion of doing a good job to finding out whether you really are but something just a lot of teachers don't know HOW to do do it indirectly when kids argue with them...wait for complaints from students, student's parents and maybe other teachers...ON THE OTHER HAND.... I do realize that I am advocating here more work for you in the short term, but, it will pay off in the long termBut realistically:

*Probably only doing it for your most important tests end of unit tests, final exams --> summative evaluation especially if you're using common exams with other teachers common exams give you bigger sample to work with, which is good makes sure that questions other teacher wrote are working for YOUR class maybe they taught different stuff in a different way impress the daylights out of your colleagues*Probably only doing it for test questions you are likely going to reuse next year

*Spend less time on item analysis than on revising items item analysis is not an end in itself, no point unless you use it to revise items, and help students on basis of information you get out of itI also find that, if you get into it, it is kind of fascinating. When statsturn out well, it's objective, external validation of your work. When statsturn out differentlythan you expect, it becomes a detective mystery as you figure out what went wrong.But you'll have to take my word on this until you try it on your own stuff.

Eight Simple Steps to Item Analysis

1. Score each answer sheet, write score total on the corner obviously have to do this anyway2. Sort the pile intorank orderfrom top to bottom score(1 minute, 30 seconds tops)3. If normal class of 30 students,divide class in half same number in top and bottom group: toss middle paper if odd number (put aside)4. Take 'top' pile,count number of studentswho responded toeach alternative fast way is simply to sort piles into "A", "B", "C", "D" // or true/false or type of error you get for short answer, fill-in-the-blank

ORset up on spread sheet if you're familiar with computersITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTSCLASS SIZE = 30

ITEM UPPER LOWER DIFFERENCE D TOTAL DIFFICULTY

1. A 0 *B 4 C 1 D 1 O

*=Keyed Answer

repeat for lower groupITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTSCLASS SIZE = 30

ITEM UPPER LOWER DIFFERENCE D TOTAL DIFFICULTY

1. A 0 *B 4 2 C 1 D 1 O

*=Keyed Answer

this is the time consuming part --> but not that bad, can do it while watching TV, because you're just sorting piles

THREE POSSIBLE SHORT CUTS HERE(STEP 4)(A) If you have a large sample of around 100 or more, you can cut down the sample you work with take top 27% (27 out of 100); bottom 27% (so only dealing with 54, not all 100) put middle 46 aside for the moment larger the sample, more accurate, but have to trade off against labour; using top 1/3 or so is probably good enough by the time you get to 100; --27% magic figure statisticians tell us to use I'd use halves at 30, but you could just use a sample of top 10 and bottom 10 if you're pressed for time but it means a single student changes stats by 10% trading off speed for accuracy... but I'd rather have you doing ten and ten than nothing(B)Second short cut, if you have access to photocopier (budgets) photocopy answer sheets, cut off identifying info(can't use if handwriting is distinctive) colour code high and low groups --> dab of marker pen color distribute randomly to students in your class so they don't know whose answer sheet they have get them to raise their hands for #6, how many have "A" on blue sheet?how many have "B"; how many "C" for #6, how many have "A" on red sheet.... some reservations because they can screw you up if they don't take it seriously another version of this would be to hire kid who cuts your lawn to do the counting, provided you've removed all identifying information I actually did this for a bunch of teachers at one high school in Edmonton when I was in university for pocket money(C) Third shortcut, IF you can't use separate answer sheet, sometimes faster to type than to sort SAMPLE OF TYPING FORMAT FOR ITEM ANALYSIS

ITEM # 1 2 3 4 5 6 7 8 9 10

KEY T F T F T A D C A B

STUDENT

Kay T T T F F A D D A C

Jane T T T F T A D C A D

John F F T F T A D C A B

type name; then T or F, or A,B,C,D == all left hand on typewriter, leaving right hand free to turn pages (from Sax) IF you have a computer program -- some kicking around -- will give you all stats you need, plus bunches more you don't-- automatically after this stageOVERHEAD: SAMPLE ITEM ANALYSIS FOR CLASS OF 30 (PAGE #1) (in text)

5. Subtractthe number of students inlower groupwho got question rightfromnumber ofhighgroup students who got it right quite possible to get a negative numberITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTSCLASS SIZE = 30

ITEM UPPER LOWER DIFFERENCE D TOTAL DIFFICULTY

1. A 0 *B 4 2 2 C 1 D 1 O

*=Keyed Answer

6. Dividethe difference by number of students in upper or lower group in this case, divide by 15 this gives you the"discrimination index" (D)ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTSCLASS SIZE = 30

ITEM UPPER LOWER DIFFERENCE D TOTAL DIFFICULTY

1. A 0 *B 4 2 2 0.333 C 1 D 1 O

*=Keyed Answer

7. Total numberwho got itrightITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTSCLASS SIZE = 30

ITEM UPPER LOWER DIFFERENCE D TOTAL DIFFICULTY

1. A 0 *B 4 2 2 0.333 6 C 1 D 1 O

*=Keyed Answer

8. If you have a large classand were only using the 1/3 sample for top and bottom groups, then you have toNOWcount number of middle group who got each question right (not each alternative this time, just right answers)9. Sample FormClass Size= 100. if class of 30, upper and lower half, no other column here10. Dividetotal by total number of students difficulty= (proportion who got it right(p))ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTSCLASS SIZE = 30

ITEM UPPER LOWER DIFFERENCE D TOTAL DIFFICULTY

1. A 0 *B 4 2 2 0.333 6 .42 C 1 D 1 O

*=Keyed Answer

11. You willNOTEthe complete lack of complicated statistics --> counting, adding, dividing --> no tricky formulas required for this not going to worry about corrected point biserials etc. one of the advantages of using fixed number of alternatives

Interpreting Item Analysis

Let's look at what we have and see what we can see 90% of item analysis is just common sense...1. Potential Miskey2. IdentifyingAmbiguous Items3. EqualDistribution to all alternatives.4. Alternatives arenot working5. Distractertoo atractive.6. Question notdiscriminating.7. Negative discrimination.8. Too Easy.9. Omit.10. &11. Relationship betweenD index and Difficulty (p).

. Item Analysis ofComputer Printouts.

1. What do we see looking at this first one?[Potential Miskey] Upper Low Difference D Total Difficulty

1. *A 1 4 -3 -.2 5 .17 B 1 3 C 10 5 D 3 3 O this is an easy and very common mistake for you to make better you find out now before you hand back then when kids complain OR WORSE, they don't complain, and teach themselves that your miskey as the "correct" answer so check it out and rescore that question on all the papers before handing them back Makes it 10-5 Difference = 5; D=.34; Total = 15; difficulty=.50--> nice itemOR: you check and find that you didn't miskey it --> that is the answer you thought

two possibilities:7. one possibility is that you made slip of the tongue and taught them the wrong answer1. anything you say in class can be taken down and used against you on an examination....7. more likely means even "good" students are being tricked by a common misconception -->You're not supposed to have trick questions, so may want to dump it--> give those who got it right their point, but total rest of the marks out of 24 instead of 25If scores are high, or you want to make a point, might let it stand, and then teach to it --> sometimes if they get caught, will help them to remember better in futuresuch as:1. very fine distinctions1. crucial steps which are often overlookedREVISEit for next time to weaken "B"-- alternatives are not supposed to draw more than the keyed answer-- almost always an item flaw, rather than useful distinction

1. What can we see with #2:[Can identify ambiguous items] Upper Low Difference D Total Difficulty

2. A 6 5 B 1 2 *C 7 5 2 .13 12 .40 D 1 3 O

2. #2, aboutequal numbersof top students went for A and D.Suggests they couldn't tell which was correct1. either, students didn't know this material (in which case you can reteach it)1. or the item was defective --->. look at their favorite alternative again, and see if you can find any reason they could be choosing it. often items that look perfectly straight forward to adults are ambiguous to studentsFavoriteExamplesof ambiguous items.. if youNOWrealize thatDwas a defensible answer, rescore before you hand it back to give everyone credit foreitherA or D -- avoids arguing with you in class. if it's clearly a wrong answer, then you now know which error most of your students are making to get wrong answer. useful diagnostic information on their learning, your teaching

Equally to all alternatives Upper Low Difference D Total Difficulty

3. A 4 3 B 3 4 *C 5 4 1 .06 9 .30 D 3 4 O

. item #3, students respond about equally to all alternatives. usually means they are guessingThree possibilities:2. may be material you didn't actually get to yet1. you designed test in advance (because I've convinced you to plan ahead) but didn't actually get everything covered before holidays....1. or item on a common exam that you didn't stress in your class2. item so badly written students have no idea what you're asking2. item so difficult students just completely baffled3. review the item:3. if badly written ( by other teacher) or on material your class hasn't taken, toss it out, rescore the exam out of lower total1. BUT give credit to those that got it, to a total of 100%

3. if seems well written, but too hard, then you know to (re)teach this material for rest of class....3. maybe the 3 who got it are top three students,3. tough but valid item:3. OK, if item tests valid objective3. want to provide occasional challenging question for top students3. but make sure you haven't defined "top 3 students" as "those able to figure out what the heck I'm talking about"

Alternatives aren't working Upper Low Difference D Total Difficulty

4. A 1 5 *B 14 7 7 .47 21 .77 C 0 2 D 0 0 O

. example #4 --> no one fell for D --> so it is not a plausible alternative. question is fine for this administration, but revise item for next time. toss alternative D, replace it with something more realistic. each distracter has to attract at least 5% of the students4. class of 30, should get at least two students

. or might accept one if you positively can't think of another fourth alternative -- otherwise, do not reuse the itemiftwoalternatives don't draw any students --> might consider redoing as true/false

Distracter too attractive Upper Low Difference D Total Difficulty

5. A 7 10 B 1 2 C 1 1 *D 5 2 3 .20 7 .23 O

. sample #5 --> too many going for A--> no ONE distracter should get more than key

--> no one distracter should pull more than about half of students

-- doesn't leave enough for correct answer and five percent for each alternative. keep for this time. weaken it for next time

Question not discriminating Upper Low Difference D Total Difficulty

6. *A 7 7 0 .00 14 .47 B 3 2 C 2 1 D 3 5 O

. sample #6:low group gets it as often as high group. onnorm-referencedtests, point is to rank students from best to worst. so individual test items should have good students get question right, poor students get it wrong. test overall decides who is a good or poor student on this particular topic4. those who do well have more information, skills than those who do less well4. so if on a particular question those with more skills and knowledge do NOT do better, something may be wrong with the question. question may beVALID, but off to