clear exam review · in this issue, george gray’s abstracts and updates column reviews a book on...

31
A Journal CLEAR Exam Review Volume XXIII, Number 1 Spring 2012

Upload: others

Post on 14-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

A Journal

CLEARExam Review

Volume XXIII, Number 1Spring 2012

Page 2: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

CLEAR Exam Review is a journal, published twice a year, reviewing issues affecting testing and credentialing. CER is published by the Council on Licensure, Enforcement, and Regulation, 403 Marquis Ave., Suite 200, Lexington, KY 40502.

Editing and composition of this journal have been underwritten by Prometric, which specializes in the design, development, and full-service operation of high-quality licensing, certification and other adult examination programs.

Subscriptions to CER are sent free of charge to all CLEAR members and are available for $30 per year to others. Contact Stephanie Thompson at (859) 269-1802, or at her e-mail address, [email protected], for membership and subscription information.

Advertisements and Classified (e.g., position vacancies) for CER may be reserved by contacting Janet Horne at the address or phone number noted above. Ads are limited in size to 1/4 or 1/2 page, and cost $100 or $200, respectively, per issue.

Editorial BoardJanet CiuccioAmerican Psychological Association

Rose C. McCallinColorado Department of Regulatory Agencies

Steven NettlesApplied Measurement Professionals

CoeditorMichael Rosenfeld, Ph.D.Educational Testing ServicePrinceton, NJ [email protected]

CoeditorF. Jay Breyer, Ph.D.Educational Testing ServiceRosedale Road, MS13-RPrinceton, NJ [email protected]

CLEAR Exam Review

Contents

FROM THE EDITORS ................................................................ 1

F. Jay Breyer, Ph.D.

Michael Rosenfeld, Ph.D.

COLUMNS

Abstracts and Updates ......................................................... 2George T. Gray, Ed.D.

Technology and Testing....................................................... 7Robert Shaw, Jr., Ph.D.

Legal Beat ........................................................................... 11Dale J. Atkinson, Esq.

ARTICLES

Challenges in Developing High Quality Test Items .......... 15Greg Applegate, MBA

A Multistate Approach to Setting Standards: An ............... 18Application to Teacher Licensure TestsRichard J. Tannenbaum, Ph.D.

Effect on Pass Rate when Innovative Items Added ........... 25to the National Certification Examination forNurse AnesthetistsMary Anne Krogh, Ph.D., CRNA

VOLUME XXIII, NUMBER 1 SPRING 2012

Copyright ©2012 Council on Licensure, Enforcement, and Regulation. All rights reserved. ISSN 1076-8025

Page 3: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

1CLEAR EXAM REVIEW

From the EditorsWelcome to the Spring 2012 issue of the CLEAR Exam Review, volume XXIII, number 1. This issue has three columns and three articles that we think you will find interesting and informative. This year marks our 21st year editing the CLEAR Exam Review, which started with volume II, number 1 in 1991 for what was then the Summer issue.

In this issue, George Gray’s Abstracts and Updates column reviews a book on psychometric theory, an internet paper on item response theory and Rasch modeling, a book about developing a quality certification program, two instructional modules – one on subscores and a second on linking and equating. He also summarizes four articles covering topics such as equating using a graphical approach, changing item responses, adjusting human scores for rater effects and an article about certification in pediatric nursing.

In the Technology and Testing column, Robert Shaw talks about item banking standards. His intention with this article is to describe a current set of concepts that could affect some item banks today and in the future. This concept of item-banking embraces interoperability among multiple systems, which is in contrast to systems that were built in the past to serve the parochial needs of a single user.

Dale Atkinson’s Legal Beat column explores a recent case where a licensure candidate had his scores cancelled and his license revoked for suspected impersonation. The column is interesting and will be of interest to many test sponsors and test vendors given the outcome of the case. It deserves your attention.

This issue also contains an article by Mary Anne Krogh on the effect that introducing innovative item types has on pass rates in a certification test. The article has some practical implications for credentialing programs seeking to introduce new item types and serves as an introduction to an area that is sure to become increasingly more important in the future. A second article by Greg Applegate summarizes aspects that lead to quality tests as well as the steps needed for quality items and their creation. Our third article in this issue is from Richard Tannenbaum who discusses the subject of standard setting for multistate licensing programs. He describes a multistate standard-setting approach designed to address the issues presented by the more traditional state-by-state approach for recommending passing scores on teacher licensure tests. Although the multistate approach has been used with teacher licensure tests, it certainly may be applied to other licensure contexts that involve multiple jurisdictions or multiple agencies.

We hope that we cover topics that will be of interest to you. We are always interested in new topics and are appreciative of article submissions. We urge you to submit an article on some interesting endeavor that you have worked on recently. Further, we would like to invite readers who are interested in writing a column that you feel would appeal to CER readers to please get in touch with either of us at Educational Testing Service, Princeton NJ 08541.

Now turn the page and read on …

SPRING 2012 n

Page 4: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

2 CLEAR EXAM REVIEW

This issue’s column focuses on pedagogical references concerning measurement concepts, a new book on certification, an article on the value of certification in a nursing specialty, and a number of papers focusing on specific measurement topics (equating and IRT test characteristic curves, the merits of allowing for review of items in computer-based tests, rater effects on performance assessment, test subscores, and the concept of population invariance in linking and equating).

Test Theory

Raykov, T. and Marcoulides, G.A. (2011). Introduction to Psychometric Theory. New York: Routledge

Taylor & Francis Group, 335 pp.

If you pick up this book, what are you getting into? The word “introduction” seems inviting enough, but “psychometric theory” is a term that is inherently intimidating to some. The authors describe the purpose of the book “to provide a coherent introduction to psychometric theory which would adequately cover the basics of the subject.” (p. xi) They describe their approach as “a relatively non-technical introduction to the subject” using “mathematical formulas mainly in their definitional meaning.” (p. xi) No previous knowledge of measurement is required, but a basic course in analysis of variance and regression analysis is recommended. The “short introduction to matrix algebra” on page 21 may seem like a hurdle to be overcome, but quite a bit of the text throughout the book is carefully written with the interested student in mind.

The sequence of chapter topics reflects the conceptual approach taken in the book. Early chapters focus on factor analysis and move on to latent variable modeling and confirmatory factor analysis. Classical test theory is the subject of the next chapter. The first topic covered is the nature of measurement error, which leads into the concept that an observed score has two components: a true score and an error score. In a section titled “misconceptions about classical test theory” the authors indicate that there really isn’t a classical test theory model, that there is no assumption that true score and error score are unrelated, and there is no assumption that error scores on different tests are uncorrelated. (p. 121)

Subsequent chapters of the book focus on familiar topics such as reliability, validity, generalizability theory, and item response theory—again focusing on conceptual clarity, as well as the relationship to latent variable modeling.

Abstracts and UpdatesGEORGE T. GRAYGeorge T. Gray, EdD is director of the Test Development Department, Workforce Test Development, Workforce Development Division, ACT, Inc.

n SPRING 2012

Page 5: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

3CLEAR EXAM REVIEW

Item Response Theory and Rasch Modeling

Yu, C. H. (2010) A simple guide to the item response theory (IRT) and Rasch modeling. www.creative-wisdom.com

Normally this column doesn’t reference internet papers, but this one came up in a search and proved to have some interesting attributes. Five topics are covered in step-by-step narration with few formulas and Greek letters. The five topics covered are: (1) item calibration and ability estimation, (2) item characteristic curves in one to three parameters, (3) item information functions and test functions, (4) item-person map, and (5) misfit. One distinction that the author (and other Rasch model advocates) make is that the Rasch model is not a one-parameter example of item response theory (IRT). Those not wedded to the Rasch model would say that there are three commonly used IRT models depending on the number of parameters: item difficulty only, item difficulty and statistical discrimination, and the addition of a third parameter – getting an item correct by “chance.”

Although the difference between one-parameter IRT and Rasch may escape the reader, it is not a barrier to understanding the paper. Item calibration and ability estimation is illustrated with tables of item responses with correct and incorrect responses and percent correct. The Rasch item characteristic curve (ICC) is illustrated and then compared with two-and three-parameter IRT ICCs. The third part of the paper graphically illustrates item information functions and test information functions. Part IV defines the logit as a unit of measurement and introduces the item difficulty – person ability map produced by Rasch Winsteps software. The last topic presented is misfit: the author brings the reader to the topic of infit and outfit standardized residuals with hardly an equation in sight – a major boon for the reader who is an enthusiastic learner but a limited mathematical background.

Developing Certification Programs

Brauer, R.L. (2011) Exceptional Certification – Principles, Concepts, and Ideas for Achieving Credentialing Excellence. Champaign, IL: Premier Print Group, 263 pp.

The author of this book has written it to share his experi-ence in the certification field, and it is indeed a book filled with useful information that will benefit organizations offer-ing a certification, particularly those planning to develop a new certification program. The author brings his experience in certification to a wide range of topics: if you are planning

to do this, you ought to think about that. It’s like having a consultant between the covers of a book.

The book has twenty-six chapters, including introductory topics such as psychometrics, finance, governance, administration and staffing, marketing and sales, and legal assistance. The second major section of the book is titled: “Becoming Exceptional: Moving Beyond Certification Basics.” It includes topics such as: focus on customers, creating institutional memory, achieving financial success, governance enhancements, and a culture of excellence.

The typical format of the presentation in the book is the listing of a topic followed by a paragraph to a page of explanation. The shorter entries are mainly items to think about or the ingredients of the overall recipe for certification. The longer sections take a more complete “cookbook” approach. A stunning number of topics are covered, and short vignettes illustrating actual situations are liberally interspersed throughout the text. There are comments on the finances required to start a new certification, the viability of launching a certification, given the prospect of a relatively small number of applicants, suggestions for economical ways to reward and recognize organizational staff, and how to detect that a person with a similar name has “adopted” a deceased certificant’s credential.

One inherent limitation of the approach taken in this book is that the number of topics covered limits the amount of detail that can be provided. For example, Brauer covers international certification in three pages. In contrast, Certification: the ICE Handbook, 2nd edition devotes a chapter to this topic. As the ICE Handbook is cited in the page of references (p. 233), it is certainly possible to use the Brauer book as a starting point for a number of topics and then branch out to other sources.

Notwithstanding the virtues of the book, brevity does not always serve the cause well. In some cases brief recommendations should be considered with caution. For example, in the psychometric section, it is stated that, “A typical acceptance range (for the point-biserial item discrimination statistic) is 0.33 to 0.90.” (p. 17) For a test of homogeneous content, such as two-digit multiplication problems, items will meet this criterion; however, for certification examinations, the content covered is usually broad and contains a mix of facts and applications. The yield of items that meet the stated 0.33 criterion is likely to be low. In fact, the definition of the term discrimination in the glossary states that the point biserial correlation “should be above 0.10 for each item.” (p.239) This benchmark is easily

SPRING 2012 n

Page 6: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

4 CLEAR EXAM REVIEW

a realistic expectation. Another example is illustrated by the following quote, “Suppose on a 200 item examination the computed standard error of measurement is 3. Also assume that a candidate’s actual score was 130. The standard error of measurement means that the ‘true’ score for the candidate is between 127 and 133.” (p. 16) This explanation is not complete as the true score lies within a standard error approximately two thirds of the time and within two standard errors (124 and 136 in the cited example) 95% of the time. (Crocker and Algina, p. 123) The book is only available through amazon.com.

Overall Exceptional Certification has a great deal to recommend it. The reader should augment the psychometrics chapter with other resources.

Equating with Test Characteristic Curves

Wyse, A.E. and Reckase, M.D. (2011). A graphical approach to evaluating equating using test characteristic curves. Applied Psychological Measurement, 35(3): 217-234.

Item information functions are graphic plots of the amount of IRT information that an item provides at different ability levels. Easy items have their highest information at low levels of ability and vice versa. Test information functions represent the sum of information from items on the test at each ability level. Tests that are equated should have very similar test characteristic curves, and the authors use this concept to equate test forms.

A study is reported using forms of the Multistate Bar Examination (MBE). A new approach to equating “involves graphically examining the difference in test characteristic curves (TCCs), calculating the maximum absolute difference between the TCCs, and comparing the differences in TCCs to the difference that matters (DTM).” (p. 217) The authors evaluate six different scaling approaches using a common items nonequivalent groups design. They conclude that, “the Stocking-Lord and fixed-parameter methods appear to perform the best for equating the MBE and that the use of concurrent calibration is not desirable.” (p. 217)

Examinee Opportunity to Change Item Responses

van der Linden, W.J., Jeon, M., and Ferrara, S. (2011). A paradox in the study of the benefits of test-item review. Journal of Educational Measurement 48(4):380-398.

This paper takes a look at the popular perspective that, given the opportunity to change answers on a test, the

examinee’s first responses will be tend to be correct more frequently than the changed answers. In the context of variable length IRT tests, this perspective has practical consequences, because if item review is available, a stopping rule may be invoked based on an initial ability estimate, and the final ability calculation after answer changes may be different. In a CAT test, the assignment of items is also based on ability estimates, which would be calculated from the initial item responses.

In interpreting data, the authors introduce Simpson’s paradox, not an actual paradox but a perspective in which inferences are made about the probability of success of individuals based on population statistics. They cite an analog in the perception of the “hot hand” shooter in basketball. In reality, what is happening when shooting free throws, for example, is that individuals have different skill levels and the probability of success on any individual shot is based on a probability of success related to the individual’s skill level. Simpson’s paradox is considered in the authors’ analysis of the impact of changing answers on an examination. Individual ability levels should be considered as a variable.

The data analysis of one 65 item math test conducted by the authors supported the popular belief that examinees should go with their first responses and suggests that a long history of empirical research has ignored the important variable of individual ability. Implications are explored for future research.

Rater Influence on Performance Assessment

Raymond, M.R., Harik, P. and Clauser, B.E. (2011). The impact of statistically adjusting rater effects on conditional standard errors of performance ratings. Applied Psychological Measurement 35(3): 235-246.

Rater severity is a well-known concern in performance testing. In spite of well-documented criteria for assigning scores, some raters may be more lenient or severe than others. It is inappropriate for rater tendencies rather than candidate abilities to be the determining factor in whether a candidate passes or fails an assessment rather than candidate ability. The authors note that the Rasch model and an approach based on ordinary least squares (OLS) regression are the most common methods of detecting and correcting for systematic error associated with rater and task variables. (p. 236). They conducted studies on data from 12 stations of a part of the NBME physician

n SPRING 2012

Page 7: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

5CLEAR EXAM REVIEW

licensure examination using OLS regression rater adjustments to evaluate “the impact of OLS adjustment on standard errors of measurement (SEM) at specific score levels.” (p. 235) The conclusion of the research is that “As methods such as OLS regression and the Rasch model continue their migration from research to practice, it will be important to identify the conditions under which they do and do not function adequately. Finally, this study demonstrated that conditional SEMs can serve as a very useful evaluative criterion, and the authors advocate its use for a variety of investigations or interventions where core reliability is of concern (e.g., utility of subscores, effectiveness of rater training, comparisons of subpopulations).” (p. 245)

ITEMS (Instructional Topics in Educational Measurement)

The following are two new titles in the National Council on Measurement in Education (NCME) ITEMS (Instructional Topics in Educational Measurement) series. ITEMS articles are presented as instructional modules rather than scholarly papers for a group of measurement insiders, so they are highly recommended. This series has been in existence since 1987 and many of the basic measurement topics such as comparison of classical test theory and item response theory, traditional equating methodology, IRT equating methods, understanding reliability, standard error of measurement, and generalizability theory were covered in earlier publications, but they are still relevant teaching tools. The previously published ITEMS modules are available in the Library section of the NCME website at www.ncme.org.

Examination Subscores

Sinharay, S., Puhan, G. and Haberman, S.J. (2011). An NCME instructional module on subscores. Educational Measurement: Issues and Practice 30(3): 29-40.

As an NCME instructional module, this article is quite a bit more intellectually accessible to the licensure and certification community than the authors’ research papers which have been cited in this column. The article starts with the most basic question: “What are subscores and why should anybody care about them?” and end with a section on proportional reduction of mean squared errors. Subscores can provide additional candidate performance information that is not contained in a single number for total score. Candidates and training program directors alike appreciate

the additional data points captured by subscores. The main problem with subscores is that they are all too frequently based on a relatively small number of items and contain more measurement error than meaning. In a literature review that the first author conducted, sixteen of twenty-five tests reviewed “had no subscores with added value.” (p. 33)

The authors review a number of methods to assess the psychometric quality of subscores. They emphasize that “subscores have to be based on a certain number of items and have to be sufficiently distinct from each other to have adequate psychometric quality.” (p. 29) Their recommendations include that evidence be provided of “adequate reliability, validity, and distinctness of the subscores.” (p.36) They also recommend that subscores be combined into more general content categories to increase the number of items and reliability. If the overall test is unidimensional, it may not be possible to divide it into content categories that are statistically distinct. It is not possible to measure something that is not there.

Population Invariance in Linking and Equating

Huggins, A.C., and Penfield, R.D. (2012) An NCME instructional module on population invariance in linking and equating. Educational Measurement: Issues and Practice 31(1): 27-40.

In this instructional module on population invariance, the authors state that, “Whether one is conducting a linking or equating, an important issue is whether the comparability of the test scores is dependent on particular groups of examinees defined by characteristics such as gender, language, race, geographic region and accommodation strategy.” (p. 27) If candidates from different populations have the same score on one form of a test but different expected scores on a linked or equated test, a question of fairness arises. (Ibid)

The authors present an overview of population invariance in linking a number of methods for evaluating population invariance, and implications where violations of the assumption of population invariance are found. With regard to the latter problem, they indicate that no single remedy fits all situations. Individual population linkages are a possibility that should be considered with caution. Characteristics of the test, data, and populations should be reviewed. Finally, “Technical reports that accompany linkage results should thoroughly describe the population and the conditions under which the linking function is invariant or dependent.” (p. 38)

SPRING 2012 n

Page 8: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

6 CLEAR EXAM REVIEW

Certification in Pediatric Nursing

Messmer, P.R., Hill-Rodrigues, D., Williams, A.R., Ernst, M.E., and Tahmooressi, J. (2011) Perceived value of national certification for pediatric nurses. Journal of Continuing Education in Nursing 42(9): 421-432).

The study compared responses to a questionnaire on the value of nursing specialty certification for respondents who were either certified or not certified in pediatric nursing in a single pediatric hospital. An instrument previously developed and validated was administered. It contained 18 “value of certification” statements that were rated using a Likert scale. Survey instruments were sent to 400 pediatric nurses, and 160 usable responses were obtained. Results of the study included the conclusions that nurses associate certification with professional growth, autonomy and recognition, with certified nurses attaching a greater value to certification than non-certified nurses. The authors noted that in this institution, although certification was weighted in salary determinations, “it does not automatically guarantee an increase in salary.” (p. 430) A moderate correlation was found between an item on salary and the value of certification instrument.

References

Crocker, L. and Algina, J. (1986). Introduction to Classical & Modern Test Theory. Orlando FL: Harcourt Brace Jovanovich, 527 pp.

Knapp, J., Anderson, L., and Wild, C. (Eds.) (2009). Certification – The ICE Handbook 2nd Ed., Washington, DC: Institute for Credentialing Excellence, 424 pp.

n SPRING 2012

Page 9: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

7CLEAR EXAM REVIEW SPRING 2012 n

Several elements contribute to occupational credentialing programs including systems on which the regulation of professionals rely. Among these elements are organizational components associated with operating programs, legal components associated with treating regulated professionals fairly, and authoring components associated with producing and refining item content that will be used within tests.

Focusing on the items for a moment, they require a storage system. Once stored, a system must permit someone to access items, edit content, and classify items in a variety of ways that are germane to the assessment program to which they are linked. In this article, I intend to focus on principles that one might observe in item-banking systems today or might anticipate for item banks in the future.

A look back at item-banking practices

There was a time when contents of an item bank were made portable by transporting 3X5 index cards within an elongated box. Item bank managers manually edited content of an item on the card or replaced an old card with a new one containing a revised version of an item. Items (cards) that were retired could be marked as such or removed from the box. When the time came to publish a test, a test constructor would pull cards to assemble test content after which content from the cards was typeset, proofed, and printed into booklets.

The first computerized item banks created virtual cards and boxes. In predictable fashion, individual organizations each hired programmers, each of whom developed a system that fit the needs of the institution that hired him or her to do the work. While these technological advances improved on the card and box system, new problems cropped up.

At the risk of over-generalizing and migrating into philosophy, I will suggest that humans are problem solvers. However, the problems we tend to solve are the immediate ones. If we can solve this problem over here, we tend not to wonder whether someone else is solving a similar problem over there from which we might learn. Neither do we tend to have much concern about whether our solution will cause someone else a problem.

Interoperability

When individual organizations each develop a system for storing and accessing items in banks, a solution implemented by one can cause problems for another when they subsequently decide to interoperate their systems. For example, a credentialing program may

Technology and TestingA New Wave of Item-banking Principles

ROBERT C. SHAW, JR., PHDRobert C. Shaw, Jr. is a Program Director in the Psychometrics Division at Applied Measurement Professionals, Inc.

Page 10: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

8 CLEAR EXAM REVIEW

decide to have one company develop its test content but have another company administer and score tests. Occasionally, a credentialing program will decide to move its whole development, administration, and scoring operation from one company to another.

For these scenarios and others, it is no longer sufficient to solve the immediate local problem. One has to consider what others have done to solve their problems while getting the systems to work together.

An international standards group

A group populated by members from around the world has organized to address issues that arise when education systems, including testing systems, interoperate. The group is called the Instructional Management Systems (IMS) Global Learning Consortium (http://www.imsproject.org). The mission of this global nonprofit organization is to enable growth of learning technology. There are just short of 200 organizations that have joined the Consortium; 58% of members are corporations, 24% are institutions of learning, and 18% are classified as consortia or governments.

The Consortium has specified several standards, each of which focuses on a learning system. Some standards guide content packaging within learning systems and there are standards for other elements of learning, assessment, and documentation of competencies. Plus, the Consortium has developed standards for a structure called an ePortfolio, which is a concept that assumes thorough integration across multiple systems.

Contents of an ePortfolio could be started for a person while he or she was enrolled in an instructional program. Some content from the ePorfolio could be expected to move with the person to the institution for which he or she becomes employed. Perhaps in between, some content from a person’s ePortfolio could reside within a credentialing system as a part of verifying eligibility to take an examination that is a part of regulating professionals. It is this kind of integration among learning systems that the Consortium envisions. Included among its standards is a specification for Question and Test Interoperability (QTI) that applies to item-banking.

A brief history of QTI specifications

QTI specifies technical models so that content of test items, information from candidates’ responses, and reports about candidates’ responses can be represented within computer

systems. Importantly, otherwise independent systems that each follow QTI specifications should be able to exchange information among authoring tools1, item banks, test construction tools, and test delivery systems. In brief, if each party stores the same type of information in the same format and location, then information can be readily exchanged.

The first version (V0.5) of the QTI specification was released for discussion in 1999. V1.0 of the specification was released in 2000, after which updates were released in 2001 and 2002. V2.0 of the specification was drafted in 2003 followed by V2.1 in 2006. The website (http://www.imsproject.org/question/index.html) for the V2.1 specification indicates that it has not been finalized as of early 2012.

An important difference between QTI V1 and V2 is that V2 is intended to accommodate innovative types of items that are only available within computer-based systems. After reading the rest of this section, it will become clear that the QTI group has taken on a complicated task while trying to accommodate content from, and responses to, technologically complex types of test items. Some examples include the following:

• AnInlineChoice Interaction type of item permits candidates to view the content of a pull-down menu that has been embedded within the text of an item to select the best response.

• AMatch Interaction type of item presents a table in which a candidate can click each cell that contains a correct match between content presented in the rows and columns of the table.

• ATextEntry Interaction item type provides space for candidates to type several words of a response into a field, which most likely will be scored later.

• AHotSpot Interaction type of item permits candidates to connect a sequence of dots on screen. The example given by the Consortium shows an image of a map so a test taker would have to have mastered knowledge of the geography to connect the dots correctly. However, creative people could find other uses for this type of item.

• AGraphicGapMatch Interaction item presents a slider to candidates (often associated with a percentage) that lets them move the tool with the computer mouse until the intended response is displayed.

I will save a personal reaction about the relative merits of innovative types of items as compared to one-best answer multiple-choice items for the next article in this space. I will

n SPRING 2012

1 There is a QTI glossary at the end of the article.

Page 11: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

9CLEAR EXAM REVIEW SPRING 2012 n

stay on point for now and say that there is a QTI Lite (V1.2) specification that is available from the Consortium. QTI Lite focuses on testing scenarios that rely exclusively on the one best answer multiple-choice type of items. The group working on the QTI specifications understood that a great quantity of testing is done using one best answer multiple-choice items. QTI Lite is intended to simplify specifications for such systems that are intended to interoperate.

Potential features of a modern item-banking system

QTI specifications make use of the Extensible Markup Language (XML), which is a programming language for web pages. Immediately, one can see a fundamental change compared to item-banking applications of the past. Instead of a company developing a computer application that is installed on each personal computer, software developers would develop an application that is accessed from the Internet. Users would still rely on personal computers to access information, but the application would not run on each personal computer. The application that would be running on each personal computer is the Internet browser.

Instead of purchasing a license to install an application on each personal computer, users of an item-banking service would purchase authenticated access to the Internet application. Such a system would simplify some elements of item bank management for a user. The user would be freed from having to update application software and provide storage space for information on which the application otherwise would rely. In other words, the system would be modeled on software-as-a-service rather than software-as-a-product.

Such a system might generate some concern among some potential users. No longer would they possess the contents of their bank in the sense that item data are stored on a computer that they own. A user would access content of an item bank through an authentication system, which would limit access to persons or personal computers that were supposed to have access. Users who may come to realize that “our items are stored with the items of everyone else” could raise another point of concern. However, such an item storage scheme happens already. When an item bank owner purchases item-banking and maintenance services from a company, the bank owner’s content is stored on the same drive with the contents of other item bank owners.

Remote item entry and editing is a significant advantage that a web-based item bank can offer. An author could be authenticated to access an authoring tool through which he or she could enter the content of new items directly into the bank. Other persons could be authenticated to log in and review content of newly submitted draft items. These reviewers could be authorized to directly edit draft items to potentially encourage more rapid approval for tests.

Summary

My intention with this article was to describe a current set of concepts that could affect some item banks today and in the future. This concept of item-banking embraces interoperability among multiple systems, which is in contrast to systems that were built in the past to serve the parochial needs of users. Because one entity may develop test content while another entity may administer the content and score responses, and because an owner of a test bank could decide to move its bank

QTI Glossary

Term Definition

Item author A person who creates content of a test item.

Authoring tool A system used by an author to create or modify an item.

Item bank A system that collects and permits management of items.

Item bank manager A person who manages a collection of items while using an item bank.

Test constructor A person who creates test forms from an item bank.

Test construction tool A system for assembling tests from items.

Assessment delivery system A system for managing delivery of tests to candidates.

Proctor or invigilator A person who oversees test delivery.

Scorer A person who assesses responses from candidates to items on a test.

Page 12: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

to a new system, principles of interoperability can become important.

An international group of corporations, learning institutions, and governments called IMS has been working on interoperability standards for computerized learning systems. Among these standards is a QTI set that started development in 1999. The current version of the QTI standard is V2.0, although V2.1 was in development at the time of this writing. Of use to many credentialing programs should be the QTI Lite (V1.2) Standard, which is intended for examinations that are constructed entirely of one best answer multiple-choice items. I cite the V1.2 and V2.0 Standards because regulatory bodies may encounter them while interacting with other regulatory bodies or they may want to cite one of these standards when soliciting item-banking services.

Lastly, this article might sensitize some readers to elements of item-banking systems in the future. Content could be added by an authenticated item author while using an Internet browser rather than directly using a software application. Likewise, an item bank manager could remotely review and edit item content while using a browser. Neither the application that permits management of items within a bank nor the content of items need be stored directly on the computer that the item bank manager is using within this paradigm.

10 CLEAR EXAM REVIEW n SPRING 2012

Page 13: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

11CLEAR EXAM REVIEW SPRING 2012 n

This edition of the Legal Beat column will focus on the legal implications of disclosure by examination owners of suspected activity and the defensibility of a passing score regarding a licensure examination used by a state board. Because examination results in a state-based licensure system are used as a minimum competence determiner, invalidated examination results can result in the denial of licensure by a board. Further, and because examination score anomalies may be discovered well after the examination administration and likely after the issuance of a professional license, information questioning the examination score may result in the loss of or removal of licensure of the practitioner by the state board. In a state-based licensure system, licensees are entitled to the full panoply of due process rights before adverse action can be taken by the state regulatory board.

Of course, it is critical that high-stakes licensure examinations accurately reflect the knowledge, skills and abilities of the examinee and examination owners are legally bound to take necessary measures to ensure examination results reflect such aptitude. When information surfaces after the administration regarding scoring anomalies of a particular candidate who is already licensed, the examination owner must assess the credibility of the information and, under the right circumstances, report such irregularities to the licensing entity. While the licensing board has the ultimate authority regarding the authorization to practice and also enjoys the immunity protections of a governmental entity, the private sector examination owner may or may not be the beneficiary of these legal protections. Consider the following.

The National Association of Boards of Pharmacy (NABP) is a private, not-for-profit organization whose membership is comprised of the state boards of pharmacy in the United States, Puerto Rico, the Virgin Islands and the District of Columbia (as well as certain like international pharmacy licensing boards). NABP provides programs and services to its member boards of pharmacy, intended to lessen burdens on state government, educate pharmacy board members, and provide uniformity to the licensure of pharmacists all in the interest of public protection. One such NABP program includes the development, administration, scoring and maintenance of the North American Pharmacists Licensure Examination (NAPLEX). The NAPLEX is an the entry-level licensing examination that is recognized by all state boards of pharmacy as one criterion in the licensure process of pharmacists. NABP contracts with an outside vendor for services related to the administration of the NAPLEX at test centers throughout the United States. To pass the NAPLEX, an examinee must receive a score of 75 of a possible 150.

In June 2007, a candidate sat for the NAPLEX and scored a near perfect 130. Previously, the candidate tested in June 2006 and scored a 24 and again tested in December 2006

Legal BeatScore Anomalies, Reporting & Immunity

DALE J. ATKINSON, ESQ.Dale Atkinson is a partner in the law firm of Atkinson & Atkinson. http://www.lawyers.com/atkinson&atkinson/

Page 14: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

scoring a 24. After the June 2007 administration, NABP received evidence that the candidate had an imposter take the 2007 examination for him. NABP conducted an investigation including attempts to gather the video of the examination administration and review fingerprints and pictures of the candidate. However, the video of that administration had been erased based upon the passage of time due to vendor policy. Further, it was determined that a picture initially indicating that the candidate for the third administration appeared to be different from the first two tests proved to be erroneous. In fact, the pictures for all three tests were eventually matched. Finally, the fingerprints were too blurry to make conclusive determinations as to candidate identity.

In March 2008, after its investigation, NABP informed the candidate that it could not verify that he was the person who sat for the June 2007 exam and invalidated his score. In short, NABP determined that it could not verify that the candidate actually sat for the June 2007 exam and also could not verify that the June 2007 examination score was achieved on his own merits. Significant to these determinations was the score variance and improbability of such a significant increase in score. NABP also notified the Michigan Board of Pharmacy of its determinations and invalidation of the NAPLEX score. The Michigan Board filed an administrative complaint against the pharmacist and summarily suspended his license. A hearing was held on April 15, 2008 at which the Department of the Attorney General recommended that the complaint be dismissed based upon the speculative nature of the grounds. The Board reinstated the pharmacist’s license on April 15, 2008. In July 2008, NAPB sent the pharmacist another notice affirming its invalidation of the June 2007 exam score based upon its inability to verify that the candidate passed the examination on his own merits.

Thereafter, the pharmacist (hereinafter referred to as plaintiff) filed suit in state court alleging negligence, libel, defamation, intentional infliction of emotional distress, and breach of contract. Defendants in the suit included Prometric, Educational Testing Service (ETS), Thompson Reuters Corporation, NABP, and NABP’s executive director. The plaintiff alleged damages of $2.1 million dollars. Based upon diversity of citizenship issues, the case was removed to federal court.

In a January 2010 opinion (Dakshinamoorthy v. National Association of Boards of Pharmacy, 2010 WL 103884) the United States District Court for the Eastern District of Michigan addressed the defendants various motions to

either dismiss the litigation or rule on summary judgment. Summary judgment involves a judicial ruling on matters of law, which dispose of the case without the need for a trial based upon no dispute over material issues of fact. The court focused on the contractual relationship between the parties and the fact that NABP contracts with the Board for examination services. In addition, NABP contracted at the time with Prometric for examination administration and assessment services. Also, the court addressed the elements for a plaintiff to state a cause of action for negligence which include a duty, breach of duty, causation, and damages.

Regarding the Prometric and ETS motions to dismiss, those defendants argued that there was no contractual relationship with nor duty owed to the plaintiff. They argued that the relationship between the plaintiff arose strictly out of contract with NABP. In denying the defendants motion to dismiss, the court noted that discovery had barely begun and such issues must be explored during these initial phases of the litigation. Thus, the motions were denied as to Prometric and ETS.

As to NABP, the court distinguished between contract and tort/negligence claims and the need to determine if NABP owed a duty to plaintiff separate and apart from the contractual obligations. Again, based upon the infancy of the litigation, the court denied the NABP motion for partial summary judgment on the negligence claim and allowed the parties to engage in discovery.

Finally, the court did dismiss the claims against Thompson Reuters as the plaintiff did not identify any reason to pierce the corporate veil of the parent company. Accordingly, the case was allowed to proceed against Prometric, ETS, NABP, and NABP’s executive director. Thereafter, either by stipulation or order of the court, Prometric and ETS were dismissed from the suit. However, NABP and its executive director remained and will be referred to collectively as NABP defendants.

After further discovery, the NABP defendants again filed for summary judgment. In an April 2011 opinion, the District Court again ruled on the case. (Dakshinamoorthy v. NABP 2011 U.S. Dist. LEXIS 40034). NABP argued that it was entitled to summary judgment on all claims based upon civil immunity under Michigan law. Michigan law provides immunity from civil or criminal liability to a “person…acting in good faith who makes a report; assists a board or task force, a disciplinary subcommittee, or the department in carrying out its duties under this article.” Any person acting in this manner is “immune from civil or criminal liability,

12 CLEAR EXAM REVIEW n SPRING 2012

Page 15: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

13CLEAR EXAM REVIEW SPRING 2012 n

including, but not limited to, liability in a civil action for damages… .”

The court noted that NABP’s actions (including that of its executive director) which form the basis for plaintiff’s complaint were undertaken pursuant to the contract between the Board and NABP. As noted by the court, on this basis alone, NABP defendants are entitled to summary judgment. The court continued, stating that even if the Michigan statute was found not to be dispositive, NABP defendants would be entitled to judgment on the intentional infliction of emotional distress claim as plaintiff must prove the following:

1. Extreme and outrageous conduct;

2. Intent or recklessness;

3. Causation; and

4. Severe emotional distress.

In this case, the court found that there was no evidence of any such outrageous conduct.

The court also found that the third party beneficiary under a breach of contract claim would also be subject to summary judgment in favor of the NABP defendants. It held that the plaintiff under these circumstances was merely an incidental beneficiary to the contract between NABP and the board. Indeed, the NABP/board contract was “primarily for the benefit of the NABP and the Michigan Board, as well as the people of the State of Michigan to ensure that state-licensed pharmacists have some level of competence.”

Finally, regarding the defamation claim, the court also found that NABP defendants would be entitled to judgment even without the immunity statute. To substantiate a defamation claim, plaintiff must show:

1. A false and defamatory statement concerning plaintiff;

2. An unprivileged communication to a third party;

3. Fault amounting to at least negligence on the part of the publisher;

4. Either actionability of the statement irrespective of special harm or the existence of special harm caused by publication.

The court found that plaintiff cannot meet this burden as the statements made were not false, they were privileged, and the plaintiff cannot establish that they were published

with actual malice. Accordingly, the District Court granted summary judgment to NABP and its executive director.

The plaintiff appealed the case to the Sixth Circuit Court of Appeals. In an April 2012 opinion, the Sixth Circuit affirmed the District Court summary judgment in favor of NABP and its executive director. In an abbreviated opinion, the court agreed with the lower court that NABP defendants were entitled to immunity under Michigan law and that the previous defendant pleading sufficiently alleged immunity as a defense.

Examination owners in state-based licensure systems are under affirmative obligations to ensure that candidates possess minimum competence as one criterion in the licensure process. NABP-like associations, under Michigan law, are rightfully entitled to immunity for providing such a valuable service to a member board of pharmacy based upon these contractual relationships and obligations. As a reminder, the boards of pharmacy are statutorily created and empowered to protect the public through the enforcement of statutory scheme. One critical element to such a licensure process is a uniform mechanism to assess entry level competence.

Page 16: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

FPO

Page 17: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

15CLEAR EXAM REVIEW SPRING 2012 n

Making a pass/fail decision or measuring a person’s ability with a test is a complex process. The development cycle for most professionally created tests follows a sequence of events similar to this: An area of knowledge is defined and a test specification is created from the definition. Items are developed using the definition and the specification. The items are used to create a test. The test is administered to examinees and the examinee responses are scored. Scores are used as data for a measurement model and the measurement model is used to make a pass/fail decision or estimate an ability level. Each step builds upon the previous step.

The steps leading up to collecting item responses are the most critical because they provide the foundation for all later analysis. It is possible to use different scoring methods or different measurement models with a set of responses; however, no amount of statistical analysis can correct for poor data. Generating good data, in the form of a consistent set of responses, is a critical function of a test, and the quality of the process is dependent on the quality of the items on the test.

Item responses serve as individual measurement points. Each item represents an opportunity for an examinee to demonstrate his or her ability level. Every instance of an examinee response to an item provides a reference data point for an ability estimate. Estimating ability from the response to a single item is problematic as the item only provides information about a single difficulty point and responses are sometimes influenced by random error rather than by the examinee’s ability. Factors such as fatigue, motivation or guessing may cause a specific response to be an inaccurate measurement of ability. This is why tests are made up of many items. Using multiple reference points minimizes the effects of factors other than ability. Generally speaking, longer tests are more reliable than shorter tests.

Ideally the information provided by an item response is directly related to an examinee’s ability; however, items also measure knowledge that is not intended to be part of the test. Factors such as item format, cultural assumptions or unfamiliar language influence item responses and often produce irrelevant information. Separating the relevant from the irrelevant information in a single item response is not possible. From an item development perspective, the best way to deal with this issue is to write items in such a way as to minimize the amount of irrelevant information produced by each item. Generally, high quality items maximize relevant information and minimize irrelevant information.

Challenges in Developing High Quality Test Items

GREG APPLEGATE, MBAPearson, VUE

Page 18: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

Item Writing

High quality items possess two important characteristics: content relevance and examinee separation. In other words, the content of the item must directly relate to the subject being tested, and the responses to each item should consistently distinguish between examinees of different ability levels. How well the content of an item relates to the subject of the test can be checked in a number of ways, but it almost always begins with the judgment of a subject matter expert. A strong basis for ensuring the item content is related to the test subject can be established by using the consensus judgment of subject matter experts. One way this may be accomplished is to develop a write and review process in which items are developed by one group of subject matter experts and independently reviewed by another. This helps to ensure that item writers are working from a common subject definition and that multiple perspectives of the subject are included.

Even when item writers start with a solid definition of the subject, writing high quality items can be challenging. Guidelines based on the consensus of content developers and item development researchers (Haladyna, Downing & Rodriquez, 2002) have been developed to assist in the item writing process. In general, items that are clearly written, free of extraneous information and focused on a single point of information are more likely to provide useful data (Haladyna, 2004). Items written in this fashion are sometimes criticized as being too simplistic; however, simplicity is exactly what is needed to obtain a good measurement. Direct and clear wording of an item helps to ensure the difficulty of the item is based upon the examinee’s knowledge of the subject rather than the ability to decipher a complex piece of writing (unless of course, measuring the ability to decipher a complex piece of writing is the objective of the test).

In addition to guidance on how to create good items, the guidelines also provide information about item characteristics that may reduce item quality. Using inclusive options for multiple choice items (all of the above or none of the above), the complex multiple choice (Type K) format or alternate choice formats (like true/false) have all been shown to have negative effects on item quality.

Despite having these guidelines to follow, writing high quality items is still a challenge because the guidelines are not definitive. The guidelines are extensive; at least one list has 31 separate rules. They are also incomplete in the sense that not all the rules have scientific evidence to support

them. Furthermore, some of the guidelines may be difficult to implement. For example, one guideline suggests, “Use novel material to test understanding and application of knowledge and skills” (Haladyna, 2004, p. 99). Unless the item writers are very familiar with the examinees, it can be difficult to ensure that material would be novel to everyone. In general, however, items that violate these rules are less likely to maximize relevant information and minimize irrelevant information (Downing, 2005; Tarrant & Ware, 2008). Following the guidelines increases the probability of creating a high quality item; however, it does not guarantee it.

The Interaction between Items and Examinees

Following item writing guidelines does not guarantee success because the quality of an item depends on its ability to distinguish between examinees of differing ability levels (examinee separation). An item might be a high quality item for one group of examinees, but not for another. This can occur for a number of reasons.

One possible reason might be the overall level of difficulty of the item. More information is produced when the item difficulty closely matches the examinee’s ability. An item that is too easy or too difficult for a group of examinees provides little information about ability because the item fails to distinguish between high ability and low ability examinees. This means an item could be well-written and include the correct content yet still not be a high quality item for a specific application. An example of this would be an item that is well-targeted and provides maximum information for a group of freshmen would likely be too easy and provide little information about the ability level of graduating seniors.

The interaction between items and examinees can only be evaluated after examinees have responded to the items. Item response analysis is required to determine if item responses are consistent and delineate between examinees of different ability levels. A failure to delineate may be unrelated to the item writing process. For example, students of a similar ability range are taken from two classes with different teachers. Teacher A teaches a process that almost everyone in the class is able to master. Teacher B does not teach the process. Items with content related to the process will likely be answered correctly by students from Teacher A’s class and incorrectly by students from Teacher B’s class regardless of the ability level of the student. Because of the difference in instruction, the item acts as an indicator of which class

16 CLEAR EXAM REVIEW n SPRING 2012

CHALLENGES IN DEVELOPING HIGH QUALITY TEST ITEMS

Page 19: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

17CLEAR EXAM REVIEW SPRING 2012 n

the student attended rather than the student’s ability. For this item, the relevant information about the student’s ability level is overshadowed by the irrelevant information of which class the student attended.

This lack of transparency in the process sometimes creates frustration. Item developers who follow item writing guidelines generally produce what they consider high quality items. Analysis of the response patterns to those items, however, may show that some of the items work better than others. A visual examination of the items would likely reveal little to no difference. The item response analysis only shows that a difference in responses exists but does not provide a reason for the difference. This can be frustrating for item writers attempting to consistently create high quality items or looking for information on how to improve existing items.

Practical Application

The process of creating high quality items falls into three distinct steps. First, a clear definition of the subject is created so that items can be written that directly relate to the subject. Second, items are developed using research-based item writing guidelines to increase the probability that each item will maximize relevant information and minimize irrelevant information. Third, response patterns to items are analyzed to ensure that the items are of appropriate difficulty and clearly delineate between examinees of different ability levels.

The first and second steps of this process are relatively straightforward to implement. Developing a good process that includes stakeholders, subject matter experts and content developers can address these issues in an environment that is manageable and productive. Defining the area of knowledge may be based on current publications in the field along with the input of the stakeholders for the examination. While building consensus may be difficult, this is a problem that has been faced by many organizations with many different issues and a variety of resources exist to facilitate the process. Similarly, guidelines for best practices in item writing are readily available.

The third step poses a challenge since it cannot be accomplished without having a sample of examinee responses to the items. Ideally, items are included on an examination but are not used for scoring, so the items can be evaluated without affecting examinee scores. This method has disadvantages. Among them, it increases the amount of

time needed to add new items to an examination, which in turn increases costs, and requires examinees to take longer examinations. As an alternative, new items may be included on an examination and the item response analysis completed before scores are released. Unfortunately, this may introduce or increase a delay in the reporting of scores.

Summary

Developing a good test is a complex process. Each step builds upon the previous step, and it is particularly important to ensure the quality of the early steps in the process in order to produce a high quality result. Developing high quality items is one of the critical first steps, and the key to creating high quality items is to create a process that includes developing a clearly defined subject, writing items using research based guidelines and analyzing item responses. Developing high quality items is challenging both because of the complexity of the task and because item writers cannot completely control the process. It is, however, a challenge worth mastering because high quality tests are built on a foundation of high quality items.

References

Downing, S. M. (2005). The effects of violating standard item writing principles on tests and students: The consequences of using flawed test items on achievement examinations in medical education. Advances in health sciences education : theory and practice, 10(2), 133-143.

Haladyna, T. (2004). Developing and validating multiple-choice test items (3rd ed.). Mahwah, NJ: Erlbaum.

Haladyna, T., Downing, S., & Rodriguez, M. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309-333.

Tarrant, M., & Ware, J. (2008). Impact of item-writing flaws in multiple-choice questions on student achievement in high-stakes nursing assessments. Medical education, 42(2), 198-206.

CHALLENGES IN DEVELOPING HIGH QUALITY TEST ITEMS

Page 20: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

Licensure tests are designed to ascertain the extent to which a candidate has demonstrated a sufficient amount of knowledge and skills important for professional practice (Clauser, Margolis, & Case, 2006). These types of tests are intended to reduce the likelihood of individuals engaging in practice before they are reasonably qualified. In the context of teacher licensure, a state department of education is responsible for establishing the minimum test score required to pass the test in that state. Typically, one within-state panel of experts recommends a passing score to its state department through the panel’s completion of a standard-setting study. The state department of education then presents the recommendation to its state board which sets the operational passing score. The continued value of a one-panel, state-by-state approach to setting standards may need to be reconsidered to foster greater interstate mobility and to bring efficiency to the standard-setting process.

First, a state must recruit a sufficiently large number of representative educators for each test-specific panel. On average, a state may require more than 30 licensure tests across content areas and grade levels2. Raymond and Reid (2001) suggest between 10 and 15 educators serve on a panel, and Zieky, Perie, and Livingston (2008) caution that a panel with fewer members than eight may not be defensible. For some licensure areas (e.g., Mathematics, Social Studies, English Language Arts), these numbers may not present an issue for a state, but for other areas (e.g., Economics, Business Education, Physics), assembling even eight educators may not be feasible.

Second, having one panel of educators making the passing-score recommendation leaves open to question whether other panels of educators would have recommended a comparable passing score, i.e. the issue of replicability (Cizek & Bunch, 2007; Hambleton & Pitoniak, 2006). As Hambleton (2001) noted, “If it cannot be demonstrated that similar performance standards would result with a second panel . . . the validity of the performance standards is significantly reduced” (p. 95). But the time and resources a state would need to invest to assemble more than one panel of experts for setting standards on the array of teacher licensure tests used is prohibitive; consequently, a one-panel approach is the norm. This means, however, that no direct measure of passing-score replicability exists, and so must be approximated by the standard error of judgment (SEJ). The SEJ

A Multistate Approach to Setting Standards: An Application to Teacher Licensure Tests1

RICHARD J. TANNENBAUM, PhDRichard J. Tannenbaum is a Senior Research Director at Educational Testing Service.

1 Parts excerpted from Setting Standards on The Praxis Series™ Tests: A Multistate Approach, R&D Connections, 17, by Richard J. Tannenbaum, 2011, Princeton, NJ: Educational Testing Service. Copyright 2011 by Educational Testing Service. Reprinted with permission.

2 The average was estimated from a sample of 17 states that use the Praxis Series™ of licensure tests.

18 CLEAR EXAM REVIEW n SPRING 2012

Page 21: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

is analogous to a standard error of the mean; the standard deviation of panelists’ passing-score recommendations is divided by the square root of the number of panelists (Cizek & Bunch, 2007). However, the assumptions underlying the SEJ, such as experts being randomly selected and judgments being independent may not always hold, and so the SEJ likely underestimates the uncertainty associated passing-score recommendations (Tannenbaum & Katz, in press).

Third, the current movement toward more national standards for K-12 students has renewed the interest in national expectations for teachers. In 2010, for instance, the Interstate Teacher Assessment and Support Consortium released its Model Core Teaching Standards: A Resource for State Dialogue (http://www.ccsso.org/Resources/Publications). These standards describe what all teachers should know and be able to perform and the dispositions they should exhibit. Darling-Hammond (2010), commenting on the merits of a national teacher-licensure process, noted that a national assessment would provide consistency in evaluating teacher preparedness and support teacher mobility across states, facilitating a more equitable distribution of teacher quality. This assumes, however, that a national passing score is also in place. The Praxis Series™ of teacher licensure tests are used across large numbers of states; but the current state-by-state standard-setting approach has led more often to variance across states’ passing scores than convergence. More than 30 states each, for example, use the Praxis licensure test for secondary Mathematics, for Social Studies, and for English Language Arts; the scaled scores for these tests range from 100 to 200. The state-adopted passing scores for these tests vary. The lowest and highest scaled passing scores and the inter-quartile range of the scaled passing scores are 123, 156, and 10 (respectively for Mathematics); 142, 172, and 9 (for English Language Arts); and 143, 162, and 7 (for Social Studies).

This paper describes a multistate standard-setting approach designed to address the issues presented by the more traditional state-by-state approach for recommending passing scores on teacher licensure tests (The Praxis Series). Although the multistate approach has been used with teacher licensure tests, it certainly may be applied to other licensure contexts that involve multiple jurisdictions or multiple agencies.

Multistate Standard Setting

Two design features distinguish the multistate standard-setting approach from the more traditional state-by-state approach for setting standards. The first feature is that

educators representing several states jointly participate in recommending the passing score for the test being considered. This has two benefits. One, it reduces any one state’s recruitment burden: Rather than a state having to assemble 10 to 15 educators on a panel, any one state may only need to contribute up to four educators in the multistate approach. Two, the passing-score recommendation will now reflect a more diversified perspective. The second feature is that two panels are formed from the same group of states for each test, and each panel makes a separate passing-score recommendation. The two panels permit a direct determination of the replicability of the passing-score recommendation, which is unique for teacher licensure. It is more common, as previously noted, for a single state to bring together only one panel of educators to recommend a passing score for that state. This is due to the difficulty of recruiting a sufficient number of educators for more than one panel. But the multistate approach reduces any one state’s recruitment burden; hence we are able to assemble two panels to recommend a passing score on the same test.

However, the multistate process does not use any new method of standard setting. The core methodologies – a modified Angoff for multiple-choice items and an extended Angoff for constructed-response items – are well established and widely used in setting standards on teacher licensure tests. This paper does not describe these methods in detail as several of the sources in the literature (Cizek & Bunch, 2007; Hambleton & Pitoniak, 2006; Tannenbaum & Katz, in press; Zieky et al., 2008) are available to readers who wish to study them further.

These standard-setting methods have long been in use for The Praxis Series tests. State departments of education and their state boards – responsible for setting the operational passing scores – are familiar with and accept these methods. It was an explicit goal to maintain these same standard-setting methods in developing the multistate standard-setting approach while still enabling us to: reduce the burden on any one state for recruiting educators to serve on standard-setting panels, and reduce the overall costs associated with conducting state-by-state studies; determine the replicability of a passing-score recommendation by having two panels of educators; and support a more uniform, national perspective regarding passing scores for initial teacher licensure.

Prerequisite Policy Groundwork

Setting a standard is in effect setting a policy (Kane, 2001),

19CLEAR EXAM REVIEW SPRING 2012 n

A MULTISTATE APPROACH TO SETTING STANDARDS: AN APPLICATION TO TEACHER LICENSURE TESTS

Page 22: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

in this case a policy about the testing requirements for teacher licensure – the type and amount of knowledge and skills that beginning teachers need to have. Cizek and Bunch (2007) aptly noted, “All standard setting is unavoidably situated in a complex constellation of political, social, economic, and historical contexts” (p.12). As we began to think about the multistate standard-setting process, this “truism” reinforced the need for us to present the concept to the state departments of education before any attempt to implement it. We needed to make sure that the state departments of education understood what we were proposing and why. We also wanted them to have the opportunity to raise issues or ask questions. In this regard, we implemented two strategies to secure state acceptance. First, we conducted a series of webinars explaining how we envisioned the multistate process. Approximately 20 state departments of education participated. The webinars were instrumental in allaying states’ concerns about changing a process that was familiar to them. Second, we invited them to observe the first two multistate studies we conducted. Several state directors or their designees attended these studies to see firsthand how the studies were conducted and the nature of the interactions among educators from the states represented on the panels. State observers regularly attended multistate studies throughout the first year of implementation.

Overview of the Multistate Process

There are certain elements of panel-based standard setting that contribute to the quality and reasonableness of passing-score recommendations (Tannenbaum & Katz, in press). Several of these are part of the multistate approach and include having the panelists take the test to become familiar with its content; construct a performance level description – the minimal knowledge and skills expected of a candidate who is qualified to be licensed; receive training in the standard-setting method(s) and the opportunity to practice making judgments; and engage in two rounds of standard-setting judgments.

FIGURE 1. Multistate Standard-Setting Process

20 CLEAR EXAM REVIEW n SPRING 2012

The multistate standard-setting process is outlined in Figure 1. Each state interested in adopting the test for licensure nominates educators to represent the state. We encourage each state to nominate up to six educators – four teachers and two teacher education faculty members. They should come from different settings and be diverse with respect to gender, race, and ethnicity. Two panels are formed from the cross-state pool of nominees so that the composition and representation of the panels are comparable. We then contact each state to review the educators selected from that state, and ask each state to either approve the selection or suggest alternative educators. For the multistate process, each panel includes up to 25 educators. This number is greater than the range suggested by Raymond and Reid (2001) to bolster state representation, but it still supports interaction among the educators during the standard-setting process. This means that up to 50 educators may contribute to the passing-score recommendation for a test.

A MULTISTATE APPROACH TO SETTING STANDARDS: AN APPLICATION TO TEACHER LICENSURE TESTS

Page 23: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

The two panels meet on different occasions, often within the same week. This expedites the test-adoption process for states. In the more traditional state-by-state model, states are necessarily placed in a queue for standard setting (due to resource limitations), which extends the timeframe for states to adopt the tests. The multistate process enables us to complete the standard setting for all states in the same brief time period, expediting the first step in each state’s adoption cycle – the presentation of a recommended passing score to its board.

The Performance Level Description

Because it meets first, Panel 1 has primary responsibility for constructing the performance level description, which delineates the minimal knowledge and skills expected of a candidate to pass the test. The objective of the standard-setting task is for the panel to identify the test score likely to be earned by a candidate who just meets the expectation expressed by the description.

The process of developing a performance level description roughly works like this: After having taken the test and discussed its content, the educators on Panel 1 are formed into two or three subgroups, and each of them independently constructs a performance level description. The entire panel then reviews and discusses the individual performance level descriptions in order to reach a consensus

on a final performance level description. In general, we devote approximately two hours to this work. The final description is printed so that each educator has a copy. The panel then completes the standard-setting task. Panelists respond to evaluation surveys after training and practice and at the conclusion of the standard-setting session. The survey responses address the quality of the implementation, and the reasonableness of the panel’s recommended passing score.

Panel 2 then begins its work. The value of having a second panel is that the number of educators contributing to the passing-score recommendation increases and that we can obtain a direct estimate of the replicability of the passing score. In this instance, replicability addresses the question of how close the recommended passing score of a second panel of educators would be to that of the first panel, if the two panels followed the same standard-setting procedures and used the same performance level description. The key here is that our design maintains the consistency of the performance level description (the performance expectation) between the panels. The standard-setting method and the performance expectation remain constant, with only the particular educators on the two panels varying. The educators from Panel 2 take the test and discuss its content, just as the educators on Panel l did. However, the educators on Panel 2 receive the performance level description from the first panel rather than having to construct a new description. They are informed of the reason for this – to maintain consistency of the performance expectation with the first panel – and told

TABLE 1. Examples of Critical Indicators

Test Performance-Level-Description Statement Indicators

Physical Education understands individual and group motivation and behavior classroom rules demonstrate to foster positive social interaction, active engagement in positive social interaction learning, and self-motivation uses “do nows” (instant activity cards) that promote positive engagement School Counselor understands how to deliver prevention and intervention describes the process or steps services through individual, small- and large-group to develop a counseling group counseling

identifies two rationales for considering whether to provide services to a group or to an individual

21CLEAR EXAM REVIEW SPRING 2012 n

A MULTISTATE APPROACH TO SETTING STANDARDS: AN APPLICATION TO TEACHER LICENSURE TESTS

Page 24: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

what they need to do in order to “internalize” the meaning of the description. The recommended passing score from Panel 1 is not shared with Panel 2, only the performance level description.

The work regarding the performance level description then proceeds. The educators on Panel 2 discuss the performance level description as a group. The same researcher who facilitated Panel 1 also facilitates Panel 2, so that the “history” of Panel 1’s performance level description can be shared as needed. Then the educators are formed into subgroups, and each subgroup is asked to develop critical indicators for each of the knowledge or skill statements in the performance level description. A critical indicator helps to operationally define each knowledge or skill, helps to “flesh it out.” Each subgroup is asked to generate two or three indicators for each knowledge or skill statement. The indicators are then presented to the whole panel for discussion, and a final set of indicators for each performance-level-description statement is documented. The indicators are not intended to be exhaustive, but to illustrate what the statement means.

Hence, only a few indicators for each statement are necessary. Table 1 presents examples of indicators associated with one knowledge or skill statement each from a performance level description for a Physical Education licensure test and for a School Counselor licensure test. The panel reviews the indicators for internal consistency before finalizing them, verifying that the entire set of indicators are related to the knowledge or skill statement, and that it has

not changed the fundamental meaning of the statement. The panel then completes the same standard-setting task and evaluation surveys that Panel 1 completed.

Documentation

Each participating state department of education receives a technical report that documents the characteristics and experiences of the educators on the two panels, the methods and procedures each panel followed in arriving at its passing-score recommendation, and the round-by-round results for each panel. Once a state has received the report, it goes through its particular process to determine the final passing score to be set.

Quality Metrics

One indicator of the quality of the multistate standard-setting process comes from the panelists’ responses to the final evaluation. Panelist evaluations are a credible indicator of the validity of the standard-setting implementation (Cizek, Bunch, & Koons, 2004; Kane, 1994). Table 2 presents results from surveys of more than 530 panelists across 16 tests. The tests cover Art, Business Education, English Language Arts, General Pedagogy, Physical Education, School Leadership, Special Education, Technology Education, Teaching Reading, and World Languages. Table 2 summarizes the results for questions dealing with panelists’ understanding of the purpose of the

22 CLEAR EXAM REVIEW n SPRING 2012

TABLE 2. Responses to Final Evaluations

% Strongly Agree % Agree % Disagree/Strongly Disagree

I understood the purpose of the standard- 92.44 7.56 0 setting study.

The training was adequate for me to 88.75 11.07 .18 complete the standard-setting task.

The standard-setting process was 73.01 25.69 1.29 easy to follow.

% About Right % Too Low % Too High

How reasonable was the recommended 88.39 9.74 1.87 passing score?

A MULTISTATE APPROACH TO SETTING STANDARDS: AN APPLICATION TO TEACHER LICENSURE TESTS

Page 25: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

standard-setting study, the adequacy of the standard-setting training, and the ease of completing the standard-setting task. The responses were on a four-point scale ranging from strongly agree to strongly disagree. The panelists also were asked to indicate whether they believed the recommended passing score was about right, too low, or too high.

Approximately 90% of the panelists strongly agreed that they understood the purpose of the study and that the standard-setting training they received was adequate. Nearly three-quarters (73%) of the panelists strongly agreed that the standard-setting process was easy to follow. We had expected a lower percentage here as standard setting is a novel activity for most educators; nonetheless, the positive result attests to the perceived quality of the standard-setting implementation. The percentages for these three questions approach 100 if the responses strongly agree and agree are combined. Close to 90% of the panelists indicated that the recommended passing score was about right and close to 10% indicated that it was too low.

A second indicator of quality is the replicability of the recommended passing score. The use of two panels is an explicit feature of the multistate standard-setting process. This increases the number of educators contributing to the passing-score recommendation, leading to a more stable recommendation, but it also permits a direct estimate of the replicability of the passing-score recommendation. The recommended passing score is the Round 2 (final round) mean for a panel, so two means are available for each test (one for Panel 1 and one for Panel 2). Brennan (2002) provides a way to calculate a standard error of a mean when there are two observations, as is the case in the multistate approach. The standard error provides a way to gauge replicability. The standard error is the absolute difference between the two means (recommended passing scores) divided by two. Sireci, Hauger, Wells, Shea, and Zenisky (2009) suggest that a value of less than 2.5 indicates that other panels of educators would likely recommend comparable passing scores. The standard error is less than 2.5 in all 16 instances. The lowest value was 0.43 and the highest was 2.14; the average value was 1.29. This indicates that the recommended passing scores should not vary significantly across other panels of educators.

A third indicator of quality in a multistate process is the variability in passing scores across states. A reduction in variance indicates a higher potential for teacher mobility, and preliminary evidence points in this direction. Between 2008 and 2010, 13 states that participated in single-state studies set passing scores for 37 Praxis Series tests. The

average percentage of change from the panel-recommended passing score was approximately 5 scaled points. Between 2009 and 2010, 26 states that participated in multistate studies set passing scores for 10 Praxis Series tests.3 The average percentage of change from the panel-recommended passing score in these instances was approximately 1 scaled point. Preliminary analyses of multistate standard-setting results conducted in 2011 indicate that the trend toward a minimal percentage of change from the panel-recommended passing scores has continued.

Conclusion

Traditional standard setting for teacher licensure tests is done on a state-by-state basis, with each state assembling one panel of educators on one occasion to recommend a passing score. This places a burden on each state to recruit a sufficient number of educators to serve on a panel, leaves open to question whether other panels of experts would recommend a similar passing score, and often leads to variation in passing scores across states. These challenges, though clearly faced in teacher licensure, are likely faced in other licensure contexts that involve multiple jurisdictions or agencies. The multistate standard-setting process outlined in this paper addresses these challenges. The approach is accepted by state departments of education responsible for teacher licensure, uses well-researched standard setting methods, and has yielded evidence that a recommended passing score from a multistate panel is likely replicable. This latter finding indicates that one multistate panel may, in fact, be sufficient for generating a stable passing-score recommendation to state licensing boards.

References

Brennan, R. L. (2002, October). Estimated standard error of a mean when there are only two observations (Center for Advanced Studies in Measurement and Assessment Technical Note Number 1). Iowa City: University of Iowa.

Cizek, G. J., & Bunch, M. B. (2007). Standard setting: A guide to establishing and evaluating performance standards on tests. Thousand Oaks, CA: Sage.

Cizek, G. J., Bunch, M. B., & Koons, H. (2004). Setting performance standards: Contemporary methods. Educational Measurement: Issues and Practice, 23, 31–50.

Clauser, B.E., Margolis, M.J., & Case, S.M. (2006). Testing for licensure and certification in the professions. In R. L. Brennan (Ed.), Educational Measurement (4th ed., pp. 701-731). Westport, CT: Praeger.

23CLEAR EXAM REVIEW SPRING 2012 n

A MULTISTATE APPROACH TO SETTING STANDARDS: AN APPLICATION TO TEACHER LICENSURE TESTS

3 Not all states that participated in standard setting set final passing scores in this time period.

Page 26: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

Darling-Hammond, L. (2010, October). Evaluating Teacher Effectiveness: How Teacher Performance Assessments Can Measure and Improve Teaching. Retrieved May 10, 2011, from Center for American Progress (http://www.americanprogress.org).

Hambleton, R. K. (2001). Setting performance standards on educational assessments and criteria for evaluating the process. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 89-116). Mahwah, NJ: Lawrence Erlbaum.

Hambleton, R. K., & Pitoniak, M. J. (2006). Setting performance standards. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 433–470). Westport, CT: American Council on Education/ Praeger.

Kane, M. T. (2001). So much remains the same: Conception and status of validation in setting standards. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 53–88). Mahwah, NJ: Lawrence Erlbaum.

Kane, M. (1994). Validating the performance standards associated with passing scores. Review of Educational Research, 64, 425–461.

Raymond, M. R., & Reid, J. R. (2001). Who made thee a judge? Selecting and training participants for standard setting. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 119–157). Mahwah, NJ: Lawrence Erlbaum.

Sireci, S. G., Hauger, J. B., Wells, C. S., Shea, C., & Zenisky, A. L. (2009). Evaluation of the standard setting on the 2005 Grade 12 National Assessment of Educational Progress mathematics test. Applied Measurement in Education, 22, 339–358.

Tannenbaum, R. J., & Katz, I. R. (in press). Standard setting. In K. F. Geisinger (Ed.), APA handbook of testing and assessment in psychology. Washington, DC: American Psychological Association.

Zieky, M. J., Perie, M., & Livingston, S. A. (2008). Cutscores: A manual for setting standards of performance on educational and occupational tests. Princeton, NJ: ETS.

A MULTISTATE APPROACH TO SETTING STANDARDS: AN APPLICATION TO TEACHER LICENSURE TESTS

24 CLEAR EXAM REVIEW n SPRING 2012

Page 27: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

In 2009, innovative items were added to the National Certification Examination for Nurse Anesthetists (NCE). This study was undertaken to compare the effect of this addition on the pass rates for the examination. The percent pass rates for recent graduates in 2008, 2009, and 2010 were analyzed through pairwise z-test comparisons with Bonferroni correction of p=.01. No difference was found in the pass rate when experimental items were added to the examination. Additionally, no difference was found when comparing the experimental item pass rate to the scored item pass rate. The pass rate did increase significantly between 2008, when there were no innovative items on the examination, and 2010, when innovative items were scored. It is possible that a novelty effect exists whereby candidates are sharing items with subsequent test-takers. Expansion of the innovative item pool should reduce this effect in the future.

In 1945, entry-level certification became a requirement for all graduates of approved nurse anesthesia programs. To achieve initial certification, candidates were required to pass a qualifying examination with the intention of protecting hospitals, surgeons, and the public served by certified registered nurse anesthetists (CNRAs). Since its inception, the certification examination for nurse anesthetists (NCE) has consisted primarily of four option multiple-choice questions (MCQs). In 2008, a task force was created by the Council on Certification of Nurse Anesthetists (CCNA) with the charge of exploring the use of innovative item types (IIT) in the examination. In the fall of 2009, IITs were added to the NCE as experimental, unscored items and as scored items in 2010. Innovative items are traditionally defined as any item that is not formatted as MCQ. For the purposes of this study, the IITs include multiple correct response (MCR), drag-and-drop (DD), and short-answer calculation (SAC) (Figure 1). The effects of adding the new item types, both to the psychometric properties of the examination as a whole and to the experience of the examinee, have not previously been investigated.

Purpose of the Study

The purpose of this study was to determine whether the addition of innovative items impacted the pass rate on the examination. In 2008, no innovative items were present on the NCE. The new item types were added as experimental items in August of 2009, and as scored items in August of 2010.

Effect on Pass Rate when Innovative Items Added to the National Certification Examination for Nurse Anesthetists

MARY ANNE KROGH, PhD, CRNA

25CLEAR EXAM REVIEW SPRING 2012 n

Page 28: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

tion. Both item types were found to be more difficult than standard MCQs even though they tested similar content. The difference in item difficulty was attributed to the effect of guessing correctly in the MCQs. Wendt (2008) studied the item characteristics of IITs. For the study, the innovative items consisted of constructed and ordered response, and MCRs in the nursing licensure examination. The constructed response and MCR items were consistently more difficult than traditional MCQs.

Construct irrelevant variance is a major concern with IITs (Parshall et al., 2000; Sireci & Zenisky, 2006). Test anxiety and difficulty of new items were of particular concern in this study. While all items on the NCE are presented via com-puter, the innovative items require more data manipulation, and therefore, may introduce more test anxiety for test can-didates. Because test anxiety could not be measured directly with the data available, the study focused on comparisons of pass rates on the NCE before items were added, after experi-mental items were added, and after scored items were added.

Background and Literature Review

A major concern surrounding the use of MCQs is their level of dif-ficulty. Even when written without flaws, many MCQs are written at a low cognitive level (Huff & Sireci, 2001; Tarrant, Knierem, Hayes, & Ware, 2006) MCQs that mea-sure critical thinking and require higher levels of cognition from test candidates are difficult to write (Morrison & Free, 2001; Simkin & Kuechler, 2005; Tarrant et al., 2006).

Critical thinking, including in-terpretation, analysis, inference, evaluation, and explanation, is a central component of excellent nursing care. This process is in no way linear and requires reflection and reasoning on the part of the nurse. Examination items should reflect this process to effectively evaluate the candidate’s ability in nursing decision-making processes (Wendt, Kenny, & Marks, 2007). If the goal of examination is to get at clinical reality, then it is important to move beyond MCQs, otherwise the test will not fully measure its intended construct (Jodoin, 2003). This shift in testing has the potential to improve construct repre-sentation, thus improving the measurement of knowledge, skills, and abilities needed for clinical success (Sireci & Ze-nisky, 2006). Because an important component of entry-level nurse anesthesia competence is the ability to make clinically relevant decisions, the addition of innovative items should improve the NCE as a measurement tool.

The cognitive skill necessary to answer IITs differ from those required to respond to MCQs. In constructed response items, for example, the cognitive demand is much broader because the candidate must analyze and synthesize data to formulate an answer. With MCQs, on the other hand, the candidate must eventually settle on a single answer that is provided (Martinez, 1999). IITs are thought to measure different, more complex cognitive constructs (Parshall, Davey, & Pashley, 2000; Wendt et al., 2007). Kubinger, Holocer-Ertl, Reif, Ho-hensinn, & Frebort (2010) studied the effect of adding MCRs and constructed response items to a mathemetics examina-

26 CLEAR EXAM REVIEW n SPRING 2012

MCQ MCRIn the elderly, the time needed for What are the hemodynamic goals forclinical recovery from neuromuscular the patient with hypertrophicblockade is significantly increased for: cardiomyopathy? Select two.A. cisatracurium A. Decrease contractilityB. vecuronium B. Decrease preloadC. pipecuronium C. Increase afterloadD. mivacurium D. Increase heart rate

SAC DDCalculate the cardiac output, given the following parameters: HR 60 beats/min, BP 120/80 mm Hg, SV 60 ml/beat.

Action Laryngeal muscle

MCQ=Multiple-choice question, SAC=Short answer/calculation, DD=drag and drop

FIGURE 1. Samples of MCQs and innovative items

EFFECT ON PASS RATE WHEN INNOVATIVE ITEMS ADDED TO THE NATIONAL CERTIFICATION EXAMINATION FOR NURSE ANESTHETISTS

Page 29: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

Theoretical Framework

The NCE is administered in a computer-adaptive manner and the Rasch 1-parameter item response theory (IRT) model is utilized for scoring. IRT is a probabilistic model used to predict the abilities or latent traits of test candidates (De Champlain, 2010; Downing, 2003; Kyngdon, 2008; Weiss & Yoess, 1991). In IRT, both item difficulty and candidate ability are expressed graphically along the x-axis, while the probability of a correct response is on the y-axis.

For IRT to be utilized, items must first be calibrated for the model being used. Models can account for difficulty, dis-crimination, and guessing. The Rasch model sets the guess-ing parameter at zero and the discrimination parameter at 1. Therefore, the model functions by determining the difficulty of the item and the probability that any one candidate of a certain ability will be able to answer the item correctly. The Rasch model may be utilized in conjunction with computer adaptive testing to deliver an examination tailored to the ability level of the test candidate (Downing, 2003; Weiss & Kingsbury, 1984).

Because the candidate ability and item difficulty are matched during the administration of a computer-adaptive exami-nation, it was hypothesized that the introduction of items of higher difficulty would not impact the pass rate on the examination. There was concern, however, that IITs could

27CLEAR EXAM REVIEW SPRING 2012 n

TABLE 1. National Certification Examination Pass Rates for 2008, 2009, 2010

Year* N Pass N Candidates % Pass

2008 925 1071 86.4 2009 932 1039 89.7 2010 906 994 91.2

NOTE: *August through December of each year.

introduce an element of test anxiety and contribute to con-struct irrelevant variance.

Methodology

Data for the study were collected by the National Board of Certification for Nurse Anesthetists (NBCRNA). All individ-ual data were coded to conceal any identifying information from the researcher. After exemption was received from the human subjects committee from South Dakota State Univer-sity, score results were compiled and formatted for analysis.

Pass rates for the NCE were taken from candidate data from August through December of 2008, 2009, and 2010. Pass rates for recent graduates who passed the NCE on their first attempt were used in the analysis. In 2008, no innovative items were present on the NCE. The IITs were added as unscored items in 2009, and as scored items in 2010. Pairwise comparisons of the pass rates were made using a z-test for com-paring proportions to determine whether statistically significant difference existed in pass rates between the test formats. The data for this analysis were independent because they were derived from different candidate groups. Because pairwise com-parisons can result in p value inflation, a Bonferroni correction was utilized to set the p value at .01 for analysis.

Results

The pass rate for the NCE progressively increased from 2008 to 2010,despite the addition of IITs (Table 1). When the pairwise z-tests were conducted to determine if a significant difference existed in the pass rates, no significant difference was found between 2008 and 2009, or between 2009 and 2010. A significant difference existed between 2008 and 2010, indicating that the pass rate was significantly higher when comparing the NCE pass rate with no IITs and with scored IITs. (Table 2).

EFFECT ON PASS RATE WHEN INNOVATIVE ITEMS ADDED TO THE NATIONAL CERTIFICATION EXAMINATION FOR NURSE ANESTHETISTS

TABLE 2. Results from z-test Pairwise Analysis of Pass Rates from August through December 2008, 2009, and 2010

z-score Significance

2008 vs. 2009 2.29 .02 2009 vs. 2010 1.03 .30

2008 vs. 2010 3.35 .001*

NOTE: *p<.01 with Bonferroni correction

Page 30: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

Downing , S. M. (2003). Item response theory: Applications of modern test theory in medical education. Medical Education, 37, 739-745.

Harmes, J. C., & Wendt, A. (2009). Memorability of innovative items. Clear Exam Review, 20(1), 16-20.

Huff, K. L., & Sireci, S. G. (2001). Validity issues in computer-based testing. Educational Measurement: Issues and Practice, 20, 16-25.

Jodoin, M. G. (2003). Measurement efficiency of innovative item formats in computer-based testing. Journal of Educational Measurement, 40(1), 1-15.

Kubinger, K. D., Holocer-ERtl, S., Reif, M., Hohensinn, C., & Frebort, M. (2010). On minimizing guessing effects on multiple-choice items: Superiority of a two solutions and three distractors item format to a one solution and five distractors item format. International Journal of Selection and Assessment, 18(1), 111-115.

Kyngdon, A. (2008). The Rasch model from the perspective of the representational theory of measurement. Theory & Psychology, 18(1), 89-109.

Martinez, M. E. (1999). Cognition and the question of test item format. Educational Psychologist, 34(4), 207-218.

Morrison, S., & Free, K. W. (2001). Writing multiple-choice test items that promote and measure critical thinking. Journal of Nursing Education, 40(1), 17-24.

Parshall, C. G., Davey, T., & Pashley, P. J. (2000). Innovative item types for computerized testing. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 129-147). Boston: Kluwer Academic Publishers.

Simkin, M. G., & Kuechler, W. L. (2005). Multiple-choice tests and student understanding: What is the connection? Decision Sciences Journal of Innovative Education, 3(1), 73-97.

Sireci, S. G., & Zenisky, A. L. (2006). Innovative item formats in computer-based testing: In pursuit of improved construct representation. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development. Mahwah, NJ: Erlbaum.

The NCE is a computer-adaptive exami-nation that utilizes item-response theory for scoring. Item difficulty varied between the item types. On average, MCR and DD items were most difficult for students while MCQ and SAC were easier (Table 3). This finding was different than Wendt (2008). Because this administration and scoring model matches item difficulty to candidate ability, no difference in pass rate was expected, despite the difference in difficulty among the items. However, unexpectedly, the pass rate on the exami-nation increased significantly between 2008 and 2010. For the time periods studied, the introduc-tion of IITs did not negatively impact the pass rate on the NCE. Further, it does not appear that the addition of IIT has introduced sufficient test anxiety to reduce the examination pass rate.

Implications of the Study

An identified concern by Harmes & Wendt (2009) was the introduction of a novelty effect, whereby candidates share novel test items with other candidates, when IITs were added to an examination. However, the researchers found that test candidates did not remember specific items with enough de-tail to impact test security. Despite these findings in the pilot study, it is possible that the increase in the pass rate over time is related to a novelty effect. An expansion of the IIT pool will be required to overcome this effect. Further research could include replication of the Harmes and Wendt pilot study to determine if memorability of IITs exists in the NCE.

Additionally, the concerns over changes in the pass rate due to test anxiety seem to be unfounded. A significantly higher pass rate on the examination was seen after introduc-tion of scored test items. Perhaps the new item types mimic real-world problem-solving in a more realistic way than can be accomplished by MCQs alone. Further studies involving the critical thinking processes of test candidates will need to be completed to determine if different cognitive skills are utilized in the new item types.

References

De Champlain, A. F. (2010). A primer on classical test theory and item response theory for assessments in medical education. Medical Education, 44, 109-117.

28 CLEAR EXAM REVIEW n SPRING 2012

TABLE 3. Comparison of item difficulty

Item Format Average Difficulty (logit) Average Difficulty (Pval)

SAC 1.83 .64

MCQ 1.85 .62

DD 2.44 .53

MCR 2.68 .49

NOTE: SAC=short answer calculation, MCQ=multiple choice question, DD=drag and drop, MCR=multiple correct response, Pval=proportion correct.

EFFECT ON PASS RATE WHEN INNOVATIVE ITEMS ADDED TO THE NATIONAL CERTIFICATION EXAMINATION FOR NURSE ANESTHETISTS

Page 31: CLEAR Exam Review · In this issue, George Gray’s Abstracts and Updates column reviews a book on ... third article in this issue is from Richard Tannenbaum who discusses the subject

29CLEAR EXAM REVIEW SPRING 2012 n

Tarrant, M., Knierim, A., Hayes, S. K., & Ware, J. (2006). The frequency of item writing flaws in multiple-choice questions used in high stakes nursing assessments. Nurse Education Today, 26(8), 354-363.

Weiss, D. J., & Kingsbury, G. G. (1984). Application of computerized adaptive testing to educational problems. Journal of Educational Measurement, 21(4), 361-375.

Weiss, D. J., & Yoess, M. E. (1991). Item response theory. In R. K. Hambleton & J. Zaal (Eds.), Advances in educational and psychological testing. Boston, MA: Kluwer Academic Publishers.

Wendt, A. (2008). Investigation of the item characteristics of innovative item formats. Clear Exam Review, 19(1), 22-28.

Wendt, A., Kenny, L. E., & Marks, C. (2007). Assessing critical thinking using a talk-aloud protocol. Clear Exam Review, 18(1), 18-27.

EFFECT ON PASS RATE WHEN INNOVATIVE ITEMS ADDED TO THE NATIONAL CERTIFICATION EXAMINATION FOR NURSE ANESTHETISTS