test validation & analysis program manual - · pdf filesteps for validating and analyzing...

59
Copyright © 2010 Biddle Consulting Group, Inc. TVAP Test Validation & Analysis Program Version 7.0 July 2010 Biddle Consulting Group, Inc. 193 Blue Ravine, Suite 270 Folsom, CA 95630 / 916.294.4250 ext. 113 www.biddle.com

Upload: lamtram

Post on 24-Mar-2018

227 views

Category:

Documents


1 download

TRANSCRIPT

Copyright © 2010 Biddle Consulting Group, Inc.

TVAP

Test Validation & Analysis Program

Version 7.0

July 2010

Biddle Consulting Group, Inc.

193 Blue Ravine, Suite 270

Folsom, CA 95630 / 916.294.4250 ext. 113

www.biddle.com

I

Contents

Program Setup and Overview .......................................................................................... 1

System Requirements and Setup Instructions ......................................................... 1

Program Overview .................................................................................................. 1

Historical/Legal Background of the Program ......................................................... 2

Steps for Validating and Analyzing Tests Using the Program ..................................... 4

Validate the Test Content Using the Test Validation Workbook ........................... 4

Step 1: Have SMEs Rate Items Using the Test Item Survey and Input Survey

Data into Test Item Survey Workbook .................................................................. 4 Step 2: Evaluate the Question Analysis Sheet in the Test Validation Workbook

and Select Surviving Items .................................................................................... 5 Step 3: Export the Angoff Ratings from the Test Analysis Export Sheet and

Import to the Test Analysis Workbook ................................................................. 7

Analyze the Test Results and Establish Cutoff Score Options Using the Test

Analysis Workbook ................................................................................................. 7

Step 1: Rater Input: Import Angoff Ratings and Keep only Ratings for Items that

Survived the Question Analysis Review ............................................................... 7 Step 2: Rater Analysis: Evaluate Rater Reliability and Remove Outlier Raters ... 7 Step 3: Finalize and Administer the Test and Announce the Critical Score ........ 10 Step 4: Complete Item Analyses .......................................................................... 11 Step 5: Remove Test Items Based on Item Analyses .......................................... 14 Step 6: Complete Overall Test Analyses ............................................................. 14 Step 7: Calculate Three Cutoff Score Options and Set the Final Cutoff Score ... 16 Step 8: Cutoff Results by Gender/Ethnicity & Adverse Impact Analyses .......... 20

Glossary ............................................................................................................................ 22

References ........................................................................................................................ 24

Attachment A - Instructions for Completing the Test Item Survey ........................... 27

Introduction ........................................................................................................... 27

Administering the Test Item Survey ..................................................................... 27

Instructions for Providing Ratings on Checklist (Questions 1-10) ...................... 28 Instructions for Rating Job Knowledge Tests (Questions 11-14) ........................ 31

Test Item Comment Form ..................................................................................... 32

Attachment B - The Conditional Standard Error of Measurement (CSEM)............ 33

Attachment C - Test Item Writing Guidelines ............................................................. 43

Test Item Writing Guidelines ................................................................................ 43

Item Writing .......................................................................................................... 43

Item Format When Developing Tests ................................................................... 48

Examples of Test Items ......................................................................................... 49

Examples of Item Types........................................................................................ 50

Test Plan Example................................................................................................. 51

Attachment D - Upward Rating Bias on Angoff Panels .............................................. 54

1

Program Setup and Overview

System Requirements and Setup Instructions

Before installing the Test Validation & Analysis Program, be sure that your computer is

equipped with at least the following:

• CD ROM drive

• Pentium III processor (or higher)

• 1 GB of RAM minimum (more recommended)

• 150 megabytes of free hard drive space

• Super VGA monitor, 800x600 resolution minimum, 1024x768 recommended

• Microsoft Windows® version XP/Vista/7.0

• Microsoft Excel® XP/2003/2007

• Excel must also be configured so that macros are enabled for the Program to operate

To install the Program, insert the CD and run the ―setup.exe‖ file and follow the on-screen

instructions.

Program Overview

This Program is designed for use by human resource professionals to aid in validating

and analyzing written tests (the Program may also be used for other types of tests—please call

for assistance). Because this Program uses a content validity approach for validation (see Section

14C and 15C of the Uniform Guidelines, 1978), it is most appropriate for validating tests

designed to measure knowledges, skills, abilities, and personal characteristics (―KSAPCs‖) that

can be validated using this method. Note that the Uniform Guidelines specify that tests

measuring abstract traits or constructs that cannot be ―operationally defined‖ in terms of

observable aspects of job behavior (see Sections 14C[1] and Questions & Answer #75) should

not be validated using a content validation strategy.

The Program was developed by integrating concepts and requirements from professional

standards, the Uniform Guidelines (1978), and relevant court cases into a system that is relatively

automated. While efforts were made to automate these processes as much as possible,

professional judgment should be used when operating this Program and evaluating its results.

The Program includes two separate Workbooks: the Test Validation Workbook and the

Test Analysis Workbook. These Workbooks operate in Microsoft Excel and include numerous

features that use embedded programming and macros to complete the calculations. Microsoft

Excel was chosen as a development platform for this Program to allow the user to readily import

and export data and relevant analysis results and to make custom changes if desired. Used in

sequence, the Workbooks provide a complete set of tools to validate written tests (entry level or

promotional), analyze item-level and test-level results (after administration), and set job-related

and defensible cutoff scores using court-approved methods.

Copyright © Biddle Consulting Group, Inc. 2

To use the two Workbooks in a successful test development and validation process, the

user needs two documents in advance. The first is a job analysis document that includes a list of

important and/or critical job duties and KSAPCs for the target position. The second is the draft

written test—which was (hopefully) developed based on the job analysis and developed using

both subject-matter experts (―SMEs‖) and human resource staff. If these two documents are

available, the test validation and analysis process can begin by using the two Workbooks in the

following manner.

The Test Validation Workbook includes the tools for validating a test before it is

administered to applicants. This Workbook includes a survey that is printed and given to a panel

of SMEs. The SMEs use the survey to rate each question on the draft written test. These ratings

are then entered into the Test Validation Workbook and analyzed using the Program to evaluate

which items to include on the final test (and which items to discard, or save for re-evaluation by

SMEs after revision). After the final items are identified using this Workbook, part of the SME

ratings (the Angoff ratings that reflect the opinions from SMEs regarding the minimum passing

level for each item on the test) are exported to the Test Analysis Workbook for the remaining

steps in the process.

The Test Analysis Workbook is used to analyze the test results (at both an item and

overall test level), modify and improve the test based on these results, and then set job-related

cutoffs using methods that have been previously endorsed by the courts. There are eight steps

included in this Workbook, all of which should be completed the first time a new test is

administered and only some of which are used to analyze test data for subsequent

administrations.

In summary, the Test Validation Workbook is used before a test is administered

(primarily to decide which questions to include on the test) and the Test Analysis Workbook is

used to analyze the test after being administered and to set cutoffs based on various factors.

This manual is designed to provide the user with instructions for operating both

Workbooks and interpreting their results. It should be noted that some background in statistics

and test development/validation is needed to operate these Workbooks and effectively analyze

their results. There is a Glossary at the end of this manual that defines key terms.

There are four attachments in this manual:

Attachment A - Instructions for Completing the Test Item Survey can be used as a

guide for administering the Test Item Survey to SMEs and facilitating a test item

validation workshop.

Attachment B - The Conditional Standard Error of Measurement (CSEM) describes

some of the technical aspects included in this Manual, specifically regarding the use of

statistics to establish a cutoff score and create top-down score bands.

Attachment C - Test Item Writing Guidelines provides information for working with

SMEs and human resource staff to write effective test items.

Attachment D - Upward Rating Bias on Angoff Panels provides a brief description of

the tools and methods in the Program for detecting (and correcting where appropriate) an

upward rating bias that sometimes occurs when establishing cutoff scores for tests.

Historical/Legal Background of the Program

Written tests have been the focus of litigation for several decades. Completing a thorough

validation process, such as the one facilitated by this Program, offers two key benefits to the

employer. First, it helps insure that the test used for selection or promotion is sufficiently related

Copyright © 2010 Biddle Consulting Group, Inc. 3

to the job and includes only test items that SMEs have deemed fair and effective. Second, a

validation process generates documentation that can be used as evidence should the test ever be

challenged in an arbitration or civil rights litigation setting.

Some of the standards used in the Program were adopted from court cases where the

criteria pertaining to test validation have been litigated. Two of these court cases are Contreras v.

City of Los Angeles (656 F.2d 1267. 9th Cir. 1981) and U.S. v. South Carolina (434 US 1026,

1978).

In the Contreras v. City of Los Angeles case, a three-phase process was used to develop

and validate an examination for an Auditor position. In the final validation phase, where the

SMEs were asked to identify a knowledge, skill, or ability that was measured by the test item, a

―5 out of 7‖ rule (71%) was used to screen items for inclusion on the final test. After extensive

litigation, the Ninth Circuit approved the validation process of constructing a written test using

items that had been linked to the knowledges, skills, abilities, and personal characteristics

(KSAPCs) of a job analysis by at least five members of a seven-member SME panel.

In U.S. v. South Carolina, SMEs were convened into ten-member panels and asked to

provide certain judgments to evaluate whether each question on the tests (which included 19

subtests on a National Teacher Exam used in the state) involved subject matter that was a part of

the curriculum at his or her teacher training institution, and therefore appropriate for testing.

These review panels determined that between 63% and 98% of the items on the various tests

were content valid and relevant for use in South Carolina. The U.S Supreme Court endorsed this

process as ―sufficiently valid‖ to be upheld.

These cases provide guidelines for establishing minimum thresholds (71% and 63%

respectively) for the levels of SME endorsement necessary for screening test items for inclusion

on a final test to be used for selection or promotion purposes. In either case, it is important to

note that at least an ―obvious majority‖ of the SMEs was required to justify that the items were

sufficiently related to the job to be selected for inclusion on the test.

Following the reasonable precedence established by these two cases, this Program uses a

―>65% job duty or KSAPC linkage‖ criteria for classifying items as ―acceptable‖ for inclusion

on a test. If less than 65% of the SMEs link the item to a job duty or KSAPC, the Program

simply ―red flags‖ the item for closer evaluation. It should be noted, however, that because this

Program function requires at least a 65% endorsement level for each item, the collective

endorsement level for all of the items used for an actual test is likely to be much higher. This is

because several of the items are likely to have much higher SME endorsement levels.

Copyright © Biddle Consulting Group, Inc. 4

Steps for Validating and Analyzing

Tests Using the Program

As previously mentioned, a Job Analysis document and a draft written test are needed

before beginning the test validation and analysis process. It is necessary that the Job Analysis

used for this process consist of important and/or critical job duties and KSAPCs, which need to

be in a numbered list for the SMEs to provide ratings through the process (e.g., Job Duties 1-23

and KSAPCs 1-41). The draft written test should consist of multiple choice and/or true/false

questions, and also be arranged in numerical order for SME reference during the rating process.

After these two documents have been compiled, the steps below can be completed using the

Program.

Important Note! There are 11 steps described below for completing the test

validation and analysis process facilitated by the Program. Three (3) of

the steps are completed using the Test Validation Workbook. Eight (8) of

the steps are completed using the Test Analysis Workbook.

Validate the Test Content Using the Test Validation Workbook

Step 1: Have SMEs Rate Items Using the Test Item Survey and Input Survey Data

into Test Item Survey Workbook

This step can be completed by printing the Test Item Survey contained in the Test

Validation Workbook, having SMEs complete the surveys using the test, answer key, and

appropriate job analysis documents, and then inputting their ratings into the SME Sheets in

the Workbook (e.g., SME I, SME II, SME III, etc.). See ―Instructions for Completing Test

Item Survey‖ in Attachment A for detailed instructions on conducting this workshop. It is

recommended that no fewer than 3 and no more than 12 SMEs participate in a panel (it is

difficult to evaluate rater reliability if there are fewer than 3 raters and more than 12 can

sometimes be unwieldy and provide no different or ―better‖ results than a panel with fewer

SMEs).

Important Note! The Test Item Survey includes a total of 14 survey questions to be

evaluated by SMEs for every item on a test (14 questions are needed for job

knowledge tests; only 10 are needed for all other types of tests). So, for a job

knowledge test with 100 items, does this mean that each SME needs to provide 14

X 100 ratings, or 1,400 ratings? Yes. Survey questions 6, 8, 9, 12, and 14 are

“active” questions that require the SME to provide a unique response (e.g., the

minimum passing percentage for each item). The remaining 9 questions are

“passive” questions, meaning that the SMEs only need to dissent on these

questions if they desire to provide a negative response. For this reason, an “Auto-

Fill” function has been added to the Program, which will allow the user to

populate all 9 “passive” questions with “Yes” for a specified number of SMEs and

test items. This way, only the “No” ratings from the SMEs need to be input for

these questions.

Copyright © 2010 Biddle Consulting Group, Inc. 5

When inputting data into the Test Validation Workbook, users should start at the

Menu Sheet. To begin inputting data from the Test Item Survey, click on ―SME I.‖ After the

data entry screen appears, begin inputting data for the first test question, using the ―tab‖ key

to move between fields (note that ―tab‖ is used to move down fields; ―shift-tab‖ is used to

move up). Note that ―1‖ should be input for a ―No‖ response, and ―2‖ should be input for a

―Yes‖ response for fields requiring Yes or No input. The ―Close‖ button on the data entry

screen is used to close the data entry screen. Alternatively, data may be input directly into the

spreadsheet using the required input values. Repeat the data entry process for all test items

and all SMEs. For entering multiple duty and/or KSAPC linkages for test items, separate

each number using a comma.

Important Note! Because Excel has limited file saving/recovery features, be sure to

save your work frequently when using the Program.

Step 2: Evaluate the Question Analysis Sheet in the Test Validation Workbook and

Select Surviving Items

After completing and double-checking the data entry, return to the Menu Sheet and

click on the ―Question Analysis‖ button. This will calculate all the values and summarize the

results in the Question Analysis Sheet. The different values shown on this screen are

described below.

Column Descriptions

# Red Flags: The number shown in this column represents the number of potential problem

areas with each corresponding test item. Test items that are red flagged should be

considered for removal from the test or revised and re-evaluated by SMEs (some minor

changes can be made to the test items without requiring a re-evaluation of the items by

SMEs, however, items that are substantially re-worked to address the problem areas

identified by SMEs should be re-evaluated by SMEs). Because some survey criteria are

more significant than others, professional judgment should be used in this process.

Columns 1-5: These columns provide the ―Yes/No‖ ratio of the SMEs (e.g., 67% indicates

that two-thirds of the SMEs answered ―Yes‖ to the survey question) regarding the quality

of the item stem (the part of the test item that asks the question) and the alternatives. A

>65% criteria is used for survey questions 1, 2, 4, and 5. A >86% criteria is used for

question 3 (correct key).

Column 6: This column shows the average Angoff rating (minimum competency rating) for

the item. No red flag criteria are used for this rating, however, items that are rated very

low (near a ―chance score‖ for the item) or very high (e.g., >95%) should be closely

evaluated (see the Section titled, ―Administering the Test Item Survey‖ in Attachment A).

This survey question is designed to address Sections 5H, 14C(7), and 15C(7) of the

Uniform Guidelines. See Attachment A for detailed instructions on how to gather these

ratings from SMEs.

Column 7: This column shows the ratio of SMEs who agree that the item is fair to all

groups, and free from unnecessary bias or culturally loaded content. A >86% criteria is

used for this question.

Columns 8-9: These columns show the percentage of SMEs who linked the test item to a

duty (column 8) or a KSAPC (column 9). A >65% criteria is used for these ratings. It is

not necessary that items are linked to both job duties and KSAPCs if an acceptable

Copyright © Biddle Consulting Group, Inc. 6

KSAPC/job duty linkage study was included in the job analysis process (where SMEs

identified where the KSAPCs are actually applied on the job). These survey questions are

designed to address Sections 14C(4-5), and 15C(4-5) of the Uniform Guidelines.

Column 10: This column provides the ―Yes/No‖ ratio of the ―Necessary on the First Day of

the Job‖ rating, and a >65% criteria is used for this rating. Tests should measure an aspect

of the targeted KSAPC that is needed before on-the-job training. This survey question is

designed to address Section 14C(1) and 5F of the Uniform Guidelines. While the job

analysis process may insure that only KSAPCs that are needed the first day on the job are

selected for measurement on the written test, this survey question helps to insure that the

specific aspect of the KSAPC measured on the test are needed before on-the-job training.

NOTE: Survey Questions 11-14 are only required for tests that include

items measuring job knowledge.

Column 11: This column provides the ratio of SMEs who believed that the item was ―Based

on Current Information.‖ An item that fails to meet the >65% endorsement criteria used

for this question may not be based on current job knowledge (or, even if the item is

technically based on current information, the SMEs are indicating that, with their current

level of understanding of the job, it may not be practically based on current information).

Column 12: This column provides the average ―Memorized‖ rating (using a 0-2 scale). A

>=1.0 criteria is used for this question. Tests measuring job knowledge should only

measure job knowledge areas that are needed in memory while a person is performing the

job (rather than job knowledge areas that can be easily looked up while performing the

job without a potential negative consequence). Sometimes job knowledge tests measure

areas of job knowledge that are provided on a reading list that applicants are directed to

study before taking the test. Tests measuring these ―directed‖ areas of job knowledge are

acceptable, provided that the majority of SMEs endorse that the item is measuring a

specific area of knowledge that is necessary to have in memory and cannot be looked up

without some potential negative consequence on the job.

Column 13: This column provides the ―Yes/No‖ ratio regarding the level of difficulty of the

item. A >65% criteria is used for this rating. Test items designed to measure job

knowledge should be written at a level of difficulty that is similar to how the job

knowledge will be actually applied on the job. This survey question is designed to

address Section 14C(4) and Question & Answer #79 of the Uniform Guidelines.

Column 14: This column provides the average for the ―Consequences‖ rating (using a 0-2

scale). A >=1.0 criteria is used for this rating. This survey question is designed to flag

items that measure only trivial aspects of job knowledge. This rating is useful for

screening out items that may be linked to a job knowledge domain (survey question 9)

that may be (globally) critical to job performance, however the specific aspect of the

knowledge measured by the item is not critically important. This survey question is

designed to address Section 14C(4) and Question & Answer #62 of the Uniform

Guidelines.

The >65% criteria are used for the majority of the survey questions for at least three reasons.

First, this criteria level has been previously endorsed in two high-profile court cases that

involved written tests where SME judgments on job relatedness were evaluated. Second, the

criteria represent a ―clear majority.‖ Third, 65% is a ―natural break‖ that works well for SME

panels of various sizes. For example, in a 3-5 member SME panel, the 65% criteria allows

for one dissenting SME; in a 6-8 member panel, two can dissent; in a 9-11 member panel,

Copyright © 2010 Biddle Consulting Group, Inc. 7

three can dissent; and in a 12 member panel, 4 can dissent. The >86% criteria (used for the

―correct key‖ and ―fair‖ survey questions) require perfect agreement among SMEs with

panels with six or fewer members, and allows for one dissenting SME with panels of 7-12

members.

Please note that conservative standards have been used for the ―Accept‖ and ―Reject‖

values produced by the Program. The type of test, nature of the position, and the extent to

which other tests are used should be some of the factors considered when using the summary

data from this Sheet.

Step 3: Export the Angoff Ratings from the Test Analysis Export Sheet and Import

to the Test Analysis Workbook

Using the ―Export‖ button in the Test Analysis Export Sheet in the Test Validation

Workbook, export the Angoff ratings (the data in column 6 in the Question Analysis Sheet).

This will ready the data for importing into the Test Analysis Workbook.

Analyze the Test Results and Establish Cutoff Score Options Using the Test

Analysis Workbook

Step 1: Rater Input: Import Angoff Ratings and Keep only Ratings for Items that

Survived the Question Analysis Review

Click the ―Import Rater Data‖ button under the ―Tools‖ Menu to import the Angoff

ratings for the test items. This will import the Angoff ratings from SMEs that were developed

using the Test Validation Workbook. After importing these ratings, go to the Rater Input

Worksheet and delete the ratings for the items that did not survive the criteria based on the

Question Analysis Sheet (in the Test Validation Workbook). This is necessary because the

Question Analysis Sheet in the Test Validation Workbook will preserve and record the

process used for choosing the items to include on the final test; whereas only the Angoff

ratings for the final items will be retained for test administration and analysis purposes. The

Angoff ratings for each unused test item should be deleted from the Rater Input Worksheet

by placing the cursor in the appropriate test item row, holding down the shift key, and using

the right arrow key to highlight all raters in that row (be careful to stop at the last rater) and

then pressing the delete key. Important: after removing the ratings for the unused items,

square the data by moving row data so there are no empty rows.

Step 2: Rater Analysis: Evaluate Rater Reliability and Remove Outlier Raters

This section of the Program provides six (6) useful Outputs for analyzing the raters

who participated in the Angoff process. Outputs 1-5 can be interpreted prior to administering

the test; however, Output 6 can only be computed and viewed after the test has been

administered.

After the data has been imported into the Rater Input Sheet, the first five Outputs on

the Rater Analysis Sheet are automatically calculated and the outlier raters (raters whose

ratings were outside of the ―normal range‖ of ratings provided by other raters) are

automatically removed from the calculation of the Critical Score. The Critical Score is the

raw score attained when the average of the SME item ratings are multiplied by the number of

test items. A cutoff score, in contrast, is the final score selected as a pass/fail point for the

test, and is determined after reducing the Critical Score by one, two, or three Conditional

Copyright © Biddle Consulting Group, Inc. 8

Standard Errors of Measurement (CSEMs) (see discussion below, and in Attachment B). The

six Outputs on the Rater Analysis Sheet are described below.

Output 1: Reliability Matrix

This matrix displays the correlations between raters on the item ratings. Correlation

values range between -1 and 1. Positive numbers indicate various levels of ―agreement‖

between raters. Negative numbers indicate ―disagreement.‖ Low correlations are .10 or

lower, medium correlations are between .10 and .30; high correlations are .30 to .50; and

very high correlations exceed .50. Raters that are not highly correlated (or even possibly

negatively correlated) should be closely evaluated.

Output 2: Overall Reliability by Rater

This Output shows the correlations for each rater indicating how consistent their

ratings were relative to all other raters on the panel. Raters who are not statistically

significantly correlated with the average rating of all other raters (with p-values greater

than .05, indicated in yellow) are automatically removed by the program from the

calculation of the overall Critical Score at this step. Raters who are not statistically

significantly correlated with the average rating of other raters either did not understand

the directions sufficiently, purposefully rated items at random, tried expressing a

particular high or low bias in their ratings, or simply had very different opinions than the

other raters (or possibly a combination of several of these reasons).

Output 3: Overall Rater Panel Reliability

This Output shows the overall reliability of all raters using the ―intraclass

correlation coefficient‖ (ICC), which shows the average reliability of the entire panel as a

whole. While it is desirable to have a panel with an ICC value that exceeds .50, lower

values may be acceptable. The ICC provided in this Output only includes raters that

survived Outputs 2 and 4.

Output 4: Outlier Raters (Raters With Overall Averages That are Significantly High or Low

Compared to the Average of the Panel)

This Output highlights raters who, on average, rated items significantly higher or

lower than other raters (using a rule of +/- 1.645 standard deviations from the average of

the overall panel). Note that a rater can be statistically significantly correlated with other

raters (based on Outputs 1 and 2 above), but still consistently provide higher or lower

ratings when compared to other raters. Raters who are significantly higher or lower than

other raters may be attempting to raise or lower the Critical Score for reasons that may

not be job related. The process of omitting these ―outlier‖ raters is known as ―trimming,‖

and is useful for eliminating atypical data reported by SMEs and brings the average

values more within ―normal ranges‖ rather than skewing the average based on extreme

data points.

Output 5: Overall Critical Score

This Output represents the final, unmodified Critical Score for the test that will

later be reduced by one, two, or three CSEMs to establish the final cutoff for the test. The

results are shown with all raters included and with the unreliable and outlier raters

removed (which only includes raters not highlighted in Outputs 2 and 4).

Copyright © 2010 Biddle Consulting Group, Inc. 9

Output 6: Compare Critical Score to Actual Test Results

This Output provides important insight into how the rating panels‘ recommended

cutoff score related to the scores of the test takers who scored in the region of the

recommended cutoff score. Specifically, this Output evaluates whether the SMEs who set

the recommended cutoff score had an upward bias when compared to the test takers who

scored within the score range surrounding the Critical Score. The Output provides the

means for making an adjustment to the Critical Score recommended by the rater panel if

the statistical analysis results justify making such adjustment. This Output has been

provided because it has been our experience that most rater panels have a tendency to

overshoot the Critical Score level actually required for the test. See Attachment D

(Upward Rating Bias in Angoff Panels) for a full discussion on this topic.

Important: This Output can only be interpreted after the Item Analysis, Test

Analysis, and the Rater Bias Evaluation Programs have been run (by clicking their

respective buttons in this order). In addition, for this part of the program to work

correctly, there must be a lock-step connection between the item numbers in the Rater

Input and the Item Data Input sheets (e.g., item #1 must refer to the same item in both

sheets).

Interpreting these Outputs requires making the assumption that (overall) the test

taker pool was in fact at least minimally qualified. However, because the Program

extracts only the test takers who are within a statistically similar score range of the

Critical Score (using a confidence interval of +/- 1.645 Standard Errors of Difference, or

―SEDs‖), and further only recommends making a correction if a significant pattern of

overestimation is observed (rather than a balanced mix of underestimates and

overestimates), making downward adjustments to the cutoff score recommended by the

SME panel is justified if the warnings setup in the program are triggered. Further, the two

adjustments suggested by the program are scaled based upon the severity of the potential

bias levels observed, and are only adjusted within a statistically similar range of the

original ratings provided by the SME panel.

Even with these constraints, we recommend following the guidelines below

before using these Outputs to adjust the Critical Score:

Be sure that the test taker pool was (overall) qualified (relative to the minimum

requirements necessary for applying to the position).

The sample size of the test takers should be relatively large (e.g., >200).

Consider whether the SMEs who provided item ratings were properly calibrated

(see helpful criteria in Attachments A and D).

Carefully evaluate each of the Outputs, in succession, and be sure that the proper

adjustment is used (either Option 1 or Option 2).

After following these guidelines and selecting one of the two optional Critical Scores is

selected (if the Statistical Test exceeded 2.0), simply substitute this value in the Cutoff

Score % field in the CSEM Specification Menu that displays when running either the

―Test Analysis‖ or ―Cutoff Analysis‖ programs. Each section of Output 6 is described

below.

Correlation Between Angoff Ratings and Item Difficulty Values: This column provides

the correlation between the minimum passing score estimates (Angoff ratings) provided

by the SMEs and the Item Difficulty Values (also called ―item p-values,‖ or the

Copyright © Biddle Consulting Group, Inc. 10

percentage of test takers who answered the item correctly) from the test takers. Stronger

correlations suggest a tighter connection between the competency levels judged by the

SMEs who rated the items and the test taker pool taking the test. We typically experience

correlations in the .20s, with the range being about .15 to as high as .55.

Difference Between Critical Score and Test Difficulty: This column provides the average

difference between the Angoff ratings (from SME raters) and the Item Difficulty Values

(from test takers). Positive values in this column indicate that the Angoff ratings were

higher than the Item Difficulty Values. The opposite is of course true for negative values.

For example, if an item had an Angoff rating of 80% and 75% of the test takers answered

the item correctly, a 5% difference would be displayed for this item. This column shows

the average of these differences (for all items on the test).

Skew of Difference Values: Skew is a statistical indicator that reflects whether the

distribution of the data is symmetrical (i.e., uniformly distributed with an equal number

of values above and below the average of the distribution). If the skewness statistic is

zero (0), the data are perfectly symmetrical. As a general guideline, if the skewness

statistic is less than −1 or greater than +1, the distribution is highly skewed. If skewness

is between −1 and −½ or between +½ and +1, the distribution is moderately skewed. If

skewness is between −½ and +½, the distribution is approximately symmetric. This

skewness statistic is applied to the difference values computed by obtaining the

difference between Angoff ratings and the Item Difficulty Values. Positive skew values

reveal that there is a disproportionately high number of test items with positive values

(i.e., items that were potentially over-rated by the raters). Negative skew values indicate

the opposite.

St. Error of Skew: This column provides the standard error of the skew.

St. Error of Skew Threshold (2X St. Error of Skew): This column provides 2X the

Standard Error of the Skew (used below).

Skewness Test Result (Skew/St. Error of Skew): This column provides the results of a

statistical test to indicate whether the skewness is significant (Tabachnick & Fidell,

1996). Tests with unusually high differences between Angoff ratings and Item Difficult

Values (indicated when the Skewness Test exceeds 2.0) should be carefully evaluated. If

the Skewness Test exceeds 2.0, consider using OPT Critical Score #1, which is

computed by reducing each over-rated item‘s Angoff rating to the outer lower limit (1.96

X SE Mean of the SME ratings for each over-rated item). If the Skewness Test results

exceed 3.0, consider using the OPT Critical Score #2, which is computed by reducing

the Critical Score to the outer lower limit of the raters (1.96 Standard Errors of Difference

from the Critical Score, using the average rater reliability and SD of the raters‘ average

ratings). The OPT Critical Score #2 provides a greater correction than OPT Critical Score

#1.

Step 3: Finalize and Administer the Test and Announce the Critical Score

After the test has been assembled using items that have been screened using the

Question Analysis Sheet (and the ratings from only the final items incorporated in the test

have been included in the Critical Score calculation—based on Output 5 or Output 6 in the

Rater Analysis Sheet), the cutoff for the test can be pre-announced with the caveat that the

final cutoff will be determined after making adjustments for measurement error (as reflected

by the CSEM). The applicant pool should be informed that because tests are ―less than

perfectly reliable instruments‖ that the Critical Score will be adjusted after taking several

Copyright © 2010 Biddle Consulting Group, Inc. 11

factors into consideration (e.g., the number of applicants that can feasibly be processed at the

next selection step, the degree of adverse impact of the various cutoff score choices, etc.).

See additional guidelines under Step 7 below.

Step 4: Complete Item Analyses

Input the applicant scores from the test (e.g., preferably by copying and pasting from

a separate worksheet), using 0s (zeros) to indicate incorrectly answered items and 1s to

indicate correctly answered items, into the Item Data Input Sheet. Then run the Item Analysis

program by clicking on this button from the Main Screen. Then evaluate the Outputs

provided by the Program using the guidelines provided below.

Important Note! Most HR/testing software packages will output score data in

text format. The most practical way to input data in the Item Data Input

Sheet in the Program is to configure the data on a separate worksheet

(aligning the columns with those on the Item Data Input sheet) and then

using the copy/paste commands to transfer the data. Be sure that Men

and Women are coded as 1s and 2s and Whites, Blacks, Hispanics,

Asians, Native Americans, and “others” are coded 1, 2, 3, 4, 5, and 6

respectively.

Output 1: Point Biserials

Point biserials provide values for each item that shows how correlated the item is

to the overall test score. Negative values (shown in red) indicate that the item is most

likely mis-keyed or has some other significant flaw. Values between 0.0 and 0.2 (shown

in yellow) indicate that the item is functioning somewhat effectively, but is not

contributing to the overall reliability of the test in a meaningful way. Values of 0.2 and

higher indicate that the item is functioning in an effective way, and is contributing to the

overall reliability of the test.

Note: The overall reliability of the test can be increased by removing the

items with low or negative point biserials.

Output 2: Item Difficulty

This Output shows the percentage of applicants who answered the question

correctly. Items that are excessively ―easy‖ (where more than 90% of the test takers

answered correctly) are shown in red and items that were excessively difficult (where less

than 30% of the test takers answered correctly) are shown in yellow. Typically, items that

provide the highest contribution to the overall reliability of the test are in the mid-range

of difficulty (e.g., 40% to 60%).

Output 3: Differential Item Functioning (DIF)

DIF analyses detect items that are not functioning in similar ways between the

indicated group and whites, or between women and men. Unlike simply comparing the

simple average difference in item performance between groups, DIF statistically controls

for overall group performance on the test, so any DIF items could indicate that the items

are functioning differently even when differences in overall group ability levels are

controlled.

Copyright © Biddle Consulting Group, Inc. 12

TVAP implements the same DIF categorization rules by the Educational Testing

Service (ETS) where test items are classified into three groups (see Zeiky, 1993; and

Dorans and Holland, 1993):

A - The item has negligible or nonsignificant DIF;

B - The item has statistically significant and moderate DIF; or

C - The item has statistically significant and moderate to large DIF.

Cells that display numeric values (e.g., 79, 89, etc.), do not have computable DIF

statistics (e.g., one or more groups had either all correct or incorrect values for the item of

interest). Items with ―B‖ DIF values have some level of DIF; items with ―C‖ DIF values

should be checked for possible content that could unnecessarily favor the majority group

(see the guidelines provided below). DIF statistically controls for overall group

performance on the test, so any DIF items could possibly indicate that the items are

functioning differently even when differences in overall group ability levels are

controlled.

To complete the DIF analysis, TVAP breaks the test score distribution into seven

(7) groups to divide the examinees into similar ability levels for the DIF analysis. The

number of strata used to break up the test score distribution can be changed by selecting

the ―Options‖ button beneath the Item Analysis Program and changing the value.1 The

maximum value is limited to the total number of items in the test.

Before evaluating the results from a DIF analysis, a few conceptual issues should

be explained. First, there is a significant difference between deleting a test item because

one group is scoring lower than another and deleting an item based on DIF analyses. As

defined by the Standards (1999), DIF ―…occurs when different groups of examinees with

similar overall ability, or similar status on an appropriate criterion, have, on average,

systematically different responses to a particular item.‖ The DIF analyses used in TVAP

only flag possible items for removal after controlling for differences in overall group

ability levels. This means, for example, that if only 20% of minority group members

answered a particular item correctly and 70% of the whites answered the item correctly (a

large, 50% score gap between the two groups), the item could still escape a ―DIF

designation.‖ A DIF designation, however, could occur if the minority group and whites

scored very close as overall groups (for example, 55% and 60% respectively), but

systematically performed differently on a certain item.

Second, DIF analyses provide the most accurate results on tests that measure

highly related KSAPCs. For example, if a 50-item test contains a 25-item math scale and

a 25-item interpersonal skills scale, and the test has low reliability because these skills are

unrelated, a DIF analysis conducted on the overall test may produce unreliable and

inaccurate results. In these circumstances, it would be best to separate the two test scales

and conduct separate DIF analyses on each.

Interpreting DIF results should always be done with caution—particularly when

decisions are being made regarding whether to keep items on a test based on DIF results.

For example, the SIOP Principles approach the DIF topic with a certain degree of caution

1 Unless the test includes >1,000 applicants, we recommend using no fewer than 1 strata per 5-10 items to maximize

the statistical power of the DIF analysis. For example, consider using 7 strata for a test with 50 items that was

administered to 500 examinees.

Copyright © 2010 Biddle Consulting Group, Inc. 13

(Principles, 2003, pp. 33-34). Because DIF results can be limited for a variety of reasons

and the validity and reliability of the test should be the dominant concern when making

personnel decisions, we recommend making DIF a secondary consideration behind test

validity and reliability. To this end, we advise removing DIF items from a test based on

the results of a five-step evaluation process, considered in the order provided below.

Factor 1: The first evaluation factor that should be considered is the sample size

of the DIF study. Because DIF analyses rely heavily on inferential statistics, they are very

dependent on sample size. As such, items flagged as DIF based on large sample sizes

(e.g., more than 500 applicants) are more reliable than those based on small sample sizes

(e.g., less than 100 or so). While the testing literature provides various suggestions and

guidelines for sample size requirements when using these types of analyses, a baseline

number of test takers for accurate statistical analysis is more than 200 applicants in the

reference group (whites or men) and at least 30 in the focal group (i.e., the minority

group of interest). More attention should be given to DIF results that are based on larger

sample sizes.

Factor 2: Is there a qualitative reason why the item could be flagged as DIF?

Items that are sometimes flagged as DIF contain certain words, phrases, or comparisons

that require culturally- or gender-loaded content, knowledge, or context to provide an

adequate response. For example, using a football situation for a test measuring math

skills could include content biased against females. On the flipside, if the item measures a

very neutral skill (e.g., 2 + 2 = 4) and has no apparent reason to exhibit DIF, there is no

qualitative evidence to combine with the quantitative indicator (i.e., the item‘s DIF level,

A, B, or C) to form an overall conclusion that the item should be eligible for removal for

possible bias reasons. When evaluating this factor, sometimes it helps to evaluate the

item alternative that the DIF group selected over the non-DIF group

Factor 3: The third evaluation factor pertains to the consistency and directionality

of the DIF results. For example, are there consistent DIF against both minorities and

females? Items with DIF values of ―B‖ or ―C‖ are considered as ―potentially DIF‖ and

should be ―flagged‖ if they are directionally consistent for both minorities and females.

For example, one should proceed cautiously before pulling an item off a test that

exhibited DIF against minorities (i.e., it was lowering the pass rate of minorities), yet had

positive DIF against women (e.g., with an opposite DIF result, showing that women were

advantaged). Thus, with all other things being equal, such a ―flip-flop‖ between

minorities and females can be considered as a ―self-canceling‖ concern.

Factor 4: If the conditions described above are met, the psychometric qualities of

the item should be evaluated, starting first with the specific item validity. With this step,

the question is asked, ―What happens to the validity of the test if this item is removed?‖

For example, if there is specific evidence that the particular item is statistically correlated

with job performance, this must be carefully weighed into the decision process. If there is

specific content validity for the item, this should be carefully evaluated next. For

example, the item under review may be specifically measuring key KSAPCs that have

been identified as crucial for job success. However, if there are several items measuring

such a key KSAPCs, and only one of the items exhibits DIF and the others do not, this

Copyright © Biddle Consulting Group, Inc. 14

may be a specific situation where the DIF item can be removed (with other factors also

being evaluated).2

Factor 5: After evaluating the validity of the item, the fifth and final step should

be completed: reviewing item reliability. This step is last but not least, because item

reliability is typically interwoven with the item‘s validity. For this step, the key statistic

under consideration is the point biserial correlation. This statistic is used to determine the

extent to which the item can discern between the ―high ability‖ and ―low ability‖ test

takers. Items with point biserial values that are exceptionally high (e.g., above .30) have a

high degree of power to distinguish the test takers who have high levels of the KSAPCs

being tested, as well as those who have low ability levels. On the other hand, items with

low point biserials (e.g., below .15) should be carefully evaluated. Items with point

biserial values that are not statistically significantly correlated with the total test score

(typically those with values less than .15, varying with sample size) should be more

closely scrutinized (if the samples sizes are sufficient). Another factor that should be

considered in this step of the process is the item difficulty value. Items that have

exceptionally high pass rates (e.g., >90%, indicating they are easy for the majority of test

takers) should be more readily justified to leave on the test than items that are found more

difficult by the majority of test takers.

In summary, all of these decision criteria should be evaluated carefully, and the

reasons for pulling an item off a test should be clearly outweigh the justification for

leaving an item on a test. More detailed decision rules are discussed in various texts (e.g.,

Biddle, 2005; Principles, 2003, pp. 33-34).

Step 5: Remove Test Items Based on Item Analyses

If items need to be removed from the test based on the results from the Item

Analyses, three steps need to be completed for the Program to adjust the scoring and analysis

procedures appropriately:

1. The item‘s Angoff ratings on the Rater Input Sheet need to be removed (remove the

data in the entire row, and re-square the data by removing empty rows).

2. Delete the item data (the 0s and 1s) contained in the Item Data Input Sheet. Do not

delete the entire column. Only delete the 0s and 1s corresponding to the item data

beginning in row 3 and ending with the last applicant (then re-square the data by

removing empty columns).

3. After removing the item data, re-compute the Item Analysis by clicking on this button

on the Main Sheet.

These steps are necessary for the various other calculations in the Workbook to operate correctly

(including the cutoff calculations).

Step 6: Complete Overall Test Analyses

The Test Analysis Sheet in the Test Analysis Workbook includes several outputs that

can be used to evaluate the overall quality of the test. These are described below.

2 This analysis can be made by evaluating the Question Analysis Sheet in the Test Validation Workbook. Was this

item ―clear‖ on the various ratings provided by SMEs? If the item had an unusual number of red flags when

compared to the other items that were included on the test, it may be a candidate for removal.

Copyright © 2010 Biddle Consulting Group, Inc. 15

Descriptive Statistics

This Output provides the Mean, Standard Deviation, and Minimum/Maximum

scores for the test. While the Mean can be a useful statistic for evaluating the overall test

results, it should be given less consideration when evaluating mastery-based or

certification tests (because certain score levels are needed for passing the test,

irrespective of the fluctuation in score averages based on various applicant groups). The

Standard Deviation is a statistical unit showing the average score dispersion of the overall

test scores. Typically, 68% of applicant scores will be contained within one standard

deviation above and below the test mean, 95% will be contained within two, and 99%

within three.

Reliability

Three reliability estimates are provided in this Output: Cronbach‘s Alpha,

Guttman‘s Split Half, and KR-21. Cronbach‘s Alpha is a widely accepted method for

determining the internal consistency of a written test. The reliability using this method is

shown, along with interpretive guidelines of ―Excellent,‖ ―Good,‖ ―Adequate,‖ and

―Limited,‖ which are taken from the U.S. Department of Labor‘s guidelines (DOL,

2000). The Guttman split-half reliability coefficient is an adaptation of the Spearman-

Brown coefficient, but one which does not require equal variances between the two split

forms. The KR-21 formula is another method for evaluating the overall consistency of the

test. It is typically more conservative than Cronbach‘s Alpha, and is calculated by

considering only each applicant‘s total score (whereas the Cronbach‘s Alpha method

takes item-level data into consideration).

Standard Error of Measurement (SEM) and Conditional Standard Error of Measurement (CSEM)

This Output provides the classical SEM of the test (using the formula: standard

deviation multiplied by the square root of 1 minus the reliability). The formula on this

Sheet uses Cronbach‘s Alpha for the calculation. The SEM provides a confidence interval

of an applicant‘s ―true score‖ around his or her ―obtained score.‖ An applicant‘s true

score represents his or her true, actual ability level on the overall test; whereas an

applicant‘s obtained score represents where he/she ―just happened to score on that given

test day.‖ For example, if the test‘s SEM is 3.0 and an applicant obtained a raw score of

60, his or her true score (with 68% likelihood) is between 57 and 63, between 54 and 66

(with 95% likelihood), and between 51 and 69 (with 99% likelihood).

The Conditional SEM (CSEM) is computed using the method described in

Attachment B and is provided for each score in the distribution. The CSEM provides an

estimate of the SEM for each score in the distribution, allowing the user to focus on the

CSEM in the range of scores around the Critical Score (the classical SEM provides

only an average that considers all scores in the distribution). The CSEM is typically

smaller than the SEM, since it is only taking the scores around each score into

consideration.

Note: This Program is designed for a test measuring a single trait, or a set of

highly related traits. If the user inputs applicant test score data for a test

measuring divergent traits or KSAPCs, it is likely that the reliability (and

other related statistics will be low). Call for assistance when using the

program for multi-trait or multi-scaled tests.

Copyright © Biddle Consulting Group, Inc. 16

Step 7: Calculate Three Cutoff Score Options and Set the Final Cutoff Score

TVAP uses the modified Angoff process (see Biddle, 2006 for a complete

description of this process) to establish job-related cutoff scores. This process is useful for

determining the critical point in the score distribution that delineates ―qualified‖ from

―unqualified‖ based on SME ratings, the measurement properties of the test, and the

consistency and accuracy of SMEs. This process can effectively address Section 5H of the

Uniform Guidelines that requires pass/fail cutoffs to be ―. . . set so as to be reasonable and

consistent with the normal expectations of acceptable proficiency in the workforce.‖

The modified Angoff process involves two steps. The first step is to have a panel of

SMEs determine the Critical Score, which is the average of their Angoff ratings for all of

the items included on the test. The second step is to reduce the Critical Score using one, two,

or three CSEMs (to account for the measurement error of the test), which provides three

cutoff score options for the test. The following steps can be completed to generate the three

cutoff score options.

After loading and analyzing the test item data, the three cutoff options can be

calculated by clicking on either the ―Test Analysis‖ or ―Cutoff Analysis‖ buttons. Doing so

reveals the following CSEM Specification program menu:

This Program computes the CSEMs (using the Mollenkopf-Feldt procedure described in

Attachment B) that are used for adjusting the Critical Score set by SMEs. To run this

Program, complete the fields using the guidelines below:

1. Enter Max Score: This field will automatically populate using the number of test

items detected in the Item Data Input Sheet.

2. Polynomial Power: This option allows the user to choose either a second-order

equation (using the quadratic equation) or a third-order equation (using the cubic

Copyright © 2010 Biddle Consulting Group, Inc. 17

equation) for computing the smoothed CSEM. While a second-order equation

typically gathers the necessary information to provide an adequately smoothed

CSEM, we typically recommend leaving the default option (3) for this field.

3. Regression Constant: Because CSEMs need to be set to an interpretable floor of 0,

leave this option to the default (0).

4. CSEM Confid. Interval (Z): This field allows a user to toggle between 1, 2, and 3

CSEMs for the top-down banding output of the Program (it does not impact

adjusting the Critical Score, which is conducted by using only 1, 2, or 3 CSEMs,

which are not multiplied by confidence intervals). See further discussion below.

5. Cutscore (%): The Program automatically populates this field with the Critical Score

(unmodified Angoff percentage score from SMEs) based upon the ―Final‖ percentage

score in Output 5 in the Rater Analysis sheet. If the user desires to select an

alternative Critical Score (such as the Optional Critical Score shown in Output 6 in

the Rater Analysis sheet, or some other cutoff), manually input the value here (as a

percentage).

After running this program function, it is the user‘s prerogative to select one of the three

cutoff options that are displayed on the Cutoff Analysis Worksheet (see Pass/Fail Status

column) (the cutoff options are also displayed on the Cutoff Options and the Score List

Worksheets). Factors such as the number of applicants needed to move to the next step in the

selection process, the levels of adverse impact, the size of the CSEM, and others should be

considered for making the decision to use one, two, or three CSEMs for the final cutoff. For

example, in U.S. v. South Carolina (1978) five statistical and human factors were considered

when deciding whether to use one, two, or three SEMs when setting the cutoff scores. These

five factors are:

1. Size of the standard error of measurement (SEM). We advise using the Conditional

SEM—or CSEM—in most instances over the SEM because it considers the error

variance only for test takers around the Critical Score, which is the area of decision-

making interest3 (see Attachment B for a discussion on the Conditional Standard

Error of Measurement);

2. Possibility of sampling error in the study;

3. Consistency of the results (internal comparisons of the panel results) (this can be

evaluated using the Rater Analysis Sheet);

4. Supply and demand for teachers in each specialty field; and

5. Racial composition of the teacher force.

While these factors were based upon the specific needs and circumstances in relevant to the

circumstances in U.S. v. South Carolina (1978) they provide some useful considerations for

employers when setting cutoff scores.

In addition to the three cutoff options provided on the Cutoff Options Sheet, the

―Decision Consistency Reliability‖ and ―Kappa‖ are also shown for each corresponding

cutoff. These are described briefly below.

Important Note! The standard deviation and reliability of the test change

(sometimes only slightly) every time the test is administered. Because these two

3 See Standards 2.1, 2.2, 2.14, and 2.15 of the Standards for Educational and Psychological Testing (1999).

Copyright © Biddle Consulting Group, Inc. 18

statistics are involved in the calculation of the SEM and CSEM, they will also

change each time the test is given. Research has shown, however, that the SEM is

likely to be relatively stable across applicant populations which differ in

variability (i.e., high or low test means or small or large standard deviations)

because the resulting changes in the reliability coefficient and standard deviation

partially offset each other (see Nunnally & Bernstein, 1994, p. 262). Therefore, if

cutoff scores need to be pre-announced prior to administering the test to

applicants, the SEM from a previous administration can be used to set a pre-

established cutoff (using one, two, or three SEMs).

Decision Consistency Reliability (DCR)

DCR is the appropriate type of reliability to consider when interpreting reliability and

cutoff score effectiveness for mastery-based tests. Mastery-based tests are tests used to

classify examinees as ―masters‖ or ―non-masters‖ or ―having enough competency‖ or ―not

having enough competency‖ with respect to the KSAPC set being measured by the test. See

Chapter 2 (pages 29-30) and Standard 14.15 of the Standards (1999).

DCR attempts to answer the following question regarding a mastery-level cutoff on a

test: If the test was hypothetically administered to the same group of examinees a second

time, how consistently would the test pass the examinees (i.e., classify them as ―masters‖)

who passed the first administration again on a second administration? Similarly, DCR

answers: ―How consistently would examinees who were classified by the test as ‗non-

masters‘ (failing) fail the test the second time?‖ This type of reliability is different than

internal consistency reliability (e.g., Cronbach‘s Alpha and KR-21), which considers the

consistency of the test internally, without respect to the consistency with which the test‘s

cutoff classifies examinees as masters and non-masters.

Important Note! Because DCR is a different type of reliability estimation (which

estimates the overall test’s ability to classify test takers) it cannot be used in the

classical SEM formula (SEM = 2/1)1( xxx r where x is the standard deviation

and rxx is the reliability of the test).

The Program outputs both estimated (using the procedure outlined in Subkoviak, 1988) and

calculated (using the procedure recommended in Peng & Subkoviak, 1980) values for this

reliability, along with interpretive guidelines derived from Subkoviak (1988). The calculated

values are typically more accurate than the estimated values.

Kappa Coefficients

―Kappa‖ coefficients are also provided in this Sheet, along with interpretive guidelines. A

Kappa coefficient explains how consistently the test classifies ―masters‖ and ―non-masters‖

beyond what could be expected by chance. This is essentially a measure of utility for the test.

Kappa coefficients exceeding .31 indicate adequate levels of effectiveness and levels of .42

and higher are good.

Top-Down Banding

Top-down bands are displayed in a column in the Cutoff Analysis Sheet in the section

labeled, ―Conditional Standard Error of Measurement.‖

Copyright © 2010 Biddle Consulting Group, Inc. 19

Using top-down bands for hiring applicants in groups is useful in situations where the

employer cannot feasibly process all applicants who pass the chosen cutoff score.4 TVAP sets

top-down bands by establishing score bands using the process described in Attachment B. While

the entire score list is banded by the Program, the banding process should stop at where the

minimum cutoff has been set (even if the cutoff score stops in the middle of a band).

While using this option may be effective for obtaining a smaller group of applicants who

pass the cutoff score and are substantially equally qualified, a second option is strict rank

ordering (i.e., hiring the applicants top-down in strict score order). Strict rank ordering is not

typically advised on written tests for three reasons: (1) the reliability (and related C/SEM) of any

written test demonstrates that the precise score order of applicants is not highly stable (i.e.,

applicants would likely change order if the same test was re-administered), (2) the actual

qualification levels of closely-scoring applicants is insignificant (compared to applicants who are

grouped using a broader score band), and (3) strict ranking will almost always exhibit higher

levels of adverse impact than other techniques, and is typically very difficult to defend in

litigation settings.

At a minimum, be careful to insure that at least the following three criteria are met if

ranking will be used:

1. The test should measure KSAPCs that are performance differentiating. Performance

differentiating KSAPCs distinguish between acceptable and above-acceptable

performance on the job. A strict rank ordering process should not be used on a test that

measures KSAPCs that are only needed at minimum, baseline levels on the job and do

not distinguish between acceptable and above-acceptable job performance. See Questions

& Answers #62 and Section 14C(9) of the Uniform Guidelines;

2. The reliability of the test should be sufficiently high (tests with low levels of reliability

do not adequately distinguish between test scores at a level that justifies making selection

decisions based on very small score differences). We recommend using a reliability

coefficient of .85 as a minimum threshold; and

3. The test results should show an adequate dispersion of scores (i.e., score variance as

represented by the standard deviation or standard error of measurement) within the range

of interest (i.e., the range where the selection decisions are being made). One way to

evaluate the dispersion of scores is to use the CSEM values provided by this Program.

Using the CSEM, the employer can evaluate whether the score dispersion is adequate

within the relevant range of scores when compared to other parts of the score distribution.

For example, if the CSEM is very small (e.g., 2.0) in the range of scores where the strict

rank ordering will occur (e.g., 95-100), but is very broad throughout the other parts of the

score distribution (e.g., double or triple the size), the score dispersion in the relevant

range of interest (e.g., 95-100) may not be sufficiently high to justify this criteria.

With these options provided, it is recommended that employers use the pass/fail cutoff options

provided by this Program as much as possible, and use other valid devices (such as structured

interviews) for banding or strict rank ordering applicants.

4 While the Guidelines are clear that this is a justifiable process, the degree of adverse impact should also be

considered (see Section 5H of the Guidelines).

Copyright © Biddle Consulting Group, Inc. 20

Step 8: Cutoff Results by Gender/Ethnicity & Adverse Impact Analyses

This Output provides the results of the three cutoff options (A, B, and C) for each group.

In addition to the number passing and passing rate percentage for each group, adverse impact

analyses are also provided, including the results of the 80% Test and Statistical Significance

Tests. These are described below.

80% Test

The 80% Test is an analysis that compares the passing rate of one group to the passing

rate of another group (e.g., Men vs. Women). An 80% test ―violation‖ would occur if one

group‘s passing rate is less than 80% of the group with the highest rate. For example, if the male

pass rate on a test was 90% and the female pass rate was 70% (77.7% of the male pass rate), an

80% Test violation would occur. The 80% Test is described by the Uniform Guidelines as:

―. . . a ‗rule of thumb‘ as a practical means for determining adverse impact for use in

enforcement proceedings . . . It is not a legal definition of discrimination, rather it is a

practical device to keep the attention of enforcement agencies on serious discrepancies in

hire or promotion rates or other employment decisions.‖ (Uniform Guidelines Overview,

Section ii).

―. . . a selection rate for any race, sex, or ethnic group which is less than four-fifths (4/5)

(or eighty percent) of the rate for the group with the highest rate will generally be

regarded by the Federal enforcement agencies as evidence of adverse impact, while a

greater than four-fifths rate will generally not be regarded by Federal enforcement

agencies as evidence of adverse impact. Smaller differences in selection rate may

nevertheless constitute adverse impact, where they are significant in both statistical and

practical terms . . .‖ (Uniform Guidelines, Section 4D).

The 80% Test has been scrutinized in Title VII litigation because it is greatly impacted by small

numbers and does not consider the ―statistical significance‖ of the passing rate disparity between

the two groups (see, for example, Bouman v. Block, 940 F.2d 1211, C.A.9 [Cal., 1991]; and

Clady v. County of Los Angeles, 770 F.2d 1421, 1428 [9th Cir., 1985]). More typically, courts

consider the statistical significance of the passing rate disparity between groups:

―Rather than using the 80 percent rule as a touchstone, we look more generally to

whether the statistical disparity is ‗substantial‘ or ‗significant‘ in a given case.‖ (Bouman

v. Block (citing Contreras, 656 F.2d at 1274-75).

―There is no consensus on a threshold mathematical showing of variance to constitute

substantial disproportionate impact. Some courts have looked to Castaneda v. Partida,

430 U.S. 482, 496-97 n. 17, 97 S.Ct. 1272, 1281-82 n. 17 n. 17, 51 L.Ed.2d 498 (1977),

which found adverse impact where the selection rate for the protected group was ‗greater

than two or three standard deviations‘ from the selection rate of their counterparts. See,

e.g., Rivera, 665 F.2d at 536-37 n. 7; Guardians Association of the New York City Police

Dept. v. Civil Service Commission, 630 F.2d 79, 88 (2d Cir. 1980), cert. denied, 452

U.S. 940, 101 S.Ct. 3083, 69 L.Ed.2d 954 (1981)‖ (emphasis added). Clady v. Los

Angeles County, 770 F.2d 1421, C.A.9 (Cal., 1985).

Note above the adoption of the ―greater than two or three standard deviation‖ rule. While the

courts have ruled that statistical significance occurs when the standard deviation representing the

difference in passing rates is ―two or three standard deviations . . .‖ the actual threshold

statistically speaking is 1.96 (which equates to a 5% significance level).

Copyright © 2010 Biddle Consulting Group, Inc. 21

Statistical Significance Tests

The statistical significance test conducted by the program is a two-tail Fisher Exact Test

probability statistic for 2 X 2 contingency tables.5 Any values that are less than .05 are

statistically significant and indicate a difference in passing rates between two groups that is not

likely occurring by chance.

Charts

The Program provides two charts to evaluate each group‘s passing rate:

The Passing Percentage of Each Group By Cutoff Score chart displays the total

percentage of each group that passes the test at each of the three cutoff choices (A, B, or C). So,

if there was a total of 100 Hispanics who took the test and 10, 15, and 20 Hispanics passed

cutoffs A, B, and C respectively, the chart would show 10%, 15%, and 20%. So, this chart shows

the passing rate of each group compared to the total test takers in each group.

The chart Passing Rates Compared to Men or Whites displays the percentage of each

group passing when compared to the men (for the women bar) and when compared to whites (for

each of the ethnic groups). For example, if 60% of all of the men passed Cutoff A and 40% of

the women passed, 66.6% would show for Cutoff A in the chart (40% / 60% = 66.6%).

5 The widely-endorsed Lancaster (1961) correction has been included as a sensible compromise that mitigates the

effects of conservatism of exact methods while continuing to use the exact probabilities from the small-sample

distribution being analyzed.

Copyright © Biddle Consulting Group, Inc. 22

Glossary

Adverse Impact—A substantially different rate of selection in hiring, promotion, or other

employment decision that works to the disadvantage of members of a race, sex, or ethnic group.

Angoff Ratings—Ratings that are provided by SMEs on the percentage of minimally qualified

applicants they expect to answer the test item correctly. These ratings are averaged into a score

called the ―unmodified Angoff score‖ (also referred to as a ―Critical Score‖).

Critical Score—The score level of the test that was set by averaging the Angoff ratings that are

provided by SMEs on the percentage of minimally qualified applicants they expect to answer the

test items correctly.

Cutoff Score—The final pass/fail score set for the test (set by reducing the Critical Score by 1,

2, or 3 CSEMs).

CSEM—Conditional Standard Error of Measurement. The SEM at a particular score level in the

score distribution (see SEM definition below).

DCR—Decision Consistency Reliability. A type of test reliability that estimates how

consistently the test classifies ―masters‖ and ―non-masters‖ or those who pass the test versus fail.

DIF—Differential Item Functioning. A statistical analysis that identifies test items where a focal

group (usually a minority group or women) scores lower than the majority group (usually whites

or men), after matching the two groups on overall test score. DIF items are therefore potentially

bias or unfair.

ETS—A person‘s true score is defined as the expected number-correct score over an infinite

number of independent administrations of the test

Item Difficulty Values—The percentage of all test takers who answered the item correctly.

Job Analysis—A document created by surveying SMEs that includes job duties (with relevant

ratings such as frequency, importance, and performance differentiating), KSAPCs (with ratings

such as frequency, importance, performance differentiating, and duty linkages), and other

relevant information about the job (such as supervisory characteristics, licensing and certification

requirements, etc.).

Job Duties—Statements of ―tasks‖ or ―work behaviors‖ that describe discreet aspects of work

performance. Job duties typically start with an action word (e.g., drive, collate, complete,

analyze, etc.) and include relevant ―work products‖ or outcomes.

KSAPCs—Knowledges, skills, abilities, and personal characteristics. Job knowledges refer to

bodies of information applied directly to the performance of a work function; skills refer to an

observable competence to perform a learned psychomotor act (e.g., keyboarding is a skill

because it can be observed and requires a learned process to perform); abilities refer to a present

competence to perform an observable behavior or a behavior which results in an observable

product (see the Uniform Guidelines, Definitions). Personal characteristics typically refer to

traits or characteristics that may be more abstract in nature, but include ―operational definitions‖

that specifically tie them into observable aspects of the job. For example, dependability is a

personal characteristic (not a knowledge, skill, or ability), but can be included in a job analysis if

Copyright © 2010 Biddle Consulting Group, Inc. 23

it is defined in terms of observable aspects of job behavior. For example: ―Dependability

sufficient to show up for work on time, complete tasks in a timely manner, notify supervisory

staff is delays are expected, and regularly complete critical work functions.‖

Outlier—A statistical term used to define a rating, score, or some other measure that is outside

the normal range of other similar ratings or scores. Several techniques are available for

identifying outliers.

Point Biserial—A statistical correlation between a test item (in the form of a 0 for incorrect and

1 for correct) and the overall test score (in raw points). Items with negative point biserials are

inversely related to higher test scores, which indicates that they are negatively impacting test

reliability; positive point biserials are contributing to test reliability in various levels.

Reliability—The consistency of the test as a whole. Tests that have high reliability are consistent

internally because the items are measuring a similar trait in a way that holds together between

items. Tests that have low reliability include items that are pulling away statistically from other

items either because they are poor items for the trait of interest, or they are good items that are

measuring a different trait.

SEM—Standard Error of Measurement. A statistic that represents the likely range of a test

taker‘s ―true score‖ (or speculated ―real ability level‖) from any given score. For example, if the

test‘s SEM is 3 and an applicant obtained a raw score of 60, his or her true score (with 68%

likelihood) is between 57 and 63, between 54 and 66 (with 95% likelihood), and between 51 and

69 (with 99% likelihood). Because test takers have ―good days‖ and ―bad days‖ when taking

tests, this statistic is useful for adjusting the test cutoff to account for such differences that may

be unrelated to a test taker‘s actual ability level.

SME—Subject-matter expert. A job incumbent who has been selected for providing input on the

job analysis or test validation process. SMEs should have at least one-year on-the-job experience

and not be on probationary or ―light/modified duty‖ status. Supervisors and trainers can also

serve as SMEs, provided that they know how to perform the target job.

Copyright © Biddle Consulting Group, Inc. 24

References

American Educational Research Association, the American Psychological Association,

and the National Council on Measurement in Education (1999). Standards for educational and

psychological testing. Washington DC: American Educational Research Association.

Angoff, W.H. (1971). Scales, norms, and equivalent scores. In Thorndike, R.L.,

Educational Measurement, pp. 508-600. Washington, DC: American Council on Education.

Biddle, D. (2006). Adverse impact and test validation: A practitioner’s guide to valid and

defensible employment testing (2nd

ed). Burlington, VT: Gower.

Contreras v. City of Los Angeles, 656 F.2d 1267 (9th Cir. 1981).

Dorans, N.J. & Holland, P.W. DIF detection and description: Mantel Haenszel and

standardization. In Holland P.W, Wainer, H. (Eds.). Differential Item Functioning. Hillsdale, NJ:

Lawrence Erlbaum Associates; 1993, 35-66.

Lancaster (1961). Significance tests in discrete distributions. Journal of the American

Statistical Association, 56, 223-234.

Nunnally, J.C. & Bernstein, I.R. (1994). Psychometric theory (3rd

ed.). New York:

McGraw Hill.

Peng, C.J. & Subkoviak, M. (1980). A note on Huynh‘s normal approximation procedure

for estimating criterion-referenced reliability. Journal of Educational Measurement, 17 (4), 359-

368.

SIOP (Society for Industrial and Organizational Psychology, Inc.) (1987, 2003),

Principles for the Validation and Use of Personnel Selection Procedures (3rd and 4th eds).

College Park, MD: SIOP.

Subkoviak, M. (1988). A practitioner‘s guide to computation and interpretation of

reliability indices for mastery tests. Journal of Educational Measurement, 25 (1), 47-55.

Tabachnick, B. G., & Fidell, L. S. (1996). Using multivariate statistics (3rd ed.). New

York: Harper Collins.

Copyright © 2010 Biddle Consulting Group, Inc. 25

Thorndike, R. L. (1951). Reliability. In E. F. Lindquist (Ed.). Educational measurement

(pp. 560-620). Washington DC: American Council on Education.

U.S. Department of Labor: employment and training administration (2000). Testing and

assessment: an employer‘s guide to good practices.

U.S. v. South Carolina, 434 US 1026 (1978).

Zieky, M. (1993). Practical questions in the use of DIF statistics in item development. In

P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 337–364). Hillsdale, NJ:

Lawrence Erlbaum.

27

Attachment A - Instructions for

Completing the Test Item Survey

Introduction

This document provides instructions on how to use the Test Item Survey. To access and

print the Test Item Survey, go to the Test Item Survey Sheet in the Test Validation Workbook.

Based on the specific needs of the employer, survey questions 1-5 on the Test Item Survey may

be answered by qualified SMEs or by qualified test development personnel, based upon their

level of familiarity with the relevant position for which the test is being developed.

Administering the Test Item Survey

The following steps are recommended for administering the Test Item Survey:

1. Convene between 4 and 12 qualified SMEs with a gender and race/ethnic balance. The

SMEs selected for the panel should consist mostly of incumbents for the target position,

along with a limited representation of supervisors (perhaps 1 for every 3-4 SMEs). The

SMEs selected for the panel should have at least one year of on-the-job experience and

not be on probationary or ―light/modified duty‖ status. We also recommend having SMEs

who are ―not too far removed‖ from the actual requirements of the target position

involved in the process. While there is no hard-and-fast guideline for the number of years

experience a SME should have for serving on a panel (other than the one year

recommendation), we find that SMEs with 1-5 years experience in the target position are

best.

2. Reproduce the Job Analysis and make a copy for each SME. Be sure that the Job

Analysis itemizes the various job duties and knowledge, skills, abilities, and personal

characteristics (KSAPCs) that are important or critical to the job.

3. Make a copy of the test and key for each SME and stamp all tests and keys with a

numbered control stamp (so that each SME is assigned a numbered test and key).

4. Explain the confidential nature of the workshop, the overall goals and outcomes, and ask

the SMEs to sign confidentiality agreements.

5. Explain the mechanics of a test item with the SME panel, including the item ―stem‖ (the

part of the item that asks the question), alternates (all choices including the key),

distractors (incorrect alternatives), and answer key. Also review any source linkage

documentation for the items (i.e., where the correct answers are located in a book or

manual).

6. Review each question on the Test Item Survey with the SME panel. After going through

(and reading) each question, go through one of the test items as a group, answering the

questions for the test item collectively and having discussion about each question as

necessary. This step should also include reviewing the Job Analysis and the job duty and

28

KSAPC6 numbering system. Be sure to insert only numbers when listing duties and

KSAPCs. Use the Instructions for Providing Ratings sections in this Manual to answer

any questions.

7. Provide the Test Item Comment Form to the SMEs and ask them to use this form to

record any ―No‖ ratings so their responses can be evaluated.

8. Facilitate a discussion with the SME panel to clarify and define the concept of a

―minimally qualified applicant‖ (for the relevant question on the Test Item Survey). The

definition should be limited to one who possesses the necessary, baseline levels of the

KSAPC measured by the test item to successfully perform the first day (before

training) on the job. It is sometimes useful to ask the SMEs to imagine 100 minimally

qualified applicants in the room (in the various states that an applicant can be) and ask,

―How many of the 100 do you believe will answer the item correctly?‖ Be sure to warn

the SMEs against providing ratings below a chance score (50% for true/false items; 25%

for a multiple choice item with four alternatives; 20% for an item with five). In addition,

SMEs should not assign ratings of 100%, because this assumes that every minimally

qualified applicant was having a ―perfect day‖ on the testing day, and allows no room for

error.

9. Allow the SMEs to continue rating the first five test items and then stop.

10. Select one of the first five test items as a ―group discussion‖ item. Ask each of the SMEs

to share their ratings for the item by going through the survey questions one at a time.

Allow the SMEs to debate over ratings—especially the Angoff rating. This will help

centralize the panel and reign in any extreme outliers before they rate the remaining

items. It is acceptable to stimulate the SMEs by discussing and contrasting their ratings,

however, the facilitator should not require or coerce the SME to make any changes. The

facilitator and the SMEs can prod, argue, discuss, and challenge any individual SME

rating during the items that are reviewed during group discussion. However, each SME

should be encouraged to ―cast their own vote‖ after discussion.

11. Allow the SMEs to continue completing the Survey for the remaining test items.

12. After all SMEs have rated all test items, go through the test one item at a time and solicit

feedback. If items are mis-keyed or can be improved by making minor changes, allow for

group discussion and make changes to the item. For any changes made, be sure to ask the

SMEs to record the same changes on their hard copy version of the test, and to re-rate any

questions on the Test Item Survey based on the changes to the item.

13. Input all ratings and have the data entry independently verified until perfect.

Instructions for Providing Ratings on Checklist (Questions 1-10)

Question 1

The item ―stem‖ is the part of the test item that asks the question and solicits response from

the test taker.

Part A asks if the item stem reads well. The test item should not be unnecessarily wordy or

complex. In addition, it should not oversimplify complex concepts.

A ―Yes‖ or ―No‖ response should be provided to this question.

6 For test items measuring only work behaviors (duties), only linkages between the test item and job duties are

necessary. For items measuring knowledges or abilities, it is helpful but not necessary to link to both.

29

Part B asks if sufficient information is provided within the item for the test taker to provide

an accurate response. The item should be ―complete‖ and contain all the relevant information

the test taker will need to provide an accurate response.

A ―Yes‖ or ―No‖ response should be provided to this question.

Question 2

This section asks five questions about the item‘s distractors. Distractors are the response

alternates that are incorrect and misleading.

Part A asks if the distractors are similar in difficulty. Although distractors will vary to the

extent that they are ―tempting,‖ the test taker should not be able to easily eliminate the

distractors.

A ―Yes‖ or ―No‖ response should be provided to this question.

Part B asks if the distractors are distinct. Distractors should not be too similar; each one

should represent a unique, plausible (but incorrect) response.

A ―Yes‖ or ―No‖ response should be provided to this question.

Part C asks if the distractors are incorrect, yet plausible. Each distractor should be

―tempting‖ to the test takers, but should be incorrect under any likely circumstance on the

job. If a distractor could possibly be correct under circumstances on the job that are likely to

occur, it is not an appropriate distractor.

A ―Yes‖ or ―No‖ response should be provided to this question.

Part D asks if the distractors are similar in length. It is acceptable for distractors to be

different lengths, but not excessively different. For example, it would not be acceptable for

one or more distractors to be one or two sentences in length if the key or other distractors are

just a few words in length.

A ―Yes‖ or ―No‖ response should be provided to this question.

Part E asks if the distractors are correctly matched to the stem. All distractors should

smoothly flow from the question stem. Each distractor should complete the sentence or

question posed by the stem.

A ―Yes‖ or ―No‖ response should be provided to this question.

Question 3

This question asks if the key is correct in all circumstances. The key should be clearly correct

under any likely circumstance of the job. If the key could possibly be incorrect under

circumstances on the job that are likely to occur, it is not an appropriate key.

A ―Yes‖ or ―No‖ response should be provided to this question.

Question 4

This question asks if the test item is free of providing clues to other items on the test.

Consider the item stem, distractors, and the key when providing this rating. It is not

uncommon to find test item keys that provide clues to other items, or eliminate the

plausibility of distractors on other test items.

A ―Yes‖ or ―No‖ response should be provided to this question.

30

Question 5

This question asks if the test item is free from unnecessary complexities. The difficulty of a

test item should be related only to the difficulty or complexity of the concept it is testing. An

item‘s difficulty should not be attributed to it being overly complex, tricky, or excessively

wordy.

A ―Yes‖ or ―No‖ response should be provided to this question.

Question 6

This question asks SMEs to provide an estimate on the percentage of minimally qualified

applicants they would expect to answer the item correct. If SMEs believe that the item is

extremely difficult and they would expect a low percentage of minimally qualified applicants

to answer it correctly, their rating should be low. If the test item is extremely easy and they

would expect most minimally qualified applicants to answer it correctly, their rating should

be high.

When considering what makes up a ―minimally-qualified applicant,‖ SMEs should consider

an applicant who has the minimum level of the knowledge, skill, or ability being tested by

the item that is needed to satisfactorily perform the essential duties of the position on the first

day of the job (before training). They should not consider ―above average‖ or ―above

satisfactory‖ performance—they should consider only applicants who have the minimum

levels needed to satisfactorily perform the essential duties of the position.

Ratings to this question should be:

51% - 99% for true/false items

26% - 99% for multiple-choice items with four alternatives

21% - 99% for multiple-choice items with five alternatives

Question 7

This question asks if the test item is fair to all groups of people. Are there any parts of the

question, key, or distractors that would be better understood by some groups than others? For

example, a question testing for math ability that uses football examples may be unfairly

biased toward men.

A ―Yes‖ or ―No‖ response should be provided to this question.

Question 8

This question requires SMEs to identify the job duty(s) that is represented by the test

question. Using the Job Analysis, write the number(s) of the job duty(s) that is related to the

item.

Question 9

This question requires SMEs to identify the knowledge, skill, ability, and/or personal

characteristic (KSAPC) that is measured by the test item. Using the Job Analysis, SMEs

should write the number(s) of the KSAPC that is being measured by the item.

Question 10

This question asks if the knowledge being measured by the item is necessary the first day on

the job. Some KSAPCs are only acquired through training or experience gained while on the

job, while others are not acquired through job training or experience, but are expected to be

possessed by candidates on the first day of hire.

31

A ―Yes‖ or ―No‖ response should be provided to this question.

Instructions for Rating Job Knowledge Tests (Questions 11-14)

Question 11

This question asks whether the test item is based on current information. The knowledge

being measured by this test item should be linked to current, job-relevant information.

A ―Yes‖ or ―No‖ response should be provided to this question.

Question 12

This question asks how important it is that the knowledge being tested be memorized.

If it is not necessary that the knowledge area(s) measured by this item is memorized and it

can be looked up without impacting job performance, a ―0‖ should be recorded.

If it is important that the knowledge area(s) measured by this question is memorized and

having to look it up is likely to have a negative impact on the job, a ―1‖ should be recorded.

If it is essential that the knowledge area(s) measured by this question is memorized and

having to look it up is most likely to have a negative impact on the job, a ―2‖ should be

recorded.

If the SMEs believe that having to look up the knowledge area(s) measured by the item is

halfway between ―likely‖ and ―most likely‖ to negatively impact job performance, ―1.5‖

should be recorded (or any range between 1.0 and 2.0). Any value between 0 and 2.0 may

also be used.

Question 13

This question asks about the level of difficulty of the item. Test items designed to measure

job knowledge should be written at a level of difficulty that is similar to how the job

knowledge will actually be applied on the job. For example, if the job requires a basic level

of knowledge about a particular knowledge domain, test items should not measure the area at

an ―advanced‖ or ―overly complex‖ level.

A ―Yes‖ or ―No‖ response should be provided to this question.

Question 14

This question asks about the seriousness of the consequences that are likely to occur if the

applicant does not possess the knowledge required to answer the item correctly.

If little or no consequences are likely to occur if the applicant does not possess the

knowledge required to answer the item correctly, a ―0‖ should be recorded.

If moderate consequences are likely to occur if the applicant does not possess the knowledge

required to answer the item correctly, a ―1‖ should be recorded.

If severe consequences are likely to occur if the applicant does not possess the knowledge

required to answer the item correctly, a ―2‖ should be recorded.

If SMEs believe that the consequences that are likely to occur if the applicant does not

possess the knowledge required to answer the question correctly are halfway between

―moderate‖ and ―severe,‖ ―1.5‖ should be recorded (or any range between 1.0 and 2.0). Any

value between 0 and 2.0 may also be used.

32

Test Item Comment Form

Instructions: Use this form to explain any ―No‖ ratings given to items when using the Test Item

Survey.

Name: _______________Test: _____________________ Date: _______________________

Survey Question #

Test Item # (from Survey) Explanation

_____________ _____________ ___________________________________________

_____________ _____________ ___________________________________________

_____________ _____________ ___________________________________________

_____________ _____________ ___________________________________________

_____________ _____________ ___________________________________________

_____________ _____________ ___________________________________________

_____________ _____________ ___________________________________________

_____________ _____________ ___________________________________________

_____________ _____________ ___________________________________________

_____________ _____________ ___________________________________________

_____________ _____________ ___________________________________________

_____________ _____________ ___________________________________________

_____________ _____________ ___________________________________________

_____________ _____________ ___________________________________________

_____________ _____________ ___________________________________________

_____________ _____________ ___________________________________________

_____________ _____________ ___________________________________________

_____________ _____________ ___________________________________________

_____________ _____________ ___________________________________________

_____________ _____________ ___________________________________________

_____________ _____________ ___________________________________________

_____________ _____________ ___________________________________________

_____________ _____________ ___________________________________________

33

Attachment B - The Conditional

Standard Error of Measurement

(CSEM)

The Standards (1999) require the consideration of the Conditional Standard Error of

Measurement (or ―CSEM‖)7 when setting cutoff scores (as opposed to the traditional Standard

Error of Measurement, or ―SEM‖). The traditional SEM represents the standard deviation of an

individual examinee‘s true score (the score that best represents the applicant‘s actual ability

level) around his or her obtained (or actual) score. The traditional SEM considers the entire range

of test scores when calculated. Because the traditional SEM considers the entire range of scores,

its accuracy and relevance is limited when evaluating the reliability and consistency of test

scores within a certain range of the score distribution.

Most test score distributions have scores bunched in the middle and spread out through

the low and high range of the distribution. Those examinees who score in the lowest range of the

distribution lower the overall test reliability (hence effecting the size of the SEM) by adding

chance variance by means of guessing and not possessing levels of the measured KSAPC that are

high enough to contribute to the true score variance of the test. High scorers can also lower the

overall reliability (and similarly effect the size of the SEM) by means of examinees possessing

exceedingly high level of the KSAPC being measured, which can also reduce the variance

included in the test score range. The chart below shows how the SEM is not constant throughout

a score distribution (regardless of which method is used to estimate the SEM).8

7 See pages 27, 29, 30 and Standard 2.2, 2.14, and 2.15 of the Standards (1999).

8 Chart derived from data provided in Lord (1984).

34

Because the SEM considers the average reliability of scores throughout the entire range

of scores, it is less precise when considering the scores of a particular section of the score

distribution. When tests are used for making personnel decisions, the entire score range is almost

never the central concern. Typically in personnel settings, only a certain range of scores are

considered (i.e., those scores at or near the Critical Score, or the scores that will be included in a

banding or ranking procedure).

The CSEM avoid the limitations of the SEM by considering only the range of scores of

interest (i.e., the CSEM associated with any given score). By interpreting the CSEM at the

Critical Score, the most accurate reliability estimate can be evaluated at the point in the

distribution that matters the most (at least when making pass/fail determinations).

For these reasons, the traditional SEM9 should not be used in practice unless the

researcher is confident that measurement error is consistent throughout the score range—a

circumstance that is rare indeed.10

Rather, the conditional SEM should be used because it

accurately reflects the variation of measurement precision throughout the range of scores.

More than a dozen various CSEM techniques have accumulated since the first mention of

the concept over 60 years ago.11

Some techniques are original; others are ―improved‖ techniques

that include minor adjustments to the original author‘s work. Some techniques are based on

―strong true score models,‖ which are tied to strong statistical and theoretical (i.e., parametric)

assumptions. Binomial error models and Item Response Theory (IRT) models for computing

CSEMs generally fall into this category. Other CSEM models are categorized as ―weak true

score models‖ and encompass the ―difference methods‖ that were used by the original CSEM

authors. Such methods are referred to as the ―difference‖ methods because they compute the

CSEM based on the variance between two split halves of the same test. These methods are

referred to as the ―weak‖ methods because they are not fundamentally tied to theoretical or

parametric distributional assumptions.

While the techniques for computing CSEMs are widely varied, they are fortunately rather

close in their outputs. For example, Qualls-Payne (1992) showed that the average CSEM

produced by the difference method (across 13 different score levels) produced CSEM values that

were very close to five other methods (including binomial and IRT methods). Further, a study

conducted by Feldt, Steffen, & Gupta (1985) shows that, while the difference methods are less

complicated than some other methods (i.e., those based on ANOVA, IRT, or binomial methods),

it is ―integrally related‖ and generally produces very similar results.

IRT-based CSEM methods typically require a large sample size for calibrating accurate

parameter estimates (e.g., Tsutakawa & Johnson, 1990, recommend a minimum sample size of

approximately 500 for accurate parameter estimates) and the use of sophisticated software (e.g.,

BILOG). Further, they can only be used for tests that meet certain psychometric assumptions

(e.g., uni-dimensionality). The binomial error methods are generally less computationally

intensive, and require only KR-21 reliability, the mean and standard deviation of test scores, and

9 SEM =

2/1)1( xxx r where x is the standard deviation and rxx is the reliability of the test.

10 The SEM is only constant in symmetrical, mesokurtic score distributions (see Mollenkopf, 1949).

11 Ten years prior to this ―discovery‖ made by Mollenkopf, Rulon (1939) showed that the SEM is equal to the

standard deviation of the differences between the two half-test scores (which could be calculated at any score

interval and thus resulting in the SEM at various levels). However, Rulon‘s emphasis was not placed on the

conditional nature of SEMs; the same was only a focus of the later work first conducted by Mollenkopf.

35

the number of items on the test. As ―strong‖ methods, both of these techniques are tied to

distributional assumptions.

The ―difference‖ method used in TVAP adopts the Mollenkopf-Feldt (M-F) method,

which uses polynomial regression to ―smooth‖ the CSEM values that are calculated using the

variance between scores on two split test halves. Specifically, the M-F CSEM method can be

executed by following the steps outlined in the section below.

After computing CSEMs, score bands can be created by centering confidence intervals

around classically computed Estimated True Scores (ETS), using the formula: ETS = ((X-M)* rxx

+ M (where X is the examinee‘s obtained score, M is the average test score of all examinees, and

rxx is the reliability of the test) and the desired Confidence Interval (e.g., 1.96 for 95% confidence

when considering 2 CSEMs). This personnel score banding approach used in TVAP is based on

the original M-F method and was modified for the personnel score banding (as originally defined

in Biddle et al., 2007) and will be outlined in further detail in a forthcoming work by Biddle &

Feldt (2010).

As a ―weak‖ model, it is not tied to assumptions inherent to other models (e.g., IRT or the

binomial methods). It has the added value of being tolerant of various item types (e.g., binary or

polytomous items) and having sample size requirements similar to those of most multiple

regression situations. One of the additional advantages of the M-F method is that it is based on

the actual item responses and characteristics of each data set (binomial models can be computed

without item-level data because they are based on the binomial probability distribution). The

method allows data from each test administration to form varying CSEM values at each score

interval. Some of our research has shown that the M-F method is highly correlated (r = .88) to

the CSEM models based on binomial methods.

The implications of using CSEMs are wide and substantial for the HR professional.

Classical SEMs are widely used by HR professionals in two ways—both of which have

substantial impact on countless applicants that are ranked on various score lists: (1) score

banding, and (2) adjusting Critical Scores to derive Cutoff Scores.

Test score banding is commonly practiced by I-O professionals as a way of preserving

utility (over just using cutoffs) while minimizing adverse impact (over top-down ranking).

Banding procedures typically use the Standard Error of Difference (or ―SED,‖ which is

calculated by multiplying the SEM by the square root of 2), along with a confidence interval

multiplier (e.g., 95% using a 1.96 multiplier) to categorize groups of ―substantially equally

qualified‖ applicants. The conditional nature of the SEMs, however, demands the use of CSEMs

when creating bands rather than using this traditional method.

Adjusting minimum competency cutoffs (i.e., Critical Scores), as in the case of the

modified Angoff technique, is another area where CSEMs are commonly applied. In this

situation, the Critical Score for the test is lowered using 1, 2, or 3 CSEMs (also centered on the

ETS associated with the Critical Score—see MacCann, 2008) to account for measurement error

in the test (and thereby giving the ―benefit of the doubt‖ to the applicant). Most CSEM

techniques typically produce smaller values than classical SEM calculations in the upper range

of the score distribution, which translates to smaller groups of applicants being categorized

because of the added precision of the conditional measurement precision (compared to the

classical SEM, which simply averages measurement error across the entire score range).

36

Steps for Developing Personnel Score Bands Using the M-F Procedure Centered on ETS2

The steps below provide a systematic process (and the same process used in TVAP) for

computing M-F CSEMs and subsequent score bands.

Step 1. Divide the test into two halves that are approximately equal in difficulty and

variance, and have the same number of items. Because the M-F method is a ―difference‖ CSEM

method, it is important that the test halves are balanced by having the same number of items

contributing variance to the equation. If the test has an odd number of items, one of the items can

be removed from the test. It is also important that all of the items on the test have the same point

values, as the M-F method cannot be used on a test that includes mixed binary and polytomous

(multi-point) items. Fortunately, the method can be used for tests with polytomous items,

however, all items on the test need to be the same type (i.e., either binary or polytomous).

In addition to have the same number and type of items, the two test halves should be

psychometrically parallel. While a number of sophisticated techniques are available for

assembling and evaluating the comparability between tests (referred to as tau equivalency—see

Haertel, 2006), we encourage at least some minimal uniform procedure is developed and

followed. In many cases, simply dividing the test using an even/odd split (with the same number

of items on each half) will do the job. However, if such a split results in two test halves that have

substantially different means or standard deviations, we recommend using a more intensive

splitting method.

Because the M-F CSEM approach already takes the mean difference between test halves

into account (by subtracting them out of the equation), the SD of each test should receive

attention. With this in mind, we suggest the following steps be taken to insure the test haves are

parallel:

1. Calculate the Item Difficulty Value for each item (i.e., the average passing rate for each

item).

2. Sort the items by Item Difficulty Value.

3. Split the test items (using odd/even) into two forms with the same number of items (if the

test has an odd number of items, remove one). The two forms should have fairly

equivalent means, which is one component of establishing parallel forms.

4. Compute the SD for each test half.

5. Use a pre-specified criteria/threshold for the difference between both forms. If the

difference exceeds a criteria/threshold, then swap items from hardest to easiest until they

are within the threshold.12

Step 2. Calculate an ―adjusted difference score‖ for each applicant by subtracting their score on

test half 1 from their score on test half 2 and the mean from test half 1 from the mean from test

half 2, noted as: Y = 22121 )()( XXXX where X1 and X2 are the two test halves and 1X

and 2X are the means for each test half. This process results in calculating an ―adjusted

difference score‖ for each applicant that will be used as a dependant variable, Y in the subsequent

regression equation.

12

These steps are automated in TVAP. Such a complicated process is not necessary; however, some steps should be

taken to ensure the forms are approximately similar.

37

For example, for a test with two halves that have means of 26.28 and 26.45 respectively

and a first examinee who scored 60 on the test, 28 on half 1 and 32 on test half 2, the formula

would be displayed as: Y = 226.45)-(26.28-32)-(28 which computes to a value of 14.67. This

calculation is repeated for each examinee in the data set to create Y values that will be used as

the dependent variable in the multiple regression analysis, described next.

Step 3. Conduct a multiple regression analysis using the dependent variable Y from the

step above and the total test score (X1), total test score squared (X

2), and total test score cubed

(X3) as the independent variables. Set the constant in the regression to 0 because this is the

theoretical floor of CSEM values.

Step 4. Use the resulting three beta coefficients to compute CSEM values for each score

in the distribution using the formula: )*()*()*( 3

3

2

2

1

1 XXX . For example, to

calculate the CSEM for the raw score of 70, the three predictors 80 (X1), 640 (X2), and 512000

(X3) are multiplied with their three corresponding beta coefficients 1 (1.5597) 2 (-0.01914)

and 3 (0.000006) as demonstrated here

)000006.0*343000()01914.0*490()5597.1*70( . This results in a CSEM of 2.667 for

the score of 70.

By way of technical thoroughness, a clarification should be made at this point. The

dependent variable, Y, must be defined as the square of the difference score—not the square root

of it. Thus, Y = 22121 )()( XXXX . This squared quantity for an examinee is not equal to

Y2. In effect, the regression analysis must be carried out as a process of developing smoother

estimates of the conditional error variance. Only after the curve is fitted to the data for

smoothing the variances are the square roots of the predicted Y values taken to obtain the CSEM

for various score values. At first blush, this may seem identical to defining the squared deviation

as Y2. It is not. The square root of the average of a set of values is not equal to the average of the

square roots of the values. This can be illustrated as follows:

2/)14425( = 9.19 vs. (5 + 12)/2 = 8.50.

When the values are closer together, the difference is not so great. Consider these values:

2/)169( = 3.53 vs. (3 + 4)/2 = 3.50.

The CSEM must be taken as the square root of the mean of the predicted Y, as defined above.

The process cannot be abbreviated by defining Y as the positive square root of the squared

deviation and then smoothing these predicted Y values.

Step 5. Multiply each CSEM by the desired Confidence Interval (e.g., 1.96 for 95%

confidence intervals). This creates the width of the bands to be used in Step 7.

Step 6. Calculate Estimated True Scores (ETS) for each score in the distribution, using

the formula: ETS = ((X-M)* rxx + M) where X is the score, M is the average test score of all

examinees, and rxx is the reliability of the test. This will result in fraction scores for each

observed score, which is fine because these are estimations of true scores.

Step 7. Create score bands by centering confidence intervals around the ETS values so

that the lower boundary and upper boundary of the bands do not overlap. See Table 1 for an

example.

38

Table 1. Score Bands Using the M-F CSEM Method

Obs. Score ETS

M-F

CSEM

95% Lower

Conf. Int.

95% Upper

Conf. Int.

CSEM

Band

Classic

Band

80 77.68 0.00 77.68 77.68 1 1

79 76.77 0.77 75.26 78.28 1 1

78 75.85 1.17 73.57 78.14 1 1

77 74.94 1.46 72.08 77.80 1 1

76 74.02 1.69 70.70 77.35 1 1

75 73.11 1.90 69.39 76.83 1 1

74 72.19 2.08 68.11 76.28 1 1

73 71.28 2.25 66.87 75.68 1 1

72 70.36 2.40 65.66 75.06 2 1

71 69.45 2.54 64.48 74.42 2 1

70 68.53 2.67 63.31 73.76 2 1

69 67.62 2.79 62.15 73.08 2 2

68 66.70 2.90 61.02 72.39 2 2

67 65.79 3.01 59.89 71.69 2 2

66 64.87 3.11 58.78 70.97 2 2

65 63.96 3.20 57.68 70.24 2 2

64 63.04 3.29 56.59 69.50 2 2

63 62.13 3.38 55.50 68.75 2 2

62 61.21 3.46 54.43 68.00 2 2

61 60.30 3.54 53.36 67.23 2 2

60 59.38 3.61 52.31 66.46 2 2

59 58.47 3.68 51.26 65.68 2 2

58 57.55 3.74 50.21 64.89 3 3

Table 1 demonstrates that creating bands is a bi-directional process. Because CSEMs vary in

size (typically being smaller towards the upper end of the distribution and largest near the mean)

and the ETS adjustment centers the CSEM bands slightly below each observed score at the

higher range, the upper and lower confidence intervals of each CSEM need to be simultaneously

evaluated to determine where they touch. For example, Bands 1 and 2 divide at the score of 72

because the upper confidence interval limit at the CSEM at 72 (2.40 *1.96 = 4.70 plus 70.36 [the

ETS at the observed score of 72] is 75.06, which touches the lower confidence interval of the

highest possible score with a non-zero CSEM [79]).

Classic bands are also displayed in Table 1 for comparative purposes. Using the

conventional SED formula (SEM * 2 ) based on the reliability of the test (0.9151) and SD

(12.76) the SED 10.31 can be calculated and centered on observed scores. This process produces

39

wider bands in general, particularly in the upper part of the distribution, when compared to M-F

CSEM banding. In this dataset, the classic band procedure starts the second band 4 points lower

than the M-F CSEM method.

Given a sufficiently large sample size (starting generally at 100+ subjects), the process

described above should produce stable CSEMs and related score bands. However, in situations

where fewer than 100 subjects are involved, and particularly in situations with less than 50

subjects, the classical SEM should be used instead of the CSEM if the CSEM produces higher

overall values. This is because the regression-based estimates will tend to overestimate CSEMs

when smaller samples are involved. In addition, sometimes negative CSEM values will

occasionally occur with small datasets are analyzed. These values should be set to the theoretical

minimum value of 0.

40

References

Angoff, W.H. (1971). Scales, norms, and equivalent scores. In Thorndike RL,

Educational measurement, pp. 508-600. Washington, DC: American Council on Education.

Biddle, D. (2005). Adverse impact and test validation. Ashgate Publishing: Burlington,

VT.

Biddle, D., Kuang, D.C.Y., & Higgins, J. (2007, March). Test use: ranking, banding,

cutoffs, and weighting. Paper presented at the Personnel Testing Council of Northern California,

Sacramento.

Biddle, D. & Feldt, L. (2010). A new method for personnel score banding using the

conditional standard error of measurement. Unpublished manuscript.

Feldt, L. S., Steffen, M., & Gupta, N. C. (1985, December). A comparison of five

methods for estimating the standard error of measurement at specific score levels. Applied

Psychological Measurement, 9 (4), 351-361.

Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th

ed., p. 83). Westport, CT: Praeger Publishers.

Lord, F. M. (1984). Standard errors of measurement at different ability levels. Journal of

Educational Measurement, 21 (3), 239-243.

MacCann, R. G. (2008, April). A modification to Angoff and bookmarking cut scores to

account for the imperfect reliability of test scores. Educational and Psychological Measurement,

68 (2), 197-214.

Mollenkopf, W. G. (1949), Variation of the standard error of measurement of scores.

Psychometrika, (l4) 3, 189-229.

Qualls-Payne, A. L. (1992). A comparison of score level estimates of the standard error

of measurement, Journal of Educational Measurement, 29 (3), 213–225.

Rulon, P. J. (1939). A simplified procedure for determining the reliability of a test by

split-halves. Harvard Educational Review, 9, 99-103.

41

Standards—American Educational Research Association, the American Psychological

Association, and the National Council on Measurement in Education (1999), Standards for

educational and psychological testing. Washington DC: American Educational Research

Association.

Thorndike, R. L. (1951). Reliability. In E. F. Lindquist (Ed.). Educational measurement

(pp. 560-620). Washington DC: American Council on Education.

Tsutakawa, R. K. & Johnson, J. C. (1990). The effect of uncertainty of item parameter

estimation on ability estimates. Psychometrika, 55, 371-390.

43

Attachment C - Test Item Writing

Guidelines

Test Item Writing Guidelines

Before writing any items, you should:

Read all of the related materials (job description, test plan, text books used by applicants,

etc.).

Understand the job and all of the source materials.

Use only the most recent publication/edition of the source materials

1. Make sure the applicants are told which publication/edition of the source material

will be used for the test items.

2. Do NOT use a source just because it is convenient.

3. Examine and review alternative sources.

4. Do NOT use ambiguous sources.

Design your test based upon a written test plan.

1. Develop a written test plan. The test plan should be your road map in designing

your test.

2. Develop a content-by-process matrix to determine what kinds of items to write.

(This will be covered later in this guide)

Select the number of alternatives you wish to include for each item before starting to

write the test.

1. Research has shown that there is relatively little advantage to having more than

three alternatives for any one multiple-choice question. More than four alternatives is

impractical since it is extremely difficult to develop that many plausible incorrect

alternatives on a consistent basis.

2. Use the same number of alternatives for EVERY question on the test.

Item Writing

The Basics

Use correct spelling.

1. Spell check everything.

2. Spell check the test again completely just before the final version of the test is

printed. Often, there are errors made during final edits just before a test is printed.

44

3. Have another person (other than yourself) check over the completed test for

spelling errors. Do NOT rely solely on your computer word-processing program's

―spell check‖ function - it has serious limitations such as not catching the word

―form‖ being spelled as ―from.‖

Use correct grammar.

1. Do NOT rely on your computer‘s grammar checking function. Have your grammar

checked over by another person before releasing the test to be printed.

2. Use all punctuation correctly, including commas and hyphens.

3. Do NOT use contractions. For example, use ―do not‖ instead of ―don't.‖

4. Check your grammar at the later stages of test development. Do not allow any test

to be administered without checking for proper grammar and usage. Check the test

again completely just before the final version of the test is to be printed.

Avoid negatively worded items, if possible.

Constantly check for typographical errors.

Capitalize ALL limiting or directive words.

1. Capitalize the word NOT and other negative words.

2. Capitalize key limiting and directive words that are vital for choosing the correct

alternative such as REQUIRED, PROHIBITED, ALLOWED, ALL, MUST, etc.

Avoid using slang, ambiguous, and obsolete (or archaic) words.

Break up long sentences into shorter sentences.

1. Avoid run-on sentences that are difficult to follow.

2. Keep the verb and subject word close together to avoid ambiguity. For example, do

NOT say: ―In California, a person has committed the offense of simple battery, a

gross misdemeanor, if he/she has:‖

Make sure all alternatives are parallel and similar in form. For example, make sure they

are all the same tense.

Use acronyms correctly. If the source material uses an acronym to describe something,

the test-writer should use both the acronym and the full title. For example, the test should

read: ―The Central Intelligence Agency (CIA) is a part of the United States government.‖

Setting an item in an applied situation, when appropriate, can enhance the item.

The Stem of an Item

The part of a test item that leads up to (but does not include) the choice of answers is called the

stem of the item.

Use incomplete statements as stems whenever possible instead of fill-in-the-blanks. ―The

capital of Pennsylvania is‖ is preferred to ―__________ is the capital of Pennsylvania.‖

1. If using blanks, try to put the blank as close as possible to the end of the last

sentence in the stem.

2. Avoid using double alternatives. For example, you should not write an item that

states: ―The two longest rivers in the United States are __________ and

__________.‖

45

Include all qualifying information in the stem.

Avoid using or ―all of the above,‖ ―NONE of the above,‖ or ―all of the above… except‖

in the item stem.

Eliminate irrelevant words and ideas from the stem and alternatives. Be clear and

concise.

Use active voice rather than passive voice, if possible. Most computer word-processing

programs have a grammar-checking program that identifies passive voice, but not always

reliably or accurately.

When an incomplete statement is used as a stem, use ―a/an‖ if some of the alternatives in

that item begin with vowels and some begin with consonants.

State only one idea or central problem in each item

Include as much of the item in the stem as possible.

Use objective ideas that can be referenced to the source. Do not ask for personal

opinions.

Avoid using absolute terms such as ―always‖ and ―never.‖

Keep items as simple as possible. If the stem must be read more than once to be

understood, you should rewrite the stem.

The Alternatives of an Item

The choices for answers given to the test-taker are called alternatives. Incorrect alternatives are

called distracters. The correct choice is called the correct alternative.

Have only ONE correct alternative for each item.

1. Eliminate overlapping alternatives. For example, do NOT use the following:

a) 5 to 10 days.

b) 5 to 20 days.

c) 15 to 25 days.

2. Use only plausible distracters. Do not try to be funny or cute. If you become stuck,

work on another item and come back to the original item later on. Many item writers

think of effective distracters for previous items when working on other items.

Place periods at the end of each alternative when the item stem ends in an incomplete

statement. Do not place periods at the end of each alternative where the stem either

contained a blank or ended with a question mark.

Numbers.

1. Write out the numbers one to ten. Use the numeric form for numbers over ten (e.g.,

11, 12, 27, 53). However, if you have an item where the alternatives are both less than

ten and greater than ten, use the numeric form for ALL alternatives in that item.

2. Use ascending order for alternatives containing numbers. For example 1, 4, 8, 10;

or 25, 30, 35, 40.

3. Be consistent when using units of measurement. Do not mix and match units within

a single item. For example, if you use minutes in one alternative, use minutes for

ALL of the alternatives in that item.

46

4. Use exclusionary terms such as ―What is the MAXIMUM number of days‖ to

make numeric questions unambiguous.

Do not use both categories and sub-categories as your alternatives. If you do, then test-

takers who know only the headings of sections in the reference source will be able to

answer the item on this basis alone. For example, you would not want to use the

alternatives of ―vertebrate,‖ ―mammal,‖ and ―human‖ as choices in the same item.

Be as clear and concise as possible. Eliminate any possible misinterpretations.

1. Do NOT reword one of the other alternatives to create a new alternative.

2. Do NOT use synonyms of other alternatives as an alternative.

3. If you CANNOT come up with a sufficient number of distracters, then reword the

item or do not use it.

Each alternative should be independent. Avoid using ―all of the above,‖ ―none of the

above,‖ ―A and C,‖ or other such alternatives.

FOR LAW ENFORCEMENT TESTS ONLY: Avoid using lesser and included offenses

or duties as your alternatives. For example, simple assault is usually a lesser and included

offense of aggravated assault - therefore, do not use both as alternatives in the same item.

Do not make the correct alternative significantly different in form from the other

distracters.

1. Alternatives within the same item should be of similar length if possible.

2. Alternatives within the same item should be parallel (e.g., same tense, form,

structure, etc.).

3. All alternatives should be grammatically consistent with the stem.

4. Avoid using words that sound or look alike as alternatives.

General Rules

Do not allow the stem and alternatives of one item to help a test-taker answer any other

item in the test.

1. If possible, avoid using identical alternatives for different items. This may give the

test-taker a clue as to how to answer another item.

2. If you CANNOT avoid using identical alternatives for more than one item, put the

alternatives in the same order for both items.

Make certain the complete stem and all alternatives associated with that stem are on the

same page when the test is printed

Use gender-neutral and race/ethnic-neutral terms and pronouns in your items, unless this

information is vital to the item.

1. Use he/she, him/her, or his/hers whenever possible.

2. If the item has several actors try using gender-neutral names, such as Pat or Chris.

3. Use titles to eliminate gender. For example, ―officer‖ and ―supervisor‖ do not

denote any particular gender.

4. Be consistent. If you make Chris Johnson a female in one item, make sure Chris

Johnson is a female in all items.

47

Avoid abbreviations whenever possible.

1. Use only those abbreviations that are commonly used by the test-taker. If in doubt,

do not use the abbreviated version.

2. Use the time and date structure that is common to the agency or group for whom

you are writing. For example, many police and fire departments use military time

(e.g., 1800 hours instead of 6:00 p.m.) If in doubt, check with the agency for which

you are writing.

3. Do not mix words and abbreviations. For example, if you use the abbreviation ―ft‖

in one part of the item, do not use the word ―feet‖ in another.

Avoid negatively worded items.

Randomize the order of alternatives and location of the correct answer.

1. Be sure the final version of the test has a similar number of As, Bs, Cs, and Ds as

correct alternatives for a four-alternative test.

2. The pattern for the answers should be random. Make certain there is no discernible

pattern in the location of the correct answers on the test.

Other Considerations

Item Difficulty. Do NOT make the items either too easy/obvious or too difficult/trivial.

Make the level of item difficulty for the test appropriate to the job.

Job Related. Make sure the items are job related. Do not become immersed in the

reference source and forget the job analysis.

1. Does each item match something in the job content according to the job

description?

2. Does each item tap into a relevant aspect of the job? Just because information is in

a reference source does not automatically mean it should be tested. It is up to you, the

item writer, to determine what information is relevant.

3. Make sure each item is applicable to the agency or group you are testing. For

example, it would be inappropriate to ask items about railroad crossings if there are

no railroad tracks in the municipality for which you are testing.

Complexity.

1. Match the test items to the complexity of the job. If the job requires a high school

diploma, then maintain the appropriate vocabulary and reading level.

2. Check the reading level with a computerized word-processing program. List the

desired reading level in the test plan and the test reading level in the test development

guide.

Do not go outside of the reference source for the alternatives or rationale for

answering the alternatives. For example, do not use a recent Supreme Court decision to

justify your choice of a correct alternative if that Supreme Court decision is not

mentioned in the reference source.

Direct and indirect clue errors structure of the item. Avoid providing clues to the

correct answer in the stem.

1. Do not make distracters significantly different from the correct answer in form.

48

2. Make distracters equally plausible to the uninformed candidate

a) Avoid using distracters that are opposites of the correct answer.

b) Avoid using the same/similar words (or synonyms) in both the stem and the

correct alternative. This provides a clue to the correct answer.

c) Avoid using stereotyped phraseology or stating the correct alternative in

more detail than the distracters.

d) Avoid using absolute terms in distracters. Words such as ―all,‖ ―always,‖

―none,‖ ―never,‖ and ―only‖ are often associated with incorrect responses to

test questions.

e) Avoid creating a sub-set of alternatives that is all-inclusive. If two of the

alternatives cover all possibilities, then the other distracters are eliminated

from consideration as correct by the test-wise candidate.

f) Avoid writing two distracters with the same meaning. The test-wise

candidate can eliminate these alternatives if only one answer is to be selected.

One correct alternative for each item. Search for possible conflicts with other resource

materials or other sections of the reference material from which you are working. For

example, criminal codes and vehicle codes are notorious for having the same actions

constitute more than one offense. To illustrate, in one state the same car theft could be a

misdemeanor (according to the vehicle code) or a felony (according to the criminal code).

Process-by-content matrix. Three types of items should be included (see the ―Item

Examples‖ section for examples of each type):

1. Knowledge of terms/definitions

2. Knowledge of principles and concepts,

3. Application of principles and concepts.

Know when to admit defeat. Do NOT be afraid to throw out or set aside items that

CANNOT be saved. Sometimes there is just nothing you can do to save an item. It is

better to discard or set aside a bad item than to slip it into the test or to spend too much

time working on any one item.

Item Format When Developing Tests

Sample Item and Answer Documentation Example:

QUESTION: According to the CRIMINAL INVESTIGATION workbook, testimony given by

an accomplice or participant in a crime which tends to convict others is _________ evidence.

A. state

B. police

C. defense

D. dissent

Answer: A

(Source: Criminal Investigation workbook, 3rd Edition, p. 47. )

49

Note:

• Develop four (4) alternatives per item, lettered A, B, C, and D.

• Write the letter of the correct alternative at the end of the item.

• Write a source at the end of each item, describing the source and page

number or section (e.g., Merit System Manual, IX-3).

• Place the name of the source within the text of the item stem in capital

letters. (e.g., “According to the MERIT SYSTEM MANUAL....”)

• Avoid negative items.

• Capitalize limiting, directional, or negative words.

• Randomize the order of distracters and the correct alternative.

Examples of Test Items

I. If the stem is a question.....

What color are polar bears in the Artic?

A. Black

B. Yellow

C. White

D. Brown

......then

• The stem must be a grammatically complete sentence.

• Put a question mark at the end of the question.

• The question must be the last sentence in the stem.

• Capitalize the first letter of each alternative, even if it is not a proper noun.

• Do not put a period at the end of each alternative.

II. If the correct alternative completes the stem....

A person who burns his/her own car and reports it stolen in order to obtain the insurance money

is guilty of

A. insurance fraud.

B. theft.

C. arson for profit.

D. burglary.

.....then

• Use no punctuation at the end of the stem. Do not use a blank line or colon.

• The sentence leading to the alternatives should be the last sentence in the stem.

• Place a period at the end of each alternative.

50

• Capitalize the first letter of the alternative only if it is grammatically correct to do so.

III. If the alternative is embedded in the stem....

A robbery is an example of a __________ offense.

A. traffic

B. criminal

C. civil

D. zoning

....then

• Do not put a period or other punctuation at the end of these alternatives. If punctuation is

required, put it in the stem.

• Capitalize the first letter of the alternative only if it is grammatically correct to do so.

• Make all blanks ten spaces long, regardless of the length of the alternatives.

Examples of Item Types

I. Knowledge of Definitions/Terms

The proceeding whereby one party to an action may be informed as to the facts known to other

parties or witnesses is a

A. discovery.

B. garnishment.

C. indictment.

D. tort.

Note: This item taps the test-taker's knowledge of the definition of the

word/term “discovery.”

II. Knowledge of Principles and Concepts

The changing phases of the moon are caused by

A. the tilt of the earth's axis.

B. the rotation of the moon on its axis.

C. the tidal patterns of the earth‘s oceans.

D. the orbit of the moon around the earth.

Note: The concept/principle being tapped here is the understanding that the

cause of the changes in the phases of the moon is the orbit of the moon

around the earth. These type of items include those that tap the test-

taker's knowledge of facts.

III. Application of Principles and Concepts

51

On hot sunny days, parked cars with the hottest interiors are those that are __________ in color.

A. white

B. red

C. black

D. green

Note: This item taps into the application of the principle (concept/fact) that

black surfaces absorb heat (and light) more efficiently than other colors

and thus will be relatively hotter.

Test Plan Example

The following is an example of a Test Plan for a test for the promotion of police officers to the

position of sergeant.

The length of the Police Sergeant exam will be 150 items. The number of items drawn from each

document will be based on the following:

A. The importance and frequency of behaviors associated with the knowledge contained in

that document.

B. The priority assigned to each document by the advisory committee

C. The nature of the content and the relative length of the document. Some documents are

very short. For this reason, the number of items drawn from a document is partially

determined by how feasible it is to write items from that source. Some content areas are less

amenable to testing than others.

The reading level for this exam will be that of a college sophomore.

Following the determination of the length of the test and the number of items to be derived from

each source, a test plan was developed. The use of a process-by-content matrix ensures adequate

sampling of job knowledge content areas and problem-solving processes. Problem-solving areas

involve the following

A. Knowledge of terminology.

B. Understanding of principles.

C. Application of knowledge to new situations

While knowledge of terminology is important, the understanding and application of principles

are considered to be of primary importance. This is reflected in the recommendation that a

majority of the items should involve the application of knowledge and understanding of

principles. Furthermore, not all documents are equally well suited for each of the three problem-

solving processes. Therefore, the manner in which each is sampled takes this into consideration.

PROPOSED PROCESS-BY-CONTENT MATRIX

52

POLICE SERGEANT

SOURCE DEF PRINC APP TOTAL

1. Essentials of Modern Police Work 3 8 17 28

2. Community Policing 0 7 13 20

3. Rules of Evidence 3 10 15 28

4. Department Rules & Regulations 0 4 6 10

5. State Criminal Code 2 5 11 18

6. State Vehicle Code 2 4 14 20

7. City Ordinances 0 2 10 12

8. Performance Appraisal Guidelines/ 0 3 4 7

Employee Ratings

9. Labor Agreement with the city 0 3 4 7

Total 10 46 94 150

Note: The document above was the goal the test writers set themselves to

follow. Turn to the next page to see the final Process-by-Content Matrix

that shows what really occurred during the test’s development. Notice

how it is slightly different from the test plan.

53

PROCESS-BY-CONTENT MATRIX

POLICE SERGEANT

SOURCE DEF PRINC APP TOTAL

1. Essentials of Modern Police Work 4 10 20 34

2. Community Policing 3 7 13 23

3. Rules of Evidence 3 10 17 30

4. Department Rules & Regulations 1 3 6 10

5. State Criminal Code 4 5 9 18

6. State Vehicle Code 4 6 10 20

7. City Ordinances 2 2 6 10

8. Performance Appraisal Guidelines/ 0 1 1 2

Employee Ratings

9. Labor Agreement with the city 0 1 2 3

Total 21 45 84 150

Note: This is an example of the Process-by-Content Matrix for a completed

test. Not all documents were equally well suited for each of the three

problem-solving processes. While the item writers attempted to conform

to the test plan some sources did not yield the required number of items.

So, the number of items actually developed from each source was

different from the preliminary test plan in some respects. This is because

sometimes the source materials do not supply sufficient information to

write appropriate test items. Alternatively, when the item writers

examined the source materials more closely, they identified additional

areas in the source materials that should be covered during the test that

were not identified when they originally made up the proposed Process-

by-Content Matrix. From this, you can see that test writing is a dynamic

process where the reality of writing test items is frequently dictated by

the quality of the source materials.

54

Attachment D - Upward Rating Bias

on Angoff Panels

In some situations, SME panels responsible for setting the Critical Score level for test set

the bar too high. For example, we have experienced situations where only 50% of credentialed

applicants in a given area of expertise taking a pre-employment test (measuring the same

competency areas in which they are credentialed) would pass a recommended cutoff score set by

the rater panel. In highly-regulated fields where credentialing programs are oftentimes very

rigorous, a situation where a pre-employment test fails 50% of the credentialed applicants could

have two possible explanations: (1) the credentialing program is setting the bar much too low

(and unqualified candidates are being credentialed), or (2) the rating panel that established the

Critical Score for the pre-employment test set the bar too high. While there is a range of other

plausible explanations between these two extremes, it has been our experience that the latter

explanation is oftentimes the case.

This ―upward bias‖ tendency that is sometimes observed does not (of course) rule out the

opposite, where a rater panel underestimates the idea minimum competency level. However, it

has been our experience that rating biases of the overestimation type are more common than

those of the underestimation nature.

While there are several viable theories that may explain why this phenomenon may occur

with rating panels, one particular theory seems to provide a practical explanation. The conscious

competence theory (which is sometimes also called the ―Four Stages of Learning‖ theory) was

originally posited by psychologist Abraham Maslow in the 1940s. This theory provides an

explanation of how people learn in four progressive stages:

1. Unconscious Incompetence (where you don‘t know that you don‘t know something),

to

2. Conscious Incompetence (you are now aware that you are incompetent at

something), to

3. Conscious Competence (you develop a skill in that area but have to think about it),

to the final stage

4. Unconscious Competence (you are good at it and it now comes naturally).

These four ―learning stages‖—ranging from unconsciously unskilled, consciously unskilled,

consciously skilled, to unconsciously skilled—have been widely adopted in both theory and

practice in educational, psychology, and organizational behavior fields since their inception. It is

the fourth stage (unconsciously skilled) that may cause some of the upward bias sometimes

observed in rating panels. This is because individuals who have had so much practice with a

particular skill—to the point where it becomes ―second nature‖ and can be performed easily

without intense concentration—can sometimes underestimate their level of competency when

they first started in the position (i.e., how long it took them to master the skill), which may cause

55

them to overestimate the percentage of qualified applicants who may be able to answer the test

question on the first day of the job.

Common examples of skills that can be attained at this ―fourth level‖ include driving,

sports activities, typing, manual dexterity tasks, listening, and communicating. For example,

performing a ―Y turn‖ is second nature to most people who have been driving for several years.

In fact, many experienced drivers may not even recall ever having to acquire this skill, but in

actuality many experienced drivers had to work hard at this skill repeatedly until mastered.

This issue can create an upward bias when applying minimum passing score

recommendations. Some raters might now be able to teach others in the target skill, although

after some time of being unconsciously competent, the person might actually have difficulty in

explaining exactly how they perform a particular skill because the skill has become largely

instinctual. This arguably gives rise to the need for long-standing unconscious competence to be

checked periodically against new standards. Fortunately, the extent to which this possible bias

may exist can be evaluated (after the test administration) by interpreting Output 6 in the Rater

Analysis Sheet.

Below are four suggestions that can be followed to help alleviate this potential problem:

1. Select SMEs who have between 1 and 5 years experience to serve on the rating panel,

2. Have the SMEs reveal and discuss their ratings on the first few test items so the

outliers in either direction can be reined in by the workshop proctor,

3. Conduct rigorous discussions with the SMEs regarding the ―true‖ minimum

qualification level relevant to the test, and

4. In some situations, masking the answer key from the SMEs can help reduce the

potential for upward rating bias.