reliability

53
CHAPTER 4 Reliability

Upload: dermengles

Post on 13-Jun-2015

419 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reliability

C H A P T E R 4

Reliability

Page 2: Reliability

What is Reliability?

There are actually several ways to go about defining and interpreting reliability.

In one instance, a test may be considered reliable if its observed scores are highly correlated with its true scores.• Presuming that both observed and true scores were obtained for every examinee for a

given test, then the squared correlation coefficient between observed and true scores, , is known as the reliability coefficient for that test.

Reliability may also be expressed as a correlation coefficient between observed scores on two parallel tests.• In this case, the squared correlation coefficient, , is the reliability coefficient for that

test.

Unfortunately, in most cases it is not possible to obtain the true scores, nor is a parallel test available…

Page 3: Reliability

Defining Reliability

For simplicity, let us use to represent the reliability coefficient, even when we are not discussing the correlation between two parallel tests.

We can now define six ways of obtaining reliability:

1. is the correlation between the observed scores on parallel tests.2. is the proportion of variance in that is explained by a linear relation with .

Page 4: Reliability

Defining Reliability

is the correlation between the observed scores on parallel tests.

This interpretation asserts that the reliability of a test is equal to the correlation of its observed scores with the observed scores from a parallel test.

If each examinee obtains the same observed score when measured with a parallel test and there is some variance in observed scores within each testing, then the tests have perfect reliability (i.e., ).

If examinees have observed scores on one test that are uncorrelated with their observed scores on a parallel test, then the tests are totally unreliable (i.e., ).

Page 5: Reliability

Defining Reliability

is the proportion of variance in that is explained by a linear relation with .

This is the standard interpretation of reliability and it is to be read and understood in the same way as the squared correlation coefficient for the typical Pearson correlation.

In this way, we can view as the proportion of variance in explained by .

𝜌𝑋 𝑋 ′

2𝑋 𝑋 ′

Page 6: Reliability

Defining Reliability

This interpretation of reliability asserts that the reliability coefficient can be expressed as the ratio of true-score variance to observed-score variance.

For a perfectly reliable text and ; thus when , then measurement is made without error.

When , then error is present in the measurement.

When , then , which means that all scores reflect only error and that any differences between examinees’ observed scores reflect random error rather than true-score differences.

Page 7: Reliability

Defining Reliability

From this relation we can also state that as the reliability of a given test increases, the error-score variance becomes relatively smaller.

To the right is a graph depicting a pair of error distributions, one larger than the other.

The examinee’s true score is indicated by the .

For the curve with the smaller error variance, the scores are more tightly packed around .

For the curve with the larger error variance, the scores are less tightly packed around .

Page 8: Reliability

Defining Reliability

This interpretation states that the reliability coefficient is equal to the square of the correlation between observed and true scores.

𝜌𝑋 𝑋 ′

𝜌𝑋𝑇

To the right is a graph depicting the relation between and or .

We can see for every value of that .

In other words, an observed test score will correlate higher with its own true score than with an observed score on a parallel test.

This will become more important when we discuss validity.

Page 9: Reliability

Defining Reliability

This interpretation asserts that the reliability coefficient is equal to 1 minus the squared correlation between an observed and error scores.

In ideal circumstances , but this is only true when .

This is depicted in the graph to the right.

𝜌𝑋 𝑋 ′

𝜌𝑋𝐸

Page 10: Reliability

Defining Reliability

This interpretation relates the reliability coefficient to 1 minus the ratio of error-score variance to observed-score variance.

If , then necessarily equals 0.

This relation also indicates that observed-score variance can affect reliability.• For instance, a restricted range of observed scores will produce a low observed-score

variance, thus increasing the value of the ratio and reducing the reliability coefficient.

Page 11: Reliability

Test/Retest Reliability Estimates

Test/Retest Reliability is a means of obtaining a reliability estimate, , that is based on testing the same examinees twice with the same test and then correlating the results.

If, for example, we administer the same test to the same examinees twice and each examinee produces the same score and there is some variance across the examinees, then the test/retest correlation will be 1.0.• This is an example of perfect reliability.

If the test/retest correlation is 0.0, then the test is completely unreliable.

Page 12: Reliability

Test/Retest Reliability Estimates

The Test/Retest method of evaluating reliability is one of the easiest ways to do so, but it comes with some inherent complications.

The biggest obstacle to overcome is the potential for carry-over effects to exist between the two tests.• In other words, the first testing has somehow influenced

the second testing.

For example:If examinees remember their scores from the previous test, then tends to overestimate .

If the examinees’ performance improves, then tends to underestimate .

Page 13: Reliability

Test/Retest Reliability Estimates

A second problem with the Test/Retest method is the length of time required to conduct the two test administrations.

A short delay between Time 1 and Time 2 increases the potential for carry-over effects due to memory, fatigue, practice, etc.

But a long delay between Time 1 and Time 2 increases the potential for carry-over effects due to mood, developmental change, etc.

Consequently, the Test/Retest method is most appropriate in contexts wherein the test is not susceptible to carry-over effects.

Page 14: Reliability

Parallel-Forms & Alternative-Forms Reliability Estimates

A Parallel-Forms Reliability estimate is the correlation between the observed scores on two parallel tests, .• In a practical context it is generally not possible to verify

that two tests are indeed parallel.

• In these cases, alternate test forms are often used instead.

Alternate test forms are any two test forms that have been constructed in an effort to make them parallel.• Consequently, they may have equal or similar observed-

score means, variances, and correlations with other measures.

Page 15: Reliability

Parallel-Forms & Alternative-Forms Reliability Estimates

The correlation between observed scores on the alternate test forms, , is, therefore, an estimate of the reliability of either of the test forms.

This correlation , , reflects how reliable and parallel the two tests are.

CAUTION! This does not eliminate the concern for carry-over effects, and indeed, these two tests may be susceptible to the same problems.

Furthermore, if alternate tests are not parallel, then will be an inaccurate estimate of or .

Page 16: Reliability

Parallel-Forms & Alternative-Forms Reliability Estimates

To illustrate the last point, consider the following:

Let and .

If but , then is less reliable than .• will tend to overestimate and underestimate .

If , it is possible that the tests are measuring different traits.• will tend to underestimate both and .

Page 17: Reliability

Parallel-Forms & Alternative-Forms Reliability Estimates

It is also possible for alternate test forms to have unequal true scores and error variances, even though the correlation between their observed scores is equivalent to a parallel-forms correlation.

Let and where and are scores on parallel tests.

Let us also suppose that where and are constants establishing that is a linear function of .

Although and are not parallel tests (i.e., and ), it is possible that .

Therefore, a correlation between observed scores on alternate forms will produce a good estimate of reliability if the alternate forms are parallel or if they are linear functions of parallel test scores AND if carry-over effects do not influence the correlation.

Page 18: Reliability

Internal Consistency Estimates of Reliability

We have seen that reliability estimates can be obtained by administering the same test to the same examinees and by correlating the results: Test/Retest

We have also seen that reliability estimates can be obtained by administering two parallel or alternate forms of a test, and then correlating those results: Parallel- & Alternate-Forms

In both of the above cases, the researcher must administer two exams, and they are sometimes given at different times making them susceptible to carry-over effects.

Here, we will see that it is possible to obtain a reliability estimate using only a single test.

The most common way to obtain a reliability estimate using a single test is through the Split-half approach.

Page 19: Reliability

Split-Half approach to Reliability

When using the Split-Half approach, one gives a single test to a group of examinees.

Later, the test is divided into two parts, which may be considered to be alternate forms of one another.• In fact, the split is not so arbitrary; an attempt is made to

choose the two halves so that they are parallel or essentially τ-equivalent.• If the halves are considered parallel, then the reliability

of the whole test is estimated using the Spearman-Brown formula.

• If the halves are essentially τ-equivalent, then the coefficient α can be used to estimate reliability.

Page 20: Reliability

Split-Half approach to Reliability

Using the Spearman-Brown formula:• Here, we are assuming the two test halves (Y and Y’) are

parallel forms.

The two halves are correlated, producing the estimated Spearman-Brown reliability coefficient, .• But this is only a measure of the reliability of one half of

the test.• The reliability of the entire test would be greater that the

reliability of either test half taken alone.

The Spearman-Brown formula for estimating reliability of the entire test is therefore:

𝑟𝑒𝑙𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦=2×𝑟 𝑌 𝑌❑

1+𝑟𝑌 𝑌 ❑′

Page 21: Reliability

Split-Half approach to Reliability

Correlation between Parallel Halves of a test () and the Reliability of the Entire Test ()

0.00 0.00

0.20 0.33

0.40 0.57

0.60 0.75

0.80 0.89

1.00 1.00

𝑟𝑒𝑙𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦=2×𝑟 𝑌 𝑌❑

1+𝑟𝑌 𝑌 ❑′

Page 22: Reliability

Split-Half approach to Reliability

On the other hand, the two test halves may not (and are likely not) parallel forms.

This is confirmed when it is determined that the two halves have unequal variances.

In these situations, it is best to use a different approach to estimating reliability.• Cronbach’s coefficient α

α can be used to estimate the reliability of the entire test.

Page 23: Reliability

Split-Half approach to Reliability

If the test halves are not essentially τ-equivalent, then coefficient α will give a lower bound for the test’s reliability.• In other words, the test’s reliability must be greater than,

or equal to, the value produced by Cronbach’s α.

• If α is a high value, then you know that the test reliability is also high.

• If α is a low value, then you may not know whether the test actually has low reliability or whether the halves of the test are simply not essentially τ-equivalent.

Page 24: Reliability

Split-Half approach to Reliability

Nevertheless, the formula for Cronbach’s coefficient α is as follows:

Where, and are the variances of test halves and , respectively. is the variance of the entire test, .

Page 25: Reliability

Split-Half approach to Reliability

To facilitate the conceptual understanding of this formula, we may also state the following:

In so doing, we see that as the covariance (and correlation) between the halves increases, the coefficient α also increases.

As with the Spearman-Brown formula, if the reliability is high, then the correlation value will also be high, and vice versa.• Specifically, these are measures of the tests internal

consistency and/or homogeneity.

Page 26: Reliability

Split-Half approach to Reliability

It is the case, that if the variances on both test halves are equal, then the Spearman-Brown formula and Cronbach’s α will produce identical results.

If the variances of the two test halves are equal, but the halves are not Essentially τ-Equivalent, then both the Spearman-Brown formula and Cronbach’s α will underestimate the test’s reliability.• Lower bound estimate

If the observed-score variances of the test halves are equal and the tests are Essentially τ-Equivalent, then the Spearman-Brown formula and Cronbach’s α will both equal the test’s reliability.

Page 27: Reliability

Split-Half approach to Reliability

Obviously, the major advantage to using internal-consistency reliability estimates is that test need only be given once to obtain such an estimate.

Naturally, this approach is limited only to tests that can be divided into two parts, or into two parts that are either parallel or essentially τ-equivalent, or when the test lacks independent items that can be separated from one another.• In these situations, one must use test/retest, parallel- or

alternate-forms reliability approaches.

Assuming one is able to use the Split-Half approach, however, how does one go about forming two test halves?

Page 28: Reliability

Split-Half approach to Reliability

Forming Test Halves:

There are 3 commonly used methods for forming test halves:1. The Odd/Even method

2. The Order method

3. The Matched Random Subsets method

Page 29: Reliability

Odd/Even approach to Test Halves

The Odd/Even method classifies items by their order, whether odd-numbered or even numbered, on the test.• In other words, all odd-numbered test items form the first

half, and all even-numbered test items form the second half.

After the two halves are formed, a score for each half is obtained for each examinee.

These scores are used to obtain an estimate of reliability.

This is a fairly simple, and straightforward approach to forming two test halves.

Page 30: Reliability

Ordered approach to Test Halves

The Ordered method requires that a test be divided prior to its administration.

From this point, there are multiple additional approaches to administrating the Ordered method.1. Every examinee can be given the same test and then, one can

compare scores from the first half to scores from the second half.• Carry-over effects may be a concern.

2. Each half is labeled, say A and B, are then given in different orders to different examinees.• In other words, half the examinees will be randomly assigned

order A-B, and the other half will be assigned order B-A.

The Ordered method is generally considered to be less satisfactory than the Odd/Even method because of the increased potential for carry-over effects.

Page 31: Reliability

The Matched Random Subsets approach to Test Halves

The Matched Random Subsets method is much more sophisticated than the two aforementioned methods.

This process involves several steps:1. For each test item, two statistics are computed:• The proportion of examinees passing the item – a

measure of the item’s “difficulty.”• The biserial or point-biserial correlation between the

item score and the total test score.

2. Each item is plotted on a graph using the above two statistics.• Items that are close together on the graph are paired,

and one item from each pair is randomly assigned to one half of the test.

• The remaining items form the other half of the test.

Page 32: Reliability

The Matched Random Subsets approach to Test Halves

For example, in the graphic above, we see the plot of test items A, B, C, D, E & F.

Test items A & B are similar, and therefore grouped. Likewise, so is C with D, and E with F.

Page 33: Reliability

Internal-Consistency Reliability – The General Case

In our previous examples, we divided a given test into two equal halves.

But, here we can examine dividing a given test into multiple equal components.

Even in these cases, we can apply the basic principles of each of the methods for dividing a test.• For example, the odd/even method can be modified to divide a nine item test into

thirds by taking every third item in a sequence to form a given component, etc.

• The Matched Random Subsets method would involve forming triplets, rather than pairs, but then the first item is randomly assigned to one component, the next to another, and so on.

Page 34: Reliability

Internal-Consistency Reliability – The General Case

Let us assume that a given test is divided into N components.

The variances of the scores on each component and the variances of the entire test are used to estimate the reliability of the test.

If the components are essentially τ-equivalent, then formulas presented herein will provide good estimates of the test’s reliability.

If, however, the components are not essentially τ-equivalent, then the formulas presented herein will underestimate (i.e., provide a lower bound for) the test’s reliability.

Furthermore, it is important the any test divided into components measure only a single trait (i.e., be homogeneous in content).• Intelligence tests are a classic example of a heterogeneous test, because they measure

a broad spectrum of traits.

Page 35: Reliability

Internal-Consistency Reliability – The General Case

The general formula for estimating internal-consistency reliability is as follows:

Where: is the observed score for a test formed from combining components, is the population variance of . is the population variance of the ith component of . is the number of components that are combined to form .

Page 36: Reliability

Internal-Consistency Reliability – The General Case

If each component test, , is a dichotomous item, then we can simplify our formula to the following, also known as the Kuder-Richardson formula 20 (KR20)†:

This formulation reflects the fact that the variance of scores on item i, when these scores can only take values 0 or 1, equals , where: is the proportion of examinees in a population who get a score of 1 on the item (that is, the proportion who pass).

†Sometimes this formula is referred to as α-20 or α(20).

Page 37: Reliability

Internal-Consistency Reliability – The General Case

Similarly, another Kuder-Richardson formula that is useful when every is a dichotomous item is:

Where, is the average proportion of the item difficulties

This is a special case of the formula and is usually referred to as or .

It is also the case that , and the two formulas will only be equal when the difficulties on all the test items is also equal.• If the item difficulties are not all equal, then will be less than and underestimate the

test’s reliability.

Page 38: Reliability

Internal-Consistency Reliability – The General Case

If the components of a given test are essentially τ-equivalent, then the values produced by each of the aforementioned correlations will be equal to the test’s reliability.• In the case of , however, the items must also be of equal difficulty.

If the components are not essentially τ-equivalent, then the values produced will be less than the test’s reliability, and are therefore only lower bounds.• If the components that make the total test score inter-correlate highly, then the values

of the aforementioned correlations will be high.

• If the components that make the total test score do not inter-correlate highly, then the vales of the aforementioned correlations will be small.

Page 39: Reliability

The Spearman-Brown Formula: The General Case

Earlier, we saw that one can estimate a test’s reliability using information about the reliability of parallel components using the Spearman-Brown Formula.

We can also use this formula to predict the effects of changes in test length on reliability.• Sometimes the Spearman-Brown formula is referred to as the Spearman-Brown Prophecy

formula.

In it’s general form, the formula is as follows:

Where: is the observed total test score for a test formed from combining parallel component test scores; i.e., . is a component test score that is a part of . is the population reliability of . is the population reliability of any . is the number of parallel test scores that are combined to form .

Page 40: Reliability

The Spearman-Brown Formula: The General Case

In this way, the reliability of any test, , is expressed in terms of the reliability of parallel components of the test, .

Note, that will always be greater than .• For this reason, is sometimes referred to as the stepped-up reliability.

Page 41: Reliability

The Spearman-Brown Formula: The General Case

To the right is a graph that illustrates the general effects that change in test length will have on

Each line represented a different base component reliability estimate, ranging from to .

Using the Spearman-Brown Prophecy formula one is able to determine how long a test must be in order to obtain a given reliability coefficient, or whether it’s even possible.• Note that even when a given test with a

component reliability of 0.20 is increased by a factor of ten, the total test reliability is still only slightly greater than 0.70.

Page 42: Reliability

The Spearman-Brown Formula: The General Case

If we know and , then we can solve for .

Alternatively, suppose we know and would like to predict given divisions (or as long).• By rearrangement, we obtain the following:

Page 43: Reliability

The Spearman-Brown Formula: The General Case

It is even possible that we know and our desire is to obtain a given value.• In this case, we can rearrange the formula in the following way:

This formula will determine the number of components necessary to reach a given value.

Page 44: Reliability

The Spearman-Brown Formula: The General Case

If the component tests are not parallel, then the Spearman-Brown formula will wither over- or underestimate the reliability of a longer test.

An example scenario of overestimation:• Suppose one has a 10 item test with a reliability of 0.60.• The Spearman-Brown formula predicts that by adding a

parallel ten-item test that the resultant total reliability will be 0.75.

• But suppose the test that is added by a faulty test that has no variance.

• Effectively, we’ve only added a constant to every examinee’s score, which does not contribute to the test’s reliability.

• In this case, the total test reliability would still be 0.60.

Page 45: Reliability

The Spearman-Brown Formula: The General Case

If the component tests are not parallel, then the Spearman-Brown formula will wither over- or underestimate the reliability of a longer test.

An example scenario of underestimation:• Suppose a ten item test has a reliability of 0.00.• The Spearman-Brown formula predicts that by doubling

the test length with a parallel component would produce a reliability of 0.00.

• However, if a non-parallel test is added instead with a reliability of, say, 0.70, then the resultant reliability of the lengthened test will be greater than 0.00.

Page 46: Reliability

Comparison of Methods of Estimating Reliabilities

So far, we have learned several different ways to estimate the reliability of a given test.

Here is a summary of the basic principles of each, that one should use when deciding which is appropriate for estimating the reliability of one’s test:1. When using Test/Retest methods, one should use Parallel- or Alternate-

Forms reliability estimates because most internal-consistency measures would be inaccurate.

2. Use of Cronbach’s α or the Kuder-Richardson methods produces a lower bound for a given test’s reliability.• If the tests happen to be essentially τ-equivalent, then the estimated

reliability is the test’s reliability.• But these tests should only be used for homogeneous tests

3. When using the Split-Half method, the Spearman-Brown formula can over- or underestimate a test’s reliability if the components are not parallel.• When the components are parallel, then the estimate provided is very

good for judging the effects of changing test length.

Page 47: Reliability

Standard Errors of Measurement & Confidence Intervals for True Scores

Recall the assumptions of classical true-score theory.

Specifically recall the distribution of observed scores for a given examinee over repeated independent testings with the same test (or with parallel tests).

This distribution is centered on T and has a standard deviation, , where is the standard error of measurement.• Given this situation, unless , we see that any single

observation is unlikely to be equal to our examinee’s true score.

• The observed score, however, will fall somewhere within a range of values near the true score.

Page 48: Reliability

Standard Errors of Measurement & Confidence Intervals for True Scores

The bottom chart depicts an approximately normal distribution of observed scores obtained from many independent testings of a single examinee.

Note how the scores vary, but tend to group around the examinee’s true score.

By applying our understanding of normal distributions, we can even state that X% of scores will be bound by a given range, through .

Likewise, if we establish a bracket (range) for each observed score, then we could state that the probability that said range contains the true score is X%.

Page 49: Reliability

Standard Errors of Measurement & Confidence Intervals for True Scores

The construction of an interval requires an estimate of .

Typically, the examiner does not have the opportunity to administer an infinite number of tests to a given subject…

Consequently, is estimated by using the many scores of many examinees in a given sample.

If we assume that is the same for every examinee in the sample, then the estimate is obtained as follows:

Page 50: Reliability

Standard Errors of Measurement & Confidence Intervals for True Scores

This consideration allows us to construct confidence intervals for true scores.

For instance, the confidence interval for the true score of an examinee can be constructed if the following assumptions are met:1. The assumptions of the classical true-score theory are

met.2. For a given examinee, errors of measurement are

normally distributed.3. is the same for all examinees (homoskedasticity).

Page 51: Reliability

Standard Errors of Measurement & Confidence Intervals for True Scores

When the aforementioned assumptions are met, then a confidence interval for an examinee’s score can be constructed:

which is typically written as:

Where: is the observes score for a given examinee. is the estimated standard error of measurement. is the critical value of the standard normal deviate at the desired probability level.

Page 52: Reliability

Standard Errors of Measurement & Confidence Intervals for True Scores

The confidence intervals for true scores can be interpreted in either of two ways:1. The intervals can be expected to contain a given

examinee’s true score a specified percentage of time when the interval is constructed using observed scores that are the result of repeated independent testings of the examinee using the same test (or parallel tests).

2. The interval can be expected to cover a specified percentage of the examinee’s true scores when many examinees are tested once with the same test (or parallel tests) and a confidence interval is calculated for each examinee.

Page 53: Reliability

Standard Errors of Measurement & Confidence Intervals for True Scores

Tests with a high degree of measurement error will produce confidence intervals that are necessarily wider.

Less reliable tests tend to have a high degree of measurement error.

Therefore, wide confidence intervals are an indication that the observed scores are not very good estimates of true scores.

If a test has good reliability, then the confidence intervals will also be narrow, indicating good estimates of true scores.