interpreting validity indexes for diagnostic tests

Upload: judson-borges

Post on 07-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 Interpreting Validity Indexes for Diagnostic Tests

    1/10

    Interpreting Validity Indexes for

    Diagnostic Tests: An Illustration

    Using the Berg Balance Test

    Physical therapists routinely make diagnostic and prognostic decisionsin the course of patient care. The purpose of this clinical perspective

    is to illustrate what we believe is the optimal method for interpretingthe results of studies that describe the diagnostic or prognostic

    accuracy of examination procedures. To illustrate our points, we chosethe Berg Balance Test as an exemplar measure. We combined the data

    from 2 previously published research reports designed to determinethe validity of the Berg Balance Test for predicting risk of falls amongelderly people. We calculated the most common validity indexes,

    including sensitivity, specificity, predictive values, and likelihood ratiosfor the combined data. Clinical scenarios were used to demonstratehow we believe these validity indexes should be used to guide clinicaldecisions. We believe therapists should use validity indexes to decrease

    the uncertainty associated with diagnostic and prognostic decisions.More studies of the accuracy of diagnostic and prognostic tests used byphysical therapists are urgently needed. [Riddle DL, Stratford PW.

    Interpreting validity indexes for diagnostic tests: an illustration usingthe Berg Balance Test. Phys Ther. 1999;79:939948.]

    Key Words: Diagnosis; Tests and measurements, general.

    Physical Therapy . Volume 79 . Number 10 . October 1999 939

    ClinicalP

    erspective

    Daniel L Riddle

    Paul W Stratford

  • 8/6/2019 Interpreting Validity Indexes for Diagnostic Tests

    2/10

    Physical therapists routinely perform diagnostictests on their patients. For diagnostic testresults to be most useful, we contend thatvalidity estimates from studies of the diagnostic

    test in question should be used to guide clinical deci-sions. The purpose of this perspective is to describe aconceptual model proposed by other authors1,2 for the

    application of validity indexes for diagnostic (or prog-nostic) tests to clinical practice. We use a clinical illus-tration to demonstrate how measures, which we refer toas validity indexes (ie, sensitivity, specificity, positiveand negative predictive values, likelihood ratios), can beinterpreted for individual patients. The illustration com-bines data from 2 studies on the use (validity) of theBerg Balance Test (BBT) for predicting risk of fallsamong elderly people aged 65 to 94 years.3,4 The illus-tration is meant only to demonstrate how validityindexes can be useful for practice and not necessarily toassist clinicians in the examination of patients suspectedof having balance disorders.

    Studies that can be used to determine whether mean-ingful clinical inferences can be made based on diagnos-tic tests are classified as criterion-related validity stud-ies.5 Criterion-related validity studies take 1 of 2 forms.Researchers can compare a clinical measure with a goldstandard measure (ideally, a valid diagnostic test or adefinitive measure of whether the condition of interest istruly present) obtained at about the same time as themeasure being studied. In our illustration, the patientsreport of falling is considered the gold standard mea-sure. In other cases, a gold standard measure may be a

    diagnosis made at the time of surgery or via an invasivediagnostic procedure. Studies in which some form ofgold standard is obtained at about the same time as thediagnostic test being studied are commonly called con-current criterion-related validity studies.5 Researcherscan also compare a measures prediction of a futureevent with what actually happens to a patient in thefuture. These studies are commonly termed predictivecriterion-related validity studies.5

    Studies designed to estimate the risk of a future adverseevent are often used by clinicians to make judgments

    about prognoses. For example, investigating whether the

    BBT can be used topredict whether a per-son will fall in thefuture is an illustra-tion of a predictivecriterion-related valid-ity study. The gold

    standard for this typeof study would be thesubjects report of falls for a period of time followingadministration of the BBT.

    The Berg Balance TestThe BBT was designed to be an easy-to-administer, safe,simple, and reasonably brief measure of balance forelderly people. The developers expressed the hope thatthe BBT would be used to monitor the status of apatients balance and to assess disease course andresponse to treatment.6 Patients are asked to complete14 tasks, and each task is rated by an examiner on a

    5-point scale ranging from 0 (cannot per-form) to 4 (normal performance). Elements of the test aresupposed to be representative of daily activities thatrequire balance, including tasks such as sitting, standing,leaning over, and stepping. Some tasks are rated accord-ing to the quality of the performance of the task, whereasthe time taken to complete the task is measured forother tasks. The developers of the BBT provided opera-tional definitions for each task and the criteria forgrading each task. Overall scores can range from 0(severely impaired balance) to 56 (excellent balance).

    Data exist to support the reliability of BBT scoresobtained from elderly subjects.3,6,7 For example, BogleThorbahn and Newton3 reported an intertester reliabil-ity (Spearman rho) value of .88 for 17 subjects aged 69 to94 years. Evidence also exists to support the contentvalidity,6 construct validity,7,8 and criterion-related valid-ity3,4,8 of test scores for inferring fall risk in elderlysubjects tested in a variety of settings. Construct validityhas been assessed using a variety of approaches. Forexample, construct validity was supported to the extentthat BBT scores were shown to correlate reasonably well with other measures of balance (Pearson r.38.91)

    and measures of motor performance (Pearsonr

    DL Riddle, PhD, PT, is Associate Professor, Department of Physical Therapy, Medical College of Virginia Campus, Virginia Commonwealth

    University, 1200 E Broad, Richmond, VA 23298-0224 (USA) ([email protected]). Address all correspondence to Dr Riddle.

    PW Stratford, PT, is Associate Professor, School of Rehabilitation Science, and Associate Member, Department of Clinical Epidemiology and

    Biostatistics, McMaster University, Hamilton, Ontario, Canada.

    Concept, writing, and data analysis were provided by Riddle and Stratford. Consultation (including review of manuscript before submitting) was

    provided by Cheryl Ford-Smith, Susan Cromwell, Dr Roberta Newton, and Dr Anne Shumway-Cook.

    This article was submitted December 7, 1998, and was accepted July 7, 1999.

    How can clinicians

    use diagnostic test

    studies to guide

    clinical decisions for

    individual patients?

    940 . Riddle and Stratford Physical Therapy . Volume 79 . Number 10 . October 1999

  • 8/6/2019 Interpreting Validity Indexes for Diagnostic Tests

    3/10

    .62.94).7,8 For example, the Pearson r correlation be-tween the BBT and the balance subscale of the TinettiPerformance-Oriented Mobility Assessment9 was .91.8

    The Pearson r correlation between the BBT and theBarthel Index mobility subscale10 was .67.8

    The Illustration

    To illustrate how to interpret validity indexes, we havecombined data from 2 studies3,4 designed to determinewhether BBT scores could identify elderly people (agerange6594 years) who are at risk for falling. Subjectsin both studies were of similar ages and had similar BBTscores, and the proportions of male and female subjectswere also similar (Tab. 1). In both studies, the subjectsreported whether they had fallen and the number offalls in the 6 months prior to being admitted to thestudy. In addition, for both studies, the authorsappeared to use essentially the same definition for whatconstituted a fall. Bogle Thorbahn and Newton3

    defined a fall as an unexpected contact of any part of

    the body with the ground. Shumway-Cook and col-leagues4 defined a fall as any event that led to anunplanned, unexpected contact with a supporting surface.

    The 2 studies differed in 2 potentially important ways.First, Shumway-Cook et al4 excluded subjects withcomorbidities that may have affected balance. BogleThorbahn et al3 did not exclude these types of subjects.Subjects in the study by Shumway-Cook and colleaguesreported no comorbidities, whereas 38% of the subjectsin the study by Bogle Thorbahn and Newton reportedhaving diagnoses of neurological or orthopedic condi-

    tions. Second, subjects in the study by Shumway-Cooket al were required to have fallen at least twice in theprevious 6 months, whereas subjects in the study byBogle Thorbahn et al had to have fallen only once ormore in the previous 6 months. It is unclear how thesedifferences affected the validity estimates reported bythese authors, but we believe the studies were similarenough to allow us to combine the data for the illustra-tion in this article. It is also unclear why the proportionof fallers (50%) in the study by Shumway-Cook et al wasmuch higher than the proportion of fallers (17%) in thestudy by Bogle Thorbahn and Newton.

    Diagnostic Test MethodologyWe believe that the subjects studied (the sample) shouldrepresent those types of patients who will be measuredduring clinical practice.11 In our illustration, the sample ofsubjects was elderly people (ages ranging from 65 to 94years) living independently. Some patients will have thedisorder of interest (using our illustration, some subjectsreported falls), and some patients will not have the disor-der of interest (some reported no falls). The test beingstudied (ie, the BBT) and the gold standard or criterionmeasure (ie, determination of whether the subject had

    fallen in the past 6 months) are applied to all subjects, andthe tests diagnostic accuracy (Tab. 2) is determined.9

    The results from diagnostic accuracy studies are oftensummarized in a format similar to that shown in Table2.1214 In this table, the terms condition present andcondition absent are used to identify people who truly

    have or do not have the condition of interest (the goldstandard test is either positive or negative). The lettersa, b, c, and d are used to reference cells in thetable, and the sums ab, cd, ac, bd, andabcd denote marginal values. The cell values andmarginal values are combined in various ways to calcu-late validity indexes. Definitions of terms related todiagnostic testing and formulas for the many validityindexes also are presented in Table 2.

    Sensitivity and SpecificitySensitivity indicates how often a diagnostic test detects a

    disease or condition when it is present. Sensitivity essen-tially tells the clinician how good the test is at correctlyidentifying patients with the condition of interest. Specificityindicates how often a diagnostic test is negative in theabsence of the disease or condition. Specificity essentiallytells the clinician how good the test is at correctly identify-ing the absence of disease.15 The closer the sensitivity orspecificity is to 100%, the more sensitive or specific the test.

    The authors of both studies in our illustration reportedthe sensitivity and specificity of the BBT for determiningcurrent fall risk. Berg et al8 contended that the best way

    Table 1.Characteristics of the Subjects Combined From Two Studies3,4

    Characteristic

    Study ofShumway-Cook andcolleagues4

    (N44)

    Study of BogleThorbahn andNewton3

    (N66)

    Age (y)

    X 76.1 79.2SD 6.6 6.2Range 6594 6994

    Sex (%)Male 27 24Female 73 76

    Berg Balance Test

    X 46.1 48.2SD 10.5 9.9Range 18 56 956

    Gold standardclassification offallers (%)

    50 17

    Physical Therapy . Volume 79 . Number 10 . October 1999 Riddle and Stratford . 941

  • 8/6/2019 Interpreting Validity Indexes for Diagnostic Tests

    4/10

    to interpret scores on the BBT is to use a single cutoff

    point of 45 to differentiate those at risk for falls (thosewith scores of45) and those who are not at risk for falls(those with scores of45). Using a cutoff point of 45, asrecommended by Berg et al, the sensitivity for the datacollected by Shumway-Cook and colleagues4 was 55%and the specificity was 95%. For the data collected byBogle Thorbahn and Newton,3 the sensitivity was 82%and the specificity was 87%. When we combined the datafrom both studies, a cutoff point of 45 yielded a sensi-tivity of 64% and a specificity of 90% (Tab. 3). Asensitivity of 64% indicates that 64% of subjects whowere true fallers had a positive BBT (a score of45).That is, approximately a third of the subjects who werefallers were missed by the BBT. Although there are noagreed-on standards for judging sensitivity and specific-ity, we believe the sensitivity of 64% should generally beconsidered quite low because more than a third of thesubjects were misclassified.

    A specificity of 90% indicates that 90% of subjects who were nonfallers had a negative BBT (a score of45).That is, only 10% of the nonfallers were missed by theBBT. Specificity was much higher than sensitivity, indi-cating that the BBT does a better job of identifyingsubjects who are not fallers than subjects who are fallers.

    When we use diagnostic tests, we do not know who has

    the condition of interest and who does not have thecondition of interest. That is, sensitivity and specificityhave somewhat limited usefulness because they do notdescribe validity in the context of the test result.1 Rather,they describe validity in the context of the gold standard,a value we do not know when we do diagnostic tests.Sensitivity, for example, does not take into account thefalse positive test results (Tab. 2) on a group of patients.Stated another way, sensitivity does not describe howoften patients with positive tests have the disorder ofinterest. Sensitivity only describes the proportion ofpatients with the disorder of interestwho have a positive test.Similarly, specificity does not take into account falsenegative test results (Tab. 2). Specificity does notdescribe how often patients with negative tests do nothave the disorder of interest. Specificity only describesthe proportion of patients without the disorder of interestwho have a negative test.

    Diagnostic testing, in our view, is used because clinicianswant to know the probability of the condition existing.Because clinicians make decisions based on diagnostictest results and not necessarily on results of tests that areconsidered gold standards, some authors1 have con-tended that positive and negative predictive values (see

    Table 2.Two Two Table, Formulas, and Definitions for Validity Indexesa

    DiagnosticTest Result

    Gold Standard Test Result

    Total

    (Condition Present)

    (Condition Absent)

    True Positive

    (a)

    False Positive

    (b)

    a b

    False Negative(c)

    True Negative(d)

    c d

    Total a c b d a b c d

    Sensitivity: Those people correctly identified by the test as having the condition of interest as a percentage of all those who truly have the condition of interest:

    [100% (a/[a c])].

    Specificity: Those people correctly identified by the test as not having the condition of interest as a percentage of all those who truly do not have the condition of

    interest: [100% (d/[b d])].

    False Positive Rate: Those people falsely identified by the test as having the condition of interest as a percentage of all patients without the condition of interest:

    [100% (b/b d)].

    False Negative Rate: Those people falsely identified by the test as not having the condition of interest as a percentage of all patients with the condition of interest:

    [100% (c/[a c])].

    Positive Predictive Value: Those people correctly identified by the test as having the condition of interest as a percentage of all those identified by the test as

    having the condition of interest: [100% (a/[a b])].

    Negative Predictive Value: Those people correctly identified by the test as not having the condition of interest as a percentage of all those identified by the test as

    not having the condition of interest: [100% (d/[c d])].

    Diagnostic Accuracy: The percentage of people who are correctly diagnosed: [100% (a d)/(a b c d)].

    Prevalence: The percentage of people in a target population who truly have the condition of interest: [100% (a c)/(a b c d)].

    Likelihood Ratio for a Positive Test: Is sensitivity divided by 1 specificity [{a/(a c)}/{b/(b d)}].

    Likelihood Ratio for a Negative Test: Is 1 sensitivity divided by specificity [{c/(a c)}/{d/(b d)}].

    Pretest Probability of the Disorder: The therapists estimate of the patients chance of having the disorder (condition of interest) prior to the therapist doing the

    test. It is usually estimated by the clinician based on prior knowledge and experience.

    Posttest Probability of the Disorder: The patients chance of having the condition of interest after the results of the test are obtained.a All definitions agree with the Standards for Tests and Measurements in Physical Therapy Practice.5 Definitions for sensitivity, specificity, false positive rate, false negative

    rate, positive predictive rate, and negative predictive rate are derived from the Standards for Tests and Measurements in Physical Therapy Practice.5 Definitions for

    diagnostic accuracy, prevalence, likelihood ratio for a positive test, likelihood ratio for a negative test, pretest probability of the disorder, and posttest probability

    of the disorder are derived from Sackett and colleagues.1,2

    942 . Riddle and Stratford Physical Therapy . Volume 79 . Number 10 . October 1999

  • 8/6/2019 Interpreting Validity Indexes for Diagnostic Tests

    5/10

    next section) are more important than sensitivity andspecificity for clinical practice.

    Positive and Negative Predictive ValuesBefore diagnostic testing, therapists usually have col-lected a variety of information (eg, medical history, someexamination data) from the patient. Based on theirknowledge, training, and experience, therapists cansometimes use these data, depending on what is known

    about various conditions, to estimate the probability thecondition of interest is present. This is known as thepretest probability of the disorder.1 For example, if a therapistfound that an elderly patient had a history of dizziness andrequired assistance with most activities of daily living, thetherapist might anticipate that the patients risk of falling was quite high, say on the order of 60%. Because thetherapist knew evidence existed to indicate that dizziness16

    and difficulty with home activities of daily living17 increasefall risk, the therapist estimated the pretest probability forfalls to be quite high. The pretest probability estimate of60% is only an estimate and may contain some error. Thetherapist could then do a BBT to better estimate thepatients risk of falling. Positive and negative predictivevalues describe the probability of disease after the test iscompleted. The probability of the condition of interestafter the test result is obtained is also known as the posttest

    probabilityof the disorder.1

    For many clinicians, the idea of estimating the probabil-ity of a disorder prior to doing a diagnostic test (pretestprobability) may seem like a new or unusual concept. Webelieve that some clinicians, based on their experienceand training, may use an ordinal-based scale estimate ofpretest probability, such as the disease is highly likely,

    somewhat likely, or not very likely given the patientssigns and symptoms. In our view, however, using per-centage estimates of pretest probability is not commonlydone by most therapists. We suggest that therapistsshould make percentage estimates of the pretest proba-bility of the disorder of interest. For example, if aclinician used an ordinal scale similar to the one justdescribed, we contend that the clinician should convertit to a percentage estimate of pretest probability in the

    following way. If the pretest probability of the disorderwere judged to be highly likely, this judgment could beconverted to a 75% pretest probability, whereas a ratingof somewhat likely could be converted to pretestprobability of 50%. A rating of not very likely might beconverted to a pretest probability of 25%. We believethat, as therapists become more comfortable with mak-ing percentage estimates of pretest probability, they willbecome more accurate, although we have no data tosupport this argument. By using percentage estimatesfor pretest probability, therapists can take full advantageof positive and negative predictive values (and likelihoodratios, to be discussed elsewhere in this article) reportedin the literature. Several examples are discussed else-where in this article to illustrate how pretest probabilitycan be estimated and how these estimates can influencethe interpretation of the diagnostic test.

    Positive predictive value is the proportion of patients witha positive test who have the condition of interest.1

    Negative predictive valueis the proportion of patients witha negative test who do not have the condition of inter-est.1 The closer the positive predictive value is to 100%,the more likely the disease is present with a positive testfinding. The closer the negative predictive value is to

    Table 3.Sensitivity and Specificity for Four Cutoff Points of the Berg Balance Test (BBT)

    BBT CutoffPoint

    2 2 Tables for Four BBT Cutoff Points

    Gold Standard forCutoff of 40

    Gold Standard forCutoff of 45

    Gold Standard forCutoff of 50

    Gold Standard forCutoff of 55

    Fall No Fall Fall No Fall Fall No Fall Fall No Fall

    15 a b 3

    40 18 c d 74 21 a b 8

    45 12 c d 69 28 a b 21

    50 5 c d 56 32 a b 57

    55 1 c d 20

    Sensitivitya/(a c) 45% 64% 85% 97%

    Specificityd/(b d) 96% 90% 73% 26%

    Physical Therapy . Volume 79 . Number 10 . October 1999 Riddle and Stratford . 943

  • 8/6/2019 Interpreting Validity Indexes for Diagnostic Tests

    6/10

    100%, the more likely the disease is absent with anegative test finding.

    In our illustration, the combined data from both studies yielded a positive predictive value of 72% when using acutoff point of 45 on the BBT (Tab. 4). A positive predic-tive value of 72% indicates that 72% of patients with apositive test (a BBT of45) were classified as fallers (the

    gold standard) and 28% of the patients were misclassifiedas fallers based on the BBT, an error rate that we considerto be fairly high. A negative predictive value of 85%indicates that 85% of patients with a negative test (a BBT of45) were classified as nonfallers (the gold standard). Ourmisclassification rate for nonfallers is less than for fallers(ie, we can be more confident about identifying nonfallersthan fallers based on BBT test results).

    As with sensitivity and specificity, no standard exists forwhat constitutes an acceptable level of positive or nega-tive predictive value. In addition, interpretations ofpredictive values, sensitivity, and specificity are notalways straightforward. In the next section, we attempt todescribe the critical issues that we believe should beconsidered when interpreting validity indexes.

    Issues Related to the Interpretation ofSensitivity, Specificity, and Predictive ValuesSome tests have a binary outcome (2 mutually exclusivecategories such as present or absent), but manyother test results are reported on an ordinal scale (suchas the manual muscle test) or a continuous scale (such asthe BBT). When using sensitivity, specificity, and predic-tive values, the researcher is forced to dichotomize

    results for ordinal and continuous measures (such as theBBT) and, therefore, may lose information about theusefulness of the test. One example is the use of a singlecutoff point of 45 for the BBT. We will show later howsome researchers have dealt with the problem of onlyone cutoff point for continuous measures.

    The choice of the cutoff point influences the sensitivity,

    specificity, and positive and negative predictive values.This concept is illustrated in Table 4. For example, if thecutoff point for the BBT were set at 40, the sensitivity would be 45% and the specificity would be 96%. Witha cutoff point of 50, the sensitivity is 85% and thespecificity is 73%. Generally, the choice of cutoff pointby the researcher will increase one validity index(eg, sensitivity) but will decrease the other validity index(eg, specificity). For example, when sensitivity rises (asseen when going from a cutoff point of 40 to a cutoffpoint of 50 on the BBT), specificity falls. The sameconcept holds for positive and negative predictive values.When the positive predictive value rises (as seen whengoing from a cutoff point of 50 to a cutoff point of 40 onthe BBT), the negative predictive value falls (Tab. 4).

    The principal factor influencing the clinicians choice ofa cutoff point is related to the consequence of misclas-sifying patients. Broadly speaking, there are 3 choices fora cutoff point: (1) maximize both sensitivity and speci-ficity, (2) maximize sensitivity at the cost of minimizingspecificity, and (3) maximize specificity at the cost ofminimizing sensitivity. Maximizing sensitivity and speci-ficity is appropriate when the consequences of falsepositives and false negatives are about equal. Maximizing

    Table 4.Validity Estimates for Several Different Cutoff Points of the Berg Balance Test

    BergBalanceTest Result

    PositivePredictiveValue(95% CIa)

    NegativePredictiveValue(95% CI)

    Sensitivity(95% CI)

    Specificity(95% CI)

    PositiveLikelihoodRatio(95% CI)

    NegativeLikelihoodRatio(95% CI)

    35 77% 67% 30% 96% 7.8 0.7

    (54 100) (5876) (14 46) (92100) (2.326.4) (0.6 0.9)

    40 83% 67% 45% 96% 11.7 0.6(66 100) (5777) (28 62) (92100) (3.6 37.6) (0.4 0.8)

    45 72% 85% 64% 90% 6.1 0.4(56 88) (7793) (48 80) (8397) (3.0 12.4) (0.3 0.6)

    50 57% 92% 85% 73% 3.1 0.2(4371) (8599) (7397) (63 83) (2.1 4.6) (0.1 0.5)

    55 36% 95% 97% 26% 1.3 0.1(26 46) (86100) (91100) (16 36) (1.11.5) (0.020.8)

    60 30% 100% 100% 1% 1.01 Undefined(2139) (5) (91) (0 3) (11.04)

    a CIconfidence interval.

    944 . Riddle and Stratford Physical Therapy . Volume 79 . Number 10 . October 1999

  • 8/6/2019 Interpreting Validity Indexes for Diagnostic Tests

    7/10

    sensitivity at the cost of minimizing specificity is de-sirable when the consequence of a false negative(eg, falsely identifying a subject as a nonfaller) exceedsthe consequence of a false positive (eg, falsely identify-ing the subject as a faller). Conversely, maximizingspecificity at the cost of minimizing sensitivity is desirable when the consequence of a false positive exceeds the

    consequence of a false negative. In the case of the BBT, itwould appear that sensitivity should be optimized to avoidclassifying a faller as a nonfaller. Misclassifying fallers wouldappear to have serious consequences (eg, fractures).

    An important advantage associated with the use ofsensitivity and specificity is that they are not influencedby prevalence. Prevalence is defined as the proportion ofpatients with the disorder of interest among all patientstested.1 A therapist can use sensitivity and specificityestimates from a published report and apply theseestimates to a patient as long as the patient is reasonablysimilar to the subjects in the study.

    Predictive values should guide clinical decisions (theyestimate validity in the context of the test result), butunlike sensitivity and specificity, predictive values areprevalence dependent.1 That is, as the proportion ofthose with the disease changes, predictive values alsochange. Predictive values, therefore, vary when the prev-alence of the disorder of interest changes. As the prev-alence increases, the positive predictive value increasesand the negative predictive value decreases. When theprevalence decreases, the positive predictive valuedecreases and the negative predictive value increases.

    Because the chance that an individual patient will have atarget disorder varies (ie, the pretest probability changesdepending on the patients signs and symptoms), theprevalence associated with a diagnostic accuracy studymay not apply to a given patient. For example, in thestudy by Shumway-Cook et al,4 there was a prevalence offallers of 50%. If, for example, a clinician estimated thepretest probability of falling for a patient to be only 10%,the predictive values from the data of Shumway-Cook et alwould not provide accurate estimates of positive or nega-tive predictive values for the patient. The positive predictive value from the data of Shumway-Cook and colleagues

    would be spuriously high (because of the higher preva-lence), and the negative predictive value would be spuri-ously low for the patient with a pretest probability of 10%.

    Unfortunately, predictive values are influenced by prev-alence, whereas sensitivity and specificity are not. Sensi-tivity and specificity, however, are related to positive andnegative predictive values in the following way. Whenspecificity is high, the positive predictive value tends tobe high, and when sensitivity is high, the negativepredictive value tends to be high. That is, when sensitiv-ity is high, a negative test generally indicates the disorder

    is not present (or, in our illustration, the person is not atrisk of falling). When specificity is high, a positive testgenerally indicates the disorder is present (the person isat risk of falling).2 Table 4 illustrates this concept. Whenspecificity is high, for example, for a BBT cutoff point of40 (96%), the positive predictive value will generally behigh (83%). A clinician might hypothetically believe, for

    example, that based on medical history and examinationdata, a patient had a pretest probability of falling ofapproximately 40% and the patient might subsequentlyhave a score of 37 on the BBT, a score considered positiveusing a cutoff point of 40 (Tab. 4). The positive predictivevalue would be 83%, an increase of 43 percentage pointsfrom the pretest probability. We contend that the cliniciancan be reasonably confident the patient is a faller.

    Similarly, when sensitivity is high (97% for a cutoff pointof 55), the negative predictive value will also generally behigh (95%). For example, a clinician might believe,based on a patients medical history and examination

    data, that the patient had a pretest probability of fallingof approximately 40% (or a pretest probability of notfalling of 60%). The patient might subsequently have ascore of 56 on the BBT, a score considered negativeusing a cutoff point of 55 (Tab. 4). The negativepredictive value (posttest probability) in this hypotheti-cal example would be 95%, and we argue that theclinician can be very confident the patient is not a faller.We noted earlier that predictive values are dependent onprevalence, and in our examples, the prevalence (pretestprobability) for falls was estimated to be 40%, a reasonableapproximation of the prevalence reported in our illustra-

    tion using the BBT data. Had the pretest probabilities forthe patient examples been appreciably lower or higher, thepredictive values reported in the 2 examples above wouldnot have been accurate estimates of posttest probability.

    In summary, sensitivity and specificity are not dependenton prevalence and are therefore seen as useful forclinical practice.1 As a general guide, we believe clini-cians should conclude the condition is likely to bepresent when a test is positive and the specificity for thetest is high. Conversely, clinicians should conclude thecondition is likely to be absent when a test is negative

    and the sensitivity for the test is high.1,2

    Positive andnegative predictive values are, in part, prevalence depen-dent. As a result, we argue that predictive values aremeaningful only when the prevalence reported in astudy approximates the pretest probability of the disor-der the clinician has estimated for the patient. To bemost accurate, pretest probability estimates should bebased on sound scientific data.

    Confidence Intervals for Validity IndexesSensitivity, specificity, positive and negative predictivevalues, and likelihood ratios represent point estimates of

    Physical Therapy . Volume 79 . Number 10 . October 1999 Riddle and Stratford . 945

  • 8/6/2019 Interpreting Validity Indexes for Diagnostic Tests

    8/10

    population values.15 Point estimates are estimations ofthe true value for the index of interest. To determine theaccuracy of a point estimate, confidence intervals (CIs)are calculated.15 Confidence intervals indicate howclosely a studys point estimate of these values approxi-

    mate the population values.15 Confidence intervalsessentially describe for clinicians how confident they can beabout a point estimate. For example, if sensitivity was 80%,with a 95% CI of 70% to 90%, the true value for sensitivityin the population (with 95% certainty) lies between 70%and 90%. The width of a CI becomes narrower as thesample size increases, and it becomes wider as the samplesize decreases.15 In addition, the width is dependent on the variability of the measure with the population.15 Thedegree of confidence we place on these validity estimatescan be calculated.1,18 In our view, studies that examine thevalidity of diagnostic tests should provide CI estimates.

    For example, the 95% CI for specificity reported byBogle Thorbahn and Newton3 ranged from 67% (notvery specific) to 100% (perfect specificity). The 95% CIfor specificity for the combined data from the studies ofBogle Thorbahn and Newton3 and Shumway-Cook et al4

    ranged from 83% to 97% (both values, in our opinion,represent reasonably high specificity).

    Likelihood RatiosPositive and negative likelihood ratios are 2 additional validity indexes for diagnostic tests. Likelihood ratioshave been proposed to be more efficient and more

    powerful than sensitivity, specificity, and predictive val-ues.15,19 Likelihood ratios essentially combine the bene-fits of both sensitivity and specificity into one index.1

    Likelihood ratios indicate by how much a given diagnos-tic test result will raise or lower the pretest probability ofthe target disorder.20 Likelihood ratios are reported in adecimal number format rather than as percentages. Alikelihood ratio of 1 means the posttest probability

    (probability of the condition after the test results areobtained) for the target disorder is the same as thepretest probability (probability of the condition beforethe test was done). Likelihood ratios greater than 1increase the chance the target disorder is present,

    whereas likelihood ratios less than 1 decrease the chancethe target disorder is present.20

    Jaeschke and colleagues20 proposed the following guideto interpreting likelihood ratios. Likelihood ratiosgreater than 10 or less than 0.1 generate large and oftenconclusive changes from pretest to posttest probability.Likelihood ratios between 5 and 10 or between 0.2 and0.1 generate moderate changes from pretest to posttestprobability. Likelihood ratios from 2 to 5 and from 0.5 to0.2 result in small (but sometimes important) shifts inprobability, and likelihood ratios from 0.5 to 2 result in

    small and rarely important changes in probability.

    Because likelihood ratios can be applied to score inter-vals for tests with continuous measures, we believe theyare more useful than sensitivity, specificity, and predic-tive values, which are limited to data presented in adichotomous format. For example, the positive likeli-hood ratio for the score interval of 40 to 44 (a test scoreconsidered positive based on recommendations of Bergand colleagues8) is 2.8 (Tab. 5). This likelihood ratioindicates that a patient with a BBT score between 40 and44 is 2.8 times more likely to be a faller than a nonfaller.The 95% CI ranges from 0.9 to 8.5. That is, the 95% CI

    overlaps 1 (no change in the probability of the disor-der); therefore, a clinician cannot be very confident thata score between 40 and 44 increases the probability ofidentifying a patient at risk for falls. If a patient scoresbelow 40 on the BBT, however, the likelihood ratioincreases to 11.7 (95% CI3.637.6). A patient with aBBT score below 40 is at greater risk for falls as com-pared with patients with scores between 40 and 44. Onaverage, patients with BBT scores less than 40 are almost12 times more likely to be a faller than a nonfaller.

    Likelihood ratios should not be confused with odds ratios. Odds ratios are an

    estimate of risk often expressed in case-control studies designed to investigate

    causation of a disease.

    Table 5.Positive Likelihood Ratios for Several Different Intervals of Berg Balance Test Scores

    Berg BalanceTest Result

    Gold Standard Test Result

    Positive LikelihoodRatio (95% CIa)

    Positive Negative

    Number Proportion Number Proportion

    40 15 15/330.455 3 3/770.039 11.7 (3.6 37.6)

    40 44 6 6/330.182 5 5/770.065 2.8 (0.9 8.5)4549 7 7/330.212 13 13/770.169 1.3 (0.52.9)50 54 4 4/330.121 36 36/770.467 0.3 (0.1 0.7)54 1 1/330.03 20 20/770.26 0.1 (0.02 0.8)Total 33 77

    a CIconfidence interval.

    946 . Riddle and Stratford Physical Therapy . Volume 79 . Number 10 . October 1999

  • 8/6/2019 Interpreting Validity Indexes for Diagnostic Tests

    9/10

  • 8/6/2019 Interpreting Validity Indexes for Diagnostic Tests

    10/10

    der (falls, in this case) would be relatively low, perhapson the order of 20%. The patient then had a BBT doneand a score of 50 (a negative test, using a cutoff point of50) was obtained. The negative likelihood ratio for a cutoffpoint of 50 is 0.2 (Tab. 4). We align a ruler with the leftcolumn of the nomogram (Figure) at 20 (20% pretestprobability) and with the middle column at a likelihood

    ratio of approximately 0.2. We find that the posttest prob-ability of current fall risk for this patient is approximately5%, an improvement of 15 percentage points from thepretest probability (the chance of the patient being a fallerhas gone from 20% down to 5%). Hypothetically, wesubstantially increased our level of certainty about thepatients current risk of falling based on the BBT score.

    Our second hypothetical example is about a 75-year-oldman who was diagnosed with congestive heart failureapproximately 5 years previously and requires assistancewith some activities of daily living. He reports losing hisbalance occasionally and remembers falling once in the

    past few years. Based on the patients medical historyand functional status, the pretest probability for fallswould be fairly high (ie, on the order of 50%). A BBTwas done, and a score of 38 (a positive test, using a cutoffpoint of 40) was obtained. Using the data in Table 4, thepositive likelihood ratio for a score of less than 40 is 11.7.That is, this patient is 11.7 times more likely to be a fallerthan a nonfaller. Using the nomogram shown in theFigure, the posttest probability for current fall risk isapproximately 92%, an increase of 42 percentage pointsabove the pretest probability. If we believe our data arecorrect and our estimates are appropriate, we can theo-

    retically be confident that we have identified a patient who has a very high probability of falling. We againappear to have substantially increased our level of cer-tainty about the patients risk of falling.

    Summary Validity indexes for diagnostic tests were reviewed, andterms used in studies designed to describe the validity ofdiagnostic tests were defined. Data from 2 studies examin-ing the validity of measurements obtained with the BBT forinferring current fall risk were used as an illustration todemonstrate how clinicians could use diagnostic test stud-ies to guide clinical decisions for individual patients. Unfor-

    tunately, there are only a small number of diagnostic teststudies describing the validity of examination procedurescommonly used by physical therapists. There is an urgentneed to conduct more studies of the usefulness of diagnos-tic and prognostic tests in physical therapy.

    Acknowledgments We thank Dr Anne Shumway-Cook, Linda Thorbahn,and Dr Roberta Newton for their insights and forallowing us to use their data in this article. We also thankCheryl Ford-Smith and Sue Cromwell for reviewing anearlier version of the manuscript.

    References1 Sackett DL, Haynes RB, Guyatt GH, Tugwell P. Clinical Epidemiology:A Basic Science for Clinical Medicine. 2nd ed. Boston, Mass: Little, Brownand Co Inc; 1991:8586.

    2 Sackett DL, Richardson WS, Rosenberg W, Haynes RB. Evidence-basedMedicine: How to Practice and Teach EBM. New York, NY: ChurchillLivingstone Inc; 1997.

    3 Bogle Thorbahn LD, Newton RA. Use of the Berg Balance Test to

    predict falls in elderly persons. Phys Ther. 1996;76:576583.

    4 Shumway-Cook A, Baldwin M, Polissar NL, Gruber W. Predicting theprobability for falls in community-dwelling older adults. Phys Ther.1997;77:812819.

    5 Task Force on Standards for Measurement in Physical Therapy.Standards for tests and measurements in physical therapy practice. PhysTher. 1991;71:589622.

    6 Berg KO, Wood-Dauphinee SL, Williams JI, Gayton D. Measuringbalance in the elderly: preliminary development of an instrument.Physiotherapy Canada. 1989;41:304311.

    7 Berg KO, Maki BE, Williams JI, et al. Clinical and laboratory mea-sures of postural balance in an elderly population. Arch Phys MedRehabil. 1992;73:10731080.

    8 Berg KO, Wood-Dauphinee SL, Williams JI, Maki B. Measuringbalance in the elderly: validation of an instrument. Can J Public Health.1992;83(suppl 2):S7S11.

    9 Tinetti ME. Performance-oriented assessment of mobility problemsin elderly patients. J Am Geriatr Soc. 1986;34:119126.

    10 Mahoney FL, Barthel DW. Functional evaluation: the Barthel index.Md State Med J. 1965;14:6165.

    11 Department of Clinical Epidemiology and Biostatistics, McMasterUniversity. How to read clinical journals, II: to learn about a diagnostictest. Can Med Assoc J. 1981:124:703710.

    12 Department of Clinical Epidemiology and Biostatistics, McMasterUniversity. Interpretation of diagnostic data, 2: how to do it with a

    simple table (part A). Can Med Assoc J. 1983:129:511.

    13 Department of Clinical Epidemiology and Biostatistics, McMasterUniversity. Interpretation of diagnostic data, 2: how to do it with asimple table (part B). Can Med Assoc J. 1983:129:1217.

    14 Department of Clinical Epidemiology and Biostatistics, McMasterUniversity. Interpretation of diagnostic data, 2: how to do it with simplemath. Can Med Assoc J. 1983:129:2229.

    15 Sackett DL. A primer on the precision and accuracy of the clinicalexamination. JAMA. 1992;267:26382644.

    16 Luukinen H, Koski K, Kivela SL, Laippala P. Social status, lifechanges, housing conditions, health, functional abilities, and lifestyleas risk factors for recurrent falls among the home-dwelling elderly.Public Health. 1996;110:115118.

    17 Tinetti ME, SpeechleyM, GinterSF. Risk factors forfalls among elderlypersons living in the community. N Engl J Med. 1988;319:17011707.

    18 Colton T. Statistics in Medicine. Boston, Mass: Little, Brown and CoInc; 1974:160.

    19 Crombie DL. Diagnostic process. J Coll Gen Prac. 1963;6:579589.

    20 Jaeschke R, Guyatt GH, Sackett DL. Users guides to the medicalliterature, III: how to use an article about a diagnostic test, B: What arethe results and will they help me in caring for my patients? JAMA.1994;271:703707.

    21 Fagan TJ. Nomogram for Bayes theorem [letter]. N Engl J Med.1975;293:257.

    948 Riddle and Stratford Physical Therapy Volume 79 Number 10 October 1999