meta-analysis of test performance when there is a “ gold standard ”

58
Meta-analysis of Test Performance When There Is a “Gold Standard” Prepared for: The Agency for Healthcare Research and Quality (AHRQ) Training Modules for Medical Test Reviews Methods Guide www.ahrq.gov

Upload: laban

Post on 19-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Meta-analysis of Test Performance When There Is a “ Gold Standard ”. Prepared for: The Agency for Healthcare Research and Quality (AHRQ) Training Modules for Medical Test Reviews Methods Guide www.ahrq.gov. Learning Objectives. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Meta-analysis of Test Performance When There Is a

“Gold Standard”Prepared for:

The Agency for Healthcare Research and Quality (AHRQ)

Training Modules for Medical Test Reviews Methods Guide

www.ahrq.gov

Page 2: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Graphically display diagnostic test performance across multiple studies using a “gold standard” reference

Explain the dependence of sensitivity and specificity over studies and thus the need for a multivariate (joint) analysis

Describe choices for a meta-analysis to summarize test performance depending on whether the sensitivity and specificity estimates from multiple studies vary (or do not vary) widel

Learning Objectives

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 3: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

This module focuses on how to conduct a meta-analysis with a “gold standard” reference. Module 9 discusses how to conduct a meta-analysis

when no gold standard reference exists.

There are two goals for a meta-analysis in a systematic review:

1.Provide summary estimates for key quantities

2.Explain observed heterogeneity in the results of studies included in the review

For systematic reviews of medical tests, a meta-analysis often focuses on synthesis of test performance data.

Background

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 4: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Gold Standard: A reference standard that is considered adequate in defining the presence or absence of the condition of interest (disease).

Diagnostic Test: This type of test is potentially less accurate than using the gold standard to ascertain disease.

Data: The main focus is on tests with positive or negative results because of the use of a cut-off level (threshold); each study provides 2 × 2 tabulation.

Important Terms

Test Result

With Disease Healthy

Positive True positive False positive

Negative False negative True negativeTrikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 5: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Sensitivity = The proportion of test positives among people with a disease (true-positive rate)

Specificity = The proportion of test negatives among healthy people (true-negative rate)

Positive predictive value = Proportion with disease among people with test-positive results

Negative predictive value = Proportion of healthy people with test-negative results The predictive values can be computed from

sensitivity, specificity, and disease prevalence.

Measures Used To Assess Test Performance (1 of 2)

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 6: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Positive likelihood ratio = sensitivity/(1 ‒ specificity) = proportion of test positives among diseased/proportion of test positives among healthy

Negative likelihood ratio = (1 ‒ sensitivity)/specificity = proportion of test negatives among diseased/proportion of test negatives among healthy

Diagnostic odds ratio = (true positives/false negatives)/(false positives/true negatives) = odds of a positive test with disease over odds of a positive test without disease Diagnostic odds ratios do not allow weighing of the

true-positive and false-positive rates separately.

Measures Used To Assess Test Performance (2 of 2)

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 7: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Meta-analysis aims to provide a meaningful summary of sensitivity and specificity across studies. Within each study, sensitivity and specificity are

independent — they are estimated from different patients (those with a disease or those who are healthy).

Across studies, sensitivity and specificity are generally negatively correlated — as one increases the other is expected to decrease. This negative correlation is most obvious with varying

thresholds (known as “threshold effect”), varying time from onset of symptom to test, et cetera.

Positive correlations are often due to a missing covari

Dependence of Sensitivity and Specificity Across Studies (1 of 2)

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 8: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

This is an example with 11 studies using D-dimer tests to diagnose acute coronary events, showing that sensitivity increases as specificity decreases:

Summarizing the two correlated quantities is a multivariate problem, and multivariate methods should be used to address it.

Dependence of Sensitivity and Specificity Across Studies (2 of 2)

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.Becker DM, Philbrick JT, Bachhuber TL, et al. Ann Intern Med 1996 May 13;156(9):939-46. PMID: 8624174.

Page 9: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

How to quantitatively summarize medical test performance when:

1. The sensitivity and specificity estimates of various studies do not vary widely or extensively Can use a summary point to obtain summary test

performance if the tests have the same threshold Summary point = a summary sensitivity and summary

specificity pair

2. The sensitivity and specificity of multiple studies vary widely Can use a summary line to describe the relationship

between average sensitivity and average specificity May be less important than variations in thresholds,

reference standards, study designs, et cetera, between the studies

Challenges

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 10: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Principle 1: Favor the most informative way to summarize the data. Choose between a summary point and a summary line.

Use the summary point when sensitivity/specificity do not vary much.

Use the summary line when there are different thresholds for positive tests or estimates vary widely.

Both can also be used, since they convey complementary information.

The choice is subjective; there are no hard-and-fast rules.

Principle 2: Explore the variability in study results with graphs and suitable analyses rather than relying exclusively on “grand means” (i.e., a single summary statistic).

Principles for Addressing the Challenges

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 11: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Problem: Within a study, sensitivity, specificity,

positive/negative predictive values, and prevalence are all interrelated via simple formulas.

Meta-analyzing each metric across studies will create summaries that are inconsistent with these formulas.

Proposed solution: Obtain summaries for sensitivities and specificities

across studies via meta-analysis, then back-calculate the rest of the metrics (using the formulas) over a range of prevalence values.

Deciding Which Metrics To Meta-analyze (1 of 6)

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 12: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

A Visual Summary of Sensitivity/Specificity Across Studies With

back calculation of the Other Metrics

Deciding Which Metrics To Meta-analyze (2 of 6)

NLR = negative likelihood ratioNPV = negative predictive valuePLR = positive likelihood ratio PPV = positive predictive value Prev = prevalenceSe = sensitivitySp = specificityTrikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”

In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 13: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Why does it make sense to directly meta-analyze sensitivity and specificity? It aligns well with our understanding of positivity

threshold effects. Sensitivity and specificity are often considered

independent of prevalence. Summary sensitivity and specificity obtained by

direct meta-analysis will always be between 0 and 1. These two metrics are not as easily understood as

predictive values and likelihood ratios, so back calculation of these other metrics is useful.

Deciding Which Metrics To Meta-analyze (3 of 6)

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 14: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Why does it not make sense to directly meta-analyze predictive values or prevalence? Predictive values are dependent on prevalence. Rarely is it meaningful to meta-analyze each

value across studies. Prevalence is often wide ranging. Prevalence cannot be estimated from case-control

studies (the main design of many medical test studies).

It is better to back calculate these values over a range of plausible prevalence values.

Deciding Which Metrics To Meta-analyze (4 of 6)

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 15: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Why can directly meta-analyzing positive and negative likelihood ratios be problematic? Combining likelihood ratios across studies does not

guarantee the summary values are “internally consistent.” It is possible to obtain summary likelihood ratios that

correspond to impossible summary sensitivities or specificities (i.e., values <0 or >1).

Back calculation avoids this.

This is not a common case, however; often direct meta-analysis yields the same conclusions as back calculation.

Deciding Which Metrics To Meta-analyze (5 of 6)

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 16: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Directly analyzing diagnostic odds ratios Is straightforward and follows standard meta-

analytic methods Characteristics of the diagnostic odds ratio

Closely linked to sensitivity, specificity, and likelihood ratios

Can easily be included in meta-regression models for analysis of heterogeneity between studies

Disadvantages Challenging to interpret Impossible to weigh the true-positive rate and the

false-positive rate separately

Deciding Which Metrics To Meta-analyze (6 of 6)

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 17: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Meta-analytic methods should: Respect the multivariate nature of test

performance metrics (i.e., sensitivity and specificity)

Allow for nonindependence between sensitivity and specificity across studies (“threshold effect”)

Allow for between-study heterogeneity (i.e., variability not explained by the statistical distribution of the data in each study)

The most theoretically motivated approaches are based on multivariate methods (hierarchical modeling).

Desired Characteristics of Meta-analytic Methods

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 18: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Multivariate meta-analysis of sensitivity and specificity (i.e., joint analysis of both) should be performed, rather than separate univariate meta-analyses.

It requires hierarchical modeling. Bivariate model Hierarchical summary receiver operator characteristic

model

Both families of models use two levels to model data. 1st level = within-study variability, from 2 × 2 table

counts 2nd level = between-study variability (i.e.

heterogeneity), allowing for nonindependence of sensitivity and specificity across studies

Preferred Methods for Obtaining a Summary Point (1 of 4)

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 19: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Model families differ in the parameters used for between-study variability in the 2nd level. The bivariate model uses parameters that are

transformations of the average sensitivity and specificity. The hierarchical summary receiver operator characteristic

(HSROC) model uses a scale parameter and an accuracy parameter.

Both models are functions of the sensitivity and specificity.

They also define an underlying HSROC curve. Both models are mathematically the same in the

absence of covariates. Both models assume a normal distribution of

parameters, which can be difficult to satisfy.

Preferred Methods for Obtaining a Summary Point (2 of 4)

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 20: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Researchers need to choose between the bivariate and the hierarchical summary receiver operator characteristic (HSROC) models when covariates are present (i.e., meta-regression analysis). For example: The bivariate model is more appropriate when there is

variation in disease severity. This affects sensitivity but not specificity. The bivariate model allows direct evaluation of the difference in

sensitivity and/or specificity.

The HSROC model is more effective when “spectrum effects” (the subjects in a study do not represent the patients who will receive the test in practice) are present. This is more likely to affect test accuracy rather than threshold. The HSROC model allows direct evaluation of the difference in

accuracy and/or threshold parameters.

Preferred Methods for Obtaining a Summary Point (3 of 4)

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 21: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Methods Commonly Used To Calculate a Summary Point

Preferred Methods for Obtaining a Summary Point (4 of 4)

Method Description or CommentDoes It Have the Desired

Characteristics?Independent meta-analysis of sensitivity and specificity

Separate meta-analyses per metric

Within-study variability preferably modeled by the binomial distribution

Ignores the correlation between sensitivity and specificity

Underestimates the summary sensitivity and specificity and wrong confidence intervals

Joint (multivariate) meta-analysis of sensitivity and specificity based on hierarchical modeling

Based on multivariate (joint) modeling of sensitivity and specificity

Two families of models that are equivalent when there are no covariates

Modeling preferably using binomial likelihood rather than normal approximations

The generally preferred method

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 22: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Hierarchical modeling is recommended. Hierarchical summary lines can be calculated from

bivariate random-effects model parameters A range of hierarchical summary receiver operator

characteristic (HSROC) lines can be calculated from fitted bivariate model parameters. An example is the Rutter-Gatsonis HSROC model. Represent alternative characterizations of the

bivariate distribution of sensitivity and specificity Show how the summary sensitivity changes with the

summary specificity

Preferred Methods for Obtaining a Summary Line (1 of 2)

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.Rutter CM, Gatsonis CA. Acad Radiol 1995 Mar;2 Suppl 1:S48-56; discussion S65-7, S70-1 pas. PMID: 9419705.

Page 23: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Methods Commonly Used To Calculate a Summary Line

Preferred Methods for Obtaining a Summary Line (2 of 2)

Method Description or CommentDoes It Have the Desired

Characteristics?Moses-Littenberg model

Summary line based on a simple regression of the difference of logit-transformed true-positive and false-positive rates versus their average

Ignores unexplained variation between-studies (fixed effects)

Does not account for correlation between sensitivity and specificity

Does not account for variability in the independent variable

Inability to weight studies optimally – yields wrong inferences when covariates are used

Random intercept augmentation of the Moses-Littenberg model

Regression of the difference of logit-transformed true-positive and false-positive rates versus their average for random effects that allows for variability across studies

Does not account for correlation between sensitivity and specificity

Does not account for variability in the independent variable

Summary receiver operator characteristic (ROC) based on hierarchical modeling

Same as for multivariate meta-analysis to obtain a summary point — hierarchical modeling

Many ways to obtain a (hierarchical) summary ROC:o Rutter-Gatsonis (most

common)o Several alternative curves

Most theoretically motivated method Rutter-Gatsonis hierarchical

summary ROC is recommended in the Cochrane Handbook, as it is the method that has been used most often

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.Littenberg B, Moses LE. Med Decis Making 1993 Oct-Dec;13(4):313-21. PMID: 8246704.Rutter CM, Gatsonis CA. Acad Radiol 1995 Mar;2 Suppl 1:S48-56; discussion S65-7, S70-1 pas. PMID: 9419705.

Page 24: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

It is not uncommon for studies to report multiple sensitivity/ specificity pairs at several thresholds for positive tests. Option 1: Decide on one threshold from each study (e.g.,

the threshold with the highest sensitivity) Option 2: Use all thresholds

An extension of the hierarchical summary receiver operator characteristic model has been developed for this purpose.

A method combining whole receiver operator characteristic (ROC) curves can also be used.

It is recommended that data be explored graphically in ROC space to highlight similarities and differences among the studies.

Special Case: Joint Analysis of Sensitivity and Specificity With Multiple Thresholds (1 of 2)

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 25: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

This is an example of an ROC graph for studies with different thresholds for total serum bilirubin. Points on the line for each study represent sensitivity/specificity pairs at different threshold values.

Special Case: Joint Analysis of Sensitivity and Specificity With Multiple Thresholds (2 of 2)

This is a typical receiver operator characteristic (ROC) graph for four hypothetical studies. Studies in the left shaded area have an LR+ ≥10. Studies in the top shaded area have an LR- ≤ 0.1. Those in the intersection have both.

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.Trikalinos TA, Chung M, Lau J, et al. Pediatrics 2009 Oct;124(4):1162-71. PMID: 19786450.

Page 26: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

A three-step algorithm is recommended for meta-analyzing studies with a gold standard reference:

1. Start by considering sensitivity and specificity separately.

2. Perform a multivariate meta-analysis (when each study reports a single threshold).

3. Explore between-study heterogeneity

Recommended Algorithm

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 27: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Reviewers should familiarize themselves with the pattern of study-level sensitivities and specificities.

Use graphical displays. Forest plots of study sensitivities and specificities with

their confidence intervals give a visual impression of variability of sensitivity and specificity across studies

A plot of sensitivity (vertical axis) versus 1 – specificity (horizontal axis) give a visual impression of the relationship between sensitivity and specificity across studies. These plots are also known as receiver operator characteristic graphs. A shoulder-and-arm pattern is present when there is a

threshold effect.

Step 1: Start by Considering Sensitivity and Specificity Separately (1 of 2)

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 28: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Examples of forest plots

An example of a receiver operator characteristic graph with the shoulder-and-arm pattern

Step 1: Start by Considering Sensitivity and Specificity Separately (2 of 2)

Increasing the threshold decreases sensitivity but increases specificity

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.Becker DM, Philbrick JT, Bachhuber TL, et al. Arch Intern Med 1996 May 13;156(9):939-46. PMID: 8624174.

Page 29: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Obtain a 2-dimensional summary point (sensitivity, specificity) using the bivariate model of meta-analysis, preferably with utilization of binomial error.

Obtain summary lines based on multivariate meta-analytic models. Interpretation of a summary line is not “automaticall

y ” that of “threshold effects,” especially if there is a positive correlation between sensitivity and specificity across studies

If more than one threshold is reported per study, consider incorporating all of them in the analysis both qualitatively (via graphs) and quantitatively (via proper methods).

Step 2: Multivariate Meta-analysis (When Each Study Reports a Single Threshold)

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 30: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

The hierarchical summary receiver operator characteristic (HSROC) model allows direct evaluation of heterogeneity in accuracy and threshold parameters.

Bivariate models allow direct evaluation of sensitivity and specificity.

Added covariates that reduce variability across studies may need to be taken into account when summarizing the studies.

Some common sources of heterogeneity: Patient population/selection Methods to verify/interpret results Clinical setting Disease severity

Step 3: Explore Between-Study Heterogeneity

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 31: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

D-dimers are fragments specific to fibrin degradation.

They are measured by using an enzyme-linked immunosorbent assay (ELISA) to diagnose venous thromboembolism.

Example 1: D-Dimer Assays for Diagnosing Venous Thromboembolism (1 of 4)

Forest Plots of Sensitivity, Specificity, and Likelihood Ratios

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.Becker DM, Philbrick JT, Bachhuber TL, et al. Ann Intern Med 1996 May 13;156(9):939-46. PMID: 8624174.

Page 32: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Forest plots show more heterogeneity in sensitivity/specificity than in likelihood ratios. Verified by formal heterogeneity testing May be a threshold effect

Because of the variety of thresholds being used in each study, it is more informative to summarize test performance with an hierarchical summary receiver operator characteristic plot rather than by summarizing sensitivities and specificities.

Example 1: D-Dimer Assays for Diagnosing Venous Thromboembolism (2 of 4)

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 33: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

The shoulder-and-arm pattern indicates the threshold effect.

The location of points in the upper shaded area of the receiver operator characteristic space indicates high sensitivity and low specificity.

The test minimizes false-negative results and is good for ruling out disease.

Example 1: D-Dimer Assays for Diagnosing Venous Thromboembolism (3 of 4)

HSROC Plot of D-Dimer Tests Using the Highest

Thresholds

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.Lijmer JG, Bossuyt PM, Heisterkamp SH. Stat Med 2002 Jun 15;21(11):1525-37. PMID: 12111918.

Page 34: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Example 1: D-Dimer Assays for Diagnosing Venous Thromboembolism (4 of 4) It is informative to give a

summary of the negative and positive predictive values for this test. Calculate over a range of

prevalence values using the summary sensitivity and specificity values.

A consistently high negative predictive value line means that a high percentage of people who test negative actually are negative for the disease.

Calculated Negative Predictive Values for the D-

Dimer Test With the Prevalence of Venous

Thromboembolism Between5 and 50 Percent

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.Lijmer JG, Bossuyt PM, Heisterkamp SH. Stat Med 2002 Jun 15;21(11):1525-37. PMID: 12111918.

Page 35: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Serial measurements of the creatine kinase-myocardial band (CK-MB) are used to diagnose acute cardiac ischemia in the emergency room. Blood levels of CK-MB increase over time from symptom

onset.

14 studies performed CK-MB testing at varying times after symptom onset. There was evident heterogeneity in sensitivity that was

not attributable to the threshold effect. The sensitivity of the test increased as the time from

symptom onset increased. The difference in sensitivity may be attributable to

time; to test this possibility, a bivariate meta-analytic model was used.

Example 2: Serial Measurements of the Creatine Kinase-Myocardial Band To Diagnose Acute Cardiac Ischemia (1 of 3)

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 36: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Sensitivity increases with longer hours from symptom onset to the last measurement of the creatine kinase-myocardial band.

Example 2: Serial Measurements of the Creatine Kinase-Myocardial Band To Diagnose Acute Cardiac Ischemia (2 of 3)

Actual hours are indicated next to the points; circles ≤ 3 hours; Xs > 3 hours

Dashed lines = 95-percent confidence regions; blue ≤ 3 hours; red > 3 hours

Actual Hours 95-Percent Confidence Regions

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm. Balk EM, Ioannidis JP, Salem D, et al. Ann Emerg Med 2001 May;37(5):478-94. PMID: 11326184.Lau J, Ioannidis JP, Balk E, et al. Evid Rep Technol Assess (Summ) 2000 Sep;(26):1-4. PMID: 11079073.

Page 37: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

The hierarchical summary receiver operator characteristic (HSROC) model (bivariate meta-regression) was used to compare summary sensitivity and specificity with a binary variable to account for timing of the last serial creatine kinase-myocardial band measurement (fixed-effects binary covariate).

Note that properly specified bivariate/HSROC meta-regressions can be used to compare two or more index tests.

Example 2: Serial Measurements of the Creatine Kinase-Myocardial Band To Diagnose Acute Cardiac Ischemia (3 of 3)

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.Balk EM, Ioannidis JP, Salem D, et al. Ann Emerg Med 2001 May;37(5):478-94. PMID: 11326184.Lau J, Ioannidis JP, Balk E, et al. Evid Rep Technol Assess (Summ) 2000 Sep;(26):1-4. PMID: 11079073.

Meta-analysis Metric ≤3 Hours >3 Hours

P-Value for the Comparison

Across Subgroups

Summary sensitivity(Percentage)

80 (64 to 90) 96 (85 to 99) 0.36

Summary specificity(Percentage)

97 (94 to 98) 97 (95 to 99) 0.56

Page 38: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Use the bivariate random-effects meta-analytic models to obtain a summary sensitivity and specificity.

Back-calculate the overall positive and negative predictive values (over a range of prevalence values) from summary estimates of sensitivity and specificity, rather than meta-analyzing them directly.

Back-calculate overall positive and negative likelihood ratios from summary estimates of sensitivity and specificity, rather than meta-analyzing them directly.

Overall Recommendations (1 of 3)

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 39: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

To obtain a summary line, use multivariate meta-analysis methods such as the hierarchical summary receiver operator characteristic (HSROC) model. Several summary lines can be obtained based on

multivariate meta-analytic models. They can differ when the estimated correlation

between sensitivity and specificity is positive and when there is little between-study variability.

If there is evidence of a positive correlation, the variability in the studies cannot be attributed to a threshold effect.

Explore for missing important covariates.

Overall Recommendations (2 of 3)

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 40: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

If more than one threshold is reported per study, this must be taken into account in the quantitative analyses. Qualitative analysis with graphs and quantitative

analyses with proper methods are encouraged.

Explore the impact of study characteristics on summary results using meta-regression–based analyses or subgroup analyses in the context of the primary methodology used to summarize the studies.

Overall Recommendations (3 of 3)

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.”In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Page 41: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

1. Within individual studies of a systematic review, sensitivity and specificity are independent variables.

a. True

b. False

Practice Question 1 (1 of 2)

Page 42: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Explanation for Question 1:

This statement is true. Sensitivity and specificity within each study are independent because they are estimated from different patients. Across studies they typically are negatively correlated.

Practice Question 1 (2 of 2)

Page 43: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

2. Why does this module recommend directly meta-analyzing sensitivity and specificity?

a. Sensitivity and specificity are dependent on the prevalence of the condition under study.

b. Other predictive values and likelihood ratios can be back-calculated for a range of prevalence values by using known formulas.

c. Summary sensitivity and specificity obtained by direct meta-analysis will always be greater than 1.

d. Interpretation of sensitivity and specificity is very intuitive.

Practice Question 2 (1 of 2)

Page 44: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Explanation for Question 2:

The correct answer is b. Once the summary sensitivity and specificity are calculated by meta-analysis, there are formulas that allow the back calculation of overall predictive values and likelihood ratios. Likelihood ratios and predicative values are more easily interpreted by the reader of the review. Sensitivity and specificity are often considered to be independent of prevalence because they do not depend on it mathematically and will always be between 0 and 1.

Practice Question 2 (2 of 2)

Page 45: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

3. What is the preferred method for obtaining a summary sensitivity and specificity in a meta-analysis?

a. Multivariate meta-analysis

b. Separate univariate meta-analyses

c. Using a summary line

d. The Kester and Buntinx variant

Practice Question 3 (1 of 2)

Page 46: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Explanation for Question 3:

The correct answer is a. A multivariate meta-analysis of sensitivity and specificity is the recommended method for obtaining a summary point (summary sensitivity and specificity). This is a joint analysis of both quantities instead of a separate univariate meta-analyses. Obtaining a summary line is an alternative to calculating a summary point. The Kester and Buntinx method is used to analyze sensitivity and specificity pairs when there are several thresholds for positive tests.

Practice Question 3 (2 of 2)

Page 47: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

4. In which situation would a summary line be more helpful in summarizing medical test performance?

a. Sensitivity and specificity estimates of various studies do not vary widely.

b. Sensitivity and specificity of various studies vary over a large range.

Practice Question 4 (1 of 2)

Page 48: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Explanation for Question 4:

The correct answer is b. Both a summary point and a summary line are informative and are useful in synthesizing data. There are no strict rules to follow in deciding which to use. A summary line may be more helpful as a summary of test performance when the sensitivity and specificity estimates of various studies vary over a large range.

Practice Question 4 (2 of 2)

Page 49: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

This presentation was prepared by Brooke Heidenfelder, Andrzej Kosinski, Rachael Posey, Lorraine Sease, Remy Coeytaux, Gillian Sanders, and Alex Vaz, of the Duke University Evidence-based Practice Center.

The module is based on Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Chang SM and Matchar DB, eds. Methods guide for medical test reviews. Rockville, MD: Agency for Healthcare Research and Quality; June 2012. p. 8.1-21. AHRQ Publication No. 12-EHC017. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Authors

Page 50: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Arends LR, Hamza TH, van Houwelingen JC, et al. Bivariate random effects meta-analysis of ROC curves. Med Decis Making 2008 Sep-Oct;28(5):621-38. PMID: 18591542.

Balk EM, Ioannidis JP, Salem D, et al. Accuracy of biomarkers to diagnose acute cardiac ischemia in the emergency department: a meta-analysis. Ann Emerg Med 2001 May;37(5):478-94. PMID: 11326184.

Becker DM, Philbrick JT, Bachhuber TL, et al. D-dimer testing and acute venous thromboembolism. A shortcut to accurate diagnosis? Arch Intern Med 1996 May 13;156(9):939-46. PMID: 8624174.

Bossuyt PM, Reitsma JB, Bruns DE, et al. The STARD statement for reporting studies of diagnostic accuracy: explanation and elaboration. Ann Intern Med 2003 Jan 7;138(1):W1-12. PMID: 12513067.

References (1 of 9)

Page 51: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Chappell FM, Raab GM, Wardlaw JM. When are summary ROC curves appropriate for diagnostic meta-analyses? Stat Med 2009 Sep 20;28(21):2653-68. PMID: 19591118.

Deeks JJ, Altman DG. Diagnostic tests 4: likelihood ratios. BMJ 2004 Jul 17;329(7458):168-9. PMID: 15258077.

Dukic V, Gatsonis C. Meta-analysis of diagnostic test accuracy assessment studies with varying number of thresholds. Biometrics 2003 Dec;59(4):936-46. PMID: 14969472.

Fu R, Gartlehner G, Grant M, et al. Conducting quantitative synthesis when comparing medical interventions: AHRQ and the Effective Health Care Program. J Clin Epidemiol 2011 Nov;64(11):1187-97. PMID: 21477993.

Glas AS, Lijmer JG, Prins MH, et al. The diagnostic odds ratio: a single indicator of test performance. J Clin Epidemiol 2003 Nov;56(11):1129-35. PMID: 14615004.

References (2 of 9)

Page 52: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Harbord RM, Deeks JJ, Egger M, et al. A unification of models for meta-analysis of diagnostic accuracy studies. Biostatistics 2007 Apr;8(2):239-51. PMID: 16698768.

Harbord RM, Whiting P, Sterne JA, et al. An empirical comparison of methods for meta-analysis of diagnostic accuracy showed hierarchical models are necessary. J Clin Epidemiol 2008 Nov;61(11):1095-103. PMID: 19208372.

Hartmann KE, Matchar DB, Chang S. Chapter 6: assessing applicability of medical test studies in systematic reviews. J Gen Intern Med 2012 Jun;27 Suppl 1;S39-46. PMID: 22648674.

Irwig L, Tosteson AN, Gatsonis C, et al. Guidelines for meta-analyses evaluating diagnostic tests. Ann Intern Med 1994 Apr 15;120(8):667-76. PMID: 8135452.

References (3 of 9)

Page 53: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Kardaun JW, Kardaun OJ. Comparative diagnostic performance of three radiological procedures for the detection of lumbar disk herniation. Methods Inf Med 1990 Jan;29(1):12-22. PMID: 2308524.

Kester AD, Buntinx F. Meta-analysis of ROC curves. Med Decis Making 2000 Oct-Dec;20(4):430-9. PMID: 11059476.

Lau J, Ioannidis JP, Balk E, et al. Evaluation of technologies for identifying acute cardiac ischemia in emergency departments. Evid Rep Technol Assess (Summ) 2000 Sep;(26):1-4. PMID: 11079073.

Lau J, Ioannidis JP, Schmid CH. Summing up evidence: one answer is not always enough. Lancet 1998 Jan 10;351(9096):123-7. PMID: 9439507.

Leeflang MM, Bossuyt PM, Irwig L. Diagnostic test accuracy may vary with prevalence: implications for evidence-based diagnosis. J Clin Epidemiol 2009 Jan;62(1):5-12. PMID: 18778913.

References (4 of 9)

Page 54: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Lijmer JG, Bossuyt PM, Heisterkamp SH. Exploring sources of heterogeneity in systematic reviews of diagnostic tests. Stat Med 2002 Jun 15;21(11):1525-37. PMID: 12111918.

Lijmer JG, Mol BW, Heisterkamp S, et al. Empirical evidence of design-related bias in studies of diagnostic tests. JAMA 1999 Sep 15;282(11):1061-6. PMID: 10493205.

Littenberg B, Moses LE. Estimating diagnostic accuracy from multiple conflicting reports: a new meta-analytic method. Med Decis Making 1993 Oct-Dec;13(4):313-21. PMID: 8246704.

Loong TW. Understanding sensitivity and specificity with the right side of the brain. BMJ 2003 Sep 27;327(7417):716-9. PMID: 14512479.

Moses LE, Shapiro D, Littenberg B. Combining independent studies of a diagnostic test into a summary ROC curve: data-analytic approaches and some additional considerations. Stat Med 1993 Jul 30;12(14):1293-316. PMID: 8210827.

References (5 of 9)

Page 55: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Mulherin SA, Miller WC. Spectrum bias or spectrum effect? Subgroup variation in diagnostic test evaluation. Ann Intern Med 2002 Oct 1;137(7):598-602. PMID: 12353947.

Oei EH, Nikken JJ, Verstijnen AC, et al. MR imaging of the menisci and cruciate ligaments: a systematic review. Radiology 2003 Mar;226(3):837-48. PMID: 12601211.

Reitsma JB, Glas AS, Rutjes AW, et al. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J Clin Epidemiol 2005 Oct;58(10):982-90. PMID: 16168343.

Riley RD, Abrams KR, Lambert PC, et al. An evaluation of bivariate random-effects meta-analysis for the joint synthesis of two correlated outcomes. Stat Med 2007 Jan 15;26(1):78-97. PMID: 16526010.

References (6 of 9)

Page 56: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Riley RD, Abrams KR, Sutton AJ, et al. Bivariate random-effects meta-analysis and the estimation of between-study correlation. BMC Med Res Methodol 2007 Jan 12;7:3. PMID: 17222330.

Rutjes AW, Reitsma JB, Di Nisio M, et al. Evidence of bias and variation in diagnostic accuracy studies. CMAJ 2006 Feb 14;174(4):469-76. PMID: 16477057.

Rutter CM, Gatsonis CA. Regression methods for meta-analysis of diagnostic test data. Acad Radiol 1995 Mar;2 Suppl 1:S48-56; discussion S65-7, S70-1 pas. PMID: 9419705.

Simel DL, Bossuyt PM. Differences between univariate and bivariate models for summarizing diagnostic accuracy may not be large. J Clin Epidemiol 2009 Dec;62(12):1292-300. PMID: 19447007.

Thompson SG, Sharp SJ. Explaining heterogeneity in meta-analysis: a comparison of methods. Stat Med 1999 Oct 30;18(20):2693-708. PMID: 10521860.

References (7 of 9)

Page 57: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Chang SM and Matchar DB, eds. Methods guide for medical test reviews. Rockville, MD: Agency for Healthcare Research and Quality; June 2012. p. 8.1-21. AHRQ Publication No. 12-EHC017. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Trikalinos TA, Chung M, Lau J, et al. Systematic review of screening for bilirubin encephalopathy in neonates. Pediatrics 2009 Oct;124(4):1162-71. PMID: 19786450.

Visser K, Hunink MG. Peripheral arterial disease: gadolinium-enhanced MR angiography versus color-guided duplex US--a meta-analysis. Radiology 2000 Jul;216(1):67-77. PMID: 10887229.

Whiting P, Rutjes AW, Reitsma JB, et al. Sources of variation and bias in studies of diagnostic accuracy: a systematic review. Ann Intern Med 2004 Feb 3;140(3):189-202. PMID: 14757617.

References (8 of 9)

Page 58: Meta-analysis of Test Performance When There Is a  “ Gold Standard ”

Zwinderman AH, Bossuyt PM. We should not pool diagnostic likelihood ratios in systematic reviews. Stat Med 2008 Feb 28;27(5):687-97. PMID: 17611957.

References (9 of 9)