spectrum topr 0075: veterans health administration (vha ......exhibit 2. sail overview the sail...
TRANSCRIPT
SPECTRUM TOPR 0075: Veterans Health Administration (VHA) Strategic Analytics for Improvement and Learning (SAIL) Assessment Office of Strategic Integration (OSI) Veterans Health Administration (VHA) U.S. Department of Veterans Affairs (VA) April 24, 2015
Final Report Contract Number: VA798-11-D-0122 Prepared by: Booz Allen Hamilton 8283 Greensboro Drive McLean, VA 22102
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
TABLE OF CONTENTS
1 EXECUTIVE SUMMARY .................................................................................................... 1
2 INTRODUCTION................................................................................................................... 3
3 QUALITATIVE ANALYSIS ................................................................................................. 5
3.1 Assess Validity of SAIL as a Tool to Identify Strengths and Weaknesses of
Medical Center Performance ........................................................................................... 6
3.2 Assess Individual Domains Included within SAIL for Evaluating Facility
Performance ................................................................................................................... 13
3.3 Provide a Detailed Comparison of SAIL to Other Systems for Assessing Hospital
and Delivery System Performance that Are Used in the Public and Private Sectors .... 17
3.4 Determine How SAIL is Assessed by Users in Terms of Relevance, Utility, and
Perceived Accuracy ....................................................................................................... 24
4 EMPIRICAL ANALYSIS .................................................................................................... 26
4.1 Assess Whether the Data Elements within SAIL are Representative, Valid,
and Reliable Contributors to their Respective Domains ................................................ 26
4.2 Evaluate the Methodology for the (“Star”) Ratings of Quality and their Internal
and External Validity ..................................................................................................... 31
5 RECOMMENDATIONS ...................................................................................................... 35
5.1 Measurement System Purpose — Accountability and Quality Improvement ............... 35
5.2 Measures ........................................................................................................................ 39
5.3 Measurement System Hierarchy .................................................................................... 41
5.4 Scoring/Star Rating ........................................................................................................ 43
5.5 Measurement System Management ............................................................................... 45
6 CONCLUSION ..................................................................................................................... 47
APPENDIX A. METHODS...................................................................................................... A-1
FINAL REPORT i
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
LIST OF EXHIBITS
Exhibit 1. Short- and Long-Term Recommendations ..................................................................... 2
Exhibit 2. SAIL Overview .............................................................................................................. 4
Exhibit 3. Domain Framework ..................................................................................................... 16
Exhibit 4. Medicare Total Performance Score .............................................................................. 22
Exhibit 5. California Hospital Performance Ratings (CHART) ................................................... 24
Exhibit 6. Comparisons of Domain Scores by Star Rating ........................................................... 27
Exhibit 7. Comparisons and Domain Scores by Star Ratings ....................................................... 28
Exhibit 8. Measure-domain correlation results summary ............................................................. 29
Exhibit 9. Fit Summary of Confirmatory Factor Analysis ........................................................... 29
Exhibit 10. Promax Rotated Factor Pattern of Current SAIL Data .............................................. 30
Exhibit 11. Latent Profile Analysis ............................................................................................... 33
Exhibit 12. 3-Class Profile Plot .................................................................................................... 34
Exhibit A-1. Sources of Input for SAIL Final Report ................................................................ A-1
Exhibit A-2. Measurement System Industry Experts ................................................................. A-3
Exhibit A-3. Discussions Conducted by Title............................................................................. A-4
Exhibit A-4. Star Rating Distribution ......................................................................................... A-4
Exhibit A-5. Complexity Distribution ........................................................................................ A-4
FINAL REPORT ii
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
1 EXECUTIVE SUMMARY
As the nation’s policy makers call for alignment with national quality goals, the Veterans Health
Administration (VHA) continues to refine tools and strategies aimed at improving the quality of
care received by the nation’s Veterans through comparisons of private and public sectors. With
over 120 hospitals, 800 clinics, and 200 nursing homes, this is no small task for VHA. VHA has
been a leader in the use of system-wide performance measurement to assess care and to improve
processes and clinical outcomes.1,2,3 As part of an integrated set of activities intended to improve
health care system and its constituent parts, VHA developed and implemented a comprehensive
measurement system known as the Strategic Analytics for Improvement and Learning (SAIL)
Value Model. With information and capabilities provided by SAIL, the VA gains opportunity to
identify facilities that may require assistance achieving higher performance, select facilities that
could offer assistance as models for new practices or as high-performing mentors, and assist VA
Medical Center (VAMC) directors and staff in developing targeted quality improvement
initiatives.
VHA, given its ongoing role and commitment to the use of performance measurement to
improve quality of care, contracted with Booz Allen Hamilton (Booz Allen) and independent
consultants (referred to as the Booz Allen Team in this report) for an independent review of
SAIL. The recommendations produced by the Booz Allen Team (Exhibit 1) serve as a catalyst to
VHA-led and other stakeholders’ deliberations and strategies on iterative advancements in
performance measurement and reporting in the VA health care system. The Team understands
that advances and improvements in the field are frequently incremental and has grouped the
recommendations by suggested timeframe—short term versus long term—and included
observations for each recommendation area (e.g. measurement purpose, measures, hierarchy,
scoring/star rating, and measurement system management). The Final Report offers readers the
1 Kizer KW, Dudley RA. “Extreme makeover: transformation of the Veterans Health Care System.” Annual Review
of Public Health. 2009; 30:1–27. 2 Edmondson EA, Golden BR, Young GJ. Turnaround at the Veterans Health Administration. N9-607-035. Boston,
MA: Harvard Business School. 2006. 3 Trevelyan EW. The Performance Management System of the Veterans Health Administration. Harvard School of
Public Health Case Study. Cambridge, MA: Harvard School of Public Health. 2002.
FINAL REPORT 1
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
opportunity to glean findings and explore contextual information provided in more substantive
narrative (Sections 3, 4, and 5).
Exhibit 1. Short- and Long-Term Recommendations
Area Observation Short Term (before next year) Long Term
Measurement System Purpose
There is a discrepancy between SAIL’s original design, as an improvement tool, and its current use also as an accountability tool. A determination of SAIL’s primary purpose will drive VHA decisions with regards to many design decisions and improvements.
• Clarify the purpose of the SAIL measurement system for accountability and improvement
• Clarify the scoring approaches for mastery (predetermined values or external standards) or relative peer performance
• Clarify the specific role of SAIL informing WHAT versus HOW to improve
• Clarify the change model – how improvement is achieved and SAIL’s role in contributing to such change
Measures Measures used in SAIL, especially for accountability, need to be valid (accurately and fairly measure what they purport to measure) and reliable (precisely and reproducibly discriminate true performance differences).
• Formalize a process for selecting and managing measures with respect to their intended uses
• Perform additional screening to examine each measure’s ability to discriminate performance
• Consider integrated scoring of quality and efficiency
Hierarchy There are alternative hierarchy structures to be considered based on key stakeholders and industry standards. For accountability purposes, numerically combining individual metrics or domain scores into single measurements of quality may not be effective.
• Clarify rationale for the domain structure such as grouping measures that: i) share the same underlying dimension of performance, ii) reflect policy priorities, iii) are organized by clinical program area, iv) are based on psychometric properties, v) align with industry-wide constructs, etc.
• Reconsider weighting scheme for grouping measures that includes input from stakeholders
• Expand the measures hierarchy to include tiers that serve frontline care and service providers and construct clinical program area performance dashboards to align accountability with staff’s span of control by program area
Scoring/Star Rating
Our empirical analysis (latent class analysis) suggests the data support a three class model, not five. The current star rating system may risk misclassification of facilities. Feedback from the field raises questions of the need for additional methods to allow for optimal fairness in inferences made from final summary scores.
• Conduct empirical investigations of non-equivalence across facilities to demonstrate appropriate handling of such differences, and optimal fairness in inferences made from the final summary scores
• Assess potential misclassification and address using techniques to mitigate medical center misclassification error
• Perform sensitivity analysis of the weighting scheme to reveal differences in resulting scores
• Consider shifting to a 3-star rating summary system
• Consider assigning responsibility for performance using a balanced scorecard approach that drills down to frontline supervisors
• Consider alternatives to creating fixed strata and limit comparisons within strata
Measurement System Management
Field Directors Assessment suggests that some staff in other parts of VHA have not engaged with SAIL to feel like effective users or effective contributors to SAIL as it exists.
• Engage internal stakeholders in adapting SAIL to the needs of its users over time
• Consider adopting rapid cycle evaluation or improvement best practices for managing SAIL
• Engage internal stakeholders through a formal process and consider a more robust training and technical assistance program to support SAIL end-users
FINAL REPORT 2
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
2 INTRODUCTION
The Booz Allen Team conducted an independent assessment of the SAIL Value Model in
accordance with contract number: VA798-11-D-0122. The SAIL Final Report (Task 6.5) is the
culmination of earlier deliverables from this review. The purpose of the report is to:
• Assess the validity of SAIL as a tool to identify strengths and weaknesses of medical
center performance.
• Assess the domain structure as a hierarchy for medical center performance reporting.
• Assess whether the measures within SAIL are representative, valid, and reliable
contributors to their respective domains.
• Evaluate the methodology for the (“star”) ratings of quality and their internal and external
validity.
• Provide a comparison of SAIL to other systems for assessing hospital and delivery
system performance used in the public and private sectors.
• Document how SAIL is assessed by users in terms of relevance, utility, and perceived
accuracy.
• Provide recommended next steps for SAIL improvements.
The Final Report contains findings from earlier reports—Task 6.2: Discussion Paper, Task 6.3:
Synopsis of Industry Best Practices on Measurement Systems (referred to hereafter as the
Synopsis), Task 6.4: Field Directors Assessment—and an empirical analysis of raw data
provided by VHA. The material in this report is organized to offer readers both high-level
summaries of findings, as well as contextually relevant discussions. The observations are
presented to share information, stimulate discussion, assist communication among stakeholders,
facilitate understanding, and provide guidance on potential SAIL enhancements that could be
pursued by VHA. Hereafter, the paper is organized in the following manner:
• Section 3: Findings from the qualitative portion of the assessment
• Section 4: Findings from the empirical analysis
• Section 5: Recommendations for potential next steps for SAIL
• Section 6: Conclusion of key themes in the report
• Appendix A: Methodology used to collect data for earlier reports
FINAL REPORT 3
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
The SAIL tool is a web-based, balanced scorecard model used by the U. S. Department of
Veterans Affairs to measure, evaluate, and benchmark quality and efficiency among VA Medical
Centers.4 SAIL was designed to offer a high-level view of health care quality and efficiency,
enabling executives and managers to examine a wide breadth of existing VA measures in one
place.1 SAIL assesses 28 quality measures, including mortality, complications, and customer
satisfaction, which are organized within the following domains: acute care mortality; avoidable
adverse events; Centers for Medicare and Medicaid Services (CMS) mortality and readmission
measures; length of stay; mental health, performance measures (ORYX/HEDIS); customer
satisfaction; ambulatory care sensitive condition hospitalizations; clinical wait times and call
center responsiveness, and efficiency. The underlying data from which SAIL is based are
provided through other VHA sources, such as Linking Knowledge and Systems (LinKS),
ASPIRE, VA Inpatient Evaluation Centers (IPEC), Performance Management, and Office of
Productivity, Efficiency, and Staffing (OPES).
Exhibit 2. SAIL Overview
The SAIL reporting tool incorporates data from 128 VAMCs that provide acute inpatient
medical and/or surgical care to Veteran patients. The report also includes data from facilities that
do not have acute inpatient medical and/or surgical care (i.e., Ambulatory Care Centers,
Rehabilitation Centers, and Outpatient VAMCs).1 From these measures, SAIL also provides a
composite 1- to 5-star rating for each of the 128 VAMCs in overall quality.2 The VA’s
4 Veterans Health Administration Office of Informatics and Analytics. Strategic Analytics for Improvement and
Learning (SAIL) Fact Sheet. Retrieved March 2015 at
http://www.hospitalcompare.va.gov/docs/06092014SAILFactSheet.pdf
FINAL REPORT 4
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
hypothesis is that 1-star facilities will benefit from adopting successful practices from 5-star
facilities.
SAIL’s assessment of the relative performance of facilities involves several steps:
1. Facilities are first compared within their comparison group on individual quality
measures and assigned a score based on their relative performance;
2. Within each domain, the measure scores are multiplied by the assigned weight and then
added together to become the domain score;
3. The domain scores are then used to calculate the quality composite score; and
4. Using 10th, 30th, 70th, and 90th percentile cut-offs of the composite scores, each facility
is designated a 1- to 5-star rating for overall quality.
Facilities are assigned a 1- and 5-star rating if their scores fall in the bottom and top 10th
percentile, respectively.5 Facilities in the next bottom and top 20% of the distribution are
assigned a 2- and 4-star rating, respectively. The remaining 40% of the facilities is assigned a 3-
star rating. SAIL’s 5-star facilities that have acute care inpatient mortality or 30-day mortality in
the highest 20% (high mortality) are demoted to a 4-star rating. An equal number of 4-star
facilities are promoted to a 5-star rating. Facilities with a 1 star rating that have the most
inpatient measures whose performance is better than the bottom 20% of health systems in the
Truven Top Health Systems study are promoted to a 2-star rating.2 SAIL is noted by VHA to be
unique because, unlike most other health industry report cards updated annually, SAIL is
updated quarterly to allow medical centers to more closely monitor the quality and efficiency of
the care delivered to Veterans.2
3 QUALITATIVE ANALYSIS
The primary purpose of a measurement system will influence the decisions made by system
developers. The primary purpose drives decisions in the selection of measures, structure of
domains, and scoring methodologies. For example, one decision in measure selection, namely
between rapid-cycle and focused measures intended to provide formative feedback regarding
5 Veterans Health Administration Office of Informatics and Analytics (2015). Strategic Analytics for Improvement
and Learning (SAIL) Measure Education Module – Healthcare Acquired Infections.
FINAL REPORT 5
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
improvement efforts, versus more valid and reliable summary measures suitable for formal
accountability. As noted elsewhere in earlier reports produced by the Booz Allen Team, these
differences can be illustrated through the example of readmission measures6:
• Accountability systems include measures designed for provider comparisons, include
risk-adjustments, and are collected retrospectively over long periods of time.
• Improvement systems include measures that capture local performance over time, are
not risk-adjusted, and are captured weekly or even daily to provide real-time information
as to whether interventions or clinical care are producing the intended outcomes.
A primary issue, that can underlay the foundation of prioritizing improvements to SAIL, is the
discrepancy between its original design as an improvement tool and its current use as an
accountability tool. SAIL was originally developed as “VA’s internal improvement tool which is
designed to offer high-level views of health care quality and efficiency.”7 A determination of
SAIL’s primary purpose will drive VHA decisions with regards to many design decisions and
improvements. Accountability may take on a different meaning when referring to a public
institution like the VA. In a private sector hospital system measures used for improvement may
not ever be visible to the public. But in VA, even those internal measures intended to focus on
improvement, are likely discovered by congressional staffers or the media and made public, for
their own purpose of holding VA accountable. Therefore, they may become de facto
accountability measures even when that is not their original intent. The dichotomy of the primary
purpose (s) of SAIL are further discussed throughout the Final Report.
3.1 Assess Validity of SAIL as a Tool to Identify Strengths and Weaknesses of
Medical Center Performance
This section of the Final Report provides findings from the qualitative tasks from the Field
Director Assessment and Synopsis.
6SAIL Discussion Paper (Task 2), Section 3
7SAIL Value Model 20141210.pdf
FINAL REPORT 6
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
3.1.1 Field Director Discussions
The Field Director discussions with the Medical Center Directors, Chiefs of Staff, Deputy Chief
of Staff, and Veterans Integrated Service Network (VISN) Directors provided insights into how
SAIL is being used in the field, as well as the perception of SAIL by users with regard to its level
of accuracy and relevancy for identifying both strengths and weaknesses in their own
performance. Responses obtained throughout these discussions were used to assess the perceived
validity of SAIL for identifying strengths and weaknesses of medical center performance.
Nearly all of the VISN Directors, Medical Center Directors, Chiefs of Staff, and Deputy Chief of
Staff appreciated that their feedback and opinions are being considered for future improvements
to SAIL. When all 38 Program Office Staff, VISN Directors, Medical Center Directors, Chiefs of
Staff, and Deputy Chief of Staff were asked what they like most about SAIL, 90% indicated that
they like the concept of a centralized dashboard or “snapshot” across all VAMCs where they can
view high-level data, 70% indicated that they liked the graphics and visual presentation of SAIL
data and attributed that improvements were made in their facilities as a direct result of SAIL.
Responses from the field indicate changes could be made to improve SAIL’s perceived validity
for identifying strengths and weaknesses of medical center performance. The following three
overarching themes emerged in terms of recommendations for improvement with regard to
evaluating facility strengths and weaknesses: 1) comparison of VAMC measures to existing
benchmarks of excellence and to neighboring non-VA facilities; 2) increase in the frequency
with which the SAIL results are reported; and 3) addition of measures sensitive to day-to-day
aspects for managing VA care.
These recommendations must be reviewed in the context of the intended use of SAIL. For the
first recommendation, the selection of comparators can be quite different depending on the
intended use. For example, if the desire is consumer choice of a facility, local comparators would
best fit to aid in decision making. If selecting comparators for accountability of results, a focus
on a wide national sample would best suit this purpose. For the second recommendation,
frequent and recent results best suit quality improvement purposes. However, for the purpose of
accountability, SAIL results can be less frequent to support reliable estimates for low frequency
events. Regarding the third recommendation, more sensitive measures support quality
FINAL REPORT 7
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
improvement, but those measures are often not appropriate to be rolled up for a star rating or
accountability measurements. The dichotomy of the intended use of SAIL as a tool for quality
improvement and accountability is further explored in the following sections of the Final Report.
When interviewees were asked what they liked least about SAIL, 39% (15 out of 38) pointed to
the method behind the star rating system. Interviewees generally noted that since the star rating is
a ranking system among VA facilities, it is not viewed as an appropriate depiction of medical
center performance since someone is always at the top and someone is always at the bottom. As
a result, a few Field Directors reported concern that this drives competition as opposed to
collaboration. Furthermore, concerns indicated that the star rating focuses all attention on
performance difference between VHA facilities without regard necessarily to the meaning or
value of such differences and ignores more relevant comparisons to established benchmarks of
excellence and to local health care alternatives to VHA facilities. More than half (58%)
responded they would prefer, instead, to be measured against established benchmarks of
excellence and to have their performance compared to neighboring non-VA facilities as much as
possible so they can tell patients their facility is “just as good or better than [the] neighboring
non-VA facilities.”
When measuring VA facilities against established benchmarks of excellence, it was reported by
many respondents that facility demographics need to be factored into the ranking — this would
include facility complexity level as well as urban versus rural facilities so “apples-to-apples”
comparisons occur as opposed to what is now happening, which is “apples-to-oranges.” These
recommendations highlight considerations that must be addressed when establishing benchmarks
for performance improvement. Relative benchmarks are best for purposes of continuous
improvement. Absolute benchmarks are best in some circumstances to establish clear levels of
achievement in the short-term (e.g., a one year payout schedule). The primary purpose for SAIL
and its measures must be considered when determining appropriate benchmarks.
Reporting lag-time was mentioned as an aspect preventing users from being able to effectively
use SAIL as a performance assessment and improvement tool, especially when it comes to
identifying shortcomings within care delivery. In fact, 10% (4 out of 38) of all respondents
indicated the reporting lag-time in response to what they liked least about SAIL. This response
FINAL REPORT 8
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
was the third most common response for this question. Moreover, when asked what measures
interviewees would like added or removed from SAIL, 95% (32 out of 34) noted they would like
to see more real-time data metrics incorporated into SAIL. The data reporting lag-time also
played a role in why respondents reported certain measures, such as patient satisfaction and the
efficiency measure, are not useful for monitoring improvement efforts. Specifically, 20% (7 out
of 34) stated directly the SAIL data must be current in order for it to be actionable. Existing
quality decision support tools designed specifically for monitoring clinical care process and
outcome measures and managing patient satisfaction may be more appropriate than using SAIL
for both purposes, improvement and accountability.
When interviewees were asked how relevant and useful they found SAIL overall for assisting
them in managing their medical centers performance, 50% (15 out of 30) indicated SAIL is not
useful or relevant. This led to discussion of what additional information pertinent to their day-to-
day operations are needed. The most commonly suggested types of information to be added are:
1) how much VAMCs are supporting staff development fostering employee growth and
satisfaction; 2) human resource data (e.g.; vacancy rates); 3) data on patient flow; 4) how
VAMCs are supporting research; 5) contracting data (e.g. purchased care or "fee" care); 6)
outpatient care domain; 7) surgical care domain; 8) homelessness; 9) diabetic foot care; and 10)
mental health drill-down data.
3.1.2 Industry Expert Discussions
The industry expert discussions further provide insight of performance measurement tools, such
as SAIL, used to assess strengths and weaknesses within a system. The industry experts
interviewed represented organizations that are committed to continuous improvement in
healthcare quality, efficiency, and value. The efforts within their organizations are highly
regarded, even exemplary, in relation to the state of the art in the industry as a whole. VHA has
watched the industry attentively for opportunities to borrow conventions and adapt best practices
in measurement, as well as dissemination and education.
In many of those respects, VHA matches the quality and thoroughness of other leading health
systems and ratings organizations. As shown elsewhere, the SAIL Value Model includes a
number of measures spanning all the dimensions of health system performance recommended by
FINAL REPORT 9
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
the Institute of Medicine (IOM).8 And, as of 2015, SAIL extended its measurement system to
include mental health as a set of important conditions faced by Veterans and as a specialty or
service line provided by VHA. In addition, SAIL includes mission-critical measures related to
access to care.
In contrast to the standalone rating systems (Health Grades, Truven, Consumer Reports, and
Leapfrog), the measurement systems integral to health delivery systems tend to manage and
deploy many more measures. This reflects broad responsibility for complex delivery systems that
span the continuum of care, as well as the need to direct information to multiple parties ranging
from external parties to senior leadership, middle management, and frontline staff. Also,
measures used for evaluating management are often parsimonious and focused and rolled up to a
summary measure. Measures used to drive improvement are varied and not usually aggregated to
create an overall rating. The multiple audiences reflect several additional purposes for measures
beyond sending comparative signals to external parties (e.g., consumers), such as:
• Making or monitoring the business case for investments in infrastructure or operations,
e.g., clinical registries, electronic health records (EHRs), clinicians, administrative and
clinical support staff, building capacity, patient education, and other technologies.
• Inferring or establishing responsibility (accountability) for specific patients or episodes of
care, i.e., attribution of outcomes to specific departments, sites, or individuals.
• Linking measures in coherent causal chains connect frontline activities to designated
outcomes of interest.
• Organizing measures around specific patient cohorts conforming to physician specialty
and departmental lines of service; connecting care processes, service utilization, resource
inputs, and cost, clinical outcomes, and self-reported patient experience.
For the purpose of accountability, the number of SAIL measures is consistent with information
gleaned from the industry interviews. Industry experts highlighted that fewer measures were
generally used for accountability measurement. Most of the SAIL measures reflect industry
standards in terms of their construction and meet standards of face validity (i.e., they are
specified and constructed reasonably to reflect their intent, or label). Many of the measures are
8 Deliverable 6.3 Synopsis of Industry Best Practices on Measurement Systems, Exhibit 8.
FINAL REPORT 10
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
borrowed, replicated, or adapted from external sources (e.g., Truven, Centers for Medicare and
Medicaid Services (CMS), and the Agency for Healthcare Research and Quality [AHRQ]).
The ability of SAIL to capture and identify strengths and weaknesses of medical center
performance for quality improvement is limited partly by the measure set. Although motivated
VHA staff can “drill down” into layers of data beneath the surface measures, it is largely a
qualitative or intuitive exercise. For example, users can see which patients died. This provides
great transparency with respect to how raw data and calculations led to SAIL measures and even
could facilitate qualitative improvement efforts, such as special root cause analyses. However,
the root causes themselves, the significant determinants of the outcomes of interest, are not
codified as measures in SAIL. Additionally, SAIL does not provide users with a consistent
methodology for carrying out these types of analyses.
The industry experts interviewed by the Team discussed that integrated measurement systems
involve a process map consisting of nested layers that give rise to the outcome of interest, but
connect directly to the “decision layer where all change happens.” Processes can be organized
into hierarchies, typically involving around three to nine steps, which can be seen as a process or
an outcome depending on perspective. For example, certain tests, medications, and other clinical
activities can determine directly the lipid profile of a patient, and lipid control is an
(intermediate) outcome of interest. Similarly, clinical activities affect blood pressure control or
glucose control, which individually and in combination affect the clinical progression of diabetes
and the onset of complications such as retina disease, kidney functioning,
neuropathy/amputation, cardiovascular health and outcomes, and eventually survival. This type
of process map is not apparent in SAIL. Consequently, SAIL does not by itself identify the
pockets of deficiency, the unreliable links in the causal chain, or the specific challenges at the
decision layer of operations.
SAIL is an efficient tool managed by a fairly small and dedicated multidisciplinary team. It
welcomes suggestions for new measures and imports various measures from other parts of VHA
or from external sources. From our Team’s discussions with the SAIL Team during this review
process, SAIL is open to augmenting its capacity and breadth to serve the medical centers with
more details and guidance regarding performance improvement.
FINAL REPORT 11
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
Accepting the measures that are included, SAIL goes to commendable efforts to help users
interact with the measures and understand them in context. Tabulations and graphical
presentations can allow users to see easily how they rate compared to other users (medical
centers) and to their own prior performance. Rating organizations tended to stress cross-sectional
or contemporaneous comparisons among providers and health plans. That largely reflects their
mission of providing report cards and decision-support for making choices about where to
receive care presently.
In contrast, although experts in health systems acknowledged the importance of external
benchmarks and comparisons, particularly for external accountability, they tended to give greater
attention to internal benchmarks and evidence of improvement toward specified goals. This is
especially relevant for systems designed for the purpose of performance improvement. Measures
that were for external consumption or for accountability generally were seen as “floors” or
requirements for minimum adequate performance. Measures for improvement were viewed
instead as more challenging and “aspirational,” intended to motivate and guide performance to
levels not normally achieved (by nearly anyone).
Regarding measures of improvement in relation to internal benchmarks, industry experts
downplayed the need for statistical risk-adjustment, given so much is held constant naturally,
including the needs and complexity of the patient population. VHA also would seem to benefit
analytically from having a fairly stable target population for whom they retain ongoing
responsibility for continuity of care and long-run outcomes.
Comparisons between the VHA system and other healthcare sectors, and even among medical
centers within VHA, may be more difficult without risk adjustment. Failure to measure or adjust
for differences in underlying patient characteristics and needs, or for the relationship between the
characteristics of different medical centers and their resulting “niche” in the local markets for
Veterans, may threaten the validity of such comparisons. Hence, capturing or identifying
situations of relatively low performance and opportunities for improvement may be distorted by
non-equivalent comparators.
FINAL REPORT 12
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
3.2 Assess Individual Domains Included within SAIL for Evaluating Facility
Performance
This section provides a review of our findings from the Synopsis (Task 6.3) and the results from
the Field Directors Assessment (Task 6.4).
3.2.1 Field Discussions
The SAIL Value Model allow users to view performance according to each domains=—Acute
Care Mortality, Avoidable Adverse Events, CMS 30-Day RSMR & RSRR, Length of Stay,
Performance Measures, Customer Satisfaction, ACSC Hospitalizations, Clinical Wait Times &
Call Center Responsiveness, Mental Health, and Efficiency. The domain scores are presented as
the average z-score of measures in the same domain. The overall quality z-score is the average of
the nine quality domain z-scores.9
The perceived saliency of the SAIL domains for evaluating facility performance, as reported
throughout our interviews with Medical Center Directors, revealed mixed results and several
suggestions for improvement. When asking interviewees to assess the relevancy and usefulness
of the SAIL domains for managing their medical center(s) performance and driving
improvement, 57% of respondents expressed that the domains were generally relevant.
As mentioned earlier, several interviewees recommended incorporating additional information
into SAIL and included the following 1) how much VAMCs are supporting staff development
fostering employee growth and satisfaction; 2) human resource data; 3) data on patient flow; 4)
how VAMCs are supporting research; 5) contracting data (e.g. purchased care or "fee" care); 6)
outpatient care domain; 7) surgical care domain; 8) homelessness; 9) diabetic foot care; and 10)
mental health drill-down data. In fact, 12 % (5 out of 34) said SAIL was weighted too heavily
toward inpatient rather than outpatient care. Additionally, several interviewees recommended
eliminating overlap of measures within domains. For instance, multiple mortality measures
within the acute care mortality domain and additional mortality measures within the CMS
measures domain. Pressure ulcer measures are in both the risk adjusted complication index and
the risk adjusted patient safety index within the avoidable adverse event domain.
9 As documented in SAIL Data Definition 2014, Quarter 4
FINAL REPORT 13
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
Overall, more outpatient care information, surgery, and mental health drill-down data were noted
to be foundational needs for attempting to evaluate level of care provided by VA, whereas the
other suggestions were noted to be important for managing their facility and evaluating overall
quality of their facility or facilities. One interviewee suggested the domains first should be
formulated based on the primary goals of VA. Then measures can be identified to fulfill these
goals, such as population health, Veteran experience, financial stewardship, excellence in
workforce, and excellence in service to communities. It was expressed that these goals as a
whole are not adequately captured within the current SAIL domains.
Many expressed concerns with the current weighting of the domains. These concerns boiled
down to two key aspects: 1) a poor understanding for why the domains are weighted the way
they are; and 2) frustration with the “lack of transparency” surrounding the methodology of the
weighting leading to poor confidence and trust in the SAIL tool overall. In addition, one
respondent suggested the SAIL domains need to more accurately account for variances in
geographical variations, such as labor force issues and population issues.
3.2.2 Industry Expert Discussions
Clinical programs are the bedrock of the performance measurement systems for three of the
leading U.S. integrated delivery systems interviewed by the Booz Allen Team – Intermountain
Healthcare, Geisinger Health System (Geisinger), and Kaiser Permanente (Kaiser). From the
industry expert discussions, the Team gleaned that Intermountain Healthcare organizes its
performance metrics by 13 clinical areas, and Geisinger maps its metrics to approximately 25
clinical programs. While the measures are grouped by clinical programs, unlike SAIL, these
groupings are not used as a performance scoring mechanism. In essence, Intermountain
Healthcare and Geisinger group measures based on clinical programs or service level and use for
organizational purpose but these systems do not provide a “clinic-level” score. Thus, these
systems do not produce clinical program metrics rolled up into system-wide scores. The lack of a
scoring hierarchy is common when the purpose of measurement is quality improvement.
Kaiser’s domain formulation differs somewhat from the other IDS. Many of Kaiser’s measure
sets are bundled by condition and others are clustered by program priorities such as patient safety
or preventive health. These composites are then grouped into clinical program areas. For
FINAL REPORT 14
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
example, a cancer treatment composite is joined with a cancer screening composite to form the
cancer management domain. Kaiser has no single set of domains — rather its decision support
system enables the user to tailor a dashboard as needed. The domain templates available to
Kaiser staff include:
• IOM Six Aims (safe, effective, efficient, timely, patient-centered, equitable)
• Donabedian classification (structure, process, outcome)
• National Committee for Quality Assurance (NCQA) categories (respiratory,
musculoskeletal, diabetes, cardiovascular, etc.)
• Care setting (inpatient, ambulatory, home health, skilled nursing facility, hospice)
• Externally mandated (regulatory, accrediting, purchaser agreement)
• AHRQ and National Quality Forum (NQF)groupings (similar to organ systems)
The nine SAIL domains, organized mainly by measure type (e.g., outcomes, processes,
patient/staff experience, efficiency, etc.), differ from the IDS approaches that are oriented to
clinical programs and their underlying processes of care. A number of SAIL domains are
conceptually akin to the CMS Hospital Value-Based Purchasing program though CMS uses four
summary performance categories: 1) clinical process of care; 2) patient experience; 3) outcomes;
and 4) efficiency. SAIL’s newest domain, mental health, shares the clinical program focus that is
the core of the IDS measurement programs.
The measurement systems in our convenience sample underscored the importance of domains or
other summary indicators as devices to align the workforce around a small number of priorities
that embody the organization’s mission. Each of these systems signals the organization’s focus
on their major clinical programs. Beyond this clinical program orientation, the measurement
systems’ approaches differ when communicating summary performance indicators to foster
alignment.
The measurement systems use several performance information constructs to organize and
communicate each organization’s goals, business priorities, and performance. In lieu of summary
performance metrics, Intermountain Healthcare aligns measures to illustrate the five program
goals in patient safety, integrated electronic medical records (EMR), patient experience, care
FINAL REPORT 15
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
redesign, and prevention/wellness. Differently, Geisinger uses four “value equation” dimensions
as the pillars of its performance management system:
• Clinical
• Patient Experience
• Total Cost of Care
• Professional Experience
The flexibility of Kaiser’s decision support tool equips Kaiser’s regions and medical centers to
formulate performance dashboards for different needs — whether clinical program management,
region-wide strategic plan monitoring, or otherwise.
Resource use accountability is addressed differently among the IDS. Total cost of care domain is
one of four dimensions that comprise Geisinger’s value equation. Resource use accountability is
mainly lodged with its primary care
providers. Intermountain Healthcare
incorporates appropriateness markers
and cost/resource intensity per case
metrics into its clinical program
dashboards. Kaiser includes appropriate
care metrics in its national measures
repository while cost metrics are
adopted at the region or medical center
level.
In summary, the measurement systems’
domains markedly differ from the SAIL
approach (Exhibit 3). For a given
clinical program, the measurement
systems encapsulate the various metrics
found across the nine SAIL domains, but
the metrics are specific to the
measurement system’s clinical program patient population. That is, an IDS clinical program
Exhibit 3. Domain Framework
System Domain Framework
SAIL
• Organized mainly by measure type (e.g., outcomes, processes, patient/staff experience, efficiency, etc.)
• Global indicator
Intermountain Healthcare
• Emphasis on specific process metrics rather than aggregated domains
• Measure sets organized by ~10 clinical areas
• Overall defect rate is example of a roll-up metric (e.g. selected care process measures)
• Major care areas and processes organized into 60 datamarts
• No global indicator
Geisinger Health System
• Summary Domains: - Clinical - Patient Experience - Total Cost of Care - Professional Experience
• Care bundles comprised of multiple measures (e.g., 9 diabetes measures), nested in clinical domain, are organized by 25 clinical service areas
• No global indicator
Kaiser Permanente
• Aggregates measures into domains • No single approach; tools give managers
flexibility to tailor dashboards/hierarchy • Preference to organize by IOM 6 aims • No global indicator
FINAL REPORT 16
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
dashboard can include process, outcomes and cost measures. These IDS likely also use some
domains that comprise cross-program measures (e.g., a Hospital Consumer Assessment of
Healthcare Providers and Systems ([HCAHPS] patient survey measures set), but the emphasis is
on fashioning measure sets that are within the span of control of a given clinical program. The
IDS also group performance measures into summary domains for internal accountability
purposes, but these are organizing constructs, not mechanisms, to produce domain-level
performance scores.
3.3 Provide a Detailed Comparison of SAIL to Other Systems for Assessing
Hospital and Delivery System Performance that Are Used in the Public and
Private Sectors
This section provides a review of our findings from the Synopsis (Task 6.3) comparing the SAIL
Value Model to other systems used in the public and private sectors. In this section, we discuss
the following topics:
• Accountability versus improvement
• Measures (selection, reliability, etc.)
• Measurement system hierarchy (structure)
• Scoring (weighting, time periods)
3.3.1 Accountability versus Improvement
The distinct purposes of the measurement approach used for accountability versus improvement
was an overarching theme sounded by industry experts. While often there is overlap in the
performance measures used for both purposes, in many instances the measures differ. More
importantly, different performance targets are applied and the methods’ rigor can vary when used
for learning and improvement in contrast to use for accountability objectives, like public
reporting, personnel performance assessment, and incentive payments.
The interviewees stressed the distinction between the measures and performance targets used for
accountability given the higher stakes — career, payment, reputation — compared to measures
for improvement purposes. The performance indicators for improvement can include metrics or
methods that have less technical rigor (e.g., no standard case-mix adjustment element) and whose
performance targets are aspirational, that is given uses that have lesser consequences. Geisinger
FINAL REPORT 17
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
cited an example of using a “quality gate” coupled with an incentive payment program. The
“quality gate” defines minimally acceptable quality to be eligible for payment rewards under an
accountability initiative. That same quality metric could be ratcheted to a much higher threshold
if used for improvement work.
The Kaiser Permanente Quality Measures (KPQM) repository, which comprises approximately
450 quality measures, illustrates measures’ dual purposes. Each measure is categorized by its
approved use for: 1) accountability; 2) improvement; or 3) both accountability and improvement.
Kaiser reports that while measure designation for improvement purposes is generally
straightforward, the assignment of measures for accountability is ongoing as there is less internal
consensus on designating certain measures for accountability uses particularly if the measure is
linked to payment.
Kaiser’s performance management decision support organizes information to support the
accountable units that can effect change in a set of care or service processes. Unlike SAIL, these
performance systems are not organized by measure type (outcome, process, cost, etc.); rather the
measure sets are fitted to patients grouped by related clinical disciplines like cardiovascular
health.
For the measurement systems reviewed, reporting for accountability and rating systems have few
categories to represent the organization’s key performance dimensions. Notably, these categories
are not a quantitative summation of underlying metrics; instead they are composed of a subset of
metrics that are representative of the topic and may be tailored to the user’s needs. For example,
if access to care was deemed a vital aim of the organization, that category could house
overlapping but different measure sets for a hospital chief, an ambulatory clinical leader, or a
rehabilitation unit manager.
3.3.2 Measures
For measure sets, which are composed of hundreds of measures, the most common measures
selection criteria cited by the industry key informants were:
• Health impairment — condition prevalence and severity
• Improvement opportunity — potential clinical and cost gains
FINAL REPORT 18
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
• Performance variation — among the accountable entities
Intermountain Healthcare explained its longstanding efforts to identify the relatively few care
processes, and their associated metrics, that account for the bulk of the opportunity to improve
health and reduce costs. Approximately 7% of its processes account for approximately 95% of
the potential gain. The Intermountain Healthcare change model — the systems and mechanisms
to influence quality and cost — is centered on process change. As such, its performance system
is predominately process metrics though certain outcomes measures — including clinical, cost,
and patient experience — are part of the measurement system.
Intermountain Healthcare’s framing of domains by clinical programs reflects a second tier of
criteria for selecting performance dashboard measures. These additional criteria are: 1)
actionable; 2) span of control; 3) time to effect change.
IDS clinical program managers are responsible for measure sets that reflect the processes,
people, and resources within their purview and for performance targets that can be influenced
during the relevant reporting period. Though our expert discussions did not address partitioning
performance objectives and metrics into near term (annual) and long term (e.g. three years)
components, such an approach can be used to match accountability to the time to effect change
and in the meantime monitor progress.
Kaiser’s repository of quality measures largely is sourced from external measurement sets used
by CMS, NCQA, NQF, and the Joint Commission in their regulatory, payment, and recognition
programs. Kaiser’s federation model, in which its regional health systems and plans have had
considerable autonomy, explains the need to draw upon industry standard measures as it seeks
metrics alignment across its seven regions. For example, there is no common patient scheduling
system across Kaiser Regions, hence it is unable to craft access measures specific to appointment
systems. About one-third of the KPQM metrics are internal to Kaiser. The regions and their
health centers assemble their dashboards from the KPQM metrics in addition to local
performance measures.
In contrast to drawing upon external performance metrics and categories, Intermountain
Healthcare’s system has been built from the bottom up. With a concentration on the needs of
FINAL REPORT 19
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
those front-line staff that implement change, Intermountain Healthcare has created metrics that
are actionable for staff in their work to manage care processes. Similarly, Geisinger, with its
emphasis on staff span of control, has shaped its performance information system to support staff
as participants and contributors to its system of care.
Measurement systems like Kaiser and Geisinger have the advantage of enrolled health plan
populations for which they capture total cost of care metrics. Intermountain Healthcare may have
the same advantage given its health plan division (SelectHealth) though its health services
division concentration is on cost per case, which is well suited to its process management
approach.
Though SAIL measures criteria were not included in the materials we reviewed, SAIL measures
criteria appears to include: 1) clinical measures for which there are external benchmarks; 2) a
well-rounded mix of measures types that represent a number of the IOM six aims; and 3)
particular attention to two performance areas of high importance to VHA — access and mental
health. The measures are predominately inpatient though there are important exceptions. It is
unclear the extent to which SAIL measures criteria include improvement centric elements like
actionability.
3.3.3 Measurement Hierarchy
Each measurement system reviewed has a measurement hierarchy in which individual measures
are aggregated into summary sets and a single score is computed for that composite measure.
The composition of these composites varies — in some instances individual measures are
collapsed into a condition-specific bundle; in other cases the measures are aggregated into a
cross-cutting construct like patient experience or safety. Grouping of composite measures by
clinical programs is a mainstay of these measurement systems (e.g., orthopedics or primary
preventive care). Importantly, though measures are grouped by clinical programs, summary
program scores are not computed. Rather, discrete metrics are organized into dashboards that are
useful for staff to manage systems of care for patient populations within a given clinical area.
These IDS are not using domains in the same way as SAIL, which derives its origins from a
public reporting and rating model (Truven). For VHA, a SAIL domain defines a specific, small
set of two to four measures to represent a performance topic, such as satisfaction or mortality,
FINAL REPORT 20
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
and these domains are the underpinning of a structure to compute summary performance scores.
In contrast, the IDS use domains as an organizing construct in which a variety of measures can
be mapped to the domain depending on the user needs. The “clinical quality domain” may
comprise orthopedic quality metrics if that is the program of interest, or for another user it may
be a set of preventive care measures. Similarly, a set of inpatient adverse event measures may be
mapped to a patient safety domain or the domain may be populated by ambulatory metrics for
fall prevention and medication reconciliation.
The IDS highlighted the importance of “nested dashboards” in which there are multiple
dashboards to organize performance information for each operational tier given that staff’s
responsibilities. For instance, the cardiovascular program medical officer’s performance
dashboard may include a more comprehensive array of cardiovascular metrics, subsets of which
are nested in subsidiary dashboards for department chiefs for interventional cardiology, surgery,
recovery, and rehabilitation, etc. Although we did not assess the relationship of the 28 SAIL
measures to underlying program measure sets, the sample drill-down reports do not reveal a
bridge from SAIL to other VHA clinical program dashboards.
All of the IDS use performance rubrics that mimic the Institute for Healthcare Improvement
(IHI) Triple Aim — each uses dashboards with population health, patient experience of care, and
per capita cost dimensions. Geisinger is notable for its elevation of the “experience of the
professional” dimension on an equal footing with the Triple Aim components. Here, Geisinger is
highlighting the importance of operating a system of care that is embraced and advanced by its
staff.
3.3.4 Scoring
The health care systems we interviewed do not compute summary category or global scores. As
such, the measures scoring tasks involve straightforward scoring of individual measures and
scoring to combine measures into composites. We did not probe the components of these scoring
formula components though there are several industry standards for calculating such composites
including summing the measures to compute a weighted average or using binary all-or-nothing
scoring. Standardized scoring techniques, like the SAIL z-scores, are not needed given the IDS
measures aggregation generally entails like-measures within a composite.
FINAL REPORT 21
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
Though the formulas differ, Kaiser and Geisinger use a similar approach to weighting or
ascribing importance to measures for internal accountability. For incentive compensation, Kaiser
designates measures as: 1) maintain high performance; or 2) attain performance improvement.
Incentive pay is linked only to measures in the latter category for which there are improvement
targets. Kaiser and Geisinger generally are not using differential weights to score measures,
rather the weighting is used in compensation formulas (i.e., measures scores or composite are not
combined into weighted summary scores).
In its value equation formula used for performance and compensation reviews, Geisinger assigns
weights to metrics based on the size of the gap between actual and targeted performance.
Intermountain Healthcare assigns
differential weights to combine measures
for certain composites. Higher weights are
assigned based on the health impact. For
example, to construct a complications
composite, greater weight is assigned to
higher severity complication categories. Intermountain Healthcare links up to 20% of salary to
progress on organization-wide goals, not to specific performance metrics.
CMS and private sector payment and recognition programs — like the Medicare Hospital Value-
Based Payment initiative — give weights to domain scores in computing global results (Exhibit
4). Similarly, NCQA, in its health plan accreditation scoring, applies domain weights. NCQA
assigns point values to each of its approximately 35 measures and to accreditation standards
which equates to about 35% weight for clinical performance, 15% weight for patient experience,
and 50% weight for accreditation standards. Though the IDS monitor external benchmarks, they
largely use internal benchmarks when setting performance targets. Kaiser, in a hybrid approach,
uses the higher of its internal benchmark or a national reference benchmark.
Measures are differentially weighted in the SAIL global scoring. Using a rule of equal weights
for each of the domains, the point values within a domain are proportionally allocated among the
domain measures leading to differential measure weights. This weighting formula results in
Exhibit 4. Medicare Total Performance Score
Domain Weight
Clinical 20%
Patient Experience 30%
Outcome 30%
Efficiency 20%
FINAL REPORT 22
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
relatively small weights for dimensions like patient experience and access, staff satisfaction, and
ambulatory care clinical quality.
3.3.5 Global Ratings
None of these IDS use a global, summary indicator; rather, the grouping of measures into clinical
programs, or categories defined by strategic priorities, are the apex of the performance measures
hierarchy. Intermountain Healthcare emphasized the importance of having a direct link between
the measures used in its clinical program dashboards to measures used by frontline staff who
manage or deliver care. Geisinger has a similar emphasis on clinical program dashboards in lieu
of global indicators. However, these IDS systems are not accountable to the public, and
therefore, may not have the same need for summary scores and global indicators of performance.
Initially patterned after public reporting dashboards and tools developed and used in the private
sector, SAIL has evolved into a much more comprehensive and complex reporting system.
Moreover, data from SAIL are available to the public and provided to the U.S. Congress. In a
private sector hospital system measures used for improvement may not ever be visible to the
public. But in VA, even those internal measures intended to focus on improvement, are likely
discovered by congressional staffers or the media and made public, for their own purpose of
holding VA accountable. Therefore, they may become de facto accountability measures even
when that is not their original intent.
Kaiser is distinct in its use of a national measures repository that is a key, but not sole source, of
quality measures and performance results used by regional and local leadership. The Kaiser
entities integrate subsets of the KPQM measures with locally adopted measures many of which
are operational and financial indicators. As such, Kaiser seeks alignment across its regions on
measures of strategic import but the makeup of the overall performance dashboard is locally
controlled.
FINAL REPORT 23
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
Each IDS’ overall performance ratings is graded in various rating programs operated by payers,
accreditors, and performance reporting programs. The IDS operate measurement systems they
view as more comprehensive than the external programs and enable them to meet or exceed such
programs’ performance thresholds. The California hospital rating program (CHART) produces a
5-category summary quality rating for each of its condition-specific topics and several cross-
cutting hospital composites including patient experience. CHART computes a 5-category rating
schema (Exhibit 5) by
clustering hospital scores into
13 bands (visually depicted)
and then assigning the bands
to one of the five rating
categories. The clustering of
hospital scores is based on
25th, 50th, and 90th percentile
thresholds. These percentile
benchmarks use the higher of
statewide or national
performance. Thus, the IDS
performance information sets are inherently more actionable than SAIL given their grouping of
measures by clinical programs and without a global rating.
3.4 Determine How SAIL is Assessed by Users in Terms of Relevance, Utility, and
Perceived Accuracy
Throughout the development of this report, the following overarching themes emerged in terms
of the relevancy, utility, and perceived accuracy of SAIL.
3.4.1 Relevance
Responses were mixed with regard to the relevancy of SAIL overall to assist respondents in
facilitating improvement and managing their facility. It was reported that there is a limited
connection between the data captured by SAIL and the local needs of facilities and VISNs for
motivating and supporting improvement. It was also reported that the SAIL domains should align
Exhibit 5. California Hospital Performance Ratings (CHART)
FINAL REPORT 24
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
more closely with all aspects included in the VHA Vision10 and the ten essential strategies to
achieve the VHA mission11.
Despite this area for improvement, it is important to consider that 70% of discussion responses
(24 out of 34) indicated that improvements were made in their facilities as a direct result of
SAIL, thus indicating that SAIL has been successful in facilitating and supporting improvement
in VAMCs across the nation.
3.4.2 Perceived Accuracy
While nearly all interviewees liked SAIL as a centralized dashboard to view performance, 70%
of respondents said that the SAIL star rating is perceived to represent quality of care inaccurately
in their medical center(s). Moreover, virtually all interviewees responded that they would change
the star rating system. A majority of concerns were related to the VAMC ranking comparison, as
several expressed an interest in being rated by comparison to non-VA facilities in their
community and to some established (external) benchmark of excellence. Interviewees with a
higher star rating were more likely to respond that SAIL accurately reflects the overall
performance at their medical center (r=0.49, p<.01).
3.4.3 Utility
Responses indicate that the lag time of the SAIL report (quarterly) limits the utility of SAIL to
support quality improvement, more recent data is more conducive to motivating change and
monitoring improvements. Responses from interviewees indicate consistently that the efficiency
measure is not useful for users. Users did not understand how they were supposed to use it to
facilitate improvement generally, and they were less inclined to use it because the measures is
only updated annually.
Respondents also reported a lack of understanding, transparency, and involvement in the
weighting and scoring method generally in SAIL, which may feed the general perception and
inferences made that star ratings are inaccurate, inappropriate, unfair, or misleading. Moreover,
rankings and ratings based on relative performance, rather than absolute achievement of a best
10 U.S. Department of Veterans Affairs. Veteran Health Administration: http://www.va.gov/health/aboutvha.asp 11 Veterans Health Administration. Blueprint for Excellence (2014):
http://www.va.gov/HEALTH/docs/VHA_Blueprint_for_Excellence.pdf
FINAL REPORT 25
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
practice target, has introduced a competitive environment that was observed by some to have a
negative impact on how facilities are using SAIL to share best practices and collaborate on
improvement. A few respondents reported being turned away from higher star-rated facilities
when trying to learn how that facility has achieved their score as a result of competition. Overall,
there was no significant difference among the use of SAIL between VISN Directors and Medical
Center Directors, Chiefs of Staff, and Deputy Chief of Staff. Almost all interviewees noted that
they review it regularly in conjunction with other reports during executive leadership meetings.
Frequency of reviews ranged from weekly to quarterly, and several noted that their frequency for
reviewing the SAIL report has increased recently because it has become a part of their
performance plan.
This section concludes our qualitative analysis of the industry discussion, review of performance
measurement systems, and field discussions. The following section presents our empirical
analysis of the raw data provided by VHA.
4 EMPIRICAL ANALYSIS
4.1 Assess Whether the Data Elements within SAIL are Representative, Valid,
and Reliable Contributors to their Respective Domains
The Team’s assessment of SAIL includes some examination of the measures used in SAIL in
order to present readers with an informative empirical description of “what is” SAIL, and to
investigate some patterns in the data, in order to inform our assessment and recommendations.
We first calculated mean, standard deviation, median, percentiles, variance, and coefficient of
variation for all continuous variables, and frequency tabulations for all categorical variables. For
this, we examined:
• Raw values for each measure
• Unweighted standardized scores of measures (z-scores)
• Weighted z-scores of measures
• Weighted domain scores
• Overall weighted quality score, and
FINAL REPORT 26
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
• Resulting star ratings (before applying the adjustments used in SAIL to add or subtract
stars using external criteria)
An important concern for measures used in SAIL is the extent to which VHA facilities score
similarly, virtually indistinguishably on any given measure. This is especially important for the
purpose of making distinctions in relative performance by comparing VA facilities against each
other. This can occur when there is a theoretical or practical limit to high or low performance,
and entities tend to amass near that limit, a situation referred to as a “topped off” measure. To
test for this, we compared the point scores at the 75th and 90th percentiles, as well as the
“truncated coefficient of variation,” which is a calculation of the coefficient of variation after
removing the highest scoring 5%, and the lowest scoring 5% of the population of medical centers
from the statistic. This is a technique used by CMS for selecting measures for value-based
purchasing (Tompkins et al, 2008). Using these criteria, none of the examined measures were
found to be topped out. Outliers (i.e., measures with extreme values, high or low, out of normal
range) based on descriptive statistics were not identified.
We also performed F test and t tests to compare the overall quality score and each domain score
by the resulting star ratings.
Exhibit 6. Comparisons of Domain Scores by Star Rating
Domain Result
Acute Care Mortality
• Facilities with Star 4 versus Star 3 had mean values, respectively, that were not statistically significant (t= −0.3, p=0.77).
Avoidable Adverse Events
• Facilities with Star 2 versus Star 1 had mean values, respectively, that were not statistically significant (t=-0.68, p=0.50).
• Facilities with Star 4 versus Star 3 had mean values, respectively, that were not statistically significant (t=0.23, p=0.82).
Performance Measures
• Facilities with Star 2 versus Star 1 had mean values, respectively, that were not statistically significant (t=1.12, p=0.27).
• Facilities with Star 5 versus Star 4 had mean values, respectively, that were not statistically significant (t=0.62, p=0.53).
Length of Stay • Facilities with Star 5 versus Star 4 had mean values, respectively, that were not statistically significant (t=-0.88, p=0.38).
Ambulatory Care Sensitive Conditions (ACSC) Hospitalizations
• Facilities with Star 3 versus Star 2 had mean values, respectively, that were not statistically significant (t=0.13, p=0.90).
• Facilities with Star 4 versus Star 3 had mean values, respectively, that were not statistically significant (t=1.51, p=0.13).
• Facilities with Star 5 versus Star 4 had mean values that were not statistically significant (t=1.1, p=0.28).
Customer Satisfaction
• Facilities with Star 2 versus Star 1 had mean values that were not statistically significant (t=1.62, p=0.11).
• Facilities with Star 5 versus Star 4 had mean values that were not statistically significant (t=1.24, p=0.22).
FINAL REPORT 27
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
Domain Result
CMS Measures (RSMR and RSRR)
• Facilities with Star 2 versus Star 1 had mean values that were not statistically significant (t=0.71, p=0.48).
• Facilities with Star 3 versus Star 2 had mean values that were not statistically significant (t=0.85, p=0.40).
• Facilities with Star 5 versus Star 4 had mean values that were not statistically significant (t=1.0, p=0.32).
Access • Facilities with Star 2 versus Star 1 had mean values that were not statistically significant (t=1.6, p=0.11).
Mental Health • Facilities with Star 2 versus Star 1 had mean values that were not statistically significant (t=0.06, p=0.96).
• Facilities with Star 5 versus Star 4 had mean values that were not statistically significant (t=-0.31, p=0.75).
Our findings, indicate that although different star ratings differ significantly on overall quality
score (F=292.45, p<.0001), they may not significantly distinguish on each domain score
(Exhibit 6 and Exhibit 7). Essentially, for different star ratings, facilities do show consistent
differences in quality ratings, but do not distinguish consistently across their domain scores.
Exhibit 7. Comparisons and Domain Scores by Star Ratings
We then performed internal consistency reliability (using Cronbach Alpha coefficients) analysis
to examine the measure-domain correlations and internal consistencies. This analysis examines
how well measures are correlated with their assigned domains. This analysis used z-scores of
measures. The Cronbach Alpha coefficients ranged from 0.14 to 0.69 — specifically, 0.59 for
Acute Care Mortality, 0.14 for Avoidable Adverse Events, 0.54 for CMS Measures, 0.44 for
-1.5
-1
-0.5
0
0.5
1
SAIL Domain Scores by Star Rating
star1 star2 star3 star4 star5
FINAL REPORT 28
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
Performance Measures, 0.49 for Customer Satisfaction, 0.69 for Access, and 0.69 for Overall
Quality (25 measures). These findings indicated acceptable internal consistencies (alpha 0.60 and
above, i.e., high measure-domain correlations) for overall quality measures and access measures,
but the lack of internal consistency for some domains (alpha below 0.60, i.e., poor measure-
domain correlations, see Exhibit 8) signals the need to reaffirm the rationale for the domains’
composition given weaker psychometric properties.
Exhibit 8. Measure-domain correlation results summary
Acceptable Measure-domain correlation (alpha > 0.60)
Unacceptable measure-domain correlation (alpha <0.60)
• Access • Overall Quality • Length of stay* • ACSC Hospitalizations*
• Acute Care Mortality • CMS Measures • Performance Measures • Customer Satisfaction • Avoidable Adverse Events
*Length of stay and ACSC hospitalizations domains comprise a single measure. Mental health domain was excluded from this analysis.
Third, we performed Confirmatory Factor Analysis (CFA) to examine the current SAIL model
structure (eight quality domains and corresponding measures in each domain; mental health
domain was excluded due to lack of data for individual mental health measures) by using z-
scores of measures. Neither the factor model nor the linear equation model support the current
SAIL domain structure, i.e., generate an acceptable model fit of eight domains (factors). The
CFA analysis indicates unacceptable model fit of eight domains/factors (Exhibit 9).
Exhibit 9. Fit Summary of Confirmatory Factor Analysis
Fit Summary Factor Model Linear Equation Model
Chi-Square 598.8928 696.4999
Chi-Square DF 224 224
Pr > Chi-Square <.0001 <.0001
Standardized RMR (SRMR) 0.1125 0.1125
RMSEA Estimate 0.1186 0.1331
Bentler Comparative Fit Index 0.1832 0
Fourth, we performed an Exploratory Factor Analysis (EFA) to examine the underlying
relationships between the measures in the most recent SAIL data. EFA is often used when
developing a scale (e.g., star ratings) to identify the latent constructs (e.g. domains) that are
FINAL REPORT 29
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
supported by the observed data. We conducted EFA on z-scores by using Principal Components
along with Promax rotation. Using a cut-off of eigenvalue greater than one criteria, we identified
three factors as fitting the data and explaining 71% of the variance. Exhibit 10 presents the
Promax Rotated Factor Pattern (Standardized Regression Coefficients). A factor loading equal to
or greater than 0.30 was considered acceptable. The first factor (1) could be called “access to
care” based on the strong loading from variables such as call-waiting times. The second factor
(2) tends to load heavily on variables suggesting that it could be called “mortality and safety.”
The third factor (3) could be called “patient experience of care.” Healthcare associated infections
was the only measure that did not load significantly on any of the three factors.
Exhibit 10. Promax Rotated Factor Pattern of Current SAIL Data
Measures Measure Description Factor1 Factor2 Factor3
z3_callcenter Seconds to pick up calls 0.74872 -0.05303 -0.11837
z3_xaccess PCMH access domain composite 0.66286 -0.00644 0.20468
z3_telr2 Telephone abandonment rate 0.59059 0.15127 -0.07074
z3_pc11 Primary care new patient wait time 0.50786 -0.07969 0.01006
z3_sc13 Specialty care new patient wait time 0.47841 0.13064 0.05119
z3_bptw Best places to work 0.46992 -0.20337 0.26918
z3_sc12 Specialty care established patient wait times 0.36891 0.12194 -0.11065
z3_rnturnover Registered Nurse turnover rate 0.3103 -0.0435 -0.073
z3_smr30 30-day risk adjusted mortality 0.2274 0.60761 -0.23288
z3_smr In-hospital risk adjusted mortality 0.0129 0.58009 0.27151
z3_rsrrchf Risk standardized readmission rates congestive heart failure (CHF)
-0.17197 0.57194 0.14495
z3_rsrrpn Risk standardized readmission rates pneumonia -0.30834 0.40818 0.10451
z3_rsrrami Risk standardized readmission rates acute myocardial infarction (AMI)
-0.03709 0.39206 0.06939
z3_rsmrchf Risk standardized mortality rates CHF 0.17951 0.39079 -0.22804
z3_rsmrpn Risk standardized mortality rates pneumonia 0.11503 0.37704 0.09916
z3_psi Risk adjusted patient safety index (PSI) -0.0321 0.37362 -0.1675
z3_rComp In-hospital complications 0.06231 0.20227 0.1988
z3_alos Adjusted length of stay -0.09339 0.13147 0.62919
z3_hcahps Patient rating of overall hospital performance 0.40894 0.07693 0.46136
z3_mh12 Mental health new patient wait time 0.11285 -0.04446 0.41339
z3_oryx Inpatient core measures mean percentage -0.0613 -0.13619 0.39263
z3_acsc ACSC hospitalization -0.18882 0.06945 0.28955
z3_hedis Healthcare Effectiveness Data and Information Set (HEDIS) outpatient core measures mean percentage
0.15066 0.0064 0.22169
z3_hai Healthcare associated infections 0.05448 -0.04197 -0.0891
FINAL REPORT 30
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
4.2 Evaluate the Methodology for the (“Star”) Ratings of Quality and their
Internal and External Validity
The current SAIL star-rating process generates star ratings for facilities based on their quality
summary scores, using a five-star scale. In general terms, this involves three major steps:
• Standardized (z) scores (deviations from the mean of all VHA medical centers) to
measure the performance of each medical center on each measure
• Weighting those individual measures within and across their respective domains
• Choosing cut-off points along the resulting summary/averaging score to produce an
ordinal scale numbered from one to five stars
Based on the distributions of summary scores, facilities scoring in the lowest 10th percentile
receive a rating of 1-star; facilities scoring in the top 10th percentile receive a rating of 5-stars;
facilities scoring in the middle 40th percentile of the distribution of scores receive a 3-star rating;
facilities scoring above the 10th percentile up to the 30th percentile receive a 2-star rating; and
facilities scoring above the 70th percentile up to the 90th percentile receive a 4-star rating (note
that 1-star facilities that exceed the commercial industry average are promoted to 2-star). This
process for setting the four relative thresholds for the five-star performance categories relies on a
norm-reference interpretation of overall quality summary scores.
The current methods for star rating may imply the risk of misclassification — the true quality
differs from the assigned star rating, i.e., rating does not reflect the facility’s true performance.
This could occur if the individual measures or domain scores are not sufficiently reliable,
resulting in noise and error.
Still, the star ratings might be considered misleading for at least three reasons. First, two
facilities can have the same star rating but significantly different patterns for individual measures
and domains. For example, currently, two medical facilities can both achieve a quality rating of
three stars, even though one facility’s actual overall quality score is 39 percentile and the other’s
is in the 68 percentile, given the range of three stars are from 39 to 70 percentile. A second
example would be two medical facilities that have different star ratings (such as 4-star and 3-
star), even though their actual percentiles of overall quality scores are close, but they happen to
be above and below the threshold (such as 71st percentile and 69th percentile). A third type of
FINAL REPORT 31
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
misleading result would be inferring that few stars necessarily meant bad quality in relation to
the healthcare performance in a facility’s’ local communities, given the widely-documented
geographic variations in healthcare metrics in general.
Considering the problem of different underlying patterns for scores, we explored an alternative
approach for classifying data into subgroups, and performed Latent Profile Analysis (LPA). LPA
provides an advantage over other methods because it groups medical facilities based on their
scoring patterns within the data and then uses those patterns as independent variables (Muthén,
2001). LPA is a subject-centered (in this case, medical facility-centered, rather than measure-
centered) and is a model-based cluster analytic approach. Unique model parameters are
estimated for each profile based on maximum likelihood estimation. Specifically, LPA estimates,
for each profile/class, the mean and variance for each measure, the probability that each facility
falls into each cluster, and the probability that any facility falls into a given class across all
facilities. It thus assigns medical facilities to profiles with the highest member probability.
Probabilities closer to one for a single profile/class and closer to zero for the remaining classes,
suggest good group assignment and distinct profile/classes.
Several models were fit to the z-scores of each individual measure and to the z-score of the
mental health domain data, specifying three through five latent profiles. Models with different
numbers of profiles were compared using information criteria (IC)-based fit statistics. These
include the Bayesian Information Criteria (BIC), Akaike Information Criteria (AIC), and
Adjusted BIC. Lower values on these fit statistics indicate better model fit. The accuracy with
which models classify medical facilities into their most likely profile/class is examined. Entropy
is a type of statistic that assesses this accuracy, and can range from 0 to 1, with higher scores
representing greater classification accuracy. The exhibit below presents the Information Criteria,
Entropy, and Average Class Probabilities of LPA. As shown in Exhibit 11, the 3-class and 4-
class models achieve good fit.
The profile plot in Exhibit 12 shows graphically the latent class estimated means on the y-axis.
The x-axis (i.e. measures) starts at zero and increases in units of one for each of the observed
variables/measures. Class 1 has lower average scores on most measures, class 3 has higher
average scores on most measures, class 2 fall into between. This is in line with a 3-star rating
FINAL REPORT 32
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
scale. The 5-class model does not fit the data well, implying the data may not support the 5-star
ratings.
Exhibit 11. Latent Profile Analysis
Goodness-of-fit statistics for the 3-class to 5-class latent profile solutions
N=147
Fit statistics 3-Class 4-Class 5-Class
Log-likelihood -4264.142 -4192.269 -4215.968
AIC 8832.284 8790.537 8939.937
BIC 9286.83 9397.595 9699.507
SSA-BIC 8805.82 8755.193 8895.714
Entropy 0.92 0.91 0.95
AIC = Akaike Information Criteria; BIC = Bayesian Information Criteria; SSA-BIC = Sample-Size-Adjusted BIC
Class counts and proportions for the latent classes based on
estimated posterior probabilities Average Latent Class Probabilities
Three-class model 1 2 3
1, n = 20.9, 14.2% 0.984 0.016 0
2, n = 87.3, 59.4% 0.014 0.973 0.013
3, n = 38.8, 26.4% 0 0.035 0.965
Four-class model 1 2 3 4
1, n = 20.4, 13.9% 0.969 0.001 0.029 0
2, n = 21.2, 14.4% 0.007 0.936 0.053 0.004
3, n = 64.3, 43.8% 0.013 0.02 0.958 0.01
4, n = 41.1, 28.0% 0 0.005 0.032 0.963
Five-class model 1 2 3 4 5
1, n = 2.0, 1.4% 1 0 0 0 0
2, n = 18.3, 12.4% 0 0.965 0.035 0 0
3, n = 87.9, 60.0% 0 0.01 0.971 0.019 0
4, n = 38.8, 26.4% 0 0 0.021 0.979 0
5, n = 0, 0% 0 0 0 0 0
FINAL REPORT 33
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
Exhibit 12. 3-Class Profile Plot
Measures
La
ten
t C
las
s E
sti
ma
ted
Me
an
s
Measures
La
ten
t C
las
s E
sti
ma
ted
Me
an
s
FINAL REPORT 34
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
5 RECOMMENDATIONS
Based on its independent assessment of SAIL, the Booz Allen Team provides recommendations
related to measurement system purpose, measures, hierarchy, scoring, and system management.
5.1 Measurement System Purpose — Accountability and Quality Improvement
The primary purpose(s) of a measurement system will influence the decisions made by system
developers. A recent Commonwealth Fund and Institute for Healthcare Improvement (IHI) issue
brief,12 illustrated these differences:
• Accountability systems include measures designed for provider comparisons, include
risk-adjustments, and are collected retrospectively over long periods of time.
• Improvement systems include measures that capture local performance over time, and
are not risk-adjusted information as to whether interventions or clinical care are
producing the intended outcomes.
The purpose of a measurement system drives decisions in the selection of measures, structure of
domains, and scoring methodologies. A primary issue, that can underlay the foundation of
prioritizing improvements to SAIL, is the discrepancy between its original design as an
improvement tool and its current use as an accountability tool. SAIL was originally developed as
“VA’s internal improvement tool which is designed to offer high-level views of health care
quality and efficiency.”13 A determination of SAIL’s primary purpose will drive VHA decisions
with regard to many design decisions and improvements. Accountability may take on a different
meaning when referring to a public institution like the VHA. In a private sector hospital system,
measures used for improvement may not ever be visible to the public. But in VA, even those
internal measures intended to focus on improvement are likely discovered by congressional
staffers or the media and made public, for their own purpose of holding VA accountable.
Therefore, they may become de facto accountability measures even when that is not their
original intent. The Booz Allen Team recommends VHA consider whether the primary
purposes of the SAIL Value Model for is improvement or accountability. It is possible that
12 Clifford Marks, et al. Hospital Readmissions: Measuring for Improvement, Accountability, and Patients,
Commonwealth Fund/Institute for Healthcare Improvement, Issue Brief, September 2013. 13
SAIL Value Model 20141210.pdf
FINAL REPORT 35
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
SAIL can be used for both purposes and redesigned accordingly. For example, existing measures
within SAIL could be reorganized and presented by clinical area –aligning with the need for
performance data for a given service-line to manage. This provides reports of measures to the
end-user that aligns with their locus of control. In addition, a subset of measures, given rigorous
evaluation, could be designated and used for accountability purposes.
An important purpose of any measurement system is to discern levels of performance and to
attach judgments to observed levels, such as high or low, adequate or inadequate, unsatisfactory
or exemplary. The benchmarks used to make such judgments can take one of two forms, scoring
according to mastery or scoring relative to peers.
When “scoring according to mastery” an entity’s performance is judged against predetermined
values or external benchmarks, such as an indicated service that is always delivered, or an
external value is drawn from some other population. For example, administering aspirin on
arrival to patients with acute myocardial infarction (AMI) generally is considered good clinical
practice. VHA may set a high bar, such as a 95% success rate, and label any hospital achieving
that level of performance as a high performer because it has “mastered” the process, and any
hospital failing to achieve that level as a low performer. In such a scenario, where the threshold
for high performance is determined by policy or external reference, hypothetically some or all
VHA facilities could be above (or below) that threshold rather than rated relative to their peers.
When “scoring relative to peers” (or grading on the curve), each entity is judged based on its
score within the distribution of all entities. Performance is judged relative to the mean value
calculated for all entities, or another reference point such as the median value or the 90th
percentile. For the most part, SAIL metrics are used in this way. Some VHA facilities will be
found to be high performers because they scored better than other facilities, which will be
labeled low performers. Z-scores are standardized representations of the location of a facility on
the distribution of scores on a measure. The five-star rating system grades VHA centers on the
curve, although a star can be added or subtracted from a center’s rating based on reference to
external benchmarks on specific measures or domains.
One of the major themes that emerged from interviews with VHA and private industry leaders is
that benchmarking should consider both performance within VHA and external benchmarks
FINAL REPORT 36
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
drawn from national and local results for non-VA facilities. Performance differences between
external benchmarks and VHA’s internal benchmarks should inform the adoption of benchmarks
to set performance goals or to establish thresholds used in categorizing performance (e.g., star
ratings). Local non-VA area benchmarks are of particular interest given Veterans’ options to use
competing healthcare providers. The Booz Allen Team recommends VHA clarify the scoring
approaches for mastery (predetermined values or external benchmarks) or relative
performance. This clarification would include examining the most appropriate scoring approach
by measure (i.e., some measures are more suited to mastery scoring) and consideration the use of
the data by various stakeholder groups (i.e., senior leadership, VISN directors, VAMC directors,
frontline staff, Veterans, and other external groups).
In addition to comparing a medical center to other (similar) medical centers or to external
benchmarks, VHA can measure each facility against itself, i.e., its own historical value on a
measure such as a baseline reference point, the previous reporting period, or a trend line or
rolling average. Surely, there is merit for a medical center that performs better than most or all of
its peers; there also may be merit in making significant improvements over time. Accountability
systems can acknowledge either or both types of merit, i.e., achievement or improvement,
respectively. However, comparing VA facilities and ranking them within a narrow range of
variation is not useful for either accountability or improvement purposes. Consideration should
be given to other measures as to whether comparison or benchmarking approach to scoring is
more appropriate for each.
Whereas metrics can be used to judge whether an entity performs well, or has improved
significantly, asking a measurement system to inform entities how to improve can imply a
different set of demands. SAIL may do well informing entities how they rank compared to other
entities or external benchmarks, and which measures indicate the greatest performance gaps. And
that may be a great service to entities with respect to informing what to improve, but perhaps less
about how to go about improving.
The Booz Allen Team recommends VHA clarify the specific role of SAIL to inform the
network about what versus how to improve. Integrated systems in our study built and
leveraged their own measurement systems to serve both roles. Thus, we observed that a single
FINAL REPORT 37
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
coherent (but not necessarily simple) measurement system can be applied for the goals of
accountability and fostering improvement, but the approaches should be tailored to the different
uses. By itself, SAIL does not emulate the single coherent systems that are designed to
accomplish both goals. If SAIL is intended to work seamlessly with other measurement systems
within VHA, we observed only limited anecdotal evidence that the respective medical center
personnel experience useful integration.
Even beyond metrics, the integrated systems in our study married their measurement systems to
their own cultural “change model,” which identified the prerequisites and mechanisms for
driving improvement. In some but decisively not all cases, that meant linking performance to
financial compensation of senior executives or other staff. For example, at Intermountain
Healthcare financial incentives are secondary in its change model while Geisinger expressed
importance of financial incentives in its change model. We did find passionate belief in framing
measures to guide processes linked to outcomes of interest, or in other words, applying measures
of how to perform well (input measures) as well as measures of success for the patient
(intermediate, and eventually ultimate outcomes).
The Booz Allen Team recommends VHA clarify its own change model(s), given its public
mission along with its constraints. This may dovetail with larger VHA systems of
accountability and compensation; however, for the present purpose, such clarification would help
to inform changes to the mission and makeup of SAIL. Specifically, this could facilitate
transparent communication of a vision for how the network can improve and what should be
captured and measured in SAIL that would be used to drive improvement. This would create a
system that equips staff, often front line staff who can most effect change, with the information
needed to manage and improve processes. Such clarification of the VHA change model could
involve at least two dimensions:
• Does VHA know/have consensus on the key processes that influence Veterans’
health/experiences?
• What are the strategies for driving change to better health/experiences (e.g., staff
configurations, medical homes, technology, process controls, integration with local
community resources etc.)? And, does VA have a teachable process that can be used
FINAL REPORT 38
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
across facilities to diagnose the problem, to use analysis to develop solutions, to
implement change in a sustainable way, and to measure improvements?
The approaches and related measures chosen after such clarification may lead to enhancements
to SAIL such as to provide a menu of dashboards for different needs. Some measures could be
deployed to support processes; other measures may be constructed rigorously and refined to
support judgments and accountability for performance.
5.2 Measures
Measures used in SAIL, especially for accountability, need to be valid (accurately and fairly
measure what they purport to measure) and reliable (precisely and reproducibly discriminate true
performance differences). In this study, we examined measures and domains according to these
scientific criteria (e.g., suitability of domains; potential for topped-out measures).
The Booz Allen Team recommends VHA clarify and formalize a process for selecting and
managing measures with respect to their intended uses. This could include adopting measure
inclusion criteria to include dimensions such as: mission alignment, performance improvement
opportunity, reliability, and actionability. An initial part of this process may be to reconsider or
confirm the set of performance domains (see section 6.3). Measures used to rank-order entities
should pass tests related to sufficient reliability and ability to contribute to discriminating
performance. In contrast, certain SAIL quality and safety measures may have little variance
across VHA facilities and have high absolute scores, but are still very important to track. Our
empirical analysis did begin to examine this at the domain and subdomain level and did not find
topped-out measures. However, our analysis did not examine the performance of every
component measure contained within SAIL.
In particular, it may be useful to screen each measure for possible “topped-out status.” Generally,
when the distribution of scores for most or all entities is concentrated in a narrow range, then
assigning relative ranks (percentiles or z-scores) that lead to labels such as “high (or low)
performer” can amount to making distinctions without meaningful differences. Mixing topped-
out measures into the relative scoring systems can detract from the ability to identify reliably true
differences in performance. More generally, better measures for relative ranking are those which
distinguish performance differences reliably and contribute to distinctions among entities in their
FINAL REPORT 39
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
respective domain summary scores. Measures that are topped-out, or nearly so, may be used in
relation to external benchmarks (mastery), or for public reporting to inform and reassure
stakeholders that facilities and VHA is maintaining its success for these measures.
Accountability measures can be partitioned into those important measures with improvements
aims and are core elements of the accountability structure versus other important measures, in
which high performance has been achieved, that are monitored but are not core accountability
metrics. Additional measures and a different mix of measures may be needed based on further
assessment about how to support performance improvement and consideration of how to
formalize measures of efficiency and value.
Despite its full name “SAIL Value Model,” SAIL itself does little to link quality and cost
measures into integrated measures of efficiency or value. Our interviews with VHA leaders
found that 97% of them do not use the SAIL’s efficiency measure in any way to improve either
operational efficiency or budget performance. The measure is not actionable and not reported
frequently; therefore, they do not see its utility. The vast majority simply do not understand this
measure. As such, the measure cannot reasonably be used for accountability purposes. In
addition, it does not get at the central question of value, which is whether quality or outcomes at
a facility is commensurate with its resource use.
The efficiency measure constructed using stochastic frontier analysis is an overarching summary
measure of efficiency that is useful as a macro level indicator of efficiency and, as such, may
have use as a high level summary score that senior level executives of VHA can use to quickly
compare resource use across facilities and frame further questions about efficiency. Information
contained in the Efficiency Opportunity Grid (EOG), which is embedded in the SAIL package,
provides information that is actionable and can lead to improvement.
VHA and SAIL may be poised to move forward in measuring efficiency and value, beyond
simple side-by-side displays, or parallel rather than integrated scoring methods involving
dimensions of quality and cost. For the most part, SAIL imports measures of relative cost
across entities, and links to multiple measures of resource use that reside in other parts of VHA.
In very tangible ways, the SAIL Team coordinates well with other parts of the agency that are
involved in formulating measures, and this is a case in point. Questions of great concern in
FINAL REPORT 40
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
healthcare generally, and presumably VHA, have to do with distinguishing high-value spending
from low-value spending. For what patient conditions are services and costs too low, such as
underutilization of effective services? For what conditions are costs too high, such as excessive
volumes or inefficient use of high-cost alternatives? These questions relate to allocative
efficiency, i.e., the ability to properly steer resources away from waste and inefficient utilization
patterns, and toward their highest-value purposes.
SAIL has demonstrated this insight with the recent deployment of measures centered on a
particular set of conditions, i.e., mental health. This approach helps to organize thinking and
improvement around an identifiable set of patient cohorts, as well as the clinical and
administrative components of the VHA system who are responsible for those conditions and
patients. In other words, the information gets organized and targeted around quality and access
pertaining to certain conditions and patients; this approach may then lend itself to consideration
about resource use measures integrated to define and measure efficiency and value for mental
health services. Generalized to include other conditions and lines of service, medical centers
could embark on empirically-driven measurement and improvement goals pertaining to overall
efficiency and value by reallocating resources to best meet the needs of Veterans.
5.3 Measurement System Hierarchy
The Booz Allen Team recommends VHA clarify which one or more purposes it chooses to
support through a domain structure.
There can be several reasons for grouping measures into domains. One reason for supporting a
domain structure is to group measures that tend to explain the same underlying dimension of
performance. As such, the similar measures tend to reinforce each other when they tracking with
the common trait, and tend to cancel each other out when they are exhibiting “noise” or
otherwise not tracking the common trait of interest; i.e., that defines the domain. Familiar
examples come from psychology (hence, this approach is called psychometric), such as using
several questions (individual measures) to draw conclusions about levels of intelligence, certain
personality traits, types of aptitude, or indications of morbidity. Grouping individual items for
this purpose often uses factor analysis techniques, attempts to align measures according to the
underlying “factors” that may explain the concepts of interest.
FINAL REPORT 41
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
Another somewhat related approach is to group measures into domains according to conceptual
similarity. Inpatient mortality rate is similar conceptually to all-cause 30-day post-discharge
mortality rate, but that does not mean necessarily that the two measures would tap the same trait
or dimension of quality. For example, inpatient mortality may be driven by poor technique,
unsanitary conditions, or deficiencies related to intensive care. In contrast, 30-day mortality may
be caused by inadequate patient education, community supports, adherence to medications, or
failure to reconcile contraindicated prescription patterns. Empirically it may be true that entities
tend to score higher or lower than others on both measures either because both measures do share
some causes (e.g. high infection rates), or perhaps because low performers may tend to be
suboptimal in many processes for many reasons.
Domain structures can also be selected to align with key business drivers or policy priorities,
such as clinical quality, patient access, patient experience, or employee experience. These
domains would be selected based on identifying the key areas where VA has to perform well in
to be successful.
A different approach to grouping measures is by condition, specialty, or line of service. VHA
might wish to set up dashboards that show results for various types of measures, individually and
collectively, that are calculated for selected patient cohorts. One organizing framework
recommended by the NQF is the patient-focused episode of care. For example, measures relevant
to heart failure patients can be calculated for those patients and shown to clinicians and
departments responsible for meeting the needs of heart failure patients.
Another type of cohort could be patients undergoing coronary artery bypass grafting (CABG).
Domains could include episodes that relate to the same type of physician specialties such as
orthopedics, mental health, etc. Under this rubric, the signals carried by individual measures
would directly reflect on the attributed providers, and could reinforce each other in
differentiating the relative performance of departments or medical centers as it relates to the lines
of service. An intrinsic advantage to using the episode framework is the relevance of quality and
cost to measuring relative efficiency and value across patient cohorts and lines of service. How
do entities compare on resources used to manage clinically similar patients? When costs are
FINAL REPORT 42
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
above average for managing patients in similar clinical contexts, do quality measures support the
higher resource use, or do they seem to indicate inefficiencies?
Similarly, determination of relative value can occur when the scoring rules, particularly the
weights, given to the individual measures, are calibrated properly to reflect the clinical or policy
importance attributed to them by VHA. Applying more or less equal weights to individual
measures is a choice, and likely does not convey significant differences in importance or priority
given by policymakers or senior leadership, or even clinical staff. For example, patient
experience may be given a higher weight for some episodes in which perceived access and
communication are paramount to value, while clinical outcomes may have higher weights when
considering value in other contexts.
5.4 Scoring/Star Rating
SAIL applies weights to the measures inside domains and, in turn, to the respective quality
domains to produce a summary score, which is summarized as a value from one to five (stars).
More specifically, the performance on each measure is the standardized score using the z
distribution (observed minus mean, divided by standard deviation) for each of the measures,
respectively. Thus, entities are given star ratings mostly based on rankings derived from cross-
sectional comparisons of each medical center against others; comparisons are made within strata
defined by the complexity of the facility (high, medium, or low).
Construction of total summary scores or star-ratings are always problematic methodologically.
However, for practical managerial reasons, high level VHA executives need some way to
regularly monitor if a hospital or health system performance is faltering and may require
intervention to avoid subpar care.
VHA might also consider assigning responsibility for performance using a balanced
scorecard approach that drills down to frontline supervisors. Using SAIL as an improvement
strategy begins with educating those who have to execute it. Performance or improvement goals
should disseminate from top to bottom in a medical center. A rolled down balanced scorecard
methodology allows senior, mid-level and frontline staff what their responsibilities are regarding
performance improvement and empowers them to work with their care delivery teams to analyze
appropriate data, consider options, take action and track improvement. Both improvement and
FINAL REPORT 43
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
accountability are spread across the organization and locus of control can be placed closest to the
patient.
The Booz Allen Team recommends considering some alternatives to creating fixed strata
(the three complexity levels) and simply limiting all comparisons within strata. One
alternative could be pooling data across all facilities and examining predictive margins (also
known as “recycled predictions”) in which robust estimates are determined of the incremental
differences in expected values related to the three respective complexity levels. Another
alternative also would take advantage of pooled data, and develop comparisons samples of
patients from potentially all other facilities based on direct standardization techniques. In other
words, create matched samples for comparison using the unique characteristics of each facility’s
unique mix of patients in the populations served. Still another alternative may be to calculate
expected measure or domain results using hierarchical statistical models, which could allow for
robust patient-level risk-adjustment along with facility or market characteristics that also can
affect measure results.
No matter which alternative is chosen going forward, it seems that these additional empirical
investigations of non-equivalence across facilities — across strata but possibly also within strata
— would be useful to demonstrate appropriate handling of such differences, and optimal fairness
in inferences made from the final summary scores. Thus, the Booz Allen Team builds on its
previous recommendation and recommends conducting empirical investigations of non-
equivalence across facilities — across strata but possibly also within strata.
Based on our review, the use of the star ratings for accountability may warrant further
investigation of potential technical concerns (domain weighting and risk of misclassification). If
star rating is to be used for accountability purposes with important consequences, the Booz
Allen Team recommends further research to assess potential misclassification and address
using techniques to mitigate medical center misclassification error. While we do not have
evidence that the star rating is unreliable, additional analyses may clarify or improve the
reliability of star ratings over time. For example, the formation of summary scores out of
individual measures requires decisions about how much weight to give to each of the measures,
respectively, to produce a summary score that is valid for the purpose. There can be several
FINAL REPORT 44
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
criteria used for determining such weights, including the statistical reliability of the individual
measures, empirically-derived relative importance based on some criterion such as the relative
impact on health, or preference weights that express relative policy importance or desire to focus
attention. Sometimes developers resort to an even simpler approach, which is to give equal
weights to each measure, which passively allow the relative numbers of measures to drive the
overall weight given to any particular concept. What can be lost in the process is an appreciation
of what effects differential weights may have on the resulting summary scores, and in turn, the
absolute or rank-order differences observed for medical centers. Therefore, whether it is to
explore different conceptual approaches to weighting schemes, or whether to investigate
alternative empirically-driven weights, we recommend that sensitivity analysis be used to reveal
the differences in the resulting scores and their implications that would result from the choices
about weights.
Understandably, SAIL expects to track relative improvement over time for a given facility in
comparison to others. Such tracking may be undermined if the ratings are considered unreliable
or if methods can distort a facility’s score. Furthermore, the empirical analysis conducted by the
Booz Allen Team, detailed in Section 4.2 of this report, suggests that a three class model may
better fit the rating system. The Booz Allen Team recommends that SAIL consider shifting to
a 3-star rating summary system.
5.5 Measurement System Management
It has appeared to the Booz Allen Team that the SAIL Team is genuinely open to suggestions
about measures or scoring approaches. At the same time, some findings suggest some perceived
gaps in communication in either influencing or using SAIL among other VHA staff. Perhaps
some staff in other parts of VHA have not engaged enough with SAIL to feel like effective users,
or effective contributors to SAIL as it exists. As the results of this assessment are considered in
VHA, and particularly as changes in SAIL are explored and implemented, there is perhaps a
corresponding opportunity to solicit input and “buy-in” from staff who may wish to influence the
direction of SAIL.
The Booz Allen Team recommends VHA engage internal stakeholders in adapting SAIL to
the needs of its users over time. This may include a formal SAIL stakeholder process in which
FINAL REPORT 45
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
frontline staff, measurement system owners, and others are authorized to oversee the SAIL
measures set and decision support services. Such a mechanism could help to ensure that the
measures set serves the needs of each internal client.
The Booz Allen Team suggests VHA consider adopting continuous rapid cycle evaluation
and improvement best practices into their practices for managing SAIL. In order to
undertake this initiative the Team suggests that VHA establish internal measures for evaluating
SAILs impact, relevancy, perceived accuracy, and utility and develop a process for collecting
pertinent data around these areas. We suggest that this performance data be regularly reviewed
by leadership so that improvements may be made to SAIL over time allowing SAIL to be
adapted to user needs and produce better long term outcomes. The Team supports that this
approach would achieve performance excellence and allow for SAIL to more adequately meet
VHA intended goals.
The Team also suggests VHA consider developing a formal SAIL stakeholder process in
which frontline staff, measurement system owners, and others are authorized to oversee the
SAIL measures set and decision support services. Such a mechanism could help to ensure that
the measures set serves the needs of each internal client. Moreover, it could be beneficial to post
the addition of measures for public comment prior to deciding that they be adopted so users may
are more engaged in the process of changing SAIL. Overall, it is believed by the Team that SAIL
should be a utility that serves the VISNs/VAMCs, and hence those stakeholders should oversee
the makeup of SAIL, introduction of new measures, refinement of decision-support
tools/training, etc.
The Team and the field applauds the work of the SAIL Team in providing facility trainings when
requested and recommends continuation of such trainings to include executive level training on
SAIL, which instructs senior leaders specifically on not just what the measures are, but also how
they should or could be using them. Examples of collaborative (across facilities) approaches for
improvement based on SAIL metrics should be included in the training. There is, however, an
opportunity to provide medical centers with more robust resources for training and technical
assistance. One approach VHA might take is to develop a Learning and Action Network that
provides role specific educational webinars and video shorts, as well as other user-friendly
FINAL REPORT 46
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
resources. Users would benefit from being required to take a set curriculum of training to help
them understand how they are supposed to use SAIL to facilitate improvement in their facility
under their specific role. Additional information on the methodology behind the SAIL rating and
how users can provide feedback is encouraged to be included as well.
6 CONCLUSION
The use of SAIL has resulted in several important changes to the way VHA, VISN, and VAMC
leadership have assessed VAMC care quality over the years. SAIL has gained wide-spread
attention by VHA leadership as an important tool for holding VAMCs accountable. As a result,
SAIL users have embraced the opportunity to provide feedback, and industry leaders in the field
of performance measurement have come together, all to assess the adequacy of SAIL for
evaluating performance, facilitating improvement, and most importantly, holding VAMCs
accountable.
Overall, SAIL provides a visual tool to VA stakeholders allowing them to evaluate VAMC
performance and is recognized as a valuable VA-specific performance assessment dashboard,
which is both needed and desired by all in VA settings. According to several responses from our
Field Directors Assessment, improvement in patient care has occurred as a direct result of using
SAIL. However, the Team believes opportunities exist to improve SAIL so it can become more
specific to both user needs and local settings, allowing users to better assess what and how to
improve care in their facility. In addition, the Team believes by focusing on improving this tool,
the quality of care provided to Veterans can reach a state of excellence.
This report has identified several overarching themes and recommendations for improvement.
These recommendations are provided to VHA in order to stimulate internal conversations among
VHA and SAIL leadership regarding next steps for improving SAIL
FINAL REPORT 47
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
APPENDIX A. METHODS
The Booz Allen Team was tasked with data collection in Task 6.2: Discussion Paper, Task 6.3:
Synopsis of Industry Best Practices on Measurement Systems report, and Task 6.4: Field
Directors Assessment of the contract. These data collection efforts feed directly into the Final
Report (Task 6.5). The tasks are grouped by qualitative and quantitative analysis used by the
Team to arrive at recommendations in addressing potential next steps for SAIL (Exhibit A-1).
The methodology of the earlier reports in this review process are summarized in the following
sections.
Exhibit A-1. Sources of Input for SAIL Final Report
Topic / Research Question Data Collection / Input
Qualitative Analysis
Assess the validity of SAIL as a tool to identify strengths and weaknesses of medical center performance
Field Directors Assessment Synopsis of Industry Best Practices
Critique whether the individual domains included within SAIL are the most salient for evaluating facility performance
Field Directors Assessment Synopsis of Industry Best Practices
Provide a detail comparison of SAIL to other systems for assessing hospital performance that are presently used in the public and private sectors
Synopsis of Industry Best Practices
Determine how SAIL is assessed by users in terms of relevance, utility, and perceived accuracy
Field Directors Assessment
Quantitative Analysis
Assess whether the data elements within SAIL are representative, valid, and reliable contributors to their respective domains
Empirical analysis
Evaluate the methodology for the (“star”) ratings of quality and their internal and external validity
Empirical analysis
Discussion Paper
The purpose of this task was to develop a discussion paper of initial findings and gaps between
the committee report and industry practices. The Booz Allen Team reviewed the SAIL Review
Committee report,14 the SAIL Team response,15 and initial findings from an industry best
practice literature review in order to evaluate issues that have arisen as a result of SAIL
implementation. The Team focused on the measures, domains and reporting hierarchy, and
methodology and scoring.
14 The Strategic Analytics for Improvement and Learning (SAIL) Value Model Documentation, August 2014. 15 Response to the Report of the Committee to Review SAIL.
FINAL REPORT A-1
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
Synopsis of Industry Best Practices
To conduct the research for the Synopsis and identify and review industry best practices related
to measurement systems for quality and safety, the Booz Allen Team used a four-step approach
including establishing research criteria, identifying key research questions, performing environ-
mental scan, and synthesizing key findings. The Team collected data from three sources: review
of measurement systems, literature review, and industry expert discussions.
During the research process for the Synopsis, the Booz Allen Team reviewed industry measure-
ment systems for similarities across the industry. The Team developed a data collection Excel
template to organize key elements and decision areas leaders must address with regard to
development and refinement of a measurement system. Using this template, team members
reviewed publicly available data on the following measurement systems: CMS Hospital
Compare, Leapfrog Group, Consumer Reports, Truven Health Analytics, Kaiser Permanente,
Intermountain Health, US News, and Health Grades. The Team also reviewed the SAIL Value
Model to collect comparable information on SAIL.16
The Booz Allen Team used Google Scholar and PubMed search engines to complete the litera-
ture review, including peer-reviewed scholarly articles, and grey literature from industry reports,
think tanks, not-for-profit associations, patient advocacy groups, and government agencies. The
literature search included studies published in the English language from 2004 to present,
without a geographic limitation, to derive a wide range of tested methods. The literature review
also included an assessment of access-to-care standards developed by insurance regulators and
agencies, such as state Medicaid programs. The Team searched for articles, then reviewed
abstracts and determined which articles to review in detail. The information extracted from each
article was plotted against the research questions in a systematic manner in a master spreadsheet.
As a supplement to the literature review, the Booz Allen Team conducted eight stakeholder
interviews with leaders of private rating systems and health care delivery systems (Exhibit A-2).
The Team conducted these interviews to learn about other performance measurement systems
used in the field, provide a more complete picture of the existing industry best practices around
16 On a few occasions, the Team received information that was not public (e.g., updated SAIL Value Model from the
SAIL Team)
FINAL REPORT A-2
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
performance measurement, and identify priority areas for measurement. These experts provided
robust comments and insights across the following nine sections of the Discussion Guide:
• Accountability and quality
improvement
• Purpose
• Hierarchal structure of measures
• Scoring rules
• Comparisons/benchmarks and
performance classification
• Ratings and data presentation
• Challenges
• Improvements
• Knowledge of VA population and geographies
The findings from the review of measurement systems, literature review, and industry expert
discussions are woven through the Final Report.
Field Directors Assessment
The objective of the Field Directors Assessment was to: 1) determine how SAIL is judged by
users in the field in terms of relevancy, utility, and perceived accuracy; 2) assess the validity of
SAIL as a tool for identifying strengths and weaknesses of medical center performance; and
3) determine whether the individual measures and domains included within SAIL are the most
salient for evaluating facility performance. In essence, day-to-day users of SAIL were
interviewed on the ability to comprehend SAIL metrics and translate the metrics into actionable
steps.
VHA provided the Booz Allen Team with a list of discussion candidates from which the Team
sent invitation emails to Medical Center Directors, Chiefs of Staff, Veterans Integrated Service
Network (VISN) Directors, and Program Office staffs, all diverse in both geography and facility
17 Due to scheduling availability, the Booz Allen Team spoke with Dr. Perlin after completion of the Synopsis. His
input is captured in the Final Report along with the other seven industry experts.
Exhibit A-2. Measurement System Industry Experts
Contact Organization
Patrick Conway, MD, MSc Centers for Medicare and Medicaid Services
Melissa Danforth The Leapfrog Group
Andy Amster Kaiser Permanente
Doris Peter, PhD Consumer Reports
Bruce Spurlock, MD Cynosure Health Solutions
Brent James, MD Intermountain Healthcare
Thomas Graf, MD Geisinger Health System
Jonathan Perlin, MD, PhD17 Hospital Corporation of America
FINAL REPORT A-3
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
complexity score. One Deputy Chief of Staff was contacted in
place of a Chief of Staff. Exhibit A-3 identifies the breakdown
of participants by title.
The informants represented a diverse cross-section of the VHA
system. Their facilities represented ranged in bed size from 85
to 1,621 beds. Interviewees were representative of 16 (32%) of
the 50 U.S. states. Exhibit A-4 shows the star rating
distribution for facilities represented in the sample. The mean star rating for these facilities was
3.2. The distribution of star ratings for this sample of interviewees was not statistically different
from the distribution of star ratings of 128 VA Medical Centers (VAMCs) and 19 ad-hoc
facilities included in SAIL reports.
Exhibit A-4. Star Rating Distribution
Star Rating N % Sample Distribution National Distribution P-Value
1 2 7 7.69 7.81 p=0.98
2 5 16 19.23 21.88 p=0.76
3 8 27 30.77 40.63 p=0.35
4 8 27 30.77 19.53 p=0.20
5 3 10 11.54 10.16 p=0.83
Excluded18 4 13 - - -
Total 30 100 - - -
Mean Rating 3.2 - - - -
Median Rating 3 - - - -
Exhibit A-5 represents the complexity score distribution
for facility discussions. Complexity level is determined
based on the characteristics of the facility’s patient
population, clinical services offered, educational and
research missions, and administrative complexity. Level
1 is the most complex and is subdivided into 1a, 1b, and
1c, with 1a being the most complex; Level 2 is
moderately complex; and Level 3 the least complex.
18 No star rating available based on facility type. 19 No complexity rating available based on its facility type.
Exhibit A-3. Discussions
Conducted by Title
Title N
Medical Center Director 15
Chief of Staff 14
Program Office 4
VISN Director 4
Deputy Chief of Staff 1
TOTAL 38
Exhibit A-5. Complexity Distribution
Complexity Level N %
1a 8 27
1b 7 23.5
1c 5 16
2 7 23.5
3 2 7
Excluded19 1 3
Total 30 100
FINAL REPORT A-4
SAIL Final Report
Contract Number: VA798-11-D-0122 April 24, 2015
Empirical Analysis
The Team performed an empirical analysis to assess whether the data elements within SAIL are
representative, valid, and reliable contributors to their respective domains and to evaluate the
methodology for the (“star”) ratings of quality and their internal and external validity. To do so,
we used the most recent updated SAIL data, fourth quarter fiscal year 2014 (FY2014Q4). These
data included raw data on measures and domains, including numerators and denominators, z-
scores, weighted z-scores, and star ratings for 24 measures and nine quality domains, for 128
VAMCs and 19 non-acute care facilities. The efficiency z-score and domain scores were not
included in the current analysis.
We used these data to calculate mean, standard deviation, median, percentiles, variance, and
coefficient of variation for all continuous variables and frequency for all categorical variables.
We performed F test and t tests to compare overall quality score and each domain score by star
ratings. We performed internal consistency reliability (using Cronbach Alpha coefficients)
analysis examining the measure-domain correlations and internal consistencies. We performed
Confirmatory Factor Analysis (CFA) to examine the underlying structure of the data and to
assess the current SAIL model structure. We also performed Exploratory Factor Analysis (EFA)
to investigate the factor structure of the most recent SAIL data. We performed Latent Profile
Analysis (LPA) to explore an alternative approach of classifying data into subgroups.
The following sections present our high-level findings, followed by recommendations for
considerations based on all sources of information (previous reviews of SAIL, current industry
practices, VHA field experience, and empirical analysis).
FINAL REPORT A-5