spectrum topr 0075: veterans health administration (vha ......exhibit 2. sail overview the sail...

SPECTRUM TOPR 0075: Veterans Health Administration (VHA) Strategic Analytics for Improvement and Learning (SAIL) Assessment Office of Strategic Integration (OSI) Veterans Health Administration (VHA) U.S. Department of Veterans Affairs (VA) April 24, 2015

Final Report Contract Number: VA798-11-D-0122 Prepared by: Booz Allen Hamilton 8283 Greensboro Drive McLean, VA 22102

SAIL Final Report

Contract Number: VA798-11-D-0122 April 24, 2015

TABLE OF CONTENTS

1 EXECUTIVE SUMMARY .................................................................................................... 1

2 INTRODUCTION................................................................................................................... 3

3 QUALITATIVE ANALYSIS ................................................................................................. 5

3.1 Assess Validity of SAIL as a Tool to Identify Strengths and Weaknesses of

Medical Center Performance ........................................................................................... 6

3.2 Assess Individual Domains Included within SAIL for Evaluating Facility

Performance ................................................................................................................... 13

3.3 Provide a Detailed Comparison of SAIL to Other Systems for Assessing Hospital

and Delivery System Performance that Are Used in the Public and Private Sectors .... 17

3.4 Determine How SAIL is Assessed by Users in Terms of Relevance, Utility, and

Perceived Accuracy ....................................................................................................... 24

4 EMPIRICAL ANALYSIS .................................................................................................... 26

4.1 Assess Whether the Data Elements within SAIL are Representative, Valid,

and Reliable Contributors to their Respective Domains ................................................ 26

4.2 Evaluate the Methodology for the (“Star”) Ratings of Quality and their Internal

and External Validity ..................................................................................................... 31

5 RECOMMENDATIONS ...................................................................................................... 35

5.1 Measurement System Purpose — Accountability and Quality Improvement ............... 35

5.2 Measures ........................................................................................................................ 39

5.3 Measurement System Hierarchy .................................................................................... 41

5.4 Scoring/Star Rating ........................................................................................................ 43

5.5 Measurement System Management ............................................................................... 45

6 CONCLUSION ..................................................................................................................... 47

APPENDIX A. METHODS...................................................................................................... A-1

FINAL REPORT i

SAIL Final Report


LIST OF EXHIBITS

Exhibit 1. Short- and Long-Term Recommendations ..................................................................... 2

Exhibit 2. SAIL Overview .............................................................................................................. 4

Exhibit 3. Domain Framework ..................................................................................................... 16

Exhibit 4. Medicare Total Performance Score .............................................................................. 22

Exhibit 5. California Hospital Performance Ratings (CHART) ................................................... 24

Exhibit 6. Comparisons of Domain Scores by Star Rating ........................................................... 27

Exhibit 7. Comparisons and Domain Scores by Star Ratings ....................................................... 28

Exhibit 8. Measure-domain correlation results summary ............................................................. 29

Exhibit 9. Fit Summary of Confirmatory Factor Analysis ........................................................... 29

Exhibit 10. Promax Rotated Factor Pattern of Current SAIL Data .............................................. 30

Exhibit 11. Latent Profile Analysis ............................................................................................... 33

Exhibit 12. 3-Class Profile Plot .................................................................................................... 34

Exhibit A-1. Sources of Input for SAIL Final Report ................................................................ A-1

Exhibit A-2. Measurement System Industry Experts ................................................................. A-3

Exhibit A-3. Discussions Conducted by Title............................................................................. A-4

Exhibit A-4. Star Rating Distribution ......................................................................................... A-4

Exhibit A-5. Complexity Distribution ........................................................................................ A-4

FINAL REPORT ii

SAIL Final Report


1 EXECUTIVE SUMMARY

As the nation’s policy makers call for alignment with national quality goals, the Veterans Health

Administration (VHA) continues to refine tools and strategies aimed at improving the quality of

care received by the nation’s Veterans through comparisons of private and public sectors. With

over 120 hospitals, 800 clinics, and 200 nursing homes, this is no small task for VHA. VHA has

been a leader in the use of system-wide performance measurement to assess care and to improve

processes and clinical outcomes.1,2,3 As part of an integrated set of activities intended to improve

health care system and its constituent parts, VHA developed and implemented a comprehensive

measurement system known as the Strategic Analytics for Improvement and Learning (SAIL)

Value Model. With information and capabilities provided by SAIL, the VA gains opportunity to

identify facilities that may require assistance achieving higher performance, select facilities that

could offer assistance as models for new practices or as high-performing mentors, and assist VA

Medical Center (VAMC) directors and staff in developing targeted quality improvement

initiatives.

VHA, given its ongoing role and commitment to the use of performance measurement to

improve quality of care, contracted with Booz Allen Hamilton (Booz Allen) and independent

consultants (referred to as the Booz Allen Team in this report) for an independent review of

SAIL. The recommendations produced by the Booz Allen Team (Exhibit 1) serve as a catalyst to

VHA-led and other stakeholders’ deliberations and strategies on iterative advancements in

performance measurement and reporting in the VA health care system. The Team understands

that advances and improvements in the field are frequently incremental and has grouped the

recommendations by suggested timeframe—short term versus long term—and included

observations for each recommendation area (e.g. measurement purpose, measures, hierarchy,

scoring/star rating, and measurement system management). The Final Report offers readers the

1 Kizer KW, Dudley RA. “Extreme makeover: transformation of the Veterans Health Care System.” Annual Review

of Public Health. 2009; 30:1–27. 2 Edmondson EA, Golden BR, Young GJ. Turnaround at the Veterans Health Administration. N9-607-035. Boston,

MA: Harvard Business School. 2006. 3 Trevelyan EW. The Performance Management System of the Veterans Health Administration. Harvard School of

Public Health Case Study. Cambridge, MA: Harvard School of Public Health. 2002.

FINAL REPORT 1

SAIL Final Report


opportunity to glean findings and explore contextual information provided in more substantive

narrative (Sections 3, 4, and 5).

Exhibit 1. Short- and Long-Term Recommendations

Area Observation Short Term (before next year) Long Term

Measurement System Purpose

There is a discrepancy between SAIL’s original design, as an improvement tool, and its current use also as an accountability tool. A determination of SAIL’s primary purpose will drive VHA decisions with regards to many design decisions and improvements.

• Clarify the purpose of the SAIL measurement system for accountability and improvement

• Clarify the scoring approaches for mastery (predetermined values or external standards) or relative peer performance

• Clarify the specific role of SAIL informing WHAT versus HOW to improve

• Clarify the change model – how improvement is achieved and SAIL’s role in contributing to such change

Measures Measures used in SAIL, especially for accountability, need to be valid (accurately and fairly measure what they purport to measure) and reliable (precisely and reproducibly discriminate true performance differences).

• Formalize a process for selecting and managing measures with respect to their intended uses

• Perform additional screening to examine each measure’s ability to discriminate performance

• Consider integrated scoring of quality and efficiency

Hierarchy There are alternative hierarchy structures to be considered based on key stakeholders and industry standards. For accountability purposes, numerically combining individual metrics or domain scores into single measurements of quality may not be effective.

• Clarify rationale for the domain structure such as grouping measures that: i) share the same underlying dimension of performance, ii) reflect policy priorities, iii) are organized by clinical program area, iv) are based on psychometric properties, v) align with industry-wide constructs, etc.

• Reconsider weighting scheme for grouping measures that includes input from stakeholders

• Expand the measures hierarchy to include tiers that serve frontline care and service providers and construct clinical program area performance dashboards to align accountability with staff’s span of control by program area

Scoring/Star Rating

Our empirical analysis (latent class analysis) suggests the data support a three class model, not five. The current star rating system may risk misclassification of facilities. Feedback from the field raises questions of the need for additional methods to allow for optimal fairness in inferences made from final summary scores.

• Conduct empirical investigations of non-equivalence across facilities to demonstrate appropriate handling of such differences, and optimal fairness in inferences made from the final summary scores

• Assess potential misclassification and address using techniques to mitigate medical center misclassification error

• Perform sensitivity analysis of the weighting scheme to reveal differences in resulting scores

• Consider shifting to a 3-star rating summary system

• Consider assigning responsibility for performance using a balanced scorecard approach that drills down to frontline supervisors

• Consider alternatives to creating fixed strata and limit comparisons within strata

Measurement System Management

Field Directors Assessment suggests that some staff in other parts of VHA have not engaged with SAIL to feel like effective users or effective contributors to SAIL as it exists.

• Engage internal stakeholders in adapting SAIL to the needs of its users over time

• Consider adopting rapid cycle evaluation or improvement best practices for managing SAIL

• Engage internal stakeholders through a formal process and consider a more robust training and technical assistance program to support SAIL end-users

FINAL REPORT 2

SAIL Final Report


2 INTRODUCTION

The Booz Allen Team conducted an independent assessment of the SAIL Value Model in

accordance with contract number: VA798-11-D-0122. The SAIL Final Report (Task 6.5) is the

culmination of earlier deliverables from this review. The purpose of the report is to:

• Assess the validity of SAIL as a tool to identify strengths and weaknesses of medical

center performance.

• Assess the domain structure as a hierarchy for medical center performance reporting.

• Assess whether the measures within SAIL are representative, valid, and reliable

contributors to their respective domains.

• Evaluate the methodology for the (“star”) ratings of quality and their internal and external

validity.

• Provide a comparison of SAIL to other systems for assessing hospital and delivery

system performance used in the public and private sectors.

• Document how SAIL is assessed by users in terms of relevance, utility, and perceived

accuracy.

• Provide recommended next steps for SAIL improvements.

The Final Report contains findings from earlier reports—Task 6.2: Discussion Paper, Task 6.3:

Synopsis of Industry Best Practices on Measurement Systems (referred to hereafter as the

Synopsis), Task 6.4: Field Directors Assessment—and an empirical analysis of raw data

provided by VHA. The material in this report is organized to offer readers both high-level

summaries of findings, as well as contextually relevant discussions. The observations are

presented to share information, stimulate discussion, assist communication among stakeholders,

facilitate understanding, and provide guidance on potential SAIL enhancements that could be

pursued by VHA. Hereafter, the paper is organized in the following manner:

• Section 3: Findings from the qualitative portion of the assessment

• Section 4: Findings from the empirical analysis

• Section 5: Recommendations for potential next steps for SAIL

• Section 6: Conclusion of key themes in the report

• Appendix A: Methodology used to collect data for earlier reports

FINAL REPORT 3

SAIL Final Report


The SAIL tool is a web-based, balanced scorecard model used by the U. S. Department of

Veterans Affairs to measure, evaluate, and benchmark quality and efficiency among VA Medical

Centers.4 SAIL was designed to offer a high-level view of health care quality and efficiency,

enabling executives and managers to examine a wide breadth of existing VA measures in one

place.1 SAIL assesses 28 quality measures, including mortality, complications, and customer

satisfaction, which are organized within the following domains: acute care mortality; avoidable

adverse events; Centers for Medicare and Medicaid Services (CMS) mortality and readmission

measures; length of stay; mental health, performance measures (ORYX/HEDIS); customer

satisfaction; ambulatory care sensitive condition hospitalizations; clinical wait times and call

center responsiveness, and efficiency. The underlying data from which SAIL is based are

provided through other VHA sources, such as Linking Knowledge and Systems (LinKS),

ASPIRE, VA Inpatient Evaluation Centers (IPEC), Performance Management, and Office of

Productivity, Efficiency, and Staffing (OPES).

Exhibit 2. SAIL Overview

The SAIL reporting tool incorporates data from 128 VAMCs that provide acute inpatient

medical and/or surgical care to Veteran patients. The report also includes data from facilities that

do not have acute inpatient medical and/or surgical care (i.e., Ambulatory Care Centers,

Rehabilitation Centers, and Outpatient VAMCs).1 From these measures, SAIL also provides a

composite 1- to 5-star rating for each of the 128 VAMCs in overall quality.2 The VA’s

4 Veterans Health Administration Office of Informatics and Analytics. Strategic Analytics for Improvement and

Learning (SAIL) Fact Sheet. Retrieved March 2015 at

http://www.hospitalcompare.va.gov/docs/06092014SAILFactSheet.pdf

FINAL REPORT 4

SAIL Final Report


hypothesis is that 1-star facilities will benefit from adopting successful practices from 5-star

facilities.

SAIL’s assessment of the relative performance of facilities involves several steps:

1. Facilities are first compared within their comparison group on individual quality

measures and assigned a score based on their relative performance;

2. Within each domain, the measure scores are multiplied by the assigned weight and then

added together to become the domain score;

3. The domain scores are then used to calculate the quality composite score; and

4. Using 10th, 30th, 70th, and 90th percentile cut-offs of the composite scores, each facility

is designated a 1- to 5-star rating for overall quality.

Facilities are assigned a 1- and 5-star rating if their scores fall in the bottom and top 10th

percentile, respectively.5 Facilities in the next bottom and top 20% of the distribution are

assigned a 2- and 4-star rating, respectively. The remaining 40% of the facilities is assigned a 3-

star rating. SAIL’s 5-star facilities that have acute care inpatient mortality or 30-day mortality in

the highest 20% (high mortality) are demoted to a 4-star rating. An equal number of 4-star

facilities are promoted to a 5-star rating. Facilities with a 1 star rating that have the most

inpatient measures whose performance is better than the bottom 20% of health systems in the

Truven Top Health Systems study are promoted to a 2-star rating.2 SAIL is noted by VHA to be

unique because, unlike most other health industry report cards updated annually, SAIL is

updated quarterly to allow medical centers to more closely monitor the quality and efficiency of

the care delivered to Veterans.2

3 QUALITATIVE ANALYSIS

The primary purpose of a measurement system will influence the decisions made by system

developers. The primary purpose drives decisions in the selection of measures, structure of

domains, and scoring methodologies. For example, one decision in measure selection, namely

between rapid-cycle and focused measures intended to provide formative feedback regarding

5 Veterans Health Administration Office of Informatics and Analytics (2015). Strategic Analytics for Improvement

and Learning (SAIL) Measure Education Module – Healthcare Acquired Infections.

FINAL REPORT 5

SAIL Final Report


improvement efforts, versus more valid and reliable summary measures suitable for formal

accountability. As noted elsewhere in earlier reports produced by the Booz Allen Team, these

differences can be illustrated through the example of readmission measures6:

• Accountability systems include measures designed for provider comparisons, include

risk-adjustments, and are collected retrospectively over long periods of time.

• Improvement systems include measures that capture local performance over time, are

not risk-adjusted, and are captured weekly or even daily to provide real-time information

as to whether interventions or clinical care are producing the intended outcomes.

A primary issue, that can underlay the foundation of prioritizing improvements to SAIL, is the

discrepancy between its original design as an improvement tool and its current use as an

accountability tool. SAIL was originally developed as “VA’s internal improvement tool which is

designed to offer high-level views of health care quality and efficiency.”7 A determination of

SAIL’s primary purpose will drive VHA decisions with regards to many design decisions and

improvements. Accountability may take on a different meaning when referring to a public

institution like the VA. In a private sector hospital system measures used for improvement may

not ever be visible to the public. But in VA, even those internal measures intended to focus on

improvement, are likely discovered by congressional staffers or the media and made public, for

their own purpose of holding VA accountable. Therefore, they may become de facto

accountability measures even when that is not their original intent. The dichotomy of the primary

purpose (s) of SAIL are further discussed throughout the Final Report.

3.1 Assess Validity of SAIL as a Tool to Identify Strengths and Weaknesses of

Medical Center Performance

This section of the Final Report provides findings from the qualitative tasks from the Field

Director Assessment and Synopsis.

6SAIL Discussion Paper (Task 2), Section 3

7SAIL Value Model 20141210.pdf

FINAL REPORT 6

SAIL Final Report


3.1.1 Field Director Discussions

The Field Director discussions with the Medical Center Directors, Chiefs of Staff, Deputy Chief

of Staff, and Veterans Integrated Service Network (VISN) Directors provided insights into how

SAIL is being used in the field, as well as the perception of SAIL by users with regard to its level

of accuracy and relevancy for identifying both strengths and weaknesses in their own

performance. Responses obtained throughout these discussions were used to assess the perceived

validity of SAIL for identifying strengths and weaknesses of medical center performance.

Nearly all of the VISN Directors, Medical Center Directors, Chiefs of Staff, and Deputy Chief of

Staff appreciated that their feedback and opinions are being considered for future improvements

to SAIL. When all 38 Program Office Staff, VISN Directors, Medical Center Directors, Chiefs of

Staff, and Deputy Chief of Staff were asked what they like most about SAIL, 90% indicated that

they like the concept of a centralized dashboard or “snapshot” across all VAMCs where they can

view high-level data, 70% indicated that they liked the graphics and visual presentation of SAIL

data and attributed that improvements were made in their facilities as a direct result of SAIL.

Responses from the field indicate changes could be made to improve SAIL’s perceived validity

for identifying strengths and weaknesses of medical center performance. The following three

overarching themes emerged in terms of recommendations for improvement with regard to

evaluating facility strengths and weaknesses: 1) comparison of VAMC measures to existing

benchmarks of excellence and to neighboring non-VA facilities; 2) increase in the frequency

with which the SAIL results are reported; and 3) addition of measures sensitive to day-to-day

aspects for managing VA care.

These recommendations must be reviewed in the context of the intended use of SAIL. For the

first recommendation, the selection of comparators can be quite different depending on the

intended use. For example, if the desire is consumer choice of a facility, local comparators would

best fit to aid in decision making. If selecting comparators for accountability of results, a focus

on a wide national sample would best suit this purpose. For the second recommendation,

frequent and recent results best suit quality improvement purposes. However, for the purpose of

accountability, SAIL results can be less frequent to support reliable estimates for low frequency

events. Regarding the third recommendation, more sensitive measures support quality

FINAL REPORT 7

SAIL Final Report


improvement, but those measures are often not appropriate to be rolled up for a star rating or

accountability measurements. The dichotomy of the intended use of SAIL as a tool for quality

improvement and accountability is further explored in the following sections of the Final Report.

When interviewees were asked what they liked least about SAIL, 39% (15 out of 38) pointed to

the method behind the star rating system. Interviewees generally noted that since the star rating is

a ranking system among VA facilities, it is not viewed as an appropriate depiction of medical

center performance since someone is always at the top and someone is always at the bottom. As

a result, a few Field Directors reported concern that this drives competition as opposed to

collaboration. Furthermore, concerns indicated that the star rating focuses all attention on

performance difference between VHA facilities without regard necessarily to the meaning or

value of such differences and ignores more relevant comparisons to established benchmarks of

excellence and to local health care alternatives to VHA facilities. More than half (58%)

responded they would prefer, instead, to be measured against established benchmarks of

excellence and to have their performance compared to neighboring non-VA facilities as much as

possible so they can tell patients their facility is “just as good or better than [the] neighboring

non-VA facilities.”

When measuring VA facilities against established benchmarks of excellence, it was reported by

many respondents that facility demographics need to be factored into the ranking — this would

include facility complexity level as well as urban versus rural facilities so “apples-to-apples”

comparisons occur as opposed to what is now happening, which is “apples-to-oranges.” These

recommendations highlight considerations that must be addressed when establishing benchmarks

for performance improvement. Relative benchmarks are best for purposes of continuous

improvement. Absolute benchmarks are best in some circumstances to establish clear levels of

achievement in the short-term (e.g., a one year payout schedule). The primary purpose for SAIL

and its measures must be considered when determining appropriate benchmarks.

Reporting lag-time was mentioned as an aspect preventing users from being able to effectively

use SAIL as a performance assessment and improvement tool, especially when it comes to

identifying shortcomings within care delivery. In fact, 10% (4 out of 38) of all respondents

indicated the reporting lag-time in response to what they liked least about SAIL. This response

FINAL REPORT 8

SAIL Final Report


was the third most common response for this question. Moreover, when asked what measures

interviewees would like added or removed from SAIL, 95% (32 out of 34) noted they would like

to see more real-time data metrics incorporated into SAIL. The data reporting lag-time also

played a role in why respondents reported certain measures, such as patient satisfaction and the

efficiency measure, are not useful for monitoring improvement efforts. Specifically, 20% (7 out

of 34) stated directly the SAIL data must be current in order for it to be actionable. Existing

quality decision support tools designed specifically for monitoring clinical care process and

outcome measures and managing patient satisfaction may be more appropriate than using SAIL

for both purposes, improvement and accountability.

When interviewees were asked how relevant and useful they found SAIL overall for assisting

them in managing their medical centers performance, 50% (15 out of 30) indicated SAIL is not

useful or relevant. This led to discussion of what additional information pertinent to their day-to-

day operations are needed. The most commonly suggested types of information to be added are:

1) how much VAMCs are supporting staff development fostering employee growth and

satisfaction; 2) human resource data (e.g.; vacancy rates); 3) data on patient flow; 4) how

VAMCs are supporting research; 5) contracting data (e.g. purchased care or "fee" care); 6)

outpatient care domain; 7) surgical care domain; 8) homelessness; 9) diabetic foot care; and 10)

mental health drill-down data.

3.1.2 Industry Expert Discussions

The industry expert discussions further provide insight of performance measurement tools, such

as SAIL, used to assess strengths and weaknesses within a system. The industry experts

interviewed represented organizations that are committed to continuous improvement in

healthcare quality, efficiency, and value. The efforts within their organizations are highly

regarded, even exemplary, in relation to the state of the art in the industry as a whole. VHA has

watched the industry attentively for opportunities to borrow conventions and adapt best practices

in measurement, as well as dissemination and education.

In many of those respects, VHA matches the quality and thoroughness of other leading health

systems and ratings organizations. As shown elsewhere, the SAIL Value Model includes a

number of measures spanning all the dimensions of health system performance recommended by

FINAL REPORT 9

SAIL Final Report


the Institute of Medicine (IOM).8 And, as of 2015, SAIL extended its measurement system to

include mental health as a set of important conditions faced by Veterans and as a specialty or

service line provided by VHA. In addition, SAIL includes mission-critical measures related to

access to care.

In contrast to the standalone rating systems (Health Grades, Truven, Consumer Reports, and

Leapfrog), the measurement systems integral to health delivery systems tend to manage and

deploy many more measures. This reflects broad responsibility for complex delivery systems that

span the continuum of care, as well as the need to direct information to multiple parties ranging

from external parties to senior leadership, middle management, and frontline staff. Also,

measures used for evaluating management are often parsimonious and focused and rolled up to a

summary measure. Measures used to drive improvement are varied and not usually aggregated to

create an overall rating. The multiple audiences reflect several additional purposes for measures

beyond sending comparative signals to external parties (e.g., consumers), such as:

• Making or monitoring the business case for investments in infrastructure or operations,

e.g., clinical registries, electronic health records (EHRs), clinicians, administrative and

clinical support staff, building capacity, patient education, and other technologies.

• Inferring or establishing responsibility (accountability) for specific patients or episodes of

care, i.e., attribution of outcomes to specific departments, sites, or individuals.

• Linking measures in coherent causal chains connect frontline activities to designated

outcomes of interest.

• Organizing measures around specific patient cohorts conforming to physician specialty

and departmental lines of service; connecting care processes, service utilization, resource

inputs, and cost, clinical outcomes, and self-reported patient experience.

For the purpose of accountability, the number of SAIL measures is consistent with information

gleaned from the industry interviews. Industry experts highlighted that fewer measures were

generally used for accountability measurement. Most of the SAIL measures reflect industry

standards in terms of their construction and meet standards of face validity (i.e., they are

specified and constructed reasonably to reflect their intent, or label). Many of the measures are

8 Deliverable 6.3 Synopsis of Industry Best Practices on Measurement Systems, Exhibit 8.

FINAL REPORT 10

SAIL Final Report


borrowed, replicated, or adapted from external sources (e.g., Truven, Centers for Medicare and

Medicaid Services (CMS), and the Agency for Healthcare Research and Quality [AHRQ]).

The ability of SAIL to capture and identify strengths and weaknesses of medical center

performance for quality improvement is limited partly by the measure set. Although motivated

VHA staff can “drill down” into layers of data beneath the surface measures, it is largely a

qualitative or intuitive exercise. For example, users can see which patients died. This provides

great transparency with respect to how raw data and calculations led to SAIL measures and even

could facilitate qualitative improvement efforts, such as special root cause analyses. However,

the root causes themselves, the significant determinants of the outcomes of interest, are not

codified as measures in SAIL. Additionally, SAIL does not provide users with a consistent

methodology for carrying out these types of analyses.

The industry experts interviewed by the Team discussed that integrated measurement systems

involve a process map consisting of nested layers that give rise to the outcome of interest, but

connect directly to the “decision layer where all change happens.” Processes can be organized

into hierarchies, typically involving around three to nine steps, which can be seen as a process or

an outcome depending on perspective. For example, certain tests, medications, and other clinical

activities can determine directly the lipid profile of a patient, and lipid control is an

(intermediate) outcome of interest. Similarly, clinical activities affect blood pressure control or

glucose control, which individually and in combination affect the clinical progression of diabetes

and the onset of complications such as retina disease, kidney functioning,

neuropathy/amputation, cardiovascular health and outcomes, and eventually survival. This type

of process map is not apparent in SAIL. Consequently, SAIL does not by itself identify the

pockets of deficiency, the unreliable links in the causal chain, or the specific challenges at the

decision layer of operations.

SAIL is an efficient tool managed by a fairly small and dedicated multidisciplinary team. It

welcomes suggestions for new measures and imports various measures from other parts of VHA

or from external sources. From our Team’s discussions with the SAIL Team during this review

process, SAIL is open to augmenting its capacity and breadth to serve the medical centers with

more details and guidance regarding performance improvement.

FINAL REPORT 11

SAIL Final Report


Accepting the measures that are included, SAIL goes to commendable efforts to help users

interact with the measures and understand them in context. Tabulations and graphical

presentations can allow users to see easily how they rate compared to other users (medical

centers) and to their own prior performance. Rating organizations tended to stress cross-sectional

or contemporaneous comparisons among providers and health plans. That largely reflects their

mission of providing report cards and decision-support for making choices about where to

receive care presently.

In contrast, although experts in health systems acknowledged the importance of external

benchmarks and comparisons, particularly for external accountability, they tended to give greater

attention to internal benchmarks and evidence of improvement toward specified goals. This is

especially relevant for systems designed for the purpose of performance improvement. Measures

that were for external consumption or for accountability generally were seen as “floors” or

requirements for minimum adequate performance. Measures for improvement were viewed

instead as more challenging and “aspirational,” intended to motivate and guide performance to

levels not normally achieved (by nearly anyone).

Regarding measures of improvement in relation to internal benchmarks, industry experts

downplayed the need for statistical risk-adjustment, given so much is held constant naturally,

including the needs and complexity of the patient population. VHA also would seem to benefit

analytically from having a fairly stable target population for whom they retain ongoing

responsibility for continuity of care and long-run outcomes.

Comparisons between the VHA system and other healthcare sectors, and even among medical

centers within VHA, may be more difficult without risk adjustment. Failure to measure or adjust

for differences in underlying patient characteristics and needs, or for the relationship between the

characteristics of different medical centers and their resulting “niche” in the local markets for

Veterans, may threaten the validity of such comparisons. Hence, capturing or identifying

situations of relatively low performance and opportunities for improvement may be distorted by

non-equivalent comparators.

FINAL REPORT 12

SAIL Final Report


3.2 Assess Individual Domains Included within SAIL for Evaluating Facility

Performance

This section provides a review of our findings from the Synopsis (Task 6.3) and the results from

the Field Directors Assessment (Task 6.4).

3.2.1 Field Discussions

The SAIL Value Model allow users to view performance according to each domains=—Acute

Care Mortality, Avoidable Adverse Events, CMS 30-Day RSMR & RSRR, Length of Stay,

Performance Measures, Customer Satisfaction, ACSC Hospitalizations, Clinical Wait Times &

Call Center Responsiveness, Mental Health, and Efficiency. The domain scores are presented as

the average z-score of measures in the same domain. The overall quality z-score is the average of

the nine quality domain z-scores.9

The perceived saliency of the SAIL domains for evaluating facility performance, as reported

throughout our interviews with Medical Center Directors, revealed mixed results and several

suggestions for improvement. When asking interviewees to assess the relevancy and usefulness

of the SAIL domains for managing their medical center(s) performance and driving

improvement, 57% of respondents expressed that the domains were generally relevant.

As mentioned earlier, several interviewees recommended incorporating additional information

into SAIL and included the following 1) how much VAMCs are supporting staff development

fostering employee growth and satisfaction; 2) human resource data; 3) data on patient flow; 4)

how VAMCs are supporting research; 5) contracting data (e.g. purchased care or "fee" care); 6)

outpatient care domain; 7) surgical care domain; 8) homelessness; 9) diabetic foot care; and 10)

mental health drill-down data. In fact, 12 % (5 out of 34) said SAIL was weighted too heavily

toward inpatient rather than outpatient care. Additionally, several interviewees recommended

eliminating overlap of measures within domains. For instance, multiple mortality measures

within the acute care mortality domain and additional mortality measures within the CMS

measures domain. Pressure ulcer measures are in both the risk adjusted complication index and

the risk adjusted patient safety index within the avoidable adverse event domain.

9 As documented in SAIL Data Definition 2014, Quarter 4

FINAL REPORT 13

SAIL Final Report


Overall, more outpatient care information, surgery, and mental health drill-down data were noted

to be foundational needs for attempting to evaluate level of care provided by VA, whereas the

other suggestions were noted to be important for managing their facility and evaluating overall

quality of their facility or facilities. One interviewee suggested the domains first should be

formulated based on the primary goals of VA. Then measures can be identified to fulfill these

goals, such as population health, Veteran experience, financial stewardship, excellence in

workforce, and excellence in service to communities. It was expressed that these goals as a

whole are not adequately captured within the current SAIL domains.

Many expressed concerns with the current weighting of the domains. These concerns boiled

down to two key aspects: 1) a poor understanding for why the domains are weighted the way

they are; and 2) frustration with the “lack of transparency” surrounding the methodology of the

weighting leading to poor confidence and trust in the SAIL tool overall. In addition, one

respondent suggested the SAIL domains need to more accurately account for variances in

geographical variations, such as labor force issues and population issues.

3.2.2 Industry Expert Discussions

Clinical programs are the bedrock of the performance measurement systems for three of the

leading U.S. integrated delivery systems interviewed by the Booz Allen Team – Intermountain

Healthcare, Geisinger Health System (Geisinger), and Kaiser Permanente (Kaiser). From the

industry expert discussions, the Team gleaned that Intermountain Healthcare organizes its

performance metrics by 13 clinical areas, and Geisinger maps its metrics to approximately 25

clinical programs. While the measures are grouped by clinical programs, unlike SAIL, these

groupings are not used as a performance scoring mechanism. In essence, Intermountain

Healthcare and Geisinger group measures based on clinical programs or service level and use for

organizational purpose but these systems do not provide a “clinic-level” score. Thus, these

systems do not produce clinical program metrics rolled up into system-wide scores. The lack of a

scoring hierarchy is common when the purpose of measurement is quality improvement.

Kaiser’s domain formulation differs somewhat from the other IDS. Many of Kaiser’s measure

sets are bundled by condition and others are clustered by program priorities such as patient safety

or preventive health. These composites are then grouped into clinical program areas. For

FINAL REPORT 14

SAIL Final Report


example, a cancer treatment composite is joined with a cancer screening composite to form the

cancer management domain. Kaiser has no single set of domains — rather its decision support

system enables the user to tailor a dashboard as needed. The domain templates available to

Kaiser staff include:

• IOM Six Aims (safe, effective, efficient, timely, patient-centered, equitable)

• Donabedian classification (structure, process, outcome)

• National Committee for Quality Assurance (NCQA) categories (respiratory,

musculoskeletal, diabetes, cardiovascular, etc.)

• Care setting (inpatient, ambulatory, home health, skilled nursing facility, hospice)

• Externally mandated (regulatory, accrediting, purchaser agreement)

• AHRQ and National Quality Forum (NQF)groupings (similar to organ systems)

The nine SAIL domains, organized mainly by measure type (e.g., outcomes, processes,

patient/staff experience, efficiency, etc.), differ from the IDS approaches that are oriented to

clinical programs and their underlying processes of care. A number of SAIL domains are

conceptually akin to the CMS Hospital Value-Based Purchasing program though CMS uses four

summary performance categories: 1) clinical process of care; 2) patient experience; 3) outcomes;

and 4) efficiency. SAIL’s newest domain, mental health, shares the clinical program focus that is

the core of the IDS measurement programs.

The measurement systems in our convenience sample underscored the importance of domains or

other summary indicators as devices to align the workforce around a small number of priorities

that embody the organization’s mission. Each of these systems signals the organization’s focus

on their major clinical programs. Beyond this clinical program orientation, the measurement

systems’ approaches differ when communicating summary performance indicators to foster

alignment.

The measurement systems use several performance information constructs to organize and

communicate each organization’s goals, business priorities, and performance. In lieu of summary

performance metrics, Intermountain Healthcare aligns measures to illustrate the five program

goals in patient safety, integrated electronic medical records (EMR), patient experience, care

FINAL REPORT 15

SAIL Final Report


redesign, and prevention/wellness. Differently, Geisinger uses four “value equation” dimensions

as the pillars of its performance management system:

• Clinical

• Patient Experience

• Total Cost of Care

• Professional Experience

The flexibility of Kaiser’s decision support tool equips Kaiser’s regions and medical centers to

formulate performance dashboards for different needs — whether clinical program management,

region-wide strategic plan monitoring, or otherwise.

Resource use accountability is addressed differently among the IDS. Total cost of care domain is

one of four dimensions that comprise Geisinger’s value equation. Resource use accountability is

mainly lodged with its primary care

providers. Intermountain Healthcare

incorporates appropriateness markers

and cost/resource intensity per case

metrics into its clinical program

dashboards. Kaiser includes appropriate

care metrics in its national measures

repository while cost metrics are

adopted at the region or medical center

level.

In summary, the measurement systems’

domains markedly differ from the SAIL

approach (Exhibit 3). For a given

clinical program, the measurement

systems encapsulate the various metrics

found across the nine SAIL domains, but

the metrics are specific to the

measurement system’s clinical program patient population. That is, an IDS clinical program

Exhibit 3. Domain Framework

System Domain Framework

SAIL

• Organized mainly by measure type (e.g., outcomes, processes, patient/staff experience, efficiency, etc.)

• Global indicator

Intermountain Healthcare

• Emphasis on specific process metrics rather than aggregated domains

• Measure sets organized by ~10 clinical areas

• Overall defect rate is example of a roll-up metric (e.g. selected care process measures)

• Major care areas and processes organized into 60 datamarts

• No global indicator

Geisinger Health System

• Summary Domains: - Clinical - Patient Experience - Total Cost of Care - Professional Experience

• Care bundles comprised of multiple measures (e.g., 9 diabetes measures), nested in clinical domain, are organized by 25 clinical service areas

• No global indicator

Kaiser Permanente

• Aggregates measures into domains • No single approach; tools give managers

flexibility to tailor dashboards/hierarchy • Preference to organize by IOM 6 aims • No global indicator

FINAL REPORT 16

SAIL Final Report


dashboard can include process, outcomes and cost measures. These IDS likely also use some

domains that comprise cross-program measures (e.g., a Hospital Consumer Assessment of

Healthcare Providers and Systems ([HCAHPS] patient survey measures set), but the emphasis is

on fashioning measure sets that are within the span of control of a given clinical program. The

IDS also group performance measures into summary domains for internal accountability

purposes, but these are organizing constructs, not mechanisms, to produce domain-level

performance scores.

3.3 Provide a Detailed Comparison of SAIL to Other Systems for Assessing

Hospital and Delivery System Performance that Are Used in the Public and

Private Sectors

This section provides a review of our findings from the Synopsis (Task 6.3) comparing the SAIL

Value Model to other systems used in the public and private sectors. In this section, we discuss

the following topics:

• Accountability versus improvement

• Measures (selection, reliability, etc.)

• Measurement system hierarchy (structure)

• Scoring (weighting, time periods)

3.3.1 Accountability versus Improvement

The distinct purposes of the measurement approach used for accountability versus improvement

was an overarching theme sounded by industry experts. While often there is overlap in the

performance measures used for both purposes, in many instances the measures differ. More

importantly, different performance targets are applied and the methods’ rigor can vary when used

for learning and improvement in contrast to use for accountability objectives, like public

reporting, personnel performance assessment, and incentive payments.

The interviewees stressed the distinction between the measures and performance targets used for

accountability given the higher stakes — career, payment, reputation — compared to measures

for improvement purposes. The performance indicators for improvement can include metrics or

methods that have less technical rigor (e.g., no standard case-mix adjustment element) and whose

performance targets are aspirational, that is given uses that have lesser consequences. Geisinger

FINAL REPORT 17

SAIL Final Report


cited an example of using a “quality gate” coupled with an incentive payment program. The

“quality gate” defines minimally acceptable quality to be eligible for payment rewards under an

accountability initiative. That same quality metric could be ratcheted to a much higher threshold

if used for improvement work.

The Kaiser Permanente Quality Measures (KPQM) repository, which comprises approximately

450 quality measures, illustrates measures’ dual purposes. Each measure is categorized by its

approved use for: 1) accountability; 2) improvement; or 3) both accountability and improvement.

Kaiser reports that while measure designation for improvement purposes is generally

straightforward, the assignment of measures for accountability is ongoing as there is less internal

consensus on designating certain measures for accountability uses particularly if the measure is

linked to payment.

Kaiser’s performance management decision support organizes information to support the

accountable units that can effect change in a set of care or service processes. Unlike SAIL, these

performance systems are not organized by measure type (outcome, process, cost, etc.); rather the

measure sets are fitted to patients grouped by related clinical disciplines like cardiovascular

health.

For the measurement systems reviewed, reporting for accountability and rating systems have few

categories to represent the organization’s key performance dimensions. Notably, these categories

are not a quantitative summation of underlying metrics; instead they are composed of a subset of

metrics that are representative of the topic and may be tailored to the user’s needs. For example,

if access to care was deemed a vital aim of the organization, that category could house

overlapping but different measure sets for a hospital chief, an ambulatory clinical leader, or a

rehabilitation unit manager.

3.3.2 Measures

For measure sets, which are composed of hundreds of measures, the most common measures

selection criteria cited by the industry key informants were:

• Health impairment — condition prevalence and severity

• Improvement opportunity — potential clinical and cost gains

FINAL REPORT 18

SAIL Final Report


• Performance variation — among the accountable entities

Intermountain Healthcare explained its longstanding efforts to identify the relatively few care

processes, and their associated metrics, that account for the bulk of the opportunity to improve

health and reduce costs. Approximately 7% of its processes account for approximately 95% of

the potential gain. The Intermountain Healthcare change model — the systems and mechanisms

to influence quality and cost — is centered on process change. As such, its performance system

is predominately process metrics though certain outcomes measures — including clinical, cost,

and patient experience — are part of the measurement system.

Intermountain Healthcare’s framing of domains by clinical programs reflects a second tier of

criteria for selecting performance dashboard measures. These additional criteria are: 1)

actionable; 2) span of control; 3) time to effect change.

IDS clinical program managers are responsible for measure sets that reflect the processes,

people, and resources within their purview and for performance targets that can be influenced

during the relevant reporting period. Though our expert discussions did not address partitioning

performance objectives and metrics into near term (annual) and long term (e.g. three years)

components, such an approach can be used to match accountability to the time to effect change

and in the meantime monitor progress.

Kaiser’s repository of quality measures largely is sourced from external measurement sets used

by CMS, NCQA, NQF, and the Joint Commission in their regulatory, payment, and recognition

programs. Kaiser’s federation model, in which its regional health systems and plans have had

considerable autonomy, explains the need to draw upon industry standard measures as it seeks

metrics alignment across its seven regions. For example, there is no common patient scheduling

system across Kaiser Regions, hence it is unable to craft access measures specific to appointment

systems. About one-third of the KPQM metrics are internal to Kaiser. The regions and their

health centers assemble their dashboards from the KPQM metrics in addition to local

performance measures.

In contrast to drawing upon external performance metrics and categories, Intermountain

Healthcare’s system has been built from the bottom up. With a concentration on the needs of

FINAL REPORT 19

SAIL Final Report


those front-line staff that implement change, Intermountain Healthcare has created metrics that

are actionable for staff in their work to manage care processes. Similarly, Geisinger, with its

emphasis on staff span of control, has shaped its performance information system to support staff

as participants and contributors to its system of care.

Measurement systems like Kaiser and Geisinger have the advantage of enrolled health plan

populations for which they capture total cost of care metrics. Intermountain Healthcare may have

the same advantage given its health plan division (SelectHealth) though its health services

division concentration is on cost per case, which is well suited to its process management

approach.

Though SAIL measures criteria were not included in the materials we reviewed, SAIL measures

criteria appears to include: 1) clinical measures for which there are external benchmarks; 2) a

well-rounded mix of measures types that represent a number of the IOM six aims; and 3)

particular attention to two performance areas of high importance to VHA — access and mental

health. The measures are predominately inpatient though there are important exceptions. It is

unclear the extent to which SAIL measures criteria include improvement centric elements like

actionability.

3.3.3 Measurement Hierarchy

Each measurement system reviewed has a measurement hierarchy in which individual measures

are aggregated into summary sets and a single score is computed for that composite measure.

The composition of these composites varies — in some instances individual measures are

collapsed into a condition-specific bundle; in other cases the measures are aggregated into a

cross-cutting construct like patient experience or safety. Grouping of composite measures by

clinical programs is a mainstay of these measurement systems (e.g., orthopedics or primary

preventive care). Importantly, though measures are grouped by clinical programs, summary

program scores are not computed. Rather, discrete metrics are organized into dashboards that are

useful for staff to manage systems of care for patient populations within a given clinical area.

These IDS are not using domains in the same way as SAIL, which derives its origins from a

public reporting and rating model (Truven). For VHA, a SAIL domain defines a specific, small

set of two to four measures to represent a performance topic, such as satisfaction or mortality,

FINAL REPORT 20

SAIL Final Report


and these domains are the underpinning of a structure to compute summary performance scores.

In contrast, the IDS use domains as an organizing construct in which a variety of measures can

be mapped to the domain depending on the user needs. The “clinical quality domain” may

comprise orthopedic quality metrics if that is the program of interest, or for another user it may

be a set of preventive care measures. Similarly, a set of inpatient adverse event measures may be

mapped to a patient safety domain or the domain may be populated by ambulatory metrics for

fall prevention and medication reconciliation.

The IDS highlighted the importance of “nested dashboards” in which there are multiple

dashboards to organize performance information for each operational tier given that staff’s

responsibilities. For instance, the cardiovascular program medical officer’s performance

dashboard may include a more comprehensive array of cardiovascular metrics, subsets of which

are nested in subsidiary dashboards for department chiefs for interventional cardiology, surgery,

recovery, and rehabilitation, etc. Although we did not assess the relationship of the 28 SAIL

measures to underlying program measure sets, the sample drill-down reports do not reveal a

bridge from SAIL to other VHA clinical program dashboards.

All of the IDS use performance rubrics that mimic the Institute for Healthcare Improvement

(IHI) Triple Aim — each uses dashboards with population health, patient experience of care, and

per capita cost dimensions. Geisinger is notable for its elevation of the “experience of the

professional” dimension on an equal footing with the Triple Aim components. Here, Geisinger is

highlighting the importance of operating a system of care that is embraced and advanced by its

staff.

3.3.4 Scoring

The health care systems we interviewed do not compute summary category or global scores. As

such, the measures scoring tasks involve straightforward scoring of individual measures and

scoring to combine measures into composites. We did not probe the components of these scoring

formula components though there are several industry standards for calculating such composites

including summing the measures to compute a weighted average or using binary all-or-nothing

scoring. Standardized scoring techniques, like the SAIL z-scores, are not needed given the IDS

measures aggregation generally entails like-measures within a composite.

FINAL REPORT 21

SAIL Final Report


Though the formulas differ, Kaiser and Geisinger use a similar approach to weighting or

ascribing importance to measures for internal accountability. For incentive compensation, Kaiser

designates measures as: 1) maintain high performance; or 2) attain performance improvement.

Incentive pay is linked only to measures in the latter category for which there are improvement

targets. Kaiser and Geisinger generally are not using differential weights to score measures,

rather the weighting is used in compensation formulas (i.e., measures scores or composite are not

combined into weighted summary scores).

In its value equation formula used for performance and compensation reviews, Geisinger assigns

weights to metrics based on the size of the gap between actual and targeted performance.

Intermountain Healthcare assigns

differential weights to combine measures

for certain composites. Higher weights are

assigned based on the health impact. For

example, to construct a complications

composite, greater weight is assigned to

higher severity complication categories. Intermountain Healthcare links up to 20% of salary to

progress on organization-wide goals, not to specific performance metrics.

CMS and private sector payment and recognition programs — like the Medicare Hospital Value-

Based Payment initiative — give weights to domain scores in computing global results (Exhibit

4). Similarly, NCQA, in its health plan accreditation scoring, applies domain weights. NCQA

assigns point values to each of its approximately 35 measures and to accreditation standards

which equates to about 35% weight for clinical performance, 15% weight for patient experience,

and 50% weight for accreditation standards. Though the IDS monitor external benchmarks, they

largely use internal benchmarks when setting performance targets. Kaiser, in a hybrid approach,

uses the higher of its internal benchmark or a national reference benchmark.

Measures are differentially weighted in the SAIL global scoring. Using a rule of equal weights

for each of the domains, the point values within a domain are proportionally allocated among the

domain measures leading to differential measure weights. This weighting formula results in

Exhibit 4. Medicare Total Performance Score

Domain Weight

Clinical 20%

Patient Experience 30%

Outcome 30%

Efficiency 20%

FINAL REPORT 22

SAIL Final Report


relatively small weights for dimensions like patient experience and access, staff satisfaction, and

ambulatory care clinical quality.

3.3.5 Global Ratings

None of these IDS use a global, summary indicator; rather, the grouping of measures into clinical

programs, or categories defined by strategic priorities, are the apex of the performance measures

hierarchy. Intermountain Healthcare emphasized the importance of having a direct link between

the measures used in its clinical program dashboards to measures used by frontline staff who

manage or deliver care. Geisinger has a similar emphasis on clinical program dashboards in lieu

of global indicators. However, these IDS systems are not accountable to the public, and

therefore, may not have the same need for summary scores and global indicators of performance.

Initially patterned after public reporting dashboards and tools developed and used in the private

sector, SAIL has evolved into a much more comprehensive and complex reporting system.

Moreover, data from SAIL are available to the public and provided to the U.S. Congress. In a

private sector hospital system measures used for improvement may not ever be visible to the

public. But in VA, even those internal measures intended to focus on improvement, are likely

discovered by congressional staffers or the media and made public, for their own purpose of

holding VA accountable. Therefore, they may become de facto accountability measures even

when that is not their original intent.

Kaiser is distinct in its use of a national measures repository that is a key, but not sole source, of

quality measures and performance results used by regional and local leadership. The Kaiser

entities integrate subsets of the KPQM measures with locally adopted measures many of which

are operational and financial indicators. As such, Kaiser seeks alignment across its regions on

measures of strategic import but the makeup of the overall performance dashboard is locally

controlled.

FINAL REPORT 23

SAIL Final Report


Each IDS’ overall performance ratings is graded in various rating programs operated by payers,

accreditors, and performance reporting programs. The IDS operate measurement systems they

view as more comprehensive than the external programs and enable them to meet or exceed such

programs’ performance thresholds. The California hospital rating program (CHART) produces a

5-category summary quality rating for each of its condition-specific topics and several cross-

cutting hospital composites including patient experience. CHART computes a 5-category rating

schema (Exhibit 5) by

clustering hospital scores into

13 bands (visually depicted)

and then assigning the bands

to one of the five rating

categories. The clustering of

hospital scores is based on

25th, 50th, and 90th percentile

thresholds. These percentile

benchmarks use the higher of

statewide or national

performance. Thus, the IDS

performance information sets are inherently more actionable than SAIL given their grouping of

measures by clinical programs and without a global rating.

3.4 Determine How SAIL is Assessed by Users in Terms of Relevance, Utility, and

Perceived Accuracy

Throughout the development of this report, the following overarching themes emerged in terms

of the relevancy, utility, and perceived accuracy of SAIL.

3.4.1 Relevance

Responses were mixed with regard to the relevancy of SAIL overall to assist respondents in

facilitating improvement and managing their facility. It was reported that there is a limited

connection between the data captured by SAIL and the local needs of facilities and VISNs for

motivating and supporting improvement. It was also reported that the SAIL domains should align

Exhibit 5. California Hospital Performance Ratings (CHART)

FINAL REPORT 24

SAIL Final Report


more closely with all aspects included in the VHA Vision10 and the ten essential strategies to

achieve the VHA mission11.

Despite this area for improvement, it is important to consider that 70% of discussion responses

(24 out of 34) indicated that improvements were made in their facilities as a direct result of

SAIL, thus indicating that SAIL has been successful in facilitating and supporting improvement

in VAMCs across the nation.

3.4.2 Perceived Accuracy

While nearly all interviewees liked SAIL as a centralized dashboard to view performance, 70%

of respondents said that the SAIL star rating is perceived to represent quality of care inaccurately

in their medical center(s). Moreover, virtually all interviewees responded that they would change

the star rating system. A majority of concerns were related to the VAMC ranking comparison, as

several expressed an interest in being rated by comparison to non-VA facilities in their

community and to some established (external) benchmark of excellence. Interviewees with a

higher star rating were more likely to respond that SAIL accurately reflects the overall

performance at their medical center (r=0.49, p<.01).

3.4.3 Utility

Responses indicate that the lag time of the SAIL report (quarterly) limits the utility of SAIL to

support quality improvement, more recent data is more conducive to motivating change and

monitoring improvements. Responses from interviewees indicate consistently that the efficiency

measure is not useful for users. Users did not understand how they were supposed to use it to

facilitate improvement generally, and they were less inclined to use it because the measures is

only updated annually.

Respondents also reported a lack of understanding, transparency, and involvement in the

weighting and scoring method generally in SAIL, which may feed the general perception and

inferences made that star ratings are inaccurate, inappropriate, unfair, or misleading. Moreover,

rankings and ratings based on relative performance, rather than absolute achievement of a best

10 U.S. Department of Veterans Affairs. Veteran Health Administration: http://www.va.gov/health/aboutvha.asp 11 Veterans Health Administration. Blueprint for Excellence (2014):

http://www.va.gov/HEALTH/docs/VHA_Blueprint_for_Excellence.pdf

FINAL REPORT 25

SAIL Final Report


practice target, has introduced a competitive environment that was observed by some to have a

negative impact on how facilities are using SAIL to share best practices and collaborate on

improvement. A few respondents reported being turned away from higher star-rated facilities

when trying to learn how that facility has achieved their score as a result of competition. Overall,

there was no significant difference among the use of SAIL between VISN Directors and Medical

Center Directors, Chiefs of Staff, and Deputy Chief of Staff. Almost all interviewees noted that

they review it regularly in conjunction with other reports during executive leadership meetings.

Frequency of reviews ranged from weekly to quarterly, and several noted that their frequency for

reviewing the SAIL report has increased recently because it has become a part of their

performance plan.

This section concludes our qualitative analysis of the industry discussion, review of performance

measurement systems, and field discussions. The following section presents our empirical

analysis of the raw data provided by VHA.

4 EMPIRICAL ANALYSIS

4.1 Assess Whether the Data Elements within SAIL are Representative, Valid,

and Reliable Contributors to their Respective Domains

The Team’s assessment of SAIL includes some examination of the measures used in SAIL in

order to present readers with an informative empirical description of “what is” SAIL, and to

investigate some patterns in the data, in order to inform our assessment and recommendations.

We first calculated mean, standard deviation, median, percentiles, variance, and coefficient of

variation for all continuous variables, and frequency tabulations for all categorical variables. For

this, we examined:

• Raw values for each measure

• Unweighted standardized scores of measures (z-scores)

• Weighted z-scores of measures

• Weighted domain scores

• Overall weighted quality score, and

FINAL REPORT 26

SAIL Final Report


• Resulting star ratings (before applying the adjustments used in SAIL to add or subtract

stars using external criteria)

An important concern for measures used in SAIL is the extent to which VHA facilities score

similarly, virtually indistinguishably on any given measure. This is especially important for the

purpose of making distinctions in relative performance by comparing VA facilities against each

other. This can occur when there is a theoretical or practical limit to high or low performance,

and entities tend to amass near that limit, a situation referred to as a “topped off” measure. To

test for this, we compared the point scores at the 75th and 90th percentiles, as well as the

“truncated coefficient of variation,” which is a calculation of the coefficient of variation after

removing the highest scoring 5%, and the lowest scoring 5% of the population of medical centers

from the statistic. This is a technique used by CMS for selecting measures for value-based

purchasing (Tompkins et al, 2008). Using these criteria, none of the examined measures were

found to be topped out. Outliers (i.e., measures with extreme values, high or low, out of normal

range) based on descriptive statistics were not identified.

We also performed F test and t tests to compare the overall quality score and each domain score

by the resulting star ratings.

Exhibit 6. Comparisons of Domain Scores by Star Rating

Domain Result

Acute Care Mortality

• Facilities with Star 4 versus Star 3 had mean values, respectively, that were not statistically significant (t= −0.3, p=0.77).

Avoidable Adverse Events

• Facilities with Star 2 versus Star 1 had mean values, respectively, that were not statistically significant (t=-0.68, p=0.50).

• Facilities with Star 4 versus Star 3 had mean values, respectively, that were not statistically significant (t=0.23, p=0.82).

Performance Measures



Length of Stay • Facilities with Star 5 versus Star 4 had mean values, respectively, that were not statistically significant (t=-0.88, p=0.38).

Ambulatory Care Sensitive Conditions (ACSC) Hospitalizations



• Facilities with Star 5 versus Star 4 had mean values that were not statistically significant (t=1.1, p=0.28).

Customer Satisfaction



FINAL REPORT 27

SAIL Final Report


Domain Result

CMS Measures (RSMR and RSRR)




Access • Facilities with Star 2 versus Star 1 had mean values that were not statistically significant (t=1.6, p=0.11).

Mental Health • Facilities with Star 2 versus Star 1 had mean values that were not statistically significant (t=0.06, p=0.96).

• Facilities with Star 5 versus Star 4 had mean values that were not statistically significant (t=-0.31, p=0.75).

Our findings, indicate that although different star ratings differ significantly on overall quality

score (F=292.45, p<.0001), they may not significantly distinguish on each domain score

(Exhibit 6 and Exhibit 7). Essentially, for different star ratings, facilities do show consistent

differences in quality ratings, but do not distinguish consistently across their domain scores.

Exhibit 7. Comparisons and Domain Scores by Star Ratings

We then performed internal consistency reliability (using Cronbach Alpha coefficients) analysis

to examine the measure-domain correlations and internal consistencies. This analysis examines

how well measures are correlated with their assigned domains. This analysis used z-scores of

measures. The Cronbach Alpha coefficients ranged from 0.14 to 0.69 — specifically, 0.59 for

Acute Care Mortality, 0.14 for Avoidable Adverse Events, 0.54 for CMS Measures, 0.44 for

-1.5

-1

-0.5

0

0.5

1

SAIL Domain Scores by Star Rating

star1 star2 star3 star4 star5

FINAL REPORT 28

SAIL Final Report


Performance Measures, 0.49 for Customer Satisfaction, 0.69 for Access, and 0.69 for Overall

Quality (25 measures). These findings indicated acceptable internal consistencies (alpha 0.60 and

above, i.e., high measure-domain correlations) for overall quality measures and access measures,

but the lack of internal consistency for some domains (alpha below 0.60, i.e., poor measure-

domain correlations, see Exhibit 8) signals the need to reaffirm the rationale for the domains’

composition given weaker psychometric properties.

Exhibit 8. Measure-domain correlation results summary

Acceptable Measure-domain correlation (alpha > 0.60)

Unacceptable measure-domain correlation (alpha <0.60)

• Access • Overall Quality • Length of stay* • ACSC Hospitalizations*

• Acute Care Mortality • CMS Measures • Performance Measures • Customer Satisfaction • Avoidable Adverse Events

*Length of stay and ACSC hospitalizations domains comprise a single measure. Mental health domain was excluded from this analysis.

Third, we performed Confirmatory Factor Analysis (CFA) to examine the current SAIL model

structure (eight quality domains and corresponding measures in each domain; mental health

domain was excluded due to lack of data for individual mental health measures) by using z-

scores of measures. Neither the factor model nor the linear equation model support the current

SAIL domain structure, i.e., generate an acceptable model fit of eight domains (factors). The

CFA analysis indicates unacceptable model fit of eight domains/factors (Exhibit 9).

Exhibit 9. Fit Summary of Confirmatory Factor Analysis

Fit Summary Factor Model Linear Equation Model

Chi-Square 598.8928 696.4999

Chi-Square DF 224 224

Pr > Chi-Square <.0001 <.0001

Standardized RMR (SRMR) 0.1125 0.1125

RMSEA Estimate 0.1186 0.1331

Bentler Comparative Fit Index 0.1832 0

Fourth, we performed an Exploratory Factor Analysis (EFA) to examine the underlying

relationships between the measures in the most recent SAIL data. EFA is often used when

developing a scale (e.g., star ratings) to identify the latent constructs (e.g. domains) that are

FINAL REPORT 29

SAIL Final Report


supported by the observed data. We conducted EFA on z-scores by using Principal Components

along with Promax rotation. Using a cut-off of eigenvalue greater than one criteria, we identified

three factors as fitting the data and explaining 71% of the variance. Exhibit 10 presents the

Promax Rotated Factor Pattern (Standardized Regression Coefficients). A factor loading equal to

or greater than 0.30 was considered acceptable. The first factor (1) could be called “access to

care” based on the strong loading from variables such as call-waiting times. The second factor

(2) tends to load heavily on variables suggesting that it could be called “mortality and safety.”

The third factor (3) could be called “patient experience of care.” Healthcare associated infections

was the only measure that did not load significantly on any of the three factors.

Exhibit 10. Promax Rotated Factor Pattern of Current SAIL Data

Measures Measure Description Factor1 Factor2 Factor3

z3_callcenter Seconds to pick up calls 0.74872 -0.05303 -0.11837

z3_xaccess PCMH access domain composite 0.66286 -0.00644 0.20468

z3_telr2 Telephone abandonment rate 0.59059 0.15127 -0.07074

z3_pc11 Primary care new patient wait time 0.50786 -0.07969 0.01006

z3_sc13 Specialty care new patient wait time 0.47841 0.13064 0.05119

z3_bptw Best places to work 0.46992 -0.20337 0.26918

z3_sc12 Specialty care established patient wait times 0.36891 0.12194 -0.11065

z3_rnturnover Registered Nurse turnover rate 0.3103 -0.0435 -0.073

z3_smr30 30-day risk adjusted mortality 0.2274 0.60761 -0.23288

z3_smr In-hospital risk adjusted mortality 0.0129 0.58009 0.27151

z3_rsrrchf Risk standardized readmission rates congestive heart failure (CHF)

-0.17197 0.57194 0.14495

z3_rsrrpn Risk standardized readmission rates pneumonia -0.30834 0.40818 0.10451

z3_rsrrami Risk standardized readmission rates acute myocardial infarction (AMI)

-0.03709 0.39206 0.06939

z3_rsmrchf Risk standardized mortality rates CHF 0.17951 0.39079 -0.22804

z3_rsmrpn Risk standardized mortality rates pneumonia 0.11503 0.37704 0.09916

z3_psi Risk adjusted patient safety index (PSI) -0.0321 0.37362 -0.1675

z3_rComp In-hospital complications 0.06231 0.20227 0.1988

z3_alos Adjusted length of stay -0.09339 0.13147 0.62919

z3_hcahps Patient rating of overall hospital performance 0.40894 0.07693 0.46136

z3_mh12 Mental health new patient wait time 0.11285 -0.04446 0.41339

z3_oryx Inpatient core measures mean percentage -0.0613 -0.13619 0.39263

z3_acsc ACSC hospitalization -0.18882 0.06945 0.28955

z3_hedis Healthcare Effectiveness Data and Information Set (HEDIS) outpatient core measures mean percentage

0.15066 0.0064 0.22169

z3_hai Healthcare associated infections 0.05448 -0.04197 -0.0891

FINAL REPORT 30

SAIL Final Report


4.2 Evaluate the Methodology for the (“Star”) Ratings of Quality and their

Internal and External Validity

The current SAIL star-rating process generates star ratings for facilities based on their quality

summary scores, using a five-star scale. In general terms, this involves three major steps:

• Standardized (z) scores (deviations from the mean of all VHA medical centers) to

measure the performance of each medical center on each measure

• Weighting those individual measures within and across their respective domains

• Choosing cut-off points along the resulting summary/averaging score to produce an

ordinal scale numbered from one to five stars

Based on the distributions of summary scores, facilities scoring in the lowest 10th percentile

receive a rating of 1-star; facilities scoring in the top 10th percentile receive a rating of 5-stars;

facilities scoring in the middle 40th percentile of the distribution of scores receive a 3-star rating;

facilities scoring above the 10th percentile up to the 30th percentile receive a 2-star rating; and

facilities scoring above the 70th percentile up to the 90th percentile receive a 4-star rating (note

that 1-star facilities that exceed the commercial industry average are promoted to 2-star). This

process for setting the four relative thresholds for the five-star performance categories relies on a

norm-reference interpretation of overall quality summary scores.

The current methods for star rating may imply the risk of misclassification — the true quality

differs from the assigned star rating, i.e., rating does not reflect the facility’s true performance.

This could occur if the individual measures or domain scores are not sufficiently reliable,

resulting in noise and error.

Still, the star ratings might be considered misleading for at least three reasons. First, two

facilities can have the same star rating but significantly different patterns for individual measures

and domains. For example, currently, two medical facilities can both achieve a quality rating of

three stars, even though one facility’s actual overall quality score is 39 percentile and the other’s

is in the 68 percentile, given the range of three stars are from 39 to 70 percentile. A second

example would be two medical facilities that have different star ratings (such as 4-star and 3-

star), even though their actual percentiles of overall quality scores are close, but they happen to

be above and below the threshold (such as 71st percentile and 69th percentile). A third type of

FINAL REPORT 31

SAIL Final Report


misleading result would be inferring that few stars necessarily meant bad quality in relation to

the healthcare performance in a facility’s’ local communities, given the widely-documented

geographic variations in healthcare metrics in general.

Considering the problem of different underlying patterns for scores, we explored an alternative

approach for classifying data into subgroups, and performed Latent Profile Analysis (LPA). LPA

provides an advantage over other methods because it groups medical facilities based on their

scoring patterns within the data and then uses those patterns as independent variables (Muthén,

2001). LPA is a subject-centered (in this case, medical facility-centered, rather than measure-

centered) and is a model-based cluster analytic approach. Unique model parameters are

estimated for each profile based on maximum likelihood estimation. Specifically, LPA estimates,

for each profile/class, the mean and variance for each measure, the probability that each facility

falls into each cluster, and the probability that any facility falls into a given class across all

facilities. It thus assigns medical facilities to profiles with the highest member probability.

Probabilities closer to one for a single profile/class and closer to zero for the remaining classes,

suggest good group assignment and distinct profile/classes.

Several models were fit to the z-scores of each individual measure and to the z-score of the

mental health domain data, specifying three through five latent profiles. Models with different

numbers of profiles were compared using information criteria (IC)-based fit statistics. These

include the Bayesian Information Criteria (BIC), Akaike Information Criteria (AIC), and

Adjusted BIC. Lower values on these fit statistics indicate better model fit. The accuracy with

which models classify medical facilities into their most likely profile/class is examined. Entropy

is a type of statistic that assesses this accuracy, and can range from 0 to 1, with higher scores

representing greater classification accuracy. The exhibit below presents the Information Criteria,

Entropy, and Average Class Probabilities of LPA. As shown in Exhibit 11, the 3-class and 4-

class models achieve good fit.

The profile plot in Exhibit 12 shows graphically the latent class estimated means on the y-axis.

The x-axis (i.e. measures) starts at zero and increases in units of one for each of the observed

variables/measures. Class 1 has lower average scores on most measures, class 3 has higher

average scores on most measures, class 2 fall into between. This is in line with a 3-star rating

FINAL REPORT 32

SAIL Final Report


scale. The 5-class model does not fit the data well, implying the data may not support the 5-star

ratings.

Exhibit 11. Latent Profile Analysis

Goodness-of-fit statistics for the 3-class to 5-class latent profile solutions

N=147

Fit statistics 3-Class 4-Class 5-Class

Log-likelihood -4264.142 -4192.269 -4215.968

AIC 8832.284 8790.537 8939.937

BIC 9286.83 9397.595 9699.507

SSA-BIC 8805.82 8755.193 8895.714

Entropy 0.92 0.91 0.95

AIC = Akaike Information Criteria; BIC = Bayesian Information Criteria; SSA-BIC = Sample-Size-Adjusted BIC

Class counts and proportions for the latent classes based on

estimated posterior probabilities Average Latent Class Probabilities

Three-class model 1 2 3

1, n = 20.9, 14.2% 0.984 0.016 0

2, n = 87.3, 59.4% 0.014 0.973 0.013

3, n = 38.8, 26.4% 0 0.035 0.965

Four-class model 1 2 3 4

1, n = 20.4, 13.9% 0.969 0.001 0.029 0

2, n = 21.2, 14.4% 0.007 0.936 0.053 0.004

3, n = 64.3, 43.8% 0.013 0.02 0.958 0.01

4, n = 41.1, 28.0% 0 0.005 0.032 0.963

Five-class model 1 2 3 4 5

1, n = 2.0, 1.4% 1 0 0 0 0

2, n = 18.3, 12.4% 0 0.965 0.035 0 0

3, n = 87.9, 60.0% 0 0.01 0.971 0.019 0

4, n = 38.8, 26.4% 0 0 0.021 0.979 0

5, n = 0, 0% 0 0 0 0 0

FINAL REPORT 33

SAIL Final Report


Exhibit 12. 3-Class Profile Plot

Measures

La

ten

t C

las

s E

sti

ma

ted

Me

an

s

Measures

La

ten

t C

las

s E

sti

ma

ted

Me

an

s

FINAL REPORT 34

SAIL Final Report


5 RECOMMENDATIONS

Based on its independent assessment of SAIL, the Booz Allen Team provides recommendations

related to measurement system purpose, measures, hierarchy, scoring, and system management.

5.1 Measurement System Purpose — Accountability and Quality Improvement

The primary purpose(s) of a measurement system will influence the decisions made by system

developers. A recent Commonwealth Fund and Institute for Healthcare Improvement (IHI) issue

brief,12 illustrated these differences:

• Accountability systems include measures designed for provider comparisons, include

risk-adjustments, and are collected retrospectively over long periods of time.

• Improvement systems include measures that capture local performance over time, and

are not risk-adjusted information as to whether interventions or clinical care are

producing the intended outcomes.

The purpose of a measurement system drives decisions in the selection of measures, structure of

domains, and scoring methodologies. A primary issue, that can underlay the foundation of

prioritizing improvements to SAIL, is the discrepancy between its original design as an

improvement tool and its current use as an accountability tool. SAIL was originally developed as

“VA’s internal improvement tool which is designed to offer high-level views of health care

quality and efficiency.”13 A determination of SAIL’s primary purpose will drive VHA decisions

with regard to many design decisions and improvements. Accountability may take on a different

meaning when referring to a public institution like the VHA. In a private sector hospital system,

measures used for improvement may not ever be visible to the public. But in VA, even those

internal measures intended to focus on improvement are likely discovered by congressional

staffers or the media and made public, for their own purpose of holding VA accountable.

Therefore, they may become de facto accountability measures even when that is not their

original intent. The Booz Allen Team recommends VHA consider whether the primary

purposes of the SAIL Value Model for is improvement or accountability. It is possible that

12 Clifford Marks, et al. Hospital Readmissions: Measuring for Improvement, Accountability, and Patients,

Commonwealth Fund/Institute for Healthcare Improvement, Issue Brief, September 2013. 13

SAIL Value Model 20141210.pdf

FINAL REPORT 35

SAIL Final Report


SAIL can be used for both purposes and redesigned accordingly. For example, existing measures

within SAIL could be reorganized and presented by clinical area –aligning with the need for

performance data for a given service-line to manage. This provides reports of measures to the

end-user that aligns with their locus of control. In addition, a subset of measures, given rigorous

evaluation, could be designated and used for accountability purposes.

An important purpose of any measurement system is to discern levels of performance and to

attach judgments to observed levels, such as high or low, adequate or inadequate, unsatisfactory

or exemplary. The benchmarks used to make such judgments can take one of two forms, scoring

according to mastery or scoring relative to peers.

When “scoring according to mastery” an entity’s performance is judged against predetermined

values or external benchmarks, such as an indicated service that is always delivered, or an

external value is drawn from some other population. For example, administering aspirin on

arrival to patients with acute myocardial infarction (AMI) generally is considered good clinical

practice. VHA may set a high bar, such as a 95% success rate, and label any hospital achieving

that level of performance as a high performer because it has “mastered” the process, and any

hospital failing to achieve that level as a low performer. In such a scenario, where the threshold

for high performance is determined by policy or external reference, hypothetically some or all

VHA facilities could be above (or below) that threshold rather than rated relative to their peers.

When “scoring relative to peers” (or grading on the curve), each entity is judged based on its

score within the distribution of all entities. Performance is judged relative to the mean value

calculated for all entities, or another reference point such as the median value or the 90th

percentile. For the most part, SAIL metrics are used in this way. Some VHA facilities will be

found to be high performers because they scored better than other facilities, which will be

labeled low performers. Z-scores are standardized representations of the location of a facility on

the distribution of scores on a measure. The five-star rating system grades VHA centers on the

curve, although a star can be added or subtracted from a center’s rating based on reference to

external benchmarks on specific measures or domains.

One of the major themes that emerged from interviews with VHA and private industry leaders is

that benchmarking should consider both performance within VHA and external benchmarks

FINAL REPORT 36

SAIL Final Report


drawn from national and local results for non-VA facilities. Performance differences between

external benchmarks and VHA’s internal benchmarks should inform the adoption of benchmarks

to set performance goals or to establish thresholds used in categorizing performance (e.g., star

ratings). Local non-VA area benchmarks are of particular interest given Veterans’ options to use

competing healthcare providers. The Booz Allen Team recommends VHA clarify the scoring

approaches for mastery (predetermined values or external benchmarks) or relative

performance. This clarification would include examining the most appropriate scoring approach

by measure (i.e., some measures are more suited to mastery scoring) and consideration the use of

the data by various stakeholder groups (i.e., senior leadership, VISN directors, VAMC directors,

frontline staff, Veterans, and other external groups).

In addition to comparing a medical center to other (similar) medical centers or to external

benchmarks, VHA can measure each facility against itself, i.e., its own historical value on a

measure such as a baseline reference point, the previous reporting period, or a trend line or

rolling average. Surely, there is merit for a medical center that performs better than most or all of

its peers; there also may be merit in making significant improvements over time. Accountability

systems can acknowledge either or both types of merit, i.e., achievement or improvement,

respectively. However, comparing VA facilities and ranking them within a narrow range of

variation is not useful for either accountability or improvement purposes. Consideration should

be given to other measures as to whether comparison or benchmarking approach to scoring is

more appropriate for each.

Whereas metrics can be used to judge whether an entity performs well, or has improved

significantly, asking a measurement system to inform entities how to improve can imply a

different set of demands. SAIL may do well informing entities how they rank compared to other

entities or external benchmarks, and which measures indicate the greatest performance gaps. And

that may be a great service to entities with respect to informing what to improve, but perhaps less

about how to go about improving.

The Booz Allen Team recommends VHA clarify the specific role of SAIL to inform the

network about what versus how to improve. Integrated systems in our study built and

leveraged their own measurement systems to serve both roles. Thus, we observed that a single

FINAL REPORT 37

SAIL Final Report


coherent (but not necessarily simple) measurement system can be applied for the goals of

accountability and fostering improvement, but the approaches should be tailored to the different

uses. By itself, SAIL does not emulate the single coherent systems that are designed to

accomplish both goals. If SAIL is intended to work seamlessly with other measurement systems

within VHA, we observed only limited anecdotal evidence that the respective medical center

personnel experience useful integration.

Even beyond metrics, the integrated systems in our study married their measurement systems to

their own cultural “change model,” which identified the prerequisites and mechanisms for

driving improvement. In some but decisively not all cases, that meant linking performance to

financial compensation of senior executives or other staff. For example, at Intermountain

Healthcare financial incentives are secondary in its change model while Geisinger expressed

importance of financial incentives in its change model. We did find passionate belief in framing

measures to guide processes linked to outcomes of interest, or in other words, applying measures

of how to perform well (input measures) as well as measures of success for the patient

(intermediate, and eventually ultimate outcomes).

The Booz Allen Team recommends VHA clarify its own change model(s), given its public

mission along with its constraints. This may dovetail with larger VHA systems of

accountability and compensation; however, for the present purpose, such clarification would help

to inform changes to the mission and makeup of SAIL. Specifically, this could facilitate

transparent communication of a vision for how the network can improve and what should be

captured and measured in SAIL that would be used to drive improvement. This would create a

system that equips staff, often front line staff who can most effect change, with the information

needed to manage and improve processes. Such clarification of the VHA change model could

involve at least two dimensions:

• Does VHA know/have consensus on the key processes that influence Veterans’

health/experiences?

• What are the strategies for driving change to better health/experiences (e.g., staff

configurations, medical homes, technology, process controls, integration with local

community resources etc.)? And, does VA have a teachable process that can be used

FINAL REPORT 38

SAIL Final Report


across facilities to diagnose the problem, to use analysis to develop solutions, to

implement change in a sustainable way, and to measure improvements?

The approaches and related measures chosen after such clarification may lead to enhancements

to SAIL such as to provide a menu of dashboards for different needs. Some measures could be

deployed to support processes; other measures may be constructed rigorously and refined to

support judgments and accountability for performance.

5.2 Measures

Measures used in SAIL, especially for accountability, need to be valid (accurately and fairly

measure what they purport to measure) and reliable (precisely and reproducibly discriminate true

performance differences). In this study, we examined measures and domains according to these

scientific criteria (e.g., suitability of domains; potential for topped-out measures).

The Booz Allen Team recommends VHA clarify and formalize a process for selecting and

managing measures with respect to their intended uses. This could include adopting measure

inclusion criteria to include dimensions such as: mission alignment, performance improvement

opportunity, reliability, and actionability. An initial part of this process may be to reconsider or

confirm the set of performance domains (see section 6.3). Measures used to rank-order entities

should pass tests related to sufficient reliability and ability to contribute to discriminating

performance. In contrast, certain SAIL quality and safety measures may have little variance

across VHA facilities and have high absolute scores, but are still very important to track. Our

empirical analysis did begin to examine this at the domain and subdomain level and did not find

topped-out measures. However, our analysis did not examine the performance of every

component measure contained within SAIL.

In particular, it may be useful to screen each measure for possible “topped-out status.” Generally,

when the distribution of scores for most or all entities is concentrated in a narrow range, then

assigning relative ranks (percentiles or z-scores) that lead to labels such as “high (or low)

performer” can amount to making distinctions without meaningful differences. Mixing topped-

out measures into the relative scoring systems can detract from the ability to identify reliably true

differences in performance. More generally, better measures for relative ranking are those which

distinguish performance differences reliably and contribute to distinctions among entities in their

FINAL REPORT 39

SAIL Final Report


respective domain summary scores. Measures that are topped-out, or nearly so, may be used in

relation to external benchmarks (mastery), or for public reporting to inform and reassure

stakeholders that facilities and VHA is maintaining its success for these measures.

Accountability measures can be partitioned into those important measures with improvements

aims and are core elements of the accountability structure versus other important measures, in

which high performance has been achieved, that are monitored but are not core accountability

metrics. Additional measures and a different mix of measures may be needed based on further

assessment about how to support performance improvement and consideration of how to

formalize measures of efficiency and value.

Despite its full name “SAIL Value Model,” SAIL itself does little to link quality and cost

measures into integrated measures of efficiency or value. Our interviews with VHA leaders

found that 97% of them do not use the SAIL’s efficiency measure in any way to improve either

operational efficiency or budget performance. The measure is not actionable and not reported

frequently; therefore, they do not see its utility. The vast majority simply do not understand this

measure. As such, the measure cannot reasonably be used for accountability purposes. In

addition, it does not get at the central question of value, which is whether quality or outcomes at

a facility is commensurate with its resource use.

The efficiency measure constructed using stochastic frontier analysis is an overarching summary

measure of efficiency that is useful as a macro level indicator of efficiency and, as such, may

have use as a high level summary score that senior level executives of VHA can use to quickly

compare resource use across facilities and frame further questions about efficiency. Information

contained in the Efficiency Opportunity Grid (EOG), which is embedded in the SAIL package,

provides information that is actionable and can lead to improvement.

VHA and SAIL may be poised to move forward in measuring efficiency and value, beyond

simple side-by-side displays, or parallel rather than integrated scoring methods involving

dimensions of quality and cost. For the most part, SAIL imports measures of relative cost

across entities, and links to multiple measures of resource use that reside in other parts of VHA.

In very tangible ways, the SAIL Team coordinates well with other parts of the agency that are

involved in formulating measures, and this is a case in point. Questions of great concern in

FINAL REPORT 40

SAIL Final Report


healthcare generally, and presumably VHA, have to do with distinguishing high-value spending

from low-value spending. For what patient conditions are services and costs too low, such as

underutilization of effective services? For what conditions are costs too high, such as excessive

volumes or inefficient use of high-cost alternatives? These questions relate to allocative

efficiency, i.e., the ability to properly steer resources away from waste and inefficient utilization

patterns, and toward their highest-value purposes.

SAIL has demonstrated this insight with the recent deployment of measures centered on a

particular set of conditions, i.e., mental health. This approach helps to organize thinking and

improvement around an identifiable set of patient cohorts, as well as the clinical and

administrative components of the VHA system who are responsible for those conditions and

patients. In other words, the information gets organized and targeted around quality and access

pertaining to certain conditions and patients; this approach may then lend itself to consideration

about resource use measures integrated to define and measure efficiency and value for mental

health services. Generalized to include other conditions and lines of service, medical centers

could embark on empirically-driven measurement and improvement goals pertaining to overall

efficiency and value by reallocating resources to best meet the needs of Veterans.

5.3 Measurement System Hierarchy

The Booz Allen Team recommends VHA clarify which one or more purposes it chooses to

support through a domain structure.

There can be several reasons for grouping measures into domains. One reason for supporting a

domain structure is to group measures that tend to explain the same underlying dimension of

performance. As such, the similar measures tend to reinforce each other when they tracking with

the common trait, and tend to cancel each other out when they are exhibiting “noise” or

otherwise not tracking the common trait of interest; i.e., that defines the domain. Familiar

examples come from psychology (hence, this approach is called psychometric), such as using

several questions (individual measures) to draw conclusions about levels of intelligence, certain

personality traits, types of aptitude, or indications of morbidity. Grouping individual items for

this purpose often uses factor analysis techniques, attempts to align measures according to the

underlying “factors” that may explain the concepts of interest.

FINAL REPORT 41

SAIL Final Report


Another somewhat related approach is to group measures into domains according to conceptual

similarity. Inpatient mortality rate is similar conceptually to all-cause 30-day post-discharge

mortality rate, but that does not mean necessarily that the two measures would tap the same trait

or dimension of quality. For example, inpatient mortality may be driven by poor technique,

unsanitary conditions, or deficiencies related to intensive care. In contrast, 30-day mortality may

be caused by inadequate patient education, community supports, adherence to medications, or

failure to reconcile contraindicated prescription patterns. Empirically it may be true that entities

tend to score higher or lower than others on both measures either because both measures do share

some causes (e.g. high infection rates), or perhaps because low performers may tend to be

suboptimal in many processes for many reasons.

Domain structures can also be selected to align with key business drivers or policy priorities,

such as clinical quality, patient access, patient experience, or employee experience. These

domains would be selected based on identifying the key areas where VA has to perform well in

to be successful.

A different approach to grouping measures is by condition, specialty, or line of service. VHA

might wish to set up dashboards that show results for various types of measures, individually and

collectively, that are calculated for selected patient cohorts. One organizing framework

recommended by the NQF is the patient-focused episode of care. For example, measures relevant

to heart failure patients can be calculated for those patients and shown to clinicians and

departments responsible for meeting the needs of heart failure patients.

Another type of cohort could be patients undergoing coronary artery bypass grafting (CABG).

Domains could include episodes that relate to the same type of physician specialties such as

orthopedics, mental health, etc. Under this rubric, the signals carried by individual measures

would directly reflect on the attributed providers, and could reinforce each other in

differentiating the relative performance of departments or medical centers as it relates to the lines

of service. An intrinsic advantage to using the episode framework is the relevance of quality and

cost to measuring relative efficiency and value across patient cohorts and lines of service. How

do entities compare on resources used to manage clinically similar patients? When costs are

FINAL REPORT 42

SAIL Final Report


above average for managing patients in similar clinical contexts, do quality measures support the

higher resource use, or do they seem to indicate inefficiencies?

Similarly, determination of relative value can occur when the scoring rules, particularly the

weights, given to the individual measures, are calibrated properly to reflect the clinical or policy

importance attributed to them by VHA. Applying more or less equal weights to individual

measures is a choice, and likely does not convey significant differences in importance or priority

given by policymakers or senior leadership, or even clinical staff. For example, patient

experience may be given a higher weight for some episodes in which perceived access and

communication are paramount to value, while clinical outcomes may have higher weights when

considering value in other contexts.

5.4 Scoring/Star Rating

SAIL applies weights to the measures inside domains and, in turn, to the respective quality

domains to produce a summary score, which is summarized as a value from one to five (stars).

More specifically, the performance on each measure is the standardized score using the z

distribution (observed minus mean, divided by standard deviation) for each of the measures,

respectively. Thus, entities are given star ratings mostly based on rankings derived from cross-

sectional comparisons of each medical center against others; comparisons are made within strata

defined by the complexity of the facility (high, medium, or low).

Construction of total summary scores or star-ratings are always problematic methodologically.

However, for practical managerial reasons, high level VHA executives need some way to

regularly monitor if a hospital or health system performance is faltering and may require

intervention to avoid subpar care.

VHA might also consider assigning responsibility for performance using a balanced

scorecard approach that drills down to frontline supervisors. Using SAIL as an improvement

strategy begins with educating those who have to execute it. Performance or improvement goals

should disseminate from top to bottom in a medical center. A rolled down balanced scorecard

methodology allows senior, mid-level and frontline staff what their responsibilities are regarding

performance improvement and empowers them to work with their care delivery teams to analyze

appropriate data, consider options, take action and track improvement. Both improvement and

FINAL REPORT 43

SAIL Final Report


accountability are spread across the organization and locus of control can be placed closest to the

patient.

The Booz Allen Team recommends considering some alternatives to creating fixed strata

(the three complexity levels) and simply limiting all comparisons within strata. One

alternative could be pooling data across all facilities and examining predictive margins (also

known as “recycled predictions”) in which robust estimates are determined of the incremental

differences in expected values related to the three respective complexity levels. Another

alternative also would take advantage of pooled data, and develop comparisons samples of

patients from potentially all other facilities based on direct standardization techniques. In other

words, create matched samples for comparison using the unique characteristics of each facility’s

unique mix of patients in the populations served. Still another alternative may be to calculate

expected measure or domain results using hierarchical statistical models, which could allow for

robust patient-level risk-adjustment along with facility or market characteristics that also can

affect measure results.

No matter which alternative is chosen going forward, it seems that these additional empirical

investigations of non-equivalence across facilities — across strata but possibly also within strata

— would be useful to demonstrate appropriate handling of such differences, and optimal fairness

in inferences made from the final summary scores. Thus, the Booz Allen Team builds on its

previous recommendation and recommends conducting empirical investigations of non-

equivalence across facilities — across strata but possibly also within strata.

Based on our review, the use of the star ratings for accountability may warrant further

investigation of potential technical concerns (domain weighting and risk of misclassification). If

star rating is to be used for accountability purposes with important consequences, the Booz

Allen Team recommends further research to assess potential misclassification and address

using techniques to mitigate medical center misclassification error. While we do not have

evidence that the star rating is unreliable, additional analyses may clarify or improve the

reliability of star ratings over time. For example, the formation of summary scores out of

individual measures requires decisions about how much weight to give to each of the measures,

respectively, to produce a summary score that is valid for the purpose. There can be several

FINAL REPORT 44

SAIL Final Report


criteria used for determining such weights, including the statistical reliability of the individual

measures, empirically-derived relative importance based on some criterion such as the relative

impact on health, or preference weights that express relative policy importance or desire to focus

attention. Sometimes developers resort to an even simpler approach, which is to give equal

weights to each measure, which passively allow the relative numbers of measures to drive the

overall weight given to any particular concept. What can be lost in the process is an appreciation

of what effects differential weights may have on the resulting summary scores, and in turn, the

absolute or rank-order differences observed for medical centers. Therefore, whether it is to

explore different conceptual approaches to weighting schemes, or whether to investigate

alternative empirically-driven weights, we recommend that sensitivity analysis be used to reveal

the differences in the resulting scores and their implications that would result from the choices

about weights.

Understandably, SAIL expects to track relative improvement over time for a given facility in

comparison to others. Such tracking may be undermined if the ratings are considered unreliable

or if methods can distort a facility’s score. Furthermore, the empirical analysis conducted by the

Booz Allen Team, detailed in Section 4.2 of this report, suggests that a three class model may

better fit the rating system. The Booz Allen Team recommends that SAIL consider shifting to

a 3-star rating summary system.

5.5 Measurement System Management

It has appeared to the Booz Allen Team that the SAIL Team is genuinely open to suggestions

about measures or scoring approaches. At the same time, some findings suggest some perceived

gaps in communication in either influencing or using SAIL among other VHA staff. Perhaps

some staff in other parts of VHA have not engaged enough with SAIL to feel like effective users,

or effective contributors to SAIL as it exists. As the results of this assessment are considered in

VHA, and particularly as changes in SAIL are explored and implemented, there is perhaps a

corresponding opportunity to solicit input and “buy-in” from staff who may wish to influence the

direction of SAIL.

The Booz Allen Team recommends VHA engage internal stakeholders in adapting SAIL to

the needs of its users over time. This may include a formal SAIL stakeholder process in which

FINAL REPORT 45

SAIL Final Report


frontline staff, measurement system owners, and others are authorized to oversee the SAIL

measures set and decision support services. Such a mechanism could help to ensure that the

measures set serves the needs of each internal client.

The Booz Allen Team suggests VHA consider adopting continuous rapid cycle evaluation

and improvement best practices into their practices for managing SAIL. In order to

undertake this initiative the Team suggests that VHA establish internal measures for evaluating

SAILs impact, relevancy, perceived accuracy, and utility and develop a process for collecting

pertinent data around these areas. We suggest that this performance data be regularly reviewed

by leadership so that improvements may be made to SAIL over time allowing SAIL to be

adapted to user needs and produce better long term outcomes. The Team supports that this

approach would achieve performance excellence and allow for SAIL to more adequately meet

VHA intended goals.

The Team also suggests VHA consider developing a formal SAIL stakeholder process in

which frontline staff, measurement system owners, and others are authorized to oversee the

SAIL measures set and decision support services. Such a mechanism could help to ensure that

the measures set serves the needs of each internal client. Moreover, it could be beneficial to post

the addition of measures for public comment prior to deciding that they be adopted so users may

are more engaged in the process of changing SAIL. Overall, it is believed by the Team that SAIL

should be a utility that serves the VISNs/VAMCs, and hence those stakeholders should oversee

the makeup of SAIL, introduction of new measures, refinement of decision-support

tools/training, etc.

The Team and the field applauds the work of the SAIL Team in providing facility trainings when

requested and recommends continuation of such trainings to include executive level training on

SAIL, which instructs senior leaders specifically on not just what the measures are, but also how

they should or could be using them. Examples of collaborative (across facilities) approaches for

improvement based on SAIL metrics should be included in the training. There is, however, an

opportunity to provide medical centers with more robust resources for training and technical

assistance. One approach VHA might take is to develop a Learning and Action Network that

provides role specific educational webinars and video shorts, as well as other user-friendly

FINAL REPORT 46

SAIL Final Report


resources. Users would benefit from being required to take a set curriculum of training to help

them understand how they are supposed to use SAIL to facilitate improvement in their facility

under their specific role. Additional information on the methodology behind the SAIL rating and

how users can provide feedback is encouraged to be included as well.

6 CONCLUSION

The use of SAIL has resulted in several important changes to the way VHA, VISN, and VAMC

leadership have assessed VAMC care quality over the years. SAIL has gained wide-spread

attention by VHA leadership as an important tool for holding VAMCs accountable. As a result,

SAIL users have embraced the opportunity to provide feedback, and industry leaders in the field

of performance measurement have come together, all to assess the adequacy of SAIL for

evaluating performance, facilitating improvement, and most importantly, holding VAMCs

accountable.

Overall, SAIL provides a visual tool to VA stakeholders allowing them to evaluate VAMC

performance and is recognized as a valuable VA-specific performance assessment dashboard,

which is both needed and desired by all in VA settings. According to several responses from our

Field Directors Assessment, improvement in patient care has occurred as a direct result of using

SAIL. However, the Team believes opportunities exist to improve SAIL so it can become more

specific to both user needs and local settings, allowing users to better assess what and how to

improve care in their facility. In addition, the Team believes by focusing on improving this tool,

the quality of care provided to Veterans can reach a state of excellence.

This report has identified several overarching themes and recommendations for improvement.

These recommendations are provided to VHA in order to stimulate internal conversations among

VHA and SAIL leadership regarding next steps for improving SAIL

FINAL REPORT 47

SAIL Final Report


APPENDIX A. METHODS

The Booz Allen Team was tasked with data collection in Task 6.2: Discussion Paper, Task 6.3:

Synopsis of Industry Best Practices on Measurement Systems report, and Task 6.4: Field

Directors Assessment of the contract. These data collection efforts feed directly into the Final

Report (Task 6.5). The tasks are grouped by qualitative and quantitative analysis used by the

Team to arrive at recommendations in addressing potential next steps for SAIL (Exhibit A-1).

The methodology of the earlier reports in this review process are summarized in the following

sections.

Exhibit A-1. Sources of Input for SAIL Final Report

Topic / Research Question Data Collection / Input

Qualitative Analysis

Assess the validity of SAIL as a tool to identify strengths and weaknesses of medical center performance

Field Directors Assessment Synopsis of Industry Best Practices

Critique whether the individual domains included within SAIL are the most salient for evaluating facility performance

Field Directors Assessment Synopsis of Industry Best Practices

Provide a detail comparison of SAIL to other systems for assessing hospital performance that are presently used in the public and private sectors

Synopsis of Industry Best Practices

Determine how SAIL is assessed by users in terms of relevance, utility, and perceived accuracy

Field Directors Assessment

Quantitative Analysis

Assess whether the data elements within SAIL are representative, valid, and reliable contributors to their respective domains

Empirical analysis

Evaluate the methodology for the (“star”) ratings of quality and their internal and external validity

Empirical analysis

Discussion Paper

The purpose of this task was to develop a discussion paper of initial findings and gaps between

the committee report and industry practices. The Booz Allen Team reviewed the SAIL Review

Committee report,14 the SAIL Team response,15 and initial findings from an industry best

practice literature review in order to evaluate issues that have arisen as a result of SAIL

implementation. The Team focused on the measures, domains and reporting hierarchy, and

methodology and scoring.

14 The Strategic Analytics for Improvement and Learning (SAIL) Value Model Documentation, August 2014. 15 Response to the Report of the Committee to Review SAIL.

FINAL REPORT A-1

SAIL Final Report


Synopsis of Industry Best Practices

To conduct the research for the Synopsis and identify and review industry best practices related

to measurement systems for quality and safety, the Booz Allen Team used a four-step approach

including establishing research criteria, identifying key research questions, performing environ-

mental scan, and synthesizing key findings. The Team collected data from three sources: review

of measurement systems, literature review, and industry expert discussions.

During the research process for the Synopsis, the Booz Allen Team reviewed industry measure-

ment systems for similarities across the industry. The Team developed a data collection Excel

template to organize key elements and decision areas leaders must address with regard to

development and refinement of a measurement system. Using this template, team members

reviewed publicly available data on the following measurement systems: CMS Hospital

Compare, Leapfrog Group, Consumer Reports, Truven Health Analytics, Kaiser Permanente,

Intermountain Health, US News, and Health Grades. The Team also reviewed the SAIL Value

Model to collect comparable information on SAIL.16

The Booz Allen Team used Google Scholar and PubMed search engines to complete the litera-

ture review, including peer-reviewed scholarly articles, and grey literature from industry reports,

think tanks, not-for-profit associations, patient advocacy groups, and government agencies. The

literature search included studies published in the English language from 2004 to present,

without a geographic limitation, to derive a wide range of tested methods. The literature review

also included an assessment of access-to-care standards developed by insurance regulators and

agencies, such as state Medicaid programs. The Team searched for articles, then reviewed

abstracts and determined which articles to review in detail. The information extracted from each

article was plotted against the research questions in a systematic manner in a master spreadsheet.

As a supplement to the literature review, the Booz Allen Team conducted eight stakeholder

interviews with leaders of private rating systems and health care delivery systems (Exhibit A-2).

The Team conducted these interviews to learn about other performance measurement systems

used in the field, provide a more complete picture of the existing industry best practices around

16 On a few occasions, the Team received information that was not public (e.g., updated SAIL Value Model from the

SAIL Team)

FINAL REPORT A-2

SAIL Final Report


performance measurement, and identify priority areas for measurement. These experts provided

robust comments and insights across the following nine sections of the Discussion Guide:

• Accountability and quality

improvement

• Purpose

• Hierarchal structure of measures

• Scoring rules

• Comparisons/benchmarks and

performance classification

• Ratings and data presentation

• Challenges

• Improvements

• Knowledge of VA population and geographies

The findings from the review of measurement systems, literature review, and industry expert

discussions are woven through the Final Report.

Field Directors Assessment

The objective of the Field Directors Assessment was to: 1) determine how SAIL is judged by

users in the field in terms of relevancy, utility, and perceived accuracy; 2) assess the validity of

SAIL as a tool for identifying strengths and weaknesses of medical center performance; and

3) determine whether the individual measures and domains included within SAIL are the most

salient for evaluating facility performance. In essence, day-to-day users of SAIL were

interviewed on the ability to comprehend SAIL metrics and translate the metrics into actionable

steps.

VHA provided the Booz Allen Team with a list of discussion candidates from which the Team

sent invitation emails to Medical Center Directors, Chiefs of Staff, Veterans Integrated Service

Network (VISN) Directors, and Program Office staffs, all diverse in both geography and facility

17 Due to scheduling availability, the Booz Allen Team spoke with Dr. Perlin after completion of the Synopsis. His

input is captured in the Final Report along with the other seven industry experts.

Exhibit A-2. Measurement System Industry Experts

Contact Organization

Patrick Conway, MD, MSc Centers for Medicare and Medicaid Services

Melissa Danforth The Leapfrog Group

Andy Amster Kaiser Permanente

Doris Peter, PhD Consumer Reports

Bruce Spurlock, MD Cynosure Health Solutions

Brent James, MD Intermountain Healthcare

Thomas Graf, MD Geisinger Health System

Jonathan Perlin, MD, PhD17 Hospital Corporation of America

FINAL REPORT A-3

SAIL Final Report


complexity score. One Deputy Chief of Staff was contacted in

place of a Chief of Staff. Exhibit A-3 identifies the breakdown

of participants by title.

The informants represented a diverse cross-section of the VHA

system. Their facilities represented ranged in bed size from 85

to 1,621 beds. Interviewees were representative of 16 (32%) of

the 50 U.S. states. Exhibit A-4 shows the star rating

distribution for facilities represented in the sample. The mean star rating for these facilities was

3.2. The distribution of star ratings for this sample of interviewees was not statistically different

from the distribution of star ratings of 128 VA Medical Centers (VAMCs) and 19 ad-hoc

facilities included in SAIL reports.

Exhibit A-4. Star Rating Distribution

Star Rating N % Sample Distribution National Distribution P-Value

1 2 7 7.69 7.81 p=0.98

2 5 16 19.23 21.88 p=0.76

3 8 27 30.77 40.63 p=0.35

4 8 27 30.77 19.53 p=0.20

5 3 10 11.54 10.16 p=0.83

Excluded18 4 13 - - -

Total 30 100 - - -

Mean Rating 3.2 - - - -

Median Rating 3 - - - -

Exhibit A-5 represents the complexity score distribution

for facility discussions. Complexity level is determined

based on the characteristics of the facility’s patient

population, clinical services offered, educational and

research missions, and administrative complexity. Level

1 is the most complex and is subdivided into 1a, 1b, and

1c, with 1a being the most complex; Level 2 is

moderately complex; and Level 3 the least complex.

18 No star rating available based on facility type. 19 No complexity rating available based on its facility type.

Exhibit A-3. Discussions

Conducted by Title

Title N

Medical Center Director 15

Chief of Staff 14

Program Office 4

VISN Director 4

Deputy Chief of Staff 1

TOTAL 38

Exhibit A-5. Complexity Distribution

Complexity Level N %

1a 8 27

1b 7 23.5

1c 5 16

2 7 23.5

3 2 7

Excluded19 1 3

Total 30 100

FINAL REPORT A-4

SAIL Final Report


Empirical Analysis

The Team performed an empirical analysis to assess whether the data elements within SAIL are

representative, valid, and reliable contributors to their respective domains and to evaluate the

methodology for the (“star”) ratings of quality and their internal and external validity. To do so,

we used the most recent updated SAIL data, fourth quarter fiscal year 2014 (FY2014Q4). These

data included raw data on measures and domains, including numerators and denominators, z-

scores, weighted z-scores, and star ratings for 24 measures and nine quality domains, for 128

VAMCs and 19 non-acute care facilities. The efficiency z-score and domain scores were not

included in the current analysis.

We used these data to calculate mean, standard deviation, median, percentiles, variance, and

coefficient of variation for all continuous variables and frequency for all categorical variables.

We performed F test and t tests to compare overall quality score and each domain score by star

ratings. We performed internal consistency reliability (using Cronbach Alpha coefficients)

analysis examining the measure-domain correlations and internal consistencies. We performed

Confirmatory Factor Analysis (CFA) to examine the underlying structure of the data and to

assess the current SAIL model structure. We also performed Exploratory Factor Analysis (EFA)

to investigate the factor structure of the most recent SAIL data. We performed Latent Profile

Analysis (LPA) to explore an alternative approach of classifying data into subgroups.

The following sections present our high-level findings, followed by recommendations for

considerations based on all sources of information (previous reviews of SAIL, current industry

practices, VHA field experience, and empirical analysis).

FINAL REPORT A-5

spectrum topr 0075: veterans health administration (vha ......exhibit 2. sail overview the sail...

Documents