validationofabehavioralhealthtreatmentoutcome ...the brief symptom inventory the brief symptom...

31
Validation of a Behavioral Health Treatment Outcome and Assessment Tool Designed for Naturalistic Settings: The Treatment Outcome Package David R. Kraus and David A. Seligman Behavioral Health Laboratories John R. Jordan Family Loss Project In 1994, the American Psychological Association and the Society for Psychotherapy Research convened a Core Battery Conference to develop a set of criteria for the selection of a universal core battery that could be used as a common outcome tool across all outcome studies. The Treat- ment Outcome Package (TOP) is a behavioral health assessment and out- come battery with modules for assessing a wide array of behavioral health symptoms and functioning, demographics, case-mix, and treatment satis- faction. It was developed to follow the design specifications set forth by the Core Battery Conference, but also to ensure the battery’s applica- bility to naturalistic treatment settings in which randomization may be impossible. In this article we discuss a number of studies that evaluate the initial psychometrics of the items that comprise the mental health symp- tom and functional modules of the TOP. We conclude that the TOP has an excellent factor structure, good test-retest reliability, promising initial con- vergent and discriminant validity, measures the full range of pathology on each scale, and has some ability to distinguish between behavioral health clients and members of the general population. © 2004 Wiley Periodi- cals, Inc. J Clin Psychol 61: 285–314, 2005. Keywords: treatment outcomes; assessment tests; test validation Editor’s Note: This article may represent a conflict of interest. Please see the comment by Karla Moras. The authors would like to thank Peter Arnett, David Barlow, Tim Brown, Tom Borkovec, two anonymous reviewers, and JoAnn Bouzan for their contributions and suggestions. Correspondence concerning this article should be addressed to: David R. Kraus, 50 Main Street, Ashland, MA 01721; e-mail: [email protected]. JOURNAL OF CLINICAL PSYCHOLOGY, Vol. 61(3), 285–314 (2005) © 2005 Wiley Periodicals, Inc. Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/jclp.20084

Upload: others

Post on 23-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Validation of a Behavioral Health Treatment Outcomeand Assessment Tool Designed for NaturalisticSettings: The Treatment Outcome Package

    David R. Kraus and David A. SeligmanBehavioral Health Laboratories

    John R. JordanFamily Loss Project

    In 1994, the American Psychological Association and the Society forPsychotherapy Research convened a Core Battery Conference to developa set of criteria for the selection of a universal core battery that could beused as a common outcome tool across all outcome studies. The Treat-ment Outcome Package (TOP) is a behavioral health assessment and out-come battery with modules for assessing a wide array of behavioral healthsymptoms and functioning, demographics, case-mix, and treatment satis-faction. It was developed to follow the design specifications set forth bythe Core Battery Conference, but also to ensure the battery’s applica-bility to naturalistic treatment settings in which randomization may beimpossible. In this article we discuss a number of studies that evaluate theinitial psychometrics of the items that comprise the mental health symp-tom and functional modules of the TOP. We conclude that the TOP has anexcellent factor structure, good test-retest reliability, promising initial con-vergent and discriminant validity, measures the full range of pathology oneach scale, and has some ability to distinguish between behavioral healthclients and members of the general population. © 2004 Wiley Periodi-cals, Inc. J Clin Psychol 61: 285–314, 2005.

    Keywords: treatment outcomes; assessment tests; test validation

    Editor’s Note: This article may represent a conflict of interest. Please see the comment by Karla Moras.The authors would like to thank Peter Arnett, David Barlow, Tim Brown, Tom Borkovec, two anonymousreviewers, and JoAnn Bouzan for their contributions and suggestions.Correspondence concerning this article should be addressed to: David R. Kraus, 50 Main Street, Ashland, MA01721; e-mail: [email protected].

    JOURNAL OF CLINICAL PSYCHOLOGY, Vol. 61(3), 285–314 (2005) © 2005 Wiley Periodicals, Inc.Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/jclp.20084

  • Mandates for accountability data from behavioral healthcare purchasers and accreditingbodies have led to an industry-wide surge in outcome evaluations in naturalistic, real-world settings. This rapid growth of outcome measurement presents a tremendous oppor-tunity to conduct rigorous, low-cost experimental research using large samples and toimprove treatment quality through accurate measurement and appropriate feedback. How-ever, many of these opportunities are predicated on the existence of a core outcomebattery that meets the needs of both clinicians and researchers (Borkovec, Echemendia,Ragusea, & Ruiz, 2001). The most noteworthy effort to develop the standards and selec-tion criteria for a core battery comes from the 1994 Core Battery Conference (referred toas Conference) organized by the Society for Psychotherapy Research and the AmericanPsychological Association (Horowitz, Lambert, & Strupp, 1997). The Conference con-cluded that a Universal Core Battery (UCB) that would work across all diagnostic cat-egories and levels of care was required, and that more focused, diagnostic-specific batteriesshould supplement the UCB as needed in individual studies. Table 1 summarizes theUCB development and selection criteria.

    The Treatment Outcome Package (TOP) was designed to measure outcomes in nat-uralistic settings and developed using these Conference UCB guidelines. Our purposehere is to present a number of studies that evaluate the TOP in light of the psychometricrequirements set forth by the Conference.

    Beyond the Conference criteria summarized in Table 1, several other criteria areimportant in judging the appropriateness of an outcome tool because they affect thechances of widespread adoption in naturalistic settings: the inclusion of case-mix (alsoknown as moderator or risk-adjustment) variables (Goldfield, 1999), minimal or absentfloor and ceiling effects, and real-time reporting. Our rationale for including each of themis discussed below.

    Variables that are beyond the control of the therapeutic process, but nonethelessinfluence the outcome, are defined as case-mix variables (Goldfield, 1999). Naturalistic

    Table 1Universal Core Battery Requirements

    Core Battery Conference Criteria for a Universal Core Battery

    Not bound to specific theoriesAppropriate across all diagnostic groupsMust measure subjective distressMust measure symptomatic statesMust measure social and interpersonal functioningMust have clear and standardized administration and scoringNorms to help discriminate between patients and nonpatientsAbility to distinguish clients from general populationInternal consistency and test-retest reliabilityConstruct and external validitySensitive to changeEasy to useEfficiency and feasibility in clinical settingsEase of use by clinicians and relevance to clinical needsAbility to track multiple administrationsReflect categorical and dimensional dataAbility to gather data from multiple sources

    Source. (Horowitz et al., 1997).

    286 Journal of Clinical Psychology, March 2005

  • research typically lacks the experimental controls used in efficacy research to mitigateagainst the need to statistically control for (or measure) these case-mix variables.1 With-out measuring and controlling for such variables, comparing or benchmarking naturalis-tic datasets can be quite misleading. Hsu (1989) has shown that even with randomization,when the samples are small, the chances of a “nuisance” (case-mix, e.g. AIDS) variablebeing disproportionately distributed across groups is not only common, but very likely(in some cases exceeding 90%). Therefore, without extensive case-mix data, results havelimited administrative value in real-world settings. These data need to be used to disag-gregate and/or statistically adjust outcome data to produce fair and accurate benchmark-ing. Furthermore, a major, and essential purpose of naturalistic outcome research is toassess the generalizability of tightly controlled efficacy research to real-world settings. Inorder to discover the populations to which the efficacy results can generalize, case-mixmust obviously be measured.

    If an outcome tool is to be truly applicable across all diagnostic groups and consumerpopulations, it needs to demonstrate that it can measure the full spectrum of pathology.Since this is rarely discussed in convergent validation samples (cf. Foa, Kozak, Salkov-skis, Coles, & Amir, 1998), we believe analysis of floor and ceiling effects should be aseparate psychometric requirement. For an outcome tool to be widely applicable (espe-cially for populations like the seriously and persistently mentally ill) it must accuratelymeasure the full spectrum of the construct, including its extremes. The use of measuresthat cannot do this is comparable to the use of a basal body thermometer (with a built-inceiling of only 1028) to study air temperature in the desert. On a string of hot summerdays, one might conclude that the temperature never changes and stays at 1028. For apsychiatric patient who scores at the ceiling of the tool but actually has much more severesymptomatology, the patient could make considerable progress in treatment, but still bemeasured at the ceiling on follow-up. Incorrectly concluding that a client is not makingclinically significant changes can lead to poor administrative and clinical decisions. Floorand ceiling effects of the TOP are discussed in Study 4 of this article.

    Although not part of an outcome tool per se, near-real-time delivery of results toclinicians is imperative. The Conference hints at this by noting that the tool and its resultsshould be clinically useful. Similar to psychological testing, outcome assessment dataand their reports need to be fed back to clinicians in a timely manner so that the resultscan be integrated into treatment planning, evaluation, and diagnostic formulations. Onlyby delivering reports that facilitate the treatment process can a system meet clinicians’needs and win their buy-in. For both the client and the clinician, the purpose of partici-pation in naturalistic outcome studies must be first and foremost to help the client toimprove and only secondarily to assist a research project. Therefore, an outcome tool, its

    1In randomized controlled trials, risk adjustment is typically neither necessary nor done; yet in naturalisticoutcomes it is essential. In randomized controlled trials, certain case-mix variables are seen as so powerful thatthey are typically controlled by making them part of strict inclusion and exclusion criteria (e.g. co-morbidmedical conditions). Strict inclusion and exclusion criteria in efficacy research limit the variance in importantcase-mix variables. Further mitigating the effects of case-mix variables, random assignment increases thechances that other uncontrolled variables are evenly distributed across control conditions.However, most naturalistic research studies lack one or both of the methods that tend to homogenize thesamples (random assignment and strict inclusion exclusion criteria). The samples in naturalistic outcome mea-surement are often quite heterogeneous and require larger Ns, disaggregation, and/or statistical techniques tocontrol for sample differences. Consequently, measurement tools like the Beck Depression Inventory or SCL-90that have been used successfully for decades in efficacy research are simply insufficient (by themselves) fornaturalistic outcome measurement because these tools lack case-mix variables, making risk-adjustment or dis-aggregation impossible. Measurement of symptoms, functioning, and general distress must be augmented byextensive assessment of case-mix.

    Validation of the Treatment Outcome Package 287

  • processing system, and report structures must deliver useful and immediate feedback.The mean time it takes for a report to be returned to the provider after a TOP is completedis 16 minutes.

    The Treatment Outcome Package

    Since a battery meeting all of the above criteria did not exist, a decision was made todevelop a new battery. Additional rationale for creating a new instrument included thedemand for a royalty-free set of modules, creation of one common Likert scale for allclinical questions, elimination of duplicate items across measures, and control over therights to modify and add questions in the future.

    Initial development of these modules consisted of the first author generating morethan 250 a-theoretical items that spanned diagnostic symptoms and functional areas iden-tified in the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition (DSM-IV; American Psychiatric Association, 1994). All DSM-IV Axis I diagnostic symptomswere reviewed and those symptoms that the first author thought clients could reliably rateon a self-report measure were formulated into questions. Many assessment tools werealso reviewed for item inclusion, but most were based on theoretical constructs inconsis-tent with DSM-IV symptomatology.

    These questions were then presented to other clinicians for their review and editing.They made suggestions for modifications and deletions, based on relative importance andclarity of items. Clients were administered initial versions of the questionnaires and askedfor feedback as well. Questions were reworded based on feedback, and items that wereless important or appeared to measure a similar symptom were eliminated. The tool wasthen revised and re-introduced for feedback. The instrument presented here is the resultof four iterations of that process.

    The current version of the TOP is a battery of distinct modules that can be adminis-tered all together or in combinations as needed. Expert clinical and client review in thedevelopment process ensured adequate face validity. The various modules of the TOPinclude:

    • Chief complaints

    • Demographics

    • Treatment utilization and provider characteristics

    • Comorbid medical conditions and medical utilization

    • Assessment of life stress

    • Substance abuse

    • Treatment satisfaction

    • Functioning

    • Quality of life/Subjective distress• Mental health symptoms

    The present study focuses on the 93 items of the TOP designed to measure function-ing, quality of life, and mental health symptoms since these scales are directly related tothe Conference criteria for a UCB. In this article six separate studies are included, eachimpacting upon UCB criteria. Study 1 determines the factor structure of the TOP toensure a theory free tool with a solid foundation derived from extensive patient popula-tions spanning all levels of care. Study 2 provides preliminary information on the test-retest reliability of the TOP scales determined in Study 1. Study 3 explores the discriminant

    288 Journal of Clinical Psychology, March 2005

  • and convergent validity, comparing the TOP to other standard assessment and outcometools in the industry. Study 4 explores floor and ceiling effects of the TOP that are criticalfactors in ensuring the scales applicability to diverse clinical populations and its sensi-tivity to change. Study 5 specifically explores the scale’s sensitivity to change, whileStudy 6 explores the tool’s ability to distinguish patients from nonpatients.

    Instruments

    The following instruments were used in the present studies to test the validity of the TOP.

    The Beck Depression Inventory

    The Beck Depression Inventory (BDI; Beck, Steer, & Ranieri, 1988) is a 21-item self-report scale used to assess cognitive and physical symptoms of depression. It has beenused extensively in psychological research with numerous populations and psychiatricdisorders. The BDI has been a central outcome tool used in depression efficacy researchand was specifically recommended as a good example of the measurement of depressionby the Core Battery Conference. Mean internal consistency across patient and nonpatientsamples is .86 (Beck, Steer, & Garbin, 1988), and there is good validation data with othermeasures of depression like the Hamilton Psychiatric Rating Scale for Depression (.73),the Zung Self-Reported Depression Scale (.76), and the MMPI Depression Scale (.76)(Groth-Marnat, 1990).

    The Brief Symptom Inventory

    The Brief Symptom Inventory (BSI; Derogatis, 1975) is a 53-item version of the Symp-tom Checklist-90-Revised (SCL-90-R; Derogatis, 1977). The SCL-90-R succeeded theHopkins Symptom Checklist (HSCL), which was used as part of the core outcome bat-tery developed in 1970 by the National Institute for Mental Health (Waskow, 1975). TheBSI is a self-report scale that has been used to assess a broad array of psychiatric symp-toms. It has been used extensively in psychological research across numerous popula-tions and psychiatric disorders (Flynn, 2002; Trabin, Freeman, & Pallak, 1995). Its scalesinclude Somatization, Obsessive–Compulsive, Interpersonal Sensitivity, Depression, Anx-iety, Hostility, Phobic Anxiety, Paranoid Ideation, and Psychoticism. The internal consis-tency of BSI scales ranges from .71 (Psychoticism) to .85 (Depression). Except forsomatization (.68), test-retest reliabilities are good (.78 to .85) (Derogatis, 1975). TheBSI’s validity has been extensively tested against the SCL-90-R, MMPI (Hathaway &McKinley, 1989), and many others.

    The Minnesota Multiphasic Personality Inventory-2

    The MMPI-2 (Hathaway & McKinley, 1989) is a 567-item self-report instrument used toassess personality characteristics. The original MMPI was also part of the core outcomebattery developed by NIMH in 1970 (Waskow, 1975). The MMPI-2 has been extensivelyused in psychological research across numerous populations and psychiatric disorders.Reliability of MMPI-2 scales is mixed with certain scales (5, 6, 9) showing unacceptablelevels of internal consistency, which is further supported by the lack of unidimensionalscales in factor analysis. Other scales (especially 1, 7, 8, 0) have good internal consis-tency and will be emphasized in the analyses below (Graham, 1993).

    Validation of the Treatment Outcome Package 289

  • The BASIS 32

    The BASIS 32 (Eisen, Grob, & Klein, 1986) is a 32-item self-report instrument designedto measure clinical outcomes in inpatient facilities. As evidenced by its inclusion by mostoutcome software vendors, it has emerged as one of the most widely used naturalisticoutcome tools (Trabin et al., 1995), and studies have documented its utility in outpatientsettings as well (Eisen, Wilcox, Leff, Schaefer, & Culhane, 1999). The BASIS 32 has fivescales (Depression and Anxiety, Relation to Self and Others, Psychosis, Impulsive andAddictive Behavior, Daily Living and Role Functioning) and one summary scale (Over-all Score). Test-retest reliability of the BASIS-32 total score was .85 with specific sub-scales’ reliabilities for Relation to Self and Others at .80; Daily Living Skills and RoleFunctioning at .81; Depression and Anxiety at .78; Impulsive and Addictive Behavior at.65; and Psychosis at .76 (Eisen, Dill, & Grob, 1994). Correlating scores with hospital-ization status and the tool’s ability to discriminate between diagnostic groups have assessedthe tool’s validity.

    The SF-36

    The SF-36 (Ware & Sherboume, 1992) is a 36-item self-report measure of general healthstatus. It has eight scales and two summary scales (Mental and Physical). As evidenced byits inclusion by most outcome software vendors, the SF-36 has also emerged as one of themost widely used psychiatric outcome scales in naturalistic settings. It has documented sat-isfactory reliability and validity in both psychiatric and medical settings. Internal consis-tency of SF-36 subscales is generally reported to exceed .80 (Jette & Downing, 1994; Garratt,Ruta, Abdalla, Buckingham & Russell, 1993; Ware, 1996). Its usefulness as a general out-come tool has been tested on a wide variety of mental health and general medical patients(e.g. Brazier et al., 1992; McHorney, Ware, & Raczek, 1993; Wells et al., 1989).

    Study 1: Factor Structure

    In this section, we describe the exploratory factor analysis (EFA) and confirmatory factoranalyses (CFA) of the TOP’s internal structure.

    Method

    For this study, 93 mental health symptom, functional, and quality of life items adminis-tered to a large sample of newly admitted psychiatric clients were analyzed. Participantswere instructed to rate each question in relation to “How much of the time during the lastmonth you have a . . .” All questions were answered on a 6-point Likert frequency scale:1 (All ), 2 (Most), 3 (A lot), 4 (Some), 5 (A little), 6 (None).

    The sample consisted of 19,801 adult patients treated in 383 different behavioralhealth services across the United States who completed all questions of the TOP at intakeas part of standard treatment. Age, sex, and years of education for the samples are sum-marized in Table 2. Breakdown of service facility types is presented in Table 3.

    Procedure

    The sample was split into five random subsamples (split sample 1, 2, 3, 5, n’s � 3,960 andsample 4, n � 3,961) as a cross-validation strategy. Samples 1 and 2 were used to develop

    290 Journal of Clinical Psychology, March 2005

  • Table 2Participants

    M SD

    Study 1: Factor Analysis19,801 Participants

    Age 33.3 12.1Education 12.4 3.7% Women 51%

    Study 2: Test-Retest53 Participants

    Age 38.6 15.4Education 11.1 6.2% Women 63%

    Study 3: Discriminant and Convergent Validity312 Participants

    Age 41.2 16.2Education 14.0 3.6% Women 65%

    Study 4: Floor and Ceiling Effects216,642 Participants

    Age 33.8 13.8Education 12.0 4.1% Women 58%

    Study 5: Sensitivity to Change20,098 Participants

    Age 33.1 14.2Education 12.1 4.3% Women 60%

    Study 6: Criterion Validity1,034 Participants

    Age 46.3 17.5Education 14.9 3.4% Women 74%

    Table 3Psychiatric Services Included in Studies 1, 4, and 5

    Number of Facilities in Each Study

    Service Type Study 1 Study 4 Study 5

    Long-term inpatient locked units 3 10 4Acute short-term inpatient locked units 21 39 26Acute short-term inpatient unlocked units 11 16 12Partial hospitalization programs 20 70 51Crisis stabilization/respite programs 10 13 8Crisis/emergency evaluation 27 34 26Outpatient milieu programs (e.g., day treatment) 21 61 27Outpatient therapy programs 183 379 206Outpatient assessment and referral services 5 24 5Community living/supported housing 12 32 14Residential programs 55 130 84Employee assistance programs 2 2 2Unknown 13 115 46

    Validation of the Treatment Outcome Package 291

  • and refine a factor model that was subsequently confirmed in samples 3–5. All sampleshad no missing data.

    Sample 1 was used to develop a baseline factor model. Responses to the 93 items(detailed in Table 4) were correlated, and the resulting matrix was submitted to principal-components analysis (PCA) followed by correlated (Direct Oblimin) rotations. The opti-mal number of factors to be retained was determined by the criterion of eigenvalue greaterthan 1 supplemented by the scree test and the criterion of interpretability (Cattell, 1966;Tabachnick & Fidell, 1996). Items that did not load on at least one factor greater than0.45, and factors with fewer than three items were trimmed from the model.

    Sample 2 was then used to develop a baseline measure of acceptability in a Confir-matory Factor Analysis (CFA) and revise the model using fit diagnostics in AMOS 4.0(Arbuckle & Wothke, 1999). Goodness of fit was evaluated using the root mean squareerror of approximation (RMSEA) and its 90% confidence interval (90% CI; cf. MacCal-lum, Browne, & Sugawara, 1996), comparative fit index (CFI), and the Tucker-Lewisindex (TLI). Acceptable model fit was defined by the following criteria: RMSEA (�0.08, 90% CI � 0.08), CFI (� 0.90), and TLI (� 0.90). Multiple indices were usedbecause they provide different information about model fit (i.e., absolute fit, fit adjustingfor model parsimony, fit relative to a null model); used together these indices provide amore conservative and reliable test of the solution (Jaccard & Wan, 1996). Most of therevised models were nested; in these situations, comparative fit was evaluated by �2

    differences tests ~xdiff2 ) and the interpretability of the solution.The final model that resulted from sample 2 exploratory procedures was then com-

    paratively evaluated in three independent CFAs (samples 3–5) using the criteria above.

    Results

    Based on the above criteria, 16 factors were extracted and reviewed from the EFA dataset.Both orthogonal and oblique rotations were explored, with only minor differences foundin factor loadings. With the assumption that these factors are related to each other, theoblimin rotation was chosen. Five factors were dropped due to insufficient number ofloading items. Other items were also trimmed due to insufficient loadings. In total, 41 ofthe 93 items were dropped and the final 52 items were again analyzed with obliquerotations. The final model produced 11 factors accounting for 63% of the variance and ispresented in Table 5.

    This model was then tested using structural equation modeling in an initial CFA(Sample 2). Although this model did meet the conservative multiple-index fit criteria(Table 6), fit diagnostics indicated that the model could be improved. Through exploringall possible sources of strain (potential cross loadings, method effects, over- or under-factoring, and minor factors), a series of steps were taken to improve the model, nowusing the CFA framework in an exploratory fashion. With each modification, the xdiff2 wassignificant ( p � 0.001). During this process, four additional items (1, 4, 29, 55) wereeliminated from the model due to relatively low factor loadings and the item having morethan one correlated error with items on other factors. In these cases, dropping the itemfrom the model improved overall model fit. The model was also improved by freeing tenitems to crossload on other factors. Standardized regression weights for these cross-loaded items ranged from 0.132 to 0.315 with a mean of 0.200. The crossloading itemscan be seen in Table 4. Finally, eight correlated errors were mapped into the model. Twowere mapped due to item juxtaposition (90–91, 47– 48), four were mapped due to itemcontent similarity (68–70, 64– 66, 56– 60, 23–26), and two for both reasons (24–25,

    292 Journal of Clinical Psychology, March 2005

  • Table 4Original 93 TOP Items With Final Primary and Secondary Factor Loading

    Item Item Wording Final Model

    1 Been satisfied with your physical abilities

    2 Felt a lack of closeness or contact with others

    3 Been satisfied with your relationships with others LIFEQ

    4 Been satisfied with your sleep

    5 Been satisfied with your daily responsibilities LIFEQ

    6 Been satisfied with your sex life

    7 Been satisfied with your general mood and feelings LIFEQ

    8 Been satisfied with how you cope with daily problems

    9 Been satisfied with your life in general LIFEQ

    10 Had trouble telling others your feelings or needs

    11 Felt that others were not responding to your feelings or needs

    12 Felt too much conflict with someone SCONF

    13 Been emotionally hurt by someone SCONF

    14 Felt someone else had too much control over your life SCONF

    15 Felt too dependent on others

    16 Had trouble falling asleep SLEEP

    17 Had nightmares SLEEP, PSYCS

    18 Awakened frequently during the night SLEEP

    19 Had trouble returning to sleep after awakening in the night SLEEP

    20 Felt tired during the day

    21 Slept too much or at unwanted times

    22 Had conflicts with others at work or school regardless of fault WORKF

    23 Missed work or school for any reason WORKF

    24 Not been acknowledged for your accomplishments WORKF

    25 Had your performance criticized WORKF

    26 Not been excited about your work or school work WORKF

    27 Spent too much time working

    28 Yelled at someone

    29 Broken or damaged things in anger

    30 Physically hurt someone else or an animal VIOLN

    31 Had desires to seriously hurt someone VIOLN

    32 Had thoughts of killing someone else VIOLN

    33 Felt that you were going to act on violent thoughts VIOLN, SUICD

    34 Felt no desire for, or pleasure in, sex SEXFN

    35 Had sexual thoughts you did not want to have

    36 Felt sexually incompatible with your partner or frustrated by the lack of a partner SEXFN, SCONF

    37 Felt emotional or physical pain during sex SEXFN

    ~continued !

    Validation of the Treatment Outcome Package 293

  • Table 4Continued

    Item Item Wording Final Model

    38 Been aroused by things that felt unacceptable

    39 Had trouble functioning sexually (having orgasms, etc.) SEXFN

    40 Felt shaky or trembled

    41 Had a racing heart PANIC

    42 Felt light-headed PANIC

    43 Frequently urinated

    44 Had shortness of breath PANIC

    45 Been startled (by a touch or by someone entering the room)

    46 Felt nauseous, had diarrhea or other stomach or abdominal pains

    47 Had a dry mouth or trouble swallowing (“a lump in your throat”) PANIC

    48 Had sweaty hands (clammy) or cold hands or feet PANIC

    49 Felt restless, keyed up, or on edge

    50 Had muscle pain, including back, neck, or headache pain

    51 Felt down or depressed DEPRS

    52 Felt easily irritated or annoyed

    53 Felt little or no interest in most things DEPRS

    54 Felt hopeless

    55 Felt nervous or anxious

    56 Felt guilty DEPRS

    57 Felt angry

    58 Felt restless DEPRS, MANIA

    59 Wanted to be alone

    60 Felt worthless DEPRS, SUICD

    61 Had to do something to avoid anxiety or fear (washing hands, etc.)

    62 Felt shy or inhibited

    63 Felt tired, slowed down, or had little energy DEPRS

    64 Worried about things DEPRS

    65 Had trouble concentrating or making decisions DEPRS

    66 Noticed your thoughts racing ahead DEPRS, MANIA

    67 Been too talkative

    68 Inflicted pain on yourself SUICD

    69 Felt rested after only a few hours of sleep MANIA, PSYCS

    70 Thought about killing yourself or wished you were dead SUICD, DEPRS

    71 Planned or tried to kill yourself SUICD

    72 Avoided certain situations due to fear or panic

    73 Felt emotionally numb to something that would normally cause intense feelings

    74 Felt you were better than other people MANIA, WORKF

    ~continued !

    294 Journal of Clinical Psychology, March 2005

  • 65– 66). All modifications to the model were made based on both strain indices and theconceptual interpretation of the findings.

    Samples 3, 4, and 5 were used to validate the final model developed with Sample 2,and showed excellent and consistent model fit criteria across all indices. Taken together,there is strong support for the stability and strength of these factors. Results are summa-rized in Table 6 and demonstrate excellent model fit with no significant strains. Thefactor names, Cronbach’s alphas to assess internal consistency, and intercorrelations arelisted in Table 7.

    Discussion

    The TOP was designed to assess a broad range of behavioral health functional and symp-tom domains. The factor analysis presented here revealed eleven stable and clinicallyuseful TOP subscales with excellent confirmatory modeling in large samples of diversepatients. One factor (Quality of Life) incorporates questions about how often the clienthas felt satisfied with various areas of his or her life (e.g. “been satisfied with your life ingeneral”). Three other factors include functional questions and are labeled: Work Func-tioning, Sexual Functioning, and Social Conflict. The other seven factors hold symptom

    Table 4Continued

    Item Item Wording Final Model

    75 Felt on top of the world MANIA

    76 Felt panic in places that would be hard to leave if necessary

    77 Had a large appetite or little or no appetite

    78 Had trouble with your memory

    79 Felt others were working against you

    80 Had no time for yourself

    81 Felt responsible for your troubles

    82 Worried that someone might hurt you PSYCS, SCONF

    83 Felt detached from what was really happening

    84 Been unable to talk to at least one other person about your problems

    85 Had unwanted thoughts or images PSYCS

    86 Worried about going crazy

    87 Done something without thinking of the consequences

    88 Felt people or events kept you from achieving your goals

    89 Felt confused, in a fog, or dazed

    90 Seen or heard something that was not really there PSYCS

    91 Felt someone or something was controlling your mind PSYCS

    92 Forced yourself to throw-up food

    93 Had difficulty remembering personal information (important life events orperiods of time)

    Note. Dropped items.

    Validation of the Treatment Outcome Package 295

  • Table 5EFA Pattern Matrix

    Component

    Item DEPRS VIOLN WORKF LIFEQ SLEEP SEXFN SCONF SUICD MANIA PSYCS PANIC

    64 .64956 .62255 .618 �.26466 .59565 .58858 .57851 .57753 .55660 .53563 .45531 .81732 .76933 .75730 .73129 .60626 .69922 .66025 .65123 .64924 .6453 .7475 .7369 .6811 .6447 �.257 .63519 �.86418 �.85916 �.7984 .531 .56617 �.493 �.27439 .77634 .68937 .67736 .66613 �.74612 �.72014 �.70671 .90468 .74070 .73874 .76175 .74569 .45391 �.74690 �.69182 �.60985 .322 �.50344 �.77042 �.69747 �.67741 �.66748 �.627

    Note. Only loadings greater than 0.25 are shown. Dropped items.

    296 Journal of Clinical Psychology, March 2005

  • items and include: Depression, Panic, Psychosis, Suicidal Ideation, Violence, Mania, andSleep. Cronbach’s alphas were used as one estimate of scale reliability and were adequatefor all factors, with the exception of Mania. Its lower internal consistency may be due tothe nature of the items that load onto it, in which extreme scores at either end may beviewed as unhealthy (symptoms of mania or depression), while scores in the middle maybe viewed as healthy. Despite its questionable internal consistency, Mania was retained asa factor because of its clinical importance and acceptable test-retest reliability (see Study2 below).

    Through this method of developing the factor structure of the TOP, it is clear thatmany clinically interesting questions have been dropped (as compared to previous ver-sions of the tool). If it becomes clear from additional research and clinician feedback thatthese questions are valuable, it may be important to return these items with additionalquestions from the same construct and develop additional factors for inclusion in futureversions. In other words, these questions may have been related to important clinicalconstructs for which insufficient items were available to form reliable factor structures.

    While the TOP includes many questions about both the physiological and cognitivecomponents of anxiety, just the physiological symptoms of anxiety loaded on a separatefactor, which we labeled Panic. Some cognitive symptoms of anxiety (e.g., worried aboutthings, noticed your thoughts racing ahead, felt restless) loaded on the Depression factor,a finding consistent with the literature (Barlow, Bach, & Tracey, 1998; Eisen, Grob, &Klein, 1986).

    Finally, space limitations prevent an adequate review of factor invariance analyses ofthe TOP factors in this large clinical population. Analyzing whether people in differentdemographic and clinical populations show similar patterns of responses is an importantdiscussion to which an entire article could be devoted.

    Study 2: Test-Retest Reliability

    In this section, we report on the test-retest reliability of the TOP. Another measure ofreliability, internal consistency, was presented in Study 1.

    Method

    In 1998, 53 behavioral health clients were recruited by four community mental healthcenters to participate in a one-week test re-test study. All clients were Medicaid enrolleeswho completed the Treatment Outcome Package one week apart while they were waiting

    Table 6CFA Validation

    CFA Description N DF TLI CFI RMSEARMSEAUpper

    Sample 2 initial Derived from EFA model 3,960 1218 .898 .906 .045 .046Sample 2 final Modified model 3,960 1007 .945 .951 .033 .034Sample 3 Confirmatory analysis 1 3,960 1007 .940 .946 .035 .036Sample 4 Confirmatory analysis 2 3,960 1007 .942 .948 .034 .035Sample 5 Confirmatory analysis 3 3,961 1007 .940 .947 .035 .036

    Validation of the Treatment Outcome Package 297

  • Tabl

    e7

    Fac

    tor

    Cor

    rela

    tion

    Mat

    rix

    and

    Test

    -Ret

    est

    Rel

    iabi

    liti

    es

    Fac

    tor

    Des

    crip

    tion

    DE

    PR

    SV

    IOL

    NS

    CO

    NF

    LIF

    EQ

    SL

    EE

    PS

    EX

    FN

    WO

    RK

    FP

    SY

    CS

    PAN

    ICM

    AN

    ICS

    UIC

    D�

    Intr

    acla

    ssTe

    st-R

    etes

    t

    DE

    PR

    SD

    epre

    ssio

    n1.

    00.9

    3.9

    3V

    IOL

    NV

    iole

    nce

    0.33

    1.00

    .81

    .88

    SC

    ON

    FS

    ocia

    lC

    onfl

    ict

    0.55

    0.33

    1.00

    .72

    .93

    LIF

    EQ

    Qua

    lity

    ofL

    ife

    �0.

    78�

    0.24

    �0.

    451.

    00.8

    5.9

    3S

    LE

    EP

    Sle

    epF

    unct

    ioni

    ng0.

    640.

    260.

    41�

    0.50

    1.00

    .86

    .94

    SE

    XF

    NS

    exua

    lF

    unct

    ioni

    ng0.

    510.

    210.

    38�

    0.41

    0.36

    1.00

    .69

    .92

    WO

    RK

    FW

    ork

    Fun

    ctio

    ning

    0.55

    0.43

    0.53

    �0.

    410.

    370.

    341.

    00.7

    2.9

    0P

    SY

    CS

    Psy

    chos

    is0.

    660.

    550.

    42�

    0.46

    0.51

    0.42

    0.50

    1.00

    .69

    .87

    PAN

    ICP

    anic

    0.73

    0.33

    0.43

    �0.

    520.

    590.

    430.

    460.

    671.

    00.8

    3.8

    8M

    AN

    ICM

    ania

    �0.

    260.

    11�

    0.09

    0.37

    �0.

    12�

    0.09

    0.01

    0.05

    0.04

    1.00

    .53

    .76

    SU

    ICD

    Sui

    cida

    lity

    0.44

    0.44

    0.26

    �0.

    330.

    270.

    230.

    360.

    610.

    38�

    0.02

    1.00

    .78

    .90

    298 Journal of Clinical Psychology, March 2005

  • for outpatient treatment to begin. Age, sex, and years of education for the sample aresummarized in Table 2.

    Results

    The stability of the TOP over time was assessed by computing intraclass correlationcoefficients using a one-way random model. Except for MANIA, all reliabilities for sub-scales (factors presented in Study 1) were excellent (see Table 7), ranging from .87 to .94.Mania’s reliability was acceptable, but considerably lower at .76.

    Discussion

    Results of Study 2 revealed that the TOP has good test-retest reliability for all subscalescores for the sample chosen. However, the sample used (53 outpatients waiting fortreatment to begin), was chosen largely because it was easy to obtain. Ideally, it would beimportant to assess test-retest reliabilities for all types of populations and levels of care.However, for some levels of care with acute or high-risk clients, it is difficult or impos-sible to ethically obtain a sample waiting for treatment. One potential source of partici-pants representing a more severe population that might ethically be recruited would be ahomeless population with previous severe psychiatric histories who are currently refus-ing treatment. Until more diverse and larger samples are collected, all that may be statedis that the TOP seems to have good test-retest reliability among outpatients awaitingclinical treatment. The use of intraclass correlation in this analysis demonstrates that notonly is the rank-ordering of clinical severity in patients similar from one week to the next,but so is the actual level of severity within patients.

    Study 3: Discriminant and Convergent Validity

    In this section, we evaluate the discriminant and convergent validity of the factors devel-oped in Study 1. Important to the testing of the validity of a measure is the testing ofwhether the measure correlates highly with other variables with which it should theoret-ically correlate (convergent validity), and whether it does not correlate significantly withvariables from which it should differ (discriminant validity). The validity instrumentschosen for this study were selected because of their acceptable psychometrics and prom-inence in the field. Because the instruments were chosen before the factors from Study 1emerged, a few factors do not have ideal convergent validity measures. In evaluating theresults, it should be noted that there is no item overlap between the TOP and any of thevalidity measures—if such overlap did exist, it might artificially inflate the convergentcorrelations.

    Method

    Study 3 included 312 participants. Age, sex, and years of education for the sample aresummarized in Table 2. Ninety-four participants were from the general population, 123were from an outpatient clinical population, and 95 were from an inpatient clinical pop-ulation. All participants completed the TOP and one or more validity questionnaires,outlined as follows: 110 completed the BASIS 32 (51 general population, 23 outpatient,and 36 inpatient), 80 completed the SF-36 (43 general population, 3 outpatient, and 34inpatient), and 69 completed the BSI, BDI, and MMPI-2 (69 outpatient). Ideally, all

    Validation of the Treatment Outcome Package 299

  • patients would have completed all measures, however attempting this may have repre-sented too large a burden for many participants. That all patients did not complete allmeasures should be considered in interpreting the results. Specifically, it suggests thatdifferences between validity scales in the relative magnitude of correlations may be dueto sample differences (or tool reliability differences) rather than to differences in truerelationships among the constructs.

    All clients signed informed consent and were recruited through customers of BHL.During a 2-month period in 1996, the first author attempted to recruit all newly admittedpatients within the first 24 hours of admission to three inpatient psychiatric and substanceabuse units in a Boston area hospital. Outpatient clinicians who agreed to participate inthe study attempted to recruit all new admissions during a 6-month period during 1997.General population samples were recruited by clinicians from BHL sites during 1997 byasking friends and acquaintances (nonfamily) to participate in the study.

    Procedure

    Validity scales were used to evaluate discriminant and convergent validity of the TOP.The specific measures used for both are detailed in the results section below. Becausesome of the TOP factors are not normally distributed, both Pearson and Spearman corre-lations were analyzed and reviewed. No significant differences were found between thetwo methods and just the Pearson correlations are presented below.

    Results

    The construct validity of the TOP was assessed by correlating each TOP measure witheach validity measure’s score. The entire correlation matrix is presented in Table 8. Becausethe study design did not call for all clients to complete all measures, it is impossible toevaluate the relative strength of correlations between some of the validity measures andthe TOP. Therefore, these differences are not discussed.

    As discussed below, inspection of the correlations indicated that the TOP measuresgenerally showed the expected relationships with other relevant self-report measures ofpsychiatric symptoms and functioning. In most cases, convergent coefficients were quitehigh for each validity measure.

    Measuring depression, the TOP Depression (DEPRS) scale should show convergentrelationships with other measures of the same construct. These are: the BDI (.92), MMPI-Depression (.73), BSI-Depression (.90), BSI-Anxiety (.70), BASIS32-Depression/Anxiety (.86), the SF36-Mental Health (.82), and the SF-36-Vitality (.68) measures. Allof these correlations were quite high. By contrast the TOP Depression scale should notcorrelate with MMPI-Mania (�.23), or the MMPI-Schizophrenia (.24) measures.

    Measuring violence and temper, the TOP Violence (VIOLN) scale was expected tocorrelate with the BSI-Hostility scale (.77). A similar, but not identical construct is tappedby the BASIS32-Impulsive (.69). It was not expected to correlate with MMPI-Hypochondriasis (�.16) or BSI Somatization (�.13).

    Measuring interpersonal functioning and conflict, the TOP Social Conflict (SCONF)scale was expected to correlate with the BASIS32-Relationship to Self and Other (.60),SF36-Social Functioning (�.35), MMPI Social Introversion (.37), BSI-Paranoid (.72),and BSI-Interpersonal Sensitivity (.44). It was not expected to correlate with BSI-OCD(�.24), or MMPI-Psychasthenia (�.04).

    Measuring quality of life and subjective distress, the TOP Quality of Life (LIFEQ)scale was expected to correlate with SF36-Vitality (�.57), SF36-Mental Health (�.68),

    300 Journal of Clinical Psychology, March 2005

  • Tabl

    e8

    Cor

    rela

    tion

    sB

    etw

    een

    TOP

    and

    Vali

    dity

    Scal

    es

    DE

    PR

    SV

    IOL

    NS

    CO

    NF

    LIF

    EQ

    SL

    EE

    PS

    EX

    FN

    WO

    RK

    FP

    SY

    CS

    PAN

    ICM

    AN

    ICS

    UIC

    D

    BD

    I�

    .92*

    *(6

    6)�

    .41*

    *(6

    6)�

    .62*

    *(6

    5).7

    1**

    (63)

    �.5

    2**

    (67)

    �.4

    2**

    (57)

    �.3

    7**

    (63)

    �.5

    0**

    (67)

    �.4

    9**

    (65)

    �.0

    8(6

    5)�

    .60*

    *(6

    5)

    MM

    PI

    HS

    �.0

    9(6

    6)�

    .27*

    (66)

    �.1

    4(6

    5).1

    9(6

    3)�

    .11

    (67)

    .00

    (57)

    �.3

    4**

    (63)

    �.0

    2(6

    7)�

    .30*

    (65)

    �.0

    1(6

    5)�

    .05

    (65)

    D�

    .73*

    *(6

    6)�

    .27*

    (66)

    �.4

    6**

    (65)

    .60*

    *(6

    3)�

    .43*

    *(6

    7)�

    .31*

    (57)

    �.2

    5*(6

    3)�

    .47*

    *(6

    7)�

    .30*

    (65)

    �.1

    1(6

    5)�

    .42*

    *(6

    5)H

    Y�

    .29*

    (66)

    �.1

    6(6

    6)�

    .25*

    (65)

    .29*

    (63)

    �.4

    2**

    (67)

    �.1

    8(5

    7)�

    .21

    (63)

    �.1

    7(6

    7)�

    .13

    (65)

    �.2

    1(6

    5)�

    .15

    (65)

    PD

    �.4

    1**

    (66)

    �.0

    5(6

    6)�

    .38*

    *(6

    5).5

    9**

    (63)

    �.0

    3(6

    7)�

    .33*

    (57)

    �.1

    6(6

    3)�

    .22

    (67)

    �.1

    7(6

    5).1

    9(6

    5)�

    .27*

    (65)

    PA�

    .51*

    *(6

    6)�

    .19

    (66)

    �.3

    8**

    (65)

    .43*

    *(6

    3)�

    .14

    (67)

    �.3

    7**

    (57)

    �.2

    4(6

    3)�

    .36*

    *(6

    7)�

    .32*

    *(6

    5)�

    .08

    (65)

    �.3

    7**

    (65)

    PT

    .03

    (66)

    .14

    (66)

    .00

    (65)

    .05

    (63)

    �.0

    1(6

    7)�

    .14

    (57)

    .11

    (63)

    .00

    (67)

    .20

    (65)

    .04

    (65)

    �.0

    4(6

    5)S

    C�

    .24

    (66)

    �.0

    5(6

    6)�

    .28*

    (65)

    .28*

    (63)

    �.2

    0(6

    7)�

    .27*

    (57)

    �.2

    7*(6

    3)�

    .28*

    (67)

    �.1

    3(6

    5)�

    .06

    (65)

    �.2

    2(6

    5)M

    A.2

    3(6

    6).1

    5(6

    6).1

    3(6

    5)�

    .23

    (63)

    �.1

    0(6

    7).0

    7(5

    7).0

    3(6

    3).1

    8(6

    7).1

    6(6

    5).4

    3**

    (65)

    .18

    (65)

    SI

    �.2

    8*(6

    4)�

    .20

    (65)

    �.3

    7**

    (64)

    .30*

    (63)

    �.1

    7(6

    5)�

    .13

    (56)

    �.2

    6*(6

    2)�

    .24

    (66)

    �.1

    2(6

    4)�

    .14

    (64)

    �.0

    8(6

    4)

    BSI D

    epre

    ssio

    n�

    .90*

    *(6

    6)�

    .42*

    *(6

    6)�

    .61*

    *(6

    5).6

    6**

    (63)

    �.4

    6**

    (67)

    �.3

    6**

    (57)

    �.3

    7**

    (63)

    �.5

    0**

    (67)

    �.5

    0**

    (65)

    �.1

    6(6

    5)�

    .69*

    *(6

    5)P

    sych

    otic

    ism

    �.6

    3**

    (66)

    �.2

    7*(6

    6)�

    .53*

    *(6

    5).5

    0**

    (63)

    �.3

    6**

    (67)

    �.2

    4(5

    7)�

    .46*

    *(6

    3)�

    .72*

    *(6

    7)�

    .60*

    *(6

    5)�

    .18

    (65)

    �.4

    6**

    (65)

    Som

    atiz

    atio

    n�

    .30*

    (66)

    �.1

    3(6

    6)�

    .47*

    *(6

    5).4

    3**

    (63)

    �.2

    2(6

    7)�

    .29*

    (57)

    �.3

    2**

    (63)

    �.3

    9**

    (67)

    �.7

    2**

    (65)

    .02

    (65)

    �.3

    0*(6

    5)H

    osti

    lity

    �.4

    1**

    (66)

    �.7

    7**

    (66)

    �.4

    1**

    (65)

    .17

    (63)

    �.3

    4**

    (67)

    �.1

    2(5

    7)�

    .32*

    *(6

    3)�

    .37*

    *(6

    7)�

    .18

    (65)

    �.1

    8(6

    5)�

    .51*

    *(6

    5)P

    hobi

    c�

    .52*

    *(6

    6)�

    .25*

    (66)

    �.5

    2**

    (65)

    .45*

    *(6

    3)�

    .27*

    (67)

    �.2

    1(5

    7)�

    .40*

    *(6

    3)�

    .56*

    *(6

    7)�

    .82*

    *(6

    5)�

    .18

    (65)

    �.3

    5**

    (65)

    OC

    D�

    .29*

    (66)

    �.1

    6(6

    6)�

    .24

    (65)

    .31*

    (63)

    �.1

    0(6

    7)�

    .10

    (57)

    �.3

    1*(6

    3)�

    .44*

    *(6

    7)�

    .19

    (65)

    .01

    (65)

    �.2

    1(6

    5)A

    nxie

    ty�

    .70*

    *(6

    6)�

    .20

    (66)

    �.4

    2**

    (65)

    .54*

    *(6

    3)�

    .38*

    *(6

    7)�

    .30*

    (57)

    �.2

    3(6

    3)�

    .36*

    *(6

    7)�

    .41*

    *(6

    5)�

    .14

    (65)

    �.4

    7**

    (65)

    Inte

    rper

    sona

    l�

    .38*

    *(6

    6)�

    .15

    (66)

    �.4

    4**

    (65)

    .35*

    *(6

    3)�

    .27*

    (67)

    �.2

    1(5

    7)�

    .41*

    *(6

    3)�

    .45*

    *(6

    7)�

    .60*

    (65)

    �.1

    0(6

    5)�

    .23

    (65)

    Par

    anoi

    d�

    .58*

    *(6

    6)�

    .28*

    (66)

    �.7

    2**

    (65)

    .45*

    *(6

    3)�

    .44*

    *(6

    7)�

    .35*

    *(5

    7)�

    .47*

    *(6

    3)�

    .47*

    *(6

    7)�

    .42*

    *(6

    5)�

    .08

    (65)

    �.3

    8**

    (65)

    Bas

    is32

    Rel

    Sel

    fO

    ther

    �.8

    4**

    (110

    )�

    .58*

    *(1

    10)

    �.6

    0**

    (36)

    .71*

    *(1

    10)

    �.5

    4**

    (110

    )�

    .28*

    *(1

    05)

    �.5

    6**

    (102

    )�

    .61*

    *(1

    10)

    �.6

    8**

    (110

    )�

    .15

    (109

    )�

    .64*

    *(1

    10)

    Dai

    lyR

    ole

    �.8

    2**

    (110

    )�

    .57*

    *(1

    10)

    �.6

    5**

    (36)

    .73*

    *(1

    10)

    �.4

    6**

    (110

    )�

    .24*

    (105

    )�

    .51*

    *(1

    02)

    �.6

    3**

    (110

    )�

    .67*

    *(1

    10)

    �.1

    2(1

    09)

    �.6

    7**

    (110

    )D

    ep/A

    nx�

    .86*

    *(1

    10)

    �.6

    1**

    (110

    )�

    .44*

    *(3

    6).7

    3**

    (110

    )�

    .61*

    *(1

    10)

    �.2

    2*(1

    05)

    �.5

    5**

    (102

    )�

    .68*

    *(1

    10)

    �.7

    3**

    (110

    )�

    .22*

    (109

    )�

    .72*

    *(1

    10)

    Impu

    lsiv

    e�

    .68*

    *(1

    10)

    �.6

    9**

    (110

    )�

    .15

    (36)

    .57*

    *(1

    10)

    �.4

    9**

    (110

    ).0

    0(1

    05)

    �.4

    9**

    (102

    )�

    .68*

    *(1

    10)

    �.6

    2**

    (110

    )�

    .20*

    (109

    )�

    .59*

    *(1

    10)

    Psy

    chos

    is�

    .62*

    *(1

    10)

    �.5

    8**

    (110

    )�

    .41*

    (36)

    .52*

    *(1

    10)

    �.4

    5**

    (110

    )�

    .21*

    (105

    )�

    .53*

    *(1

    02)

    �.8

    0**

    (110

    )�

    .59*

    *(1

    10)

    �.2

    2*(1

    09)

    �.6

    4**

    (110

    )

    SF-3

    6P

    hysi

    cal

    Fun

    c.2

    8*(7

    7).3

    3**

    (76)

    .30

    (32)

    �.2

    6*(7

    7).2

    9*(7

    7).3

    9**

    (75)

    .11

    (70)

    .27*

    (77)

    .31*

    *(7

    7).3

    3**

    (76)

    .19

    (77)

    Rol

    eP

    hysi

    cal

    .43*

    *(7

    7).3

    1**

    (76)

    .30

    (32)

    �.4

    2**

    (77)

    .39*

    *(7

    7).4

    1**

    (75)

    .38*

    *(7

    0).4

    5**

    (77)

    .47*

    *(7

    7).2

    5*(7

    6).1

    9(7

    7)B

    odil

    yP

    ain

    .39*

    *(7

    7).2

    6*(7

    6).3

    7*(3

    2)�

    .38*

    *(7

    7).4

    6**

    (77)

    .31*

    *(7

    5).2

    1(7

    0).3

    6**

    (77)

    .42*

    *(7

    7).2

    3*(7

    6).2

    3*(7

    7)G

    ener

    alH

    ealt

    h.4

    8**

    (78)

    .24*

    (77)

    .25

    (32)

    �.5

    6**

    (78)

    .37*

    *(7

    8).2

    9*(7

    6).3

    1**

    (71)

    .44*

    *(7

    8).4

    4**

    (78)

    .27*

    (77)

    .32*

    *(7

    8)V

    ital

    ity

    .68*

    *(7

    8).3

    6**

    (77)

    .56*

    *(3

    2)�

    .57*

    *(7

    8).3

    9**

    (78)

    .47*

    *(7

    6).1

    8(7

    1).4

    5**

    (78)

    .54*

    *(7

    8).2

    2(7

    7).5

    1**

    (78)

    Soc

    ial

    Fun

    c.7

    5**

    (78)

    .43*

    *(7

    7).3

    5(3

    2)�

    .68*

    *(7

    8).6

    1**

    (78)

    .37*

    *(7

    6).4

    4**

    (71)

    .57*

    *(7

    8).5

    0**

    (78)

    .28*

    (77)

    .53*

    *(7

    8)R

    ole

    Em

    otio

    nal

    .59*

    *(7

    5).4

    2**

    (74)

    .22

    (30)

    �.5

    1**

    (75)

    .49*

    *(7

    5).3

    6**

    (73)

    .39*

    *(6

    8).5

    9**

    (75)

    .49*

    *(7

    5).2

    0(7

    4).3

    6**

    (75)

    Men

    tal

    Hea

    lth

    .82*

    *(7

    8).4

    8**

    (77)

    .54*

    *(3

    2)�

    .68*

    *(7

    8).4

    7**

    (78)

    .36*

    *(7

    6).3

    4**

    (71)

    .54*

    *(7

    8).6

    2**

    (78)

    .32*

    *(7

    7).6

    9**

    (78)

    *Sig

    nifi

    cant

    p�

    0.05

    (Nin

    pare

    nthe

    ses)

    .**

    Sig

    nifi

    cant

    p�

    0.01

    (Nin

    pare

    nthe

    ses)

    .

    Validation of the Treatment Outcome Package 301

  • and SF36-General Health (�.56). It was not expected to correlate with BSI-OCD (.31), orMMPI-Hypochondriasis (.29).

    No validity instruments had a direct measure of the construct of sleep disturbance.However, the TOP Sleep (SLEEP) scale was expected to correlate with other measuresthat relate to sleep functioning, including SF36-Bodily Pain (.46), SF-36 Vitality (.40),BDI (�.52), BSI-Depression (�.46), MMPI-Depression (�.43), and BASIS32-Depression/Anxiety (�.61). It was not expected to correlate with MMPI-Psychopathic Deviance(�.03), or MMPI-Schizophrenia (�.20).

    With no direct validity measure of sexual functioning, the TOP Sexual Functioning(SEXFN) scale was not expected to correlate highly with any validity measure, but it wasexpected to correlate moderately with several measures that are related to sexual func-tioning like the BDI (.42), SF36 Vitality (.47), and other measures of depression [MMPI-Depression (�.31), BSI-Depression (.36), and the BASIS32-Depression/Anxiety (�.22)].

    Measuring work performance and functioning, the TOP Work Functioning (WORKF)scale was expected to correlate with BASIS32-Daily Role (�.51), and the SF36-RoleFunctioning Emotional (.40). It was not expected to correlate with MMPI-Psychasthenia(�.11).

    Measuring issues related to psychotic processes, the TOP Psychosis (PSYCS) scalewas expected to correlate with MMPI-Schizophrenia (�.28), BSI-Psychoticism (.72),and the BASIS32-Psychosis (.80). It was not expected to correlate with MMPI-Hypochondriasis (�.17), or MMPI-Mania (.18).

    Measuring the physiological symptoms of anxiety, the TOP Panic (PANIC) scalewas expected to correlate with BSI-Somatization (.67), BSI-Anxiety (.50), BASIS32-Depression/Anxiety (.82), and SF36-Vitality (�.65). It was not expected to correlate withMMPI-Psychopathic deviate (.30), or BSI-Hostility (.23).

    Measuring symptoms of mania, the TOP Manic (MANIC) scale was expected tocorrelate with the MMPI-Hypomania (.43) scale, and was not expected to correlate withMMPI-Hypochondriasis (�.21), or the MMPI-Psychasthenia (.04) scales.

    Measuring suicidal ideation and planning, the TOP Suicide (SUICD) scale wasexpected to correlate with related measures of depression like the BDI (.60), BSI-Depression (.69), and BASIS32-Depression/Anxiety (.72). It was not expected to corre-late with MMPI-Hypochondriasis (�.15), or the MMPI-Psychasthenia (�.04) scales.

    Discussion

    This first study evaluating the validity of the TOP provides an initial foundation of dataon the TOP factors. Two limitations to the current study that should be addressed in futurework are the lack of validity measures that tap directly suicidality, sleep, and sexualfunctioning, and the failure to have all clients complete all validity measures. However,these limitations do not prevent one from drawing important initial conclusions about theTOP’s convergent and discriminant validity.

    As an initial study, these results document good convergent and excellent discrimi-nant ability of many of the TOP scales. Indeed, almost all expected convergent relation-ships with validity measures were supported by significant correlations. In most casesthese correlation coefficients were large (in the 0.60 to 0.90 range), demonstrating goodconvergent validity. All but one (TOP LIFEQ and BSI-OCD, 0.31) expected discriminantrelationships were below 0.30, demonstrating excellent discriminant validity.

    In many cases, there were other significant relationships between the TOP measuresand validity measures of different constructs. For example, the TOP Depression measure

    302 Journal of Clinical Psychology, March 2005

  • correlated with the BASIS32-Relationship to Self and Other, and BASIS32-Daily role.As another example, the TOP Quality of Life measure correlated highly with validityscale measures of depression. One interpretation of such correlations is that many psy-chological constructs are not orthogonal and have been shown to inter-correlate. Anotherinterpretation is that many psychological subscales include a portion of something likegeneral subjective distress, which is common across different subscales.

    As stated above, several TOP factors warrant further investigation. No validationsubscale was used for the exact same construct measured by the TOP-Suicidal Ideationfactor, although this factor demonstrated expected relationships with depression factorson the MMPI, BSI, BDI, and BASIS 32. In future investigations, this factor should becorrelated with scales of suicidal ideation like the Beck Scale for Suicide Ideation (Beck,Steer, & Ranieri, 1988). Similarly, the TOP Manic factor should be correlated with otherscales of mania. Although the Manic factor correlated satisfactorily with the MMPI-Hypomania scale, the items of the MMPI scale do not necessarily reflect current diag-nostic classification symptoms.

    Finally, the TOP Sleep and Sexual Functioning factors did not have a convergentvalidity measure used in this study. Both scales did show expected relationships withother related measures; however, future studies should correlate these factors with otherdirect measures of both of these constructs.

    Study 4: Floor and Ceiling Effects

    In this section, we report on the floor and ceiling effects of the TOP in a large clinicalsample. Floor and ceiling effects are serious issues to consider in selecting outcome toolsfor clinical populations. If the tools are not able to measure the full range of pathology,their ability to accurately measure initial status and change may be severely limited. Forexample, Nelson, Hartman, Ojemann, & Wilcox (1995) reported that the SF-36 has sig-nificant ceiling effects in clinical samples, suggesting that the tool has limited applica-bility to the Medicaid population for which it was being tested. As another example, theaverage inpatient’s Total Score at admission for the BASIS 32 is reported to be 0.79 on ascale of 0 (no problems) to 4 (severe problems) (Eisen et al., 1986). This means that theaverage inpatient starts near the floor of the tool and suggests that many inpatients startat the actual floor, leaving little or no room to document improvement. For the TOP toserve as a reliable and valid UCB it must demonstrate that it can measure the full range ofpathology.

    Method

    A total of 216,642 clinical TOP administrations were analyzed for both floor and ceilingeffects. Demographic information of the clinical sample is presented in Table 2. Thislarge dataset included all adult clients from a diverse array of service settings that con-tracted with Behavioral Health Laboratories between the years of 1996 and 2003 to pro-cess and analyze their clinical outcome data. The number of each service type is presentedin Table 3. The dataset was analyzed for frequency counts of clients who scored at eitherthe theoretical maximum or minimum score of each TOP scale. The TOP scores arepresented in Z-scores, standardized by using general population means and standard devi-ations. All scales are oriented so that higher scores indicate more symptoms or poorerfunctioning. Theoretical maximum scores were calculated by scoring each measure withitem scores at their highest symptom level (e.g., for the item “Indicate how much of the

    Validation of the Treatment Outcome Package 303

  • time during the last month you have felt down or depressed,” an item score of 1 was usedreferring to “All of the time.”). Continuing this example of depression, the DEPRS scor-ing results in a theoretical maximum Z-score of 4.63 (standard deviations from the gen-eral population mean). Similarly, the theoretical minimum score for Depression (�1.67)was calculated using the item scores representing no depressive symptoms for each itemin the construct. Frequency counts were then calculated for the number of clients whoactually scored at the theoretical maximum or minimum.

    Results

    Table 9 presents the number and percent of clients who scored at the theoretical minimumor maximum for each TOP subscale. TOP ceiling effects are virtually undetectable withonly 0.1% to 4.0% of the clinical sample scoring at the theoretical maximum of TOPsubscales. Only three TOP subscales had frequency counts at the maximum theoreticalscore greater than 1% (Quality of Life 4.0%, Sleep Functioning 2.9%, and Depression1.1%). This result suggests that little would be gained by redesigning any subscale tohave a higher maximum score.

    TOP floor effects were evident on most subscales, but none of the floors are on the“pathological” side of the general population mean. In all cases the floor was below thegeneral population mean, suggesting that each subscale is assessing the pathologicalrange of the construct (also demonstrated by a lack of ceiling effects), but not necessarilythe full “healthy” range of the construct. The most notable of floor effects occurred onViolence, Suicidality, and Sexual Functioning.

    Discussion

    Analysis of the TOP revealed no substantial ceiling effects on any TOP scales, suggestingthat the TOP sufficiently measures into the clinically severe extremes of these constructs.Furthermore, each TOP subscale measures at least a half to more than two standarddeviations into the “healthy” tails of its construct. Therefore, from this very large clinical

    Table 9Floor and Ceiling Effects

    FactorTheoreticalMinimum

    TheoreticalMaximum

    Number ofClients atMinimum

    Number ofClients atMaximum

    TotalSample

    Size(N )

    Percentageof Clients atMinimum

    Percentageof Clients atMaximum

    DEPRS �1.67 4.63 7,519 2,406 212,589 3.5 1.1VIOLN �0.44 15.44 121,625 978 205,932 59.1 0.5SCONF �1.44 2.87 11,606 726 145,695 8.0 0.5LIFEQ �2.34 5.05 4,430 6,210 156,738 2.8 4.0SLEEP �1.43 3.73 23,106 5,907 206,677 11.2 2.9SEXFN �1.15 3.79 48,905 1,264 150,576 32.5 0.8WORKF �1.54 5.95 22,081 163 152,511 14.5 0.1PSYCS �0.93 13.23 33,900 339 202,306 16.8 0.2PANIC �1.13 7.59 30,444 1,153 212,474 14.3 0.5MANIC �1.57 4.75 16,779 474 211,802 7.9 0.2SUICD �0.51 15.57 58,388 702 211,836 27.6 0.3

    304 Journal of Clinical Psychology, March 2005

  • sample it is reasonable to conclude that the each TOP scale measures the full range ofclinical severity, and represents a substantial improvement over the widely used natural-istic outcome tools reported previously.

    Of particular note are the floor effects present on Suicidality and Violence. As theycurrently exist, both of these subscales are pathological constructs without a clear healthyside to the continuum. Having any suicidal or violent behavior is clinically defined aspathological, and is supported by the overwhelming numbers of people in the generalpopulation who do not report any problems on either of these dimensions. In other words,it is hard to report or measure less than zero suicidality. What would it mean to say thatsomeone has an extreme score on the healthy side of violence? One generally thinks thatthere may be a wide range of violent thoughts, tendencies, and behaviors among people,but there is a built-in floor of little or no violent thoughts, tendencies, and behaviors,where a large percentage of the population exists. If there are any “healthy” aspects tothese constructs, they are probably inoculation-type behaviors or attitudes that help insu-late and protect individuals from becoming violent toward themselves or others. In thefuture, it would certainly be a useful goal to explore these relationships, and if they areconnected to the same construct, add items to each of these measures to assist providersin not only reducing pathological behaviors, but also strengthening their resistance tothese destructive actions.

    Study 5: Sensitivity to Change

    In this section, we report information about the TOP’s sensitivity to change. The moreaccurately an outcome measure is able to measure important (even subtle) changes inclinical status, the more useful it is as an outcome tool. Ideally, evaluating sensitivity tochange should include two subject samples—one that is expected to change, and anotherthat is expected not to change based on prior knowledge or research. In addition, anexternal measure with proven validity and sensitivity to change should be used to verifythat change has, or has not, occurred. Then the measure in question can be compared tothis standard. Unfortunately, most of the constructs measured by the TOP do not havematching external measures with sensitivity to change reported in this ideal format. There-fore, less than ideal methodology must be employed.

    Sensitivity to change is a critical issue for the industry to begin addressing in natu-ralistic settings. Many state governments (e.g. Michigan, Georgia) and private payers(e.g. Tufts) have mandated the use of outcome tools that have inadequate sensitivity tochange, costing all involved extensive time and wasted resources, only to have the projectabandoned after the data are unable to demonstrate differences in provider outcomes. Forexample, the functional scales of the Ohio Youth Scales are not showing change in func-tional status in treatment (Ogles, Melendez, Davis, & Lunnen, 2000).

    Since this is such a critical issue, if an external measure of change does not exist withproven sensitivity to change to be used as a “gold standard” of comparison, the field mustnot ignore this important UCB requirement. Instead, it should design studies to make thebest inferences possible, allowing more informed decision-making.

    Without an external “gold standard” of measurement, change documented in sensi-tivity to change studies cannot rule out the possibility that observed changes are theproduct of tool instability rather than actual change. Instead, we argue that measurementerror (caused by poor reliability or validity) must be assessed prior to the study throughother means (i.e., other studies of reliability and validity). First, the tool’s stability shouldbe documented (i.e., test re-test reliabilities) to ensure that change scores are not causedby errors in measurement (we have done this in Study 2). Second, the tool should dem-

    Validation of the Treatment Outcome Package 305

  • onstrate that it is effectively measuring the constructs it is supposed to be measuring (i.e.,convergent and discriminant validity), which we have done in Study 3. With good test-retest reliabilities and good convergent and discriminant validity, the current study offersuseful, albeit circumstantial, evidence about the TOP’s sensitivity to change.

    Method

    Between April 1996 and June 2001, as part of routine care, 20,098 adult behavioral healthclients were administered the TOP at the start of treatment and later after several therapysessions. Age, sex, and years of education of participants are presented in Table 2 andbreakdowns of service facility types are presented in Table 3. The median number of daysbetween TOP administrations was 49 and the median treatment session at which thesecond TOP was administered was 7.

    For each TOP subscale, within group Cohen’s d effect sizes were calculated compar-ing subscale scores at first TOP administration to subscale scores at second TOP admin-istration. In addition, a reliable change index was calculated for each TOP factor usingprocedures outlined in Jacobson, Roberts, Berns, and McGlinchey (1999). The reliablechange index can be used to determine if the change an individual client makes is beyondthe measurement error of the instrument. We used the indices to classify each client ashaving made reliable improvement (or reliable worsening), or not, on each TOP subscale.In addition, the same indices were used to calculate the number of clients who showedreliable improvement (or reliable worsening) on at least one TOP subscale.

    Results

    For each TOP subscale, Table 10 presents sample size, mean, and standard deviation offirst and second TOP administrations, within-group Cohen’s d effect size, and the per-centage of clients who showed reliable improvement or worsening. With an average ofonly seven treatment sessions, Cohen’s d effect sizes ranged from .16 (Mania) to .53(Depression). The percentage of clients who made reliable improvement ranged from 10

    Table 10Sensitivity to Change

    Variable NInitialMean

    Follow-upMean

    InitialSD

    Follow-upSD

    Cohen’sd

    % ClientsShowingReliable

    Improvement

    % ClientsShowingReliable

    Worsening

    DEPRS 19,660 1.34 .48 1.68 1.55 .53 54 14VIOLN 18,765 1.25 .68 2.97 2.37 .21 31 17SCONF 8,047 .28 �.04 1.08 1.01 .31 38 18LIFEQ 10,039 2.19 1.44 1.83 1.81 .41 52 21SLEEP 18,869 .68 .16 1.46 1.32 .37 47 20SEXFN 9,407 �.12 �.31 1.12 1.04 .18 25 15WORKF 9,600 .30 �.10 1.44 1.29 .29 39 20PSYCS 18,320 2.02 1.14 2.85 2.42 .33 44 18PANIC 19,701 1.36 .75 1.93 1.73 .33 41 17MANIC 19,561 �.31 �.47 1.00 0.96 .16 10 6SUICD 19,562 2.38 1.14 3.69 2.80 .38 42 14

    306 Journal of Clinical Psychology, March 2005

  • (Mania) to 54 (Depression), and the percentage of clients who got reliably worse rangedfrom 6 (Mania) to 21 (Quality of Life). Out of 6,577 clients with scores for every sub-scale, 91% of clients showed reliable improvement on at least one TOP subscale and 67%of clients showed reliable worsening on at least one TOP subscale.

    Discussion

    Since no external measure indicating that change actually occurred was available for thisstudy, the possibility that the TOP is unstable (rather than sensitive to change) cannot beruled out from this study when considered in isolation. However, the strong test-retestresults from Study 2 suggest that instability in the subscales is not responsible for theresults from the current study. Studies 1 and 3 provide further evidence for the TOPscales’ reliability and validity, suggesting that the results from the current study are notdue to inaccurate measurement.

    Furthermore, there is robust evidence from past research documenting the efficacyand effectiveness of psychotherapy (Feltham, 1999; Lambert & Bergin, 1994; Seligman,1995; Shadish, 2000; Howard, Kopta, Krause, & Orlinsky, 1986; Shadish et al., 1997;Smith, Glass, & Miller, 1980). Therefore, it is reasonable to speculate that at least someof the change demonstrated in this study was real change associated with treatment ratherthan measurement error. However, future studies will be needed to provide definitiveevidence on the issue.

    This study provides evidence that the TOP may be sensitive to change. Most of thewithin-group Cohen’s d effect sizes were in the small (.2) to medium (.5) range (Cohen,1988), and may have been increased by measuring client improvement through termina-tion. In addition, effect sizes were reported in all cases, even if the patient did not entertreatment for a problem on the dimension and already had scores at or below the generalpopulation average. This was especially true for Sexual Functioning where most patientshad normal functioning at the start of treatment and had little room for, or need forimprovement, on this dimension.

    Most TOP measures showed reliable improvement for at least a quarter of partici-pants, and 91% of clients showed reliable improvement on at least one TOP subscale. Asone might expect, the functional domains (Social Conflict, Work, and Sex) tended toshow less change than the symptom domains.

    Study 6: Criterion Validity

    In this section, we report on the TOP’s ability to accurately discriminate between mem-bers of the general population and behavioral health clients, and should provide furthercorroboration of the tentative findings discussed in Study 5. The ability of an instrumentto distinguish between clients and members of the general population is important fortwo reasons. First, the Core Battery Conference recommended that the Universal CoreBattery be able to do so as part of criterion validation. To the extent that an instrumentcan distinguish between clients and members of the general population, we are morelikely to believe that it measures aspects of psychopathology. Second, a possible appli-cation of the TOP is to help clinicians screen potential clients to decide whether or notany treatment is needed. While the decision to treat or not should always be a matter ofmany factors, including clinical judgment, such decisions should be based on as muchrelevant information as possible, including scores on self-report tests.

    Validation of the Treatment Outcome Package 307

  • Method

    A total of 94 members of the general population completed the TOP. These were the samegeneral population participants from Study 3. Demographic information of this sample ispresented in Table 11 under the heading “General Population.” Age, years of education,and sex were used to create 10 unique matched samples of 94 clients each drawn from theBHL database of behavioral health clients who have taken the TOP. Binary logistic regres-sion was applied to each set of the 94 general population participants and the matchedsample from the clinical population. These analyses combined all TOP measures into abinary stepwise logistic regression to determine the most parsimonious collection of sub-scales accounting for independent prediction of client/general population status. In thistype of analysis, independent variables are entered into the equation one at a time basedon which variable will add the most to the regression equation. The 10 available TOPscales (Depression, Violence, Quality of Life, Sleep, Sexual Functioning, Work Func-tioning, Psychosis, Mania, Panic, and Suicide) served as the independent variables andclient/general population status served as the dependent variable.

    Results

    Demographic information for the 10 client-matched samples is presented in Table 11.The extensive BHL database (more than 210,000 adult TOP administrations) allowed forvery precise matching between the general population sample and the 10 sets of clientsamples.

    In Analysis 1, the first variable entered into the model was Quality of Life, �2(1) �40.74, p � .001. Seventy percent of the clients were correctly classified as clients and68% of general population participants were correctly classified as such, with a totalclassification accuracy of 69%. Psychosis was entered next, �2(1) � 10.67, p � .001.With its entry, correct classification of clients increased to 73%, correct classification ofthe general population participants increased to 77%, and total classification accuracyincreased to 75%. The results from the other four steps and the total model of analysis 1,as well as Analyses 2 through 10 are presented in Table 12.

    Table 11Demographic Information of Participants in Study 6

    AnalysisNumber N Population

    MeanAge

    SDAge

    MeanEducation

    SDEducation

    %Women

    1–10 94 General Population 46.3 17.5 14.9 3.4 73.91 94 Patient 46.2 17.3 14.8 3.4 74.22 94 Patient 46.0 17.3 14.9 3.4 74.23 94 Patient 46.3 17.7 14.9 3.3 74.24 94 Patient 46.1 17.2 14.8 3.3 74.25 94 Patient 46.1 17.4 14.8 3.3 74.26 94 Patient 46.2 17.2 14.8 3.4 73.97 94 Patient 45.8 16.9 14.9 3.4 73.98 94 Patient 46.0 17.2 14.8 3.4 74.29 94 Patient 45.8 17.1 14.9 3.3 74.2

    10 94 Patient 45.9 17.0 14.9 3.3 74.2

    308 Journal of Clinical Psychology, March 2005

  • Table 12Logistic Regression Results

    AnalysisNo. Step No.

    VariableEntered �2~df !, p �

    % ClientsClassifiedCorrectly

    % GeneralPopulationParticipantsClassifiedCorrectly

    Total% Classified

    CorrectlyNagelkerke

    R2

    1 1 LIFEQ 40.74 (1) .001 70 68 69 .281 2 PSYCS 10.67 (1) .001 73 77 75 .341 3 MANIC 14.04 (1) .001 72 80 76 .421 4 SUICD 9.35 (1) .01 75 83 79 .471 5 WORKF 7.06 (1) .01 76 86 81 .501 6 SEXFN 6.97 (1) .01 78 83 80 .541 Total Model 88.82 (6) .001 78 83 80 .54

    2 1 LIFEQ 97.49 (1) .001 83 82 82 .572 2 MANIC 21.89 (1) .001 84 82 83 .662 3 DEPRS 12.91 (1) .001 86 85 86 .712 4 PSYCS 6.13 (1) .05 86 85 86 .732 5 VIOLN 5.38 (1) .05 86 85 86 .752 6 SEXFN 5.46 (1) .05 87 90 89 .772 Total Model 149.25 (6) .001 87 90 89 .77

    3 1 LIFEQ 74.00 (1) .001 76 79 77 .473 2 MANIC 8.97 (1) .01 78 79 79 .513 3 PSYCS 16.07 (1) .001 82 82 82 .583 4 SEXFN 7.58 (1) .01 81 84 83 .623 5 DEPRS 4.23 (1) .05 82 86 84 .633 Total Model 110.84 (5) .001 82 86 84 .63

    4 1 LIFEQ 73.41 (1) .001 78 79 79 .464 2 SEXFN 6.06 (1) .05 80 83 82 .494 3 PSYCS 10.39 (1) .001 78 83 80 .544 4 MANIC 11.44 (1) .001 83 84 83 .594 5 VIOLN 5.24 (1) .05 84 86 85 .614 6 WORKF 4.83 (1) .05 84 84 84 .634 7 PANIC 6.71 (1) .01 87 86 87 .664 Total Model 118.09 (7) .001 87 86 87 .66

    5 1 LIFEQ 97.79 (1) .001 82 82 82 .585 2 MANIC 6.55 (1) .01 83 82 82 .615 3 PSYCS 14.58 (1) .001 82 84 83 .665 4 SEXFN 8.03 (1) .01 85 88 86 .695 5 PANIC 8.20 (1) .01 84 84 84 .725 Total Model 135.16 (5) .001 84 84 84 .72

    6 1 LIFEQ 85.32 (1) .001 81 82 81 .526 2 WORKF 8.67 (1) .01 81 83 82 .566 3 PSYCS 14.24 (1) .001 82 83 83 .636 4 MANIC 8.73 (1) .01 83 85 84 .666 5 SLEEP 4.97 (1) .05 86 86 86 .686 Total Model 121.93 (5) .001 86 86 86 .68

    7 1 LIFEQ 66.03 (1) .001 75 79 77 .427 2 WORKF 14.95 (1) .001 83 79 81 .507 3 PSYCS 13.52 (1) .001 79 80 80 .567 4 SEXFN 9.15 (1) .01 80 82 81 .607 5 VIOLN 5.78 (1) .05 80 84 82 .637 6 MANIC 5.37 (1) .05 83 85 84 .657 7 PANIC 5.73 (1) .05 85 84 84 .677 Total Model 120.53 (7) .001 85 84 84 .67

    ~continued !

    Validation of the Treatment Outcome Package 309

  • To explore the amount of variance accounted for in client/general populationstatus by the six significant predictors in Analysis 1, we employed the Nagelkerke R 2

    test (Nagelkerke, 1991). Quality of Life accounted for 28% of the variance in client/general population status, Psychosis accounted for another 6%, Mania accounted foranother 8%, Suicidality accounted for another 5%, Work Functioning accounted for another3%, and Sexual Functioning accounted for another 4%. Thus, together these six variablesaccounted for 54% of the variance in predicting client/general population status. Thelogistic regression results for this analysis and the remaining nine analyses are presentedin Table 12.

    In the 10 analyses, the percentage of participants correctly classified as being from aclient or general population sample ranged from 80% to 89%, with an average of 84%.Nagelkerke R 2 for the complete models ranged from .54 to .77 with a mean of .65. Inaddition, the variables that were significant predictors of client/general population statuswere fairly consistent across the 10 analyses. In 10 of the analyses, Quality of Life andMania were significant predictors, in 9 of the analyses Sexual Functioning was a signif-icant predictor, in 8 of the analyses Psychosis was a significant predictor, and in 6 of theanalyses Work Functioning and Panic were significant predictors. Other significant pre-dictors included Suicidality (three analyses), Violence (three analyses), Depression (twoanalyses), and Sleep (one analysis). The most important predictor of client/general pop-ulation status for each of the 10 analyses was Quality of Life.

    Table 12Continued

    AnalysisNo. Step No.

    VariableEntered �2~df !, p �

    % ClientsClassifiedCorrectly

    % GeneralPopulationParticipantsClassifiedCorrectly

    Total% Classified

    CorrectlyNagelkerke

    R2

    8 1 LIFEQ 67.35 (1) .001 76 79 78 .438 2 WORKF 12.67 (1) .001 74 79 76 .498 3 SUICD 8.74 (1) .01 76 82 79 .548 4 MANIC 4.45 (1) .05 77 82 79 .568 5 PANIC 5.18 (1) .05 82 82 82 .588 6 SEXFN 3.98 (1) .05 82 79 80 .608 Total Model 102.36 (6) .001 82 79 80 .60

    9 1 LIFEQ 69.28 (1) .001 73 79 76 .459 2 MANIC 9.05 (1) .01 73 79 76 .499 3 PANIC 7.02 (1) .01 76 82 79 .539 4 SEXFN 6.84 (1) .01 80 83 81 .569 5 SUICD 5.64 (1) .05 80 83 81 .589 6 WORKF 6.05 (1) .05 81 83 82 .619 Total Model 103.88 (6) .001 81 83 82 .61

    10 1 LIFEQ 67.28 (1) .001 76 79 77 .4310 2 MANIC 9.71 (1) .01 78 78 78 .4810 3 PSYCS 9.38 (1) .01 78 79 78 .5310 4 SEXFN 4.99 (1) .05 80 80 80 .5510 5 PANIC 6.90 (1) .01 82 82 82 .5810 Total Model 98.26 (5) .001 82 82 82 .58

    310 Journal of Clinical Psychology, March 2005

  • Discussion

    The results demonstrate that the TOP has some ability to discriminate between clients andmembers of the general population with an average correct classification rate of 84%.The consistency across the 10 separate analyses lends credence to these results. It ispossible that the analyses could be further improved by adding several other scales to theanalysis. The Social Conflict and Substance Abuse subscales of the TOP were not avail-able for this analysis because these scales have been revised since the general populationsample was collected.

    We were not able to find other studies with which to benchmark these results. Othercriterion validity studies in the literature typically used another measure of psychopathol-ogy, the presence of a DSM diagnosis in the medical chart, or an expert rating as thecriterion (Baity & Hilsenroth, 2002; Snowden, Kersten, & Roy-Byrne, 2003). Futureanalyses of the TOP’s criterion validity should focus on larger general population sam-ples in which all symptom and functional factors are available and a gold standard likethe Structured Clinical Interview for DSM-IV-TR (SCID; First, Spitzer, Gibbon, & Wil-liams, 1997) is available to accurately distinguish between groups.

    General Discussion

    In the present article we describe the development and initial validation of the TOP.These initial studies suggest the TOP is a promising multipurpose self-report measure. Todocument good psychometric properties with many different demographic and clinicalpopulations serviced in a diverse number of treatment settings, it will be important toreplicate several of the current studies that reported smaller sample sizes (especiallytest-retest, and convergent and discriminant validity samples). All validity and reliabilitystudies should be replicated on diverse clinical samples to evaluate the TOP’s psycho-metrics across the full spectrum of disorders and settings. Beyond the validity measuresreported here, these future studies should include additional validity measures specifi-cally designed for the content domains of suicidality, sexual functioning, sleep, and mania.Ideally all participants would receive all validation measures so as to assess the relativestrength of correlations.

    The initial results from these limited samples suggest the TOP has good test-retestreliability on all symptom and functional measures. The TOP factors correspond wellwith other measures of symptoms and functioning, and the TOP can distinguish betweenclients and members of the general population. The TOP has virtually no ceiling effectsand the floor effects that do exist are not within the pathological range of the constructs.Furthermore, there is some initial evidence that the TOP subscales are sensitive to change.A definitive study of the TOP’s sensitivity to change should include both a populationthat is expected to change and one that is not. It should also include a measure withwell-documented validity and sensitivity to rule out the possibility of instability inmeasurement. In addition, the TOP’s ability to discriminate between diagnostic groupsshould be tested.

    The TOP-Manic scale may require additional work. Questions like “felt on top of theworld” clearly are not unidimensional with respect to health, and may have very differentclinical meanings for people who do, and do not, have bipolar disorder. Additional itemsand scoring changes may improve its internal consistency and correlation to other measures.

    In summary, the self-report version of the adult TOP is a promising instrument. Itsadministration requires no technical expertise and typically takes only 25 minutes tocomplete the full battery. It surveys a broad range of symptom, functional, and case-mix

    Validation of the Treatment Outcome Package 311

  • variables and yields a profile of the client’s condition in comparison to the general pop-ulation. Good reliability and validity of the TOP and its subscales have been demon-strated with clinical and nonclinical samples.

    References

    American Psychiatric Association. (1994). Diagnostic and statistical manual of mental disorders(4th ed.). Washington, DC: Author.

    Arbuckle, J., & Wothke, W. (1999). AMOS 4.0 User’s Guide. Chicago: Smallwaters Corporation,Inc.

    Baity, M.R., & Hilsenroth, M.J. (2002). Rorschach aggressive content (AgC) variable: A study ofcriterion validity. Journal of Personality Assessment, 78, 275–287.

    Barlow, D.H., Bach, A.K., & Tracey, S.A. (1998). The nature and development of anxiety anddepression: Back to the future. In D.K. Routh & R.J. DeRubeis (Eds.), The science of clinicalpsychology: Accomplishments and future directions (pp. 95–120). Washington, DC: Ameri-can Psychological Association).

    Beck, A.T., Steer, R.A., & Garbin, M.G. (1988). Psychometric properties of the Beck DepressionInventory: Twenty-five years of evaluation. Clinical Psychology Review, 8(1), 77–100.